BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//132.216.98.100//NONSGML kigkonsult.se iCalcreator 2.20.4//
BEGIN:VEVENT
UID:20260517T095252EDT-5647sc85aU@132.216.98.100
DTSTAMP:20260517T135252Z
DESCRIPTION:\n	\n		\n			\n				Virtual Informal Systems Seminar (VISS)\n\n				Centre for I
 ntelligent Machines (CIM) and Groupe d'Etudes et de Recherche en Analyse d
 es Decisions (GERAD)\n			\n		\n	\n\n	Speaker: Alec Koppel – Research Scientist\, 
 Amazon\, United States \n\n\n\n	Webinar link:\n		Webinar ID: 910 7928 6959\n		P
 asscode: VISS\n\n	Abstract: Reinforcement Learning (RL) is a form of stocha
 stic adaptive control in which one seeks to estimate parameters of a contr
 oller only from data\, and has gained popularity in recent years. However\
 , technological successes of RL are hindered by the high variance and irre
 producibility their training exhibits in practice. Motivated by this gap\,
  we'll present recent efforts to solidify theoretical understanding of how
  risk-sensitivity\, incorporating prior information\, and prioritizing exp
 loration may be subsumed into a 'general utility.' This entity is defined 
 as any concave function of the long-term state-action occupancy measure of
  a MDP. We present two different methodologies for RL with general utiliti
 es: the first\, for the tabular setting\, extends the classical linear pro
 gramming formulation of dynamic programming to general utilities. We devel
 op a solution methodology based upon a stochastic variant of primal-dual m
 ethod\, whose polynomial rate of convergence to a primal-dual optimal pair
  is derived. Experiments demonstrate the proposed approach yields a rigoro
 us way to incorporate risk-sensitivity into RL. Secondly\, we study scalab
 le solutions for general utilities by searching over parameterized familie
 s of policies. To do so\, we put forth Variational Policy Gradient Theorem
 \, based upon which we develop Variational Policy Gradient (VPG) method. V
 PG constructs a ``shadow reward' which plays the role of the usual reward 
 in PG methods to conduct search directions in parameter space. We present 
 the convergence rate of this technique to global optimality that exploits 
 a bijection between occupancy measures and parameterized polices. Experime
 ntally\, we observe that VPG provides an effective framework for solving c
 onstrained MDPs and exploration problems experimentally on some benchmarks
  in OpenAI Gym.\n\n	Bio: Alec Koppel is a Research Scientist at Amazon with
 in Supply Chain Optimization Technologies (SCOT) since September of 2021. 
 From 2017-2021\, he was a Research Scientist with the U.S. Army Research L
 aboratory (ARL) in the Computational and Information Sciences Directorate.
  He completed his Master's degree in Statistics and Doctorate in Electrica
 l and Systems Engineering\, both at the University of Pennsylvania (Penn) 
 in August of 2017. Before coming to Penn\, he completed his Master's degre
 e in Systems Science and Mathematics and Bachelor's Degree in Mathematics\
 , both at Washington University in St. Louis (WashU)\, Missouri. He is a r
 ecipient of the 2016 UPenn ESE Dept. Award for Exceptional Service\, an aw
 ardee of the Science\, Mathematics\, and Research for Transformation (SMAR
 T) Scholarship\, a co-author of Best Paper Finalist at the 2017 IEEE Asilo
 mar Conference on Signals\, Systems\, and Computers\, a finalist for the A
 RL Honorable Scientist Award 2019\, an awardee of the 2020 ARL Director's 
 Research Award Translational Research Challenge (DIRA-TRC)\, a 2020 Honora
 ble Mention from the IEEE Robotics and Automation Letters\, and mentor to 
 the 2021 ARL Summer Symposium Best Project Awardee. His research interests
  are in optimization and machine learning. His academic work focuses on ap
 proximate Bayesian inference\, reinforcement learning\, and decentralized 
 optimization\, with an emphasis on applications in robotics and autonomy. 
 On the industry side\, he is investigating inferring hidden supply signals
  from market data\, and its intersection with vendor selection.\n\n
DTSTART:20220311T150000Z
DTEND:20220311T160000Z
LOCATION:CA\, ZOOM
SUMMARY:Beyond the Cumulative Return in Reinforcement Learning
URL:https://www.mcgill.ca/cim/channels/event/beyond-cumulative-return-reinf
 orcement-learning-337692
END:VEVENT
END:VCALENDAR