Event

PhD defence of Yuwei Fu – Sample Efficient Reinforcement Learning: Methods and Applications

Friday, August 29, 2025 13:00to15:00
McConnell Engineering Building Room 603, 3480 rue University, Montreal, QC, H3A 0E9, CA

Abstract

Deep Reinforcement Learning (DRL) has transformed decision-making in areas such as game playing, robotics, protein structure prediction, and reasoning in large language models. However, its practical use is often hindered by the issue of low sample efficiency. Unlike humans, DRL agents typically require millions of interactions to learn effective policies, making training costly and time-consuming. This thesis tackles the sample efficiency challenge in DRL through three novel approaches and demonstrates a practical application in time series forecasting.

First, we address the offline RL setting, where policies are learned from fixed datasets without further online environment interaction. We show that existing model-free methods tend to produce overly conservative policies and propose a relaxed behavior regularization strategy to overcome this issue.

Next, we investigate the use of pre-trained Vision-Language Models (VLMs) to guide online RL in reward-sparse environments. While VLMs can provide useful task progress signals, we identify a reward misalignment problem. To fix this, we introduce FuRL, a method that aligns VLM-derived rewards with task goals, significantly improving learning efficiency.

We also explore Inverse Reinforcement Learning (IRL) from expert video demonstrations. Existing Optimal Transport-based methods often ignore temporal structure. To remedy this, we propose a method that integrates context embeddings and a masking mechanism to capture temporal order, enabling policy learning from just two action-free videos.

Finally, we apply DRL to ensemble learning for time series forecasting under non-stationary conditions. By treating model combination as a reinforcement learning task, we design a system that dynamically adjusts model weights, achieving strong performance even with limited training data.

Back to top