ISEM Seminar Series
“Provable Efficiency in Online Reinforcement Learning:Function Approximation and RLHF”by Dr Peng Zhao Assistant Professor School of Artificial Intelligence, Nanjing University |
29 April 2025 (Tuesday), 2pm – 3.30pm Venue: E1-07-21/22 - ISEM Executive Classroom |
ABSTRACT
Reinforcement learning has achieved great success in recent years, ranging from AlphaGo to LLMs, and has become a successful paradigm for modeling interactive learning with environments. Recent studies have explored the theoretical foundations underpinning the RL success. In this talk, I will present our recent efforts towards provable efficiency in online RL, from both statistical and computational aspects. We first investigate MNL mixture MDPs, a generalization of standard linear mixture MDPs that allows the transition matrix to be non-linear. Due to this non-linearity, achieving both optimal regret and a computationally efficient algorithm is much harder than in the linear case. We accomplish this by leveraging the power of online mirror descent (OMD)—a powerful learning paradigm for regret minimization in online learning—to serve as a “one-pass” estimator with a carefully designed local norm. We further study RLHF, where, instead of using the traditional MLE estimator, we again use the OMD estimator to achieve computational efficiency while maintaining statistical optimality, and the algorithm has demonstrated encouraging empirical performance. |
Peng Zhao is an assistant professor in the School of Artificial Intelligence at Nanjing University, China. His research explores the theoretical foundations of machine learning, with a focus on online learning, stochastic optimization, and their applications to modern machine learning problems such as LLMs. He has published more than 30 papers on top-tier conferences and journals such as ICML/NeurIPS/COLT and JMLR. He serves regularly as the reviewer/area chair for various top-tier conferences and journals. |