PhD defence of Borna Sayedana – Learning, Control and Concentration of Cumulative Rewards in MDPs and Markov Jump Systems
Abstract
This thesis examines various aspects of integrating learning and control across different families of stochastic systems. This work includes three key aspects: (i) learning the unknown system parameters from input-output data sequences (i.e., system identification), (ii) integrating learning within the control of dynamical systems (i.e., adaptive control), and (iii) providing probabilistic guarantees for the developed methodologies, including regret upper bounds and concentration bounds. The analysis is conducted within three frameworks of stochastic systems: Markov jump linear systems, finite-state and finite-action Markov Decision Processes (MDPs), and linear stochastic systems.
In the framework of Markov jump linear systems, two primary problems are addressed. First, we focus on the full-state observation system identification problem, i.e., learning system parameters from observed sequences of discrete and continuous states. We propose a variant of least squares algorithm called switched least squares. By leveraging classical regression theory, we establish the algorithm's strong consistency and derive its convergence rate. Furthermore, we integrate this algorithm into a certainty-equivalence framework tailored for controlling Markov jump linear systems. By leveraging the convergence rate of switched least squares, a novel regret decomposition, and the concentration properties of martingale difference sequences, we derive a sub-linear regret upper bound for the proposed algorithm.
In the second part of this thesis, we investigate the concentration properties of cumulative rewards in Markov Decision Processes (MDPs), focusing on both asymptotic and non-asymptotic settings. We introduce a unified approach to characterize reward concentration in MDPs, covering both infinite-horizon settings (i.e., average and discounted reward frameworks) and finite-horizon setting. The asymptotic results include the law of large numbers, the central limit theorem, and the law of iterated logarithm, while the non-asymptotic results include Azuma-Hoeffding-type inequalities and a non-asymptotic version of the law of iterated logarithm. Using these results, we show that two alternative definitions of regret for learning policies in the literature are rate-equivalent. The proofs rely on a novel martingale decomposition of cumulative reward, properties of the solutions of the policy-evaluation fixed-point equation, and asymptotic and non-asymptotic concentration of martingales.
Finally, the analysis is extended to the case of linear systems, where we establish the asymptotic normality of the cumulative cost induced by the optimal policies in linear quadratic regulators (LQRs). These results address some of the key theoretical questions in integrating learning and control in stochastic systems.