Report on Current Developments in the Research Area of Constrained Markov Decision Processes (CMDPs) and Related Control Problems
General Direction of the Field
The recent advancements in the field of Constrained Markov Decision Processes (CMDPs) and related control problems are marked by a significant shift towards more efficient and versatile algorithms that can handle a broader range of scenarios, including adversarial and stochastic environments, bandit feedback, and non-quadratic cost functions. The field is moving towards achieving optimal performance bounds with more practical and computationally efficient methods, particularly through policy optimization approaches rather than traditional occupancy-measure-based methods.
One of the key trends is the development of algorithms that can seamlessly switch between stochastic and adversarial settings, achieving near-optimal regret and constraint violation bounds in both scenarios. This "best-of-both-worlds" approach is becoming increasingly important as real-world applications often involve mixed or unpredictable environments.
Another notable trend is the focus on bandit feedback models, which are more representative of practical scenarios where full feedback is not available. Researchers are making strides in developing algorithms that can achieve optimal regret bounds even in the presence of bandit feedback, addressing the challenges posed by memory structures and non-quadratic loss functions.
The field is also witnessing advancements in nonlinear control problems, particularly in the context of bandit feedback and adversarial perturbations. Algorithms are being developed that can handle strongly-convex and smooth cost functions, improving upon previous regret bounds and addressing the complexities of real-world control problems.
Preference learning in multi-armed bandit problems is another area of innovation, where researchers are exploring models that rely on preference feedback rather than scalar rewards. This approach is particularly relevant in scenarios where defining an exact reward function is challenging or infeasible.
Finally, there is growing interest in the stabilization and control of age-structured population models, particularly in the context of predator-prey dynamics and harvesting strategies. These models, which involve integro-partial differential equations (IPDEs), present unique challenges that are being addressed through novel control designs and Lyapunov functions.
Noteworthy Papers
Best-of-Both-Worlds Policy Optimization for CMDPs with Bandit Feedback: This paper introduces the first algorithm that can handle both stochastic and adversarial constraints in CMDPs with bandit feedback, achieving optimal regret and constraint violation bounds.
Optimal Strong Regret and Violation in Constrained MDPs via Policy Optimization: The authors provide an efficient policy optimization algorithm with $\widetilde{\mathcal{O}}(\sqrt{T})$ strong regret/violation, answering the open question of whether optimal bounds are achievable through policy optimization methods.
Tight Rates for Bandit Control Beyond Quadratics: This paper achieves an $\tilde{O}(\sqrt{T})$ optimal regret for bandit non-stochastic control with strongly-convex and smooth cost functions, improving upon previous bounds and addressing the challenges of adversarial perturbations and non-quadratic costs.
DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback: The DOPL algorithm ensures $\tilde{\mathcal{O}}(\sqrt{T\ln T})$ regret for RMAB with preference feedback, a significant advancement in preference-based learning for bandit problems.
Stabilization of Predator-Prey Age-Structured Hyperbolic PDE when Harvesting both Species is Inevitable: This work introduces novel control designs for stabilizing age-structured predator-prey models, providing explicit estimates of regions of attraction for infinite-dimensional multi-species dynamics.