New Algorithm Addresses Markovian Bandits with Hidden States

Thomas Hira, Victor Boone, Urtzi Ayesta, Ina Maria Verloop· June 29, 2026 View original

Summary

This paper introduces UCB-NOM, an optimistic algorithm for regret minimization in Markovian bandits with non-observable states and constrained decision epochs, achieving nearly logarithmic regret without prior knowledge of the bandit's structure.

This research delves into the problem of minimizing regret in Markovian bandits, specifically addressing scenarios where the underlying states are non-observable and decision epochs might be constrained. The focus is on a "pure" regret benchmark, comparing learning algorithm performance against an optimal pure policy that consistently selects the best arm. The authors introduce "self-degrading Markovian bandits," a generalization of rested Markovian bandits, where pure policies are asymptotically optimal. The study demonstrates that without prior knowledge of the bandit's structure, algorithms that rarely switch arms will inevitably incur super-logarithmic regret. Despite this, the researchers designed UCB-NOM, an optimistic algorithm inspired by UCB, which achieves nearly logarithmic regret. Furthermore, with prior knowledge, such as a bound on the bias functions of the arm, UCB-NOM can achieve O(log T) regret. The paper also provides a O(sqrt(T log T)) worst-case regret bound for UCB-NOM under this prior knowledge. Notably, the derived regret bounds are independent of the number of states in the underlying Markov chains, suggesting that non-observability of states is a relatively minor issue in self-degrading Markovian bandits.

Why it matters

For professionals in reinforcement learning, online optimization, and sequential decision-making, this research provides theoretical advancements and a practical algorithm for complex bandit problems where state information is limited, improving decision efficiency in dynamic environments.

How to implement this in your domain

  1. 1Understand the theoretical framework of Markovian bandits with non-observable states.
  2. 2Explore the UCB-NOM algorithm for sequential decision-making in uncertain environments.
  3. 3Apply UCB-NOM in scenarios where state information is hidden and decisions are constrained.
  4. 4Evaluate the regret performance of UCB-NOM against baseline algorithms in simulation.
  5. 5Consider how prior knowledge about the system can further optimize the algorithm's performance.

Who benefits

Online AdvertisingRecommender SystemsResource AllocationDynamic Pricing

Key takeaways

  • Learning in Markovian bandits with hidden states is a complex challenge.
  • The UCB-NOM algorithm offers nearly logarithmic regret even without prior knowledge.
  • With some prior knowledge, UCB-NOM can achieve optimal logarithmic regret.
  • Regret bounds are independent of the number of underlying Markov states.

Original post by Thomas Hira, Victor Boone, Urtzi Ayesta, Ina Maria Verloop

"arXiv:2606.27448v1 Announce Type: new Abstract: This paper studies the problem of regret minimization in Markovian bandits with \emph{non-observable states} and possibly \emph{constrained} decision epochs. The focus is restricted to a ``pure'' regret benchmark, that compares the…"

View on X

Originally posted by Thomas Hira, Victor Boone, Urtzi Ayesta, Ina Maria Verloop on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses