Coordinating for Clicks: Learning in Multi-Agent Information Asymmetric Cascading Bandits
This is my applied math paper No. 1 (!), titled “Coordinating for Clicks: Learning in Multi-Agent Information Asymmetric Cascading Bandits”. I collaborated on this project alongside Lily, Lune, and Sophia, and our advisor was William. You can view the manuscript (currently under review at JMLR) by clicking the link or viewing the PDF below.
Bandits are one of the most studied settings in reinforcement learning, with the standard multi-armed bandit featuring a “learning agent” who interacts with a feedback-giving environment. In stochastic bandits, the agent interacts over a series of rounds called a horizon, \(T\). For each round, the agent has a variety of “actions” (often dubbed “arms”) to choose from, each with its own reward distribution. Reinforcement learning is essentially learning how to optimally take in and learn from feedback in a new, unexplored environment, so it is the agent’s goal to learn the best/optimal arms to achieve the highest rewards.
In bandits, learners adopt a “policy”, often encoded in some algorithm. Researchers evaluate a policy based on its regret (\(R_T\)), which is the difference between the rewards received if the agent had played the optimal action over the entirety of the horizon and the actual rewards received following the algorithm. Ideally, a “good” algorithm should incur sublinear regret, that is, the regret grows on some order less than \(O(T)\). In the classical stochastic bandit, the Upper Confidence Bound (UCB) algorithm achieves the most optimal regret on the order of \(O(\log T)\). Essentially, the learner creates a confidence interval around the empirical reward estimate for each arm, and as arms are pulled and additional rewards are realized, these upper confidence bounds approach closer to the true mean reward for each arm. The agent then simply picks the action with the highest UCB index for each round, and with high probability, ends up finding the optimal arm. This is just a brief introduction—if you want to learn more, I recommend the book Bandit Algorithms by Tor Lattimore and Csaba Szepesvari.
The bandit variant we study in this paper is called a cascading bandit. Instead of the learner choosing one action for each round, the agent can choose \(K\) different actions arranged in a list. This setting rifles with applications in things like recommendation bars or search engines, and for this reason, arms are sometimes referred to as “items”, and the “action” taken for a round is now the full ranking/recommendation of \(K\) items.
The key feature of this environment is that the learner can receive partial feedback, in that even though they may recommend \(K\) items, they may receive feedback on less. This is because of the assumption that a “user” examines the items starting from the first item to the last item and stops examining once they are “attracted” by an item or reach the end of the list. Thus, the remaining items are unobserved by the user, and the agent receives no feedback (a keen reader would note that in this setting, rewards are Bernoulli, so a “click” on an item results in a reward of \(1\), while an item observed and not clicked results in a reward of \(0\)). This is just a brief summary of the environment; if you’re curious, I’ve written more about it in the introduction section of the paper below, and you can look at the seminal paper on cascading bandits by Kveton et al., linked here.
For our main contribution, we study a multiplayer extension of cascading bandits. Instead of one agent interacting with the cascading environment, we consider \(M\) different learners, each with their own set of \(L\) items. The key distinction is that when each player recommends an item for a position in the ranking, it gives rise to a joint-item, which is a combination of each of the \(M\) players’ items. Each joint item has its own independent probability of attracting a user, meaning the agents now have to coordinate effectively to maximize rewards and minimize regret by selecting the \(K\) optimal joint-items. In addition to this, we study various forms of information asymmetry between the players, where players might not be able to observe the items played by others, where they could receive different feedback from each other, or both. For an introduction to multiplayer bandits in the standard multi-armed bandit, I suggest reading this paper, written by our advisor, William Chang.
In particular, our main results include algorithms for three settings of information asymmetry. We propose \(\texttt{mCascadeUCB}\) for the setting in which actions are unobservable to all players, \(\texttt{mCascadeUCB-Intervals}\) for the setting in which rewards received by all players are IID (independently and identically distributed), and \(\texttt{mCascadeDSEE}\) for the setting including both of the former information asymmetries. The first two, as evidenced by the name, use a UCB-type algorithm and achieve regret on the order of \(O(\log T)\), while the last one achieves close to \(\log\) regret by using an “Explore-Then-Commit” structure, where players first test out the attractiveness of items before “committing” to the \(K\) items they believe to be the best.
This has been just a brief introduction. If you’re curious about it more, feel free to reach out at kennyguo@ucla.edu.