3260 papers • 126 benchmarks • 313 datasets
Multi-armed bandits refer to a task where a fixed amount of resources must be allocated between competing resources that maximizes expected gain. Typically these problems involve an exploration/exploitation trade-off. ( Image credit: Microsoft Research )
(Image credit: Papersgraph)
These leaderboards are used to track progress in multi-armed-bandits-3
Use these libraries to find multi-armed-bandits-3 models and implementations
No subtasks available.
The proposed DRR framework treats recommendation as a sequential decision making procedure and adopts an "Actor-Critic" reinforcement learning scheme to model the interactions between the users and recommender systems, which can consider both the dynamic adaptation and long-term rewards.
This work benchmarks well-established and recently developed methods for approximate posterior sampling combined with Thompson Sampling over a series of contextual bandit problems and finds that many approaches that have been successful in the supervised learning setting underperformed in the sequential decision-making scenario.
A new algorithm, NeuralUCB, is proposed, which leverages the representation power of deep neural networks and uses a neural network-based random feature mapping to construct an upper confidence bound (UCB) of reward for efficient exploration.
This paper proposes a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation, with a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network.
This paper proposes simple black-box reduction frameworks that can solve a large family of context-free bandits learning problems with LDP guarantee and extends the algorithm to Generalized Linear Bandits with regret bound $\tilde{\mathcal{O}}(T^{3/4}/\varepsilon)$ under $(\varpsilon, \delta)$-LDP which is conjectured to be optimal.
This work has applied their algorithm, Limited Memory Neural-Linear with Likelihood Matching (NeuralLinear-LiM2) on a variety of datasets and observed that the algorithm achieves comparable performance to the unlimited memory approach while exhibits resilience to catastrophic forgetting.
This work proposes a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space, and analyzes the conditions under which the action embedding provides statistical benefits over conventional estimators.
This work defines an isometry invariant Max Min COV(X) which bounds from below the performance of Lipschitz MAB algorithms for X, and presents an algorithm which comes arbitrarily close to meeting this bound.
The SWITCH estimator is proposed, which can use an existing reward model to achieve a better bias-variance tradeoff than IPS and DR and prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of data sets, often outperforming prior work by orders of magnitude.
Adding a benchmark result helps the community track progress.