Home Research Papers Datasets State of the Art Pricing

Discover, visualize, and connect AI research papers. Explore the latest trends and insights in artificial intelligence research.

Product

Home
Research Papers
About

Support

Contact
Terms of Service
Privacy Policy

© 2026 Papersgraph. All rights reserved.

miscellaneous-3

Multi-Armed Bandits

3260 papers • 126 benchmarks • 313 datasets

Multi-armed bandits refer to a task where a fixed amount of resources must be allocated between competing resources that maximizes expected gain. Typically these problems involve an exploration/exploitation trade-off. ( Image credit: Microsoft Research )

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in multi-armed-bandits-3

Trend

Dataset

Best Model

Actions

Mushroom

Mushroom

Libraries

i

Use these libraries to find multi-armed-bandits-3 models and implementations

2 papers 589

Datasets

KuaiRand

Duolingo Bandit Notifications

Duolingo Bandit Notifications

Subtasks

No subtasks available.

Most implemented papers

Deep Reinforcement Learning based Recommendation with Explicit User-Item Interactions Modeling

Yunming Ye, Xutao Li, Huifeng Guo, Ruiming Tang, Feng Liu, Haokun Chen, Yuzhou Zhang•Sun Oct 28 2018

The proposed DRR framework treats recommendation as a sequential decision making procedure and adopts an "Actor-Critic" reinforcement learning scheme to model the interactions between the users and recommender systems, which can consider both the dynamic adaptation and long-term rewards.

116

Content

Introduction Benchmarks Datasets Subtasks Libraries Papers

0

Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling

Jasper Snoek, G. Tucker, C. Riquelme•Wed Feb 14 2018

This work benchmarks well-established and recently developed methods for approximate posterior sampling combined with Thompson Sampling over a series of contextual bandit problems and finds that many approaches that have been successful in the supervised learning setting underperformed in the sequential decision-making scenario.

381 0

Neural Contextual Bandits with UCB-based Exploration

Lihong Li, Dongruo Zhou, Quanquan Gu•Sun Nov 10 2019

A new algorithm, NeuralUCB, is proposed, which leverages the representation power of deep neural networks and uses a neural network-based random feature mapping to construct an upper confidence bound (UCB) of reward for efficient exploration.

313 0

Neural Thompson Sampling

Lihong Li, Dongruo Zhou, Quanquan Gu, Weitong Zhang•Thu Oct 01 2020

This paper proposes a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation, with a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network.

145 0

Locally Differentially Private (Contextual) Bandits Learning

Zhenguo Li, Liwei Wang, Tianle Cai, Weiran Huang•Sun May 31 2020

This paper proposes simple black-box reduction frameworks that can solve a large family of context-free bandits learning problems with LDP guarantee and extends the algorithm to Generalized Linear Bandits with regret bound $\tilde{\mathcal{O}}(T^{3/4}/\varepsilon)$ under $(\varpsilon, \delta)$-LDP which is conjectured to be optimal.

69 0

Online Limited Memory Neural-Linear Bandits with Likelihood Matching

Ofir Nabati, Tom Zahavy, Shie Mannor•Sat Feb 06 2021

This work has applied their algorithm, Limited Memory Neural-Linear with Likelihood Matching (NeuralLinear-LiM2) on a variety of datasets and observed that the algorithm achieves comparable performance to the unlimited memory approach while exhibits resilience to catastrophic forgetting.

18 0

Off-Policy Evaluation for Large Action Spaces via Embeddings

Yuta Saito, T. Joachims•Sat Feb 12 2022

This work proposes a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space, and analyzes the conditions under which the action embedding provides statistical benefits over conventional estimators.

57 0

Multi-armed bandits in metric spaces

Robert D. Kleinberg, Aleksandrs Slivkins, E. Upfal•Fri May 16 2008

This work defines an isometry invariant Max Min COV(X) which bounds from below the performance of Lipschitz MAB algorithms for X, and presents an algorithm which comes arbitrarily close to meeting this bound.

492 0

Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

Alekh Agarwal, Yu-Xiang Wang, Miroslav Dudík•Sat Dec 03 2016

The SWITCH estimator is proposed, which can use an existing reward model to achieve a better bias-variance tradeoff than IPS and DR and prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of data sets, often outperforming prior work by orders of magnitude.

231 0

Adding a benchmark result helps the community track progress.

Multi-Armed Bandits | State-of-the-Art