3260 papers • 126 benchmarks • 313 datasets
Thompson sampling, named after William R. Thompson, is a heuristic for choosing actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem. It consists of choosing the action that maximizes the expected reward with respect to a randomly drawn belief.
(Image credit: Papersgraph)
These leaderboards are used to track progress in thompson-sampling-2
No benchmarks available.
Use these libraries to find thompson-sampling-2 models and implementations
No datasets available.
No subtasks available.
This work benchmarks well-established and recently developed methods for approximate posterior sampling combined with Thompson Sampling over a series of contextual bandit problems and finds that many approaches that have been successful in the supervised learning setting underperformed in the sequential decision-making scenario.
This tutorial covers the algorithm and its application, illustrating concepts through a range of examples, including Bernoulli bandit problems, shortest path problems, product recommendation, assortment, active learning with neural networks, and reinforcement learning in Markov decision processes.
This work explores adaptations of successful multi-armed bandits policies to the online contextual bandits scenario with binary rewards using binary classification algorithms such as logistic regression as black-box oracles, resulting in more scalable approaches than previous works, and the ability to work with any type of classification algorithm.
Two perturbation approaches are investigated to overcome conservatism that optimism based algorithms chronically suffer from in practice and both empirically show the outstanding performance in tackling conservatism issue that Discounted LinUCB (D-LinUCB) struggles with.
Thompson Sampling-style algorithms for mean-variance MAB and comprehensive regret analyses for Gaussian and Bernoulli bandits with fewer assumptions are developed and shown to significantly outperform existing LCB-based algorithms for all risk tolerances.
This work proposes the Convolutional Neural Process (ConvNP), which endows Neural Processes (NPs) with translation equivariance and extends convolutional conditional NPs to allow for dependencies in the predictive distribution, and proposes a new maximum-likelihood objective to replace the standard ELBO objective in NPs.
This paper proposes a new algorithm, called Neural Thompson Sampling, which adapts deep neural networks for both exploration and exploitation, with a novel posterior distribution of the reward, where its mean is the neural network approximator, and its variance is built upon the neural tangent features of the corresponding neural network.
A variational Bayesian Recurrent Neural Net recommender system that acts on time series of interactions between the internet platform and the user, and which scales to real world industrial situations is introduced.
The question of the optimality of Thompson Sampling for solving the stochastic multi-armed bandit problem is answered positively for the case of Bernoulli rewards by providing the first finite-time analysis that matches the asymptotic rate given in the Lai and Robbins lower bound for the cumulative regret.
A generalization of Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary is designed and analyzed.
Adding a benchmark result helps the community track progress.