Home Research Papers Datasets State of the Art Pricing

Discover, visualize, and connect AI research papers. Explore the latest trends and insights in artificial intelligence research.

Product

Home
Research Papers
Datasets
State of the Art

Support

Contact
Pricing
Terms of Service
Privacy Policy
About

© 2026 Papersgraph. All rights reserved.

reinforcement-learning-rl-1

Off-policy evaluation

3260 papers • 126 benchmarks • 313 datasets

Off-policy Evaluation (OPE), or offline evaluation in general, evaluates the performance of hypothetical policies leveraging only offline log data. It is particularly useful in applications where the online interaction involves high stakes and expensive setting such as precision medicine and recommender systems.

(Image credit: Papersgraph)

Benchmarks

These leaderboards are used to track progress in reinforcement-learning-rl-1

Trend

Dataset

Best Model

Actions

No benchmarks available.

Libraries

i

Use these libraries to find reinforcement-learning-rl-1 models and implementations

4 papers 612

Datasets

No datasets available.

Subtasks

No subtasks available.

Most implemented papers

Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation

Yuta Saito, Shunsuke Aihara, Megumi Matsutani, Yusuke Narita•Sun Aug 16 2020

Open Bandit Dataset is presented, a public logged bandit dataset collected on a large-scale fashion e-commerce platform, ZOZOTOWN, that enables experimental comparisons of different OPE estimators for the first time and develops Python software called Open Bandit Pipeline to streamline and standardize the implementation of batch bandit algorithms and OPE.

Content

Introduction Benchmarks Datasets Subtasks Libraries Papers

89

0

Benchmarks for Deep Off-Policy Evaluation

Mohammad Norouzi, T. Paine, S. Levine, Ziyun Wang, Justin Fu, Aviral Kumar, Ofir Nachum, G. Tucker, Yutian Chen, Alexander Novikov, Cosmin Paduraru, Mengjiao Yang, Michael R. Zhang•Mon Mar 29 2021

The goal of the benchmark is to provide a standardized measure of progress that is motivated from a set of principles designed to challenge and test the limits of existing OPE methods.

110 0

Off-Policy Evaluation for Large Action Spaces via Embeddings

Yuta Saito, T. Joachims•Sat Feb 12 2022

This work proposes a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space, and analyzes the conditions under which the action embedding provides statistical benefits over conventional estimators.

57 0

Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

Alekh Agarwal, Yu-Xiang Wang, Miroslav Dudík•Sat Dec 03 2016

The SWITCH estimator is proposed, which can use an existing reward model to achieve a better bias-variance tradeoff than IPS and DR and prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of data sets, often outperforming prior work by orders of magnitude.

231 0

Robust Generalization despite Distribution Shift via Minimum Discriminating Information

A. Krause, Tobias Sutter, D. Kuhn•Mon Jun 07 2021

This paper introduces a modeling framework where, in addition to training data, the author has partial structural knowledge of the shifted test distribution, and employs the principle of minimum discriminating information to embed the available prior knowledge.

13 0

Evaluating the Robustness of Off-Policy Evaluation

Yuta Saito, Yusuke Narita, Haruka Kiyohara, Takuma Udagawa, Kazuki Mogi, Kei Tateno•Mon Aug 30 2021

Interpretable Evaluation for Offline Evaluation is developed, an experimental procedure to evaluate OPE estimators’ robustness to changes in hyperparameters and/or evaluation policies in an interpretable manner and is applied to real-world e-commerce platform data.

45 0

Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model

Yuta Saito, Yusuke Narita, N. Shimizu, Haruka Kiyohara, Tatsuya Matsuhiro, Yasuo Yamamoto•Wed Feb 02 2022

This work proposes the Cascade Doubly Robust estimator building on the cascade assumption, which assumes that a user interacts with items sequentially from the top position in a ranking, and shows that the proposed estimator is unbiased in more cases compared to existing estimators that make stronger assumptions on user behavior.

51 0

Balanced Off-Policy Evaluation for Personalized Pricing

Adam N. Elmachtoub, Vishal Gupta, Yunfan Zhao•Thu Feb 23 2023

The key idea is to compute an estimate that minimizes the best-case mean squared error or maximizes a worst-case lower bound on policy performance, where in both cases the worst- case is taken with respect to a set of possible revenue functions.

6 0

Off-policy evaluation for slate recommendation

A. Krishnamurthy, J. Langford, Alekh Agarwal, Adith Swaminathan, Miroslav Dudík, Damien Jose, I. Zitouni•Sun May 15 2016

A new practical estimator that uses logged data to estimate a policy's performance and is accurate in a variety of settings, including as a subroutine in a learning-to-rank task, where it achieves competitive performance.

244 0

Importance Sampling Policy Evaluation with an Estimated Behavior Policy

P. Stone, Josiah P. Hanna, S. Niekum•Thu May 31 2018

This paper studies importance sampling with an estimated behavior policy where the behavior policy estimate comes from the same set of data used to compute the importance sampling estimate, and finds that this estimator often lowers the mean squared error of off-policy evaluation compared to importance sampled with the true behavior policy or using a behavior policy that is estimated from a separate data set.

70 0

Adding a benchmark result helps the community track progress.

Off-policy evaluation | State-of-the-Art