3260 papers • 126 benchmarks • 313 datasets
Off-policy Evaluation (OPE), or offline evaluation in general, evaluates the performance of hypothetical policies leveraging only offline log data. It is particularly useful in applications where the online interaction involves high stakes and expensive setting such as precision medicine and recommender systems.
(Image credit: Papersgraph)
These leaderboards are used to track progress in off-policy-evaluation-2
No benchmarks available.
Use these libraries to find off-policy-evaluation-2 models and implementations
No datasets available.
No subtasks available.
Open Bandit Dataset is presented, a public logged bandit dataset collected on a large-scale fashion e-commerce platform, ZOZOTOWN, that enables experimental comparisons of different OPE estimators for the first time and develops Python software called Open Bandit Pipeline to streamline and standardize the implementation of batch bandit algorithms and OPE.
The goal of the benchmark is to provide a standardized measure of progress that is motivated from a set of principles designed to challenge and test the limits of existing OPE methods.
This work proposes a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space, and analyzes the conditions under which the action embedding provides statistical benefits over conventional estimators.
The SWITCH estimator is proposed, which can use an existing reward model to achieve a better bias-variance tradeoff than IPS and DR and prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of data sets, often outperforming prior work by orders of magnitude.
This paper introduces a modeling framework where, in addition to training data, the author has partial structural knowledge of the shifted test distribution, and employs the principle of minimum discriminating information to embed the available prior knowledge.
Interpretable Evaluation for Offline Evaluation is developed, an experimental procedure to evaluate OPE estimators’ robustness to changes in hyperparameters and/or evaluation policies in an interpretable manner and is applied to real-world e-commerce platform data.
This work proposes the Cascade Doubly Robust estimator building on the cascade assumption, which assumes that a user interacts with items sequentially from the top position in a ranking, and shows that the proposed estimator is unbiased in more cases compared to existing estimators that make stronger assumptions on user behavior.
The key idea is to compute an estimate that minimizes the best-case mean squared error or maximizes a worst-case lower bound on policy performance, where in both cases the worst- case is taken with respect to a set of possible revenue functions.
A new practical estimator that uses logged data to estimate a policy's performance and is accurate in a variety of settings, including as a subroutine in a learning-to-rank task, where it achieves competitive performance.
This paper studies importance sampling with an estimated behavior policy where the behavior policy estimate comes from the same set of data used to compute the importance sampling estimate, and finds that this estimator often lowers the mean squared error of off-policy evaluation compared to importance sampled with the true behavior policy or using a behavior policy that is estimated from a separate data set.
Adding a benchmark result helps the community track progress.