3260 papers • 126 benchmarks • 313 datasets
Data valuation in machine learning tries to determine the worth of data, or data sets, for downstream tasks. Some methods are task-agnostic and consider datasets as a whole, mostly for decision making in data markets. These look at distributional distances between samples. More often, methods look at how individual points affect performance of specific machine learning models. They assign a scalar to each element of a training set which reflects its contribution to the final performance of some model trained on it. Some concepts of value depend on a specific model of interest, others are model-agnostic. Concepts of the usefulness of a datum or its influence on the outcome of a prediction have a long history in statistics and ML, in particular through the notion of the influence function. However, it has only been recently that rigorous and practical notions of value for data, and in particular data-sets, have appeared in the ML literature, often based on concepts from collaborative game theory, but also from generalization estimates of neural networks, or optimal transport theory, among others.
(Image credit: Papersgraph)
These leaderboards are used to track progress in data-valuation-1
No benchmarks available.
Use these libraries to find data-valuation-1 models and implementations
No datasets available.
This work develops a principled framework to address data valuation in the context of supervised machine learning by proposing data Shapley as a metric to quantify the value of each training datum to the predictor performance.
An Monte Carlo approximation algorithm is proposed, which is up to three orders of magnitude faster than the baseline approximation algorithm and can accelerate the value calculation process even further.
The most important applications of the Shapley value in machine learning: feature selection, explainability, multi-agent reinforcement learning, ensemble pruning, and data valuation are given.
It is demonstrated that Beta Shapley outperforms state-of-the-art data valuation methods on several downstream ML tasks such as: 1) detecting mislabeled training data; 2) learning with subsamples; and 3) identifying points whose addition or removal have the largest positive or negative impact on the model.
This study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a promising alternative to the other semivalue-based data value schemes given its computational advantage and ability to robustly differentiate data quality.
This work proposes CS-Shapley, a Shapley value with a new value function that discriminates between training instances' in-class and out-of-class contributions, and suggests Shapley-based data valuation is transferable for application across different models.
It is demonstrated that the proposed Data-OOB significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points, highlighting the potential for applying data values in real-world applications.
This paper proposes a repertoire of efficient algorithms for approximating the Shapley value, a popular notion of value which originated in coopoerative game theory and demonstrates the value of each training instance for various benchmark datasets.
The corrupted sample discovery performance of DVRL is close to optimal in many regimes, and for domain adaptation and robust learning DVRL significantly outperforms state-of-the-art by 14.6% and 10.8%, respectively.
This work proposes to boost the efficiency in computing Shapley value or Least core by learning to estimate the performance of a learning algorithm on unseen data combinations by derived bounds relating the error in the predicted learning performance to the approximation error in SV and LC.
Adding a benchmark result helps the community track progress.