3260 papers • 126 benchmarks • 313 datasets
Data integration (also called information integration) is the process of consolidating data from a set of heterogeneous data sources into a single uniform data set (materialized integration) or view on the data (virtual integration). Data integration pipelines involve subtasks such as schema matching, table annotation, entity resolution, value normalization, data cleansing, and data fusion. Application domains of data integration include data warehousing, data lakes, and knowledge base consolidation. Surveys on Data integration: Dong, Srivastava: Big data integration, 2013. Doan, Halevy, Ives: Principles of Data Integration, 2012.
(Image credit: Papersgraph)
These leaderboards are used to track progress in data-integration-6
No benchmarks available.
Use these libraries to find data-integration-6 models and implementations
MIMIC-III (‘Medical Information Mart for Intensive Care’) is a large, single-center database comprising information relating to patients admitted to critical care units at a large tertiary care hospital. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more. The database supports applications including academic and industrial research, quality improvement initiatives, and higher education coursework. Design Type(s) data integration objective Measurement Type(s) Demographics • clinical measurement • intervention • Billing • Medical History Dictionary • Pharmacotherapy • clinical laboratory test • medical data Technology Type(s) Electronic Medical Record • Medical Record • Electronic Billing System • Medical Coding Process Document • Free Text Format Factor Type(s) Sample Characteristic(s) Homo sapiens Design Type(s) data integration objective Measurement Type(s) Demographics • clinical measurement • intervention • Billing • Medical History Dictionary • Pharmacotherapy • clinical laboratory test • medical data Technology Type(s) Electronic Medical Record • Medical Record • Electronic Billing System • Medical Coding Process Document • Free Text Format Factor Type(s) Sample Characteristic(s) Homo sapiens Machine-accessible metadata file describing the reported data (ISA-Tab format)
A novel Bayesian hybrid matrix factorisation model (HMF) for data integration, based on combining multiple Matrix factorisation methods, that can be used for in- and out-of-matrix prediction of missing values is introduced.
Constraint-based Optimization of Metabolic Objectives (COMO), a user-friendly pipeline that integrates multi-omics data processing, context-specific metabolic model development, simulations, drug databases and disease data to aid drug discovery, is developed.
A novel semi-supervised heterogeneouslabel propagation algorithm named Heter-LP, which applies both local and global network features for data integration and implements a label propagation algorithm to find new interactions in drug repositioning research.
This work develops a novel method for feature learning on biological knowledge graphs that combines symbolic methods, in particular knowledge representation using symbolic logic and automated reasoning, with neural networks to generate embeddings of nodes that encode for related information within knowledge graphs.
This paper introduces a sparse multiple relationship data regularized joint matrix factorization (JMF) framework and two adapted prediction models for pattern recognition and data integration and presents four update algorithms to solve this framework.
A new machine learning model with engineered features as well as two deep learning models which do not require extensive feature engineering are developed to address the lack of a well-defined semantic description for relational schema.
This work proposes a general statistical framework based on Gaussian graphical models for horizontal and vertical integration of information in such datasets, and develops a debiasing technique and asymptotic distributions of inter-layer directed edge weights that utilize already computed neighborhood selection coefficients for nodes in the upper layer.
Adding a benchmark result helps the community track progress.