This Viewpoint focuses on reproducibility, even though it is important to acknowledge that replication is often the ultimate goal, as high-capacity machine learning studies are beginning to demonstrate early successes in clinical applications.
Reproducibility has been an important and intensely debated topic in science and medicine for the past few decades.1 As the scientific enterprise has grown in scope and complexity, concerns regarding how well new findings can be reproduced and validated across different scientific teams and study populations have emerged. In some instances,2 the failure to replicate numerous previous studies has added to the growing concern that science and biomedicine may be in the midst of a “reproducibility crisis.” Against this backdrop, high-capacity machine learning models are beginning to demonstrate early successes in clinical applications,3 and some have received approval from the US Food and Drug Administration. This new class of clinical prediction tools presents unique challenges and obstacles to reproducibility, which must be carefully considered to ensure that these techniques are valid and deployed safely and effectively. Reproducibility is a minimal prerequisite for the creation of new knowledge and scientific progress, but defining precisely what it means for a scientific study to be “reproducible” is complex and has been the subject of considerable effort by both individual researchers and organizations like the National Academies of Science, Engineering, and Medicine. First, it is important to distinguish between the notions of reproducibility and replication. A study is reproducible if, given access to the underlying data and analysis code, an independent group can obtain the same result observed in the original study. However, being reproducible does not imply that a study is correct, only thattheresultswereabletobeverifiedbyadifferentgroup not involved in the original study. A study is replicable if an independent group studying the same phenomenon reaches the same conclusion after performing the same set of experiments or analyses after collecting new data. The discussion around reproducibility and replication has primarily focused on traditional statistical models and the results from randomized clinical trials, but these considerations can and should apply equally to machine learning studies. Challenges to reproducibility and replication include confounding, multiple hypothesis testing, randomness inherent to the analysis procedure, incomplete documentation, and restricted access to the underlying data and code. The last concern, data access, is especially germane for medicine, as privacy barriers are important considerations for data sharing. However, by definition, replication does not require access to the original data or code because a replication exercise examines the extent to which the original phenomenon generalizes to new contexts and new populations. This Viewpoint focuses on reproducibility, even though it is important to acknowledge that replication is often the ultimate goal. Replication is especially important for studies that use observational data (which is almost always the case for machine learning studies) because these dataareoftenbiased,andmodelscouldoperationalizethis bias if not replicated. The challenges of reproducing a machinelearningmodeltrainedbyanotherresearchteamcan be difficult, perhaps even prohibitively so, even with unfettered access to raw data and code.