Balancing Quality and Quantity in Machine Learning for Science

Introduction

Recent advancements in machine learning (ML) have brought transformative changes across various disciplines, particularly in scientific research. Nonetheless, several challenges persist that must be addressed to validate new models effectively, ensure robust testing and validation methods, and confirm that the developed models can be utilized in practical, real-world scenarios. Key issues include biased, subjective, and unbalanced evaluations, often unintentional; datasets that fail to accurately represent real-world applications (for instance, overly simplistic datasets); and improper methodologies for dividing datasets into training, testing, and validation groups. This article will delve into these issues, drawing from examples within the biological sciences, a field experiencing significant shifts due to ML techniques.

Moreover, I will touch upon the interpretability of ML models, which remains quite limited yet is crucial for clarifying many of the aforementioned limitations needing attention.

While some ML models may be overstated in their capabilities, this does not imply that they lack utility or that they haven't contributed valuable insights advancing specific subfields within ML.

The Proliferation of ML Models

The rapid increase in the publication of ML research papers focused on scientific applications has prompted me to question the revolutionary claims made by these studies. For instance, while AlphaFold 2 was indeed groundbreaking, what is the broader landscape of ML tools being introduced? How can a multitude of ML tools be touted as "the best" simultaneously? Is the foundational research robust? And if the work is novel and of high quality, can we guarantee that evaluations are impartial and comprehensive? Are the proposed real-world applications as groundbreaking as claimed?

Every time a new ML technique for structural biology emerges, I find myself pondering its validity and, more importantly, its actual effectiveness for my research.

Researchers are increasingly implementing state-of-the-art neural network developments to tackle longstanding scientific challenges, resulting in notable advancements. However, it is vital to ensure that evaluations are fair and objective, and that datasets and predictive capabilities genuinely reflect the practical applications of the ML models.

The surge in social media discussions, preprints, and peer-reviewed publications surrounding modern neural network techniques (transformers, diffusion models, etc.) indicates a promising trend in addressing long-standing scientific questions. This trend is exemplified by AlphaFold 2's success in protein structure prediction during CASP14, as well as breakthroughs in protein design, particularly through D. Baker’s ProteinMPNN, whose predicted sequences underwent extensive experimental validation.

Potential Issues in Datasets and Evaluations

A significant concern in ML research is the quality of datasets used for model evaluations. Many studies assess their models using inadequate datasets that do not accurately reflect real-world applications. Common issues include datasets that overlap with training data or fail to represent realistic scenarios.

This is not typically due to malicious intent. Training ML models demands vast datasets that are often impractical to curate manually, leading to inherent limitations in automated curation processes. Furthermore, many publications tend to highlight only favorable cases that demonstrate the model's applicability, neglecting instances that may lack biological significance or are difficult to interpret.

These problems are not exclusive to ML but rather highlight broader issues within scientific research: a tendency to emphasize positive results while disregarding negative or inconclusive findings, which are equally essential for avoiding wasted resources. The publish-or-perish mentality often leads to an overemphasis on positive outcomes, sometimes overstating claims of novelty and superiority.

Given these challenges, I believe that rigorous contests like CASP, along with studies aimed at objectively benchmarking existing methods, contribute more significantly to the field's advancement than the majority of publications detailing new models.

Methodology Limitations and Interpretability

One of the major limitations of ML, particularly in structural biology, is the interpretability of models. Many ML systems operate as black boxes, providing little insight into how they achieve their predictions. Understanding why a model performs well or poorly is crucial, especially when addressing complex extrapolations, such as predicting the structure of a protein with a significantly different architecture from known structures.

A lack of interpretability hinders our ability to gain insights into the underlying principles of the biological systems being modeled. This can lead to an erosion of trust in ML applications, particularly in critical scenarios where accuracy is paramount.

Improving interpretability could alleviate many challenges associated with the development and application of ML models, potentially identifying issues before they arise and enhancing the balance between quality and quantity in scientific research.

Related Material

I have come across insightful posts by other researchers and bloggers while crafting this article, which I recommend exploring, despite occasional disagreements:

Some Thoughts on a Mysterious Universe by Mohammed AlQuraishi: moalquraishi.wordpress.com
Oxford Protein Informatics Group: www.blopig.com
Science, the Hard Way: jgreener64.github.io

If you enjoyed this piece, consider checking out my other articles:

Scientists are Approaching the First Near-Atomic Simulations of Whole Cells: towardsdatascience.com
8 Modern and Upcoming Technologies that You Must Know About: medium.com
Key Websites and Programs for Structural Biology and Bioinformatics: lucianosphere.medium.com
Building Customized Chatbots for the Web Using gpt-3.5-turbo: pub.towardsai.net