lang icon English
Feb. 1, 2024, 8:02 a.m.
820

None

Brief news summary

None

Thank you for visiting nature. com. Your browser version has limited support for CSS. To ensure the best experience, we recommend using a more up-to-date browser or disabling compatibility mode in Internet Explorer. Currently, the site is being displayed without styles and JavaScript to ensure continued support. We are witnessing a significant boom in data-driven science, where large and intricate data sets, often containing numerous individually measured and annotated 'features, ' are being utilized by voracious artificial intelligence (AI) and machine learning systems. New applications utilizing these data sets are being published almost daily. However, publication alone does not guarantee accuracy or freedom from errors. It is essential for scientists to verify the accuracy and validity of these resources before using them to avoid encountering errors. Unfortunately, this has already occurred. In recent months, our bioinformatics and systems biology laboratory has reviewed state-of-the-art machine learning methods for predicting metabolic pathways based on the chemical structures of metabolites. We sought to identify, implement, and potentially improve the best methods for understanding how metabolic pathways are affected under different conditions, such as in diseased versus normal tissues. We found several papers published between 2011 and 2022 that demonstrated the application of various machine learning methods to a gold-standard metabolite data set derived from the Kyoto Encyclopedia of Genes and Genomes (KEGG), maintained at Kyoto University in Japan. Our expectation was that these algorithms would improve over time, and indeed, newer methods outperformed older ones. However, we questioned the validity of these improvements. Scientific reproducibility plays a crucial role in vetting data and results, enabling peer reviewers and other research groups to scrutinize the data set when used in new applications. Fortunately, two of the papers in our analysis followed best practices for computational reproducibility. They provided the necessary resources for testing their observations, including the data set, computer code, and results generated from that code. Three of the papers used the same data set, enabling direct comparisons. To our surprise, we uncovered a significant issue known as 'data leakage' in these papers. The two subsets of the data set used for training and testing the models were cross-contaminated, undermining the desired separation. A substantial portion of the entries from the KEGG COMPOUND database, more than 1, 700 out of 6, 648, were represented multiple times, corrupting the cross-validation process. After removing the duplicates and reapplying the published methods, we observed a notable drop in the F1 score, a machine learning evaluation metric similar to accuracy but calculated with precision and recall, from 0. 94 to 0. 82. While a score of 0. 94 is reasonably high, indicating the algorithm's usability in many scientific applications, a score of 0. 82 suggests limited usefulness, only applicable in certain scenarios and with appropriate handling.

It is unfortunate that flawed results stemming from the corrupted data set were published in these studies, thus raising doubts about their findings. However, due to the authors of two studies adhering to computational scientific reproducibility best practices and making their data, code, and results fully available, the scientific method successfully detected and potentially corrected these flawed results. As for the third study, it was impossible to properly evaluate their results as they did not provide their data set or code. If all groups had omitted their data and code, this data leakage issue would have been nearly impossible to identify. This would not only affect the previously published studies but also hinder other scientists who might wish to employ that data set for their own research. Moreover, the erroneously high performance reported in these papers could discourage others from attempting to improve upon the published methods, as their own algorithms would appear inferior in comparison. Additionally, it could complicate the journal publication process, as demonstrating improvement often serves as a requirement for successful review, potentially delaying research for years. So, how should we handle these erroneous studies?Some argue for their retraction, but we caution against a blanket policy of retraction as a knee-jerk reaction. In our analysis, since two of the three papers included the necessary data, code, and complete results, we were able to evaluate their findings and highlight the problematic data set. Encouraging this type of behavior, such as allowing authors to publish corrections, should be promoted. Retracting studies with deeply flawed results and lacking support for reproducible research implies that scientific reproducibility is optional. Moreover, demonstrating support for full scientific reproducibility provides a clear guideline for journals to determine whether correction or retraction is appropriate. Currently, scientific data are becoming increasingly complex. Data sets used in complex analyses, especially those involving AI, are an integral part of the scientific record. To ensure their availability, they should be made accessible along with the code used for analysis, either as supplementary material or through open data repositories like Figshare and Zenodo, which guarantee data persistence and provenance. However, researchers must also exercise skepticism when working with published data to avoid repeating the mistakes of others. This article is from the Nature Careers Community, a platform for Nature readers to share their professional experiences and advice. Guest posts are encouraged. Reference: Yang, Z. , Liu, J. , Wang, Z. , Wang, Y. & Feng, J. In 2020 IEEE International Conference on Bioinformatics and Biomedicine 126–131 (Institute of Electrical and Electronics Engineers, 2020).


Watch video about

None

Try our premium solution and start getting clients — at no cost to you

I'm your Content Creator.
Let’s make a post or video and publish it on any social media — ready?

Language

Hot news

Nov. 6, 2025, 1:35 p.m.

Искусственный интеллект Watson Health от IBM диаг…

Искусственный интеллект Watson Health от IBM достиг важной вехи в медицинской диагностике, получив показатель точности в 95 процентов при обнаружении различных видов рака, включая рак легких, груди, простаты и колоректальный рак.

Nov. 6, 2025, 1:23 p.m.

Революция или «дымовуха для выживания»? Маркетоло…

Раннее на этой неделе мы спрашивали старших маркетологов о влиянии искусственного интеллекта на профессии в маркетинге, получив широкий спектр продуманных ответов.

Nov. 6, 2025, 1:21 p.m.

Vista Social внедряет технологию ChatGPT, становя…

Vista Social добилась значительного прорыва в управлении социальными сетями, интегрировав технологию ChatGPT в свою платформу, став первым инструментом, внедрившим передовой разговорный искусственный интеллект OpenAI.

Nov. 6, 2025, 1:21 p.m.

КомандерAI: закрыто начальное финансирование в ра…

CommanderAI привлекло 5 миллионов долларов на начальном этапе финансирования для расширения своей платформы аналитики продаж на базе ИИ, специально ориентированной на индустрию отходоперевозок.

Nov. 6, 2025, 1:20 p.m.

Новостной видеоролик с AI [Melobytes.com]

Melobytes.com запустила инновационную услугу, которая преобразует создание новостных видео с помощью технологий искусственного интеллекта.

Nov. 6, 2025, 1:18 p.m.

Остановка платформы GEO вызвала отраслевые дебаты…

Бенжамен Уи прекратил деятельность Lorelight — платформы для оптимизации движка поиска (GEO), предназначенной для мониторинга видимости бренда в ChatGPT, Claude и Perplexity, после того, как он пришёл к выводу, что большинству брендов не нужен специализированный инструмент для контроля видимости в AI-поиске.

Nov. 6, 2025, 9:20 a.m.

Продажи искусственного интеллекта могут вырасти н…

Краткое изложение ключевых моментов Аналитики Morgan Stanley прогнозируют, что продажи искусственного интеллекта (ИИ) в секторах облачных технологий и программного обеспечения вырастут более чем на 600% за следующие три года и превысят 1 триллион долларов в год к 2028 году

All news

AI Company

Launch your AI-powered team to automate Marketing, Sales & Growth

and get clients on autopilot — from social media and search engines. No ads needed

Begin getting your first leads today