The Curious Case of How MS-excel Was a Nightmare for Bioinformatics

Last Updated on August 3, 2022 by Editorial Team

Author(s): Salvatore Raieli

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

An example of how Ms-Excel can be deleterious in data science

How do we have no power in front of Excel’s habit of arbitrarily changing the format to what it does like?

The unforgettable sin of Ms-Excel of deciding by itself

There is no shortage of science fiction films where machines decide to ignore human orders and rebel (2001: A Space Odyssey, to name the most famous). Anyone who has worked with Ms-Excel might think that this dystopian future is not such an impossible eventuality.

Dave: Open the pod bay doors, HAL.

HAL: I’m sorry, Dave. I’m afraid I can’t do that.

2001: A space Odissey

I don’t know why, but Ms-Excel is fascinated by dates, and this passion results in randomly turning numbers and acronyms into dates. In 2016, a scientific article conducted an analysis on the phenomenon and realized that around 20% of the articles in the literature contain errors related to Ms-Excel. For example, SEPT2 (Septin 2) and MARCH1 (Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase) are converted by default to ‘2-Sep’ and ‘1-Mar’. Sometimes, MS-Excel does even better: it transforms SEPT2 in ‘2006/09/02’. There is no way you can turn off this auto-format feature and correcting it is a painstaking process.

In the same article, they analyzed more than 35,000 articles from 2005 to 2015, showing that every journal was impacted. Ironically, the most impacted journal in the analysis is Nature, which is considered the bible of science journals.

gene name errors due to Ms-excel — From the original article “*Prevalence of gene name errors in supplementary Excel files.* a Percentage of published papers with supplementary gene lists in Excel files affected by gene name errors. b Increase in gene name errors by year”

The omics revolution

The right panel also show how the problem increased dramatically year after year. In fact, the Omics revolution started in the 2000s when next-generation sequences started to be much cheaper and widespread. Nowadays, few articles have not performed some omics analysis (like RNAseq, ChipSeq, and so on). Why? Because these new techniques are able to provide a snapshot of a tumor or tissue. To date, they have achieved single-cell resolution (e.g., single-cell RNAseq).

Since you can obtain a huge amount of information, these analyses are becoming ubiquitous. From cancer research to clinical trials, scientists are relying more on the output of some sequencing. These large sequencings generate terabytes of data each year, and to analyze them, it is necessary to use Bash scripts, R, and python.

Considering the need to use sophisticated scripts to analyze all this data, one would think that the situation has improved and that we no longer have to worry about Ms-Excel. Instead, an article from 2021 shows that we have not learned our lesson and MS-Excel is more alive than ever.

As can be seen, the number of errors did not decrease (it remained constant if not increased).

Why spreadsheet are so spread?

It is difficult to associate Ms-Excel with biology or other scientific disciplines, yet it is everywhere. Not just because it is generally installed by default on all computers in laboratories and hospitals. Its success lies in the fact that, in spite of all its bugs and other farfetched annoyances, it allows a lot of analysis to be conducted quickly. Students and professors can open a plethora of files and do basic analysis and visualizations within minutes. In addition, it is also one of the accepted formats when submitting a publication to a scientific journal.

Certainly, R and Python are much better performers, allowing for a much more sophisticated and complex analysis of a huge amount of data and graphs. On the other hand, you have to know the syntax and install libraries, and you waste a lot of time in a series of errors. So when you are tired, that green icon on the desktop is… so so inviting.

Can we solve it?

In 2017, the HUGO Gene Nomenclature Committee (HGNC) decided to rename 27 genes to avoid confusion. Since these committees are generally really difficult to organize and normally conservative in their choices, imagine how much annoyed they were to decide to rename genes just to make them Excel-friendly. However, as we have seen, this was not enough.

Researchers proposed different solutions: in fact, the 2016 article already proposed a script to solve the issue. In 2022 (which shows that this problem still matters) other researchers proposed a web tool to autocorrect Excel misidentified gene names (which you can find here).

It is still too early to say if these efforts will solve the issue or if Ms-Excel will still haunt the sleep of the researchers.

Conclusions and take-away

“It’s Excel’s world, we just live in it.”
— jgalt212 on HackerNews

HGNC has created guidelines on how to decide on a name for a gene (which generally are directed against human stupidity and avoid dumb or offensive names being chosen). However, they forgot the computer stupidity and how Ms-Excel is widespread. True, scientists could abandon Ms-Excel, but it is installed everywhere, and sometimes you need to be quick.

The beauty of this story is the fact that it only takes an insignificant detail to have large-scale nefarious effects. Potentially simple Ms-Excel errors (hopefully unintentional, or it is already part of the machine's plan to take over the world?) can be dragged on for years. An article in Nature can have dozens if not hundreds of citations and influence researchers around the world. After all, genetic testing is becoming increasingly common, and cancer therapies are increasingly decided on mutational profiling. So extreme caution in analyzing articles or lists of genes would be necessary.

This story also teaches us that data collection and selection are critical, but so is the choice of algorithms and software. Moreover, Ms-Excel is showing strange behaviors with different types of data (for example, numbers or strings) and should be avoided in data science (to avoid deleterious effects). In addition, one should always think about the end user because they could be used in an unimagined way with catastrophic results.

if you have found it interesting:

You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach me on LinkedIn. Thanks for your support!

Here is the link to my Github repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science with math explanation and reusable code (in python and R)

The Curious Case of How MS-excel Was a Nightmare for Bioinformatics was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

The Curious Case of How MS-excel Was a Nightmare for Bioinformatics

Author(s): Salvatore Raieli

An example of how Ms-Excel can be deleterious in data science

if you have found it interesting:

Towards AI Team

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

The Curious Case of How MS-excel Was a Nightmare for Bioinformatics

Author(s): Salvatore Raieli

An example of how Ms-Excel can be deleterious in data science

if you have found it interesting:

Towards AI Team

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥