Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

The Curious Case of How MS-excel Was a Nightmare for Bioinformatics
Latest

The Curious Case of How MS-excel Was a Nightmare for Bioinformatics

Last Updated on August 3, 2022 by Editorial Team

Author(s): Salvatore Raieli

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

An example of how Ms-Excel can be deleterious in dataΒ science

algorithm issues with gene analysis

How do we have no power in front of Excel’s habit of arbitrarily changing the format to what it doesΒ like?

The unforgettable sin of Ms-Excel of deciding byΒ itself

There is no shortage of science fiction films where machines decide to ignore human orders and rebel (2001: A Space Odyssey, to name the most famous). Anyone who has worked with Ms-Excel might think that this dystopian future is not such an impossible eventuality.

Dave: Open the pod bay doors,Β HAL.

HAL: I’m sorry, Dave. I’m afraid I can’t doΒ that.

2001: A spaceΒ Odissey

I don’t know why, but Ms-Excel is fascinated by dates, and this passion results in randomly turning numbers and acronyms into dates. In 2016, a scientific article conducted an analysis on the phenomenon and realized that around 20% of the articles in the literature contain errors related to Ms-Excel. For example, SEPT2 (Septin 2) and MARCH1 (Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase) are converted by default to β€˜2-Sep’ and β€˜1-Mar’. Sometimes, MS-Excel does even better: it transforms SEPT2 in β€˜2006/09/02’. There is no way you can turn off this auto-format feature and correcting it is a painstaking process.

In the same article, they analyzed more than 35,000 articles from 2005 to 2015, showing that every journal was impacted. Ironically, the most impacted journal in the analysis is Nature, which is considered the bible of science journals.

gene name errors due to Ms-excel
From the original article β€œPrevalence of gene name errors in supplementary Excel files. a Percentage of published papers with supplementary gene lists in Excel files affected by gene name errors. b Increase in gene name errors byΒ year”

The omics revolution

The right panel also show how the problem increased dramatically year after year. In fact, the Omics revolution started in the 2000s when next-generation sequences started to be much cheaper and widespread. Nowadays, few articles have not performed some omics analysis (like RNAseq, ChipSeq, and so on). Why? Because these new techniques are able to provide a snapshot of a tumor or tissue. To date, they have achieved single-cell resolution (e.g., single-cell RNAseq).

Since you can obtain a huge amount of information, these analyses are becoming ubiquitous. From cancer research to clinical trials, scientists are relying more on the output of some sequencing. These large sequencings generate terabytes of data each year, and to analyze them, it is necessary to use Bash scripts, R, andΒ python.

Considering the need to use sophisticated scripts to analyze all this data, one would think that the situation has improved and that we no longer have to worry about Ms-Excel. Instead, an article from 2021 shows that we have not learned our lesson and MS-Excel is more alive thanΒ ever.

gene name errors due to Ms-excel
adapted from the original article: β€œ Prevalence of gene name errors in the period 2014–2020. (A) Publications with supplementary Excel gene lists. (B) Publications affected by gene name errors. Β© Proportion of affected publications.”

As can be seen, the number of errors did not decrease (it remained constant if not increased).

Why spreadsheet are soΒ spread?

It is difficult to associate Ms-Excel with biology or other scientific disciplines, yet it is everywhere. Not just because it is generally installed by default on all computers in laboratories and hospitals. Its success lies in the fact that, in spite of all its bugs and other farfetched annoyances, it allows a lot of analysis to be conducted quickly. Students and professors can open a plethora of files and do basic analysis and visualizations within minutes. In addition, it is also one of the accepted formats when submitting a publication to a scientific journal.

Certainly, R and Python are much better performers, allowing for a much more sophisticated and complex analysis of a huge amount of data and graphs. On the other hand, you have to know the syntax and install libraries, and you waste a lot of time in a series of errors. So when you are tired, that green icon on the desktop is… so so inviting.

gene name errors due to Ms-excel
Image by the author (adapted from a frame of the movie 2001 Odissey inΒ space)

Can we solveΒ it?

In 2017, the HUGO Gene Nomenclature Committee (HGNC) decided to rename 27 genes to avoid confusion. Since these committees are generally really difficult to organize and normally conservative in their choices, imagine how much annoyed they were to decide to rename genes just to make them Excel-friendly. However, as we have seen, this was notΒ enough.

Researchers proposed different solutions: in fact, the 2016 article already proposed a script to solve the issue. In 2022 (which shows that this problem still matters) other researchers proposed a web tool to autocorrect Excel misidentified gene names (which you can findΒ here).

gene name errors due to Ms-excel
from the article: β€œSchematic of Gene Updater. If old gene names are provided, these genes will be automatically converted to the updated approved geneΒ names.”

It is still too early to say if these efforts will solve the issue or if Ms-Excel will still haunt the sleep of the researchers.

Conclusions and take-away

β€œIt’s Excel’s world, we just live in it.”
β€Šβ€”β€Šjgalt212 on
HackerNews

HGNC has created guidelines on how to decide on a name for a gene (which generally are directed against human stupidity and avoid dumb or offensive names being chosen). However, they forgot the computer stupidity and how Ms-Excel is widespread. True, scientists could abandon Ms-Excel, but it is installed everywhere, and sometimes you need to beΒ quick.

The beauty of this story is the fact that it only takes an insignificant detail to have large-scale nefarious effects. Potentially simple Ms-Excel errors (hopefully unintentional, or it is already part of the machine's plan to take over the world?) can be dragged on for years. An article in Nature can have dozens if not hundreds of citations and influence researchers around the world. After all, genetic testing is becoming increasingly common, and cancer therapies are increasingly decided on mutational profiling. So extreme caution in analyzing articles or lists of genes would be necessary.

This story also teaches us that data collection and selection are critical, but so is the choice of algorithms and software. Moreover, Ms-Excel is showing strange behaviors with different types of data (for example, numbers or strings) and should be avoided in data science (to avoid deleterious effects). In addition, one should always think about the end user because they could be used in an unimagined way with catastrophic results.

if you have found it interesting:

You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach me on LinkedIn. Thanks for yourΒ support!

Here is the link to my Github repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, andΒ more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science with math explanation and reusable code (in python and R)


The Curious Case of How MS-excel Was a Nightmare for Bioinformatics was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓