Last Updated on November 4, 2020 by Editorial Team
Author(s): Venkat Raman
I was recently invited to judge a Data Science competition. The students were given the ‘heart disease prediction’ dataset, perhaps an improvised version of the one available on Kaggle. I had seen this dataset before and often come across various self-proclaimed data science gurus teaching naïve people how to predict heart disease through machine learning.
I believe the “Predicting Heart Disease using Machine Learning” is a classic example of how not to apply machine learning to a problem, especially where a lot of domain experience is required.
Let me unpack the various problems in applying machine learning to this data set.
Dive straight into the problem syndrome
Well, this is the first mistake many people make. Jumping straight into the problem and thinking which Machine learning algorithm to apply. Doing EDA at all as part of this process is not *thinking* about the problem. Rather it is a sign that you have already accepted the notion that the problem needs a data science solution. Instead, one of the pertinent questions that need to be asked before starting any analysis is, “Is this problem even predictable through the application of machine learning?”.
Blind faith in Data
This is an extension of the first point. Diving straight into the problem means you have blind faith in the data. People assume the data to be true and do not make an effort to scrutinize the data. For example, the dataset only provided systolic blood pressure. If you spoke to any doctor or even a paramedic, they would tell you that systolic blood pressure alone does not give the full picture. Reporting of the diastolic level is important too. Many don’t even ask the question, “are the features enough to predict the outcome or more features are needed.”
Not enough data per patient: Let’s take a look at the data set above. If you notice, there is only one data point under each feature for a patient. The fundamental problem here is that features like blood pressure, cholesterol, heartbeat are not static. They range. The blood pressure of a person varies from hour to hour, and daily, so does heartbeat. So when it comes to the prediction problem, there is no telling whether 135 mm hg blood pressure was one of the factors to cause the heart disease or was it 140, all while the data set might be reporting 130 mm hg. Ideally, multiple measurements need to be had for each feature for a patient.
Now let’s come to the crux of the matter
Applying algorithm without domain experience
One reason for the high failure rate of data science application in health care is that the data scientists applying the algorithm do not have adequate medical knowledge.
Secondly, in healthcare, causality is taken very seriously. Many rigorous clinical and statistical tests are conducted to infer causality.
In the case study, any machine learning algorithm is just trying to map the input to the output while reducing some error metric. Also, the machine learning algorithm by themselves is not classifiers. We make them as classifiers by setting some cut-off or threshold. Again, this cut off is not decided to deduce causality but to get “favorable metrics.”
Aggravating this problem is the usage of low code libraries. This case study is a case point example of why low code libraries can be dangerous. Low code libraries fit a dozen or more algorithms. Most are not even aware of how some of these algorithms work! They pick the ‘best’ algorithm based on metrics like F1, Precision, Recall, and Accuracy.
The low code libraries that fixate on accuracy metrics lead to ‘Goodhart’s law’ — ‘When a measure becomes a target, it ceases to be a good measure.’
If you are predicting, you are implying causation. In healthcare, a mere prediction is not enough. One needs to prove causation. Machine learning classifier algorithms do not answer the ‘causation’ part.
Believing they have solved a real healthcare problem
Last but not least, many believe that by fitting an ML algorithm to a *healthcare* data set and getting some accuracy metrics, they have solved a real healthcare problem. Nothing can be further from the truth than this, especially when it pertains to the healthcare domain.
There are perhaps thousands of business problems that genuinely warrant data science/machine learning solutions. But at the same time, one should not fall into the trap of “To a person with a hammer, everything looks like a nail.” Seeing everything as a nail (data science problem) and machine learning algorithms (hammer) can be very counterproductive. Much of the 80% failure rate in data science applied to business problems could be attributed.
Good data scientists are like Good doctors. Good doctors suggest conservative treatments first before prescribing heavy dosage medicines or surgery. Similarly, a good data scientist should ask certain pertinent questions first before blindly applying a dozen ML algorithms to the problem.
Doctor: Surgery :: Data Scientist : Machine learning
Your comments and opinions are welcome.
You can reach out to me on
Predicting Heart Disease using Machine Learning? Don’t! was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Published via Towards AI