Data Science Guidebook 2021
Last Updated on January 6, 2023 by Editorial Team
Last Updated on September 12, 2021 by Editorial Team
Author(s): Tim Cvetko
Careers
Things to keep an eye out for during your DSΒ Problem
The worldβs sexiest job of the 21st century may be appealing but it also comes with a lot of challenges. No worries. This is your luckyΒ day!!!
I present to you a modern end-to-end data science guidebook. Hopefully, after reading this article youβll be able to spark that little lightbulb at the back of your head every time you face a challenge in the AIΒ world.
Understanding theΒ Problem
As researchers, programmers, and βsmart creativesβ we get carried away by the outcome and the hype of artificial intelligence and its capabilities. We want to get results and we want themΒ fast.
Sit down, breathe, take a piece of paper and a pen and start sketching.
Brainstorm, if necessary. Make a plan. I want you to embrace the mentality of thinking 3 steps ahead. Before the attractive code [ML] begins, I want you to think aboutΒ :
1 -> what tools you will use for your data pipelines,
2 -> is there going to be an overhead in data (lightbulb should go off: Python generators, Spark pipelines)
3 -> what data format will I use, how is this compatible with the one I need for training?
4 -> does my problem really acquire an ML solution?
Always think of an easier [non ML] solution.
I guarantee you: As you find for one, youβll find insights that you wouldnβt have otherwise. Youβll see your problem from a different perspective. Even if your solution does require an ML solution, youβll benefit tremendously.
Why getting to know the problem is mainline of theΒ project?
- You canβt understand the data if you donβt understand theΒ problem
- You should have a narrative throughout the process and you should aim to get as close to finding the solution.
- Your actions are otherwise aimless. Remember, the algorithms revolve around problems, not the other way around. You shouldnβt look at an algorithm and imagine or even force it to solve a task. For example: solving the shortest path problem should spark that lightbulb: βDijkstraβ.
The Next Step is choosing anΒ Approach
Once long enough in the ML world, youβll be able to envision a solution prior to receiving the data. Even though it sounds great, itΒ isnβt.
Hereβs why: the reason why youβre so good at what you do, is because youβre able to think beyond the scopes of your knowledge. As a modern data scientist, your job is to come up with clever solutions to modern-day problems. We might as well turn our focus to AutoML if that would not be theΒ case.
Remember the motto from before: Problem First, Algorithm, and Solution later? Add this: βMore times than not itβs easier to delegate your tasks.Β β
Break your problem apart. See the bigΒ picture.
At this point, I want you to dig deep into your expertise in algorithms and mathematic techniques. Think like an end-to-end developer from the beginning. Anticipate the unpredictable. Assuming you understand the type of data youβre dealing with and the type of problem youβre trying to solve. Itβs rarely a one-type algorithm kindaΒ work.
While researching, think about where each algorithm might go wrong. If youβre dealing with neural networks, I want you to think about gradients, choosing the right optimizer, what effect the regularization might have, etc. You have to be that under-the-hood guy, not TensorFlow.
hands-on learning is the best kind ofΒ learning
For example:
- Stochastic Gradient Descent has a nosier convergence thanΒ Adam.
- One advantage of using sparse categorical cross-entropy is it saves time in memory as well as computation because it uses a single integer for a class, rather than a whole vector. These are the kind of lightbulbs I was talkingΒ about.
Note: once data is manageable, is when the magic begins. At that stage, you can of course try many different approaches, do hyperparameter tuning, and soΒ forth.
Understanding Your Data ->Β EDA
EDA stands for exploratory data analysis and is probably the most vital part of your DS projectβs journey. No real-world project relies on pure data. Youβre going to spend a lot of time on data fetching and manipulation.
βEveryone wants to do the model work, not the data workβ: Data Cascades in High-Stakes AIβββGoogleΒ Research
The project revolves around your aim. Seek for hidden meanings in data, their correspondence. Donβt underestimate the power of data visualization. As programmers, we get quite arrogant about data visualization because it suggests we donβt understand our data. Donβt let that ego undertake you.
Go deep. Find every insight possible.
Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms.
At this point, I want to show you a bag of tricks useful for selecting features:
- feature cross -> the most famous example of a feature cross is the longitude-latitude problem. These two values could be but lonely floating values in the dataset. If you choose to pursue the importance of location, the end goal is to determine what regions have the highest correspondence to the output. Crossing combinations of features can provide predictive abilities beyond what those features can provide individually.
Feature Crosses | Machine Learning Crash Course | Google Developers
Another possibility for the location is to define clusters of data points and pass their label to theΒ model.
- dealing with missing values (NaN)-> this is always an obnoxious problem, right? First step:βββfind insights for the columns with missing data and feature cross them out of the equation. Second: Perhaps, it shares a high correlation with another column and then you can use padding. Find moreΒ here:
Padding and Working with Null or Missing Values
Let me give you advice they donβt teach at school, guys. If the desired column matches none of the previous requirements and you find that it has a low correlation to the output, dropΒ it.
- data normalization ->simply put, this is the process of scaling your data while still representing what it should represent. Why would you do that? Firstly, it helps with outliers. Secondly, while it doesnβt fix the exploding gradients problem, it outlasts it. Thirdly, provides a more unique distribution thanΒ before.
- dealing with outliers->outliers are unlikely elements of the dataset which are way out of the desiredΒ limits.
How to Remove Outliers for Machine Learning – Machine Learning Mastery
Good news: You get to pick what is an outlier to your data and what youβre going to do aboutΒ them.
Model Interpretation
Youβve reached the final steps of your project. Yet, this is perhaps the one that cements you as a data scientist. βHe who tells the storyβ,Β right?
Coming this far, you feel no doubt youβll push your model to the limit of its possibilities. Weβre all thinking about it by now, so Iβll just say it: βHyperparameter tuningβ.
Iβve gone into much detail about it in the following article, so goΒ ahead.
Keynote: βDonβt reinvent the wheel. Thereβs no need. The goal is to provide the best hyperparameters for your model.Β β
In addition, I would like to add something referred also to as the Pandas vs Caviar method. Itβs a matter of babysitting a model and waiting for its results or diverting into multiple models and choosing the bestΒ one.
Caviar > Pandas.Β Always.
Every project has its own interpretation of metrics. In order to be successful, you must think beyond the scopes of general compliance. Ask yourselfΒ this:
What criteria would make this project/model successful?
What point could you say youβre satisfied with your model performance at?
You know your model best, so again, these are just a fewΒ tricks:
- plot accuracy/loss percentiles ( these indicate how faulty your model is on the test set, i.e how off are your predictions)
- track loss and accuracy ( use early stopping callback, regularization, or TensorBoard forΒ logging)
- find out hidden connections between data that are mispredicted, so you can fix why it isΒ so
Finally, think of the customer/business interactions. What are the applications of yourΒ model?
Get in the shoes of a person youβre trying to sell your AI power force to and think of the struggles they might have. Such preparation will be noted, trustΒ me.
Conclusion
Here you go. This was my all-out step-by-step guidebook to Data Science. I encourage you to take this list before you start a new project and think ahead. If I could sum up, my greatest piece of suggestion to you, that would be: Get to know what youβre dealing with. Data really is the centerpiece of your project. The model can only be as good as your data. Be prepared and the results will not disappoint.
Embrace the mindset of a modern data scientist. And remember, when in a seemingly hopeless position, weβve all been there. My motto is if it doesnβt take at least 100 bugs during the coding, whatβs theΒ point?
Connect and readΒ more
Feel free to contact me via any social media. Iβd be happy to get your feedback, acknowledgment, orΒ critic.
LinkedIn, Medium, GitHub,Β Gmail
I have just recently started writing my own newsletter and I would really appreciate it if you checked itΒ out.
Link: https://winning-pioneer-3527.ck.page/4ffcbd7ad7.
- Deploying a Neural Style Transfer App on Streamlit
- How GANs Are Capable of Creating Such High-Resolution Images
- Kubernetes vs. Docker Explained
Data Science Guidebook 2021 was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Published via Towards AI