Life Cycle for Machine Learning Problem — Beginner Writes
Last Updated on May 30, 2022 by Editorial Team
Author(s): Siddharth patel
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
I am a beginner in ML (Well, That’s true). I am writing everything as I am learning. If I can explain, that will be great! I have been learning a lot about the ML life cycle and suddenly random thought crossed my mind, “Hey, How about writing what I think is the ML life cycle and letting other people read and react?”. So, Here we go!
There is a number of programmers and engineers across the globe who write code each day to develop software. They certainly use some approaches to develop their software which include creating documentation, preparing ER, Relational, Use Case, Sequence diagram, etc. Just like that, we have a life cycle to develop a Complete Machine Learning Solution for any given problem.
Machine Learning life cycle or in simple words method and steps to approach any ML problem help any person to get good results in very less time. In short, by applying this method, you can deploy your model very quickly with robust results. This is a repetitive process that you can apply each time you start work on a new problem.
Steps of Machine Learning Life Cycle
- Define Problem Statement
- Data Gathering
- Data Preparation
- Data Analysis
- Model Selection and Training
- Model Testing
- Deployment
These are the main steps to approach any Machine Learning problem. Although these steps really help you to solve any ML problem fast, it is not always necessary to follow each step. You can make changes as per your problem.
Adding to that, these are main steps and can be broken down into smaller steps and any engineer is encouraged to break down these steps into as many smaller steps as possible to keep work easy. For example, if you are working on a specific dataset while working on 3rd step i.e. Data Analysis, you may certainly want to create a different types of charts and graphs, make some assumptions based on that, prepare some tables and write down your findings. Those things help you to find the best algorithm for your data and get the best results.
Also, When you start working, you will discover that many of these steps are interrelated and iterative. By iterative, I mean to say you may need to perform a single-step number of times before you get the results you want.
Let’s dive into each step now.
Define Problem Statement
This is the first step for any ML problem. You first need to understand the problem statement and make some assumptions based on the statement. Then you need to discuss the requirements for solving the problem. You cannot take any problem and solve it if you don’t get a certain benefit from it. There must be certain motivation to get a solution. The last part of the defined problem statement is to discuss expected outcomes by solving the problem and the output you want at the end of the solution.
These steps connect you with the problem statement and you can make better decisions. Remember, You are not just an ML engineer, You are a person who solves the problem for people and helps them get benefit from the solution.
Data Gathering
This step may look confusing to people so before you start saying “Oh man, this guy must be kidding” let me clarify, this step is there in a few problems, and in most problem, you get data from the person or organization you work for and hence you don’t need to find data but as I said it is required in some problems, I may explain it.
Suppose you are working on an ML problem where you need to get data from the internet, you may need to perform web scrapping to collect data. If you are working on a problem given by some “Conventional Industry”, you need to get data from sensors or other methods (Don’t worry, these steps are very rare). This data can be further used by you.
Data Preparation
This step is very important and you need to perform it on almost all real-life datasets. There is a very low chance of you getting data that is clean and does not require performing data preparation on that (Iris, which is a very famous dataset is clean and balanced but you use that just to practice basic problems).
For data preparation, you first need to think about which data you want to use for a specific problem. You also want to think about what data you want to add to the dataset.
After that comes to a very important step, “Data Preprocessing”. In this step, you need to clean data. By cleaning data, what I mean is, in any real-life data, there are certain missing values, data needs formatting, and units of data may be different for some instances of the dataset (We will discuss instances later). You need to either remove missing values or fill them. There are certain methods for that. You need to Normalize and Standardize your data.
After that, you perform certain operations like scaling and data decomposition to make data ready for the ML algorithm.
Remember, the Amount of data is important but the quality of that data is more important and hence you may want to focus a lot on this step. You cannot feed any data to your ML model and expect good results. Data matters, A lot!
Data Analysis
This is again a very important step. Once you have preprocessed data in the required format, you may want to visualize it and derive certain assumptions from it. For this step, you require good command in plotting and understanding charts and graphs (You may do it with Python or Excel or any tool of your choice). It helps select good algorithms for specific datasets and problem statements.
Basically here, you visualize data which gives you a better understanding of patterns in specific attributes of data. It also helps to discard attributes that you find not useful for your problem statement. For example, if the value of the specific attribute is approximately the same for the whole dataset, it is maybe not very useful in prediction.
Model Selection and training
Once you have all the data and you visualized the data to make certain assumptions, it is finally time for selecting the model and training it on our dataset. We basically use some methods for this purpose.
We generally use k-fold cross-validation to test. All combinations of algorithms and data are tested k times and we get mean and standard deviations for all (If you are not familiar with the concept of mean and standard deviation, they are very basic concepts of statistics and we will discuss all that in statistics for ML).
In k-fold cross-validation, we take a few ML algorithms (Those are a few algorithms that are frequently used, say Standard ML Algorithms) and run them on the dataset. Those are a few major algorithms from different families (You may want to read this blog of Jason Brownlee, Ph.D.).
Here, You first decide your approach, i.e. If you want to go with a Supervised, Unsupervised, or Semi-supervised approach. Once you decide, you need to decide which specific algorithm you want to use (Sometimes, we also use hybrid approaches). We discard other methods and narrow down our list of algorithms and keep doing this step in the iterative method until we decide perfect algorithm (By perfect, I mean we get a clear view, not necessarily the best one).
After deciding on the algorithm, we train our model with our dataset. We basically divide our dataset into two parts, one for training and the other for testing. You may go with 70:30 or 80:20 or whatever proportion you feel like choosing for your problem statement.
Once you train your ML model with the dataset, it’s time for testing accuracy.
Model Testing
We basically check our algorithm results which actual results we already know. So we can compare predicted results with actual results and get an accuracy of the model. There are also certain test algorithms to test our model. You need to prepare a good method to test your algorithm as actual accuracy is very important. You perform performance measures and cross-validation etc.
You also try to find out what change in the algorithm can increase accuracy by testing and training again after making the change. Basically, training and testing is also an iterative process that you keep repeating until you get results close to the results you want. You perform Algorithm tuning or Ensemble methods for this purpose.
Deployment
Deployment of the model is something not necessarily done by an ML engineer. It is subjective to your work profile but it is good to learn at least the basics. You need to explain your findings to the client sometime because there is the possibility that your client maybe not be much aware of ML and wants outcomes in simple, easy-to-understand terms or manner. You may need to create dashboards or deploy your model on a cloud or web app if you are working as an independent ML engineer (Companies mostly have people for that but freelance ML engineers may need to do it).
Summary
Here, I have discussed the complete life cycle of any ML problem and solution. We use this approach to decrease uncertainty in the process and make things clear, process fast, and result robust.
Following this method can help you to organize your whole process where you can modify any step at any time and make things better.
That’s all for today, See you all in the next post!
Life Cycle for Machine Learning Problem — Beginner Writes was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI