Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

The AI Process
Latest   Newsletter

The AI Process

Last Updated on July 24, 2023 by Editorial Team

Author(s): Jeff Holmes

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Guide to solving an AI problem

Hao Wang on Unsplash

As an AI engineer with an MS in Mathematics and MSCS in Artificial Intelligence, I find it troubling that a majority of software engineers, researchers, and data scientists using AI/ML are self-taught which is probably why 80% or more of AI projects fail [1][2]. In fact, a recent IEEE survey found that almost 34% of scientists admitted to questionable research practices at least once in their career [3]. Personally, I have found that the results of almost all journal articles on AI/ML (even peer-reviewed) are irreproducible.

Outline

  • What is AI
  • The AI Process
  • Define the Problem
  • Data Preparation
  • How to Choose an AI Model
  • Model Selection Criteria
  • Experimental Design
  • Why Simpler Models
  • Multinomial Logistic Regression
  • Understand AI Algorithms

What is AI?

Artificial intelligence (AI) focuses on the design and implementation of intelligent systems that perceive, act, and learn in response to their environment.

According to Russell and Norvig [6], an agent is just something that acts (from the Latin word agere which means to do). All computer programs can be considered to do something but computer agents are expected to do more complex tasks: operate autonomously, perceive their environment, persist over a prolonged period of time, adapt to change, and create and pursue goals. In fact, a rational agent is one that acts so as to achieve the best outcome or the best-expected outcome when there is uncertainty.

In a nutshell, AI is focused on the study and construction of agents that act rationally or do the right thing as defined by the objective defined to the agent which is called the standard model [6]. However, there are limitations to this model such as the issue of limited rationality and the value alignment problem which leads to the concept of agents that are provably beneficial to humans, but the standard model is a good reference point for theoretical analysis [6].

When the agent is a computer, the process is called machine learning (ML): a computer observes some data, builds a model based on the data, and uses the model as a hypothesis about the world and a piece of software that can solve problems [6].

In fact, AI engineering is the discipline focused on developing tools, systems, and processes to enable the application of artificial intelligence in real-world contexts which combines the principles of systems engineering, software engineering, and computer science to create AI systems.

Since a majority of AI projects fail, academic credentials and integrity are paramount, perhaps more than any field or career. Since each AI project is unique (the data), what worked for one project/company most likely will not work for another project/company. Thus, the “real-world” experience of a self-taught AI practitioner is really not relevant. As a consultant, I am often asked to troubleshoot, fix, redesign, and/or replace poorly designed software by software engineers who supposedly have “real-world” experience. Therefore, I feel it is important to write an article on the AI Engineering Process or AI Process (AP) which is described in most AI/ML textbooks [5][6].

There are two approaches to AI/ML (model-centric vs data-centric) that are mutually exclusive. Either you are letting the dataset drive model selection (data-driven) or you are not (model-centric). In AI Engineering, we would use a data-driven approach by using AutoML or a coding a custom test harness to evaluate many algorithms (25–50+) on the dataset and then choose the top performers (say the top 3) for the further study being sure to give preference to simpler algorithms (Occam’s Razor). Thus, we would only choose more complex SOTA algorithms only if all simpler algorithms failed miserably. In a research project, we would likely be using a model-centric approach to evaluate new algorithms and compare the results to previous results on the same toy dataset, assuming that previous research has already obtained baselines for simpler models.

The AI Process

We can define an AI Engineering Process or AI Process (AP) which can be used to solve almost any AI problem [5][6][7][9]:

  1. Define the problem: This step includes the following tasks: defining the scope, value definition, timelines, governance, and resources associated with the deliverable.
  2. Dataset selection: This step can take a few hours or a few months depending on the project. It is crucial to obtain the correct and reliable dataset for an AI/ML project.
  3. Data description: This step includes the following tasks: describe the dataset including the input features and target feature(s); include summary statistics of the data and counts of any discrete or categorical features including the target feature.
  4. Data preparation: This step includes the following tasks: data preprocessing, data cleaning, and exploratory data analysis (EDA).
  5. Feature engineering: This step includes the following tasks: quantization or binning; mathematical transforms; scaling and normalization; convert text data features into vectors; modify and/or create new features.
  6. Design: This step includes the following tasks: feature selection, decomposing the problem, and building and evaluating models. We can use AutoML or create a custom test harness to build and evaluate many models to determine what algorithms and views of the data should be chosen for further study.
  7. Training: This step includes building the model which may include cross-validation.
  8. Evaluation: This step includes the evaluation of well-performing models on a hold-out test dataset and model selection.
  9. Tuning: This step involves algorithm tuning of the few selected well-performing models which may include evaluation of ensembles of models to obtain further improvement in accuracy.
  10. Finalize: This step is to finalize the chosen model by training using the entire dataset and making sure that the final solution meets the original business requirements for model accuracy and other performance metrics.
  11. Deployment: The model is now ready for deployment. There are two common approaches to the deployment of ML models to production: embed models into a web server or offload the model to an external service. Both model serving approaches have pros and cons.
  12. Monitoring: This is the post-deployment phase which involves observing the model and pipelines, refreshing the model with new data, and tracking success metrics in the context of the original problem.

For more detail on each step, please refer to the AI Checklist, Applied ML Checklist, Data Preparation, and Feature Engineering on my GitHub repo.

Define the problem

The first step in an AI project is to define the problem [6]. In a few sentences, describe the following:

  1. Describe the problem to be solved?
  2. Describe the part(s) of the problem that can be solved by machine learning.
  3. Describe the goal of the project.
  4. Describe the goal of the model: classify, predict, detect, translate, etc.
  5. Define the loss function and/or performance and error metrics for the project.

When designing an agent, one of the first steps is to specify the task environment which is called the PEAS (Performance, Environment, Actuators, Sensors) description [6]. We can think of the task environment as the “problem” and the rational agent(s) are the “solution”. A classic toy example is a simple robot vacuum.

PEAS description for robot vacuum

  • What is the performance measure? cleanness, efficiency: distance traveled to clean, battery life, security
  • What is known about the environment? room, table, wood floor, carpet, different obstacles
  • What actuators does the agent have? wheels, different brushes, vacuum extractor
  • What sensors does the agent have? camera, dirt detection sensor, cliff sensor, bump sensors, infrared wall sensors

After we have decomposed the problem into parts, we may find that there are multiple components that can be handled using traditional software engineering rather than machine learning. We could develop the overall system and then go back later and optimize it, replacing some components with more sophisticated machine learning models.

Part of problem formulation is deciding whether we are dealing with supervised, unsupervised, or reinforcement learning. However, the distinctions are not always so definite.

Data Preparation

The data preparation stage actually involves three steps that may overlap.

  1. Data preprocessing: format adjustments; correct inconsistencies; handle errors in variables.
  2. Exploratory data analysis and visualization: check if data is normally distributed or heavy-tailed; check for outliers; check if clustering of the data will help; check for imbalanced data.
  3. Data cleaning: check data types; handle missing or invalid values; handle outliers; handle categorical values; encoding class labels; parsing dates; character encodings; handle imbalanced data.

Data preprocessing

Split first, normalize later which means that we should perform the train-test split first then normalize the datasets.

Format adjustments

  • Remove leading and trailing spaces
  • Standardize types (decimal separators, date formats, or measurement units)
  • Replace unrecognizable or corrupted characters
  • Check for truncated entries (data entries that are cut off at a certain position)

Correct inconsistencies

  • Check for invalid values (age is 200 or negative)
  • Check for wrong categories in categorical data (similar products should not be put into different categories)

Handle errors in variables

  • High Cardinality: the number of different labels in categorical data is very high which causes problems for the model to learn.
  • Outliers: the extreme cases that may be due to error but not in every case.

How to Choose an AI Model

Every new AI engineer finds that they need to decide what model to use for a problem.

There are many models to choose from, but there are usually only slight alterations needed to change a regression model into a classification model and vice versa.

First, remember to take a data-centric approach, so avoid asking “what model should I use”. Thus, the first step in AI/ML process would be to perform EDA to understand the properties of your model such as balanced (classification) or Gaussian (regression).

There are two approaches to model selection: data-centric and model-centric. Either you are letting the data drive model selection (model-centric) or you are not (model-centric).

In a model-centric approach, you are basically throwing models at the dataset and hoping something will work. Similar to throwing bologne at the wall hoping it will stick, model-centric is an unscientific approach with a low probability of success.

The second step to solving an AI problem is to try simple algorithms (such as Linear or Logistic Regression) as baseline models which are used later to evaluate your model choice(s) which should perform better than all baseline models.

There are a lot of models to choose from, so consider starting with classification/regression models which can be done easily using scikit-learn.

10 Simple Things to Try Before Neural Networks

Next, the best practice is to evaluate many algorithms (say 10–20) using an AutoML tool such as Orange, PyCaret, or AutoGluon and narrow the choices to a few models based on accuracy and error metrics. Then, create a test harness [10] to fully explore the candidates.

In general, you should have evaluated many models before trying to evaluate more complex models such as neural networks. This approach is not unique to AI/ML. It is similar to the approach used to evaluate and compare algorithms in mathematics, engineering, and other fields.

Keep in mind that an accuracy of 50% is equivalent to random guessing (coin toss). Thus, your models should have an accuracy of at least 75–80% or better before hypertuning. Otherwise, you need to select a different model and/or spend more time on data preparation and feature engineering.

A more detailed discussion of the AI engineering process can be found in [5][6].

Model Selection Criteria

In model selection, we are concerned with two questions about learning algorithms [5]:

  1. How can we assess the expected error of a learning algorithm on a problem?
  2. How can we say one model has less error than the other for a given application?

The error rate on the training set is always smaller (by definition) than the error rate on a test set containing instances unseen during training. Thus, we cannot choose between algorithms based on training set errors. Therefore, we need a validation set that is distinct from the training set.

We also need to have several runs on the validation set to compute the average error rates since noise, outliers, and other random factors will affect generalization. Then, we base our evaluation of the learning algorithm on the distribution of these validation errors to assess the expected error of the learning algorithm for the given problem or compare it to the error rate distribution of another learning algorithm.

It is important to keep in mind several important points [5]:

  1. Whatever conclusion we draw from our analysis is conditioned on the dataset we are given.

As stated by the No Free Lunch Theorem, there is no such thing as the best learning algorithm; For any learning algorithm, there is a dataset where it is very accurate and another dataset where it is very poor.

2. The division of a given dataset into a number of training and validation set pairs is only for testing purposes.

Once all the tests are complete and we have made our decision as to the final method or hyperparameters, we can use all the labeled data that we have previously used for training or validation to train the final learner which is called finalizing the model.

3. Since we also use the validation set(s) for testing purposes (such as choosing the better of two learning algorithms or deciding where to stop learning), it becomes part of the data we use.

Therefore, given a dataset, we should first leave some part of it aside as the test set and then use the rest for training and validation.

4. In general, we compare learning algorithms by their error rates, but it should be kept in mind that in real life, the error is only one of the criteria that will affect our decision.

Some other criteria are [5]:

  • risks when errors are generalized using loss functions, instead of 0/1 loss
  • training time and space complexity
  • testing time and space complexity
  • interpretability which means whether the method allows knowledge extraction which can be checked and validated by experts
  • easy programmability

However, the relative importance of these factors changes depending on the application.

When we train a learner on a dataset using a training set and test its accuracy on a validation set and try to draw conclusions, what we are doing is experimentation. Statistics defines a methodology to design experiments correctly and analyze the collected data in a manner so as to be able to extract significant conclusions [5].

Experimental Design

The goal of ML is to conduct experiments and analyze the results to be able to eliminate the effect of chance and obtain conclusions that we can consider statistically significant [5].

Thus, we want to find a learner with the highest generalization accuracy and minimal complexity (the implementation is cheap in time and space) and is robust (unaffected by external sources of variability) [5].

There are three basic principles of experimental design [5]:

  1. Randomization requires that the order in which the runs are carried out should be randomly determined so that the results are independent. However, order is usually not a problem in software experiments.
  2. Replication implies that for the same configuration of (controllable) factors, the experiment should be run a number of times to average over the effect of uncontrollable factors.

In machine learning, replication is typically done by running the same algorithm on a number of resampled versions of the same dataset which is called cross-validation.

3. Blocking is used to reduce or eliminate the variability due to nuisance factors that influence the response but in which we are not interested.

When we are comparing learning algorithms, we need to make sure the algorithms all use the same resampled subsets of data. Therefore, the different training sets in replicated runs should be identical which is what we mean by blocking [7]. In statistics, if there are two populations, this approach is called pairing which is used in paired testing.

Model Selection Criteria

The following seven criteria can help in selecting a model [11]:

1. Explainability

There is a trade-off between explainability and model performance.

Using a more complex model will often increase the performance but it will be more difficult to interpret.

If there is no need to explain the model and its output to a non-technical audience, more complex models could be used such as ensemble learners and deep neural networks.

2. In memory vs out memory

It is important to consider the size of your data and the amount of RAM available on the computer where training will occur on.

If the RAM can handle all of the training data, you can choose from a wide variety of machine learning algorithms.

If the RAM cannot handle the training data, you can explore incremental learning algorithms which can improve the model by gradually adding more training data.

3. Number of features and examples

The number of training samples and the number of features per sample is also important in model selection.

If you have a small number of examples and features, a simple learner would be a great choice such as a decision tree or k-nearest neighbors.

If you have a small number of examples and a large number of features, SVM and gaussian processes would be a good choice since they can handle a large number of features but require fewer resources.

If you have a large number of examples then deep neural networks and boosting algorithms would be a good choice since they can handle millions of samples and features.

4. Categorical vs numerical features

The type of features is important when choosing a model.

Some machine learning algorithms cannot handle categorical features such as linear regression so you have to convert them into numerical features while other algorithms can handle categorical features such as decision trees and random forests.

5. Normality of data

If your data is normally distributed, SVM with linear kernel, logistic regression, or linear regression could be used.

If your data is not normally distributed, deep neural networks or ensemble learners would be a good choice.

6. Training speed

The available time for training is important when choosing a model.

Simple algorithms such as logistic/linear regression or decision trees can be trained in a short time.

Complex algorithms such as neural networks and ensemble learners are slow to train.

If you have access to a multi-core machine, this could significantly reduce the training time of more complex algorithms.

7. Prediction speed

The speed of generating the results is another important criterion for choosing a model.

If your model will be used in a real-time or production environment, it should be able to generate the results with very low latency.

Algorithms such as SVMs, linear/logistic regression, and some types of neural networks are extremely fast at prediction time.

You should also consider where you will deploy your model. If you are using the models for analysis or theoretical purposes, your prediction time can be longer which means you could use ensemble algorithms and very deep neural networks.

Why Simple Models

The two most common regression algorithms are:

  • Linear Regression (Regression)
  • Logistic Regression (Classification)

You should start with these simple models because [12]:

  • It is likely that your problem does not need a complex algorithm
  • These two models have been studied thoroughly and are some of the most well-understood models in ML.
  • They are easy to implement and test.
  • They are easily interpretable since they are linear models.

To convert a regression problem to a classification problem, there are two common solutions:

  • Logistic Regression: binary classification
  • Softmax Regression: multiclass classification

In fact, I have recently worked on many projects in which the developers spent weeks or months trying to implement state-of-the-art DL algorithms from research papers only to have me show how Linear Regression and/or XGBoost outperformed all their complex models (in many cases achieving 95-98% accuracy on the test dataset). You should evaluate many algorithms to obtain baselines for comparison to justify your final model selection, so you should always know how simpler models perform on your dataset.

If you are doing research, a model-centric approach is acceptable provided someone has done an extensive evaluation of various models (including simpler models) on the same toy dataset. When you are using a custom dataset and/or solving real-world problems then you are performing AI Engineering (not research), so the rule of thumb is Occam’s Razor (“simpler is better” or “there is no such thing as best, just good enough”).

Multinomial Logistic Regression

Multinomial Logistic Regression (MLR) is a classification algorithm used to perform multiclass classification which is an extension of logistic regression that adds support for multi-class classification problems.

The primary assumptions of linear regression (multiple and singular) are [13]:

  1. Linearity: There is a linear relationship between the outcome and predictor variable(s).
  2. Normality: The residuals (error calculated by subtracting the predicted value from the actual value) follow a normal distribution.
  3. Homoscedasticity: The variability in the dependent variable is equal for all values of the independent variable(s).

With many independent variables, we often encounter other problems such as multicollinearity were variables that are supposed to be independent vary with each other, and the presence of categorical variables such as an ocean temperature being classified as cool, warm, or hot instead of quantified in degrees.

Here are some tips for working with MLR [13]:

  • When your MLR models get complicated, avoid trying to use coefficients to interpret changes in the outcome versus changes in individual predictors.
  • Create predictions while varying a sole predictor and observe how the prediction changes and use these changes to form your conclusions.

Some good tutorials on MLR are given in [12] and [13].

Understand AI Algorithms

You need to know what algorithms are available for a given problem, how they work, and how to get the most out of them. However, this does not mean you need to hand-code the algorithms from scratch.

Even if you are an experienced ML engineer, you should know the performance of simpler models on your dataset/problem which I discuss further in Getting Started with AI.

Conclusion

The AI process discussed here can be used for solving almost any AI problem, with some modifications of course. There does not currently seem to be a clearly defined approach to solving AI problems, so this article attempts to present a consolidated approach from several textbooks and articles as well as discuss some of the issues involved such as model selection criteria, and simpler models and provides a quick review on understanding AI algorithms with a link to an article that provides a more detailed discussion. I plan to write some followup articles with end-to-end examples using the AI process.

References

[1] Y. Kosarenko, “The majority of business analytics and AI projects are still failing,” Data-Driven Investor, April 30, 2020.

[2] A. DeNisco Rayome, “Why 85% of AI projects fail,” TechRepublic, June 20, 2019.

[3] J. F. DeFranco and J. Voas, “Reproducibility, Fabrication, and Falsification,” IEEE Computer, vol. 54 no. 12, 2021.

[4] T. Shin, “4 Reasons Why You Shouldn’t Use Machine Learning,” Towards Data Science, Oct. 5, 2021.

[5] E. Alpaydin, “Design and Analysis of Machine Learning Experiments”, in Introduction to Machine Learning, 3rd ed., MIT Press, ISBN: 978–0262028189, 2014, ch. 19, pp. 547–588.

[6] S. Russell and P. Norvig, “Developing Machine Learning Systems,” in Artificial Intelligence: A Modern Approach, 4th ed. Upper Saddle River, NJ: Prentice Hall, ISBN: 978–0–13–604259–4, 2021, sec. 19.9, pp. 704–714.

[7] S. Raschka. and V. Mirjalili, Python Machine Learning, 2nd ed. Packt, ISBN: 978–1787125933, 2017.

[8] W. McKinney, Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, 2nd ed., O’Reilly Media, ISBN: 978–1491957660, 2017.

[9] J. Brownlee, “Applied Machine Learning Process,” Machine Learning Mastery, Feb. 12, 2014.

[10] J. Brownlee, “How to Evaluate Machine Learning Algorithms,” Machine Learning Mastery, Aug. 16, 2020.

[11] Y. Hosni, “Brief Guide for Machine Learning Model Selection,” MLearning.ai, Dec. 4, 2021.

[12] Z. WarnesHow to Select an ML Model,” KD Nuggets, Aug. 2021.

[13] M. LeGro, “Interpreting Confusing Multiple Linear Regression Results,” Towards Data Science, Sep. 12, 2021.

[14] J. Brownlee, “Multinomial Logistic Regression With Python,” Machine Learning Mastery, Jan, 1, 2021.

[15] W. Xie, “Multinomial Logistic Regression in a Nutshell,” Data Science Student Society @ UC San Diego, Dec. 8, 2020.

[16] P. Bourque and R. E. Fairley, Guide to the Software Engineering Body of Knowledge, v. 3, IEEE, 2014.

[17] J. S. Damji and M. Galarnyk, “Considerations for Deploying Machine Learning Models in Production,” Towards Data Science, Nov. 19, 2021.


The AI Process was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓