Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Transform Your Data Science Project: Discover the Benefits of Storing Variables in a YAML File
Data Science   Latest   Machine Learning

Transform Your Data Science Project: Discover the Benefits of Storing Variables in a YAML File

Last Updated on August 1, 2023 by Editorial Team

Author(s): Shivamshinde

Originally published on Towards AI.

This blog post will discuss the benefits of using a YAML file as a central repository for storing variables, parameters, and hyper-parameters in a data science project. It will explain how this method of storage can improve the efficiency and organization of the project by allowing for easy access and modification of these values. The post will also provide examples and a step-by-step guide for implementing this method in a data science project.

Photo by Fikri Rasyid on Unsplash

Introduction

Machine learning and deep learning problems are all about experimentation with different parameters. The experimentation becomes quite difficult as the number of parameters increases. This difficulty is partly due to the manual effort required to change the parameter values for every experiment iteration. But luckily for us, there is a way to make this easier. Using the collaboration of YAML files with the Python code, we can perform different experiments quite easily. This article will demonstrate how to use the YAML file and Python code for different experimentations.

Prerequisites

  • Basic knowledge of Python programming language
  • Basic knowledge of working of the machine learning lifecycle

Agenda

  • What is YAML?
  • Why not use the conventional way of storing the variables?
  • Advantages of storing the parameters centrally in the YAML file
  • Downloading PyYAML python library
  • Storing the variables in the YAML file
  • Storing lists and dictionaries in the YAML file
  • Loading the variables from the YAML file into the Python file
  • Conclusion

What is YAML?

Before diving straight into the topic, let’s learn some basic information about YAML.

YAML stands for ‘YAML Ain’t Markup Language’. YAML is a language that stores data in a very much human-readable format, unlike XML or JSON files. YAML file only stores information so, it doesn’t include any type of actions in it. Also, one can easily transfer the data from YAML files into other programming languages, such as Python.

Why not use the conventional way of storing the variables?

To explain these concepts here, I will be using an example of a data science project named ‘credit card fraud detection’. The aim of the project is simple. The project focuses on detecting whether the performed transaction is fraudulent or not. This is done by using some information about the said transaction. Some of the examples that could be used as information are:

  • The distance between the place where the transaction is done and the home address of the credit card owner.
  • The distance from the last transaction place
  • The ratio of the mean transaction price to the current transaction price
  • The IP address from which the transaction has been made
  • Online or offline payment done

This detection is done by the machine learning model trained on the credit card transaction history of the user.

The data science project based on machine learning has many stages to it, such as data exploring, data cleaning, finding the suitable machine learning model for the problem, tuning the model, and saving the model. These are some of the many steps present in such projects. Each of these steps creates a lot of variables, especially in the step where the suitable machine learning algorithm is to be found and the one where the tuning of the suitable machine learning algorithm is done.

The conventional way of storing variables creates problems in such cases. Let’s understand this in more detail. Finding the suitable machine learning algorithm to fit the data and to get the maximum accuracy out of it largely depends on experimentation with the algorithm’s hyper-parameters. Using the conventional way, we will have to go around each of the files and change those parameters manually to perform every experiment. This becomes very hectic and is prone to errors. To avoid such unnecessary work and to avoid silly mistakes, a new approach is used. We will understand that new approach later in this article.

Advantages of storing the parameters centrally in the YAML file

Unlike the conventional way of storing the parameters in their respective file, this approach advises storing all the parameters in one file. One can obtain the parameters whenever needed from this file by importing the file. This approach is leaner and less prone to silly mistakes. One can even use a YAML file to store the file paths also.

One question might arise in the mind why use the YAML file only? The answer to this question lies in the extremely simple syntax of the YAML files. One can use other types of files also, but to make a simple matter simpler, it is advisable to use the YAML file.

Now, let’s see how it’s done using some code.

Downloading PyYAML python library

Python’s one of the popular third-party libraries is PyYAML. This library is actively maintained and it is also mentioned on the official YAML website. To install this library, use the following command in the terminal.

python -m pip install pyyaml

After the installation of the library is completed, use the following command to import it into the Python file.

import yaml

Note that even though PyYAML is the name of the library that you have installed, you will import the package using the name ‘yaml’ in the python code.

Storing the variables in the YAML file

YAML file has a somewhat similar syntax to that of the Python language. In the YAML file, indents are used just like in Python. Let’s take a look at the YAML file to understand this.

SimpleImputer:
strategy: most_frequent
missing_values: nan

OrdinalEncoder:
handle_unknown: use_encoded_value
unknown_value: 100

Here we are storing variables in two groups named SimpleImputer and OrdinalEncoder. These are the variables that are used as parameters for Scikit-Learn’s simple imputer and ordinal encoder transformers in the preprocessing step.

Note that we don’t need to use quotation marks around the string variables’ values in the YAML file. But even if we used quotation marks around string variable values, it does not make any difference.

Storing the file paths into the YAML is similar to saving of any other value in the YAML file. The following are the paths used in the preprocessing of the data and training of our credit card fraud detection model.

data_preparation:
training_db: Training_db
training_db_dir: Training_Database
table_name: trainingGoodRawDataTable
schema_training: config/schema_training.json
good_validated_raw_dir: data/Training_Raw_Files_Validated/Good_Raw
master_csv: master.csv

Storing lists and dictionaries in the YAML file

There are two ways to store lists and dictionaries in YAML file. The following are the hyper-parameters used for the hyper-parameter tuning of the random forest classifier model used for fraud detection.

Approach-1:

random_forest:
cv: 5
verbose: 3
param_grid: {n_estimators: [10, 50, 100, 130], max_depth: [2, 3], max_features: ['auto', 'log2']}

In the first approach, we just put the list or dictionaries as we do in the Python programming language. Dictionaries are represented in the simple key: value pair.

Approach-2:

random_forest:
cv: 5
verbose: 3
param_grid:
n_estimators:
- 10
- 50
- 100
- 130
max_depth:
- 2
- 3
max_features:
- auto
- log2

In the second approach, all the list members start with the symbol ‘-’ at the same indentation level. Dictionaries are represented in the simple key: value pair.

Loading the variables from the YAML file into the Python file

random_forest:
cv: 5
verbose: 3
param_grid:
n_estimators:
- 10
- 50
- 100
- 130
max_depth:
- 2
- 3
max_features:
- auto
- log2

Now let’s say we want to access the ‘verbose’ variable from our ‘parameters.yaml’ file into the python file. We can do this the following way.

import yaml

with open('parameters.yaml') as p:
params = yaml.safe_load(p)

verbose = params['random_forest']['verbose']

You might wonder why go through all the trouble of importing the variables in the python file from the YAML file when we can just initialize the variable verbose as 3 here in the python file. There is a reason behind this way of coding practice.

Let’s say we want to use this variable in multiple files. And let’s say we want to update the verbose variable, then we will have to go through all the files one by one and then change it. If we were to store the variable in the YAML file and then import it in every python file, then once we change the value of the variable in the YAML file, it will be reflected in every python file where it is used.

Conclusion

In this article, we learned Why we shouldn’t use the traditional approach of storing variables in python files. Also, we learned what are the advantages of using a yaml file to store the variables and how it helps with the experiments in the machine learning project. Check out the following link for the whole code for this article.

Discover the Benefits of Storing Variables in a YAML File

Skip to content Experience YouTube (and Twitch) in the console. Fullscreen for best experience. Fork for audio & custom…

replit.com

Outro

I hope you like the article. If you have any thoughts on the article, then please let me know. Also, if you liked the article, then please give a clap.

Connect with me on LinkedIn.

Know more about me on Website.

Mail me at [email protected]

Have a great day!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓