Transform Your Data Science Project: Discover the Benefits of Storing Variables in a YAML File
Last Updated on August 1, 2023 by Editorial Team
Author(s): Shivamshinde
Originally published on Towards AI.
This blog post will discuss the benefits of using a YAML file as a central repository for storing variables, parameters, and hyper-parameters in a data science project. It will explain how this method of storage can improve the efficiency and organization of the project by allowing for easy access and modification of these values. The post will also provide examples and a step-by-step guide for implementing this method in a data science project.
Introduction
Machine learning and deep learning problems are all about experimentation with different parameters. The experimentation becomes quite difficult as the number of parameters increases. This difficulty is partly due to the manual effort required to change the parameter values for every experiment iteration. But luckily for us, there is a way to make this easier. Using the collaboration of YAML files with the Python code, we can perform different experiments quite easily. This article will demonstrate how to use the YAML file and Python code for different experimentations.
Prerequisites
- Basic knowledge of Python programming language
- Basic knowledge of working of the machine learning lifecycle
Agenda
- What is YAML?
- Why not use the conventional way of storing the variables?
- Advantages of storing the parameters centrally in the YAML file
- Downloading PyYAML python library
- Storing the variables in the YAML file
- Storing lists and dictionaries in the YAML file
- Loading the variables from the YAML file into the Python file
- Conclusion
What is YAML?
Before diving straight into the topic, letβs learn some basic information about YAML.
YAML stands for βYAML Ainβt Markup Languageβ. YAML is a language that stores data in a very much human-readable format, unlike XML or JSON files. YAML file only stores information so, it doesnβt include any type of actions in it. Also, one can easily transfer the data from YAML files into other programming languages, such as Python.
Why not use the conventional way of storing the variables?
To explain these concepts here, I will be using an example of a data science project named βcredit card fraud detectionβ. The aim of the project is simple. The project focuses on detecting whether the performed transaction is fraudulent or not. This is done by using some information about the said transaction. Some of the examples that could be used as information are:
- The distance between the place where the transaction is done and the home address of the credit card owner.
- The distance from the last transaction place
- The ratio of the mean transaction price to the current transaction price
- The IP address from which the transaction has been made
- Online or offline payment done
This detection is done by the machine learning model trained on the credit card transaction history of the user.
The data science project based on machine learning has many stages to it, such as data exploring, data cleaning, finding the suitable machine learning model for the problem, tuning the model, and saving the model. These are some of the many steps present in such projects. Each of these steps creates a lot of variables, especially in the step where the suitable machine learning algorithm is to be found and the one where the tuning of the suitable machine learning algorithm is done.
The conventional way of storing variables creates problems in such cases. Letβs understand this in more detail. Finding the suitable machine learning algorithm to fit the data and to get the maximum accuracy out of it largely depends on experimentation with the algorithmβs hyper-parameters. Using the conventional way, we will have to go around each of the files and change those parameters manually to perform every experiment. This becomes very hectic and is prone to errors. To avoid such unnecessary work and to avoid silly mistakes, a new approach is used. We will understand that new approach later in this article.
Advantages of storing the parameters centrally in the YAML file
Unlike the conventional way of storing the parameters in their respective file, this approach advises storing all the parameters in one file. One can obtain the parameters whenever needed from this file by importing the file. This approach is leaner and less prone to silly mistakes. One can even use a YAML file to store the file paths also.
One question might arise in the mind why use the YAML file only? The answer to this question lies in the extremely simple syntax of the YAML files. One can use other types of files also, but to make a simple matter simpler, it is advisable to use the YAML file.
Now, letβs see how itβs done using some code.
Downloading PyYAML python library
Pythonβs one of the popular third-party libraries is PyYAML. This library is actively maintained and it is also mentioned on the official YAML website. To install this library, use the following command in the terminal.
python -m pip install pyyaml
After the installation of the library is completed, use the following command to import it into the Python file.
import yaml
Note that even though PyYAML is the name of the library that you have installed, you will import the package using the name βyamlβ in the python code.
Storing the variables in the YAML file
YAML file has a somewhat similar syntax to that of the Python language. In the YAML file, indents are used just like in Python. Letβs take a look at the YAML file to understand this.
SimpleImputer:
strategy: most_frequent
missing_values: nan
OrdinalEncoder:
handle_unknown: use_encoded_value
unknown_value: 100
Here we are storing variables in two groups named SimpleImputer and OrdinalEncoder. These are the variables that are used as parameters for Scikit-Learnβs simple imputer and ordinal encoder transformers in the preprocessing step.
Note that we donβt need to use quotation marks around the string variablesβ values in the YAML file. But even if we used quotation marks around string variable values, it does not make any difference.
Storing the file paths into the YAML is similar to saving of any other value in the YAML file. The following are the paths used in the preprocessing of the data and training of our credit card fraud detection model.
data_preparation:
training_db: Training_db
training_db_dir: Training_Database
table_name: trainingGoodRawDataTable
schema_training: config/schema_training.json
good_validated_raw_dir: data/Training_Raw_Files_Validated/Good_Raw
master_csv: master.csv
Storing lists and dictionaries in the YAML file
There are two ways to store lists and dictionaries in YAML file. The following are the hyper-parameters used for the hyper-parameter tuning of the random forest classifier model used for fraud detection.
Approach-1:
random_forest:
cv: 5
verbose: 3
param_grid: {n_estimators: [10, 50, 100, 130], max_depth: [2, 3], max_features: ['auto', 'log2']}
In the first approach, we just put the list or dictionaries as we do in the Python programming language. Dictionaries are represented in the simple key: value pair.
Approach-2:
random_forest:
cv: 5
verbose: 3
param_grid:
n_estimators:
- 10
- 50
- 100
- 130
max_depth:
- 2
- 3
max_features:
- auto
- log2
In the second approach, all the list members start with the symbol β-β at the same indentation level. Dictionaries are represented in the simple key: value pair.
Loading the variables from the YAML file into the Python file
random_forest:
cv: 5
verbose: 3
param_grid:
n_estimators:
- 10
- 50
- 100
- 130
max_depth:
- 2
- 3
max_features:
- auto
- log2
Now letβs say we want to access the βverboseβ variable from our βparameters.yamlβ file into the python file. We can do this the following way.
import yaml
with open('parameters.yaml') as p:
params = yaml.safe_load(p)
verbose = params['random_forest']['verbose']
You might wonder why go through all the trouble of importing the variables in the python file from the YAML file when we can just initialize the variable verbose as 3 here in the python file. There is a reason behind this way of coding practice.
Letβs say we want to use this variable in multiple files. And letβs say we want to update the verbose variable, then we will have to go through all the files one by one and then change it. If we were to store the variable in the YAML file and then import it in every python file, then once we change the value of the variable in the YAML file, it will be reflected in every python file where it is used.
Conclusion
In this article, we learned Why we shouldnβt use the traditional approach of storing variables in python files. Also, we learned what are the advantages of using a yaml file to store the variables and how it helps with the experiments in the machine learning project. Check out the following link for the whole code for this article.
Discover the Benefits of Storing Variables in a YAML File
Skip to content Experience YouTube (and Twitch) in the console. Fullscreen for best experience. Fork for audio & customβ¦
replit.com
Outro
I hope you like the article. If you have any thoughts on the article, then please let me know. Also, if you liked the article, then please give a clap.
Connect with me on LinkedIn.
Know more about me on Website.
Mail me at [email protected]
Have a great day!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI