Transform Your Data Science Project: Discover the Benefits of Storing Variables in a YAML File

Last Updated on August 1, 2023 by Editorial Team

Author(s): Shivamshinde

Originally published on Towards AI.

This blog post will discuss the benefits of using a YAML file as a central repository for storing variables, parameters, and hyper-parameters in a data science project. It will explain how this method of storage can improve the efficiency and organization of the project by allowing for easy access and modification of these values. The post will also provide examples and a step-by-step guide for implementing this method in a data science project.

Introduction

Machine learning and deep learning problems are all about experimentation with different parameters. The experimentation becomes quite difficult as the number of parameters increases. This difficulty is partly due to the manual effort required to change the parameter values for every experiment iteration. But luckily for us, there is a way to make this easier. Using the collaboration of YAML files with the Python code, we can perform different experiments quite easily. This article will demonstrate how to use the YAML file and Python code for different experimentations.

Prerequisites

Basic knowledge of Python programming language
Basic knowledge of working of the machine learning lifecycle

Agenda

What is YAML?
Why not use the conventional way of storing the variables?
Advantages of storing the parameters centrally in the YAML file
Downloading PyYAML python library
Storing the variables in the YAML file
Storing lists and dictionaries in the YAML file
Loading the variables from the YAML file into the Python file
Conclusion

What is YAML?

Before diving straight into the topic, let’s learn some basic information about YAML.

YAML stands for ‘YAML Ain’t Markup Language’. YAML is a language that stores data in a very much human-readable format, unlike XML or JSON files. YAML file only stores information so, it doesn’t include any type of actions in it. Also, one can easily transfer the data from YAML files into other programming languages, such as Python.

Why not use the conventional way of storing the variables?

To explain these concepts here, I will be using an example of a data science project named ‘credit card fraud detection’. The aim of the project is simple. The project focuses on detecting whether the performed transaction is fraudulent or not. This is done by using some information about the said transaction. Some of the examples that could be used as information are:

The distance between the place where the transaction is done and the home address of the credit card owner.
The distance from the last transaction place
The ratio of the mean transaction price to the current transaction price
The IP address from which the transaction has been made
Online or offline payment done

This detection is done by the machine learning model trained on the credit card transaction history of the user.

The data science project based on machine learning has many stages to it, such as data exploring, data cleaning, finding the suitable machine learning model for the problem, tuning the model, and saving the model. These are some of the many steps present in such projects. Each of these steps creates a lot of variables, especially in the step where the suitable machine learning algorithm is to be found and the one where the tuning of the suitable machine learning algorithm is done.

The conventional way of storing variables creates problems in such cases. Let’s understand this in more detail. Finding the suitable machine learning algorithm to fit the data and to get the maximum accuracy out of it largely depends on experimentation with the algorithm’s hyper-parameters. Using the conventional way, we will have to go around each of the files and change those parameters manually to perform every experiment. This becomes very hectic and is prone to errors. To avoid such unnecessary work and to avoid silly mistakes, a new approach is used. We will understand that new approach later in this article.

Advantages of storing the parameters centrally in the YAML file

Unlike the conventional way of storing the parameters in their respective file, this approach advises storing all the parameters in one file. One can obtain the parameters whenever needed from this file by importing the file. This approach is leaner and less prone to silly mistakes. One can even use a YAML file to store the file paths also.

One question might arise in the mind why use the YAML file only? The answer to this question lies in the extremely simple syntax of the YAML files. One can use other types of files also, but to make a simple matter simpler, it is advisable to use the YAML file.

Now, let’s see how it’s done using some code.

Downloading PyYAML python library

Python’s one of the popular third-party libraries is PyYAML. This library is actively maintained and it is also mentioned on the official YAML website. To install this library, use the following command in the terminal.

python -m pip install pyyaml

After the installation of the library is completed, use the following command to import it into the Python file.

import yaml

Note that even though PyYAML is the name of the library that you have installed, you will import the package using the name ‘yaml’ in the python code.

Storing the variables in the YAML file

YAML file has a somewhat similar syntax to that of the Python language. In the YAML file, indents are used just like in Python. Let’s take a look at the YAML file to understand this.

SimpleImputer:
 strategy: most_frequent
 missing_values: nan

OrdinalEncoder:
 handle_unknown: use_encoded_value
 unknown_value: 100

Here we are storing variables in two groups named SimpleImputer and OrdinalEncoder. These are the variables that are used as parameters for Scikit-Learn’s simple imputer and ordinal encoder transformers in the preprocessing step.

Note that we don’t need to use quotation marks around the string variables’ values in the YAML file. But even if we used quotation marks around string variable values, it does not make any difference.

Storing the file paths into the YAML is similar to saving of any other value in the YAML file. The following are the paths used in the preprocessing of the data and training of our credit card fraud detection model.

data_preparation:
 training_db: Training_db
 training_db_dir: Training_Database
 table_name: trainingGoodRawDataTable
 schema_training: config/schema_training.json
 good_validated_raw_dir: data/Training_Raw_Files_Validated/Good_Raw
 master_csv: master.csv

Storing lists and dictionaries in the YAML file

There are two ways to store lists and dictionaries in YAML file. The following are the hyper-parameters used for the hyper-parameter tuning of the random forest classifier model used for fraud detection.

Approach-1:

random_forest:
 cv: 5
 verbose: 3
 param_grid: {n_estimators: [10, 50, 100, 130], max_depth: [2, 3], max_features: ['auto', 'log2']}

In the first approach, we just put the list or dictionaries as we do in the Python programming language. Dictionaries are represented in the simple key: value pair.

Approach-2:

random_forest:
 cv: 5
 verbose: 3
 param_grid: 
 n_estimators:
 - 10
 - 50
 - 100
 - 130
 max_depth:
 - 2
 - 3
 max_features:
 - auto
 - log2

In the second approach, all the list members start with the symbol ‘-’ at the same indentation level. Dictionaries are represented in the simple key: value pair.

Loading the variables from the YAML file into the Python file

random_forest:
 cv: 5
 verbose: 3
 param_grid: 
 n_estimators:
 - 10
 - 50
 - 100
 - 130
 max_depth:
 - 2
 - 3
 max_features:
 - auto
 - log2

Now let’s say we want to access the ‘verbose’ variable from our ‘parameters.yaml’ file into the python file. We can do this the following way.

import yaml

with open('parameters.yaml') as p:
 params = yaml.safe_load(p)

verbose = params['random_forest']['verbose']

You might wonder why go through all the trouble of importing the variables in the python file from the YAML file when we can just initialize the variable verbose as 3 here in the python file. There is a reason behind this way of coding practice.

Let’s say we want to use this variable in multiple files. And let’s say we want to update the verbose variable, then we will have to go through all the files one by one and then change it. If we were to store the variable in the YAML file and then import it in every python file, then once we change the value of the variable in the YAML file, it will be reflected in every python file where it is used.

Conclusion

In this article, we learned Why we shouldn’t use the traditional approach of storing variables in python files. Also, we learned what are the advantages of using a yaml file to store the variables and how it helps with the experiments in the machine learning project. Check out the following link for the whole code for this article.

Discover the Benefits of Storing Variables in a YAML File

Skip to content Experience YouTube (and Twitch) in the console. Fullscreen for best experience. Fork for audio & custom…

replit.com

Outro

I hope you like the article. If you have any thoughts on the article, then please let me know. Also, if you liked the article, then please give a clap.

Connect with me on LinkedIn.

Know more about me on Website.

Mail me at [email protected]

Have a great day!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Transform Your Data Science Project: Discover the Benefits of Storing Variables in a YAML File

Author(s): Shivamshinde

Discover the Benefits of Storing Variables in a YAML File

Skip to content Experience YouTube (and Twitch) in the console. Fullscreen for best experience. Fork for audio & custom…

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

AI Safety on a Budget: Your Guide to Free, Open-Source Tools for Implementing Safer LLMs

AI Safety on a Budget: Your Guide to Free, Open-Source Tools for Implementing Safer LLMs

AI Safety on a Budget: Your Guide to Free, Open-Source Tools for Implementing Safer LLMs

You Can Now Call ChatGPT From Your Phone

You Can Now Call ChatGPT From Your Phone

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Transform Your Data Science Project: Discover the Benefits of Storing Variables in a YAML File

Author(s): Shivamshinde

Discover the Benefits of Storing Variables in a YAML File

Skip to content Experience YouTube (and Twitch) in the console. Fullscreen for best experience. Fork for audio & custom…

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement