Author(s): Haryo Akbarianto Wibowo
My Tools to Increase ML Model Development Productivity
Focuses on ML Development in text
Hello everyone, welcome to my medium’s article. In this article, I want to share about tools that I use when developing Machine Learning (ML) models based on my experience. Specifically, I will focus on some Natural Language Project (NLP) tools since I currently mostly handle NLP projects.
In my experience, there are many productivity problems when developing ML models. Mainly, I find what I do is not productive. For example, when I log experiment results, I often come back and forth from the output folder to see the experiment results and write them manually to a spreadsheet file. It takes a lot of time.
Another problem comes from the reusability of the code. I, a graduate with a computer science degree, care about code reusability. When I want to scale the previous ML model, I often have trouble extending the code. In the end, I rewrite the code to make it tidier. Once more, it also took a lot of time.
Using several ML tools below, I can overcome some of the problems that arise first, especially with the time problem. In the beginning, it takes time to learn these tools. After I am comfortable with them, I can see that I have higher productivity than before.
You might be interested in this article for someone who wants to leverage productivity on developing ML models, especially in NLP. Note that these are according to my preference and they might not be suitable for you. These tools are free to use.
I will scope the list focuses on developing the model, not deployment.
Table of Contents
- Integrated Development Environment (IDE)
- Machine Learning Libraries
- NLP Text Processing Libraries
- Data Exploration Libraries
- Experiment Logger & Hyperparameter Optimization Tools
Integrated Development Environment (IDE)
Currently, a programming language that is popular to do ML tasks in Python. Because of its popularity, I use Python to create source code to develop an ML model. Below are several IDEs that I use.
Why do I use Pycharm?
When developing code, I tend to do Object-Oriented Programming (OOP) to make the code reusable and follow the DRY principle. When I make a class for something (e.g., Deep Learning Models), Pycharm can help me determine the function that I need with its documentation. It helps me a lot when developing the code. Other than that, it has a code style warning that helps me a lot to make my code tidier. Pycharm also makes my code less prone to error. Without these error hints, I might take a longer time to make the code works.
When do I use Pycharm?
- Developing OOP code (classing).
- Developing a complicated module that has several dependencies.
Why do I use Visual Studio Code?
I use Visual Studio Code, where the code I want to develop is not too complicated (e.g., writing a few data cleaning functions). I find that Visual Studio Code is lightweight than Pycharm (in my PC) on taking my CPU and RAM. Unfortunately, it doesn’t have better error checking than Pycharm. Although, It’s still useful in some cases, such as hinting at missing variable declaration. What I like about using Visual Studio Code when there is a merge conflict on using Git. It highlights my code where the conflict is, and I can take action on them. In my opinion, It has a better User Experience (UX) than Pycharm (from my perspective) when I do some code versioning and do remote code (ssh to the remote server).
When do I use Visual Studio Code?
- Small correction on my code
- Develop a small code (e.g., writing a function)
- Quickly check the source code. Opening Pycharm takes more time than Visual Studio Code.
Why do I use JupyterLab?
Jupyter Lab, Jupyter’s “Next-Generation” Notebook Interface, is a popular IDE on Data Science Community. It has a beautiful interface and easy to use. I want to emphasize its live code feature, where you can code and see the output. It can also output plot visualization, which I sometimes need for Exploratory Data and Analysis (EDA).
I use JupyterLab when I do EDA. For example, when analyzing the text, I sometimes plot their word frequency and show the output. I also use Jupyterlab as a playground code. When I develop a Deep Learning model, I often make each model’s layer step by step and check the output of the model in the notebook (jupyterLab file). I separate each layer of the model and check whether the output is right or not. When I think the code is right, I place the code to the ‘.py’ file using other IDE to mitigate code error.
I avoid writing my reusable code in the notebook. According to Joel Grus, Former senior researcher at AllenNLP, he pointed out that notebook has less reproducibility in his ‘I Hate Notebook’ slide. That’s why, when developing a runnable code that can be reused, I write them in a module, so people can easily use my code to train or test the model.
When do I use JupyterLab?
- A playground on testing a code
- EDA data and model’s output results
Machine Learning Libraries
These are some libraries that I use when making and training a model.
Why do I use scikit-learn?
A popular and must use the library if I develop a classical ML model in Python. It’s easy to use and mostly cover several shallow machine learning algorithm that I usually use. It also provides several preprocessing stuff such as TfIdfVectorizer (Text processing by using TF-IDF algorithm), MinMaxScaling (numerical scaling), and train_test_split (split data into train test) that makes my life easier.
Rather than code the model from scratch, I can use this library.
When do I use scikit-learn?
- Develop classical/shallow model
- Data processing
- Splitting train, validation, test
Why do I use Pytorch?
My main reason for using Pytorch because Pytorch is Pythonic and easy to learn. It’s also a popular open-source Deep Learning Framework that is actively developed by the community. With its dynamic graph support, I can easily debug my code. There are many extensions that the community builds to make it productive when developing by using Pytorch. With the easiness use of Pytorch, I can build a prototype model fast.
When do I use Pytorch?
- Developing a Deep Learning model
- Experimenting with some model’s architecture
Why do I use Pytorch Lightning?
Pytorch Lightning is a wrapper on Pytorch to make the code organized. When developing a Pytorch model, I find out that when I often make an unorganized code. When I research a model, I usually focus on making the code right rather than the tidiness of the code. It makes the code messy and hard to scale. Furthermore, when I do other model development projects, I often see some ‘pattern’ in the code that I thought not repeatedly to code again.
Pytorch Lightning came to the rescue. I love this library because it makes my code organized. It creates some interfaces that I should follow. Several features can easily be used, such as Early Stopping and Model Checkpointing. These features make my code tidier, and I can scale my code quickly when needed.
When do I use Pytorch Lightning?
- When I Develop Deep Learning using Pytorch, I wrap the code using this library.
Why do I use HuggingFace’s Transformers?
Hugginface’s Transformers make my life easier when developing a Transformer-based NLP model. On loading a pre-trained model, I can easily type the model name, and it will be downloaded and loaded automatically. After I load the model, I can train the model or customize the model according to my need.
I often wrap the HuggingFace’s pre-trained model to Pytorch Lightning module.
When do I use HuggingFace’s Transformers?
- Developing Transformer-based Deep Learning NLP model
NLP Text Processing Libraries
These are some libraries that I use to preprocess text data.
Why do I use Spacy?
I mainly use Spacy to tokenize my text to preprocess my text. I use spacy to easily extract several text features like Part of Speech and the lemmatization of a word.
When do I use Spacy?
- Text Feature Extraction
- Text Tokenization
Why do I use NLTK?
Similar to Spacy. I use features that are not available on the Spacy, such as tokenizing Twitter’s text.
When do I use NLTK?
- Text Feature Extraction
- Text Tokenization
Why do I use HuggingFace’s tokenizers?
When developing an NLP model, sometimes we want to try a sub-word tokenizer algorithm (e.g. ‘eating → [‘eat’ — ‘ing’]). I use this library to make a sub-word representation. With this library, I can train the tokenizer and use it as a tokenizer. It’s developed in Rust, so it’s fast.
When do I use HuggingFace’s tokenizers?
- Subword tokenization
Experiment Logger & Hyperparameter Optimization Tools
These are some tools that I use to log and tune when developing a model.
Why do I use WandB?
I simply use WandB because I find them easy to use on logging my experiment. WandB can also be integrated easily when I use Pytorch Lightning. It helps me a lot in logging and visualizes my experiment results. With WANDB, I can trackback my old experiment. I like its simple interface for collecting the experiment log.
They also have Hyperparameter optimization tools called ‘sweeps’ where I can do several hyperparameter optimization strategies. It has grid search, random search, and Bayesian Optimization. It also logs the result of the hyperparameter optimization and shows the feature importance.
When do I use WandB?
- Experiment Logging
- Hyperparameter Optimization
These are some tools that I use to version my code, data, and model.
Why do I use Git?
When I develop with a source code, I make sure to versioning my code to make it trackable. Since I often work together with others, git can make the collaboration seamless. Even when I do some self-project, I also habitually use git to version my code.
When do I use Git?
- When I make a source code on a project.
- Collaborating with others.
Why do I use DVC?
To make the model’s experiment reproducible, I also version-control the data and model for each commit in Git. Git’s use case is not suitable for handling large files (model’s output or data), so I need other tools. One of the tools that I use to version the large files is DVC. I can specify where I can store the file, such as Virtual Private Server, Google Cloud Storage, and even Google Drive. I can push to the storage and pull them according to the commit version in the git repository.
When do I use DVC?
- Do an ML project when using a large file.
Data Exploration Libraries
These are some tools that I use to do data manipulation and analysis.
Why do I use pandas?
I use pandas to manipulate and analyze the data that will be used to train an ML model. When doing EDA, I always use pandas to see the data according to my curiosity. I also utilize pandas to do data processing and cleaning. Moreover, the pandas library is easy and flexible to use.
When do I use Pandas?
- Data Cleaning and Preprocessing
- Data Analysis and Visualization
Those are several tools that I often use when developing ML models. Note that these tools are selected based only on my preference. There might be some tools that are not suitable for you. Everyone has their unique taste in increasing their productivity. If you want to suggest some nice tools to increase productivity, you can share them in the comment section 😉.
Thank you for reading my article. I hope my article is useful for you 🙂. See ya in my next article!
My Tool’s to Increase ML Model Development Productivity was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Published via Towards AI