Beginner Tips for Getting Started with Azure Machine Learning
Last Updated on July 26, 2023 by Editorial Team
Author(s): Andrew Blance
Originally published on Towards AI.
Getting ready for the DP-100 Azure Data Science Associate exam.
The end-to-end pipeline for a data science model is diverse and winding. Between exploratory data analysis, model training, deployment, and managing those models, there are a lot of moving parts. Azure Machine Learning is Microsoftβs cloud service to help developers along this journey. It offers a wide set of tools to track your modelβs development, version your data, securely deploy your model, and more.
I think many good APIs and software share some things in common. One of these is that they are good at predicting your desires, or at least have an understanding of how the user will want to interact with the system. When I write code and think βoh boy, I wish there was a way to do this really easilyβ, and then stumble across a feature of the language that does that exact thing in one line, I feel like the writer of the library has done something really special. This happens a lot in Python I think, with Pandas or Numpy being designed in a way that seems to understand how I will want to interact with it (except for times and dates, which kinda suck in everything).
Currently, I am preparing for the DP-100, a Microsoft exam about using Azure ML to do data science. Iβve spent a lot of time learning about the ecosystem and getting to know how it all works. I find myself thinking quite a lot about how well designed a lot of it is. How a lot of the features make my life a lot easier, removing my need to write a lot of code, as they have implemented a smart function to do it already.
Iβve not really written about data science before, but I thought I would try it here. There are already lots of great stuff out there about the DP-100, so instead, I thought I would try something slightly different. This is a list of things that compliment the DP-100 and go well with the syllabus, or in some cases, things I thought were pretty neat from it. Itβs not full tutorials on how to use the features, but lil suggestions of things to look at. Enjoy!
Visual Studio Code
Ok, so this is sometimes mentioned in the syllabus. However, Iβll mention it again since itβs integrated into it so well.
Within Visual Studio Code, the Azure Machine Learning plugin allows you to have access to your Workspaces, Datasets and Computes, etc. Basically, it allows you to use VSCode as your IDE while retaining the functionality of the Azure ML. Once you connect to a running compute you are able to access the files stored on it, and run your code the way you would on your local machine. On the Azure ML browser, you are a little limited to using notebooks, whereas here you can write scripts as you please!
Standardisation
From my experience learning programming, I think there may be a distinction between βhardβ and βsoftβ programming skills. Iβm gonna call the βhardβ skills the pure coding: the language itself. The βsoftβ stuff is literally everything else around it. Iβm not even sure that this is a good distinction to make, in fact, splitting these in two probably results in worse code. However, I mention it as I think sometimes when you learn to code you are subtly trained to make the distinction. In my experience, coding courses and textbooks focus almost solely on the βhardβ skills, and leave the βsoftβ stuff as an exercise to the reader.
Iβm not much of a programmer, I still have a huge lot of room for improvement. I think much of where I have got better has come from embracing the softer side and a thoughtful and informed rejection of the βhardβ. A lot of my problems were typical ones β βoops I wrote this code 3 months ago and forgot what it doesβ, βoops I wish I could go back to an earlier version of the codeβ or βoops Iβve been given someone else's code and have no idea how to use itβ. I feel a lot of these are solved not with being able to write speedier functions, but with DevOps and standards.
Anyway, that is a big introduction to simply say: try standardising stuff. Microsoft has recommended naming conventions for Azure recourses, and there are templates out there for laying out your coding projects (Iβve played around with this, badly, on Github). By standardising things, it helps new people come into a project, helps you when you go into a project you haven't been on, and you help yourself when you return to code you haven't seen in ages. Things will always be named in a consistent manner, and projects will be laid out in familiar ways.
For example, you could have a recourse group for each every project, named like:
rg-example-dev-001
This tells you a lot of information already: you know its a recourse group (rg
), you have an idea of its purpose (it's for a project called example
), and you know it's for the dev
build (rather than uat
or prod
). Now, inside here you could make an Azure Machine Learning Workspace, and call it:
mlw-example-dev-001
This style can be followed for everything else, and should hopefully mean everything is all neat and tidy.
CI/CD
Using some kind of version control is absolutely vital. This is when code is sent to a βrepoβ for safekeeping. Being able to track changes and revisions to your code is an absolute lifesaver. Itβs something that is left out of a lot of coding tutorials and textbooks I think, and Iβll admit I did spend way too long never using Git at all (just a monumentally terrible mistake).
Continuous integration then is when code is automatically checked whenever it is sent into a repo. This might involve automatically running all the tests youβve written, checking if the code can build, or running a linter.
At work, we use Azure DevOps, which has lots of fun tracking things for project management. I use Github for all my personal stuff, and it has a wonderful feature where you can launch notebooks in-browser by hitting the .
key on your keyboard. It's amazing. Both have robust CI/CD offerings: Github has Github Actions and Azure DevOps has Pipelines, but both work similarly.
In Github, you can create a file .github/workflows/main.yml
which looks like:
name: Linting and Testingon: [push]jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.7, 3.8, 3.9]
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: U+007C
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Test with pytest
run: U+007C
pytest --ignore=docs
This code, every time you push your code will create a Python instance (either 3.7
, 3.8
or 3.9
) with packages based on the contents of requirements.txt
, then run pytest
. Now, when you check the projectβs Github page you can take the results of the CI pipeline into account before you accept a pull request. Azure Pipelines use a very similar syntax. It's a great way of making sure code passes certain tests before accepting it.
ps, Microsoft, please release the .
thing for Azure DevOps!
Sharing Environments
For each project, I create a Python environment for it. This is driven by a requirements.txt
file. This file, as hinted at above, is also used to create the Python environment that is used for the CI/CD pipeline.
When you submit jobs in Azure Machine Learning using the run
methods, you need to specify the environment to use. I (wrongly) imagined at the time I would have to create a whole new environment. I thought I would have to do this by looping through my requirements file, passing the contents of it .add_pip_package()
one at a time. Eventually, that would create the same environment as everywhere else. However, it's much easier than that.
Firstly, you can use .from_pip_requirements()
and pass the whole requirements file into it in one go. Or, if youβve already created a conda environment you can just specify that in.from_existing_conda_environment()
.
Then, if you register this environment, you can see it in Azure MLβs βEnvironmentβ tab! Now, you should have consistent environments across all parts of your project.
Setting a Budget
I am sorry to break this to you: one day you will leave a VM or compute or something running when you thought you turned it off. This might be for an hour, or for weeks, but itβs gonna happen. I try not to think about how much Iβve accidentally spent, itβs not goodβ¦.
In your resource group or subscription, you can set a budget, which can help you stop this problem before itβs a problem. You can put in the amount you think you should be spending, and set points at which you want to be warned if you are starting to approach it.
Final thoughts
So that are a few things that Iβve found useful when using Azure Machine Learning. Each point could really be its own article β Iβve done all the points a disservice, really! There are also so many other things Iβve found that I think are neat (the βModelβ tab in Azure ML, Labeller, and Synapse) that I would also love to talk about. If people are interested, I might come back and write some more about everything!
However, these 5 things: VSCode, standardization, CI/CD, environment management, and Budgets are all good tools to build upon some of the DP-100 content! I might return to these things later for more in-depth exploration, but hopefully, you enjoyed what was here!
Andrew is a data scientist at Waterstons, an IT consultancy. He hosts a silly podcast called Brains on the Outside and is in a genuinely terrible band called Dioramarama. Regardless, Durham University deemed him sensible enough to make him a Doctor of particle physics (and kinda machine learning and quantum computing). He can run 5km pretty fast (22:05) and thinks modern Star Trek isnβt as bad as people make it out to be.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI