Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Cloud Computing
Latest

Cloud Computing

Last Updated on December 24, 2021 by Editorial Team

Author(s): Γ–mer Γ–zgΓΌr

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

How To Automate Data Processing With AWSΒ Batch

Photo by Kristin Snippe onΒ Unsplash

While solving a problem in the field of machine learning, we can experiment by changing parameters such as models, augmentation methods, data. An important task for a data scientist is to automate repetitive tasks.

Automating the time-consuming steps during these experiments will make our lives easier and the development processΒ faster.

Demo Architecture

Image byΒ author

In this demo, we will do a simple experiment. When a CSV is uploaded to the S3 bucket, we will see how it can trigger Lambda function, become processed by AWS Batch, and written to another S3. You can edit the python code according to your needs. In the example these steps will be followed:

  • Pushing our docker toΒ ECR
  • Creating BatchΒ Job
  • Creating LambdaΒ Function

Lambda vsΒ Batch

Basically, the purpose of Lambda and Batch is to work in a serverless way to fulfill the given task. Batch and Lambda are basically different in run times, if your process takes less than 15 minutes you should useΒ Lambda.

You can also use GPU in Batch. Lambda only works with theΒ CPU.

Let’s GetΒ Started

Note: Your Lambda function, S3 bucket, and Batch should be in the sameΒ region

You can use your computer or create an Ec2 instance that has docker installed. Docker creation codes in Linux but you can find the windows version onΒ AWS.

First of all, we need to install awscli and set up the credentials.

sudo apt  install awscli -y
aws configure

After configuration, we can go to Elastic Container Registry and create a new repository.

Image by the author fromΒ AWS

The commands here will be specific to you. First, let’s authenticate the DockerΒ client.

aws ecr get-login-password --region us-east-2 | docker login --username AWS --password-stdin your_acct_id.dkr.ecr.us-east-2.amazonaws.com

We need 3 files, Docker file, requirements.txt, main.py. My all files in under pyproject folder.

cd pyproject

Our DockerΒ File:

FROM python:3.6
WORKDIR /script
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY main.py .
ENTRYPOINT [ "python", "main.py"]

Our requirements.txt. You can add the libraries needed.

tensorflow==2.4.3
numpy
pandas
boto3
s3fs
Pillow==8.0.1

Our basic main.py, we can configure AWS credentials for reading writing from s3. And it uses command-line arguments to get the path of theΒ file.

import io
import os
import pandas as pd
import boto3
import s3fs
import sys
os.environ['AWS_ACCESS_KEY_ID'] = 'xxxxxxxxxxxxxxxxxxxxxx'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'xxxxxxxxxxxxxxxxxxxxxxxxx'
os.environ['AWS_REGION'] = 'us-east-2'
os.environ['S3_ENDPOINT'] = 'https://s3-us-east-2.amazonaws.com'
os.environ['S3_VERIFY_SSL'] = '0'
if __name__ == "__main__":
    csv_path = sys.argv[3]
df = pd.read_csv(csv_path)
df = df[df["Age"]>20]
    df.to_csv("s3://auto_proc/processed.csv",index=False)

If our files okay we can run to build and push our docker toΒ ECR:

docker build -t pyproject .
docker tag pyproject:latest 7409775503.dkr.ecr.us-east-2.amazonaws.com/pyproject:latest
docker push 7409475503.dkr.ecr.us-east-2.amazonaws.com/pyproject:latest

Creating BatchΒ Job

Image by author fromΒ AWS

The most important part when creating a Batch Job is to enter the Image URI, which we created ECR before, in the Image section. We will use the jobQueue and jobDefinition information that we entered during the Batch creation process to trigger theΒ Lambda.

In serverless architecture, we are not dependent on a single machine. When the task is assigned, the Docker we created runs on a machine with the desired features and then leaves thatΒ machine.

Creating Lambda Function And S3Β Trigger

Image by the author fromΒ AWS

In the architectural plan, the csv file is loaded to s3 and triggers the Lambda. Triggered Lambda sends the job to Batch with the uploaded file name. The lambda codeΒ :

import json
import urllib.parse
import boto3
s3 = boto3.client('s3')
client = boto3.client('batch', 'us-east-2')
def lambda_handler(event, context):   
    bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3'] ['object']['key'], encoding='utf-8')
response = s3.get_object(Bucket=bucket, Key=key)
full_path = "s3://{change here}/" +key.split("/")[1]
print(full_path)
csv_job = client.submit_job(
jobName='example_job',
jobQueue='{change here}',
jobDefinition='{change here}',
containerOverrides={
'command': ["python","main.py",full_path]})

We can see the submitted jobs in Batch. We can run multiple Docker images by choosing more CPUs in the computing environment.

Image by the author fromΒ AWS

Conclusion

Data pipelines are an indispensable part of machine learning. AWS Lambda and Batch are useful for this type of need. Automate, make lifeΒ easier!


Cloud Computing was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓