Efficient Pandas: Using Chunksize for Large Datasets
Last Updated on December 10, 2020 by Editorial Team
Author(s):Β Lawrence Alaso KrukruboΒ
Exploring large data sets efficiently usingΒ Pandas
Data Science professionals often encounter very large data sets with hundreds of dimensions and millions of observations. There are multiple ways to handle large data sets. We all know about the distributed file systems like Hadoop and Spark for handling big data by parallelizing across multiple worker nodes in a cluster. But for this article, we shall use the pandas chunksizeΒ attribute or get_chunk() function.
Imagine for a second that youβre working on a new movie set and youβd like to know:-
1. Whatβs the most common movie rating from 0.5 to 5.0
2. Whatβs the average movie rating for most movies produced.
To answer these questions, first, we need to find a data set that contains movie ratings for tens of thousands of movies. Thanks to Grouplens for providing the Movielens data set, which contains over 20 million movie ratings by over 138,000 users, covering over 27,000 different movies.
This is a large data set used for building Recommender Systems, And itβs precisely what we need. So letβs extract it using wget. Iβm working in Colab, but any notebook or IDE is fine.
!wget -O moviedataset.zip https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip
print('unziping ...')
!unzip -o -j moviedataset.zip
Unzipping the folder displays 4 CSV files:
links.csv
movies.csv
ratings.csv
tags.csv
Our interest is on the ratings.csv data set, which contains over 20 million movie ratings for over 27,000 movies.
# First let's import a few libraries import pandas as pd import matplotlib.pyplot as plt
Letβs take a peek at the ratings.csv file
ratings_df = pd.read_csv('ratings.csv') print(ratings_df.shape) >> (22884377, 4)
As expected, The ratings_df data frame has over twenty-two million rows. This is a lot of data for our computerβs memory to handle. To make computations on this data set, itβs efficient to process the data set in chunks, one after another. In sort of a lazy fashion, using an iterator object.
Please note, we donβt need to read in the entire file. We could simply view the first five rows using the head() function like this:
pd.read_csv('ratings.csv').head()
Itβ s important to talk about iterable objects and iterators at this pointβ¦
An iterable is an object that has an associated iter() method. Once this iter() method is applied to an iterable, an iterator object is created. Under the hood, this is what a for loop is doing, it takes an iterable like a list, string or tuple, and applies an iter() method and creates an iterator and iterates through it. An iterable also has the __get_item__() method that makes it possible to extract elements from it using the square brackets.
See an example below, converting an iterable to an iterator object.
# x below is a list. Which is an iterable object. x = [1, 2, 3, 'hello', 5, 7]
# passing x to the iter() method converts it to an iterator. y = iter(x)
# Checking type(y) print(type(y)) >> <class 'list_iterator'>
The object returned by calling the pd.read_csv() function on a file is an iterable object. Meaning it has the __get_item__() method and the associated iter() method. However, passing a data frame to an iter() method creates a map object.
df = pd.read_csv('movies.csv').head()
# Let's pass the data frame df, to the iter() method df1 = iter(df)
print(type(df1)) >> <class 'map'>
An iterator is defined as an object that has an associated next() method that produces consecutive values.
To create an iterator from an iterable, all we need to do is use the function iter() and pass it the iterable. Then once we have the iterator defined, we pass it to the next() method and this returns the first value. calling next() again returns the next value and so on⦠Until there are no more values to return and then it throws us a StopIterationError.
x = [1, 2, 3]
x = iter(x) # Converting to an iterator object
# Letβs call the next function on x using a for loop for i in range(4): print(next(x)) >> 1 2 3 StopIterationError # Error is displayed if next is called after all items have been printed out from an iterator object
Note that the terms function and method have been used interchangeably here. Generally, they mean the same thing. Just that a method is usually applied on an object like the head() method on a data frame, while a function usually takes in an argument like the print() function.
If youβd like to find out about python comprehensions and generators see this link to my notebook on Github. Itβs not necessary for this article.
Ok. letβs get back to the ratings_df data frame. We want to answer two questions:
1. Whatβs the most common movie rating from 0.5 to 5.0
2. Whatβs the average movie rating for most movies.
Letβs check the memory consumption of the ratings_df data frame
ratings_memory = ratings_df.memory_usage().sum()
# Let's print out the memory consumption print('Total Current memory is-', ratings_memory,'Bytes.')
# Finally, let's see the memory usage of each dimension. ratings_df.memory_usage() >> Total Current memory is- 732300192 Bytes.
Index 128 userId 183075016 movieId 183075016 rating 183075016 timestamp 183075016 dtype: int64
We can see that the total memory consumption of this data set is over 732.3 million bytes⦠Wow.
Since weβre interested in the ratings, letβs get the different rating keys on the scale, from 0.5 to 5.0
# Let's get a list of the rating scale or keys rate_keys = list(ratings_df['rating'].unique())
# let's sort the ratings keys from highest to lowest. rate_keys = sorted(rate_keys, reverse=True) print(rate_keys) >> [5.0, 4.5, 4.0, 3.5, 3.0, 2.5, 2.0, 1.5, 1.0, 0.5]
We now know our rating scale. Next, is to find a way to get the number of ratings for each key on the scale. Yet due to the memory size, we should read the data set in chunks and perform vectorized operations on each chunk. Avoiding loops except necessary.
Our first goal is to count the number of movie ratings per rating key. Out of the 22 million-plus ratings, how many ratings does each key hold? Answering this question automatically answers our first question:-
Question One:
1. Whatβs the most common movie rating from 0.5 toΒ 5.0
letβs create a dictionary whose keys are the unique rating keys using a simple for loop. Then we assign each key to value zero.
ratings_dict = {} for i in rate_keys: ratings_dict[i] = 0
ratings_dict >>
{0.5: 0, 1.0: 0, 1.5: 0, 2.0: 0, 2.5: 0, 3.0: 0, 3.5: 0, 4.0: 0, 4.5: 0, 5.0: 0}
Next, we use the python enumerate() function, pass the pd.read_csv() function as its first argument, then within the read_csv() function, we specify chunksize = 1000000, to read chunks of one million rows of data at a time.
We start the enumerate() function index at 1, passing start=1 as its second argument. So that we can compute the average number of bytes processed for each chunk using the index. Then we use a simple for loop on the rating keys and extract the number of ratings per key, for each chunk and sum these up for each key in the ratings_dict
The final ratings_dict will contain each rating key as keys and total ratings per key as values.
Using chunksize attribute we can see thatΒ :
Total number of chunks: 23
Average bytes per chunk: 31.8 million bytes
This means we processed about 32 million bytes of data per chunk as against the 732 million bytes if we had worked on the full data frame at once. This is computing and memory-efficient, albeit through lazy iterations of the data frame.
There are 23 chunks because we took 1 million rows from the data set at a time and there are 22.8 million rows. So that means the 23rd chunk had the final 0.8 million rows of data.
We can also see our ratings_dict below complete with each rating key and the total number of ratings per key
{5.0: 3358218, 4.5: 1813922, 4.0: 6265623, 3.5: 2592375, 3.0: 4783899, 2.5: 1044176, 2.0: 1603254, 1.5: 337605, 1.0: 769654, 0.5: 315651}
Note that By specifying chunksize
in read_csv
, the return value will be an iterable object of type TextFileReaderΒ .
Specifying iterator=True
will also return the TextFileReader
object:
# Example of passing chunksize to read_csv reader = pd.read_csv(βsome_data.csvβ, chunksize=100) # Above code reads first 100 rows, if you run it in a loop, it reads the next 100 and so on
# Example of iterator=True. Note iterator=False by default. reader = pd.read_csv('some_data.csv', iterator=True) reader.get_chunk(100) This gets the first 100 rows, running through a loop gets the next 100 rows and so on.
# Both chunksize=100 and reader.get_chunk(100) return same TextFileReader object.
This shows that the chunksize acts just like the next() function of an iterator, in the sense that an iterator uses the next() function to get itsβ next element, while the get_chunksize() function grabs the next specified number of rows of data from the data frame, which is similar to an iterator.
Before moving on, letβs confirm we got the complete ratings from the exercise we did above. Total ratings should be equal to the number of rows in the ratings_df.
sum(list(ratings_dict.values())) == len(ratings_df) >> True
Letβs finally answer question one by selecting the key/value pair from ratings_dict that has the max value.
# We use the operator module to easily get the max and min values import operator max(ratings_dict.items(), key=operator.itemgetter(1))
>> (4.0, 6265623)
We can see that the rating key with the highest rating value is 4.0 with a value of 6,265,623 movie ratings.
Thus, the most common movie rating from 0.5 to 5.0 is 4.0
Letβs visualize the plot of rating keys and values from max to min.
Letβs create a data frame (ratings_dict_df) from the ratings_dict by simply casting each value to a list and passing the ratings_dict to the pandas DataFrame() function. Then we sort the data frame by Count descending.
Question Two:
2. Whatβs the average movie rating for mostΒ movies.
To answer this, we need to calculate the Weighted-Average of the distribution.
This simply means we multiply each rating key by the number of times it was rated and we add them all together and divide by the total number of ratings.
# First we find the sum of the product of rate keys and corresponding values. product = sum((ratings_dict_df.Rating_Keys * ratings_dict_df.Count))
# Let's divide product by total ratings. weighted_average = product / len(ratings_df)
# Then we display the weighted-average below. weighted_average >> 3.5260770044122243
So to answer question two, we can say
The average movie rating from 0.5 to 5.0 is 3.5.
Itβs pretty encouraging that on a scale of 5.0, most movies have a rating of 4.0 and an average rating of 3.5β¦ Hmm, Is anyone thinking of movie production?
If youβre like most people I know, the next logical question is:-
Hey Lawrence, whatβs the chance that my movie would at least be ratedΒ average?
To find out what percentage of movies are rated at least average, we would compute the Relative-frequency percentage distribution of the ratings.
This simply means what percentage of movie ratings does each rating key hold?
Letβs add a percentage column to the ratings_dict_df using apply and lambda.
ratings_dict_df['Percent'] = ratings_dict_df['Count'].apply(lambda x: (x / (len(ratings_df)) * 100))
ratings_dict_df >>
Therefore to find the percentage of movies that are rated at least average (3.5), we simply sum the percentages of movie keys 3.5 to 5.0.
sum(ratings_dict_df[ratings_dict_df.Rating_Keys >= 3.5]['Percent']) >> 61.308804692389046
Findings:
From these exercises, we can infer that on a scale of 5.0, most movies are rated 4.0 and the average rating for movies is 3.5 and finally, over 61.3% of all movies produced have a rating of at least 3.5.
Conclusion:
Weβve seen how we can handle large data sets using pandas chunksize attribute, albeit in a lazy fashion chunk after chunk.
The merits are arguably efficient memory usage and computational efficiency. While demerits include computing time and possible use of for loops. Itβs important to state that applying vectorised operations to each chunk can greatly speed up computing time.
Thanks for your time.
P.S See a link to the notebook for this article in Github.
Bio: Lawrence Krukrubo is a Data Specialist at Tech Layer Africa, passionate about fair and explainable AI and Data Science. Lawrence is certified by IBM as an Advanced-Data Science Professional. He loves to contribute to open-source projects and has written several insightful articles on Data Science and AI. Lawrence holds a BSc in Banking and Finance and pursuing his Masters in Artificial Intelligence and Data Analytics at Teesside, Middlesbrough U.K.
Efficient Pandas: Using Chunksize for Large Datasets was originally published in Towards AIΒ on Medium, where people are continuing the conversation by highlighting and responding to this story.
Published viaΒ Towards AI
George
Thank you! This is exactly what I was looking for!
Could you please provide a code example for the question 1 part? You describe the solution very well but this post is going to help a lot of people and I think it would be useful to see the code too.
Ramos
How do you determine the optimal chunksize.
For my dataset 1million chunksize is still slow (8GB RAM)