Maximizing Pandas Performance: 6 Best Practices for Efficient Data Processing
Last Updated on February 15, 2023 by Editorial Team
Author(s): Fares Sayah
Originally published on Towards AI.
Optimizing Pandas: Understanding Data Types and Memory Usage for Efficient Data Processing
Photo by shiyang xu onΒ Unsplash
Pandas is a popular library in the world of Data Science that makes it easy to work with data using efficient and high-performance tools. But when dealing with huge amounts of data, Pandas can become limited and cause memory problems. To overcome these issues, you can use other tools like Dask or Polar. This article will give you some tips to try before switching to anotherΒ tool.
Pandas data types can be confusing, so itβs important to check them when you first begin exploring your data. Having the correct data types will make your analysis more accurate and efficient. Sometimes, Pandas might read an integer column as a floating point or object type, which can lead to errors and use up extra memory. This article will explain Pandas data types and show you how to save memory by using the right dataΒ types.
This article is inspired by Matt Harrisonβs talk: Effective Pandas I Matt Harrison I PyData Salt Lake CityΒ Meetup.
We are going to use vehicle data in this article: vehicles data. The data is a little bit big, so we are going to choose a few columns to experiment with.
Table OfΒ Content
Β· 1. Reading the Data
Β· 2. Memory Usage
Β· 3. Pandas Data Types
Β· 4. Integers
Β· 5. Float
Β· 6. Objects and Category
Β· 8. Datetimes
Β· 9. NumPy vs Pandas operations
1. Reading theΒ Data
If the data does not fit into your memory in the first place, you can read data in chunks and explore it. The parameter essentially means the number of rows to read intoΒ memory.
After exploring a small portion of the data, you now know what the important columns and unimportant columns are. To save extra memory, you can only read important columns.
2. MemoryΒ Usage
The Pandas info() function provides valuable information about a DataFrame, including the data type of each column, the number of non-null values, and memory usage. This function is useful for understanding the structure of your data and optimizing memory usage. The memory usage report is displayed at the end of the info() function's output.
To get full memory usage, we provide memory_usage=βdeepβ argument toΒ info().
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41144 entries, 0 to 41143
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 city08 41144 non-null int64
1 comb08 41144 non-null int64
2 highway08 41144 non-null int64
3 cylinders 40938 non-null float64
4 displ 40940 non-null float64
5 drive 39955 non-null object
6 eng_dscr 24991 non-null object
7 fuelCost08 41144 non-null int64
8 make 41144 non-null object
9 model 41144 non-null object
10 trany 41133 non-null object
11 range 41144 non-null int64
12 createdOn 41144 non-null object
13 year 41144 non-null int64
dtypes: float64(2), int64(6), object(6)
memory usage: 18.7 MB
3. Pandas DataΒ Types
When importing data into a Pandas DataFrame, the entire dataset is read into memory to determine the data types of each column. This process can sometimes result in incorrect data type assignments, such as assuming a column with integer values and missing data is a floating-point data type rather than an integer. To avoid this, itβs important to carefully review and adjust the data types asΒ needed.
To check the types of your data, you can useΒ .dtypes and it will return a pandas series of columns associated with there dtypeΒ :
city08 int64
comb08 int64
highway08 int64
cylinders float64
displ float64
drive object
eng_dscr object
fuelCost08 int64
make object
model object
trany object
range int64
createdOn object
year int64
dtype: object
Only three types appear in our dataset, but Pandas has 7 types inΒ general:
- object, int64, float64, category, and datetime64 are going to be covered in thisΒ article.
- bool: True/False values. Can be a NumPy datetime64[ns].
- timedelta[ns]: Differences between two datetimes.
4. Integers
Integer numbers. Can be a NumPy int_, int8, int16, int32, int64, uint8, uint16, uint32, orΒ uint64.
You can use numpy.iinfo() to check the machine limit for the integer types and choose one that allows you to save memory without losing precision.
Machine parameters for int8
---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------
Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------
Use pandas.select_dtypes() to select columns based on specificΒ dtype.
The simplest way to convert a pandas column of data to a different type is to use astype().
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41144 entries, 0 to 41143
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 city08 41144 non-null int16
1 comb08 41144 non-null int16
2 highway08 41144 non-null int8
3 cylinders 40938 non-null float64
4 displ 40940 non-null float64
5 drive 39955 non-null object
6 eng_dscr 24991 non-null object
7 fuelCost08 41144 non-null int16
8 make 41144 non-null object
9 model 41144 non-null object
10 trany 41133 non-null object
11 range 41144 non-null int16
12 createdOn 41144 non-null object
13 year 41144 non-null int16
dtypes: float64(2), int16(5), int8(1), object(6)
memory usage: 17.3 MB
5. Float
Floating-point numbers. Can be a NumPy float_, float16, float32,Β float64
You can use numpy.finfo() to check the machine limit for the float types and choose one that allows you to save memory without losing precision.
Machine parameters for float16
---------------------------------------------------------------
precision = 3 resolution = 1.00040e-03
machep = -10 eps = 9.76562e-04
negep = -11 epsneg = 4.88281e-04
minexp = -14 tiny = 6.10352e-05
maxexp = 16 max = 6.55040e+04
nexp = 5 min = -max
---------------------------------------------------------------
Machine parameters for float32
---------------------------------------------------------------
precision = 6 resolution = 1.0000000e-06
machep = -23 eps = 1.1920929e-07
negep = -24 epsneg = 5.9604645e-08
minexp = -126 tiny = 1.1754944e-38
maxexp = 128 max = 3.4028235e+38
nexp = 8 min = -max
---------------------------------------------------------------
The cylinders column should be an integer dtype but because it has missing value, pandas read it as floatΒ dtype.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41144 entries, 0 to 41143
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 city08 41144 non-null int16
1 comb08 41144 non-null int16
2 highway08 41144 non-null int8
3 cylinders 41144 non-null int8
4 displ 41144 non-null float16
5 drive 39955 non-null object
6 eng_dscr 24991 non-null object
7 fuelCost08 41144 non-null int16
8 make 41144 non-null object
9 model 41144 non-null object
10 trany 41133 non-null object
11 range 41144 non-null int16
12 createdOn 41144 non-null object
13 year 41144 non-null int16
dtypes: float16(1), int16(5), int8(2), object(6)
memory usage: 16.8 MB
6. Objects andΒ Category
Object: Text or mixed numeric and non-numeric values. Can be a NumPy string_, unicode_, or mixedΒ types.
Category: The category data type in pandas is a hybrid data type. It looks and behaves like a string in many instances but internally is represented by an array of integers. This allows the data to be sorted in a custom order and more efficiently store theΒ data.
drive and trany have a small number of unique values, so we can convert them to categoryΒ dtype
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41144 entries, 0 to 41143
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 city08 41144 non-null int16
1 comb08 41144 non-null int16
2 highway08 41144 non-null int8
3 cylinders 41144 non-null int8
4 displ 41144 non-null float16
5 drive 41144 non-null category
6 eng_dscr 24991 non-null object
7 fuelCost08 41144 non-null int16
8 make 41144 non-null category
9 model 41144 non-null object
10 trany 41144 non-null category
11 range 41144 non-null int16
12 createdOn 41144 non-null object
13 year 41144 non-null int16
dtypes: category(3), float16(1), int16(5), int8(2), object(3)
memory usage: 8.8 MB
8. Datetimes
Date and time values. Having our dates as datetime64 object will allow us to access a lot of date and time information through theΒ .dtΒ API.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41144 entries, 0 to 41143
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 city08 41144 non-null int16
1 comb08 41144 non-null int16
2 highway08 41144 non-null int8
3 cylinders 41144 non-null int8
4 displ 41144 non-null float16
5 drive 41144 non-null category
6 eng_dscr 24991 non-null object
7 fuelCost08 41144 non-null int16
8 make 41144 non-null category
9 model 41144 non-null object
10 trany 41144 non-null category
11 range 41144 non-null int16
12 createdOn 41144 non-null datetime64[ns]
13 year 41144 non-null int16
dtypes: category(3), datetime64[ns](1), float16(1), int16(5), int8(2), object(2)
memory usage: 5.8 MB
9. NumPy vs. Pandas operations
Sometimes, just converting data to NumPy arrays will speed up calculations like in the example (.values will convert the series to a NumPyΒ array):
78.1 Β΅s Β± 1.29 Β΅s per loop (mean Β± std. dev. of 7 runs, 10000 loops each)
36.9 Β΅s Β± 579 ns per loop (mean Β± std. dev. of 7 runs, 10000 loops each)
Summary
- Proper data type assignment is an important step in exploring a newΒ dataset.
- Pandas generally make accurate data type inferences, but itβs important to be familiar with the conversion options available to ensure the data is properly formatted.
- Correctly assigning data types can result in significant memory savings, potentially reducing memory usage by overΒ 30%.
Maximizing Pandas Performance: 6 Best Practices for Efficient Data Processing was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI