Maximizing Pandas Performance: 6 Best Practices for Efficient Data Processing

Last Updated on February 15, 2023 by Editorial Team

Author(s): Fares Sayah

Originally published on Towards AI.

Optimizing Pandas: Understanding Data Types and Memory Usage for Efficient Data Processing

Pandas is a popular library in the world of Data Science that makes it easy to work with data using efficient and high-performance tools. But when dealing with huge amounts of data, Pandas can become limited and cause memory problems. To overcome these issues, you can use other tools like Dask or Polar. This article will give you some tips to try before switching to another tool.

Pandas data types can be confusing, so it’s important to check them when you first begin exploring your data. Having the correct data types will make your analysis more accurate and efficient. Sometimes, Pandas might read an integer column as a floating point or object type, which can lead to errors and use up extra memory. This article will explain Pandas data types and show you how to save memory by using the right data types.

This article is inspired by Matt Harrison’s talk: Effective Pandas I Matt Harrison I PyData Salt Lake City Meetup.

We are going to use vehicle data in this article: vehicles data. The data is a little bit big, so we are going to choose a few columns to experiment with.

Table Of Content

· 1. Reading the Data
· 2. Memory Usage
· 3. Pandas Data Types
· 4. Integers
· 5. Float
· 6. Objects and Category
· 8. Datetimes
· 9. NumPy vs Pandas operations

1. Reading the Data

If the data does not fit into your memory in the first place, you can read data in chunks and explore it. The parameter essentially means the number of rows to read into memory.

After exploring a small portion of the data, you now know what the important columns and unimportant columns are. To save extra memory, you can only read important columns.

2. Memory Usage

The Pandas info() function provides valuable information about a DataFrame, including the data type of each column, the number of non-null values, and memory usage. This function is useful for understanding the structure of your data and optimizing memory usage. The memory usage report is displayed at the end of the info() function's output.

To get full memory usage, we provide memory_usage=”deep” argument to info().

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41144 entries, 0 to 41143
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   city08      41144 non-null  int64  
 1   comb08      41144 non-null  int64  
 2   highway08   41144 non-null  int64  
 3   cylinders   40938 non-null  float64
 4   displ       40940 non-null  float64
 5   drive       39955 non-null  object 
 6   eng_dscr    24991 non-null  object 
 7   fuelCost08  41144 non-null  int64  
 8   make        41144 non-null  object 
 9   model       41144 non-null  object 
 10  trany       41133 non-null  object 
 11  range       41144 non-null  int64  
 12  createdOn   41144 non-null  object 
 13  year        41144 non-null  int64  
dtypes: float64(2), int64(6), object(6)
memory usage: 18.7 MB

3. Pandas Data Types

When importing data into a Pandas DataFrame, the entire dataset is read into memory to determine the data types of each column. This process can sometimes result in incorrect data type assignments, such as assuming a column with integer values and missing data is a floating-point data type rather than an integer. To avoid this, it’s important to carefully review and adjust the data types as needed.

To check the types of your data, you can use .dtypes and it will return a pandas series of columns associated with there dtype :

city08          int64
comb08          int64
highway08       int64
cylinders     float64
displ         float64
drive          object
eng_dscr       object
fuelCost08      int64
make           object
model          object
trany          object
range           int64
createdOn      object
year            int64
dtype: object

Only three types appear in our dataset, but Pandas has 7 types in general:

object, int64, float64, category, and datetime64 are going to be covered in this article.
bool: True/False values. Can be a NumPy datetime64[ns].
timedelta[ns]: Differences between two datetimes.

4. Integers

Integer numbers. Can be a NumPy int_, int8, int16, int32, int64, uint8, uint16, uint32, or uint64.

You can use numpy.iinfo() to check the machine limit for the integer types and choose one that allows you to save memory without losing precision.

Machine parameters for int8
---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------

Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------

Use pandas.select_dtypes() to select columns based on specific dtype.

The simplest way to convert a pandas column of data to a different type is to use astype().

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41144 entries, 0 to 41143
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   city08      41144 non-null  int16  
 1   comb08      41144 non-null  int16  
 2   highway08   41144 non-null  int8   
 3   cylinders   40938 non-null  float64
 4   displ       40940 non-null  float64
 5   drive       39955 non-null  object 
 6   eng_dscr    24991 non-null  object 
 7   fuelCost08  41144 non-null  int16  
 8   make        41144 non-null  object 
 9   model       41144 non-null  object 
 10  trany       41133 non-null  object 
 11  range       41144 non-null  int16  
 12  createdOn   41144 non-null  object 
 13  year        41144 non-null  int16  
dtypes: float64(2), int16(5), int8(1), object(6)
memory usage: 17.3 MB

5. Float

Floating-point numbers. Can be a NumPy float_, float16, float32, float64

You can use numpy.finfo() to check the machine limit for the float types and choose one that allows you to save memory without losing precision.

Machine parameters for float16
---------------------------------------------------------------
precision =   3   resolution = 1.00040e-03
machep =    -10   eps =        9.76562e-04
negep =     -11   epsneg =     4.88281e-04
minexp =    -14   tiny =       6.10352e-05
maxexp =     16   max =        6.55040e+04
nexp =        5   min =        -max
---------------------------------------------------------------

Machine parameters for float32
---------------------------------------------------------------
precision =   6   resolution = 1.0000000e-06
machep =    -23   eps =        1.1920929e-07
negep =     -24   epsneg =     5.9604645e-08
minexp =   -126   tiny =       1.1754944e-38
maxexp =    128   max =        3.4028235e+38
nexp =        8   min =        -max
---------------------------------------------------------------

The cylinders column should be an integer dtype but because it has missing value, pandas read it as float dtype.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41144 entries, 0 to 41143
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   city08      41144 non-null  int16  
 1   comb08      41144 non-null  int16  
 2   highway08   41144 non-null  int8   
 3   cylinders   41144 non-null  int8   
 4   displ       41144 non-null  float16
 5   drive       39955 non-null  object 
 6   eng_dscr    24991 non-null  object 
 7   fuelCost08  41144 non-null  int16  
 8   make        41144 non-null  object 
 9   model       41144 non-null  object 
 10  trany       41133 non-null  object 
 11  range       41144 non-null  int16  
 12  createdOn   41144 non-null  object 
 13  year        41144 non-null  int16  
dtypes: float16(1), int16(5), int8(2), object(6)
memory usage: 16.8 MB

6. Objects and Category

Object: Text or mixed numeric and non-numeric values. Can be a NumPy string_, unicode_, or mixed types.

Category: The category data type in pandas is a hybrid data type. It looks and behaves like a string in many instances but internally is represented by an array of integers. This allows the data to be sorted in a custom order and more efficiently store the data.

drive and trany have a small number of unique values, so we can convert them to category dtype

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41144 entries, 0 to 41143
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   city08      41144 non-null  int16   
 1   comb08      41144 non-null  int16   
 2   highway08   41144 non-null  int8    
 3   cylinders   41144 non-null  int8    
 4   displ       41144 non-null  float16 
 5   drive       41144 non-null  category
 6   eng_dscr    24991 non-null  object  
 7   fuelCost08  41144 non-null  int16   
 8   make        41144 non-null  category
 9   model       41144 non-null  object  
 10  trany       41144 non-null  category
 11  range       41144 non-null  int16   
 12  createdOn   41144 non-null  object  
 13  year        41144 non-null  int16   
dtypes: category(3), float16(1), int16(5), int8(2), object(3)
memory usage: 8.8 MB

8. Datetimes

Date and time values. Having our dates as datetime64 object will allow us to access a lot of date and time information through the .dt API.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41144 entries, 0 to 41143
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   city08      41144 non-null  int16         
 1   comb08      41144 non-null  int16         
 2   highway08   41144 non-null  int8          
 3   cylinders   41144 non-null  int8          
 4   displ       41144 non-null  float16       
 5   drive       41144 non-null  category      
 6   eng_dscr    24991 non-null  object        
 7   fuelCost08  41144 non-null  int16         
 8   make        41144 non-null  category      
 9   model       41144 non-null  object        
 10  trany       41144 non-null  category      
 11  range       41144 non-null  int16         
 12  createdOn   41144 non-null  datetime64[ns]
 13  year        41144 non-null  int16         
dtypes: category(3), datetime64[ns](1), float16(1), int16(5), int8(2), object(2)
memory usage: 5.8 MB

9. NumPy vs. Pandas operations

Sometimes, just converting data to NumPy arrays will speed up calculations like in the example (.values will convert the series to a NumPy array):

78.1 µs ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

36.9 µs ± 579 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Summary

Proper data type assignment is an important step in exploring a new dataset.
Pandas generally make accurate data type inferences, but it’s important to be familiar with the conversion options available to ensure the data is properly formatted.
Correctly assigning data types can result in significant memory savings, potentially reducing memory usage by over 30%.

Maximizing Pandas Performance: 6 Best Practices for Efficient Data Processing was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Maximizing Pandas Performance: 6 Best Practices for Efficient Data Processing

Author(s): Fares Sayah

Optimizing Pandas: Understanding Data Types and Memory Usage for Efficient Data Processing

Table Of Content

1. Reading the Data

2. Memory Usage

3. Pandas Data Types

4. Integers

5. Float

6. Objects and Category

8. Datetimes

9. NumPy vs. Pandas operations

Summary

JOIN NOW!

🔥 Recommended Articles 🔥

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Maximizing Pandas Performance: 6 Best Practices for Efficient Data Processing

Author(s): Fares Sayah

Optimizing Pandas: Understanding Data Types and Memory Usage for Efficient Data Processing

Table Of Content

1. Reading the Data

2. Memory Usage

3. Pandas Data Types

4. Integers

5. Float

6. Objects and Category

8. Datetimes

9. NumPy vs. Pandas operations

Summary

JOIN NOW!

🔥 Recommended Articles 🔥

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement