Set Operations on Python DataFrames
Last Updated on July 25, 2023 by Editorial Team
Author(s): Shravankumar Hiregoudar
Originally published on Towards AI.
We often perform join, union, difference, intersection, etc. operations between python DataFrames. Throughout my data science journey, Understanding the concepts of Set Theory has helped me perform these tasks efficiently. In this article, I would like to use the set theory to implement DataFrame operations.
The topic of discussion!
1. Basics of βSet Theoryβ and βVenn Diagrams.β
2. Set Operations on DataFrames
3. Conclusion
1. Basics of βSet Theoryβ and βVenn Diagrams.β
We use sets and set theory extensively in Mathematics and Probability theory. A simple definition would be, A set is a collection of elements in which the order of the elements doesn't matter. The concept of set is also used in python, Often referred to set as a datatype.
In python, Sets are very similar to lists except that their elements are immutable (that means you cannot change/mutate an element of a set once declared). However, you can add/remove elements from the set. A set can contain elements of different data types. A set can be;
mySet = set(("Theory", "Mathematics", 214, 3.14, (1,0,0,1)))
We use a Venn diagram to visualize the sets. In the set operations section, we will see how to use them.
2. Set Operations on DataFrames
Let's consider two simple DataFrames to demo the logic of set operations
df1 = pd.DataFrame({'price': [10,20,30,40,50,60],'quantity': [100,200,300,400,500,600], 'margin%': [1,2,3,4,5,6]})df2 = pd.DataFrame({'price': [10,20,30,40,50,60],'quantity': [120,200,300,400,520,620], 'margin%': [7,2,3,6,11,12]})
Printing df1 and df2 looks like;
We have two simple DataFrames from two different sellers, which have the price of the article, the number of articles you would sell for that price, and the margin of profit.
Venn diagram representation of df1 and df2:
- Orange shade: The article/s that is sold by seller 1 at a different price, quantity, and margin than that of seller 2
For example, article 0 is sold by seller 1 at $10, 100 in quantity with a 1% margin, whereas article 0 is sold by seller 2 at $10, 120 in quantity with a 7% margin. - Blue shade: The article/s that is sold by seller 2 at a different price, quantity, and margin than that of seller 1.
- Grey share: The article/s that is sold by seller 2 and seller 1 with the same price, quantity, and margin
For example, article 3 is sold by seller 1 and seller 2 at the same price of $30, 300 in quantity with a 3% margin.
a. Union
The union of two sets is a new set consisting of all elements in set A or B. For example,
set A = {1,2,3,4} and set B = {3,4,5,6}
set A union set B = {1,2,3,4,5,6} often referred as AβͺB
In our case,
A simple approach of union would be;
# Union : Concat or append with duplicatespd.concat([df1, df2]) #ORdf1.append(df2)
output:
A better approach would be;
# Union : Concat or append or merge without duplicatespd.concat([df1,df2]).drop_duplicates().reset_index(drop=True) #ORdf1.append(df2).drop_duplicates().reset_index(drop=True) #ORdf1.merge(df2, how='outer') #left.merge(right, on='key', how='outer')
b. Intersection
The intersection of two sets is a new set consisting of all elements in sets A and B. For example,
set A = {1,2,3,4} and set B = {3,4,5,6}
set A union set B = {3,4} often referred as Aβ©B
In our case,
A simple approach of intersection would be;
pd.merge(df1, df2, how='inner') #OR
df1.merge(df2) #left.merge(right, on='key')
output:
Intersection is inner join operation and union is outer join operation of two dataframes. Other join operations are:
#Inner Join / Intersection
left.merge(right, on='key')#Outer Join / Union
left.merge(right, on='key', how='outer')#Left outer join / Left join - keys from left are used
left.merge(right, on='key', how='left')#Right outer join / Right join - keys from right are used
left.merge(right, on='key', how='right')#In case of different key column names
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')#To see all possible options ("left_only", "right_only" or "both")
left.merge(right, how='outer', suffixes=['', '_'], indicator=True)
For more advanced join operations
c. Difference
The set AβB consists of elements that are in A but not in B. For example,
set A = {1,2,3,4} and set B = {3,4,5,6}
set A difference B = {1,2} often referred as A-B
In our case,
A simple approach of difference would be;
# Using isin with tuple
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1)) #OR
df1[~df1.astype(str).apply(tuple, 1).isin(df2.astype(str).apply(tuple, 1))] #If df contains, nan# merge with indicator
df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
output:
Conclusion
In this article, you became familiar with the concepts of sets and itβs representation via the Venn diagrams. We also looked at the set basic operations like union, intersection, joins, and difference and how we can apply those logics and concepts for python dataframes for our data science problems.
Thank you for reading!
References
- https://www.probabilitycourse.com/chapter1/1_2_2_set_operations.php
- https://realpython.com/python-sets/
- https://www.w3schools.com/python/python_ref_set.asp
- https://stackoverflow.com/questions/53645882/pandas-merging-101
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI