Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Set Operations on Python DataFrames
Latest   Machine Learning

Set Operations on Python DataFrames

Last Updated on July 25, 2023 by Editorial Team

Author(s): Shravankumar Hiregoudar

Originally published on Towards AI.

We often perform join, union, difference, intersection, etc. operations between python DataFrames. Throughout my data science journey, Understanding the concepts of Set Theory has helped me perform these tasks efficiently. In this article, I would like to use the set theory to implement DataFrame operations.

Photo by Tyler Lastovich on Unsplash

The topic of discussion!

1. Basics of ‘Set Theory’ and ‘Venn Diagrams.’
2. Set Operations on DataFrames
3. Conclusion

1. Basics of ‘Set Theory’ and ‘Venn Diagrams.’

We use sets and set theory extensively in Mathematics and Probability theory. A simple definition would be, A set is a collection of elements in which the order of the elements doesn't matter. The concept of set is also used in python, Often referred to set as a datatype.

In python, Sets are very similar to lists except that their elements are immutable (that means you cannot change/mutate an element of a set once declared). However, you can add/remove elements from the set. A set can contain elements of different data types. A set can be;

mySet = set(("Theory", "Mathematics", 214, 3.14, (1,0,0,1)))

We use a Venn diagram to visualize the sets. In the set operations section, we will see how to use them.

2. Set Operations on DataFrames

Let's consider two simple DataFrames to demo the logic of set operations

df1 = pd.DataFrame({'price': [10,20,30,40,50,60],'quantity': [100,200,300,400,500,600], 'margin%': [1,2,3,4,5,6]})df2 = pd.DataFrame({'price': [10,20,30,40,50,60],'quantity': [120,200,300,400,520,620], 'margin%': [7,2,3,6,11,12]})

Printing df1 and df2 looks like;

Fig 1: print(df1) and print(df2) result. Screenshot (Source: Author)

We have two simple DataFrames from two different sellers, which have the price of the article, the number of articles you would sell for that price, and the margin of profit.

Venn diagram representation of df1 and df2:

  • Orange shade: The article/s that is sold by seller 1 at a different price, quantity, and margin than that of seller 2
    For example, article 0 is sold by seller 1 at $10, 100 in quantity with a 1% margin, whereas article 0 is sold by seller 2 at $10, 120 in quantity with a 7% margin.
  • Blue shade: The article/s that is sold by seller 2 at a different price, quantity, and margin than that of seller 1.
  • Grey share: The article/s that is sold by seller 2 and seller 1 with the same price, quantity, and margin
    For example, article 3 is sold by seller 1 and seller 2 at the same price of $30, 300 in quantity with a 3% margin.
Fig 2: Venn dig of dfs. Screenshot (Source: Author)

a. Union

The union of two sets is a new set consisting of all elements in set A or B. For example,

set A = {1,2,3,4} and set B = {3,4,5,6}
set A union set B = {1,2,3,4,5,6} often referred as A∪B

In our case,

Fig 3. The union of two dfs (grey area). Screenshot (Source: Author)

A simple approach of union would be;

# Union : Concat or append with duplicatespd.concat([df1, df2]) #ORdf1.append(df2)

output:

Fig 4. Union : Concat or append with duplicates. Screenshot (Source: Author)

A better approach would be;

# Union : Concat or append or merge without duplicatespd.concat([df1,df2]).drop_duplicates().reset_index(drop=True) #ORdf1.append(df2).drop_duplicates().reset_index(drop=True) #ORdf1.merge(df2, how='outer') #left.merge(right, on='key', how='outer')
Fig 5. Union: Concat or append or merge without duplicates. This is the desired result for df1 union df2. (Source: Author)

b. Intersection

The intersection of two sets is a new set consisting of all elements in sets A and B. For example,

set A = {1,2,3,4} and set B = {3,4,5,6}
set A union set B = {3,4} often referred as A∩B

In our case,

Fig 6. The intersection of two dfs (grey area). Screenshot (Source: Author)

A simple approach of intersection would be;

pd.merge(df1, df2, how='inner') #OR
df1.merge(df2) #left.merge(right, on='key')

output:

Fig 6. The intersection of two dfs. Screenshot (Source: Author)

Intersection is inner join operation and union is outer join operation of two dataframes. Other join operations are:

#Inner Join / Intersection
left.merge(right, on='key')
#Outer Join / Union
left.merge(right, on='key', how='outer')
#Left outer join / Left join - keys from left are used
left.merge(right, on='key', how='left')
#Right outer join / Right join - keys from right are used
left.merge(right, on='key', how='right')
#In case of different key column names
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
#To see all possible options ("left_only", "right_only" or "both")
left.merge(right, how='outer', suffixes=['', '_'], indicator=True)

For more advanced join operations

c. Difference

The set A−B consists of elements that are in A but not in B. For example,

set A = {1,2,3,4} and set B = {3,4,5,6}
set A difference B = {1,2} often referred as A-B

In our case,

Fig 6. The difference of two dfs i.e. df1-df2 (grey area). Screenshot (Source: Author)

A simple approach of difference would be;

# Using isin with tuple
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1)) #OR
df1[~df1.astype(str).apply(tuple, 1).isin(df2.astype(str).apply(tuple, 1))] #If df contains, nan
# merge with indicator
df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']

output:

Fig 7. The difference of two dfs (df1-df2). Screenshot (Source: Author)

Python Set Methods

Conclusion

In this article, you became familiar with the concepts of sets and it’s representation via the Venn diagrams. We also looked at the set basic operations like union, intersection, joins, and difference and how we can apply those logics and concepts for python dataframes for our data science problems.

Thank you for reading!

References

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓