Set Operations on Python DataFrames

Last Updated on July 25, 2023 by Editorial Team

Author(s): Shravankumar Hiregoudar

Originally published on Towards AI.

We often perform join, union, difference, intersection, etc. operations between python DataFrames. Throughout my data science journey, Understanding the concepts of Set Theory has helped me perform these tasks efficiently. In this article, I would like to use the set theory to implement DataFrame operations.

The topic of discussion!

1. Basics of ‘Set Theory’ and ‘Venn Diagrams.’
2. Set Operations on DataFrames
3. Conclusion

1. Basics of ‘Set Theory’ and ‘Venn Diagrams.’

We use sets and set theory extensively in Mathematics and Probability theory. A simple definition would be, A set is a collection of elements in which the order of the elements doesn't matter. The concept of set is also used in python, Often referred to set as a datatype.

In python, Sets are very similar to lists except that their elements are immutable (that means you cannot change/mutate an element of a set once declared). However, you can add/remove elements from the set. A set can contain elements of different data types. A set can be;

`mySet = set(("Theory", "Mathematics", 214, 3.14, (1,0,0,1)))`

We use a Venn diagram to visualize the sets. In the set operations section, we will see how to use them.

2. Set Operations on DataFrames

Let's consider two simple DataFrames to demo the logic of set operations

`df1 = pd.DataFrame({'price': [10,20,30,40,50,60],'quantity': [100,200,300,400,500,600], 'margin%': [1,2,3,4,5,6]})df2 = pd.DataFrame({'price': [10,20,30,40,50,60],'quantity': [120,200,300,400,520,620], 'margin%': [7,2,3,6,11,12]})`

Printing df1 and df2 looks like;

We have two simple DataFrames from two different sellers, which have the price of the article, the number of articles you would sell for that price, and the margin of profit.

Venn diagram representation of df1 and df2:

• Orange shade: The article/s that is sold by seller 1 at a different price, quantity, and margin than that of seller 2
For example, article 0 is sold by seller 1 at \$10, 100 in quantity with a 1% margin, whereas article 0 is sold by seller 2 at \$10, 120 in quantity with a 7% margin.
• Blue shade: The article/s that is sold by seller 2 at a different price, quantity, and margin than that of seller 1.
• Grey share: The article/s that is sold by seller 2 and seller 1 with the same price, quantity, and margin
For example, article 3 is sold by seller 1 and seller 2 at the same price of \$30, 300 in quantity with a 3% margin.

a. Union

The union of two sets is a new set consisting of all elements in set A or B. For example,

set A = {1,2,3,4} and set B = {3,4,5,6}
set A union set B = {1,2,3,4,5,6} often referred as A∪B

In our case,

A simple approach of union would be;

`# Union : Concat or append with duplicatespd.concat([df1, df2]) #ORdf1.append(df2)`

output:

A better approach would be;

`# Union : Concat or append or merge without duplicatespd.concat([df1,df2]).drop_duplicates().reset_index(drop=True) #ORdf1.append(df2).drop_duplicates().reset_index(drop=True) #ORdf1.merge(df2, how='outer') #left.merge(right, on='key', how='outer')`

b. Intersection

The intersection of two sets is a new set consisting of all elements in sets A and B. For example,

set A = {1,2,3,4} and set B = {3,4,5,6}
set A union set B = {3,4} often referred as A∩B

In our case,

A simple approach of intersection would be;

`pd.merge(df1, df2, how='inner') #ORdf1.merge(df2) #left.merge(right, on='key')`

output:

Intersection is inner join operation and union is outer join operation of two dataframes. Other join operations are:

`#Inner Join / Intersectionleft.merge(right, on='key')#Outer Join / Unionleft.merge(right, on='key', how='outer')#Left outer join / Left join - keys from left are usedleft.merge(right, on='key', how='left')#Right outer join / Right join - keys from right are usedleft.merge(right, on='key', how='right')#In case of different key column namesleft2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')#To see all possible options ("left_only", "right_only" or "both")left.merge(right, how='outer', suffixes=['', '_'], indicator=True)`

c. Difference

The set A−B consists of elements that are in A but not in B. For example,

set A = {1,2,3,4} and set B = {3,4,5,6}
set A difference B = {1,2} often referred as A-B

In our case,

A simple approach of difference would be;

`# Using isin with tupledf1[~df1.apply(tuple,1).isin(df2.apply(tuple,1)) #ORdf1[~df1.astype(str).apply(tuple, 1).isin(df2.astype(str).apply(tuple, 1))] #If df contains, nan# merge with indicatordf1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']`

output:

Python Set Methods

Conclusion

In this article, you became familiar with the concepts of sets and it’s representation via the Venn diagrams. We also looked at the set basic operations like union, intersection, joins, and difference and how we can apply those logics and concepts for python dataframes for our data science problems.

References

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI