Combine datasets using Pandas merge(), join(), concat() and append()

Last Updated on October 29, 2020 by Editorial Team

Author(s): Vivek Chaudhary

Free stock photo of abstract, access, background — Source: Pexels

In the world of Data Bases, Joins and Unions are the most critical and frequently performed operations. Almost every other query is an amalgamation of either a join or a union. Using Pandas we perform similar kinds of stuff while working on a Data Science algorithm or any ETL (Extract Transform and Load) project, joins and unions are critical here as well.

Just a little difference between join and unions before jumping onto the use cases of both. Both join and union are used to combine data sets, however, the result set of a join is a horizontal combination of the dataset where a result set of a union is a vertical combination of data set.

In Pandas for a horizontal combination we have merge() and join(), whereas for vertical combination we can use concat() and append(). Merge and join perform similar tasks but internally they have some differences, similar to concat and append. And in this blog, I had tried to list out the differences in the nature of these methods.

merge() is used for combining data on common columns or indices.

create two data frames and build an understanding of how merge works.

import pandas as pd

d1 = {‘Id’: [‘A1’, ‘A2’, ‘A3’, ‘A4’,’A5'],
 ‘Name’:[‘Vivek’, ‘Rahul’, ‘Gaurav’, ‘Ankit’,’Vishakha’], 
 ‘Age’:[27, 24, 22, 32, 28],} 
 
 
d2 = {‘Id’: [‘A1’, ‘A2’, ‘A3’, ‘A4’],
 ‘Address’:[‘Delhi’, ‘Gurgaon’, ‘Noida’, ‘Pune’], 
 ‘Qualification’:[‘Btech’, ‘B.A’, ‘Bcom’, ‘B.hons’]}

df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)

Case 1. merging data on common columns ‘Id’

#Inner Join
pd.merge(df1,df2)
#simple merge with no additional arguments performs an inner/equi join equivalent to data base join operation

pd.merge(df1,df2, how='inner)
#produces output similar as above, as pandas merge by default is an equi join

#Left Join
pd.merge(df1,df2,how=’left’)

#matching and non matching records from left DF which is df1 is present in result data frame

#Right Join
pd.merge(df1,df2,how=’right’)

#matching and non matching records from right DF, df2 will come in result df
#as of now we dont have any non matching record in right df

#outer join
pd.merge(df1,df2,how=’outer’)

#all the matching and non matching records are available in resultant dataset from both data frames

Case 2: merge data frames with no common columns

#changed the merging key in data frame d2

import pandas as pd

d1 = {‘Id’: [‘A1’, ‘A2’, ‘A3’, ‘A4’,’A5'],
 ‘Name’:[‘Vivek’, ‘Rahul’, ‘Gaurav’, ‘Ankit’,’Vishakha’], 
 ‘Age’:[27, 24, 22, 32, 28],} 
 
 
d2 = {‘ID’: [‘A1’, ‘A2’, ‘A3’, ‘A4’],
 ‘Address’:[‘Delhi’, ‘Gurgaon’, ‘Noida’, ‘Pune’], 
 ‘Qualification’:[‘Btech’, ‘B.A’, ‘Bcom’, ‘B.hons’]}

df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)

#lets check when merging keys are different ('Id' and 'ID')
pd.merge(df1,df2)

merging result in below error:

MergeError: No common columns to perform merge on.

to overcome the merge error, we can use pandas argument ‘left_on’ and ‘right_on’ to explicitly indicate pandas on what key columns we want to merge data frames, rest everything remains similar.

#df1 key column 'Id'
#df2 key column 'ID'

pd.merge(df1,df2,left_on=’Id’,right_on=’Id’,how=’left’)

2. join() is used for combining data on a key column or an index.

create two data frames and build an understanding of how join works.

import pandas as pd
df1 = pd.DataFrame({‘key’: [‘K0’, ‘K1’, ‘K5’, ‘K3’, ‘K4’, ‘K2’],
 ‘A’: [‘A0’, ‘A1’, ‘A5’, ‘A3’, ‘A4’, ‘A2’]})
df2 = pd.DataFrame({‘key’: [‘K0’, ‘K1’, ‘K2’],
 ‘B’: [‘B0’, ‘B1’, ‘B2’]})

Case 1. join on indexes

By default, pandas join operation is performed on indexes both data frames have default indexes values, so no need to specify any join key, join will implicitly be performed on indexes.

#default nature of pandas join is left outer join
df1.join(df2, lsuffix=’_l’, rsuffix=’_r’)

If there are overlapping columns in pandas join, it throws an error :

ValueError: columns overlap but no suffix specified: Index([‘key’], dtype=’object’)

With the help of the below use case try to understand the default nature of pandas join which is left outer join.

Create two data frames with different index values

df1 = pd.DataFrame({‘key’: [‘K0’, ‘K1’, ‘K5’, ‘K3’, ‘K4’, ‘K2’],
 ‘A’: [‘A0’, ‘A1’, ‘A5’, ‘A3’, ‘A4’, ‘A2’]},
 index=[0,1,2,3,4,5])

df2 = pd.DataFrame({‘key’: [‘K0’, ‘K1’, ‘K2’],
 ‘B’: [‘B0’, ‘B1’, ‘B2’]},
 index=[6,7,8])

df1.join(df2,lsuffix=’_l’,rsuffix=’_r’)

#df1 is left DF and df2 is right DF

Index values in both data frames are different, in the case of inner/equi join resultant data set will be empty but data is present from left DF (df1).

#inner join
df1.join(df2,lsuffix=’_l’,rsuffix=’_r’,how=’inner’)

#outer join
df1.join(df2,lsuffix=’_l’,rsuffix=’_r’,how=’outer’)

Case 2. join on columns

Data frames can be joined on columns as well, but as joins work on indexes, we need to convert the join key into the index and then perform join, rest every thin is similar.

#join on data frame column
df1.set_index(‘key1’).join(df2.set_index(‘key2’))

3. concat() is used for combining Data Frames across rows or columns.

create two data frames to understand how concat works. concat is a vertical operation.

Case 1. concat data frames on axis=0, default operation

import pandas as pd
m1 = pd.DataFrame({
 ‘Name’: [‘Alex’, ‘Amy’, ‘Allen’, ‘Alice’, ‘Ayoung’],
 ‘subject_id’:[‘sub1’,’sub2',’sub4',’sub6',’sub5'],
 ‘Marks_scored’:[98,90,87,69,78]},
 index=[1,2,3,4,5])

m2 = pd.DataFrame({
 ‘Name’: [‘Billy’, ‘Brian’, ‘Bran’, ‘Bryce’, ‘Betty’],
 ‘subject_id’:[‘sub2’,’sub4',’sub3',’sub6',’sub5'],
 ‘Marks_scored’:[89,80,79,97,88]},
 index=[4,5,6,7,8])

pd.concat([m1,m2])

Indexes of both the Data Frames are preserved in concat operation.

#ignore indexes from original data frames
pd.concat([m1,m2],ignore_index=True)

sequential numbers are given to index values unlike previous output

Case 2. concat operation on axis=1, horizontal operation

#axis=1 works as join operation
pd.concat([m1,m2],axis=1)

Case 3. concat unequal shape data frames

df1 = pd.DataFrame({‘A’:[1,2,3], ‘B’:[1,2,3]})
df2 = pd.DataFrame({‘A’:[4,5]})

If we try to perform concat operation on df1 and df2 with unequal columns or data frames of different shapes. while performing concatenation operation on unequal shape data frames, pandas updates value as NaN for missing values.

Pandas NaN are float type in nature so values of series B changed to float.

df = pd.concat([df1,df2],ignore_index=True)

4. append() combine data frames vertically fashion

create two data frames to understand how append works.

Case 1. appending data frames, duplicate index issue

m1 = pd.DataFrame({
 ‘Name’: [‘Alex’, ‘Amy’, ‘Allen’, ‘Alice’, ‘Ayoung’],
 ‘subject_id’:[‘sub1’,’sub2',’sub4',’sub6',’sub5'],
 ‘Marks_scored’:[98,90,87,69,78]},
 index=[1,2,3,4,5])

m2 = pd.DataFrame({
 ‘Name’: [‘Billy’, ‘Brian’, ‘Bran’, ‘Bryce’, ‘Betty’],
 ‘subject_id’:[‘sub2’,’sub4',’sub3',’sub6',’sub5'],
 ‘Marks_scored’:[89,80,79,97,88]},
 index=[1,2,3,4,5])

with overlapping indexes append function throws error:

ValueError: Indexes have overlapping values:

m1.append(m2,verify_integrity=False)
#verify_integrity=False is default argument
m1.append(m2)

#output will be similar for both above lines of code

Case 2. append data frames with unequal shapes

create two new data frames with different shapes

m1 = pd.DataFrame({
 ‘Name’: [‘Vivek’, ‘Vishakha’, ‘Ash’, ‘Natalie’, ‘Ayoung’],
 ‘subject_id’:[‘sub1’,’sub2',’sub4',’sub6',’sub5'],
 ‘Marks_scored’:[98,90,87,69,78],
 ‘Rank’:[1,3,6,20,13]},
 index=[1,2,3,4,5])

m2 = pd.DataFrame({
 ‘Name’: [‘Barak’, ‘Wayne’, ‘Saurav’, ‘Yuvraj’, ‘Suresh’],
 ‘subject_id’:[‘sub2’,’sub4',’sub3',’sub6',’sub5'],
 ‘Marks_scored’:[89,80,79,97,88],},
 index=[1,2,3,4,5])

m1.append(m2)

Summary:

· Pandas Data set combination operations to use cases

· Nature of every combination operation

· Pandas merge() , join() way of working and differences

· Pandas concat() , append() way of working and differences

Thanks to all for reading my blog and If you like my content and explanation please follow me on medium and your feedback will always help us to grow.

Thanks

Vivek Chaudhary

Combine datasets using Pandas merge(), join(), concat(), and append() was originally published in Towards AI — Multidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Combine datasets using Pandas merge(), join(), concat() and append()

Author(s): Vivek Chaudhary

Summary:

Towards AI Team

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Combine datasets using Pandas merge(), join(), concat() and append()

Author(s): Vivek Chaudhary

Summary:

Towards AI Team

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥