Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Combine datasets using Pandas merge(), join(), concat() and append()
Data Science

Combine datasets using Pandas merge(), join(), concat() and append()

Last Updated on October 29, 2020 by Editorial Team

Author(s): Vivek Chaudhary

Free stock photo of abstract, access, background
Source: Pexels

In the world of Data Bases, Joins and Unions are the most critical and frequently performed operations. Almost every other query is an amalgamation of either a join or a union. Using Pandas we perform similar kinds of stuff while working on a Data Science algorithm or any ETL (Extract Transform and Load) project, joins and unions are critical here as well.

Just a little difference between join and unions before jumping onto the use cases of both. Both join and union are used to combine data sets, however, the result set of a join is a horizontal combination of the dataset where a result set of a union is a vertical combination of data set.

In Pandas for a horizontal combination we have merge() and join(), whereas for vertical combination we can use concat() and append(). Merge and join perform similar tasks but internally they have some differences, similar to concat and append. And in this blog, I had tried to list out the differences in the nature of these methods.

  1. merge() is used for combining data on common columns orΒ indices.

create two data frames and build an understanding of how merge works.

import pandas as pd
d1 = {β€˜Id’: [β€˜A1’, β€˜A2’, β€˜A3’, β€˜A4’,’A5'],
 β€˜Name’:[β€˜Vivek’, β€˜Rahul’, β€˜Gaurav’, β€˜Ankit’,’Vishakha’], 
 β€˜Age’:[27, 24, 22, 32, 28],} 
 
 
d2 = {β€˜Id’: [β€˜A1’, β€˜A2’, β€˜A3’, β€˜A4’],
 β€˜Address’:[β€˜Delhi’, β€˜Gurgaon’, β€˜Noida’, β€˜Pune’], 
 β€˜Qualification’:[β€˜Btech’, β€˜B.A’, β€˜Bcom’, β€˜B.hons’]}
df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)

Case 1. merging data on common columnsΒ β€˜Id’

#Inner Join
pd.merge(df1,df2)
#simple merge with no additional arguments performs an inner/equi join equivalent to data base join operation
pd.merge(df1,df2, how='inner)
#produces output similar as above, as pandas merge by default is an equi join
merge inner
#Left Join
pd.merge(df1,df2,how=’left’)
#matching and non matching records from left DF which is df1 is present in result data frame
merge left
#Right Join
pd.merge(df1,df2,how=’right’)
#matching and non matching records from right DF, df2 will come in result df
#as of now we dont have any non matching record in right df
merge right
#outer join
pd.merge(df1,df2,how=’outer’)
#all the matching and non matching records are available in resultant dataset from both data frames
merge fullΒ outer

Case 2: merge data frames with no commonΒ columns

#changed the merging key in data frame d2
import pandas as pd
d1 = {β€˜Id’: [β€˜A1’, β€˜A2’, β€˜A3’, β€˜A4’,’A5'],
 β€˜Name’:[β€˜Vivek’, β€˜Rahul’, β€˜Gaurav’, β€˜Ankit’,’Vishakha’], 
 β€˜Age’:[27, 24, 22, 32, 28],} 
 
 
d2 = {β€˜ID’: [β€˜A1’, β€˜A2’, β€˜A3’, β€˜A4’],
 β€˜Address’:[β€˜Delhi’, β€˜Gurgaon’, β€˜Noida’, β€˜Pune’], 
 β€˜Qualification’:[β€˜Btech’, β€˜B.A’, β€˜Bcom’, β€˜B.hons’]}
df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)
#lets check when merging keys are different ('Id' and 'ID')
pd.merge(df1,df2)

merging result in belowΒ error:

MergeError: No common columns to perform mergeΒ on.

to overcome the merge error, we can use pandas argument β€˜left_on’ and β€˜right_on’ to explicitly indicate pandas on what key columns we want to merge data frames, rest everything remainsΒ similar.

#df1 key column 'Id'
#df2 key column 'ID'
pd.merge(df1,df2,left_on=’Id’,right_on=’Id’,how=’left’)
merge uncommonΒ keys

2. join() is used for combining data on a key column or anΒ index.

create two data frames and build an understanding of how join works.

import pandas as pd
df1 = pd.DataFrame({β€˜key’: [β€˜K0’, β€˜K1’, β€˜K5’, β€˜K3’, β€˜K4’, β€˜K2’],
 β€˜A’: [β€˜A0’, β€˜A1’, β€˜A5’, β€˜A3’, β€˜A4’, β€˜A2’]})
df2 = pd.DataFrame({β€˜key’: [β€˜K0’, β€˜K1’, β€˜K2’],
 β€˜B’: [β€˜B0’, β€˜B1’, β€˜B2’]})

Case 1. join onΒ indexes

By default, pandas join operation is performed on indexes both data frames have default indexes values, so no need to specify any join key, join will implicitly be performed on indexes.

#default nature of pandas join is left outer join
df1.join(df2, lsuffix=’_l’, rsuffix=’_r’)
join left

If there are overlapping columns in pandas join, it throws an error :

ValueError: columns overlap but no suffix specified: Index([β€˜key’], dtype=’object’)

With the help of the below use case try to understand the default nature of pandas join which is left outer join.

Create two data frames with different indexΒ values

df1 = pd.DataFrame({β€˜key’: [β€˜K0’, β€˜K1’, β€˜K5’, β€˜K3’, β€˜K4’, β€˜K2’],
 β€˜A’: [β€˜A0’, β€˜A1’, β€˜A5’, β€˜A3’, β€˜A4’, β€˜A2’]},
 index=[0,1,2,3,4,5])
df2 = pd.DataFrame({β€˜key’: [β€˜K0’, β€˜K1’, β€˜K2’],
 β€˜B’: [β€˜B0’, β€˜B1’, β€˜B2’]},
 index=[6,7,8])
df1.join(df2,lsuffix=’_l’,rsuffix=’_r’)
#df1 is left DF and df2 is right DF
non-matching indexes

Index values in both data frames are different, in the case of inner/equi join resultant data set will be empty but data is present from left DF (df1).

#inner join
df1.join(df2,lsuffix=’_l’,rsuffix=’_r’,how=’inner’)
join inner
#outer join
df1.join(df2,lsuffix=’_l’,rsuffix=’_r’,how=’outer’)
join outer

Case 2. join onΒ columns

Data frames can be joined on columns as well, but as joins work on indexes, we need to convert the join key into the index and then perform join, rest every thin is similar.

#join on data frame column
df1.set_index(β€˜key1’).join(df2.set_index(β€˜key2’))

3. concat() is used for combining Data Frames across rows orΒ columns.

create two data frames to understand how concat works. concat is a vertical operation.

Case 1. concat data frames on axis=0, default operation

import pandas as pd
m1 = pd.DataFrame({
 β€˜Name’: [β€˜Alex’, β€˜Amy’, β€˜Allen’, β€˜Alice’, β€˜Ayoung’],
 β€˜subject_id’:[β€˜sub1’,’sub2',’sub4',’sub6',’sub5'],
 β€˜Marks_scored’:[98,90,87,69,78]},
 index=[1,2,3,4,5])
m2 = pd.DataFrame({
 β€˜Name’: [β€˜Billy’, β€˜Brian’, β€˜Bran’, β€˜Bryce’, β€˜Betty’],
 β€˜subject_id’:[β€˜sub2’,’sub4',’sub3',’sub6',’sub5'],
 β€˜Marks_scored’:[89,80,79,97,88]},
 index=[4,5,6,7,8])
pd.concat([m1,m2])

Indexes of both the Data Frames are preserved in concat operation.

concat
#ignore indexes from original data frames
pd.concat([m1,m2],ignore_index=True)

sequential numbers are given to index values unlike previousΒ output

index ignored

Case 2. concat operation on axis=1, horizontal operation

#axis=1 works as join operation
pd.concat([m1,m2],axis=1)
concat asΒ join

Case 3. concat unequal shape dataΒ frames

df1 = pd.DataFrame({β€˜A’:[1,2,3], β€˜B’:[1,2,3]})
df2 = pd.DataFrame({β€˜A’:[4,5]})

If we try to perform concat operation on df1 and df2 with unequal columns or data frames of different shapes. while performing concatenation operation on unequal shape data frames, pandas updates value as NaN for missingΒ values.

Pandas NaN are float type in nature so values of series B changed toΒ float.

df = pd.concat([df1,df2],ignore_index=True)
concat unequal shapeΒ DF

4. append() combine data frames vertically fashion

create two data frames to understand how appendΒ works.

Case 1. appending data frames, duplicate indexΒ issue

m1 = pd.DataFrame({
 β€˜Name’: [β€˜Alex’, β€˜Amy’, β€˜Allen’, β€˜Alice’, β€˜Ayoung’],
 β€˜subject_id’:[β€˜sub1’,’sub2',’sub4',’sub6',’sub5'],
 β€˜Marks_scored’:[98,90,87,69,78]},
 index=[1,2,3,4,5])
m2 = pd.DataFrame({
 β€˜Name’: [β€˜Billy’, β€˜Brian’, β€˜Bran’, β€˜Bryce’, β€˜Betty’],
 β€˜subject_id’:[β€˜sub2’,’sub4',’sub3',’sub6',’sub5'],
 β€˜Marks_scored’:[89,80,79,97,88]},
 index=[1,2,3,4,5])

with overlapping indexes append function throwsΒ error:

ValueError: Indexes have overlapping values:

m1.append(m2,verify_integrity=False)
#verify_integrity=False is default argument
m1.append(m2)
#output will be similar for both above lines of code
append

Case 2. append data frames with unequalΒ shapes

create two new data frames with different shapes

m1 = pd.DataFrame({
 β€˜Name’: [β€˜Vivek’, β€˜Vishakha’, β€˜Ash’, β€˜Natalie’, β€˜Ayoung’],
 β€˜subject_id’:[β€˜sub1’,’sub2',’sub4',’sub6',’sub5'],
 β€˜Marks_scored’:[98,90,87,69,78],
 β€˜Rank’:[1,3,6,20,13]},
 index=[1,2,3,4,5])
m2 = pd.DataFrame({
 β€˜Name’: [β€˜Barak’, β€˜Wayne’, β€˜Saurav’, β€˜Yuvraj’, β€˜Suresh’],
 β€˜subject_id’:[β€˜sub2’,’sub4',’sub3',’sub6',’sub5'],
 β€˜Marks_scored’:[89,80,79,97,88],},
 index=[1,2,3,4,5])
m1.append(m2)

Summary:

Β· Pandas Data set combination operations to use cases

Β· Nature of every combination operation

Β· Pandas merge()Β , join() way of working and differences

Β· Pandas concat()Β , append() way of working and differences

Thanks to all for reading my blog and If you like my content and explanation please follow me on medium and your feedback will always help us toΒ grow.

Thanks

Vivek Chaudhary


Combine datasets using Pandas merge(), join(), concat(), and append() was originally published in Towards AIβ€Šβ€”β€ŠMultidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓