Combine datasets using Pandas merge(), join(), concat() and append()
Last Updated on October 29, 2020 by Editorial Team
Author(s): Vivek Chaudhary
In the world of Data Bases, Joins and Unions are the most critical and frequently performed operations. Almost every other query is an amalgamation of either a join or a union. Using Pandas we perform similar kinds of stuff while working on a Data Science algorithm or any ETL (Extract Transform and Load) project, joins and unions are critical here as well.
Just a little difference between join and unions before jumping onto the use cases of both. Both join and union are used to combine data sets, however, the result set of a join is a horizontal combination of the dataset where a result set of a union is a vertical combination of data set.
In Pandas for a horizontal combination we have merge() and join(), whereas for vertical combination we can use concat() and append(). Merge and join perform similar tasks but internally they have some differences, similar to concat and append. And in this blog, I had tried to list out the differences in the nature of these methods.
- merge() is used for combining data on common columns orΒ indices.
create two data frames and build an understanding of how merge works.
import pandas as pd
d1 = {βIdβ: [βA1β, βA2β, βA3β, βA4β,βA5'], βNameβ:[βVivekβ, βRahulβ, βGauravβ, βAnkitβ,βVishakhaβ], βAgeβ:[27, 24, 22, 32, 28],} d2 = {βIdβ: [βA1β, βA2β, βA3β, βA4β], βAddressβ:[βDelhiβ, βGurgaonβ, βNoidaβ, βPuneβ], βQualificationβ:[βBtechβ, βB.Aβ, βBcomβ, βB.honsβ]}
df1=pd.DataFrame(d1) df2=pd.DataFrame(d2)
Case 1. merging data on common columnsΒ βIdβ
#Inner Join pd.merge(df1,df2) #simple merge with no additional arguments performs an inner/equi join equivalent to data base join operation
pd.merge(df1,df2, how='inner) #produces output similar as above, as pandas merge by default is an equi join
#Left Join pd.merge(df1,df2,how=βleftβ)
#matching and non matching records from left DF which is df1 is present in result data frame
#Right Join pd.merge(df1,df2,how=βrightβ)
#matching and non matching records from right DF, df2 will come in result df #as of now we dont have any non matching record in right df
#outer join pd.merge(df1,df2,how=βouterβ)
#all the matching and non matching records are available in resultant dataset from both data frames
Case 2: merge data frames with no commonΒ columns
#changed the merging key in data frame d2
import pandas as pd
d1 = {βIdβ: [βA1β, βA2β, βA3β, βA4β,βA5'], βNameβ:[βVivekβ, βRahulβ, βGauravβ, βAnkitβ,βVishakhaβ], βAgeβ:[27, 24, 22, 32, 28],} d2 = {βIDβ: [βA1β, βA2β, βA3β, βA4β], βAddressβ:[βDelhiβ, βGurgaonβ, βNoidaβ, βPuneβ], βQualificationβ:[βBtechβ, βB.Aβ, βBcomβ, βB.honsβ]}
df1=pd.DataFrame(d1) df2=pd.DataFrame(d2)
#lets check when merging keys are different ('Id' and 'ID') pd.merge(df1,df2)
merging result in belowΒ error:
MergeError: No common columns to perform mergeΒ on.
to overcome the merge error, we can use pandas argument βleft_onβ and βright_onβ to explicitly indicate pandas on what key columns we want to merge data frames, rest everything remainsΒ similar.
#df1 key column 'Id' #df2 key column 'ID'
pd.merge(df1,df2,left_on=βIdβ,right_on=βIdβ,how=βleftβ)
2. join() is used for combining data on a key column or anΒ index.
create two data frames and build an understanding of how join works.
import pandas as pd df1 = pd.DataFrame({βkeyβ: [βK0β, βK1β, βK5β, βK3β, βK4β, βK2β], βAβ: [βA0β, βA1β, βA5β, βA3β, βA4β, βA2β]}) df2 = pd.DataFrame({βkeyβ: [βK0β, βK1β, βK2β], βBβ: [βB0β, βB1β, βB2β]})
Case 1. join onΒ indexes
By default, pandas join operation is performed on indexes both data frames have default indexes values, so no need to specify any join key, join will implicitly be performed on indexes.
#default nature of pandas join is left outer join df1.join(df2, lsuffix=β_lβ, rsuffix=β_rβ)
If there are overlapping columns in pandas join, it throws an error :
ValueError: columns overlap but no suffix specified: Index([βkeyβ], dtype=βobjectβ)
With the help of the below use case try to understand the default nature of pandas join which is left outer join.
Create two data frames with different indexΒ values
df1 = pd.DataFrame({βkeyβ: [βK0β, βK1β, βK5β, βK3β, βK4β, βK2β], βAβ: [βA0β, βA1β, βA5β, βA3β, βA4β, βA2β]}, index=[0,1,2,3,4,5])
df2 = pd.DataFrame({βkeyβ: [βK0β, βK1β, βK2β], βBβ: [βB0β, βB1β, βB2β]}, index=[6,7,8])
df1.join(df2,lsuffix=β_lβ,rsuffix=β_rβ)
#df1 is left DF and df2 is right DF
Index values in both data frames are different, in the case of inner/equi join resultant data set will be empty but data is present from left DF (df1).
#inner join df1.join(df2,lsuffix=β_lβ,rsuffix=β_rβ,how=βinnerβ)
#outer join df1.join(df2,lsuffix=β_lβ,rsuffix=β_rβ,how=βouterβ)
Case 2. join onΒ columns
Data frames can be joined on columns as well, but as joins work on indexes, we need to convert the join key into the index and then perform join, rest every thin is similar.
#join on data frame column df1.set_index(βkey1β).join(df2.set_index(βkey2β))
3. concat() is used for combining Data Frames across rows orΒ columns.
create two data frames to understand how concat works. concat is a vertical operation.
Case 1. concat data frames on axis=0, default operation
import pandas as pd m1 = pd.DataFrame({ βNameβ: [βAlexβ, βAmyβ, βAllenβ, βAliceβ, βAyoungβ], βsubject_idβ:[βsub1β,βsub2',βsub4',βsub6',βsub5'], βMarks_scoredβ:[98,90,87,69,78]}, index=[1,2,3,4,5])
m2 = pd.DataFrame({ βNameβ: [βBillyβ, βBrianβ, βBranβ, βBryceβ, βBettyβ], βsubject_idβ:[βsub2β,βsub4',βsub3',βsub6',βsub5'], βMarks_scoredβ:[89,80,79,97,88]}, index=[4,5,6,7,8])
pd.concat([m1,m2])
Indexes of both the Data Frames are preserved in concat operation.
#ignore indexes from original data frames pd.concat([m1,m2],ignore_index=True)
sequential numbers are given to index values unlike previousΒ output
Case 2. concat operation on axis=1, horizontal operation
#axis=1 works as join operation pd.concat([m1,m2],axis=1)
Case 3. concat unequal shape dataΒ frames
df1 = pd.DataFrame({βAβ:[1,2,3], βBβ:[1,2,3]}) df2 = pd.DataFrame({βAβ:[4,5]})
If we try to perform concat operation on df1 and df2 with unequal columns or data frames of different shapes. while performing concatenation operation on unequal shape data frames, pandas updates value as NaN for missingΒ values.
Pandas NaN are float type in nature so values of series B changed toΒ float.
df = pd.concat([df1,df2],ignore_index=True)
4. append() combine data frames vertically fashion
create two data frames to understand how appendΒ works.
Case 1. appending data frames, duplicate indexΒ issue
m1 = pd.DataFrame({ βNameβ: [βAlexβ, βAmyβ, βAllenβ, βAliceβ, βAyoungβ], βsubject_idβ:[βsub1β,βsub2',βsub4',βsub6',βsub5'], βMarks_scoredβ:[98,90,87,69,78]}, index=[1,2,3,4,5])
m2 = pd.DataFrame({ βNameβ: [βBillyβ, βBrianβ, βBranβ, βBryceβ, βBettyβ], βsubject_idβ:[βsub2β,βsub4',βsub3',βsub6',βsub5'], βMarks_scoredβ:[89,80,79,97,88]}, index=[1,2,3,4,5])
with overlapping indexes append function throwsΒ error:
ValueError: Indexes have overlapping values:
m1.append(m2,verify_integrity=False) #verify_integrity=False is default argument m1.append(m2)
#output will be similar for both above lines of code
Case 2. append data frames with unequalΒ shapes
create two new data frames with different shapes
m1 = pd.DataFrame({ βNameβ: [βVivekβ, βVishakhaβ, βAshβ, βNatalieβ, βAyoungβ], βsubject_idβ:[βsub1β,βsub2',βsub4',βsub6',βsub5'], βMarks_scoredβ:[98,90,87,69,78], βRankβ:[1,3,6,20,13]}, index=[1,2,3,4,5])
m2 = pd.DataFrame({ βNameβ: [βBarakβ, βWayneβ, βSauravβ, βYuvrajβ, βSureshβ], βsubject_idβ:[βsub2β,βsub4',βsub3',βsub6',βsub5'], βMarks_scoredβ:[89,80,79,97,88],}, index=[1,2,3,4,5])
m1.append(m2)
Summary:
Β· Pandas Data set combination operations to use cases
Β· Nature of every combination operation
Β· Pandas merge()Β , join() way of working and differences
Β· Pandas concat()Β , append() way of working and differences
Thanks to all for reading my blog and If you like my content and explanation please follow me on medium and your feedback will always help us toΒ grow.
Thanks
Vivek Chaudhary
Combine datasets using Pandas merge(), join(), concat(), and append() was originally published in Towards AIβββMultidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.
Published via Towards AI