More often, data that is required for analytical purpose comes from different isolated data sources. We need to integrate these desperate data into a unified dataset useful for analytical and modelling purpose. The task for combining and integrating data is usually performed in the Data Warehouse (EDW) with ETL or ELT jobs. However, for specific use-cases we are required to performed additional joining and blending of data to achieve a specific unified dataset. Pandas provides us with various SQL-Like techniques to combine and blend data. In this post we will look at various techniques used for combining and joining data in Pandas. You can download the dataset used in this post from here.
Concatenating Data Frames
This involves joining data frames along a specified axis (row-wise or column-wise).
Let’s create different data frames.
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
}
)
df2 = pd.DataFrame(
{
"A": ["A4", "A5", "A6", "A7"],
"B": ["B4", "B5", "B6", "B7"],
"C": ["C4", "C5", "C6", "C7"],
"D": ["D4", "D5", "D6", "D7"],
},
)
df3 = pd.DataFrame(
{
"A": ["A8", "A9", "A10", "A11"],
"B": ["B8", "B9", "B10", "B11"],
"C": ["C8", "C9", "C10", "C11"],
"D": ["D8", "D9", "D10", "D11"],
},
)
df4 = pd.DataFrame(
{
"B": ["B2", "B3", "B6", "B7"],
"D": ["D2", "D3", "D6", "D7"],
"F": ["F2", "F3", "F6", "F7"],
},
)
Stack dataframes on top of each other (row-wise/axis=0)
We set ignore_index=True to reset the indexes of two dataframes to a unified index
# pd.concat([df1,df2], ignore_index=True) # Default concat function stacks dataframes on axis=0 (rowise)
pd.concat([df1,df2,df3], ignore_index=True,axis=0)
Pandas concat and set keys
pd.concat([df1,df2,df3],axis=0,keys=['df1','df2','df3'])
Stack dataframes next to each other (column-wise/axis=1)
pd.concat([df1,df4],axis=1)
Stack dataframes next to each other (column-wise/axis=1) with keys
pd.concat([df1,df4],axis=1,keys=['df1','df4'])
Pandas append function
df1.append([df2,df3],ignore_index=True)
Merging Data Frames
pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL. These methods perform significantly better (in some cases well over an order of magnitude better) than other open source implementations (like base::merge.data.frame in R). The reason for this is careful algorithmic design and the internal layout of the data in DataFrame. Pandas Merge function provides an SQL-Like capability for joining two or more datasets.
Load Datasets
hr_departments_df=pd.read_csv('hr_departments.csv')
hr_employees_df=pd.read_csv('hr_employees.csv')
print(hr_departments_df.head())
hr_employees_df.head()
Pandas merge function with keys
Let’s join Departments and Employee data with DEPARTMENT_ID as a common Column
pd.merge(hr_employees_df,hr_departments_df,on=['DEPARTMENT_ID']).head()
Pandas merge function with keys and inner join criteria
Inner join is equivalent to SQL INNER JOIN. Results in intersection of data with only matching keys from both dataframes.
Let’s create the data frames first
left = pd.DataFrame(
{
"key1": ["K0", "K0", "K1", "K2"],
"key2": ["K0", "K1", "K0", "K1"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
}
)
right = pd.DataFrame(
{
"key1": ["K0", "K1", "K1", "K2"],
"key2": ["K0", "K0", "K0", "K0"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
}
)
print(left)
right
Merge on left join
Left join resembles SQL LEFT OUTER JOIN where the resulting dataframe contains all the keys from left dataframe
pd.merge(left,right,on=['key1','key2'],how='left')
Merge on right join
Left join resembles SQL RIGHT OUTER JOIN where the resulting dataframe contains all the keys from right dataframe
pd.merge(left,right,on=['key1','key2'],how='right')
Merge on inner join
Inner join resembles SQL INNER JOIN resulting to intersection of keys from both dataframes
pd.merge(left,right,on=['key1','key2'],how='inner')
Remove duplicates during joining
Removes duplicates to specified dataframe before joining
pd.merge(left.drop_duplicates(),right,on=['key1','key2'],how='inner')
Validate join
We can validate our join criteria based on; one_to_one, one_to_many and many_to_many. If the condition is not satisfied it throws an error
pd.merge(left.drop_duplicates(),right,on=['key1','key2'],how='inner',
validate='many_to_many')
Merge with outer join criteria
Outer join resembles SQL like FULL OUTER JOIN. Results in a union
pd.merge(left,right,on=['key1','key2'],how='outer')
Merge with outer join criteria and indicator
The argument indicator=True creates a new column that describes the kind of join relationship in each records
pd.merge(left,right,on=['key1','key2'],how='outer',indicator=True)
Check the complete jupyter notebook for the above functions here.
Conclusion
Joining data is critical in the world of analytics. It’s heavily done by data engineers, however, it’s a necessary skill for any data practitioner. Pandas provides us with powerful SQL-Like functions to join different data into a unified dataset. In the next post we will look at how to handle missing data in pandas. To learn about how to select, sort and filter data in pandas visit our previous post.