More often, data that is required for analytical purpose comes from different isolated data sources. We need to integrate these desperate data into a unified dataset useful for analytical and modelling purpose. The task for combining and integrating data is usually performed in the Data Warehouse (EDW) with ETL or ELT jobs. However, for specific use-cases we are required to performed additional joining and blending of data to achieve a specific unified dataset. Pandas provides us with various SQL-Like techniques to combine and blend data. In this post we will look at various techniques used for combining and joining data in Pandas. You can download the dataset used in this post from here.

pandas-logo

Concatenating Data Frames

This involves joining data frames along a specified axis (row-wise or column-wise).

Let’s create different data frames.

                    

df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)

df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    },
)

df3 = pd.DataFrame(
    {
        "A": ["A8", "A9", "A10", "A11"],
        "B": ["B8", "B9", "B10", "B11"],
        "C": ["C8", "C9", "C10", "C11"],
        "D": ["D8", "D9", "D10", "D11"],
    },
)

df4 = pd.DataFrame(
    {
        "B": ["B2", "B3", "B6", "B7"],
        "D": ["D2", "D3", "D6", "D7"],
        "F": ["F2", "F3", "F6", "F7"],
    },
)

Stack dataframes on top of each other (row-wise/axis=0)

We set ignore_index=True to reset the indexes of two dataframes to a unified index

                    

# pd.concat([df1,df2], ignore_index=True) # Default concat function stacks dataframes on axis=0 (rowise)
pd.concat([df1,df2,df3], ignore_index=True,axis=0)

Pandas concat and set keys

                    

pd.concat([df1,df2,df3],axis=0,keys=['df1','df2','df3'])

Stack dataframes next to each other (column-wise/axis=1)

                    

pd.concat([df1,df4],axis=1)

Stack dataframes next to each other (column-wise/axis=1) with keys

                    

pd.concat([df1,df4],axis=1,keys=['df1','df4'])

Pandas append function

                    

df1.append([df2,df3],ignore_index=True)

Merging Data Frames

pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL. These methods perform significantly better (in some cases well over an order of magnitude better) than other open source implementations (like base::merge.data.frame in R). The reason for this is careful algorithmic design and the internal layout of the data in DataFrame. Pandas Merge function provides an SQL-Like capability for joining two or more datasets.

Load Datasets

                    

hr_departments_df=pd.read_csv('hr_departments.csv')
hr_employees_df=pd.read_csv('hr_employees.csv')
print(hr_departments_df.head())
hr_employees_df.head()

Pandas merge function with keys

Let’s join Departments and Employee data with DEPARTMENT_ID as a common Column

                    

pd.merge(hr_employees_df,hr_departments_df,on=['DEPARTMENT_ID']).head()

Pandas merge function with keys and inner join criteria

Inner join is equivalent to SQL INNER JOIN. Results in intersection of data with only matching keys from both dataframes.

Let’s create the data frames first

                    

left = pd.DataFrame(
    {
        "key1": ["K0", "K0", "K1", "K2"],
        "key2": ["K0", "K1", "K0", "K1"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)

right = pd.DataFrame(
    {
        "key1": ["K0", "K1", "K1", "K2"],
        "key2": ["K0", "K0", "K0", "K0"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)

print(left)
right

Merge on left join

Left join resembles SQL LEFT OUTER JOIN where the resulting dataframe contains all the keys from left dataframe

                    

pd.merge(left,right,on=['key1','key2'],how='left')

Merge on right join

Left join resembles SQL RIGHT OUTER JOIN where the resulting dataframe contains all the keys from right dataframe

                    

pd.merge(left,right,on=['key1','key2'],how='right')

Merge on inner join

Inner join resembles SQL INNER JOIN resulting to intersection of keys from both dataframes

                    

pd.merge(left,right,on=['key1','key2'],how='inner')

Remove duplicates during joining

Removes duplicates to specified dataframe before joining

                    

pd.merge(left.drop_duplicates(),right,on=['key1','key2'],how='inner')

Validate join

We can validate our join criteria based on; one_to_one, one_to_many and many_to_many. If the condition is not satisfied it throws an error

                    

pd.merge(left.drop_duplicates(),right,on=['key1','key2'],how='inner',
validate='many_to_many')

Merge with outer join criteria

Outer join resembles SQL like FULL OUTER JOIN. Results in a union

                    

pd.merge(left,right,on=['key1','key2'],how='outer')

Merge with outer join criteria and indicator

The argument indicator=True creates a new column that describes the kind of join relationship in each records

                    

pd.merge(left,right,on=['key1','key2'],how='outer',indicator=True)

Check the complete jupyter notebook for the above functions here.

Conclusion

Joining data is critical in the world of analytics. It’s heavily done by data engineers, however, it’s a necessary skill for any data practitioner. Pandas provides us with powerful SQL-Like functions to join different data into a unified dataset. In the next post we will look at how to handle missing data in pandas. To learn about how to select, sort and filter data in pandas visit our previous post.

Combining Data in Pandas

Post navigation


0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x