More often, data that is required for analytical purpose comes from different isolated data sources. We need to integrate these desperate data into a unified dataset useful for analytical and modelling purpose. The task for combining and integrating data is usually performed in the Data Warehouse (EDW) with ETL or ELT jobs. However, for specific use-cases we are required to performed additional joining and blending of data to achieve a specific unified dataset. Pandas provides us with various SQL-Like techniques to combine and blend data. In this post we will look at various techniques used for combining and joining data in Pandas. You can download the dataset used in this post from here.


Concatenating Data Frames

This involves joining data frames along a specified axis (row-wise or column-wise).

Let’s create different data frames.


df1 = pd.DataFrame(
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],

df2 = pd.DataFrame(
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],

df3 = pd.DataFrame(
        "A": ["A8", "A9", "A10", "A11"],
        "B": ["B8", "B9", "B10", "B11"],
        "C": ["C8", "C9", "C10", "C11"],
        "D": ["D8", "D9", "D10", "D11"],

df4 = pd.DataFrame(
        "B": ["B2", "B3", "B6", "B7"],
        "D": ["D2", "D3", "D6", "D7"],
        "F": ["F2", "F3", "F6", "F7"],

Stack dataframes on top of each other (row-wise/axis=0)

We set ignore_index=True to reset the indexes of two dataframes to a unified index


# pd.concat([df1,df2], ignore_index=True) # Default concat function stacks dataframes on axis=0 (rowise)
pd.concat([df1,df2,df3], ignore_index=True,axis=0)

Pandas concat and set keys



Stack dataframes next to each other (column-wise/axis=1)



Stack dataframes next to each other (column-wise/axis=1) with keys



Pandas append function



Merging Data Frames

pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL. These methods perform significantly better (in some cases well over an order of magnitude better) than other open source implementations (like in R). The reason for this is careful algorithmic design and the internal layout of the data in DataFrame. Pandas Merge function provides an SQL-Like capability for joining two or more datasets.

Load Datasets



Pandas merge function with keys

Let’s join Departments and Employee data with DEPARTMENT_ID as a common Column



Pandas merge function with keys and inner join criteria

Inner join is equivalent to SQL INNER JOIN. Results in intersection of data with only matching keys from both dataframes.

Let’s create the data frames first


left = pd.DataFrame(
        "key1": ["K0", "K0", "K1", "K2"],
        "key2": ["K0", "K1", "K0", "K1"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],

right = pd.DataFrame(
        "key1": ["K0", "K1", "K1", "K2"],
        "key2": ["K0", "K0", "K0", "K0"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],


Merge on left join

Left join resembles SQL LEFT OUTER JOIN where the resulting dataframe contains all the keys from left dataframe



Merge on right join

Left join resembles SQL RIGHT OUTER JOIN where the resulting dataframe contains all the keys from right dataframe



Merge on inner join

Inner join resembles SQL INNER JOIN resulting to intersection of keys from both dataframes



Remove duplicates during joining

Removes duplicates to specified dataframe before joining



Validate join

We can validate our join criteria based on; one_to_one, one_to_many and many_to_many. If the condition is not satisfied it throws an error



Merge with outer join criteria

Outer join resembles SQL like FULL OUTER JOIN. Results in a union



Merge with outer join criteria and indicator

The argument indicator=True creates a new column that describes the kind of join relationship in each records



Check the complete jupyter notebook for the above functions here.


Joining data is critical in the world of analytics. It’s heavily done by data engineers, however, it’s a necessary skill for any data practitioner. Pandas provides us with powerful SQL-Like functions to join different data into a unified dataset. In the next post we will look at how to handle missing data in pandas. To learn about how to select, sort and filter data in pandas visit our previous post.

Combining Data in Pandas

Post navigation

0 0 votes
Article Rating
Notify of
Inline Feedbacks
View all comments
Would love your thoughts, please comment.x