More often we need to change the way our data appears to perform deeper analyses. To achieve this Pandas gives us various simple techniques that allows us to reshape and transform our data accordingly. We can change the columns to become rows and rows to be columns, we can aggregate and summarise our data into an easier to understand data. In this post we are going to look at various techniques for reshaping DataFrame which includes; Transposing, Pivot, Pivot Tables, Stacking, Unstacking and melt. You can download the datasets used in this post from here.

pandas-logo

Import Data

                    

import pandas as pd
import numpy as np

gdp_df=pd.read_csv("gdp.csv")
gdp_df.head()

pandas-reshape-read-data

Pandas Transpose

The pandas transpose function reflects the DataFrame over its main diagonal by writing rows as columns and vice-versa. The property T is an accessor to the method transpose().

pandas-reshape-transpose-data

Pandas Groupby

The groupby() function in pandas enables us to split the data on a certain criteria (categories), apply a specific function and combine the results. It enables us to efficiently aggregate and summarize data.

Groupby continents

                    

groups=gdp_df.groupby(['continent'])
groups

pandas-reshape-group-data

Iterate through groups

                    

for i,j in groups:
    print(i)
#     print(j)

pandas-reshape-groups-data

Select groups

                    

groups.get_group('Asia').head()

pandas-reshape-select-groups-data

Groupby with aggregation

                    

# gdp_df.groupby(['continent']).aggregate(np.sum) # first approach
gdp_df.groupby(['continent']).sum() #second approach

pandas-reshape-groupby-with-aggregation-data

Groupby and passing multiple aggregation

                    

gdp_df.groupby(['continent']).agg([np.sum,np.mean,np.std]).reset_index()

pandas-reshape-groupby-with-mutiple-aggregation-data

Groupby with aggregation and set index

                    

# gdp_df.groupby(['continent'],as_index=False).sum() #first approach
gdp_df.groupby(['continent']).sum().reset_index() #second approach

pandas-reshape-groupby-with-aggregation-and-set-index-data

Pandas Pivot

Reshape dataframe from long format to wide format. Pivot function does not deal with repeated/duplicated values in index and also does not accept aggregate. Due to this drawbacks we opt to use Pivot Table.

Read Data

                    

# Read data
continental_temperature_df=pd.read_csv('continental_temperature.csv')
# Add Temperature in Farenheit
continental_temperature_df['AvgTemperature_FH']=continental_temperature_df['AvgTemperature']*(9/5)+32

continental_temperature_df.head()

pandas-reshape-pivot-read-data

Pivot data by Region

                    

continental_temperature_df.pivot(index='Year',columns='Region',values=['AvgTemperature']).head()

pandas-reshape-pivot-pivot-by-region-data

Pivot with more than one measure

                    

continental_temperature_df.pivot(index='Year',columns='Region',
values=['AvgTemperature','AvgTemperature_FH']).head()

pandas-reshape-pivot-pivot-with-more-measures-data

Pandas Pivot Table

Pivot Table function creates a spreadsheet-style pivot table as a DataFrame. It is similar to pivot function but with more capabilities of handling duplicates values and also allows us to use various aggregate functions.

Pivot Table with one aggregation

                    

pd.pivot_table(continental_temperature_df,index=['Year'],columns=['Region'],values='AvgTemperature',aggfunc=np.mean).head()

pandas-reshape-pivot-table-with-aggregation-data

Pivot Table with Totals

                    

pd.pivot_table(continental_temperature_df,index=['Year'],columns=['Region'],values='AvgTemperature',
               aggfunc=np.mean,margins=True,margins_name='Mean')

pandas-reshape-pivot-table-with-totals

Pivot Table with multiple aggregations

                    

pd.pivot_table(continental_temperature_df,index=['Year'],columns=['Region'],values=['AvgTemperature','AvgTemperature_FH'],
               aggfunc={'AvgTemperature':np.mean,'AvgTemperature_FH':[np.min,np.max,np.mean,np.std]}).head()

pandas-reshape-pivot-pivot-with-more-aggregations

                    

pd.pivot_table(continental_temperature_df,index=['Year'],columns=['Region'],values=['AvgTemperature','AvgTemperature_FH'],
               aggfunc={'AvgTemperature':np.mean,'AvgTemperature_FH':[np.min,np.max,np.mean,np.std]}).head().T

pandas-reshape-pivot-table-with-more-aggregation-transpose

Pandas Data Stack

Return a reshaped DataFrame or Series having a multi-level index with one or more new inner-most levels compared to the current DataFrame. The new inner-most levels are created by pivoting the columns of the current dataframe. For more details refer here.

Let’s prepare a data frame to stack

                    

# Let's first create a dataframe that we can stack
continental_temp_df=pd.pivot_table(continental_temperature_df,index=['Year'],columns=['Region'],
                                   values='AvgTemperature',aggfunc=np.mean)
continental_temp_df.head()

pandas-reshape-stack-read-data

Stack data by Year and Region

                    

# Let's now stack the above dataframe with Year and Region
continental_temp_df.stack(0).reset_index().head()

pandas-reshape-stack-data-by-region

Pandas Data Unstack

Returns a DataFrame having a new level of column labels whose inner-most level consists of the pivoted index labels. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.unstack.html#pandas.DataFrame.unstack

Unstack data by Region

                    

stacked_df=gdp_df.groupby(['continent']).sum()
stacked_df.head()

pandas-reshape-unstack-data-by-region

Pandas Melt

Unpivots a DataFrame from wide format to long format.

data frame to Melt

                    

continental_temp_df.reset_index().head()

pandas-reshape-melt-dataframe-to-melt

Melt the above DataFrame

                    

year_region_temp_df=pd.melt(continental_temp_df.reset_index().head(),id_vars=['Year'],value_vars=['Asia','Africa','Europe'],
        value_name='Temperature_C')
year_region_temp_df

pandas-reshape-melt-dataframe

Pandas Cross Tabulation

Pandas cross tabulation function allows us to summarize data similar to Pivot and Pivot_table. By default it gives the count or frequency of occurrence between values of two different columns. It also gives us the capability to use other aggregation functions such as sum, meant etc.

Data Frame for crosstab

                    

# This is our original data before applying crosstab function
year_region_temp_df.head()

pandas-reshape-crosstab-read-data

Cross Tabulation on the Region and Year Columns

                    

'''Let's apply crosstab function and round the results to 1 decimal place. Also note that we are using mean as
    our aggregation function '''
pd.crosstab(year_region_temp_df['Year'],year_region_temp_df['Region'],
            values=year_region_temp_df['Temperature_C'],aggfunc='mean').round(2)

pandas-reshape-crosstab-by-region-year

Adding Totals to crosstab table

                    

pd.crosstab(year_region_temp_df['Year'],year_region_temp_df['Region'],values=year_region_temp_df['Temperature_C'],
            aggfunc='mean', margins=True, margins_name='Mean').round(2)

pandas-reshape-crosstab-with-totals

Cross Tabulation with Normalization

Data to normalize

                    

titanic_df=pd.read_csv('titanic.csv')
titanic_df.head()

pandas-reshape-crosstab-normalization-data

Simple crosstab

                    

'''Let's use titanic dataset and count the number of passenger who survived and the distribution of there gender.'''
pd.crosstab(titanic_df['Survived'],titanic_df['Sex'])

pandas-reshape-crosstab-simple-data

Crosstab with Totals

                    

'''Now let's add total and see how it looks.'''
pd.crosstab(titanic_df['Survived'],titanic_df['Sex'], margins=True, margins_name='Total')

pandas-reshape-crosstab-with-totals-titanic

Crosstab with Normalization Rows and Columns

                    

'''Let's normalize and multiply by 100%. Normalize argument computes the percent contribution of each value
    to the total. For example for female and 0 survived we take (81/887)*100 which we get 9.131905'''
pd.crosstab(titanic_df['Survived'],titanic_df['Sex'], margins=True, margins_name='Total',normalize=True)*100

pandas-reshape-crosstab-normalization-all-data

Crosstab and Normalize a cross rows

                    

'''Let's normalize rowise by setting normalize='index'. For example to get the first value of female and 0 survived we take
    (81/545)*100 to get 14.862385'''   
pd.crosstab(titanic_df['Survived'],titanic_df['Sex'], margins=True, margins_name='Total',normalize='index')*100

pandas-reshape-crosstab-normalization-index

Crosstab and Normalize a cross columns

                    

'''Let's normalize rowise by setting normalize='columns'. For example to get the first value of female and 0 survived we take
    (81/314)*100 to get 25.79617834394904 '''     
pd.crosstab(titanic_df['Survived'],titanic_df['Sex'], margins=True, margins_name='Total',normalize='columns')*100

pandas-reshape-crosstab-normalization-column

Add Grouping to our Crosstab function

                    

'''Now let's add total ans see how it looks.'''
pd.crosstab([titanic_df['Survived'],titanic_df['Pclass']],[titanic_df['Sex'],titanic_df['Siblings/Spouses Aboard']], 
            rownames=['Survival','Passenger Class'], colnames=['Gender','Siblings Aboard'], margins=True, margins_name='Total')

pandas-reshape-crosstab-with-groupby

For complete code check the jupyter notebook here.

Conclusion

In this post we have looked at various ways to reshape our data from Transposing, Pivot, Pivot Table, Stacking, Unstacking, Melting and Cross-Tabulation. By reshaping our data we can perform complex analyses on it. In the next post we will look at String manipulation in pandas. To learn about apply functions in pandas check our previous post here.

Reshaping DataFrame

Post navigation


0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x