More often we need to change the way our data appears to perform deeper analyses. To achieve this Pandas gives us various simple techniques that allows us to reshape and transform our data accordingly. We can change the columns to become rows and rows to be columns, we can aggregate and summarise our data into an easier to understand data. In this post we are going to look at various techniques for reshaping DataFrame which includes; Transposing, Pivot, Pivot Tables, Stacking, Unstacking and melt. You can download the datasets used in this post from here.
Import Data
import pandas as pd
import numpy as np
gdp_df=pd.read_csv("gdp.csv")
gdp_df.head()
Pandas Transpose
The pandas transpose function reflects the DataFrame over its main diagonal by writing rows as columns and vice-versa. The property T is an accessor to the method transpose().
gdp_df.T
Pandas Groupby
The groupby() function in pandas enables us to split the data on a certain criteria (categories), apply a specific function and combine the results. It enables us to efficiently aggregate and summarize data.
Groupby continents
groups=gdp_df.groupby(['continent'])
groups
Iterate through groups
for i,j in groups:
print(i)
# print(j)
Select groups
groups.get_group('Asia').head()
Groupby with aggregation
# gdp_df.groupby(['continent']).aggregate(np.sum) # first approach
gdp_df.groupby(['continent']).sum() #second approach
Groupby and passing multiple aggregation
gdp_df.groupby(['continent']).agg([np.sum,np.mean,np.std]).reset_index()
Groupby with aggregation and set index
# gdp_df.groupby(['continent'],as_index=False).sum() #first approach
gdp_df.groupby(['continent']).sum().reset_index() #second approach
Pandas Pivot
Reshape dataframe from long format to wide format. Pivot function does not deal with repeated/duplicated values in index and also does not accept aggregate. Due to this drawbacks we opt to use Pivot Table.
Read Data
# Read data
continental_temperature_df=pd.read_csv('continental_temperature.csv')
# Add Temperature in Farenheit
continental_temperature_df['AvgTemperature_FH']=continental_temperature_df['AvgTemperature']*(9/5)+32
continental_temperature_df.head()
Pivot data by Region
continental_temperature_df.pivot(index='Year',columns='Region',values=['AvgTemperature']).head()
Pivot with more than one measure
continental_temperature_df.pivot(index='Year',columns='Region',
values=['AvgTemperature','AvgTemperature_FH']).head()
Pandas Pivot Table
Pivot Table function creates a spreadsheet-style pivot table as a DataFrame. It is similar to pivot function but with more capabilities of handling duplicates values and also allows us to use various aggregate functions.
Pivot Table with one aggregation
pd.pivot_table(continental_temperature_df,index=['Year'],columns=['Region'],values='AvgTemperature',aggfunc=np.mean).head()
Pivot Table with Totals
pd.pivot_table(continental_temperature_df,index=['Year'],columns=['Region'],values='AvgTemperature',
aggfunc=np.mean,margins=True,margins_name='Mean')
Pivot Table with multiple aggregations
pd.pivot_table(continental_temperature_df,index=['Year'],columns=['Region'],values=['AvgTemperature','AvgTemperature_FH'],
aggfunc={'AvgTemperature':np.mean,'AvgTemperature_FH':[np.min,np.max,np.mean,np.std]}).head()
pd.pivot_table(continental_temperature_df,index=['Year'],columns=['Region'],values=['AvgTemperature','AvgTemperature_FH'],
aggfunc={'AvgTemperature':np.mean,'AvgTemperature_FH':[np.min,np.max,np.mean,np.std]}).head().T
Pandas Data Stack
Return a reshaped DataFrame or Series having a multi-level index with one or more new inner-most levels compared to the current DataFrame. The new inner-most levels are created by pivoting the columns of the current dataframe. For more details refer here.
Let’s prepare a data frame to stack
# Let's first create a dataframe that we can stack
continental_temp_df=pd.pivot_table(continental_temperature_df,index=['Year'],columns=['Region'],
values='AvgTemperature',aggfunc=np.mean)
continental_temp_df.head()
Stack data by Year and Region
# Let's now stack the above dataframe with Year and Region
continental_temp_df.stack(0).reset_index().head()
Pandas Data Unstack
Returns a DataFrame having a new level of column labels whose inner-most level consists of the pivoted index labels. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.unstack.html#pandas.DataFrame.unstack
Unstack data by Region
stacked_df=gdp_df.groupby(['continent']).sum()
stacked_df.head()
Pandas Melt
Unpivots a DataFrame from wide format to long format.
data frame to Melt
continental_temp_df.reset_index().head()
Melt the above DataFrame
year_region_temp_df=pd.melt(continental_temp_df.reset_index().head(),id_vars=['Year'],value_vars=['Asia','Africa','Europe'],
value_name='Temperature_C')
year_region_temp_df
Pandas Cross Tabulation
Pandas cross tabulation function allows us to summarize data similar to Pivot and Pivot_table. By default it gives the count or frequency of occurrence between values of two different columns. It also gives us the capability to use other aggregation functions such as sum, meant etc.
Data Frame for crosstab
# This is our original data before applying crosstab function
year_region_temp_df.head()
Cross Tabulation on the Region and Year Columns
'''Let's apply crosstab function and round the results to 1 decimal place. Also note that we are using mean as
our aggregation function '''
pd.crosstab(year_region_temp_df['Year'],year_region_temp_df['Region'],
values=year_region_temp_df['Temperature_C'],aggfunc='mean').round(2)
Adding Totals to crosstab table
pd.crosstab(year_region_temp_df['Year'],year_region_temp_df['Region'],values=year_region_temp_df['Temperature_C'],
aggfunc='mean', margins=True, margins_name='Mean').round(2)
Cross Tabulation with Normalization
Data to normalize
titanic_df=pd.read_csv('titanic.csv')
titanic_df.head()
Simple crosstab
'''Let's use titanic dataset and count the number of passenger who survived and the distribution of there gender.'''
pd.crosstab(titanic_df['Survived'],titanic_df['Sex'])
Crosstab with Totals
'''Now let's add total and see how it looks.'''
pd.crosstab(titanic_df['Survived'],titanic_df['Sex'], margins=True, margins_name='Total')
Crosstab with Normalization Rows and Columns
'''Let's normalize and multiply by 100%. Normalize argument computes the percent contribution of each value
to the total. For example for female and 0 survived we take (81/887)*100 which we get 9.131905'''
pd.crosstab(titanic_df['Survived'],titanic_df['Sex'], margins=True, margins_name='Total',normalize=True)*100
Crosstab and Normalize a cross rows
'''Let's normalize rowise by setting normalize='index'. For example to get the first value of female and 0 survived we take
(81/545)*100 to get 14.862385'''
pd.crosstab(titanic_df['Survived'],titanic_df['Sex'], margins=True, margins_name='Total',normalize='index')*100
Crosstab and Normalize a cross columns
'''Let's normalize rowise by setting normalize='columns'. For example to get the first value of female and 0 survived we take
(81/314)*100 to get 25.79617834394904 '''
pd.crosstab(titanic_df['Survived'],titanic_df['Sex'], margins=True, margins_name='Total',normalize='columns')*100
Add Grouping to our Crosstab function
'''Now let's add total ans see how it looks.'''
pd.crosstab([titanic_df['Survived'],titanic_df['Pclass']],[titanic_df['Sex'],titanic_df['Siblings/Spouses Aboard']],
rownames=['Survival','Passenger Class'], colnames=['Gender','Siblings Aboard'], margins=True, margins_name='Total')
For complete code check the jupyter notebook here.
Conclusion
In this post we have looked at various ways to reshape our data from Transposing, Pivot, Pivot Table, Stacking, Unstacking, Melting and Cross-Tabulation. By reshaping our data we can perform complex analyses on it. In the next post we will look at String manipulation in pandas. To learn about apply functions in pandas check our previous post here.