One of the most powerful capability of pandas is to slice based on different criteria. Pandas provides us with different approaches to slice data in both series and dataframe data structures to return the required subset. These approaches include selection by label, position or both. After selecting specific subset of rows and columns we can then sort our data based a specific column/s. This gives us the power to arrange the data in the form we require. Another powerful functionality in pandas is the one to filter data based on a defined criteria. Pandas provides us with SQL-Like functions to filter our data for analytical purpose. In this post we will learn how to use select, sort and filter data in Pandas. You can download the dataset used in this post here. For complete cone check my jupyter notebook here.
Selecting Data in Pandas
In Pandas we can select data to return only a subset of specific rows and columns to work with. We can select data by label or position. Below are different approaches to select data from pandas dataframe.
Select specific columns and all records
# Select one column only titanic_df['Survived'] # Use . syntax if the column name don't have spaces titanic_df.Survived # Select more than one column titanic_df[['Survived','Pclass','Age']]
Select data by position
titanic_df.iloc[:10] # select first 10 records titanic_df.iloc[-10:] # select last 10 records titanic_df.iloc[5:8] # select records from 5 to 8 records. Including 5 excluding 8 titanic_df.iloc[0:-10] # select all data excluding last 10 records titanic_df.iloc[0:,0:2] # select first two columns with all records titanic_df.iloc[0:,3:5] # select all rows and only columns from 3 to 5. titanic_df.iloc[:5,0:-1] # select first 5 records and all columns excluding last 1 column
Sorting Data in Pandas
Pandas gives us flexibility to order our data either ascending, descending or a combination of both in one or more columns. Below are different ways we can sort our data in pandas.
Sort data ascending
Sort data by two columns
titanic_df.sort_values(by=['Age','Fare'], ascending=[True,False]).head() # Sort age by ascending and then Fare by descending
Filtering Data in Pandas
The filtering capabilities enables us to apply logical conditions to slice the data in different ways. Pandas provides SQL-Like functions and mathematical operators to filter our data.
Filter data where age > 50
Filter data where age is between 30 and 40
titanic_df[(titanic_df['Age']>30) & (titanic_df['Age']<40)].head()
Using IN function
Using NOT IN
Slicing and dicing data is a core skill of a data practitioner. Luckily Pandas provides us with all the powerful tools to perform these task. We can select a small subset of data of our interest by applying slicing, sorting and filtering conditions to the dataframe. In the next post we will learn how to reshape data with techniques and approaches such as pivot, pivot_table, cross-tabulation, stacking, unstacking and melt. To learn more about pandas data types and there conversions check our previous post here.