Data comes in different structures either numeric, text, images, videos or audio. Ability to comfortably work with different data format is key for any data scientist and analytics practitioner. Pandas provides us with powerful functions to clean and analyse text data. Text analysis is a complex and an important type of analysis for extracting insights from textual data. In this post we will look at the most common pandas string manipulation functions. You can download the dataset for this post here.
Pandas String Manipulation
Load Data
import pandas as pd
import numpy as np
titanic_df=pd.read_csv('titanic.csv')
titanic_df.head()
Convert text to Upper Case
titanic_df['Name'].str.upper().head()
Convert text to Lower Case
titanic_df['Name'].str.lower().head()
Calculate length of a string
titanic_df['Name'].str.len().head()
Strip white spaces
titanic_df['Name'].str.strip().head()
Split string
titanic_df['Name'].str.split(' ').head()
Convert strings to categorical numbers
titanic_df['Sex'].str.get_dummies().head()
Repeat text
titanic_df['Sex'].str.repeat(5).head()
Count occurrence of certain words
titanic_df['Name'].str.count('Rev.').sum()
String search startswith and endswith
titanic_df['Name'].str.startswith('Mrs.').sum()
titanic_df['Name'].str.endswith('y').sum()
Swap Cases
titanic_df['Name'].str.swapcase().head()
Filter data with substring
titanic_df['Name'].str.contains('Mr.').head()
Replace string
titanic_df['Sex'].str.replace('female','F').head()
Check if text is lower or upper
titanic_df['Name'].str.islower()
titanic_df['Name'].str.isupper().head()
Check if value is numeric
titanic_df['Name'].str.isnumeric().head()
For complete code check the jupyter notebook here.
Conclusion
In this post we have looked at the most common text manipulation functions in pandas. With growth of data and variety at which it gets generated understanding how to work with text data is critical. In the next post we will look at how to work with dates in pandas. To learn how to reshape data in pandas and concepts like pivot table, cross-tabulation, melting and stacking among others check our previous post here.