Data comes in different structures either numeric, text, images, videos or audio. Ability to comfortably work with different data format is key for any data scientist and analytics practitioner. Pandas provides us with powerful functions to clean and analyse text data. Text analysis is a complex and an important type of analysis for extracting insights from textual data. In this post we will look at the most common pandas string manipulation functions. You can download the dataset for this post here.

pandas-logo

Pandas String Manipulation

Load Data

                    

import pandas as pd
import numpy as np

titanic_df=pd.read_csv('titanic.csv')
titanic_df.head()

pandas-string-load-data

Convert text to Upper Case

                    

titanic_df['Name'].str.upper().head()

pandas-string-upper-case

Convert text to Lower Case

                    

titanic_df['Name'].str.lower().head()

pandas-string-lower-case

Calculate length of a string

                    

titanic_df['Name'].str.len().head()

pandas-string-length

Strip white spaces

                    

titanic_df['Name'].str.strip().head()

pandas-string-strip-white-spaces

Split string

                    

titanic_df['Name'].str.split(' ').head()

pandas-string-split-string

Convert strings to categorical numbers

                    

titanic_df['Sex'].str.get_dummies().head()

pandas-string-get_dummies

Repeat text

                    

titanic_df['Sex'].str.repeat(5).head()

pandas-string-string

Count occurrence of certain words

                    

titanic_df['Name'].str.count('Rev.').sum()

pandas-string-count

String search startswith and endswith

                    

titanic_df['Name'].str.startswith('Mrs.').sum()
titanic_df['Name'].str.endswith('y').sum()

pandas-string-endswith

Swap Cases

                    

titanic_df['Name'].str.swapcase().head()

pandas-string-swap-cases

Filter data with substring

                    

titanic_df['Name'].str.contains('Mr.').head()

pandas-string-filter-substring

Replace string

                    

titanic_df['Sex'].str.replace('female','F').head()

pandas-string-replace-string

Check if text is lower or upper

                    

titanic_df['Name'].str.islower()
titanic_df['Name'].str.isupper().head()

pandas-string-islower-isupper

Check if value is numeric

                    

titanic_df['Name'].str.isnumeric().head()

pandas-string-isnumeric

For complete code check the jupyter notebook here.

Conclusion

In this post we have looked at the most common text manipulation functions in pandas. With growth of data and variety at which it gets generated understanding how to work with text data is critical. In the next post we will look at how to work with dates in pandas. To learn how to reshape data in pandas and concepts like pivot table, cross-tabulation, melting and stacking among others check our previous post here.

Pandas String Manipulation

Post navigation


0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x