Missingn data is a data quality issue and a common problem that every data practitioner has to deal with. Missing data affects the accuracy of analysis and the performance of the model. Most machine learning algorithms are sensitive to missing data. In the data cleaning process checking and dealing with missing data is a common task. In Pandas missing data can be presented as NaN, Null or None. Pandas provides us with adequate tools to handle missing data during the data cleaning process. The strategies for handling missing data can as simple as deleting the rows or columns with missing values or as complex as developing a model to predict the value to be used to impute missing data. In this post we will look at different techniques to handle missing data in pandas. For extensive approaches to handle missing data refer to our post on Handling Missing Data.
Causes of Missing Data
There are many reasons that can results to missing values in a datasets; some of the most causes are as listed below;
- Data entry error
- Rare observations that cannot be found
- The species/individual under study can die or drop off before sampling concludes.
- Lack of response to the survey questions
It is important to understand the root cause of the missing data/value to prevent future occurrence or device a mechanism to handle it.
Techniques for Handling Missing Data in Pandas
There are various approaches to deal with missing values/data ranging from simple ones such as dropping the entire record to creating a sophisticated model for data imputation. Below are some of the most common techniques to deal with missing data in Pandas.
- Deleting records with missing values.
- Imputation. This involves filling the missing values with other values either manually determined or statistical computed value.
- Interpolation. This involves constructing new data points to fill in the missing values.
- Full Analysis. We use different approaches such as generative or discriminative approaches to impute missing data.
- Model-based techniques. We can develop a machine learning model to predict the probable value for the missing data point.
For more details on handling missing data points refer here.
Handling Missing Data
Import required libraries
import pandas as pd
import numpy as np
Create a sample data frame with missing values
missing_data_df = pd.DataFrame(
{
"Students": ["Tom", "Peter",np.nan, "Mary", "Tom","King","Tom","Mary",np.nan],
"Exam_Date": ["15/01/2021", "16/01/2021", "19/01/2021", "27/01/2021", "16/01/2021",
"16/01/2021", "16/01/2021", "16/01/2021", "16/01/2021"],
"Math": [79.00, 67.00,np.nan, 84.00, 70.00,np.nan,90.00,76.00,np.nan],
"Physics":[63.00, np.nan, 60.00, np.nan,84.00, 77.00,55.00,np.nan,66.00],
"Computer":[np.nan,78.00, 57.00, 88.00, 75.00,93.00,np.nan,70.00,np.nan],
}
)
missing_data_df
Show null values and return false if not null
missing_data_df.isnull().head()
Show count of null values
missing_data_df.isnull().sum()
Show not null values and return true if not null
missing_data_df.notnull().head()
Drop entire row with all values null
missing_data_df.dropna(how='all')
Drop column with any null value
missing_data_df.dropna(how='any',axis=1)
Drop rows with any of specified columns have null
missing_data_df.dropna(subset=['Math', 'Physics'], how='any')
Drop rows with all of specified columns have null
missing_data_df.dropna(subset=['Math', 'Physics'], how='all')
Drop row with a given number of null values
missing_data_df.dropna(axis=1,thresh=2)
Replace null values with a scalar value
missing_data_df.fillna(-999) # replace null values with -999
Backward Fill
missing_data_df.fillna(method='bfill')
Forward Fill
missing_data_df.fillna(method='ffill')
Replace missing value with a specific value
missing_data_df.replace(np.nan, 0)
Impute null value with statistical measures
missing_data_df.fillna(missing_data_df.Math.mean()) # fillna null value in Math column with mean of the Math
missing_data_df.fillna(missing_data_df.Students.mode()) # fillna null value in Students column with mode of the Students
missing_data_df.fillna(missing_data_df.Computer.median()) # fillna null value in Computer column with median of the Computer
Interpolate missing values
missing_data_df.interpolate(method='linear')
For complete code check in jupyter notebook here
Conclusion
Missing data is as a results of data quality issues. It affects the results of the analysis and performance of machine learning model as many machine learning algorithm are sensitive to missing data. To deal with missing values we can either impute the missing values with other values, drop records with missing values, interpolate or develop machine learning model that can predict the data point to fill in the missing value. In this post we have looked at what is the missing value, causes of missing value and how to handle it. In the next post we will learn about Pandas apply functions. To learn about Combining Data in Pandas visit our previous post here.