Missing data is a common data quality issue that every data practitioner has to deal with. Missing data affects the accuracy of analysis and the performance of the model. Most machine learning algorithms are sensitive to missing data. In the data cleaning process checking and dealing with missing data is a common task. Missing data can be presented as NaN, Null or None. The strategies for handling missing data can be as simple as deleting the rows or columns with missing values or as complex as developing a model to predict the value to be used to impute missing data. In this post we will look at different techniques to handle missing data and how to implement them in Python.

missing-data-image

Causes of Missing Data

There are many reasons that can results to missing values in a datasets; some of the most causes are as listed below;

  1. Data entry error
  2. Rare observations that cannot be found
  3. The species/individual under study can die or drop off before sampling concludes
  4. Lack of response to the survey questions

It is important to understand the root cause of the missing data/value to prevent future occurrence or device a mechanism to handle it.

Types of Missing Data

  1. Missing Completely at Random (MCAR). This is when the probability of an instance to have a missing value for an attribute does not depend on any other value from the dataset.
  2. Missing at Random (MAR). This is when the probability of an instance to have a missing value for an attribute depends on some other value from the dataset.
  3. Not Missing at Random (NMAR). This is when the probability that a value is missing does depend on the value(s) of the target variable to be imputed, and possibly also on the values of auxiliary variables. This is the case of the value of the variable that’s missing is related to the reason it’s missing such as if men failed to fill in a depression survey because of their level of depression.

Handling Missing Data

There are various approaches to deal with missing values/data ranging from simple techniques such as dropping the entire record to creating a sophisticated model for data imputation. Below are some of the most common techniques to deal with missing data.

  1. Deleting records with missing values.
  1. Imputation. This involves filling the missing values with other values either manually determined or statistical computed value.
  1. Interpolation. This involves constructing new data points to fill in the missing values.
  1. Full Analysis. We use different approaches such as generative or discriminative approaches to impute missing data.
  1. Model-based techniques. We can develop a machine learning model to predict the probable value for the missing data point. Sklearn comes with KNNImputer class that is based on K-Nearest neighbors. It uses the distance based metric to compute the nearest data to impute with missing one.

Create DataFrame

                    

import pandas as pd
import numpy as np

missing_data_df = pd.DataFrame(
    {
        "Students": ["Tom", "Peter",np.nan, "Mary", "Tom","King","Tom","Mary",np.nan],
        "Exam_Date": ["15/01/2021", "16/01/2021", "19/01/2021", "27/01/2021", "16/01/2021",
                      "16/01/2021", "16/01/2021", "16/01/2021", "16/01/2021"],
        "Math": [79.00, 67.00,np.nan, np.nan, 70.00,np.nan,90.00,76.00,np.nan],
        "Physics":[63.00, np.nan, 60.00, np.nan,84.00, 77.00,55.00,np.nan,66.00],
        "Computer":[np.nan,78.00, 57.00, 88.00, np.nan,np.nan,np.nan,70.00,np.nan],
    }
)

missing_data_df

data-preprocessing-missing-data]

Show null values and return false if not null

                    

missing_data_df.isnull().head()

data-preprocessing-show-null-values-false

Show not null values and return true if not null

                    

missing_data_df.notnull().head()

data-preprocessing-show-null-values-true

Count of null values

                    

missing_data_df.isnull().sum()

data-preprocessing-count-null-values

Techniques for Handling Missing Values

 1. Deleting records with missing values

Drop entire row with all values null

                    

missing_data_df.dropna(how='all')

data-preprocessing-drop-entire-row

Drop column with any null value

                    

missing_data_df.dropna(how='any',axis=1)

data-preprocessing-drop-column-with-any-missing-value

Drop rows with any of specified columns have null

                    

missing_data_df.dropna(subset=['Math', 'Physics'], how='any')

data-preprocessing-drop-column-with-specific-number-missing-value

Drop rows with all of specified columns have null

                    

missing_data_df.dropna(subset=['Math', 'Physics'], how='all')

data-preprocessing-drop-column-with-all-number-missing-value

Drop row with a given number of null values

                    

missing_data_df.dropna(axis=1,thresh=2)

data-preprocessing-drop-row-with-specific-number-missing-value

2. Imputation

Replace null values with a scalar value

                    

missing_data_df.fillna(-999) # replace null values with -999

data-preprocessing-impute-with-scalar

Backward Fill with Next Value

                    

missing_data_df.fillna(method='bfill')

data-preprocessing-impute-backwardfill

Forward Fill with Previous Value

                    

missing_data_df.fillna(method='ffill')

data-preprocessing-impute-forwardfill

Statistical Imputation

                    

missing_data_df.fillna(missing_data_df.Math.mean()) # fillna null value in Math column with mean of the Math
missing_data_df.fillna(missing_data_df.Students.mode()) # fillna null value in Students column with mode of the Students
missing_data_df.fillna(missing_data_df.Computer.median()) # fillna null value in Computer column with median of the Computer

data-preprocessing-impute-with-statistical-measure

3. Interpolate missing values

Interpolate missing data in forward direction

                    

missing_data_df.interpolate(method='linear', limit_direction ='forward')

data-preprocessing-interpolate-forward

Interpolate missing data in backward direction

                    

missing_data_df.interpolate(method='linear', limit_direction ='backward')

data-preprocessing-interpolate-backward

4. Model-based Imputation techniques

Nearest neighbors imputation

DataFrame with Missing Values

data-preprocessing-missing-data]

                    

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=2, weights="uniform")
imputed_data=imputer.fit_transform(missing_data_df[['Math','Physics','Computer']].to_numpy())
imputed_data

data-preprocessing-knnimputed-data

Join imputed data to DataFrame

                    

pd.concat([missing_data_df[['Students','Exam_Date']],pd.DataFrame(imputed_data,columns=['Math','Physics','Computer'])],axis=1)

data-preprocessing-knnimputed-dataframe

For complete code check notebook from github here.

Conclusion

Missing data is as a results of data quality issues. It affects the results of the analysis and performance of machine learning model as many machine learning algorithm are sensitive to missing data. To deal with missing values we can either impute the missing values with other values, drop records with missing values, interpolate or develop machine learning model that can predict the data point to fill in the missing value. In this post we have looked at what is the missing value, causes of missing value and how to handle it. In the next post we will learn about handling imbalanced data.

Handling Missing Data

Post navigation


0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x