In this era of big data we often end up with large data that we need to analyse and model. This might take too much time to process, analyse and model. To avoid this we need to select a small subset of the larger population to work with and deliver insights to business that will be used to make inference and conclusion of the general population. Sampling is a technique of drawing small set of data commonly referred to as sample from the larger population. In machine learning modelling we often sample dataset and separate the training and validation sets. When selecting the sample size we need to be very careful to correctly draw sample that represents the true nature of larger population to avoid a common problem of sampling bias. Sampling bias occurs when the distribution of sample data is different with that of the general population. This is a common problem experienced by machine learning models when taken to production. In this post we will look at various data sampling techniques and how to implement them in Python. Download the data for the post here.
Importance of Sampling
- Sample data is faster to process and analyse than entire population.
- Sample data is cheap to process and analyse since it requires less resources.
- Proper analysis of sample data is useful in making inference of the entire population
Categories of Sampling Methods
Sampling can be broadly classified into two types; probability sampling and non-probability sampling.
- Probability Sampling. The sample data is randomly selected from the population. Each data point has an equal chances of being selected hence this approach provides allows the sample size to be statistical significance in generalizing the population.
- Non-probability Sampling. The sample data is not randomly selected. This approach can lead to sampling bias between the sample and population distribution.
Common Probabilistic Sampling Techniques
- Simple Random Sampling. This is the common and simplest probability sampling techniques based on random selection. Each data point have equal chance of being selected. However, it might suffer the bias of selecting once class more than the other.
- Systematic Sampling. This method involves selecting the sample based on pre-defined pattern. It gives a more generalized sample.
- Stratified Sampling. In this method we first divide the population into strata (subgroups) based on different criteria such as gender, colour etc. and then evenly selecting samples from each strata. This approach yields a well representative sample from the entire population.
- Clustering Sampling. This method involves drawing samples from clusters of the population. We first divide the population to clusters and we pick specific number of clusters as the sample. This approach is ideal when we want to focus on specific area or region in our dataset.
Sampling Techniques in Python
Load Data
import pandas as pd
import numpy as np
titanic_df=pd.read_csv('titanic.csv')
titanic_df.tail()
1. Simple Random Sampling
Randomly select 10 samples from the population.
titanic_df.sample(n=10, random_state=1)
2. Stratified Sampling
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=3)
print("No. of stratas : ",skf.get_n_splits(titanic_df))
X=titanic_df.drop('Survived', axis=1)
y=titanic_df.Survived
for train_index, test_index in skf.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
For complete code check the notebook here.
Common Non-Probabilistic Sampling Techniques
- Convenience Sampling. The samples are selected based on their availability and readiness for sampling. It’s highly prone to bias.
- Judgement Sampling. The expert makes a selective judgement on what data point to select. This method can also suffer from personal judgement.
- Quota Sampling. We sample data based on pre-defined traits.
Statistical Resampling
Statistical Resampling is a non-parametric statistical inference method of selecting samples with repetition. Unlike sampling which is used for selecting data from population to be used for inference, resampling involves use of sampled data to improve the performance of models and analyses to give accurate inference about the population. Below are common techniques for resampling;
- Bootstrapping. This involves selecting samples from dataset with replacement.
- Cross-Validation. This is a statistical technique for validating predictive models. We divide data into training and validation set. Training set is used for training/fitting the model while validation set is used for evaluating the model. Cross-validation helps us in detecting overfitting and sample bias. Cross-validation methods can be exhaustive such that uses all possible ways to divide the sample data into training and validation sets. Exhaustive techniques includes; Leave-p-out and Leave-one-out cross-validations. Another method for cross-validation is non-exhaustive which uses specific criterion to divide the sample data into training and validation sets. Non-exhaustive techniques include; k-fold cross-validation and holdout among other techniques.
Bootstrapping
from sklearn.utils import resample
bootstrap=resample(titanic_df, replace=True, n_samples=200, random_state=1)
print("Bootstrap Sample\n")
bootstrap
Out of bag sample
print("Out of bag sample\n")
titanic_df.merge(bootstrap, how='inner', indicator=False)
KFold Cross-Validation
Let’s fit a regression model
# import required libraries
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
#Get features and target variables from data
X=titanic_df[['Pclass','Age','Siblings/Spouses Aboard','Parents/Children Aboard','Fare']]
y=titanic_df['Survived']
# prepare cross-validation data with 10 folds
cv=KFold(n_splits=10, shuffle=True, random_state=1)
# model
model=LogisticRegression()
#Evaluate model with cv
score=cross_val_score(model,X,y,scoring='accuracy',cv=cv)
print('Min Accuracy %.2f'% (score.min()*100),'%')
print('Max Accuracy %.2f'% (score.max()*100),'%')
print('Mean Accuracy %.2f'% (np.mean(score)*100),'%')
Let’s determine the value of k (folds) with higher accuracy
def model_evaluation(cv,X,y):
model=LogisticRegression()
score=cross_val_score(model,X,y,scoring='accuracy',cv=cv)
return score.min(), score.max(),np.mean(score)
folds=range(2,21)
min_cv,max_cv,mean_cv=list(),list(),list()
#iterate through each value of fold/k
for k in folds:
cv=KFold(n_splits=k,shuffle=True,random_state=1)
k_min,k_max,k_mean=model_evaluation(cv,X,y)
min_cv.append(k_min*100),max_cv.append(k_max*100),mean_cv.append(k_mean*100)
print('Folds=%d || Min Acc. %.2f || Max Acc. %.2f || Mean Acc.%.2f' % (k,k_min,k_max,k_mean))
def model_evaluation(cv,X,y):
model=LogisticRegression()
score=cross_val_score(model,X,y,scoring='accuracy',cv=cv)
return score.min(), score.max(),np.mean(score)
folds=range(2,21)
min_cv,max_cv,mean_cv=list(),list(),list()
#iterate through each value of fold/k
for k in folds:
cv=KFold(n_splits=k,shuffle=True,random_state=1)
k_min,k_max,k_mean=model_evaluation(cv,X,y)
min_cv.append(k_min*100),max_cv.append(k_max*100),mean_cv.append(k_mean*100)
print('Folds=%d || Min Acc. %.2f || Max Acc. %.2f || Mean Acc.%.2f' % (k,k_min,k_max,k_mean))
Let’s plot the accuracies for the values of k
import matplotlib.pyplot as plt
plt.style.use('ggplot') # Other sytles to use; fivethirtyeight
plt.figure(figsize=(20,10)) # Set figure size
plt.rcParams.update({'font.size': 22}) # Set axes size
plt.plot(folds,min_cv,color='orange',marker='*') # Plot the minimum accuracy
plt.plot(folds,max_cv,color='skyblue',marker='*') # Plot the maximum accuracy
plt.plot(folds,mean_cv,color='green',marker='*') # Plot the mean accuracy
plt.title('K-Fold',fontsize=24)
plt.xticks(rotation=45)
plt.xlabel('K',fontsize=24)
plt.ylabel('Accuracy',fontsize=24)
plt.legend(['Min','Max','Mean'], loc='upper right')
plt.show()
For complete code check the notebook here.
Errors in Sampling Process
Sampling is a sensitive process that when wrongly done can lead to wrong analyses and insights about the population. The sample data should be statistically significant and free from errors to produce trusted inferences about the population. There are various ways that errors can be introduced in the sample data. Below are most common sampling errors.
- Sampling Error. This occurs when the sample that was randomly selected is skewed.
- Selection Bias. This is a results of the method used for sampling skew the sample data.
- Systematic Error. This is caused by faulty tool or wrong experimental design. This type of error is consistent across the sample data.
Testing Sample & Population Distributions
To be certain that the sample data is correctly selected and check for sampling bias we need to test between the sample and population distributions. We need to conduct a hypothesis test to determine if the sample is drawn to a particular population. Below are various techniques used for testing sample distributions.
- chi-square test. This is used to compare sample and population distribution for categorical variables.
- Kolmogorov-Smirnov test. This method is used for numerical variables in sample and population data. It determines if a sample comes from a population with a specific distribution.
- The histogram for sample and population should have identical shape. Check the post for histogram here.
Conclusion
In this post we have looked at what data sampling is and why it’s important. We have seen different techniques of data sampling both probabilistic and non-probabilistic. Also we have looked at the two common resampling techniques, which includes bootstrapping and cross-validation. In the previous post we have looked at outliers in data and how to handle them, check our post here. In the next post we will look at data leakage and what causes data leakage in the entire data science life cycle.