📌 It's heartbreaking to see the suffering of so many children. We stand with Palestine 🇵🇸 and pray for peace!

Hyper-parameter Tuning with Custom GridSearchCV

8 min read
·
10 July, 2021
·
0

Typically in Machine Learning, we split our datasets into two parts, i.e. training(70% usually) and testing(30% usually) datasets. If we are working on a classification problem where we have X as a feature and Y as a label, we split data as X_train, X_test, and Y_train, Y_test. What we usually do is, train and fit the model on the training dataset and test the model on the testing dataset. We usually find the accuracy by comparing the values that our model predicted on test data with actual data. The problem with this method is whenever your model encounters new data in the real world, the accuracy is no more the same. You may see how your model fails miserably. To solve such problems, researchers have come up with various methods in which one way is to use Cross-Validation while splitting data.

Cross-Validation

To understand this, we will take an example of the KNN Algorithm. What we usually do is split our data into three parts instead of two viz training data(60%), cv data(20%), and testing data(20%). NOTE: It's not always 60–20–20, some people prefer 70–10–20 and some others. Now in K-NN, we use training data to train and fit the model, basically, we are finding all the nearest neighbors for each point in our dataset. CV data is used to find the optimal K (K in K-NN is referred to as hyper-parameter, selecting the right K is crucial. In simple words, K is the number of nearest neighbor that is to be selected while applying K-NN). CV data is used to find the accuracy for multiple K and the K with the highest accuracy is considered as optimal K. Say, when K was 5, we got the highest accuracy of our model i.e. 95%. Now if we use the optimal K and trained model and test it on testing data and say we get around 93% of accuracy, we now are more certain that our model will give high accuracy in real-world cases too as our model hasn’t seen the test data during the training and also during finding the best K.

But still, there is one problem that needs to be addressed. If we split our data into 60–20–20 ratio, where 60% is used for training purposes, we are certainly losing 40% information while using 60% information to train our model is not a suitable way in the real world. To encounter this, researchers have come up with many modifications and we are going to learn one in this article.

GridSearchCV

Split the data into two parts, 80% of the data will be used as training data while 20% will be used as testing data. The training dataset is now further divided into four parts with 20% of each, say D1, D2, D3, D4 blocks.

Splitting data into training and testing

Now, for 1st time, we will use blocks D1, D2, D3 as a training dataset to train our model, while D4 will be used as CV data with K=1 to find the accuracy of the model i.e 81%. The second combination with blocks D1, D2, D4 as training data and D3 as CV data for K=1 will accuracy of 85%. The third combination for K=1 will give 85% accuracy and the fourth one will give 84% accuracy. Once we have used all the combinations, we now take the averages of all the accuracies for K=1. This way we can smartly use 80% of the data as training without losing much information as we have seen in the previous case. NOTE: We have still not used the D5 or test data so, for K=1, we can test our model and can expect the test accuracy to be sustainable for the real world cases too. One can try such combinations for multiple K, K=1, 3, 5, … and can find the optimal K which gives the highest accuracy, which in our case is K=5 with an accuracy of 93%. So, if we test our model with K=5 on our testing data and say we get similar accuracy, we are now more certain that our model will give similar accuracy on real data too.

Grid like combinations of K vs number of folds

Such a method to find the best hyper-parameter (K in K-NN) by making a grid (see the above image) is known as GridSearchCV. Let’s implement GridSearchCV from scratch using Python without using Sklearn. You can also use the Sklearn library which is more efficient in its own way.

Implementing Custom GridSearchCV

from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt

We will be using make_classification from sklearn.datasets which will help us to create clusters of points (for classification problem) with various parameters such as n_samples for the number of samples, n_features defines the number of features that we want, n_informative defines which features you want to be informative, n_redundant defines which features you want to be redundant or useless, random_state if set to any specific number will return the same random data every time which helps you to share the same randomly selected data with anyone.

def GridSearchCV(x_train, y_train, classifier, parameter, folds):
trainscores = []
testscores = []

Define a function that takes five input x_train , y_train , classifier in our case classifier is KNN, parameter such as n_neighbors, folds our number of cv folds. Initially, we are defining two empty lists trainscores and testscoreswhich stores all the accuracy that we have got during training and testing of n number of K with n number of folds.

for k in parameter['n_neighbors']:
training_fold_scores = []
cv_fold_scores = []

The first for loop will take different values of K which is stored in a variable parameter as a set of lists.parameter={'n_neighbors':[1,3,5,7,9,11,13,15,17,19,21,23,25,27,29]} We will iterate K from 1 to 30 odd numbers since in KNN if K is even, deciding the class label will be difficult if we get an even number of neighbors.

for j in range(0, cv_folds):
training_data = select_data_without_duplicates(x_train)
cv_data = list(set(list(range(1, len(x_train)))) - set(training_data))

This second loop will select the number of k_folds for each K. So if K=1, and cv_folds=4 we get four iterations, K1F0, K1F1, K1F2, K1F3 where K1 is K=1 and F0 to F4 are F=0 to 3. As we have taken multiple K from 1 to 29, we will get 15*4=60 combinations from K1F0, K1F1, K1F2, K1F3, K2F0…. to K29F3. As I have said earlier, we will be splitting our data into 2 parts i.e 80% of data will be training data + cv data and 20% will testing data. So in the second line, already have 80% of training+cv data from which we are splitting 60% data to training and the remaining 20% to cv data. The third line simply means the remaining data is given to cv_data. The function for splitting our data randomly into 60% without creating duplicates is as follows:

def select_data_without_duplicates(x_train):
return random.sample(range(0, len(x_train)), int(0.6*len(x_train)))

The above function basically takes x_train data and returns 60% randomly sample data ranging from 0 to length of training data without duplicates.

X_train = x_train[training_data]
X_cv = x_train[cv_data]
Y_train = y_train[training_data]
Y_cv = y_train[cv_data]

As we have split data into training and cv data, it's important to differentiate X_train X_cv Y_train Y_cv mapped by our training and cv_data.

classifier.n_neighbors = k
classifier.fit(X_train, Y_train)

We are assigning n_neighbors as the number of k given by the users and then fitting the model using X_train and Y_train

Y_cv_predict = classifier.predict(X_cv)
cv_fold_scores.append(accuracy_score(Y_cv, Y_cv_predict))
Y_train_predict = classifier.predict(X_train)
training_fold_scores.append(accuracy_score(Y_train, Y_train_predict))

Once the model is trained, we will predict the X_cv and will compare it with the given Y_cv to compute the accuracy. The accuracy will be stored in the cv_fold_scores list we already created. Similarly, we will predict X_train and compare it with Y_train to compute accuracy and store it in training_fold_scores. In our example, we have created cv_fold=4 so we get four accuracies for each K.

trainscores.append(np.mean(np.array(training_fold_scores)))
testscores.append(np.mean(np.array(cv_fold_scores)))
return trainscores,testscores

Take the mean of four accuracies and append the average in the trainscores and testscores list respectively. See Fig 2 to understand in visual. The function now has been created and we can test our function on the dataset.

Here is the full code of the function:

grid_search_cv.py
#This function will returns 60% randomly sample data ranging from 0 to length of training data without duplicates.
def select_data_without_duplicates(x_train):
return random.sample(range(0, len(x_train)), int(0.6*len(x_train)))
def GridSearchCV(x_train, y_train, classifier, parameter, folds):
trainscores = []
testscores = []
for k in tqdm(parameter['n_neighbors']):
training_fold_scores = []
cv_fold_scores = []
for j in range(0, cv_folds):
#Spliting Data into train and test
training_data = select_data_without_duplicates(x_train) #60% of the x_train data
cv_data = list(set(list(range(1, len(x_train)))) - set(training_data)) #Remaining data: 100% - 60%
#Evaluating X_train, Y_train, X_test, Y_test data from new split
X_train = x_train[training_data]
X_cv = x_train[cv_data]
Y_train = y_train[training_data]
Y_cv = y_train[cv_data]
#Applying KNN Algorithm and fitting model
classifier.n_neighbors = k
classifier.fit(X_train, Y_train)
#Predicting accuracy for X_test data and appending value to testing_fold_scores
Y_cv_predict = classifier.predict(X_cv)
cv_fold_scores.append(accuracy_score(Y_cv, Y_cv_predict))
#Predicting accuracy for X_test data and appending value to training_fold_scores
Y_train_predict = classifier.predict(X_train)
training_fold_scores.append(accuracy_score(Y_train, Y_train_predict))
#For each fold i.e folds=0,1,2 we have predicted accuracy for x_train and x_test. Since we have three values,
#We will take means of three accuracies and appending them to trainscores and testscores.
trainscores.append(np.mean(np.array(training_fold_scores)))
testscores.append(np.mean(np.array(cv_fold_scores)))
return trainscores,testscores

Testing Our Function

x, y = make_classification(n_samples=10000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=53)
#Spliting data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(x, y, stratify=y, random_state=32)

Here we are creating a dataset of 10,000 samples with 2 informative features and 0 redundant features (n_redundant should be equal to 0, by default is 2). After splitting data into X and Y train and test, training data is 7500, and test data is 2500. If we visualize the features and label of the data, it would be something similar to below:

Graph

After applying our function to the above dataset, we can identify the optimal K or the best hyper-parameter. Below is the plot of Hyper-parameter VS Accuracy.

Comparison Graph

The highest K you can see for Test data is around 15 with an accuracy of 87.5%. See the below figure.

Comparison Graph

For detailed and step by step code, refer to my GitHub or Colab

If you've enjoyed reading this blog and have learnt at least one new thing, do subscribe to recieve updates whenever I post a new article directly to your inbox and do share on Twitter with your friends.