To evaluate the performance of a trained machine learning model, we would have to test it on a dataset that it hasn’t seen before. But sometimes there’s just not enough data that would make a sizeable chunk just for testing. A common practice to solve this problem is by holding off a part of the dataset before training, through techniques called resampling methods, which partition the dataset to obtain representative data for training and testing.
As simple as it sounds, train-test split is just setting aside a portion of the available data for testing, which is usually around 20-30% of the dataset (although this would vary based on the nature of the dataset, requirements, etc.). Train-test split is readily available in the scikit-learn library in Python, but it is also fairly simple to implement from scratch.
In this example, we’ll use the randrange() function to randomly pick numbers to allocate to training and testing. The following function accepts the dataset and the proportion of the test subset against the entire dataset.
from random import randrange def train_test_split(dataset, split): train_set = list(dataset) #train_set initially gets all data test_set = list() test_size = len(dataset) * split while len(test_set) < test_size: random_index = randrange(len(train_set)) test_set.append(train_set.pop(random_index)) return train_set, test_set
We can then create a dummy dataset to test this function, and specify to use 30% of the dataset as the test subset.
dataset = [1,2,3,4,5,6,7,8,9,10] train, test = train_test_split(dataset, 0.3) print('Train:', train) print('Test:', test)
A sample output would be like:
Train: [1, 2, 4, 5, 7, 8, 10] Test: [9, 3, 6]
K-Fold Cross Validation Split
Another resampling method is by performing train-test split several times, each time using a different partition as the test set. This is called k-fold cross validation, where the dataset is split into k equal-sized folds (groups), and training and testing is done k times. The model is trained each time with a different training set and tested on a different test set. The overall performance of the model is the average of its performance on each round of training and testing.
For example, if k = 3, then we can name each fold as A, B, C respectively. Training and testing is done k times:
|k||Training set||Test Set|
To generate the folds from the dataset, the implementation looks similar to train-test split, but the difference is performing it k times. Again, we use randrange() to pick out the training and testing indices for us randomly.
from random import randrange def cross_validation_split(dataset, k_folds): fold_size = int(len(dataset) / k_folds) #folds are of the same size dataset_splits = list() for i in range(k_folds): fold = list() while len(fold) < fold_size: random_index = randrange(len(dataset)) fold.append(dataset.pop(random_index)) dataset_splits.append(fold) return dataset_splits
We set our sample dataset to be the numbers from 1 to 9, and k = 3.
dataset = list(range(1, 10)) k = 3 dataset_folds = cross_validation_split(dataset, k)
dataset_folds would then contain an 2-dimensional list of the elements in each fold; i.e., dataset_folds is fold A, dataset_folds is fold B, and dataset_folds is fold C.
[[7, 4, 9], [8, 5, 3], [6, 1, 2]]
We can then simulate the train-test process by referencing indices in the dataset_folds list.
for i in range(k): train_set_indices = list(range(k)) test_set_indices = train_set_indices.pop(i) print('Round', i+1) print('Training set:', [dataset_folds[index] for index in train_set_indices]) print('Test set:', dataset_folds[test_set_indices]) print('\n')
Running the above code would result in three rounds of training and testing with different subsets of the dataset.
Round 0 Training set: [[8, 5, 3], [6, 1, 2]] Test set: [7, 4, 9] Round 1 Training set: [[7, 4, 9], [6, 1, 2]] Test set: [8, 5, 3] Round 2 Training set: [[7, 4, 9], [8, 5, 3]] Test set: [6, 1, 2]