Resampling Methods: Splitting and k-Folds

To evaluate the performance of a trained machine learning model, we would have to test it on a dataset that it hasn’t seen before. But sometimes there’s just not enough data that would make a sizeable chunk just for testing. A common practice to solve this problem is by holding off a part of the dataset before training, through techniques called resampling methods, which partition the dataset to obtain representative data for training and testing.

Train-Test Split

As simple as it sounds, train-test split is just setting aside a portion of the available data for testing, which is usually around 20-30% of the dataset (although this would vary based on the nature of the dataset, requirements, etc.). Train-test split is readily available in the scikit-learn library in Python, but it is also fairly simple to implement from scratch.

In this example, we’ll use the randrange() function to randomly pick numbers to allocate to training and testing. The following function accepts the dataset and the proportion of the test subset against the entire dataset.

from random import randrange

def train_test_split(dataset, split):
    train_set = list(dataset)  #train_set initially gets all data
    test_set = list()

    test_size = len(dataset) * split
    
    while len(test_set) < test_size:
        random_index = randrange(len(train_set))

        test_set.append(train_set.pop(random_index))
        
    return train_set, test_set

We can then create a dummy dataset to test this function, and specify to use 30% of the dataset as the test subset.

dataset = [1,2,3,4,5,6,7,8,9,10]
train, test =  train_test_split(dataset, 0.3)
print('Train:', train)
print('Test:', test)

A sample output would be like:

Train: [1, 2, 4, 5, 7, 8, 10]
Test: [9, 3, 6]

K-Fold Cross Validation Split

Another resampling method is by performing train-test split several times, each time using a different partition as the test set. This is called k-fold cross validation, where the dataset is split into k equal-sized folds (groups), and training and testing is done k times. The model is trained each time with a different training set and tested on a different test set. The overall performance of the model is the average of its performance on each round of training and testing.

For example, if k = 3, then we can name each fold as A, B, C respectively. Training and testing is done k times:

kTraining setTest Set
1A, BC
2B, CA
3A, CB

To generate the folds from the dataset, the implementation looks similar to train-test split, but the difference is performing it k times. Again, we use randrange() to pick out the training and testing indices for us randomly.

from random import randrange

def cross_validation_split(dataset, k_folds):
    fold_size = int(len(dataset) / k_folds) #folds are of the same size
    dataset_splits = list()
    
    for i in range(k_folds):
        fold = list()
        
        while len(fold) < fold_size:
            random_index = randrange(len(dataset))
            
            fold.append(dataset.pop(random_index))
        
        dataset_splits.append(fold)
        
    return dataset_splits

We set our sample dataset to be the numbers from 1 to 9, and k = 3.

dataset = list(range(1, 10))
k = 3
dataset_folds = cross_validation_split(dataset, k)

dataset_folds would then contain an 2-dimensional list of the elements in each fold; i.e., dataset_folds[0][0] is fold A, dataset_folds[0][1] is fold B, and dataset_folds[0][2] is fold C.

[[7, 4, 9], [8, 5, 3], [6, 1, 2]]

We can then simulate the train-test process by referencing indices in the dataset_folds list.

for i in range(k):
    train_set_indices = list(range(k))
    test_set_indices = train_set_indices.pop(i)
    
    print('Round', i+1)
    print('Training set:', [dataset_folds[index] for index in train_set_indices])
    print('Test set:', dataset_folds[test_set_indices])
    print('\n')

Running the above code would result in three rounds of training and testing with different subsets of the dataset.

Round 0
Training set: [[8, 5, 3], [6, 1, 2]]
Test set: [7, 4, 9]

Round 1
Training set: [[7, 4, 9], [6, 1, 2]]
Test set: [8, 5, 3]

Round 2
Training set: [[7, 4, 9], [8, 5, 3]]
Test set: [6, 1, 2]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

<span>%d</span> bloggers like this: