← All simulations · Pillar 6: Checking the work

Train/test split

What it is

Imagine studying for a test by reading the answer key, then grading yourself on those very same questions. You’d score 100% — but you wouldn’t really know anything new. Machine learning has the same trap. The fix is a train/test split: hide some of your data before training, then grade the model only on that hidden data — questions it has never seen.

Go deeper: we cut the data into a training set (the model studies it) and a test set (kept secret until grading). The score on the training set is almost always too good — the model has, in effect, seen the answers. The score on the test set is the honest estimate of how it will do on brand-new data out in the world.

Why care

This is the difference between a model that looks smart and one that is smart. A model can ace the data it studied and still flop on anything new — that’s called overfitting, and the only way to catch it is to hold out a test set. Every honest claim about how well a model works — in a research paper or a product — rests on testing against data the model never trained on.

The idea, intuitively

Below, a simple model learns to tell apples from lemons. The fruit it studies are solid; the fruit we hold out are hollow. Grade it on the solid fruit and it looks great — especially with k = 1, where every fruit’s nearest neighbor is itself, so it scores a perfect 100%. Then grade it on the hollow fruit it never saw: the score drops to the truth. That gap is the whole point.

Peek at the data first

Here are the fruit, each with its sweetness, size, and which kind it really is. We’ll set some rows aside as a secret test — just like Spectra’s split data, hold_out: does before training.

Try it

Slide Hold out for testing to choose how much fruit to hide, and k to set how many neighbors vote. Watch the two scores: the blue one is graded on fruit the model studied; the green one is the honest score on new fruit. Turn on Mark the studied fruit instead to move the ✓/✗ marks onto the training set.

Where it shows up

Where it came from

The idea of holding back data to test a model honestly grew out of 20th-century statistics. In 1974 Seymour Geisser and Mervyn Stone helped formalize cross-validation — rotating which slice is held out so every example gets a fair turn as the test. Today a train/test split (and its cousin, cross-validation) is the first rule of evaluating any model.

Try it in code

In the Studio, split hides a slice of the data, and check grades the model only on that held-out slice:

data  = load "fruits"
train, test = split data, hold_out: 30%

model = make_model "classifier"
train_model model, on: train, predict: "type", using: ["sweetness", "size"]

check model, with: test

Open it in the Studio ▶

Check your understanding