← All simulations · Pillar 6: Checking the work
Train/test split
What it is
Imagine studying for a test by reading the answer key, then grading yourself on those very same questions. You’d score 100% — but you wouldn’t really know anything new. Machine learning has the same trap. The fix is a train/test split: hide some of your data before training, then grade the model only on that hidden data — questions it has never seen.
Go deeper: we cut the data into a training set (the model studies it) and a test set (kept secret until grading). The score on the training set is almost always too good — the model has, in effect, seen the answers. The score on the test set is the honest estimate of how it will do on brand-new data out in the world.
Why care
This is the difference between a model that looks smart and one that is smart. A model can ace the data it studied and still flop on anything new — that’s called overfitting, and the only way to catch it is to hold out a test set. Every honest claim about how well a model works — in a research paper or a product — rests on testing against data the model never trained on.
The idea, intuitively
Below, a simple model learns to tell apples from lemons. The fruit it studies are solid; the fruit we hold out are hollow. Grade it on the solid fruit and it looks great — especially with k = 1, where every fruit’s nearest neighbor is itself, so it scores a perfect 100%. Then grade it on the hollow fruit it never saw: the score drops to the truth. That gap is the whole point.
Peek at the data first
Here are the fruit, each with its sweetness, size, and which kind it really is. We’ll set
some rows aside as a secret test — just like Spectra’s
split data, hold_out: does before training.
Try it
Slide Hold out for testing to choose how much fruit to hide, and k to set how many neighbors vote. Watch the two scores: the blue one is graded on fruit the model studied; the green one is the honest score on new fruit. Turn on Mark the studied fruit instead to move the ✓/✗ marks onto the training set.
Where it shows up
- Every honest benchmark. When a team reports a model’s accuracy, it’s the test-set score — data the model never trained on.
- Catching overfitting. A big gap between training and test scores is the warning sign that a model memorized instead of learned.
- Tuning fairly. People often hold out a third slice (a validation set) to pick settings, keeping the test set truly untouched until the very end.
Where it came from
The idea of holding back data to test a model honestly grew out of 20th-century statistics. In 1974 Seymour Geisser and Mervyn Stone helped formalize cross-validation — rotating which slice is held out so every example gets a fair turn as the test. Today a train/test split (and its cousin, cross-validation) is the first rule of evaluating any model.
Try it in code
In the Studio, split hides a slice of the data, and check grades the
model only on that held-out slice:
data = load "fruits" train, test = split data, hold_out: 30% model = make_model "classifier" train_model model, on: train, predict: "type", using: ["sweetness", "size"] check model, with: test
Check your understanding
- Why is a 100% score on the training data not proof that a model is good?
- What does a big gap between the training score and the test score tell you?
- Why must the test set stay hidden until the very end?