← All simulations · Pillar 6: Checking the work
Data leakage
What it is
Data leakage is when information from the test secretly sneaks into training. The model then looks amazing on that test — but it cheated, so the score is a lie. Here a quiz-robot has a true skill of 70%. Leak some of the real test answers into its study set and watch its test score shoot up… while its skill on brand-new questions never moves.
Go deeper: leakage hides in sneaky places. Copying rows into both study and test is the obvious one, but it also happens when you include a column that secretly contains the answer, or when you scale/clean numbers using the whole dataset before splitting (so the test quietly influences the training). The cure is always the same: split first, then keep the test sealed until the very end.
Why care
Leakage is the number-one way people fool themselves in machine learning. A model that scores 99% in testing and then flops with real users almost always leaked. Spotting it is what separates a result you can trust from one that just looks good on paper.
The idea, intuitively
Imagine studying for a test by memorising the actual exam paper the night before. You’d ace that exact test — and learn nothing. Hand you a fresh paper and you’re back to your real level. Leakage is the model memorising the exam paper without anyone meaning for it to.
Peek at the data first
The scenario is fixed — you never type anything (safety by design). What matters is keeping two sets separate, and what goes wrong when they overlap.
Try it
Slide Test answers that leaked into study up and watch the test score climb toward a fake 100%. Then tick Check it on brand-new data to reveal the real-world score — the gap between the two bars is the leakage lie.
Where it shows up
- Accidental copies. The same rows end up in both the training and test sets.
- Giveaway columns. A feature secretly encodes the answer (like an ID that tracks the label).
- Cleaning too early. Scaling or filling-in numbers using the whole dataset before splitting.
Where it came from
As machine learning spread into competitions and industry, researchers kept finding “too good” results that fell apart in the real world. Kaufman, Rosset, and Perlich wrote a well-known paper, “Leakage in Data Mining” (2011), cataloguing how it sneaks in — and it’s now a standard warning in every data-science course.
Try it in code
The honest way: split the data first, train on the learn part, and only ever
check on the held-out part — never let the model see it during training:
data = load "students" train, test = split data, hold_out: 30% model = make_model "tree" train_model model, on: train, predict: "result", using: ["hours_studied", "sleep_hours", "attendance"] check model, with: test
Check your understanding
- Why does the real-world score stay at 70% no matter how much leaks?
- Name two sneaky ways test information can leak into training.
- What simple habit prevents most leakage?