← All simulations · Pillar 6: Checking the work

Confusion matrix

What it is

A ripe-fruit detector gives every fruit a score, and a threshold turns that score into a yes/no guess: “ripe” or “not ripe.” But a guess can be right or wrong in two different ways. A confusion matrix is the little 2×2 scorecard that sorts every guess into four boxes: correct catches, correct skips, misses, and false alarms.

Go deeper: the four boxes have names. A true positive (TP) is really ripe and called ripe. A true negative (TN) is really not ripe and called not ripe. A false positive (FP) is a false alarm — not ripe, but called ripe. A false negative (FN) is a miss — really ripe, but called not. One single number like “accuracy” smears all four together and can hide a problem.

Why care

Not every mistake costs the same. A spam filter that sends a real email to the junk folder (a false alarm) is far more annoying than one that lets a little spam through. A medical test that misses a sick patient (a miss) is far worse than one that calls in a healthy person for a second look. The confusion matrix is how you see which kind of mistake your model is making, so you can tune it for the kind that matters.

The idea, intuitively

Picture the fruit laid out by score: really-ripe fruit on the top lane, really-not on the bottom. The threshold is a vertical line — everything to its right is “said ripe.” That line and the two lanes carve the picture into four quadrants, and those quadrants are exactly the four boxes of the matrix. Slide the line right and you stop crying “ripe!” at unripe fruit (fewer false alarms) but you start missing real ripe fruit. The two mistakes trade off.

Peek at the data first

Each fruit has a detector score and the truth of what it really was. That truth is what lets us check the guesses — much like Spectra’s describe_data summarizes a dataset before you trust it.

Try it

Slide the threshold and watch the four boxes change. Click any box in the table to light up the fruit it contains and read what that box means. Turn on Spotlight the mistakes to ring the misses and false alarms at once.

Where it shows up

Spam filters. False positive = a real email in the junk folder; false negative = spam in your inbox. The matrix shows which is happening.
Medical screening. A miss (false negative) can be dangerous, so doctors often tune the threshold to catch more — accepting more false alarms.
Any classifier report. The confusion matrix is the starting point for accuracy, precision, recall, and almost every other score people quote.

Where it came from

The idea of crossing “what was true” with “what we said” in a table goes back to early statistics and to signal detection theory developed for radar in the 1950s, where operators had to tell a real blip from noise. The same four-box thinking — hits, misses, false alarms, correct rejections — became standard for judging any classifier.

Try it in code

In the Studio, check runs a trained model on held-out data and reports the confusion matrix behind its accuracy:

data  = load "fruits"
train, test = split data, hold_out: 20%

model = make_model "tree"
train_model model, on: train, predict: "type", using: ["sweetness", "size"]

check model, with: test
show_model model

Open it in the Studio ▶

Check your understanding

What is the difference between a false positive and a false negative?
When would you slide the threshold to make more false alarms on purpose?
Why can two models with the same accuracy still behave very differently?