← All simulations · Pillar 6: Checking the work

Precision vs. recall

What it is

A puppy finder looks through a pile of photos and flags the ones it thinks are puppies. Two different questions judge it. Precision asks: of the photos it flagged, how many were really puppies? Recall asks: of all the real puppies, how many did it find? They are not the same — and you usually can’t max out both at once.

Go deeper: a threshold turns each photo’s score into a flag or a skip. Make it picky (high threshold) and almost everything you flag is a true puppy — precision climbs — but you stay quiet about the rest, so recall falls. Make it eager (low threshold) and you catch every puppy — recall climbs — but you flag lots of non-puppies too, so precision falls. The precision-recall curve draws that trade-off.

Why care

Which mistake hurts more depends on the job. A test for a serious illness wants high recall — missing a sick person is far worse than a false alarm you can double-check. A spam filter wants high precision — tossing a real message into the junk folder is worse than letting a little spam through. Accuracy alone hides this; precision and recall make you choose, on purpose, which error you can live with.

The idea, intuitively

Picture the photos laid out by score, and a line you can slide. Everything to the right of the line is “flagged.” Recall is about the green puppies: how many of them landed in the flagged zone. Precision is about the flagged zone itself: how green it is. Slide the line and the two move in opposite directions — a see-saw you can feel.

Peek at the data first

Every photo has a finder score and the truth of what it really showed. The truth is what lets us measure precision and recall — much like Spectra’s describe_data summarizes a dataset before you trust it.

Try it

Slide the threshold and watch the blue precision bar and orange recall bar see-saw, while the dot rides along the precision-recall curve. Turn on Show the balance score (F1) to add a green bar and a star marking the threshold where both are highest at once.

Where it shows up

Medical screening. Tuned for high recall — catch every possible case, then confirm with a careful follow-up test.
Spam & safety filters. Tuned for high precision — don’t block real messages or punish innocent users.
Search and recommendations. The first results should be precise (relevant), while still recalling enough good options to choose from.

Where it came from

Precision and recall come from information retrieval — the science of finding the right documents in a big library. Researchers including Cyril Cleverdon used them in the Cranfield search experiments of the late 1950s and 1960s, and they became the standard pair for judging searches and, later, classifiers. The combined F1 score (their balance) came into wide use afterward.

Try it in code

In the Studio, check reports the confusion matrix behind a model’s guesses — the same true/false positives and negatives that precision and recall are built from:

data  = load "fruits"
train, test = split data, hold_out: 20%

model = make_model "tree"
train_model model, on: train, predict: "type", using: ["sweetness", "size"]

check model, with: test
show_model model

Open it in the Studio ▶

Check your understanding

In your own words, what is the difference between precision and recall?
Name one job where you’d want high recall, and one where you’d want high precision.
Why might you raise the threshold even though it lowers recall?