← All simulations · Pillar 6: Checking the work

ROC curve & AUC

What it is

A detector gives every thing it sees a score — here, a radar reading of how “plane-like” a blip is. A threshold turns that score into a decision: sound the alarm, or stay quiet. The ROC curve shows what happens to two rates as you slide that threshold across every setting: the true-positive rate (real planes you caught) against the false-positive rate (birds you mistook for planes).

Go deeper: each threshold is one point — one (false-alarm, catch) pair. Sweep the threshold from strict to loose and those points trace a curve from the bottom-left (alarm never sounds) to the top-right (alarm always sounds). A great detector bows toward the top-left corner: lots caught, few false alarms. The straight diagonal is pure guessing. The area under the curve (AUC) squeezes the whole picture into one number — 1.0 is perfect, 0.5 is a coin flip.

Why care

A single accuracy number depends entirely on where you happened to set the threshold — move the line and the number changes. The ROC curve judges a detector across every threshold at once, so you can compare two models fairly before committing to one operating point. It also lets you pick the threshold on purpose: hug the top-left for balance, or slide along the curve to trade a few more false alarms for a few more catches when a miss is costly.

The idea, intuitively

Picture all the blips laid out by score, with a line you can slide. Everything to the right of the line sets off the alarm. Move the line left and you catch more planes — but more birds trip the alarm too. Move it right and the birds go quiet — but planes slip through. Plot “planes caught” up and “false alarms” across, drop a dot for every line position, and you’ve drawn the ROC curve.

Peek at the data first

Every blip has a radar score and the truth of what it really was. The truth is what lets us count catches and false alarms — much like Spectra’s describe_data summarizes a dataset before you trust it.

Try it

Slide the threshold and watch the green catch-rate bar and red false-alarm bar move together while the dot rides along the ROC curve. Tick Shade the area under the curve (AUC) to see the single number that scores the detector across every threshold at once.

Where it shows up

Medical tests. ROC/AUC is the standard way to report how well a test separates sick from healthy across all possible cut-offs.
Fraud & spam detection. Compare detectors by AUC, then pick the threshold that balances catches against false alarms for your costs.
Any score-and-threshold model. Image detectors, credit scoring, search ranking — anywhere a model outputs a score you must turn into a yes/no.

Where it came from

ROC stands for receiver operating characteristic — it was invented by radar engineers during World War II to measure how well an operator could tell real enemy aircraft from noise and bird clutter on the screen. In the 1950s it spread to signal-detection theory in psychology, then to medicine for judging diagnostic tests, and finally became a everyday tool in machine learning for comparing classifiers.

Try it in code

In the Studio, check reports the confusion matrix behind a model’s guesses — the same true and false positives the ROC curve is built from, counted across thresholds:

data  = load "fruits"
train, test = split data, hold_out: 20%

model = make_model "tree"
train_model model, on: train, predict: "type", using: ["sweetness", "size"]

check model, with: test
show_model model

Open it in the Studio ▶

Check your understanding

What two rates do the axes of an ROC curve show?
Why is a detector that bows toward the top-left corner better than one near the diagonal?
Why can the AUC be a fairer score than a single accuracy number?