← All simulations · Pillar 3: Finding patterns

Outlier hunt

What it is

An outlier is a value that sits far away from the rest — the point that does not belong. Spotting them is half the job of a data detective. But the real skill is the next question: is this odd value noise (a mistake or glitch we should remove) or signal (a real, surprising event we should study)? Get that call wrong and you either keep garbage or throw away gold.

Go deeper: a common, fair way to flag outliers is the IQR rule. Line the values up, find the middle half (between the 25% and 75% marks), and call its width the interquartile range (IQR). Anything more than 1.5 IQRs past that middle half is flagged. It is resistant to extremes — the same reason the median is.

Why care

One broken sensor reading can wreck an average; one ignored warning sign can miss a real problem. Fraud alerts, faulty-part detection, finding a new kind of star — all of them are outlier hunts. And every model trains better on data where the mistakes have been cleaned out but the real surprises have been kept in.

The idea, intuitively

Here is a month of ice-cream sales. Most days look alike, so they form a comfortable band. Two days stick out. One is a broken measurement; the other is a genuine event. They look the same on the chart — both far away — but they deserve opposite treatment. Your job is to tell them apart.

Peek at the data first

Look before you judge. Here are a few days and a summary of the column — the same thing Spectra’s describe_data shows, including the normal range it would flag against.

Try it

Click a dot to inspect that day. Dots outside the shaded normal range are flagged as outliers. For each flagged day, decide: is it noise to remove, or signal to keep and study? Watch the average change as you clean — and notice when removing a point would actually hide the truth.

Noise vs. signal — the real lesson

The 0-scoop day was a recording error (the shop was shut). Removing it is honest cleaning — the average snaps back to the truth. The 95-scoop day was a real festival. It is just as far from normal, but erasing it would throw away the most important thing that happened all month. Far away does not mean wrong. Always ask why a point is strange before you delete it.

Where it shows up

Where it came from

The struggle over whether to discard a stubborn measurement is old: astronomers debated “rejecting” observations through the 1800s. John Tukey, who shaped modern exploratory data analysis in the 1970s, gave us the box-plot and the 1.5×IQR rule this sim uses. As ever, the judgment — noise or signal? — still belongs to a thoughtful human.

Try it in code

In the Studio you can see the shape of a column and the values that stretch its tails:

data = load "weather_town"

describe_data data
plot_distribution data, x: "temperature", bins: 10

Open it in the Studio ▶

Check your understanding