← All simulations · Pillar 3: Finding patterns
Outlier hunt
What it is
An outlier is a value that sits far away from the rest — the point that does not belong. Spotting them is half the job of a data detective. But the real skill is the next question: is this odd value noise (a mistake or glitch we should remove) or signal (a real, surprising event we should study)? Get that call wrong and you either keep garbage or throw away gold.
Go deeper: a common, fair way to flag outliers is the IQR rule. Line the values up, find the middle half (between the 25% and 75% marks), and call its width the interquartile range (IQR). Anything more than 1.5 IQRs past that middle half is flagged. It is resistant to extremes — the same reason the median is.
Why care
One broken sensor reading can wreck an average; one ignored warning sign can miss a real problem. Fraud alerts, faulty-part detection, finding a new kind of star — all of them are outlier hunts. And every model trains better on data where the mistakes have been cleaned out but the real surprises have been kept in.
The idea, intuitively
Here is a month of ice-cream sales. Most days look alike, so they form a comfortable band. Two days stick out. One is a broken measurement; the other is a genuine event. They look the same on the chart — both far away — but they deserve opposite treatment. Your job is to tell them apart.
Peek at the data first
Look before you judge. Here are a few days and a summary of the column — the same thing
Spectra’s describe_data shows, including the normal range it would flag
against.
Try it
Click a dot to inspect that day. Dots outside the shaded normal range are flagged as outliers. For each flagged day, decide: is it noise to remove, or signal to keep and study? Watch the average change as you clean — and notice when removing a point would actually hide the truth.
Noise vs. signal — the real lesson
The 0-scoop day was a recording error (the shop was shut). Removing it is honest cleaning — the average snaps back to the truth. The 95-scoop day was a real festival. It is just as far from normal, but erasing it would throw away the most important thing that happened all month. Far away does not mean wrong. Always ask why a point is strange before you delete it.
Where it shows up
- Catching fraud. A purchase wildly unlike your usual ones gets flagged — an outlier in spending that is worth a second look.
- Keeping machines healthy. A temperature spike far outside normal can warn that a part is about to fail.
- Discovery. Some of science’s biggest finds were “weird” data points that turned out to be real — signal nobody wanted to delete.
Where it came from
The struggle over whether to discard a stubborn measurement is old: astronomers debated “rejecting” observations through the 1800s. John Tukey, who shaped modern exploratory data analysis in the 1970s, gave us the box-plot and the 1.5×IQR rule this sim uses. As ever, the judgment — noise or signal? — still belongs to a thoughtful human.
Try it in code
In the Studio you can see the shape of a column and the values that stretch its tails:
data = load "weather_town" describe_data data plot_distribution data, x: "temperature", bins: 10
Check your understanding
- Two days are equally far from normal. Why does one get removed and the other kept?
- What happened to the average after you removed the broken day — up or down? Why?
- Can you think of a real outlier in your own life that was signal, not a mistake?