← All simulations · Pillar 2: Data & how we tidy it
Cleaning data
What it is
Cleaning is the unglamorous first job of every data project: fixing the survey before you trust the chart. Real tables arrive messy — the same thing spelled two ways, a value that can’t possibly be real, a cell left blank. Cleaning means spotting those problems and fixing each one so the numbers mean what they claim to.
Go deeper: the three chores here are the classics. Inconsistent labels (a typo splits one group into two, so counts are wrong). Impossible or outlier values (a stray 80 on a 0–10 scale — maybe a typo, maybe a sensor glitch — that drags the average far from typical). And missing cells, which you either drop or fill with a sensible stand-in like the median. Each decision changes what the model later learns.
Why care
A model is only ever as good as the data it learns from — garbage in, garbage out. A single typo can invent a category that doesn’t exist; a single impossible value can make an average lie; a few blanks can crash a calculation or quietly skew it. Most of the time spent on real machine-learning projects goes here, before any clever algorithm runs.
The idea, intuitively
A little fruit survey just came in: which fruit each kid likes and how sweet they find it (0–10). It’s messy. One row says “aple,” so the chart sprouts a fake fourth fruit. One sweetness reads 80, which is impossible — and it shoves the average up past 10. One cell is blank. Click each red cell to fix it and watch the bar chart and the average snap back to something you can believe.
Peek at the data first
Ten survey rows — a fruit and a sweetness score — with three planted problems, the same kind
of summary Spectra’s describe_data would give you before you tidy anything.
Try it
Click each red cell in the table to fix it: the typo “aple” merges into “apple,” the impossible 80 snaps back onto the scale, and the blank fills with the median of the rest. Watch the bar chart lose its fake fruit and the average sweetness drop from an impossible number to a believable one. Hit Make it messy again to start over.
Where it shows up
- Every dataset, ever. Spreadsheets, sensor logs, survey exports — all arrive needing a tidy before analysis.
- Deduping & standardising. “NYC,” “New York,” “new york” are one place; cleaning makes them count as one.
- Missing-value strategy. Whether to drop a row or fill it (mean, median, or a model’s guess) is a real modelling choice.
Where it came from
The phrase “garbage in, garbage out” dates to the early days of computing — an IBM instructor, George Fuechsel, is often credited with popularising it in the 1960s, and a printed use appears in a 1957 newspaper. The idea that data quality limits any result is older still, and remains the first rule of every data pipeline.
Try it in code
In the Studio, you peek at a dataset and chart a column the same way — cleaning is the step that happens before the model ever sees the numbers:
data = load "weather_town" describe_data data plot_distribution data, x: "temperature", bins: 8
Check your understanding
- Why does a single typo like “aple” make a chart wrong, not just ugly?
- How can one impossible value push an average to a number the scale can’t even reach?
- Why fill a blank cell with the median rather than the average, when a giant outlier is present?