← All simulations · Pillar 1: Numbers & pictures

Mean, median & spread

What it is

Before a computer can learn from data, we need a few honest ways to summarize it. Three little numbers do most of the work: the mean (the average), the median (the middle value), and the spread (how far apart the values are). Together they tell you, at a glance, “what’s typical, and how much does it wobble?”

Go deeper: the mean and median are both ways to find the center of the data (statisticians call this central tendency). The mean is the balance point — if the number line were a see-saw with a weight at every dot, the mean is where it balances. The median just lines the values up and points to the one in the middle. They usually sit close together — until a few extreme values show up.

Why care

Every chart, every model, every “average score” you have ever seen rests on these ideas. Machine-learning models lean on the mean to find a center and on the spread to know what counts as “normal.” Knowing which summary to trust — and when the average is lying to you because of a weird value — is one of the most useful habits in all of data science.

The idea, intuitively

Imagine the times it takes a few kids to walk to school, dropped as dots on a number line. The median is just the kid standing in the middle of the line-up. The mean is the balance point — move one kid much farther away and the balance point slides toward them, even though the kid in the middle of the line-up hasn’t changed. The spread is how stretched-out the dots are: bunched tight, or scattered wide.

Peek at the data first

Always look before you summarize. Here are the nine starting values and a quick summary of the column — the same thing Spectra’s describe_data shows you.

Try it

Drag any dot along the line to change that value. The blue mean triangle and the green median line move as you do. Use Add a point or Remove a point to change how many you have, and press Drop a far-away point to see what one outlier does to each summary.

So which one do I trust?

When the data is balanced, the mean and median agree — use either. When a few values are extreme (a millionaire walks into a room of ordinary incomes), the mean gets yanked toward them and can mislead, while the median stays put and tells the more honest “typical” story. That is exactly the puzzle the next sim, the Outlier hunt, is about.

A word on sampling

Usually we can’t measure everyone, so we measure a sample and hope it represents the whole group. The more we sample, the closer our sample mean creeps toward the true average — the same “more data settles things down” effect you can watch in the Randomness & probability sim. A good sample is fair: if it leaves people out, even a perfect average will be wrong.

Where it shows up

Where it came from

The idea of averaging measurements to tame error grew through the 1600s–1800s in astronomy and navigation; the median was used by Galileo (1632) and named and studied later by Francis Galton in the 1880s. A clear, careful way to measure spread — the standard deviation — was named by Karl Pearson in 1893. Like most foundational ideas, the credit is shared across many hands and many years.

Try it in code

In the Studio, describe_data reports the same summaries for any column, and plot_distribution shows their shape:

data = load "lemonade_stand"

describe_data data
plot_distribution data, x: "cups", bins: 8

Open it in the Studio ▶

Check your understanding