← All simulations · Pillar 3: Finding patterns

Dimensionality, simply

What it is

Dimensionality reduction squashes data with many numbers per item down to just a few, while losing as little as possible. The classic tool is PCA (principal component analysis): it finds the direction the data spreads out the most, and lets you keep that instead of the original columns.

Go deeper: when two columns rise and fall together, they largely repeat each other — the cloud is a thin, tilted streak rather than a round blob. PCA rotates to a new axis pointing along that streak (the first principal component), then a second axis at right angles, and so on. Each point’s position on an axis is just its projection — its shadow on the line. Keep the top axis and you turn two numbers into one while preserving almost all of the spread (the variance), which is where the information lives.

Why care

Real data often has dozens or hundreds of columns — too many to plot or even to learn from well. Squashing to two or three lets you see the data, speeds up models, and strips out columns that just echo each other. It powers eigenfaces, gene-expression maps, and the “2-D pictures” of high-dimensional word and image embeddings.

The idea, intuitively

Here are two groups in a tilted cloud. Rotate the line and watch each point drop a shadow onto it: that shadow is the single number you’d keep. The bar shows how much of the cloud’s spread that one direction preserves, and the strip below shows the squashed result. Line up with the streak and you keep almost everything — and the two colors stay neatly apart. Turn across it and they collapse into mush.

Peek at the data first

Two number columns that move together, plus a group label just so we can watch whether the squash keeps the groups apart — the same summary describe_data would give.

Try it

Drag Line angle to rotate the line through the cloud. Each point’s dashed shadow lands on the line, and the strip below shows those shadows as one number each. Watch Spread kept climb as you align with the streak, then press Snap to best (PC1) to jump to the perfect direction — and notice the two colors stay separated in the strip.

Where it shows up

Seeing embeddings. 2-D maps of word or image vectors are usually PCA (or its cousins).
Faces & genes. Eigenfaces and gene-expression studies compress thousands of numbers to a few.
Speed & noise. Fewer, stronger features train faster and ignore columns that just echo each other.

Where it came from

PCA was invented by Karl Pearson in 1901 as “lines and planes of closest fit,” and developed independently by Harold Hotelling in the 1930s, who gave it the name. The same mathematics — eigenvectors of the covariance matrix — underlies much of modern data visualisation.

Try it in code

In the Studio, plot_distribution lets you eyeball how a single column spreads — the same “where is the spread?” question PCA answers for whole clouds:

data = load "flowers"
describe_data data
plot_distribution data, x: "petal_length", bins: 8

Open it in the Studio ▶

Check your understanding

What does it mean to “project” a point onto a line?
Why does PCA pick the direction of greatest spread to keep?
Why do the two groups survive the squash onto PC1 but not onto the perpendicular line?