← All simulations · Pillar 3: Finding patterns

Hierarchical grouping

What it is

Hierarchical clustering builds groups bottom-up. Start with every point as its own tiny group, then repeatedly fuse the two closest groups — over and over — until everything is one big group. The record of who-merged-with-whom-and-when is a tree called a dendrogram.

Go deeper: the bar heights in the tree are the distances at which groups joined, so short bars mean tight, similar groups and tall bars mean a big leap to join. The clever part: you never commit to a number of groups up front. A single horizontal cut across the tree reads off a grouping — cut low for many small groups, high for a few big ones. “Distance between groups” needs a rule (a linkage); here we use average linkage, the mean distance between all pairs of points across the two groups.

Why care

Unlike k-means, you don’t have to guess k in advance — the tree shows the whole family of groupings at once, and the big gaps suggest where the natural splits are. It’s how biologists draw trees of life from gene data, how documents get organised into topics and sub-topics, and how analysts explore structure before settling on any one answer.

The idea, intuitively

Six points sit in two number columns. The closest pair joins first (a short bar), then the next, and so on up the tree, until the far-off pair on the right finally joins at a tall bar — it was the odd one out. Slide the cut line up and groups merge; slide it down and they split. One slider walks every grouping the data can offer.

Peek at the data first

Just six labelled points with two number columns and no group given — the same unlabelled shape describe_data would summarise before any clustering.

Try it

Slide Cut height. The bottom tree shows a red cut line; everything joined below it counts as one group, and the points up top recolor to match. Cut through the tall gap for two clean groups; cut low to see every pair on its own. Watch the group count and members update.

Where it shows up

Where it came from

Agglomerative methods and the dendrogram grew out of mid-20th-century numerical taxonomy — Robert Sokal and Peter Sneath’s 1963 Principles of Numerical Taxonomy is a landmark — with linkage rules like Joe Ward’s (1963) still in daily use across biology and data science.

Try it in code

In the Studio you can cluster the same way and ask for a chosen number of groups — the flat reading a dendrogram cut would give you:

data = load "fruits"
model = make_model "clusterer"
train_model model, on: data, using: ["sweetness", "size"], groups: 2
show_model model

Open it in the Studio ▶

Check your understanding