← All simulations · Pillar 3: Finding patterns
Hierarchical grouping
What it is
Hierarchical clustering builds groups bottom-up. Start with every point as its own tiny group, then repeatedly fuse the two closest groups — over and over — until everything is one big group. The record of who-merged-with-whom-and-when is a tree called a dendrogram.
Go deeper: the bar heights in the tree are the distances at which groups joined, so short bars mean tight, similar groups and tall bars mean a big leap to join. The clever part: you never commit to a number of groups up front. A single horizontal cut across the tree reads off a grouping — cut low for many small groups, high for a few big ones. “Distance between groups” needs a rule (a linkage); here we use average linkage, the mean distance between all pairs of points across the two groups.
Why care
Unlike k-means, you don’t have to guess k in advance — the tree shows the whole family of groupings at once, and the big gaps suggest where the natural splits are. It’s how biologists draw trees of life from gene data, how documents get organised into topics and sub-topics, and how analysts explore structure before settling on any one answer.
The idea, intuitively
Six points sit in two number columns. The closest pair joins first (a short bar), then the next, and so on up the tree, until the far-off pair on the right finally joins at a tall bar — it was the odd one out. Slide the cut line up and groups merge; slide it down and they split. One slider walks every grouping the data can offer.
Peek at the data first
Just six labelled points with two number columns and no group given — the same unlabelled shape
describe_data would summarise before any clustering.
Try it
Slide Cut height. The bottom tree shows a red cut line; everything joined below it counts as one group, and the points up top recolor to match. Cut through the tall gap for two clean groups; cut low to see every pair on its own. Watch the group count and members update.
Where it shows up
- Trees of life. Biologists build phylogenetic trees by merging the most similar species first.
- Topics & sub-topics. Documents organise into nested themes without choosing a count first.
- Exploration. A quick dendrogram reveals structure before committing to a flat k.
Where it came from
Agglomerative methods and the dendrogram grew out of mid-20th-century numerical taxonomy — Robert Sokal and Peter Sneath’s 1963 Principles of Numerical Taxonomy is a landmark — with linkage rules like Joe Ward’s (1963) still in daily use across biology and data science.
Try it in code
In the Studio you can cluster the same way and ask for a chosen number of groups — the flat reading a dendrogram cut would give you:
data = load "fruits" model = make_model "clusterer" train_model model, on: data, using: ["sweetness", "size"], groups: 2 show_model model
Check your understanding
- What do the bar heights in a dendrogram mean?
- How does one cut line turn the tree into a specific set of groups?
- Why is hierarchical clustering handy when you don’t know how many groups to expect?