← All simulations · Pillar 3: Finding patterns

Choosing k

What it is

k-means finds groups in data — but you have to tell it how many groups to look for. That number is k. Pick too few and real groups get smushed together; pick too many and you split tidy groups into meaningless slivers. Choosing k is the art of finding the number that actually fits the data.

Go deeper: the usual measure is inertia — the total squared distance from every point to its group’s center. More groups always lowers inertia (with one group per point it hits zero), so you can’t just minimise it. Instead you plot inertia against k and look for the elbow: the spot where the curve stops plunging and flattens out. Before the elbow, each new group fixes a real split; after it, you’re only shaving hairs.

Why care

Almost every clustering tool — customer segments, photo grouping, topic discovery — needs a k. Choose it badly and the “groups” are an illusion: one giant blob, or hundreds of near-duplicates. The elbow gives you a principled, visual way to defend the number you picked instead of guessing.

The idea, intuitively

Here are the same unlabelled dots from k-means — three obvious blobs. Slide k from 1 upward and watch the colors and the “total spread” below. From 1 to 3 the spread collapses: each new group snaps onto a real blob. Past 3 it barely moves — you’re just cutting good groups in half. The sharp bend at 3 is the elbow, and it matches the three blobs your eye already sees.

Peek at the data first

Two number columns and no group label — exactly the unlabelled shape k-means works on. The only extra decision here is how many groups to ask for.

Try it

Slide Groups (k) from 1 to 6. The top chart recolors the points and moves each group’s center; the bottom chart plots the total spread for every k, with a ring on your current choice and the elbow marked in green. Watch the curve plunge, then flatten — pick the k right at the bend.

Where it shows up

Customer segments. How many kinds of shopper are really there? The elbow suggests the count.
Image & color quantization. How many colors to keep a picture looking right.
Any clustering. Topic discovery, anomaly groups, sensor regimes — all need a defensible k.

Where it came from

The elbow heuristic was described by the psychologist Robert L. Thorndike in 1953, when he asked “who belongs in the family?” of clustering. It remains the most common first answer to “how many groups?”, now joined by sharper measures like the silhouette score and the gap statistic.

Try it in code

In the Studio, train_model on a clusterer takes a groups: count — the very k you just slid. Try a few and compare the groupings it shows:

data = load "fruits"
model = make_model "clusterer"
train_model model, on: data, using: ["sweetness", "size"], groups: 3
show_model model

Open it in the Studio ▶

Check your understanding

Why can’t you choose k just by picking the value with the smallest total spread?
In your own words, what is the “elbow,” and why is it the sensible k?
What goes wrong with too few groups? With too many?