← All simulations · Pillar 1: Numbers & pictures

Distance & similarity

What it is

Once we turn things into points — a fruit becomes its sweetness and size, a song its tempo and loudness — we can ask the most useful question in machine learning: how far apart are two points? Distance answers it with a single number, and that number is how a computer measures similarity: the closer two points sit, the more alike they are.

Go deeper: the everyday measure is straight-line (Euclidean) distance, and it’s just the Pythagorean theorem. Walk across by Δx and up by Δy; the direct hop between the points is the hypotenuse of that little triangle, so d = √(Δx² + Δy²). There’s a second honest measure too: city-block (Manhattan) distance, where you can only travel along the grid lines, so the cost is Δx + Δy — never shorter than the straight line.

Why care

Distance is the quiet engine under a huge share of machine learning. k-nearest neighbors guesses a label from the closest known points. k-means builds groups by pulling points toward the nearest center. Recommenders suggest items that sit near the ones you liked. Get distance, and you’ve got the common idea behind all three.

The idea, intuitively

Drag the two fruits around. A triangle springs up between them: the flat leg is how different their sweetness is, the upright leg how different their size is, and the slanted line connecting them is the distance. Push them together and the “how alike” bar fills up; pull them to opposite corners and it empties. That’s the whole trick — close means alike.

Peek at the data first

Each fruit is just a couple of numbers (sweetness, size) that place it on the grid — the same numeric, labelled shape Spectra’s describe_data would show. Distance turns those numbers into one score for “how far apart.”

Try it

Drag fruit A (the red circle) and fruit B (the gold square) anywhere on the grid — or tap an empty spot to send the nearer one there. Watch the Δx/Δy triangle, the distance, and the “how alike” bar update live. Tick city-block to walk the grid instead of cutting straight across, and compare the two distances.

Where it shows up

Where it came from

The straight-line formula is the Pythagorean theorem, credited to Pythagoras of Samos (~500 BCE) though known to Babylonian mathematicians centuries earlier. The grid-walking alternative is nicknamed Manhattan (or taxicab) distance after the city’s block layout; it was studied by Hermann Minkowski around 1900 as one of a whole family of distance measures.

Try it in code

In the Studio, distance does its work behind the scenes — a classifier finds the nearest known fruits and lets them vote, exactly the “close means alike” idea you just dragged:

data  = load "fruits"
train, test = split data, hold_out: 20%

model = make_model "classifier"
train_model model, on: train, predict: "type", using: ["sweetness", "size"]

check model, with: test
show_model model

Open it in the Studio ▶

Check your understanding