← All simulations · Pillar 5: Making decisions

Decision tree

What it is

A decision tree is a flowchart of yes/no questions. To decide if a fruit is ripe, you might ask “Is it sweet enough?” and then “Is it big enough?” — and each answer sends you down a branch until you reach a guess. The clever part is that the computer figures out which questions to ask, and in what order, all by itself.

Go deeper: at every step the tree tries every possible question and keeps the one that best splits the mixed-up group into two cleaner groups. “Cleaner” is measured by Gini impurity — 0 means a group is all one kind, 0.5 means it is a 50/50 mix. It keeps asking until each group is pure (or it hits a depth limit, so it doesn’t just memorize).

Why care

Decision trees are everywhere because you can read them. Unlike a tangle of numbers, a tree shows its reasoning as plain questions a person can check, argue with, or explain to a friend. That makes them a favorite when a decision needs to be transparent — and they are the building block of random forests, one of the most reliable tools in machine learning.

The idea, intuitively

Plot every known fruit by two clues: sweetness and size. The ripe fruit huddles in the top-right corner (sweet and big). No single straight line can fence it off — but two questions can. The first question slices the plane in two; the next question tidies up a leftover corner. Each cut is a box, and each box ends in one confident guess.

Peek at the data first

Before growing anything, look at the fruit we already know. Each row has two clues — sweetness and size — and whether it turned out ripe. Here are a few rows with a summary of each column, just like Spectra’s describe_data.

Try it

Press Grow one question to let the tree add its next-best yes/no question. Watch the split lines carve the plot into boxes and the flowchart grow to match. Prune one takes a question back; turn on Shade the boxes to color each box by its guess.

Where it shows up

Decisions people must trust. Loan, medical, and safety tools often use trees because a human can follow exactly why a choice was made.
Sorting and triage. “Is it urgent? Is it about billing?” — flowcharts of questions route huge numbers of cases quickly.
Forests. Hundreds of slightly different trees voting together (a random forest) is one of the strongest, most dependable methods around.

Where it came from

The modern decision tree grew from two lines of work in the 1980s: Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone published Classification and Regression Trees (CART) in 1984, and Ross Quinlan developed the ID3 and later C4.5 algorithms around the same time. Breiman went on to combine many trees into the random forest in 2001.

Try it in code

The Studio’s tree model learns exactly this kind of flowchart, and show_model draws it for you:

data  = load "fruits"
train, test = split data, hold_out: 20%

model = make_model "tree"
train_model model, on: train, predict: "type", using: ["sweetness", "size"]

check model, with: test
show_model model

Open it in the Studio ▶

Check your understanding

Why can two yes/no questions separate the ripe fruit when a single straight line can’t?
What does it mean for a box to be “pure”? Why does the tree stop asking there?
If you let the tree grow forever, it could get every fruit right — so why do we set a depth limit?