← All simulations · Pillar 5: Making decisions

Random forest

What it is

One decision tree can be a bit jumpy — change the data a little and it draws a different flowchart. A random forest fixes that by growing many trees, each on a slightly different slice of the data, and letting them vote. The most popular answer wins. It turns out a crowd of so-so trees beats one clever tree almost every time.

Go deeper: two tricks make the trees different on purpose. Each tree learns from a bootstrap sample — the same number of rows, but drawn with replacement, so each tree sees a different mix. And at every question a tree may only look at a random subset of the clues. Different trees make different mistakes, and when they vote, the mistakes cancel out. That cancelling is called variance reduction.

Why care

“Ask a diverse crowd and take the majority” is one of the most dependable ideas in all of machine learning. Random forests are accurate, hard to fool, need very little fiddling, and still let you peek at what the trees pay attention to. For tables of numbers, they are often the first thing a professional reaches for.

The idea, intuitively

Our ripe fruit sits above a slanted line (sweet and big). A single tree can only cut straight across or straight up — so it approximates that slant with clumsy steps, and gets the edge cases wrong. Each tree in the forest places its steps a little differently. Stack enough of them and the votes blend into a smooth boundary that hugs the real slanted line.

Peek at the data first

Before building anything, look at the fruit we already know — two clues, sweetness and size, and whether it turned out ripe. Here are a few rows with a summary of each column, just like Spectra’s describe_data.

Try it

Drag the mystery fruit (the “?”) anywhere; every tree votes and the majority colors it. Slide How many trees and watch the accuracy climb and steady. Turn on Show the vote map to see how sure the forest is across the whole plot — deep green where almost all trees say ripe.

Where it shows up

Where it came from

The idea built up in stages. Tin Kam Ho proposed growing trees on random subsets of the features (“random subspaces”) in 1995. Leo Breiman introduced bagging — bootstrap aggregating — in 1996, and in 2001 combined these ideas into the random forest we use today (with Adele Cutler).

Try it in code

The Studio’s forest model grows a crowd of voting trees, and show_model reveals how many trees there are and which clue each one asks first:

data  = load "students"
train, test = split data, hold_out: 20%

model = make_model "forest"
train_model model, on: train, predict: "result", using: ["hours_studied", "sleep_hours", "attendance"], trees: 12

check model, with: test
show_model model

Open it in the Studio ▶

Check your understanding