← All simulations · Pillar 8: Brains made of math

Reinforcement learning maze

What it is

Reinforcement learning is how a computer learns by trying things and getting rewards — no teacher showing the right answer, just a score that goes up or down. Here a little critter starts in a maze knowing nothing. Reaching the cheese earns a reward; falling in the pit is a punishment. After enough tries it works out a path all by itself.

Go deeper: the critter keeps a number for every move in every square — a guess at “how good is it to go this way from here?” (its Q-value). Each move it nudges that number toward the reward it got plus the value of wherever it landed. Good news slowly seeps backward from the cheese, square by square, until the best move everywhere points home. That trick is called Q-learning.

Why care

Lots of the most striking AI works this way: programs that master chess and Go, robots that learn to walk, and the “helpfulness training” that shapes chatbots all use rewards instead of labelled answers. Reinforcement learning matters because in the real world you often can’t list the right move for every situation — you can only say whether things went well.

The idea, intuitively

Imagine learning a new video game with the sound off and no instructions. At first you press buttons randomly. Now and then something good happens and you remember what you just did; something bad happens and you avoid it next time. Repeat hundreds of times and your fumbling turns into skill. That’s exactly what the critter is doing — trial, error, and a memory of what paid off.

Peek at the data first

There is no spreadsheet here — the only “data” is the reward for each thing that can happen. Everything the critter knows, it learns from these numbers by trying.

Try it

Drag Tries trained from 0 upward and watch the blue arrows (the best move in each square) and the gold path appear. Tick Colour each square by how good it is to see value spreading out from the cheese like warmth filling the maze.

Where it shows up

Game-playing AI. AlphaGo and game bots learn winning moves purely from the reward of winning.
Robots. A robot learns to walk or grab by being rewarded for staying upright or holding on.
Chatbots. “Reinforcement learning from human feedback” rewards helpful, honest answers.

Where it came from

The idea grew from animal-learning psychology — Edward Thorndike’s “law of effect” (1911): actions followed by reward get repeated. Richard Bellman (1957) gave it the maths of value and the future. Chris Watkins invented Q-learning in 1989, and decades later DeepMind combined it with neural networks to beat humans at Atari games (2015) and Go (2016).

Try it in code

Spectra keeps things safe and tiny, so it learns from prepared examples rather than live rewards — but the “learn by trying, keep what works” spirit is the same one behind the word-babbler:

data = load "sayings"
describe_data data

model = make_model "markov"
train_model model, on: data, using: "text"
generate model, count: 4

Open it in the Studio ▶

Check your understanding

How is learning from rewards different from being shown the right answer?
Why does the critter need so many tries before a clear path appears?
Why does a tiny cost per step make the critter prefer a shorter route?