← All simulations · Pillar 8: Brains made of math
Tokenization
What it is
A language model can’t read letters — it reads tokens, which are just numbers. Tokenization is the step that chops a sentence into pieces and looks up a number for each piece. You could chop by single characters, by whole words, or — what modern models actually do — by subword pieces that sit in between.
Go deeper: subword tokenizers learn their vocabulary of pieces by scanning lots of text and repeatedly merging the most common pair of symbols (this is byte-pair encoding). Common words like “the” end up as a single token, while a rare word like “unbelievably” is built from reusable pieces (un + believ + ably). That keeps the vocabulary small and lets the model handle words it has never seen.
Why care
Tokens are the unit everything else is measured in. A model’s memory limit, its speed, and even what you pay to use one are all counted in tokens. How text is chopped also decides whether a model can spell, handle a new name, or work in another language — so tokenization quietly shapes what a model can and can’t do.
The idea, intuitively
Think of cutting a sentence into Lego bricks. Tiny bricks (characters) fit anything but you need a huge pile to build even one word. Giant bricks (whole words) build fast but you’ll be missing the exact brick for any unusual word. Medium bricks (subwords) are the sweet spot: a manageable set of pieces that snap together to make any word.
Peek at the data first
You never type your own text here — sentences come from a small fixed list (safety by design). Here is how a few of their words split into subword pieces before becoming numbers.
Try it
Pick a sentence, then switch Chop by between Characters, Words, and Subwords and watch the chips — and the token count — change. Tick Show the ID numbers to reveal what the model truly reads: a list of integers, not letters.
Where it shows up
- Context limits. “This model holds 8,000 tokens” sets how much text it can consider at once.
- Cost & speed. Usage is billed and timed per token, so chopping efficiently matters.
- New & rare words. Subword pieces let a model sound out names and made-up words it has never seen before.
Where it came from
Byte-pair encoding began as a simple data-compression trick from Philip Gage (1994). Sennrich, Haddow, and Birch (2016) repurposed it for machine translation to handle rare words, and it (along with relatives like WordPiece and SentencePiece) became the standard way to feed text into transformers — the tokenizer sitting in front of every modern language model.
Try it in code
Spectra keeps text models safe and tiny, but the “words in, pieces out” idea is the same one the word-babbler uses when it learns from sentences:
data = load "sayings" describe_data data model = make_model "markov" train_model model, on: data, using: "text" generate model, count: 4
Check your understanding
- Why does a model read numbers instead of letters?
- What goes wrong if you tokenize only by whole words?
- How do subword pieces let a model handle a word it has never seen?