How LLMs Work

What is an LLM?

At its core, an LLM is a next-token predictor. It was trained on a massive amount of text, and all it learned to do is: given everything before this point, what word/chunk comes next?

But this one task, done at enormous scale, produces something that can reason, write code, argue philosophy, and debug your agent.

The key mental model to hold: The model doesn't know things the way you do. It has internalized statistical patterns of language. When it says something confidently wrong, it's not lying - it's predicting plausible-sounding text. This matters when you build agents because you can't trust output blindly.

Tokens

LLMs don't read words. They read tokens - chunks of text the model was trained to recognize.

"Hello, world!"   →  ["Hello", ",", " world", "!"]      = 4 tokens
"tokenization"    →  ["token", "ization"]                = 2 tokens
"AI agent"        →  ["AI", " agent"]                    = 2 tokens

Notice the space before "world" and "agent" is part of the token. It's not character-by-character, not word-by-word - it's somewhere in between.

Why it matters:

You pay per token (in + out)
Every model has a max token limit for the entire conversation
Rule of thumb: ~4 characters = 1 token. A 1,000-word essay ≈ 1,300 tokens

Context Window

The context window is the model's working memory - the total tokens it can see at once.

It's one shared bucket containing everything:

[ system prompt ] + [ conversation history ] + [ current message ] + [ response ]
                    ↑ all of this together must fit within the limit

Model	Limit
Claude Opus/Sonnet 4.6	200K tokens (1M beta for select tiers)
GPT-5	400K tokens
Gemini 3.1 Pro / Flash	Up to 1M tokens

The agent implication: A long-running agent keeps appending to that conversation history. Eventually it hits the ceiling and breaks - unless you manage it. That's why memory management is a real engineering problem in Phase 3.

Temperature

Controls how random the model's word choices are.

0 → Always picks the most probable next token. Deterministic, consistent.
1.0 → More random. Explores less likely word choices. More creative, less reliable.

Think of it like this: the model assigns a probability to every possible next token. Temperature 0 always picks the winner. Higher temperature gives the runner-ups a fighting chance.

For most agent tasks, use 0 or close to it. Predictable behavior usually matters more than creativity — though exploratory or brainstorming agents may benefit from higher values.

The Model Landscape

Family	Examples	Who makes it
Claude	Opus 4.6, Sonnet 4.6, Haiku 4.5	Anthropic
GPT	GPT-5, o4-mini, GPT-4.1	OpenAI
Gemini	Gemini 3 Pro, 3 Flash	Google
Open source	Llama, Mistral, DeepSeek R1, Qwen 3	Meta, community

Within a family, there's usually a capability/cost tradeoff:

Opus 4.6 / GPT-5 → most capable, most expensive
Haiku/smaller models → fast, cheap, good enough for simpler tasks

In agent systems, you'll often use cheaper models for routine subtasks and expensive ones only where it matters.

Module 1.1

What is an LLM?

Tokens

Context Window

Temperature

The Model Landscape

Sources