Module 0.1
How LLMs Work
What is an LLM?
At its core, an LLM is a next-token predictor. It was trained on a massive amount of text, and all it learned to do is: given everything before this point, what word/chunk comes next?
That's it. But this one simple task, done at enormous scale, produces something that can reason, write code, argue philosophy, and debug your agent.
The key mental model to hold: The model doesn't know things the way you do. It has internalized statistical patterns of language. When it says something confidently wrong, it's not lying - it's predicting plausible-sounding text. This matters when you build agents because you can't trust output blindly.
Tokens
LLMs don't read words. They read tokens - chunks of text the model was trained to recognize.
"Hello, world!" → ["Hello", ",", " world", "!"] = 4 tokens
"tokenization" → ["token", "ization"] = 2 tokens
"AI agent" → ["AI", " agent"] = 2 tokens
Notice the space before "world" and "agent" is part of the token. It's not character-by-character, not word-by-word - it's somewhere in between.
Why it matters:
- You pay per token (in + out)
- Every model has a max token limit for the entire conversation
- Rule of thumb: ~4 characters = 1 token. A 1,000-word essay ≈ 1,300 tokens
Context Window
The context window is the model's working memory - the total tokens it can see at once.
It's one shared bucket containing everything:
[ system prompt ] + [ conversation history ] + [ current message ] + [ response ]
↑ all of this together must fit within the limit
| Model | Limit |
|---|---|
| Claude Opus/Sonnet 4.6 | 200K tokens (1M beta) |
| GPT-5 / o4-mini | 200K tokens |
| Gemini 3 Pro | 1M tokens |
The agent implication: A long-running agent keeps appending to that conversation history. Eventually it hits the ceiling and breaks - unless you manage it. That's why memory management is a real engineering problem in Phase 2.
Temperature
Controls how random the model's word choices are.
- 0 → Always picks the most probable next token. Deterministic, consistent.
- 1.0 → More random. Explores less likely word choices. More creative, less reliable.
Think of it like this: the model assigns a probability to every possible next token. Temperature 0 always picks the winner. Higher temperature gives the runner-ups a fighting chance.
For agents: use 0 or close to it. You want reliable, predictable behavior - not creative surprises in the middle of a task.
The Model Landscape
| Family | Examples | Who makes it |
|---|---|---|
| Claude | Opus 4.6, Sonnet 4.6, Haiku 4.5 | Anthropic |
| GPT | GPT-5, o4-mini, GPT-4.1 | OpenAI |
| Gemini | Gemini 3 Pro, 3 Flash | |
| Open source | Llama, Mistral, DeepSeek R1, Qwen 3 | Meta, community |
Within a family, there's usually a capability/cost tradeoff:
- Opus 4.6 / GPT-5 → most capable, most expensive
- Haiku/smaller models → fast, cheap, good enough for simpler tasks
In agent systems, you'll often use cheaper models for routine subtasks and expensive ones only where it matters.