Module 1.1
How LLMs Work
What is an LLM?
At its core, an LLM is a next-token predictor. It was trained on a massive amount of text, and all it learned to do is: given everything before this point, what word/chunk comes next?
But this one task, done at enormous scale, produces something that can reason, write code, argue philosophy, and debug your agent.
The key mental model to hold: The model doesn't know things the way you do. It has internalized statistical patterns of language. When it says something confidently wrong, it's not lying - it's predicting plausible-sounding text. This matters when you build agents because you can't trust output blindly.
Tokens
LLMs don't read words. They read tokens - chunks of text the model was trained to recognize.
"Hello, world!" β ["Hello", ",", " world", "!"] = 4 tokens
"tokenization" β ["token", "ization"] = 2 tokens
"AI agent" β ["AI", " agent"] = 2 tokens
Notice the space before "world" and "agent" is part of the token. It's not character-by-character, not word-by-word - it's somewhere in between.
Why it matters:
- You pay per token (in + out)
- Every model has a max token limit for the entire conversation
- Rule of thumb: ~4 characters = 1 token. A 1,000-word essay β 1,300 tokens
Context Window
The context window is the model's working memory - the total tokens it can see at once.
It's one shared bucket containing everything:
[ system prompt ] + [ conversation history ] + [ current message ] + [ response ]
β all of this together must fit within the limit
| Model | Limit |
|---|---|
| Claude Opus/Sonnet 4.6 | 200K tokens (1M beta for select tiers) |
| GPT-5 | 400K tokens |
| Gemini 3.1 Pro / Flash | Up to 1M tokens |
The agent implication: A long-running agent keeps appending to that conversation history. Eventually it hits the ceiling and breaks - unless you manage it. That's why memory management is a real engineering problem in Phase 3.
Temperature
Controls how random the model's word choices are.
- 0 β Always picks the most probable next token. Deterministic, consistent.
- 1.0 β More random. Explores less likely word choices. More creative, less reliable.
Think of it like this: the model assigns a probability to every possible next token. Temperature 0 always picks the winner. Higher temperature gives the runner-ups a fighting chance.
For most agent tasks, use 0 or close to it. Predictable behavior usually matters more than creativity β though exploratory or brainstorming agents may benefit from higher values.
The Model Landscape
| Family | Examples | Who makes it |
|---|---|---|
| Claude | Opus 4.6, Sonnet 4.6, Haiku 4.5 | Anthropic |
| GPT | GPT-5, o4-mini, GPT-4.1 | OpenAI |
| Gemini | Gemini 3 Pro, 3 Flash | |
| Open source | Llama, Mistral, DeepSeek R1, Qwen 3 | Meta, community |
Within a family, there's usually a capability/cost tradeoff:
- Opus 4.6 / GPT-5 β most capable, most expensive
- Haiku/smaller models β fast, cheap, good enough for simpler tasks
In agent systems, you'll often use cheaper models for routine subtasks and expensive ones only where it matters.
Sources
- Attention Is All You Need β Vaswani et al., 2017
- Anthropic β Context Window Documentation
- Tiktokenizer β Token Visualizer