AI Engineering Curriculum
Phase 0: Foundations·2 min read

Module 0.1

How LLMs Work

What is an LLM?

At its core, an LLM is a next-token predictor. It was trained on a massive amount of text, and all it learned to do is: given everything before this point, what word/chunk comes next?

That's it. But this one simple task, done at enormous scale, produces something that can reason, write code, argue philosophy, and debug your agent.

The key mental model to hold: The model doesn't know things the way you do. It has internalized statistical patterns of language. When it says something confidently wrong, it's not lying - it's predicting plausible-sounding text. This matters when you build agents because you can't trust output blindly.


Tokens

LLMs don't read words. They read tokens - chunks of text the model was trained to recognize.

"Hello, world!"   →  ["Hello", ",", " world", "!"]      = 4 tokens
"tokenization"    →  ["token", "ization"]                = 2 tokens
"AI agent"        →  ["AI", " agent"]                    = 2 tokens

Notice the space before "world" and "agent" is part of the token. It's not character-by-character, not word-by-word - it's somewhere in between.

Why it matters:

  • You pay per token (in + out)
  • Every model has a max token limit for the entire conversation
  • Rule of thumb: ~4 characters = 1 token. A 1,000-word essay ≈ 1,300 tokens

Context Window

The context window is the model's working memory - the total tokens it can see at once.

It's one shared bucket containing everything:

[ system prompt ] + [ conversation history ] + [ current message ] + [ response ]
                    ↑ all of this together must fit within the limit
ModelLimit
Claude Opus/Sonnet 4.6200K tokens (1M beta)
GPT-5 / o4-mini200K tokens
Gemini 3 Pro1M tokens

The agent implication: A long-running agent keeps appending to that conversation history. Eventually it hits the ceiling and breaks - unless you manage it. That's why memory management is a real engineering problem in Phase 2.


Temperature

Controls how random the model's word choices are.

  • 0 → Always picks the most probable next token. Deterministic, consistent.
  • 1.0 → More random. Explores less likely word choices. More creative, less reliable.

Think of it like this: the model assigns a probability to every possible next token. Temperature 0 always picks the winner. Higher temperature gives the runner-ups a fighting chance.

For agents: use 0 or close to it. You want reliable, predictable behavior - not creative surprises in the middle of a task.


The Model Landscape

FamilyExamplesWho makes it
ClaudeOpus 4.6, Sonnet 4.6, Haiku 4.5Anthropic
GPTGPT-5, o4-mini, GPT-4.1OpenAI
GeminiGemini 3 Pro, 3 FlashGoogle
Open sourceLlama, Mistral, DeepSeek R1, Qwen 3Meta, community

Within a family, there's usually a capability/cost tradeoff:

  • Opus 4.6 / GPT-5 → most capable, most expensive
  • Haiku/smaller models → fast, cheap, good enough for simpler tasks

In agent systems, you'll often use cheaper models for routine subtasks and expensive ones only where it matters.