Basics of AI & Agents Curriculum
Phase 1: FoundationsΒ·4 min read

Module 1.1

How LLMs Work

What is an LLM?

At its core, an LLM is a next-token predictor. It was trained on a massive amount of text, and all it learned to do is: given everything before this point, what word/chunk comes next?

But this one task, done at enormous scale, produces something that can reason, write code, argue philosophy, and debug your agent.

The key mental model to hold: The model doesn't know things the way you do. It has internalized statistical patterns of language. When it says something confidently wrong, it's not lying - it's predicting plausible-sounding text. This matters when you build agents because you can't trust output blindly.


Tokens

LLMs don't read words. They read tokens - chunks of text the model was trained to recognize.

"Hello, world!"   β†’  ["Hello", ",", " world", "!"]      = 4 tokens
"tokenization"    β†’  ["token", "ization"]                = 2 tokens
"AI agent"        β†’  ["AI", " agent"]                    = 2 tokens

Notice the space before "world" and "agent" is part of the token. It's not character-by-character, not word-by-word - it's somewhere in between.

Why it matters:

  • You pay per token (in + out)
  • Every model has a max token limit for the entire conversation
  • Rule of thumb: ~4 characters = 1 token. A 1,000-word essay β‰ˆ 1,300 tokens

Context Window

The context window is the model's working memory - the total tokens it can see at once.

It's one shared bucket containing everything:

[ system prompt ] + [ conversation history ] + [ current message ] + [ response ]
                    ↑ all of this together must fit within the limit
ModelLimit
Claude Opus/Sonnet 4.6200K tokens (1M beta for select tiers)
GPT-5400K tokens
Gemini 3.1 Pro / FlashUp to 1M tokens

The agent implication: A long-running agent keeps appending to that conversation history. Eventually it hits the ceiling and breaks - unless you manage it. That's why memory management is a real engineering problem in Phase 3.


Temperature

Controls how random the model's word choices are.

  • 0 β†’ Always picks the most probable next token. Deterministic, consistent.
  • 1.0 β†’ More random. Explores less likely word choices. More creative, less reliable.

Think of it like this: the model assigns a probability to every possible next token. Temperature 0 always picks the winner. Higher temperature gives the runner-ups a fighting chance.

For most agent tasks, use 0 or close to it. Predictable behavior usually matters more than creativity β€” though exploratory or brainstorming agents may benefit from higher values.


The Model Landscape

FamilyExamplesWho makes it
ClaudeOpus 4.6, Sonnet 4.6, Haiku 4.5Anthropic
GPTGPT-5, o4-mini, GPT-4.1OpenAI
GeminiGemini 3 Pro, 3 FlashGoogle
Open sourceLlama, Mistral, DeepSeek R1, Qwen 3Meta, community

Within a family, there's usually a capability/cost tradeoff:

  • Opus 4.6 / GPT-5 β†’ most capable, most expensive
  • Haiku/smaller models β†’ fast, cheap, good enough for simpler tasks

In agent systems, you'll often use cheaper models for routine subtasks and expensive ones only where it matters.


Sources