Anthropic Claude SDK (Direct API)

What the SDK Actually Is

Every LLM — Claude, GPT, Gemini — lives behind an HTTP server. You send it a JSON request, it sends you a JSON response. That's the whole thing.

The anthropic Python package doesn't do anything magical. It's a convenience wrapper around that HTTP call — handling auth headers, serializing your request, parsing the response — so you don't have to write the raw HTTP yourself.

The key mental model: You are making an API call. Every parameter you pass shapes that one call. Nothing persists between calls unless you carry it forward.

Bash

pip install anthropic

Python

import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from environment

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "What is an AI agent?"}
    ]
)

print(response.content[0].text)

The Three Required Parameters

model is which Claude to use. Different models trade capability for cost and speed — claude-opus-4-6 for hard reasoning tasks, claude-sonnet-4-6 for everyday work, claude-haiku-4-5-20251001 for fast cheap tasks. You'll mix them in agent systems.

max_tokens is a hard ceiling on how long the response can be — not a target. Claude stops naturally when done and only uses what it needs. If you set this too low and Claude has more to say, the response gets cut off mid-sentence. That's a silent failure, not an error — which is why the next concept matters.

messages is the conversation history. It must alternate user / assistant turns. This is how you communicate — and how you maintain context across a conversation.

The Response and Why `stop_reason` Matters

Python

response.content[0].text      # the text Claude generated
response.stop_reason          # WHY Claude stopped — critical
response.usage.input_tokens   # tokens you sent
response.usage.output_tokens  # tokens Claude generated

stop_reason is the most important field in the response. There are exactly four values:

Value	What happened
`"end_turn"`	Claude finished naturally. Normal.
`"max_tokens"`	Hit your ceiling. Response is truncated.
`"stop_sequence"`	Hit a custom stop string you defined.
`"tool_use"`	Claude wants to call a tool. You must handle this.

Why it matters for agents: In an agent loop, you can't just read .content[0].text and assume everything went well. If stop_reason is "max_tokens", the response is incomplete. If it's "tool_use", there's no text yet — Claude is waiting for you to run a function. Always check stop_reason before doing anything with the response.

The Stateless API and Multi-Turn Conversations

The model has no memory between calls. Every API call is completely independent. When you want a conversation, you are responsible for passing the entire history on every single call.

Python

messages = []

# Turn 1
messages.append({"role": "user", "content": "What is RAG?"})
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=512,
    messages=messages
)
messages.append({"role": "assistant", "content": response.content[0].text})

# Turn 2 — the full history goes again
messages.append({"role": "user", "content": "What are its limitations?"})
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=512,
    messages=messages
)

messages is a plain Python list. After every turn, you append both the user message and the assistant reply. Then you send the entire list on the next call. Claude doesn't know what was said before — it only knows what you put in that list.

The agent implication: This is exactly why context windows fill up. Every turn adds more tokens to the list. A long-running agent will eventually hit the ceiling. Managing that list — when to trim it, when to summarize it — is a real engineering problem you'll solve in Module 2.3.

One gotcha: if you send two consecutive user messages without an assistant message between them, the API auto-merges them. Usually not what you want.

System Prompts

The system prompt is your instruction set — it runs before any user message and defines everything about how Claude behaves in this session.

Python

client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    system="You are a senior Python engineer. Always use type hints. Never use print() — use logging.",
    messages=[{"role": "user", "content": "Write a function to parse a config file"}]
)

Critical gotcha: system is a top-level parameter, not a message. This is the most common mistake when first using the API:

Python

# ❌ WRONG — there is no "system" role
messages=[
    {"role": "system", "content": "You are helpful"},
    {"role": "user", "content": "Hello"}
]

# ✅ RIGHT
system="You are helpful",
messages=[{"role": "user", "content": "Hello"}]

The API will silently ignore the "system" role in some cases — making it one of those bugs that's hard to spot because nothing crashes.

Temperature

Controls how random Claude's word choices are at each step.

0 → Always picks the most probable next token. Deterministic. Same input always produces same output.
1.0 → Default. More varied. Explores less likely word choices.

For agents: use 0. You want consistent, predictable tool calls and decisions — not creative variations. Unpredictability in an agent loop compounds into unreliable behavior fast.

Streaming

By default, you wait for the entire response before seeing anything. Streaming sends tokens as they're generated:

Python

with client.messages.stream(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain RAG in detail"}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Use streaming whenever a human is waiting and the response might be long. For autonomous agent loops where you process the full output programmatically, it's less important.

Extended Thinking

For hard reasoning problems, you can give Claude a private scratchpad to think before answering. The thinking tokens don't appear as your response — they're Claude working through the problem first.

Python

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=10000,  # must be high — thinking uses these tokens too
    thinking={"type": "enabled", "budget_tokens": 5000},
    messages=[{"role": "user", "content": "Design a database schema for a multi-tenant SaaS app"}]
)

for block in response.content:
    if block.type == "thinking":
        print("Claude's reasoning:", block.thinking)
    elif block.type == "text":
        print("Answer:", block.text)

budget_tokens is how much of your max_tokens Claude can spend thinking. Must be at least 1024. Use this for architecture decisions and complex reasoning — not for simple queries.

Prompt Caching

If you're making repeated calls with the same large prefix — a long system prompt, a full codebase — you're paying to re-send those tokens every time. Caching stores the prefix for 5 minutes. Subsequent calls cost ~90% less and run faster.

Python

client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    system=[
        {"type": "text", "text": "You are a helpful assistant."},
        {
            "type": "text",
            "text": entire_codebase_as_string,
            "cache_control": {"type": "ephemeral"}  # cache everything up to here
        }
    ],
    messages=[{"role": "user", "content": "Find all security vulnerabilities"}]
)

Why it matters for agents: Agents often run the same system prompt hundreds of times a day. Caching cuts both the cost and latency of every call after the first.

Error Handling

Two errors you'll hit constantly:

Rate limits (429) — too many requests too fast. Fix: exponential backoff.

Python

import time, anthropic

client = anthropic.Anthropic()

def call_with_backoff(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.messages.create(
                model="claude-opus-4-6",
                max_tokens=1024,
                messages=messages
            )
        except anthropic.RateLimitError:
            if attempt == max_retries - 1:
                raise
            wait = 2 ** attempt   # 1s → 2s → 4s → 8s → 16s
            time.sleep(wait)

Context overflow — conversation history too long. Fix: trim or summarize.

Python

except anthropic.BadRequestError as e:
    if "prompt is too long" in str(e):
        messages = messages[-10:]  # keep last 10 turns, or summarize older content

Both are normal operating conditions in long-running agents — build them in from the start, not as afterthoughts.

Putting It Together

Python

import anthropic

client = anthropic.Anthropic()

def run_agent(system_prompt: str):
    messages: list[dict] = []

    while True:
        user_input = input("You: ").strip()
        if user_input.lower() in ("exit", "quit"):
            break

        messages.append({"role": "user", "content": user_input})

        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=1024,
            temperature=0,
            system=system_prompt,
            messages=messages
        )

        if response.stop_reason == "max_tokens":
            print("[Warning: response truncated — increase max_tokens]")

        reply = response.content[0].text
        messages.append({"role": "assistant", "content": reply})

        print(f"\nClaude: {reply}")
        print(f"[{response.usage.input_tokens} in / {response.usage.output_tokens} out]\n")

if __name__ == "__main__":
    run_agent("You are a helpful AI engineering tutor.")

Nothing here is complicated. It's a loop, a list, and an API call. Everything you build on top of this — tool use, RAG, multi-agent systems — is just an elaboration of this pattern.

Module 2.1

What the SDK Actually Is

The Three Required Parameters

The Response and Why `stop_reason` Matters

The Stateless API and Multi-Turn Conversations

System Prompts

Temperature

Streaming

Extended Thinking

Prompt Caching

Error Handling

Putting It Together

Sources

What the SDK Actually Is

The Three Required Parameters

The Response and Why stop_reason Matters

The Stateless API and Multi-Turn Conversations

System Prompts

Temperature

Streaming

Extended Thinking

Prompt Caching

Error Handling

Putting It Together

Sources

The Response and Why `stop_reason` Matters