Excessive Agency & Least Privilege

A successful prompt injection on a read-only agent leaks information. The same attack on an agent with delete permissions and an email send tool is a catastrophe. The difference isn't the injection — it's the agency.

This is LLM06 on the OWASP list, and it's the multiplier. Every other vulnerability in this phase is made dramatically worse by excessive agency. It's also the vulnerability you have the most control over, because it lives entirely in your architecture decisions — not in the model's behavior.

The Three Failure Modes

OWASP defines three distinct ways agents end up with too much power:

1. Excessive functionality. The tool does more than the task requires. You gave the agent a database connector with read, write, and delete because that's what the library exposes. The task only needed read. Now a successful injection can delete production data.

2. Excessive permissions. The credentials granted allow broader access than the task requires. The agent needs to read one user's calendar. You gave it credentials that access all calendars. One injected instruction later, every user's schedule is exposed.

3. Excessive autonomy. High-impact, irreversible actions execute without any human confirmation step. The agent decides to send an email, delete a file, or charge a card — and it just does it, because nothing in the architecture requires it to pause.

These three compound each other. An agent with excessive functionality, running with excessive permissions, operating with excessive autonomy is one successful injection away from a major incident.

The Minimal Footprint Principle

Anthropic's guidance on agent safety distills into what the industry calls the minimal footprint principle. Anthropic expresses it through four concrete rules:

Prefer reversible over irreversible actions. Before anything that cannot be undone — deleting a file, sending an email, calling an external API that charges money — pause. Reversible iteration is almost always the safer default. The cost of undoing a wrong reversible action is low. The cost of an irreversible one can be permanent.

Request minimal permissions. Ask for only what the current task requires. If you're building an agent that reads emails to summarize them, it should have read access to email — not send, not delete. If the task scope expands later, expand permissions then.

Pause at uncertainty. Rather than guessing, agents should stop and confirm when the right action is ambiguous. The cost of one extra confirmation message is trivial. The cost of guessing wrong on an irreversible action is not.

Sandboxed testing before deployment. Any agent that can take real-world actions must be tested in an isolated environment first. Test that the guardrails hold. Test that the HITL checkpoints trigger correctly. Test that the rate limiters work. Then deploy.

The Risk Classification Matrix

Before deciding whether an agent action runs autonomously or requires human approval, classify it across three dimensions:

Reversibility — can this action be undone?

Fully reversible within seconds (create a draft, write a temp file)
Reversible with effort (edit a database record with an audit log — can be rolled back)
Irreversible or very costly to undo (send email, delete production data, charge a card, post publicly)

Blast radius — how much can go wrong?

Affects only the agent's own working memory or temp files
Affects the user's local files or personal data
Affects external systems, other users, third parties, or public-facing surfaces

Data sensitivity — what data is involved?

No sensitive data
User PII, financial data, health records, or credentials

The decision rules:

Reversible	Low blast radius	Low sensitivity	→ Run autonomously
Reversible	Medium blast	Any	→ Log + consider dry-run
Irreversible	Any	Any	→ Human approval required
Any	High blast radius	Any	→ Human approval required
Any	Any	High sensitivity	→ Human approval required

When in doubt, require approval. The friction of one extra confirmation is always cheaper than the cost of a wrong irreversible action.

Implementing Permission Scoping

Define what each agent is allowed to do at instantiation time — not at runtime. This makes the scope explicit, reviewable, and enforceable before anything runs.

Python

from enum import Enum
from dataclasses import dataclass

class Permission(Enum):
    READ_FILES = "read_files"
    WRITE_FILES = "write_files"
    DELETE_FILES = "delete_files"
    SEND_EMAIL = "send_email"
    DATABASE_READ = "database_read"
    DATABASE_WRITE = "database_write"
    CALL_EXTERNAL_API = "call_external_api"
    EXECUTE_CODE = "execute_code"

@dataclass
class AgentPermissionScope:
    allowed: frozenset[Permission]
    require_approval_for: frozenset[Permission]
    max_actions_per_run: int = 50

# Research agent — reads only, write requires sign-off
RESEARCH_SCOPE = AgentPermissionScope(
    allowed=frozenset({
        Permission.READ_FILES,
        Permission.DATABASE_READ,
    }),
    require_approval_for=frozenset({
        Permission.WRITE_FILES,  # allowed but needs human OK
    }),
    max_actions_per_run=100,
)

# Task execution agent — broader access, tighter HITL requirements
TASK_AGENT_SCOPE = AgentPermissionScope(
    allowed=frozenset({
        Permission.READ_FILES,
        Permission.WRITE_FILES,
        Permission.EXECUTE_CODE,
        Permission.DATABASE_READ,
        Permission.CALL_EXTERNAL_API,
    }),
    require_approval_for=frozenset({
        Permission.SEND_EMAIL,       # irreversible
        Permission.DATABASE_WRITE,   # high blast radius
        Permission.DELETE_FILES,     # irreversible
    }),
    max_actions_per_run=25,
)

The require_approval_for set is the critical piece. These permissions are granted — the agent can do these things — but every attempt triggers a HITL checkpoint. The action is possible, not automatic.

The HITL Checkpoint Pattern

Human-in-the-loop (HITL) checkpoints are the architectural guarantee that high-stakes actions require human eyes before they execute. In production, "human eyes" means a Slack message, an email, a mobile notification — something that pulls a person in before the irreversible thing happens.

Python

class HumanApprovalGate:
    def __init__(self, notify_fn, timeout_seconds=300, auto_approve_reversible=True):
        self.notify = notify_fn
        self.timeout = timeout_seconds
        self.auto_approve_reversible = auto_approve_reversible

    def request_approval(self, action_name, params, is_reversible, risk_level) -> bool:
        # Low-risk reversible actions: auto-approve, no friction
        if is_reversible and risk_level == "low" and self.auto_approve_reversible:
            return True

        # Everything else: escalate to human
        self.notify(
            f"Agent Action Requires Approval: {action_name}",
            {"action": action_name, "params": params,
             "risk": risk_level, "reversible": is_reversible}
        )
        # In production: await webhook response, Slack reaction, SMS reply
        return self._wait_for_response(action_name)

Two design principles here that matter:

First, auto-approve low-risk reversible actions. If you make humans approve everything, they'll start rubber-stamping approvals without reading them — which is worse than no checkpoint at all. Reserve HITL for actions that genuinely warrant attention.

Second, the notification must give enough context to make a real decision. "Agent wants to do something — approve?" is useless. "Agent wants to send an email to external-address@gmail.com with subject 'Account Export' containing 847 rows of customer data — approve?" is a decision.

Poka-Yoke Tool Design

Anthropic introduces the concept of poka-yoke tools — tools designed so that misuse is structurally harder than correct use. The term comes from Japanese manufacturing: "mistake-proofing."

Applied to agent tools:

Require absolute file paths instead of relative ones. A tool that accepts ../../../etc/passwd is exploitable. A tool that validates the path starts with /home/agent/workspace/ is not.

Separate read and write into distinct tools. If the agent needs to read a file, give it a read-only tool. If it needs to write, give it a separate write tool. This makes the scope explicit and makes the HITL decision clear — "which tools does this agent have?" is answerable by looking at its tool list.

Make destructive operations require explicit confirmation parameters. Instead of delete_file(path), design delete_file(path, confirmed=False) where confirmed=True must be explicitly passed. The agent has to consciously decide to set that flag — it can't accidentally delete by omitting a parameter.

Document tools thoroughly. Every tool the agent can use should have a description that includes: what it does, what it can't undo, what the blast radius is. This goes into the system prompt. An agent that knows "this action is irreversible" is more likely to pause before taking it.

Why This Module Is Mostly Architecture, Not Code

The most important insight of this module: excessive agency is fixed at design time, not at runtime.

You can't add guardrails later to compensate for an over-privileged agent. By the time you're adding runtime filters, the permissions are already granted, the tools are already wired up, and all you're doing is putting a speed bump in front of a vehicle that's already been given keys to everywhere.

The right time to ask "what does this agent actually need?" is before you write the first line of code. Define the scope. Define the HITL thresholds. Define the irreversible action list. Then build to that spec.

That discipline — narrow scope, explicit permissions, HITL on the dangerous stuff — is what separates agents that are safe to deploy from agents that are a liability waiting to trigger.

Module 5.3