Multimodal & Computer Use

Most of the web wasn't built for APIs. Legacy enterprise software, desktop applications, government portals, old internal tools — they expose no programmatic interface. The only way to automate them is to interact with them the way a human does: look at the screen and click things.

That's computer use. It's not the right tool for every job — direct APIs are always faster, more reliable, and cheaper when they exist. But for the enormous category of systems where no API exists, computer use is the unlock.

What Multimodal Agents Are

Multimodal agents combine text, vision, audio, and sometimes video processing into a unified reasoning system. They don't just read text — they can see screenshots, interpret images, understand diagrams, and act on what they see. The market reflects the trajectory: multimodal AI surpassed $1.6 billion in 2024 and is growing at 32.7% annually. By 2027, 40% of generative AI solutions are projected to be multimodal — up from 1% in 2023.

Claude's Computer Use API

The loop is simple: Claude sees a screenshot → decides what to do → takes an action → receives a new screenshot → repeats.

Available actions: screenshot (capture current display), mouse_move, click, type, scroll, key (keyboard shortcuts), and zoom (inspect regions in detail, added in the claude_20251124 model version).

Python

import anthropic
import base64

client = anthropic.Anthropic()

tools = [{
    "type": "computer",
    "name": "computer",
    "display_width_px": 1920,
    "display_height_px": 1080,
}]

# Start the loop
messages = [{"role": "user", "content": "Book the first available meeting room for tomorrow at 2pm"}]

while True:
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=4096,
        tools=tools,
        messages=messages,
    )

    if response.stop_reason == "end_turn":
        break  # Claude finished the task

    # Process tool use blocks
    for block in response.content:
        if block.type == "tool_use" and block.name == "computer":
            action = block.input["action"]

            if action == "screenshot":
                screenshot_b64 = take_screenshot()  # your screenshot function
                # Send screenshot back to Claude
                messages.append({"role": "assistant", "content": response.content})
                messages.append({
                    "role": "user",
                    "content": [{
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": [{"type": "image", "source":
                            {"type": "base64", "media_type": "image/png", "data": screenshot_b64}}]
                    }]
                })
            elif action == "click":
                perform_click(block.input["coordinate"])  # your click function

The loop continues until Claude's stop reason is end_turn — meaning it believes the task is complete.

Security is mandatory. Computer use agents can click malicious links, enter credentials into phishing sites, or install malware if not sandboxed. Always run computer use inside a VM with minimal network access, no sensitive data in the environment, and no access to production systems. This is not optional.

Vision-Only: When You Don't Need the Full Loop

Claude's vision model is excellent at reading text, identifying UI elements, and analyzing screenshots — without the computer use tool and without taking any actions. This is useful when:

You need analysis but not automation ("describe what's on this dashboard screenshot")
You're building a hybrid: Claude sees and reasons, a separate tool (Playwright, PyAutoGUI) executes
Speed matters — the screenshot → reason → act → screenshot loop has meaningful latency overhead
You're processing batches of images or video frames rather than interactive sessions

For many automation tasks, the hybrid approach is better: use Claude's vision for understanding and decision-making, use a deterministic automation tool for execution.

Playwright + LLM — The Faster Alternative

Playwright MCP (launched March 2025) is a Model Context Protocol server that lets Claude control a browser using Playwright — Microsoft's cross-browser automation framework. Instead of working from screenshots, Playwright exposes the accessibility tree: a structured, text-based representation of the webpage's interactive elements.

The difference is significant:

Computer use (vision-based): screenshot → LLM processes image → click coordinates → repeat. Slow, pixel-imprecise.
Playwright MCP (accessibility tree): text representation of page structure → LLM reasons about elements by name → Playwright executes precise action. 10–100x faster.

# Playwright MCP flow
Claude gets: <button id="submit-order">Submit Order</button>
Claude reasons: "I need to click the Submit Order button"
Playwright executes: page.click("#submit-order")
Claude gets: updated accessibility tree
Loop continues

Use Playwright when the site has well-structured HTML and semantic elements. Use computer use when the interface is a legacy desktop app, a poorly built web app, or anything that doesn't expose proper semantic structure.

Playwright Agents (launched October 2025): three specialized agents — Planner, Generator, and Healer — that automatically create, execute, and repair browser automation test suites. Early adopters report 70–80% faster test creation and 3–5x more test coverage.

Real Products Being Built

Salesforce Agentforce 2.0 deploys AI agents that qualify leads, update CRM records, and generate follow-up emails autonomously inside the Salesforce interface.

ServiceNow Vision AI Agents automate IT operations and incident management by reading and acting on enterprise system interfaces.

In manufacturing, on-premise multimodal systems combine cameras and microphones to detect equipment defects and anomalies — with data never leaving the facility.

In healthcare, multimodal triage agents combine vital signs data, voice assessment, and sensor readings for real-time patient evaluation.

Limitations and Gotchas

Element grounding is an unsolved hard problem. Even powerful vision models struggle to reliably identify exactly which pixel to click on in complex or cluttered interfaces. This gets worse with dynamic content, overlapping elements, and non-standard UI frameworks. Plan for failure and build retry logic.

Computer use is slow. The screenshot-process-act loop adds meaningful latency compared to direct API calls or Playwright. For interactive real-time workflows, this is a UX constraint. For background automation tasks, it's acceptable.

Playwright breaks on SPAs and custom components. Heavily JavaScript-rendered single-page applications often don't expose proper semantic accessibility structure. The accessibility tree is empty or misleading. Computer use (vision) is the fallback — slower but more universally applicable.

Never run computer use with production credentials or sensitive data in scope. The agent can be manipulated via indirect prompt injection through the content it sees on screen. A malicious webpage the agent browses can contain instructions that cause it to take unintended actions.

Vision Agents for Document Intelligence

Not all multimodal agent work involves controlling a UI. A huge category is document intelligence — extracting structured data from images, PDFs, invoices, forms, and reports that have no machine-readable format.

The vision model reads the document as an image; the agent extracts structured output. This replaces expensive OCR pipelines and handles handwritten notes, complex table layouts, and mixed text/graphic documents that traditional parsers can't handle.

Python

import anthropic
import base64

client = anthropic.Anthropic()

def extract_invoice_data(image_path: str) -> dict:
    """Extract structured data from an invoice image."""
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode("utf-8")

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data,
                    },
                },
                {
                    "type": "text",
                    "text": """Extract the following fields from this invoice as JSON:
{
  "vendor_name": "",
  "invoice_number": "",
  "invoice_date": "",
  "due_date": "",
  "line_items": [{"description": "", "quantity": 0, "unit_price": 0, "total": 0}],
  "subtotal": 0,
  "tax": 0,
  "total_due": 0
}
Return only valid JSON, no explanation."""
                }
            ],
        }]
    )
    return json.loads(response.content[0].text)

Why it matters for agents specifically: Business automation agents frequently encounter documents with no API. A real accounts-payable agent needs to process PDFs from vendors. A compliance agent needs to read regulations. A research agent needs to parse charts and figures in academic papers. Vision is the unlock for all of these.

Common document intelligence patterns:

Invoice/receipt extraction → accounting automation
Chart and graph reading → "what was the Q3 revenue trend in this report?"
Form classification → route incoming documents by type before processing
Multi-page PDF analysis → summarize a 50-page report, extract key findings per section
Handwritten notes → digitize meeting notes, annotations, whiteboard photos

For batch document processing, use the Claude Batch API: submit thousands of documents in one request and retrieve results when ready — 50% cost reduction vs real-time API calls, designed for exactly this workload.

Building a Vision + Action Pipeline

The real power is combining vision (reasoning about what you see) with action (doing something about it). The pattern:

Python

class DocumentRoutingAgent:
    """
    Vision agent that classifies incoming documents and routes them
    to the right handler — no human needed.
    """
    def __init__(self):
        self.client = anthropic.Anthropic()

    def classify_document(self, image_b64: str) -> str:
        """Use Claude to classify document type from image."""
        resp = self.client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=50,
            messages=[{
                "role": "user",
                "content": [
                    {"type": "image", "source": {"type": "base64",
                     "media_type": "image/png", "data": image_b64}},
                    {"type": "text", "text":
                     "Classify this document: invoice, contract, report, form, or other. "
                     "Reply with one word only."}
                ]
            }]
        )
        return resp.content[0].text.strip().lower()

    def process(self, image_b64: str):
        doc_type = self.classify_document(image_b64)
        handlers = {
            "invoice": self.handle_invoice,
            "contract": self.handle_contract,
            "form": self.handle_form,
        }
        handler = handlers.get(doc_type, self.handle_unknown)
        return handler(image_b64)

Vision-based classification is faster and more accurate than rule-based filename/MIME-type routing for unstructured document intake — and it handles the long tail of edge cases that break rule-based systems.

Module 6.3

What Multimodal Agents Are

Claude's Computer Use API

Vision-Only: When You Don't Need the Full Loop

Playwright + LLM — The Faster Alternative

Real Products Being Built

Limitations and Gotchas

Vision Agents for Document Intelligence

Building a Vision + Action Pipeline

Sources