Browser Workflows

Orchestrating the Browser

How to choose among Playwright, MCP-connected browser tools, and computer-use loops when a page has to be clicked through.

The Browser Changes the Shape of the Task

Some tasks depend on a live interface: the page has to render, the login has to complete, the drawer has to open, and the download only appears after the right sequence of clicks. Once that is true, the browser is part of the work itself, with its own timing and state. Most readers will not need this chapter for a first project. Read it when you cannot get the data without going through a browser.

You can automate browser work. The hard part is keeping track of what happened during the run. Playwright, a browser automation library, gives you a code-driven setup. Claude Code can connect to browser tools through MCP servers, where MCP (Model Context Protocol) is the standard that lets an AI agent call external tools as if they were built in. Anthropic and OpenAI now document computer-use loops that act from screenshots, mouse actions, and keyboard input inside sandboxed environments (isolated containers that limit what the agent can access). These three approaches overlap, but their costs differ. Playwright scripts cost only compute time, MCP browser tools add some overhead from the server layer, and computer-use loops are noticeably more expensive in API tokens because every screenshot is a large image the model has to process.

The operational rule

Use the lightest setup that can still finish the task. When the interface allows it, a deterministic script is more reliable over time than a screenshot-driven loop.

Playwright Script

→

MCP Browser Tools

→

Computer Use Loop

A Concrete Browser Problem

Imagine a familiar sort of institutional task: a monthly report is available only through a logged-in dashboard, where you sign in, open a date picker, choose a filter, and download a CSV. There is no export URL you can call directly and no documented API. With that problem in mind, you face several practical questions. How much of the browser do you want to control? How visible should the run be? How much fragility are you willing to tolerate?

Three Ways to Drive the Browser

Deterministic

Playwright script

Best when the interface is stable enough to describe in code and check with clear assertions (explicit checks that the page reached the expected state).

Strongest for repeatable tests and downloads
Produces the most inspectable artifact after the run

Tool-Mediated

MCP browser tools

Best when an agent should use an existing browser setup through structured tools, without rebuilding everything from scratch.

Fits Claude Code and similar tool clients
Keeps browser actions visible as tool calls
Depends on the quality and trustworthiness of the MCP server

Visual

Computer-use loop

Best when the task crosses application boundaries or depends on visual state that is awkward to express through locators alone.

More flexible across unfamiliar interfaces
Slower, less precise, and more expensive than a good script
Needs stronger isolation and human review
Both Anthropic and OpenAI still label these features as provisional

What the Agent Can Drive, and What It Should Not Decide

Whichever approach you choose, the division of responsibility is the same. The agent handles repetition and state tracking. You decide which environment is safe, which credentials are involved, what counts as success, and which actions need human review.

The Agent Does

Draft and revise browser scripts or browser-tool calls
Explain locators, waits, traces (recorded logs of each action), and likely failure points
Move through repetitive UI sequences once the path is clear
Save artifacts such as traces, screenshots, and logs

You Do

Choose the setup and decide when escalation is warranted
Protect credentials, session state, and sensitive pages
Verify that the browser actions proved what you intended
Review consequential actions before they are executed

Save These

The script or tool schema that drove the run
Authentication setup and storage rules
Trace files, screenshots, or step logs
A small note on what constitutes success or failure

Starting Prompt

"This task depends on a live browser. Before you automate it, tell me which approach is likely to work, what I should save from the run, and what kind of failure I should expect first. Explain the choice in plain language."

Playwright First, When Playwright Fits

Playwright is usually the first choice when the browser work can be expressed as code. A locator, in Playwright's terms, is the rule it uses to find a button, field, or link on the page. The mechanics are more involved than that name suggests. Playwright's auto-waiting (pausing until an element appears rather than failing instantly) and its ability to retry failed lookups both depend on finding page elements in human-recognizable ways, which is why the docs keep steering people toward user-facing locators such as role, text, label, and test id. Positional selectors, the kind that say "the third div inside the second section," look stable when first written. They tend to break the moment someone rearranges the page layout.

You can also let Playwright write the first draft. Its code generator records actions in a browser window and produces starter code in the Inspector (a built-in tool for viewing and debugging locators). The current docs also include first-party test agents, with `planner`, `generator`, and `healer`, and an `init-agents --loop=claude` path for Claude Code. The script is still code you can read and revise, but Playwright now ships some agent-driven scaffolding around it.

Use locators that express intent

Prefer role, label, text, and explicit test ids over positional selectors. Playwright's current docs recommend this directly because user-facing attributes tend to survive UI change better than brittle paths.

Record first, then revise

Codegen is a good starting point. Let it record the path through the interface, then clean the script up by naming the assertions and removing incidental actions.

Save traces for repair work

Playwright Trace Viewer records actions, snapshots, console logs, network requests, and source locations. When the script fails, the trace tells you where and why.

Treat authentication state as sensitive data

Playwright's authentication guide recommends storing browser state under `playwright/.auth` and adding it to `.gitignore`. The docs also warn that the state file may contain cookies and headers that can impersonate the user or test account.

A short Playwright script for the dashboard scenario from above gives a sense of what the code looks like in practice.

typescript

import { test, expect } from '@playwright/test';

test('export current month report', async ({ page }) => {
  await page.goto('https://reports.example.edu/login');
  await page.getByLabel('Username').fill(process.env.REPORT_USER!);
  await page.getByLabel('Password').fill(process.env.REPORT_PASSWORD!);
  await page.getByRole('button', { name: 'Sign in' }).click();

  await page.getByRole('button', { name: 'Reports' }).click();
  await page.getByRole('button', { name: 'Current month' }).click();
  await page.getByRole('button', { name: 'Download CSV' }).click();

  await expect(page.getByText('Download ready')).toBeVisible();
});

MCP as Browser Tools

With MCP, the agent connects to browser tools that some other system exposes, which means someone else wrote and configured the server that sits between the agent and the browser. Anthropic's current MCP docs say Claude Code can connect to hundreds of external tools and data sources through MCP, and they warn explicitly that third-party MCP servers can expose you to prompt injection risk (hidden text on a page that tricks the AI into doing something unintended) when they fetch untrusted content. That warning matters especially in browser work, because the page itself can carry hidden instructions intended to manipulate the agent.

MCP browser servers suit exploratory work, delegated testing, and cases where you already have a browser setup you trust. A badly configured MCP server might let the agent click through pages containing hidden instructions that override its behavior, or silently pass session cookies to an endpoint you did not authorize. Before using an MCP browser server, find out who wrote it and what it can reach.

Prompt Pattern

Ask the agent to explain the available browser tools before it uses them: what tools exist, what each one can touch, what content is untrusted, and what confirmations should remain human.

How Computer Use Differs

Newer computer-use systems work by looking at the screen and acting on what they see. Anthropic's current computer-use docs describe a screenshot-driven tool that can click, drag, type, and move through desktop environments inside a sandboxed computing environment. OpenAI's current guide lays out several setup shapes, from a built-in computer-use loop to custom arrangements layered on Playwright, Selenium, VNC (a protocol for viewing and controlling a remote desktop), or MCP, including code-execution setups where the model writes and runs short scripts while moving between visual and DOM-based interaction (the DOM, or Document Object Model, being the structured tree of elements that makes up a webpage). The tooling is still scattered across products and documentation. Know that before you commit to one setup.

Computer use is more general than deterministic scripts. It can cross application boundaries, inspect a screen before acting, and handle workflows that ordinary page selectors cannot express. The model can also mistake one button for another, follow prompt injection embedded in a page, or become expensive in tokens and time when the loop runs longer than expected. Both Anthropic and OpenAI treat the feature as provisional. Anthropic labels it beta and insists on sandboxing with human confirmation for consequential actions. OpenAI says the security of the setup is the user's responsibility.

Human Review Point

Accepting terms, submitting forms, authorizing payments, and interacting with sensitive accounts should remain human-confirmed steps, even when the browser interaction itself feels routine.

Choose the Setup by Failure Mode

To choose among these approaches, ask how you expect the run to fail and what you want to be looking at when it does.

Playwright is the default when you need reproducibility, explicit assertions, traces, and code reviewable by other people. MCP browser tools make sense when you already trust the server setup and want the agent to use it interactively inside Claude Code or a similar client. Computer use is worth the overhead in narrower circumstances: the interface crosses applications, depends on visual state, or resists ordinary selector-based automation. OpenAI's current guide also documents a hybrid option, a code-execution setup that can use browser libraries such as Playwright for loops, conditional logic, and page inspection while still falling back to visual interaction when needed.

Case Study: A Report Behind Login

The problem

Return to the dashboard scenario from earlier: a logged-in interface, a date picker, a filter drawer, a CSV download, and no API. The interface is slow, but stable enough.

Playwright is the natural starting point here. The browser context can hold the authenticated state, the date picker and filter drawer can be addressed with explicit locators, the CSV download can be asserted, and a trace can be saved when the workflow breaks. If the first attempt turns into a struggle over brittle selectors, the task may move toward a hybrid setup or a screenshot-first loop. Knowing why you escalated, and being able to say what specifically broke, will save time if you need to revisit the approach later.

Save the login and authentication setup separately from the extraction logic.
Store browser state outside the repository and keep it out of version control.
Record traces on failure.
Treat the downloaded file as the next inspectable artifact in the workflow.

What changes in agentic work

The agent can draft the setup quickly. You should spend your time on the authentication boundaries, meaningful assertions, saved traces, and the steps that still need a human.

When to Stop Escalating

A few signals suggest you have pushed the automation further than it can usefully go:

The script is a visual guesser because the interface itself is unstable.
The browser actions can no longer be explained in a small, auditable sequence.
The workflow crosses into sensitive accounts or irreversible decisions.
You see that an API, an export, a saved session, or a static file would be simpler. (This happens more often than you might expect.)

When the target is data embedded in a page, scraping is a lighter approach. A structured endpoint, if one exists, is more stable still: see Working with APIs.