← Back home

Playwright: E2E Testing and Agent Verification

Playwright: E2E Testing and Agent Verification

AI coding tools ship code fast. Without an external check, "done" just means the model said so. Steve Kinney's Playwright course on Master.dev frames the fix clearly: build verification infrastructure — lint and types for static proof, Playwright for behavioral proof. This post distills what I took from that course into one reference page.


Where Playwright sits in the pyramid

Unit tests (Jest) and component tests (React Testing Library) catch building blocks in isolation. Playwright sits at the top of the testing pyramid — fewer tests, slower runs, highest confidence.

        /\
       /  \     E2E — Playwright
      /----\
     /      \   Integration — RTL + API mocks
    /--------\
   /          \ Unit — Jest
  /--------------\

The more your tests resemble the way your software is used, the more confidence they can give you.

Playwright does not replace Jest or RTL. It closes the gap they leave open: wrong redirects after login, API contract drift in production builds, layout regressions, and third-party auth mis-wiring.

What you get: real browsers (Chromium, Firefox, WebKit), auto-waiting locators, traces and screenshots on failure, and network interception for deterministic runs.


Getting started

npm init playwright@latest   # or add to an existing project
npx playwright install       # browser binaries
npx playwright test
npx playwright test --ui      # visual runner — best way to learn
npx playwright codegen http://localhost:3000

Key files: playwright.config.ts, tests/, and npm scripts. Most configs use webServer to start the dev server before tests and reuseExistingServer locally so you are not spawning duplicate processes.

A minimal test follows the same rhythm as RTL: navigate, interact, assert.

import { test, expect } from '@playwright/test';

test('sign in shows welcome message', async ({ page }) => {
  await page.goto('/');
  await page.getByRole('button', { name: 'Sign in' }).click();
  await expect(page.getByText('Welcome')).toBeVisible();
});

The accessibility tree (and why it matters)

Playwright queries the browser's accessibility tree — the same structure assistive technology uses. Tests find elements by role, label, and name, not arbitrary CSS.

Implication: semantic HTML (button, label, heading) produces stable tests. div with an onClick handler produces brittle ones.

Locators auto-wait until an element is actionable (visible, stable, enabled). In most cases you can skip manual sleep() calls.

Locator priority

Same philosophy as React Testing Library — stay user-facing:

  1. getByRole — buttons, links, headings, checkboxes, textboxes
  2. getByLabel — form fields
  3. getByPlaceholder
  4. getByText
  5. getByAltText / getByTitle
  6. getByTestId — last resort
await page.getByRole('listitem').filter({ hasText: '1984' }).click();
await page.getByRole('navigation').getByRole('link', { name: 'Home' }).click();

Stay high in the hierarchy. getByRole('listitem') beats chaining CSS into nested divs. When duplicate text appears on the page, narrow with .filter() or scope to a parent.

Codegen records clicks and typing into test code — useful as a starting point, but always edit the output. Generated selectors are often brittle.


Configuring projects and dev servers

playwright.config.ts is where you wire the environment:

import { defineConfig, devices } from '@playwright/test';

export default defineConfig({
  testDir: './tests',
  fullyParallel: true,
  retries: process.env.CI ? 2 : 0,
  use: {
    baseURL: 'http://localhost:3000',
    trace: 'on-first-retry',
    screenshot: 'only-on-failure',
  },
  webServer: {
    command: 'npm run dev',
    url: 'http://localhost:3000',
    reuseExistingServer: !process.env.CI,
  },
  projects: [
    { name: 'chromium', use: { ...devices['Desktop Chrome'] } },
  ],
});

Projects let you run subsets — different browsers, viewports, or auth profiles. Make sure baseURL and webServer.url match.


Authentication without logging in every test

Logging in before every test is slow and brittle. Playwright's answer is storageState: serialize cookies and localStorage once, reuse everywhere.

// tests/auth.setup.ts
import { test as setup, expect } from '@playwright/test';

const authFile = 'playwright/.auth/user.json';

setup('authenticate', async ({ page }) => {
  await page.goto('/login');
  await page.getByLabel('Email').fill('test@example.com');
  await page.getByLabel('Password').fill('secret');
  await page.getByRole('button', { name: 'Sign in' }).click();
  await page.waitForURL('/dashboard');
  await page.context().storageState({ path: authFile });
});
// playwright.config.ts — setup project runs first
projects: [
  { name: 'setup', testMatch: /.*\.setup\.ts/ },
  {
    name: 'chromium',
    use: { storageState: 'playwright/.auth/user.json' },
    dependencies: ['setup'],
  },
],

Add playwright/.auth to .gitignore — those files contain session tokens.

OAuth and Google login are painful in CI. Prefer test accounts, API-based auth, or a test-environment bypass. Wait for auth to finish with waitForURL or an assertion on a post-login element — cookies often set across redirects.

For tests that mutate shared server state in parallel, use one account per worker. For read-only suites, a single shared account is fine.

Per-test override when you need a different role:

test.use({ storageState: 'playwright/.auth/admin.json' });

Network isolation: HAR files and route mocking

External APIs are slow, flaky, and non-deterministic. Playwright offers two main strategies.

HAR record and playback

A HAR (HTTP Archive) is a snapshot of network traffic. Record once, replay in CI forever.

// Record (run once, then commit the HAR)
await page.routeFromHAR('hars/search-books.har', {
  url: '**/api/**',
  update: true,
});

// Replay (CI and local)
await page.routeFromHAR('hars/search-books.har', {
  url: '**/api/**',
  update: false,
});

CLI alternative:

npx playwright open --save-har=example.har --save-har-glob="**/api/**" https://example.com

Route interception

For fine-grained control — especially error states a HAR cannot represent:

// Full mock — no network call
await page.route('**/api/v1/fruits', async (route) => {
  await route.fulfill({ json: [{ name: 'Strawberry', id: 21 }] });
});

// Fetch real response, patch the body
await page.route('**/api/v1/fruits', async (route) => {
  const response = await route.fetch();
  const json = await response.json();
  json.push({ name: 'Loquat', id: 100 });
  await route.fulfill({ response, json });
});

// Simulate failure
await route.fulfill({ status: 404, body: 'Not found' });
ApproachBest for
HARMany endpoints, realistic traffic snapshot
route.fulfillSingle endpoint, error states, speed
route.fetch + patchReal headers with tweaked JSON body

Visual regression, traces, and debugging

Screenshots

await expect(page).toHaveScreenshot('homepage.png');
await expect(page.getByRole('main')).toHaveScreenshot('main-panel.png');

Update baselines after intentional UI changes: npx playwright test --update-snapshots.

Playwright compares runtime DOM renders — not Figma files. Use component tests for design-system checks; E2E screenshots for integrated pages.

Traces

Traces are the highest-value artifact when a test fails — for you and for an AI agent.

npx playwright show-trace path/to/trace.zip
# or upload to trace.playwright.dev

A trace bundles DOM snapshots, network log, console output, a screenshot filmstrip, and the source line for each action. On CI, trace: 'on-first-retry' keeps overhead low while capturing failures.

What to feed an agent on failure: test name, assertion error, trace path, and the relevant spec file — not the entire monorepo.


The agent verification loop

The course's through-line is turning tests into an external loop agents cannot argue with:

  1. Code changes (human or agent)
  2. npx playwright test runs (hook or CI)
  3. On failure → trace + stderr → agent prompt
  4. Agent patches → re-run until green

Success criteria must be objective: all tests pass, not "looks good."

Git hooks

Husky + lint-staged is a common pattern:

  • Pre-commit: lint, format, smoke tests (--grep @smoke)
  • Pre-push or CI: full Playwright suite

Same loop for humans and agents. Fast feedback locally; thorough proof before merge.

Playwright MCP vs CLI

CLIMCP (Playwright server)
Best forCI, hooks, deterministic runsAgent exploring a live browser
Trust levelHigh — same result every timeAgent-driven — verify with CLI

npx playwright init-agents scaffolds three agents (official docs):

  • Planner — explores the app → Markdown test plan in specs/
  • Generator — plan → executable tests
  • Healer — runs the suite and repairs failing tests

Chained: Planner → Generator → Healer → CLI verifies.

Rule I took from the course: MCP explores and drafts. CLI green is the merge gate.


The full stack at a glance

LayerTool
Static analysisESLint, TypeScript
Unit / integrationJest, React Testing Library
End-to-endPlaywright
Auth speedstorageState + setup project
Network stabilityHAR + page.route
Debug / agent contextTraces, screenshots, HTML report
EnforcementGit hooks + CI
Agent scaffoldinginit-agents (Planner, Generator, Healer)

What I am doing next

  1. Add Playwright to a real app (not just the course demo)
  2. Write three to five smoke tests on critical user paths
  3. Add auth.setup.ts wherever login is required
  4. Enable trace: 'on-first-retry' in CI
  5. Wire a pre-push hook for the full suite

Senior work was never just typing code. It is architecting systems you can trust — and Playwright is one of the pieces that makes that possible when agents are in the loop.


Further reading