Agentic Coding

Verifiable Goals

When the agent can check its own work, it can keep going without you. When it cannot, you stay in the loop.

The Overnight Experiment

In March 2026, Andrej Karpathy released a project called autoresearch: a roughly 630-line Python script that lets an AI agent run machine learning experiments in a loop, unattended. The agent modifies a single training file, trains a small language model for exactly five minutes, then checks whether a validation metric (bits per byte, a measure of how well the model predicts unseen text) improved. If the change helped, the agent commits it. If not, it tries something else.

After two days the agent had run approximately 700 experiments and discovered around 20 improvements that transferred to larger models. It could work unsupervised that long because the goal was a number that either went up or didn't. Had the task been "make the model's outputs sound more natural," a human would have been needed at every step, because no automated metric captures naturalness.

What Makes a Goal Checkable

Some work has a built-in answer key. Converting a CSV export to the JSON schema that Omeka S (a digital collections platform) expects is one example. You can validate every record against the schema and know immediately whether the conversion succeeded. Standardizing phone numbers in a donor spreadsheet to a consistent format like (XXX) XXX-XXXX is another, because a regular expression can check every row in milliseconds. When the goal is that concrete, the agent can try an approach, run the check, and adjust on its own.

Other work resists that kind of check. Generating Library of Congress Subject Headings from item descriptions, for instance. The agent can suggest headings, and a cataloger familiar with the vocabulary can evaluate them, but no automated test will tell you whether "Immigrants, Italian, New York (State), New York" is the right heading for a particular photograph. Deciding how to structure a digital exhibit is even harder, and you can't check curatorial choices automatically.

Many tasks are partly checkable and partly not. A metadata cleanup project might involve format standardization, which is checkable, alongside quality review of description fields, which is not. In those mixed cases, you can give the checkable parts to the agent and handle the rest yourself.

Fully checkable

Format conversions, schema validation, test suites, data transformations with known expected output, checksum verification, mathematical computation

Partially checkable

Data cleaning (row counts check out, but do the corrections make sense?), API integration (status codes pass, but is the response what you need?), deduplication (the logic ran, but were the right records merged?)

Judgment-dependent

Subject heading assignment, exhibit curation, prose quality, UI design, architecture decisions, anything requiring domain expertise to evaluate

Tests as Steering

In software, this usually means writing tests before asking the agent to code. The tests describe what the code should do. The agent writes code, runs the tests, reads the failures, and adjusts. It keeps doing this until the tests pass.

Write code

→

Run tests

→

Read failures

→

Adjust

← repeat until all tests pass ←

SWE-bench, a benchmark that gives coding agents real GitHub issues with existing test suites, works the same way. The agent produces a patch; the tests say whether it worked. As of early 2026, the best agents resolve around 75% of the curated set.

You do not need to be writing software for this to apply. If you are building a data pipeline to prepare records for migration, the "tests" might be assertions you write in advance: the output file has the same number of rows as the input, no required fields contain null values, all dates fall within a plausible range, every ISBN passes a checksum. Write the criterion as a sentence: "every row in the output should have a non-empty value in the title column." The agent can turn that into a check directly. The Thinking in Steps guide describes the same habit. Here you write the criteria so the agent can use them.

Letting It Run Overnight

The autoresearch pattern is general enough to reuse. You need something the agent can modify (a script, a configuration, a dataset), a scoring function, and a time limit per attempt. The agent proposes a change, scores the result, keeps or discards it, and repeats.

A librarian writing a Python script to normalize messy metadata exports could set this up with a quality score: the percentage of records that conform to a target schema. Hand the agent the script and the scorer, let it iterate. Each run takes seconds. By morning the script handles unusual date formats and encoding problems that only show up in a handful of records out of thousands. Those edge cases would have taken hours to find by hand.

Anthropic's own data on Claude Code usage bears this out. Their 2026 report found that the longest autonomous sessions nearly doubled in duration between October 2025 and January 2026 (from under 25 minutes to over 45), and the rate at which users included tests in agent-generated pull requests climbed from 31% to 52%.

When You Cannot Walk Away

Some tasks have no machine-checkable criterion. A test suite cannot tell you whether a finding aid reads clearly or whether a visualization emphasizes the right dimension of the data. The model will still produce plausible-sounding work. Fluent prose can disguise bad arguments. Clean-looking code can hide bugs.

For that kind of work, keep the agent on a shorter leash. Let it run for a few steps, review what it produced, redirect if needed. The step-by-step approach from the Thinking in Steps guide applies here, with each step small enough to evaluate before the agent moves on.

Goodhart's Law applies here too

Goodhart's Law says that when a measure becomes a target, it ceases to be a good measure. A coding agent that modifies the test suite to make tests pass, rather than fixing the code the tests were meant to check, is gaming the metric. If the assertions you write are too narrow or too easy to satisfy trivially, the agent will satisfy them without solving the underlying problem.

The Same Pattern, Deeper Down

This is also how the models themselves are trained. Researchers at DeepSeek discovered that training language models with verifiable reward signals, where the reward depends on whether code compiles or whether a math problem produces the correct answer, yields stronger reasoning than training with human preference ratings. Human preferences are noisy and gameable; a compiler is not. DeepSeek's R1 model (released January 2025) saw its accuracy on a standard math benchmark jump from 15.6% to 77.9% during the reinforcement learning phase.

Before handing work to an agent, figure out what "done" looks like in terms the machine can check. For a data migration, that might be validation rules for the target schema. For a script, a handful of test cases. If you cannot describe the goal precisely enough for a machine to evaluate it, you have to stay at the keyboard.