Agentic Coding

Thinking in Steps

How to break a complicated task into inspectable, verifiable stages.

A Working Sequence

The model does not know what "correct" means in your context. You do, but that knowledge is usually implicit: you recognize a bad date or a mismatched record when you see one, without being able to hand someone a checklist in advance. You make that implicit knowledge explicit by breaking the task into smaller steps. Most useful work of this kind follows the same sequence:

  1. Inspect the data to understand the problem
  2. Define what "fixed" looks like
  3. Work through the task piece by piece
  4. Verify the result strategically

Each step in this order produces something small enough to check before moving on. The tool will not invent the right checkpoints for you, and without that structure you tend to discover mistakes only after the whole job is finished.

Division of labor

For a data cleaning task, you might decide that dates need to be standardized to YYYY-MM-DD and duplicates should be removed by matching on DOI (Digital Object Identifier, the unique ID assigned to scholarly articles). The tool writes the script, runs it, and shows you the result. You then decide whether the duplicate count looks plausible and whether the changed rows make sense.

A Method for Breaking Down a Task

This approach works for spreadsheets, batches of files, images, or any other repeatable task where intermediate checking matters.

Step 1: Inspect the file you have

Before you plan anything, look at the data. Exports are often messier than memory suggests. Plan from the file in front of you. A good early habit is to work on a copy, not the original. If a step goes wrong you want to be able to start that step over from clean data, and the easiest way to guarantee that is to keep the source file untouched.

What to ask the tool

"I have a file called acquisitions_2024.csv. Take a look at it and tell me what you see. How many rows? What are the columns? Are there obvious issues like missing values, inconsistent formatting, or duplicate rows?"

The goal at this stage is reconnaissance, an accurate picture of the problem before you commit to a method. If a column you planned to use as a unique key turns out to have blanks in half the rows, better to know that now.

Step 2: Define what "done" looks like

Before you start fixing anything, write down what the finished product should look like.

These statements are your test criteria. If you cannot define completion clearly enough for someone else to check the work, the tool cannot target it either. You will either stop too early or keep tweaking long after the work is done.

Step 3: Identify the sequence

Now determine the order in which things need to happen, since some steps depend on others and getting the sequence wrong usually means redoing work later.

Example: Cleaning a Donor Records Spreadsheet

You have 3,000 rows of donor contact records exported from an old system, and the data is going into a new CRM (Customer Relationship Management system, the database that tracks donor interactions). Here is one workable sequence:

1
Assess the damage. How many rows have missing email addresses? How many phone numbers are in different formats? Are there obvious duplicates? (No output file yet; this step is just reconnaissance.)
2
Deduplicate. Find rows that refer to the same person (matching on name + address, or name + email). Keep the most complete record. Save the result as donors_deduped.csv.
3
Standardize formats. Starting from donors_deduped.csv, normalize phone numbers to (XXX) XXX-XXXX, states to two-letter abbreviations, zip codes to five digits. Save as donors_standardized.csv.
4
Validate. Are email addresses formatted correctly? Are zip codes real? Flag anything suspicious for manual review. Save as donors_validated.csv with a separate review_flags.csv.
5
Enrich. For records missing a state, look it up from the zip code. For records missing a zip code, look it up from the city and state. Save as donors_enriched.csv.
6
Export. Convert the cleaned data to the format the new CRM expects, with a separate file listing everything flagged for review.

The order matters here. Deduplication has to come first because two records that look different on the surface may turn out to refer to the same person once you compare name and address. And validation belongs after standardization, since a phone number that looked malformed at the start may be perfectly valid once the formatting is normalized. Naming each output file also makes it possible to restart a step from known-good data if something goes wrong partway through.

Step 4: Describe each step to the tool

Give the tool one step at a time, each small enough that you can inspect the result before moving on.

A useful prompt for Step 3 from the example above

"In donors_deduped.csv, standardize the phone numbers in the phone column to (XXX) XXX-XXXX format. If a phone number has a country code, keep it as a prefix. If the value doesn't look like a phone number at all, leave it unchanged and add the row number to a list of rows that need manual review. Save the result as donors_standardized.csv and save the review list as phone_review.csv."

Notice how the prompt names its input (donors_deduped.csv, the output of step 2), the specific column, the target format, the edge-case behavior, and the output files. When you have asked for something specific, you can tell whether you got it.

How large should a step be?

A step is the right size when you can verify its output independently. "Deduplicate and standardize and validate" is too large; if the output is wrong, you will not know which part failed. "Standardize the phone column to (XXX) XXX-XXXX format" is verifiable because you can scan the column and judge the result. When in doubt, make the step smaller.

The all-at-once trap

It is tempting to describe the whole pipeline in one prompt and let the tool sort it out. Sometimes that works. For anything with more than a few steps, however, you lose the ability to inspect intermediate results. If step 3 goes wrong, you may not notice until the end. One step at a time is slower and much more reliable.

Verifying the Output

The tool can write code, run it, and produce output, but whether that output is correct in your domain is a judgment only you can make. You know what plausible donor counts look like, which date ranges are reasonable, or when a zip code belongs to the wrong state. The tool does not, unless you tell it.

Domain knowledge as verification

In software development, automated tests catch regressions. Here, the equivalent is your own familiarity with the material. You know a publication date of 2087 is wrong, or that a zip code with three digits needs inspection. These are the kinds of checks worth building into your prompts explicitly, since the model will not flag them on its own.

Verification strategies

You do not need to review every row. The fastest check is often a row count. How many rows went in, and how many came out? If you started with 3,000 records and ended with 2,847, that is 153 duplicates removed. Does that sound plausible? If it is 500, something probably went wrong. After the counts look right, spot-check the rows most likely to be wrong: the ones with missing data, unusual values, or edge cases. If the tool handled those correctly, the straightforward cases are almost certainly fine.

Picking a handful of rows at random and inspecting them manually is unglamorous but catches things automated checks miss. Another useful habit is to ask the tool, after each step, to summarize what it did: "How many rows were changed? What were the most common issues? Show me a few examples of changes you made." That summary often reveals problems faster than scanning the raw data yourself.

Build verification into the process

You can ask the tool to add a changes_made column that logs what was modified in each row. That way, you can sort by that column and quickly see everything that was touched. Rows that were left alone don't need your attention.

When Things Go Wrong

The first run rarely produces perfect results. The next thing to do is find where it went wrong. Because each step has its own output, you can locate the problem, fix the description for that step, and rerun it without disturbing the rest of the pipeline.

Common failure modes

In every case, the recovery path is the same: go back to the step that misfired, describe it more precisely, and rerun from there. Named output files for each stage make this practical, since you can always pick up from the last good file and continue forward.

From One-Off Work to a Repeatable Process

Once you have worked through a problem step by step, you have a procedure you can reuse. Save the prompts you used. Next quarter, when similar data arrives, you can run the same steps with the new file.

How a pipeline develops
Messy problem
Break into steps
Run + verify each
Repeatable process

The first run requires thought. Subsequent runs follow the same steps with new data.

Once every step has been verified individually, you can ask the tool to combine them into a single script that runs the whole pipeline at once. If you automate a pipeline before testing it step by step, all the errors arrive at once instead of one at a time. That makes them harder to fix.

Preserve the prompts that worked

Keep a document with the prompts you used for each step. When the next batch arrives, you can hand those same prompts to the tool with the new file name. If the data has changed slightly, you adjust the relevant step rather than starting over.

If you are using Claude Code and find yourself reusing the same pipeline regularly, the skills guide covers how to package instructions as a skill.

When Not to Use a Pipeline

A self-contained job ("convert this CSV to JSON," "add a column that combines first and last name") or an exploratory question ("what's in this dataset?") can just be a single request. (JSON, short for JavaScript Object Notation, is a structured text format that many web tools and platforms expect as input.) The pipeline approach is worth the overhead when the task has multiple stages that depend on one another. It is also worth the overhead when a wrong final output would be costly, because you want to catch problems before the end.

A Complete Example

Scenario: Preparing Metadata for a Digital Exhibit

You are building a digital exhibit from a collection of 200 photographs. You have a spreadsheet with basic information, filename, rough date, short description, but you need richer metadata for the exhibit platform. Here is the pipeline:

1
Audit the existing data. "Look at photos_metadata.csv. Tell me which columns are mostly complete and which have a lot of gaps. Are any filenames missing or duplicated?"
2
Standardize dates. "The date column has entries like 'circa 1920', 'early 1900s', '1923', and 'undated'. Create two new columns: date_start and date_end as years. For 'circa 1920', use 1915-1925. For 'early 1900s', use 1900-1910. For exact years, use the same year for both. For 'undated', leave both blank."
3
Generate subject headings. "Based on the description column, suggest 2-3 Library of Congress Subject Headings for each photo. Put them in a new lcsh_suggested column, separated by semicolons. I will review these suggestions before assigning final headings."
4
Check file references. "Verify that every filename in the filename column actually exists in the /photos/ folder. List any mismatches."
5
Export for the exhibit platform. "Convert the CSV to the JSON format that Omeka S expects, with Dublin Core field mappings. Save it as exhibit_import.json."

Each of these steps can be a single conversation with the tool. Between steps, you check the output. You ask whether the date ranges make sense, whether the suggested subject headings are reasonable, whether the file references match. You don't move on until those checks pass.

The descriptions you write for each step have to be specific. You should be able to look at the output and say whether the step succeeded. They should also still make sense when you come back to the same task next quarter with fresh data.

Further Reading