Data Workflows

Working with Tabular Data

How to use an AI coding tool to inspect, validate, and transform tables before analysis, reporting, or row-by-row processing.

The First Useful Prompt

A table can look orderly while still being hard to understand. Dates might mix formats and IDs that should stay text are read as numbers. Summary rows are mixed in with ordinary records without any distinguishing marker. If the first prompt is "clean this dataset," the model starts guessing almost immediately about what the rows mean and what kind of changes are acceptable.

The opening prompt should be narrow. Ask for a profile of the file, meaning a quick account of its shape and obvious trouble spots, along with likely keys, structural oddities, and anything that cannot be inferred from the table alone. Once you have that account, the tool can write explicit import code, generate validation checks, reshape the table, and document what changed along the way. Without the profile, the tool produces code that makes assumptions you have not checked.

The operational rule

Profile first, transform second. Without a profile, the tool still produces plausible output, but you do not know what assumptions it made.

Source File
Profile
Explicit Import
Validation
Transform
Handoff

Division of Labor

Some of the work is mechanical: profiling, reshaping, writing validation checks. Some of it is interpretive: deciding what rows represent, which blanks matter, whether two labels refer to the same thing. Keep the two separate, and you will notice more quickly when the tool starts mixing them up.

The Agent Does

  • Profiles the table and summarizes suspect columns
  • Writes explicit import code and repetitive transforms
  • Generates validation checks and comparison reports
  • Reshapes tables and documents what changed

You Do

  • Decide what the rows and columns represent
  • Define missing values and meaningful distinctions
  • Judge whether keys, joins (combining rows from two tables on a shared column), and categories are valid
  • Inspect the output against the source and the question

Save These

  • A dataset profile
  • Import code or import settings
  • Validation checks and summary outputs
  • A cleaned intermediate file

A profile does not need to be elaborate to be useful. It can be as simple as noticing that the file has 14,382 rows and 12 columns, that record_id looks like the identifier, that deposit_date mixes dates and blanks, that orcid should probably stay text, that the year columns suggest wide-form layout (separate columns for 2022, 2023, 2024 rather than a single "year" column), and that the last three rows look more like summaries than ordinary records.

Starting Prompt

"Read this CSV and give me a structural profile: row count, columns, likely identifier fields, likely missing-value markers, suspect type issues, and anything that looks structurally inconsistent. Do not modify the file yet. Tell me which decisions still require domain judgment from me, and explain any technical term you use."

The Workflow

Profile the table before editing it

The first pass should establish the table's basic shape: number of rows and columns, column names, obvious identifiers, visible missing-value markers, and columns that appear to contain mixed types. At this stage, description matters more than transformation.

If you prefer to start visually, spreadsheet tools and OpenRefine are both useful here, but the next prompt should rest on an inspection of the file, whatever tool produced it.

Decide what the table represents

Sometimes the table is structurally wrong, not just textually dirty. A spreadsheet that mixes annual totals, institution metadata, notes, and summary rows in one sheet is holding several different kinds of records at once. Hadley Wickham's "tidy data" framework gives you a test for this: each row should be one observation, each column one variable, each table one kind of observational unit. Does this file pass?

The tool can flag a structural problem, but deciding whether two columns represent competing labels for the same thing or different kinds of information still requires someone who knows the domain.

Write explicit import code

Import is interpretation. The moment a CSV is loaded, somebody is deciding whether a numeric-looking identifier should remain text, whether a blank cell is missing or merely empty, whether a malformed row should halt the process, and whether dates should be parsed at all. Libraries like pandas (a widely used Python data tool) or R's readr make these choices visible if you ask for them explicitly, which is why the agent should be asked to write the explicit version, spelling out each decision in the import code rather than leaving type inference to chance.

Generate checks before and after transformations

Every transformation should come paired with validation: row counts, null counts, uniqueness checks, comparison reports for changed columns, and summaries of unmatched records after joins. When the AI tool is writing and running code on your behalf (sometimes called "agentic" use, where the tool acts semi-autonomously), the validation checks are at least as important as the transformation code. The tool will not write checks for itself unless you ask.

Transform only what you can describe

Standardization, reshaping, and joining all depend on named rules. "Standardize this column" is often too vague, whereas "trim whitespace, convert title case to sentence case, preserve acronyms, and show me fifteen changed values before applying the transformation to the whole column" gives the agent a procedure it can follow. The prompt has to name specific operations.

A Small Import Example

The snippet below is Python, but the same decisions exist in R, Excel Power Query, or any tool that reads a CSV. Which columns are text? Which blanks count as missing? What should happen when a row is malformed? If you do not usually write code yourself, read it as the kind of explicit import instruction you want the agent to generate for you. Each line names a decision the tool would otherwise make without telling you.

python
import pandas as pd

df = pd.read_csv(
    "repository_export.csv",
    dtype={
        "record_id": "string",
        "orcid": "string",
        "issn": "string",
    },
    parse_dates=["deposit_date"],
    na_values=["", "NULL", "Unknown", "N/A"],
    keep_default_na=True,
    on_bad_lines="error",
)

profile = {
    "rows": len(df),
    "columns": list(df.columns),
    "null_counts": df.isna().sum().sort_values(ascending=False).to_dict(),
    "duplicate_record_ids": int(df["record_id"].duplicated().sum()),
}

Prompt Pattern

Ask the tool to explain the import decisions in plain language before running them. A vague answer means you should look more carefully at the code itself.

Where Inference Breaks Down

During import, explicit instructions stop the tool from misreading types and missing values. The same problem appears at every later stage of the workflow. The tool can detect patterns but not always tell what they mean. Below are some common ways the tool gets it wrong without warning you.

Common Inference Failures

Validation Before Confidence

The workflow above treats validation as a numbered step, but in practice it runs alongside every transformation. Ask the tool to generate checks before and after each substantial change, including row counts, null counts, duplicate-key checks, controlled-value summaries, and reports on any join (combining two tables by matching on a shared column). These checks take seconds, and they force you to look at what changed before moving on. Without them, you have no way to tell whether the data was corrupted.

Validation Prompt

"Write the transformation, but also write validation checks: row counts before and after, duplicate checks for record_id, null counts for the date columns, and a report of values changed in institution_name. Save the cleaned file separately from the source file."

Join Warning

When you join two tables, you are making an analytical claim about how the rows in table A relate to the rows in table B through a shared column (one-to-one, one-to-many, etc.). Ask the tool for unmatched-record summaries and row-count comparisons before and after. If the join multiplies rows unexpectedly, the relationship between the tables needs more investigation before you proceed.

Case Study: A Repository Export

The problem

A repository export arrives from several reporting systems at once. Dates are mixed, identifiers have been read as numbers, institution names vary across rows, and annual counts are spread across multiple year columns. The file is usable once you make the structure explicit.

Before

record_id
title
institution
institution_2
orcid
deposit_date
embargo_end
2022_downloads
2023_downloads
2024_downloads
note
total

After

record_id
title
institution_name
orcid
deposit_date
embargo_end
year
downloads
source_note
validation_flag

The workflow

  1. Ask the tool for a profile of the source file and a list of structural problems.
  2. Write explicit import code that preserves identifiers as text and names local missing-value markers.
  3. Generate a missing-data summary and duplicate-key check.
  4. Normalize institution names, but only after reviewing examples of proposed changes.
  5. Reshape the year columns into long format (one row per year per record, with a single "downloads" column) and generate a row-count comparison.
  6. Save the cleaned file separately, along with the validation outputs.

Save the validation outputs and the import code alongside the cleaned table. Six months from now, when the next export arrives, you (or a colleague) can reproduce the same process without starting over.

After the Table Is Clean

Once the table is profiled, cleaned, and saved as a separate intermediate file, you can move on to analysis with ordinary tools, an automated reporting pipeline, or row-by-row processing when each record needs interpretation rather than structural repair. Keep structural cleanup separate from record-level interpretation. They are different problems, and mixing them makes both harder.

Next steps

Once the table is legible, row-by-row processing applies when individual records need judgment. Consider automation when nothing about the structure or the criteria has changed in several runs.

Further Reading