External Data

Working with APIs and External Data

How to read API documentation, test small requests, save responses, and keep outside data inspectable.

The First Constraint

With a local file, you can see the data in front of you. With an API (Application Programming Interface, a structured way for one piece of software to ask another for data or a service), the same information is split across the documentation, the request you sent, and the response you got back. The agent can move from question to script in seconds, which is useful until it skips past a wrong parameter (a setting attached to the request that controls what comes back), a partial result, or a page-one-only answer. The workflow in this chapter keeps the request, the response, and the documentation visible so you can inspect them together.

The operational rule

Read the docs, make a small request, save the response.

Documentation
Small Request
Saved Response
Transform
Handoff

Who Checks What

Let the agent handle the repetitive parts: drafting requests, parsing responses, writing transformation code. You also have to judge whether you are asking the right question and whether the returned data answers it. The agent cannot do that for you.

The Agent Does

  • Locate the current docs and summarize the relevant endpoint (the specific URL you will call)
  • Draft small test requests and starter scripts
  • Explain where authentication, pagination, and retries are handled
  • Transform saved responses into cleaned intermediate files

You Do

  • Confirm that the docs are current and match the task
  • Decide where credentials belong and what should stay out of logs
  • Inspect the returned data before trusting downstream output
  • Judge whether the resulting data still answers the original question

Save These

  • The documentation URL you used
  • A small raw response sample, plus any response headers (pagination links, rate-limit counts) worth revisiting
  • A cleaned intermediate file such as JSON (a common structured-text format most APIs use) or CSV
  • A short note on pagination, rate limits, and authentication

Begin with the Documentation

The first request should follow the docs. That sounds obvious, but agents can draft code before anyone has read the docs carefully. Good API documentation covers more than the endpoint itself. It also specifies pagination, error handling, rate limits, and authentication. The first call is rarely sufficient. You will need to handle paging, errors, and limits as part of the same workflow.

Starting Prompt

"Find the current documentation for this API and identify the request that best matches my question. Tell me what the request needs, what I should verify before writing a full script, and which documentation page you are relying on. Do not build the whole workflow yet. Define any technical term you use."

Locate the current documentation

Ask for the live docs, the exact endpoint (the URL your code will call), the required parameters, and the shape of what comes back. Save the documentation URL in the session or project notes so you can return to it when the response surprises you.

Make the smallest request that can still prove the path

Request one record, one page, one narrow filter, or one grouped result. This establishes that the endpoint, parameters, and auth flow are real before you write the wider pipeline.

Save the raw response before you transform it

Write the response to disk. If response headers (the metadata the server sends alongside the data, such as pagination links or rate-limit counters) carry information you will need later, save or inspect those too.

Build the transformation from the saved sample

Once the returned data is visible, the agent can write extraction and cleaning code against real field names and real nesting, including the null values that documentation rarely warns you about. Documentation can describe the response inaccurately. A saved sample shows what the API returns.

Scale only after the small request holds

Pagination, retries, batching, and scheduled pulls belong later in the sequence. If you add them before the basic request works, they mostly help a broken request reach more data.

Pagination, Rate Limits, and Error States

Most API problems are mundane. The service returns one page at a time, limits how fast you can call it, or sends back an error that the script may not notice.

GitHub's current REST docs, for example, advise clients to use link headers (metadata the server attaches to each response, separate from the data itself) for pagination, follow redirects, serialize requests where needed, and treat repeated 4xx or 5xx status codes as a reason to stop and correct the integration. Those codes are the server's way of reporting trouble. A 4xx means the request itself was wrong (bad URL, missing credentials), while a 5xx means something failed on the server's end.

OpenAlex's paging guide draws a similar boundary. Basic page and per_page work for the first 10,000 results. Cursor paging works beyond that. Larger pulls require the bulk snapshot.

If you cannot explain how the script reaches the whole result set, or why it stops where it does, treat the data as provisional. That may be fine for exploration, but write that note next to the file.

Pagination

Many APIs return only part of the result set per request. GitHub, for instance, exposes next-page URLs in a response header called Link. OpenAlex uses page and per_page parameters for the first 10,000 results, then switches to cursor=* with a next_cursor value for deeper paging. However your API handles paging, the script should show you whether it reached everything.

Rate limits

Current GitHub guidance recommends serial requests and explicit backoff when rate limits are hit. That is a good general habit. Twenty parallel calls usually makes the rate-limit problem worse.

Error handling

A request that completes without a network error can still carry a failure in its status code. Python's response.raise_for_status() and JavaScript's response.ok both exist to catch this. Left unchecked, the script treats a 403 or 429 error as the first row of your dataset.

Schema drift

Field names change, null values appear where they did not before, and providers sometimes retire older response formats without warning. Save a raw sample early. Compare new responses against it to catch changes in the data's structure.

Verification Prompt

"Write the script, then explain how pagination is handled, how error responses are surfaced, and which part of the docs justifies that logic. Save a raw response sample before you write the transformation."

Where the Request Should Run

Code running in a web page cannot call every outside service freely. Browsers enforce a policy called CORS (Cross-Origin Resource Sharing) that blocks requests to servers that have not explicitly opted in; MDN's current guide explains the mechanics, and Your First Web App introduces the static-versus-backend distinction more broadly. The question that comes up in practice is simpler. Where should this particular request run? The answer depends on what the API requires and what you need to keep private.

Browser

Use for public, inspectable experiments

  • The API is public and no secret is involved
  • The provider permits cross-origin requests
  • The response is modest
  • The result can be inspected in the page or browser tools
Local Script

Use for reproducible pulls and saved artifacts

  • You want raw responses and cleaned files on disk
  • You need transformation before the result is useful
  • You expect to rerun the workflow
  • You are the only user of the result
Backend

Use when credentials or shared state enter the picture

  • The request needs a token or key that should stay private
  • Several users depend on the same live result
  • You need server-side caching or coordination
  • The browser should not call the service directly

Prompt Pattern

Ask the agent to justify the execution location after the request itself is stable. That question is most useful once you know what is being called, what it returns, and what has to remain private.

Case Study: OpenAlex Yearly Counts

The question

Suppose you want yearly open access counts for a small visualization. You could start pulling records and see what happens. OpenAlex's docs show a faster method. Their grouping guide shows that group_by returns grouped counts in a group_by array, and their Open Access Trends by Year recipe applies that directly to publication_year.

This answers the question without retrieving millions of works. Two grouped requests are enough: one for all works by year, and one for open access works by year. The script can save both raw responses, combine them into a small CSV, and hand that file off to the visualization stage. The companion project The Rise of Open Access follows that pattern directly.

python
import csv
import json
import os
from pathlib import Path

import requests

BASE_URL = "https://api.openalex.org/works"
DATA_DIR = Path("data")
RAW_DIR = DATA_DIR / "raw"
RAW_DIR.mkdir(parents=True, exist_ok=True)

api_key = os.getenv("OPENALEX_API_KEY")

def fetch_grouped_counts(filter_value):
    params = {
        "filter": filter_value,
        "group_by": "publication_year",
    }
    if api_key:
        params["api_key"] = api_key

    response = requests.get(BASE_URL, params=params, timeout=30)
    response.raise_for_status()
    return response.json()

totals = fetch_grouped_counts("publication_year:2015-2025")
oa = fetch_grouped_counts("publication_year:2015-2025,open_access.is_oa:true")

(RAW_DIR / "openalex-total-by-year.json").write_text(json.dumps(totals, indent=2))
(RAW_DIR / "openalex-oa-by-year.json").write_text(json.dumps(oa, indent=2))

total_by_year = {row["key"]: row["count"] for row in totals["group_by"]}
oa_by_year = {row["key"]: row["count"] for row in oa["group_by"]}

with (DATA_DIR / "openalex-oa-by-year.csv").open("w", newline="") as fh:
    writer = csv.DictWriter(fh, fieldnames=["year", "total", "open_access"])
    writer.writeheader()
    for year in sorted(total_by_year):
        writer.writerow({
            "year": year,
            "total": total_by_year[year],
            "open_access": oa_by_year.get(year, 0),
        })

Running this script

The script uses requests, a Python library for making HTTP calls. If you do not have it installed, run pip install requests (or uv pip install requests if you use uv) in your terminal.

OpenAlex works without an API key for light use, but if you have one, set it as an environment variable before running the script. On macOS or Linux: export OPENALEX_API_KEY=your_key_here. On Windows: set OPENALEX_API_KEY=your_key_here. Then run the script with python fetch_oa_counts.py (or whatever you named the file). It will create a data/raw/ folder with the two JSON responses and a data/openalex-oa-by-year.csv combining them.

Documentation trail

Authenticated APIs Need One More Boundary

Once credentials enter the picture, keep them outside the source file. An environment variable (a named value your operating system holds in memory, accessible to any program you run) is the most common place. The agent should read the key from there, which keeps the secret out of version control and out of any file you might share. Save a redacted sample response, and decide early what may appear in logs. A credential that leaks into a commit history or a shared notebook is hard to remove completely. Handle this before you write any code.

Authenticated API Prompt

"Read the current docs for this authenticated endpoint. Write a local script that expects the token in an environment variable, saves a small redacted sample response, surfaces status codes clearly, and explains where pagination and retries are handled. Do not write credentials into the file."

When to Stop and Inspect

A line chart can look authoritative even when it is built on the first page of a paginated response.

Once the API output is in a table, the next step is tabular data cleanup. If no supported endpoint exists and the data has to be collected from the page itself, scraping is the fallback.

Further Reading