How to read API documentation, test small requests, save responses, and keep outside data inspectable.
With a local file, you can see the data in front of you. With an API (Application Programming Interface, a structured way for one piece of software to ask another for data or a service), the same information is split across the documentation, the request you sent, and the response you got back. The agent can move from question to script in seconds, which is useful until it skips past a wrong parameter (a setting attached to the request that controls what comes back), a partial result, or a page-one-only answer. The workflow in this chapter keeps the request, the response, and the documentation visible so you can inspect them together.
Read the docs, make a small request, save the response.
Let the agent handle the repetitive parts: drafting requests, parsing responses, writing transformation code. You also have to judge whether you are asking the right question and whether the returned data answers it. The agent cannot do that for you.
The first request should follow the docs. That sounds obvious, but agents can draft code before anyone has read the docs carefully. Good API documentation covers more than the endpoint itself. It also specifies pagination, error handling, rate limits, and authentication. The first call is rarely sufficient. You will need to handle paging, errors, and limits as part of the same workflow.
"Find the current documentation for this API and identify the request that best matches my question. Tell me what the request needs, what I should verify before writing a full script, and which documentation page you are relying on. Do not build the whole workflow yet. Define any technical term you use."
Ask for the live docs, the exact endpoint (the URL your code will call), the required parameters, and the shape of what comes back. Save the documentation URL in the session or project notes so you can return to it when the response surprises you.
Request one record, one page, one narrow filter, or one grouped result. This establishes that the endpoint, parameters, and auth flow are real before you write the wider pipeline.
Write the response to disk. If response headers (the metadata the server sends alongside the data, such as pagination links or rate-limit counters) carry information you will need later, save or inspect those too.
Once the returned data is visible, the agent can write extraction and cleaning code against real field names and real nesting, including the null values that documentation rarely warns you about. Documentation can describe the response inaccurately. A saved sample shows what the API returns.
Pagination, retries, batching, and scheduled pulls belong later in the sequence. If you add them before the basic request works, they mostly help a broken request reach more data.
Most API problems are mundane. The service returns one page at a time, limits how fast you can call it, or sends back an error that the script may not notice.
GitHub's current REST docs, for example, advise clients to use link headers (metadata the server attaches to each response, separate from the data itself) for pagination, follow redirects, serialize requests where needed, and treat repeated 4xx or 5xx status codes as a reason to stop and correct the integration. Those codes are the server's way of reporting trouble. A 4xx means the request itself was wrong (bad URL, missing credentials), while a 5xx means something failed on the server's end.
OpenAlex's paging guide draws a similar boundary. Basic page and per_page work for the first 10,000 results. Cursor paging works beyond that. Larger pulls require the bulk snapshot.
If you cannot explain how the script reaches the whole result set, or why it stops where it does, treat the data as provisional. That may be fine for exploration, but write that note next to the file.
Many APIs return only part of the result set per request. GitHub, for instance, exposes next-page URLs in a response header called Link. OpenAlex uses page and per_page parameters for the first 10,000 results, then switches to cursor=* with a next_cursor value for deeper paging. However your API handles paging, the script should show you whether it reached everything.
Current GitHub guidance recommends serial requests and explicit backoff when rate limits are hit. That is a good general habit. Twenty parallel calls usually makes the rate-limit problem worse.
A request that completes without a network error can still carry a failure in its status code. Python's response.raise_for_status() and JavaScript's response.ok both exist to catch this. Left unchecked, the script treats a 403 or 429 error as the first row of your dataset.
Field names change, null values appear where they did not before, and providers sometimes retire older response formats without warning. Save a raw sample early. Compare new responses against it to catch changes in the data's structure.
"Write the script, then explain how pagination is handled, how error responses are surfaced, and which part of the docs justifies that logic. Save a raw response sample before you write the transformation."
Code running in a web page cannot call every outside service freely. Browsers enforce a policy called CORS (Cross-Origin Resource Sharing) that blocks requests to servers that have not explicitly opted in; MDN's current guide explains the mechanics, and Your First Web App introduces the static-versus-backend distinction more broadly. The question that comes up in practice is simpler. Where should this particular request run? The answer depends on what the API requires and what you need to keep private.
Ask the agent to justify the execution location after the request itself is stable. That question is most useful once you know what is being called, what it returns, and what has to remain private.
Suppose you want yearly open access counts for a small visualization. You could start pulling records and see what happens. OpenAlex's docs show a faster method. Their grouping guide shows that group_by returns grouped counts in a group_by array, and their Open Access Trends by Year recipe applies that directly to publication_year.
This answers the question without retrieving millions of works. Two grouped requests are enough: one for all works by year, and one for open access works by year. The script can save both raw responses, combine them into a small CSV, and hand that file off to the visualization stage. The companion project The Rise of Open Access follows that pattern directly.
import csv
import json
import os
from pathlib import Path
import requests
BASE_URL = "https://api.openalex.org/works"
DATA_DIR = Path("data")
RAW_DIR = DATA_DIR / "raw"
RAW_DIR.mkdir(parents=True, exist_ok=True)
api_key = os.getenv("OPENALEX_API_KEY")
def fetch_grouped_counts(filter_value):
params = {
"filter": filter_value,
"group_by": "publication_year",
}
if api_key:
params["api_key"] = api_key
response = requests.get(BASE_URL, params=params, timeout=30)
response.raise_for_status()
return response.json()
totals = fetch_grouped_counts("publication_year:2015-2025")
oa = fetch_grouped_counts("publication_year:2015-2025,open_access.is_oa:true")
(RAW_DIR / "openalex-total-by-year.json").write_text(json.dumps(totals, indent=2))
(RAW_DIR / "openalex-oa-by-year.json").write_text(json.dumps(oa, indent=2))
total_by_year = {row["key"]: row["count"] for row in totals["group_by"]}
oa_by_year = {row["key"]: row["count"] for row in oa["group_by"]}
with (DATA_DIR / "openalex-oa-by-year.csv").open("w", newline="") as fh:
writer = csv.DictWriter(fh, fieldnames=["year", "total", "open_access"])
writer.writeheader()
for year in sorted(total_by_year):
writer.writerow({
"year": year,
"total": total_by_year[year],
"open_access": oa_by_year.get(year, 0),
})
The script uses requests, a Python library for making HTTP calls. If you do not have it installed, run pip install requests (or uv pip install requests if you use uv) in your terminal.
OpenAlex works without an API key for light use, but if you have one, set it as an environment variable before running the script. On macOS or Linux: export OPENALEX_API_KEY=your_key_here. On Windows: set OPENALEX_API_KEY=your_key_here. Then run the script with python fetch_oa_counts.py (or whatever you named the file). It will create a data/raw/ folder with the two JSON responses and a data/openalex-oa-by-year.csv combining them.
key, key_display_name, and count fields returned in grouped responses.Once credentials enter the picture, keep them outside the source file. An environment variable (a named value your operating system holds in memory, accessible to any program you run) is the most common place. The agent should read the key from there, which keeps the secret out of version control and out of any file you might share. Save a redacted sample response, and decide early what may appear in logs. A credential that leaks into a commit history or a shared notebook is hard to remove completely. Handle this before you write any code.
"Read the current docs for this authenticated endpoint. Write a local script that expects the token in an environment variable, saves a small redacted sample response, surfaces status codes clearly, and explains where pagination and retries are handled. Do not write credentials into the file."
next_cursor disappears earlier than expected200 response arrives with empty results after an auth or parameter changeA line chart can look authoritative even when it is built on the first page of a paginated response.
Once the API output is in a table, the next step is tabular data cleanup. If no supported endpoint exists and the data has to be collected from the page itself, scraping is the fallback.