Collection Work

Web Scraping as a Last Resort

How to investigate a page, find the underlying data source, and scrape slowly enough to verify and repair.

Do You Need to Scrape at All?

Many pages already offer an export, a feed, a sitemap, a documented API, or a data request the browser is making in the background. Any of these is a better starting point than scraping. The data appears to be sitting right there on the page, and an agent can draft a scraper quickly enough that the first attempt may even run.

Investigate first. Scrapy, the most widely used Python framework for web scraping and crawling, has documentation on dynamically loaded content and browser developer tools that points in the same direction: look for the underlying source, inspect what the browser is requesting, save a sample response, and only then decide whether HTML scraping is still warranted. When scraping is necessary, go slow. You need to be able to inspect and repair as you go. Webpages get reorganized often, sometimes shortly after you decide your scraper is reliable.

Export / Feed

→

Data Request

→

Saved Page

→

Browser Automation

Division of Labor

Scraping involves several kinds of judgment: whether to scrape, what source to target, how to extract, and when to stop. The agent handles the mechanical parts well enough. The investigation, the requests, and the saved artifacts are where you have to stay involved.

The Agent Does

Inspect page structure and trace likely data requests
Draft small collection and export scripts
Suggest page-finding rules, often called selectors, and explain what each one returns
Separate request logic from parsing and output

You Do

Decide whether scraping is warranted at all
Judge whether the request pattern is appropriate, sustainable, and worth the maintenance burden
Verify that extracted fields match the page or response
Decide when the maintenance burden exceeds the value

Save These

One saved response from the actual source (HTML, JSON, or feed), with a note about request assumptions and any constraints you discovered
The request template or request notes
A cleaned intermediate file such as CSV or JSON

Starting Prompt

"Inspect this page before writing any scraper. Tell me whether there is an export, feed, sitemap, documented API, or underlying data request. If the page still requires scraping, tell me what source I should save first and which extraction path seems least brittle. Define any technical term you use."

Find the Actual Data Source

Browser developer tools show the live page after scripts have run, which may differ from the original HTML response. JavaScript may have modified the page, the browser may have cleaned up the markup, and Firefox may even add <tbody> elements that do not exist in the raw source. A pattern copied from the browser inspector can therefore fail against the response your scraper actually receives. Scrapy's developer-tools guide walks through this problem in detail.

The practical sequence starts with opening the browser's network panel (the tab in developer tools that shows every request the page makes behind the scenes), finding the request that carries the data you want, and saving a sample of that HTML or JSON response. Then ask the agent to explain what it sees. This takes longer than asking for a scraper directly, but by the time you write extraction code you know what source you are targeting. Different sources require different work. Suppose a staff directory page appears to list people in ordinary HTML cards. If the browser is loading those names from a background JSON request, you should target the JSON. If the page also offers a CSV export, that is better still.

Investigation Prompt

"Inspect this page and tell me whether the desired data is present in the original HTML, embedded in JavaScript, or loaded from a secondary request. Show me the smallest request that returns the data. Save one sample response and explain, in plain language, which fields or selectors are stable enough to use."

Source categories

Best Case

Export, feed, or sitemap

Use the supported route and save the file
Prefer this path whenever it exists

Often Better

Underlying data request

Use the browser's network panel to capture the request
Treat the result as undocumented API work

Sometimes Needed

Saved page with selectors

Save the response first
Work against the saved file, then generalize

Escalation

Browser automation

Use when the task requires login, clicking through menus, or other stateful interaction
Expect higher fragility and maintenance cost

Category Shift

If the browser's network panel reveals a JSON endpoint, the project has moved closer to Working with APIs and External Data than to HTML scraping.

Selectors Are a Maintenance Problem

A selector is a rule for finding one part of a page. The examples below use XPath, a syntax for navigating the nested structure of an HTML document (think of it as giving directions through a tree: "go to the third branch, then the second leaf"). Scrapy's developer-tools guide recommends relative selectors based on attributes and identifying features over full XPath paths, which makes sense given how often page layouts change. The selectors guide adds a practical distinction between methods that return one value and methods that return all matches. Extraction code is safer when you know whether you are selecting one thing or many, and why the selector is likely to survive a redesign. Sort that out before you build the loop.

Brittle

/html/body/div[2]/main/div[3]/table/tbody/tr[4]/td[2]/text()

This path depends on the entire page structure staying put. A mild redesign or an added row can break it.

Sturdier

//tr[@data-person-id]/td[@data-field="role"]/text()

This selector names the field it wants. It has a better chance of surviving small layout changes.

It helps to ask the agent to show what each selector returns before building the loop around it. Verifying the output on five records is much faster than fixing five hundred broken records later.

Keep the Scraper Slow Enough to Be Welcome

Throttling slows the scraper down so it does not overwhelm the site and so you have a better chance of noticing when something has gone wrong. Scrapy's AutoThrottle documentation explains that the extension adjusts delay based on latency and treats error responses conservatively: error responses (anything other than a successful "200 OK" status) are not allowed to reduce the delay. Lower target concurrency values (the number of requests happening at the same time) make the crawler more polite, which suits most library, repository, and vendor collection work.

python

ROBOTSTXT_OBEY = True
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 5.0
AUTOTHROTTLE_TARGET_CONCURRENCY = 0.5
DOWNLOAD_DELAY = 1.0

You do not need to memorize these settings. A scraper should pause between requests and slow down further when the site responds slowly. That also gives you a better chance of noticing when something has gone wrong. Scrapy's settings docs say ROBOTSTXT_OBEY defaults to True in generated settings.py files, even though the fallback default remains False for historical reasons.

Stopping Rule

Repeated errors, repeated blocks, or repeated selector failures are signals to stop and inspect the source again. Do not retry harder or in parallel.

Reading `robots.txt`

robots.txt is a plain text file that usually sits at /robots.txt on a website, where it tells automated crawlers which parts of the site they are asked to avoid. RFC 9309 (the formal internet standard for robots exclusion) covers this pattern. According to the RFC, if a crawler successfully downloads the file, it is expected to follow the parseable rules. The RFC also makes clear that robots.txt expresses preferences rather than enforcing access control.

bash

curl -I https://example.edu/robots.txt
curl https://example.edu/robots.txt

For this chapter, the practical use is modest: look at the file, understand what it is saying about crawler behavior, and let that inform the request pattern. It belongs alongside the rest of the context, which includes the site's structure, its terms of service, and the way public-web scraping disputes have been argued in court. The Ninth Circuit's 2022 decision in hiQ Labs v. LinkedIn clarified that scraping publicly available data does not necessarily violate the Computer Fraud and Abuse Act, though the legal picture remains uneven and context-dependent. For librarians, institutional policy and professional norms often matter as much as law. Your institution may have its own rules about automated collection. Even when scraping is legal, the vendor license agreements you have signed may forbid it.

Case Study: The Page That Turns Out to Be JSON

The problem

You need a CSV of names, titles, and departments from an institutional directory. There is no obvious export button. The page looks like a classic scraping target. You might be tempted to select cards or rows from it directly.

If you check the network panel first, you find something different. The saved HTML contains only a shell, while the network panel reveals a paginated JSON request (a URL like /directory/api/people?page=1 that returns structured data). At that point, you save the JSON response, reproduce the request carefully, and write a small transformation script against that source.

Save one JSON response and note any assumptions about pagination, headers, or rate limits that turned out to matter.
Record the request URL, method, query parameters, and stop condition.
Write extraction code against the saved JSON.
Export a clean CSV and keep the raw response for repair work later.

When to Stop Scraping

A better supported access method becomes visible
The request pattern grows too complicated or brittle to explain clearly
The maintenance burden exceeds the value of the data
Terms, policies, or institutional constraints make the work dubious
The output can no longer be verified with reasonable effort

If the source turns out to be a data request behind the page, an API approach is more stable. If the task depends on driving a live interface through login flows and browser state, browser orchestration is the next step. When scraping is the right tool, save the artifacts so you can repair the scraper after the page changes.