How to investigate a page, find the underlying data source, and scrape slowly enough to verify and repair.
Many pages already offer an export, a feed, a sitemap, a documented API, or a data request the browser is making in the background. Any of these is a better starting point than scraping. The data appears to be sitting right there on the page, and an agent can draft a scraper quickly enough that the first attempt may even run.
Investigate first. Scrapy, the most widely used Python framework for web scraping and crawling, has documentation on dynamically loaded content and browser developer tools that points in the same direction: look for the underlying source, inspect what the browser is requesting, save a sample response, and only then decide whether HTML scraping is still warranted. When scraping is necessary, go slow. You need to be able to inspect and repair as you go. Webpages get reorganized often, sometimes shortly after you decide your scraper is reliable.
Scraping involves several kinds of judgment: whether to scrape, what source to target, how to extract, and when to stop. The agent handles the mechanical parts well enough. The investigation, the requests, and the saved artifacts are where you have to stay involved.
"Inspect this page before writing any scraper. Tell me whether there is an export, feed, sitemap, documented API, or underlying data request. If the page still requires scraping, tell me what source I should save first and which extraction path seems least brittle. Define any technical term you use."
Browser developer tools show the live page after scripts have run, which may differ from the original HTML response. JavaScript may have modified the page, the browser may have cleaned up the markup, and Firefox may even add <tbody> elements that do not exist in the raw source. A pattern copied from the browser inspector can therefore fail against the response your scraper actually receives. Scrapy's developer-tools guide walks through this problem in detail.
The practical sequence starts with opening the browser's network panel (the tab in developer tools that shows every request the page makes behind the scenes), finding the request that carries the data you want, and saving a sample of that HTML or JSON response. Then ask the agent to explain what it sees. This takes longer than asking for a scraper directly, but by the time you write extraction code you know what source you are targeting. Different sources require different work. Suppose a staff directory page appears to list people in ordinary HTML cards. If the browser is loading those names from a background JSON request, you should target the JSON. If the page also offers a CSV export, that is better still.
"Inspect this page and tell me whether the desired data is present in the original HTML, embedded in JavaScript, or loaded from a secondary request. Show me the smallest request that returns the data. Save one sample response and explain, in plain language, which fields or selectors are stable enough to use."
If the browser's network panel reveals a JSON endpoint, the project has moved closer to Working with APIs and External Data than to HTML scraping.
A selector is a rule for finding one part of a page. The examples below use XPath, a syntax for navigating the nested structure of an HTML document (think of it as giving directions through a tree: "go to the third branch, then the second leaf"). Scrapy's developer-tools guide recommends relative selectors based on attributes and identifying features over full XPath paths, which makes sense given how often page layouts change. The selectors guide adds a practical distinction between methods that return one value and methods that return all matches. Extraction code is safer when you know whether you are selecting one thing or many, and why the selector is likely to survive a redesign. Sort that out before you build the loop.
/html/body/div[2]/main/div[3]/table/tbody/tr[4]/td[2]/text()
This path depends on the entire page structure staying put. A mild redesign or an added row can break it.
//tr[@data-person-id]/td[@data-field="role"]/text()
This selector names the field it wants. It has a better chance of surviving small layout changes.
It helps to ask the agent to show what each selector returns before building the loop around it. Verifying the output on five records is much faster than fixing five hundred broken records later.
Throttling slows the scraper down so it does not overwhelm the site and so you have a better chance of noticing when something has gone wrong. Scrapy's AutoThrottle documentation explains that the extension adjusts delay based on latency and treats error responses conservatively: error responses (anything other than a successful "200 OK" status) are not allowed to reduce the delay. Lower target concurrency values (the number of requests happening at the same time) make the crawler more polite, which suits most library, repository, and vendor collection work.
ROBOTSTXT_OBEY = True
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 5.0
AUTOTHROTTLE_TARGET_CONCURRENCY = 0.5
DOWNLOAD_DELAY = 1.0
You do not need to memorize these settings. A scraper should pause between requests and slow down further when the site responds slowly. That also gives you a better chance of noticing when something has gone wrong. Scrapy's settings docs say ROBOTSTXT_OBEY defaults to True in generated settings.py files, even though the fallback default remains False for historical reasons.
Repeated errors, repeated blocks, or repeated selector failures are signals to stop and inspect the source again. Do not retry harder or in parallel.
robots.txtrobots.txt is a plain text file that usually sits at /robots.txt on a website, where it tells automated crawlers which parts of the site they are asked to avoid. RFC 9309 (the formal internet standard for robots exclusion) covers this pattern. According to the RFC, if a crawler successfully downloads the file, it is expected to follow the parseable rules. The RFC also makes clear that robots.txt expresses preferences rather than enforcing access control.
curl -I https://example.edu/robots.txt
curl https://example.edu/robots.txt
For this chapter, the practical use is modest: look at the file, understand what it is saying about crawler behavior, and let that inform the request pattern. It belongs alongside the rest of the context, which includes the site's structure, its terms of service, and the way public-web scraping disputes have been argued in court. The Ninth Circuit's 2022 decision in hiQ Labs v. LinkedIn clarified that scraping publicly available data does not necessarily violate the Computer Fraud and Abuse Act, though the legal picture remains uneven and context-dependent. For librarians, institutional policy and professional norms often matter as much as law. Your institution may have its own rules about automated collection. Even when scraping is legal, the vendor license agreements you have signed may forbid it.
You need a CSV of names, titles, and departments from an institutional directory. There is no obvious export button. The page looks like a classic scraping target. You might be tempted to select cards or rows from it directly.
If you check the network panel first, you find something different. The saved HTML contains only a shell, while the network panel reveals a paginated JSON request (a URL like /directory/api/people?page=1 that returns structured data). At that point, you save the JSON response, reproduce the request carefully, and write a small transformation script against that source.
If the source turns out to be a data request behind the page, an API approach is more stable. If the task depends on driving a live interface through login flows and browser state, browser orchestration is the next step. When scraping is the right tool, save the artifacts so you can repair the scraper after the page changes.