Turning the Internet into Data

A Catalog Web API

by Stephen Zweibel

The CUNY+ Catalog

A rich vein of data.

Search Results

Holdings Records

MARC Records

010 ## $a ###89048230 
020 ## $a 0316107514 : $c $12.95
020 ## $a 0316107506 (pbk.) : $c $5.95 ($6.95 Can.)
040 ## $a DLC $c DLC $d DLC
050 00 $a GV943.25 $b .B74 1990
100 1# $a Brenner, Richard J.,  $d 1941-
245 10 $a Make the team.  $p Soccer : $b a heads up guide to super soccer! / $c Richard J. Brenner.
246 30 $a Heads up guide to super soccer
250 ## $a 1st ed.
260 ## $a Boston : $b Little, Brown, $c c1990.
300 ## $a 127 p. : $b ill. ; $c 19 cm.
650 #0 $a Soccer $v Juvenile literature.
650 #1 $a Soccer. 

Problem

How could this data be accessed programatically?

{
      "title": "Methods for the analysis of large data-sets",
      "author": "Di Ciaccio, Agostino.",
      "year": "2012",
      "library": "Hunter Main",
      "docNumber": "007268706",
}
<CD>
<TITLE>Greatest Hits</TITLE>
<ARTIST>Dolly Parton</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>RCA</COMPANY>
<PRICE>9.90</PRICE>
<YEAR>1982</YEAR>
</CD>

Web Scraping

The DOM Tree

<html>
    <head></head>
<body>
    <h1>Header text here</h1>
    <div>
        <p>Paragraph text here</p>
        <h2>Subheader text here</h2>
    </div>
    <div>
        <p>Another paragraph here</p>
    </div>
</body>
</html>

Behind the Curtain


app.get('/marc', function (req, res){
    var docNumber = req.query["docNumber"];
    var verbose = req.query["verbose"];
    var uriBase = 'http://apps.appl.cuny.edu:83/F/';
    verbose == '1' ? format = '001' : format = '002'

    var options = {
        uri: uriBase + '?func=direct&doc_number=' + docNumber + '&format=' + format
    };
    request(options, function(error, response, body) {
    //  debugger;
        console.log(options.uri);
        if (error && response.statusCode !== 200) {
            console.log(error);
        }

        jsdom.env({
                html: body,
                scripts: [
                    'https://ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js'
                ]
            }, function (err, window) {
            // load jquery
            var $ = window.jQuery;
            $('img[src]').each(function(i,el){
                $(el).removeAttr('src');
            });

            var wholeMarc = {};

            var resultsTable = $('table')[4]; //Grabbing the MARC Record

            var rows = $(resultsTable).children('tr');
            rows.each(function (i, item) {
                var marcLabel = $(item).children('td')[0];
                var marcValue = $(item).children('td')[1];
                var name = $(marcLabel).text().trim();
                var value = $(marcValue).text().trim();
                wholeMarc[name] = value;
            });
            res.writeHead(200, {
                'Content-Type': 'text/plain',
                'Access-Control-Allow-Origin' : '*'
            });
            res.end(JSON.stringify(wholeMarc, null, 2));
        });
    });
});

using Nodejs

Documentation

Publicly available

The API

Currently located at:

mighty-wildwood-7308.herokuapp.com
To do a search for 'global warming' in Hunter College:
/search?query=global+warming&queryType=All+Fields&school=HUNTER
Try it out Here

Deeper Data

Applications

A New Catalog

  • Mobile
  • Responsive
  • Clear UX

Example

Collection Development Scripting

Given a list of ISBNs, for each item get:

  1. Title
  2. Price
  3. Holdings at all CUNY Libraries for this and any other edition

Automate Shelf-reading

Know Thy Shelf

Fun

Exploration

Serendipity

A Data Experiment

Live Version

Static Version

THE END

BY Stephen Zweibel