Sometimes, a website will have what is called an Application Programming Interface or API. In essence, this lets a program on your computer talk to the computer serving the website you’re interested in, such that the website gives you the data that you’re looking for.
An API is just a way of opening up a program so that another program can interact with it. That is, instead of an interface meant for a human to interact with the machine, there’s an API to allow some other machine to interact with this one. If you’ve ever downloaded an app on your phone that allowed you to interact with Instagram (but wasn’t Instagram itself), that interaction was through Instagram’s API, for instance.
A good API will also have documentation explaining what or how to make calls to it to get the information you want. That is, instead of you punching in the search terms on a search page, and copying and pasting the results, you frame a request as a URL. More or less. The results often come back to you in a text format where data is organized according to keys and values (JSON). Sometimes it’s just a table of data using commas to separate each field for each row (.csv file). JSON looks like the following:
image courtesty Ian Milligan
Getting material out of an API
Each API has its own idiosyncracies, but we can always look at the documentation and figure out how to form our requests so that our programs grab the data we’re after. The Chronicling America website from the Library of Congress has digitized American newspapers from 1789 to 1963. While the site has a fine search interface, we’ll use the API to grab every article that mentions the word ‘archeology’ (sic).
First of all, go to the Cronicling America site and search in the text box for archeology
. Notice how the url changes when it brings back the results:
https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1789&date2=1963&proxtext=archeology&x=0&y=0&dateFilterType=yearRange&rows=20&searchType=basic
There’s a lot of stuff in there - amongst other things, we can see a setting for state
, for date1
and date2
, and our search term appears as the value for a setting called proxtext
. This is the API in action.
Let’s grab some data! (The script we’re writing is adapted from one created by Tim Sherratt).
- Open a new file in Sublime Text. We’ll first put a bit of metadata in our file, for the programme we’re going to write (we do this so that when we come back to this file later, we know what we were trying to do, etc):
|
|
The first bit tells us this is a python file. The next bit tells us what the file is for. The import
tells python that we’ll need a module (pre-packaged python that does a particular job) called requests
which lets us grab materials from the web, and json
which helps us deal with json formatted data. The final bit says who wrote the script. Nb The default python environment on your machine may or may not have the requests
and json
modules installed. If you get an error when you run this script to the effect that ‘requests’ or ‘json’ not found, you can install these at the terminal prompt with pip install requests
and pip install json
.
- Now we’re going to define some variables to hold the bit of the search url up to where the
?
occurs - everything after the question mark are the parameters we want the API to search. We create the api_searh_url, define the parameter we want to search for, and define how we want the results returned to us. Anything with a#
remember is a comment. Good commenting makes your code readable and reusable! Add the following to your script:
|
|
- Now we’ll send the request to the server, and we’ll add a bit of error checking so that if something is wrong, we’ll get some indication of why that is.
By the way, in the block below you’ll notice that some lines are indented. Indentations matter in Python, and you cannot mix spaces with tabs to effect the indentation. I would suggest you just use the TAB key on your keyboard to handle indentations. If you are using Sublime or Atom as your text editor, notice also how the text editor makes it easier to see the indentation level you are at.
Add this to your script:
|
|
- Now let’s get the results, and put them into a variable called ‘data’. Then we’ll print the results to the terminal, and finish by also writing the results to a file.
|
|
Save your file as ca.py
. Open a terminal/command prompt (remember Windows folks: anaconda command prompt) in the folder where you saved this file, and run it with:
$ python ca.py
Your terminal will look like it’s frozen for a few moments; that’s because your computer is reaching out to the Chronicling America website, making its request, and pulling down the results. But in seconds, you’ll have a data.json
file with loads of data - 9021 articles in fact!
Congratulations, you now have a program that you wrote that you can use to obtain all sorts of historical information. Now… why not search for something that interests you?
Your complete file will look like this:
|
|
Some other APIs
You can modify this code to extract information from other APIs, but it takes a bit of tinkering. In essence, you need to study the website you’re interested in to see how they form the API, and then change up lines 13, 17 and 21 accordingly. You can see this in action for instance here, with regard to the Metropolitan Museum of Art or here, with regard to the Smithsonian. My Winter 2020 digital museums’ class made APIs from collections at some of our national museums; you can see a modified version of the code to access some of their work here
But… it’s in json format?
JSON is handy for lots of computational tasks, but for you as a beginning digital historian, you might want to have the data as a table. There are a couple of options here. The easiest right now - and there’s no shame in doing this - is to use an online converter. This site: json-csv.com lets you convert your json file to csv or Excel spreadsheet, and even transfer it over to a google doc. Give that a shot right now; the text of the articles by the way is in the field ‘ocr_eng’ which tells us that the text was originally transcribed from the images using OCR or object character recognition - so there will be errors and weird glitches in the text. Fortunately, there’s also a URL with the direct link to the original document, so you can check things for yourself.
GLAM Workbench
‘GLAM’ stands for ‘galleries, libraries, archives, and museums’. The GLAM Workbench is by Tim Sherratt, a digital historian in Australia. I would strongly recommend that you explore and give the Workbench a whirl if you are at all interested in the kinds of work that you might be able to do when you are computationally able to treat collections as data. For a glimpse as to what that might mean, check out this presentation.