APIs

Sometimes, a website will have what is called an Application Programming Interface or API. In essence, this lets a program on your computer talk to the computer serving the website you’re interested in, such that the website gives you the data that you’re looking for.

An API is just a way of opening up a program so that another program can interact with it. That is, instead of an interface meant for a human to interact with the machine, there’s an API to allow some other machine to interact with this one. If you’ve ever downloaded an app on your phone that allowed you to interact with Instagram (but wasn’t Instagram itself), that interaction was through Instagram’s API, for instance.

A good API will also have documentation explaining what or how to make calls to it to get the information you want. That is, instead of you punching in the search terms on a search page, and copying and pasting the results, you frame a request as a URL. More or less. The results often come back to you in a text format where data is organized according to keys and values (JSON). Sometimes it’s just a table of data using commas to separate each field for each row (.csv file). JSON looks like the following:

apis

image courtesty Ian Milligan

In what follows, push yourself until you get stuck. I’m not interested in how far you get, but rather in how you document what you are able to do, how you look for help, how you reach out to others - or how you help others over the bumps. I know also that you all have lots of other claims on your time. Reading through all of this and making notes on what you do/don’t understand is fine too.

Getting material out of an API

Each API has its own idiosyncracies, but we can always look at the documentation and figure out how to form our requests so that our programs grab the data we’re after. The Chronicling America website from the Library of Congress has digitized American newspapers from 1789 to 1963. While the site has a fine search interface, we’ll use the API to grab every article that mentions the word ‘archeology’ (sic).

First of all, go to the Cronicling America site and search in the text box for archeology. Notice how the url changes when it brings back the results:

https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1789&date2=1963&proxtext=archeology&x=0&y=0&dateFilterType=yearRange&rows=20&searchType=basic

There’s a lot of stuff in there - amongst other things, we can see a setting for state, for date1 and date2, and our search term appears as the value for a setting called proxtext. This is the API in action.

Let’s grab some data! (The script we’re writing is adapted from one created by Tim Sherratt).

  1. Open a new file in Sublime Text. We’ll first put a bit of metadata in our file, for the programme we’re going to write (we do this so that when we come back to this file later, we know what we were trying to do, etc):
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/usr/bin/env python
"""
a script for getting materials from the Chronicling America website
"""

# Make these modules available
import requests
import json

__author__ = "your-name"

The first bit tells us this is a python file. The next bit tells us what the file is for. The import tells python that we’ll need a module (pre-packaged python that does a particular job) called requests which lets us grab materials from the web, and json which helps us deal with json formatted data. The final bit says who wrote the script. Nb The default python environment on your machine may or may not have the requests and json modules installed. If you get an error when you run this script to the effect that ‘requests’ or ‘json’ not found, you can install these at the terminal prompt with pip install requests and pip install json.

  1. Now we’re going to define some variables to hold the bit of the search url up to where the ? occurs - everything after the question mark are the parameters we want the API to search. We create the api_searh_url, define the parameter we want to search for, and define how we want the results returned to us. Anything with a # remember is a comment. Good commenting makes your code readable and reusable! Add the following to your script:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Create a variable called 'api_search_url' and give it a value
api_search_url = 'https://chroniclingamerica.loc.gov/search/pages/results/'

# This creates a dictionary called 'params' and sets values for the API's mandatory parameters
params = {
    'proxtext': 'archeology' # Search for this keyword     
}

# This adds a value for 'encoding' to our dictionary
params['format'] = 'json'
  1. Now we’ll send the request to the server, and we’ll add a bit of error checking so that if something is wrong, we’ll get some indication of why that is.

By the way, in the block below you’ll notice that some lines are indented. Indentations matter in Python, and you cannot mix spaces with tabs to effect the indentation. I would suggest you just use the TAB key on your keyboard to handle indentations. If you are using Sublime or Atom as your text editor, notice also how the text editor makes it easier to see the indentation level you are at.

Add this to your script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# This sends our request to the API and stores the result in a variable called 'response'; it joins the api_search_url with the parameters of our search
response = requests.get(api_search_url, params=params)

# This shows us the url that's sent to the API
print('Here\'s the formatted url that gets sent to the ChronAmerca API:\n{}\n'.format(response.url))

# This checks the status code of the response to make sure there were no errors
# Use your keyboard's TAB key to indent the print statements
if response.status_code == requests.codes.ok:
    print('All ok')
elif response.status_code == 403:
    print('There was an authentication error. Did you paste your API above?')
else:
    print('There was a problem. Error code: {}'.format(response.status_code))
  1. Now let’s get the results, and put them into a variable called ‘data’. Then we’ll print the results to the terminal, and finish by also writing the results to a file.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# Get the API's JSON results and make them available as a Python variable called 'data'
data = response.json()

# Let's prettify the raw JSON data and then display it.

# We're using the Pygments library to add some colour to the output, so we need to import it

from pygments import highlight, lexers, formatters

# This uses Python's JSON module to output the results as nicely indented text
formatted_data = json.dumps(data, indent=2)

# This colours the text
highlighted_data = highlight(formatted_data, lexers.JsonLexer(), formatters.TerminalFormatter())

# And now display the results
print(highlighted_data)

# dump json to file
with open('data.json', 'w') as outfile:
    json.dump(data, outfile)

Save your file as ca.py. Open a terminal/command prompt (remember Windows folks: anaconda command prompt) in the folder where you saved this file, and run it with:

$ python ca.py

Your terminal will look like it’s frozen for a few moments; that’s because your computer is reaching out to the Chronicling America website, making its request, and pulling down the results. But in seconds, you’ll have a data.json file with loads of data - 9021 articles in fact!

Congratulations, you now have a program that you wrote that you can use to obtain all sorts of historical information. Now… why not search for something that interests you?

Your complete file will look like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
#!/usr/bin/env python
"""
a script for getting materials from the Chronicling America website
"""

# Make these modules available
import requests
import json

__author__ = "your-name"

# Create a variable called 'api_search_url' and give it a value
api_search_url = 'https://chroniclingamerica.loc.gov/search/pages/results/'

# This creates a dictionary called 'params' and sets values for the API's mandatory parameters
params = {
    'proxtext': '' # Put the search keyword between the '' marks
}

# This adds a value for 'encoding' to our dictionary
params['format'] = 'json'

# This sends our request to the API and stores the result in a variable called 'response'
response = requests.get(api_search_url, params=params)

# This shows us the url that's sent to the API
print('Here\'s the formatted url that gets sent to the ChronAmerca API:\n{}\n'.format(response.url))

# This checks the status code of the response to make sure there were no errors
if response.status_code == requests.codes.ok:
    print('All ok')
elif response.status_code == 403:
    print('There was an authentication error. Did you paste your API above?')
else:
    print('There was a problem. Error code: {}'.format(response.status_code))

# Get the API's JSON results and make them available as a Python variable called 'data'
data = response.json()

# Let's prettify the raw JSON data and then display it.
# We're using the Pygments library to add some colour to the output, so we need to import it

from pygments import highlight, lexers, formatters

# This uses Python's JSON module to output the results as nicely indented text
formatted_data = json.dumps(data, indent=2)

# This colours the text
highlighted_data = highlight(formatted_data, lexers.JsonLexer(), formatters.TerminalFormatter())

# And now display the results
print(highlighted_data)

# dump json to file
with open('data.json', 'w') as outfile:
    json.dump(data, outfile)

Some other APIs

You can modify this code to extract information from other APIs, but it takes a bit of tinkering. In essence, you need to study the website you’re interested in to see how they form the API, and then change up lines 13, 17 and 21 accordingly. You can see this in action for instance here, with regard to the Metropolitan Museum of Art or here, with regard to the Smithsonian. My Winter 2020 digital museums’ class made APIs from collections at some of our national museums; you can see a modified version of the code to access some of their work here

But… it’s in json format?

JSON is handy for lots of computational tasks, but for you as a beginning digital historian, you might want to have the data as a table. There are a couple of options here. The easiest right now - and there’s no shame in doing this - is to use an online converter. This site: json-csv.com lets you convert your json file to csv or Excel spreadsheet, and even transfer it over to a google doc. Give that a shot right now; the text of the articles by the way is in the field ‘ocr_eng’ which tells us that the text was originally transcribed from the images using OCR or object character recognition - so there will be errors and weird glitches in the text. Fortunately, there’s also a URL with the direct link to the original document, so you can check things for yourself.

GLAM Workbench

‘GLAM’ stands for ‘galleries, libraries, archives, and museums’. The GLAM Workbench is by Tim Sherratt, a digital historian in Australia. I would strongly recommend that you explore and give the Workbench a whirl if you are at all interested in the kinds of work that you might be able to do when you are computationally able to treat collections as data. For a glimpse as to what that might mean, check out this presentation.