An Introduction to Topic Models
that link at the end of the video, if you’re interested: http://bit.ly/sw-tm-tour
Topic Model in R
The Topic Model Tool is an excellent tool for building and exploring topic models. But you might want to try building a topic model in R. R is a powerful language for statistical exploring, visualizing, and manipulating all kinds of data, including textual. While R can be worked with from the command line, RStudio is what we need to preserve our sanity. RStudio can be started up from your Anaconda Navigator - you met RStudio back in week 2 remember.
R allows us to do scripted analyses. We write out the sequence of transformations or analyses to be done, making a kind of program that we can then feed our data into. Once we have a workflow that does what we need it to do, it becomes trivial to re-use the code on other datasets. What’s more, when we do an analysis, we can publish the scripts and the data. We don’t have to taken an author’s word for it: we can re-run the analysis ourselves or build upon it.
This course only touches in the lightest manner on the potential for R for doing digital history. Please see Lincoln Mullen’s Computational Historical Thinking with Applications in R for more instruction if you’re interested.
A Walkthrough
We’re going to use the same data as we did before, but this time, it is organized as a table where each chapbook’s text is one row. There are columns for ‘title’, ‘text’, ‘date’, each separated by a comma, so a bit of the metadata is included. You may download this comma separated, or csv, table from this link.
- Make a new folder on your computer called
chapbooks-r
. Put thechapbooks-text.csv
inside it. - Open R Studio. Under ‘File’ select ‘new project’ and then ‘From existing directory.’ In the chooser, select your
chapbooks-r
folder. Having a ‘project’ in R means that R knows where to look for data, and where to save your work or restore from.
- Let’s start a new R script. Hit ‘new file’ under ‘file’ (or the button with the green-and-white plus side, top left) and select ‘r script’
- We need to install some pieces first. In the Console enter these two commands:
|
|
You’ll only have to do this once. If you get a yellow banner warning across the top of your script screen saying some extra packages weren’t installed and would you like them to be installed, say ‘yes’.
- Now, in our script, let’s begin. We’ll invoke those packages, and load our data up. In this step, we’re also doing a bit of cleaning with a regular expression to remove digits from the text field of our data. You can copy and paste the code below into your R script; run it one line at a time.
|
|
-
In the console, take a look at how your data has been transformed. Type
cb
and hit enter - you’ll see a table very much like how your data would look if you opened it in excel. Typecb_df
and it’s still table like, but columns and digits are removed. Typetidy_cb
and you’ll see it’s a list of words, with metadata indicating which document the word is a part of, and which year the document was written! Transforming your data like this makes it more amenable to text analysis. Let’s continue. -
We can now transform that list into a matrix, which will enable us to do the topic model. Add, and then run, the following lines to your script:
|
|
- Now we’ll build the topic model!
|
|
Computers sometimes need to use random numbers in their calculations. By using set.seed
, we enable the machine to generate the same sequence of random numbers every time we do these calculations. The effect is that my results and your results will look the same. If you remove the seed, your results will differ from mine in slightly different, minor, ways.
Here we’re building a model with 15 topics; we run it for 500 iterations. You can tweak those variables if, after examining the results, you find that the model captures too much (or not enough) variability. The topicModel
command takes a bit of time to run, even at just 500 iterations. The higher-end computer you have, the quicker this will run.
- Now, remember that every document will be composed of all 15 topics, just in different proportions; similarly, every topic has every word, but again, in different proportions. So we’ll rearrange things just so that we get the top say 5 terms per topic; these 5 words will give us a sense of what the topic is about. Do you spot where you might change that to say 10 terms?
|
|
- Now comes the fun part - visualization of the patterns!
|
|
- Finally, let’s look at our topic model over time, which is why you’re all here, right?
|
|
Interesting how stories about the army and the English seem to decrease over time in the 18th century… perhaps that’s meaningful? Go back and re-run your script from the point where you selected 15 topic models. Try more topic models. Try fewer. Do you spot any interesting patterns?
Save your script as tm.R
. Put it in your repo for this part of the course. Save screenshots of your findings, and make notes of your observations. These can be shared in your weekly work too - make sure the image file and your notes.md file are in the same repository, then you can insert an image like this:
![description of image](filename.jpg)
or filename.png etc.
Now, there’s a lot in this code that is impenetrable to you right now. But frankly, there was a lot going on with the Topic Modeling Tool that was impenetrable too, but it was hidden behind a user interface. At least here, you can see what it is that you do not yet understand. One of the attractions of doing this kind of work in R and R Studio is that these scripts are reusable, shareable, and citeable. A script like this might accompany a journal article, allowing the reviewers and the readers alike to experiment and see if the author’s conclusions are warranted. Alternatively, you could replicate the study on a different body of evidence, provided that you arranged your data in the same way our original source tables were arranged.
Communicating the process, the meat-grinding, of your research this way leads to better research. And it’s a helluva lot easier to point to a script than say ‘now click this… then this… then if you’re using this version, there should be a drop down over here, otherwise it’ll be… now click the thing that looks like…’