Today, it is increasingly difficult for scholars to do historical, primary source research without institutional access to fulltext databases like Early English Books Online, Eighteenth-Century Collections Online, and the 17th and 18th Century Newspaper Collections, among others. These resources are often prohibitively expensive for smaller colleges and universities, and travel funding for archival research is down–especially the kind of long-term archival research that would enable scholars to some kinds of large-scale research.
These web-accessible resources have been, indeed, a boon to scholars across the United States and beyond; they enable the kind of granular reading traditional scholarship in the humanities has been built on, but they also enable research on macro levels, with vast bodies of content over time. As Franco Moretti puts it in Graphs, Maps, Trees, “within that old territory [is] a new object of study: instead of concrete, individual works, a trio of artificial constructs…in which the reality of the text undergoes a process of deliberate reduction and abstraction.” He calls this “distant reading,” where distance—in time, in space—is “not an obstacle, but a specific form of knowledge: fewer elements, hence a sharper sense of their overall interconnection.”
Resources like ECCO and EEBO can give scholars just that kind of knowledge; if one wants to know how many times, in what frequencies, and in what contexts a particular search term or string was used in English volumes published over the course of a century, one can use ECCO to find out. One can enter the search term, view a list of results, limit them in various ways, and examine the facsimile page images—even download the entire (non-searchable) PDF of a single text to a flash drive, especially helpful if one is using such tools at public libraries like the Library of Congress.
Yet, there is to date no simple, user-friendly tool that allows the raw data of the results stream to be used for data mining or graphing; the metadata of the results cannot be downloaded to, for instance, a spreadsheet, for visual manipulation in tools like IBMs ManyEyes or Pivot charts. A browser plugin to capture the results stream, or an export feature built in to the fulltext database (as one additional export feature) would give scholars the ability to explore, practically, Moretti’s theory of distant reading.
3 Replies to “Piping Fulltext Database Result Streams to Manipulable Data Files for Distant Reading”
UPDATE: On Friday, June 10, 2011, I’m having a Skype conference with some folks at Cenage to discuss new methods of data mining like the one I’ve described above. I would love to be able to pass on others’ ideas, so if you have a quick thought, please comment below!
I’ve had discussions with various people about this, but the relatively crude keyword searches available in ECCO have frustrated me over the years (my institution doesn’t own it, but I can go to Rice next door to access). What I would like to do is figure out a way to carve out a generically- and chronologically-delimited subset and search for terms within that set. Maybe create a customized set of results (short records? contextualized snatches?) that goes into a pdf that one can put onto a flash drive. That’s my wish, anyway.