In this post, I wanted to talk a little about my research design process. As I mentioned last post, I knew I wanted to create a network visualization, and I also knew I wanted to test Clover’s theory about the self-consciousness of horror as a genre. What would I need to do this?
My approach was to lean into the idea of self-awareness in film. How could I measure this? I know there are many films, at least many in my own image-repertoire, that feature the act of creating, of making something in an aesthetic way. I think this is an important root for the idea of self-awareness, especially in art. What better way to “talk about” or to render what we think we’re doing than by taking it up as a major part of the content we’re making, ourselves? Looking first for films about painting, sculpture, photography, acting, filmmaking, and so on–looking for films that are about art and creativity in some meaningful way–should help me identify self-conscious films, and then I could count them and do other things with them. Organize by country, by decade, by year, by theme, and more.
I looked first at the data made available by IMDb via their daily downloads. This data is pretty robust as it is, though the nature of IMDb is itself crowd-sourced, and that has its own biases and data entry problems. But for the time being, we’ll bracket that! I used two of these datasources, primarily–the title.basics.tsv.gz file, and the title.ratings.tsv.gz file. Currently, IMDb has 1,048,576 films in its dataset, and each of these films is represented in these two downloads.
The basic title dataset contains a unique alphanumeric identifier for each film, a column indicating the type/format of the title (movie, short, TV series, TV episode, video), two columns indicating primary and original titles, a boolean indicator of whether the film is an adult movie, start and end year columns (start year is used for all types, while end year is used to indicate spans for TV series), a column for runtime in minutes, and a string array of up to three genres associated with each title.
The ratings dataset contains, again, the unique alphanumeric identifier for each film, as well as a column for average rating and a column for total number of votes, which is a loose indicator of popularity.
From these datasets, I filtered out films that contained “horror” in the genre array, to create a subset of only horror films (movies and shorts). This first pass gave me a dataset of 19682 rows, and from there, I limited it further to only movies or shorts, resulting in a dataset of 9637 rows. In my first iteration of the project, I used this list to scrape additional information not included in the daily downloads from IMDb–specifically, plot summary, film synopsis, and country of origin (which is also an array of up to 5 or 6 country locations). I had sought to pull box office info, as well, but only a tiny fraction of films contained this data, so I opted to ignore it in the end. Overall, this was a relatively easy task, though time-consuming, with the Beautiful Soup package; happily, the URL for each film is composed of its unique alphanumeric identifier. The HTML and CSS for any given film entry allowed me to identify which text chunks I wanted to scrape.
Several films include more than one plot summary, though others contain no summary at all. While the summary (also called the “storyline” in IMDb) is relatively brief, perhaps 2-6 sentences, the synopsis is typically quite long, and includes a blow-by-blow of the action in the film. I elected to include only the first plot summary and the synopsis, if available. If not, I filled the observation with NAN. With these two text fields, I could search for keywords to identify films about art.
I devised a keyword schema focusing on simple keywords and their permutations, like “create/creating/creates/created,” “painting/paint/painter/painted/portrait,” “art/artist/artists,” “sculpture/sculpture/sculptress/statue,” and so on. My keywords included the following categories:
- art or artist
- acting or theater
- filmmaking or recording
At some point during the process, I inadvertently left “music” out, so the final project prototype does not include it.
I then searched the refined dataset with text fields for words in my keyword clusters. If I had known more about NLP at that point in the process, I would have lemmatized the text, which would have made the act of refining the keyword searches much easier and more accurate. Hindsight and learning!
Many of the films used only one keyword; looking manually at these films, it appeared that they were not really focused on the creative act. A film might use the word “painting” to describe a piece from the 1940s that included a painting under the credit sequence, or a “statue” that is referenced offhand. Often, these references were in the highly detailed plot synopses. In the final prototype, I elected to include only films with two or more keywords related to making, which helped to weed out false positives.
Once I had this part done, I could begin calculating relationships and percentages!
But wait…. a colleague, during one of our review sessions, suggested that to really dig into this question of whether horror films were more likely than other genres to focus on topics of making or be self-aware, I would need a comparison set of not-horror films to contrast. This meant going back to the drawing board, in some ways, though the overall principles remained the same. Ultimately, I also retrieved data for films in the genre of horror, as well as drama, mystery, thriller, action, romance, and so on. That dataset was so large, I had to limit it only to feature-length films (or shorts–I elected to focus on feature films because they may be more familiar to my audiences).
More to come!