Why Big Data Doesn’t Live up to the Hype
In 1996, the artist Karen Reimer alphabetized the entire text of a romance novel and published the results as the book Legendary, Lexical, Loquacious Love. By listing every instance of each word in the original novel, she allowed the frequencies of words to tell a subversive story about the entire genre of the romance novel. The word “beautiful” appears 29 times, and “breasts” occupies one-third of a page. “Her” fills eight pages, while “his” fills only two and a half. The word “intelligent” appears precisely once.
While it’s not shocking that feminine pronouns and anatomy predominate in a romance novel, the technique of studying language through the lens of word frequency does yield a striking insight. The first person to notice what is now recognized as a universal principle of all languages was a Harvard professor of German literature name George Zipf. In 1937, he used a word index to James Joyce’s Ulysses to rank the frequency of every word in the novel. Once he completed this epic task, he noticed that a few words were incredibly frequent. The less frequent a word, the greater the number of words with that frequency. The tenth most frequent word in Ulysses, for example, is “I,” which appears 2653 times. The 100th most frequent word, “say,” occurs 265 times, and the 1000th most frequent word, “step,” appears 26 times. “Indisputable,” which Joyce used only twice, was tied for the 10,000th spot. The bulk of the novel, in other words, was composed of words that occurred very rarely.
This is not just a quirk of Joyce’s erudition; the same inverse relation between number and frequency of words characterizes every known language. While a typical adult speaker of English has a vocabulary of roughly 60,000 words, a mere 4000 words supply the material for 98% of conversation. Zipf was working in an analog world. Imagine a dataset of words pulled not from a single romance or modernist novel, but from a vast digital database of every book published in the last 200 years. What patterns and predictions would emerge?
It’s easy to exaggerate the importance of what such a tool could discover. Sometimes it seems the only thing larger than big data is the hype that surrounds it. Within the first 30 pages of Uncharted: Big Data as a Lens on Human Culture, Erez Aiden and Jean-Baptiste Michel manage to compare themselves to Galileo and Darwin and suggest that they, too, are revolutionizing the world. The authors were instrumental in creating the Google Ngram viewer, which allows researchers or anyone else so inclined to explore the changing frequencies of words across time. Likening their creation to a cultural telescope, they proceed to share some of their ostensibly dazzling findings.
Some results are interesting, though many shade into trivia. Bill Clinton, for instance, at the peak of his Ngram-measured fame, was about as frequent as the word “lettuce.” Lewis Carroll really did introduce the word “chortle” to the English language in his 1871 poem Jabberwocky. And Charles Dickens seems to have popularized the greeting “Merry Christmas,” which soars in frequency after the 1843 publication of A Christmas Carol. Many of the Ngram findings fall into one of two categories: things people didn’t know but also didn’t really need to know (Bill Clinton and lettuce), or things people already knew.
Ngram data does offer precise and seemingly unbiased confirmation of trends that human historians have often identified in somewhat hazier ways. Consider, for instance, a significant transition in the history of American self-conception: the switch from the phrase “the United States are” to “the United States is.” When did Americans start considering themselves a singular entity rather than a collective of mostly autonomous parts? A traditional answer, and one given by James McPherson in his Civil War history Battle Cry of Freedom, is that the Civil War marks this moment of transition.
By sifting through millions of digitized books in Google’s enormous database, the Ngram viewer can reveal that the singular did not surpass the plural until 1880. It also reveals in vivid visual form the precise contour of the transition. Aiden and Michel present this as an awe-inspiring demonstration of the power of their cultural telescope, but its actual impact seems modest. McPherson already alludes to a transition, not an instant and total switch, and he comes quite close to pinpointing the precise date. More importantly, he suggests a causal mechanism—he interprets the past. In this instance, the Ngram data offer a refinement, not a revelation.
Another somewhat underwhelming demonstration concerns an automated detector the authors designed to measure the degree of censorship of certain individuals under the Third Reich. After generating the chart, they asked a mere human historian to produce the same results without Ngram data. Rather than failing miserably, she provided a list of names that agreed with their data a vast majority of the time.
To their credit, Aiden and Michel freely acknowledge the distortions inherent in their device. For one thing, people who write books tend to write about other people who write books, so the Ngram data often exaggerate the cultural prominence of academics and authors. It’s also easy to confuse correlation with causation. Did the increasing frequency of the word “zombie” contribute to the rising occurrence of “the future,” was the causation reversed, are the trends unrelated, or do both reflect a deeper cause?
There are some genuinely interesting patterns that Ngram data reveal and that would be hard to measure precisely by other means. One example is the exponential growth of fame curves and the fact that luminaries across various fields tend to follow an identical trajectory from obscurity to fame and back to relative obscurity. Another is the fact that even the best dictionaries struggle to detect exceedingly rare words. When a word’s frequency is below one in a million, the odds that a dictionary has omitted it will increase dramatically. Ultimately, however, Aiden and Michel’s enthusiasm seems best explained by an Ngram that plots the relative frequency of the words “God” and “data.” Data eclipsed God in 1973, and its continuing ascendance suggests a culture that treats it as a surrogate divinity.