Amateur sleuths are hard at work trying to unmask the author of that anonymous Trump-bashing op-ed in The New York Times, but even professional linguists say that identifying the “senior official” through the 957 words published this week will be very difficult.
Robert Leonard, a professor of linguistics at Hofstra University, is better known as the expert who worked on espionage and first-degree murder cases. His colleagues—Georgetown professor Roger Shuy and FBI forensic linguist James Fitzgerald—helped crack the Unabomber case by analyzing Ted Kaczynski's 35,000 word manifesto.
So far, he’s unimpressed with the public’s attempt to figure out who wrote the piece through unscientific analysis, with much of that theorizing focusing on a single, unusual word.
“Everyone is pointing to ‘lodestar’ and they're saying ‘Who uses lodestar?’” Leonard told The Daily Beast. “But it could be someone trying to throw people off the trail.”
“Lodestar isn’t a useful bit of data at all,” he added.
Leonard said a professional like himself would need days and documents to do a proper analysis.
“It’s what's called authorship analysis and demographic linguistic profiling,” he told The Daily Beast. “And you can tell a lot from it.”
An author’s prose can give a sense of how well they are educated–and where they were raised. He noted that Kaczynski used the phrase “rearing children” in the Unabomber manifesto, which is more common in the northern Midwest than other places in the country.
Dialect and regionalisms can play a crucial role.
“It’s a good data point because you unconsciously do it,” Leonard said. “You’re giving away a lot about where you're from and who you are and what you do.”
But there are plenty of pitfalls in this work.
“People keep asking if it's a man or a woman,” he said. “And there have been data-driven studies on what a woman would say and write compared to a man. But that doesn’t mean every woman writes like that. If you are a woman and you wanted to obfuscate your identity, you would make sure to avoid that.”
The type of document being picked apart is also limiting, unless the author of the anonymous Times piece has also written a lot of other op-eds.
“You want to match the style, the register of the language,” Leonard said. “You can’t compare an op-ed to a grocery list to a love letter.”
The methods that Leonard uses–a careful analysis of each and every word, examining how they are strung together to form a sentence, paying attention to the intention of phrases–is time-consuming.
But there’s a second way to do the detective work: machine learning, which analyzes semantics and syntax under the theory that every writing style is unique and the way someone stitches together a sentence to convey thought is specific to where they come from, who they are and their state of mind.
Data scientists have pounced on the op-ed, using programs that analyze “term frequency-inverse document frequency,” or TF-IDF tests that catalog frequently used terms in a piece of writing and compare them with other pieces of writing, looking for the equivalent of a literary fingerprint.
But Morteza Dehghani, an assistant professor of psychology and computer science at the University of Southern California said TF-IDF hasn’t been useful in putting a name to the op-ed writer.
“What’s been done so far is not scientific and simply silly,” he said.
Before any conclusions can be made about the predictions of the method, we'd need to first know the model actually worked. In order to do this, Dehghani suggested taking op-eds written by White House staff, masking the authors' names from the algorithm. The op-eds then would be subjected to the algorithm to see if their authors could be identified accurately. If the machine was able to identify them correctly—or at least the majority of them correctly—it would mean the model worked, and could subsequently be possibly applied on the anonymous op-ed.
As with Leonard’s method, that’s difficult when there aren’t enough op-eds to mine. What internet Sherlocks have done is what Dehghani calls “cross domain classification”—taking tweet language and comparing it with the op-ed.
That’s problematic, Dehghani said, since people communicate differently on different platforms.
Several of the analyses that have been posted online using TF-IDF also drop what Dehghani called “stop words” that we use to bridge ideas together: like, of, that, the, and so on. Those words might seem unimportant in a computer analysis, but Dehghani said they actually serve as “indirect pointers to syntax” and dropping them removes pieces of of the puzzle.
“Whatever method we use would need to probably include information about intentions and deeper information than just a bag of words,” he said. “Analysis of semantics is important, but syntax is key. We need human experts to annotate this piece for intentions of the author.”
And what if the intention of the author is to deceive? The assumption that the op-ed writer is being open, honest, and vulnerable, that they are intentionally dropping clues by using certain words is just that: an assumption.
Dehghani and Leonard warned that the writer is probably one step ahead of the public. They may be imitating the style of someone else or adding unusual language as false flags.
Additionally, the op-ed was likely edited for structure and clarity which can also obscure the text’s “tells” and make it harder to parse than something like the Unabomber’s freewheeling writings.
“When you have documents that someone has written candidly without thinking they would be analyzed, that's what you want,” Leonard said.
CORRECTION, 12:55 a.m., 9/7/18: This post has been updated to reflect that Morteza Dehghani teaches at the University of Southern California, not the University of California, Berkeley.