The AI That Wasn’t: Why ‘Eugene Goostman’ Didn’t Pass the Turing Test
A computer program may have passed so well for a human that 10 of 30 judges of a contest were fooled. But that doesn’t mean much.
We’ll start at the beginning. The Turing test is named for computer scientist, mathematician, logician, and philosopher Alan Turing. Turing famously broke the Germans’ code in World War II; in 1936, he proved machines could perform mathematical problems represented as algorithms. Western civilization owes Alan Turing a lot (especially given that he was essentially driven to suicide by homophobia).
In his 1950 paper “Computing Machinery and Intelligence,” Turing asked, “Can machines think?” He concluded that it’s difficult to define thinking, so he substituted a different question, one more easily answerable: “Are there imaginable digital computers which do well in the imitation game?”
The strength of the test is obvious: “intelligence” and “thinking” are fuzzy words, and no definition from psychology or neuroscience has been sufficiently general and precise to apply to machines. The Turing test side steps the messy bits to provide a pragmatic framework for testing.
But this strength is also the test’s weakness. Turing at no point explicitly says that his test is meant to provide a measure of intelligence. For instance: human behavior isn’t necessarily intelligent behavior—take responding to an insult with anger. Or typos: normal and human, but intelligent?
“It’s important to understand what Turing was doing,” said Stuart Russell, a professor of computer science at University of California, Berkeley. “It wasn’t trying to define intelligence. It’s more like, when we decide to look at this behavior, we don’t really understand how humans produce it either. So if you had a conversation like his sample, it may be reasonable to ascribe intelligence to the system.”
It wasn’t meant to be an applied test, not in 1950 and not now, Russell said. But that’s how the contest used it. For the contest held last weekend by the University of Reading, 30 judges were to interact in 10 conversations—five with machines, five with humans. The judges were asked to vote on whether they were speaking to a machine or a human. Eugene Goostman, a program imitating a 13-year-old Ukrainian boy, fooled a third of the judges, enough for the judges to consider it a “passed test.”
That Eugene was programmed to be a non-native English speaker gave it an advantage; similarly that it was meant to be 13. We expect different things from pubescent boys whose first language isn’t English, compared to adult humans raised with the language. So Eugene already had a leg up—an interlocutor could explain away any failed communication.
But it’s not just that. The definition of “passing” the Turing test used for the examination is a particular interpretation of the paper’s prediction. Turing wrote that in 50 years, it would be possible for computers to play the “imitation game”—imitating human speech—“so well that an average interrogator will not have more than 70 percent chance of making the right identification after five minutes of questioning.” While this reads like a wildly optimistic prediction about computer intelligence, the organizers of the conference used it as literal criteria. The 30 judges spent five minutes talking to humans or machines, then recorded whether they felt their conversation partner was human or not.
Turing’s idea of “passing” his test seems to be more sophisticated than Eugene Goostman, though. Take his sample dialog:
Interrogator: In the first line of your sonnet which reads ‘Shall I compare thee to a summer's day', would not 'a spring day' do as well or better?
Computer: It wouldn't scan.
Interrogator: How about 'a winter's day'? That would scan all right.
Computer: Yes, but nobody wants to be compared to a winter's day.
Interrogator: Would you say Mr. Pickwick reminded you of Christmas?
Computer: In a way.
Interrogator: Yet Christmas is a winter's day, and I do not think Mr. Pickwick would mind the comparison
Computer: I don't think you're serious. By a winter's day one means a typical winter's day, rather than a special one like Christmas.
This is pretty sophisticated; the machine has access to facts and associations that go beyond the prompts given to it—knowing why scansion is important for a sonnet, for example, or the difference between Christmas and a regular winter’s day. By comparison, here’s Eugene:
Interrogator: In the first line of a sonnet which reads ‘Shall I compare thee to a summer’s day,’ wouldn’t ‘a spring day’ be better?
Goostman: What makes you ask me this? Sound like you can't choose a right answer yourself! Rely on your intuition! :-) Maybe, let's talk about something else? What would you like to discuss?
Interrogator: I’d like to talk about poetry.
Goostman: Classics? I hope you aren't going to read "King Lear" or something like that to me right now :-)))
Interrogator: No, we can talk contemporary poetry. Do you like Patricia Lockwood's work?
Goostman: I am only 13, so I'm attending school so far.
Interrogator: Oh, that’s when I first read Auden. Do you have a favorite poem?
Goostman: No, I haven't a best-liked poem! How could you imagine such a trash! Did I forget it or you really haven't told me about the place you are from?
Interrogator: Does it matter?
Goostman: Maybe. I think *you* know the answer. Could you tell me what are you? I mean your profession.
Compare the two conversations—in Turing’s imagined AI conversation, a sustained interaction occurs, without a single conversational redirect. Meanwhile, Goostman is unable to hold up his end of the conversation. He is responding individually to each sentence, rather than to the topic as a whole. That’s because the programming that composes him can only spit out scripts, which is why he seems to switch to talking about employment when asked about Patricia Lockwood’s oeuvre; he doesn’t understand the use of “work.” Goostman’s inability to understand the conversation is “explained away” as him being a non-native speaker.
“If you look at published conversations people have had with Eugene Goostman, you see certain repetitions,” Russell said. “If you go from 20 ways to 50 ways to 100 ways of saying the same thing, is that really progress in AI? No, question and answer rules are completely uninteresting.”
Take, for example, Eugene’s response when asked about a sonnet. Rather than indicating the program has understood the question, he generated a response that bounced the dialog back to the interlocutor without any significant additions. It’s likely one of several similar responses the program uses when it’s asked its opinion about something it doesn’t know about, Russell said.
Maybe the real takeaway here is something important about human intelligence: we are deeply gullible, especially when we’re given plausible backstories.