Like Gen-Z and TikTok or Donald Trump and his Diet Coke button, humans use analogies a lot. Beyond being an annoying part of every SAT, analogies actually help us solve problems and even learn skills by connecting it with concepts that we’re already familiar with. For example, you might tell a child learning to ride a bike for the first time that it’s a lot like balancing on a seesaw.
For a while, it was thought that just humans used analogical reasoning to solve problems. However, some new research from psychologists at the University of California, Los Angeles have found that AI chatbots also have the ability to use analogies just like humans.
The team published a study Monday in the journal Nature Human Behaviour that found that OpenAI’s large language model GPT-3 performed as well as college students when asked to solve analogy problems like the ones found on tests like the SAT. The LLM was also able to actually outperform the study participants on occasion—showing that it may surpass us when it comes to a hallmark of human intelligence.
“Language learning models are just trying to do word prediction so we’re surprised they can do reasoning,” senior author Hongjing Lu, a psychology professor at UCLA, said in a press release. “Over the past two years, the technology has taken a big jump from its previous incarnations.”
The study’s authors attempted to test GPT-3 using prompts that weren’t used in its training set. However, they also conceded that this was more of an educated guessing game because OpenAI has kept this largely a secret from the public (much to the chagrin of many tech ethicists).
In one set of tests, for example, the team utilized problems based off of the Raven’s Progressive Matrices, which are a set of questions that asks test takers to predict the next image in a series of shapes. The authors converted the images into a text format so that GPT-3 could “see” it.
This test was then given to both GPT-3 and 40 undergraduate UCLA students. The bot correctly solved 80 percent of the problems, while the human participants had an average score of 60 percent. However, the highest human scores were within the range of the GPT-3 scores.
“Surprisingly, not only did GPT-3 do about as well as humans but it made similar mistakes as well,” Lu said.
The researchers also gave the bot a set of SAT analogy questions that they believed to have never been published on the internet, which means it would have unlikely to have been a part of GPT-3’s training. After comparing it to the results of actual college applicants’ SAT scores, they found that the bot performed better than the humans.
It’s important to remember that this doesn’t necessarily mean that the bots are smarter than humans or are even showing signs of human-level intelligence and reasoning. Keep in mind: These are language models that have been trained on massive datasets—including ones that have crawled through the entire internet. While it might look impressive, it’s merely performing a set of actions that we trained it to do.
“GPT-3 might be kind of thinking like a human,” co-author Keith Holyoak, a psychology professor at UCLA, said in a press release. “But on the other hand, people did not learn by ingesting the entire internet, so the training method is completely different.”
However, he added that the team is looking into finding out “if it’s really doing it the way people do, or if it’s something brand new—a real artificial intelligence—which would be amazing in its own right.”