How did we get into this mess?
Current AI systems are largely powered by a statistical technique called deep learning, and deep learning is very effective at learning correlations, such as correlations between images or sounds and labels. But deep learning struggles when it comes to understanding how objects like sentences relate to their parts (like words and phrases).
Why? It’s missing what linguists call compositionality: a way of constructing the meaning of a complex sentence from the meaning of its parts. For example, in the sentence “The moon is 240,000 miles from the Earth,” the word moon means one specific astronomical object, Earth means another, mile means a unit of distance, 240,000 means a number, and then, by virtue of the way that phrases and sentences work compositionally in English, 240,000 miles means a particular length, and the sentence “The moon is 240,000 miles from the Earth” asserts that the distance between the two heavenly bodies is that particular length.
Surprisingly, deep learning doesn’t really have any direct way of handling compositionality; it just has information about lots and lots of complex correlations, without any structure. It can learn that dogs have tails and legs, but it doesn’t know how they relate to the life cycle of a dog. Deep learning doesn’t recognize a dog as an animal composed of parts like a head, a tail, and four legs, or even what an animal is, let alone what a head is, and how the concept of head varies across frogs, dogs, and people, different in details yet bearing a common relation to bodies. Nor does deep learning recognize that a sentence like “The moon is 240,000 miles from the Earth” contains phrases that refer to two heavenly bodies and a length.
At the same time, deep learning has no good way to incorporate background knowledge. A system can learn to predict that the words wallet and safe place occur in similar kinds of sentences (“He put his money in the wallet,” “He put his money in a safe place”), but it has no way to relate that to the fact that people like to protect their possessions.
In the language of cognitive psychology, what you do when you read any text is build up a cognitive model of the meaning of what the text is saying. As you read the passage from Farmer Boy, for example, you gradually build up a mental representation—internal to your brain—of all the people, objects, and incidents in the story and the relations among them: Almanzo, the wallet, and Mr. Thompson, and also the events of Almanzo speaking to Mr. Thompson, Mr. Thompson shouting and slapping his pocket, Mr. Thompson snatching the wallet from Almanzo, and so on. It’s only after you’ve read the text and constructed the cognitive model that you do whatever you do with the narrative—answer questions about it, translate it into Russian, illustrate it, or just remember it for later.
Ever since 2013, when DeepMind built a system that played Atari games—often better than humans—without cognitive models, and sold themselves to Google for more than half a billion dollars, cognitive models have gone out of fashion. But what works for games with their fixed rules and limited options doesn’t work for reading. The simulated prose of the cognitive-model-free GPT-2 is entertaining, but it’s a far cry from genuine reading comprehension.
That’s because, in the final analysis, statistics are no substitute for real-world understanding. Instead, there is a fundamental mismatch between the kind of statistical computation that powers current AI programs and the cognitive-model construction that would be required for systems to actually comprehend what they are trying to read.
We don’t think it is impossible for machines to do better. But mere quantitative improvement—with more data, more layers in our neural networks, and more computers in the networked clusters of powerful machines that run those networks—isn’t going to cut it.