Why isn’t Siri smarter?
AI has accelerated in recent years, especially with deep learning, but current chatbots are an embarrassment. Computers still can’t read or converse intelligently. Their deficiency is disappointing because we want to interact with our world using natural language, and we want computers to read all of those documents out there so they can retrieve the best ones, answer our questions, and summarize what is new. To understand our language, computers need to know our world. They need to be able to answer questions like “Why does it only rain outside?” and “If a book is on a table, and you push the table, what happens?”. We humans understand language in a way that is grounded in sensation and action. When someone says the word “chicken,” we map that to our experience with chickens, and we can talk to each other because we have had similar experiences with chickens. This is how computers need to understand language.
There are two paths to building computers with this kind of understanding. The first path is a symbolic one that is traversed by hard-coding our world into computers. To follow the symbolic path, we segment text into meaningless tokens that correspond to words and punctuation. We then manually create representations that assign meanings to these tokens by putting them into groups and creating relationships between the groups. With those representations, we build a model of how the world works, and we ground that model in the manually created representations. The second path is sub-symbolic, which we initially follow by having computers learn from text. This path is synonymous with neural networks (also called deep learning), and it begins with representing words as vectors. It then progresses to representing whole sentences with vectors, and then to using vectors to answer arbitrary questions. To complete this path, we must create algorithms that allow computers to learn from rich sensory experience that is similar to our own.
The Symbolic Path
The symbolic path is a long one. A depressing amount of work in NLP is done without even using the meanings of words. Words are represented by tokens whose only designation is a unique number. This means that the token for “run” and the token for “jog” have different numbers and are therefore as different from each other as the token for “run” is from the token for “Chicago.” Even worse, a sentence is represented not as a sequence of tokens, but as a collection of tokens that doesn’t consider order, called a bag-of-words. This means that “dog bit man” will have the same representation as “man bit dog.”
One popular method assigns an index number to each word and represents a document by creating a vector where index i is the count of the number of times word i occurs in the document. This means that if you have a vocabulary size of 50,000, the vector representing a document will have 50,000 dimensions, where most of them have a count of 0 because their corresponding word is not in the document. The method can get a little fancier by weighing the count of each word by the rarity of that word in all of the documents. This is the classic term-frequency inverse-document frequency (tf-idf) method. Using tf-idf, one can find similar documents to a given one; or, if you have a lot of labeled documents, the vectors can then be plugged into a regular supervised learning algorithm to label unseen documents.
A second popular Natural Language Processing (NLP) method that treats words as meaningless tokens is topic modeling. One way to do topic modeling is called Latent Dirichlet Allocation (LDA). LDA starts from a fixed number of topics. Each topic is represented as a distribution over words, and each document is then represented as a distribution over topics. Although the tokens themselves are meaningless, the probability distributions over words provided by the topics provide a sense of the different ideas contained in the documents.
We get the first hint of meaning in sentiment analysis. Sentiment analysis seeks to automatically determine how a writer feels about what he or she has written, and it will tell you, for example, if a reviewer liked your product based on the words used in the review. One common method for sentiment analysis is to give each word a score as being positive or negative. So “happy” and “joyful” would be positive, like 2.0, but “painful” would be negative, like -3.0. Then you sum up the scores of all the words in a given text to find out if the text is positive or negative. “I am happy” would be positive because the word “happy” would have a positive value in the sentiment dictionary. Here there is no real understanding, only table lookups.
The next step along the symbolic path to meaning is to manually construct representations, which means telling the computer what things mean by creating symbols and specifying relationships between symbols. To “understand” text, a computer can then map text to symbols, which allows us to define what the computer should do for this smaller number of symbol configurations. For example, there might be a lot of ways of mentioning that you drive a car in a tweet, but as long as all of those are mapped to the symbol Vehicle, and the symbol Vehicle is specified as being related to Tires, a tire company can know that you are a potential customer.
Let’s take a look at some existing representations. The most famous representation is WordNet. In WordNet, the symbols are groups of words that have the same meaning, called synsets. One synset could be the set consisting of “car” and “automobile.” Each word can be in multiple synsets. For example, “bank” could be in the synset that means river bank, and also in the synset that means a place where money is deposited. There are a few kinds of relationships between synsets, such as has-part, superordinate (more general class), and subordinate (more specific class). For example, the synset containing “automobile” is subordinate to the one containing “motor vehicle” and superordinate to the one containing “ambulance.”
ConceptNet is a representation that provides commonsense linkages between words. For example, it states that bread is commonly found near toasters. These everyday facts could be useful if you wanted to make a boring chatbot; “Speaking of toasters, you know what you typically find near them? Bread.” But, unfortunately, ConceptNet isn’t organized very well. For instance, it explicitly states that a toaster is related to an automobile. This is true, since they are both machines, but trying to explicitly enumerate everything that is true is hopeless.
Of course, it seems like Wikipedia already has all the information that computers need. We even have a machine-readable form of Wikipedia called DBpedia, and DBpedia and WordNet have been combined into a representation called YAGO (Yet Another Great Ontology). YAGO has good coverage of named entities, such as entertainers, and it was used by Watson to play Jeopardy!, along with other sources. YAGO and DBpedia contain a lot of facts, but they aren’t the basic facts that we learn as young children, and their representations are shallow.
We need deep representations because classification and hierarchy are efficient ways of specifying information. The better your organization, the more power your statements about the world have, and many statements aren’t even necessary, like saying that a toaster is related to an automobile. One representation that organizes concepts down to the lowest level is SUMO (Suggested Upper Merged Ontology). For example, in SUMO, “cooking” is a type of “making” that is a type of “intentional process” that is a type of “process” that is a “physical” thing that is an “entity.”
SUMO specifies an ontology, explicitly laying out what is, but for machines to really understand us they need to share our experiences. SUMO doesn’t have anything to say on what it is like to experience the world as a human. Fortunately, common experiences, like making a purchase, can be represented by FrameNet. In FrameNet, the frame for a purchase transaction specifies roles for the seller, the buyer, and the thing being purchased. Even more fundamental to the human experience is image schemas. We use image schemas to comprehend spatial arrangements and physics, such as path, containment, blockage, and attraction.
Abstract concepts such as romantic relationships and social organizations are represented as metaphors to this kind of experience. Unfortunately, there currently is no robust implementation of image schemas. Representations for image schemas need to be built, and then we have to merge all of our manually created representations together. There has been some work doing this merging. There is a representation called YAGO-SUMO that merges the low-level organization of SUMO with the instance information of YAGO. This is a good start, especially since YAGO also builds on WordNet.
These representations enable computers to dissect the world and identify situations, but with the possible exception of the poorly organized ConceptNet, they don’t specify at the most basic level how the world is or how it changes. They don’t say anything about cats chasing mice or about flicking light switches to illuminate rooms. Computers need to know all of these things to understand us because our language evolved not to describe our world as it is but rather to communicate only what the listener does not already know.
To reach artificial intelligence by following the symbolic path, we must create a robust world model built on a merged version of these manual representations. The longstanding project Cyc has a large model that uses representations, but it is built on logic, and is not clear if logic is sufficiently supple. We will know when we have reached success when we can ask a computer, “Why does it only rain outside?” and it responds “A roof blocks the path of things from above.” Or, if we asked, “Explain your conception of conductivity?” and it said “Electrons are little spheres, and electricity is little spheres going through a tube. Conductivity is how freely the spheres can move through the tube.” These answers would be from first principles, and they would show that the computer could combine them flexibly enough to have a real conversation.
The Sub-Symbolic Path
The sub-symbolic path begins with assigning each word a long sequence of numbers in the form of a vector. Word vectors are useful because you can calculate the distance between them. The word vector for the word “run” will be pretty close to the word vector for the word “jog,” but both of those word vectors will be far from the vector for “Chicago.” This is a big improvement over symbols, where all we can say is that going for a run is not the same as going for a jog, and neither are the same as Chicago.
The word vector for each word has the same dimension. The dimension is usually around 300, and unlike tf-idf document vectors, word vectors are dense, meaning that most values are not 0. To learn the word vectors, the Skip-gram algorithm first initializes each word vector to a random value. It then essentially loops over each word w1 in all of the documents, and for each word w2 around word w1, it pushes the vectors for w1 and w2 closer together, while simultaneously pushing the vector for w1 and the vectors for all other words farther apart.
The quote we often see associated with word vectors is “You shall know a word by the company it keeps” by J. R. Firth (1957). This seems to be true, at least to a degree, because word vectors have surprising internal structure. For example, the classic result is that if you take the word vector for the word “Italy” and subtract the word vector for the word “Rome,” you get something very similar to what you get when you subtract the word vector for the word “Paris” from the word vector for the word “France.” This internal structure of word vectors is impressive, but these vectors are not grounded in experience in the world — they are only grounded in being around other words. As we saw previously, people only express what the reader does not know, so the amount of world knowledge that can be contained in these vectors is limited.
Just as we can encode words into vectors, we can encode whole sentences into vectors. This encoding is done using a recurrent neural network (RNN). An RNN takes a vector representing its last state and a word vector representing the next word in a sentence, and it produces a new vector representing its new state. It can keep doing this until the end of the sentence, and the last state represents the encoding of the entire sentence.
A sentence encoded into a vector using an encoder RNN can then be decoded into a different sentence. This decoding uses another RNN and goes in reverse. This decoder RNN begins with its state vector being the last state vector of the encoder RNN, and it produces the first word of the new sentence. The decoder RNN then takes that produced word and the RNN vector itself as input and returns a new RNN vector, which allows it to produce the next word. This process can continue until a special stop symbol is produced, such as a period.
These encoder-decoder (sequence-to-sequence) models are trained on a corpus consisting of source sentences and their associated target sentences, such as sentences in English and their corresponding translations into Spanish. These sentences are run through the model until it learns the underlying patterns. In fact, one current application for these models is machine translation. This is how Google Translate works. In fact, this general method works for many kinds of problems. For example, if you can encode an image using a neural network (such as a convolutional neural network) into a vector, and if you have enough training data, you can automatically generate captions for images the model has never seen before.
These encoder-decoder models work on many kinds of sequences, but this generality highlights their limitation for use as language understanding agents, such as chatbots. Noam Chomsky proposed that the human brain contains a specialized universal grammar that allows us to learn our native language. In conversations, sequences of words are just the tip of the meaning iceberg, and it is unlikely that a general method run over the surface words in communication could capture the depth of language. Language allows for infinite combinations of concepts, and any training set, no matter how large, will represent only a finite subset.
Beyond the fact that the sequence-to-sequence models are too general to fully capture language, you may wonder how a fixed-length vector can store a growing amount of information as the model moves from word to word. This is a real problem, and it was partially solved by adding attention to the model. During decoding, for example when doing language translation, before outputting the next word of the target sentence, the model can look back over all of the encoded states of the source sentence to help it determine what the next word should be. The model learns what kinds of information it needs in different situations. It treats the encodings of the input sentence as a kind of memory.
This attention mechanism can be generalized to enable computers to answer questions based on text. The question itself can be converted into a vector, and instead of looking back through words in the source sentence, the neural network can look at encoded versions of the facts it has seen, and it can find the best facts that will help it answer the current question.
Current question-answering algorithms are trained using generated stories. A story might be like, “Bob went home. Tim went to the junkyard. Bob picked up the jar. Bob went to town.” A question for this story might be “Where is the jar.” The network can learn to say that “town” is the answer, but there is no larger understanding. The network is learning linkages between sequences of symbols, but these kinds of stories do not have sufficiently rich linkages to our world. It won’t help to have these algorithms learn on real stories because, as we have seen, we communicate only what isn’t already covered by shared experience. In a romance novel, you will never read, “In his passion for her, he shoved the table between them aside, and all of the objects on the table moved because gravity was pushing them down, creating friction between the bottoms of the objects and the support surface of the table.”
To train machines so that they can talk with us, we need to immerse them in an environment that is like our own. It can’t just be dialog. When we say “chicken,” we need the machine to have had as much experience with chickens as possible, because to us a chicken isn’t just a bird, it’s everything one can do with it and everything it represents in our culture. There has been work in this direction. For example, OpenAI now allows us to train machines by playing video games, and as our virtual worlds become more like our real one, this kind of training will be increasingly useful. In addition to producing ever-better virtual worlds, computers are spending more time in our physical one. Amazon Alexa in listening attentively in many of our homes. Imagine if she had a camera and a rotating head. Could she watch our eyes and our actions to learn a partially grounded understanding of our lives?
We have seen two potential progressions from natural language processing to artificial intelligence. For the symbolic path, we need to build world models based on deep and organized representations. Success on this path requires that the models we build be comprehensive and flexible. For the sub-symbolic path, we need to train large neural networks in an environment with similar objects, relationships, and dynamics as our own. On this path, the learning agent must be able to perceive this environment at a low-enough level so that the outlines of how we humans experience our environment are visible. Along either path, we can see that Searle’s Chinese room scenario, where a person who doesn’t know Chinese can maintain written conversations in Mandarin by looking up exactly what to do for each stroke of each character, isn’t really possible. To have a real conversation, comprehension needs to ground out in shared first principles, and when a computer can do that, it will understand as well as you or I.