The ultimate experiment to approach artificial natural language understanding

Sarah Zheng
Jan 21, 2020
11 min read

In this post I argue for a radically different approach to let machines learn to fully understand a natural language, as I believe the current exploitation of pure text-based methods is reaching a ceiling and will not yield full language comprehension. Instead, I propose to use gamification of language in an interactive multi-sensory virtual game.

2024 update: in the era of large language models, I still believe in what I wrote in this post early 2020. In short, human language is a reflection of our conscious thoughts and feelings. LLMs trained on just one dimension of language therefore could never grasp the intricacies of human experience that mere words reflect. LLMs do not actually feel what they are "saying".

The poet appears between strumming of strings

In fractions of time, syncopated delights

Perceived to caress my ears

My love for poetry was sparked

Upon listening to a song:

"Little Fly" by Esperanza Spalding

On William Blake's famous poem

I thought it pretty and I thought it bold

To merge those art forms naturally

Soon enough I started reading

Dickinson, Pushkin, relentlessly

Though my rhyme may not be as eloquent

Or as stirring as Tatyana's letter

It surely pleases the mind -

To concentrate a conscious thought

Into cryptic little lines

Thus, may I share with you

A recent thought or two...

A major domain in artificial intelligence research concerns the study of human, "natural" language comprehension by computers. Machine learning techniques referred to as natural language processing (NLP) have been shown to successfully automate various language processing tasks. Think of translation services such as Google Translate, spam e-mail filtering in Outlook and Amazon Alexa's text-to-speech. How computers can learn to truly understand natural language the way we, humans, do, however, is a question that still is under major scrutiny.

Why we need a new approach to artificial natural language processing

The late Alan Turing described a testing paradigm to test "a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human". It poses that once a machine can trick a human into believing they are interacting with another human being because they are unable to distinguish it from interacting with an actual human, the machine passes the test. Depending on how stringent one's testing criteria are here, a chatbot called ELIZA has successfully led some people to believe they were interacting with another person and therefore passed the Turing test.

Image source

Despite various critiques on the validity of this test - does passing it really indicate machine intelligence? - it remains a relevant paradigm to think about how to test machines on their level of intelligence, as manifested through language.

Although great efforts are put forth, for instance, into creating chatbots to respond as humanly as possible, we typically find it relatively easy to tell if we are chatting with a real person or not. The test becomes more challenging if we would somehow be able to collect all possible ways in which people ever asked and responded to all questions that have ever been asked in English; program these into a chatbot machine and let human interactors talk with the machine in disguise. It then becomes more likely to assume that at least a significant proportion of these human interactors is not able to tell the machine from another human. Thus, it would pass the Turing test just like ELIZA and beg the question: has this machine mastered the English language to at least a human-level understanding?

Apart from the practical inefficiencies of storing such an infinitely growing body of data, we cannot guarantee that this machine has truly grasped the meaning that each response represents, just because it "learned" what the sensible responses would be in all possible interactions. It is like teaching a toddler to say "thank you" when they receive a gift from someone else, whilst the child is unable yet to comprehend what it means to be grateful.

Current machine learning approaches such as tf-idf, n-grams, word2vec and BERT learn language patterns by the frequencies and order with which words or letter groups appear in each other's proximity in plain text character patterns. These techniques have demonstrated to be useful, for instance to automatically find semantic clusters to sort documents and categorise e-mails. However, these machine-generated language models are limited by their training purpose and scope of training data, and do not generalize to a unified language machine that can use and interpret language the way we do.

For example, try using a collection of poems as training data for your NLP classifier to find the exact topic each poem concerns. It will fail, because of the deeper semantic properties of language.

Why natural language is so difficult to learn

What makes language comprehension difficult is because it is highly context-dependent. The functional meaning of the same word in different sentences can be completely different (homonym). In contrast, different words can mean approximately the same thing (synonyms). Approximately, because there is a deeper layer of semantic granularity that would be very hard for machines to grasp: the connotations of words.

"an idea or feeling which a word invokes for a person in addition to its literal or primary meaning"

When we have a pair of possible synonyms like "bright" and "clever", their common functional meaning is to describe something or someone that we consider "smart" or "intelligent". The word "bright", however, can also be interpreted as to describe something that seems to be significantly lit, light-coloured, clear or shiny. "Clever", in contrast, is not directly associated with light in any way and has a completely different etymology - stemming from an old English description for something that is skilfully or dexterously made. These "word histories" and secondary associative meanings contribute to the connotations of words.

Certain words also change functional meaning over time, as seen with slang adjectives to describe something positive: "cool", "dope", "sick" and "lit". Whereas the original, literal meaning of these words may have neutral or even negative connotations, people started using these words to mean something different, which changes these words' connotations to positive ones instead when used in the appropriate situation.

To make it even more complex, connotations of the same words can be different for everyone, because they are informed by personal life experiences. For example, the noun "rose" may be associated with the colour red as we may often find "roses are red". One may even think of the flower's soft and tender leaves that one can pluck to form a trail in a romantic setting. These associations nudge feelings of romance and warmth. Someone else, however, may have pricked their finger on a thorn earlier today when they picked up a rose, making the connotation of the word "rose" a negative one for them. Connotations, thus, can be regarded as highly dynamic layers of meanings of words, as they easily change over time and place of usage.

It takes creativity and sensitivity to be able to use them well. Hence, people who particularly excel at playing with these deeper layers of semantics are the true masters of language. They are for instance poets, writers, comedians, public speakers and successful politicians.

Should we start building super-mega-meta-hyper "language" models?

One may suggest that to teach machines to understand natural language, we need to build more complex algorithms that additionally learn all the possible primary, secondary and n-th meanings and connotations of, and all possible relations between these meanings as found in the English language. By building this extensive language model of the world from English character patterns, such a machine, we may assume, should be able to more accurately classify the meaning of any English expression. In the Turing test, it would recognize precisely what the meaning of the human interactor's input is and return a more sensible output... Right?

Imagine we have such a machine, would it attest to true human level comprehension of the English language? Might it go beyond human understanding? Will having an exhaustive semantic knowledge of the English language be sufficient to engage in any English conversation? Albeit it may come pretty far in learning patterns in language, I still do not expect it to have human level understanding. To reason why, we need to think about language on a more fundamental level: what is language and what is it for?

Why language exists

Language forms a highly multi-dimensional, conscious representation of our cognitive worlds. It is a reflection of our internal mental space through which all our sensory sensations from worldly objects can converge to express our needs, desires, emotions, thoughts and imagination. Why we have it? Because we are social organisms that survive by the cooperative manipulation of our environment for which non-chemical interactions proved to be evolutionarily advantageous. Having an own species-specific language which does not rely on pheromones or other biochemical signalling is like a next-level secret communication system that other organisms cannot ever grasp. We simply assign everything in the world a unique symbolic sound and visual pattern.

Language is so essential to human life that children have an age window during which they are optimally receptive to learn a language via what Chomsky calls the "language acquisition device" in our brains. It is as if together with the dimensionality enabled by our senses, it was inevitable for natural language to develop with it. Without it, human consciousness as we know it could not exist.

When words of a language are like probabilistic rules of an interaction system

When we regard language as an interaction system in which the functional meaning of words determines the system's probabilistic rules, we may need to teach NLP machines what the rules are derived from in the first place to gain a deeper language understanding. Instead of letting the machine depend on massive amounts of text data to compute an accurate semantic interpretation, we would let the machine learn how to learn a language by itself, making it akin to a language acquisition device.

Following this system-wth-rules analogy, rules are primarily defined by sensory perceptions of conscious entities that interact with the objects in space and time through either of the five sensory modalities. The secondary rule definitions will come from the range of emotional arousal they can typically invoke after primary perceptual processing, which inform the rule's connotations.

Using the noun "rose" again as an example, we can break down the primary and secondary conditions for this rule to be applied when an object is of the class "rose".

1. Vision (eyes): size of the object is typically big enough to be caught by the naked eye, petals and petal folding always appear in a particular shape in various possible bright colour variations

2. Sound (ears): silent, a rose on itself does not produce any perceptible sound, but when manipulated to break it may elicit a soft cracking sound

3. Touch (skin): soft leaves but sturdy and sharp spiky stem; light-weight; fingers usually in tweezers grip to reach for and hold one

4. Smell (nose): geraniol perceived as its characteristic chemical component

5. Taste (tongue): surely, one could eat certain types, leaving just a slightly bitter taste in the mouth

6. Affect (emotion): when ripe petal colours appear, humans often find them beautiful, its smell is generally found to be pleasant; may induce romantic feelings as it is often related to romantic settings in western culture

"Rose", as a linguistic rule in the English system now represents this particular combination of sensory-affective experiences from the physical object. We could theoretically continue to make this conditional breakdown for all existing rules, where rules of the type nouns typically represent objects, verbs often represent or imply time or motion for the object, adverbs represent relations with objects and those of the type adjectives describe intrinsic features or consequences of objects.

Two fundamental problems arise when we want to make the case for a self-sufficing language-learning machine with this approach. One is that computers do not experience sensory dimensions the way living organisms do and two is that they have no intrinsic motivation to communicate in a natural language, whether with human beings or other artificial machines. This makes them unable to grasp what we as humans want, feel and think of when we practice language. The question now becomes: how can we deal with these issues?

Virtual gaming for machines to learn natural language

To ultimately teach a machine to comprehend natural language, we may need to find a way to make computers understand the four dimensions of space and time in which objects, including itself, exist. Because presumably, an intelligent artificial system able to fully comprehend natural language should at a minimum be aware of its environment in the first place.

To be aware of one's environment, one needs a way of sensing what is in its environment, before being able to perceive it. Once it can perceive, it can develop the want to interact with its environment as well - much like the development of any biological organism. Once equipped with a predisposed ability to make symbolic representations of the environment, language may naturally evolve with it. Remember for example that young children need to be able to explore objects in the world by themselves, preferably with all possible senses. Later they learn the symbolic sound and visual rule (word) to refer to the object they explored when it wants to interact with or about it again.

One way of making objects from our physical world machine-interpretable would be to create 3D-simulations of the world in a virtual environment that a computer agent can interact in. We would for instance have a virtual three-dimensional "rose" with sensory properties as described above to be "discovered" through actions in this virtual space. We could even simulate this rose to grow and perish over time and according to laws of physics show the consequences of interactions between the rose and the computer agent in this virtual environment. The machine would need to learn, for instance, what happens to an object when it is dropped or when it is picked up in different ways, before it can understand verbs and, later, adjectives. Similar virtual environments have already been built to a certain extent for the purpose of video games and virtual reality.

Image source

Reinforcement learning, a class of machine learning techniques which has proven to be useful in recent AI research breakthroughs, lets machines learn through reward functions for a finite set of actions it can take in a given environment. This would serve as an interesting paradigm to devise a virtual environment in which learning language is made to be a multi-sensory game where the computer agent "plays" a human body and levels up every time it has learned a new class of rules (e.g. flowers, furniture, food up to abstract objects such as freedom and humanity) through correct sensory deduction.

The game will also need a "supervisor", akin to a parent for a child, to inform the player what the correct symbolic pattern is to label each environmental object at the beginning of each level. This would make the game the ultimate semi-supervised reinforcement learning experiment to develop an artificial language acquisition device.

As a starting point, the range of actions it could take could be to manipulate objects in this space in all possible ways with each of the sensory modalities - e.g. to view, smell, hear, listen to, touch, grab, taste and crash the object it encounters.

The reward goal could be defined as "matching the correct sensory experience (how did the object look, sound, taste, smell and feel when interacting with it) to the appropriate rule (word, in sound and character pattern) to classify each object (noun), motion (verb), relation (adverb) and consequence (adjective) it encounters".

Gradually, as my hypothesis goes, this computer will learn language more like an actual biological human organism: by gaining a sensory-affective understanding of what objects in the world actually are like and what rules to use to interact about them.

After training in virtual environments comes testing with real humans

Once the machine has observed all possible objects, its actual natural language understanding needs to be tested with real humans. We can let it undergo the Turing test about any conversational topic and set an extremely stringent testing criterium to "no human interactors at all can differentiate the machine from the human". An additional way is to see if this machine can use language to generate sensible, authentic expressions. For instance, in the form of poetry, jokes or an essay on a phenomenon it recently observed.