The Algebra of Semantics: A Linguistic Analysis of NLP
Last Updated on January 3, 2025 by Editorial Team
Author(s): Lauro Oliveira
Originally published on Towards AI.
The Algebra of Semantics: A Linguistic Analysis of NLP
Here we will explore the intersection of linguistics and information theory to understand the mechanisms behind natural language processing (NLP). By examining how words carry meaning, drawing inspiration from mathematical approaches to semantics, and analyzing the boundaries and ranges of meaning in language, we aim to show how these principles support the creation of Large Language Models (LLMs). This post serves as a guide to understanding the linguistic structures that drive modern NLP systems, offering insights into how these structures can be used to improve their capabilities.
Introduction
Linguistics is some sort of thing that all human beings donβt live without but what exactly is it?
Keith Allan presents linguistics as the study of the human ability to produce and interpret language; this could be done by speaking, reading, writing, and signing ( mostly used by the deaf community ). A language is the way the content is encoded to be transmitted and could be expressed into many varieties, composed of the relations and structures from its components.
As hypothesized by many historians, language has arisen as a solution to human communication with others, this is not exactly only dedicated by humans, many other living beings have a communication system on par with human language, perhaps the human language is the most sophisticated means of communication among earthly life forms. This could be explained by the fact that most beings use some language to communicate with each other as a response to the interaction of the environment, and a subset of them intentionally communicates with each other but in very limited ways (an example of this, are the bees which communicate by dancing), but not the humans, they freely communicate intentionally with infinite possibilities and forms.
The Meaning
The language structure could be decomposed into sentences, phrases, words, letters, phonemes, sounds, and so onβ¦; But above all, language is a bearer of meaning. Not all types of information could be expressed in language e.g. proprioceptive sensations, but almost all could be approximately done, by the selection of the right words and composed according to the constraint. This leads to the following question: How do words carry meaning?
Zellig Harris tries a mathematical approach to answer this question. Before there was a language itself, there was a vocabulary of isolated terms called words, every single word must have a single more or less continuous range of meanings and semantic properties, but the distinction and boundaries of the meaning-range are complicated and vague. Doesn't it remember you something in NLP word?
Okay, what Zelling would like to say with boundaries and meaning Range ? Let's take the word "set" as an example, when we read this word, if you know a little bit of English you probably have to bring to your mind all the meanings that "set" can assume, and also the inverse is possible, you would think about all the meaning that "set" can't assume, as the example: "I died of set", this may cause a weirdness to your mind.
Words also can broaden or specialize their meaning-range by their variations (morphological alterations), or interactions with the environment ( time, space, context, and so onβ¦). Itβs possible to see this as the prior information of a word as it initially base meaning-range and some posteriori information meaning-range given by some modification.
For instance, take the word "Facebook", nowadays we know for sure which meanings this word can assume, perhaps for a 12th century templar knight the meaning of Facebook is very restricted.
What do we try to do with classical NLP approaches of embedding techniques?
In 2013 a Google article proposed a new way to computationally represent words by using vectors, where the value of these vectors are trained from large datasets, they treated words as an atomic unit. This was a very efficient solution that has changed the NLP world for a long. Actually what the embedding representation of words tries to do is define a fixed space where words bring the reference vector to their meaning direction, isn't it so similar to the boundaries of the meaning range? Also, there are words with a very large meaning-range, those words that are not very informative, in embedding these words are the ones less similar to all other words in the vocabulary.
Vector embeddings try to explicitly encode the meaning-range of words. The classical models like CBOW or SkipGram try to learn the best vector values according to the context of the word occurrence.
But wait, we said before that even a single word has the self meaning-range, why do we need context to learn the best representation? Well, by reading almost all possible usages of a word by examples in a dataset, the embedding tries to get the best representation that fits the word in every occurrence, this represents the "average spectre meaning "of the word.
Tokens
Time has passed since Google innovation about Word representation and many problems have arisen from this, the vocabularies keep getting larger and larger according to the dataset inserted. A word like "love" can appear with many variations: "lovely", "loveless", "lover", "loved", "beloved", "loveable" and so onβ¦ also in certain places the word love can be misspelled: "lvoe", "lovve", "ovle", "lvoe", and etc…
Then a solution to this was to create embedding not for valid words in the vocabulary of the language, but for the sub-word pieces, like in a Lego, we get a textual dataset and try to find the best and minimal pieces of a word that combine, then we can write the entire texts, and those pieces also have some significant value. For instance, the word "unhappiness" can be split into:
- "un": A common prefix meaning βnotβ.
- βhappiβ: The root of the word, related to βhappyβ.
- βnessβ: A common suffix that turns an adjective into a noun, indicating a state or quality.
This sub-word split technique has started a new branch of word representation which has increased the capability power of the Language models. An example of those subword tokenizers: Byte-Pair Encoding (BPE), the model used for ChatGPT, and WordPiece, a method used by BERT models.
Those sub-pieces of words also have an instrumental meaning in linguistical systems, the so-called "morphemes". According to the dictionary definition: "A morpheme is the indivisible unit of meaning in language. Not always a morpheme is a word itself."
By using not only words but also morphemes with their constraints and particular meanings, we can approximate computationally to an official language system. For instance, when you type to an IA or speak with someone the word: "Prejoyment", a word which does not officially exist in the English vocabulary, but connects the morphemes we have:
- Pre- (before)
- Joy (happiness)
- -ment (state or condition)
"Prejoyment" could mean the anticipation or feeling of happiness before something enjoyable happens, like the excitement before a holiday or a fun event. This is usually called Linguistic Building.
An English speaker will comprehend this word by his intrinsic domain of the language, and the meaning of the morphemes composed are perceived by the mind, although an IA like ChatGPT will understand because their training over thousands and thousands of usage examples of all different sub-piece of words that almost approximates this word morphemes.
NO SEC:
Also, we could define a message in a language as information that we want to transmit, both language and information are structured as departures of the equiprobability, this is why they are intertranslatable, but not all information could be expressible in language, and some items of language structure do not express information.
Meaning of Sentences
When the meaning is observed in some sentences, itβs clear that this isnβt only about the selection of the right meaning of words. The sentence information could also be defined by all grammatical relations among words and how they contribute with their particular meaning to the entire meaning.
Let's take an example of trying to form a phrase in a new learning language, suppose Spanish, the phrase: "Yo voy a la tienda para comprar pan." ( I go to the store to buy bread). But if you're like me and are trying to learn Spanish from the basics, you may not know the best way to construct this phrase, even knowing the meaning of each word⦠as me, you probably will say something like: "Yo voy tienda comprar pan". Here I wrote the words that I sure know the meaning in the right order ( at least for my English knowledge), and then a Spanish speaker for sure will understand my message.
Why did I show this? Here we obey some constraints (the same constraints of "order" from English) and I picked up the right word meaning to direct the sentence meaning to propose a phrase. Then remember: Order, Individual Meaning, and Grammatical constraints. A message can exist without total grammatical constraints only with the individual meanings but sometimes will sound weird or not fully comprehensive, along with only grammatical constraints a message can guide to an undefined meaning direction. We can also see order as a type of grammatical constraint, but not only that, if you get a paragraph and randomly sort the position of each word, which may be totally different from the original, the order of the words helps to connect the individual meanings and direct them to a very specific meaning for the entire sentence.
How to Express sentences in NLP terms?
Computationally, the very first basic once knowing the embedding of words was to compute the embedding for each word and merge them into a final vector by some aggregation ( sum, average, etcβ¦) this was nice, and worked very well for certain tasks. Perhaps more complex tasks from NLP require more than this because this approach only supports the meaning combination of a sentence, the order and grammatical constraints were still missing.
Limitations of Embedding Aggregations
The biggest example of this failure for the NLP tasks is the RAG approach to query certain knowledge into a source of information, a consequence of semantic dissonance, the mismatch between the meaning of a query and the retrieved response, even if the response is technically related.
You have a dataset of Question-answering that you would like to use to improve some LLM' models that you are implementing to answer business questions for the CEO's, then someone asks: "How did the company lose revenue in 2024 Q3?" The ideal answer should be: "Product X faced a significant recall in September, leading to a decrease in sales.β
Perhaps, there is a dissonance, the answer doesnβt explicitly state βlost revenueβ but the recall is indirectly the cause, consequently only using aggregations of sentence meaning (aggregated embedding) to compare the similarity of phase from Question and Answer into a document retrieval search could not lead to a good result.
Approaches to Order and Grammatical Constraints
A language can contain infinity sentences, with the most diverse lengths, and one of the first attempts to handle "order" in NLP was the Recurrent Neural Network which receives a limited sequence of data, and processes individually each item in the sequence with a very simple idea of "memory", generating different outputs for each new item seen. This kind of Neural Network was an incredible advancement since with sequence processing we are including the order notion to the AI's, and with the "memory" we are not only increasing the capacity of order but also implementing simple constraints: the meaning of previous words seen interferes into how the current word could be interpreted to obtain a sentence meaning.
You can see this in the following phrase:
- I know that
- I do not know that
Here, the word "not" can completely redirect the meaning of knowing in this phrase.
Improved Order Information
Once Sequential modeling of text entry with recurrent neural network architecture has proved its capability, what else could we do?
It's important to highlight the biggest problems with the Recurrent models as we saw previously:
- Inefficiency: It's not possible to parallelize the training, so it takes a long time for huge datasets.
- Difficulty in Capturing Long-Range Dependencies: The "mesmerizing capacity" of an RNN is very limited, and retaining information on long-term dependencies is very complicated too.
Then, the Transformer architecture helps us solve those issues. Let's deep dive into one of the main components, the Positional Encoder.
The task that the Positional Encoder is responsible for solving is: "How can I inject the position information of a token in a given sequence without any architecture tricks?", and the solution is simple, for each position of a sequence I will generate a unique frequency signal, that will increase the frequency as the position of the sequence increases, and I will add this vector with the signal to each token of the sequence.
It's a quick method to infer the positional signal for each token and also helps the computational model to learn one of the main important linguist artifacts of a sentence, the position of each token.
Improved word relationships and constraints
As we mentioned, we need to know the words of a sentence, which order those words are disposed and how occurs the relationship of each word. With the previous mechanism was possible to solve most of the requirements, but still one is missing: the relationship of each word. How do different words in a sentence interact? What are the constraints of those interactions? How a word at the beginning of a sentence could affect one word used in the middle of the text?
Most of the behavior of restrictions and effects need to be learned by the model, and both are consequences of the interactions of the tokens. A way to explicitly highlight the interactions is the Attention mechanism.
We won't join in the Details of Attention mechanism ( it's a bit long to explain and it could take an entire post only for this), but we can summarize what this mechanism does: Computationally, it tries to optimize the relevance of information by dynamically focusing on the most important parts of the input sequence, and by the Linguistic point of view, it also helps the model to optimize and capture the relationship between the tokens of a sequence, even when they are far away in a sequence.
Putting all together, it's possible to see by the Linguistic view why an LLM guided by Transformers architecture has the capacity to learn most of the main components of a language and use this almost the same as a human, this is why LLMs like ChatGPT sounds so natural, a huge amount of data with a well encoded information.
Information and Language
At the end of this all, we can see that the most important aspect of a language for computer science is how to translate the information presented to a message encoded by a language, in a way that important linguistic aspects could be captured by a computation model. The information is all that matters.
In order to build great linguistic computational models we need to get deep into how a language works and find good methods to translate the language information into a computational input where all the requirements of the language structure could be present.
References:
- The Routledge Handbook of Linguistics β by Keith Allan
- The Theory of Language and Information (A Mathematical Approach) β by Zellig Harris
- Natural Language Processing with Transformers β by Tunstall, Werra, and Wolf.
- Deep Learning βby Ian Goodfellow, Bengio, and Courville.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI