History of Natural Language Processing: Syntax vs Semantics
Natural Language Processing is the branch of computer science focused on training machines to process and “understand” patterns in language. Given the complexity and nuances in language, the task has proven to be quite challenging. The primary divide in the field is between syntax and semantics; syntax is focused on the presence or absence of words and the arrangement or order of words, while semantics refers to their meaning.
Many syntax based tasks are quite easy for modern computers. For example, the simplest machine syntax based task is Ctrl+F which can be used to find instances of a given word or phrase in a document. As compared to humans, computers can, at scale, execute this task of finding words in documents with flawless or near flawless performance. Machines can also undertake more complex syntax based analysis such as executing RegEx, fuzzy string matching and other similar methods.
Despite the modest success of these syntax based approaches, for many years, computers had at best a very thin understanding of natural language. Machines had no real account for semantics – the meaning of words and the role that context plays in shaping meaning.
In Part I of this series, we provided an initial introduction to embeddings - the unique numerical codes which enable computers to comprehend words. Starting roughly a decade ago in the highly acclaimed Word2Vec paper, the field of NLP forged a path toward greater semantic understanding of patterns in natural language. While the idea of embeddings can be traced to the 1950’s, it was this influential paper which modernized the use of embeddings as a means to model language.
Embeddings in Generative A.I.
With this historical background in mind, let’s delve deeper into the importance and functioning of embeddings in the realm of generative AI.
Let us start at the beginning with an understanding of how computers process information. Computers inherently process numbers. If we are to feed a computer textual data, like words, phrases, or textual documents - something we must do if we are to leverage legal data - that data must be converted into some sort of a numerical format.
Enter Large Language Models (LLMs) - machine learning models which are specialized in processing words. LLMs encode words into embeddings, which are distinct numerical vectors representing each word. A vector is a series of numbers that represents a specific word or object. Visualize a vector as a point in space, where its location is determined by a series of numbers. For example, in a 2D setting, the word “cat” could be pinpointed by the coordinates [0.5, 1.2]. But in AI, these spaces are often multi-dimensional, with vectors having many more coordinates to capture the essence of a word, like 100D, 300D, or even more. The vector for “cat” might be something like [0.5, 1.2, -0.3, 0.8, …, nth dimension].
Words that have similar meanings will generally have similar vectors, meaning they will be close together in the multi-dimensional space. For example, “cat” and “kitten” would be closer to each other than say “cat” and “skyscraper”; similarly, as highlighted in the original Word2Vec paper from 2013, the vector difference between “king” and “man” might be similar to the difference between “queen” and “woman”. This spatial arrangement lets LLMs distinguish between words, their meanings, and relationships, enabling them to generate coherent text.
How are these embeddings created? Using neural networks - computational models inspired by the structure of the human brain - LLMs generate and train word embeddings on a large corpus of text. Every word in a given input corpus is fed into a neural network and converted into its embedding form. During training, words with similar meanings move closer together in the spatial representation. Such spatial representations are typically improved by exposing a model to more examples of a given word being used across various substantive contexts. At enormous scales (e.g., GPT-4 has more than a trillion parameters) more subtle semantic differences can be more easily distinguished by a model.
Ultimately, models such as GPT are trained to predict subsequent words in sentences, while others like BERT focus on the relationships between words. Once trained, when you input a sentence into an LLM, each word is transformed into its respective embedding. As the sentence progresses within the model, these embeddings evolve, amalgamate, and adjust, allowing the LLM to craft answers, categorize text, and more. Algorithms like Word2Vec, GloVe, and FastText have been pivotal algorithms for producing these embeddings.
In a nutshell, embeddings act as a foundational blueprint for words in LLMs. They evolve and adapt during both training and execution, enabling models to discern nuances, recognize synonyms, and even decode analogies. Well-trained embeddings capture the semantic meaning of words and allow models to grasp subtleties in meaning, detect synonyms, and even understand analogies. In essence, embeddings provide a bridge between the human world of language and the machine world of numbers, allowing AI models to perform intricate tasks on textual data.
Stay Tuned for Future Parts of our Series on Embeddings
Stay tuned for Part III, where we’ll explore the significance of embeddings in the legal domain and their potential impact on legal tech.