A Basic Introduction to Embeddings

With all of the recent publicity around generative AI, it is likely that you have come across some technical jargon in conversation, academic articles or even social media. If you haven’t yet, one of the most fundamental and significant terms you will inevitably encounter is embeddings.

What are embeddings, you ask? Consider this definition from Microsoft-

Embeddings are the representations or encodings of tokens, such as sentences, paragraphs, or documents, in a high-dimensional vector space, where each dimension corresponds to a learned feature or attribute of the language.

For those of us just entering the field of generative AI, this definition is likely to raise more questions than answers. To really get a grasp on embeddings, let us take a few steps back and begin with several brief explanations on what an embedding is. And who could better explain it than our dear friend, GPT-4? To get a few different angles on embeddings, we fed three prompts to GPT-4 - explain what an embedding in AI is to a 5 year old, a 10 year old, and a 12 year old. Check it out.

Explain what an embedding in AI is to a 5 yr old :

Imagine you have a big box of crayons, each crayon being a different color. Now, suppose you want to show your friend what your favorite color is, but you can’t use the word for the color or show the crayon itself. So, you might draw a picture or something that is usually that color, like the sun for yellow, or grass for green.

In AI, an “embedding” is a bit like drawing a picture to show what a word means. It turns words into a kind of picture or pattern that the computer can understand better. So, when we talk about an “embedding” in AI, we are talking about the way the computer thinks about a word by changing it into its own special picture or pattern!

So, you know how sometimes in school, you use a number line to understand how numbers relate to each other? Like, on a number line, you can see that 10 is further to the right than 5.

Explain what an embedding in AI is to a 10 year old:

In AI, an “embedding” is kind of like a super fancy number line for words or things. Each word gets its special spot in this line (or actually, it’s more like a plane with many dimensions, but that’s getting complicated). This special spot is determined by how the word is used and what it means.

So, let’s say the word “cat.” The computer looks at lots of books and websites to see how “cat” is used, and it notices that “cat” is often used similarly to “kitten” but very differently from “car.” So, in the embedding, “cat” and “kitten” would be close together, but “car” would be far away.

This way, the computer can understand words by turning them into points in its fancy, multi-dimensional space. This helps the computer learn the meaning of words based on how and where they are used and helps it to understand and process language better!

Explain what an embedding in AI is to a 12 year old:

You know how every word has its own meaning, right? In AI, an “embedding” is a way to give words a kind of special code so that computers can understand their meanings. This code is made of numbers and helps computers see which words are similar and which ones are not.

Imagine you have a map, and each word is a city on that map. Cities that are close to each other have similar meanings, like “cat” and “kitten” might be neighbors because they are both about small, furry pets. But “cat” and “bicycle” would be far away from each other because they are very different.

To make this map, computers read lots and lots of text, like books, websites, and articles. They look at how words are used together and then figure out how close or far they should be on the map based on how similar their meanings are. So, if “dog” and “bark” appear together a lot, the computer understands that they are related and places them close on the map.

These maps, or “embeddings,” help computers to understand and process language, which is really useful for things like translating languages, answering questions, or making chatbots that can talk to people!

Embeddings, Revisited

What we learn here is that you can think of an embedding as a unique code or numeric identification used by computers to effectively process, represent and utilize language. In this sense, embeddings are identifiers that allow computers to ultimately “understand” words. Pretty amazing!

Stay Tuned for More on Embeddings…

In the next part of this series on embeddings, we take a deeper look at why we need embeddings and how they work. Then, we consider the value of legal-specific embeddings.

Jessica Mefford Katz, PhD

Jessica is a Co-Founding Partner and a Vice President at 273 Ventures.

Jessica holds a Ph.D. in Analytic Philosophy and applies the formal logic and rigorous frameworks of the field to technology and data science. She is passionate about assisting teams and individuals in leveraging data and technology to more accurately inform decision-making.

Would you like to learn more about the AI-enabled future of legal work? Send your questions to Jessica by email or LinkedIn.