Swimming in the Sea of 'Legalese' - the Value of Legal-Specific Embeddings

Swimming in the Sea of ‘Legalese’

The realm of law is a vast ocean of words, brimming with unique legal jargon that is not found in everyday conversations. Indeed, the public uses terms like ‘legalese’ and ‘legal gobbledygook’ to refer to the complex and nuanced language typically used by lawyers.

This year has witnessed the significant progress of foundational models on legal tasks including GPT-4 Passing the Bar Exam. Despite this progress there are still many opportunities to improve the state of the art through the use of legal specific models and methods.

In Part I of our series on Embeddings, we introduced the concept of an embedding with help from our good friend GPT-4. In Part II of our series we provided a deeper dive on the history of NLP and the use of embeddings in large language models.

The Value of Legal Specific Embeddings

For AI systems, like large language models (LLMs), to dive deep into this sea and better understand legal lingo, they need proper training. The use cases in law are fairly varied – including analyzing contracts, monitoring regulations, drafting briefs, overseeing mergers & acquisitions, or even analyzing legal billing. To excel at these tasks, it’s essential that it can recognize, understand, and fluently use the special terms that make the legal world tick.

Over the past several months, we have been slowly building the Kelvin Legal DataPack - an extensive dataset of over 200+ billion law tokens derived from legal, regulatory and financial texts from many jurisdictions. We are targeting 300+ Billion Law Tokens by the end of 2023 (and even more in 2024).

To make sure that a large language model can accurately and effectively utilize legal terminology, embeddings should be trained on a diet of high quality legal data and documents. Thus, the most immediate use for the Kelvin Legal DataPack is to support retrieval augmentation. We have taken elements of the Kelvin Legal DataPack to build a series of custom embedding models which can help facilitate a higher quality RAG (retrieval augmented generation) across several use cases. We also have embeddings built upon the entire Kelvin Legal DataPack.

GPT Embeddings vs Kelvin Embeddings in an M&A Use Case

One of the challenges that many legal tech customers face is the lack of transparency in tools, approaches and the actual value above the base model that individual Gen A.I. products actually offer. While we certainly understand the need to keep certain information proprietary, our Kelvin Team believes in providing reasonably detailed documentation. Check our Kelvin Docs Page for more information.

The M&A example from the Kelvin Docs Page gives a good feel for Kelvin Embeddings in action. In the example, we combine the use of Kelvin Vector (our Vector Database) with Kelvin embeddings to help support retrieval-augmentation in the context of a M&A diligence checklist.

In this example, we are reviewing a series of hypothetical executive employment agreements and attempting to determine both the salary of the employee and their bonus plan (if applicable). Kelvin’s smallest and (in turn) fastest model en-001-small, outperforms OpenAI’s text-embedding-ada-002 when evaluated on this M&A test case. OpenAI’s model returned a series of tax related results while Kelvin Embeddings returned the correct results for the problem.

For example, highest-similarity document segment for Kelvin’s embeddings en-001-small is shown below:

Employment Agreement - Lucille Bluth.docx: 4. Compensation If there is a change in control of Bluth, as defined in the applicable agreements governing such change in control, and if the CFO’s employment is terminated without cause or if the CFO resigns for good reason within 12 months following such change in control, then the CFO shall be entitled to receive (i) a lump sum severance payment equal to two times her base salary and target bonus in effect immediately prior to such termination or resignation, (ii) immediate vesting of all equity awards granted to her by the Company, and (iii) continuation of health benefits for a period of 12 months following such termination or resignation.

By contrast, OpenAI’s state-of-the-art text-embedding-ada-002 result are shown below:

Tax Opinion.docx: Generally, foreign nationals are subject to withholding if they are engaged in a US trade or business, and receive a US-source payment of income. Depending on the country of residence of the foreign contractor, certain tax treaty provisions may provide an exemption from withholding.

These results highlight that OpenAI embeddings are clearly not ideal for this task, as they are not able to recall the relevant documents or text segments on this retrieval problem.

So to reiterate, Kelvin’s embeddings are legal-specific. Rather than being trained on a swath of general data, Kelvin’s embeddings have been trained on a large corpus of legal and financial documents. We have even created different embeddings for different problems in law.

The Kelvin Legal DataPack - From Embeddings to Fine Tuning to the Path to a ‘LegalGPT’

While legal embeddings are important to support tasks such as retrieval augmentation, there is a range of potential uses for a large corpus of legal information. For example, we have already used the Legal DataPack to fine tune an existing foundational model for a specific use case. Ultimately, the Legal DataPack can be used to support the building of legal specific foundational model (i.e. a ‘LegalGPT’) but more on that later :)

Jessica Mefford Katz, PhD

Jessica is a Co-Founding Partner and a Vice President at 273 Ventures.

Jessica holds a Ph.D. in Analytic Philosophy and applies the formal logic and rigorous frameworks of the field to technology and data science. She is passionate about assisting teams and individuals in leveraging data and technology to more accurately inform decision-making.

Would you like to learn more about the AI-enabled future of legal work? Send your questions to Jessica by email or LinkedIn.