Soon after we started 273 Ventures, we began to research a topic clearly near and dear to our hearts: training data. Every real AI model starts with training data, and a great deal of subsequent performance is determined by the breadth and quality of this data. This is especially true in the legal domain, where the syntax and substance of language is critical.
We didn’t approach this problem lightly. In addition to Dan and Mike’s backgrounds as leading scholars in applied AI, I am a Certified Information Privacy Professional for the US and Europe and one of the first ForHumanity Certified Auditors for AI systems. Over the last three years, my professional focus has been on the ethics, security, and compliance of AI models and their training data.
So, the three of us set out to review available datasets for training. We read hundreds of publications available on large model construction, as well as any dataset-specific publications like The Pile. Our conclusion, unfortunately, was that there was no training dataset that checked the boxes. Two primary issues confronted every option we reviewed: either data was collected under breach of contract or there were issues with intellectual property rights.
Existing Datasets and Breach of Contract
Existing Datasets and Intellectual Property Rights Issues
Within the public discourse and legal actions, copyright violations are the most well-known and discussed issue surrounding training data for LLMs. Recent complaints, including multiple class action lawsuits against companies who develop LLMs, center around the use of copyrighted materials in training data.
While some models and datasets provide complete transparency of the sources of their training data, many models are either closed or only provide open weights, leaving users in the dark when it comes to their exposure. Regardless of whether there are these types of unknown issues or there are known issues related to IP rights (due to the use of copyrighted material in datasets), both lead to business continuity risks for organizations relying on these datasets or the related models.
Developing a Solution
Like so many others, we were driven to develop our own solution when we couldn’t find one in the market. Enter: the Kelvin Legal DataPack. Our work was driven by a set of principles that would directly address our concerns with existing datasets; we had to have a clear understanding of ownership, intellectual property, and commercial reuse.
Furthermore, we only obtained data from sources with clear intellectual property rights that explicitly allowed reuse of data for commercial purposes. Our goal was to amass data that was not only free of use restrictions, but that was also high quality from the start. Rather than ingesting huge swaths of uncurated data and hoping to filter out the “garbage,” we opted to instead begin with high-quality data.
We focused on legal domain data that comes from high quality legal and financial sources, much of which was actually produced by attorneys and other legal professionals. We have collected practically all laws and rules from key jurisdictions, including the US, the UK, and the EU, with an end result of over 150 billion true tokens and more than 200 billion BPE tokens, as traditionally measured by GPT models. The Kelvin Legal DataPack covers numerous languages, including English, Spanish, German, and French.
We’re excited to be able to offer a dataset that has clean provenance and commercial licensing terms; we believe that this type of dataset will enable organizations to capture the power of large language models without the exposure to risks historically associated with them.