Introducing the Massive Legal Embedding Benchmark (MLEB)

3 weeks ago 1

tl;dr

We’re announcing the release of the Massive Legal Embedding Benchmark (MLEB), the largest, most diverse, and most comprehensive benchmark for legal text embedding models.

MLEB contains 10 datasets spanning multiple document types, jurisdictions, areas of law, and tasks.

To do well on MLEB, embedding models must demonstrate both extensive legal domain knowledge and strong legal reasoning skills.

On MLEB, our newly released Kanon 2 Embedder scores highest while simultaneously maintaining the lowest inference time of all commercial competitors, highlighting the extreme accuracy and efficiency gains to be had from domain adaptation.

The need for an industry-standard legal embedding benchmark

In the process of training Kanon 2 Embedder, our flagship legal embedding model, we found that the only two existing benchmarks for legal embeddings, LegalBench-RAG and the legal split of the Massive Text Embedding Benchmark (MTEB), were either of low quality or low diversity.

With regard to LegalBench-RAG, we found that it included only 4 evaluation datasets, and all datasets consisted entirely of contracts. In practice, legal professionals and users seeking legal advice or knowledge tend to search for and be interested in a much broader range of document types, including legislation, regulations, cases, and general legal literature. Additionally, the datasets were largely dominated by US contracts, reflecting the broader overrepresentation of American law in legal benchmarks and public legal datasets.

In respect of the legal split of MTEB, we observed two key issues.

First, we found a significant amount of mislabeling.

AILA Casedocs and AILA Statutes, in particular, comprising 25% of the legal split and 50% of English data in the split, contain many query-passage pairs that are totally irrelevant to each other. Upon review of the authors’ paper, we discovered the cause to be that the datasets had been created using an ‘automated methodology’ that paired ‘facts stated in certain [Indian] Supreme Court cases’ with cases and statutes that had been ‘cited by the lawyers arguing those cases’. According to the authors, ‘actually involving legal experts (e.g., to find relevant prior cases / statutes) would have required a significant amount of financial resources and time’.

The second issue we found with the legal split of MTEB was that it lacked diversity in the areas that matter most to legal practitioners and seekers of legal knowledge.

Of the remaining English-language datasets after exclusion of AILA Casedocs and AILA Statutes, two deal with consumer terms of service (Consumer Contracts QA and Legal Summarization), leaving only one (Corporate Lobbying) that deals with legislation, and none dealing with case law. All such datasets are again largely representative of American law.

Regarding the non-English-language datasets in the legal split of MTEB, we argue that, in many cases, the legal systems of different cultures may fundamentally differ in ways that make cross-jurisdictional comparisons (e.g., between the common law system used by Anglosphere countries and Sharia law) of the effectiveness of legal embeddings inappropriate.

Furthermore, given that the legal split contains two German datasets, one Chinese dataset, and no other non-English datasets, and that those datasets are concentrated on three select legal tasks, we argue that the inclusion of non-English datasets largely introduces bias and noise in ways that are unlikely to be conducive to real-world performance on most English-language legal information retrieval tasks.

What makes MLEB an industry-standard benchmark

Learning from the limitations of existing legal embedding benchmarks, we designed MLEB with four key objectives in mind, namely to:

  1. be of high quality, both in terms of provenance and labeling;
  2. consist of text processing tasks that have genuine real-world utility to legal tech professionals;
  3. be meaningfully challenging in ways likely to require significant legal knowledge and strong legal reasoning skills; and
  4. represent a broad variety of jurisdictions, legal areas, and types of legal texts.

To that end, MLEB contains 10 different evaluation sets spanning a range of difficulties (including tasks requiring legal reasoning as well as tasks requiring lexical analysis), problem types (specifically, retrieval, zero-shot classification, and question answering), jurisdictions (the US, UK, Australia, Ireland, Singapore, and the EU) and document types (decisions, legislation, regulations, contracts, and literature).

Of the 10 datasets in MLEB, 7 are entirely new, constructed either by having subject matter experts hand-label data or by adapting existing expert-labeled data.

One of the most valuable constituents of MLEB is the Australian Tax Guidance Retrieval dataset. This dataset pairs 112 real-life tax questions posed by Australian taxpayers with 105 relevant Australian Government guidance and policy documents.

We constructed this dataset by sourcing questions from the Australian Taxation Office’s community forum, where Australian taxpayers ask accountants and ATO officials their tax questions. We found that, in most cases, such questions can be answered by Australian Government guidance materials that, for whatever reason, taxpayers were unable to locate themselves. Accordingly, we manually went through a stratified sample of challenging forum questions and extracted guidance materials linked to by tax experts that we confirmed to answer such questions.

What makes this dataset so valuable is that, unlike the vast majority of legal information retrieval evaluation sets currently available, this dataset consists of genuine, challenging real-world user-created queries, rather than artificially constructed queries that, at times, diverge considerably from the types of tasks embedding models are actually used for.

The queries are valuable and challenging precisely because users have gone to the effort of asking them on a forum, indicating that traditional search engines failed to surface the answers they were looking for. The relevant materials are, in turn, also valuable because accountants and ATO officials have confirmed them to be relevant, and we have independently affirmed their relevance.

This dataset is just one of several others that we invested considerable, painstaking effort into ensuring the usefulness and quality of.

Below, we present an overview of all the datasets included in MLEB alongside all the various features that make them unique.

Read Entire Article