Stanford University
A Reasoning-Focused Legal Retrieval Benchmark
Pages
25
Time to read
78 mins
Publication
Language
English
Pages
25
Time to read
78 mins
Publication
Language
English
This document is a research article that introduces two novel benchmarks for evaluating retrieval-augmented language models in the legal domain: Bar Exam QA and Housing Statute QA. These benchmarks address the lack of realistic legal retrieval benchmarks that capture the complexities of legal question-answering. The authors describe the construction of these datasets, which include approximately 10,000 labeled, paired query, gold passage, and answer examples designed to reflect real-world legal research tasks. The benchmarks are produced through annotation processes modeled after actual legal research practices. The article also compares the performance of existing retrieval pipelines on these datasets, revealing that current methods struggle due to low lexical similarity between queries and relevant documents. Furthermore, the authors suggest that to enhance the performance of legal retrieval-augmented language models, developers may need to implement more sophisticated retrieval strategies that incorporate legal reasoning capabilities.