Building a benchmark for legal autograders
A significant barrier to the development and deployment of legal AI systems is the cost of measuring performance ("evaluation"). For AI systems which output text — such as tools for drafting, analysis, summarization, or question-answering — evaluation requires that human attorneys manually inspect every output. This process is time-consuming, expensive, and must be repeated every time the underlying AI system is updated or modified.
One possible solution is autograders. Autograders are AI systems which automate the process of human review. In other domains, AI researchers have found that autograders can reliably replace human grading, offering robust evaluation results at a fraction of the price. Indeed, autograder-variants have been essential to substantial recent improvements in generative AI.
Unfortunately, autograders for legal tasks remain understudied and underdeveloped. The primary reason is a lack of benchmarks for measuring autograders on legal tasks. Whereas AI researchers have shown that autograders for non-legal tasks frequently "agree" with human evaluation, little is known about the extent to which legal autograders agree with human lawyer assessments of quality.
LegalBench V2 (LB2) is an ongoing project attempting to fill this gap. Our goal is to build benchmark datasets which allow us to validate and compare different autograders for different legal tasks. This benchmark should reveal those legal tasks where existing AI systems may be relied upon in place of costly human judgements. It will also provide a way to iterate on legal autograder designs, allowing us to identify more general strategies for legal autograder construction.
LB2 builds upon the original LegalBench project from 2023. Our goal in LB2 is to spur development on legal autograders by releasing datasets for benchmarking autograders — allowing us to measure how well different autograders actually correlate with expert lawyer judgements across different tasks. The original LegalBench, published at NeurIPS 2023, was the first large-scale open-source effort to benchmark legal reasoning in AI. Drawing on contributions from 40 collaborators — lawyers, law professors, computational legal scholars, and legal aid organizations — it assembled 162 tasks spanning six types of legal reasoning, and used them to evaluate 20 different AI models. The project was widely adopted by AI researchers as a standard reference for understanding what AI can and cannot do in legal contexts. LB2 builds on that foundation, and extends it to a new and increasingly important question: not just whether AI can perform legal tasks, but whether AI can reliably judge the quality of legal work produced by other AI systems.
LB2 focuses on preference-based autograders. Preference-based autograders are given two responses to the same task and asked to decide which one an expert attorney would prefer. Specifically, these autograders take three inputs:
To validate a preference-based autograder for a task, we first have a set of human attorneys express their preferences between different outputs for the same task instruction. After collecting this data, we can measure — for any autograder — how often its judgements agree with those of humans. A good autograder is one that tends to pick the same outputs as the human attorneys; a bad autograder is one that frequently disagrees. If an autograder agrees with human judgements a substantial proportion of the time (e.g., 70%+), then it can be treated as a reliable proxy for human judgements.
Datasets for benchmarking autograders have three parts: a set of instructions, a set of model responses for each instruction, and a set of preferences between those responses. To make this concrete, consider the task of case summarization, in which AI systems are used to summarize cases. Here,
Data of this form can be used to validate autograders. Specifically, we would feed into an autograder (most likely an LLM) each case and pair of summaries. We would ask it to select the summary that best reflected the criteria, and we would measure how often the autograder's choice "agrees" with the human attorney's choice. The more frequently an autograder agrees with human attorneys, the more reliable it is likely to be.
Like LegalBench 1, LB2 is designed to be a community effort. There are two ways to participate.
If you can identify a legal task where AI evaluation matters, we want to hear from you. You do not need to provide any data — describing the task and why it matters is itself valuable. We will work with collaborators to develop datasets around promising ideas, though we cannot guarantee inclusion of every submission.
If you have both a task idea and instructions to contribute, you can indicate so via this form. Provided instructions meet minimum quality standards, we will work with you to build a benchmark corresponding to your instructions in LB2. Contributors who provide task data will be credited as co-authors on LB2.
Below are some examples of potential LB2 tasks. These are intended to be illustrative — we are actively seeking ideas beyond these categories.
Instructions correspond to law school exam questions, and responses correspond to student essays. An autograder for this task would assess essay quality in the same way a professor or supervising attorney would.
Example instruction
"You are grading a first-year law school exam. The question asked students to analyze whether the following facts give rise to a valid negligence claim under New York law. The answer should identify the relevant elements, apply each to the facts, and reach a reasoned conclusion.
Attachment: Exam question text"
Instructions correspond to deponent-specific litigation contexts — a description of the legal dispute and a profile of the witness. Responses correspond to deposition question strategies. An autograder for this task would assess which strategy is more likely to be effective.
Example instruction
"You represent the plaintiff in a personal injury case. The following is a summary of the dispute and a profile of the opposing party's key witness. Devise a deposition question strategy aimed at establishing that the witness had prior knowledge of the defect.
Attachment: Case summary + witness profile"
Instructions correspond to litigation contexts and the specific issue on appeal. Responses correspond to argument sections of briefs. An autograder for this task would assess which argument is stronger, better reasoned, or more persuasive.
Example instruction
"Draft the argument section of an appellant's brief addressing whether the trial court abused its discretion in excluding the expert's testimony under FRE 702. The relevant trial court ruling and expert report are appended.
Attachment: Documents"
Instructions correspond to drafting scenarios specifying the transaction, parties, and the attorney's objectives for a given clause. Responses correspond to competing drafts. An autograder for this task would assess which draft better serves the client's interests.
Example instruction
"Draft a forum-selection clause for a commercial services agreement between a Delaware-incorporated technology company and a California-based vendor. The clause should favor the technology company and account for disputes arising under both state and federal law."
Instructions correspond to regulatory or compliance questions paired with relevant company and product information. Responses correspond to competing legal analyses. An autograder for this task would assess which analysis is more thorough, accurate, or practically useful.
Example instruction
"A fintech startup intends to offer a peer-to-peer lending product in all 50 states. Identify the primary federal and state regulatory frameworks that would apply and describe the key compliance obligations the company must address before launch.
Attachment: Company description and product term sheet"
Instructions correspond to research questions posed by a supervising attorney, along with relevant factual context. Responses correspond to competing research memos. An autograder for this task would assess which memo more accurately states the law, identifies the most relevant authority, and applies it to the facts.
Example instruction
"A client based in Texas entered into a software licensing agreement with a vendor that included a mandatory arbitration clause. The client now wants to bring a class action. Write a short research memo analyzing whether the arbitration clause is enforceable and whether it bars class proceedings under current federal and Texas law.
Attachment: Contract excerpt"
Instructions correspond to a client situation and the specific question the client has asked. Responses correspond to competing advice letters. An autograder for this task would assess which letter more clearly explains the legal situation, identifies the key risks, and provides actionable guidance appropriate for a non-lawyer reader.
Example instruction
"Your client is a small business owner who has received a cease-and-desist letter alleging trademark infringement. They have been using their current business name for six years. Draft a client advice letter explaining their legal exposure, options, and recommended next steps in plain language."
Instructions correspond to a contract review task specifying the client's position and the issues to flag. Responses correspond to competing due diligence summaries. An autograder for this task would assess which summary more accurately identifies material risks, missing provisions, and negotiating points relevant to the client.
Example instruction
"You represent the buyer in an acquisition. Review the following target company's standard customer agreement and identify any provisions that create material liability, limit assignability, or would require third-party consent to transfer upon closing.
Attachment: Contract text"
Instructions correspond to a dispute context and the client's objectives. Responses correspond to competing demand letters. An autograder for this task would assess which letter more effectively asserts the client's legal position, sets an appropriate tone, and maximizes the likelihood of a favorable response.
Example instruction
"Your client is a landlord whose commercial tenant vacated the premises three months before the lease term ended without notice, leaving the space in disrepair. Draft a demand letter seeking unpaid rent, early termination damages, and repair costs.
Attachment: Lease agreement and damage assessment"
Instructions correspond to a defendant's background, the offense conduct, and the applicable sentencing guidelines range. Responses correspond to competing sentencing memoranda submitted on the defendant's behalf. An autograder for this task would assess which memorandum more effectively presents mitigating factors and advocates for a below-guidelines sentence.
Example instruction
"Your client has pleaded guilty to one count of wire fraud and faces a guidelines range of 37–46 months. He has no prior criminal history, cooperated with investigators, and has significant family caretaking responsibilities. Draft a sentencing memorandum arguing for a sentence of time served.
Attachment: Client background materials and offense summary"
Ready to contribute a task idea or dataset?
Open the contribution form →For questions about LB2, contribution logistics, or data handling, please contact Neel Guha — neelguha@gmail.com.