A community effort to evaluate AI in the legal domain — Stanford University
AI tools are now widely used across the legal profession — for tasks like reviewing contracts, summarizing case law, drafting pleadings, and analyzing depositions. As these tools become more capable and more relied upon, a critical question emerges: how do we know whether an AI is actually doing a good job?
Evaluating AI performance in legal settings is harder than it sounds. Legal quality is nuanced — a contract clause might be technically correct but poorly suited to a client's risk tolerance; a case summary might be accurate but miss the strategically important detail. Assessing this kind of quality has traditionally required expert human review, which is expensive, slow, and difficult to scale.
LegalBench V2 (LB2) is a follow-up to the original LegalBench project from 2023. LB2 focuses specifically on evaluating a class of AI tools called autograders — AI systems that can assess the quality of other AI systems' outputs in legal contexts.
The original LegalBench, published at NeurIPS 2023, was the first large-scale open-source effort to benchmark legal reasoning in AI. Drawing on contributions from 40 collaborators — lawyers, law professors, computational legal scholars, and legal aid organizations — it assembled 162 tasks spanning six types of legal reasoning, and used them to evaluate 20 different AI models. The project was widely adopted by AI researchers as a standard reference for understanding what AI can and cannot do in legal contexts. LB2 builds on that foundation, and extends it to a new and increasingly important question: not just whether AI can perform legal tasks, but whether AI can reliably judge the quality of legal work produced by other AI systems.
An autograder is an AI system that evaluates the output of another AI system — an automated quality reviewer. If you want to know whether an AI tool is drafting good contracts, you would traditionally need experienced attorneys to review hundreds of AI-generated drafts. This is accurate but expensive and slow. An autograder, if reliable, can do the same thing automatically.
Autograders are already widely used in other fields of AI development. The question LB2 asks is: can autograders be trusted to evaluate AI performance in legal tasks — and if so, which ones?
LB2 focuses on preference-based autograders. Rather than scoring a single output, a preference-based autograder is given two responses to the same task and asked to decide which one an expert attorney would prefer. It takes three inputs:
If an autograder consistently agrees with experienced attorneys across many such comparisons, it can serve as a reliable proxy for expert legal judgment — enabling AI tools to be evaluated quickly and at scale.
Consider the task of case summarization — one of the most common uses of AI in legal practice. Attorneys use AI to generate summaries of judicial opinions, quickly extracting key holdings, facts, and reasoning from lengthy decisions. But how do you know whether a summary is actually good?
Task instruction
"Summarize the following judicial opinion for a litigator who needs to understand the court's holding on personal jurisdiction, the key facts the court relied on, and any limitations the court placed on its reasoning. Write in plain English, in 300 words or fewer."
Input data
The full text of a judicial opinion — for example, a district court ruling on a motion to dismiss for lack of personal jurisdiction.
Two responses to compare
The autograder is asked: which of these would an experienced litigator prefer? If it reliably selects the stronger summary across many opinion–pair combinations, it can serve as a trustworthy evaluator of AI-generated summaries — without requiring a human attorney to weigh in each time.
Below are examples of what a LB2 task looks like in practice. These are illustrative — we are actively seeking ideas beyond these categories.
Example instruction
"You are grading a first-year law school exam. The question asked students to analyze whether the following facts give rise to a valid negligence claim under New York law. The answer should identify the relevant elements, apply each to the facts, and reach a reasoned conclusion. [Exam question text]"
Input data
A set of student essay responses to the exam question.
Responses to compare
Pairs of student essays — one stronger, one weaker — ideally with attorney or professor annotations indicating which is preferred and why.
Example instruction
"You represent the plaintiff in a personal injury case. The following is a summary of the dispute and a profile of the opposing party's key witness. [Case summary + witness profile]. Devise a deposition question strategy aimed at establishing that the witness had prior knowledge of the defect."
Input data
Case summaries and witness profiles from real or anonymized litigation scenarios.
Responses to compare
Pairs of deposition strategies drafted by attorneys, varying in experience or approach. Human-authored responses are strongly preferred.
Example instruction
"Draft the argument section of an appellant's brief addressing whether the trial court abused its discretion in excluding the expert's testimony under FRE 702. The relevant trial court ruling and expert report are appended. [Documents]"
Input data
Trial court rulings, expert reports, and relevant procedural history.
Responses to compare
Pairs of attorney-drafted argument sections, varying in quality or strategic approach.
Example instruction
"Draft a forum-selection clause for a commercial services agreement between a Delaware-incorporated technology company and a California-based vendor. The clause should favor the technology company and account for disputes arising under both state and federal law."
Input data
Drafting instructions specifying transaction context, parties, and attorney objectives for each clause.
Responses to compare
Pairs of attorney-drafted clauses for the same scenario.
Example instruction
"A fintech startup intends to offer a peer-to-peer lending product in all 50 states. Identify the primary federal and state regulatory frameworks that would apply and describe the key compliance obligations the company must address before launch. [Company description and product term sheet]"
Input data
Regulatory questions or compliance scenarios with accompanying company and product information.
Responses to compare
Pairs of legal analyses addressing the same scenario, varying in thoroughness or practical focus.
Like LegalBench 1, LB2 is designed to be a community effort. There are two ways to participate.
If you can identify a legal task where AI evaluation matters, we want to hear from you. You do not need to provide any data — describing the task and why it matters is itself valuable. We will work with collaborators to develop datasets around promising ideas, though we cannot guarantee inclusion of every submission.
If you have both a task idea and data to contribute, you can indicate so via this form. We will work with you to incorporate the data into LB2. Contributors who provide task data will be credited as co-authors on LB2.
For task data contributions, we distinguish between three types — they are not equally valuable, and co-authorship requires at minimum the first two.
Ready to contribute a task idea or dataset?
Open the contribution form →For questions about LB2, contribution logistics, or data handling, please contact Neel Guha — nguha@cs.stanford.edu.