Background

AI tools are now widely used across the legal profession — for tasks like reviewing contracts, summarizing case law, drafting pleadings, and analyzing depositions. As these tools become more capable and more relied upon, a critical question emerges: how do we know whether an AI is actually doing a good job?

Evaluating AI performance in legal settings is harder than it sounds. Legal quality is nuanced — a contract clause might be technically correct but poorly suited to a client's risk tolerance; a case summary might be accurate but miss the strategically important detail. Assessing this kind of quality has traditionally required expert human review, which is expensive, slow, and difficult to scale.

LegalBench V2 (LB2) is a follow-up to the original LegalBench project from 2023. LB2 focuses specifically on evaluating a class of AI tools called autograders — AI systems that can assess the quality of other AI systems' outputs in legal contexts.

The original LegalBench, published at NeurIPS 2023, was the first large-scale open-source effort to benchmark legal reasoning in AI. Drawing on contributions from 40 collaborators — lawyers, law professors, computational legal scholars, and legal aid organizations — it assembled 162 tasks spanning six types of legal reasoning, and used them to evaluate 20 different AI models. The project was widely adopted by AI researchers as a standard reference for understanding what AI can and cannot do in legal contexts. LB2 builds on that foundation, and extends it to a new and increasingly important question: not just whether AI can perform legal tasks, but whether AI can reliably judge the quality of legal work produced by other AI systems.


What is an Autograder?

An autograder is an AI system that evaluates the output of another AI system — an automated quality reviewer. If you want to know whether an AI tool is drafting good contracts, you would traditionally need experienced attorneys to review hundreds of AI-generated drafts. This is accurate but expensive and slow. An autograder, if reliable, can do the same thing automatically.

Autograders are already widely used in other fields of AI development. The question LB2 asks is: can autograders be trusted to evaluate AI performance in legal tasks — and if so, which ones?

Preference-based autograders

LB2 focuses on preference-based autograders. Rather than scoring a single output, a preference-based autograder is given two responses to the same task and asked to decide which one an expert attorney would prefer. It takes three inputs:

If an autograder consistently agrees with experienced attorneys across many such comparisons, it can serve as a reliable proxy for expert legal judgment — enabling AI tools to be evaluated quickly and at scale.


A Worked Example: Case Summarization

Consider the task of case summarization — one of the most common uses of AI in legal practice. Attorneys use AI to generate summaries of judicial opinions, quickly extracting key holdings, facts, and reasoning from lengthy decisions. But how do you know whether a summary is actually good?

Task instruction

"Summarize the following judicial opinion for a litigator who needs to understand the court's holding on personal jurisdiction, the key facts the court relied on, and any limitations the court placed on its reasoning. Write in plain English, in 300 words or fewer."

Input data

The full text of a judicial opinion — for example, a district court ruling on a motion to dismiss for lack of personal jurisdiction.

Two responses to compare

Response A
Accurately states the holding, but buries it after two paragraphs of procedural history. Omits the court's limiting language. Uses Latin phrases without explanation.
Response B
Leads with the holding in the first sentence. Identifies the two facts the court found dispositive. Flags the key factual distinction limiting the ruling's precedential scope. Written in plain English throughout.

The autograder is asked: which of these would an experienced litigator prefer? If it reliably selects the stronger summary across many opinion–pair combinations, it can serve as a trustworthy evaluator of AI-generated summaries — without requiring a human attorney to weigh in each time.


Examples of Tasks We Are Interested In

Below are examples of what a LB2 task looks like in practice. These are illustrative — we are actively seeking ideas beyond these categories.

Legal Education — Law School Exam Grading +

Example instruction

"You are grading a first-year law school exam. The question asked students to analyze whether the following facts give rise to a valid negligence claim under New York law. The answer should identify the relevant elements, apply each to the facts, and reach a reasoned conclusion. [Exam question text]"

Input data

A set of student essay responses to the exam question.

Responses to compare

Pairs of student essays — one stronger, one weaker — ideally with attorney or professor annotations indicating which is preferred and why.

Depositions — Question Strategy Evaluation +

Example instruction

"You represent the plaintiff in a personal injury case. The following is a summary of the dispute and a profile of the opposing party's key witness. [Case summary + witness profile]. Devise a deposition question strategy aimed at establishing that the witness had prior knowledge of the defect."

Input data

Case summaries and witness profiles from real or anonymized litigation scenarios.

Responses to compare

Pairs of deposition strategies drafted by attorneys, varying in experience or approach. Human-authored responses are strongly preferred.

Brief Writing — Argument Quality +

Example instruction

"Draft the argument section of an appellant's brief addressing whether the trial court abused its discretion in excluding the expert's testimony under FRE 702. The relevant trial court ruling and expert report are appended. [Documents]"

Input data

Trial court rulings, expert reports, and relevant procedural history.

Responses to compare

Pairs of attorney-drafted argument sections, varying in quality or strategic approach.

Contract Drafting — Clause Quality +

Example instruction

"Draft a forum-selection clause for a commercial services agreement between a Delaware-incorporated technology company and a California-based vendor. The clause should favor the technology company and account for disputes arising under both state and federal law."

Input data

Drafting instructions specifying transaction context, parties, and attorney objectives for each clause.

Responses to compare

Pairs of attorney-drafted clauses for the same scenario.

Regulatory Analysis — Compliance Assessment +

Example instruction

"A fintech startup intends to offer a peer-to-peer lending product in all 50 states. Identify the primary federal and state regulatory frameworks that would apply and describe the key compliance obligations the company must address before launch. [Company description and product term sheet]"

Input data

Regulatory questions or compliance scenarios with accompanying company and product information.

Responses to compare

Pairs of legal analyses addressing the same scenario, varying in thoroughness or practical focus.


How to Contribute

Like LegalBench 1, LB2 is designed to be a community effort. There are two ways to participate.

Task Idea Contribution

If you can identify a legal task where AI evaluation matters, we want to hear from you. You do not need to provide any data — describing the task and why it matters is itself valuable. We will work with collaborators to develop datasets around promising ideas, though we cannot guarantee inclusion of every submission.

Task Data Contribution

If you have both a task idea and data to contribute, you can indicate so via this form. We will work with you to incorporate the data into LB2. Contributors who provide task data will be credited as co-authors on LB2.

What counts as data?

For task data contributions, we distinguish between three types — they are not equally valuable, and co-authorship requires at minimum the first two.

1.
Task instructions — required for co-authorship
The specific prompt or set of prompts defining the legal task. For example: "Summarize the following judicial opinion, identifying the court's holding, the key facts relied upon, and any limitations on the reasoning. Write in plain English, in 300 words or fewer."
2.
Input data — required for co-authorship
The raw legal materials the instruction is applied to — for example, the full text of judicial opinions, deposition transcripts, or contract templates.
3.
Human-written responses — optional but highly valuable
Pairs of responses drafted by attorneys or other qualified professionals, not AI-generated. These serve as a quality benchmark. If you have access to attorney work product that could serve as response pairs, we strongly encourage you to discuss contribution with us.

Ready to contribute a task idea or dataset?

Open the contribution form →

For questions about LB2, contribution logistics, or data handling, please contact Neel Guha — nguha@cs.stanford.edu.