publications
Jump to law and policy work, computer science publications, or textbook chapters.
PDFs available on request for paywalled publications.
law and policy
2024
2023
- AI Regulation Has Its Own Alignment Problem: The Technical and Institutional Feasibility of Disclosure, Registration, Licensing, and AuditingNeel Guha, Christie M. Lawrence, Lindsey A. Gailmard, Kit T. Rodolfa, Faiz Surani, Inioluwa Deborah Raji, Mariano-Florentino Cuéllar, Colleen Honigsberg, Percy Liang, and Daniel E. HoGeorge Washington Law Review Symposium on Legally Disruptive Emerging Technologiesw (forthcoming) (2023)
Calls for regulating artificial intelligence (AI) are widespread, but there remains little consensus on both the specific harms that regulation can and should address and the appropriate regulatory actions to take. Computer scientists propose technical solutions that may be infeasible or illegal; lawyers propose regulation that may be technically impossible; and commentators propose policies that may backfire. AI regulation, in that sense, has its own alignment problem, where proposed interventions are often misaligned with societal values. In this Essay, we detail and assess the alignment and technical and institutional feasibility of four dominant proposals for AI regulation in the United States: disclosure, registration, licensing, and auditing. Our caution against the rush to heavily regulate AI without addressing regulatory alignment is underpinned by three arguments. First, AI regulatory proposals tend to suffer from both regulatory mismatch (i.e., vertical misalignment) and value conflict (i.e., horizontal misalignment). Clarity about a proposal’s objectives, feasibility, and impact may highlight that the proposal is mismatched with the harm intended to address. In fact, the impulse for AI regulation may in some instances be better addressed by non-AI regulatory reform. And the more concrete the proposed regulation, the more it will expose tensions and tradeoffs between different regulatory objectives and values. Proposals that purportedly address all that ails AI (safety, trustworthiness, bias, accuracy, and privacy) ignore the reality that many goals cannot be jointly satisfied. Second, the dominant AI regulatory proposals face common technical and institutional feasibility challenges—who in government should coordinate and enforce regulation, how can the scope of regulatory interventions avoid ballooning, and what standards and metrics operationalize trustworthy AI values given the lack of, and unclear path to achieve, technical consensus? Third, the federal government can, to varying degrees, reduce AI regulatory misalignment by designing interventions to account for feasibility and alignment considerations. We thus close with concrete recommendations to minimize misalignment in AI regulation.
- ChatGPT and Physicians’ Malpractice RiskMichelle M. Mello, and Neel GuhaIn JAMA Health Forum (2023)
ChatGPT has exploded into the national consciousness. The potential for large language models (LLMs) such as ChatGPT, Bard, and many others to support or replace humans in a range of areas is now clear—and medical decisions are no exception. This has sharpened a perennial medicolegal question: How can physicians incorporate promising new technologies into their practice without increasing liability risk?
- Private Enforcement in the StatesDiego Zambrano, Neel Guha, Austin Peters, and Jeffrey XiaUniversity of Pennsylvania Law Review, forthcoming (2023)
Scholarship on U.S. litigation and civil procedure has scarcely studied the role of private enforcement in the states. Over the past two decades, scholars have established that, almost uniquely in the world, the U.S. often relies on private parties rather than administrative agencies to enforce important statutory provisions. Take your pick of any area in American governance and you will find private rights of action: environmental law, civil rights, employment discrimination, antitrust, consumer protection, business competition, securities fraud, and so on. In each of these areas, Congress deliberately empowered private plaintiffs instead of, or in addition to, government agencies. Yet, despite the vast importance of private enforcement at the federal level, we have no account of how prevalent private rights of action are in state law. And this question is particularly pressing now that a range of states—triggered by the Texas abortion law S.B.8.—are using private enforcement to weaken constitutional rights. Is private enforcement a way of governance in the states or just at the federal level? If it exists, are there important differences? What political conditions lead to their adoption? And why does it exist?
In this Article, we conduct the first systematic empirical investigation of the hidden world of state private enforcement. Using computational-linguistics and machine learning, we identify private enforcement provisions across a unique dataset of all states’ laws going back to 2000 (for all 50 states). Our results show that private enforcement is ubiquitous at the state level. Even by conservative estimates, there are more than 3,500 private rights of action provisions in state law, ranging from traditional areas like antitrust and employment, all the way to privacy violations, lawsuits against police, grave-digging, veterinary care, and waste disposal. Counterintuitively, private enforcement provisions are expanding the most in states like Utah, New Hampshire, Connecticut, Nebraska, and Wisconsin. Much of the growth in private enforcement is concentrated in areas affecting businesses, labor, the environment, and taxes. One takeaway from these results is that state private enforcement is strikingly different than the federal system—sprawling, messier, and even chaotic.
We also use our data to test conventional theories of private enforcement adoption. The most prominent one—the separation of powers theory—posits that Congress enacts private rights of action when the executive is controlled by another political party. Our empirical bottom line, based on regression analyses, is that we fail to find evidence in favor of any of the theories, including separation of powers. We even find no correlation between an increased adoption of private enforcement and legislative control by Democrats or Republicans. It appears the political economy of private enforcement in the states diverges radically from the federal government. With an eye toward future theorizing and empirical testing, we put forth three institutional differences between the states and federal government that may explain this divergence. And we sketch a future comparative research agenda focused on studying federal-state divergence. Reaffirming the central role that private enforcement plays in our system reveals the need to reorient civil procedure and incorporate state private rights of action more explicitly into its core teachings.
2022
- Vulnerabilities in Discovery TechNeel Guha, Peter Henderson, and Diego ZambranoHarvard Journal of Law & Technology (2022)
Recent technological advances are changing the litigation landscape, especially in the context of discovery. For nearly two decades, technologies have reinvented document searches in complex litigation, normalizing the use of machine learning algorithms under the umbrella of “Technology Assisted Review” (TAR). But the latest technological developments are placing discovery beyond the reach of attorney understanding and firmly in the realm of computer science and engineering. As lawyers struggle to keep up, a creeping sense of anxiety is spreading in the legal profession about a lack of transparency and the potential for discovery abuse. Judges, attorneys, bar associations, and scholars warn that lawyers need to closely supervise the technical aspects of TAR and avoid the dangers of sabotage, intentional hacking, or abuse. But these commentators have not fully defined with precision what the risks entail, furnished a clear outline of potential dangers, or defined the appropriate boundaries of debate.
This Article provides a systematic assessment of the potential for abuse in technology-assisted discovery. The Article offers three contributions. First, our most basic aim is to provide a technical but accessible assessment of vulnerabilities in the typical TAR process. To do so, we use the latest computer science research to identify and catalogue the different ways that TAR can go awry, either due to intentional abuse or mistakes. Second, with a better understanding of how discovery can be subverted, we then map potential remedies and reassess current debates in a more helpful light. The upshot is that abuse of technology-assisted discovery is possible but can be preventable if the right review processes are in place. Finally, we propose reforms to improve the system in the short and long term, with an emphasis on improved metrics that can more fully measure the quality of TAR. By exploring the technical background of discovery abuse, the Article demystifies the engineering substrate of modern discovery. Undertaking this study shows that with the right technical knowledge and assistance, lawyers can safeguard technology- assisted discovery without surrendering professional jurisdiction.
2021
- Building a National AI Research Resource: A Blueprint for the National Research CloudDaniel E. Ho, Jennifer King, Russell C. Wald, and Christopher Wan(2021)Contributing Author: §5 (Data Privacy Compliance), §6 (Technical Privacy and Virtual Data Safe Rooms), §8 (Managing Cybersecurity Risks).
computer science
2024
- Prospector Heads: Generalized Feature Attribution for Large Models & DataGautam Machiraju, Alexander Derry, Arjun Desai, Neel Guha, Amir-Hossein Karimi, James Zou, Russ Altman, Christopher Ré, and Parag MallickarXiv (2024)
Feature attribution, the ability to localize regions of the input data that are relevant for classification, is an important capability for machine learning models in scientific and biomedical domains. Current methods for feature attribution, which rely on "explaining" the predictions of end-to-end classifiers, suffer from imprecise feature localization and are inadequate for use with small sample sizes and high-dimensional datasets due to computational challenges. We introduce prospector heads, an efficient and interpretable alternative to explanation-based methods for feature attribution that can be applied to any encoder and any data modality. Prospector heads generalize across modalities through experiments on sequences (text), images (pathology), and graphs (protein structures), outperforming baseline attribution methods by up to 49 points in mean localization AUPRC. We also demonstrate how prospector heads enable improved interpretation and discovery of class-specific patterns in the input data. Through their high performance, flexibility, and generalizability, prospectors provide a framework for improving trust and transparency for machine learning models in complex domains.
2023
- LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language ModelsNeel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore, and 30 more authorsAdvances in Neural Information Processing Systems (2023)
The advent of large language models (LLMs) and their adoption by the legal community has given rise to the question: what types of legal reasoning can LLMs perform? To enable greater study of this question, we present LegalBench: a collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning. LegalBench was built through an interdisciplinary process, in which we collected tasks designed and hand-crafted by legal professionals. Because these subject matter experts took a leading role in construction, tasks either measure legal reasoning capabilities that are practically useful, or measure reasoning skills that lawyers find interesting. To enable cross-disciplinary conversations about LLMs in the law, we additionally show how popular legal frameworks for describing legal reasoning – which distinguish between its many forms – correspond to LegalBench tasks, thus giving lawyers and LLM developers a common vocabulary. This paper describes LegalBench, presents an empirical evaluation of 20 open-source and commercial LLMs, and illustrates the types of research explorations LegalBench enables.
- Embroid: Unsupervised Prediction Smoothing Can Improve Few-Shot ClassificationNeel Guha, Mayee F Chen, Kush Bhatia, Azalia Mirhoseini, Frederic Sala, and Christopher RéAdvances in Neural Information Processing Systems (2023)
Recent work has shown that language models’ (LMs) prompt-based learning capabilities make them well suited for automating data labeling in domains where manual annotation is expensive. The challenge is that while writing an initial prompt is cheap, improving a prompt is costly – practitioners often require significant labeled data in order to evaluate the impact of prompt modifications. Our work asks whether it is possible to improve prompt-based learning without additional labeled data. We approach this problem by attempting to modify the predictions of a prompt, rather than the prompt itself. Our intuition is that accurate predictions should also be consistent: samples which are similar under some feature representation should receive the same prompt prediction. We propose Embroid, a method which computes multiple representations of a dataset under different embedding functions, and uses the consistency between the LM predictions for neighboring samples to identify mispredictions. Embroid then uses these neighborhoods to create additional predictions for each sample, and combines these predictions with a simple latent variable graphical model in order to generate a final corrected prediction. In addition to providing a theoretical analysis of Embroid, we conduct a rigorous empirical evaluation across six different LMs and up to 95 different tasks. We find that (1) Embroid substantially improves performance over original prompts (e.g., by an average of 7.3 points on GPT-JT), (2) also realizes improvements for more sophisticated prompting strategies (e.g., chain-of-thought), and (3) can be specialized to domains like law through the embedding functions.
- Don’t Use a Cannon to Kill a Fly: An Efficient Cascading Pipeline for Long DocumentsZehua Li, Neel Guha, and Julian NyarkoInternational Conference on AI and Law (2023)
The computational cost of transformer-based models has a quadratic dependence on the length of the input sequence. This makes it challenging to deploy these models in domains in which long documents are especially lengthy, such as the legal domain. To address this issue, we propose a three-stage cascading approach for long document classification. We begin by filtering out likely irrelevant information with a lightweight logistic regression model before passing the more challenging inputs to the transformer-based model. We evaluate our approach using CUAD, a legal dataset with 510 manually-annotated, long contracts. We find that the cascading approach reduces training time by up to 80% while improving baseline performance. We hypothesize that the gains in performance stem from localizing the classification task of the transformer model to particularly difficult examples.
- Holistic Evaluation of Language ModelsPercy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, and 40 more authorsTransactions on Machine Learning Research (2023)
Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what’s missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don’t fall to the wayside, and that trade-offs are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit. We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.
- Ask Me Anything: A Simple Strategy for Prompting Language ModelsSimran Arora, Avanika Narayan, Mayee F Chen, Laurel J Orr, Neel Guha, Kush Bhatia, Ines Chami, Frederic Sala, and Christopher RéInternational Conference on Learning Representations (2023)
Large language models (LLMs) transfer well to new tasks out-of-the-box simply given a natural language prompt that demonstrates how to perform the task and no additional training. Prompting is a brittle process wherein small modifications to the prompt can cause large variations in the model predictions, and therefore significant effort is dedicated towards designing a painstakingly crafted "perfect prompt" for a task. To mitigate the high degree of effort, we instead ask whether collecting multiple decent, yet imperfect, prompts and aggregating them can lead to a high quality prompting strategy. Our observations motivate our proposed method, Ask Me Anything (AMA). We first develop an understanding of the effective prompt formats, finding question-answering (QA) prompts, which encourage open-ended generation ("Who went to the park?") tend to outperform those that restrict the model outputs ("John went to the park. True or False?"). AMA recursively uses the LLM to transform task inputs to the effective QA format. AM generates multiple questions per input and applies these prompts to collect several noisy "votes" for the input’s true label. We find the prompts have varying accuracies and dependencies and thus propose to use weak supervision, a procedure for combining the noisy predictions, to produce the final predictions. We evaluate AMA across open-source model families (EleutherAI, BLOOM, OPT, and T0) and sizes (125M-175B parameters), demonstrating an average performance lift of 10.2% over the few-shot baseline. This simple strategy enables the open-source GPT-J-6B model to match and exceed the performance of few-shot GPT3-175B on 15 of 20 popular benchmarks. Averaged across these tasks, the GPT-J-6B model outperforms few-shot GPT3-175B.
2022
- Pile of Law: Learning Responsible Data Filtering from the Law and a 256gb Open-source Legal DatasetPeter Henderson, Mark Krass, Lucia Zheng, Neel Guha, Christopher D Manning, Dan Jurafsky, and Daniel HoAdvances in Neural Information Processing Systems (2022)
One concern with the rise of large language models lies with their potential for significant harm, particularly from pretraining on biased, obscene, copyrighted, and private information. Emerging ethical approaches have attempted to filter pretraining material, but such approaches have been ad hoc and failed to take context into account. We offer an approach to filtering grounded in law, which has directly addressed the tradeoffs in filtering material. First, we gather and make available the Pile of Law, a 256GB (and growing) dataset of open-source English-language legal and administrative data, covering court opinions, contracts, administrative rules, and legislative records. Pretraining on the Pile of Law may help with legal tasks that have the promise to improve access to justice. Second, we distill the legal norms that governments have developed to constrain the inclusion of toxic or private content into actionable lessons for researchers and discuss how our dataset reflects these norms. Third, we show how the Pile of Law offers researchers the opportunity to learn such filtering rules directly from the data, providing an exciting new research direction in model-based processing.
- LegalBench: Prototyping a Collaborative Benchmark for Legal ReasoningNeel Guha, Daniel E. Ho, Julian Nyarko, and Christopher Ré(2022)
Can foundation models be guided to execute tasks involving legal reasoning? We believe that building a benchmark to answer this question will require sustained collaborative efforts between the computer science and legal communities. To that end, this short paper serves three purposes. First, we describe how IRAC-a framework legal scholars use to distinguish different types of legal reasoning-can guide the construction of a Foundation Model oriented benchmark. Second, we present a seed set of 44 tasks built according to this framework. We discuss initial findings, and highlight directions for new tasks. Finally-inspired by the Open Science movement-we make a call for the legal and computer science communities to join our efforts by contributing new tasks.
2021
- On the Opportunities and Risks of Foundation ModelsRishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, and 104 more authors(2021)
AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities,and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.
- When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD DatasetLucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. HoInternational Conference on AI and Law (2021)
While self-supervised learning has made rapid advances in natural language processing, it remains unclear when researchers should engage in resource-intensive domain-specific pretraining (domain pretraining). The law, puzzlingly, has yielded few documented instances of substantial gains to domain pretraining in spite of the fact that legal language is widely seen to be unique. We hypothesize that these existing results stem from the fact that existing legal NLP tasks are too easy and fail to meet conditions for when domain pretraining can help. To address this, we first present CaseHOLD (Case Holdings On Legal Decisions), a new dataset comprised of over 53,000+ multiple choice questions to identify the relevant holding of a cited case. This dataset presents a fundamental task to lawyers and is both legally meaningful and difficult from an NLP perspective (F1 of 0.4 with a BiLSTM baseline). Second, we assess performance gains on CaseHOLD and existing legal NLP datasets. While a Transformer architecture (BERT) pretrained on a general corpus (Google Books and Wikipedia) improves performance, domain pretraining (using corpus of approximately 3.5M decisions across all courts in the U.S. that is larger than BERT’s) with a custom legal vocabulary exhibits the most substantial performance gains with CaseHOLD (gain of 7.2% on F1, representing a 12% improvement on BERT) and consistent performance gains across two other legal tasks. Third, we show that domain pretraining may be warranted when the task exhibits sufficient similarity to the pretraining corpus: the level of performance increase in three legal tasks was directly tied to the domain specificity of the task. Our findings inform when researchers should engage resource-intensive pretraining and show that Transformer-based architectures, too, learn embeddings suggestive of distinct legal language.
- Leveraging administrative data for bias audits: Assessing disparate coverage with mobility data for COVID-19 policyAmanda Coston, Neel Guha, Derek Ouyang, Lisa Lu, Alexandra Chouldechova, and Daniel E HoIn Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (2021)
Anonymized smartphone-based mobility data has been widely adopted in devising and evaluating COVID-19 response strategies such as the targeting of public health resources. Yet little attention has been paid to measurement validity and demographic bias, due in part to the lack of documentation about which users are represented as well as the challenge of obtaining ground truth data on unique visits and demographics. We illustrate how linking large-scale administrative data can enable auditing mobility data for bias in the absence of demographic information and ground truth labels. More precisely, we show that linking voter roll data – containing individual-level voter turnout for specific voting locations along with race and age – can facilitate the construction of rigorous bias and reliability tests. These tests illuminate a sampling bias that is particularly noteworthy in the pandemic context: older and non-white voters are less likely to be captured by mobility data. We show that allocating public health resources based on such mobility data could disproportionately harm high-risk elderly and minority groups.
- Bootleg: Chasing the Tail with Self-Supervised Named Entity DisambiguationLaurel J. Orr, Megan Leszczynski, Simran Arora, Sen Wu, Neel Guha, Xiao Ling, and Christopher RéConference on Innovative Data Systems Research (2021)
A challenge for named entity disambiguation (NED), the task of mapping textual mentions to entities in a knowledge base, is how to disambiguate entities that appear rarely in the training data, termed tail entities. Humans use subtle reasoning patterns based on knowledge of entity facts, relations, and types to disambiguate unfamiliar entities. Inspired by these patterns, we introduce Bootleg, a self-supervised NED system that is explicitly grounded in reasoning patterns for disambiguation. We define core reasoning patterns for disambiguation, create a learning procedure to encourage the self-supervised model to learn the patterns, and show how to use weak supervision to enhance the signals in the training data. Encoding the reasoning patterns in a simple Transformer architecture, Bootleg meets or exceeds state-of-the-art on three NED benchmarks. We further show that the learned representations from Bootleg successfully transfer to other non-disambiguation tasks that require entity-based knowledge: we set a new state-of-the-art in the popular TACRED relation extraction task by 1.0 F1 points and demonstrate up to 8% performance lift in highly optimized production search and assistant tasks at a major technology company
2019
- Machine Learning for AC Optimal Power FlowNeel Guha, Zhecheng Wang, Matt Wytock, and Arun MajumdarClimate Change Workshop at the International Conference on Machine Learning (2019)
We explore machine learning methods for AC Optimal Powerflow (ACOPF) - the task of optimizing power generation in a transmission network according while respecting physical and engineering constraints. We present two formulations of ACOPF as a machine learning problem: 1) an end-to-end prediction task where we directly predict the optimal generator settings, and 2) a constraint prediction task where we predict the set of active constraints in the optimal solution. We validate these approaches on two benchmark grids.
- One-Shot Federated LearningNeel Guha, Ameet Talwalkar, and Virginia Smith2nd Workshop on Machine Learning on the Phone and other Consumer Devices at Neural Information Processing Systems (2019)
We present one-shot federated learning, where a central server learns a global model over a network of federated devices in a single round of communication. Our approach - drawing on ensemble learning and knowledge aggregation - achieves an average relative gain of 51.5% in AUC over local baselines and comes within 90.1% of the (unattainable) global ideal. We discuss these methods and identify several promising directions of future work.
textbook chapters
2023
- Gamesmanship in Modern Discovery TechDiego Zambrano, Neel Guha, and Peter HendersonIn Legal Tech and the Future of Civil Justice (Cambridge University Press, 2023)