Borrowing From Law To Filter Training Data For Foundation Models

View all on-demand sessions from the Intelligent Security Summit here.

Foundation models are often trained on what is essentially the entire Internet. Learning from such a huge data set allows them to impressively remember and reproduce information that we want them to learn. For example, they can learn to accurately answer factual questions, such as “Who is the President of the United States?”

At the same time, however, foundation models can remember and reproduce information that can be harmful. For example, they may disclose people’s social security numbers, credit card information, or criminal records, or answer questions about Muslims by suggesting that they terrorists.

These are problems foundation model makers must solve, says Peter Henderson, a JD/Ph.D. student at Stanford: “We don’t want models to associate people with their private content or with harmful traits.”

To avoid such consequences, foundation model makers sometimes try to filter out private or toxic content before using a data set to train a model. But removing all – if not most – private or toxic content from the entire internet is a huge challenge. One reason: context matters. Privacy expectations differ across cultures and even over time. And deciding whether a phrase is toxic can depend on who is speaking, why they’re using a particular phrase, and readers’ expectations. In short: it is a balancing act and different researchers use different standards.

Event

Intelligent Security Summit on demand

Learn the critical role of AI and ML in cybersecurity and industry-specific case studies. Check out on-demand sessions today.

Look here

“We wondered if there was a more principled way to filter pre-workout data,” says Henderson. He and his colleagues, including Mark Krass, also a JD/PhD student, had an idea: look at the law. There is a long history of courts setting standards for information disclosure, so why not import those standards into the machine learning (ML) environment?

To test their idea, Henderson and his colleagues gathered Pile law, an extensive dataset of judicial and administrative opinions, law books, jurisprudence books and other legal documents. They then explored whether Pile of Law could help identify a principled way to filter pre-training data, with a particular focus on privacy and toxicity.

Based on that of the team first experiments, Pile of Law offers some valuable opportunities: First, it can help researchers ensure that their training data meets minimum legal standards. And second, it can reveal problems with everyday filtering standards, such as toxicity.

Filter by privacy

When Henderson and Krass first looked at the datasets currently used to train foundation models, they found none that had been explicitly filtered for personally sensitive information. So they decided to identify the standards courts and governments use to balance privacy and transparency, and then test whether the implicit use of those standards in Pile of Law could point them toward a nuanced approach to data filtering.

First, the team took stock of the different ways courts have addressed privacy issues. They found some clear rules that model designers could adapt to filter their training data. For example, no jurisdiction in the US discloses minors’ names, social security numbers, financial account numbers, or dates of birth.

But they also found approaches that were more contextual. For example, U.S. courts typically disclose people’s criminal records or the names of litigants in civil cases, but there are exceptions. For example, in cases of sexual assault, the names of the victims are often pseudonymised. Similarly, administrative judges use their discretion to protect the names of people who come before them in contexts such as claiming disability benefits or political asylum.

The existence of these contextual standards means that certain subsets of Pile of Law are already implicitly filtered to protect the privacy of certain people. For example, in the context of immigration, people seeking asylum and claiming they were tortured in their own countries are likely to have been given pseudonyms on the public record.

Henderson and his team decided to test whether a model could learn these contextualized norms by using Pile of Law as training data. The result: a model that predicts with 80% accuracy whether a paragraph in an immigration case should use a pseudonym or not. And they showed that these predictions were in line with the law: sentences referring to asylum and torture were more likely to lead to pseudonymity than sentences referring to criminal offenses.

These and several other experiments suggest that Pile of Law could help researchers develop context-appropriate privacy filters, Henderson says. Next, the team wants to extend these efforts beyond the legal realm: Could a model learn to pseudonymize the names of asylum seekers in a dataset that spans the entire internet?

Filter by toxicity

In the toxicity arena, Henderson and Krass found a different landscape. Existing filters are widely used and go far beyond what would be suggested by legal standards. Indeed, applying the current toxicity filters to Pile of Law could filter important parts of some important civil rights-era legal precedents, including Brown v. Board of Educationa major case that led to the desegregation of schools in the United States.

In addition, the team found that existing filters can remove toxic content from shorter pieces of text, while remaining in place when it appears in longer written work – an inexplicable result that is potentially problematic.

“The lesson is to think more carefully before pulling a filter off the shelf to filter data for training,” says Henderson. “We therefore advocate for more research to properly address toxicity in the training data.”

Next: Legal reasoning

While Henderson and Krass hope that Pile of Law will lead to less ad hoc filtering of data than is currently the case, they also have a second goal: to use Pile of Law to build basic models capable of legal reasoning.

The team has al sho w that basic models have a poor understanding of how to apply the law to a set of facts. But Henderson hopes AI systems will one day improve lawyers’ efficiency and thoroughness by, for example, checking their citations and identifying all relevant arguments in a case. The aim, he says, is to improve access to justice for people who cannot afford a lawyer.

“It’s a tough challenge, but why not aim for a hard problem to solve?” he says. “And one that can really help people.”

Katharine Miller is a contributing writer for the Stanford Institute for Human-Centered AI.

Data decision makers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people who do data work, can share data-related insights and innovation.

To read about advanced ideas and up-to-date information, best practices and the future of data and data technology, join DataDecisionMakers.

You might even consider contributing an article yourself!

Is Sal Vulcano Married? Explore relationship and dating history

Jonathan Hillstrand Net Worth, Wife, Daughter, Weight Loss

Who is Ariel Contreras from ‘Hell’s Kitchen’ anyway? Biography

Who is Kaydon Boebert? Bio, parents, siblings, age, relationship

MSC’s Explora Journeys Makes Its Maiden Voyage from Copenhagen to Reykjavik

Borrowing from law to filter training data for foundation models

Event

Filter by privacy

Filter by toxicity

Next: Legal reasoning

Data decision makers

Latest news

Mostbet Pakistan ᐉ Online Casino Review Official Website

Casino Pin Up Pin-up Casino Resmi Sitesi Türkiye Proloq Ve Kayıt Çevrimiçi

Играть В Авиатора: Самолетик Pin Up

Pin Up 306 Casino əvvəl Qeydiyyat, Bonuslar, Yukl The National Investo

Oyunu Xinclamaq Mümkündürmü?

Must read

You might also likeRELATED
Recommended to you

POPULAR POSTS

Why Managed Discovery and Response (MDR) adoption is growing among small...

What Uber’s data breach reveals about social engineering

Growfin’s AI-based cash collection SaaS continues to expand into the US...

POPULAR CATEGORY

Borrowing from law to filter training data for foundation models

Event

Filter by privacy

Filter by toxicity

Next: Legal reasoning

Data decision makers

Latest news

Must read

You might also likeRELATEDRecommended to you

POPULAR POSTS

POPULAR CATEGORY

You might also likeRELATED
Recommended to you