Technology Borrowing from law to filter training data for foundation...

Borrowing from law to filter training data for foundation models

-

View all on-demand sessions from the Intelligent Security Summit here.


Foundation models are often trained on what is essentially the entire Internet. Learning from such a huge data set allows them to impressively remember and reproduce information that we want them to learn. For example, they can learn to accurately answer factual questions, such as “Who is the President of the United States?”

At the same time, however, foundation models can remember and reproduce information that can be harmful. For example, they may disclose people’s social security numbers, credit card information, or criminal records, or answer questions about Muslims by suggesting that they terrorists.

These are problems foundation model makers must solve, says Peter Henderson, a JD/Ph.D. student at Stanford: “We don’t want models to associate people with their private content or with harmful traits.”

To avoid such consequences, foundation model makers sometimes try to filter out private or toxic content before using a data set to train a model. But removing all – if not most – private or toxic content from the entire internet is a huge challenge. One reason: context matters. Privacy expectations differ across cultures and even over time. And deciding whether a phrase is toxic can depend on who is speaking, why they’re using a particular phrase, and readers’ expectations. In short: it is a balancing act and different researchers use different standards.

Event

Intelligent Security Summit on demand

Learn the critical role of AI and ML in cybersecurity and industry-specific case studies. Check out on-demand sessions today.

Look here

“We wondered if there was a more principled way to filter pre-workout data,” says Henderson. He and his colleagues, including Mark Krass, also a JD/PhD student, had an idea: look at the law. There is a long history of courts setting standards for information disclosure, so why not import those standards into the machine learning (ML) environment?

To test their idea, Henderson and his colleagues gathered Pile law, an extensive dataset of judicial and administrative opinions, law books, jurisprudence books and other legal documents. They then explored whether Pile of Law could help identify a principled way to filter pre-training data, with a particular focus on privacy and toxicity.

Based on that of the team first experiments, Pile of Law offers some valuable opportunities: First, it can help researchers ensure that their training data meets minimum legal standards. And second, it can reveal problems with everyday filtering standards, such as toxicity.

Filter by privacy

When Henderson and Krass first looked at the datasets currently used to train foundation models, they found none that had been explicitly filtered for personally sensitive information. So they decided to identify the standards courts and governments use to balance privacy and transparency, and then test whether the implicit use of those standards in Pile of Law could point them toward a nuanced approach to data filtering.

First, the team took stock of the different ways courts have addressed privacy issues. They found some clear rules that model designers could adapt to filter their training data. For example, no jurisdiction in the US discloses minors’ names, social security numbers, financial account numbers, or dates of birth.

But they also found approaches that were more contextual. For example, U.S. courts typically disclose people’s criminal records or the names of litigants in civil cases, but there are exceptions. For example, in cases of sexual assault, the names of the victims are often pseudonymised. Similarly, administrative judges use their discretion to protect the names of people who come before them in contexts such as claiming disability benefits or political asylum.

The existence of these contextual standards means that certain subsets of Pile of Law are already implicitly filtered to protect the privacy of certain people. For example, in the context of immigration, people seeking asylum and claiming they were tortured in their own countries are likely to have been given pseudonyms on the public record.

Henderson and his team decided to test whether a model could learn these contextualized norms by using Pile of Law as training data. The result: a model that predicts with 80% accuracy whether a paragraph in an immigration case should use a pseudonym or not. And they showed that these predictions were in line with the law: sentences referring to asylum and torture were more likely to lead to pseudonymity than sentences referring to criminal offenses.

These and several other experiments suggest that Pile of Law could help researchers develop context-appropriate privacy filters, Henderson says. Next, the team wants to extend these efforts beyond the legal realm: Could a model learn to pseudonymize the names of asylum seekers in a dataset that spans the entire internet?

Filter by toxicity

In the toxicity arena, Henderson and Krass found a different landscape. Existing filters are widely used and go far beyond what would be suggested by legal standards. Indeed, applying the current toxicity filters to Pile of Law could filter important parts of some important civil rights-era legal precedents, including Brown v. Board of Educationa major case that led to the desegregation of schools in the United States.

In addition, the team found that existing filters can remove toxic content from shorter pieces of text, while remaining in place when it appears in longer written work – an inexplicable result that is potentially problematic.

“The lesson is to think more carefully before pulling a filter off the shelf to filter data for training,” says Henderson. “We therefore advocate for more research to properly address toxicity in the training data.”

While Henderson and Krass hope that Pile of Law will lead to less ad hoc filtering of data than is currently the case, they also have a second goal: to use Pile of Law to build basic models capable of legal reasoning.

The team has al show that basic models have a poor understanding of how to apply the law to a set of facts. But Henderson hopes AI systems will one day improve lawyers’ efficiency and thoroughness by, for example, checking their citations and identifying all relevant arguments in a case. The aim, he says, is to improve access to justice for people who cannot afford a lawyer.

“It’s a tough challenge, but why not aim for a hard problem to solve?” he says. “And one that can really help people.”

Katharine Miller is a contributing writer for the Stanford Institute for Human-Centered AI.

This story originally appeared on Hai.stanford.edu. Copyright 2022

Data decision makers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people who do data work, can share data-related insights and innovation.

To read about advanced ideas and up-to-date information, best practices and the future of data and data technology, join DataDecisionMakers.

You might even consider contributing an article yourself!

Read more from DataDecisionMakers

Shreya Christinahttp://ukbusinessupdates.com
Shreya has been with ukbusinessupdates.com for 3 years, writing copy for client websites, blog posts, EDMs and other mediums to engage readers and encourage action. By collaborating with clients, our SEO manager and the wider ukbusinessupdates.com team, Shreya seeks to understand an audience before creating memorable, persuasive copy.

Latest news

Mostbet Pakistan ᐉ Online Casino Review Official Website

Join us to dive into an immersive world of top-tier gaming, tailored for the Kenyan audience, where fun and...

Casino Pin Up Pin-up Casino Resmi Sitesi Türkiye Proloq Ve Kayıt Çevrimiçi

ContentPin Up Nə Say Onlayn Kazino Təklif Edir?Pin Up Casino-da Pul Çıxarmaq Nə Miqdar Müddət Alır?Vəsaiti Kartadan Çıxarmaq üçün...

Играть В Авиатора: Самолетик Pin Up

ContentAviator: Son Qumar Oyunu Təcrübəsini AçınMobil Proqram Pin UpPin Up Aviator Nasıl Oynanır?Бонус За Регистрацию В Pin Up?Pin Up...

Pin Up 306 Casino əvvəl Qeydiyyat, Bonuslar, Yukl The National Investo

ContentDarajalarfoydalanuvchilar Pin UpCasino Pin-up Pin-up On Line Casino Resmi Sitesi Türkiye Başlanğıc Ve Kayıt ÇevrimiçPromosyon Və Qeydiyyatdan KeçməkAviator OyunuAviator...

Find Experts to Write My Paper for Me. Just Click a Button Even though you may have many...

Oyunu Xinclamaq Mümkündürmü?

ContentAviator Apk HackAviator-da Necə Bonus Əldə Etmək OlarAviator Hack - Oyunu Xinclamaq Mümkündürmü?Aviator Hədis AlqoritmləriIşarə Hacking AviatorAviator Oyunu 1winMərclər...

Must read

You might also likeRELATED
Recommended to you