View all on-demand sessions from the Intelligent Security Summit here.
Digital assistants of the future promise to make everyday life easier. We may ask them to complete tasks such as booking out-of-town business travel based on the content of an email, or answering open-ended questions that require a combination of personal context and public knowledge. (For example, “Is my blood pressure within the normal range for someone my age?”)
But before we can achieve new levels of efficiency at work and at home, one big question needs to be answered: How can we provide users with strong and transparent privacy guarantees for the underlying personal information that machine learning (ML) models use to arrive at these answers?
If we expect digital assistants to facilitate personal tasks involving a mix of public and private data, we need the technology to provide “perfect secrecy” or the highest possible level of privacy in certain situations. So far, previous methods have ignored the privacy issue or offered weaker privacy guarantees.
Third-year Stanford computer science Ph.D. pupil Simran Arora has studied the intersection of ML and privacy with associate professor Christopher Re as her advisor. They recently investigated whether emerging base models – large ML models trained on massive amounts of public data – provide the answer to this pressing privacy question. The resulting paper was released in May 2022 on preprint service ArXiv, with a proposed framework and proof of concept for using ML in the context of personal tasks.
Perfect secrecy defined
According to Arora, a perfect confidentiality guarantee meets two conditions. First, as users interact with the system, the chances of malicious people learning private information do not increase. Second, since multiple personal tasks are completed using the same private data, the chance of data being shared accidentally does not increase.
With this definition in mind, she has identified three criteria for evaluating a privacy system against the goal of perfect secrecy:
- Privacy: How well does the system prevent leaks of private data?
- Quality: how does the model perform a certain task when perfect secrecy is guaranteed?
- Feasibility: Is the approach realistic in terms of time and cost to implement the model?
Today, most modern privacy systems use an approach called federated learning, which allows for collective model training by multiple parties while avoiding the exchange of raw data. In this method, the model is sent to each user and then sent back to a central server along with that user’s updates. Source data is theoretically never revealed to participants. But unfortunately, other researchers have discovered that it is possible to recover data from an exposed model.
The popular technology used to enhance the privacy assurance of federated learning is called differential privacy, which is a statistical approach to protecting private information. This technology requires the implementer to set the privacy parameters, which determine a trade-off between model performance and information privacy. In practice, it is difficult for practitioners to set these preconditions and the trade-off between privacy and quality is not legally standardised. While the chance of a breach is very small, perfect secrecy is not guaranteed with a federated learning approach.
“Currently, the industry has adopted a focus on statistical reasoning,” explains Arora. “In other words, how likely is it that someone will discover my personal information? The differentiated privacy approach used in federated learning requires organizations to make trade-offs between utility and privacy. That is not ideal.”
A new approach with foundation models
When Arora saw how well basic models like GPT-3 perform new tasks with simple commands, often without the need for additional training, she wondered if these capabilities could be applied to personal tasks while offering more privacy than the status quo.
“With these large language models, you can say ‘Tell me the sentiment of this review’ in natural language and the model outputs the response — positive, negative or neutral,” she said. “We can then use the exact same model without any upgrades to ask a new question with personal context, such as ‘Tell me the subject of this email.’ ”
Arora and Ré began exploring the possibility of using ready-made public foundation models in a silo for private users to perform personal tasks. They developed a simple framework called Foundation Model Controls for User Secrecy (FOCUS), which proposes using a unidirectional data flow architecture to perform personal tasks while preserving privacy.
The one-way aspect of the framework is essential, as it means that in a scenario with different privacy scopes (i.e., a mix of public and private data), the public foundation model dataset is queried before the user’s private dataset, preventing leakback into the public arena.
Test the theory
Arora and Ré assessed the FOCUS framework on the criteria of privacy, quality and feasibility. The results were encouraging for a proof of concept. FOCUS not only ensures the privacy of personal data, but also goes beyond to hide the actual task the model was supposed to perform and how the task was completed. Best of all, this approach doesn’t require organizations to set privacy parameters that balance utility against privacy.
In terms of quality, the basic model approach competed with federated learning on six of the seven standard benchmarks. However, it underperformed in two specific scenarios: when the model was asked to perform an out-of-domain task (something not included in the training process) and when the task was performed with small base models.
Finally, they considered the feasibility of their framework compared to a federated learning approach. FOCUS eliminates the many rounds of communication between users that occur with federated learning and lets the pre-trained base model do the work faster through inference, making for a more efficient process.
Risks of the foundation model
Arora notes that several challenges must be addressed before foundation models can be widely used for personal tasks. For example, the decrease in FOCUS performance when the model is asked to perform an out-of-domain task is a concern, as is the slow runtime of the inference process with large models. For now, Arora recommends that the privacy community increasingly consider foundation models as a baseline and aid in designing new privacy benchmarks and motivating the need for federated learning. Ultimately, the right approach to privacy depends on the context of the user.
Foundation models also introduce their own inherent risks. They are expensive to pre-train and can hallucinate or misclassify information if unsure. There is also a fairness concern because so far basic models are mainly available for resource-rich languages, so a public model may not exist for all personalizations.
Pre-existing data breaches are another complicating factor. “Training basic models on web data that already contains leaked sensitive information raises a whole new set of privacy concerns,” Arora acknowledged.
Looking ahead, she and her colleagues look into the Blurred research lab at Stanford are exploring methods to power more reliable systems and enable in-context behavior with smaller base models, better suited to personal tasks on resource-poor user devices.
Arora can envision a scenario, not too far away, where you ask a digital assistant to book a flight based on an email stating you’re scheduling a meeting with an out-of-town client. And the model coordinates travel logistics without disclosing details about the person or company you will be meeting.
“It is still early days, but I hope that the FOCUS framework and proof of concept will lead to further research into applying public foundation models to private tasks,” said Arora.
Nikki Goth Itoi is a contributing writer for the Stanford Institute for Human-Centered AI.
This story originally appeared on Hai.stanford.edu. Copyright 2022
Data decision makers
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including the technical people who do data work, can share data-related insights and innovation.
To read about advanced ideas and up-to-date information, best practices and the future of data and data technology, join DataDecisionMakers.
You might even consider contributing an article yourself!
Read more from DataDecisionMakers