Data privacy on cloud-hosted AI
The rise of cloud-hosted AI software has brought much discussion about the privacy implications of using it. But both consumers and developers building on such software, don’t always have a sophisticated framework for evaluating how software providers store, use, and share their data. For example, does a company’s promise “not to train on customer data” mean your data is private?
Here is a framework for thinking about different levels of privacy on cloud platforms, from less to more:
- No Guarantees: The company provides no guarantees that your data will be kept private. For example, an AI company might train on your data and use the resulting models in ways that leak it. Many startups start here but add privacy guarantees later when customers demand them.
- No Outside Exposure: The company does not expose your data to outsiders. A company can meet this standard by not training on your data and also by not posting your data online. Many large startups, including some providers of large language models (LLMs), currently operate at this level.
- Limited Access: In addition to safeguards against data leakage, no humans (including employees, contractors, and vendors of the company) will look at your data unless they are compelled via a reasonable process (such as a subpoena or court order, or if the data is flagged by a safety filter). Many large cloud companies effectively offer this level of privacy, whether or not their terms of service explicitly say so.
- No Access: The company cannot access your data no matter what. For example, data may be stored on the customer’s premises, so the company doesn’t have access to it. If I run an LLM on my private laptop, no company can access my prompts or LLM output. Alternatively, if data is used by a SaaS system, it might be encrypted before it leaves the customer’s facility, so the provider doesn’t have access to an unencrypted version. For example, when you use an end-to-end encrypted messaging app such as Signal or WhatsApp, the company cannot see the contents of your messages (though it may see “envelope” information such as sender and recipient identities and the time and size of the message).
These levels may seem clear, but there are many variations within a given level. For instance, a promise not to train on your data can mean different things to different companies. Some forms of generative AI, particularly image generators, can replicate their training data, so training a generative AI algorithm on customer data may run some risk of leaking it. On the other hand, tuning a handful of an algorithm’s hyperparameters (such as learning rate) to customer data, while technically part of the training process, is very unlikely to result in any direct data leakage. So how the data is used in training will affect the risk of leakage.
Similarly, the Limited Access level has its complexities. If a company offers this level of privacy, it’s good to understand exactly under what circumstances its employees may look at your data. And if they might look at your data, there are shades of gray in terms of how private the data remains. For example, if a limited group of employees in a secure environment can see only short snippets that have been disassociated from your company ID, that’s more secure than if a large number of employees can freely browse your data.
In outlining levels of privacy, I am not addressing the question of security. To trust a company to deliver a promised level of privacy is also to trust that its IT infrastructure is secure enough to keep that promise.
Over the past decade, cloud hosted SaaS software has gained considerable traction. But some customers insist on running on-prem solutions within their own data centers. One reason is that many SaaS providers offer only No Guarantees or No Outside Exposure, but many customers’ data is so sensitive that it requires Limited Access.
Join Upaspro to get email for news in AI and Finance