GPT-3 for Corporates — Is Data Privacy an Issue?
Last Updated on March 3, 2021 by Editorial Team
Author(s): Shubham Saboo
GPT-3 for Corporates — Is Data Privacy an Issue?
GPT-3 is transforming the way how businesses can leverage AI to empower the existing products and come up with the next-gen of product offerings.
Brief Overview 💡
Generative Pre-trained Transformer 3 is an autoregressive language model that uses deep learning to produce human-like text. It is the third-generation of language prediction model in the GPT-n series created by OpenAI. GPT-3 is an extension and scaled-up version of GPT-2 model architecture — It includes the modified initialization, pre-normalization, and reversible tokenization and shows strong performance on many NLP tasks in the zero-shot, one-shot, and few-shot settings.
In the above graph, it is clearly visible how GPT-3 dominates all the small models and gets substantial gains on almost all the NLP tasks. It is based on the approach of pretraining on a large dataset followed by fine-tuning or priming for a specific task. Today’s AI system have their limitations in terms of performance while switching among various language tasks, but GPT-3 makes it very flexible to switch among different language tasks and is very efficient in terms of performance.
GPT-3 uses 175 billion parameters, which is by far the largest number of parameters that a model is trained on. It has brought up some fascinating insights that showed us if we can scale up the training of language models, it can significantly improve task-agnostic, few-shot performance making it comparable or even better than prior SOTA approaches.
Access to GPT-3 🗝️
The access to GPT-3 is given in the form of an API. Due to the largeness of the model, the OpenAI community decided not to release the entire model with 175 billion parameters. Unlike the current AI systems, which are designed for one use-case, GPT-3 is designed to be task-agnostic and provides a general-purpose “text in, text out” interface, providing the flexibility to the users to try it on virtually for any language task.
The API is designed in such a way that once you provide it with the apt text prompt, it will process it in the backend on the OpenAI servers and return the completed text trying to match the pattern you gave it. Unlike the current deep learning systems, which require a humungous amount of data to achieve SOTA performance, the API requires a few examples to be primed for your downstream tasks.
The API is designed to be very simple and intuitive to make machine learning teams more productive. The idea behind releasing GPT-3 in the form of API was to allow data teams to focus on focus on machine learning research rather than worrying about distributed systems problems.
GPT-3 offers its advanced language model by exposing it via an open-end API, which allows the users to provide the training data to GPT-3 in the form of a training prompt which the model uses to come up with the appropriate results. For individual accounts, the language model generally stores the training data as a part of its online learning feature in order to make the model better on the go, which creates a hitch around using GPT-3 for use cases involving highly confidential data. Data privacy has been the biggest concern for corporates around the world who are looking to use GPT-3 for creating niche domain-specific applications.
In very simple terms, at its core “all a language model does is predict the next word given a series of previous words.” OpenAI has devised different techniques to convert a language model (GPT-3) from this simple task to more useful tasks such as Question Answering, Document Summarization, Context-specific Text Generation, etc. For a language model usually, the best results are achieved by ‘fine tuning’ it on domain-specific data. GPT-3 uses a miniature version of fine-tuning, where it allows you to condition the model towards mimicking a particular behavior by providing it with just a few examples.
After receiving a lot of interests from corporates around the world to use this extremely powerful language model, OpenAI has come up with corporate accounts which allow the corporate users to sign a special Memorandum of Understanding (MoU) and Data Privacy Agreement (DPA) with OpenAI to overcome the concerns around the data leaks and data privacy.
Corporates Concern (Ask)
- The GPT-3 API endpoint exposed by OpenAI should not retain or save any part of training data provided to it as part of the model fine-tuning/training process.
- No third party should be able to extract or access the data shown to the model as a part of the training prompt by providing any kind of input to the exposed API endpoint.
Response from OpenAI
- For the first part, GPT-3 has been designed in such a way that it comes with a default “data retention” period which requires the model to keep the data around for some time in order to detect/prevent misuse of the API capabilities (For the things mentioned in our ToU section 3[j]). For custom data privacy agreements that will be specifically designed for the corporates, the retention window can be made flexible based on mutual agreement between both the parties after which the data will be scrubbed from the OpenAI systems.
- For the second part around leaking data, this can be easily taken care of by simply creating data and model silos. OpenAI will simply silo off the data so matter how long the retention period, third parties will never have access or be able to extract your data by providing any input to the GPT-3 API.
- Both the requests/asks are handled independently by OpenAI, where the retention period applies only to OpenAI, and not to the third parties. By creating data silos third parties will never be able to access the data no matter the retention window.
If you would like to learn more or want to me write more on this subject, feel free to reach out.
Published via Towards AI