Understanding LLMs, privacy and security -- why a secure gateway approach is needed
Over the past year, we have seen generative AI and large language models (LLMs) go from a niche area of AI research into being one of the fastest growing areas of technology. Across the globe, around $200 billion is due to be invested in this market according to Goldman Sachs, boosting global labor productivity by one percentage point. That might not sound like much, but it would add up to $7 trillion more in the global economy.
However, while these LLM applications might have potential, there are still problems to solve around privacy and data residency. Currently, employees at organisations can unknowingly share sensitive company data or Personal Identifiable Information (PII) on customers out to services like OpenAI. This opens up new security and data privacy risks.
Alongside this, we have to factor in potential for AI hallucinations, where LLMs return results that look realistic but that are not factual. How can organisations understand LLM models, and manage their use so that they are both effective and secure?
Building a model for LLM security
To start with, it’s worth knowing how LLMs work in practice. There are two classes of LLMs -- private LLMs are OpenAI’s GPT 3.5 and 4, and Google’s Bard, where you use an API or interface to access the service. The alternative is open models, like the open-source models available from a service like Hugging Face, or based on Meta’s open LLM LlaMa2.
The main advantage of an open model is that you can host them on your own infrastructure, either using your own on-premise hardware or in your own cloud deployment. This gives you control over how the model is used and that any data you use stays under your control. The drawback for this is that these models currently trail the private models in terms of performance, even though that gap is closing.
The issue here is that there is no security framework or compliance standards that govern or audit these technologies. There is a lot of work going on in this area, like the Artificial Intelligence and Data Acts (AIDA) in Canada, the EU AI Act, and the Blueprint for the AI Bill of Rights in the U.S. However, those regulations have not been fully implemented, and companies don’t have the tools available to help them use LLMs securely and safely.
Today, developers have to use the best practices that exist today around Machine Learning and software, and perform their own due diligence around the supply chain for their components. For example, if you are using the OpenAI API, any data you share will be sent to OpenAI for them to use. This data can then be used to evolve and retrain the OpenAPI LLM models. If you include any Personally Identifiable Information or PII in this data, it could then be shared. This would violate rules on Governance, Risk and Compliance (GRC) around data privacy. If you want to adopt OpenAI and manage that data security, then you can use the Azure OpenAI service, as this does not share any data that can be re-used.
Alongside this, you may want to look at how you manage any prompt data that you provide to the LLM. Scrubbing this data before it gets sent to the LLM can help you maintain security, but it is hard to do this with 100 percent accuracy. While open LLM models make it easier to manage GRC and prevent violations, you will also have to implement encryption around any API calls and Role-Based Access Controls on datasets.
Whether you choose an open or a private LLM, centralising your control over LLM use makes sense. To achieve this, you can use an LLM gateway, which is an API proxy that can carry out real-time logging and validation of requests. With this in place, you can put a central point of control in place around data sent to LLMs, track any data that is shared and see the responses returned back. Just as LLMs themselves are now, LLM security is just developing too.
LLM security and performance
While LLM security is a significant concern, performance is another area that also has to be looked at. LLMs are trained on Internet data sets like CommonCrawl, WebText, C4, CoDEx and BookCorpus that provide the framework for putting together responses. However, the LLM does not understand the data, it just knows the semantic meaning of the words involved and which ones are more likely to be next to each other.
For general questions, models can normally respond accurately based on the data they have been trained on. However, rather than admitting that it doesn’t have the answer, it will generate a plausible response. These false responses are dubbed AI hallucinations. For more specialised domains where data might be lacking or not up to date, these hallucinations can be very serious. To solve this problem, LLMs can be improved with fine-tuning.
Fine-tuning uses more specific data sets around those topics to provide that more in-depth insight alongside the base model. However, to carry this out, you do have to have a relatively mature data engineering infrastructure and collect that data in the right format. You will also have to know where that data comes from and how it was gathered, so you can know the provenance and how trustworthy it is. Alongside this, you can use your approach to LLMs and gathering data to track how that fine-tuning data affects and improves the quality of responses over time.
LLMs and Generative AI have huge potential, and they are evolving extremely rapidly in response to the demand in the market. At the same time, we need to develop the management and security framework around Generative AI as well, so that companies can keep their deployments secure. We have to manage and track how users will interact around LLMs, so that we can see that impact and where we can improve results too.
Getting this right from the start will serve everyone better. The alternative is that we face similar problems to the issues that have come up around software development, insecure cloud deployments and shadow IT. We have a chance to build security and privacy into how Generative AI is delivered from the start, so we should not miss out on this opportunity.
Dr. Jeff Schwartzentruber holds the position of Sr. Machine Learning Scientist at eSentire, a cyber-security company specializing in Managed Detection and Response (MDR). Dr. Schwartzentruber’s primary academic and industry research has been concentrated in solving problems at the intersection of cyber-security and machine learning (ML). Over his 10-year career, Dr. Schwartzentruber has been involved in applying ML for threat detection and security analytics for large Canadian financial institutions, public sector organizations and SMEs. In addition to his private sector work, Dr. Schwartzentruber is also an Adjunct Faculty at Dalhousie University in the Department of Computer Science, a Special Graduate Faculty member with the School of Computer Science at the University of Guelph, and a Research Fellow at the Rogers Cybersecure Catalysts.