Why safe use of GenAI requires a new approach to unstructured data management [Q&A]
Large language models generally train on unstructured data such as text and media. But most enterprise data security strategies are designed around structured data (data organized in traditional databases or formal schemas).
The use of unstructured data in GenAI introduces new challenges for governance, privacy and security that these traditional approaches aren't equipped to handle.
We spoke to Rehan Jalil, CEO of Securiti, to discuss how organizations need to rethink how they're governing and protecting unstructured data in order to safely leverage GenAI.
BN: How does unstructured data differ from structured data?
RJ: At a high level, the difference is straightforward. Structured data is any data that lives in traditional row-column databases (i.e., relational or SQL databases, Excel documents or data warehouses) or has a predefined data model. This tends to include things like financial transactions, inventory information, and patient records.
Unstructured data is all the other data that doesn't exist in spreadsheets and databases (often stored in non-relational or NoSQL databases or data lakes). It's typically text-heavy and lacks the organization and properties of structured data -- for example, all of the documents, emails, social media posts, web pages, and multimedia content that a company may have or own. It can also include all the regulations and policies that companies may need to adhere to, such as tax codes or insurance terms of coverage.
Today, about 90 percent of data being generated in enterprises is unstructured.
BN: How does this impact generative AI deployments?
RJ: In the past, companies really just mined their structured data to make business decisions. But GenAI is upending that. Most generative AI models work by analyzing unstructured data, such text data on the web, and provide outputs based on that data. Generative AI technologies employ this data to train models and build natural language processing capabilities. This causes a problem for organizations as the vast majority of their data management solutions were built for structured data.
The issue is that the industry has not put the same resources into developing technologies and strategies for managing unstructured data like they have for structured data. Lots of organizations struggle to even identify all the locations where their unstructured data might live -- across which shared drives, cloud systems, applications, and so on. And once it is identified, unstructured data requires different, more complex management and specialized techniques in order for data teams to extract meaningful insights and patterns from it -- techniques such as natural language processing, text mining, and machine learning.
BN: Why is unstructured data so challenging to manage and secure?
RJ: There are a number of factors at play. The biggest issue is simply volume and variety. There’s massive amounts of unstructured data within organizations and it comes from a diverse range of sources, such as emails, documents, social media posts, and multimedia files. This makes it difficult for teams to keep track of and enforce consistent governance and security policies across the organization.
Uncontrolled access and sharing is another hurdle. Once created, unstructured data tends to grow quickly across various systems, devices, and cloud services as people copy, modify, manipulate, and share the content. Because of this, it can be very difficult to keep track of where data came from and who should have access.
Unstructured data also tends to live across many siloes and ownership is often fuzzy. The data is frequently created and managed by different departments or individuals within an organization, leading to data silos and ambiguity around data ownership and accountability. While structured data is more likely to have known ownership within an organization due to understood security or cost implications, a company’s unstructured data is often either sequestered for legitimate reasons (e.g., upcoming commentary for an acquisition) or for less desired causes (e.g., political boundaries between divisions).
Last, unstructured data comes in a diverse number of formats. Whereas structured data has collapsed into a small set of universal standards, SQL being a principal one, unstructured content systems have a multitude of formats and legacy patterns. The tools needed to manage these formats in a unified way are unique and require a commitment from the organization to deploy and use them.
BN: What should organizations do to safely use unstructured data for GenAI?
RJ: Managing unstructured data for generative AI is possible if enterprises acquire the seven key capabilities:
- Discover, catalog, and classify unstructured data: Automatically discover, catalog, and classify files and objects on the fly, which are essential for GenAI projects.
- Preserve access entitlements of unstructured data: Maintain existing enterprise entitlements at source systems to ensure that only authorized users access relevant data via GenAI prompts.
- Trace the lineage of unstructured data: Understand data mapping and flows from source to end results, showing how the data moves from unstructured data systems to vector databases, to LLMs, and finally to endpoints.
- Curate unstructured data: Automate the labeling or tagging of files to ensure that only relevant data with associated context is fed to GenAI models, thereby providing accurate responses with citations.
- Sanitize unstructured data: Classify and redact or mask sensitive data from files that GenAI projects use.
- Focus on the quality of unstructured data: Emphasize the freshness, uniqueness, and relevance of data to prevent unintended data usage in GenAI projects.
- Secure unstructured prompts and responses with pre-configured policies: Detect, classify, and redact sensitive information on the fly, block toxic content, and enforce compliance with topic and tone guidelines.
Image credit: SergeyNivens/depositphotos.com