Dealing with the security risks of unstructured data [Q&A]
Businesses are increasingly reliant on data. In the past that's generally been in a structured form but, thanks to increasing amounts of customer information gleaned via the IoT and channels like social media, unstructured data has taken on a new importance.
Yet unstructured data also introduces new risks. AI-based solutions specialist Concentric is launching a new data access governance solution that addresses the challenge of unstructured data security. We spoke to Karthik Krishnan, CEO at Concentric, to find out more.
BN: Why do enterprises need to take unstructured data seriously?
KK: By some estimates, 80 percent of enterprise data exists in unstructured formats -- and that data is often business-critical, sensitive or regulated. In fact, some types of sensitive information, such as intellectual property, strategic plans or personnel information, is more likely in a document or spreadsheet than a database, and those are prime targets for cybercriminals.
But unlike the data in a database -- which is usually controlled by security experts on the IT team -- end users make consequential security decisions for the files they create and manage. Overshared documents raise the risk for data loss and there’s no good way to make sure users are managing this information in a responsible way.
BN: What's the first step in dealing with unstructured data?
KK: Especially with unstructured data, the first step is understanding what you have. Our customers often have upwards of 10 million files, and you wouldn’t consider all of them to be business critical. So there has to be a way to focus on what’s important, urgent and at-risk. That's why data discovery and categorization are the cornerstones to effective unstructured data access governance.
BN: How hard is it to identify potentially sensitive data in a mass of other information?
KK: It's tough. If you think about all the types of data you need to protect, the list is long and diverse. Finding critical sales data, forecasts, financial performance, personnel files and contracts among the office party invitations and other trivial stuff isn't easy. Frankly, the inability of current solutions to do this is why unstructured data is the mess that it is. One approach tries to use pattern-matching to do it, which inevitably leads to an ever-growing tangle of unmaintainable rules that still can't tell an NDA from a purchase agreement. The other approach puts the burden on end users to tag their files as sensitive or confidential – and we all know how well any IT initiative that relies on consistent end-user behavior turns out. So, yeah, identifying the important stuff is hard.
The good news is that recent advances in natural language processing (a type of deep learning/AI) are really good at this. That's what we've commercialized at Concentric.
BN: What makes Concentric's Semantic Intelligence solution different?
KK: It's how we've applied deep learning to the problem, which creates two key advantages for our customers. We've already talked about our first differentiator, and that's how we categorize data using deep learning. We can put documents into one of over 90 categories out of the box, and customers can easily create new models for their specific data. Like I mentioned earlier, this ability to categorize data – accurately and comprehensively -- is the foundation for unstructured data access governance.
The second capability is something we call Risk Distance analysis. Once data has been categorized, we use Risk Distance to compare the aggregate security practices in a group of peer files to the specific security practices for a single file. So, for example, if only one of dozens of M&A files is in a folder accessible to all employees, we can identify that file as high risk – without ever creating an explicit policy or asking an end user to mark the file. It’s an automated, accurate way to spot risk because, after all, the file owners are the content experts.
We've just released a new analysis capability that gives even more insight into risk by highlighting file activity. That helps highlight which files might need more urgent attention because they’re being routinely moved, shared, printed or otherwise used. It can also help with data retention management on the other end of the scale -- if a sensitive document isn’t getting much use, it might be a candidate for deletion or deep archiving.
BN: How can this help to ensure compliance with GDPR, CCPA, etc.?
KK: Personally identifiable information or personal health information (PII/PHI) is always sensitive so we’ve put a lot of energy into making sure we can find it. Our UI, for example, has tools dedicated to locating, analyzing and working with PII/PHI. Beyond that, Concentric brings some unique benefits to the compliance table, and here's how.
Right-to-know and right-to-be-forgotten mandates put a double burden on compliance professionals -- they not only have to find the relevant information, they also have to decide what to do with it. Our categorization capabilities make it far easier to understand, for example, whether a specific bit of PII/PHI is in a marketing document (which likely should be deleted) or in a contract (which needs to be maintained). Our categorization insights make that decision far easier.
Image Credit: Profit_Image / Shutterstock