Solving for GDPR: It’s about technology and human behavior
GDPR -- it’s a nightmare for organizations, but a much-needed protection for citizens in our world of Cambridge Analytica, criminal hackers, and nation-states cyberthreats. There are many aspects of the regulation that are extremely tricky to implement, but let’s consider just one. Imagine the following scenario:
A new customer signs up to your eCommerce website. Their data gets moved into several back-end systems; maybe a CRM, an accounts system, an order management system, marketing, and probably some kind of data science workbench. Sometime later, an analyst is tasked with analyzing new customers and their behaviors, their retention rates, and other important factors. They know customer data is spread out across dozens of these systems, so they ask IT to prepare a dataset for them. Maybe a month later IT come back with a dataset that has been provisioned in the corporate Data Lake. The data isn’t quite fit for purpose and contains far more information than the analyst needs.
By now the analyst’s deadline is fast approaching and so they don’t have time to wait for IT to make another iteration to change the data. Taking matters into their own hands, they perform an extract into Excel or Tableau and massage the data into something more suitable. They get the data they want (or maybe only a subset of it, since extracts are limited in size) and then share this extract with the rest of the team, performing manual updates every week or so.
Here’s the kicker: That customer information is personal data. If that extract gets leaked (and let’s be honest, an Excel or Tableau extract emailed to a dozen other accounts isn’t exactly secure), then the organization will be in breach of the regulation, and the nightmare begins.
So how do we address this common scenario?
The first and most direct approach organizations should probably be looking at is education. Part of the problem is people’s day-to-day behaviors. They’ve never had to worry about controlling personal data before. By providing appropriate education and updating HR policies, organizations can start to introduce a culture of responsibility around personal data.
We should also take a closer look at why people do these kinds of things in the first place. After all, it’s one thing educating people as to why they shouldn’t do this going forward, but if there isn’t an alternative approach for them to follow, the chances are they’ll fall back into habits that allow them to get their jobs done.
The first challenge is in acquiring and preparing the data. Some of that data could be copied into a data lake, but some of it would probably require security to sanction the copy, which adds time and overhead to the process. If the analyst could access the data source directly in a secured and controlled fashion, that would help bypass some of the time required to get access to the data.
Secondly, the data needs to be curated, as it won’t be in a shape the analyst needs to do their analysis. Data curation is almost certainly an iterative process (especially since the analyst doesn’t know the shape of the data in the first place and may not even fully understand what questions to ask of it, so they won’t be able to precisely and accurately communicate their requirements to IT). Again, if the analyst were able to do this work themselves, this would bypass IT. Well that’s exactly why they extract to something like Excel in the first place!
Finally, they need to share this data with their team.
If the analyst had a single environment that could connect to the data sources they needed (in a secure fashion), enabling them to prepare the data without doing extracts, and securely share that data with their teams, all while maintaining lineage and audit logs of what data was accessed, by whom, then it would become feasible for the employees to follow their GDPR training and to ensure personal data has controlled access.
Of course, we shouldn’t forget about the performance of the queries. Tableau or Excel tend to perform pretty well with a reasonably sized dataset copied to the desktop. Whatever solution is put in place must give comparable (or better) performance; if every query takes an hour to perform, the analyst will become impatient and be tempted to make an extract anyway. If the solution scales to enable queries across datasets that vastly exceed what a Tableau extract could perform, then the whole deal becomes a no-brainer!
This is precisely the type of capabilities we find in Data-as-a-Service Platforms. They provide an integrated, self-service environment for the analyst to access data from any sanctioned data-source (with appropriate security restrictions in place). Data-as-a-Service Platforms do this without extracting the data, while maintaining full lineage and auditing of the prepared datasets. The datasets can be shared with a team (again in a secure fashion). And it does all of this while scaling to petabytes of data and supporting sub-second queries for complex analytical workloads.
GDPR may still be a headache, but Data-as-a-Service Platforms provide an important piece in your overall solution in order to support your analysts and data scientists.
Christy Haragan is an engineer at Dremio. Previously she was a sales engineer at MarkLogic.