How synthetic data can unlock and help monetize information [Q&A]
Big data offers major opportunities for many industries. But in areas like finance where personal information is involved using the information raises worries about privacy.
One solution to that is to anonymize the information in some way. To discover more about how this works we spoke to Randy Koch the CEO of ARM Insight, a company pioneering the use of synthetic data and assisting more than 1,000 financial institutions to monetize their data safely.
BN: Why is it so important to be able to use this data?
RK: In financial services there is more valuable data than in any other industry, maybe the only comparable one is healthcare. So they're sitting on valuable data assets, however, they are only beginning of being able to properly extract value from that data both internally and externally. At ARM Insight the reason we exist is we are able to extract value from financial data assets both internally for the organization, the bank, credit unions fintech, and externally with their partners, with new clients, etc. We've been doing that for 10 years now, we do that for more than 1,000 clients and we do it in a safe secure way.
BN: How you can you do that without compromising the client’s data?
RK: If you take any large or medium sized bank or credit union or fintech it has very valuable data. The problem is, the executives and the IT infrastructure setup sees that data set, as one big bowl of data that has personal information, it's got name, address, social security number and transactions in it. Because it has scary, or private information it needs to be treated with kid gloves and they don't understand how they can break that apart and find ways to drive value. They have this big data set that's all co-mingled between private data and transactions.
But what we say is they can start applying the data in three different categories. The first is, what we call, raw data that includes your all the PII. So for example there is Randy Koch. We have his address we know he stopped at Starbucks at 7am and spent 10 dollars. That's the part that needs to be protected.
The second data set, or data component that we create is called anonymous data. Anonymous data is the exact same as the raw data, except you take out all of the private information, so there's no more Randy Koch there's no more address, but you to keep the transaction the same so you still see the shopping at 7am you still spend $10. You can use this for analytics but you're limiting its value.
The third data set is called synthetic data and this is the part that has been transformative over the last 18 months. What this does is create an entirely new and fake data set, based on both the personal information, and the transactional information. So, instead of the date of birth you would just say, generation X, instead of being at Starbucks at 7am you say they were there at 7:05, and instead of paying $10, they spend $10 and 10 cents. The algorithm makes sure the aggregate statistics are the same, but it protects personal privacy. This gets you outside of GDPR, it gets you outside of CCPA. And now you're totally free to monetize that synthetic data both internally and externally.
The result is the big bowl of data is no longer scary. In an institution only five to 10 percent of the data is actually consumer data and needs to be protected. But now we can take the other 90 percent and get value out of that data in a secure and privacy protected way.
BN: So what might you use the anonymized data for?
RK: We believe that for most of our clients there are two use cases for synthetic data. The main use case we see is internal where we put our synthetic data application inside the bank's firewalls. About six months ago a chief data officer from a top 20 bank came up to me and she said, "Randy, I currently have 110 data scientists, analysts and QA people working on my PII real transactional data. Our single biggest security risk is employee misuse of data. Can you create us some fake data sets, so that only seven of my people need to be looking at the real data?"
We implemented a solution for them and reduced by 93 percent the number of people actually exposed to sensitive data. And if you just think about that for a second, the Equifax breach the Capital One breach the Target breach, none of them would have made anywhere near the news impact if they'd used fake data.
The algorithm itself randomly changes the date field or randomly changes the amount field, but only by small amounts, so the aggregate amount, always comes back to 99.97 percent accuracy. So if you look at a million transactions it is still statistically very relevant and this brings us to the second use case which is being able to use the data externally.
If, for example, you are a retailer and you want to know where people go an hour before or an hour after coming into your store, a bank's information can probably tell you. Selling synthetic data with those trends therefore can create a revenue stream for the bank.
BN: Could this open up opportunities for other industries too?
RK: Absolutely, compared to where we were two years ago if you went to a bank or fintech provider and said, "Hey, let me show your data to a retailer," they'd be scared and say no because of privacy regulations. But as long as the data is synthetic, it's safe to create a revenue stream from it.
Insurance would be another area, you could show how much was paid out in claims, what properties were worth in a particular area for example, changing the amounts just slightly, and you get to the same statistical accuracy. So now you can share that and get value out of that data.