Online anonymity is a lie: Research challenges privacy protection frameworks
Online privacy and anonymity seem farther away from our reach than ever. It is almost as if every new advancement and progress in technology further removes another brick from an already flimsy wall of privacy on the web.
Although legislations such as GDPR were designed to protect user privacy and anonymity, these guarantees hold little weight against powerful technologies like machine learning which -- researchers have found -- can piece together anonymized information to form your complete identity.
The research, recently published in Nature Communications, has demonstrated that it is possible to reconstruct the real identities of individuals from sampled and anonymized information in datasets. using a combination of only fifteen basic demographic attributes such as age, ethnicity, marital status, number of children etc.
These findings carry important implications for the whole debate surrounding the issue of online privacy, and reveal just how easy it is to de-anonymize individuals with their data.
This article will explain existing anonymizing processes that are commonly used by companies, how the researchers managed to re-identify individuals based on their demographic data and discuss the wider policy implications of the revelation.
Anonymization processes for personal data
The current anonymization techniques widely used by companies involve sampling user information in the hopes of making re-identification almost impossible. Sampling is performed by removing identifiable information such as name, physical and email addresses of individuals from a company’s dataset.
To further enhance the level of anonymity, companies often also add noise – random unintelligible information -- to the dataset to increase the difficulty in decrypting information.
Data that has been anonymized in this way can be sold to third-parties, as regulations like GDPR and CCPA do not apply to anonymized information.
But how anonymous the information really is within an "anonymized data"?
If the findings of the researchers from the Imperial College London and Université Catholique de Louvain are summarized in three words, the answer to this question is: not very much.
For a sufficiently advanced machine learning algorithm, fifteen demographic characteristics is all it takes to destroy a customer’s anonymity from a company’s database.
Reconstructing identities with machine learning
Websites, governments, and companies have been collecting personal data of individuals ever since digital technologies have made it possible to do so.
With the development of powerful predictive analytics tools like machine learning and Big Data, the ability to analyze customers and make valuable insights from seemingly unimportant pieces of customer data has advanced formidably, as cases such as the Facebook-Cambridge Analytica scandal and NHS sharing medical data of patients with DeepMind clearly reveal.
However, researchers have furnished compelling evidence that, in the age of AI, the anonymizing techniques used by companies and websites are far from adequate. So inadequate in fact, that researchers were able to correctly re-identify a whopping 99.98 percent of Americans from a supposedly anonymized dataset using a machine learning model especially designed for the job.
That’s an almost perfect accuracy for reconstructing consumer identity, giving lie to the idea that current techniques for turning data anonymous are any effective at all.
To explain how anonymity is removed by piecing together different bits of information, which existing machine learning programs are exceedingly adept at, one of the authors of the research Dr Luc Rocher said "While there might be a lot of people who are in their thirties, male, and living in New York City, far fewer of them were also born on 5 January, are driving a red sports car, and live with two kids (both girls) and one dog."
This only means one thing: just because our names, email addresses, or fingerprints aren’t included in a dataset, it doesn’t rule out the possibility of our correct identity being made out using other, seemingly less important pieces of information, even when anonymized using current techniques.
Calculating the likelihood of correct re-identification
The findings of the research clearly reveal just how easy it is to reverse engineer incomplete personal information from a dataset to re-identify an individual.
To further prove the point, the researchers even prepared an online tool as a proof of concept that asks for information that is standard for companies/websites to acquire from you. Initially, the tool only asks to enter the first part of your ZIP code, date of birth, and gender.
It then estimates the probability that you can be re-identified from an incomplete and anonymized dataset using just these bits of information.
The tool will then ask how many vehicles you have and marital, house ownership, and employment statuses to recalculate your likelihood to a higher degree. The more attributes are added into the calculation, the higher will be the likelihood of correct re-identification, reaching 99.98 percent when just 15 of these characteristics are factored in the machine learning algorithm.
Since most companies will be acting well within the scope of GDPR guidelines by the collection and selling of these pieces of information (after anonymization) related to consumer demographics, it shows the abject futility of data protection legislations in preserving individual anonymity.
From the privacy point of view, this is a serious problem.
Implications for Privacy/Anonymity Regulations
Since the prohibitions enforced by GDPR and CCPA only apply to identifiable information, companies "anonymize" the data in order to make these legislations unenforceable on them.
That wouldn’t be a problem if the data actually were anonymous. But with research revealing the sheer inefficacy of prevailing anonymization standards, it is clear that privacy protection legislations are risibly inadequate in actually protecting our privacy.
The dwindling privacy of people worldwide is a fact that can be gauged by the rising demand for online privacy tools like VPNs. In earlier times, VPNs only provided basic IP address masking functionality. Today, these tools have developed the capability to guard user privacy is a much fuller way. This 20-factor comparison of VPNs gives a pretty good idea of the features any privacy-conscious user should pay attention to.
Nonetheless, the power at the disposal of companies to nullify these protections is still greater. With the state of online privacy tainted to such a degree, the importance of legal protection of the right to privacy is higher today than ever before.
The key takeaway from this research is that policymakers should start acknowledging the failure of existing privacy protection legislation and there’s an urgent need to make revisions to these in light of this research.
This might entail a change in the definition of "anonymous" in these policy frameworks, keeping in mind the capabilities of existing technology and how thoroughly these can undo the currently used anonymization techniques.
These anonymization processes will need to be far from sophisticated than they currently and must be specifically designed to combat anti-anonymity potential of AI.
For all the privacy scandals and controversies in recent times, we are still no closer to preserving our own anonymity on the web.
The application of machine learning models to effectively overturn all anonymization efforts in use today is a wake-up call for policymakers, and underscores the need to formulate more stringent frameworks for privacy protection.
As CTO of Surfshark, Magnus Steinberg aptly stated in this article: "...the sooner our societies reach the critical point of understanding that privacy is a human right and not a privilege, the earlier it will take the route of sustainability."
It won’t be surprising if these findings are ignored by policymakers responsible for privacy legislations. After all, lawmakers and governments have historically been painfully slow in implementing changes that science provides strong evidence in favor of.
Let’s hope for our sakes that we get to see a deviation from this pattern. And soon.
Osama Tahir is a writer who covers issues relating to online privacy, science, and the sociological impact of technology in modern times. He is a contributing author at Hackernoon and The Globe Post.