Synthetic data: A new frontier for data science

Since the EU-wide data regulation GDPR came into effect in May 2018, many businesses with customers in the EU are rightly terrified of its infringement penalties, which can result in fines of up to 4% of annual global turnover.

As recently as last month, British businesses were hit with a reminder of what it means to suffer a data breach when British Airways and Marriott International were imposed with eye-watering fines (£183m and £100m, respectively). For large businesses that handle huge quantities of personal data, such as banks and financial institutions, this is particularly daunting.

We’ve all heard that “data is the new oil”, and that modern businesses need to utilise customer data in order to better understand their customers, as well as train AI and machine learning algorithms. But now, in order to avoid a breach, many businesses are treating their data as radioactive material, with strict procedures around who can access it and when. While this is undeniably a positive trend for data privacy, it nonetheless restricts an organisation’s data agility as well as the ability to innovate.

The problem with traditional anonymisation

Wise businesses are now rightly seeking out new privacy-enhancing technologies in order to strike a balance between data utility and security, with many now running data-intensive processes (e.g. testing and data analysis) on “anonymised” datasets.

There are various anonymisation techniques, but one of the most commonly-used methods is generalisation, where the specificity of a data point (e.g. a customer’s full home address) is altered to become more broad (e.g. the customer’s region or city). This ultimately sacrifices a degree of utility within the dataset, in order to ensure that the individuals within it are unidentifiable.

One of the reasons that anonymisation has become so popular is because GDPR doesn’t apply to personal data that has been anonymised. But rather worryingly, recent research suggests that the bulk of anonymisation currently used is shockingly ineffective at masking an individual’s identity – and machine-learning models can re-identify individuals in the vast majority of cases.

So, it turns out that you don’t actually need granular information on an individual in order to identify them; and, consequently, traditional anonymisation techniques simply won’t cut the mustard.

Sophisticated synthetic data

The answer to more sophisticated anonymisation actually lies in a technique that data scientists have been utilising for over two decades: synthetic data.

In synthetic datasets, each data point belongs to an entirely theoretical individual with their very own name, age, address, bank account number, tax records, medical history and quite literally any other details required for data analysis. The main issue with this data historically has been that it is incredibly difficult to generate synthetic data that is of a high enough quality for advanced data science.

All this changes with recent advances in AI and machine learning. By training algorithms on “real” data, we can now generate synthetic datasets that retain all of the underlying statistics of the original data, but with zero personal or identifiable information.

An easy way to visualise this is through Nvidia’s work with Generative Adversarial Networks (GANs), which is the technology behind This Person Does Not Exist. The webpage uses a dataset of real celebrity faces to produce hyper-realistic images of people who do not exist. This is, in essence, synthetic data. Each individual has a host of attributes that could be analysed (e.g. eye colour, hair colour, skin tone), but this data is not able to be compromised because it doesn’t belong to real people.

Imagine taking this technology and applying it to customer data: you now have a dataset that can be shared throughout your data science team and used for all kinds of modelling without the need for excessive administration and with none of the privacy risk. Meanwhile, your “real” customer data can be stored on a secure server and very few people need access to it.

The new frontier

With more businesses looking to adopt a synthetic data strategy, there will no doubt be knock-on effects across all industries. Organisations equipped with the tools necessary to unlock the potential of their data will now be able to utilise their customer data whilst being both risk-averse and responsible.

With bountiful opportunities to conduct data science and advanced machine learning – on datasets where we control the statistical properties – we can expect to see a new era of data innovation and a reshaped data economy.

The advent of social media led to enormous leaps in the field of AI, but very little attention was paid to keeping that data secure. Now, with synthetic data, we can continue our way along the exponential innovation curve, but this time whilst adhering to regulation and treating data with the caution and care it deserves.

The post Synthetic data: A new frontier for data science appeared first on JAXenter.

Source : JAXenter