Opening a new chapter in Dubai’s data sharing story: synthetic data for utility, quality and privacy

Dubai Digital Authority and the City’s Data and AI Committee release an implementation framework for synthetic data.
The framework is to be supported by a Synthetic Data Sandbox to allow for safe experimentation and validation of synthetic data uses.
A research report sets out the management and data science case for synthetic data as an important building block in our digitising economy.

The perennial challenge of working with data is to extract patterns and insights from it while keeping it confidential and safe. Rightly, data governance and ethics put the brakes on revealing granular details in data exercises (personally identifiable information). Yet, the true power of data is unlocked by sharing it. The more widely available it is to developers, entrepreneurs, academics, and government staff, the more social and business value can be derived from it.

Dubai Digital Authority’s goal is to make available the largest possible variety of high-quality data — generating fresh perspectives to public policy challenges, creating better government services and boosting data-intensive elements of the digital economy (e.g. data-hungry AI).

Dubai Pulse is our data portal for the Emirate of Dubai. It contains a large variety of data with the potential to fuel the above. Yet for good reason, many more datasets are considered sensitive. They could contain commercially confidential information. Alternatively, careless disclosure could lead to harm against individuals or critical city infrastructure. These considerations mean we place high priority on creating controls to ensure that data isn’t leaked or used improperly.

Often, the strict hierarchies for data access mean that even government decision-makers can’t access individual data unless they have very specific rights. Even when testing datasets for completeness and undertaking quality control, our teams at Digital Dubai rely on automated scripts without actually being able to access personally identifiable information.

Historically, the tension between keeping data secure while making it available has been resolved through a process called anonymisation. This means that personal data is processed in such a way that it is impossible to identify individuals. Data can be aggregated, or converted into statistics. According to the Finnish Office of the Data Protection Ombudsman, “The prevention of identification must be permanent and make it impossible for the controller or a third party to convert the data back into identifiable form with the information held by them.[1]”

But anonymisation is not without its challenges. Often, data anonymization solutions must be built on a case-by-case basis, which has a resource cost. And while aggregating datasets allows privacy, this anonymisation technique by definition prevents analysis at individual and small-group level. In short, data can lose its usefulness quickly. The risk that anonymised data can be reconstructed to point back to individuals means that tight access controls must often remain in place.

At Digital Dubai, we have focused our work to advance data sharing around a couple of insights. First, that the availability of large, detailed datasets is very important to furthering analytical and machine learning applications – which in turn unlock insights and innovation potential in the digital economy. Second, that for many use cases, the exact data with individuals’ details isn’t needed. Rather, what is essential is that the statistical relationships between variables - the data structure – continue(s) to hold true.

This led us to explore a new approach to privacy preservation — synthetic data. In brief, synthetic data is artificially generated by an AI algorithm to retain the statistical properties of the original dataset, without copying or using individual rows or details from it.

It keeps the structure of the original information, but doesn’t use anything else from the original dataset. The benefit of this data that it can be used for analytical and modelling purposes.[2]

We wanted to further explore the potential benefits of synthetic data. So, we joined forces with Faculty AI — one of Europe's most experienced AI and machine learning specialists — to conduct joint research and experimentation. The full results can be found in “Unlocking the Power of Data Through Private-Synthetic Data”, a research paper which makes both the management and the data science case for synthetic data.

The gold standard of private synthetic data is created through a mathematical framework called differential privacy. This sort of synthetic data has already been tested by the US Census, the UK’s National Health Service, to ensure collected data can’t be traced back to individuals, but can be a value-giving asset.

During Digital Dubai’s trials with Faculty, we carried out a series of experiments on thousands of records across three datasets, including traffic accident data from Dubai Pulse. These experiments analysed the amount of privacy preserved and utility retained when private synthetic data is compared to traditional anonymisation methods.

The research showed that private synthetic data outperforms conventional data anonymisation techniques such as removal, substitution, masking and aggregation. For example, considering the Traffic Accident dataset in Dubai Pulse, we were able to almost completely protect individuals’ privacy, simultaneously preserving 90% of the new data’s utility compared to the original.

So synthetic data opens up a new world of possibilities. It guards against data breaches and leaks. It improves data quality and can be used to correct known biases. It can enable less restrictive data access controls for easier project management and wider stakeholder involvement. Shortened governance processes also contribute to getting projects off the ground that were once thought of as unfeasible.

Think of the possibilities. Private-synthetic data could be safely shared across government departments without risk of privacy breaches, and could even improve data analytics collaboration between the public and private sector. Further down the line, private-synthetic data could be offered as a service, and a synthetic version of Dubai Pulse could open up far more datasets for general consumption.

Yet, there are important caveats. A range of proprietary and open-source synthetic data generators are entering the market. Establishing the best fit with the needs of your own particular use case is not easy. Further, synthetic data generation using differential privacy can be resource intensive, pointing to a period of measured adoption in which we will prioritise high-value and high-impact use cases.

It is for these reasons that, with the Data and AI Committee, we built on the evidence provided through the research report to produce a synthetic data implementation framework. Designed to be informative, flexible and practical, the framework helps government entities and private sector to understand, create and use synthetic data successfully while minimising risk. Additionally, we will set up a sandbox to enable entities to validate their use cases and to tackle difficult questions with DDA’s data experts. Together, the two will help us build a catalogue of use cases and evidence to inform future scale-up of synthetic data usage, and any accompanying changes to governance and data infrastructure.

In signing off, our message on this complex but important topic is simple – get involved and come and learn about synthetic together. Both framework and sandbox can be accessed here.

[1] https://tietosuoja.fi/en/pseudonymised-and-anonymised-data

[2] https://dwpdigital.blog.gov.uk/2021/06/18/why-synthetic-data-could-be-useful-for-a-government-department/

Important Notice

SUBSCRIBE TO

OUR NEWSLETTER