The Potential of Synthetic Data, 08.09.2021
The 3rd meeting of the Expert Group Privacy Technologies for Data Collaboration took place online on September 8, 2021 in the afternoon. We were joined by 14 participants.
Nico Ebert from ZHAW opened the meeting with a discussion about the possibilities for a physical meeting at the fourth meeting on November 26. The participants agreed to meet in Zurich. He also introduced the speaker for the upcoming meeting, namely Juan Troncoso-Pastoriza from EPFL and co-founder of the startup Tune Insight. Juan will introduce the group to the basics of homomorphic encryption.
Afterwards Matthias Templ, an expert from ZHAW in the areas of data anonymization and synthetic data, presented the concept of synthetic data. Synthetic data is “any production data applicable to a given situation that are not obtained by direct measurement” according to the McGraw-Hill Dictionary of Scientific and Technical Terms. Synthetic data is generated from datasets that often contain personal data and should not be shared with third parties. However, major properties of the synthetic dataset are equal compared to the original dataset and it therefore can be used for similar purposes such as learning about distributions.
Matthias explained that creating synthetic data first requires a good understanding of the original dataset (e.g. personal data about a population). This includes understanding its generation process and its inherent distributions (including marginal distributions). Afterwards these distributions are rebuilt with one or more models (e.g. neural networks, decision trees). The models are then used to generate the synthetic dataset. Matthias has developed and published an r library to accomplish this task. He also demonstrated some of his real-world examples in which synthetic data had been applied. After Matthias’ presentation the participants discussed about the potentials of synthetic data. Another discussion point was which modelling techniques are required for which complexities of the original datasets (e.g. datasets with only a few features require less complex techniques).
In the second half of the meeting the participants discussed the potential benefits of the “Data Collaboration Canvas”. The Data Collaboration Canvas is a graphical workshop tool and has been developed with the help of the Expert Group. It is aimed at organizations that want to explore the potential of data innovation with other organizations at an early stage to create mutual added value. It offers a simple, visual structuring aid, e.g. in workshops, to identify common potentials and hurdles of collaboration. The canvas can not only be used to identify data collaboration opportunities between organizations such as companies but also within an organization (e.g. opportunities between different divisions or departments). Participants applied the canvas in two different use cases and discussed usability and comprehensibility of the canvas afterwards.