By Kurt Stockinger, ZHAW
In our latest expert group meeting, the following talks were presented:
Methods of Statistical Disclosure Control applied on Microdata
Simon Würsten, SBB
Big Data and AI Technologies on Microsoft Azure Cloud
Gerald Reif, IPT
Reproducible Data Science
Luca Furrer, Trivadis
First, Simon Würsten from SBB introduced various methods of data anonymization. Each of the presented methods can be considered as a trade-off between anonymization strength and expressiveness of the data (i.e. to minimize disclosure risk and to maximize data utility. For instance, some methods randomly change the values of data while others reshuffle the content of values between different attributes. Depending on which types of data analysis is performed, the respective anonymization methods can be chosen along with a report about the strength of the methods. The presented approaches have a very high potential to be used in various data-sensitive areas such as health care or e-government. The technology is ready to be used, for instance, in a PoC by other Alliance members (see R library sdcMicro).
Next, Gerald Reif from IPT presented the big data and AI architecture blueprint on the Microsoft Azure Cloud. Currently, one of the most widely used approaches is the lambda-architecture which consists of the in three layers: (1) The Speed Layer for real-time stream processing, (2) the Batch Layer for processing big amount of stored data, and (3) the Service Layer for presenting and reacting on the analysis results. There is currently a clear trend of combining and consolidating big data and machine learning technology from Apache Spark and Azure PaaS services. The advantage of the combined solution is the bleeding edge open-source technology of Apache Spark coupled with the enterprise features and user management functionality of Microsoft.
Finally, Luca Furrer from Trivadis provided insights into latest tools to enable reproducibility of data science experiments. In principle, three different aspects need to be reproducible: Data, code/models, and parameters. Promising tools for these aspects are dvc, mlflow and git. The advantage of these tools is that data scientists can easily keep the history of code and data and track the results of various machine learning experiments along with the chosen parameters. The tools integrate well together through git.
The presentations were followed by lively discussions about the methods, the architectures, and the experiences of using them in real life. One of the main questions was about the experience of deploying machine learning models in production over longer periods of times. A typical phenomenon is that big data and AI technology is often successfully used in proof of concepts but there is little information of how the approaches “pass the test of time” in real production environment.
As part of a future event – and possibly in collaboration with the expert group on machine learning – we are planning to report on the experience of using machine learning models in production. Typical questions to be addressed are: What models should be deployed? How often should models be deployed? When should re-training be done? How do we handle rapidly changing data? How do models degrade over time and what can we do to mitigate model degradation?