TUM Logo

Leveraging Synthetic Data for Secure Collaborative Machine Learning in Industrial Applications

Leveraging Synthetic Data for Secure Collaborative Machine Learning in Industrial Applications

Supervisor(s): Andy Ludwig
Status: finished
Topic: Machine Learning Methods
Author: Robert Haimerl
Submission: 2025-04-15
Type of Thesis: Masterthesis
Thesis topic in co-operation with the Fraunhofer Institute for Applied and Integrated Security AISEC, Garching

Description

Users with aligning interests often collaborate to satisfy the ever-growing need for data
for deep learning model training. At the same time, the produced data frequently
contains confidential information and is therefore not intended to be shared with other
participants. To solve this dilemma, we turn to synthetic data as an intermediate
product derived from the original measurements. Numerous approaches exist that
try to emulate the statistical properties of reference data and use it to create synthetic
clones. Still, it is unclear whether they can provide suitable training data while keeping
sensitive information secret from peers. This work compares generation methods to
verify their usability in keeping such data confidential in a collaborative setting. We
start by evaluating the statistical properties of synthesizers for general tabular data
before focusing on a concrete use case with sequential time-series data. Our findings
show that different generators exhibit varied statistical properties, which are mostly
consistent indicators of the used method. When using the synthesized data to train
predictive models, the resulting predictions are of comparable accuracy, with a possible
drop when relying on additional confidentiality-ensuring measures. However, we also
show that such measures can be necessary to protect against common re-identification
attacks, especially when using data prone to producing an overfitted model. Our
work demonstrates that using this approach is a viable strategy for preserving the
confidentiality of collaborative training data in our case, with further research being
necessary to extend the results to general industrial data.