Description
Users with aligning interests often collaborate to satisfy the ever-growing need for data for deep learning model training. At the same time, the produced data frequently contains confidential information and is therefore not intended to be shared with other participants. To solve this dilemma, we turn to synthetic data as an intermediate product derived from the original measurements. Numerous approaches exist that try to emulate the statistical properties of reference data and use it to create synthetic clones. Still, it is unclear whether they can provide suitable training data while keeping sensitive information secret from peers. This work compares generation methods to verify their usability in keeping such data confidential in a collaborative setting. We start by evaluating the statistical properties of synthesizers for general tabular data before focusing on a concrete use case with sequential time-series data. Our findings show that different generators exhibit varied statistical properties, which are mostly consistent indicators of the used method. When using the synthesized data to train predictive models, the resulting predictions are of comparable accuracy, with a possible drop when relying on additional confidentiality-ensuring measures. However, we also show that such measures can be necessary to protect against common re-identification attacks, especially when using data prone to producing an overfitted model. Our work demonstrates that using this approach is a viable strategy for preserving the confidentiality of collaborative training data in our case, with further research being necessary to extend the results to general industrial data.
|