Microsoft Research has worked with IOM to develop a new algorithm to derive “synthetic data” from CTDC’s sensitive victim case data. Rather than systematically redacting cases, which results in a substantial amount of data being suppressed, the algorithm generates a synthetic dataset that accurately preserves the statistical properties and relationships in the original data. Representative data on all of CTDC’s victim of trafficking cases are now available as a downloadable data file thanks to the new algorithm.
The synthetic dataset provides first-hand, critical information on the socio-demographic profile of victims, types of exploitation, and the trafficking process, including means of control used on victims. The new algorithm has enabled CTDC to share more data and allow more effective research to be conducted while protecting privacy and civil liberties. Access to additional attributes of victim case records will enable stakeholders to develop a more comprehensive understanding of this crime and the needs of survivors.
The records of the synthetic dataset no longer correspond to actual individuals and each is constructed entirely from common attribute combinations. This means that none of the attribute combinations in the synthetic dataset can be linked to distinctive individuals (or even small groups of distinctive individuals) in the sensitive dataset, or world at large.
The new privacy-preserving synthetic data solution, developed at Microsoft Research in the Python programming language, has also been made freely available via GitHub