This is an initial list of useful definitions for the terms used throughout the website.
Data anonymization refers to the process by which information which could lead to an individual being identified is removed from data. Anonymization of data is a process of data de-identification which means that the resulting data cannot be linked back to the original data; in other words, it cannot be ‘re-identified’. Often, data anonymization includes data transformation, which involves processes of data structure/format change.
Since 2019, Microsoft Research has worked with IOM to develop and refine an algorithm to generate synthetic data from CTDC’s sensitive victim case data. Rather than systematically redacting cases, which results in a substantial amount of data being suppressed, the algorithm generates a synthetic dataset that accurately preserves the statistical properties and relationships in the original data. However, the records of the synthetic dataset no longer correspond to actual individuals and each is constructed entirely from common attribute combinations. This means that none of the attribute combinations in the synthetic dataset can be linked to distinctive individuals (or even small groups of distinctive individuals) in the sensitive dataset, or world at large. In September 2021, CTDC released its first downloadable Global Synthetic Dataset, representing data from over 156,000 victims and survivors of trafficking across 189 countries and territories (where victims were first identified and supported by CTDC partners). In December 2022, CTDC released the second synthetic dataset, The Global Victim-Perpetrator Synthetic Dataset, which was produced using an extension of the algorithm with added support for differential privacy.
K-anonymization is a data anonymization technique that redacts cases falling into sets with fewer than k members, where each set is defined by a unique combination of values of the different variables in a dataset. This means that it is not possible to query a dataset and return fewer than a pre-determined (k) number of results, regardless of the query. The appropriate threshold for the number of results depends on the nature of the dataset and its size. Based on research and testing, k=10 for CTDC data, which means cases have been redacted from the Global K-Anonymized Dataset such that queries to the Global Dataset cannot return fewer than 10 results.
A codebook is a comprehensive record made available for anyone wishing to understand or analyse the dataset. It is particularly valuable for researchers and analysts. A codebook describes the content and variables of a dataset, including definitions and methodological considerations. It also contains the possible values and formats for all variables. Codebooks are provided on CTDC in order to understand the different data sources of the combined dataset, as well as the particularities of each of the contributions.
A data dictionary describes the structure of a database or dataset by listing and classifying all variables, and specifying the format within which data is stored. It also includes lookup tables for relevant variables. It is usually aimed at helping programmers or database administrators work with a dataset. Data dictionaries are provided on CTDC especially for the use of future data contributors, so that they understand the format and values that they need to adhere to.
A standardized dataset is a dataset for which common data definitions, formats, categories and structures of all data elements have been agreed. For the CTDC Global Dataset, data from different contributing organizations are combined and standardized in order to produce a unified dataset which adheres to these common standards.
De-identification of data refers to the process of removing or obscuring information from individual-level data in a way that minimizes the risk of an individual being identified through the data. There are different methods of data de-identification, some of which do not transform the data but allow for it to be “re-identified” and some of which permanently remove identifying features from data (such as anonymization).
GIS stands for Geographic Information System. It is software that helps to visualize, analyse, and interpret geographic data to understand relationships, patterns and trends. GIS typically allow multiple layers of geographic information to be displayed on a single map. CTDC uses GIS through the ArcGIS mapping software. This software maps the main human trafficking trends based on identified or assisted victim data, at country, state and regional levels, without pointing to specific route coordinates.