Privacy preservation in the release of statistical data has been a concern of the scientific community for decades. This preoccupation has been gradually expanding to outside of academia, and has been reflected in the widespread enactment and reinforcement of privacy-protection legislation around the world. In Brazil, the new privacy law enacted in 2018 (LGPD) establishes mandatory restrictions on governmental agencies that publicly release data on individuals, and prescribes sanctions in case of non-compliance. In this context, it is paramount for those agencies to thoroughly review and, if necessary, adapt their current methods of data publishing. However, it is well known that any disclosure control method applied to the release of statistical data may present deleterious effects on data utility, i.e. on the quality of information provided to legitimate consumers, such as analysts and society as a whole. A fine balance between privacy and utility must be achieved, taking into consideration the interests of several stakeholders, including data owners, legitimate data consumers, and the government.
In this thesis, we provide a thorough quantitative study of privacy risks in the release of the official Brazilian Educational Censuses provided annually by INEP, which is Brazil’s governmental agency responsible for the development and maintenance of educational statistics systems. More precisely, we formally analyze privacy risks in databases released as microdata, i.e. data at each individual’s record level, and protected by the technique of de-identification, i.e. the removal of direct identifying information such as the individuals' names or personal identification numbers.
In order to do so, we propose a unified classification system for attacks, which allows us to properly cover and formalize the landscape of privacy risks in the Educational Censuses. Our first contribution are models of attacks rigorously formalized in the framework of quantitative information flow, defined along three orthogonal dimensions: (i) risk of re-identification vs. risk of attribute-inference; (ii) attacks on a single database vs. attacks on longitudinal databases, i.e. those that are updated and extended frequently, as in the case of INEP’s Censuses; and (iii) deterministic vs. probabilistic measures of privacy risk.
As a second contribution, we employ our formal models to obtain extensive quantitative evaluations of privacy risks on INEP’s Educational Census databases, which account for more than fifty million students, or around 25% of the country’s current population. Those experiments unequivocally show that INEP’s current disclosure control methods are insufficient to guarantee individuals' privacy at any acceptable level, and therefore may be in contempt with Brazil’s new privacy legislation. For instance, 81.13% of students in the School Census of 2019, corresponding to approximately 39,085,531 individuals, may be subject to complete re-identification under reasonably modest attacks. We argue, therefore, that INEP should abandon current practices and consider stricter disclosure control methods.
As a third contribution, we formally evaluate the trade-off between privacy and utility in two variants of differential privacy –the golden standard disclosure control technique in the literature– as the method to be employed to INEP’s Educational Censuses releases. Our results confirm that global differential privacy tends to favor utility over privacy, whereas local differential privacy tends to act in the opposite way.
To the best of our knowledge, our analyses are the most extensive of its kind in the literature. Furthermore, our results provide INEP with solid empirical evidence to guide well-informed future decisions when complying to Brazil’s new privacy legislation, and have the potential to positively impact a significant fraction of the Brazilian population.