M. Finck & A. Biega: Purpose Limitation and Data Minimisation in Data-Driven Systems

Highlights summarized by Sabrina Breyer

Highlights of the VEIL

In this Virtual Ethical Innovation Lecture, Michèle Finck and Asia Biega talk about the GDPR’s two principles of purpose limitation and data minimisation, their conflicts and potential solutions. The talk contained the gist of their publication “Reviving Purpose Limitation and Data Minimisation in Data-Driven Systems” (2021), focussing on a widespread notion that the principles may both prohibit meaningful implementations in contemporary data processing environments as well as stifle innovation. By forecasting the proposed EU`s data governance act, which will incentivize a further increase in (personal) data processing, the authors project that sustained and controversial discussions about these principles and the need to reconcile them.  

Purpose Limitation and Data Minimisation

Article 5(1)(b) of the GDPR sets out the principle of purpose limitation, which states that personal data can only be used for pre-defined purposes. This requires two aspects to be considered before data is actually being collected: (a) a purpose specification should be specified explicitly (no general reference to overarching business purposes are allowed), and legitimately (a purpose cannot give ground to illegitimate actions), (b) statements about compatible use should be made that cannot go beyond the defined purposes, with some exemptions, e.g., for scientific research, statistics, and in cases of a given consent of the data subject.

Article 5(1)(c) of the GDPR says personal data shall be “adequate, relevant and limited to what is necessary in relation to the purpose for which they are processed”. Requiring data adequacy might seem counter-intuitive, because it can lead to the need for processing even more data, e.g., when trying to reach make data sets representative of a given population. Additionally, data needs to be relevant and necessary with respect to the defined purpose. In qualitative terms, data minimisation is also requiring to rather prefer process pseudonymized data over plain text and ordinary over sensitive personal data.

A Computational Perspective: In Search of Conflicts

Biega and Finck questioned potential conflicts from a computational perspective. Existing computational techniques can be regarded as being aimed at catering to the principle of data minimisation, such as feature selection (e.g. removal of certain features about users), outlier detection (e.g. removal of noisy data), and active learning (e.g. selectively labelling data). Besides enhancing the quality of the models, these computational techniques overall serve to comply with financial budgetary constraints, as well. Hence, there seems to be no obvious general conflict between data minimisation and scientific and economic benefits.

A further aspect that indicates why the GDPR’s principles might not be as controversial as is often asserted is empirical evidence from literature. Krause and Horvitz demonstrate in their “Utility-Theoretic Approach to Privacy and Personalization” (2008) that, at some point, collecting more data does not significantly improve the model’s quality. Furthermore, Hestness et al. (2017) show in their paper “Deep Learning Scaling is Predictable, Empirically” that data collection leads to three different stages in terms of model performance. In the Small Data Region the model needs to be enriched with data to train and improve the learning. In the Power-law Region more data reduces the generalization error linearly in log-scale. Consequently, both from the perspective of data protection and data minimisation collecting as much data as possible is questionable.

Based on these approaches, Biega and colleagues examined to replicate this effect on a number of different data sets experiencing, e.g., that there is the same effect looking at the GoogleLocal data set which contains rating of businesses (see Shanmugan et al. 2021 “Learning to Limit Data Collection via Scaling Law”).

What are obstacles to adoption?

Finck and Biega offered some hypotheses on potential obstacles to adopting the principles of purpose limitation and data minimisation. One obstacle might be the scarcity of guidelines, which is supported by the article “Understanding Software Developers’ Approach towards Implementing Data Minimization” (Senarath and Arachchilage 2018). There appears to be a wide divergence in understanding what these principles mean and could look like in practice, showing a need for guidelines on how to implement and to satisfy the requirements of these principles. Especially regarding data-driven systems there is a lack of operational definitions. For instance, the UK and Norwegian data protection authorities’ guidelines do not cover data-driven systems and only mention techniques without detailed guidelines.

How to operationalise purpose limitation and data minimisation?

Asia Biega outlines some questions to be considered in operationalising the principles in data-driven systems (e.g., recommender systems) extracted from “Operationalizing the legal principles of data minimization for personalization” (Biega et al. 2020):

  1. What is the purpose of personal data collection? 
    Observation shows that in personalisation, personal data is collected not necessarily to deliver, but rather to improve, the service. The proposal comprises to tie the processing to improvements in terms of performance metrics.
  2. What does it mean to limit data in relation to the purpose?
    On a global data minimisation perspective, one shall limit data based on a threshold on the loss in the global model performance. On the per-user perspective one shall limit data based on a threshold on the loss in the per-user model performance.
  3. What are the trade-offs and consequences of data minimisation according to those interpretations?
    In an experiment, the observational pool was selected for each minimising user to test what is the loss in the model.


  • As some items from the observational pool were selected, the error distribution across the population remains the same (across various minimisation strategies).
  • A large portion of the data can be minimised out without decreasing quality.
  • Depending on the base algorithms: k-nearest neighbour is, e.g., less robust than singular value decomposition (a matrix factorisation technique).
  • Minimisation might be feasible globally, but rather hard per user. The loss in quality incurred for a single user might be quite large. Some users gain, some loose.
  • There is no clear correlation between the error delta and the number of ratings, the average rating value, the average popularity of items, the number of genres, the average similarity to all system users, or the average similarity to the 30 closest system users.
  • It can be reasonable to ask few users to provide more specific data to enhance the overall quality of the model without collecting these data from all users. This was revealed by testing the impact on users when minimising using active feature selection to analyse which data has an impact and which is redundant to stop collecting as soon as possible.

Challenges in the Practical Implementation of Data Protection Law

Michèle Finck and Asia Biega conclude that compliance with the principles of purpose limitation and data minimisation is possible. However, this comes with unacknowledged trade-offs arising from a legal perspective. This includes the difficulty of measuring law as there is no concrete guidance and, hence, money could be wasted. Trade-offs between different GDPR principles can inhibit a structured implementation as the fairness principles may conflict with data minimisation, no hierarchy is prescribed in the GDPR and it is not clear how to reconcile principles when in conflict. There are also economic and environmental costs of enforcing data subject rights. For instance, the right to deletion and the right to withdraw consent might require costly retraining as long as there is a lack of efficient ways to remove personal data from a trained model. An overarching and well-known challenge of the GDPR’s practical implementation comprises the costs of compliance versus the unlikelihood of enforcement.


There are many questions to be answered in the future to support the compliance and implementation of data protection law in relation to topics like mathematical interpretations of the principles, decision rules for when to retain data, machine learning models to automate compliance, quantitative studies to understand the effects of enforcement as well as qualitative studies to understand the value of data, auditing for compliance, and balancing different data protection law principles. There is a number of computational approaches needed for, e.g., how individuals can decide upon trading their data in return of what they get. In addition to the present difficulties for white box techniques, it remains unclear how black box models can be augmented such that these principles can be balanced computationally.


Questions during Q&A session revolved around what is good data science regarding using as few data as possible and how to keep up with technological advances when formulating guidelines with specific focus on medicine.


Biega, A. J., Potash, P., Daumé III, H., Diaz, F., Finck, M. (2020): Operationalizing the Legal Principle of Data Minimization for Personalization. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Association for Computing Machinery 2020). https://doi.org/10.48550/arXiv.2005.13718

Finck, M., Biega, A. (2021): Reviving Purpose Limitation and Data Minimisation in Data-Driven Systems. Technology and Regulation, 44-61. https://doi.org/10.26116/techreg.2021.004

Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, Md. M. A., Yang, Y., Zhou, Y. (2017): Deep Learning Scaling is Predictable, Empirically. https://doi.org/10.48550/arXiv.1712.00409

Krause, A., Horvitz, E. (2008): A Utility-Theoretic Approach to Privacy and Personalization. 39 Journal of Artificial Intelligence Research 633. https://www.aaai.org/Papers/AAAI/2008/AAAI08-187.pdf

Shanmugam, D., Shabanian, S., Diaz, F., Finck, M., Biega, A. (2021): Learning to Limit Data Collection via Scaling Laws: Data minimization Compliance in Practice. https://doi.org/10.48550/arXiv.2107.08096

Senarath, A., Arachchilage, N. A. G. (2018): Understanding Software Developers‘ Approach towards Implementing Data Minimization. SOUPS. https://doi.org/10.48550/arXiv.1808.01479