Data Quality Frameworks

Getting the requirements right for the data set used to train AI systems is a hot topic. It’s a complex issue that Data Quality Frameworks can help to resolve.

Effective Data Quality Frameworks (DQFs) today monitor a range of data quality dimensions including completeness, accuracy, validity, timeliness, accessibility and consistency. But not many DQFs measure fairness, in the sense of historical biases, which are undesirable from the start and may even intensify in the future. These could lead to untrustworthy learning algorithms or AI models which, unfortunately, are often used in many systems today. A well-known example is the gender bias discovered in Amazon’s recruiting tool in 2018. Two years earlier a racial bias was discovered in the methodologies used to predict the risk of recidivism (reoffending) in the state of Florida.

Good Data Quality Management ensures efficient operation of DQFs. For instance, by helping to pre-determine what the data could be used for, investigating the potential modelling possibilities and providing a stimulus for further model development. Good data lineage can help to understand how the current data files were formed and what they are used for. This, in turn, increases the reproducibility of errors and deviations in the data, which is useful. Especially since terms, conditions and even definitions change over time.

A European approach to excellence and trust

One of the ambitions of EU Commission President Ursula von der Leyen is a coordinated European approach on the human and ethical implications of AI as well as a reflection on the better use of big data for innovation. In the recent EU Whitepaper, Artificial Intelligence – A European approach to excellence and trust, the European Commission addresses one of the requirements relating to the dataset used to train AI systems:

Requirements to take reasonable measures aimed at ensuring that such subsequent use of AI systems does not lead to outcomes entailing prohibited discrimination. These requirements could entail in particular obligations to use data sets that are sufficiently representative, especially to ensure that all relevant dimensions of gender, ethnicity and other possible grounds of prohibited discrimination are appropriately reflected in those data sets.”

Here, I want to stress that representative should be read as desired representativeness and not as historically factual representativeness. Training data is by default a historical view of reality. But it may represent an unfair historical bias, which is not something we aspire to as a society.

In other words, sometimes we actually want to change the factual but unfair distribution, which only works if fairness is one of the Data Quality Dimensions. For example, in the case of the Amazon HR system, we would want to positively change the position of women in the historical employee data set to correct the unfair distribution to desired representativeness.

Four practical steps to add fairness to Data Quality Frameworks:

  1. Discuss and define what fairness means for your organization and your data use-cases
  2. Gather a factual representative subset of data
  3. Measure fairness
  4. Optimize fairness, by either
    a. Excluding unfair features, like gender or ethnicity (only possible with direct structured inputs)
    b. Positively discriminate unfair biases, e.g. by over- or under-sampling.

Showing biases in data and openly discussing the ethics around fairness (step 1) increases awareness in the early stages of data preparations. It also increases the skillset of employees, to detect, cope and discuss biases. Discussions about ethics and fairness are important but they also require sensitivity and subtlety. Safety must be an integral part of the culture and key functions in the organisation must set the example. Ethics and fairness when it comes to data should be recognized as a priority right up to board level in the governance structure.

This is the first in a series of articles on how to add fairness to Data Quality Frameworks. In my next article I will look in more detail at step 2: how to gather a factual representative data subset.

Do you have fairness incorporated into your Data Quality Management? Would you like to know more about Data Quality Management or what the Amsterdam Data Collective can do for you? Get in touch with Casper Rutjes at, or check our contactpage.