Proper data quality frameworks (DQFs) today monitor a range of data quality dimensions, such as: completeness, accuracy, validity, timeliness, accessibility and consistency. But not many DQFs measure fairness in the sense of historical biases which may be undesired to keep or even enlarge in the future. This could lead to untrustworthy learning algorithms or AI models, frequently used in many systems today such as financial rating systems, HR systems and or fraud detection software.
Furthermore, good data management can also help in pre-determining what the data could be used for, and which modelling possibilities there might be, providing a stimulus for further model development. Good data lineage can then help to understand how the current data files were formed and what they are used for. This in turn increases the reproducibility of errors and deviations in the data, which is useful, especially as terms, conditions and even definitions change over time.
In the recent EU Whitepaper, A European approach to excellence and trust, European Commission addresses one of the requirements relating to the dataset used to train AI systems:
“Requirements to take reasonable measures aimed at ensuring that such subsequent use of AI systems does not lead to outcomes entailing prohibited discrimination. These requirements could entail in particular obligations to use data sets that are sufficiently representative, especially to ensure that all relevant dimensions of gender, ethnicity and other possible grounds of prohibited discrimination are appropriately reflected in those data sets”
Here, I want to stress to read representative as desired representativeness and not historically factual representativeness. Training data is by default a historical view of reality but may represent an unfair historical bias which we as society do not want. In other words, sometimes we actually want to change the factual but unfair distribution which only works if fairness is one of the data quality dimensions.
In practical steps:
- Discuss and define what fairness means for your organization and your data use-cases
- Gather a factual representative subset of data
- Measure fairness
- Optimize fairness, by either
a. Excluding unfair features, like gender or ethnicity (only possible with direct structured inputs)
b. Positively discriminate unfair biases, e.g. by over- or under-sampling.
Showing biases in data and openly discussing the ethics around fairness (step 1) increases awareness in the early stages of data preparations. It also increases the skillset of employees, to detect, cope and discuss biases. Ethics and fairness discussions are important but also sensitive and subtle. Safety must be organized by a culture that starts by key functions in the organization and recognizing these topics as a priority up to board level in the governance structure.
What is your opinion?
Do you have fairness incorporated into your DQM?