Organisations face various challenges when it comes to collaboration and data sharing. Parties may be unable or unwilling to share data due to privacy concerns, legal restrictions, or competitive reasons. However, advancements in technology have enabled the development of solutions such as multi-party computation (MPC), which allows for secure and privacy-preserving collaborations. One company revolutionising MPC is Roseman Labs. In this article, we will explore the applications and benefits of MPC technologies, including Roseman Labs’ Virtual Data Lake (VDL) product.
Unleashing the Power of MPC with Roseman Labs
Roseman Labs is a technology company that is transforming the way organisations handle sensitive data. Founded by Niek Bouman, Roderick Rodenburg, and Toon Segers, their technology centers around multi-party computation (MPC). MPC is a cryptographic technique that allows multiple parties to share private data and collaborate on it without revealing that data to each other.
Although MPC has been around since the 1980s, it was not until Niek’s breakthrough that it became fast enough to use in production applications. The first use cases followed quickly; some of the first products they developed with partners included a private survey tool for the National Cyber Security Center (NCSC) in the Netherlands and a platform for information exchange between parties involved in fighting human trafficking.
From there, Roseman Labs developed the Virtual Data Lake (VDL), which allows data scientists to work with MPC as if it were a normal data science product. With the VDL, the complexity of MPC is abstracted from the user, and they can interact with the data through a familiar Python interface. Overall, Roseman Labs’ technology is ground-breaking for sensitive data analysis and allowing organisations to collaborate and share information securely.
Ensuring High-Quality Data Sets for Secure Analysis
When our team at ADC first learned about the VDL, one of the issues we raised is that data quality issues are always a challenge when building a data science model. Therefore, when it comes to secure data analysis with MPC, ensuring the quality of datasets is crucial. However, since the actual data remains hidden to prevent unauthorised access, how can the data quality be checked?
In principle, the Virtual Data Lake (VDL) allows the user to perform various checks on the shared data, such as detecting missing values or unusual distributions. However, it depends on the use case and the sensitivity of the data whether this is allowed. Parties that want to build a model or perform analysis on shared data are advised to align on definitions and data quality before encrypting and uploading the data to the VDL.
When we worked with Roseman Labs, they were exploring ways to provide more feedback to users during the data uploading process. This includes offering to work together beforehand to ensure data compliance. They are eager to continue improving their functionalities for assessing data quality. Consequently, we collaborated with them to put the VDL product to the test and provide feedback from a data science perspective.
A Use Case Test on Default Model Prediction
The collaboration between ADC and Roseman Labs began with a workshop hosted by Roseman Labs at the ADC office, where our consultants were introduced to Multi-Party Computation (MPC) and its potential use cases. From there, Joost Veenkamp (Project Lead, Advanced Analytics) became interested in Roseman Labs’ innovative approach to dealing with privacy concerns around sensitive data and proposed testing the VDL from a data science perspective.
A small working group of ADC consultants, led by Joost, took up the challenge of building a data science model on the VDL. The selected use case was a credit risk model for a bank, more specifically a Probability of Default (PD) model. Using an open-source banking dataset, the team simulated a situation that two banks want to cooperatively build a PD model on data that they share on the VDL. The team built the model following the ADC way of working, which includes steps such as exploratory data analysis, data quality assessment, model evaluation, testing, and validation.
With the necessary support from Roseman Labs, our team was able to test the full modelling cycle for the use case. We split the dataset in several parts, uploaded the data to the VDL, and built the PD model using the Python interface to the VDL: Roseman Labs’ Crandas library. Naturally, the main challenge in building a well performing model of high quality is that you are not able to view the data itself. Instead, you can work with the data using the functions that are agreed on by the parties that are sharing their data and are subsequently made available on the VDL. For each step in the modelling cycle, the team assessed how well it was able to go through the steps that are necessary to get to a functional and well performing model.
The Potential of MPC for Secure Data Collaboration
Overall, the conclusion of this project was that the python interface to the VDL offers data scientists an intuitive way to work with the data. Furthermore, it demonstrates that it is possible to build a well-functioning, qualitative model on the VDL. Naturally, this comes with a few conditions that data scientists must be aware of. For example, assessing data quality and making sure that data definitions align between data sets is an essential step in any modelling process, but even more so when using the VDL.
Roseman Labs’ VDL product and other MPC technologies have the potential to transform how companies collaborate in industries such as finance, healthcare, and the public sector. For instance, banks can leverage transaction or client data from multiple sources without openly sharing sensitive data due to legal or privacy restrictions.
Similarly, MPC enables healthcare providers to collaborate without compromising patient privacy, allowing for research and analysis while respecting data privacy regulations. With the VDL technology, researchers can distill valuable insights about groups of patients without seeing individual patient data, thereby promoting both data privacy and research outcomes.
Furthermore, specific use cases such as quality registries in healthcare can benefit significantly from MPC technology. The patient journey is a crucial factor in healthcare research, and linking patient data across multiple providers can be highly informative. These examples highlight the power of MPC to drive collaboration and insights in industries where data privacy and sharing are paramount concerns.