Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/47340
Title: Accelerating Open Data Integration of Real-World Health Data Silos
Authors: PARCIAK, Marcel 
Advisors: Vansummeren, Stijn
Issue Date: 2025
Publisher: UHasselt
Abstract: Medical information collected during routine healthcare, or real-world data (RWD), can facilitate medical research projects and allow the generation of real-world evidence. While large amounts of RWD are generated on a daily basis, they remain locked in autonomous and heterogeneous health data silos, inaccessible to medical data analysts. To make RWD accessible, we need to apply open data integration techniques, a topic well-researched in data management. Open data integration refers to the pay-as-you-go approach to data integration to cope with the volume, variety, velocity and veracity of data in the increasingly common data lake settings. Data integration tasks are the individual steps needed to integrate data. Due to the inherent challenges of the healthcare domain, many automations do not get applied by medical informaticians to RWD. In this thesis, we aim to investigate this gap and show how the open data integration of real-world health data silos can be accelerated. We first investigate the gap between data management and medical informatics in a literature review, quantifying which health data integration tasks lack automations developed in data management. Apart from identifying duplicate patients in multiple datasets, which is well researched in medical informatics, we conclude that all data integration tasks could benefit from data integration approaches developed in data management. Next, we approach data integration from two parallel perspectives. From a data management perspective, we survey the discovery of approximate functional dependencies (AFDs), a multi-column data profiling approach that detects a strong relationship between sets of attributes of a relation. Based on this comparison, we recommend AFD measures to efficiently discover AFDs in RWD. From a medical informatics perspective, we give a retrospective of a project where we developed a health data integration platform together with three partner hospitals. In particular, we discuss our development approach and lessons learned regarding the complex landscape of real-world health data silos and its associated stakeholders. The learnings from these two perspectives guide our following contribution. We tackle schema matching, a data integration task that lacks automation in the healthcare domain. We approach name-based schema matching, where we aim to identify semantic correspondences between two schemas based on names and descriptions of schema elements. We use a large language model (LLM) and focus on comparing the impact that the amount of information put into a single prompt has on the matching quality. We do not use any instances, i.e. actual data values, to increase the applicability of our approach on sensitive RWD. After our initial development based on public data, we validate our approach on private RWD schemas obtained from four Belgian hospitals. We show that our LLM-based schema matching approach returns high-quality correspondences and give practical considerations for future users. By example of a name-based schema matcher developed with the healthcare domain in mind, we illustrate how open data integration techniques developed in data management accelerate unlocking real-world health data silos.
Document URI: http://hdl.handle.net/1942/47340
Datasets of the publication: https://doi.org/10.5281/zenodo.8098908
Category: T1
Type: Theses and Dissertations
Appears in Collections:Research publications

Files in This Item:
File Description SizeFormat 
thesis_Parciak_TUL.pdf
  Until 2030-09-13
Published version3.4 MBAdobe PDFView/Open    Request a copy
Show full item record

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.