A Bayesian Network Approach to Lung Cancer Screening: Assessing the Impact of Data Quantity, Quality, and the Combination of Data from Danish Electronic Health Records

Daalen, Florian van; Henriksen, Margrethe Hostgaard Bang; Hansen, Torben Frostrup; Jensen, Lars Henrik; Brasen, Claus Lohman; Hilberg, Ole; Andersen, Martin Ask Klausholt; Humerfelt, Elise; Wee, Leonard; BERMEJO DELGADO, Inigo

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/44938

Title:	A Bayesian Network Approach to Lung Cancer Screening: Assessing the Impact of Data Quantity, Quality, and the Combination of Data from Danish Electronic Health Records
Authors:	Daalen, Florian van Henriksen, Margrethe Hostgaard Bang Hansen, Torben Frostrup Jensen, Lars Henrik Brasen, Claus Lohman Hilberg, Ole Andersen, Martin Ask Klausholt Humerfelt, Elise Wee, Leonard BERMEJO DELGADO, Inigo
Issue Date:	2024
Publisher:	MDPI
Source:	Cancers, 16 (23) (Art N° 3989)
Abstract:	Background/Objectives: Lung cancer (LC) is the leading cause of cancer mortality, making early diagnosis essential. While LC screening trials are underway globally, optimal prediction models and inclusion criteria are still lacking. This study aimed to develop and evaluate Bayesian Network (BN) models for LC risk prediction using a decade of data from Denmark. The primary goal was to assess BN performance on datasets varying in size and completeness, simulate real-world screening scenarios, and identify the most valuable data sources for LC screening. Methods: The study included 38,944 patients evaluated for LC, with 11,284 (29%) diagnosed. Data on comorbidities, medications, and general practice were available for the entire cohort, while laboratory results, smoking habits, and other variables were only available for subsets. The cohort was divided into four subsets based on data availability, and BNs were trained and validated across these subsets using cross-validation and external validation. To determine the optimal combination of variables, all possible data combinations were evaluated on the samples that contained all the variables (n = 5587). Results: A model trained on the small, complete dataset (AUC 0.78) performed similarly on a larger dataset with 21% missing data (AUC 0.78). Performance dropped when 39% of data were missing (AUC 0.67), resulting in informative variables missing completely in the dataset. Laboratory results and smoking data were the most informative, significantly outperforming models based only on age and smoking status (AUC 0.70). Conclusions: BN models demonstrated moderate to strong predictive performance, even with incomplete data, highlighting the potential value of incorporating laboratory results in LC screening programs.
Notes:	Henriksen, MHB (corresponding author), Vejle Univ Hosp, Dept Oncol, DK-7100 Vejle, Denmark.; Henriksen, MHB (corresponding author), Univ Southern Denmark, Inst Reg Hlth Res, DK-5230 Odense, Denmark. florian.vandaalen@maastro.nl; margrethe.hostgaard.bang.henriksen@rsyd.dk; torben.hansen@rsyd.dk; lars.henrik.jensen@rsyd.dk; claus.lohman.brasen@rsyd.dk; ole.hilberg@rsyd.dk
Keywords:	lung cancer;bayesian networks;prediction models;screening;early detection;missing data;risk stratification
Document URI:	http://hdl.handle.net/1942/44938
e-ISSN:	2072-6694
DOI:	10.3390/cancers16233989
ISI #:	001376152400001
Rights:	2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).
Category:	A1
Type:	Journal Contribution
Validations:	ecoom 2025
Appears in Collections:	Research publications

Files in This Item:

File	Description	Size	Format
A Bayesian Network Approach .pdf	Published version	2.07 MB	Adobe PDF	View/Open

Show full item record

SCOPUS^TM
Citations

1

checked on Feb 27, 2026

WEB OF SCIENCE^TM
Citations

1

checked on Feb 26, 2026

Google Scholar^TM

Check

Files in This Item:

SCOPUSTM Citations

WEB OF SCIENCETM Citations

Google ScholarTM

Altmetric

SCOPUS^TM
Citations

WEB OF SCIENCE^TM
Citations

Google Scholar^TM