Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/44312
Title: Conditional Metrics for Data Ingestion Validation
Authors: BYLOIS, Niels 
Advisors: Neven, Frank
Vansummeren, Stijn
Luyten, Kris
Issue Date: 2024
Abstract: Validating the quality of continuously collected data is crucial to ensure trustworthiness of analytics insights. A widely used approach for validating data quality is to specify, either manually or automatically, so-called data unit tests that check whether data quality metrics lie within expected bounds. Unfortunately, these existing approaches suffer from two limitations. First, the data unit tests that they support are based on global metrics that can only provide coarse-grained signals of data quality and are unable to detect fine-grained errors. Furthermore, even when a data unit test signals a data quality problem, it does not provide a principled method to identify the data part that is responsible for a test's failure. We present an approach for data quality validation that is not only capable of detecting fine-grained errors, but also helps in identifying responsible erroneous tuples. Our approach is based on a novel form of metrics, called conditional metrics, which allow to compute data quality signals over specific parts of the ingestion data and therefore allow for a more fine-grained analysis compared to standard global metrics that operate on the entire ingestion data. The methodology consists of two phases: a unit test discovery phase and a monitoring and error identification phase. In the discovery phase, we automatically derive conditional metric-based unit tests from historical ingestion sequences, where we employ a notion of stability over the historical ingestions as a selection criterion for using conditional metrics as data unit tests. In the subsequent phase, we use the derived unit tests to validate the quality of new ingestion batches. When an ingestion batch fails one or more unit tests, we show how conditional metrics can be used to identify potential errors. We study different ways of implementing both phases, and compare their effectiveness over two real world datasets and seven synthetic error scenarios. The improvement that we measure over global metrics as well as the error-identification F1-scores that we obtain indicate that conditional metrics provide a promising approach towards fine-grained error detection for data ingestion validation. Furthermore, we compare our methodology to a data quality verification tool and machine learning methods, demonstrating that our approach can detect more fine-grained errors. We then explore further extensions to this methodology, including clustering batches based on the conditional metrics to improve the error-identification process, a quantified failing method that considers the severity of failing conditions and an iterative scoring method to enhance the performance of the methodology. Finally, we present an interactive tool that enables data stewards to explore the failing data unit tests and the corresponding data to identify the root cause of the fine-grained errors.
Document URI: http://hdl.handle.net/1942/44312
Category: T1
Type: Theses and Dissertations
Appears in Collections:Research publications

Files in This Item:
File Description SizeFormat 
thesis_Niels_Bylois.pdf
  Until 2029-09-29
Published version8.65 MBAdobe PDFView/Open    Request a copy
Show full item record

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.