Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/41759
Title: CM-EXPLORER: Dissecting Data Ingestion Problems
Authors: BYLOIS, Niels 
NEVEN, Frank 
VANSUMMEREN, Stijn 
Issue Date: 2023
Publisher: ASSOC COMPUTING MACHINERY
Source: Proceedings of the VLDB Endowment, 16 (12) , p. 3958 -3961
Abstract: Data ingestion validation, the task of certifying the quality of continuously collected data, is crucial to ensure trustworthiness of analytics insights. A widely used approach for validating data quality is to specify, either manually or automatically, so-called data unit tests that check whether data quality metrics lie within expected bounds. We employ conditional unit tests based on conditional metrics (CMs) that compute data quality signals over specific parts of the ingestion data and therefore allow for a fine-grained detection of errors. A violated conditional unit test specifies a set of erroneous tuples in a natural way: the subrelation that its CM refers to. Unfortunately, the downside of their fine-grained nature is that violating unit tests are often correlated: a single error in an ingestion batch may cause multiple tests (each referring to different parts of the batch) to fail. The key challenge is therefore to untangle this correlation and filter out the most relevant violated conditional unit tests, i.e., tests that identify a core set of erroneous tuples and act as an explanation for the errors. We present CM-EXPLORER, a system that supports data stewards in quickly finding the most relevant violated conditional unit tests. The system consists of three components: (1) a graph explorer for visualizing the correlation structure of the violated unit tests; (2) a relation explorer for browsing the tuples selected by conditional unit tests; and, (3) a history explorer to get insight why conditional unit tests are violated. In this paper, we discuss these components and present the different scenarios that we make available for the demonstration.
Notes: Bylois, N (corresponding author), UHasselt, ACSL, Data Sci Inst, Antwerp, Belgium.
niels.bylois@uhasselt.be; frank.neven@uhasselt.be;
stijn.vansummeren@uhasselt.be
Document URI: http://hdl.handle.net/1942/41759
ISSN: 2150-8097
e-ISSN: 2150-8097
DOI: 10.14778/3611540.3611595
ISI #: 001067701000056
Rights: This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment
Category: A1
Type: Journal Contribution
Appears in Collections:Research publications

Files in This Item:
File Description SizeFormat 
CM-Explorer_ Dissecting Data Ingestion Problems.pdf
  Restricted Access
Published version1.01 MBAdobe PDFView/Open    Request a copy
CM-Explorer_ Dissecting Data Ingestion Problems.pdf
  Until 2024-08-31
Peer-reviewed author version946.46 kBAdobe PDFView/Open    Request a copy
Show full item record

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.