Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/41467
Title: Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery
Data Creator - person: PARCIAK, Marcel 
VANSUMMEREN, Stijn 
WEYTJENS, Sebastiaan 
PEETERS, Liesbet 
NEVEN, Frank 
HENS, Niel 
Data Creator - organization: Hasselt University
Data Curator - person: PARCIAK, Marcel 
Data Curator - organization: Hasselt University
Rights Holder - person: PARCIAK, Marcel 
Rights Holder - organization: Hasselt University
Publisher: Zenodo
Issue Date: 2023
Abstract: Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery This collection consists of ten open access relations commonly used by the data management community. In addition to the relations themselves (please take note of the references to the original sources below), we added three lists in this collection that describe approximate functional dependencies found in the relations. These lists are the result of a manual annotation process performed by two independent individuals by consulting the respective schemas of the relations and identifying column combinations where one column implies another based on its semantics. As an example, in the claims.csv file, the AirportCode implies AirportName, as each code should be unique for a given airport. The file ground_truth.csv is a comma separated file containing approximate functional dependencies. table describes the relation we refer to, lhs and rhs reference two columns of those relations where semantically we found that lhs implies rhs. The file excluded_candidates.csv and included_candidates.csv list all column combinations that were excluded or included in the manual annotation, respectively. We excluded a candidate if there was no tuple where both attributes had a value or if the g3_prime value was too small. Dataset References adult.csv: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. claims.csv: TSA Claims Data 2002 to 2006, published by the U.S. Department of Homeland Security. dblp10k.csv: Frequency-aware Similarity Measures. Lange, Dustin; Naumann, Felix (2011). 243–248. Made available as DBLP Dataset 2. hospital.csv: Hospital dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper. t_biocase_... files: t_bioc_... files used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper. tax.csv: Tax dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.
Research Discipline: Natural sciences > Information and computing sciences > Information systems > Data models (01020501)
Keywords: data science;computer science;databases;approximate functional dependencies;data management;relational data;dataset;benchmark
DOI: 10.5281/zenodo.8098909
Link to publication/dataset: https://zenodo.org/record/8098909
Source: Zenodo. 10.5281/zenodo.8098909 https://zenodo.org/record/8098909
License: Creative Commons Attribution 4.0 International (CC-BY-4.0)
Access Rights: Open Access
Version: 1.0
Category: DS
Type: Dataset
Appears in Collections:Datasets

Show full item record

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.