Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/41467
Full metadata record
DC FieldValueLanguage
dc.date.accessioned2023-10-04T11:27:42Z-
dc.date.available2023-10-04T11:27:42Z-
dc.date.issued2023-
dc.date.submitted2023-10-04T11:26:23Z-
dc.identifier.citationZenodo. 10.5281/zenodo.8098909 https://zenodo.org/record/8098909-
dc.identifier.urihttp://hdl.handle.net/1942/41467-
dc.description.abstractAnnotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery This collection consists of ten open access relations commonly used by the data management community. In addition to the relations themselves (please take note of the references to the original sources below), we added three lists in this collection that describe approximate functional dependencies found in the relations. These lists are the result of a manual annotation process performed by two independent individuals by consulting the respective schemas of the relations and identifying column combinations where one column implies another based on its semantics. As an example, in the claims.csv file, the AirportCode implies AirportName, as each code should be unique for a given airport. The file ground_truth.csv is a comma separated file containing approximate functional dependencies. table describes the relation we refer to, lhs and rhs reference two columns of those relations where semantically we found that lhs implies rhs. The file excluded_candidates.csv and included_candidates.csv list all column combinations that were excluded or included in the manual annotation, respectively. We excluded a candidate if there was no tuple where both attributes had a value or if the g3_prime value was too small. Dataset References adult.csv: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. claims.csv: TSA Claims Data 2002 to 2006, published by the U.S. Department of Homeland Security. dblp10k.csv: Frequency-aware Similarity Measures. Lange, Dustin; Naumann, Felix (2011). 243–248. Made available as DBLP Dataset 2. hospital.csv: Hospital dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper. t_biocase_... files: t_bioc_... files used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper. tax.csv: Tax dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.-
dc.language.isoen-
dc.publisherZenodo-
dc.subject.classificationData models-
dc.subject.otherdata science-
dc.subject.othercomputer science-
dc.subject.otherdatabases-
dc.subject.otherapproximate functional dependencies-
dc.subject.otherdata management-
dc.subject.otherrelational data-
dc.subject.otherdataset-
dc.subject.otherbenchmark-
dc.titleAnnotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery-
dc.typeDataset-
local.bibliographicCitation.jcatDS-
dc.description.version1.0-
dc.rights.licenseCreative Commons Attribution 4.0 International (CC-BY-4.0)-
dc.identifier.doi10.5281/zenodo.8098909-
dc.identifier.urlhttps://zenodo.org/record/8098909-
dc.description.otherYou can cite all versions by using the DOI 10.5281/zenodo.8098908. This DOI represents all versions, and will always resolve to the latest one.-
local.provider.typedatacite-
local.uhasselt.internationalno-
local.contributor.datacreatorPARCIAK, Marcel-
local.contributor.datacreatorVANSUMMEREN, Stijn-
local.contributor.datacreatorWEYTJENS, Sebastiaan-
local.contributor.datacreatorPEETERS, Liesbet-
local.contributor.datacreatorNEVEN, Frank-
local.contributor.datacreatorHENS, Niel-
local.contributor.datacuratorPARCIAK, Marcel-
local.contributor.rightsholderPARCIAK, Marcel-
local.format.extent4.0 MB; 17.4 Mb; 5.0 Mb; 270 kB; 6 kB; 30.6 MB; 79.1 kB; 14.1 MB; 21.2 MB; 24.9 MB; 30.5 Mb; 29.8 Mb; 73.0 MB-
local.format.mimetypeComma-separated values (CSV)-
local.contributororcid.datacreator0000-0002-6950-929X-
local.contributororcid.datacreator0000-0001-7793-9049-
local.contributororcid.datacreator0000-0001-5892-508X-
local.contributororcid.datacreator0000-0002-6066-3899-
local.contributororcid.datacreator0000-0002-7143-1903-
local.contributororcid.datacreator0000-0003-1881-0637-
local.contributororcid.datacurator0000-0002-6950-929X-
local.contributororcid.rightsholder0000-0002-6950-929X-
local.contributingorg.datacreatorHasselt University-
local.contributingorg.datacuratorHasselt University-
local.contributingorg.rightsholderHasselt University-
dc.rights.accessOpen Access-
item.accessRightsClosed Access-
item.fullcitationPARCIAK, Marcel; VANSUMMEREN, Stijn; WEYTJENS, Sebastiaan; PEETERS, Liesbet; NEVEN, Frank & HENS, Niel (2023) Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery. Zenodo. 10.5281/zenodo.8098909 https://zenodo.org/record/8098909.-
item.fulltextNo Fulltext-
item.contributorPARCIAK, Marcel-
item.contributorVANSUMMEREN, Stijn-
item.contributorWEYTJENS, Sebastiaan-
item.contributorPEETERS, Liesbet-
item.contributorNEVEN, Frank-
item.contributorHENS, Niel-
crisitem.discipline.code01020501-
crisitem.discipline.nameData models-
crisitem.discipline.pathNatural sciences > Information and computing sciences > Information systems > Data models-
crisitem.discipline.pathandcodeNatural sciences > Information and computing sciences > Information systems > Data models (01020501)-
crisitem.license.codeCC-BY-4.0-
crisitem.license.nameCreative Commons Attribution 4.0 International (CC-BY-4.0)-
Appears in Collections:Datasets
Show simple item record

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.