Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/35297
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorVALKENBORG, Dirk
dc.contributor.advisorDYUBANKOVA, Natalia
dc.contributor.authorVan Eylen, Tim
dc.date.accessioned2021-09-13T13:06:28Z-
dc.date.available2021-09-13T13:06:28Z-
dc.date.issued2021
dc.identifier.urihttp://hdl.handle.net/1942/35297-
dc.description.abstractMinHash Locality Sensitive Hashing (LSH) was used to find and remove near-duplicates from large chemical datasets to avoid data leakage during training and testing of AI models for forward prediction modelling. The MinHash LSH algorithm is a nearest-neighbour algorithm which provides query times in O(n) time complexity, while pairwise comparisons require O(n²) time complexity, making them intractable for large datasets. A recent attention neural network, Molecular Transformer, was tested on the combination of three large datasets with and without the removal of these near-duplicates and compared against literature. It was concluded that MinHash LSH provides an elegant approach to removing near-duplicates. Furthermore, the reported results of the Molecular Transformer where not generalizable to aggregated datasets, although the reduced accuracy of the model on a reduced dataset could be shown.
dc.format.mimetypeApplication/pdf
dc.languageen
dc.publishertUL
dc.titleApplicability domain of chemical reaction modeling
dc.typeTheses and Dissertations
local.bibliographicCitation.jcatT2
dc.description.notesMaster of Statistics and Data Science-Biostatistics
local.type.specifiedMaster thesis
item.fullcitationVan Eylen, Tim (2021) Applicability domain of chemical reaction modeling.-
item.accessRightsOpen Access-
item.fulltextWith Fulltext-
item.contributorVan Eylen, Tim-
Appears in Collections:Master theses
Files in This Item:
File Description SizeFormat 
aff78031-f687-49d5-a8bd-f6509442e578.pdf1.61 MBAdobe PDFView/Open
Show simple item record

Page view(s)

100
checked on Nov 7, 2023

Download(s)

54
checked on Nov 7, 2023

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.