Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/35297
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorVALKENBORG, Dirk
dc.contributor.advisorDYUBANKOVA, Natalia
dc.contributor.authorVan Eylen, Tim
dc.date.accessioned2021-09-13T13:06:28Z-
dc.date.available2021-09-13T13:06:28Z-
dc.date.issued2021
dc.identifier.urihttp://hdl.handle.net/1942/35297-
dc.description.abstractMinHash Locality Sensitive Hashing (LSH) was used to find and remove near-duplicates from large chemical datasets to avoid data leakage during training and testing of AI models for forward prediction modelling. The MinHash LSH algorithm is a nearest-neighbour algorithm which provides query times in O(n) time complexity, while pairwise comparisons require O(n²) time complexity, making them intractable for large datasets. A recent attention neural network, Molecular Transformer, was tested on the combination of three large datasets with and without the removal of these near-duplicates and compared against literature. It was concluded that MinHash LSH provides an elegant approach to removing near-duplicates. Furthermore, the reported results of the Molecular Transformer where not generalizable to aggregated datasets, although the reduced accuracy of the model on a reduced dataset could be shown.
dc.format.mimetypeApplication/pdf
dc.languageen
dc.publishertUL
dc.titleApplicability domain of chemical reaction modeling
dc.typeTheses and Dissertations
local.bibliographicCitation.jcatT2
dc.description.notesMaster of Statistics and Data Science-Biostatistics
local.type.specifiedMaster thesis
item.fulltextWith Fulltext-
item.contributorVan Eylen, Tim-
item.fullcitationVan Eylen, Tim (2021) Applicability domain of chemical reaction modeling.-
item.accessRightsOpen Access-
Appears in Collections:Master theses
Files in This Item:
File Description SizeFormat 
aff78031-f687-49d5-a8bd-f6509442e578.pdf1.61 MBAdobe PDFView/Open
Show simple item record

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.