Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/35297
Title: Applicability domain of chemical reaction modeling
Authors: Van Eylen, Tim
Advisors: VALKENBORG, Dirk
DYUBANKOVA, Natalia
Issue Date: 2021
Publisher: tUL
Abstract: MinHash Locality Sensitive Hashing (LSH) was used to find and remove near-duplicates from large chemical datasets to avoid data leakage during training and testing of AI models for forward prediction modelling. The MinHash LSH algorithm is a nearest-neighbour algorithm which provides query times in O(n) time complexity, while pairwise comparisons require O(n²) time complexity, making them intractable for large datasets. A recent attention neural network, Molecular Transformer, was tested on the combination of three large datasets with and without the removal of these near-duplicates and compared against literature. It was concluded that MinHash LSH provides an elegant approach to removing near-duplicates. Furthermore, the reported results of the Molecular Transformer where not generalizable to aggregated datasets, although the reduced accuracy of the model on a reduced dataset could be shown.
Notes: Master of Statistics and Data Science-Biostatistics
Document URI: http://hdl.handle.net/1942/35297
Category: T2
Type: Theses and Dissertations
Appears in Collections:Master theses

Files in This Item:
File Description SizeFormat 
aff78031-f687-49d5-a8bd-f6509442e578.pdf1.61 MBAdobe PDFView/Open
Show full item record

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.