Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/35297
Title: Applicability domain of chemical reaction modeling
Authors: Van Eylen, Tim
Advisors: VALKENBORG, Dirk
DYUBANKOVA, Natalia
Issue Date: 2021
Publisher: tUL
Abstract: MinHash Locality Sensitive Hashing (LSH) was used to find and remove near-duplicates from large chemical datasets to avoid data leakage during training and testing of AI models for forward prediction modelling. The MinHash LSH algorithm is a nearest-neighbour algorithm which provides query times in O(n) time complexity, while pairwise comparisons require O(n²) time complexity, making them intractable for large datasets. A recent attention neural network, Molecular Transformer, was tested on the combination of three large datasets with and without the removal of these near-duplicates and compared against literature. It was concluded that MinHash LSH provides an elegant approach to removing near-duplicates. Furthermore, the reported results of the Molecular Transformer where not generalizable to aggregated datasets, although the reduced accuracy of the model on a reduced dataset could be shown.
Notes: Master of Statistics and Data Science-Biostatistics
Document URI: http://hdl.handle.net/1942/35297
Category: T2
Type: Theses and Dissertations
Appears in Collections:Master theses

Files in This Item:
File Description SizeFormat 
aff78031-f687-49d5-a8bd-f6509442e578.pdf1.61 MBAdobe PDFView/Open
Show full item record

Page view(s)

100
checked on Nov 7, 2023

Download(s)

54
checked on Nov 7, 2023

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.