Please use this identifier to cite or link to this item:
http://hdl.handle.net/1942/35297
Title: | Applicability domain of chemical reaction modeling | Authors: | Van Eylen, Tim | Advisors: | VALKENBORG, Dirk DYUBANKOVA, Natalia |
Issue Date: | 2021 | Publisher: | tUL | Abstract: | MinHash Locality Sensitive Hashing (LSH) was used to find and remove near-duplicates from large chemical datasets to avoid data leakage during training and testing of AI models for forward prediction modelling. The MinHash LSH algorithm is a nearest-neighbour algorithm which provides query times in O(n) time complexity, while pairwise comparisons require O(n²) time complexity, making them intractable for large datasets. A recent attention neural network, Molecular Transformer, was tested on the combination of three large datasets with and without the removal of these near-duplicates and compared against literature. It was concluded that MinHash LSH provides an elegant approach to removing near-duplicates. Furthermore, the reported results of the Molecular Transformer where not generalizable to aggregated datasets, although the reduced accuracy of the model on a reduced dataset could be shown. | Notes: | Master of Statistics and Data Science-Biostatistics | Document URI: | http://hdl.handle.net/1942/35297 | Category: | T2 | Type: | Theses and Dissertations |
Appears in Collections: | Master theses |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
aff78031-f687-49d5-a8bd-f6509442e578.pdf | 1.61 MB | Adobe PDF | View/Open |
Page view(s)
100
checked on Nov 7, 2023
Download(s)
54
checked on Nov 7, 2023
Google ScholarTM
Check
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.