Please use this identifier to cite or link to this item:
http://hdl.handle.net/1942/35297
Title: | Applicability domain of chemical reaction modeling | Authors: | Van Eylen, Tim | Advisors: | VALKENBORG, Dirk DYUBANKOVA, Natalia |
Issue Date: | 2021 | Publisher: | tUL | Abstract: | MinHash Locality Sensitive Hashing (LSH) was used to find and remove near-duplicates from large chemical datasets to avoid data leakage during training and testing of AI models for forward prediction modelling. The MinHash LSH algorithm is a nearest-neighbour algorithm which provides query times in O(n) time complexity, while pairwise comparisons require O(n²) time complexity, making them intractable for large datasets. A recent attention neural network, Molecular Transformer, was tested on the combination of three large datasets with and without the removal of these near-duplicates and compared against literature. It was concluded that MinHash LSH provides an elegant approach to removing near-duplicates. Furthermore, the reported results of the Molecular Transformer where not generalizable to aggregated datasets, although the reduced accuracy of the model on a reduced dataset could be shown. | Notes: | Master of Statistics and Data Science-Biostatistics | Document URI: | http://hdl.handle.net/1942/35297 | Category: | T2 | Type: | Theses and Dissertations |
Appears in Collections: | Master theses |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
aff78031-f687-49d5-a8bd-f6509442e578.pdf | 1.61 MB | Adobe PDF | View/Open |
Google ScholarTM
Check
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.