Please use this identifier to cite or link to this item:
http://hdl.handle.net/1942/35297Full metadata record
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.advisor | VALKENBORG, Dirk | |
| dc.contributor.advisor | DYUBANKOVA, Natalia | |
| dc.contributor.author | Van Eylen, Tim | |
| dc.date.accessioned | 2021-09-13T13:06:28Z | - |
| dc.date.available | 2021-09-13T13:06:28Z | - |
| dc.date.issued | 2021 | |
| dc.identifier.uri | http://hdl.handle.net/1942/35297 | - |
| dc.description.abstract | MinHash Locality Sensitive Hashing (LSH) was used to find and remove near-duplicates from large chemical datasets to avoid data leakage during training and testing of AI models for forward prediction modelling. The MinHash LSH algorithm is a nearest-neighbour algorithm which provides query times in O(n) time complexity, while pairwise comparisons require O(n²) time complexity, making them intractable for large datasets. A recent attention neural network, Molecular Transformer, was tested on the combination of three large datasets with and without the removal of these near-duplicates and compared against literature. It was concluded that MinHash LSH provides an elegant approach to removing near-duplicates. Furthermore, the reported results of the Molecular Transformer where not generalizable to aggregated datasets, although the reduced accuracy of the model on a reduced dataset could be shown. | |
| dc.format.mimetype | Application/pdf | |
| dc.language | en | |
| dc.publisher | tUL | |
| dc.title | Applicability domain of chemical reaction modeling | |
| dc.type | Theses and Dissertations | |
| local.bibliographicCitation.jcat | T2 | |
| dc.description.notes | Master of Statistics and Data Science-Biostatistics | |
| local.type.specified | Master thesis | |
| item.accessRights | Open Access | - |
| item.fullcitation | Van Eylen, Tim (2021) Applicability domain of chemical reaction modeling. | - |
| item.fulltext | With Fulltext | - |
| item.contributor | Van Eylen, Tim | - |
| Appears in Collections: | Master theses | |
Files in This Item:
| File | Description | Size | Format | |
|---|---|---|---|---|
| aff78031-f687-49d5-a8bd-f6509442e578.pdf | 1.61 MB | Adobe PDF | View/Open |
Google ScholarTM
Check
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.