Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/40266
Title: Materials design through ensemble learning: When the average model knows best
Authors: VANPOUCKE, Danny E.P. 
Mehrkanoon, Siamak
Bernaerts, Katrien V.
Issue Date: 2023
Source: E-MRS Spring meeting 2023, Strassbourg, France, 29/05/2023-02/06/2023
Abstract: Machine Learning plays an ever more important role in modern materials-design and-discovery presenting a steady flow of new discoveries. Unfortunately, these achievements are generally rooted in large data sets. Although such big data sets are becoming more common place, they are generally not representative for the day-today work performed by materials researchers, where large numbers of samples are often unfeasible due to production-cost or-time, or availability of raw materials. In this work, we investigate the impact of very small data sets (<25 samples) on model quality and show how even for these data sets high quality models can be constructed. Machine Learning in small data sets Due to the success of Machine Learning within the context of large data sets, there is a natural interest to apply these methods in the context of small data sets. The use of AI and ML is these cases is generally aimed at improved design of experiments for materials optimisation, often in combination with robotic automation. Some work on small datasets (50 to several 100 samples) performed using active learning and small deep neural networks show that, even in the context of small data sets, ML can be successful for materials research. However, the quality of the obtained models is often defined in an ad hoc fashion and their sensitivity on the used data. Though clear, the required human selection steps are generally not discussed. Model quality in small data sets In this work, we present a critical investigation of the role of small (< 25 data samples) data sets in ML based regression analysis. We start from a conceptual analysis of the quality of ML models, using training, validation and test sets. In this discussion the strong dependence of the model quality on the considered datapoints is highlighted as an important limitation of ML. Using both synthetic and experimental data sets, we show that the model instances of an ensemble are distributed around the model average [1,2] This result appears to be independent of the underlying model. More interestingly, we find that this ensemble average presents a model-quality on par with that of the best available model instance in the ensemble for the data set. We therefore propose to construct a model instance that is equivalent to the ensemble average, but presents a much lower computational cost for evaluation and storage. This mitigates the observed limitation of ML for small data sets, and makes it also accessible within the context of day-today small scale materials projects. [2]
Other: Author name needs to be updated to include middel names: Danny Vanpoucke is to be updated to Danny E.P. Vanpoucke, and correctly coupled to the personel ID in the UHasselt database which incorrectly is missing the middle names. This work is not a duplicate, false possitive as consequence of poor checking on only title of the object.
Document URI: http://hdl.handle.net/1942/40266
Category: C2
Type: Conference Material
Appears in Collections:Research publications

Files in This Item:
File Description SizeFormat 
Abstract_3000Char.pdf
  Restricted Access
Conference material46.84 kBAdobe PDFView/Open    Request a copy
Show full item record

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.