Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/41551
Title: High-dimensional data
Authors: GEUBBELMANS, Melvin 
ROUSSEAU, Axel-Jan 
VALKENBORG, Dirk 
BURZYKOWSKI, Tomasz 
Issue Date: 2023
Publisher: MOSBY-ELSEVIER
Source: AMERICAN JOURNAL OF ORTHODONTICS AND DENTOFACIAL ORTHOPEDICS, 164 (3) , p. 453 -456
Abstract: ig data is a relatively recent term that has emerged because of the rapid collection and generation of data, the increase in storage and computing capacity, and the rise of a new generation of machine learning algorithms. It is generally characterized by 3 "V's." The first "V," volume, refers to the size and scale of the data. This can be the number of subjects (observations) or variables (covariates, features) in the dataset. Velocity, the second "V," stands for the rate at which the data are generated and the speed of analysis. The final "V," variety, refers to the variation within a da-taset and is associated with the noisiness, occurrence of missing data, or because of differences in the storage methods. Specialized and distributed computer architectures that can provide adequate memory and processing power are often required to handle extremely large data-sets. In machine learning (ML), the volume of the datasets and the increasing number of variables pose additional issues to the ML algorithms, especially when the number of variables (features) vastly exceeds the number of observations. This case is often called the (n # p)-problem or high-dimensional data; sometimes, it is called the curse of dimensionality. In what follows, we present 2 examples of problems regarding high-dimensional data. In a previous article of this series, 1 the K-nearest-neighbors algorithm was discussed. One of the main assumptions of this method is that 2 observations have to be close to each other in every dimension (across each variable). Adding dimensions (variables) without adding data reduces the reliability of the algorithm because the points will lie further apart in the expanded variable space resulting in empty data regions. To compensate for this, the number of observations in the dataset should be increased accordingly. We will use a linear regression model to illustrate another problem with high dimensionality. Let us denote by n and p the number of observations and the total number of variables (p À 1 explanatory variables and the dependent variable) in a dataset used to fit a linear regression model. The least-squares method is often applied to ensure the best fit of the model. A linear combination of the p À 1 explanatory variables is found such that the residual sum of squares is the smallest. When the number of observations is less than or equal to the number of variables (n # p), it is possible to find combinations of the features with the residual sum of squares equal to zero (see Fig 1), leading to a model that fits the data perfectly. However, the results of validating such a model on an independent testing dataset will likely be poor because of overfitting. 2 The curse of dimensionality also holds for classification tasks. When adding more explanatory variables to a dataset, the dimensionality increases to a point in which the classification problem can be solved perfectly (without misclassification), even considering the most straightforward and inflexible models. This principle is exploited in support vector machines, a technique that will be discussed in a subsequent article in this series on ML. Two approaches can be considered to address the curse of dimensionality when fitting a model: selection of only important variables (features) when building the model or dimension reduction based on data transformations. FEATURE SELECTION Apart from feature subset selection methods, a general approach to performing feature selection is model regularization, that is, adding, in the search for the best-fitting model, a penalty for the number and magnitude of the coefficients included in the model. 2 For instance, in the Least Absolute Shrinkage and Selection Operator (LASSO), the penalty takes the form of the sum of the absolute values of all coefficients multiplied by a constant l from the [0, 1] interval. This penalty term is also known as the shrinkage penalty. LASSO performs
Notes: Burzykowski, T (corresponding author), Hasselt Univ, Data Sci Inst, Agoralaan 1,Bldg D, B-3590 Diepenbeek, Belgium.
tomasz.burzykowski@uhasselt.be
Document URI: http://hdl.handle.net/1942/41551
ISSN: 0889-5406
e-ISSN: 1097-6752
DOI: 10.1016/j.ajodo.2023.06.012
ISI #: 001068615000001
Rights: 2023
Category: A2
Type: Journal Contribution
Appears in Collections:Research publications

Files in This Item:
File Description SizeFormat 
High-dimensional data.pdfPublished version377.99 kBAdobe PDFView/Open
Show full item record

WEB OF SCIENCETM
Citations

2
checked on Apr 22, 2024

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.