Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/41809
Title: Generalized linear models
Authors: BURZYKOWSKI, Tomasz 
GEUBBELMANS, Melvin 
ROUSSEAU, Axel-Jan 
VALKENBORG, Dirk 
Issue Date: 2023
Publisher: MOSBY-ELSEVIER
Source: AMERICAN JOURNAL OF ORTHODONTICS AND DENTOFACIAL ORTHOPEDICS, 164 (4) , p. 604 -606
Abstract: M achine learning (ML) algorithms use statistical models to find patterns or structures in data. These models can formulate predictions for new observations on the basis of these patterns, which the ML algorithm can translate into decisions. The simple linear regression model 1 is the most fundamental of all statistical and ML models. It is used to describe an effect of a continuous explanatory variable (covariate) on the mean (expected) value of a continuous dependent variable (continuous response). In particular, the mean value of the dependent variable is expressed as a linear function of the covariate. To illustrate the idea, let us consider Figure A. It presents a scatterplot of measurements of the physical health and environment quality-of-life domains obtained with the help of the World Health Organization Quality of Life BREF questionnaire 2 for 290 subjects with and without oral submucous fibrosis (OSMF). 3 The plot suggests a positive association between the 2 measurements: the physical health scores seem to increase (ie, become more favorable) for increasing (ie, more favorable) environment scores. To describe this relationship, we can apply a simple linear regression model with the physical health score as the dependent variable and the environment score as the explanatory variable. As a result, we obtain the following estimated form of the model: mean physical health score 5 8:9 1 0:33 3ðenvironment scoreÞ The 95% confidence interval for the coefficient of the environment score is 0.27-0.40. As it does not include 0, we can reject (at the 2-sided significance level of 0.05) the null hypothesis that the true value of the coefficient is equal to 0. On the basis of the estimated coefficient value, we can conclude that the mean of the physical health score increases by about 0.33 for a unit increase in the environment score. Figure B includes a straight line illustrating the estimated regression model. The regression line does seem to reasonably fit the cloud of points. Simple linear regression can be extended to include more than 1 covariate. The extension leads to a multiple linear regression model. 1 In practice, it may be the case that some (or even all) of the potential explanatory variables may not be continuous but rather discrete (categorical) (ie, they may assume only a limited set of values). Sometimes such variables are referred to as factors, with their values referred to as levels. Examples include sex (with 2 levels: male and female), smoking status (with 3 levels: nonsmoker, light-smoker, heavy-smoker), race, and so on. A linear regression model that includes only factors as explanatory variables is equivalent to an analysis of variance model. A linear regression model that includes a mix of covariates and factors can be seen as equivalent to an analysis of covariance (ANCOVA) model. We can include a factor in a linear regression model using dummy variables (ie, binary variables coding particular factor levels); for a factor with K levels, we should include the dummy variables corresponding to only K À 1 of the levels in the model. To illustrate the idea, let us consider Figure C. It presents the scatterplot of measurements of the physical health and environment scores, but with colors indicating the OSMF status of the subjects. For controls (sub-jects without OSMF) and cases (subjects with OSMF), the plot suggests a positive association between the 2 scores. It seems, however, that the physical health scores for cases are slightly lower than for controls. To quantify this observation, we may apply a linear regression model with the physical health score as the dependent variable, and the environment score and the OSMF status as the explanatory variables. In particular, for the OSMF status (a factor with 2 levels), we use 1 dummy variable for the cases (ie, equal to 0 for controls and 1 for cases). The estimated ANCOVA model is as follows:
Notes: Burzykowski, T (corresponding author), Hasselt Univ, Data Sci Inst, Agoralaan 1,Bldg D, B-3590 Diepenbeek, Belgium.
tomasz.burzykowski@uhasselt.be
Document URI: http://hdl.handle.net/1942/41809
ISSN: 0889-5406
e-ISSN: 1097-6752
DOI: 10.1016/j.ajodo.2023.07.005
ISI #: 001085280300001
Rights: 2023 by the American Association of Orthodontists. All rights reserved. https://doi.org/10.1016/j.ajodo.2023.07.0
Category: A2
Type: Journal Contribution
Appears in Collections:Research publications

Files in This Item:
File Description SizeFormat 
Generalized linear models.pdf
  Restricted Access
Published version480.66 kBAdobe PDFView/Open    Request a copy
Show full item record

WEB OF SCIENCETM
Citations

2
checked on Apr 24, 2024

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.