Computationally Efficient Estimators for Large Clustered Data and Related Topics

FLOREZ POVEDA, Alvaro

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/31828

Title:	Computationally Efficient Estimators for Large Clustered Data and Related Topics
Authors:	FLOREZ POVEDA, Alvaro
Advisors:	Molenberghs, Geert Verbeke, Geert
Issue Date:	2020
Abstract:	The analysis of clustered data is commonly done by the generalized linear mixed model (GLMM). Here, the variability and correlation due to clustering are captured by clusterspecific normally distributed random effects. Routinely, the GLMM is fitted by maximizing the marginal likelihood. For normally distributed outcomes, the marginal (log-)likelihood is obtained in closed-form, and the estimation is done iteratively. For other types of outcomes, the marginalization is done numerically, which then still has to be combined with iterative maximization methods. Generally, the whole fitting process is computationally manageable with medium to relatively large data. Nevertheless, with complex models with many variance components and several large-sized clusters, theses procedures can be too time-consuming or even intractable. Therefore, we propose a fast two-stage estimator for fitting a GLMM. It is rooted in the pseudo-likelihood split-sample methodology. Here, the sample is partitioned into K subsamples, which are analyzed separately, and afterward, the results are combined to obtain overall inferences. Particularly for clustered data, this partitioning can be done in terms of independent or dependent sub-samples. Here, we consider the most extreme partitioning, all sub-samples with a single cluster, leading to the so-called cluster-by-cluster (CbC) estimator. Therefore, it follows the same two steps of the split-sample methodology, i.e., (1) a generalized linear model (GLM) is individually fitted in each cluster; (2) a global set of estimates for the parameters is computed using weights. However, the covariance matrix of the random effects (D) measures the variability between clusters, and consequently, it cannot be estimated using a single cluster. Hence, a method-of-moments approach is proposed. In Chapter 4, we focus on normally distributed cluster data, i.e., the linear mixed model (LMM). In this case, the CbC estimator can be expressed in closed-form. Therefore, it is computationally highly efficient. In Chapter 5, we extend the CbC estimator for the GLMM, with special attention on binary, count, and time-to-event outcomes. Moreover, in Chapter 6, our methodology is expanded for analyzing cluster count data with overdispersion and/or excess of zeros. In the latter cases, the CbC estimator is no longer closed-form. Nevertheless, it is still computationally less intensive than the maximum likelihood estimator (MLE). Based on theoretical results and simulations, our proposal had shown good statistical properties. Its performance depends mostly on the type of outcome and variance structure. Generally, it is asymptotically efficient when the size of the clusters increases faster than the number of clusters. A hierarchical setting commonly encountered in a meta-analysis, for example. Motivated on the use of the asymmetric Laplace (AL) distribution for fitting a quantile regression (QR) model, in Chapter 7, we consider a MLE for a QR model for continuous clustered data based on the multivariate asymmetric Laplace (MAL) distribution. The MAL distribution is closed under marginalization, i.e., its marginals are univariate AL distributions. This implies that, marginally, the connection with the univariate quantile regression holds. However, there are some issues with the MAL distribution that require special attention. In particular, it has an asymptote at the origin. Consequently, the likelihood surface contains a ‘minefield’ of spikes, which leads to problems in maximizing the likelihood. We solve these issues by a slight modification of the MAL density so that it becomes a well-behaved, smooth density. Although this adjustment may lead to some bias, we show via simulations that this bias is negligible for the QR parameters. Furthermore, it is more efficient than the univariate QR estimator, i.e., by ignoring the association due to clustering. In the causal-inference paradigm, multivariate surrogate endpoint assessment can be done by the individual causal association (ICA). It is a function of a partially identifiable correlation matrix (R). Consequently, it cannot be estimated from the data without imposing untestable assumptions. This issue is tackled by estimating the ICA across the set of values of the unidentifiable entries of R that lead to a valid correlation matrix, the so-called ΩR space. This sensitivity analysis helps us to quantify the non-identifiability regarding the ICA. However, this matrix completion problem is very challenging. In Chapter 8, a fast algorithm is built on previous work to generate high-dimensional correlation matrices with some prior fixed elements. Based on simulations and data analysis, our methodology allows us to evaluate several surrogates jointly is fast matter. De analyse van geclusterde gegevens wordt vaak gedaan door het gegeneraliseerde lineaire gemengde model (GLMM). Hier worden de variabiliteit en correlatie als gevolg van clustering vastgelegd door clusterspecifieke normaal verdeelde willekeurige effecten. Routinematig wordt de GLMM uitgerust door de marginale waarschijnlijkheid te maximaliseren. Voor normaal gedistribueerde uitkomsten wordt de marginale (log-)waarschijnlijkheid in gesloten vorm verkregen en wordt de schatting iteratief uitgevoerd. Voor andere soorten resultaten wordt de marginalisatie numeriek uitgevoerd, wat dan nog moet worden gecombineerd met iteratieve maximalisatiemethoden. Over het algemeen is het hele montageproces rekenkundig beheersbaar met middelgrote tot relatief grote gegevens. Met complexe modellen met veel variantiecomponenten en verschillende grote clusters kunnen deze procedures echter te tijdrovend of zelfs onmogelijk zijn. Daarom stellen we een snelle tweetrapsschatter voor, voor het aanbrengen van een GLMM. Het is geworteld in de pseudo-waarschijnlijkheid split-sample methodologie. Hier wordt het monster verdeeld in K-submonsters, die afzonderlijk worden geanalyseerd, waarna de resultaten worden gecombineerd om algemene conclusies te verkrijgen. Met name voor geclusterde gegevens kan deze verdeling worden gedaan in termen van onafhankelijke of afhankelijke submonsters. Hier beschouwen we de meest extreme verdeling, alle subsamples met een enkel cluster, wat leidt tot de zogenaamde cluster-by-cluster (CbC) estimator. Daarom volgt het dezelfde twee stappen van de gesplitste monstermethode, i.e., (1) een algemeen lineair model (GLM) is individueel gemonteerd in elk cluster; (2) een globale reeks schattingen voor de parameters wordt berekend met behulp van gewichten. De covariantiematrix van de random effects (D) meet echter de variabiliteit tussen clusters en kan daarom niet worden geschat met behulp van ´e´en cluster. Daarom wordt een methode-of-moments-aanpak voorgesteld. In hoofdstuk 4 richten we ons op normaal gedistribueerde clustergegevens, i.e., het lineaire gemengde model (LMM). In dit geval kan de CbC estimator in gesloten vorm worden uitgedrukt. Daarom is het computationeel zeer effici¨ent. In hoofdstuk 5 breiden we de CbC-schatter voor de GLMM uit, met speciale aandacht voor binaire, tellings- en timeto-event-resultaten. Bovendien wordt in hoofdstuk 6 onze methodologie uitgebreid voor het analyseren van clustertellingsgegevens met overdispersie en/of overmaat aan nullen. In de laatste gevallen is de CbC estimator niet langer gesloten. Toch is het nog steeds rekenkundig minder intensief dan de maximale waarschijnlijkheidsschat (MLE). Op basis van theoretische resultaten en simulaties had ons voorstel goede statistische eigenschappen getoond. De prestaties zijn meestal afhankelijk van het type resultaat en variantiestructuur. Over het algemeen is het asymptotisch effici¨ent wanneer de grootte van de clusters sneller toeneemt dan het aantal clusters. Een hi¨erarchische instelling die vaak wordt aangetroffen in een meta-analyse, bijvoorbeeld. Gemotiveerd over het gebruik van de asymmetrische Laplace (AL) distributie voor het monteren van een quantile regressie (QR) model, in hoofdstuk 7, overwegen we een MLE voor een QR-model voor continue geclusterde gegevens op basis van de multivariate asymmetrische Laplace (MAL) distributie. De MAL-verdeling wordt gesloten onder marginalisatie, d.w.z. de marginalen zijn univariate AL-distributies. Dit impliceert dat, marginaal, de verbinding met de univariate quantile regressie houdt. Er zijn echter enkele problemen met de MAL-distributie die speciale aandacht vereisen. In het bijzonder heeft het een asymptoot aan de oorsprong. Bijgevolg bevat het waarschijnlijkheidsoppervlak een ‘mijnenveld’ van pieken, wat leidt tot problemen bij het maximaliseren van de waarschijnlijkheid. We lossen deze problemen op door een lichte wijziging aan de MAL-dichtheid, zodat het een fatsoenlijke, soepele dichtheid wordt. Hoewel deze aanpassing kan leiden tot enige bias, tonen we via simulaties dat deze bias verwaarloosbaar is voor de QRparameters. Bovendien is het effici¨enter dan de univariate QR-schatter, i.e., door de associatie te negeren als gevolg van clustering. In het causale gevolgtrekkingsparadigma kan multivariate surrogaat eindpuntbeoordeling worden uitgevoerd door de individuele causale associatie (ICA). Het is een functie van een gedeeltelijk identificeerbare correlatiematrix (R). Bijgevolg, het kan niet worden geschat op basis van de gegevens zonder onstuitbare veronderstellingen op te leggen. Dit probleem wordt aangepakt door de ICA te schatten over de set waarden van de niet-identificeerbare vermeldingen van R die leiden tot een geldige correlatiematrix, de zogenaamde ΩR ruimte. Deze gevoeligheidsanalyse helpt ons om de niet-identificeerbaarheid met betrekking tot de ICA te kwantificeren. Dit matrix voltooiingsprobleem is echter zeer uitdagend. In hoofdstuk 8, een snel algoritme is gebouwd op eerder werk om hoog-dimensionale correlatie matrices te genereren met een aantal eerdere vaste elementen. Op basis van simulaties en data-analyses stelt onze methodologie ons in staat om verschillende surrogaten gezamenlijk te evalueren in snelle materie.
Document URI:	http://hdl.handle.net/1942/31828
Category:	T1
Type:	Theses and Dissertations
Appears in Collections:	Research publications

Files in This Item:

File	Description	Size	Format
Ordner1.pdf		4.8 MB	Adobe PDF	View/Open

Show full item record

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Google Scholar^TM