Inference With Unequal Cluster Sizes

HERMANS, Lisa

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/27981

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	MOLENBERGHS, Geert	-
dc.contributor.advisor	AERTS, Marc	-
dc.contributor.author	HERMANS, Lisa	-
dc.date.accessioned	2019-04-05T10:26:37Z	-
dc.date.available	2019-04-05T10:26:37Z	-
dc.date.issued	2019	-
dc.identifier.uri	http://hdl.handle.net/1942/27981	-
dc.description.abstract	A random sample is not always of a fixed, a priori determined size. Examples include sequential sampling and stopping rules, missing data, and clusters with random size. Often there then is no complete sufficient statistic. Completeness means that any measurable function of a sufficient statistic that has zero expectation for every value of the parameter indexing the parametric model class, is the zero function almost everywhere. A simple characterization of incompleteness is given for the exponential family in terms of the mapping between the sufficient statistic and the parameter, based upon the implicit function theorem. Essentially this is a comparison of the dimension of the sufficient statistic to the length of the parameter vector. This results in an easy verifiable criterion for incompleteness, clear and simple to use, even for complex settings as is shown for missing data and clusters of random size. The analysis of hierarchical data that take the form of clusters with random size has received considerable attention in literature. In this work, the focus was on clustered data with unequal cluster sizes, meaning that a joint model of outcome and sample size was not studied. Also, the focus here was on samples that are very large in terms of number of clusters and/or members per cluster, on the one hand, as well as on very small samples (e.g., when studying rare diseases), on the other. Whereas maximum likelihood inference is straightforward in medium to large samples, in samples of sizes considered here it may become prohibitive. Sample-splitting (Molenberghs, Verbeke, and Iddi, 2011) was proposed as a way to replace iterative optimization of a likelihood that does not admit an analytical solution, with closed-form calculations. Pseudo-likelihood (Molenberghs et al., 2014), consisting of computing weighted averages over solutions obtained from subsamples created according to sample size, was used. As a result, the statistical properties of this approach were investigated. In a first attempt, the compound-symmetry variance structure was used to investigate this modelling framework. In a subsample with only clusters of the same size, there are closed-from solutions and other useful properties can be obtained. The operational characteristics are studied using simulations. It follows that the proposed non-iterative methods have a strong beneficial impact on computation time. Next, statistically and computationally efficient estimation in a hierarchical data setting with unequal cluster sizes and an AR(1) covariance structure was studied. As for the compound-symmetry model, the pseudo-likelihood and split-sample methods of Fieuws and Verbeke (2006) and Molenberghs, Verbeke, and Iddi (2011) were used. Maximum likelihood estimation for AR(1) requires numerical iteration when cluster sizes are unequal. A near optimal non-iterative procedure was proposed. Results showed that the method is statistically nearly as efficient as maximum likelihood, but shows great savings in computation time. The odds ratio is a frequently used measure to investigate the association between binary variables. Often, such outcomes are measured across strata of different sizes. Mantel and Haenszel (1959) proposed estimators for a common odds ratio, taking into account the stratification. The most common one is among the best known and most used estimators in statistics. The setting studied by Mantel and Haenszel fits within this framework of samplesplitting and combining with proper weights. The Mantel and Haenszel estimator does not follow from optimality considerations, but nevertheless has properties similar to and often better than the optimal estimator. This was done by comparing it to the optimal estimator, whose existence was demonstrated in spite of the absence of complete sufficient statistics. It is shown, via simulations, that the optimal estimator outperforms the MantelHaenszel estimator only in certain settings with huge sample sizes. Missing data is almost inevitable in correlated-data studies. For non-Gaussian outcomes with moderate to large sequences, direct-likelihood methods can involve complex, hard-to-manipulate likelihoods. Popular alternative approaches, like generalized estimating equations, that are frequently used to circumvent the computational complexity of full likelihood, are less suitable when scientific interest, at least in part, is placed on the association structure; pseudo-likelihood methods are then a viable alternative. When the missing data are missing at random, Molenberghs et al. (2011) proposed a suite of corrections to the standard form of pseudo-likelihood, taking the form of singly and doubly robust estimators. They provided the basis, and exemplified it in insightful yet primarily illustrative examples. The important case of marginal models for hierarchical binary data was considered. Our doubly robust estimator is more convenient than the classical doubly robust estimators. The ideas are illustrated using a marginal model for a binary response, more specifically a Bahadur model.	-
dc.description.abstract	Een steekproef is niet steeds van een vaste, vooraf bepaalde grootte. Voorbeelden zijn sequentiële studies, ontbrekende gegevens en ongebalanceerde hiërarchische data. In dit soort settings is er vaak geen complete sufficient statistic. Een eenvoudige karakterisering van completeness wordt geformuleerd voor de exponentiële familie in termen van de dimensievergelijking tussen de sufficient statistic en de parameter, gebaseerd op de impliciete functiestelling. Het is een eenvoudig en makkelijk verifieerbaar criterium, zelfs voor complexe settings met ontbrekende gegevens en ongebalanceerde hiërarchische data. Ongebalanceerde hiërarchische data werd al vanuit verschillende invalshoeken bestudeerd. In deze thesis ligt de focus op steekproeven die zeer groot zijn, m.a.w. veel clusters of veel metingen per cluster, en die zeer klein zijn (studies van zeldzame ziekten). De Maximum likelihood estimator bepalen in middelgrote steekproeven is goed uitvoerbaar, maar in de settings die hier besproken worden, kan dat moeilijkheden met zich meebrengen, zoals geen analystische oplossingen van gesloten vorm en de likelihoodsfunctie kan alleen iteratief geoptimaliseerd worden. Bijgevolg werd de steekproef opgedeeld in stukken naargelang de grootte van de clusters (Molenberghs, Verbeke, and Iddi, 2011). Deze deelsteekproeven werden hierdoor gebalanceerd en resulteren wel in oplossingen van gesloten vorm. Een pseudo-likelihood werd gebruikt om de oplossingen van elke deelsteekproef te combineren gebruikmakend van gewichten. De eigenschappen van deze methodologie werden in detail onderzocht op gebalanceerde data die een compound-symmetry covariantiestructuur volgen. Via een simulatiestudie werd de toepasbaarheid onderzocht. Hieruit volgt dat deze niet-iteratieve methode slechts een korte berekeningstijd vereist en zeer precies is. Vervolgens werd deze schattingsmethode verder onderzocht in een ongebalanceerde hiërarchische dataset met een autoregressive (AR(1)) covariantiestructuur. Ook hier is deze methode bijna even efficiënt als maximum likelihood en de berekeningstijd is veel lager. The odds ratio is een statistiek die frequent gebruikt wordt om de associatie tussen binaire variabelen te onderzoeken. Ook in dit soort settings kunnen er groeperingen van de gegevens voorkomen van ongelijke grootte. De meeste gekende en gebruikte schatter is deze ontworpen door Mantel and Haenszel (1959). De schatter combineert de odds ratio van subpopulaties in een gewogen schatter, maar volgt niet vanuit optimalisatieberekeningen. The Mantel en Haenszel schatter werd vergeleken met de optimale schatter. Hieruit kan geconcludeerd worden dat de Mantel en Haenszel schatter over zeer goede eigenschappen beschikt. Enkel in settings met zeer grote steekproefgroottes zal de optimale schatter het beter dan doen de Mantel en Haenszel schatter. Ontbrekende gegevens komen zeer vaak voor in dit soort settings. Voor nietnormaalverdeelde gegevens van een zeer grote steekproef, kunnen de berekeningen van de likelihoodsfunctie zeer complex worden. Generalized estimating equations is dan een goed alternatief, maar minder geschikt indien de interesse (gedeeltelijk) gaat naar de correlatiestructuur van de data. Pseudo-likelihoodsfuncties zijn hier beter geschikt. Wanneer de ontbrekende gegevens missing at random zijn, maakte Molenberghs et al. (2011) enkelvoudige en dubbelvoudige robuste aanpassingen aan de standaard pseudo-likelihoodsfunctie om correcte inferentie te kunnen doen. Waar dat zij de algemene basis hiervan vormden, focuste dit werk op marginale modellen voor hiërarchische binare data. Een Bahadur model werd hier gekozen als marginaal model.	-
dc.language.iso	en	-
dc.subject.other	Random Cluster Size; Pseudo-likelihood; Weighted Estimation; Completeness; Missing Data	-
dc.title	Inference With Unequal Cluster Sizes	-
dc.type	Theses and Dissertations	-
local.format.pages	258	-
local.bibliographicCitation.jcat	T1	-
local.type.refereed	Non-Refereed	-
local.type.specified	Phd thesis	-
item.fulltext	With Fulltext	-
item.contributor	HERMANS, Lisa	-
item.accessRights	Open Access	-
item.fullcitation	HERMANS, Lisa (2019) Inference With Unequal Cluster Sizes.	-
Appears in Collections:	Research publications

Files in This Item:

File	Description	Size	Format
Thesis - Lisa Hermans.pdf		2.67 MB	Adobe PDF	View/Open

Show simple item record

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Google Scholar^TM