Statistical models for high-throughput proteomic and genomic data

CLAESEN, Jurgen

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/20327

Title:	Statistical models for high-throughput proteomic and genomic data
Authors:	CLAESEN, Jurgen
Advisors:	BURZYKOWSKI, Tomasz VALKENBORG, Dirk
Issue Date:	2013
Abstract:	The advent of high-throughput techniques such as micro-arrays, mass spectrometry, and next-generation sequencing, expedited in biology and life science, the shift towards \-omics" oriented studies. These techniques generate large datasets, which are often complex, and require appropriate statistical analysis. The main goal of this dissertation was the development of statistical models which allow analysis of high-throughput proteomic and genomic data. In the first part of this thesis, the focus is on mass-spectrometry-based proteomics. More specifically, on the (aggregated) isotopic distribution and how this characteristic signal can be used to interprete mass spectral data. The different peaks of the isotopic distribution correspond with the isotopic variants of the molecule under study. These different variants are the result of the isotopes of the molecule's chemical elements. For example, carbon (C) has two isotopes occuring in nature, i.e., 12C and 13C, which have a different mass and probability of occurence. In Chapter 4, we presented BRAIN, a method to calculate the aggregated isotopic distribution based on the atomic composition of the molecule. The aggregated isotopic distribution combines the isotopic variants which have miniscule differences in mass. The mass and abundance of these aggregated isotopic variants can be calculated in a recursive manner based on the Newton-Girard theorem and Viete's formulae. Because of the recursive nature of the proposed algorithm, the method is efficient in terms of memory and computation time, and simple to implement. BRAIN is at least as accurate as existing algorithms such as Emass. The (aggregated) isotopic distribution could be a valuable tool in the interpretation of mass spectral data, such as, for instance, identifying unknown biomolecules or monitoring post-translation modifications. However, the isotopic distribution as a whole, i.e., the masses and abundances of the isotopic variants, is rarely used. Most often, features of the isotopic distribution such as the mono-isotopic mass, average mass or most abundant peak mass are used. A common practice to identify unknown molecules is based on searches with observed masses against databases with known molecules. As a result, for each unknown molecule, several candidates can be found. The ultimate goal is to reduce the number of candidates to the absolute minimum, i.e., to the correct protein or peptide. In order to achieve this goal, a very high mass accuracy is required. We illustrated, in Chapter 5, that the mass accuracy is also in uenced by the elemental isotope definition, next to the mass precision and mass calibration of the mass spectrometer. In contrast to popular belief, the elemental isotope definition is not invariable. These variations, although small, cannot be ignored in a database-driven identification process. In case the proteins under study or a subset of these molecules are known, recalibration of the observed masses prior to searching a protein/peptide database is advisable. Alternatively, moderately increasing the tolerance window used to search the database is another possible solution. The usage of the isotopic distribution is not exclusively linked to signal precessing or identification. It can also be used to monitor changes in the atomic composition of biomolecules. In case of HDXMS, the exchange rates of the labile H-atoms of a biomolecule are often estimated based upon the number of incorporated deuterium atoms. The change of the average mass of the molecule under study is a measure for the number of incorporated deuterium atoms. However, the accuracy and precision of the estimated exchange rates of these methods is not good as the available information is reduced. In Chapter 6, we introduced an alternative method, which uses the aggregated isotopic distribution. The proposed Markov-chain-based model estimates the individual exchange rates based upon the changes of the abundances of each isotopic variant instead of the number of incorporated deuterium atoms. As a result, the reliability of the estimated exchange rates is higher compared to the existing methods. Simulation studies showed that correctly specifying the number of exchangeable hydrogen atoms and increasing the information content of the data, by adding informative spectra and/or removing non-informative spectra, additionally improves the quality of the estimated exchange rates. In the second part of this dissertation, we proposed two models to genetically dissect phenotypic traits based on the principle of linkage analysis. Linkage analysis relies on the extent of co-segregation between adjacent genes and/or molecular markers, such as single nucleotide polymorphisms. The width of the identified regions, which contain potential causal genes, depends on the number of identified molecular markers and known genes. The development of massively-parallel whole-genome sequencing made it possible to detect, in a fast and automated manner, many molecular markers. As a result, the mapping resolution increases and allows, in principle, the identification of individual causal genes. Thus far, a limited number of gene mapping methods, specifically designed for whole-genome-sequencing data, have been proposed. Almost all of these methods are two-stage procedures. For each marker a linkage probability or p-value is determined, and these probabilities are combined in a sliding-window approach. Subsequently, the resulting averaged probabilities are used to select potential gene loci containing regions. In contrast to the existing methods, the semi-parametric scatterplot smoother, introduced in Chapter 8, combines the fitting of smoothing splines with testing if certain parts of the identified trend are linked to causal genes. As a result, multiple gene loci per chromosome can be simultaneously detected. However, the discovered regions are relatively wide. The proposed scatterplot smoother can also be used to model experiments with multiple segregant pools. This feature allows investigating enrichment effects and it can reduce the size of the identified regions. In Chapter 9, an alternative method to map multiple causal genes is introduced. The proposed hidden Markov-model determines for each molecular marker individually to which hidden state it belongs, while accounting for the serial dependence between neighboring markers. Applying the HMM at the level of the individual markers leads to an increased resolution. The potentially linked regions identified by the HMM are subparts of the relatively wide genomic regions found by the scatterplot smoother. The base-calling step of the next generation sequencing procedure is error prone. Prior to estimating the nonlinear relation between the mismatch frequency and the chromosomal position, erroneously identified SNPs have to be filtered out. In contrast to this additional filtering step, the hidden Markov-model is able to incorporate the possibility of sequencing errors. De ontwikkeling en opkomst van high-throughput methodes zoals micro-arrays, massa spectrometrie en next-generation sequencing, heeft voor een omwenteling gezorgd in het biologisch en biomedisch wetenschappelijk onderzoek. In plaats van kleinschalige experimenten die zich specifiek richten op één of een paar genen of eiwitten, worden alle genen (het genoom) of alle eiwitten (het proteoom) die aanwezig zijn in een cel of weefsel tegelijkertijd bestudeerd. Deze aanpak genereert grote hoeveelheden data waarvoor aangepaste statistische analyses noodzakelijk zijn. In deze thesis stellen we een aantal statistische modellen voor die geschikt zijn voor de analyse van proteomica en genomica experimenten uitgevoerd met high-throughput technieken. In het eerste deel van deze thesis ligt de focus op de studie van het proteoom door middel van massa spectrometrie. In dit gedeelte gaan we na hoe de (geaggregeerde) isotopenverdeling kan gebruikt worden voor de interpretatie en analyse van spectra. De isotopenverdeling is een kenmerkend patroon van een biomolecule dat gemeten kan worden met hoge-resolutie-massa-spectrometers. De pieken in dit patroon komen overeen met de verschillende varianten van het gemeten molecule. Deze verschillende varianten ontstaan vanwege de isotopen van de elementen waaruit biomoleculen zijn opgebouwd. Zo heeft, bijvoorbeeld, koolstof (C) twee natuurlijke varianten of isotopen, namelijk 12C en 13C, elk met een verschillende massa en abundantie. In Hoofdstuk 4 stellen we BRAIN voor. Dit is een algoritme dat de geaggregeerde isotopenverdeling van een molecule berekend aan de hand van de chemische elementen waaruit het molecule opgebouwd is. In de geaggregeerde isotopenverdeling worden de isotoop-varianten van een biomolecule, die in massa minimaal verschillen, gecombineerd. Voor elke geaggregeerde isotopen variant kan de massa en de abundantie berekend worden. Door het toepassen van het theorema van Newton-Girard en de formules van Viete kan BRAIN deze massa's en abundanties op een recursieve manier berekenen. Dankzij het recursieve karakter van BRAIN is deze methode snel, makkelijk te implementeren en vraagt het weinig computergeheugen. In vergelijking met bestaande methodes is BRAIN minstens even accuraat. De isotopenverdeling kan een waardevol middel zijn voor de interpretatie en identi ficatie van ongekende moleculen. Maar, de isotopenverdeling in zijn geheel wordt hiervoor echter zelden gebruikt. Veel vaker wordt de gemiddelde massa van een molecule of de mono-isotope massa of de massa van de geaggregeerde isotopen variant met de hoogste abundantie gebruikt. Een vaak toegepaste manier om onbekende biomoleculen te identificeren is het zoeken met de geobserveerd massa's van onbekende eiwitten of peptides tegen een databank met de massa's van gekende molecules. Vaak worden er verschillende kandidaatidenti ficaties gevonden. Het ultieme doel is om dit aantal kandidaten te reduceren tot het absolute minimum, namelijk één. Om dit doel te bereiken moet de massaaccuraatheid van het experiment zeer groot zijn. We tonen aan, in Hoofdstuk 5, dat de massa-accuraatheid niet enkel bepaald wordt door de massa-precisie en massacalibratie van de massa spectrometer, maar ook door de definities van de massa's en abundanties van de isotopen. Deze isotoop-definities zijn in tegenstelling tot wat algemeen verondersteld wordt onder hevig aan veranderingen. Ondanks het feit dat deze veranderingen klein zijn, kunnen we ze niet negeren tijdens het doorzoeken van een databank. Indien alle eiwitten of een gedeelte van hen gekend zijn, kan men de geobserveerde massa's hercalibreren, vooraleer men met de identificatie start. Het vergroten van de tolerantie waarmee men in een databank zoekt is een alternatieve manier om met de variabiliteit van de isotoop-definities om te gaan. Naast de identificatie van biomolecules kan de isotopenverdeling ook gebruikt worden voor het detecteren en opvolgen van veranderingen in de chemische samenstelling van een eiwit. In het geval van waterstof-deuterium-uitwisselingsexperimenten, worden de uitwisselingsconstanten van de labiele waterstof-atomen (H) meestal geschat met behulp van het aantal opgenomen deuterium-atomen (D). Het aantal uitgewisselde D-atomen wordt bepaald aan de hand van de veranderingen van de gemiddelde massa. De nauwkeurigheid en precisie van deze geschatte uitwisselingsconstanten is echter niet goed, aangezien de beschikbare informatie gereduceerd wordt. In Hoofdstuk 6 stellen we een alternatieve manier voor om de individuele uitwisselingsconstanten te bepalen. In plaats van de veranderingen van de gemiddelde massa, gebruikt het voorgestelde model de veranderingen van de geaggregeerde isotopenverdeling. In vergelijking met de bestaande methode om uitwisselingsconstanten te bepalen, is de kwaliteit van de geschatte uitwisselingsconstanten hoger. Simulatiestudies tonen aan dat de betrouwbaarheid van de geschatte uitwisselingsconstanten verbeterd kan worden door het aantal uitwisselbare H-atomen te specifi eren en door de informatie-inhoud van de data te verhogen. Dit laatste is mogelijk door het aantal informatieve spectra te verhogen of door nietszeggende spectra te verwijderen. In het tweede deel van deze thesis stellen we twee manieren voor om uiterlijke kenmerken van een organisme genetisch te ontleden aan de hand van het principe van linkage. Wanneer een bepaald gen verantwoordelijk is voor een uiterlijk kenmerk dan kunnen we de locatie van dit gen bepalen met behulp van de overervingspatronen van nabijgelegen genen en moleculaire merkers zoals single nucleotide polymorphisms. Hoe groter het aantal nabijgelegen genen en moleculaire merkers, des te nauwkeurig zal de gen-lokalisatie op het chromosoom zijn. De verdere ontwikkeling van DNA- sequencing heeft de detectie van moleculaire merkers vergemakkelijkt. Dankzij deze ontwikkelingen kunnen er nu snel en relatief eenvoudig, zeer veel moleculaire merkers gedetecteerd worden, waardoor in principe individuele causale genen in kaart gebracht worden. Op dit moment zijn er een beperkt aantal methodes die causale genen of chromosomale regio's identificeren aan de hand van moleculaire merkers die gevonden zijn met next-generation sequencing. De overgrote meerderheid van deze bestaande methodes bepalen eerst een p-waarde of een linkage probability. Vervolgens worden deze kansen gecombineerd en gebruikt om potenti ele causale genen te identificeren. In tegenstelling tot deze bestaande methoden, combineert de semi-parametrische scatterplot smoother, die voorgesteld wordt in Hoofdstuk 8, het schatten van de onderliggende niet-lineaire trend van de SNP-frequenties, en het formeel testen of bepaalde gedeeltes van de geschatte trend causale genen bevatten. Dankzij deze aanpak is het ook mogelijk om simultaan verschillende causale genen te identificeren. Het voorgestelde model laat eveneens toe om verschillende experimenten met elkaar te vergelijken, waardoor in bepaalde gevallen de mapping resolution verbeterd wordt. In Hoofdstuk 9 stellen we een alternatieve methode voor om verscheidene causale genen in kaart te brengen. Het voorgestelde hidden Markov-model bepaalt voor elke moleculaire merker tot welke verborgen toestand de merker behoort, terwijl er rekening gehouden wordt met de verborgen toestanden van naburige merkers. Het aantal verborgen toestanden wordt op voorhand bepaald, evenals hun biologische interpretatie. Het toepassen van een hidden Markov-model op individuele moleculaire merkers heeft een verhoogde resolutie tot gevolg. De geïdentificeerde chromosomale regio's zijn een onderdeel van de relatief grote regio's gevonden door de scatterplot smoother. De detectie van moleculaire merkers met next-generation sequencing is foutgevoelig. Hierdoor moet men, vooraleer de onderliggende niet-lineaire trend geschat kan worden, de foutief geïdentificeerde merkers wegfilteren. In tegenstelling tot de scatterplot smoother is deze extra stap niet nodig voor het hidden Markov-model.
Document URI:	http://hdl.handle.net/1942/20327
Category:	T1
Type:	Theses and Dissertations
Appears in Collections:	PhD theses Research publications

Files in This Item:

File	Description	Size	Format
4483,1 D-2013-2451-33 Jürgen Claesen.pdf		18.74 MB	Adobe PDF	View/Open

Show full item record

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Google Scholar^TM