Statistical methods for the analysis of high-throughput proteomic and genomic data

ZAMANZAD GHAVIDEL, Fatemeh

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/20353

Title:	Statistical methods for the analysis of high-throughput proteomic and genomic data
Authors:	ZAMANZAD GHAVIDEL, Fatemeh
Advisors:	BURZYKOWSKI, Tomasz
Issue Date:	2014
Abstract:	In this dissertation, we proposed statistical methods for datasets from proteomics and genomics workflow. Over the past decade, MS-based proteomics has emerged as a high-throughput method for the identification and quantification of proteins in complex samples. The high resolution MS data contains a large degree of noisy, redundant, and irrelevant information. Only a part of it includes the biologically meaningful signal, i.e., peptides and small proteins, making accurate classification between peptide/protein peaks and peaks generated by noise difficult. To overcome this obstacle, prior information related to the physical properties of the peptide/protein, i.e., isotopic distribution, is needed. However, a similarity measure is also required to distinguish between peptide and noise peaks clusters. In Chapter 4, we considered the use of Pearson’s χ 2 statistic and the Mahalanobis distance for this purpose. We evaluated the performance of the two similarity measures by using a designed MALDI-TOF experiment. The results could extend to any high-resolution mass spectrum and indicated that Pearson’s χ 2 statistic offered a better discriminative power for detecting the putative-peptide clusters than the Mahalanobis distance. Protein identification is a key and essential step in the field of proteomics. For this purpose, shotgun proteomics is recognized as one of the main techniques for protein identification and quantification. In a standard computational pipeline, MS/MS spec tra from a mass spectrometer are searched against database search engines or de novo sequencing approaches. In database search algorithms, fragment ions derived from the unidentified protein are compared with theoretical data, and a score is assigned according to how well the two sets of data match together. The top score is expected to identify the unknown protein. The limiting factor in all database search tools is the tradeoff between false positives and false negatives. It is definitely essential to keep false positives to a minimum during protein identification. Principally, peptide identification based on tandem MS and database-search algorithms does not take into account information about isotope distributions of the precursor ions. To determine the effectiveness of these search algorithms in terms of their ability to distinguish between correct and incorrect peptide assignments, in Chapter 5, we proposed an additional metric that quantifies the similarity between the theoretical isotopic distributions for the precursor ions selected for tandem MS and the experimental mass spectra by using Pearson’s χ 2 statistic. The observed association between Pearson’s χ 2 statistic and the score function indicated that good scores can be obtained for molecules which exhibit atypical isotope profiles, while low scores can be obtained for fragment spectra which have a clear peptide-like isotope pattern. These results demonstrated that Pearson’s χ 2 statistic can be used in conjunction with the score of database search algorithms to increase the sensitivity and specificity of peptide identification. There are many search engines available for the analysis of proteomics data produced by MS/MS. These search algorithms vary in accuracy, sensitivity, and specificity due to the different principles in the underlying scoring mechanism. However, measuring the degree of agreement between different search engines in terms of peptide identification is always in our interest. For instance, how possible is the peptide identification obtained from SEQUEST can also be observed in MASCOT. In Chapter 6, we proposed Cohen’s kappa coefficient (chance-corrected agreement) to determine the level of the agreement, between the MASCOT and SEQUEST. The results suggested that there is, in general, a good agreement between the peptide assignments for the two search engines. The advent of high throughput sequencing methods, such as NGS has greatly accelerated biological and medical research and discovery. NGS has provided an effective approach to identify the large scale of DNA polymorphic loci used as molecular markers to distinguish gene loci responsible for the trait of concern. In Chapter 9, 10, and 11, we introduced different variants and generalizations of the basic HMM proposed in [109] used to map various QTLs responsible for high ethanol-tolerance in S. cerevisiae with NGS. One possible extension that can be dealt with the Marko vian model in the basic HMM is the direction of modelling. Both the preceding state of the (i − 1)-th SNP and following state of the (i + 1)-th SNP carry advantageous information about a current i-th SNP. Uni-directional HMMs ignore this influence, hence the motivation of applying the DHMM in Chapter 10. The comparison of the uni-directional HMM and the DHMM for chromosome XIV revealed only a slight difference in terms of the parameter estimates, with a minimal gain in precision of the estimation for the DHMM. As a result, the DHMM and the uni-directional HMMs assigned the SNPs to the same states. The main advantage of the DHMM is the fact that it produces a single set of estimates of the parameters of interest, i.e., emission (concordance) probabilities. In chapter 10, we proposed the non-homogeneous HMM. The advantage of the NH-HMM is that it allows the transition probabilities of the basic HMM to vary in distance by exploiting covariate information. Our model assumed that taking into account the distance between the neighboring SNP can influence the state assignment to each SNP. The NH-HMM were able to detect gene loci responsible for high ethanoltolerance in S. cerevisiae. In Chapter 11, we considered joint HMM of two pools of segregants at the same time. The motivation was, the significant differences between the state-dependent probabilities between two pools might lead us to the potential regions of gene loci. Joint HMM was able to detect potential genomic regions for high ethanol-tolerance in chromosome XIV. However, the same approach was not able to work properly in chromosome IX. In dit proefschrift stellen we statistische methodes voor waarmee gegevens over het proteoom en het genoom kunnen geanalyseerd worden. In het laatste decennium, wordt massa-spectrometrie-gebaseerde proteomica vaak gebruikt als high-throughput methode voor de identificatie en kwantificatie van eiwitten in complexe biologische stalen. Gegevens van dergelijke experimenten bevatten redundante en irrelevante informatie, en zijn vaak onderhevig aan ruis. Hierdoor is het moeilijk om de biologisch relevante signalen, i.e., de peptiden en eiwitten, te onderscheiden van ruis-signalen. Een oplossing voor dit probleem is het vergelijken van de gemeten signalen met theoretisch berekende, biologische signalen. In hoofdstuk 4 evalueerden we twee similarity measures, de Pearson χ 2 statistiek en de Mahalanobis-afstand, waarmee relevante biologische signalen, i.e., de isotopen-verdeling, gedetecteerd kunnen worden. Het onderscheidingsvermogen van de Pearson χ 2 statistiek was hoger dan het vermogen van de Mahalanobis-afstand in een MALDI-TOF experiment. De identificatie van eiwitten speelt een belangrijke rol in proteomica. Een van de meest gebruikte technieken voor eiwit-identificatie en -kwantificatie is shotgun proteomics. Tandem MS spectra worden vergeleken met databanken met de hulp van gespecialiseerde zoekmachines. Deze zoekmachines vergelijken niet-gedentificeerde eiwitfragmenten met theoretische data, en kennen een score toe die uitdrukt hoe groot de gelijkenis is tussen het eiwitfragment en de theoretische data. Des te beter deze score is , des te waarschijnlijker de identificatie is. Het bepalen wanneer een score goed is, is niet eenvoudig en is voor vele zoekmachines een uitdaging. Zelden of nooit wordt door deze zoekmachines de isotopenverdeling van de precursor-eiwitten gebruikt. In hoofdstuk 5 stelden we de Pearson χ 2 statistiek als maatstaf voor om de gelijkenis tussen de geobserveerde en theoretische isotopenverdeling van de precursor-eiwitten te bepalen. Dankzij de Pearson χ 2 statistiek konden we aantonen dat een goede score voor een bepaalde identificatie niet noodzakelijk overeenkomt met een grote gelijkenis tussen de geobserveerde en berekende isotopenverdeling van de precursor-eiwitten, en omgekeerd. Het combineren van de Pearson χ 2 statistiek en de zoekmachine-scores leidde tot een verhoogde sensitiviteit en specificiteit van de eiwit-identificatie. Er bestaan vele zoekmachines voor de analyse van tandem MS data. De resultaten van deze zoekmachines zijn verschillend qua accuraatheid, sensitiviteit en specificiteit. De onderliggende reden hiervoor is de manier waarop de scores berekend worden. Desondanks deze verschillen, zijn we genteresseerd in de mate van overeenkomst tussen de resultaten van de zoekmachines. We vragen ons bijvoorbeeld af waarom een identificatie met SEQUEST gelijk of niet gelijk is aan een identificatie met MASCOT. In hoofdstuk 6, stelden we Cohen’s kappa-cofficint voor om de mate van overeenkomst te bepalen tussen de MASCOT en SEQUEST identificatie-resultaten. Aan de hand van de Cohen’s kappa cofficint vonden we dat er een goede overeenkomst is tussen de resultaten van MASCOT en SEQUEST. De opkomst van high-throughput sequencing methodes, zoals NGS, heeft voor een omslag gezorgd in biologisch en biomedisch onderzoek. Dankzij deze techniek kan men in DNA efficint en op grote schaal polymorfe nucleotiden detecteren. Deze nucleotiden kunnen onder andere gebruikt worden als moleculaire merkers om de functie van bepaalde genen vast te stellen. In hoofdstuk 9, 10 en 11 introduceerden we een aantal aanpassingen aan een hidden Markov model [109] dat gebruikt werd om verschillende QTLs te identificeren die verantwoordelijk zijn voor abnormale ethanol tolerantie in S. cerevisiae. En van de mogelijke aanpassingen is gekoppeld aan de onderliggende afhankelijkheid tussen de moleculaire merkers. In hoofdstuk 10, stelden we een niet-homogeen HMM voor. In een iet-homogeen HMM zijn de overgangskansen een functie van n of meerdere covariaten. Op deze manier kunnen we rekening houden met de afstand tussen twee naburige merkers. Dit niet-homogene model kon eveneens verscheidene gekende genen identificeren die verantwoordelijk zijn voor een abnormale ethanol tolerantie. In hoofdstuk 11 breidden we het basis-HMM uit zodat het kan omgaan met merkers van twee verschillende groepen. Dankzij deze aanpassing kunnen significante verschillen tussen twee merker-groepen gevonden worden. Dit joint-HMM kon in chromosoom XIV potentiele chromosomale regio’s identificeren die gerelateerd zijn aan ethanol tolerantie. In chromosoom IX werkte deze aanpak niet.
Document URI:	http://hdl.handle.net/1942/20353
Category:	T1
Type:	Theses and Dissertations
Appears in Collections:	PhD theses Research publications

Files in This Item:

File	Description	Size	Format
7554 D-2014-2451-58 Fatemeh Zamanzad Ghavidel.pdf		6.68 MB	Adobe PDF	View/Open

Show full item record

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Google Scholar^TM