Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/26729
Title: Statistical Methods for Transcriptomic and Metabolomic Data Analysis
Authors: PADAYACHEE, Trishanta 
Advisors: BURZYKOWSKI, Tomasz
SHKEDY, Ziv
Issue Date: 2018
Abstract: Omics technologies have rapidly advanced giving rise to an extensive amount of omics data with widespread availability. The analysis of omics data can lead to the identification of molecular profiles that are associated with disease status, susceptibility, or progression, or it may provide insight into biological pathways or processes that differ in diseased and control patients. Biological processes are, however, extremely intricate and obtaining biologically meaningful information from this mass of data is a non-trivial task. To capture the complexity of biological processes, research is now centering on the integrative analysis of omics data. However, methodological development in this area is lacking. As a result, complex data is analysed in rather simple ways that fail to capture the complexity of the biological problem. The research presented in Part I of this dissertation aims to improve on currently implemented methods for the integrative analysis of omics datasets. A way to enhance our understanding of the development and progression of complex diseases is to investigate the influence of cellular environments on gene coexpression (i.e., gene-pair correlations). Investigating whether metabolites regulate the co-expression of a predefined gene module (a set of co-expressed (correlated) genes belonging to the same biological pathway) is one of the relevant questions posed in the integrative analysis of metabolomic and transcriptomic data (Inouye et al., 2010a). In Part I of this dissertation, three statistical models are described for investigating the association between gene-module co-expression and metabolite concentrations. The suitability and versatility of the proposed models are investigated through simulation studies and an application to real-life data. Specifically, a subset of the DILGOM (DIetary, Lifestyle, and Genetic determinants of Obesity and Metabolic syndrome) study data (Inouye et al., 2010a) is analysed. Part I of the dissertation begins with a description of a simple linear regression (SLR) approach that has been previously implemented for the investigation of conditional co-expression (Inouye et al., 2010a). Attention is drawn to several limitations of the approach. As an alternative, a multivariate linear model for studying the dependence between categorised metabolite concentrations and gene-module coexpression is proposed in Chapter 4. The approach addresses the limitations of the linear-regression-based analysis. Through a simulation study it is shown that the SLR approach suffers from a highly inflated type I error probability and that the proposed multivariate model is less prone to the detection of spurious conditional correlations. Often, changes in gene co-expression are investigated across two or more biological conditions defined by categorising a continuous covariate. However, the selection of arbitrary cut-off points may have an influence on the results of an analysis. To address this issue, in Chapter 5, a multivariate linear model for investigating the association between gene-module co-expression and a continuous covariate is proposed. Fitting a multivariate model that fully captures the dependence structure of several variables can become increasingly challenging as the number of parameters and the size of the variance-covariance matrix increases. Chapter 7 provides a more computationally feasible solution for investigating the conditional co-expression of a gene-module. In particular, a copula-based pseudo-likelihood approach is proposed. The multivariate density described in Chapter 5 is replaced by a pseudo-likelihood function formed by the product of all pairwise densities over the set of all possible gene pairs within the gene module. Furthermore, bivariate densities are modeled using Gaussian, Gumbel-Hougaard, and Clayton copulas that specify the gene-pair correlations as a function of the metabolite concentrations. In addition to reducing the computation burden, this approach facilitates the estimation of other non-parametric measures of association such as Kendall’s tau and Spearman’s rho. High-throughput techniques enable the measurement of the chemical composition of cells, tissues, or, biofluids. The reproducibility, precision, and inherent noise of the measurements vary between techniques. In some instances, the biological signal may constitute only a small portion of the collected measurements. Efficient extraction of the biological signal is required before the data can be analysed. A variety of approaches exist to extract biological signal. The adopted approach can have an impact on downstream analyses. In Part II of this dissertation, the impact of the method for extracting metabolic signal from proton nuclear magnetic resonance (1HNMR) data on the classification of lung cancer samples is studied. Extracting metabolic information from NMR spectra is complex due to the fact that an immense amount of detail on the chemical composition of a biological sample is expressed through a single spectrum. The simplest approach to quantify the signal is through spectral binning which involves subdividing the spectra into regions along the chemical shift axis and integrating the peaks within each region (Louis et al., 2015). However, due to overlapping resonance signals, the integration values do not always correspond to the concentrations of specific metabolites. An alternate, more advanced statistical approach is spectral deconvolution. BATMAN (Bayesian AuTomated Metabolite Analyser for NMR data) (Astle et al., 2012; Hao et al., 2014) performs spectral deconvolution using prior information on the spectral signatures of metabolites. In this way, BATMAN estimates relative metabolic concentrations. Both spectral binning and spectral deconvolution using BATMAN were applied to 400 MHz and 900 MHz NMR spectra of blood plasma samples from lung cancer patients and control subjects (Chapter 11). The relative concentrations estimated by BATMAN were compared with the binning integration values in terms of their ability to discriminate between lung cancer patients and controls (Chapter 12). For the 400 MHz data, the spectral binning approach provided greater discriminatory power. However, for the 900 MHz data, the relative metabolic concentrations obtained by using BATMAN provided greater predictive power. While spectral binning is computationally advantageous and less laborious, BATMAN estimated features correspond directly with specific metabolites and therefore have a simpler interpretation.
Document URI: http://hdl.handle.net/1942/26729
Category: T1
Type: Theses and Dissertations
Appears in Collections:PhD theses
Research publications

Files in This Item:
File Description SizeFormat 
Dissertation_TPadayachee.pdf8.17 MBAdobe PDFView/Open
Show full item record

Page view(s)

54
checked on Sep 7, 2022

Download(s)

8
checked on Sep 7, 2022

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.