The EM-algorithm for modeling Serial Analysis of Gene Expression (SAGE) data

AMPE, Michele

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/3650

Title:	The EM-algorithm for modeling Serial Analysis of Gene Expression (SAGE) data
Authors:	AMPE, Michele
Advisors:	BURZYKOWSKI, T. VALKENBORG, D.
Issue Date:	2007
Abstract:	Serial Analysis of Gene Expression (SAGE), a technique that has been developed at Johns Hopkins University in the USA, allows the analysis of overall gene expression patterns. It is an open platform because SAGE does not require a preexisting clone, unlike microarrays. So SAGE can be used for the identification and quantification of known genes as well as new genes. A SAGE experiment, from a statistical points of view, consists of the following 7 steps: 1. Extract a sample of mRNA fragments from a biological sample. 2. Convert the mRNA fragment into cDNA clones. 3. Generate tags by cutting 10 or 17 base long segments from a certain site of cDNA. These tags are what we call the true tags. 4. Apply the PCR (Polymerase Chain Reaction) procedure to boost the counts of the tags. 5. Link the tags to form long sequences. 6. Take a sample of those sequences. 7. Read off tag counts by sequencing these chosen sequences. The resulting tags are called sequenced tags and the resulting counts are the observed counts. Note that no true tags are lost before, during or after sequencing, hence the number of se- quenced tags is equal to the number of true tags. In the following sections we will assume that the true tags uniquely identify mRNA fragments that are present in the biological sample. The result of a SAGE experiment, called a SAGE library, contains the observed counts. Hence a SAGE experiment can only measure the expression levels of the tags. We can get the gene expression levels from a SAGE library by mapping the tags onto the genes. The aspects of SAGE experiments that bias the outcomes have been studied by simulating libraries by Stollberg et al. (2000). The following four sources of errors are considered: (1) sampling errors in tag selection; (2) sequencing errors; (3) non uniqueness of tag sequences; and (4) non randomness of DNA sequences. The authors have provided a maximum likelihood approach to estimate the number of unique transcripts and their frequency distribution. In what follows, we will focus on sequencing errors. Sequencing errors have a large impact on the outcome of a SAGE experiment: non-existing tags may be introduced at low abundance and the real abundance of the other tags may decrease. Colinge and Feger (2001) introduced an approach to identify tags whose abundance is biased by sequencing errors. Their approach is based on a concept of neighbourhood, i.e. abundant tags can contaminate tags whose sequence is very close. They assume constant error probabilities and use matrix inversion to correct for sequencing errors. There are also more biological approaches to the problem of sequencing errors as in Blades et al. (2004a,b). In Blades et al. (2004a), the fact that frequency distributions of tags display a regularity across cell types and species is used to: • automatically discount low counts that are not reliable for the comparison of expression levels across conditions for a specific gene; • to transform the tag counts to a scale that provides a more reliable correlation and clustering of genome-wide expression profiles. They state that the transformation enhances the ability to distinguish between signal and noise in SAGE data. Blades et al. (2004b) observed a linear relationship between the copy number of a given tag and the number of observed tags which differ from the given tag by a single base. By transforming the slope of this relationship, an estimate of the sequencing error rate can be found. Akmaev and Wang (2004) estimated error rates based on a mathematical model that includes the PCR and sequencing error contributions. About 3.5% of Long SAGE tags (10-17 base pair tags) will inherit errors from the PCR amplification and 17.3% of the Long SAGE tags will have sequencing errors. Beissbarth et al. (2004) introduced a statistical model for the propagation of sequencing errors and proposed an Expectation-Maximization (EM) algorithm to correct for the sequencing errors given a library of observed sequences and base-calling error estimates. The suggested correction method adjusts the tag counts to be closer to the true counts and the bias introduced by the sequencing errors can be partly corrected. In the article, they make use of the sequence neighbourhood of SAGE tags. This means that they assume that sequencing errors can only come from the first order neighbours tags. First order neighbours tags are tags that differ from each other by only 1 nucleotide, e.g. AAAA and AAAC are first order neighbour tags. The authors simulate the true tag counts by sampling from a Poisson distribution with mean pλ, with p the proportion of a tag in the library and λ a parameter for setting the size of the library. An observed tag sequence is generated from a true tag sequence using the simulated quality values (given by a base-calling program and in function of the probability of a base-calling error) of the true tag sequence as the multinomial probabilities, i.e. replacing each base with either one of the three bases with the probability specified by the sequencing quality value of that base. The counts of the observed tags are then summed to represent the observed tags. The implementation of the algorithm is done in R. We also propose a statistical model for the propagation of sequencing errors in the case that we have multiple SAGE libraries and correct for the sequencing error through an EM algorithm by using a similar strategy as Beissbarth et al. (2004). We use MATLAB for the implementation. There are, however, some differences between our method and the one developed by Beiss-brath et al. (2004). We assume that the true tag counts follow a multinomial distribution with parameters π and N, where π is the vector of probabilities that represent the relative expression levels of the DNA fragment and N is the number of true tags. The error estimates which we propose are partly based on the estimate given in Akmaev and Wang (2004). Another difference is that we assume that the sequencing errors are such that a tag can be misread as one of all possible tags, instead of only restricting this to the first order neighbours. Finally, in paper of Beissbarth et al. (2004), they work with Long SAGE sequences, while we work with sequences of four base pairs because we do not use the restriction of the first order neighbours. In section 2, we explain the notation and the settings that we will use throughout this thesis. In section 3, we give a detailed mathematical description of the EM algorithm with the expressions for the estimates of the expression probabilities π and the corresponding Variance-Covariance matrix. In section 4, we simulate SAGE libraries to study the following: • the potential gain in terms of bias when we use estimates obtained by the EM algorithm instead of the observed expression probabilities; • the potential gain in terms of bias when we use multiple libraries instead of a single library; • the effect of the probabilities of sequencing errors; • the comparison of the bias using our method and using the method of Beissbarth et al. (2004). The results of the simulations are given in section 5.
Notes:	Master in Biostatistics
Document URI:	http://hdl.handle.net/1942/3650
Category:	T2
Type:	Theses and Dissertations
Appears in Collections:	Applied Statistics: Master theses

Files in This Item:

File	Description	Size	Format
ampe.pdf		1.52 MB	Adobe PDF	View/Open

Show full item record

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Google Scholar^TM