Stochastic Context-Free Grammars for Modeling Three Spliceosomal Small Nuclear Ribonucleic Acids

by R.C. Underwood
revised 18.aug.2003
http://reality.sgi.com/rcu/thesis/thesis.html

[My thesis was published as UCSC Technical Report number UCSC-CRL-94-23, the PostScript for which is still available (as of August 2003) via anonymous ftp from UCSC's Computer Science Department.]

Abstract

In this thesis, stochastic context-free grammars (SCFGs) are applied to the problems of folding, aligning and modeling families of homologous RNA sequences--specifically, the small nuclear RNA sequences of the spliceosome. SCFGs generalize the hidden Markov models (HMMs) used in related work on protein and DNA to capture the sequences' common primary and secondary structure. This thesis discusses a recently introduced algorithm, Tree-Grammar EM, for deducing SCFG parameters automatically from unaligned, unfolded training sequences (see references below) and demonstrates its application to modeling three of the five prominent trans-acting factors in spliceosomal small nuclear RNA: U1, U5, and U6. Tree-Grammar EM, a generalization of the HMM forward-backward algorithm, is based on tree grammars and is faster than the previously proposed inside-outside SCFG training algorithm. Results show that after having been trained on as few as about 20 snRNA sequences, each of the three models can discern snRNA sequences from similar-length RNA sequences of other kinds, can find secondary structure of new snRNA sequences and can produce multiple alignments of snRNA sequences.

Four Salient Figures

The first and second figures below depict histograms over a large set of RNA sequences, a few of which are snRNA. For each sequence, the grammar is used to computed a negative log-likelihood (NLL) that indicates how probable it is that the grammar would have produced that sequence. The ``Z score'' indicated for each sequence on the X axis is the normalized NLL of that sequence--that is, the difference between the sequence's NLL and the average NLL of a typical sequence of the same length (see thesis for details).

The U6 grammar, trained on 22 U6 sequences using the Tree-Grammar EM algorithm, was able to correctly discriminate non-U6 sequences (striped) from U6 snRNA sequences (solid black) in the following figure. It misclassified only two of 45 U6 sequences, and misclassified none of the 1460 non-U6 sequences. (The non-U6 sequences included some U1 and U5 sequences, and pieces of other RNA sequences such as tRNA and rRNA.)

In comparison, the same U6 grammar, before training, was unable to discriminate any U6 snRNA sequences from non-snRNA sequences:

In the third and fourth figures, letters represent the four nucleotides: G for guanine, A for adenine, C for cytosine, and U for uracil. Watson-Crick base pairs (G-C and A-U) are indicated by a dash between two nucleotides. Other pairs are indicated by a dot between two nucleotides. Each ladder structure represents an RNA helix.

The U5 grammar designed in my experiments using the Tree-Grammar EM algorithm predicted the following folding for a U5 snRNA sequence. It matches the accepted hand-derived folding for the sequence exactly.

The U1 grammar designed in my experiments using the Tree-Grammar EM algorithm predicted the following folding for a U1 snRNA sequence whose ``actual'' (trusted, derived by hand) folding is unknown. It resembles closely the hand-derived trusted foldings for other U1 sequences.

References

Y. Sakakibara, M. Brown, R. Underwood, I.S. Mian, and D. Haussler. Stochastic context-free grammars for modeling RNA. Technical Report UCSC-CRL-93-16, UC Santa Cruz, Computer and Information Sciences Dept., Santa Cruz, CA 95064, 1993.

Y. Sakakibara, M. Brown, I.S. Mian, R. Underwood, and D. Haussler. Stochastic context-free grammars for modeling RNA. In Procedings of the Hawaii International Conference on System Sciences, Los Alamitos, CA, 1994. IEEE Computer Society Press.

Yasubumi Sakakibara, Michael Brown, Richard Hughey, I. Saira Mian, Kimmen Sjolander, R. C. Underwood, and David Haussler. Recent methods for RNA modeling using stochastic context-free grammars. In Proceedings of the Asilomar Conference on Combinatorial Pattern Matching, New York, NY, 1994. Springer-Verlag.