Stochastic Context-Free Grammars for Modeling
Three Spliceosomal Small Nuclear Ribonucleic Acids
- by R.C. Underwood
- revised 18.aug.2003
- http://reality.sgi.com/rcu/thesis/thesis.html
[My thesis was published as UCSC Technical Report number
UCSC-CRL-94-23, the PostScript for which is still available (as of
August 2003) via anonymous
ftp from UCSC's Computer Science
Department.]
Abstract
In this thesis,
stochastic context-free grammars (SCFGs) are applied to
the problems of folding, aligning and modeling families
of homologous RNA sequences--specifically, the small nuclear
RNA sequences of the spliceosome. SCFGs
generalize the hidden Markov models (HMMs) used
in related work on protein and DNA to
capture the sequences' common primary and secondary
structure.
This thesis discusses a recently introduced
algorithm, Tree-Grammar EM, for
deducing SCFG parameters automatically from unaligned,
unfolded training sequences
(see references below)
and
demonstrates its application to modeling three of
the five prominent
trans-acting factors in spliceosomal small nuclear RNA:
U1, U5, and U6.
Tree-Grammar EM, a generalization of the HMM forward-backward algorithm,
is based on tree grammars and is faster than the previously
proposed inside-outside SCFG training algorithm.
Results show that after having been trained on as few as
about 20 snRNA sequences, each of the
three
models can discern snRNA sequences from
similar-length RNA sequences of other kinds,
can find secondary structure of new snRNA sequences and can
produce multiple alignments of snRNA sequences.
Four Salient Figures
The first and second figures below
depict histograms over a large set of RNA sequences, a
few of which are snRNA.
For each sequence, the grammar is used to computed a
negative log-likelihood (NLL) that indicates how probable
it is that the grammar would have produced that sequence.
The ``Z score'' indicated for each sequence on the X axis is
the normalized NLL of that sequence--that is, the
difference between the sequence's NLL and the average
NLL of a typical sequence of the same length (see thesis for details).
The U6 grammar, trained on 22 U6 sequences using the
Tree-Grammar EM algorithm,
was able to correctly discriminate non-U6 sequences (striped)
from U6 snRNA sequences (solid black) in the following figure.
It misclassified only two of 45 U6 sequences, and misclassified
none of the 1460 non-U6 sequences. (The non-U6 sequences
included some U1 and U5 sequences, and pieces of other RNA
sequences such as tRNA and rRNA.)

In comparison, the same U6 grammar, before training,
was unable to discriminate any U6 snRNA sequences from non-snRNA
sequences:

In the third and fourth figures, letters represent the four nucleotides:
G for guanine,
A for adenine,
C for cytosine, and
U for uracil.
Watson-Crick base pairs (G-C and A-U)
are indicated by a dash between two nucleotides.
Other pairs are indicated by a dot between two nucleotides.
Each ladder structure represents an RNA helix.
The U5 grammar designed in my experiments using the Tree-Grammar EM algorithm
predicted the following folding for a U5 snRNA sequence. It
matches the accepted hand-derived folding for the
sequence exactly.

The U1 grammar designed in my experiments using the Tree-Grammar EM algorithm
predicted the following folding for a U1 snRNA sequence whose
``actual'' (trusted, derived by hand) folding is unknown. It
resembles closely the hand-derived trusted foldings
for other U1 sequences.

References
-
Y. Sakakibara, M. Brown, R. Underwood,
I.S. Mian, and D. Haussler.
Stochastic context-free grammars for modeling RNA. Technical Report
UCSC-CRL-93-16, UC Santa Cruz, Computer and Information Sciences
Dept., Santa Cruz, CA 95064, 1993.
-
Y. Sakakibara, M. Brown, I.S. Mian, R. Underwood,
and D. Haussler.
Stochastic context-free grammars for modeling RNA. In Procedings
of the Hawaii International Conference on System Sciences,
Los Alamitos, CA, 1994. IEEE Computer Society Press.
-
Yasubumi Sakakibara,
Michael Brown,
Richard Hughey,
I. Saira Mian,
Kimmen Sjolander,
R. C. Underwood, and
David Haussler. Recent
methods for RNA modeling using stochastic context-free grammars.
In Proceedings of the Asilomar Conference on Combinatorial
Pattern Matching, New York, NY, 1994. Springer-Verlag.