Algorithms and statistical methods for exact motif discovery

Medientyp: Dissertation; Sonstige Veröffentlichung; E-Book; Elektronische Hochschulschrift

Titel: Algorithms and statistical methods for exact motif discovery

Beteiligte: Marschall, Tobias [VerfasserIn]

Erschienen: Eldorado - Repositorium der TU Dortmund, 2011-05-23

Sprache: Englisch

DOI: https://doi.org/10.17877/DE290R-552

Schlagwörter: Probabilistic arithmetic automation ; Motif statistics ; Motif discovery ; Mycobacterium tuberculosis ; Compound poisson

Entstehung:

Anmerkungen: Diese Datenquelle enthält auch Bestandsnachweise, die nicht zu einem Volltext führen.

Beschreibung: The motif discovery problem consists of uncovering exceptional patterns (called motifs) in sets of sequences. It arises in molecular biology when searching for yet unknown functional sites in DNA sequences. In this thesis, we develop a motif discovery algorithm that (1) is exact, that means it returns a motif with optimal score, (2) can use the statistical significance with respect to complex background models as a scoring function, (3) takes into account the effects of self-overlaps of motif instances, and (4) is efficient enough to be useful in large-scale applications. To this end, several algorithms and statistical methods are developed. First, the concepts of deterministic arithmetic automata (DAAs) and probabilistic arithmetic automata (PAAs) are introduced. We prove that they allow calculating the distributions of values resulting from deterministic computations on random texts generated by arbitrary finite-memory text models. This technique is applied three times: first, to compute the distribution of the number of occurrences of a pattern in a random string, second, to compute the distribution of the number of character accesses made by windowbased pattern matching algorithms, and, third, to compute the distribution of clump sizes, where a clump is a maximal set of overlapping motif occurrences. All of these applications are interesting theoretical topics in themselves and, in all three cases, our results go beyond those known previously. In order to compute the distribution of the number of occurrences of a motif in a random text, a deterministic finite automaton (DFA) accepting the motif’s instances is needed to subsequently construct a PAA. We therefore address the problem of efficiently constructing minimal DFAs for motif types common in computational biology. We introduce simple non-deterministic finite automata (NFAs) and prove that these NFAs are transformed into minimal DFAs by the classical subset construction. We show that they can be built from (sets of) generalized strings and from consensus ...

Zugangsstatus: Freier Zugang

Rechte-/Nutzungshinweise: Urheberrechtsschutz

Nur in Feld suchen:

Zuletzt gesuchte Begriffe: