I’ll make a summary about an article of Fast String
Searching with suffixes trees posted in August 1 1996 by Mark Nelson in
Computer Science, Data Compression, and Magazine Articles.
The article was published in Dr.Dobb’s Journal in August 1996
Problem
The article starts specifying what the problem is; it talks
about how programmers face day to day some match string sequence problems, and
how improving it will benefit a lot. Then it gives an example of how useful it
would be to improve string searching algorithms, imagine that you work as a
programmer in a DNA sequence project, researchers will be busy producing
fragmented sequences of nucleotides, they send this information to the server
and it’s expected to locale the sequences in a database of genomes, as we
already know a genome of a specific virus may contain thousands of nucleotides
bases and as a server of a DNA research would be full of viruses, if the string
searching algorithm is not efficient it would take a lot of time to match with
the correct virus. If you use a brute force string search algorithm it would
take forever to perform a string comparison at every single nucleotide in every
genome in the database. So the problem is how this can be solved.
Intuitive Solution
Building a search trie for searching through input text. A
trie is a type of tree that has N possible branches from each node, where N is
the number of characters in the alphabet. It’s called suffix because the trie
contains all the suffixes of a given block of text, in this case all the viral
genome.
Example:
This example shows a suffix trie for the word BANANAS. It
gets divided in a starting root node and all of the suffixes of the word
BANANAS in this case the suffixes are BANANAS, ANANAS, NANAS and an S. Suffixes
trees simplify the job because if you input a text of N length and need to
search for a length M string it would just need to compare M to search, in
traditional brute force search algorithms it would take a lot of time because
it would compare all N*M comparisons available. Suffixes trees are so efficient
that if the word BANANAS was in the collected works of Shakespear it would only
take 7 character comparisons to find it.
But if they are so efficient why we don’t hear much about
there use?
This is because constructing one requires O(N2) time and space.
A solution to this was proposed in 1976 by Edward McCreight
when he published a paper on what came to be known as the suffix tree. The
difference between suffix tree and suffix trie it’s that the suffix tree
eliminate nodes that have a single descendant, meaning that now individual
edges in the tree may now represent sequences of text instead of single
characters.
In this picture we can see that by using this method we removed 12 nodes meaning that the time and space requeriment for the contruction is also reduced from O(N2) to O(N).
McCreight’s original algorithm had some disadvantages; one
of them was that it required the tree to be built in reverse order meaning
characters were added from the end of the input. This makes much more difficult
to use for applications of data compression. Then Esko Ukkonen from the
Iniversity of Helsinki make some modifications to the McCreight algorithm
making it able to work from left to right.
Esko Ukkonen algorithm would work this way:
It starts with an empty tree for a given string of text, it
then adds each of the N prefixes in this case B is inserte to the tree then BA
then BAN untll its completely inserted and the tree is complete.
No llegaste mucho a la parte de cómo aplicarlos, pero van 2 puntos extra anyhow.
ResponderEliminar