jueves, 28 de febrero de 2013

Fast String Searching with Suffixes trees


I’ll make a summary about an article of Fast String Searching with suffixes trees posted in August 1 1996 by Mark Nelson in Computer Science, Data Compression, and Magazine Articles.
The article was published in Dr.Dobb’s Journal in August 1996

Problem
The article starts specifying what the problem is; it talks about how programmers face day to day some match string sequence problems, and how improving it will benefit a lot. Then it gives an example of how useful it would be to improve string searching algorithms, imagine that you work as a programmer in a DNA sequence project, researchers will be busy producing fragmented sequences of nucleotides, they send this information to the server and it’s expected to locale the sequences in a database of genomes, as we already know a genome of a specific virus may contain thousands of nucleotides bases and as a server of a DNA research would be full of viruses, if the string searching algorithm is not efficient it would take a lot of time to match with the correct virus. If you use a brute force string search algorithm it would take forever to perform a string comparison at every single nucleotide in every genome in the database. So the problem is how this can be solved.

Intuitive Solution

Building a search trie for searching through input text. A trie is a type of tree that has N possible branches from each node, where N is the number of characters in the alphabet. It’s called suffix because the trie contains all the suffixes of a given block of text, in this case all the viral genome.

Example:


This example shows a suffix trie for the word BANANAS. It gets divided in a starting root node and all of the suffixes of the word BANANAS in this case the suffixes are BANANAS, ANANAS, NANAS and an S. Suffixes trees simplify the job because if you input a text of N length and need to search for a length M string it would just need to compare M to search, in traditional brute force search algorithms it would take a lot of time because it would compare all N*M comparisons available. Suffixes trees are so efficient that if the word BANANAS was in the collected works of Shakespear it would only take 7 character comparisons to find it.

But if they are so efficient why we don’t hear much about there use?

This is because constructing one requires O(N2) time and space.

A solution to this was proposed in 1976 by Edward McCreight when he published a paper on what came to be known as the suffix tree. The difference between suffix tree and suffix trie it’s that the suffix tree eliminate nodes that have a single descendant, meaning that now individual edges in the tree may now represent sequences of text instead of single characters.


In this picture we can see that by using this method we removed 12 nodes meaning that the time and space requeriment for the contruction is also reduced  from O(N2) to O(N).

McCreight’s original algorithm had some disadvantages; one of them was that it required the tree to be built in reverse order meaning characters were added from the end of the input. This makes much more difficult to use for applications of data compression. Then Esko Ukkonen from the Iniversity of Helsinki make some modifications to the McCreight algorithm making it able to work from left to right.


Esko Ukkonen algorithm would work this way:


It starts with an empty tree for a given string of text, it then adds each of the N prefixes in this case B is inserte to the tree then BA then BAN untll its completely inserted and the tree is complete.






1 comentario:

  1. No llegaste mucho a la parte de cómo aplicarlos, pero van 2 puntos extra anyhow.

    ResponderEliminar