miércoles, 29 de mayo de 2013

Bioinformatics-Extra Points

Title: Pattern recognition techniques for the emerging field of bioinformatics:A review
Authors:Alan Wee-Chung Liew,Hong Yan,Mengsu Yang

Article

Bioinformatics have been growing up in the last years, generating interest in computer science and engineering communities. With the help of many public information found all around the internet huge databases full of useful data for the Human Genome Project. This opens a bigger interest in new applications and methods for pattern recognition.  The main objective of this article is to review 2 major topics,DNA sequence analysis and DNA micro array data analysis. Some vision techniques that we can find in this article are image analysis used for gene expression and data extraction all this used then for data pre-processing and clustering analysis for pattern discovery.  The article will describe the current methods used for getting all this and what could be used talking about vision techniques.

As we already said the recent advance in science and the big amount of information that can be obtained about the Human Genome Project has created a great interest for researchers  but one problem they face is that the massive amount of data is to big and processing it in a fast way has become a major challenge. Two important technologies in the projects are:


  • DNA sequence analysis
  • DNA micro array data analysis
the article tells us that even if this technology's had been around for almost 2 decades it is now when they create more interest because of the big public domain online databases that can be found and that gold a massive amount of sequence data that can be used in the project. They even give some examples of this databases and give us the source for example the GenBank:


National Center for Biotechnology located in USA.

Image Processing

A critical aspect in the DNA microarray technology is the ability to extract expression data from images in a accurately way, to be able to do this researchers need innovative image processing techniques to be able to locate the spots in the images and to measure the resulting expression ratio. Once the expression data are obtained, they go to the cluster analysis where we look for groups of genes that are similarly expressed.

Some of the problems that scientists and researchers face when working with DNA sequence would be when a molecular biologist is presented with an unknown DNA sequence the first thing or task they have to do is to look for a similar sequence in the public databases, this will help the researcher by simplifying him the job, because he can look through the effort of many other researchers. 

 

DNA sequence analysis

In this section of the article they talk about DNA in a biological background, they explain some things that we may already know like that DNA is basis of heredity, it is made up of small molecules named nucleotides that can be divided in 4 bases:

  • Adenine
  • Cytosine
  • Guanine 
  • Thymine
DNA carries the genetic information required for the organism to function.

Here is a picture to resume all of the information given about DNA we can see the bases represented by the letters inside. 

In this section they explain us all the process of ordering amino acids and other aspects from the DNA structure. 

Sequence Comparison

When a new DNA sequence appears the most common task is to compare it with existing sequences that are already well studied and documented. When 2 sequences from different organism are similar, they may be consider as ancestors, the sequence then can be catalog as homologous. One method that can be used for sequence comparison is sequence alignment, this is similar to the string matching problem used in pattern recognition. The standard pairwise alignment method is based on dynamic programming. The method compares every pair of characters in the two sequences and generates an aligment and a score. One disadvantage of this method is that is really slow and the amount of comparisons can be really big, for example DNA databases today contains billions of bases and are still increasing. To improve the method fast heuristic local alignment algorithms have been developed. The tool used as searching tool in the database is BLAST which is freely available. Other methods that can be used for sequence comparison can be using visualization techniques such as:

  • DB-curve(Dual Base Curve): Here 2 bases are assigned into 2 vectors  and the remaining bases are assigned to another vector. Similarities and differences in the sequence will be easily observed in the plots.
Here is a Dual Base Curve test for 8 different species and we can appreciate the similarities and differences on the plots.

Even tho the method is really effective in precision it isn't in terms of time this because the number of bases to be compared, for example the human genome, the number of nucleotide bases is around 3x10 to the 9 power. This creates a huge demand on computational efficiency, looking for efficient and speed processing algorithms. 

DNA microarray gene expression profiling

Gene expressing profiling is the process of determining when and where particular genes are expressed.  Microarray technology has emerged as a powerful tool for genomic research, allowing the study of thousands of different DNA nucletide sequences. The task in microarray image analysis involves computing the expression ratio for each given spot.

This image shows the process that is followed.

As we can see in the image it follows various steps:

  1. Identify the location of all blocks on the microarray image.
  2. Generate the grid within each block which subdivides the block into p x q sub regions, each containing at most one spot.
  3. Segment the sport, if any, in each sub-region.

Vision Techniques


The microarray image first generates a gray scale image from the two TIFF images. The grey scale image could undergo image smoothing or apply a filter to reduce the effect of image. noise. The blocks in a microarray image are arranged in a rigid pattern due to the printing process, and each of the blocks in a microarray image is surrounded by regions void of any spot. To locate the individual spots in block we perform the gridding operation, this consist of locating good quality spots(guide spots). To account for the variable background and spot intensity, a novel adaptive threshold procedure and morphological processing are used to detect the guide spots. Then spot segmentation is performed in each of the subregions defined by the grid. Segmentation involves finding a circle that separates out the spot.  When a spot is present, the intensity distribution of the pixels within the subregion is modeled using 2 class Gaussian-Mixture model to find the optimus thresh old. Once the sub-region is thresholded and segmented, a best-fit circle is computed for the final spot segmentation. 

This is the image of the process described above. The grey scale image represents the segmentation of a microarray image into blocks. And the right image the gridding in a block.

This is an image of the result of spot segmentation.


Data extraction and processing


Once the spots in a microarray image are extracted the intensity value of each spot can be obtained. But results may not be 100% efficient so a preprocessing must be done following the next steps:

  1. Background correction.
  2. Data Normalization.
  3. Missing value estimation.

Background correction is the belief that a spots measured intensity includes a contribution not due to a specific hybridization of the target to the probe. The purpose of normalization is to adjust for any bias that arises from variation in the microarray process rather than from biological differences bet
ween the RNA samples. 

Pattern Discovery by cluster analysis

A standard tool in gene expression data analysis is cluster analysis. Cluster analysis aims at finding groups in a given data set such that objects in the same group are similar to each other while objects in different groups are dissimilar  Genes with related functions are expected to have similar patterns. Clustering of gene expression data has been applied to the study of temporal expression genes in sporulation, the identification of gene regulatory networks and the study of cancer.

For the project a  Binary hierarchical clustering algorithm is proposed, the algorithm performs a successive binary subdivision of the data in a hierarchical manner, until further splitting of a partition into two smaller partitions is insignificant anymore.


Conclusions

I think the project its really interesting it works with a lot of techniques and things we have seen in class like computer vision techniques to get more efficient results on the segmentation resulting images , also they work with clusters for pattern recognition, we did something similar 1 year ago in parallel system class, they use it for  looking on databases for similar genes or DNA. This article shows the increase of interest in computer science, demonstrating that researchers and scientist need efficient algorithms that may be applied in other fields too.



References

Wee-Chung Liew, A., Yan, H. and Yang, M. (2005) Pattern recognition techniques for the emerging field of bioinformatics: A review. [e-book] [Accessed: 29 May 2013].

 








Wavelets- Image Compression

For this Homework the objective was to design a program that was able to apply compression to images using wavelets and then test it and get to a conclusion. In my case i already worked with wavelets on another homework that i did for computer vision Lab. First of all i will talk about some important information and then i will explain the program and apply it to some images to test it.

Pywavelets

Pywavelets its a free open source wavelet transform software for python, it is very easy to use i worked with it using one of many of its features, i worked with 2D Forward and Inverse Discrete Wavelet Transform only but it includes many other features you can look for more information about this library in:


It can be installed by using the next line code on any linux terminal:

sudo apt-get install python-pywt

This module is needed to be able to work with the program that i will show.

Wavelet Compression

It is a form of data compression commonly used in image compression but it can also apply for video or audio. A good example of wavelet compression usage is JPEG 2000. The objective of wavelet compression is to store image data in  as little spaces possible in a file, it can be consider either lossless or lossy, in other words what we do is to keep the transient elements of data signal and represent them with a smaller amount of information.

¿How does it work?

First of all we choose a wavelet to work with it and we apply it, this will generate as many coefficients as there are pixels on the image, the coefficients concentrate all the information in arrays inside of them, from this coefficients we can then apply compression.

Image Decomposition

Because the method with wish i work is DWT(Discrete Wavelet Transform) i will explain more about it. When we apply DWT to an image we will obtain 4 coefficients. 

                            -------------------
                            |        |        |
                            | cA(LL) | cH(LH) |
                            |        |        |
(cA, (cH, cV, cD))  <--->   -------------------
                            |        |        |
                            | cV(HL) | cD(HH) |
                            |        |        |
                            -------------------
  • cA: Approximation Coefficient
  • cH: Horizontal Detail Coefficient
  • cV: Vertical Detail Coefficient
  • cD: Diagonal Detail Coefficient
We can then apply each of the coefficients to create new images that will show each detail contained in each coefficient. In the approximation coefficient we will get a white kind image that will contain most of the original information, all of the others as there name says contain details.


Image Compression

Now days we have Cameras that can take excellent pictures of really high resolution the problem is when we want to transfer them to another computes or upload it to internet, that's why we have programs to apply image compression to make it still visible but easier to manipulate it, with image compression we refer to reduce irrelevance and redundancy of the image data in order to be able to store or transmit data in an efficient form.

Types of Compression

  • Lossy: Remove irrelevance and redundancy getting high compression ratios but unable to restore the original image completely.
  • Loseless: It is possible to recover the original image without any loss.

Some good examples of the type of image compression can be found in the different image formats that we can found, for example JPEG based on DCT(Discrete Cosine Transform) taking 8x8 blocks from the image and applying DCT. One disadvantage of JPEG its that it is  consider as a lossy format but with high compression ratios going from 10:1 to 100:1. A good example for a loseless format is GIF it has lower compression ratios but we dont have information loss and can be completely restored, compression ratios go from 4:1 to 10:1.

Code

Now i will explain the code part by part as i already said i used python pywavelet library to work with wavelets using DWT and haar wavelet.

I used the dwt2() function used for applying Discrete Wavelet Transfor to 2 dimensions, the function requires 2 parameters to work:

pywt.dwt2(data,wavelet)

Data: Array or Matrix with the information of the pixels.
Wavelet: We choose a wavelet in this case haar.

In my computer vision homework first i applied gray scale and image re-size i will remove this because i discovered that i was getting extra compression from this steps but the intention is to work only with wavelets.

So the first step will be to generate the array or matrix to input as data on the function.


Then we will use the function that i already explained to generate the 4 coefficients, we get 4 coefficients because we are working with 2D when we work with 1D we only use 2 coefficients, so as we already mention we send the array and choose the wavelet in this case haar.


Now that we have the 4 coefficients we will create a new image for each of them, also we will need to apply a filter because the generating arrays from the coefficients contain negative numbers and numbers that pass 255 so this cant be included on and image, what we do is to create a filter any negative value will be 0 and any value that pass 255 will be 255 any other value will remain the same, after fixing the array we apply it to a new image generating the coefficient images.

We create a function to enumerate each of the coefficients

And then we apply the filter

After we have the 4 images i created a function to evaluate each of them, i take in consideration for the test the used coefficient, the percentage of compression in base of the original image and the new image created, also i evaluate the time of compression of each image.

But this will only give us the compression ratio of each of the coefficients of course they will be really compressed because they are divided in 4 images. To get the real compression ratio we will need to apply an invert function taking as parameters the coefficients that we created. The function we will use is idwt() and will take as parameters the 4 coefficients and the used wavelet for compression.


To evaluate this i used the same method as in the "pruebas" function but i put it all in the main.


Examples-Test

I will try different Images, different levels, different sizes and different image formats to see the results and get to a conclusion.

JPG

Level 1 this means that the input image is the original image:

Original Image

Coefficients

Inverted-Compressed


Results


As we can see in the results we get a 26% compression rate on the final image not that big but it is good also it took almost 3 seconds to compress it.

Level 2, the input for this will be the Approximation Coefficient image.

Input Image

Coefficients

Inverted-Compressed



As we can see we don't get any more compression we didn't get 1% and because of this the process is really quick too.Now i will try with a big size image

Level 1

Original-Input Image

Coefficients




Inverted-Compressed

Results



As we can see a bigger image makes a bigger compression time in this case we got a 6% compression rate not that good for the time it took to make only a 6%.

Level 2

Input Image

Coefficients

Inverse-Compressed

Results

Again just like last example we don't get any more compression and because of this the time is really low. The last JPG example will be a small image.

Original Image

Coefficients

Inverse-Compressed

Results

Here we get some interesting results, as we can see on the compressed image instead of getting a compression we recover more data from the image making it a little bigger than it was.

Level 2

Input Image

Coefficients

Inverse-Compressed


Results

Now in this case for level 2 image we did get some compression, in level 1 we were getting a bigger image, in level 2 we get really low compression not even 1% but at least we did get some, the time as we already know depends on the size.

PNG

For this test i will try to get PNG images with almost the same size to see if we get similar results.I will start with the medium image. To work with PNG we just have to add to our code the support for RGBA

Original-Image

Coefficients



Inverse-Compressed

Results

Comparing the results of this image with the JPG we can see that we get similar compression percentage but the time is almost the double.

Level 2

Input-Image

Coefficients

Inverse-Compressed

Results

In this case for level 2 we got similar results compared to JPG but as with last test we get more compression time. The next test will be with the large size image.

Original-Image

Coefficients




Inverse-Compressed

Results

In this case in comparison with the JPG image we got bigger compression rate and less time

Level-2

Input-Image

Coefficients

Inverse-Compressed

Results


In this example we got a larger image what i can see in comparison with JPG is that the coefficient images are very detailed even in level 2 they keep a high grade of detail in comparison with JPG that even in level 1 the quality of details in coefficients is very low. The last test will be with the small image.

Original-Image
Coefficients

Inverse-Compressed

Results



In this case in comparison with the JPG we did get a compression and a big one with a good time.

Level 2

Input-Image
Coefficients

Inverse-Compressed
Results
And for this last example we didn't get any compression, we even get negative values, we can also see that the coefficient images quality is really low it may depend of the type of image.

Conclusions

In conclusion with all the tests that i did with the different formats and different images is that we get better compression rates and better times in JPG  because they already have a big compression rate because of the format the only weird thing the i find was in the case of the small image where i didn't get a compression rate, i even try with other images from the same size that i didn't include in the report and i kept getting the same results, in the other hand we have the PNG format test where we got a little more time and more or less the same compression, i think it takes more time because it doesn't have a big compression like the JPG format, another think i saw was that the coefficients images from the PNG format have more details, this must be because the image looks in a good quality thanks to the PNG format. For the level 2 compression the results weren't good you don't really get good compression on the inverse image just on the coefficients.

Complete Code

References

  • En.wikipedia.org (2010) Image compression - Wikipedia, the free encyclopedia. [online] Available at: http://en.wikipedia.org/wiki/Image_compression [Accessed: 29 May 2013].
  • Wasilewski, F. (2006) PyWavelets - Discrete Wavelet Transform in Python — PyWavelets Documentation. [online] Available at: http://www.pybytes.com/pywavelets/ [Accessed: 29 May 2013].
  • Henao González, A. (n.p.) Compresión de imágenes usando la transformada de wavelet y el algoritmo de Huffman. [e-book] Manizales: [Accessed: 24 May 2013].
  • En.wikipedia.org (2000) Wavelet transform - Wikipedia, the free encyclopedia. [online] Available at: https://en.wikipedia.org/wiki/Wavelet_transform [Accessed: 29 May 2013].