Authors:Alan Wee-Chung Liew,Hong Yan,Mengsu Yang
Article
Bioinformatics have been growing up in the last years, generating interest in computer science and engineering communities. With the help of many public information found all around the internet huge databases full of useful data for the Human Genome Project. This opens a bigger interest in new applications and methods for pattern recognition. The main objective of this article is to review 2 major topics,DNA sequence analysis and DNA micro array data analysis. Some vision techniques that we can find in this article are image analysis used for gene expression and data extraction all this used then for data pre-processing and clustering analysis for pattern discovery. The article will describe the current methods used for getting all this and what could be used talking about vision techniques.
As we already said the recent advance in science and the big amount of information that can be obtained about the Human Genome Project has created a great interest for researchers but one problem they face is that the massive amount of data is to big and processing it in a fast way has become a major challenge. Two important technologies in the projects are:
- DNA sequence analysis
- DNA micro array data analysis
the article tells us that even if this technology's had been around for almost 2 decades it is now when they create more interest because of the big public domain online databases that can be found and that gold a massive amount of sequence data that can be used in the project. They even give some examples of this databases and give us the source for example the GenBank:
National Center for Biotechnology located in USA.
Image Processing
A critical aspect in the DNA microarray technology is the ability to extract expression data from images in a accurately way, to be able to do this researchers need innovative image processing techniques to be able to locate the spots in the images and to measure the resulting expression ratio. Once the expression data are obtained, they go to the cluster analysis where we look for groups of genes that are similarly expressed.
Some of the problems that scientists and researchers face when working with DNA sequence would be when a molecular biologist is presented with an unknown DNA sequence the first thing or task they have to do is to look for a similar sequence in the public databases, this will help the researcher by simplifying him the job, because he can look through the effort of many other researchers.
DNA sequence analysis
In this section of the article they talk about DNA in a biological background, they explain some things that we may already know like that DNA is basis of heredity, it is made up of small molecules named nucleotides that can be divided in 4 bases:
- Adenine
- Cytosine
- Guanine
- Thymine
DNA carries the genetic information required for the organism to function.
Here is a picture to resume all of the information given about DNA we can see the bases represented by the letters inside.
In this section they explain us all the process of ordering amino acids and other aspects from the DNA structure.
Sequence Comparison
When a new DNA sequence appears the most common task is to compare it with existing sequences that are already well studied and documented. When 2 sequences from different organism are similar, they may be consider as ancestors, the sequence then can be catalog as homologous. One method that can be used for sequence comparison is sequence alignment, this is similar to the string matching problem used in pattern recognition. The standard pairwise alignment method is based on dynamic programming. The method compares every pair of characters in the two sequences and generates an aligment and a score. One disadvantage of this method is that is really slow and the amount of comparisons can be really big, for example DNA databases today contains billions of bases and are still increasing. To improve the method fast heuristic local alignment algorithms have been developed. The tool used as searching tool in the database is BLAST which is freely available. Other methods that can be used for sequence comparison can be using visualization techniques such as:
- DB-curve(Dual Base Curve): Here 2 bases are assigned into 2 vectors and the remaining bases are assigned to another vector. Similarities and differences in the sequence will be easily observed in the plots.
Here is a Dual Base Curve test for 8 different species and we can appreciate the similarities and differences on the plots.
Even tho the method is really effective in precision it isn't in terms of time this because the number of bases to be compared, for example the human genome, the number of nucleotide bases is around 3x10 to the 9 power. This creates a huge demand on computational efficiency, looking for efficient and speed processing algorithms.
DNA microarray gene expression profiling
Gene expressing profiling is the process of determining when and where particular genes are expressed. Microarray technology has emerged as a powerful tool for genomic research, allowing the study of thousands of different DNA nucletide sequences. The task in microarray image analysis involves computing the expression ratio for each given spot.
This image shows the process that is followed.
As we can see in the image it follows various steps:
- Identify the location of all blocks on the microarray image.
- Generate the grid within each block which subdivides the block into p x q sub regions, each containing at most one spot.
- Segment the sport, if any, in each sub-region.
Vision Techniques
The microarray image first generates a gray scale image from the two TIFF images. The grey scale image could undergo image smoothing or apply a filter to reduce the effect of image. noise. The blocks in a microarray image are arranged in a rigid pattern due to the printing process, and each of the blocks in a microarray image is surrounded by regions void of any spot. To locate the individual spots in block we perform the gridding operation, this consist of locating good quality spots(guide spots). To account for the variable background and spot intensity, a novel adaptive threshold procedure and morphological processing are used to detect the guide spots. Then spot segmentation is performed in each of the subregions defined by the grid. Segmentation involves finding a circle that separates out the spot. When a spot is present, the intensity distribution of the pixels within the subregion is modeled using 2 class Gaussian-Mixture model to find the optimus thresh old. Once the sub-region is thresholded and segmented, a best-fit circle is computed for the final spot segmentation.
This is the image of the process described above. The grey scale image represents the segmentation of a microarray image into blocks. And the right image the gridding in a block.
This is an image of the result of spot segmentation.
Data extraction and processing
Once the spots in a microarray image are extracted the intensity value of each spot can be obtained. But results may not be 100% efficient so a preprocessing must be done following the next steps:
- Background correction.
- Data Normalization.
- Missing value estimation.
Background correction is the belief that a spots measured intensity includes a contribution not due to a specific hybridization of the target to the probe. The purpose of normalization is to adjust for any bias that arises from variation in the microarray process rather than from biological differences bet
ween the RNA samples.
Pattern Discovery by cluster analysis
A standard tool in gene expression data analysis is cluster analysis. Cluster analysis aims at finding groups in a given data set such that objects in the same group are similar to each other while objects in different groups are dissimilar Genes with related functions are expected to have similar patterns. Clustering of gene expression data has been applied to the study of temporal expression genes in sporulation, the identification of gene regulatory networks and the study of cancer.
For the project a Binary hierarchical clustering algorithm is proposed, the algorithm performs a successive binary subdivision of the data in a hierarchical manner, until further splitting of a partition into two smaller partitions is insignificant anymore.
Conclusions
I think the project its really interesting it works with a lot of techniques and things we have seen in class like computer vision techniques to get more efficient results on the segmentation resulting images , also they work with clusters for pattern recognition, we did something similar 1 year ago in parallel system class, they use it for looking on databases for similar genes or DNA. This article shows the increase of interest in computer science, demonstrating that researchers and scientist need efficient algorithms that may be applied in other fields too.
References
Wee-Chung Liew, A., Yan, H. and Yang, M. (2005) Pattern recognition techniques for the emerging field of bioinformatics: A review. [e-book] [Accessed: 29 May 2013].