14 January 2016

BRAKER1 – a new gene search algorithm

Scientists have proposed an algorithm that allows you to study DNA faster and more accurately

MIPT Press Service 

A group of scientists from Germany, America and Russia, with the participation of the Head of the Department of the Moscow Institute of Physics and Technology (MIPT) Mark Borodovsky, proposed an algorithm that automates and makes the search for genes more efficient. The development combines the advantages of the most advanced tools for working with genomic data. The new method will allow analyzing new DNA sequences more accurately and faster and finding a complete set of genes in the genome. 

Although the article by Hoff et al. BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS, describing the algorithm, was only recently published on the pages of the Bioinformatics journal published by Oxford Journals, the proposed method has managed to show and prove its relevance – the computer program has already been downloaded by more than 1,500 different centers and laboratories around the world. Testing of the algorithm shows its significantly higher accuracy compared to other algorithms. The presented development belongs to the field of bioinformatics – the discipline "at the junction of sciences". 

Bioinformatics is a set of methods of mathematics, statistics and computer science used to study biological molecules such as DNA, RNA, proteins. DNA, a fundamentally informational molecule, is sometimes even depicted in a computerized form to emphasize its role as a biological memory molecule. 



The popularity of bioinformatics is great, because each new sequenced genome generates so many new questions that scientists simply do not have time to answer them. The time of specialists, like the specialists themselves, is valued at its weight in gold. That is why process automation is the key to the success of any bioinformatician, and such algorithms are very necessary for solving various tasks.

One of the important tasks of bioinformatics is to abstract the genome – to determine from which parts of the DNA molecule RNA and proteins are synthesized. Such areas – genes – are of special scientific interest. The fact is that for many studies, information is needed not about the entire DNA (which is 2 meters long for only one human cell), but about its most informative part – genes. The identification of gene sites is based on the search for similarity of sequence fragments with already known genes or by detecting patterns of nucleotide alternation characteristic of genes. This process is carried out using predictive algorithms.

Finding gene sites is a non–trivial task, especially in eukaryotic organisms, which include almost all widely known species except bacteria. This is due to the fact that in such cells the transmission of hereditary information is complicated by the presence of "breaks" in the coding regions (introns) and the lack of unambiguous signs to determine whether the area is coding or not.

The algorithm proposed by scientists determines which regions in DNA are genes and which are not. To do this, you can use a Markov chain (a sequence of random events whose future depends on the past), trained on already known genes. The states of the chain in this case are either nucleotides or nucleotide words. The algorithm determines the most likely division of the genome into coding and non-coding regions, which best classify genomic fragments by their ability to encode proteins or RNA. Experimental data obtained from RNA provide additional useful information on which to train the model used in the algorithm. Some gene predictor programs can use this data to improve the accuracy of finding genes. However, such algorithms require a training sample on which a species-specific training of the model will take place. For example, for the AUGUSTUS program, which shows high accuracy, a training sample of genes is needed. Such a set can be obtained using another program – GeneMark-ET – which belongs to the category of self-training algorithms. These two algorithms were combined by the BRAKER1 algorithm, proposed jointly by the authors AUGUSTUS and GeneMark-ET.

BRAKER1 showed high efficiency. The developed program has already been downloaded by more than 1,500 different centers and laboratories. Testing of the algorithm shows its significantly higher accuracy compared to other algorithms. The approximate timing of BRAKER1 on a single processor is ∼17.5 hours for learning and predicting genes on a genome 120 million base pairs long. This is a good result, given that the time can be significantly reduced by using parallel processors, which means that in the future the algorithm can work even faster and, in general, more efficiently. 

Such tools help to solve many different tasks. Accurate gene annotation in the genome is extremely important. For example, the first results of the global project "1000 human genomes", launched in 2008 with the assistance of 75 laboratories and companies, have already been published. As a result, a large number of sequences of rare gene variants – substitutions in genes, some of which lead to diseases, were discovered. When diagnosing genetic diseases, it is very important to understand which substitutions in the gene regions lead to the occurrence of diseases. During the project, the genomes of various people are decoded, especially the coding parts of them, and rare nucleotide substitutions are identified. In the future, this will help doctors diagnose complex diseases such as heart disease, diabetes and cancer. 

BRAKER1 allows you to work effectively with the genomes of new organisms, speeding up the annotation of genomes and obtaining critical knowledge in the science of wildlife.Portal "Eternal youth" http://vechnayamolodost.ru

14.01.2015
Found a typo? Select it and press ctrl + enter Print version