08 July 2015

Genomics – astronomically big big data

DNA banks, not porn sites, turned out to be the most growing segment of the network


Obviously, here the author made a typical Freudian reservation, on the principle of "who has what hurts ..." – VM :) video hosting or social networks, and genomic data banks and related portals, scientists say in an article published in the journal PLoS One (Stephens et al., Big Data: Astronomical or Genomic? – VM).

"As DNA analysis technologies continue to improve and the cost of this procedure decreases, we expect a real explosion in the spread of sequencing technologies in everyday life and the associated information flood. The only way to survive it is to improve the computer infrastructure responsible for processing genomic data," Gene Robinson from the University of Illinois at Urbana–Champaign said (in a press release Genomics among the biggest of Big Data, experts say – VM).

Robinson, a geneticist by profession, and several mathematicians and programmers decided to assess the scale of this explosion by comparing how several of the most dynamic segments of the global network have developed in recent years – social networks, video hosting and distributed systems for processing scientific information.

The first two were familiar portals – the Twitter microblogging service and Youtube video hosting, and the third was a number of projects in astronomy, particle physics and molecular biology.

Contrary to scientists' expectations, the volumes of processed, transmitted and stored information have grown most in recent years not in social networks and video hosting, but in genomic data banks.

For comparison, the former produce about 10-100 petabytes (millions of gigabytes) of "original content" every year, which may seem like a very large figure. Genomic databases are replenished in a comparable way, but their growth rate is many times higher – every seven to eight months the volume of new genetic data doubles.

Thanks to this, in just ten years, Internet banks of genomic information will "get fat" by several exabytes (thousands of petabytes) per year, which will cause a huge number of problems with storing and processing such a mass of data. Most of them will be reinforced by the fact that biologists, unlike physicists and astronomers, have not yet developed uniform standards for processing, compressing and archiving genomic information.

As the authors of the article explain, geneticists could not create an algorithm that would allow them to "throw out" common and insignificant fragments of human DNA. Thanks to this, storing the genomes of even members of the "golden billion" will require storage devices with a capacity of several exabytes, which is a big problem today and will be difficult in 10 years.

"For a very long time, people have used the adjective "astronomical" to describe those things that have really gigantic scales, volumes or dimensions. Having revealed the incredible growth rate of genomic data, my colleagues and I now propose to call such things not "astronomical", but "genomic", – concludes Michael Schatz from the Laboratory in Cold Spring Harbor (USA).

Portal "Eternal youth" http://vechnayamolodost.ru
08.07.2015
Found a typo? Select it and press ctrl + enter Print version