HighFCMExploring deep Markov models in genomic data compression using sequence pre-analysis


HighFCM is a compression algorithm that relies on a pre-analysis of the data before compression, with the aim of identifying regions of low complexity. This strategy enables to use deeper context models, supported by hash-tables, without requiring huge amounts of memory. As an example, context depths as large as 32 are attainable for alphabets of four symbols, as is the case of genomic sequences. These deeper context models show very high compression capabilities in very repetitive genomic sequences, yielding improvements over previous algorithms. Furthermore, this method is universal, in the sense that it can be used in any type of textual data (such as quality-scores). HighFCM was designed and implemented at IEETA, a research unit of the University of Aveiro, and is available for non-commercial use.


Diogo Pratas and Armando J. Pinho. “Exploring deep Markov models in genomic data compression using sequence pre-analysis”. Proc. of the European Signal Processing Conference, EUSIPCO 2014, Lisboa, Portugal, September 2014.
DOI: to add.