MAFCOSACO: a lossless compression tool for the sequences alignments found in the MAF files.

SACO About

SACO was designed to handle the DNA bases and gap symbols that can be found in MAF files. Our method is based on a mixture of finite-context models. Contrarily a recent approach, it addresses both the DNA bases and gap symbols at once, better exploring the existing correlations. For comparison with previous methods, our algorithm was tested in the multiz28way dataset. On average, it attained 0.94 bits per symbol, approximately 7% better than the previous best, for a similar computational complexity. We also tested the model in the most recent dataset, multiz46way. In this dataset, that contains alignments of 46 different species, our compression model achieved an average of 0.72 bits per MSA block symbol.

Data sets

If you use this software, please cite the following publications:

  • Luís M. O. Matos, Diogo Pratas, and Armando J. Pinho, “A Compression Model for DNA Multiple Sequence Alignment Blocks”, in IEEE Transactions on Information Theory, volume 59, number 5, pages 3189-3198, May 2013. DOI:
  • Luís M. O. Matos, Diogo Pratas, and Armando J. Pinho, “Compression of whole genome alignments using a mixture of finite-context models”, in Proceedings of the International Conference on Image Analysis and Recognition, ICIAR 2012, (Editors: A. Campilho and M. Kamel, volume 2324 of Lecture Notes in Computer Science (LNCS)), pages 359-366, Springer Berlin Heidelberg, Aveiro, Portugal, June 2012. DOI: