QualComp: a new lossy compression of quality scores based on rate distortion theory
QualComp is a software for the lossy compression of quality scores found in a FASTQ file.
It is written in C and it is freely available for download.
The algorithm takes as input a FASTQ file, and the user specified parameter 'rate' (number of bits per quality score sequence) to be used by the lossy compressor. The program also allows the user to cluster the data prior to compression, by specifying the number of clusters as an optional input parameter.
The algorithm assumes the quality score sequences are distributed according to a multivariate gaussian, whose mean and covariance matrix are computed from the data to be compressed. Given this model, and using some results from rate distortion theory, the lossy compression algorithm optimally allocates the bits in order to reduce the MSE between the original quality scores and those reconstructed after applying the lossy compression algorithm.
Clustering the data prior to compression makes the model for the quality scores more accurate, thus improving the performance of the algorithm in terms of a reduced MSE, specially when the rate used for compression is low.
See reference for more details.
Download data used in the reference