Seitenanfang

Compressing test for short strings

Dieser Post wurde aus meiner alten WordPress-Installation importiert. Sollte es Darstellungsprobleme, falsche Links oder fehlende Bilder geben, bitte einfach hier einen Kommentar hinterlassen. Danke.


Data is getting bigger and bigger as technology advances. We didn't even think about using something inefficent like XML when I started writing sourcecode on a Datapoint 6600 or even in MS-DOS world, but today? A meg more or less doesn't matter today. But sometimes space is still premium and so it is for my current project. I'm working with short strings (text plus some extra chars, about 6 bit  per char) and they should be compressed.

The strings are about 50 to 200 bytes and use a limited charset of up to 64 chars. I assume three things:

  1. Any compression attempt will be very little effective on short data like this
  2. Typical compession algorythims can't benefit from the reduced charset very well
  3. Building a special compression for my kind of data will most likly get the best result
I already started number 3 during my current processing work and "compressed" few common character combinations using very basic replacement rules but all known common structures have been used up, anything starting from here must be able to accept all kind of data. Taking advantage of the 6 bit limited charset isn't possible because and (8-bit) compression result must be recoded to 6 bit again to get a valid transporation format.

CPAN offers a huge number of compression options and I want to try some of them on my data:

  • BZip2 is one of the best compression algorithms today, maybe the best one but I don't think that it'll be very good at short strings
  • LZ4, LZF and  LZV1 are based on the same LZ compression strategy
  • LZW is part of the same LZ family but has two operation modes: 12 and 16 bit. 12 bit mode is made for smaller strings and should get better results than 16 bit mode
  • Snappy is pretty new compression developed by Google and optimized for fast processing instead of high ratios
  • ZLib is good old stuff but still used for many applications

My test script

I wrote a short test script for comparing the different CPAN modules:
#!/usr/bin/perl

use Time::HiRes qw(time);use Compress::Raw::Bzip2;use Compress::Raw::Zlib;use Compress::LZF ();use Compress::LZW ();use Compress::LZ4 ();use Compress::LZV1 ();use Compress::Snappy ();

my %compressions = ( bzip2 => sub { my $bz = Compress::Raw::Bzip2->new; my $output; $bz->bzdeflate(shift, $output); my $o; $bz->bzflush($o); $output .= $o; $bz->bzclose($o); $output .= $o; return $output; }, zlib => sub { my $zl = Compress::Raw::Zlib::Deflate->new; $zl->deflate(shift, $output); my $o; $zl->flush($o); $output .= $o; return $output; }, lzf => sub { return Compress::LZF::compress(shift); }, lzw16 => sub { return Compress::LZW::compress(shift); }, lzw12 => sub { return Compress::LZW::compress(shift,12); }, lz4 => sub { return Compress::LZ4::compress(shift); }, lzv1 => sub { return Compress::LZV1::compress(shift); }, snappy => sub { return Compress::Snappy::compress(shift); },);

my @teststrings = ( 'This is a simple English text test string.', 'uM4iec8eiy2oPahfeiwo', 'Taeph7jaiJaa5Fi1eek8', 'aec7amoh9Oibei0paesh', 'ahYeh2xiahanai6eiF4j', 'OhMee6Ishai2Oxoo4Ooy', 'aechoopoocoo4Vief3EiraR9thooha', 'eeSh8yieliequeexohthie1eic1cha', 'iequiLei7yo5sheev6aepaik8ahlid', 'Ma3Aixoesooreikeevuighie5ohQua', 'xoa9theV8Dooyee3Xi5uZeivee4che', 'aish5izitae2OhFeehai5quoZai9AhChaenai9du', # Some other string samples up to 200 bytes each);

print "Algo.\tbest\tavg.\tworst\ttime\n";for my $compress_name (sort keys %compressions) { my $clear_size; my $compress_size; my $best; my $worst; my $start = time; for my $string (@teststrings) { my $compressed = &{$compressions{$compress_name}}($string); $clear_size += length($string); $compress_size += length($compressed); my $ratio = length($compressed) / length($string); $best = $ratio if !defined $best or $ratio < $best; $worst = $ratio if !defined $worst or $ratio > $worst; } my $duration = time - $start; printf "%s\t%.03f\t%.03f\t%.03f\t%.05f\n",$compress_name, (100 * $best),(100 * $compress_size / $clear_size),(100 * $worst), $duration;}

All compression modules are used at script start and their compression calls are merged into one sub getting the input string and returning the compressed data. The test strings are defined in a simple array.

Finally the script loops though all compression options sorted by their name and throws all the test string on each module. The total uncompressed string length and the total compressed string are counted. The best and worst compression ratios are also determined.

The time required for all strings is recorded but for information only. A single of each compression on each string will be very inaccurate because everything is running so quick that many other things might influence the results. I prefer the Benchmark module from CPAN for real timing analysis.

The results

Algo.	best	avg.	worst	timebzip2	185.500	235.390	545.000	0.00340lz4	102.778	106.230	130.000	0.00015lzf	98.333	101.023	105.000	0.00045lzv1	100.500	101.053	105.000	0.00016lzw12	116.250	126.758	150.000	0.02354lzw16	155.000	168.674	200.000	0.01340snappy	98.333	102.981	110.000	0.00052zlib	77.000	92.302	140.000	0.01070
BZip2 is a great algorithm but clearly fails on my test strings: The "compression" grows the data between 85 and 445%.

I'm a little bit surprised about zlib but it seems to be the best solution for the sample strings with compressed data.

Real data - another surprise

I can't publish my "real" test data because those strings belogn to a privat project, but the compression test results - which are even more surprising than those from the sample strings:
Algo.	best	avg.	worst	timebzip2	354.545	379.310	397.674	0.00159lz4	113.636	113.646	113.953	0.00008lzf	102.273	102.274	102.326	0.00057lzv1	102.273	102.274	102.326	0.00018lzw12	146.512	149.743	150.000	0.00803lzw16	195.349	199.560	200.000	0.00410snappy	104.545	104.549	104.651	0.00010zlib	118.182	118.195	118.605	0.00385
My very basic self-made compression seem to be good enough to make the results uncompressable, maybe the strings are just to short for being compressed at all. The LZ family grows the strings by about 2% and even zlib isn't working any longer.

Bad news - I can't do any further compression. The very different results are surprisingly but I expected the final results as compression is no magic at all and uncompressable strings are still uncompressable.

Finally: If you want to choose a compression algo, don't use anybody else results, grap your own data samples and do your own tests.

 

 

4 Kommentare. Schreib was dazu

  1. simon

    It is a well known fact that general compression techniques don't work well for short strings. smaz might help in this case: https://github.com/antirez/smaz

  2. Sebastian

    I was looking for Perl algorithm and really wanted to try Smaz but I didn't find a Perl module for it. It's optimized for English text according to their homepage and my data (sample in the blog post and real-world-data) isn't text at all.
    Anything below 100% would have been ok to me, I didn't expect a 50 or 80% compression.

  3. Trush_No_One

    The best option available for Perl right now is to use one of the zlib modules with a shared dictionary.
    See http://stackoverflow.com/questions/479218/how-to-compress-small-strings for the explanation.

  4. We are currently extending SMAZ with a new statistical coder that might work for your restricted-alphabet data. It requires a once-off corpus of training data so that it knows what kind of stuff it is compressing (this is when the program is compiled), and from there it just compresses strings. Currently we are seeing about 50% reductions for Twitter messages. See http://gihtub.com/servalproject/smaz

Schreib was dazu

Die folgenden HTML-Tags sind erlaubt:<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>