C++ wxWidgets Linux Win32 OS X Autoconf Bakefile Boost GCC GNU IMAP MSVC Perl Python SMTP STL TCP/IP Threads Unix ZSH

Choosing Best Archive Format for Source Code Distribution

The Problem

When preparing a source distribution of an open source package an important goal is to make it easily available to as many users as possible, including some who may be not very technically inclined as sometimes the developers ask the users of their programs to rebuild their application from sources using the latest version not yet available in binary form (of course, this wouldn't happen in an ideal world but in practice it does, albeit more rarely nowadays).

There are two aspects to doing this. First, the distribution should clearly be available in some common format as many users wouldn't be willing to install some exotic unpacking tool just to be able to download it. Second, ideally the distribution should be as small as possible because many people still don't have real broadband network connections and may be discouraged by downloading too big files.

Unfortunately the two considerations are not really compatible and even contradictory as the newest archive formats offer the best compression and hence the smallest file sizes while the oldest one are the ones that are really ubiquitously available. Hence we are interested in finding out if using newer archival formats and programs is worth it.

Test and Results

We have used several different archival programs under a Debian Lenny GNU/Linux system to compress wxWidgets 2.9.2 release contents. The obvious candidates were zip, which is by far the most common format used under Microsoft Windows systems, and gzip and bzip2 which are the traditional compression formats for the various Unix systems. The main contender for an alternative format for Windows users is 7zip and we also tested its more Unix-oriented alternative called xz. Finally, lzop was included simply as a kind of baseline which is expected to be surpassed by all the others as lzop is designed to be fast first and foremost instead of providing maximal compression.

The uncompressed tar file containing its sources is approximatively 115MiB in size. The table below shows the sizes of the compressed files in different formats in percents of this size as well as the time required for the compression and decompression. While the former is not especially important as it only needs to be done once, we want to verify that the latter remains reasonable for all formats.

Format Size
(% of the original)
Compression
(time in seconds)
Decompression
(time in seconds)
tar 100.0 1.5 0.5
lzop 28.8 0.5 0.7
zip 23.2 2.6 1.5
gzip 19.3 3.9 0.7
bzip2 14.5 10.3 2.4
xz 12.1 35.1 1.1
7zip 11.8 20.1 2.9

Please notice that the times for lzop, gzip, bzip2 and xz don't include creating the tar file or extracting from it and so are not directly comparable with the times for zip and 7zip. Besides, even though the times were averaged over 5 runs after discarding the slowest and the fastest, they are probably still not very precise because of the impossible to estimate effects of disk caching and other activity on the system and so should be considered as being just indicative as they show that compressing such amount of data is fast enough to not be a problem, at least on a modern multi-processor machine, in any case. But the column that really counts in the table above is the sizes one.

Conclusion

The results of the tests show quite clearly if not unexpectedly that using LZMA compression, as both 7zip and xz do, is advantageous. The difference between bzip2 and 7zip is less than 3% and while this is still quite respectable, the real gap is between the traditional zip and 7z formats: the latter is twice as small (about 13MiB instead of 26MiB) and this can clearly make a difference for bandwidth-constrained users.

It's probably too early to adopt xz as an alternative format for Unix distributions but we can notice that the difference between xz and bzip2 is only slightly less than that between bzip2 and gzip and the latter was enough for bzip2 to almost completely displace gzip on many systems, so switching to xz might be just a question of time. However for Windows users the size benefits of 7z format seem to be already sufficiently important today to justify providing it in addition to the ubiquitous zip and we will continue to do so for wxWidgets distributions in the near future.

July 7, 2022
wxWidgets 3.2, the latest stable release of wxWidgets, in development since several years, is finally available.
September 1, 2020
New gcc-warnings-tools script for C/C++ programmers for showing information about all the available warning options.
June 23, 2015
Release of where-included: a new tool for C/C++ programmers for finding the header file dependencies.
July 28, 2014
A new release of Bakefile, a makefile generator tool, is now available.
August 22, 2013
Added apache-splice-logs tool page.
March 31, 2013
Added new svn-to-git migration article.
August 6, 2012
Minor mladmin update: fix the script to work with recent Perl versions.
July 25, 2012
New diff-pdf tool description added.
April 27, 2012
wxWidgets training course proposed by TT-Solutions has been updated to cover version 3.0, please see training page for more information including the plan and some examples.
December 5, 2011
Another new script to help dealing with removing #pragma once from your code if needed.