How file compression works, what are archive files

What is file archiving

File archiving means to combine multiple files together for easier management of the data (i.e. backup, sharing by email attachment, FTP, torrent, cloud, or any kind of network service, etc) as for the host filesystem all the data will be treated as a single file rather than as multiple ones, eliminating the overhead of handling multiple objects - for each single file, locating the physical data on disk, locating possible fragments, checking file level security permissions, and so on.

What is file compression

File compression means to reduce size of data on disk encoding it to a smaller output, employing various strategies to efficiently map (most cases of) a larger input to a smaller output, i.e. using statistical analisys to reduce redundancy in inputa data.
Data compression, too, predates development of ZIP standard, as once the input files were merged into a single output archive, the operation was often concatenated to lossless data compression to reduce the size of the archive using various utilities available at the time as SQ (DOS, CP/M), CRUNCH (CP/M), and compress (Unix).
TAR format, for example, is still an uncompressed archive standard, and uses external compressors, nowadays usually GZ (fast deflate based compression, same as in ZIP format), BZ2 (more powerful compression), XZ (modern, very powerful LZMA based compression - the default compression algorithm used in 7Z format), BR Google's Brotli (modern, very fast compressor), and ZST Facebook's Zstandard (another modern, very fast compressor).

Learn more about similarities and differences in Lossy and lossless data compression paragraph. For general purpose compressed archive file, however, compression means Lossless Compression, a 1:1 mapping of input to a smaller output.

SEA's ARC format (1985) combined the archival and (lossless) compression in a single pass, providing probably the first example of general purpose of archive manager, which allowed both to spare storage for backup, and save upload and download bandwidth (and time) for sharing - at the time, mainly BBS.
A few years later, after a controversy with SEA about alleged derived work in PKARC, Phil Katz superseded previous works releasing PKZIP, which knew great success due multiple factors, as superior speed and efficiency, and being the specs released under public domain, and having relatively few competitors in years of fast PC market expansion.

Lossy and lossless data compression definition

How lossy and lossless compression works

Data compression can be defined lossy or lossless, in terms of reversibility of the compression process due loss (or preservation) of original information in the process. The two types of algorithms have different pros and cons, and different field of application.

Lossless compression definition, file archiving

Lossless compression uses statistical models to map the input to a smaller output eliminating redundancy in the data.
In this way the output carry exactly all the information featured by the input in less bytes, and can be expanded when needed to a 1:1 copy of the original data (restoring exactly the original content), which is a fundamental property for storing some types of data - i.e. a software, a database.

For this reason lossless compression algorithms are used for data backup and for archive file formats used in general purpose archive manager utilities, like 7Z, RAR, and ZIP, where an exact and reversible image of the original data must be saved.
Examples of lossless compression algorithms are Deflate (used i.e. for ZIP and GZ formats), BZip2 (used in BZ2 format), PPMd (RAR, 7Z formats), LZMA / LZMA2 (7Z / XZ format).

Some graphic file fomats (notably, PNG files and deflated TIFF) uses lossless compression, which usually results in less compression but no image quality degradation after multiple cycles of modification and saving of the picture, making this kind of image format suitable as intermediate save files for image editing tools.

Lossy compression definition, multimedia data compression

Lossy compression, instead, works identifying unnecessary or less relevant information (not just redundant data) and removing it.

Unlike the lossless compression, the amount of information to compress is effectively reduced.
The loss of information / content is irreversible, and depending from the nature of the algorithm, will likely happen each time the content is modified and saved to a lossy file format - e.g. when editing a lossy jpeg images, and saving it multiple times to intermediate work files.

In this way data compression ratio is improved but at the cost of making lossy compression a non reversible process - as it comes at the cost of losing part of the information - and making it a suitable choice only when it is not intended, by design, to restore the original content again.

Lossy compression is consequently not suitable for general purpose file archiving (as in example losing a single byte of an executable file would make it not working), but it works very well when loss, reducing less relevant information, is acceptable, as for graphic and multimedia files compression - in example for MP3 losing audio information below the audibility threshold, or losing not visible details in JPEG images, or both in compressed video formats such as MPEG (AVI, MKV, MPG, MP4...).

Most common lossy compression algorithms are consequently usually fine tuned for the specific pattern of a multimedia data type.
For this very same reson, file types compressed with lossy algorithms will not compress well (or at all) if added to archive files compressed with general purpose compression algorithms: already compressed files compresses poorly, if at all.

Due the lossy nature of those compression schemes, however, usually professional editing work is performed on non compressed data (i.e. WAV audio, or TIFF images) or data compressed in a lossless way (i.e. FLAC audio, or PNG images) every time it is feasible so saving the work in progress multiple times does not result in losing bits of the information each time, with progressive degradation of quality - usually reserving use of lossy compression to final step for creating a reasonably sized output to distribute for media consumption.

Lossy vs lossless compression

Lossy and lossless compression algorithms are so different in scopes that cannot be really put in direct competition.
When original content needs to be restored completely on decompression (binary files, rew data) lossless, fully reversible compression is the only option, while when some degree of data loss is acceptable (e.g. finalizing work on multimedia files such as mp3 audio, mpeg video, jpeg graphics) generally advantages of lossy compression in terms of speed and maximum compression ratio over lossless compression are so evident that lossy, non reversible compression is the only viable choice to meet size and/or performances constrains.

Read lossless compression optimize data compression and lossy compression optimize picture compression definitions on Wikipedia.

What compressed archive files are

What is a ZIP file

ZIP format is a lossless data compression and archival format created in 1989 by Phil Katz, implemented for the first time in PKWARE's PKZIP.
The ZIP file format specifications were released under public domain and the format had long and lasting success, to the point often "zip" is colloquially used for any generic compressed archive, and many package formats are based on deflate compression and/or same or very similar specs: Java JAR / WAR / EAR, Android APK, Apple iOS IPA files (iPhone and iPad devices), Microsoft CAB and Office compound files.
WinZip 12.1 (2009) introduced the new ZIPX file format specifications for identifying a new archive standard which supports newer and more powerful compression algorithms.

What are RAR, ACE, 7Z files

During '90s and beyond, multiple alternative archival standard emerged, as ARJ, RAR (1993), ACE, and 7Z (1999), introducing unique features to distinguish them from the growing number of competitors, in example:

usually, stronger compression ratio than ZIP at the cost of slower operation - but that disadvantage would have been paid off by slower transfer time (especially on slow and public networks) of smaller output file
multi volume archival, spanning output to multiple files to met constrains as mail attachment size limit
encryption, to enforce end user's privacy if the file is stolen, or passed through unsecure servers (unencrypted public network, or any third party controlled channel as a mail server, or remote storage service)
error detection and error correction (as implemented in ARC and RAR formats), to prevent extraction in the event data gets corrupted (i.e. faulty connection, damaged disk) and attempt recovery from known good data.

Evolution of file archiving formats

Archival file format tends to be more geared towards powerful, computing intensive features to enhance manageability of data (high compression, strong encryption), rather than enhancing ability to work on live data (rapid read and write access) like filesystems, even if some archive management utilities offers various mechanisms to add or remove data inside archives, and update or sync files already in the archive.
More choice in standards brought users more features and healthy competition between standards (see comparison of archive formats) and implementations, but also brought the need for flexible multi-purpose archival applications, like PeaZip, to deal with different formats users may encounter, and to make full use of the feature's potential of the different supported file formats.
In recent years the trend is shifting toward very fast and highly efficient compression algorithms, aiming to minimize the compression and decompression overhead in data transmission and achieve near real-time speed, with algorithms like Google's Brotli (BR file format) and Facebook's Zstandard (ZST file format).

More online resources about file archiving and compression formats: AR

(Unix), LBR wikipedia lbr file

(CP/M), SEA sea pkware controversy

company, WinZip's ZIPX standard what is winzip's zipx standard

, Google Brotli how Brotli fast compression works

, Facebook Zstandard how Zstandard fast compression works

project pages.

Synopsis: What is an archive file? What is a compressed file? What is a zip file? How lossy compression and lossless data compression works. What are non reversible and reversible algorithms advantages and disadvantages. Compare definitions, and compressed file types. Lossy vs lossless compression. How 7Z, RAR, ZIP files work? How does file compression works? How does file archiving works? What does file archiving and file compression mean?

Topics: how data compression works, lossy and lossless compression, what is an archive file, what is a zip or a rar file

PeaZip > FAQ > How file compression works, what are archive files