How file compression works, what are archive files
What is file archiving
File archiving means to combine multiple files together for easier management of the data (i.e. backup, sharing by email attachment, FTP, torrent, cloud, or any kind of network service, etc) as for the host filesystem all the data will be treated as a single file rather than as multiple ones, eliminating the overhead of handling multiple objects - for each single file, locating the physical data on disk, locating possible fragments, checking file level security permissions, and so on.
What is file
compression
File compression means to reduce size of data on disk encoding it to a smaller output, employing various strategies to efficiently map (most cases of) a larger input to a smaller output, i.e. using statistical analisys to reduce redundancy in inputa data.
Data compression, too, predates development of ZIP standard, as once the input files were merged into a single output archive, the operation was often concatenated to lossless data compression to reduce the size of the archive using various utilities available at the time as SQ (DOS, CP/M), CRUNCH (CP/M), and compress (Unix).
TAR format, for example, is still an uncompressed archive standard, and uses external compressors, nowadays usually GZ (fast deflate based compression, same as in ZIP format), BZ2 (more powerful compression), XZ (modern, very powerful LZMA based compression - the default compression algorithm used in 7Z format), BR Google's Brotli (modern, very fast compressor), and ZST Facebook's Zstandard (another modern, very fast compressor).
Learn more about similarities and differences in Lossy and lossless data compression paragraph. For general purpose compressed archive file, however, compression means Lossless Compression, a 1:1 mapping of input to a smaller output.
SEA's ARC format
(1985) combined the archival and (lossless)
compression in a single pass, providing probably the first example of
general purpose of archive manager, which allowed both to spare storage
for backup, and save upload
and download bandwidth (and time) for sharing - at the time, mainly BBS.
A few years later, after a controversy with SEA about
alleged derived
work in PKARC, Phil Katz superseded previous works releasing PKZIP,
which knew great success due multiple factors, as superior speed and
efficiency, and being the specs released under public domain, and
having relatively few competitors in years of fast PC market expansion.
Lossy
and lossless data compression
definition
How lossy and lossless compression works
Data compression can be defined lossy or lossless, in terms of reversibility of the compression process due loss (or preservation) of original information in the process. The two types of algorithms have different pros and cons, and different field of application.
Lossless compression definition, file archiving
Lossless compression uses statistical models to map the input to a smaller output eliminating redundancy in the data.
In this way the output carry exactly all the information featured by the input in less bytes, and can be expanded when needed to a 1:1 copy of the original data (restoring exactly the original content), which is a fundamental property for storing some types of data - i.e. a software, a database.
For this reason
lossless compression algorithms are used for data backup and for
archive
file formats
used in general
purpose archive manager utilities, like 7Z, RAR,
and ZIP, where an
exact and reversible image
of the original data must be saved.
Examples of lossless compression algorithms are Deflate (used i.e. for
ZIP and GZ formats), BZip2 (used in BZ2 format), PPMd (RAR, 7Z
formats), LZMA / LZMA2 (7Z / XZ format).
Lossy compression
definition, multimedia data
compression
Lossy compression, instead, works identifying unnecessary or less relevant information (not just redundant data) and removing it.
Unlike the lossless compression, the amount of information to compress
is effectively reduced.
The loss of information / content is irreversible, and depending from
the nature of the algorithm, will likely happen each time the content
is modified and saved to a lossy file format - e.g. when editing a
lossy jpeg images, and saving it multiple times to intermediate work
files.
In this way data compression ratio is improved but at the cost of making lossy compression a non reversible process - as it comes at the cost of losing part of the information - and making it a suitable choice only when it is not intended, by design, to restore the original content again.
Lossy compression is consequently not suitable for general purpose file archiving (as in example losing a single byte of an executable file would make it not working), but it works very well when loss, reducing less relevant information, is acceptable, as for graphic and multimedia files compression - in example for MP3 losing audio information below the audibility threshold, or losing not visible details in JPEG images, or both in compressed video formats such as MPEG (AVI, MKV, MPG, MP4...).
Most common lossy compression algorithms are consequently usually fine tuned for the specific pattern of a multimedia data type.For this very same reson, file types compressed with lossy algorithms will not compress well (or at all) if added to archive files compressed with general purpose compression algorithms: already compressed files compresses poorly, if at all.
Due the lossy nature of those compression schemes, however, usually professional editing work is performed on non compressed data (i.e. WAV audio, or TIFF images) or data compressed in a lossless way (i.e. FLAC audio, or PNG images) every time it is feasible so saving the work in progress multiple times does not result in losing bits of the information each time, with progressive degradation of quality - usually reserving use of lossy compression to final step for creating a reasonably sized output to distribute for media consumption.
Lossy vs
lossless compression
Lossy and lossless compression algorithms are so different in scopes that cannot be really put in direct competition.
When original content needs to be restored completely on decompression (binary files, rew data) lossless, fully reversible compression is the only option, while when some degree of data loss is acceptable (e.g. finalizing work on multimedia files such as mp3 audio, mpeg video, jpeg graphics) generally advantages of lossy compression in terms of speed and maximum compression ratio over lossless compression are so evident that lossy, non reversible compression is the only viable choice to meet size and/or performances constrains.
Read lossless
compression
and lossy
compression
definitions on Wikipedia.
What
compressed archive files are
What is a
ZIP file
ZIP format is a lossless data compression and archival format created in 1989 by Phil Katz, implemented for the first time in PKWARE's PKZIP.
The ZIP file format specifications were released under public domain and the format had long and lasting success, to the point often "zip" is colloquially used for any generic compressed archive, and many package formats are based on deflate compression and/or same or very similar specs: Java JAR / WAR / EAR, Android APK, Apple iOS IPA files (iPhone and iPad devices), Microsoft CAB and Office compound files.
WinZip 12.1 (2009) introduced the new ZIPX file format specifications for identifying a new archive standard which supports newer and more powerful compression algorithms.
What are RAR,
ACE, 7Z files
During '90s and beyond, multiple alternative archival standard emerged, as ARJ, RAR (1993), ACE, and 7Z (1999), introducing unique features to distinguish them from the growing number of competitors, in example:
- usually, stronger
compression ratio than ZIP at the
cost of slower operation - but that disadvantage would have been paid
off by slower transfer time (especially on slow and public networks)
of smaller output file
- multi volume archival, spanning output to multiple files to met constrains as mail attachment size limit
- encryption, to enforce end user's privacy if the file is stolen, or passed through unsecure servers (unencrypted public network, or any third party controlled channel as a mail server, or remote storage service)
- error detection and error correction (as implemented in ARC and RAR formats), to prevent extraction in the event data gets corrupted (i.e. faulty connection, damaged disk) and attempt recovery from known good data.
Evolution of file archiving formats
Archival file format
tends to be more geared towards powerful,
computing intensive features to enhance manageability of data (high
compression, strong encryption), rather than enhancing ability to work
on live data (rapid read and write access) like filesystems, even if
some archive management utilities offers various mechanisms to add or remove data inside
archives, and update
or sync files already in the archive.
More choice in standards brought users more features and healthy
competition between standards (see comparison of archive
formats) and implementations, but also brought
the need for
flexible multi-purpose archival applications, like PeaZip, to deal with
different formats users may encounter, and to make full use of the
feature's potential of
the different supported file formats.
In recent years the trend is shifting toward very fast and highly
efficient compression algorithms, aiming to minimize the compression
and decompression overhead in data transmission and achieve near
real-time speed, with algorithms like Google's Brotli (BR file format) and
Facebook's Zstandard (ZST file format).
More online resources about file archiving and compression formats: AR
(Unix), LBR
(CP/M), SEA
company, WinZip's
ZIPX
standard
, Google Brotli
, Facebook
Zstandard
project
pages.Synopsis: What is an archive
file? What is a compressed file? What is a zip file? How lossy
compression and lossless data compression works. What are non
reversible and reversible algorithms advantages and disadvantages.
Compare definitions, and compressed file types. Lossy vs lossless
compression. How 7Z, RAR, ZIP
files work? How does file compression works? How does file archiving
works? What does file archiving and file compression mean?
Topics: how data compression
works, lossy and lossless compression, what is an archive file, what is
a zip or a rar file
PeaZip > FAQ
> How file compression works, what are archive files