diff options
Diffstat (limited to 'docs/handbook/neozip/overview.md')
| -rw-r--r-- | docs/handbook/neozip/overview.md | 509 |
1 files changed, 509 insertions, 0 deletions
diff --git a/docs/handbook/neozip/overview.md b/docs/handbook/neozip/overview.md new file mode 100644 index 0000000000..acf32a427e --- /dev/null +++ b/docs/handbook/neozip/overview.md @@ -0,0 +1,509 @@ +# Neozip Overview + +## What Is Neozip? + +Neozip is Project Tick's fork of **zlib-ng**, which is itself a modernized, +performance-oriented fork of the venerable zlib compression library. Neozip +provides a drop-in replacement for zlib with significantly improved throughput +on modern hardware while retaining full API and format compatibility with the +original zlib 1.3.1 specification. + +The library implements the **DEFLATE** compressed data format (RFC 1951), +wrapped in either the **zlib** container (RFC 1950) or the **gzip** container +(RFC 1952). It also exposes raw deflate streams without any wrapper. + +Neozip tracks upstream zlib-ng closely. At the time of writing, the embedded +version strings are: + +```c +#define ZLIBNG_VERSION "2.3.90" +#define ZLIB_VERSION "1.3.1.zlib-ng" +``` + +--- + +## Why Neozip Exists + +The original zlib library was written in the early 1990s when CPUs had very +different performance characteristics. While zlib is extremely portable and +well-tested, it leaves significant performance on the table on modern +processors because: + +1. **No SIMD utilisation** — zlib's inner loops (match finding, checksumming, + sliding the hash window) are scalar C targeting 32-bit architectures. +2. **Conservative data structures** — hash chain lengths, buffer sizes, and + alignment are tuned for machines with tiny caches. +3. **No runtime CPU feature detection** — the same compiled binary cannot + select between SSE2 and AVX-512 code paths at runtime. + +Neozip (via zlib-ng) addresses every one of these issues while maintaining +byte-for-byte compatible output with zlib for any given set of compression +parameters (when the `ZLIB_COMPAT` build option is enabled). + +--- + +## Feature List + +### Core Compression and Decompression + +| Feature | Description | +|---|---| +| DEFLATE compression (RFC 1951) | Full implementation of LZ77 + Huffman coding | +| DEFLATE decompression | State-machine inflater with optimised fast paths | +| zlib wrapper (RFC 1950) | Adler-32 integrity, two-byte header | +| gzip wrapper (RFC 1952) | CRC-32 integrity, file metadata header | +| Raw deflate | No wrapper, caller handles integrity | +| Compression levels 0–9 | From stored (level 0) through maximum compression (level 9) | +| Multiple strategies | `Z_DEFAULT_STRATEGY`, `Z_FILTERED`, `Z_HUFFMAN_ONLY`, `Z_RLE`, `Z_FIXED` | +| Streaming API | Process data in arbitrarily-sized chunks via `deflate()` / `inflate()` | +| One-shot API | `compress()` / `uncompress()` for simple in-memory use | +| gzip file I/O | `gzopen()`, `gzread()`, `gzwrite()`, `gzprintf()`, etc. | +| Dictionary support | Pre-seed compression / decompression with a shared dictionary | + +### Performance Optimisations + +| Optimisation | Details | +|---|---| +| Runtime CPU detection | `cpu_features.c` queries CPUID (x86), `/proc/cpuinfo` (ARM), etc. | +| Function dispatch table | `functable.c` selects the best implementation for each hot function | +| x86 SSE2 | `slide_hash`, `compare256`, `chunkset`, `inflate_fast`, CRC-32 Chorba | +| x86 SSSE3 | `adler32`, `chunkset`, `inflate_fast` | +| x86 SSE4.1 | CRC-32 Chorba SSE4.1 variant | +| x86 SSE4.2 | `adler32_copy` | +| x86 PCLMULQDQ | Carryless-multiply CRC-32 | +| x86 AVX2 | `adler32`, `compare256`, `chunkset`, `slide_hash`, `inflate_fast`, `longest_match` | +| x86 AVX-512 | `adler32`, `compare256`, `chunkset`, `inflate_fast`, `longest_match` | +| x86 AVX-512 VNNI | `adler32` using VPDPBUSD | +| x86 VPCLMULQDQ | Vectorised CRC-32 with AVX2 and AVX-512 widths | +| ARM NEON | `adler32`, `compare256`, `chunkset`, `slide_hash`, `inflate_fast`, `longest_match` | +| ARM CRC32 extension | Hardware CRC-32 instructions | +| ARM PMULL+EOR3 | Polynomial multiply CRC-32 with SHA3 three-way XOR | +| ARMv6 SIMD | `slide_hash` for 32-bit ARM | +| PowerPC VMX/VSX | `adler32`, `slide_hash`, `chunkset`, `inflate_fast` | +| POWER8/9 | Optimised Adler-32, CRC-32, compare256 | +| RISC-V RVV | Vector extensions for core loops | +| RISC-V Zbc | Bit-manipulation CRC-32 | +| IBM z/Architecture DFLTCC | Hardware deflate/inflate in a single instruction | +| LoongArch LSX/LASX | SIMD for CRC-32 and general loops | + +### Algorithmic Improvements + +| Improvement | Details | +|---|---| +| Quick deflate (level 1) | Intel-designed single-pass strategy (`deflate_quick.c`) | +| Medium deflate (levels 3-6) | Intel-designed strategy bridging fast and slow (`deflate_medium.c`) | +| Chorba CRC-32 | Modern CRC algorithm by Kadatch & Jenkins with braided and SIMD variants | +| 64-bit bit buffer | `bi_buf` is `uint64_t` instead of `unsigned long`, reducing flush frequency | +| Unified memory allocation | Single `zalloc` call for all deflate/inflate buffers, cache-line aligned | +| LIT_MEM mode | Separate distance/length buffers for platforms without fast unaligned access | +| Rolling hash for level 9 | `insert_string_roll` for better match quality at maximum compression | + +### Build System + +| Feature | Details | +|---|---| +| CMake (≥ 3.14) | Primary build system with extensive option detection | +| C11 standard | Default; C99, C17, C23 also supported | +| zlib-compat mode | `ZLIB_COMPAT=ON` produces a drop-in `libz` replacement | +| Native mode | `ZLIB_COMPAT=OFF` produces `libz-ng` with `zng_` prefixed API | +| Static and shared libraries | Both targets generated | +| Google Test suite | Comprehensive C++ test suite under `test/` | +| Fuzz targets | Under `test/fuzz/` for OSS-Fuzz integration | +| Benchmark suite | Google Benchmark harnesses under `test/benchmarks/` | +| Sanitizer support | ASan, MSan, TSan, UBSan integration via `WITH_SANITIZER` | +| Code coverage | `WITH_CODE_COVERAGE` for lcov/gcov | + +--- + +## Repository Structure + +The neozip source tree is organised as follows: + +``` +neozip/ +├── CMakeLists.txt # Top-level build configuration +├── deflate.c / deflate.h # Core compression engine +├── deflate_fast.c # Level 1-2 (or 2-3) fast strategy +├── deflate_medium.c # Level 3-6 medium strategy (Intel) +├── deflate_slow.c # Level 7-9 lazy/slow strategy +├── deflate_quick.c # Level 1 quick strategy (Intel) +├── deflate_stored.c # Level 0 stored (no compression) +├── deflate_huff.c # Huffman-only strategy +├── deflate_rle.c # Run-length encoding strategy +├── deflate_p.h # Private deflate inline helpers +├── inflate.c / inflate.h # Decompression state machine +├── inflate_p.h # Private inflate inline helpers +├── infback.c # Inflate with caller-provided window +├── inftrees.c / inftrees.h # Huffman code table builder for inflate +├── inffast_tpl.h # Fast inflate inner loop template +├── inffixed_tbl.h # Fixed Huffman tables for inflate +├── trees.c / trees.h # Huffman tree construction for deflate +├── trees_emit.h # Bit emission macros for tree output +├── trees_tbl.h # Static Huffman tree tables +├── adler32.c # Adler-32 checksum entry points +├── adler32_p.h # Scalar Adler-32 implementation +├── crc32.c # CRC-32 checksum entry points +├── crc32_braid_p.h # Braided CRC-32 configuration +├── crc32_braid_comb.c # CRC-32 combine logic +├── crc32_chorba_p.h # Chorba CRC-32 algorithm +├── compress.c # One-shot compress() +├── uncompr.c # One-shot uncompress() +├── gzlib.c # gzip file I/O common code +├── gzread.c # gzip file reading +├── gzwrite.c # gzip file writing +├── gzguts.h # gzip internal definitions +├── zlib.h.in # Public API header (zlib-compat mode) +├── zlib-ng.h.in # Public API header (native mode) +├── zbuild.h # Build-system defines, compiler abstraction +├── zutil.h / zutil.c # Internal utility functions +├── zutil_p.h # Private utility helpers +├── zendian.h # Endianness detection and byte-swap macros +├── zmemory.h # Aligned memory read/write helpers +├── zarch.h # Architecture detection macros +├── cpu_features.c / .h # Runtime CPU feature detection dispatch +├── functable.c / .h # Runtime function pointer dispatch table +├── arch_functions.h # Architecture-specific function declarations +├── arch_natives.h # Native (compile-time) arch selection +├── insert_string.c # Hash table insert implementations +├── insert_string_p.h # Private insert_string helpers +├── insert_string_tpl.h # Insert string template macros +├── match_tpl.h # Longest-match template (compare256 based) +├── chunkset_tpl.h # Chunk memory-set template +├── compare256_rle.h # RLE-optimised compare256 +├── arch/ # Architecture-specific SIMD implementations +│ ├── generic/ # Portable C fallbacks +│ ├── x86/ # SSE2, SSSE3, SSE4, AVX2, AVX-512, PCLMULQDQ +│ ├── arm/ # NEON, CRC32 extension, PMULL +│ ├── power/ # VMX, VSX, POWER8, POWER9 +│ ├── s390/ # IBM z DFLTCC +│ ├── riscv/ # RVV, Zbc +│ └── loongarch/ # LSX, LASX +├── test/ # GTest test suite, fuzz targets, benchmarks +├── cmake/ # CMake modules (intrinsic detection, etc.) +├── doc/ # Upstream documentation +├── tools/ # Utility scripts +└── win32/ # Windows-specific files +``` + +--- + +## Data Formats + +Neozip processes three container formats, all built on top of the same DEFLATE +compressed data representation: + +### Raw Deflate (RFC 1951) + +A sequence of DEFLATE blocks with no framing. The caller is responsible for +any integrity checking. Selected by passing a negative `windowBits` value +(e.g., `-15`) to `deflateInit2()` / `inflateInit2()`. + +### zlib Format (RFC 1950) + +``` ++---+---+ +---+---+---+---+ +| CMF|FLG| | DATA | ... +---+---+---+---+ ++---+---+ +---+---+---+---+ | ADLER-32 | + +---+---+---+---+ +``` + +- **CMF** (Compression Method and Flags): method = 8 (deflate), window size +- **FLG**: check bits, optional preset dictionary flag (`FDICT`) +- **DATA**: raw deflate blocks +- **ADLER-32**: checksum of uncompressed data (big-endian) + +Overhead: 6 bytes (`ZLIB_WRAPLEN`). + +### gzip Format (RFC 1952) + +``` ++---+---+---+---+---+---+---+---+---+---+ +-------+ +---+---+---+---+---+---+---+---+ +|ID1|ID2| CM|FLG| MTIME |XFL| OS | | DATA | | CRC-32 | ISIZE | ++---+---+---+---+---+---+---+---+---+---+ +-------+ +---+---+---+---+---+---+---+---+ +``` + +- **ID1, ID2**: Magic bytes `0x1f`, `0x8b` +- **CM**: Compression method (8 = deflate) +- **FLG**: Flags for text, CRC, extra, name, comment +- **MTIME**: Modification time (Unix epoch) +- **XFL**: Extra flags (2 = best compression, 4 = fastest) +- **OS**: Operating system code +- **DATA**: Raw deflate blocks +- **CRC-32**: CRC of uncompressed data +- **ISIZE**: Uncompressed size mod 2^32 + +Overhead: 18 bytes (`GZIP_WRAPLEN`). + +--- + +## Compilation Modes + +### zlib-Compatible Mode (`ZLIB_COMPAT=ON`) + +When built with `-DZLIB_COMPAT=ON`: + +- The library is named `libz` (no suffix). +- All public symbols use standard zlib names: `deflateInit`, `inflate`, `crc32`, etc. +- The `z_stream` structure uses `unsigned long` for `total_in` / `total_out`. +- Header file is `zlib.h`. +- Symbol prefix macro `PREFIX()` expands to `z_` (via mangling headers). +- The `ZLIB_COMPAT` preprocessor macro is defined. +- gzip file operations (`WITH_GZFILEOP`) are forced on. + +This is the mode to use when neozip must be a transparent drop-in replacement +for system zlib. + +### Native Mode (`ZLIB_COMPAT=OFF`) + +When built with `-DZLIB_COMPAT=OFF` (the default): + +- The library is named `libz-ng`. +- All public symbols use `zng_` prefixed names: `zng_deflateInit`, `zng_inflate`, etc. +- The `zng_stream` structure uses fixed-width types (`uint32_t`). +- Header file is `zlib-ng.h`. +- Symbol prefix macro `PREFIX()` expands to `zng_`. +- No `ZLIB_COMPAT` macro is defined. + +Native mode is recommended for new code. Types are cleaner and there is no +ambiguity about which zlib implementation is in use. + +--- + +## Compression Levels and Strategies + +### Compression Levels + +Neozip maps each compression level (0–9) to a specific **strategy function** +and a set of tuning parameters defined in the `configuration_table` in +`deflate.c`: + +```c +static const config configuration_table[10] = { +/* good lazy nice chain */ +/* 0 */ {0, 0, 0, 0, deflate_stored}, /* store only */ +/* 1 */ {0, 0, 0, 0, deflate_quick}, /* quick strategy */ +/* 2 */ {4, 4, 8, 4, deflate_fast}, +/* 3 */ {4, 6, 16, 6, deflate_medium}, +/* 4 */ {4, 12, 32, 24, deflate_medium}, +/* 5 */ {8, 16, 32, 32, deflate_medium}, +/* 6 */ {8, 16, 128, 128, deflate_medium}, +/* 7 */ {8, 32, 128, 256, deflate_slow}, +/* 8 */ {32, 128, 258, 1024, deflate_slow}, +/* 9 */ {32, 258, 258, 4096, deflate_slow}, +}; +``` + +The `config` fields are: +- **good_length** — reduce lazy search above this match length +- **max_lazy** — do not perform lazy search above this match length +- **nice_length** — quit search above this match length +- **max_chain** — maximum hash chain length to traverse +- **func** — pointer to the strategy function + +### Strategy Functions + +| Strategy | Levels | Source File | Description | +|---|---|---|---| +| `deflate_stored` | 0 | `deflate_stored.c` | No compression; copies input as stored blocks | +| `deflate_quick` | 1 | `deflate_quick.c` | Fastest compression; static Huffman, minimal match search | +| `deflate_fast` | 2 (or 1–3 without quick) | `deflate_fast.c` | Greedy matching, no lazy evaluation | +| `deflate_medium` | 3–6 | `deflate_medium.c` | Balanced: limited lazy evaluation, match merging | +| `deflate_slow` | 7–9 | `deflate_slow.c` | Full lazy evaluation, deepest hash chain search | +| `deflate_huff` | (Z_HUFFMAN_ONLY) | `deflate_huff.c` | Huffman-only, no LZ77 matching | +| `deflate_rle` | (Z_RLE) | `deflate_rle.c` | Run-length encoding, distance always 1 | + +### Explicit Strategies + +The `strategy` parameter to `deflateInit2()` can override the default level-based +selection: + +- **`Z_DEFAULT_STRATEGY` (0)** — Normal deflate; level selects the function. +- **`Z_FILTERED` (1)** — Optimised for data produced by a filter (e.g., PNG predictors). + Uses `deflate_slow` with short match rejection. +- **`Z_HUFFMAN_ONLY` (2)** — No LZ77; every byte is a literal. +- **`Z_RLE` (3)** — Only find runs of identical bytes (distance = 1). +- **`Z_FIXED` (4)** — Use fixed Huffman codes instead of dynamic trees. + +--- + +## Memory Layout + +Neozip uses a **single-allocation** strategy for both deflate and inflate +states. The function `alloc_deflate()` in `deflate.c` computes the total +buffer size required and calls `zalloc` exactly once, then partitions the +returned memory into: + +1. **Window buffer** — Aligned to `WINDOW_PAD_SIZE` (64 or 4096 bytes depending + on architecture). Size: `2 * (1 << windowBits)`. +2. **prev array** — `Pos` (uint16_t) array of size `1 << windowBits`. Aligned to 64 bytes. +3. **head array** — `Pos` array of size `HASH_SIZE` (65536). Aligned to 64 bytes. +4. **pending_buf** — Output bit buffer of size `lit_bufsize * LIT_BUFS + 1`. Aligned to 64 bytes. +5. **deflate_state** — The `internal_state` struct itself. Aligned to 64 bytes + (cache-line aligned via `ALIGNED_(64)`). +6. **deflate_allocs** — Book-keeping struct tracking the original allocation pointer. + +The `inflate_state` uses an analogous scheme via `alloc_inflate()`: + +1. **Window buffer** — `(1 << MAX_WBITS) + 64` bytes with `WINDOW_PAD_SIZE` alignment. +2. **inflate_state** — The state struct, 64-byte aligned. +3. **inflate_allocs** — Book-keeping. + +This approach minimises the number of `malloc` calls, improves cache locality, +and simplifies cleanup (a single `zfree` releases everything). + +--- + +## Thread Safety + +Neozip is thread-safe under the following conditions: + +1. Each `z_stream` instance is accessed from only one thread at a time. +2. The `zalloc` and `zfree` callbacks are thread-safe (the defaults use + `malloc` / `free`, which are thread-safe on all supported platforms). + +The function dispatch table (`functable`) uses atomic stores during +initialisation: + +```c +#define FUNCTABLE_ASSIGN(VAR, FUNC_NAME) \ + __atomic_store(&(functable.FUNC_NAME), &(VAR.FUNC_NAME), __ATOMIC_SEQ_CST) +#define FUNCTABLE_BARRIER() __atomic_thread_fence(__ATOMIC_SEQ_CST) +``` + +This ensures that even if multiple threads call `deflateInit` / `inflateInit` +concurrently, the function table is initialised safely. + +--- + +## Version Identification + +The library provides several ways to query version information: + +```c +const char *zlibVersion(void); // Returns "1.3.1.zlib-ng" in compat mode +const char *zlibng_version(void); // Returns "2.3.90" + +// Compile-time constants +#define ZLIBNG_VERSION "2.3.90" +#define ZLIBNG_VERNUM 0x02039000L +#define ZLIB_VERSION "1.3.1.zlib-ng" +#define ZLIB_VERNUM 0x131f +``` + +In compat mode, `deflateInit` and `inflateInit` verify that the header version +matches the library version to prevent ABI mismatches: + +```c +#define CHECK_VER_STSIZE(version, stream_size) \ + (version == NULL || version[0] != ZLIB_VERSION[0] || \ + stream_size != (int32_t)sizeof(PREFIX3(stream))) +``` + +--- + +## Licensing + +Neozip inherits the zlib/libpng license from both zlib and zlib-ng: + +> This software is provided 'as-is', without any express or implied warranty. +> Permission is granted to anyone to use this software for any purpose, +> including commercial applications, and to alter it and redistribute it freely, +> subject to the following restrictions: [...] + +See `LICENSE.md` in the neozip source tree for the full text. + +--- + +## Key Differences from Upstream zlib + +| Area | zlib 1.3.1 | Neozip (zlib-ng) | +|---|---|---| +| Bit buffer width | 32-bit `unsigned long` | 64-bit `uint64_t` | +| Hash table size | 32768 entries (15 bits) | 65536 entries (16 bits) | +| Match buffer format | Overlaid `sym_buf` only | `LIT_MEM` option for separate `d_buf`/`l_buf` | +| Hash function | Three-byte rolling | Four-byte CRC-based or multiplicative | +| SIMD acceleration | None | Extensive (see Performance Optimisations) | +| CPU detection | None (compile-time only) | Runtime `cpuid` / feature detection | +| Memory allocation | Multiple `zalloc` calls | Single allocation, cache-aligned | +| Minimum match length | 3 (`STD_MIN_MATCH`) | Internally uses `WANT_MIN_MATCH = 4` for speed | +| Quick strategy | None | `deflate_quick` for level 1 | +| Medium strategy | None | `deflate_medium` for levels 3–6 | +| Data structure alignment | None | `ALIGNED_(64)` on key structs | +| Build system | Makefile / CMake | CMake primary with full feature detection | + +--- + +## Quick Start + +### Building + +```bash +cd neozip +mkdir build && cd build +cmake .. -DCMAKE_BUILD_TYPE=Release +cmake --build . -j$(nproc) +``` + +### Using in a CMake Project + +```cmake +find_package(zlib-ng CONFIG REQUIRED) +target_link_libraries(myapp PRIVATE zlib-ng::zlib-ng) +``` + +Or in zlib-compat mode: + +```cmake +find_package(ZLIB CONFIG REQUIRED) +target_link_libraries(myapp PRIVATE ZLIB::ZLIB) +``` + +### Minimal Compression Example + +```c +#include <zlib-ng.h> +#include <string.h> +#include <stdio.h> + +int main(void) { + const char *source = "Hello, Neozip! This is a test of compression."; + size_t source_len = strlen(source); + + size_t dest_len = zng_compressBound(source_len); + unsigned char *dest = malloc(dest_len); + + int ret = zng_compress(dest, &dest_len, (const unsigned char *)source, source_len); + if (ret == Z_OK) { + printf("Compressed %zu bytes to %zu bytes\n", source_len, dest_len); + } + + unsigned char *recovered = malloc(source_len + 1); + size_t recovered_len = source_len; + zng_uncompress(recovered, &recovered_len, dest, dest_len); + recovered[recovered_len] = '\0'; + printf("Recovered: %s\n", recovered); + + free(dest); + free(recovered); + return 0; +} +``` + +--- + +## Further Reading + +- [Architecture](architecture.md) — Module-by-module breakdown of the source +- [Building](building.md) — Complete CMake option reference +- [Deflate Algorithms](deflate-algorithms.md) — LZ77 match finding and strategies +- [Inflate Engine](inflate-engine.md) — Decompression state machine +- [Huffman Coding](huffman-coding.md) — Tree construction and bit emission +- [Checksum Algorithms](checksum-algorithms.md) — CRC-32 and Adler-32 details +- [Hardware Acceleration](hardware-acceleration.md) — CPU detection and dispatch +- [x86 Optimizations](x86-optimizations.md) — SSE/AVX/PCLMULQDQ implementations +- [ARM Optimizations](arm-optimizations.md) — NEON and CRC32 extension +- [Gzip Support](gzip-support.md) — gzip file I/O layer +- [API Reference](api-reference.md) — Full public API documentation +- [Performance Tuning](performance-tuning.md) — Benchmarking and tuning guide +- [Testing](testing.md) — Test suite reference +- [Code Style](code-style.md) — Coding conventions |
