# Performance Tuning ## Overview Neozip offers multiple controls for trading compression ratio against speed: compression level, strategy, window size, memory level, hardware acceleration, and buffer sizing. This guide describes how each knob affects performance and when to use them. --- ## Compression Level The `level` parameter (0–9) selects the deflate strategy function and its internal tuning parameters via the `configuration_table`: ```c static const config configuration_table[10] = { /* good_length lazy nice max_chain func */ /* 0 */ {0, 0, 0, 0, deflate_stored}, // No compression /* 1 */ {0, 0, 0, 0, deflate_quick}, // Fastest (Intel) /* 2 */ {4, 4, 8, 4, deflate_fast}, // Fast greedy /* 3 */ {4, 6, 32, 32, deflate_fast}, /* 4 */ {4, 4, 16, 16, deflate_medium}, // Balanced (Intel) /* 5 */ {8, 16, 32, 32, deflate_medium}, /* 6 */ {8, 16,128, 128, deflate_medium}, // Default /* 7 */ {8, 32,128, 256, deflate_slow}, // Slow lazy /* 8 */ {32, 128,258, 1024, deflate_slow}, /* 9 */ {32, 258,258, 4096, deflate_slow}, // Maximum }; ``` | Parameter | Effect | |---|---| | `good_length` | Reduce match search when match ≥ this length | | `max_lazy` | Don't try lazy match if current match ≥ this | | `nice_length` | Stop searching once match ≥ this length | | `max_chain` | Maximum hash chain steps to search | ### Level Selection Guide | Use Case | Recommended Level | Rationale | |---|---|---| | Real-time streaming | 1 | `deflate_quick`: static Huffman, minimal search | | Network compression | 2–3 | `deflate_fast`: greedy match, short chains | | General purpose | 6 (default) | `deflate_medium`: good ratio/speed balance | | Archival storage | 9 | `deflate_slow`: full lazy evaluation, deep chains | | Pre-compressed data | 0 | `deflate_stored`: passthrough with framing | ### Speed vs. Ratio Tradeoffs Approximate throughput (x86_64 with AVX2, single core): | Level | Compression Speed | Ratio (typical) | |---|---|---| | 0 | ~5 GB/s | 1.00 (none) | | 1 | ~800 MB/s | 2.0–2.5:1 | | 3 | ~400 MB/s | 2.2–2.8:1 | | 6 | ~150 MB/s | 2.5–3.2:1 | | 9 | ~30 MB/s | 2.6–3.4:1 | Decompression speed is largely independent of the compression level (~1–2 GB/s), since it only depends on the encoded stream, not the search strategy. --- ## Strategy Selection ### `Z_DEFAULT_STRATEGY` (0) Standard DEFLATE with adaptive Huffman coding and LZ77 matching. Best for most data types. ### `Z_FILTERED` (1) Optimised for data produced by filters (e.g., delta encoding, integer sequences). Uses shorter hash chains and favours Huffman coding efficiency. ### `Z_HUFFMAN_ONLY` (2) Disables LZ77 matching entirely. Every byte is encoded as a literal. Fast but poor compression ratio for most data. Useful when the data has already been transformed (e.g., BWT output). ```c // deflate_huff.c: Only emits literals block_state deflate_huff(deflate_state *s, int flush) { for (;;) { // No match search — emit one literal per byte zng_tr_tally_lit(s, s->window[s->strstart]); s->strstart++; s->lookahead--; if (s->sym_next == s->sym_end) { FLUSH_BLOCK(s, 0); } } } ``` ### `Z_RLE` (3) Run-length encoding: only matches at distance 1. Very fast for data with repeated byte patterns: ```c // deflate_rle.c block_state deflate_rle(deflate_state *s, int flush) { // Only search for matches at distance == 1 // Uses compare256_rle for fast run detection match_len = FUNCTABLE_CALL(compare256)(scan, scan - 1); } ``` ### `Z_FIXED` (4) Forces use of static (fixed) Huffman tables for every block. Eliminates the overhead of dynamic tree transmission. Slightly faster for small blocks where the tree overhead dominates. ### Strategy Selection Guide | Data Type | Strategy | |---|---| | General text/binary | `Z_DEFAULT_STRATEGY` | | Numeric arrays, deltas | `Z_FILTERED` | | Pre-transformed data | `Z_HUFFMAN_ONLY` | | Runs of repeated bytes | `Z_RLE` | | Very small blocks | `Z_FIXED` | | Random/encrypted data | Level 0 (skip entirely) | --- ## Window Size (`windowBits`) Controls the LZ77 sliding window (8–15, default 15): | windowBits | Window Size | Memory (deflate) | |---|---|---| | 9 | 512 B | ~4 KB | | 10 | 1 KB | ~8 KB | | 11 | 2 KB | ~16 KB | | 12 | 4 KB | ~32 KB | | 13 | 8 KB | ~64 KB | | 14 | 16 KB | ~128 KB | | 15 | 32 KB | ~256 KB | Smaller windows use less memory but find fewer long-distance matches, reducing compression ratio. For streaming protocols with tight memory budgets, windowBits=10–12 is a reasonable compromise. --- ## Memory Level (`memLevel`) Controls the internal hash table and buffer sizes (1–9, default 8): ```c #define DEF_MEM_LEVEL 8 // In deflateInit2: s->hash_size = 1 << (memLevel + 7); // hash_bits = memLevel + 7 s->lit_bufsize = 1 << (memLevel + 6); ``` | memLevel | Hash Table Entries | Literal Buffer | Total Memory | |---|---|---|---| | 1 | 256 | 128 | ~1 KB | | 4 | 2048 | 1024 | ~16 KB | | 8 (default) | 32768 | 16384 | ~256 KB | | 9 | 65536 | 32768 | ~512 KB | Higher memLevel improves hash distribution (fewer collisions) and allows more symbols to accumulate before flushing, improving Huffman coding efficiency. --- ## Hardware Acceleration ### Enabling SIMD **Runtime detection** (default, recommended for distributed binaries): ```bash cmake .. -DWITH_RUNTIME_CPU_DETECTION=ON ``` **Native compilation** (fastest, for local/dedicated use): ```bash cmake .. -DWITH_NATIVE_INSTRUCTIONS=ON ``` This passes `-march=native` to the compiler, enabling all instructions supported by the build machine. ### Selective Feature Control Disable specific SIMD features: ```bash cmake .. -DWITH_AVX512=OFF # Avoid AVX-512 (thermal throttling concern) cmake .. -DWITH_VPCLMULQDQ=OFF # Disable VPCLMULQDQ CRC cmake .. -DWITH_NEON=OFF # Disable NEON on ARM ``` ### SIMD Impact by Operation | Operation | Scalar | Best SIMD | Speedup | |---|---|---|---| | Adler-32 | ~1 B/cycle | ~32 B/cycle (AVX-512+VNNI) | 32× | | CRC-32 | ~4 B/cycle | ~64 B/cycle (VPCLMULQDQ) | 16× | | Compare256 | ~1 B/cycle | ~16 B/cycle (AVX2) | 16× | | Slide Hash | ~1 entry/cycle | ~32 entries/cycle (AVX-512) | 32× | | Inflate Copy | ~1 B/cycle | ~32 B/cycle (AVX2 chunkmemset) | 32× | --- ## Buffer Sizing ### Compression Buffers For streaming compression, the output buffer should be at least as large as `deflateBound(sourceLen)` for the expected input chunk size: ```c size_t out_size = deflateBound(&strm, chunk_size); ``` Larger buffers reduce system call overhead and improve throughput. ### Gzip Buffer ```c gzbuffer(gz, size); // Set before first read/write ``` Default `GZBUFSIZE` is 131072 (128 KB). For sequential I/O, larger buffers (256 KB–1 MB) improve throughput by amortising I/O overhead. ### Inflate Buffers The inflate engine benefits from output buffers ≥ 32 KB (the maximum window size). Buffers ≥ 64 KB keep the fast path active longer (the fast path requires ≥ 258 bytes of output space and ≥ 6 bytes of input). --- ## `deflateTune()` Fine-tune the `configuration_table` parameters at runtime without changing the level: ```c int deflateTune(z_stream *strm, int good_length, int max_lazy, int nice_length, int max_chain); ``` Example — high-speed level 6: ```c deflateInit(&strm, 6); deflateTune(&strm, 4, 8, 32, 64); // Shorter chains than default ``` Example — deeper search at level 4: ```c deflateInit(&strm, 4); deflateTune(&strm, 16, 64, 128, 512); // Deeper search ``` --- ## Profiling Tips ### 1. Identify the Bottleneck Use `perf` or equivalent to identify whether compression is CPU-bound (expect: hash lookup, match search) or I/O-bound (expect: read/write syscalls): ```bash perf record -g ./minigzip < large_file > /dev/null perf report ``` Look for hot functions: - `longest_match_*` — String matching (CPU-bound) - `adler32_*` / `crc32_*` — Checksumming (CPU-bound) - `slide_hash_*` — Window maintenance (CPU-bound) - `__write` / `__read` — I/O (I/O-bound) ### 2. Verify SIMD Usage Check which implementations are selected: ```bash # Check for SIMD symbols in the binary nm -D libz-ng.so | grep -E 'avx2|neon|sse|pclmul' ``` Or set a breakpoint in `init_functable()` during debugging. ### 3. Benchmark Specific Functions Use the built-in benchmarks: ```bash cmake .. -DWITH_BENCHMARKS=ON cmake --build . ./benchmark_adler32 --benchmark_repetitions=5 ./benchmark_compress --benchmark_filter="BM_Compress/6" ``` --- ## Common Tuning Scenarios ### High-Throughput Compression (Level 1) ```c deflateInit2(&strm, 1, Z_DEFLATED, 15 + 16, 8, Z_DEFAULT_STRATEGY); ``` Level 1 uses `deflate_quick`: no hash chain walking, static Huffman tables, minimal overhead. Best for cases where compression speed matters more than ratio (real-time logging, network IPC). ### Maximum Compression (Level 9) ```c deflateInit2(&strm, 9, Z_DEFLATED, 15, 9, Z_DEFAULT_STRATEGY); ``` Level 9 + memLevel 9 provides the deepest search (`max_chain=4096`) and largest hash table. Use for archival where decompression speed matters but compression can be slow. ### Memory-Constrained Environment ```c deflateInit2(&strm, 6, Z_DEFLATED, 10, 4, Z_DEFAULT_STRATEGY); ``` windowBits=10 (1KB window) + memLevel=4 gives ~16KB total memory. Suitable for embedded systems. ### Multiple Streams in Parallel Each `z_stream` is independent. For multi-threaded compression, create one stream per thread: ```c // Thread-safe: each thread has its own z_stream #pragma omp parallel for for (int i = 0; i < num_chunks; i++) { z_stream strm = {}; deflateInit(&strm, 6); // compress chunk[i] deflateEnd(&strm); } ``` The `functable` initialisation is thread-safe (atomic init flag), so the first call from any thread will safely initialise SIMD dispatch.