diff options
Diffstat (limited to 'docs/handbook/neozip/performance-tuning.md')
| -rw-r--r-- | docs/handbook/neozip/performance-tuning.md | 361 |
1 files changed, 361 insertions, 0 deletions
diff --git a/docs/handbook/neozip/performance-tuning.md b/docs/handbook/neozip/performance-tuning.md new file mode 100644 index 0000000000..f45706e964 --- /dev/null +++ b/docs/handbook/neozip/performance-tuning.md @@ -0,0 +1,361 @@ +# Performance Tuning + +## Overview + +Neozip offers multiple controls for trading compression ratio against +speed: compression level, strategy, window size, memory level, hardware +acceleration, and buffer sizing. This guide describes how each knob +affects performance and when to use them. + +--- + +## Compression Level + +The `level` parameter (0–9) selects the deflate strategy function and its +internal tuning parameters via the `configuration_table`: + +```c +static const config configuration_table[10] = { +/* good_length lazy nice max_chain func */ +/* 0 */ {0, 0, 0, 0, deflate_stored}, // No compression +/* 1 */ {0, 0, 0, 0, deflate_quick}, // Fastest (Intel) +/* 2 */ {4, 4, 8, 4, deflate_fast}, // Fast greedy +/* 3 */ {4, 6, 32, 32, deflate_fast}, +/* 4 */ {4, 4, 16, 16, deflate_medium}, // Balanced (Intel) +/* 5 */ {8, 16, 32, 32, deflate_medium}, +/* 6 */ {8, 16,128, 128, deflate_medium}, // Default +/* 7 */ {8, 32,128, 256, deflate_slow}, // Slow lazy +/* 8 */ {32, 128,258, 1024, deflate_slow}, +/* 9 */ {32, 258,258, 4096, deflate_slow}, // Maximum +}; +``` + +| Parameter | Effect | +|---|---| +| `good_length` | Reduce match search when match ≥ this length | +| `max_lazy` | Don't try lazy match if current match ≥ this | +| `nice_length` | Stop searching once match ≥ this length | +| `max_chain` | Maximum hash chain steps to search | + +### Level Selection Guide + +| Use Case | Recommended Level | Rationale | +|---|---|---| +| Real-time streaming | 1 | `deflate_quick`: static Huffman, minimal search | +| Network compression | 2–3 | `deflate_fast`: greedy match, short chains | +| General purpose | 6 (default) | `deflate_medium`: good ratio/speed balance | +| Archival storage | 9 | `deflate_slow`: full lazy evaluation, deep chains | +| Pre-compressed data | 0 | `deflate_stored`: passthrough with framing | + +### Speed vs. Ratio Tradeoffs + +Approximate throughput (x86_64 with AVX2, single core): + +| Level | Compression Speed | Ratio (typical) | +|---|---|---| +| 0 | ~5 GB/s | 1.00 (none) | +| 1 | ~800 MB/s | 2.0–2.5:1 | +| 3 | ~400 MB/s | 2.2–2.8:1 | +| 6 | ~150 MB/s | 2.5–3.2:1 | +| 9 | ~30 MB/s | 2.6–3.4:1 | + +Decompression speed is largely independent of the compression level +(~1–2 GB/s), since it only depends on the encoded stream, not the search +strategy. + +--- + +## Strategy Selection + +### `Z_DEFAULT_STRATEGY` (0) + +Standard DEFLATE with adaptive Huffman coding and LZ77 matching. +Best for most data types. + +### `Z_FILTERED` (1) + +Optimised for data produced by filters (e.g., delta encoding, integer +sequences). Uses shorter hash chains and favours Huffman coding efficiency. + +### `Z_HUFFMAN_ONLY` (2) + +Disables LZ77 matching entirely. Every byte is encoded as a literal. +Fast but poor compression ratio for most data. Useful when the data has +already been transformed (e.g., BWT output). + +```c +// deflate_huff.c: Only emits literals +block_state deflate_huff(deflate_state *s, int flush) { + for (;;) { + // No match search — emit one literal per byte + zng_tr_tally_lit(s, s->window[s->strstart]); + s->strstart++; + s->lookahead--; + if (s->sym_next == s->sym_end) { + FLUSH_BLOCK(s, 0); + } + } +} +``` + +### `Z_RLE` (3) + +Run-length encoding: only matches at distance 1. Very fast for data with +repeated byte patterns: + +```c +// deflate_rle.c +block_state deflate_rle(deflate_state *s, int flush) { + // Only search for matches at distance == 1 + // Uses compare256_rle for fast run detection + match_len = FUNCTABLE_CALL(compare256)(scan, scan - 1); +} +``` + +### `Z_FIXED` (4) + +Forces use of static (fixed) Huffman tables for every block. Eliminates +the overhead of dynamic tree transmission. Slightly faster for small +blocks where the tree overhead dominates. + +### Strategy Selection Guide + +| Data Type | Strategy | +|---|---| +| General text/binary | `Z_DEFAULT_STRATEGY` | +| Numeric arrays, deltas | `Z_FILTERED` | +| Pre-transformed data | `Z_HUFFMAN_ONLY` | +| Runs of repeated bytes | `Z_RLE` | +| Very small blocks | `Z_FIXED` | +| Random/encrypted data | Level 0 (skip entirely) | + +--- + +## Window Size (`windowBits`) + +Controls the LZ77 sliding window (8–15, default 15): + +| windowBits | Window Size | Memory (deflate) | +|---|---|---| +| 9 | 512 B | ~4 KB | +| 10 | 1 KB | ~8 KB | +| 11 | 2 KB | ~16 KB | +| 12 | 4 KB | ~32 KB | +| 13 | 8 KB | ~64 KB | +| 14 | 16 KB | ~128 KB | +| 15 | 32 KB | ~256 KB | + +Smaller windows use less memory but find fewer long-distance matches, +reducing compression ratio. For streaming protocols with tight memory +budgets, windowBits=10–12 is a reasonable compromise. + +--- + +## Memory Level (`memLevel`) + +Controls the internal hash table and buffer sizes (1–9, default 8): + +```c +#define DEF_MEM_LEVEL 8 + +// In deflateInit2: +s->hash_size = 1 << (memLevel + 7); // hash_bits = memLevel + 7 +s->lit_bufsize = 1 << (memLevel + 6); +``` + +| memLevel | Hash Table Entries | Literal Buffer | Total Memory | +|---|---|---|---| +| 1 | 256 | 128 | ~1 KB | +| 4 | 2048 | 1024 | ~16 KB | +| 8 (default) | 32768 | 16384 | ~256 KB | +| 9 | 65536 | 32768 | ~512 KB | + +Higher memLevel improves hash distribution (fewer collisions) and allows +more symbols to accumulate before flushing, improving Huffman coding +efficiency. + +--- + +## Hardware Acceleration + +### Enabling SIMD + +**Runtime detection** (default, recommended for distributed binaries): +```bash +cmake .. -DWITH_RUNTIME_CPU_DETECTION=ON +``` + +**Native compilation** (fastest, for local/dedicated use): +```bash +cmake .. -DWITH_NATIVE_INSTRUCTIONS=ON +``` + +This passes `-march=native` to the compiler, enabling all instructions +supported by the build machine. + +### Selective Feature Control + +Disable specific SIMD features: +```bash +cmake .. -DWITH_AVX512=OFF # Avoid AVX-512 (thermal throttling concern) +cmake .. -DWITH_VPCLMULQDQ=OFF # Disable VPCLMULQDQ CRC +cmake .. -DWITH_NEON=OFF # Disable NEON on ARM +``` + +### SIMD Impact by Operation + +| Operation | Scalar | Best SIMD | Speedup | +|---|---|---|---| +| Adler-32 | ~1 B/cycle | ~32 B/cycle (AVX-512+VNNI) | 32× | +| CRC-32 | ~4 B/cycle | ~64 B/cycle (VPCLMULQDQ) | 16× | +| Compare256 | ~1 B/cycle | ~16 B/cycle (AVX2) | 16× | +| Slide Hash | ~1 entry/cycle | ~32 entries/cycle (AVX-512) | 32× | +| Inflate Copy | ~1 B/cycle | ~32 B/cycle (AVX2 chunkmemset) | 32× | + +--- + +## Buffer Sizing + +### Compression Buffers + +For streaming compression, the output buffer should be at least as large +as `deflateBound(sourceLen)` for the expected input chunk size: + +```c +size_t out_size = deflateBound(&strm, chunk_size); +``` + +Larger buffers reduce system call overhead and improve throughput. + +### Gzip Buffer + +```c +gzbuffer(gz, size); // Set before first read/write +``` + +Default `GZBUFSIZE` is 131072 (128 KB). For sequential I/O, larger +buffers (256 KB–1 MB) improve throughput by amortising I/O overhead. + +### Inflate Buffers + +The inflate engine benefits from output buffers ≥ 32 KB (the maximum +window size). Buffers ≥ 64 KB keep the fast path active longer (the +fast path requires ≥ 258 bytes of output space and ≥ 6 bytes of input). + +--- + +## `deflateTune()` + +Fine-tune the `configuration_table` parameters at runtime without +changing the level: + +```c +int deflateTune(z_stream *strm, int good_length, int max_lazy, + int nice_length, int max_chain); +``` + +Example — high-speed level 6: +```c +deflateInit(&strm, 6); +deflateTune(&strm, 4, 8, 32, 64); // Shorter chains than default +``` + +Example — deeper search at level 4: +```c +deflateInit(&strm, 4); +deflateTune(&strm, 16, 64, 128, 512); // Deeper search +``` + +--- + +## Profiling Tips + +### 1. Identify the Bottleneck + +Use `perf` or equivalent to identify whether compression is CPU-bound +(expect: hash lookup, match search) or I/O-bound (expect: read/write +syscalls): + +```bash +perf record -g ./minigzip < large_file > /dev/null +perf report +``` + +Look for hot functions: +- `longest_match_*` — String matching (CPU-bound) +- `adler32_*` / `crc32_*` — Checksumming (CPU-bound) +- `slide_hash_*` — Window maintenance (CPU-bound) +- `__write` / `__read` — I/O (I/O-bound) + +### 2. Verify SIMD Usage + +Check which implementations are selected: + +```bash +# Check for SIMD symbols in the binary +nm -D libz-ng.so | grep -E 'avx2|neon|sse|pclmul' +``` + +Or set a breakpoint in `init_functable()` during debugging. + +### 3. Benchmark Specific Functions + +Use the built-in benchmarks: +```bash +cmake .. -DWITH_BENCHMARKS=ON +cmake --build . +./benchmark_adler32 --benchmark_repetitions=5 +./benchmark_compress --benchmark_filter="BM_Compress/6" +``` + +--- + +## Common Tuning Scenarios + +### High-Throughput Compression (Level 1) + +```c +deflateInit2(&strm, 1, Z_DEFLATED, 15 + 16, 8, Z_DEFAULT_STRATEGY); +``` + +Level 1 uses `deflate_quick`: no hash chain walking, static Huffman tables, +minimal overhead. Best for cases where compression speed matters more than +ratio (real-time logging, network IPC). + +### Maximum Compression (Level 9) + +```c +deflateInit2(&strm, 9, Z_DEFLATED, 15, 9, Z_DEFAULT_STRATEGY); +``` + +Level 9 + memLevel 9 provides the deepest search (`max_chain=4096`) and +largest hash table. Use for archival where decompression speed matters +but compression can be slow. + +### Memory-Constrained Environment + +```c +deflateInit2(&strm, 6, Z_DEFLATED, 10, 4, Z_DEFAULT_STRATEGY); +``` + +windowBits=10 (1KB window) + memLevel=4 gives ~16KB total memory. +Suitable for embedded systems. + +### Multiple Streams in Parallel + +Each `z_stream` is independent. For multi-threaded compression, create +one stream per thread: + +```c +// Thread-safe: each thread has its own z_stream +#pragma omp parallel for +for (int i = 0; i < num_chunks; i++) { + z_stream strm = {}; + deflateInit(&strm, 6); + // compress chunk[i] + deflateEnd(&strm); +} +``` + +The `functable` initialisation is thread-safe (atomic init flag), so +the first call from any thread will safely initialise SIMD dispatch. |
