1 files changed, 361 insertions, 0 deletions
diff --git a/docs/handbook/neozip/performance-tuning.md b/docs/handbook/neozip/performance-tuning.md
new file mode 100644
index 0000000000..f45706e964
--- /dev/null
+++ b/docs/handbook/neozip/performance-tuning.md
@@ -0,0 +1,361 @@
+# Performance Tuning
+
+## Overview
+
+Neozip offers multiple controls for trading compression ratio against
+speed: compression level, strategy, window size, memory level, hardware
+acceleration, and buffer sizing. This guide describes how each knob
+affects performance and when to use them.
+
+---
+
+## Compression Level
+
+The `level` parameter (0–9) selects the deflate strategy function and its
+internal tuning parameters via the `configuration_table`:
+
+```c
+static const config configuration_table[10] = {
+/*      good_length  lazy  nice  max_chain  func */
+/* 0 */ {0,    0,  0,    0, deflate_stored},    // No compression
+/* 1 */ {0,    0,  0,    0, deflate_quick},      // Fastest (Intel)
+/* 2 */ {4,    4,  8,    4, deflate_fast},        // Fast greedy
+/* 3 */ {4,    6, 32,   32, deflate_fast},
+/* 4 */ {4,    4, 16,   16, deflate_medium},      // Balanced (Intel)
+/* 5 */ {8,   16, 32,   32, deflate_medium},
+/* 6 */ {8,   16,128,  128, deflate_medium},      // Default
+/* 7 */ {8,   32,128,  256, deflate_slow},        // Slow lazy
+/* 8 */ {32, 128,258, 1024, deflate_slow},
+/* 9 */ {32, 258,258, 4096, deflate_slow},        // Maximum
+};
+```
+
+| Parameter | Effect |
+|---|---|
+| `good_length` | Reduce match search when match ≥ this length |
+| `max_lazy` | Don't try lazy match if current match ≥ this |
+| `nice_length` | Stop searching once match ≥ this length |
+| `max_chain` | Maximum hash chain steps to search |
+
+### Level Selection Guide
+
+| Use Case | Recommended Level | Rationale |
+|---|---|---|
+| Real-time streaming | 1 | `deflate_quick`: static Huffman, minimal search |
+| Network compression | 2–3 | `deflate_fast`: greedy match, short chains |
+| General purpose | 6 (default) | `deflate_medium`: good ratio/speed balance |
+| Archival storage | 9 | `deflate_slow`: full lazy evaluation, deep chains |
+| Pre-compressed data | 0 | `deflate_stored`: passthrough with framing |
+
+### Speed vs. Ratio Tradeoffs
+
+Approximate throughput (x86_64 with AVX2, single core):
+
+| Level | Compression Speed | Ratio (typical) |
+|---|---|---|
+| 0 | ~5 GB/s | 1.00 (none) |
+| 1 | ~800 MB/s | 2.0–2.5:1 |
+| 3 | ~400 MB/s | 2.2–2.8:1 |
+| 6 | ~150 MB/s | 2.5–3.2:1 |
+| 9 | ~30 MB/s | 2.6–3.4:1 |
+
+Decompression speed is largely independent of the compression level
+(~1–2 GB/s), since it only depends on the encoded stream, not the search
+strategy.
+
+---
+
+## Strategy Selection
+
+### `Z_DEFAULT_STRATEGY` (0)
+
+Standard DEFLATE with adaptive Huffman coding and LZ77 matching.
+Best for most data types.
+
+### `Z_FILTERED` (1)
+
+Optimised for data produced by filters (e.g., delta encoding, integer
+sequences). Uses shorter hash chains and favours Huffman coding efficiency.
+
+### `Z_HUFFMAN_ONLY` (2)
+
+Disables LZ77 matching entirely. Every byte is encoded as a literal.
+Fast but poor compression ratio for most data. Useful when the data has
+already been transformed (e.g., BWT output).
+
+```c
+// deflate_huff.c: Only emits literals
+block_state deflate_huff(deflate_state *s, int flush) {
+    for (;;) {
+        // No match search — emit one literal per byte
+        zng_tr_tally_lit(s, s->window[s->strstart]);
+        s->strstart++;
+        s->lookahead--;
+        if (s->sym_next == s->sym_end) {
+            FLUSH_BLOCK(s, 0);
+        }
+    }
+}
+```
+
+### `Z_RLE` (3)
+
+Run-length encoding: only matches at distance 1. Very fast for data with
+repeated byte patterns:
+
+```c
+// deflate_rle.c
+block_state deflate_rle(deflate_state *s, int flush) {
+    // Only search for matches at distance == 1
+    // Uses compare256_rle for fast run detection
+    match_len = FUNCTABLE_CALL(compare256)(scan, scan - 1);
+}
+```
+
+### `Z_FIXED` (4)
+
+Forces use of static (fixed) Huffman tables for every block. Eliminates
+the overhead of dynamic tree transmission. Slightly faster for small
+blocks where the tree overhead dominates.
+
+### Strategy Selection Guide
+
+| Data Type | Strategy |
+|---|---|
+| General text/binary | `Z_DEFAULT_STRATEGY` |
+| Numeric arrays, deltas | `Z_FILTERED` |
+| Pre-transformed data | `Z_HUFFMAN_ONLY` |
+| Runs of repeated bytes | `Z_RLE` |
+| Very small blocks | `Z_FIXED` |
+| Random/encrypted data | Level 0 (skip entirely) |
+
+---
+
+## Window Size (`windowBits`)
+
+Controls the LZ77 sliding window (8–15, default 15):
+
+| windowBits | Window Size | Memory (deflate) |
+|---|---|---|
+| 9 | 512 B | ~4 KB |
+| 10 | 1 KB | ~8 KB |
+| 11 | 2 KB | ~16 KB |
+| 12 | 4 KB | ~32 KB |
+| 13 | 8 KB | ~64 KB |
+| 14 | 16 KB | ~128 KB |
+| 15 | 32 KB | ~256 KB |
+
+Smaller windows use less memory but find fewer long-distance matches,
+reducing compression ratio. For streaming protocols with tight memory
+budgets, windowBits=10–12 is a reasonable compromise.
+
+---
+
+## Memory Level (`memLevel`)
+
+Controls the internal hash table and buffer sizes (1–9, default 8):
+
+```c
+#define DEF_MEM_LEVEL 8
+
+// In deflateInit2:
+s->hash_size = 1 << (memLevel + 7);  // hash_bits = memLevel + 7
+s->lit_bufsize = 1 << (memLevel + 6);
+```
+
+| memLevel | Hash Table Entries | Literal Buffer | Total Memory |
+|---|---|---|---|
+| 1 | 256 | 128 | ~1 KB |
+| 4 | 2048 | 1024 | ~16 KB |
+| 8 (default) | 32768 | 16384 | ~256 KB |
+| 9 | 65536 | 32768 | ~512 KB |
+
+Higher memLevel improves hash distribution (fewer collisions) and allows
+more symbols to accumulate before flushing, improving Huffman coding
+efficiency.
+
+---
+
+## Hardware Acceleration
+
+### Enabling SIMD
+
+**Runtime detection** (default, recommended for distributed binaries):
+```bash
+cmake .. -DWITH_RUNTIME_CPU_DETECTION=ON
+```
+
+**Native compilation** (fastest, for local/dedicated use):
+```bash
+cmake .. -DWITH_NATIVE_INSTRUCTIONS=ON
+```
+
+This passes `-march=native` to the compiler, enabling all instructions
+supported by the build machine.
+
+### Selective Feature Control
+
+Disable specific SIMD features:
+```bash
+cmake .. -DWITH_AVX512=OFF       # Avoid AVX-512 (thermal throttling concern)
+cmake .. -DWITH_VPCLMULQDQ=OFF   # Disable VPCLMULQDQ CRC
+cmake .. -DWITH_NEON=OFF         # Disable NEON on ARM
+```
+
+### SIMD Impact by Operation
+
+| Operation | Scalar | Best SIMD | Speedup |
+|---|---|---|---|
+| Adler-32 | ~1 B/cycle | ~32 B/cycle (AVX-512+VNNI) | 32× |
+| CRC-32 | ~4 B/cycle | ~64 B/cycle (VPCLMULQDQ) | 16× |
+| Compare256 | ~1 B/cycle | ~16 B/cycle (AVX2) | 16× |
+| Slide Hash | ~1 entry/cycle | ~32 entries/cycle (AVX-512) | 32× |
+| Inflate Copy | ~1 B/cycle | ~32 B/cycle (AVX2 chunkmemset) | 32× |
+
+---
+
+## Buffer Sizing
+
+### Compression Buffers
+
+For streaming compression, the output buffer should be at least as large
+as `deflateBound(sourceLen)` for the expected input chunk size:
+
+```c
+size_t out_size = deflateBound(&strm, chunk_size);
+```
+
+Larger buffers reduce system call overhead and improve throughput.
+
+### Gzip Buffer
+
+```c
+gzbuffer(gz, size);  // Set before first read/write
+```
+
+Default `GZBUFSIZE` is 131072 (128 KB). For sequential I/O, larger
+buffers (256 KB–1 MB) improve throughput by amortising I/O overhead.
+
+### Inflate Buffers
+
+The inflate engine benefits from output buffers ≥ 32 KB (the maximum
+window size). Buffers ≥ 64 KB keep the fast path active longer (the
+fast path requires ≥ 258 bytes of output space and ≥ 6 bytes of input).
+
+---
+
+## `deflateTune()`
+
+Fine-tune the `configuration_table` parameters at runtime without
+changing the level:
+
+```c
+int deflateTune(z_stream *strm, int good_length, int max_lazy,
+                int nice_length, int max_chain);
+```
+
+Example — high-speed level 6:
+```c
+deflateInit(&strm, 6);
+deflateTune(&strm, 4, 8, 32, 64);  // Shorter chains than default
+```
+
+Example — deeper search at level 4:
+```c
+deflateInit(&strm, 4);
+deflateTune(&strm, 16, 64, 128, 512);  // Deeper search
+```
+
+---
+
+## Profiling Tips
+
+### 1. Identify the Bottleneck
+
+Use `perf` or equivalent to identify whether compression is CPU-bound
+(expect: hash lookup, match search) or I/O-bound (expect: read/write
+syscalls):
+
+```bash
+perf record -g ./minigzip < large_file > /dev/null
+perf report
+```
+
+Look for hot functions:
+- `longest_match_*` — String matching (CPU-bound)
+- `adler32_*` / `crc32_*` — Checksumming (CPU-bound)
+- `slide_hash_*` — Window maintenance (CPU-bound)
+- `__write` / `__read` — I/O (I/O-bound)
+
+### 2. Verify SIMD Usage
+
+Check which implementations are selected:
+
+```bash
+# Check for SIMD symbols in the binary
+nm -D libz-ng.so | grep -E 'avx2|neon|sse|pclmul'
+```
+
+Or set a breakpoint in `init_functable()` during debugging.
+
+### 3. Benchmark Specific Functions
+
+Use the built-in benchmarks:
+```bash
+cmake .. -DWITH_BENCHMARKS=ON
+cmake --build .
+./benchmark_adler32 --benchmark_repetitions=5
+./benchmark_compress --benchmark_filter="BM_Compress/6"
+```
+
+---
+
+## Common Tuning Scenarios
+
+### High-Throughput Compression (Level 1)
+
+```c
+deflateInit2(&strm, 1, Z_DEFLATED, 15 + 16, 8, Z_DEFAULT_STRATEGY);
+```
+
+Level 1 uses `deflate_quick`: no hash chain walking, static Huffman tables,
+minimal overhead. Best for cases where compression speed matters more than
+ratio (real-time logging, network IPC).
+
+### Maximum Compression (Level 9)
+
+```c
+deflateInit2(&strm, 9, Z_DEFLATED, 15, 9, Z_DEFAULT_STRATEGY);
+```
+
+Level 9 + memLevel 9 provides the deepest search (`max_chain=4096`) and
+largest hash table. Use for archival where decompression speed matters
+but compression can be slow.
+
+### Memory-Constrained Environment
+
+```c
+deflateInit2(&strm, 6, Z_DEFLATED, 10, 4, Z_DEFAULT_STRATEGY);
+```
+
+windowBits=10 (1KB window) + memLevel=4 gives ~16KB total memory.
+Suitable for embedded systems.
+
+### Multiple Streams in Parallel
+
+Each `z_stream` is independent. For multi-threaded compression, create
+one stream per thread:
+
+```c
+// Thread-safe: each thread has its own z_stream
+#pragma omp parallel for
+for (int i = 0; i < num_chunks; i++) {
+    z_stream strm = {};
+    deflateInit(&strm, 6);
+    // compress chunk[i]
+    deflateEnd(&strm);
+}
+```
+
+The `functable` initialisation is thread-safe (atomic init flag), so
+the first call from any thread will safely initialise SIMD dispatch.