1 files changed, 797 insertions, 0 deletions
diff --git a/docs/handbook/neozip/deflate-algorithms.md b/docs/handbook/neozip/deflate-algorithms.md
new file mode 100644
index 0000000000..4d3eeb03c9
--- /dev/null
+++ b/docs/handbook/neozip/deflate-algorithms.md
@@ -0,0 +1,797 @@
+# Deflate Algorithms
+
+## Overview
+
+The DEFLATE algorithm (RFC 1951) combines **LZ77** sliding-window compression
+with **Huffman coding**. Neozip implements DEFLATE through a modular strategy
+system where each compression level maps to a specific strategy function with
+tuned parameters. This document covers every aspect of the compression
+pipeline: hash chains, match finding, lazy evaluation, and the strategy
+functions.
+
+---
+
+## The DEFLATE Compression Pipeline
+
+At a high level, DEFLATE works as follows:
+
+1. **Sliding window**: Input is processed through a 32KB (configurable)
+   sliding window. Each position is hashed and inserted into a hash table.
+2. **LZ77 match finding**: For each position, the hash table is consulted to
+   find previous occurrences of the same byte sequence. The longest match
+   within the window distance is selected.
+3. **Symbol emission**: Either a **literal** byte (no match found) or a
+   **length/distance pair** (match found) is recorded.
+4. **Huffman coding**: Accumulated symbols are encoded with Huffman codes
+   (static, dynamic, or the block is stored uncompressed) and output as a
+   DEFLATE block.
+
+```
+Input bytes
+    │
+    ▼
+┌──────────┐     ┌──────────────┐     ┌──────────────┐     ┌────────────┐
+│  Sliding  │ ──▶ │ Hash Insert  │ ──▶ │ Match Finding│ ──▶ │  Symbol    │
+│  Window   │     │ (head/prev)  │     │ (longest_    │     │  Buffer    │
+│ (32K×2)   │     │              │     │  match)      │     │ (sym_buf)  │
+└──────────┘     └──────────────┘     └──────────────┘     └─────┬──────┘
+                                                                  │
+                                                                  ▼
+                                                           ┌────────────┐
+                                                           │  Huffman   │
+                                                           │  Encoding  │
+                                                           │ (trees.c)  │
+                                                           └─────┬──────┘
+                                                                  │
+                                                                  ▼
+                                                           Compressed
+                                                           Output
+```
+
+---
+
+## The Sliding Window
+
+The sliding window is a byte array of size `2 * w_size`, where `w_size`
+defaults to 32768 (32KB, corresponding to `windowBits = 15`):
+
+```c
+unsigned char *window;    // Size: 2 * w_size bytes
+unsigned int window_size; // = 2 * w_size
+unsigned int w_size;      // = 1 << windowBits (default 32768)
+```
+
+Input is read into the **upper half** of the window (positions `w_size` to
+`2*w_size - 1`). When this half fills up, the lower half is discarded, the
+upper half is moved to the lower half (`memcpy` or `memmove`), and new input
+fills the upper half again. This is the "slide" operation.
+
+### Window Sliding
+
+The `fill_window()` function in `deflate.c` performs the slide:
+
+1. If more data is needed and `strstart >= w_size + MAX_DIST(s)`:
+   - Copy bytes `[w_size, 2*w_size)` to `[0, w_size)`
+   - Adjust `match_start`, `strstart`, `block_start` by `-w_size`
+   - Call `FUNCTABLE_CALL(slide_hash)(s)` to update hash table entries
+2. Read new input from `strm->next_in` into the free space
+3. Maintain the `high_water` mark for memory-check safety
+
+### `slide_hash()`
+
+When the window slides, all hash table entries (`head[]` and `prev[]`) must be
+decremented by `w_size`. Entries that would become negative (pointing before
+the new window start) are set to zero.
+
+The generic C implementation:
+```c
+void slide_hash_c(deflate_state *s) {
+    Pos *p;
+    unsigned n = HASH_SIZE;
+    p = &s->head[n];
+    do {
+        unsigned m = *--p;
+        *p = (Pos)(m >= w_size ? m - w_size : 0);
+    } while (--n);
+    // Same for prev[]
+}
+```
+
+SIMD implementations process 8 or 16 entries at a time using vector
+subtraction and saturation.
+
+---
+
+## Hash Table Structure
+
+The hash table maps byte sequences to window positions for fast match lookup.
+It consists of two arrays:
+
+### `head[HASH_SIZE]`
+
+An array of `Pos` (uint16_t) values, indexed by hash value. Each entry
+contains the most recent window position that hashed to that index.
+
+```c
+#define HASH_BITS  16u
+#define HASH_SIZE  65536u       // 2^16
+#define HASH_MASK  (HASH_SIZE - 1u)
+
+Pos *head;   // head[HASH_SIZE]: most recent position for each hash
+```
+
+Note: zlib-ng uses a 16-bit hash (65536 entries) compared to original zlib's
+15-bit hash (32768 entries), providing better distribution and fewer collisions.
+
+### `prev[w_size]`
+
+An array maintaining the hash **chain** — a linked list of all positions
+that share the same hash value:
+
+```c
+Pos *prev;   // prev[w_size]: chain links, indexed by (position & w_mask)
+```
+
+When a new position `str` is inserted for hash value `h`:
+```c
+prev[str & w_mask] = head[h];   // Link to previous chain head
+head[h] = str;                  // New chain head
+```
+
+Walking the chain from `head[h]` through `prev[]` yields all previous
+positions with the same hash, in reverse chronological order.
+
+---
+
+## Hash Function
+
+Neozip hashes 4 bytes at the current position (compared to zlib's 3 bytes).
+The hash function is defined in `insert_string_tpl.h`:
+
+```c
+Z_FORCEINLINE static uint32_t UPDATE_HASH(uint32_t h, uint32_t val) {
+    HASH_CALC(h, val);          // Architecture-specific hash computation
+    return h & HASH_CALC_MASK;  // Mask to HASH_SIZE
+}
+```
+
+The `HASH_CALC` macro uses a fast multiplicative hash or CRC-based hash
+depending on the configuration. Reading 4 bytes at once (`zng_memread_4`)
+provides better hash distribution than 3-byte hashing.
+
+### Hash Insert Operations
+
+Several insert variants exist:
+
+#### `quick_insert_value()`
+
+Used by `deflate_fast` and `deflate_quick`. Inserts a single position
+with a pre-read 4-byte value:
+
+```c
+Z_FORCEINLINE static uint32_t QUICK_INSERT_VALUE(
+        deflate_state *const s, uint32_t str, uint32_t val) {
+    uint32_t hm, head;
+    HASH_CALC_VAR_INIT;
+    HASH_CALC(HASH_CALC_VAR, val);
+    HASH_CALC_VAR &= HASH_CALC_MASK;
+    hm = HASH_CALC_VAR;
+    head = s->head[hm];
+    if (LIKELY(head != str)) {
+        s->prev[str & W_MASK(s)] = (Pos)head;
+        s->head[hm] = (Pos)str;
+    }
+    return head;
+}
+```
+
+#### `quick_insert_string()`
+
+Like `quick_insert_value()` but reads the 4 bytes itself:
+
+```c
+Z_FORCEINLINE static uint32_t QUICK_INSERT_STRING(
+        deflate_state *const s, uint32_t str) {
+    uint8_t *strstart = s->window + str + HASH_CALC_OFFSET;
+    uint32_t val, hm, head;
+    HASH_CALC_VAR_INIT;
+    HASH_CALC_READ;   // val = Z_U32_FROM_LE(zng_memread_4(strstart))
+    HASH_CALC(HASH_CALC_VAR, val);
+    // ... insert ...
+    return head;
+}
+```
+
+#### `insert_string()`
+
+Batch insert. Inserts `count` consecutive positions, used after a match
+to insert all strings within the matched region:
+
+```c
+void insert_string(deflate_state *const s, uint32_t str, uint32_t count);
+```
+
+#### `insert_string_roll()` / `quick_insert_string_roll()`
+
+Rolling hash insert for level 9 (`deflate_slow` with `LONGEST_MATCH_SLOW`).
+Uses a different hash update that considers the full string context for
+better match quality at the cost of speed.
+
+---
+
+## Match Finding
+
+### `longest_match()` (Standard)
+
+Defined via the `match_tpl.h` template. This is the hot inner loop of
+compression.
+
+**Algorithm**:
+
+1. Set `best_len` to `prev_length` (or `STD_MIN_MATCH - 1`)
+2. Read the first 8 bytes at `scan` and at `scan + offset` for fast
+   comparison (where `offset = best_len - 1`, adjusted for word boundaries)
+3. Set `limit` to prevent matches beyond `MAX_DIST(s)`
+4. If `best_len >= good_match`, halve `chain_length` to speed up
+5. Walk the hash chain via `prev[]`:
+   a. For each candidate `cur_match`:
+      - Quick reject: compare 8 bytes at match end position (`mbase_end + cur_match`)
+        against `scan_end`
+      - Quick reject: compare 8 bytes at match start against `scan_start`
+      - If both match: call `compare256()` for exact length
+      - If new length > `best_len`: update `best_len` and `match_start`
+      - If `best_len >= nice_match`: return immediately
+      - Update `scan_end` for the new best length
+   b. Follow `prev[cur_match & wmask]` to next chain entry
+   c. Stop when `chain_length` exhausted or `cur_match <= limit`
+
+```c
+Z_INTERNAL uint32_t LONGEST_MATCH(deflate_state *const s, uint32_t cur_match) {
+    // Early exit: if chain becomes too deep for poor matches
+    int32_t early_exit = s->level < EARLY_EXIT_TRIGGER_LEVEL;
+
+    while (--chain_length) {
+        match = mbase_start + cur_match;
+        if (zng_memread_8(mbase_end + cur_match) == scan_end &&
+            zng_memread_8(match) == scan_start) {
+            len = FUNCTABLE_CALL(compare256)(scan + 2, match + 2) + 2;
+            if (len > best_len) {
+                s->match_start = cur_match;
+                best_len = len;
+                if (best_len >= nice_match) return best_len;
+                offset = best_len - 1;
+                // Update scan_end
+            }
+        }
+        cur_match = prev[cur_match & wmask];
+        if (cur_match <= limit) return best_len;
+    }
+}
+```
+
+**Key parameters** that control match-finding behaviour:
+
+| Parameter | Field | Effect |
+|---|---|---|
+| `max_chain_length` | `s->max_chain_length` | Maximum hash chain entries to check |
+| `good_match` | `s->good_match` | Halve chain search above this best length |
+| `nice_match` | `s->nice_match` | Stop searching above this length |
+| `max_lazy_match` | `s->max_lazy_match` | Don't bother with lazy match above this |
+
+### `longest_match_slow()` (Level 9)
+
+The slow variant (`LONGEST_MATCH_SLOW`) adds chain re-rooting for better
+match quality:
+
+1. When continuing a lazy evaluation search, it doesn't just follow the
+   chain from the hash of the current position
+2. Instead, it finds the most distant chain starting from positions
+   `scan[1], scan[2], ..., scan[best_len]` using `update_hash_roll()`
+3. This effectively searches multiple hash chains to find matches that
+   the standard algorithm would miss
+
+```c
+// Re-root: find a more distant chain start
+hash = update_hash_roll(0, scan[1]);
+hash = update_hash_roll(hash, scan[2]);
+for (uint32_t i = 3; i <= best_len; i++) {
+    hash = update_hash_roll(hash, scan[i]);
+    pos = s->head[hash];
+    if (pos < cur_match) {
+        cur_match = pos;  // Found a more distant starting point
+    }
+}
+```
+
+### `compare256()`
+
+Compares up to 256 bytes between two pointers, returning the number of
+matching bytes. Architecture-specific implementations:
+
+| Implementation | File | Method |
+|---|---|---|
+| `compare256_c` | `arch/generic/compare256_c.c` | 8-byte word comparison |
+| `compare256_sse2` | `arch/x86/compare256_sse2.c` | 16-byte SSE2 comparison |
+| `compare256_avx2` | `arch/x86/compare256_avx2.c` | 32-byte AVX2 comparison |
+| `compare256_avx512` | `arch/x86/compare256_avx512.c` | 64-byte AVX-512 comparison |
+| `compare256_neon` | `arch/arm/compare256_neon.c` | 16-byte NEON comparison |
+
+The compare function uses `FUNCTABLE_CALL(compare256)` for dispatch.
+
+---
+
+## Configuration Table
+
+Each compression level has a set of tuning parameters defined in
+`deflate.c`:
+
+```c
+typedef struct config_s {
+    uint16_t good_length; // Reduce lazy search above this match length
+    uint16_t max_lazy;    // Do not perform lazy search above this length
+    uint16_t nice_length; // Quit search above this match length
+    uint16_t max_chain;   // Maximum hash chain length
+    compress_func func;   // Strategy function pointer
+} config;
+```
+
+The full table:
+
+| Level | good | lazy | nice | chain | Strategy |
+|---|---|---|---|---|---|
+| 0 | 0 | 0 | 0 | 0 | `deflate_stored` |
+| 1 | 0 | 0 | 0 | 0 | `deflate_quick` |
+| 2 | 4 | 4 | 8 | 4 | `deflate_fast` |
+| 3 | 4 | 6 | 16 | 6 | `deflate_medium` |
+| 4 | 4 | 12 | 32 | 24 | `deflate_medium` |
+| 5 | 8 | 16 | 32 | 32 | `deflate_medium` |
+| 6 | 8 | 16 | 128 | 128 | `deflate_medium` |
+| 7 | 8 | 32 | 128 | 256 | `deflate_slow` |
+| 8 | 32 | 128 | 258 | 1024 | `deflate_slow` |
+| 9 | 32 | 258 | 258 | 4096 | `deflate_slow` |
+
+When `NO_QUICK_STRATEGY` is defined, level 1 uses `deflate_fast` and level 2
+shifts accordingly. When `NO_MEDIUM_STRATEGY` is defined, levels 3–6 use
+`deflate_fast` or `deflate_slow`.
+
+---
+
+## Strategy Functions in Detail
+
+### `deflate_stored` (Level 0)
+
+No compression. Input is emitted as stored blocks (type 0 in DEFLATE):
+
+- Each stored block has a 5-byte header: BFINAL (1 bit), BTYPE (2 bits = 00),
+  padding to byte boundary, LEN (16 bits), NLEN (16 bits, one's complement of LEN)
+- Maximum stored block length: 65535 bytes (`MAX_STORED`)
+- Directly copies from `next_in` to `next_out` when possible
+- Falls back to buffering through the window when direct copy isn't feasible
+
+```c
+Z_INTERNAL block_state deflate_stored(deflate_state *s, int flush) {
+    unsigned min_block = MIN(s->pending_buf_size - 5, w_size);
+    // ...
+    len = MAX_STORED;
+    have = (s->bi_valid + 42) >> 3;  // Header overhead
+    // Copy blocks directly when possible
+}
+```
+
+### `deflate_quick` (Level 1)
+
+The fastest compression strategy, designed by Intel:
+
+- Uses **static Huffman trees** only (no dynamic tree construction)
+- Single-pass greedy matching with no lazy evaluation
+- Emits blocks via `zng_tr_emit_tree(s, STATIC_TREES, last)` and
+  `zng_tr_emit_end_block(s, static_ltree, last)`
+- Tracks block state: `block_open` = 0 (closed), 1 (open), 2 (open + last)
+- Flushes when `pending` approaches `pending_buf_size`
+
+```c
+Z_INTERNAL block_state deflate_quick(deflate_state *s, int flush) {
+    // Start static block
+    quick_start_block(s, last);
+
+    for (;;) {
+        // Flush if pending buffer nearly full
+        if (s->pending + ((BIT_BUF_SIZE + 7) >> 3) >= s->pending_buf_size) {
+            PREFIX(flush_pending)(s->strm);
+            // ...
+        }
+
+        // Hash insert
+        uint32_t hash_head = quick_insert_value(s, s->strstart, str_val);
+
+        // Try to find a match
+        if (dist <= MAX_DIST(s) && dist > 0) {
+            // Quick match comparison...
+            if (match_val == str_val) {
+                // Full match via longest_match or inline compare
+                zng_tr_emit_dist(s, static_ltree, static_dtree, ...);
+            }
+        }
+        // No match: emit literal
+        zng_tr_emit_lit(s, static_ltree, lc);
+    }
+}
+```
+
+### `deflate_fast` (Level 2, or 1–3 without quick)
+
+Greedy matching without lazy evaluation:
+
+1. `fill_window()` if `lookahead < MIN_LOOKAHEAD`
+2. Hash insert via `quick_insert_value()`
+3. If match found (within `MAX_DIST`, length ≥ `WANT_MIN_MATCH`):
+   - Record match via `zng_tr_tally_dist()`
+   - Insert all strings within the match region
+   - Advance `strstart` by `match_length`
+4. If no match:
+   - Record literal via `zng_tr_tally_lit()`
+   - Advance `strstart` by 1
+5. If tally function returns true (buffer full): flush block
+
+```c
+Z_INTERNAL block_state deflate_fast(deflate_state *s, int flush) {
+    for (;;) {
+        if (s->lookahead < MIN_LOOKAHEAD) {
+            PREFIX(fill_window)(s);
+            // ...
+        }
+        if (s->lookahead >= WANT_MIN_MATCH) {
+            uint32_t hash_head = quick_insert_value(s, s->strstart, str_val);
+            if (dist <= MAX_DIST(s) && dist > 0 && hash_head != 0) {
+                match_len = FUNCTABLE_CALL(longest_match)(s, hash_head);
+            }
+        }
+        if (match_len >= WANT_MIN_MATCH) {
+            bflush = zng_tr_tally_dist(s, s->strstart - s->match_start,
+                                        match_len - STD_MIN_MATCH);
+            s->lookahead -= match_len;
+            // Insert strings within match
+            s->strstart += match_len;
+        } else {
+            bflush = zng_tr_tally_lit(s, lc);
+            s->lookahead--;
+            s->strstart++;
+        }
+        if (UNLIKELY(bflush))
+            FLUSH_BLOCK(s, 0);
+    }
+}
+```
+
+### `deflate_medium` (Levels 3–6)
+
+Intel's balanced strategy that bridges fast and slow:
+
+Uses a `struct match` to track match attributes:
+```c
+struct match {
+    uint16_t match_start;
+    uint16_t match_length;
+    uint16_t strstart;
+    uint16_t orgstart;
+};
+```
+
+Key helper functions:
+- **`find_best_match()`** — Calls `longest_match()` and returns a `struct match`
+- **`emit_match()`** — Emits literals for short matches (< `WANT_MIN_MATCH`)
+  or a length/distance pair for longer ones
+- **`insert_match()`** — Inserts hash entries for matched positions
+
+The algorithm maintains a two-position lookahead:
+
+1. Find best match at current position
+2. Find best match at next position
+3. Compare: if next match is better by a meaningful margin, emit current
+   position as a literal and adopt the next match
+4. Otherwise, emit the current match
+
+```c
+Z_INTERNAL block_state deflate_medium(deflate_state *s, int flush) {
+    struct match current_match, next_match;
+
+    for (;;) {
+        current_match = find_best_match(s, hash_head);
+
+        if (current_match.match_length >= WANT_MIN_MATCH) {
+            // Check if next position has a better match
+            next_match = find_best_match(s, next_hash_head);
+            if (next_match.match_length > current_match.match_length + 1) {
+                // Skip current, use next
+                emit_match(s, literal_match);
+                insert_match(s, literal_match);
+                continue;
+            }
+        }
+        emit_match(s, current_match);
+        insert_match(s, current_match);
+    }
+}
+```
+
+### `deflate_slow` (Levels 7–9)
+
+Full lazy match evaluation — the traditional approach for maximum compression:
+
+1. Find longest match at current position
+2. **Do not emit it yet** — set `match_available = 1`
+3. Advance to next position, find another match
+4. If the new match is **not better** than the previous one:
+   - Emit the previous match (the "lazy" match)
+   - Insert all strings within the matched region
+   - Skip past the match
+5. If the new match **is better**:
+   - Emit the previous position as a literal
+   - Record the new match as `match_available`
+   - Continue from step 3
+
+```c
+Z_INTERNAL block_state deflate_slow(deflate_state *s, int flush) {
+    // Level ≥ 9: use slow match finder and rolling insert
+    if (level >= 9) {
+        longest_match = FUNCTABLE_FPTR(longest_match_slow);
+        insert_string_func = insert_string_roll;
+    }
+
+    for (;;) {
+        // Find match
+        if (dist <= MAX_DIST(s) && s->prev_length < s->max_lazy_match) {
+            match_len = longest_match(s, hash_head);
+        }
+
+        // Lazy evaluation
+        if (s->prev_length >= STD_MIN_MATCH && match_len <= s->prev_length) {
+            // Previous match was better — emit it
+            bflush = zng_tr_tally_dist(s, s->strstart - 1 - s->prev_match,
+                                        s->prev_length - STD_MIN_MATCH);
+            // Insert strings within match
+            s->prev_length -= 1;
+            s->lookahead -= s->prev_length;
+            s->strstart += s->prev_length;
+            s->prev_length = 0;
+            s->match_available = 0;
+        } else if (s->match_available) {
+            // Previous position has no good match — emit as literal
+            bflush = zng_tr_tally_lit(s, s->window[s->strstart - 1]);
+        } else {
+            s->match_available = 1;
+        }
+
+        s->prev_length = match_len;
+        s->prev_match = s->match_start;
+    }
+}
+```
+
+**Level 9 enhancements**:
+- Uses `longest_match_slow` which re-roots hash chains for deeper search
+- Uses `insert_string_roll` with a rolling hash for better distribution
+- `max_chain = 4096` provides the deepest chain traversal
+- `nice_match = 258` (maximum match length) means it never gives up early
+
+### `deflate_huff` (Z_HUFFMAN_ONLY)
+
+Huffman-only compression — every byte is emitted as a literal, no LZ77:
+
+```c
+Z_INTERNAL block_state deflate_huff(deflate_state *s, int flush) {
+    for (;;) {
+        if (s->lookahead == 0) {
+            PREFIX(fill_window)(s);
+            if (s->lookahead == 0) {
+                if (flush == Z_NO_FLUSH) return need_more;
+                break;
+            }
+        }
+        bflush = zng_tr_tally_lit(s, s->window[s->strstart]);
+        s->lookahead--;
+        s->strstart++;
+        if (bflush) FLUSH_BLOCK(s, 0);
+    }
+}
+```
+
+This forces construction of a Huffman tree that encodes only literal
+frequencies. Useful when the data is already compressed or random.
+
+### `deflate_rle` (Z_RLE)
+
+Run-length encoding — only finds runs of identical bytes (distance = 1):
+
+```c
+Z_INTERNAL block_state deflate_rle(deflate_state *s, int flush) {
+    for (;;) {
+        // Check for a run: scan[-1] == scan[0] == scan[1]
+        if (s->lookahead >= STD_MIN_MATCH && s->strstart > 0) {
+            scan = s->window + s->strstart - 1;
+            if (scan[0] == scan[1] && scan[1] == scan[2]) {
+                match_len = compare256_rle(scan, scan + 3) + 2;
+                match_len = MIN(match_len, STD_MAX_MATCH);
+            }
+        }
+
+        if (match_len >= STD_MIN_MATCH) {
+            bflush = zng_tr_tally_dist(s, 1, match_len - STD_MIN_MATCH);
+        } else {
+            bflush = zng_tr_tally_lit(s, s->window[s->strstart]);
+        }
+    }
+}
+```
+
+The `compare256_rle()` function is optimised for the RLE case where
+all bytes in the run are identical:
+
+```c
+// compare256_rle.h
+static inline uint32_t compare256_rle_64(const uint8_t *src0, const uint8_t *src1) {
+    // Read 8-byte words from src0, replicate src1[0] across 8 bytes
+    // XOR and count trailing zeros to find first difference
+}
+```
+
+---
+
+## Symbol Buffer
+
+Matches and literals are stored in a symbol buffer before being emitted:
+
+### Overlaid Format (`sym_buf`, default)
+
+Three bytes per symbol:
+- `sym_buf[i+0]`, `sym_buf[i+1]` — distance (0 for literal)
+- `sym_buf[i+2]` — literal byte or match length
+
+```c
+// zng_tr_tally_dist:
+zng_memwrite_4(&s->sym_buf[sym_next], Z_U32_TO_LE(dist | ((uint32_t)len << 16)));
+s->sym_next = sym_next + 3;
+
+// zng_tr_tally_lit:
+zng_memwrite_4(&s->sym_buf[sym_next], Z_U32_TO_LE((uint32_t)c << 16));
+s->sym_next = sym_next + 3;
+```
+
+### Separate Format (`LIT_MEM`)
+
+When `LIT_MEM` is defined (automatic when `OPTIMAL_CMP < 32`):
+```c
+uint16_t *d_buf;       // Distance buffer
+unsigned char *l_buf;  // Literal/length buffer
+
+// zng_tr_tally_dist:
+s->d_buf[sym_next] = (uint16_t)dist;
+s->l_buf[sym_next] = (uint8_t)len;
+s->sym_next = sym_next + 1;
+```
+
+This uses ~20% more memory but is 1–2% faster on platforms without fast
+unaligned access.
+
+The buffer size is `lit_bufsize` entries. When `sym_next` reaches `sym_end`,
+the block is flushed. The constant `LIT_BUFS` determines the buffer
+multiplier: 4 (overlaid) or 5 (separate).
+
+---
+
+## Block Flushing
+
+`zng_tr_flush_block()` in `trees.c` decides how to emit the accumulated
+symbols:
+
+1. **Compute tree statistics**: Build dynamic Huffman trees, compute
+   `opt_len` (dynamic tree bit cost) and `static_len` (static tree bit cost)
+2. **Compute stored cost**: Raw data length + 5 bytes overhead per block
+3. **Choose the best**:
+   - If stored cost ≤ `opt_len` and stored cost ≤ `static_len`: emit stored block
+   - Else if `static_len` ≤ `opt_len + 10`: emit with static trees
+   - Else: emit with dynamic trees
+
+```c
+void zng_tr_flush_block(deflate_state *s, char *buf, uint32_t stored_len, int last) {
+    build_tree(s, &s->l_desc);
+    build_tree(s, &s->d_desc);
+
+    if (stored_len + 4 <= opt_len && buf != NULL) {
+        // Stored block
+    } else if (s->strategy == Z_FIXED || static_len == opt_len) {
+        // Static trees
+        compress_block(s, static_ltree, static_dtree);
+    } else {
+        // Dynamic trees
+        send_all_trees(s, lcodes, dcodes, blcodes);
+        compress_block(s, s->dyn_ltree, s->dyn_dtree);
+    }
+    init_block(s);  // Reset for next block
+}
+```
+
+---
+
+## Match Length Constraints
+
+| Constant | Value | Purpose |
+|---|---|---|
+| `STD_MIN_MATCH` | 3 | Minimum match length per DEFLATE spec |
+| `STD_MAX_MATCH` | 258 | Maximum match length per DEFLATE spec |
+| `WANT_MIN_MATCH` | 4 | Internal minimum for actual match output |
+| `MIN_LOOKAHEAD` | `STD_MAX_MATCH + WANT_MIN_MATCH + 1` | Minimum lookahead before refilling window |
+| `MAX_DIST(s)` | `s->w_size - MIN_LOOKAHEAD` | Maximum back-reference distance |
+
+`WANT_MIN_MATCH = 4` is a performance optimisation: 3-byte matches provide
+minimal compression benefit but cost significant CPU time to find. By
+requiring 4-byte matches, the hash function can read a full 32-bit word
+and the match finder can use 32-bit comparisons for faster rejection.
+
+---
+
+## Block Types in Deflate Output
+
+Every DEFLATE block starts with a 3-bit header:
+- Bit 0: `BFINAL` — 1 if this is the last block
+- Bits 1–2: `BTYPE` — block type (00=stored, 01=fixed, 10=dynamic, 11=reserved)
+
+### Stored Block (BTYPE=00)
+
+```
+BFINAL BTYPE  pad   LEN    NLEN   DATA
+  1     00    0-7   16     16     LEN bytes
+```
+
+### Fixed Huffman Block (BTYPE=01)
+
+Uses predefined static trees (`static_ltree`, `static_dtree`). No tree
+data in the block — the decoder knows the fixed codes.
+
+### Dynamic Huffman Block (BTYPE=10)
+
+Contains the tree definition before the data:
+```
+BFINAL BTYPE HLIT HDIST HCLEN [Code lengths for code length alphabet]
+[Encoded literal/length code lengths] [Encoded distance code lengths]
+[Compressed data using the defined trees]
+```
+
+---
+
+## Flush Modes
+
+The `flush` parameter to `deflate()` controls output behaviour:
+
+| Value | Name | Effect |
+|---|---|---|
+| 0 | `Z_NO_FLUSH` | Normal operation; deflate decides when to emit output |
+| 1 | `Z_PARTIAL_FLUSH` | Flush output, emit empty fixed code block (10 bits) |
+| 2 | `Z_SYNC_FLUSH` | Flush output, align to byte, emit stored empty block (00 00 FF FF) |
+| 3 | `Z_FULL_FLUSH` | Like `Z_SYNC_FLUSH` but also resets compression state for random access |
+| 4 | `Z_FINISH` | Complete the stream. Returns `Z_STREAM_END` when done |
+| 5 | `Z_BLOCK` | Stop at next block boundary |
+| 6 | `Z_TREES` | Like `Z_BLOCK` but also emit trees (for `Z_TREES` flush) |
+
+Flush priority is ordered via the `RANK()` macro:
+```c
+#define RANK(f) (((f) * 2) - ((f) > 4 ? 9 : 0))
+```
+
+---
+
+## The `block_state` Return Values
+
+Each strategy function returns a `block_state`:
+
+```c
+typedef enum {
+    need_more,       // Need more input or more output space
+    block_done,      // Block was flushed
+    finish_started,  // Z_FINISH started, need more output calls
+    finish_done      // Z_FINISH complete
+} block_state;
+```
+
+`deflate()` uses these to control its outer loop and determine when to
+return to the caller.