summaryrefslogtreecommitdiff
path: root/docs/handbook/neozip/performance-tuning.md
blob: f45706e964de1e1dd88164c403a30de45d3576ec (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
# Performance Tuning

## Overview

Neozip offers multiple controls for trading compression ratio against
speed: compression level, strategy, window size, memory level, hardware
acceleration, and buffer sizing. This guide describes how each knob
affects performance and when to use them.

---

## Compression Level

The `level` parameter (0–9) selects the deflate strategy function and its
internal tuning parameters via the `configuration_table`:

```c
static const config configuration_table[10] = {
/*      good_length  lazy  nice  max_chain  func */
/* 0 */ {0,    0,  0,    0, deflate_stored},    // No compression
/* 1 */ {0,    0,  0,    0, deflate_quick},      // Fastest (Intel)
/* 2 */ {4,    4,  8,    4, deflate_fast},        // Fast greedy
/* 3 */ {4,    6, 32,   32, deflate_fast},
/* 4 */ {4,    4, 16,   16, deflate_medium},      // Balanced (Intel)
/* 5 */ {8,   16, 32,   32, deflate_medium},
/* 6 */ {8,   16,128,  128, deflate_medium},      // Default
/* 7 */ {8,   32,128,  256, deflate_slow},        // Slow lazy
/* 8 */ {32, 128,258, 1024, deflate_slow},
/* 9 */ {32, 258,258, 4096, deflate_slow},        // Maximum
};
```

| Parameter | Effect |
|---|---|
| `good_length` | Reduce match search when match ≥ this length |
| `max_lazy` | Don't try lazy match if current match ≥ this |
| `nice_length` | Stop searching once match ≥ this length |
| `max_chain` | Maximum hash chain steps to search |

### Level Selection Guide

| Use Case | Recommended Level | Rationale |
|---|---|---|
| Real-time streaming | 1 | `deflate_quick`: static Huffman, minimal search |
| Network compression | 2–3 | `deflate_fast`: greedy match, short chains |
| General purpose | 6 (default) | `deflate_medium`: good ratio/speed balance |
| Archival storage | 9 | `deflate_slow`: full lazy evaluation, deep chains |
| Pre-compressed data | 0 | `deflate_stored`: passthrough with framing |

### Speed vs. Ratio Tradeoffs

Approximate throughput (x86_64 with AVX2, single core):

| Level | Compression Speed | Ratio (typical) |
|---|---|---|
| 0 | ~5 GB/s | 1.00 (none) |
| 1 | ~800 MB/s | 2.0–2.5:1 |
| 3 | ~400 MB/s | 2.2–2.8:1 |
| 6 | ~150 MB/s | 2.5–3.2:1 |
| 9 | ~30 MB/s | 2.6–3.4:1 |

Decompression speed is largely independent of the compression level
(~1–2 GB/s), since it only depends on the encoded stream, not the search
strategy.

---

## Strategy Selection

### `Z_DEFAULT_STRATEGY` (0)

Standard DEFLATE with adaptive Huffman coding and LZ77 matching.
Best for most data types.

### `Z_FILTERED` (1)

Optimised for data produced by filters (e.g., delta encoding, integer
sequences). Uses shorter hash chains and favours Huffman coding efficiency.

### `Z_HUFFMAN_ONLY` (2)

Disables LZ77 matching entirely. Every byte is encoded as a literal.
Fast but poor compression ratio for most data. Useful when the data has
already been transformed (e.g., BWT output).

```c
// deflate_huff.c: Only emits literals
block_state deflate_huff(deflate_state *s, int flush) {
    for (;;) {
        // No match search — emit one literal per byte
        zng_tr_tally_lit(s, s->window[s->strstart]);
        s->strstart++;
        s->lookahead--;
        if (s->sym_next == s->sym_end) {
            FLUSH_BLOCK(s, 0);
        }
    }
}
```

### `Z_RLE` (3)

Run-length encoding: only matches at distance 1. Very fast for data with
repeated byte patterns:

```c
// deflate_rle.c
block_state deflate_rle(deflate_state *s, int flush) {
    // Only search for matches at distance == 1
    // Uses compare256_rle for fast run detection
    match_len = FUNCTABLE_CALL(compare256)(scan, scan - 1);
}
```

### `Z_FIXED` (4)

Forces use of static (fixed) Huffman tables for every block. Eliminates
the overhead of dynamic tree transmission. Slightly faster for small
blocks where the tree overhead dominates.

### Strategy Selection Guide

| Data Type | Strategy |
|---|---|
| General text/binary | `Z_DEFAULT_STRATEGY` |
| Numeric arrays, deltas | `Z_FILTERED` |
| Pre-transformed data | `Z_HUFFMAN_ONLY` |
| Runs of repeated bytes | `Z_RLE` |
| Very small blocks | `Z_FIXED` |
| Random/encrypted data | Level 0 (skip entirely) |

---

## Window Size (`windowBits`)

Controls the LZ77 sliding window (8–15, default 15):

| windowBits | Window Size | Memory (deflate) |
|---|---|---|
| 9 | 512 B | ~4 KB |
| 10 | 1 KB | ~8 KB |
| 11 | 2 KB | ~16 KB |
| 12 | 4 KB | ~32 KB |
| 13 | 8 KB | ~64 KB |
| 14 | 16 KB | ~128 KB |
| 15 | 32 KB | ~256 KB |

Smaller windows use less memory but find fewer long-distance matches,
reducing compression ratio. For streaming protocols with tight memory
budgets, windowBits=10–12 is a reasonable compromise.

---

## Memory Level (`memLevel`)

Controls the internal hash table and buffer sizes (1–9, default 8):

```c
#define DEF_MEM_LEVEL 8

// In deflateInit2:
s->hash_size = 1 << (memLevel + 7);  // hash_bits = memLevel + 7
s->lit_bufsize = 1 << (memLevel + 6);
```

| memLevel | Hash Table Entries | Literal Buffer | Total Memory |
|---|---|---|---|
| 1 | 256 | 128 | ~1 KB |
| 4 | 2048 | 1024 | ~16 KB |
| 8 (default) | 32768 | 16384 | ~256 KB |
| 9 | 65536 | 32768 | ~512 KB |

Higher memLevel improves hash distribution (fewer collisions) and allows
more symbols to accumulate before flushing, improving Huffman coding
efficiency.

---

## Hardware Acceleration

### Enabling SIMD

**Runtime detection** (default, recommended for distributed binaries):
```bash
cmake .. -DWITH_RUNTIME_CPU_DETECTION=ON
```

**Native compilation** (fastest, for local/dedicated use):
```bash
cmake .. -DWITH_NATIVE_INSTRUCTIONS=ON
```

This passes `-march=native` to the compiler, enabling all instructions
supported by the build machine.

### Selective Feature Control

Disable specific SIMD features:
```bash
cmake .. -DWITH_AVX512=OFF       # Avoid AVX-512 (thermal throttling concern)
cmake .. -DWITH_VPCLMULQDQ=OFF   # Disable VPCLMULQDQ CRC
cmake .. -DWITH_NEON=OFF         # Disable NEON on ARM
```

### SIMD Impact by Operation

| Operation | Scalar | Best SIMD | Speedup |
|---|---|---|---|
| Adler-32 | ~1 B/cycle | ~32 B/cycle (AVX-512+VNNI) | 32× |
| CRC-32 | ~4 B/cycle | ~64 B/cycle (VPCLMULQDQ) | 16× |
| Compare256 | ~1 B/cycle | ~16 B/cycle (AVX2) | 16× |
| Slide Hash | ~1 entry/cycle | ~32 entries/cycle (AVX-512) | 32× |
| Inflate Copy | ~1 B/cycle | ~32 B/cycle (AVX2 chunkmemset) | 32× |

---

## Buffer Sizing

### Compression Buffers

For streaming compression, the output buffer should be at least as large
as `deflateBound(sourceLen)` for the expected input chunk size:

```c
size_t out_size = deflateBound(&strm, chunk_size);
```

Larger buffers reduce system call overhead and improve throughput.

### Gzip Buffer

```c
gzbuffer(gz, size);  // Set before first read/write
```

Default `GZBUFSIZE` is 131072 (128 KB). For sequential I/O, larger
buffers (256 KB–1 MB) improve throughput by amortising I/O overhead.

### Inflate Buffers

The inflate engine benefits from output buffers ≥ 32 KB (the maximum
window size). Buffers ≥ 64 KB keep the fast path active longer (the
fast path requires ≥ 258 bytes of output space and ≥ 6 bytes of input).

---

## `deflateTune()`

Fine-tune the `configuration_table` parameters at runtime without
changing the level:

```c
int deflateTune(z_stream *strm, int good_length, int max_lazy,
                int nice_length, int max_chain);
```

Example — high-speed level 6:
```c
deflateInit(&strm, 6);
deflateTune(&strm, 4, 8, 32, 64);  // Shorter chains than default
```

Example — deeper search at level 4:
```c
deflateInit(&strm, 4);
deflateTune(&strm, 16, 64, 128, 512);  // Deeper search
```

---

## Profiling Tips

### 1. Identify the Bottleneck

Use `perf` or equivalent to identify whether compression is CPU-bound
(expect: hash lookup, match search) or I/O-bound (expect: read/write
syscalls):

```bash
perf record -g ./minigzip < large_file > /dev/null
perf report
```

Look for hot functions:
- `longest_match_*` — String matching (CPU-bound)
- `adler32_*` / `crc32_*` — Checksumming (CPU-bound)
- `slide_hash_*` — Window maintenance (CPU-bound)
- `__write` / `__read` — I/O (I/O-bound)

### 2. Verify SIMD Usage

Check which implementations are selected:

```bash
# Check for SIMD symbols in the binary
nm -D libz-ng.so | grep -E 'avx2|neon|sse|pclmul'
```

Or set a breakpoint in `init_functable()` during debugging.

### 3. Benchmark Specific Functions

Use the built-in benchmarks:
```bash
cmake .. -DWITH_BENCHMARKS=ON
cmake --build .
./benchmark_adler32 --benchmark_repetitions=5
./benchmark_compress --benchmark_filter="BM_Compress/6"
```

---

## Common Tuning Scenarios

### High-Throughput Compression (Level 1)

```c
deflateInit2(&strm, 1, Z_DEFLATED, 15 + 16, 8, Z_DEFAULT_STRATEGY);
```

Level 1 uses `deflate_quick`: no hash chain walking, static Huffman tables,
minimal overhead. Best for cases where compression speed matters more than
ratio (real-time logging, network IPC).

### Maximum Compression (Level 9)

```c
deflateInit2(&strm, 9, Z_DEFLATED, 15, 9, Z_DEFAULT_STRATEGY);
```

Level 9 + memLevel 9 provides the deepest search (`max_chain=4096`) and
largest hash table. Use for archival where decompression speed matters
but compression can be slow.

### Memory-Constrained Environment

```c
deflateInit2(&strm, 6, Z_DEFLATED, 10, 4, Z_DEFAULT_STRATEGY);
```

windowBits=10 (1KB window) + memLevel=4 gives ~16KB total memory.
Suitable for embedded systems.

### Multiple Streams in Parallel

Each `z_stream` is independent. For multi-threaded compression, create
one stream per thread:

```c
// Thread-safe: each thread has its own z_stream
#pragma omp parallel for
for (int i = 0; i < num_chunks; i++) {
    z_stream strm = {};
    deflateInit(&strm, 6);
    // compress chunk[i]
    deflateEnd(&strm);
}
```

The `functable` initialisation is thread-safe (atomic init flag), so
the first call from any thread will safely initialise SIMD dispatch.