1 files changed, 461 insertions, 0 deletions
diff --git a/docs/handbook/neozip/checksum-algorithms.md b/docs/handbook/neozip/checksum-algorithms.md
new file mode 100644
index 0000000000..b21504c5e3
--- /dev/null
+++ b/docs/handbook/neozip/checksum-algorithms.md
@@ -0,0 +1,461 @@
+# Checksum Algorithms
+
+## Overview
+
+Neozip implements two checksum algorithms used by the DEFLATE family of
+compression formats:
+
+- **Adler-32**: A fast checksum used in the zlib container (RFC 1950)
+- **CRC-32**: A more robust check used in the gzip container (RFC 1952)
+
+Both algorithms have SIMD-accelerated implementations across x86, ARM,
+Power, RISC-V, s390, and LoongArch architectures.
+
+---
+
+## Adler-32
+
+### Algorithm
+
+Adler-32 is defined in RFC 1950. It consists of two running sums:
+
+- **s1**: Sum of all bytes (mod BASE)
+- **s2**: Sum of all intermediate s1 values (mod BASE)
+
+```c
+#define BASE 65521U    // Largest prime less than 65536
+#define NMAX 5552      // Largest n where 255n(n+1)/2 + (n+1)(BASE-1) <= 2^32 - 1
+```
+
+The `NMAX` constant determines how many bytes can be accumulated before a
+modular reduction is required to prevent 32-bit overflow.
+
+### Scalar Implementation
+
+From `adler32.c`:
+
+```c
+Z_INTERNAL uint32_t adler32_c(uint32_t adler, const uint8_t *buf, size_t len) {
+    uint32_t sum2 = (adler >> 16) & 0xffff;
+    adler &= 0xffff;
+
+    if (len == 1) {
+        adler += buf[0];
+        if (adler >= BASE) adler -= BASE;
+        sum2 += adler;
+        if (sum2 >= BASE) sum2 -= BASE;
+        return adler | (sum2 << 16);
+    }
+
+    // Split into NMAX-sized blocks
+    while (len >= NMAX) {
+        len -= NMAX;
+        unsigned n = NMAX / 16;
+        do {
+            // Unrolled: 16 ADLER_DO per iteration
+            ADLER_DO16(buf);
+            buf += 16;
+        } while (--n);
+        MOD(adler);      // adler %= BASE
+        MOD(sum2);
+    }
+
+    // Process remaining bytes
+    while (len >= 16) {
+        len -= 16;
+        ADLER_DO16(buf);
+        buf += 16;
+    }
+    while (len--) {
+        adler += *buf++;
+        sum2 += adler;
+    }
+    MOD(adler);
+    MOD(sum2);
+    return adler | (sum2 << 16);
+}
+```
+
+### Accumulation Macros
+
+From `adler32_p.h`:
+
+```c
+#define ADLER_DO1(buf)  { adler += *(buf); sum2 += adler; }
+#define ADLER_DO2(buf)  ADLER_DO1(buf); ADLER_DO1(buf + 1)
+#define ADLER_DO4(buf)  ADLER_DO2(buf); ADLER_DO2(buf + 2)
+#define ADLER_DO8(buf)  ADLER_DO4(buf); ADLER_DO4(buf + 4)
+#define ADLER_DO16(buf) ADLER_DO8(buf); ADLER_DO8(buf + 8)
+```
+
+### Modular Reduction
+
+```c
+#define MOD(a) a %= BASE
+#define MOD4(a) a %= BASE
+```
+
+The straightforward modulo works well because BASE is prime. On architectures
+where division is expensive, Adler-32 can alternatively be reduced by
+subtracting BASE in a loop.
+
+### Combining Adler-32 Checksums
+
+`adler32_combine_()` merges two Adler-32 checksums from adjacent data
+segments without accessing the original data:
+
+```c
+static uint32_t adler32_combine_(uint32_t adler1, uint32_t adler2, z_off64_t len2) {
+    uint32_t sum1, sum2;
+    unsigned rem;
+
+    // modular arithmetic to combine:
+    // s1_combined = (s1_a + s1_b - 1) % BASE
+    // s2_combined = (s2_a + s2_b + s1_a * len2 - len2) % BASE
+    rem = (unsigned)(len2 % BASE);
+    sum1 = adler1 & 0xffff;
+    sum2 = rem * sum1;
+    MOD(sum2);
+    sum1 += (adler2 & 0xffff) + BASE - 1;
+    sum2 += ((adler1 >> 16) & 0xffff) + ((adler2 >> 16) & 0xffff) + BASE - rem;
+    if (sum1 >= BASE) sum1 -= BASE;
+    if (sum1 >= BASE) sum1 -= BASE;
+    if (sum2 >= ((unsigned long)BASE << 1)) sum2 -= ((unsigned long)BASE << 1);
+    if (sum2 >= BASE) sum2 -= BASE;
+    return sum1 | (sum2 << 16);
+}
+```
+
+### SIMD Implementations
+
+SIMD Adler-32 uses parallel accumulation with dot products:
+
+**AVX2** (`arch/x86/adler32_avx2.c`):
+```c
+Z_INTERNAL uint32_t adler32_avx2(uint32_t adler, const uint8_t *buf, size_t len) {
+    static const uint8_t dot2v[] = {32,31,30,...,1};   // Position weights
+    static const uint8_t dot3v[] = {32,32,32,...,32};  // Sum1 weight (all ones)
+    __m256i vbuf, vs1, vs2, vs1_0, vs3;
+
+    vs1 = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, adler & 0xffff);
+    vs2 = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, adler >> 16);
+    vs1_0 = vs1;
+
+    while (len >= 32) {
+        vs1_0 = vs1;
+        // Load 32 bytes
+        vbuf = _mm256_loadu_si256((__m256i*)buf);
+        // sum1 += bytes[0..31]
+        vs1 = _mm256_add_epi32(vs1, _mm256_sad_epu8(vbuf, _mm256_setzero_si256()));
+        // vs3 = dot product: sum2 += (32-i)*byte[i]
+        vs3 = _mm256_maddubs_epi16(vbuf, vdot2v);
+        // ... accumulate vs2
+        vs2 = _mm256_add_epi32(vs2, _mm256_madd_epi16(vs3, vones));
+        // vs2 += 32 * previous_vs1
+        vs2 = _mm256_add_epi32(vs2, _mm256_slli_epi32(vs1_0, 5));
+        buf += 32;
+        len -= 32;
+    }
+    // Horizontal reduction and modular reduction
+    ...
+}
+```
+
+The key insight: Instead of computing `sum2 += s1_n` for each byte n
+individually, SIMD computes `sum2 += k * byte[i]` via `_mm256_maddubs_epi16()`
+where k represents the positional weight.
+
+**Available SIMD variants**:
+
+| Architecture | Implementation | Vector Width |
+|---|---|---|
+| x86 SSE4.1 | `adler32_sse41.c` | 128-bit |
+| x86 SSSE3 | `adler32_ssse3.c` | 128-bit |
+| x86 AVX2 | `adler32_avx2.c` | 256-bit |
+| x86 AVX-512 | `adler32_avx512.c` | 512-bit |
+| x86 AVX-512+VNNI | `adler32_avx512_vnni.c` | 512-bit |
+| ARM NEON | `adler32_neon.c` | 128-bit |
+| Power VMX (Altivec) | `adler32_vmx.c` | 128-bit |
+| Power8 | `adler32_power8.c` | 128-bit |
+| RISC-V RVV | `adler32_rvv.c` | Scalable |
+| LoongArch LASX | `adler32_lasx.c` | 256-bit |
+
+### Adler-32 with Copy
+
+`adler32_copy()` computes Adler-32 while simultaneously copying data,
+fusing two memory passes into one:
+
+```c
+typedef uint32_t (*adler32_copy_func)(uint32_t adler, uint8_t *dst,
+                                       const uint8_t *src, size_t len);
+```
+
+This is used during inflate to compute the checksum while copying
+decompressed data to the output buffer.
+
+---
+
+## CRC-32
+
+### Algorithm
+
+CRC-32 uses the standard polynomial 0xEDB88320 (reflected form):
+
+```c
+#define POLY 0xedb88320     // CRC-32 polynomial (reversed)
+```
+
+### Braided CRC-32
+
+The default software implementation uses a "braided" algorithm that
+processes multiple bytes per step using interleaved CRC tables:
+
+```c
+#define BRAID_N 5           // Number of interleaved CRC computations
+#define BRAID_W 8           // Bytes per word (8 for 64-bit, 4 for 32-bit)
+```
+
+From `crc32_braid_p.h`, the braided approach processes 5 words (40 bytes
+on 64-bit) per iteration:
+
+```c
+// Braided CRC processing (conceptual)
+// Process BRAID_N words at a time:
+z_word_t braids[BRAID_N];
+
+// Load BRAID_N words from input
+for (int k = 0; k < BRAID_N; k++)
+    braids[k] = *(z_word_t *)(buf + k * BRAID_W);
+
+// For each word, XOR with running CRC then look up table
+for (int k = 0; k < BRAID_N; k++) {
+    z_word_t word = braids[k];
+    // CRC-fold using braid tables:
+    // crc = crc_braid_table[N-1-k][byte0] ^ ... ^ crc_braid_table[0][byteN-1]
+}
+```
+
+The braid tables are generated at compile time by `crc32_braid_tbl.h`.
+
+### Chorba CRC-32
+
+A newer CRC-32 algorithm using a "Chorba" reduction technique for
+even faster software CRC computation. Selected when size >= 256 bytes:
+
+```c
+Z_INTERNAL uint32_t crc32_braid(uint32_t crc, const uint8_t *buf, size_t len) {
+    // Short paths for small inputs
+    if (len < 64) {
+        return crc32_small(crc, buf, len);
+    }
+    // For lengths >= threshold, use Chorba
+    if (len >= 256) {
+        return crc32_chorba(crc, buf, len);
+    }
+    // Otherwise use braided
+    ...
+}
+```
+
+### SIMD CRC-32 Implementations
+
+Hardware-accelerated CRC-32 is available on these architectures:
+
+| Architecture | Instruction | File |
+|---|---|---|
+| x86 (PCLMULQDQ) | Carry-less multiply | `crc32_pclmulqdq.c` |
+| x86 (VPCLMULQDQ) | AVX-512 carry-less multiply | `crc32_vpclmulqdq.c` |
+| ARM (CRC32) | CRC32W/CRC32B instructions | `crc32_acle.c` |
+| ARM (PMULL) | Polynomial multiply long | `crc32_pmull.c` |
+| Power8 | Vector carry-less multiply | `crc32_power8.c` |
+| s390 (CRC32) | DFLTCC or hardware CRC | `crc32_vx.c` |
+| RISC-V | Zbc carry-less multiply | `crc32_rvv.c` |
+
+**x86 PCLMULQDQ** (`arch/x86/crc32_pclmulqdq.c`):
+Uses Barrett reduction via carry-less multiplication to fold 64 bytes at
+a time:
+
+```c
+Z_INTERNAL uint32_t crc32_pclmulqdq(uint32_t crc32, const uint8_t *buf, size_t len) {
+    __m128i xmm_crc0, xmm_crc1, xmm_crc2, xmm_crc3;
+    __m128i xmm_fold4;   // Fold constant
+
+    // Initialize with CRC and first 64 bytes
+    xmm_crc0 = _mm_loadu_si128((__m128i *)buf);
+    xmm_crc0 = _mm_xor_si128(xmm_crc0, _mm_cvtsi32_si128(crc32));
+    // ... load crc1, crc2, crc3
+
+    // Main fold loop: process 64 bytes per iteration
+    while (len >= 64) {
+        // Fold: crc_n = pclmulqdq(crc_n, fold_constant) ^ next_data
+        xmm_crc0 = _mm_xor_si128(
+            _mm_clmulepi64_si128(xmm_crc0, xmm_fold4, 0x01),
+            _mm_clmulepi64_si128(xmm_crc0, xmm_fold4, 0x10));
+        xmm_crc0 = _mm_xor_si128(xmm_crc0, _mm_loadu_si128(next++));
+        // Repeat for crc1..crc3
+    }
+
+    // Final reduction to 32-bit CRC
+    // Barrett reduction using mu and polynomial constants
+}
+```
+
+This processes data at ~16 bytes/cycle on modern x86 hardware.
+
+### CRC-32 with Copy
+
+Like Adler-32, CRC-32 has a combined compute-and-copy variant:
+
+```c
+typedef uint32_t (*crc32_copy_func)(uint32_t crc, uint8_t *dst,
+                                     const uint8_t *src, size_t len);
+```
+
+This fuses the CRC computation with the `memcpy`, utilising cache lines
+loaded for copying to also feed the CRC calculation.
+
+### Combining CRC-32 Values
+
+```c
+uint32_t crc32_combine(uint32_t crc1, uint32_t crc2, z_off_t len2);
+uint32_t crc32_combine_gen(z_off_t len2);
+uint32_t crc32_combine_op(uint32_t crc1, uint32_t crc2, uint32_t op);
+```
+
+Two-phase combine enables pre-computing the combination operator for a
+known second-segment length, then applying it to multiple CRC pairs.
+
+---
+
+## Dispatch via `functable`
+
+Checksum functions are dispatched through the `functable_s` structure:
+
+```c
+struct functable_s {
+    adler32_func          adler32;
+    adler32_copy_func     adler32_copy;
+    compare256_func       compare256;
+    crc32_func            crc32;
+    crc32_copy_func       crc32_copy;
+    // ... other function pointers
+};
+```
+
+`functable.c` selects the best implementation at runtime:
+
+```c
+// x86 dispatch cascade for adler32:
+#ifdef X86_SSE42
+    if (cf.x86.has_sse42)
+        functable.adler32 = adler32_sse42;
+#endif
+#ifdef X86_AVX2
+    if (cf.x86.has_avx2)
+        functable.adler32 = adler32_avx2;
+#endif
+#ifdef X86_AVX512
+    if (cf.x86.has_avx512)
+        functable.adler32 = adler32_avx512;
+#endif
+#ifdef X86_AVX512VNNI
+    if (cf.x86.has_avx512vnni)
+        functable.adler32 = adler32_avx512_vnni;
+#endif
+```
+
+Each architecture-specific source file is compiled separately with its
+required SIMD flags (e.g., `-mavx2`, `-mpclmul`).
+
+---
+
+## Function Table API
+
+### Public API
+
+```c
+uint32_t PREFIX(adler32)(uint32_t adler, const uint8_t *buf, uint32_t len);
+uint32_t PREFIX(crc32)(uint32_t crc, const uint8_t *buf, uint32_t len);
+```
+
+For zlib compatibility, `adler32_z()` and `crc32_z()` accept `size_t` length:
+
+```c
+uint32_t PREFIX(adler32_z)(uint32_t adler, const uint8_t *buf, size_t len);
+uint32_t PREFIX(crc32_z)(uint32_t crc, const uint8_t *buf, size_t len);
+```
+
+### Initial Values
+
+- Adler-32: `adler32(0, NULL, 0)` returns `1` (initial value)
+- CRC-32: `crc32(0, NULL, 0)` returns `0` (initial value)
+
+### Typical Usage
+
+```c
+uint32_t checksum = PREFIX(adler32)(0L, Z_NULL, 0);
+checksum = PREFIX(adler32)(checksum, data, data_len);
+// checksum now holds the Adler-32 of data[0..data_len-1]
+```
+
+---
+
+## Performance Characteristics
+
+### Adler-32
+
+| Implementation | Throughput (approximate) |
+|---|---|
+| Scalar C | ~1 byte/cycle |
+| SSE4.1 | ~8 bytes/cycle |
+| AVX2 | ~16 bytes/cycle |
+| AVX-512+VNNI | ~32 bytes/cycle |
+| ARM NEON | ~8 bytes/cycle |
+
+### CRC-32
+
+| Implementation | Throughput (approximate) |
+|---|---|
+| Braided (scalar) | ~4 bytes/cycle |
+| PCLMULQDQ | ~16 bytes/cycle |
+| VPCLMULQDQ (AVX-512) | ~64 bytes/cycle |
+| ARM CRC32 | ~4 bytes/cycle |
+| ARM PMULL | ~16 bytes/cycle |
+
+CRC-32 is computationally heavier than Adler-32, but hardware acceleration
+closes the gap significantly.
+
+---
+
+## Checksum in the Compression Pipeline
+
+### During Deflate
+
+In `deflate.c`, checksums are computed on the input data:
+
+```c
+if (s->wrap == 2) {
+    // gzip: CRC-32
+    strm->adler = FUNCTABLE_CALL(crc32)(strm->adler, strm->next_in, strm->avail_in);
+} else if (s->wrap == 1) {
+    // zlib: Adler-32
+    strm->adler = FUNCTABLE_CALL(adler32)(strm->adler, strm->next_in, strm->avail_in);
+}
+```
+
+### During Inflate
+
+In `inflate.c`, checksums are computed on the output data:
+
+```c
+static inline void inf_chksum(PREFIX3(stream) *strm, const uint8_t *buf, uint32_t len) {
+    struct inflate_state *state = (struct inflate_state *)strm->state;
+    if (state->flags)
+        strm->adler = state->check = FUNCTABLE_CALL(crc32)(state->check, buf, len);
+    else
+        strm->adler = state->check = FUNCTABLE_CALL(adler32)(state->check, buf, len);
+}
+```
+
+The `_copy` variants (`inf_chksum_cpy`) are preferred when data is being
+both checksummed and copied, as they fuse the two operations.