diff options
Diffstat (limited to 'docs/handbook/neozip/checksum-algorithms.md')
| -rw-r--r-- | docs/handbook/neozip/checksum-algorithms.md | 461 |
1 files changed, 461 insertions, 0 deletions
diff --git a/docs/handbook/neozip/checksum-algorithms.md b/docs/handbook/neozip/checksum-algorithms.md new file mode 100644 index 0000000000..b21504c5e3 --- /dev/null +++ b/docs/handbook/neozip/checksum-algorithms.md @@ -0,0 +1,461 @@ +# Checksum Algorithms + +## Overview + +Neozip implements two checksum algorithms used by the DEFLATE family of +compression formats: + +- **Adler-32**: A fast checksum used in the zlib container (RFC 1950) +- **CRC-32**: A more robust check used in the gzip container (RFC 1952) + +Both algorithms have SIMD-accelerated implementations across x86, ARM, +Power, RISC-V, s390, and LoongArch architectures. + +--- + +## Adler-32 + +### Algorithm + +Adler-32 is defined in RFC 1950. It consists of two running sums: + +- **s1**: Sum of all bytes (mod BASE) +- **s2**: Sum of all intermediate s1 values (mod BASE) + +```c +#define BASE 65521U // Largest prime less than 65536 +#define NMAX 5552 // Largest n where 255n(n+1)/2 + (n+1)(BASE-1) <= 2^32 - 1 +``` + +The `NMAX` constant determines how many bytes can be accumulated before a +modular reduction is required to prevent 32-bit overflow. + +### Scalar Implementation + +From `adler32.c`: + +```c +Z_INTERNAL uint32_t adler32_c(uint32_t adler, const uint8_t *buf, size_t len) { + uint32_t sum2 = (adler >> 16) & 0xffff; + adler &= 0xffff; + + if (len == 1) { + adler += buf[0]; + if (adler >= BASE) adler -= BASE; + sum2 += adler; + if (sum2 >= BASE) sum2 -= BASE; + return adler | (sum2 << 16); + } + + // Split into NMAX-sized blocks + while (len >= NMAX) { + len -= NMAX; + unsigned n = NMAX / 16; + do { + // Unrolled: 16 ADLER_DO per iteration + ADLER_DO16(buf); + buf += 16; + } while (--n); + MOD(adler); // adler %= BASE + MOD(sum2); + } + + // Process remaining bytes + while (len >= 16) { + len -= 16; + ADLER_DO16(buf); + buf += 16; + } + while (len--) { + adler += *buf++; + sum2 += adler; + } + MOD(adler); + MOD(sum2); + return adler | (sum2 << 16); +} +``` + +### Accumulation Macros + +From `adler32_p.h`: + +```c +#define ADLER_DO1(buf) { adler += *(buf); sum2 += adler; } +#define ADLER_DO2(buf) ADLER_DO1(buf); ADLER_DO1(buf + 1) +#define ADLER_DO4(buf) ADLER_DO2(buf); ADLER_DO2(buf + 2) +#define ADLER_DO8(buf) ADLER_DO4(buf); ADLER_DO4(buf + 4) +#define ADLER_DO16(buf) ADLER_DO8(buf); ADLER_DO8(buf + 8) +``` + +### Modular Reduction + +```c +#define MOD(a) a %= BASE +#define MOD4(a) a %= BASE +``` + +The straightforward modulo works well because BASE is prime. On architectures +where division is expensive, Adler-32 can alternatively be reduced by +subtracting BASE in a loop. + +### Combining Adler-32 Checksums + +`adler32_combine_()` merges two Adler-32 checksums from adjacent data +segments without accessing the original data: + +```c +static uint32_t adler32_combine_(uint32_t adler1, uint32_t adler2, z_off64_t len2) { + uint32_t sum1, sum2; + unsigned rem; + + // modular arithmetic to combine: + // s1_combined = (s1_a + s1_b - 1) % BASE + // s2_combined = (s2_a + s2_b + s1_a * len2 - len2) % BASE + rem = (unsigned)(len2 % BASE); + sum1 = adler1 & 0xffff; + sum2 = rem * sum1; + MOD(sum2); + sum1 += (adler2 & 0xffff) + BASE - 1; + sum2 += ((adler1 >> 16) & 0xffff) + ((adler2 >> 16) & 0xffff) + BASE - rem; + if (sum1 >= BASE) sum1 -= BASE; + if (sum1 >= BASE) sum1 -= BASE; + if (sum2 >= ((unsigned long)BASE << 1)) sum2 -= ((unsigned long)BASE << 1); + if (sum2 >= BASE) sum2 -= BASE; + return sum1 | (sum2 << 16); +} +``` + +### SIMD Implementations + +SIMD Adler-32 uses parallel accumulation with dot products: + +**AVX2** (`arch/x86/adler32_avx2.c`): +```c +Z_INTERNAL uint32_t adler32_avx2(uint32_t adler, const uint8_t *buf, size_t len) { + static const uint8_t dot2v[] = {32,31,30,...,1}; // Position weights + static const uint8_t dot3v[] = {32,32,32,...,32}; // Sum1 weight (all ones) + __m256i vbuf, vs1, vs2, vs1_0, vs3; + + vs1 = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, adler & 0xffff); + vs2 = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, adler >> 16); + vs1_0 = vs1; + + while (len >= 32) { + vs1_0 = vs1; + // Load 32 bytes + vbuf = _mm256_loadu_si256((__m256i*)buf); + // sum1 += bytes[0..31] + vs1 = _mm256_add_epi32(vs1, _mm256_sad_epu8(vbuf, _mm256_setzero_si256())); + // vs3 = dot product: sum2 += (32-i)*byte[i] + vs3 = _mm256_maddubs_epi16(vbuf, vdot2v); + // ... accumulate vs2 + vs2 = _mm256_add_epi32(vs2, _mm256_madd_epi16(vs3, vones)); + // vs2 += 32 * previous_vs1 + vs2 = _mm256_add_epi32(vs2, _mm256_slli_epi32(vs1_0, 5)); + buf += 32; + len -= 32; + } + // Horizontal reduction and modular reduction + ... +} +``` + +The key insight: Instead of computing `sum2 += s1_n` for each byte n +individually, SIMD computes `sum2 += k * byte[i]` via `_mm256_maddubs_epi16()` +where k represents the positional weight. + +**Available SIMD variants**: + +| Architecture | Implementation | Vector Width | +|---|---|---| +| x86 SSE4.1 | `adler32_sse41.c` | 128-bit | +| x86 SSSE3 | `adler32_ssse3.c` | 128-bit | +| x86 AVX2 | `adler32_avx2.c` | 256-bit | +| x86 AVX-512 | `adler32_avx512.c` | 512-bit | +| x86 AVX-512+VNNI | `adler32_avx512_vnni.c` | 512-bit | +| ARM NEON | `adler32_neon.c` | 128-bit | +| Power VMX (Altivec) | `adler32_vmx.c` | 128-bit | +| Power8 | `adler32_power8.c` | 128-bit | +| RISC-V RVV | `adler32_rvv.c` | Scalable | +| LoongArch LASX | `adler32_lasx.c` | 256-bit | + +### Adler-32 with Copy + +`adler32_copy()` computes Adler-32 while simultaneously copying data, +fusing two memory passes into one: + +```c +typedef uint32_t (*adler32_copy_func)(uint32_t adler, uint8_t *dst, + const uint8_t *src, size_t len); +``` + +This is used during inflate to compute the checksum while copying +decompressed data to the output buffer. + +--- + +## CRC-32 + +### Algorithm + +CRC-32 uses the standard polynomial 0xEDB88320 (reflected form): + +```c +#define POLY 0xedb88320 // CRC-32 polynomial (reversed) +``` + +### Braided CRC-32 + +The default software implementation uses a "braided" algorithm that +processes multiple bytes per step using interleaved CRC tables: + +```c +#define BRAID_N 5 // Number of interleaved CRC computations +#define BRAID_W 8 // Bytes per word (8 for 64-bit, 4 for 32-bit) +``` + +From `crc32_braid_p.h`, the braided approach processes 5 words (40 bytes +on 64-bit) per iteration: + +```c +// Braided CRC processing (conceptual) +// Process BRAID_N words at a time: +z_word_t braids[BRAID_N]; + +// Load BRAID_N words from input +for (int k = 0; k < BRAID_N; k++) + braids[k] = *(z_word_t *)(buf + k * BRAID_W); + +// For each word, XOR with running CRC then look up table +for (int k = 0; k < BRAID_N; k++) { + z_word_t word = braids[k]; + // CRC-fold using braid tables: + // crc = crc_braid_table[N-1-k][byte0] ^ ... ^ crc_braid_table[0][byteN-1] +} +``` + +The braid tables are generated at compile time by `crc32_braid_tbl.h`. + +### Chorba CRC-32 + +A newer CRC-32 algorithm using a "Chorba" reduction technique for +even faster software CRC computation. Selected when size >= 256 bytes: + +```c +Z_INTERNAL uint32_t crc32_braid(uint32_t crc, const uint8_t *buf, size_t len) { + // Short paths for small inputs + if (len < 64) { + return crc32_small(crc, buf, len); + } + // For lengths >= threshold, use Chorba + if (len >= 256) { + return crc32_chorba(crc, buf, len); + } + // Otherwise use braided + ... +} +``` + +### SIMD CRC-32 Implementations + +Hardware-accelerated CRC-32 is available on these architectures: + +| Architecture | Instruction | File | +|---|---|---| +| x86 (PCLMULQDQ) | Carry-less multiply | `crc32_pclmulqdq.c` | +| x86 (VPCLMULQDQ) | AVX-512 carry-less multiply | `crc32_vpclmulqdq.c` | +| ARM (CRC32) | CRC32W/CRC32B instructions | `crc32_acle.c` | +| ARM (PMULL) | Polynomial multiply long | `crc32_pmull.c` | +| Power8 | Vector carry-less multiply | `crc32_power8.c` | +| s390 (CRC32) | DFLTCC or hardware CRC | `crc32_vx.c` | +| RISC-V | Zbc carry-less multiply | `crc32_rvv.c` | + +**x86 PCLMULQDQ** (`arch/x86/crc32_pclmulqdq.c`): +Uses Barrett reduction via carry-less multiplication to fold 64 bytes at +a time: + +```c +Z_INTERNAL uint32_t crc32_pclmulqdq(uint32_t crc32, const uint8_t *buf, size_t len) { + __m128i xmm_crc0, xmm_crc1, xmm_crc2, xmm_crc3; + __m128i xmm_fold4; // Fold constant + + // Initialize with CRC and first 64 bytes + xmm_crc0 = _mm_loadu_si128((__m128i *)buf); + xmm_crc0 = _mm_xor_si128(xmm_crc0, _mm_cvtsi32_si128(crc32)); + // ... load crc1, crc2, crc3 + + // Main fold loop: process 64 bytes per iteration + while (len >= 64) { + // Fold: crc_n = pclmulqdq(crc_n, fold_constant) ^ next_data + xmm_crc0 = _mm_xor_si128( + _mm_clmulepi64_si128(xmm_crc0, xmm_fold4, 0x01), + _mm_clmulepi64_si128(xmm_crc0, xmm_fold4, 0x10)); + xmm_crc0 = _mm_xor_si128(xmm_crc0, _mm_loadu_si128(next++)); + // Repeat for crc1..crc3 + } + + // Final reduction to 32-bit CRC + // Barrett reduction using mu and polynomial constants +} +``` + +This processes data at ~16 bytes/cycle on modern x86 hardware. + +### CRC-32 with Copy + +Like Adler-32, CRC-32 has a combined compute-and-copy variant: + +```c +typedef uint32_t (*crc32_copy_func)(uint32_t crc, uint8_t *dst, + const uint8_t *src, size_t len); +``` + +This fuses the CRC computation with the `memcpy`, utilising cache lines +loaded for copying to also feed the CRC calculation. + +### Combining CRC-32 Values + +```c +uint32_t crc32_combine(uint32_t crc1, uint32_t crc2, z_off_t len2); +uint32_t crc32_combine_gen(z_off_t len2); +uint32_t crc32_combine_op(uint32_t crc1, uint32_t crc2, uint32_t op); +``` + +Two-phase combine enables pre-computing the combination operator for a +known second-segment length, then applying it to multiple CRC pairs. + +--- + +## Dispatch via `functable` + +Checksum functions are dispatched through the `functable_s` structure: + +```c +struct functable_s { + adler32_func adler32; + adler32_copy_func adler32_copy; + compare256_func compare256; + crc32_func crc32; + crc32_copy_func crc32_copy; + // ... other function pointers +}; +``` + +`functable.c` selects the best implementation at runtime: + +```c +// x86 dispatch cascade for adler32: +#ifdef X86_SSE42 + if (cf.x86.has_sse42) + functable.adler32 = adler32_sse42; +#endif +#ifdef X86_AVX2 + if (cf.x86.has_avx2) + functable.adler32 = adler32_avx2; +#endif +#ifdef X86_AVX512 + if (cf.x86.has_avx512) + functable.adler32 = adler32_avx512; +#endif +#ifdef X86_AVX512VNNI + if (cf.x86.has_avx512vnni) + functable.adler32 = adler32_avx512_vnni; +#endif +``` + +Each architecture-specific source file is compiled separately with its +required SIMD flags (e.g., `-mavx2`, `-mpclmul`). + +--- + +## Function Table API + +### Public API + +```c +uint32_t PREFIX(adler32)(uint32_t adler, const uint8_t *buf, uint32_t len); +uint32_t PREFIX(crc32)(uint32_t crc, const uint8_t *buf, uint32_t len); +``` + +For zlib compatibility, `adler32_z()` and `crc32_z()` accept `size_t` length: + +```c +uint32_t PREFIX(adler32_z)(uint32_t adler, const uint8_t *buf, size_t len); +uint32_t PREFIX(crc32_z)(uint32_t crc, const uint8_t *buf, size_t len); +``` + +### Initial Values + +- Adler-32: `adler32(0, NULL, 0)` returns `1` (initial value) +- CRC-32: `crc32(0, NULL, 0)` returns `0` (initial value) + +### Typical Usage + +```c +uint32_t checksum = PREFIX(adler32)(0L, Z_NULL, 0); +checksum = PREFIX(adler32)(checksum, data, data_len); +// checksum now holds the Adler-32 of data[0..data_len-1] +``` + +--- + +## Performance Characteristics + +### Adler-32 + +| Implementation | Throughput (approximate) | +|---|---| +| Scalar C | ~1 byte/cycle | +| SSE4.1 | ~8 bytes/cycle | +| AVX2 | ~16 bytes/cycle | +| AVX-512+VNNI | ~32 bytes/cycle | +| ARM NEON | ~8 bytes/cycle | + +### CRC-32 + +| Implementation | Throughput (approximate) | +|---|---| +| Braided (scalar) | ~4 bytes/cycle | +| PCLMULQDQ | ~16 bytes/cycle | +| VPCLMULQDQ (AVX-512) | ~64 bytes/cycle | +| ARM CRC32 | ~4 bytes/cycle | +| ARM PMULL | ~16 bytes/cycle | + +CRC-32 is computationally heavier than Adler-32, but hardware acceleration +closes the gap significantly. + +--- + +## Checksum in the Compression Pipeline + +### During Deflate + +In `deflate.c`, checksums are computed on the input data: + +```c +if (s->wrap == 2) { + // gzip: CRC-32 + strm->adler = FUNCTABLE_CALL(crc32)(strm->adler, strm->next_in, strm->avail_in); +} else if (s->wrap == 1) { + // zlib: Adler-32 + strm->adler = FUNCTABLE_CALL(adler32)(strm->adler, strm->next_in, strm->avail_in); +} +``` + +### During Inflate + +In `inflate.c`, checksums are computed on the output data: + +```c +static inline void inf_chksum(PREFIX3(stream) *strm, const uint8_t *buf, uint32_t len) { + struct inflate_state *state = (struct inflate_state *)strm->state; + if (state->flags) + strm->adler = state->check = FUNCTABLE_CALL(crc32)(state->check, buf, len); + else + strm->adler = state->check = FUNCTABLE_CALL(adler32)(state->check, buf, len); +} +``` + +The `_copy` variants (`inf_chksum_cpy`) are preferred when data is being +both checksummed and copied, as they fuse the two operations. |
