# Checksum Algorithms ## Overview Neozip implements two checksum algorithms used by the DEFLATE family of compression formats: - **Adler-32**: A fast checksum used in the zlib container (RFC 1950) - **CRC-32**: A more robust check used in the gzip container (RFC 1952) Both algorithms have SIMD-accelerated implementations across x86, ARM, Power, RISC-V, s390, and LoongArch architectures. --- ## Adler-32 ### Algorithm Adler-32 is defined in RFC 1950. It consists of two running sums: - **s1**: Sum of all bytes (mod BASE) - **s2**: Sum of all intermediate s1 values (mod BASE) ```c #define BASE 65521U // Largest prime less than 65536 #define NMAX 5552 // Largest n where 255n(n+1)/2 + (n+1)(BASE-1) <= 2^32 - 1 ``` The `NMAX` constant determines how many bytes can be accumulated before a modular reduction is required to prevent 32-bit overflow. ### Scalar Implementation From `adler32.c`: ```c Z_INTERNAL uint32_t adler32_c(uint32_t adler, const uint8_t *buf, size_t len) { uint32_t sum2 = (adler >> 16) & 0xffff; adler &= 0xffff; if (len == 1) { adler += buf[0]; if (adler >= BASE) adler -= BASE; sum2 += adler; if (sum2 >= BASE) sum2 -= BASE; return adler | (sum2 << 16); } // Split into NMAX-sized blocks while (len >= NMAX) { len -= NMAX; unsigned n = NMAX / 16; do { // Unrolled: 16 ADLER_DO per iteration ADLER_DO16(buf); buf += 16; } while (--n); MOD(adler); // adler %= BASE MOD(sum2); } // Process remaining bytes while (len >= 16) { len -= 16; ADLER_DO16(buf); buf += 16; } while (len--) { adler += *buf++; sum2 += adler; } MOD(adler); MOD(sum2); return adler | (sum2 << 16); } ``` ### Accumulation Macros From `adler32_p.h`: ```c #define ADLER_DO1(buf) { adler += *(buf); sum2 += adler; } #define ADLER_DO2(buf) ADLER_DO1(buf); ADLER_DO1(buf + 1) #define ADLER_DO4(buf) ADLER_DO2(buf); ADLER_DO2(buf + 2) #define ADLER_DO8(buf) ADLER_DO4(buf); ADLER_DO4(buf + 4) #define ADLER_DO16(buf) ADLER_DO8(buf); ADLER_DO8(buf + 8) ``` ### Modular Reduction ```c #define MOD(a) a %= BASE #define MOD4(a) a %= BASE ``` The straightforward modulo works well because BASE is prime. On architectures where division is expensive, Adler-32 can alternatively be reduced by subtracting BASE in a loop. ### Combining Adler-32 Checksums `adler32_combine_()` merges two Adler-32 checksums from adjacent data segments without accessing the original data: ```c static uint32_t adler32_combine_(uint32_t adler1, uint32_t adler2, z_off64_t len2) { uint32_t sum1, sum2; unsigned rem; // modular arithmetic to combine: // s1_combined = (s1_a + s1_b - 1) % BASE // s2_combined = (s2_a + s2_b + s1_a * len2 - len2) % BASE rem = (unsigned)(len2 % BASE); sum1 = adler1 & 0xffff; sum2 = rem * sum1; MOD(sum2); sum1 += (adler2 & 0xffff) + BASE - 1; sum2 += ((adler1 >> 16) & 0xffff) + ((adler2 >> 16) & 0xffff) + BASE - rem; if (sum1 >= BASE) sum1 -= BASE; if (sum1 >= BASE) sum1 -= BASE; if (sum2 >= ((unsigned long)BASE << 1)) sum2 -= ((unsigned long)BASE << 1); if (sum2 >= BASE) sum2 -= BASE; return sum1 | (sum2 << 16); } ``` ### SIMD Implementations SIMD Adler-32 uses parallel accumulation with dot products: **AVX2** (`arch/x86/adler32_avx2.c`): ```c Z_INTERNAL uint32_t adler32_avx2(uint32_t adler, const uint8_t *buf, size_t len) { static const uint8_t dot2v[] = {32,31,30,...,1}; // Position weights static const uint8_t dot3v[] = {32,32,32,...,32}; // Sum1 weight (all ones) __m256i vbuf, vs1, vs2, vs1_0, vs3; vs1 = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, adler & 0xffff); vs2 = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, adler >> 16); vs1_0 = vs1; while (len >= 32) { vs1_0 = vs1; // Load 32 bytes vbuf = _mm256_loadu_si256((__m256i*)buf); // sum1 += bytes[0..31] vs1 = _mm256_add_epi32(vs1, _mm256_sad_epu8(vbuf, _mm256_setzero_si256())); // vs3 = dot product: sum2 += (32-i)*byte[i] vs3 = _mm256_maddubs_epi16(vbuf, vdot2v); // ... accumulate vs2 vs2 = _mm256_add_epi32(vs2, _mm256_madd_epi16(vs3, vones)); // vs2 += 32 * previous_vs1 vs2 = _mm256_add_epi32(vs2, _mm256_slli_epi32(vs1_0, 5)); buf += 32; len -= 32; } // Horizontal reduction and modular reduction ... } ``` The key insight: Instead of computing `sum2 += s1_n` for each byte n individually, SIMD computes `sum2 += k * byte[i]` via `_mm256_maddubs_epi16()` where k represents the positional weight. **Available SIMD variants**: | Architecture | Implementation | Vector Width | |---|---|---| | x86 SSE4.1 | `adler32_sse41.c` | 128-bit | | x86 SSSE3 | `adler32_ssse3.c` | 128-bit | | x86 AVX2 | `adler32_avx2.c` | 256-bit | | x86 AVX-512 | `adler32_avx512.c` | 512-bit | | x86 AVX-512+VNNI | `adler32_avx512_vnni.c` | 512-bit | | ARM NEON | `adler32_neon.c` | 128-bit | | Power VMX (Altivec) | `adler32_vmx.c` | 128-bit | | Power8 | `adler32_power8.c` | 128-bit | | RISC-V RVV | `adler32_rvv.c` | Scalable | | LoongArch LASX | `adler32_lasx.c` | 256-bit | ### Adler-32 with Copy `adler32_copy()` computes Adler-32 while simultaneously copying data, fusing two memory passes into one: ```c typedef uint32_t (*adler32_copy_func)(uint32_t adler, uint8_t *dst, const uint8_t *src, size_t len); ``` This is used during inflate to compute the checksum while copying decompressed data to the output buffer. --- ## CRC-32 ### Algorithm CRC-32 uses the standard polynomial 0xEDB88320 (reflected form): ```c #define POLY 0xedb88320 // CRC-32 polynomial (reversed) ``` ### Braided CRC-32 The default software implementation uses a "braided" algorithm that processes multiple bytes per step using interleaved CRC tables: ```c #define BRAID_N 5 // Number of interleaved CRC computations #define BRAID_W 8 // Bytes per word (8 for 64-bit, 4 for 32-bit) ``` From `crc32_braid_p.h`, the braided approach processes 5 words (40 bytes on 64-bit) per iteration: ```c // Braided CRC processing (conceptual) // Process BRAID_N words at a time: z_word_t braids[BRAID_N]; // Load BRAID_N words from input for (int k = 0; k < BRAID_N; k++) braids[k] = *(z_word_t *)(buf + k * BRAID_W); // For each word, XOR with running CRC then look up table for (int k = 0; k < BRAID_N; k++) { z_word_t word = braids[k]; // CRC-fold using braid tables: // crc = crc_braid_table[N-1-k][byte0] ^ ... ^ crc_braid_table[0][byteN-1] } ``` The braid tables are generated at compile time by `crc32_braid_tbl.h`. ### Chorba CRC-32 A newer CRC-32 algorithm using a "Chorba" reduction technique for even faster software CRC computation. Selected when size >= 256 bytes: ```c Z_INTERNAL uint32_t crc32_braid(uint32_t crc, const uint8_t *buf, size_t len) { // Short paths for small inputs if (len < 64) { return crc32_small(crc, buf, len); } // For lengths >= threshold, use Chorba if (len >= 256) { return crc32_chorba(crc, buf, len); } // Otherwise use braided ... } ``` ### SIMD CRC-32 Implementations Hardware-accelerated CRC-32 is available on these architectures: | Architecture | Instruction | File | |---|---|---| | x86 (PCLMULQDQ) | Carry-less multiply | `crc32_pclmulqdq.c` | | x86 (VPCLMULQDQ) | AVX-512 carry-less multiply | `crc32_vpclmulqdq.c` | | ARM (CRC32) | CRC32W/CRC32B instructions | `crc32_acle.c` | | ARM (PMULL) | Polynomial multiply long | `crc32_pmull.c` | | Power8 | Vector carry-less multiply | `crc32_power8.c` | | s390 (CRC32) | DFLTCC or hardware CRC | `crc32_vx.c` | | RISC-V | Zbc carry-less multiply | `crc32_rvv.c` | **x86 PCLMULQDQ** (`arch/x86/crc32_pclmulqdq.c`): Uses Barrett reduction via carry-less multiplication to fold 64 bytes at a time: ```c Z_INTERNAL uint32_t crc32_pclmulqdq(uint32_t crc32, const uint8_t *buf, size_t len) { __m128i xmm_crc0, xmm_crc1, xmm_crc2, xmm_crc3; __m128i xmm_fold4; // Fold constant // Initialize with CRC and first 64 bytes xmm_crc0 = _mm_loadu_si128((__m128i *)buf); xmm_crc0 = _mm_xor_si128(xmm_crc0, _mm_cvtsi32_si128(crc32)); // ... load crc1, crc2, crc3 // Main fold loop: process 64 bytes per iteration while (len >= 64) { // Fold: crc_n = pclmulqdq(crc_n, fold_constant) ^ next_data xmm_crc0 = _mm_xor_si128( _mm_clmulepi64_si128(xmm_crc0, xmm_fold4, 0x01), _mm_clmulepi64_si128(xmm_crc0, xmm_fold4, 0x10)); xmm_crc0 = _mm_xor_si128(xmm_crc0, _mm_loadu_si128(next++)); // Repeat for crc1..crc3 } // Final reduction to 32-bit CRC // Barrett reduction using mu and polynomial constants } ``` This processes data at ~16 bytes/cycle on modern x86 hardware. ### CRC-32 with Copy Like Adler-32, CRC-32 has a combined compute-and-copy variant: ```c typedef uint32_t (*crc32_copy_func)(uint32_t crc, uint8_t *dst, const uint8_t *src, size_t len); ``` This fuses the CRC computation with the `memcpy`, utilising cache lines loaded for copying to also feed the CRC calculation. ### Combining CRC-32 Values ```c uint32_t crc32_combine(uint32_t crc1, uint32_t crc2, z_off_t len2); uint32_t crc32_combine_gen(z_off_t len2); uint32_t crc32_combine_op(uint32_t crc1, uint32_t crc2, uint32_t op); ``` Two-phase combine enables pre-computing the combination operator for a known second-segment length, then applying it to multiple CRC pairs. --- ## Dispatch via `functable` Checksum functions are dispatched through the `functable_s` structure: ```c struct functable_s { adler32_func adler32; adler32_copy_func adler32_copy; compare256_func compare256; crc32_func crc32; crc32_copy_func crc32_copy; // ... other function pointers }; ``` `functable.c` selects the best implementation at runtime: ```c // x86 dispatch cascade for adler32: #ifdef X86_SSE42 if (cf.x86.has_sse42) functable.adler32 = adler32_sse42; #endif #ifdef X86_AVX2 if (cf.x86.has_avx2) functable.adler32 = adler32_avx2; #endif #ifdef X86_AVX512 if (cf.x86.has_avx512) functable.adler32 = adler32_avx512; #endif #ifdef X86_AVX512VNNI if (cf.x86.has_avx512vnni) functable.adler32 = adler32_avx512_vnni; #endif ``` Each architecture-specific source file is compiled separately with its required SIMD flags (e.g., `-mavx2`, `-mpclmul`). --- ## Function Table API ### Public API ```c uint32_t PREFIX(adler32)(uint32_t adler, const uint8_t *buf, uint32_t len); uint32_t PREFIX(crc32)(uint32_t crc, const uint8_t *buf, uint32_t len); ``` For zlib compatibility, `adler32_z()` and `crc32_z()` accept `size_t` length: ```c uint32_t PREFIX(adler32_z)(uint32_t adler, const uint8_t *buf, size_t len); uint32_t PREFIX(crc32_z)(uint32_t crc, const uint8_t *buf, size_t len); ``` ### Initial Values - Adler-32: `adler32(0, NULL, 0)` returns `1` (initial value) - CRC-32: `crc32(0, NULL, 0)` returns `0` (initial value) ### Typical Usage ```c uint32_t checksum = PREFIX(adler32)(0L, Z_NULL, 0); checksum = PREFIX(adler32)(checksum, data, data_len); // checksum now holds the Adler-32 of data[0..data_len-1] ``` --- ## Performance Characteristics ### Adler-32 | Implementation | Throughput (approximate) | |---|---| | Scalar C | ~1 byte/cycle | | SSE4.1 | ~8 bytes/cycle | | AVX2 | ~16 bytes/cycle | | AVX-512+VNNI | ~32 bytes/cycle | | ARM NEON | ~8 bytes/cycle | ### CRC-32 | Implementation | Throughput (approximate) | |---|---| | Braided (scalar) | ~4 bytes/cycle | | PCLMULQDQ | ~16 bytes/cycle | | VPCLMULQDQ (AVX-512) | ~64 bytes/cycle | | ARM CRC32 | ~4 bytes/cycle | | ARM PMULL | ~16 bytes/cycle | CRC-32 is computationally heavier than Adler-32, but hardware acceleration closes the gap significantly. --- ## Checksum in the Compression Pipeline ### During Deflate In `deflate.c`, checksums are computed on the input data: ```c if (s->wrap == 2) { // gzip: CRC-32 strm->adler = FUNCTABLE_CALL(crc32)(strm->adler, strm->next_in, strm->avail_in); } else if (s->wrap == 1) { // zlib: Adler-32 strm->adler = FUNCTABLE_CALL(adler32)(strm->adler, strm->next_in, strm->avail_in); } ``` ### During Inflate In `inflate.c`, checksums are computed on the output data: ```c static inline void inf_chksum(PREFIX3(stream) *strm, const uint8_t *buf, uint32_t len) { struct inflate_state *state = (struct inflate_state *)strm->state; if (state->flags) strm->adler = state->check = FUNCTABLE_CALL(crc32)(state->check, buf, len); else strm->adler = state->check = FUNCTABLE_CALL(adler32)(state->check, buf, len); } ``` The `_copy` variants (`inf_chksum_cpy`) are preferred when data is being both checksummed and copied, as they fuse the two operations.