Conditionally shortcut via the chorba polynomial based on compile flags - Project-Tick - Project Tick is a project dedicated to providing developers with ease of use and users with long-lasting software.

diff options

author	Adam Stylinski <kungfujesus06@gmail.com>	2025-11-21 10:02:14 -0500
committer	Hans Kristian Rosbach <hk-github@circlestorm.org>	2025-11-22 11:27:12 +0100
commit	fe179585e7c25234c2c224116ccfed8b0a78dbd9 (patch)
tree	2986ef3703177e9d15ae3504353bfff34700f495 /insert_string.c
parent	f6e28fb1648f30912ddb4f6ba4a80adeab37b90f (diff)
download	Project-Tick-fe179585e7c25234c2c224116ccfed8b0a78dbd9.tar.gz Project-Tick-fe179585e7c25234c2c224116ccfed8b0a78dbd9.zip

Conditionally shortcut via the chorba polynomial based on compile flags

As it turns out, the copying CRC32 variant _is_ slower when compiled with generic flags. The reason for this is mainly extra stack spills and the lack of operations we can overlap with the moves. However, when compiling for an architecture with more registers, such as avx512, we no longer have to eat all these costly stack spills and we can overlap with a 3 operand XOR. Conditionally guarding this means that if a Linux distribution wants to compile with -march=x86_64-v4 they get all the upsides to this. This code notably is not actually used if you happen to have something that support 512 bit wide clmul, so this does help a somewhat narrow range of targets (most of the earlier avx512 implementations pre ice lake). We also must guard with AVX512VL, as just specifying AVX512F makes GCC generate vpternlogic instructions of 512 bit widths only, so a bunch of packing and unpacking of 512 bit to 256 bit registers and vice versa has to occur, absolutely killing runtime. It's only AVX512VL where there's a 128 bit wide vpternlogic.

Diffstat (limited to 'insert_string.c')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: