Building uint128 with x64 intrinsics: How fixed-width arithmetic beats dynamic bigint libraries

A minimal 128-bit unsigned integer implementation using carry/borrow/multiply intrinsics generates assembly identical to compiler builtins. For CTOs facing precision requirements in geometry, finance, or crypto workloads, fixed-width types offer predictable performance without abstraction overhead—the foundation scales cleanly to 256 or 512 bits.

TheBiggish Editorial · Monday, February 2, 2026

What This Is

A two-limb u128 type built from uint64_t pairs with x64 intrinsics (_addcarry_u64, _subborrow_u64, _mulx_u64) produces codegen matching __uint128_t for add/sub/mul/compare operations. No runtime penalty. No abstraction tax. The approach extends to 192, 256, or wider types by adding limbs—production systems already use 256-bit arithmetic in hot paths and scale to 564 bits for edge cases.

Why Fixed Width Matters

Dynamic bigint libraries solve the wrong problem when your bounds are known. GMP and equivalents pay for flexibility with heap allocations, branches, and pointer chasing. If your values fit 128 bits—common in computational geometry, high-precision finance, or blockchain cryptography—fixed types deliver constant-time operations with zero indirection.

The pattern: represent as base-2^64 digits, use hardware carry chains. Addition becomes add + adc. Subtraction is sub + sbb. Multiplication via _mulx_u64 (BMI2) or _umul128 (MSVC) for 64×64→128 operations, discarding high bits for 128-bit products. The compiler emits exactly what you'd write in assembly.

The Landscape

C++ Alliance's int128 library hit production-ready status in July 2025, delivering >100% speedups for Boost.Decimal's 128-bit backend when integrated with 256-bit types. Template libraries like wide-integer offer configurable uint256/uint512 with 32-bit default limbs (64-bit via macro). CJM Numerics provides C++20 header-only alternatives. Proposal P3140R1 pushes for std::int_least128_t standardization—recognition that intrinsics-only approaches lock code to specific platforms.

NVIDIA's CUDA int128 powers decimal arithmetic in RAPIDS libcudf for GPU workloads. Skylake divq for 128-bit division clocks 76 cycles (~20ns); LLVM's __udivti3 fallback hits 150+ns due to loop overhead.

Trade-offs

This is x64-specific, unsigned-only, and deliberately narrow. MSVC requires different intrinsics. Division remains expensive (hardware or software). Portable alternatives exist but sacrifice performance predictability. For teams with known precision bounds and x64 targets, the metal-to-math directness is worth the platform constraint.

The takeaway: When your arithmetic fits fixed bounds, make the compiler prove it can't beat hand-rolled assembly. Usually it can't—so use intrinsics and let it match you instead.

What This Is

Why Fixed Width Matters

The Landscape

Trade-offs

Related Articles

Code-first OpenAPI tools promise to end spec drift - quality remains uneven

Spring Boot controllers: four input patterns every enterprise architect should know

PyGame developer documents precision failures in 2D lighting engine build