Academic or research source. Check the methodology, sample size, and whether it's been replicated.
viable/strict/1772468940: [inductor] Add FMA lowering for add-with-alpha on CUDA
Eager CUDA computes a + alpha * b as fma(b, alpha, a) . Without this, Triton computes b * alpha then adds to a as separate operations, losing the FMA precision guarantee. This affects optimizer...
PyTorch Releases··README: ~3 min
2-Minute Brief
According to PyTorch Releases: Eager CUDA computes a + alpha * b as fma(b, alpha, a) . Without this, Triton computes b * alpha then adds to a as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use grad.add(param, alpha=weight_decay) and _foreach_add with alpha. Authored with Claude. Pull Request resolved: #175838 Approved by: https://github.com/v0i0 ghstack dependencies: #174912 , #175309 , #175310
viable/strict/1772468940: [inductor] Add FMA lowering for add-with-alpha on CUDA
TLDR
Eager CUDA computes a + alpha * b as fma(b, alpha, a) . Without this, Triton computes b * alpha then adds to a as separate operations, losing the FMA precision guarantee. This affects optimizer...
According to PyTorch Releases: Eager CUDA computes a + alpha * b as fma(b, alpha, a) . Without this, Triton computes b * alpha then adds to a as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use grad.add(param, alpha=weight_decay) and _foreach_add with alpha. Authored with Claude. Pull Request resolved: #175838 Approved by: https://github.com/v0i0 ghstack dependencies: #174912 , #175309 , #175310