The new implementations are provided under !__OBSOLETE_MATH, it uses
ISO C99 code. With default settings the worst case error in nearest
rounding mode is 0.519 ULP with inlined fma and fma contraction. It uses
a 2 KB lookup table, on aarch64 .text+.rodata size of libm.a is increased
by 1703 bytes. The w_log.c wrapper is disabled since error handling is
inline in the new code.
New __HAVE_FAST_FMA and __HAVE_FAST_FMA_DEFAULT feature macros were
added to enable selecting between the code path that uses fma and the
one that does not. Targets supposed to set __HAVE_FAST_FMA_DEFAULT
if they have single instruction fma and the compiler can actually
inline it (gcc has __FP_FAST_FMA macro but that does not guarantee
inlining with -fno-builtin-fma).
Improvements on Cortex-A72:
latency: 1.9x
thruput: 2.3x