The !HAVE_FAST_FMA code path split r = z/c - 1 into r = rhi + rlo such
that when z = 1-tiny and c = 1 then rlo and rhi could have much larger
magnitude than r which later caused large rounding errors.
So do a nearest rounding instead of truncation at the split.
In newlib with default settings this was observable on some arm targets
that enable the new math code but has no fma.
Synchronize code style and comments with Arm Optimized Routines, there
are no code changes in this patch. This ensures different projects using
the same code have consistent code style so bug fix patches can be applied
more easily.
The new implementation is provided under !__OBSOLETE_MATH, it uses
ISO C99 code. With default settings the worst case error in nearest
rounding mode is 0.54 ULP with inlined fma and fma contraction. It uses
a 4 KB lookup table in addition to the table in exp_data.c, on aarch64
.text+.rodata size of libm.a is increased by 2295 bytes.
Improvements on Cortex-A72:
latency: 3.3x
thruput: 4.9x