Commit Graph

2852 Commits

Author SHA1 Message Date
Siddhesh Poyarekar f44eee8f1b Improve strncmp for mutually misaligned inputs
The mutually misaligned inputs on aarch64 are compared with a simple
byte copy, which is not very efficient.  Enhance the comparison
similar to strcmp by loading a double-word at a time.  The peak
performance improvement (i.e. 4k maxlen comparisons) due to this on
the strncmp microbenchmark in glibc is as follows:

falkor: 3.5x (up to 72% time reduction)
cortex-a73: 3.5x (up to 71% time reduction)
cortex-a53: 3.5x (up to 71% time reduction)

All mutually misaligned inputs from 16 bytes maxlen onwards show
upwards of 15% improvement and there is no measurable effect on the
performance of aligned/mutually aligned inputs.
2018-07-13 13:27:54 +02:00
Szabolcs Nagy 81dc535bb9 Remove float compare option from sincosf
PREFER_FLOAT_COMPARISON setting was not correct as it could raise
spurious exceptions.  Fixing it is easy: just use ISLESS(x, y) instead
of abstop12(x) < abstop12(y) with appropriate non-signaling definition
for ISLESS.  However it seems this setting is not very useful (there is
only minor performance difference on various architectures), so remove
this option for now.
2018-07-11 17:16:04 +02:00
Szabolcs Nagy 358f3c61d6 Fix the documentation comments for log_inline in pow
There was a typo and the arguments were not explained clearly.
2018-07-11 17:16:04 +02:00
Szabolcs Nagy 138575c9b9 Fix namespace issues in sinf, cosf and sincosf
Use const sincos_t for clarity instead of making the typedef const.
Use __inv_pi4 and __sincosf_table to avoid namespace issues with
static linking.
2018-07-06 10:29:01 +02:00
Szabolcs Nagy 2805b07fa1 Fix large ulp error in pow without fma very near 1.0
The !HAVE_FAST_FMA code path split r = z/c - 1 into r = rhi + rlo such
that when z = 1-tiny and c = 1 then rlo and rhi could have much larger
magnitude than r which later caused large rounding errors.

So do a nearest rounding instead of truncation at the split.

In newlib with default settings this was observable on some arm targets
that enable the new math code but has no fma.
2018-07-06 10:29:01 +02:00
Szabolcs Nagy 6a85e1a4e5 Change the return type of converttoint and document the semantics
The roundtoint and converttoint internal functions are only called with small
values, so 32 bit result is enough for converttoint and it is a signed int
conversion so the natural return type is int32_t.

The original idea was to help the compiler keeping the result in uint64_t,
then it's clear that no sign extension is needed and there is no accidental
undefined or implementation defined signed int arithmetics.

But it turns out gcc does a good job with inlining so changing the type has
no overhead and the semantics of the conversion is less surprising this way.
Since we want to allow the asuint64 (x + 0x1.8p52) style conversion, the top
bits were never usable and the existing code ensures that only the bottom
32 bits of the conversion result are used.

In newlib with default settings only aarch64 is affected and there is no
significant code generation change with gcc after the patch.
2018-07-06 10:29:01 +02:00
Szabolcs Nagy 73a3e95ff2 Remove unused TOINT_RINT and TOINT_SHIFT macros
Only have separate code paths for TOINT_INTRINSICS and !TOINT_INTRINSICS.
2018-07-06 10:29:01 +02:00
Szabolcs Nagy 393a1cb4ea Move __HAVE_FAST_FMA to math_config.h
Define it consistently with other HAVE_* macros that only affect code
using math_config.h.  This is also closer to the Arm Optimized Routines
code.
2018-07-06 10:29:01 +02:00
Szabolcs Nagy cbe50607fb Fix code style and comments of new math code
Synchronize code style and comments with Arm Optimized Routines, there
are no code changes in this patch.  This ensures different projects using
the same code have consistent code style so bug fix patches can be applied
more easily.
2018-07-06 10:29:01 +02:00
Takashi Yano 6a3e08a53e Fix newlib functions perror()/psignal() not to use writev().
This fix is for some platforms which do not have writev().
*perror.c: Use _write_r() instead of writev().
*psignal.c: Use write() insetad of writev().

Revise commit: d4f4e7ae1b
2018-07-05 15:33:49 -04:00
Takashi Yano d4f4e7ae1b Fix a bug of perror()/psignal() that changes the orientation of stderr.
* perror.c: Fix the problem that perror() changes the orientation
  of stderr to byte-oriented mode if stderr is not oriented yet.
* psignal.c: Ditto.
2018-07-04 14:17:28 +02:00
Corinna Vinschen 006520ca2b newlib: enable new math functions on Cygwin
Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
2018-06-27 15:53:51 +02:00
Szabolcs Nagy b99d49e506 New pow implementation
The new implementation is provided under !__OBSOLETE_MATH, it uses
ISO C99 code.  With default settings the worst case error in nearest
rounding mode is 0.54 ULP with inlined fma and fma contraction.  It uses
a 4 KB lookup table in addition to the table in exp_data.c, on aarch64
.text+.rodata size of libm.a is increased by 2295 bytes.

Improvements on Cortex-A72:
latency: 3.3x
thruput: 4.9x
2018-06-27 15:40:49 +02:00
Szabolcs Nagy 07e2c32828 New log2 implementation
The new implementation is provided under !__OBSOLETE_MATH, it uses
ISO C99 code.  With default settings the worst case error in nearest
rounding mode is 0.547 ULP with inlined fma and fma contraction.  It uses
a 1 KB lookup table, on aarch64 .text+.rodata size of libm.a is increased
by 1584 bytes.

Note that the math.h header defines log2(x) to be log(x)/Ln2, this is
not changed, so the new code is only used if that macro is suppressed.

Improvements on Cortex-A72:
latency: 2.0x
thruput: 2.2x
2018-06-27 15:40:49 +02:00
Szabolcs Nagy e5791079c6 New log implementation
The new implementations are provided under !__OBSOLETE_MATH, it uses
ISO C99 code.  With default settings the worst case error in nearest
rounding mode is 0.519 ULP with inlined fma and fma contraction.  It uses
a 2 KB lookup table, on aarch64 .text+.rodata size of libm.a is increased
by 1703 bytes.  The w_log.c wrapper is disabled since error handling is
inline in the new code.

New __HAVE_FAST_FMA and __HAVE_FAST_FMA_DEFAULT feature macros were
added to enable selecting between the code path that uses fma and the
one that does not.  Targets supposed to set __HAVE_FAST_FMA_DEFAULT
if they have single instruction fma and the compiler can actually
inline it (gcc has __FP_FAST_FMA macro but that does not guarantee
inlining with -fno-builtin-fma).

Improvements on Cortex-A72:
latency: 1.9x
thruput: 2.3x
2018-06-27 15:40:49 +02:00
Szabolcs Nagy fb929067db New exp and exp2 implementations
The new implementations are provided under !__OBSOLETE_MATH, they use
ISO C99 code.  There are several settings, with the default one the
worst case error in nearest rounding mode is 0.509 ULP for exp and
0.507 ULP for exp2 when a multiply and add is contracted into an fma.
They use a shared 2 KB lookup table, on aarch64 .text+.rodata size
of libm.a is increased by 1868 bytes.  The w_*.c wrappers are disabled
for the new code as it takes care of error handling inline.

The old exp2(x) code used to be just pow(2,x) so the speedup there
is more significant.

The file name has no special prefix to avoid any name collision with
existing files.

Improvements on Cortex-A72:
exp latency: 3.2x
exp thruput: 4.1x
exp2 latency: 7.8x
exp2 thruput: 18.8x
2018-06-27 15:40:49 +02:00
Szabolcs Nagy cfbcbd1c95 Use uint32_t sign argument to math error functions
This change is equivalent to the commit
c65db17340
and only affects code that is from the Arm optimized-routines project.

It does not affect the observable behaviour, but the code generation
can be different on 64bit targets.  The intention is to make the
portable semantics of the code obvious by using a fixed size type.
2018-06-27 15:40:49 +02:00
Takashi Yano 048490485a Fix Unicode table.
* (mkcategories): Fix a bug that outputs incorrect Unicode category
  table for code point ranges.
* (categories.t): Rebuild it using the bug-fixed mkcategories.

This fixes the problem reported in the following post.
https://cygwin.com/ml/cygwin/2018-06/msg00248.html
2018-06-26 10:19:12 +02:00
Corinna Vinschen b14daac482 Revert "Remove -fno-builtin to allow gcc to inline functions such as fabs, floor, creal, imag."
This reverts commit c077b9de99.

Yet another accidental commit...
2018-06-26 10:17:04 +02:00
Jon Beniston c077b9de99 Remove -fno-builtin to allow gcc to inline functions such as fabs, floor, creal, imag. 2018-06-25 13:31:51 +02:00
Wilco Dijkstra 3baadb9912 Improve performance of sinf/cosf/sincosf
Here is the correct patch with both filenames and int cast fixed:

This patch is a complete rewrite of sinf, cosf and sincosf.  The new version
is significantly faster, as well as simple and accurate.
The worst-case ULP is 0.56072, maximum relative error is 0.5303p-23 over all
4 billion inputs.  In non-nearest rounding modes the error is 1ULP.

The algorithm uses 3 main cases: small inputs which don't need argument
reduction, small inputs which need a simple range reduction and large inputs
requiring complex range reduction.  The code uses approximate integer
comparisons to quickly decide between these cases - on some targets this may
be slow, so this can be configured to use floating point comparisons.

The small range reducer uses a single reduction step to handle values up to
120.0.  It is fastest on targets which support inlined round instructions.

The large range reducer uses integer arithmetic for simplicity.  It does a
32x96 bit multiply to compute a 64-bit modulo result.  This is more than
accurate enough to handle the worst-case cancellation for values close to
an integer multiple of PI/4.  It could be further optimized, however it is
already much faster than necessary.

Simple benchmark showing speedup factor on AArch64 for various ranges:

range	0.7853982	sinf	1.7	cosf	2.2	sincosf	2.8
range	1.570796	sinf	1.9	cosf	1.9	sincosf	2.7
range	3.141593	sinf	2.0	cosf	2.0	sincosf	3.5
range	6.283185	sinf	2.3	cosf	2.3	sincosf	4.2
range	125.6637	sinf	2.9	cosf	3.0	sincosf	5.1
range	1.1259e15	sinf	26.8	cosf	26.8	sincosf	45.2

ChangeLog:
2018-05-18  Wilco Dijkstra  <wdijkstr@arm.com>

        * newlib/libm/common/Makefile.in: Regenerated.
        * newlib/libm/common/Makefile.am: Add sinf.c, cosf.c, sincosf.c
        sincosf.h, sincosf_data.c. Add -fbuiltin -fno-math-errno to CFLAGS.
        * newlib/libm/common/math_config.h: Add HAVE_FAST_ROUND, HAVE_FAST_LROUND,
        roundtoint, converttoint, force_eval_float, force_eval_double, eval_as_float,
        eval_as_double, likely, unlikely.
        * newlib/libm/common/cosf.c: New file.
        * newlib/libm/common/sinf.c: Likewise.
        * newlib/libm/common/sincosf.h: Likewise.
        * newlib/libm/common/sincosf.c: Likewise.
        * newlib/libm/common/sincosf_data.c: Likewise.
        * newlib/libm/math/sf_cos.c: Add #if to build conditionally.
        * newlib/libm/math/sf_sin.c: Likewise.
        * newlib/libm/math/wf_sincos.c: Likewise.

--
2018-06-21 09:37:04 +02:00
Corinna Vinschen cfe8c6c504 Revert "Improve performance of sinf/cosf/sincosf"
This reverts commit fca80a9d1b.

Accidentally pushed a preliminary version
2018-06-21 09:36:39 +02:00
Jon Beniston b7d9d27b0e libm/common/s_round.c (round): Add cast for 16-bit CPUs 2018-06-21 09:31:13 +02:00
Wilco Dijkstra fca80a9d1b Improve performance of sinf/cosf/sincosf
This patch is a complete rewrite of sinf, cosf and sincosf.  The new version
is significantly faster, as well as simple and accurate.
The worst-case ULP is 0.56072, maximum relative error is 0.5303p-23 over all
4 billion inputs.  In non-nearest rounding modes the error is 1ULP.

The algorithm uses 3 main cases: small inputs which don't need argument
reduction, small inputs which need a simple range reduction and large inputs
requiring complex range reduction.  The code uses approximate integer
comparisons to quickly decide between these cases - on some targets this may
be slow, so this can be configured to use floating point comparisons.

The small range reducer uses a single reduction step to handle values up to
120.0.  It is fastest on targets which support inlined round instructions.

The large range reducer uses integer arithmetic for simplicity.  It does a
32x96 bit multiply to compute a 64-bit modulo result.  This is more than
accurate enough to handle the worst-case cancellation for values close to
an integer multiple of PI/4.  It could be further optimized, however it is
already much faster than necessary.

Simple benchmark showing speedup factor on AArch64 for various ranges:

range	0.7853982	sinf	1.7	cosf	2.2	sincosf	2.8
range	1.570796	sinf	1.9	cosf	1.9	sincosf	2.7
range	3.141593	sinf	2.0	cosf	2.0	sincosf	3.5
range	6.283185	sinf	2.3	cosf	2.3	sincosf	4.2
range	125.6637	sinf	2.9	cosf	3.0	sincosf	5.1
range	1.1259e15	sinf	26.8	cosf	26.8	sincosf	45.2

ChangeLog:
2018-06-18  Wilco Dijkstra  <wdijkstr@arm.com>

        * newlib/libm/common/Makefile.in: Regenerated.
        * newlib/libm/common/Makefile.am: Add sinf.c, cosf.c, sincosf.c
        sincosf.h, sincosf_data.c. Add -fbuiltin -fno-math-errno to CFLAGS.
        * newlib/libm/common/math_config.h: Add HAVE_FAST_ROUND, HAVE_FAST_LROUND,
        roundtoint, converttoint, force_eval_float, force_eval_double, eval_as_float,
        eval_as_double, likely, unlikely.
        * newlib/libm/common/cosf.c: New file.
        * newlib/libm/common/sinf.c: Likewise.
        * newlib/libm/common/sincosf.h: Likewise.
        * newlib/libm/common/sincosf.c: Likewise.
        * newlib/libm/common/sincosf_data.c: Likewise.
        * newlib/libm/math/sf_cos.c: Add #if to build conditionally.
        * newlib/libm/math/sf_sin.c: Likewise.
        * newlib/libm/math/wf_sincos.c: Likewise.

--
2018-06-19 09:44:28 +02:00
Thomas Kindler 9dd3c3b0ad newlib: getopt now permutes multi-flag options correctly
Previously, "test 1 2 3 -a -b -c"  was permuted to "test -a -b -c 1 2 3",
but "test 1 2 3 -abc" was left as "test 1 2 3 -abc".

Signed-off-by: Thomas Kindler <mail+newlib@t-kindler.de>
2018-06-18 18:45:44 +02:00
Jeff Johnston 4a3d0a5a5d Fix issue with malloc_extend_top
- when calculating a correction to align next brk to page boundary,
  ensure that the correction is less than a page size
- if allocating the correction fails, ensure that the top size is
  set to brk + sbrk_size (minus any front alignment made)

Signed-off-by: Jeff Johnston <jjohnstn@redhat.com>
2018-05-29 10:16:48 -04:00
Matthias Kannwischer fcfea0ae2d fix llrint and lrint for 52 <= exponent <= 62 2018-05-29 15:59:48 +02:00
Freddie Chopin 3305f35570 Fix 32-bit overflow in mktime() when time_t is 64-bits long
When converting number of days since epoch (32-bits) to seconds,
calculations using 32-bit `long` overflow for years above 2038. Solve
this by casting number of days to `time_t` just before final
multiplication.

Signed-off-by: Freddie Chopin <freddie.chopin@gmail.com>
2018-05-29 15:27:03 +02:00
Jeff Johnston e928275566 Use _LDBL_EQ_DBL in nexttowardf.c
2018-05-07  Tom de Vries  <tom@codesourcery.com>

	* libm/common/nexttowardf.c: Use _LDBL_EQ_DBL instead of
	_LDBL_EQ_DOUBLE.
2018-05-07 12:22:12 -04:00
Jeff Johnston cd31fbb2ae Add nvptx port.
- From: Cesar Philippidis <cesar@codesourcery.com>
  Date: Tue, 10 Apr 2018 14:43:42 -0700
  Subject: [PATCH] nvptx port

  This port adds support for Nvidia GPU's, which are primarily used as
  offload accelerators in OpenACC and OpenMP.
2018-04-13 15:42:37 -04:00
Corinna Vinschen 27652b608d strtod: Convert 64 bit double to 64 bit int during computation
The gdtoa implementation uses the type long, defined as Long, in lots
of code.  For historical reason newlib defines Long as int32_t instead.

This works fine, as long as floating point exceptions are not enabled.
The conversion to 32 bit int can lead to a FE_INVALID situation.

Example:

  const char *str = "121645100408832000.0";
  char *ptr;

  feenableexcept (FE_INVALID);
  strtod (str, &ptr);

This leads to the following situation in strtod

  double aadj;
  Long L;

  [...]
  L = (Long)aadj;

For instance, on x86_64 the code here is

  cvttsd2si %xmm0,%eax

At this point, aadj is 2529648000.0 in our example.  The conversion to
32 bit %eax results in a negative int value, thus the conversion is
invalid.  With feenableexcept (FE_INVALID), a SIGFPE is raised.

Fix this by always using 64 bit ints here if double is not a 32 bit type
to avoid this type of FP exceptions.

Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
2018-04-09 11:31:04 +02:00
Corinna Vinschen 1ee6654e50 newlib: fix iswupper_l in !_MB_CAPABLE case
Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
2018-03-27 12:35:27 +02:00
Thomas Wolff fc59da00c8 comments to document struct caseconv_entry
explain design of compact (packed) struct caseconv_entry,
in case it needs to be modified for future Unicode versions
2018-03-26 12:01:50 +02:00
Thomas Wolff b49ce5af1b newlib: fix indentation in toulower
Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
2018-03-26 10:00:16 +02:00
Hakan Lindqvist 3ce38df8d1 Reduce qsort stack consumption
Classical function call recursion wastes a lot of stack space.
Each recursion level requires a full stack frame comprising all
local variables and additional space as dictated by the
processor calling convention.

This implementation instead stores the variables that are unique
for each recursion level in a parameter stack array, and uses
iteration to emulate recursion. Function call recursion is not
used until the array is full.

To ensure the stack consumption isn't worsened by this design, the
size of the parameter stack array is chosen to be similar to the
stack frame excluding the array. Each function call recursion level
can handle 8 iterative recursion levels.

Stack consumption will worsen when sorting tiny arrays that do not
need recursion (of 6 elements or less). It will be about equal for
up to 15 elements, and be an improvement for larger arrays. The best
case improvement is a stack size reduction down to about one quarter
of the stack consumption before the change.

A design where the parameter stack array is large enough for the
worst case recursion level was rejected because it would worsen
the stack consumption when sorting arrays smaller than about 1500
elements. The worst case is 31 levels on a 32-bit system.

A design with a dynamic parameter array size was rejected because
of limitations in some compilers.
2018-03-16 10:21:23 +01:00
Hakan Lindqvist 0045445ad6 Ensure qsort recursion depth is bounded
The qsort algorithm splits the input array in three parts. The
left and right parts may need further sorting. One of them is
sorted by recursion, the other by iteration. This update ensures
that it is the smaller part that is chosen for recursion.

By choosing the smaller part, each recursion level will handle
less than half the array of the previous recursion level. Hence
the recursion depth is bounded to be less than log2(n) i.e. 1
level per significant bit in the array size n.

The update also includes code comments explaining the algorithm.
2018-03-16 10:21:23 +01:00
Joel Sherrill 948db3e4b7 Correct prototypes of pthread_mutex_getprioceiling() and pthread_setschedparam() 2018-03-15 09:25:45 -05:00
Richard Earnshaw 0bb8697333 [arm] Fix syscalls.c for newlib embedded syscalls builds
Newlib has a build configuration where syscalls can be directly
embedded in the newlib library rather than relying on libgloss.

This configuration was broken recently by an update to the libgloss
support for Arm that was not propagated to the syscalls interface in
newlib itself.  This patch restores the build.  It's essentially a
copy of https://sourceware.org/ml/newlib/2018/msg00128.html but there
are some other minor cleanups and changes that I've made at the same
time.  None of those cleanups affect functionality.

The prototypes of the following functions have been updated: _link,
_sbrk, _getpid, _write, _swiwrite, _lseek, _swilseek, _read and
_swiread.

Signed-off-by: Richard Earnshaw <Richard.Earnshaw@arm.com>
2018-03-15 09:55:11 +00:00
Yaakov Selkowitz 829820af6e ssp: fix wchar.h with -std=c99
https://sourceware.org/ml/newlib/2018/msg00261.html

Signed-off-by: Yaakov Selkowitz <yselkowi@redhat.com>
2018-03-14 10:46:32 -05:00
Yaakov Selkowitz e494b56035 Fix alloc_align and alloc_size macros for multiple arguments
https://sourceware.org/ml/newlib/2018/msg00263.html

This is a follow-up to commit 4564b30f33.

Signed-off-by: Yaakov Selkowitz <yselkowi@redhat.com>
2018-03-14 10:17:51 -05:00
Corinna Vinschen 134f93f313 ctype: align size of category bit fields to small targets needs
E.g. arm ABI requires -fshort-enums for bare-metal toolchains.
Given there are only 29 category enums, the compiler chooses an
8 bit enum type, so a size of 11 bits for the bitfield leads to
a compile time error:

  error: width of 'cat' exceeds its type
    enum category cat: 11;
                  ^~~

Fix this by aligning the size of the category members to byte
borders.

Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
2018-03-14 11:38:24 +01:00
Corinna Vinschen edcf783dc2 Revert "ctype: align size of category bit fields to small targets needs"
This reverts commit e98d3eb3eb.

It has accidentally included some work in progress.
2018-03-14 11:36:06 +01:00
Thomas Wolff 44d90834fb fix/enhance Unicode table generation scripts
Scripts do not try to acquire Unicode data by best-effort magic anymore.
Options supported:
-h for help
-i to copy Unicode data from /usr/share/unicode/ucd first
-u to download Unicode data from unicode.org first
If (despite of -i or -u if given) the necessary Unicode files are not
available locally, table generation is skipped, but no error code is
returned, so not to obstruct the build process if called from a Makefile.
2018-03-14 10:44:32 +01:00
Corinna Vinschen e98d3eb3eb ctype: align size of category bit fields to small targets needs
E.g. arm ABI requires -fshort-enums for bare-metal toolchains.
Given there are only 29 category enums, the compiler chooses an
8 bit enum type, so a size of 11 bits for the bitfield leads to
a compile time error:

  error: width of 'cat' exceeds its type
    enum category cat: 11;
                  ^~~

Fix this by aligning the size of the category members to byte
borders.

Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
2018-03-14 10:36:38 +01:00
Corinna Vinschen e186dc8661 towctrans_l: Always return a value from helper functions
touupper and toulower didn't return a value in all cases.  Worse,
this only broke Cygwin when building without optimization for debug
purposes.

Why GCC neglects to notice this is a mystery.

While at it, fix formatting.

Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
2018-03-13 22:09:30 +01:00
Joel Sherrill 5b97e36239 rtems/.../dirent.h: Add alphasort() prototype 2018-03-13 09:11:47 -05:00
Jon Turney 4564b30f33 Correct alloc_size annotation on reallocarray()
Signed-off-by: Jon Turney <jon.turney@dronecode.org.uk>
2018-03-13 09:04:56 -05:00
Thomas Wolff c8d96a96ea make target for explicit Unicode data tables generation
Run 'make unidata' in newlib target directory to generate Unicode
data tables for libc functions wcwidth, tow* and isw*.
2018-03-12 12:09:44 +01:00
Thomas Wolff a352730004 character data generation 2018-03-12 11:39:50 +01:00
Thomas Wolff 41f72ab4d7 use generated character data
The tow* functions use an included case conversion table which can be
generated from Unicode data.
The isw* functions use a character categories table (provided by
categories.c) which can be generated from Unicode data.
Delegation between current-locale and specific-locale-dependent functions
was reverted towards the generic locale-dependent functions (*_l.c);
this is however only relevant on systems with non-Unicode wide character
locales, thus not on Cygwin.
2018-03-12 11:39:42 +01:00