Marcus Shawcroft wrote:
> This patch appears to have been munged by the mail system, can you
> repost as an attachment please.
Sure, I've attached the patch.
Wilco
Add a simple rawmemchr implementation. Use strlen for rawmemchr(s, '\0') as it is the
fastest way to search for '\0', and use memchr with an infinite size for other cases.
This is 3x faster for large sizes.
ChangeLog:
2016-04-22 Wilco Dijkstra <wdijkstr@arm.com>
* newlib/libc/machine/aarch64/Makefile.in: Add rawmemchr.S and
rawmemchr-stub.c.
* newlib/libc/machine/aarch64/Makefile.am: Likewise.
* newlib/libc/machine/aarch64/rawmemchr.S (rawmemchr): Add rawmemchr.
* newlib/libc/machine/aarch64/rawmemchr-stub.c (rawmemchr): Likewise.
Newlib defines defaults for internal types via <sys/_types.h> and uses
<machine/_types.h> to let targets define their own type if necessary.
Previously for example
#ifndef __dev_t_defined
typedef short __dev_t;
#endif
However, the __*_t_defined pattern conflicts with the glibc type guard
pattern for user types, e.g. dev_t in this example. Introduce a
__machine_*_t_defined pattern for internal types (defined by
<machine/_types.h>, used by <sys/_types.h>). For example
#ifndef __machine_dev_t_defined
typedef short __dev_t;
#endif
Signed-off-by: Sebastian Huber <sebastian.huber@embedded-brains.de>
This is an optimized memset for AArch64. Memset is split into 4 main
cases: small sets of up to 16 bytes, medium of 16..96 bytes which are
fully unrolled. Large memsets of more than 96 bytes align the
destination and use an unrolled loop processing 64 bytes per
iteration. Memsets of zero of more than 256 use the dc zva
instruction, and there are faster versions for the common ZVA sizes 64
or 128. STP of Q registers is used to reduce codesize without loss of
performance.
This is an optimized memset for AArch64. Memset is split into 4 main
cases: small sets of up to 16 bytes, medium of 16..96 bytes which are
fully unrolled. Large memsets of more than 96 bytes align the
destination and use an unrolled loop processing 64 bytes per
iteration. Memsets of zero of more than 256 use the dc zva
instruction, and there are faster versions for the common ZVA sizes 64
or 128. STP of Q registers is used to reduce codesize without loss of
performance.
This is an optimized memcpy for AArch64. Copies are split into 3 main
cases: small copies of up to 16 bytes, medium copies of 17..96 bytes
which are fully unrolled. Large copies of more than 96 bytes align
the destination and use an unrolled loop processing 64 bytes per
iteration. In order to share code with memmove, small and medium
copies read all data before writing, allowing any kind of overlap. On
a random copy test memcpy is 40.8% faster on A57 and 28.4% on A53.
This is an optimized memmove for AArch64. All copies of up to 96
bytes and all backward copies are done by the new memcpy. The only
remaining case is large forward copies which are done in the same way
as the memcpy loop, but copying from the end rather than the start.
improvements. Adjust to allow building as stpcpy.
* libc/machine/aarch64/stpcpy.S: New file.
* libc/machine/aarch64/stpcpy-stub.c: New file.
* libc/machine/aarch64/Makefile.am (lib_a_SOURCES): Build stpcpy.
* libc/machine/aarch64/Makefile.in: Regenerated.
* libc/machine/aarch64/strrchr-stub.c: New file.
* libc/machine/aarch64/Makefile.am: Add them to build list.
* libc/machine/aarch64/Makefile.in: Regenerated.
2014-07-11 K�vin Petit <kevin.petit@arm.com>
* libc/machine/aarch64/memchr.S: New file.
* libc/machine/aarch64/memchr-stub.c: New file.
* libc/machine/aarch64/Makefile.am: Add the new files.
* libc/machine/aarch64/Makefile.in: Regenerated.
* libc/machine/aarch64/strchrnul-stub.c: New file.
* libc/machine/aarch64/Makefile.am: Add them to build list.
* libc/machine/aarch64/Makefile.in: Regenerated.
* libc/machine/aarch64/strchr-stub.c: New file
* libc/machine/aarch64/Makefile.am: Add them to build list.
* libc/machine/aarch64/Makefile.in: Regenerated.