Add support for the AMD GCN GPU architecture. This is primarily intended for
use with OpenMP and OpenACC offloading. It can also be used for stand-alone
programs, but this is intended mostly for testing the compiler and is not
expected to be useful in general.
The GPU architecture is highly parallel, and therefore Newlib must be
configured to use dynamic re-entrancy, and thread-safe malloc.
The only I/O available is a via a shared-memory interface provided by libgomp
and the gcn-run tool included with GCC. At this time this is limited to
stdout, argc/argv, and the return code.
This patch fixes an issue in the previous memset loop change. If the
zva size is >= 256 and there are more than 64 bytes left in the
tail, we could enter the loop and thus need to rebias dst by 32 as
well.
Since no known CPUs use this size this can't be tested natively, so I've
tested it on a simulator initialized with a large zva size.
--
This fixes an ineffiency in the non-zero memset. Delaying the writeback
until the end of the loop is slightly faster on some cores - this shows
~5% performance gain on Cortex-A53 when doing large non-zero memsets.
Tested against the GLIBC testsuite.
Move common content of the various <sys/dirent.h> and the latest FreeBSD
<dirent.h> to <dirent.h>.
Signed-off-by: Sebastian Huber <sebastian.huber@embedded-brains.de>
This macro selects a compiler option that disables recognition of
common memset/memcpy patterns and converting those to direct
memset/memcpy calls.
Signed-off-by: Keith Packard <keithp@keithp.com>
Replace the simple byte-wise compare in the misaligned case with a
dword compare with page boundary checks in place. For simplicity I've
chosen a 4K page boundary so that we don't have to query the actual
page size on the system.
This results in up to 3x improvement in performance in the unaligned
case on falkor and about 2.5x improvement on mustang as measured using
bench-strcmp in glibc.
This improved memcmp provides a fast path for compares up to 16 bytes
and then compares 16 bytes at a time, thus optimizing loads from both
sources. The glibc memcmp microbenchmark retains performance (with an
error of ~1ns) for smaller compare sizes and reduces up to 31% of
execution time for compares up to 4K on the APM Mustang. On Qualcomm
Falkor this improves to almost 48%, i.e. it is almost 2x improvement
for sizes of 2K and above.
The mutually misaligned inputs on aarch64 are compared with a simple
byte copy, which is not very efficient. Enhance the comparison
similar to strcmp by loading a double-word at a time. The peak
performance improvement (i.e. 4k maxlen comparisons) due to this on
the strncmp microbenchmark in glibc is as follows:
falkor: 3.5x (up to 72% time reduction)
cortex-a73: 3.5x (up to 71% time reduction)
cortex-a53: 3.5x (up to 71% time reduction)
All mutually misaligned inputs from 16 bytes maxlen onwards show
upwards of 15% improvement and there is no measurable effect on the
performance of aligned/mutually aligned inputs.
- From: Cesar Philippidis <cesar@codesourcery.com>
Date: Tue, 10 Apr 2018 14:43:42 -0700
Subject: [PATCH] nvptx port
This port adds support for Nvidia GPU's, which are primarily used as
offload accelerators in OpenACC and OpenMP.
At least with Binutils 2.30 and GCC 7.3 we need symbol definitions
without the leading underscore.
Signed-off-by: Sebastian Huber <sebastian.huber@embedded-brains.de>
Discard QUICKREF sections, rather than writing them to stderr
Discard MATHREF sections, rather than discarding as an error
Pass NOTES sections through to texinfo, rather than discarding as an error
Don't redirect makedoc stderr to .ref file
Remove makedoc output on error
Remove .ref files from CLEANFILES
Regenerate Makefile.ins
Signed-off-by: Jon Turney <jon.turney@dronecode.org.uk>
- For prevent confuse about what BSD license variant we used, 2- or
3-clause license, we change the license to FreeBSD license to make
it unambiguously refers to the 2-clause license.
With this change the arm platform can now be fully compiled with Clang.
Tested by comparing the output with GCC 4.8.2, and Clang 4.0, using a
variety of arches, big/little endianness, and arm/thumb mode to verify
the generated assembly output matches between GCC vs Clang with UAL, and
also GCC with UAL vs GCC with non-UAL, for all preprocessor code blocks.
The only difference found is an extra nop at the end of the function
when compiled with GCC using armv7-a/thumb/little-endian/-O2 compared to
Clang. The nop is not emitted when compiled in big-endian mode.
The implementation of the POSIX access() function is nothing machine
specific like memcpy(), etc. Move it back to the system domain. This
avoids problems due to the include search order of the Newlib/GCC build
which picks up machine includes before system includes.
Signed-off-by: Sebastian Huber <sebastian.huber@embedded-brains.de>
In patch b219285f87 you have a syntax
error in the PLD instruction. The syntax for the pld argument should be
in square brackets as it's a memory address like so: pld [r1]. With
your patch the newlib build fails for armv7-a targets. This patch fixes
the build failures.
Tested by making sure the newlib build completes successfully.
2016-01-26 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
* libc/machine/arm/strcpy.c (strcpy): Fix PLD assembly syntax.
* libc/machine/arm/strlen-stub.c (strlen): Likewise.
LTO can re-order top-level assembly blocks, which can cause this
macro definition to appear after its use (or not at all), causing
compilation failures. On modern toolchains (armv4t+), assembly
should write `bx lr` in all cases, and linkers will transparently
convert them to `mov pc, lr`, allowing us to simply remove the
macro.
(source: https://groups.google.com/forum/#!topic/comp.sys.arm/3l7fVGX-Wug
and verified empirically)
For the armv4.S file, preserve this macro to maximize backwards
compatibility.
LTO can re-order top-level assembly blocks, which can cause this
macro definition to appear after its use (or not at all), causing
compilation failures. As the macro has very few uses, simply removing
it by inlining is a simple fix.
n.b. one of the macro invocations in strlen-stub.c was already
guarded by the relevant #define, so it is simply converted directly
to a pld
In the case of memcpy-armv7m.S being built for a big-endian multilib
(including armv7 without a specific profile), realignment code made
assumptions about the byte ordering being little-endian.
Signed-off-by: Corinna Vinschen <corinna@vinschen.de>
strcmp.S contained invalid guard for code that used barrel-shifter optional
instruction - it was checking for !ARC601 instead of whether barrel shifter
is present. While it is true that ARC601 doesn't have barrel shifter, so
does some other ARC EM configurations.
2016-07-21 Anton Kolesov <Anton.Kolesov@synopsys.com>
* libc/machine/arc/strcmp.S: Fix big endian without barrel shifter.
Signed-off-by: Anton Kolesov <Anton.Kolesov@synopsys.com>
Prealloc instruction may not be present in all HS variants. Hence, use
prefetch instead of prealloc.
newlib/
2016-04-26 Claudiu Zissulescu <claziss@synopsys.com>
* libc/machine/arc/memset-archs.S: Use prefetch.
Add makedocbook, a tool to process makedoc markup and output DocBook XML
refentries.
Process all the source files which are processed with makedoc with
makedocbook as well
Add chapter-texi2docbook, a tool to automatically generate DocBook XML
chapter files from the chapter .texi files. For generating man pages all we
care about is the content of the refentries, so all this needs to do is
convert the @include of the makedoc generated .def files to xi:include of
the makedocbook generated .xml files.
Add skeleton Docbook XML book files, lib[cm].in.xml which include these
generated chapters, which in turn include the generated files containing
refentries, which is processed with xsltproc to generate the lib[cm].xml
Add new make targets to generate and install man pages from lib[cm].xml