Commit Graph

3660 Commits

Author SHA1 Message Date
Sebastian Huber 1db7cf5ce6 RTEMS: Add README 2022-07-11 13:19:29 +02:00
Gleb Smirnoff c1abc93988 libc/syslog: fully deprecate and don't try to open "/dev/log"
The "/dev/log" socket existed in pre-FreeBSD times.  Later it was
substituted to a compatibility symlink.  The symlink creation was
deprecated in FreeBSD 10.2 and 9-STABLE.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35304
2022-07-11 13:19:29 +02:00
Konrad Sewiłło-Jopek cf2ba7d7f8 arp: Implement sticky ARP mode for interfaces.
Provide sticky ARP flag for network interface which marks it as the
"sticky" one similarly to what we have for bridges. Once interface is
marked sticky, any address resolved using the ARP will be saved as a
static one in the ARP table. Such functionality may be used to prevent
ARP spoofing or to decrease latencies in Ethernet networks.

The drawbacks include potential limitations in usage of ARP-based
load-balancers and high-availability solutions such as carp(4).

The implemented option is disabled by default, therefore should not
impact the default behaviour of the networking stack.

Sponsored by:		Conclusive Engineering sp. z o.o.
Reviewed By:		melifaro, pauamma_gundo.com
Differential Revision: https://reviews.freebsd.org/D35314
MFC after:		2 weeks
2022-07-11 13:19:29 +02:00
Alan Somers 27dfb5f33f Correctly measure system load averages > 1024
The old fixed-point arithmetic used for calculating load averages had an
overflow at 1024.  So on systems with extremely high load, the observed
load average would actually fall back to 0 and shoot up again, creating
a kind of sawtooth graph.

Fix this by using 64-bit math internally, while still reporting the load
average to userspace as a 32-bit number.

Sponsored by:	Axcient
Reviewed by:	imp
Differential Revision: https://reviews.freebsd.org/D35134
2022-07-11 13:19:29 +02:00
Konstantin Belousov 0ed668df2c Add ifcap2 names for RXTLS4 and RXTLS6 interface capabilities
and corresponding nvlist capabilities name strings.

Reviewed by:	hselasky, jhb, kp (previous version)
Sponsored by:	NVIDIA Networking
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D32551
2022-07-11 13:19:29 +02:00
Konstantin Belousov 361bd82a1f Kernel-side infrastructure to implement nvlist-based set/get ifcaps
Reviewed by:	hselasky, jhb, kp (previous version)
Sponsored by:	NVIDIA Networking
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D32551
2022-07-11 13:19:29 +02:00
Richard Scheffenegger aeced2f48a tcp: LRO code to deal with all 12 TCP header flags
TCP per RFC793 has 4 reserved flag bits for future use. One
of those bits may be used for Accurate ECN.
This patch is to include these bits in the LRO code to ease
the extensibility if/when these bits are used.

Reviewed By: hselasky, rrs, #transport
Sponsored by:        NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34127
2022-07-11 13:19:29 +02:00
Mike Karels a9a87c1921 kernel: deprecate Internet Class A/B/C
Hide historical Class A/B/C macros unless IN_HISTORICAL_NETS is defined;
define it for user level.  Define IN_MULTICAST separately from IN_CLASSD,
and use it in pf instead of IN_CLASSD.  Stop using class for setting
default masks when not specified; instead, define new default mask
(24 bits).  Warn when an Internet address is set without a mask.

MFC after:	1 month
Reviewed by:	cy
Differential Revision: https://reviews.freebsd.org/D32708
2022-07-11 13:19:29 +02:00
Peter Lei 73784208e3 tcp: socket option to get stack alias name
TCP stack sysctl nodes are currently inserted using the stack
name alias. Allow the user to get the current stack's alias to
allow for programatic sysctl access.

Obtained from:	Netflix
2022-07-11 13:19:29 +02:00
Randall Stewart 0464f26db0 tcp: Add hystart-plus to cc_newreno and rack.
TCP Hystart draft version -03:
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-hystartplusplus

Is a new version of hystart that allows one to carefully exit slow start if the RTT
spikes too much. The newer version has a slower-slow-start so to speak that then
kicks in for five round trips. To see if you exited too early, if not into congestion avoidance.
This commit will add that feature to our newreno CC and add the needed bits in rack to
be able to enable it.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D32373
2022-07-11 13:19:29 +02:00
Randall Stewart 57703f72c8 tcp: Add support for DSACK based reordering window to rack.
The rack stack, with respect to the rack bits in it, was originally built based
on an early I-D of rack. In fact at that time the TLP bits were in a separate
I-D. The dynamic reordering window based on DSACK events was not present
in rack at that time. It is now part of the RFC and we need to update our stack
to include these features. However we want to have a way to control the feature
so that we can, if the admin decides, make it stay the same way system wide as
well as via socket option. The new sysctl and socket option has the following
meaning for setting:

00 (0) - Keep the old way, i.e. reordering window is 1 and do not use DSACK bytes to add to reorder window
01 (1) - Change the Reordering window to 1/4 of an RTT but do not use DSACK bytes to add to reorder window
10 (2) - Keep the reordering window as 1, but do use SACK bytes to add additional 1/4 RTT delay to the reorder window
11 (3) - reordering window is 1/4 of an RTT and add additional DSACK bytes to increase the reordering window (RFC behavior)

The default currently in the sysctl is 3 so we get standards based behavior.
Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31506
2022-07-11 13:19:29 +02:00
Andrew Gallatin 4bf5c259d3 tsleep: Add a PNOLOCK flag
Add a PNOLOCK flag so that, in the race circumstance where
wakeup races are externally mitigated, tsleep() can be
called with a sleep time of 0 without triggering an
an assertion.

Reviewed by: jhb
Sponsored by: Netflix
2022-07-11 13:19:29 +02:00
Roy Marples 356891f5e0 socket: Implement SO_RERROR
SO_RERROR indicates that receive buffer overflows should be handled as
errors. Historically receive buffer overflows have been ignored and
programs could not tell if they missed messages or messages had been
truncated because of overflows. Since programs historically do not
expect to get receive overflow errors, this behavior is not the
default.

This is really really important for programs that use route(4) to keep
in sync with the system. If we loose a message then we need to reload
the full system state, otherwise the behaviour from that point is
undefined and can lead to chasing bogus bug reports.

Reviewed by:	philip (network), kbowling (transport), gbe (manpages)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D26652
2022-07-11 13:19:29 +02:00
Kristof Provost 5260d10c98 pf: syncookie support
Import OpenBSD's syncookie support for pf. This feature help pf resist
TCP SYN floods by only creating states once the remote host completes
the TCP handshake rather than when the initial SYN packet is received.

This is accomplished by using the initial sequence numbers to encode a
cookie (hence the name) in the SYN+ACK response and verifying this on
receipt of the client ACK.

Reviewed by:	kbowling
Obtained from:	OpenBSD
MFC after:	1 week
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D31138
2022-07-11 13:19:29 +02:00
Randall Stewart b89c5a3e88 tcp: Add a socket option to rack
so we can test various changes to the slop value in timers.

Timer_slop, in TCP, has been 200ms for a long time. This value dates back
a long time when delayed ack timers were longer and links were slower. A
200ms timer slop allows 1 MSS to be sent over a 60kbps link. Its possible that
lowering this value to something more in line with todays delayed ack values (40ms)
might improve TCP. This bit of code makes it so rack can, via a socket option,
adjust the timer slop.

Reviewed by: mtuexen
Sponsered by: Netflix Inc
Differential Revision:	https://reviews.freebsd.org/D30249
2022-07-11 13:19:29 +02:00
Richard Scheffenegger d4971b6464 tcp: SACK Lost Retransmission Detection (LRD)
Recover from excessive losses without reverting to a
retransmission timeout (RTO). Disabled by default, enable
with sysctl net.inet.tcp.do_lrd=1

Reviewed By: #transport, rrs, tuexen, #manpages
Sponsored by: Netapp, Inc.
Differential Revision: https://reviews.freebsd.org/D28931
2022-07-11 13:19:29 +02:00
Randall Stewart a00ca7bd54 This brings into sync FreeBSD with the netflix
versions of rack and bbr. This fixes several breakages (panics) since the
tcp_lro code was committed that have been reported. Quite a few new features
are now in rack (prefecting of DGP -- Dynamic Goodput Pacing among the
largest). There is also support for ack-war prevention. Documents comming soon
on rack..

Sponsored by:           Netflix
Reviewed by:		rscheff, mtuexen
Differential Revision:	https://reviews.freebsd.org/D30036
2022-07-11 13:19:29 +02:00
John Baldwin 8424d5c949 Use thunks for compat ioctls using struct ifgroupreq.
Reviewed by:	brooks, kib
Obtained from:	CheriBSD
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D29893
2022-07-11 13:19:29 +02:00
Konstantin Belousov 19a627f3a4 ioccom: define ioctl cmd value that can never be valid
Its use is for cases where some filler is needed for cmd, or we need an
indication that there were no cmd supplied, and so on.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29935
2022-07-11 13:19:29 +02:00
Thomas Munro 363527bb03 poll(2): Add POLLRDHUP.
Teach poll(2) to support Linux-style POLLRDHUP events for sockets, if
requested.  Triggered when the remote peer shuts down writing or closes
its end.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D29757
2022-07-11 13:19:29 +02:00
Michael Tuexen 85140fb378 tcp: add support for TCP over UDP
Adding support for TCP over UDP allows communication with
TCP stacks which can be implemented in userspace without
requiring special priviledges or specific support by the OS.
This is joint work with rrs.

Reviewed by:		rrs
Sponsored by:		Netflix, Inc.
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D29469
2022-07-11 13:19:29 +02:00
Bjoern A. Zeeb defb5ffed4 termios: add more speeds
A lot of small arm64 gadgets are using 1500000 as console speed.
While cu can perfectly deal with this some 3rd party software, e.g.,
comms/conserver-con add speeds based on B<n> being defined.
Having it defined here simplifies enhancing other software.

Obtained-from:	NetBSD sys/sys/termios.h 1.36
MFC-after:	2 weeks
Reviewed-by:	philip (,okayed by imp)
Differential Revision:	https://reviews.freebsd.org/D29209
2022-07-11 13:19:29 +02:00
Alexander V. Chernikov 3be97ff62c Revert "SO_RERROR indicates that receive buffer overflows"
Wrong version of the change was pushed inadvertenly.

This reverts commit 4a01b854ca5c2e5124958363b3326708b913af71.
2022-07-11 13:19:29 +02:00
Alexander V. Chernikov 2ba2e1e052 SO_RERROR indicates that receive buffer overflows
should be handled as errors. Historically receive buffer overflows have been
ignored and programs could not tell if they missed messages or messages had
been truncated because of overflows. Since programs historically do not expect
to get receive overflow errors, this behavior is not the default.

This is really really important for programs that use route(4) to keep in sync
with the system. If we loose a message then we need to reload the full system
state, otherwise the behaviour from that point is undefined and can lead
to chasing bogus bug reports.
2022-07-11 13:19:29 +02:00
Alex Richardson 8054ce555f Expose clang's alignment builtins and use them for roundup2/rounddown2
This makes roundup2/rounddown2 type- and const-preserving and allows
using it on pointer types without casting to uintptr_t first. Not
performing pointer-to-integer conversions also helps the compiler's
optimization passes and can therefore result in better code generation.
When using it with integer values there should be no change other than
the compiler checking that the alignment value is a valid power-of-two.

I originally implemented these builtins for CHERI a few years ago and
they have been very useful for CheriBSD. However, they are also useful
for non-CHERI code so I was able to upstream them for Clang 10.0.

Rationale from the clang documentation:
Clang provides builtins to support checking and adjusting alignment
of pointers and integers. These builtins can be used to avoid relying
on implementation-defined behavior of arithmetic on integers derived
from pointers. Additionally, these builtins retain type information
and, unlike bitwise arithmetic, they can perform semantic checking on
the alignment value.

There is also a feature request for GCC, so GCC may also support it in
the future: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98641

Reviewed By:	brooks, jhb, imp
Differential Revision: https://reviews.freebsd.org/D28332
2022-07-11 13:19:29 +02:00
Gleb Smirnoff 5bc5689a6a Catch up with 6edfd179c86: mechanically rename IFCAP_NOMAP to IFCAP_MEXTPG.
Originally IFCAP_NOMAP meant that the mbuf has external storage pointer
that points to unmapped address.  Then, this was extended to array of
such pointers.  Then, such mbufs were augmented with header/trailer.
Basically, extended mbufs are extended, and set of features is subject
to change.  The new name should be generic enough to avoid further
renaming.
2022-07-11 11:52:46 +02:00
Konstantin Belousov 581bde91a5 Add tcgetwinsize(3) and tcsetwinsize(3) to termios
These functions get/set tty winsize respectively, and are trivial wrappers
around corresponding termio ioctls.

The functions are expected to be a part of POSIX.1 issue 8:
https://www.austingroupbugs.net/view.php?id=1151#c3856.
They are currently available in NetBSD and in musl libc.

PR:	251868
Submitted by:	Soumendra Ganguly <soumendraganguly@gmail.com>
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27650
2022-07-11 11:52:46 +02:00
Andrew Gallatin c76896074b Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.

This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.

This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.

Reviewed by:	jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by:	Netfix
Differential Revision:	https://reviews.freebsd.org/D21636
2022-07-11 11:52:46 +02:00
Brooks Davis 70b6efc47d style(9): Correct whitespace in struct definitions
struct ifconf and struct ifreq use the odd style "struct<tab>foo".
struct ifdrv seems to have tried to follow this but was committed with
spaces in place of most tabs resulting in "struct<space><space>ifdrv".

MFC after:	3 days
2022-07-11 11:52:46 +02:00
Conrad Meyer 3f7425e8bb unix(4): Enhance LOCAL_CREDS_PERSISTENT ABI
As this ABI is still fresh (r367287), let's correct some mistakes now:

- Version the structure to allow for future changes
- Include sender's pid in control message structure
- Use a distinct control message type from the cmsgcred / sockcred mess

Discussed with:	kib, markj, trasz
Differential Revision:	https://reviews.freebsd.org/D27084
2022-07-11 11:52:46 +02:00
Conrad Meyer 55dec604f8 unix(4): Add SOL_LOCAL:LOCAL_CREDS_PERSISTENT
This option is intended to be semantically identical to Linux's
SOL_SOCKET:SO_PASSCRED.  For now, it is mutually exclusive with the
pre-existing sockopt SOL_LOCAL:LOCAL_CREDS.

Reviewed by:	markj (penultimate version)
Differential Revision:	https://reviews.freebsd.org/D27011
2022-07-11 11:52:46 +02:00
Warner Losh 1cb590ab48 Integrate 4.4BSD-Lite2 changes to IOC_* definitions
Bring in the long-overdue 4.4BSD-Lite2 rev 8.3 by cgd of
sys/ioccom.h. This uses UL suffix for the IOC_* constants so they
don't sign extend. Also bring in the handy diagram from NetBSD's
version of this file. This alters the 4.4BSD-Lite2 code slightly
in a way that's semantically the same but more compact.

This should stop the warnings from Chrome for bogus sign extension.

Reviewed by: kib@, jhb@
Differential Revision: https://reviews.freebsd.org/D26423
2022-07-11 11:52:46 +02:00
John Baldwin 5ea36d92e6 Support hardware rate limiting (pacing) with TLS offload.
- Add a new send tag type for a send tag that supports both rate
  limiting (packet pacing) and TLS offload (mostly similar to D22669
  but adds a separate structure when allocating the new tag type).

- When allocating a send tag for TLS offload, check to see if the
  connection already has a pacing rate.  If so, allocate a tag that
  supports both rate limiting and TLS offload rather than a plain TLS
  offload tag.

- When setting an initial rate on an existing ifnet KTLS connection,
  set the rate in the TCP control block inp and then reset the TLS
  send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit
  send tag.  This allocates the TLS send tag asynchronously from a
  task queue, so the TLS rate limit tag alloc is always sleepable.

- When modifying a rate on a connection using KTLS, look for a TLS
  send tag.  If the send tag is only a plain TLS send tag, assume we
  failed to allocate a TLS ratelimit tag (either during the
  TCP_TXTLS_ENABLE socket option, or during the send tag reset
  triggered by ktls_output_eagain) and ignore the new rate.  If the
  send tag is a ratelimit TLS send tag, change the rate on the TLS tag
  and leave the inp tag alone.

- Lock the inp lock when setting sb_tls_info for a socket send buffer
  so that the routines in tcp_ratelimit can safely dereference the
  pointer without needing to grab the socket buffer lock.

- Add an IFCAP_TXTLS_RTLMT capability flag and associated
  administrative controls in ifconfig(8).  TLS rate limit tags are
  only allocated if this capability is enabled.  Note that TLS offload
  (whether unlimited or rate limited) always requires IFCAP_TXTLS[46].

Reviewed by:	gallatin, hselasky
Relnotes:	yes
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26691
2022-07-11 11:52:46 +02:00
Andrey V. Elsukov b8e36b9251 Implement SIOCGIFALIAS.
It is lightweight way to check if an IPv4 address exists.

Submitted by:	Roy Marples
Reviewed by:	gnn, melifaro
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D26636
2022-07-11 11:52:46 +02:00
Richard Scheffenegger 3f0cc70c13 Add IP(V6)_VLAN_PCP to set 802.1 priority per-flow.
This adds a new IP_PROTO / IPV6_PROTO setsockopt (getsockopt)
option IP(V6)_VLAN_PCP, which can be set to -1 (interface
default), or explicitly to any priority between 0 and 7.

Note that for untagged traffic, explicitly adding a
priority will insert a special 801.1Q vlan header with
vlan ID = 0 to carry the priority setting

Reviewed by:	gallatin, rrs
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26409
2022-07-11 11:52:46 +02:00
Konstantin Belousov ec997fae0e Fix typo.
Sponsored by:	Mellanox Technologies/NVIDIA Networking
MFC after:	3 days
2022-07-11 11:52:46 +02:00
Alexander V. Chernikov 48ba673ce9 Introduce scalable route multipath.
This change is based on the nexthop objects landed in D24232.

The change introduces the concept of nexthop groups.
Each group contains the collection of nexthops with their
 relative weights and a dataplane-optimized structure to enable
 efficient nexthop selection.

Simular to the nexthops, nexthop groups are immutable. Dataplane part
 gets compiled during group creation and is basically an array of
 nexthop pointers, compiled w.r.t their weights.

With this change, `rt_nhop` field of `struct rtentry` contains either
 nexthop or nexthop group. They are distinguished by the presense of
 NHF_MULTIPATH flag.
All dataplane lookup functions returns pointer to the nexthop object,
leaving nexhop groups details inside routing subsystem.

User-visible changes:

The change is intended to be backward-compatible: all non-mpath operations
 should work as before with ROUTE_MPATH and net.route.multipath=1.

All routes now comes with weight, default weight is 1, maximum is 2^24-1.

Current maximum multipath group width is statically set to 64.
 This will become sysctl-tunable in the followup changes.

Using functionality:
* Recompile kernel with ROUTE_MPATH
* set net.route.multipath to 1

route add -6 2001:db8::/32 2001:db8::2 -weight 10
route add -6 2001:db8::/32 2001:db8::3 -weight 20

netstat -6On

Nexthop groups data

Internet6:
GrpIdx  NhIdx     Weight   Slots                                 Gateway     Netif  Refcnt
1         ------- ------- ------- --------------------------------------- ---------       1
              13      10       1                             2001:db8::2     vlan2
              14      20       2                             2001:db8::3     vlan2

Next steps:
* Land outbound hashing for locally-originated routes ( D26523 ).
* Fix net/bird multipath (net/frr seems to work fine)
* Add ROUTE_MPATH to GENERIC
* Set net.route.multipath=1 by default

Tested by:	olivier
Reviewed by:	glebius
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D26449
2022-07-11 11:52:46 +02:00
Ed Maste 9dd91a8330 add SIOCGIFDATA ioctl
For interfaces that do not support SIOCGIFMEDIA (for which there are
quite a few) the only fallback is to query the interface for
if_data->ifi_link_state.  While it's possible to get at if_data for an
interface via getifaddrs(3) or sysctl, both are heavy weight mechanisms.

SIOCGIFDATA is a simple ioctl to retrieve this fast with very little
resource use in comparison.  This implementation mirrors that of other
similar ioctls in FreeBSD.

Submitted by:	Roy Marples <roy@marples.name>
Reviewed by:	markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D26538
2022-07-11 11:52:46 +02:00
Richard Scheffenegger 7b30b9f648 TCP: send full initial window when timestamps are in use
The fastpath in tcp_output tries to send out
full segments, and avoid sending partial segments by
comparing against the static t_maxseg variable.
That value does not consider tcp options like timestamps,
while the initial window calculation is using
the correct dynamic tcp_maxseg() function.

Due to this interaction, the last, full size segment
is considered too short and not sent out immediately.

Reviewed by:	tuexen
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D26478
2022-07-11 11:52:46 +02:00
Navdeep Parhar 43e76bafcd Add two new ifnet capabilities
for hw checksumming and TSO for VXLAN traffic.

These are similar to the existing VLAN capabilities.

Reviewed by:	kib@
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D25873
2022-07-11 11:52:46 +02:00
Konstantin Belousov 1306ff4c92 Support for userspace non-transparent superpages (largepages).
Created with shm_open2(SHM_LARGEPAGE) and then configured with
FIOSSHMLPGCNF ioctl, largepages posix shared memory objects guarantee
that all userspace mappings of it are served by superpage non-managed
mappings.

Only amd64 for now, both 2M and 1G superpages can be requested, the
later requires CPU feature.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24652
2022-07-11 11:52:46 +02:00
Mark Johnston 1a5f14a0c5 Include the psind in data returned by mincore(2).
Currently we use a single bit to indicate whether the virtual page is
part of a superpage.  To support a forthcoming implementation of
non-transparent 1GB superpages, it is useful to provide more detailed
information about large page sizes.

The change converts MINCORE_SUPER into a mask for MINCORE_PSIND(psind)
values, indicating a mapping of size psind, where psind is an index into
the pagesizes array returned by getpagesizes(3), which in turn comes
from the hw.pagesizes sysctl.  MINCORE_PSIND(1) is equal to the old
value of MINCORE_SUPER.

For now, two bits are used to record the page size, permitting values
of MAXPAGESIZES up to 4.

Reviewed by:	alc, kib
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26238
2022-07-11 11:52:46 +02:00
Mateusz Guzik d066d123f1 sys: clean up empty lines in .c and .h files 2022-07-11 11:52:46 +02:00
Mateusz Guzik 27fc846731 net: clean up empty lines in .c and .h files 2022-07-11 11:52:46 +02:00
Konstantin Belousov 941cda2c16 Add SOL_LOCAL symbolic constant for unix socket option level.
The constant seems to exists on MacOS X >= 10.8.

Requested by:	swills
Reviewed by:	allanjude, kevans
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D25933
2022-07-11 11:52:46 +02:00
Kyle Evans c95c267a46 shm_open2: Implement SHM_GROW_ON_WRITE
Lack of SHM_GROW_ON_WRITE is actively breaking Python's memfd_create tests,
so go ahead and implement it. A future change will make memfd_create always
set SHM_GROW_ON_WRITE, to match Linux behavior and unbreak Python's tests
on -CURRENT.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D25502
2022-07-11 11:52:46 +02:00
Wei Hu 1a840361e8 HyperV socket implementation for FreeBSD
This change adds Hyper-V socket feature in FreeBSD. New socket address
family AF_HYPERV and its kernel support are added.

Submitted by:	Wei Hu <weh@microsoft.com>
Reviewed by:	Dexuan Cui <decui@microsoft.com>
Relnotes:	yes
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D24061
2022-07-11 11:52:46 +02:00
John Baldwin 7293d1e7b6 Initial support for kernel offload of TLS receive.
- Add a new TCP_RXTLS_ENABLE socket option to set the encryption and
  authentication algorithms and keys as well as the initial sequence
  number.

- When reading from a socket using KTLS receive, applications must use
  recvmsg().  Each successful call to recvmsg() will return a single
  TLS record.  A new TCP control message, TLS_GET_RECORD, will contain
  the TLS record header of the decrypted record.  The regular message
  buffer passed to recvmsg() will receive the decrypted payload.  This
  is similar to the interface used by Linux's KTLS RX except that
  Linux does not return the full TLS header in the control message.

- Add plumbing to the TOE KTLS interface to request either transmit
  or receive KTLS sessions.

- When a socket is using receive KTLS, redirect reads from
  soreceive_stream() into soreceive_generic().

- Note that this interface is currently only defined for TLS 1.1 and
  1.2, though I believe we will be able to reuse the same interface
  and structures for 1.3.
2022-07-11 11:52:46 +02:00
Randall Stewart 1da65b8919 This change does a small prepratory step
in getting the latest rack and bbr in from the NF repo. When those come in the
OOB data handling will be fixed where Skyzaller crashes.

Differential Revision:	https://reviews.freebsd.org/D24575
2022-07-11 11:52:46 +02:00
Alexander V. Chernikov b948693357 Convert route caching to nexthop caching.
This change is build on top of nexthop objects introduced in r359823.

Nexthops are separate datastructures, containing all necessary information
 to perform packet forwarding such as gateway interface and mtu. Nexthops
 are shared among the routes, providing more pre-computed cache-efficient
 data while requiring less memory. Splitting the LPM code and the attached
 data solves multiple long-standing problems in the routing layer,
 drastically reduces the coupling with outher parts of the stack and allows
 to transparently introduce faster lookup algorithms.

Route caching was (re)introduced to minimise (slow) routing lookups, allowing
 for notably better performance for large TCP senders. Caching works by
 acquiring rtentry reference, which is protected by per-rtentry mutex.
 If the routing table is changed (checked by comparing the rtable generation id)
 or link goes down, cache record gets withdrawn.

Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
 cached object, which is mostly mechanical. Other moving parts like cache
 cleanup on rtable change remains the same.

Differential Revision:	https://reviews.freebsd.org/D24340
2022-07-11 11:52:46 +02:00
Jonathan T. Looney 09e5cb57a0 Make the path length of UNIX domain sockets
specified by a #define. Also, add a comment describing the historical context
for this length.

Reviewed by:	bz, jhb, kbowling (previous version)
MFC after:	2 weeks
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D24272
2022-07-11 11:52:46 +02:00
Alexander V. Chernikov 86484e84d7 Introduce nexthop objects and new routing KPI.
This is the foundational change for the routing subsytem rearchitecture.
 More details and goals are available in https://reviews.freebsd.org/D24141 .

This patch introduces concept of nexthop objects and new nexthop-based
 routing KPI.

Nexthops are objects, containing all necessary information for performing
 the packet output decision. Output interface, mtu, flags, gw address goes
 there. For most of the cases, these objects will serve the same role as
 the struct rtentry is currently serving.
Typically there will be low tens of such objects for the router even with
 multiple BGP full-views, as these objects will be shared between routing
 entries. This allows to store more information in the nexthop.

New KPI:

struct nhop_object *fib4_lookup(uint32_t fibnum, struct in_addr dst,
  uint32_t scopeid, uint32_t flags, uint32_t flowid);
struct nhop_object *fib6_lookup(uint32_t fibnum, const struct in6_addr *dst6,
  uint32_t scopeid, uint32_t flags, uint32_t flowid);

These 2 function are intended to replace all all flavours of
 <in_|in6_>rtalloc[1]<_ign><_fib>, mpath functions  and the previous
 fib[46]-generation functions.

Upon successful lookup, they return nexthop object which is guaranteed to
 exist within current NET_EPOCH. If longer lifetime is desired, one can
 specify NHR_REF as a flag and get a referenced version of the nexthop.
 Reference semantic closely resembles rtentry one, allowing sed-style conversion.

Additionally, another 2 functions are introduced to support uRPF functionality
 inside variety of our firewalls. Their primary goal is to hide the multipath
 implementation details inside the routing subsystem, greatly simplifying
 firewalls implementation:

int fib4_lookup_urpf(uint32_t fibnum, struct in_addr dst, uint32_t scopeid,
  uint32_t flags, const struct ifnet *src_if);
int fib6_lookup_urpf(uint32_t fibnum, const struct in6_addr *dst6, uint32_t scopeid,
  uint32_t flags, const struct ifnet *src_if);

All functions have a separate scopeid argument, paving way to eliminating IPv6 scope
 embedding and allowing to support IPv4 link-locals in the future.

Structure changes:
 * rtentry gets new 'rt_nhop' pointer, slightly growing the overall size.
 * rib_head gets new 'rnh_preadd' callback pointer, slightly growing overall sz.

Old KPI:
During the transition state old and new KPI will coexists. As there are another 4-5
 decent-sized conversion patches, it will probably take a couple of weeks.
To support both KPIs, fields not required by the new KPI (most of rtentry) has to be
 kept, resulting in the temporary size increase.
Once conversion is finished, rtentry will notably shrink.

More details:
* architectural overview: https://reviews.freebsd.org/D24141
* list of the next changes: https://reviews.freebsd.org/D24232

Reviewed by:	ae,glebius(initial version)
Differential Revision:	https://reviews.freebsd.org/D24232
2022-07-11 11:52:46 +02:00
Gleb Smirnoff f3303cf1d5 Although most of the NIC drivers are epoch ready,
due to peer pressure switch over to opt-in instead of opt-out for epoch.

Instead of IFF_NEEDSEPOCH, provide IFF_KNOWSEPOCH. If driver marks
itself with IFF_KNOWSEPOCH, then ether_input() would not enter epoch
when processing its packets.

Now this will create recursive entrance in epoch in >90% network
drivers, but will guarantee safeness of the transition.

Mark several tested drivers as IFF_KNOWSEPOCH.

Reviewed by:		hselasky, jeff, bz, gallatin
Differential Revision:	https://reviews.freebsd.org/D23674
2022-07-11 11:52:46 +02:00
Randall Stewart 0dfcaa0211 White space cleanup --
remove trailing tab's or spaces from any line.

Sponsored by:	Netflix Inc.
2022-07-11 11:52:46 +02:00
Gleb Smirnoff 301991542a Introduce flag IFF_NEEDSEPOCH
that marks Ethernet interfaces that supposedly may call into ether_input()
without network epoch.

They all need to be reviewed before 13.0-RELEASE.  Some may need
be fixed.  The flag is not planned to be used in the kernel for
a long time.
2022-07-11 11:52:46 +02:00
Michael Tuexen ebbb6536b7 Add flags for upcoming patches related to improved ECN handling.
No functional change.

Submitted by: Richard Scheffenegger
Reviewed by: rgrimes@, tuexen@
Differential Revision: https://reviews.freebsd.org/D22429
2022-07-11 11:52:46 +02:00
Edward Tomasz Napierala 0c4d87ca5f Make use of the stats(3) framework in the TCP stack.
This makes it possible to retrieve per-connection statistical
information such as the receive window size, RTT, or goodput,
using a newly added TCP_STATS getsockopt(3) option, and extract
them using the stats_voistat_fetch(3) API.

See the net/tcprtt port for an example consumer of this API.

Compared to the existing TCP_INFO system, the main differences
are that this mechanism is easy to extend without breaking ABI,
and provides statistical information instead of raw "snapshots"
of values at a given point in time.  stats(3) is more generic
and can be used in both userland and the kernel.

Reviewed by:	thj
Tested by:	thj
Obtained from:	Netflix
Relnotes:	yes
Sponsored by:	Klara Inc, Netflix
Differential Revision:	https://reviews.freebsd.org/D20655
2022-07-11 11:52:46 +02:00
David Bright 0c854dd6d1 Jail and capability mode for shm_rename;
add audit support for shm_rename

Co-mingling two things here:

  * Addressing some feedback from Konstantin and Kyle re: jail,
    capability mode, and a few other things
  * Adding audit support as promised.

The audit support change includes a partial refresh of OpenBSM from
upstream, where the change to add shm_rename has already been
accepted. Matthew doesn't plan to work on refreshing anything else to
support audit for those new event types.

Submitted by:	Matthew Bryan <matthew.bryan@isilon.com>
Reviewed by:	kib
Relnotes:	Yes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22083
2022-07-11 11:52:46 +02:00
John Baldwin 12fb531a70 Add a TOE KTLS mode and a TOE hook for allocating TLS sessions.
This adds the glue to allocate TLS sessions and invokes it from
the TLS enable socket option handler.  This also adds some counters
for active TOE sessions.

The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when
TOE KTLS is in use on a socket, but cannot be set via setsockopt().

To simplify various checks, a TLS session now includes an explicit
'mode' member set to the value returned by TLSTX_TLS_MODE.  Various
places that used to check 'sw_encrypt' against NULL to determine
software vs ifnet (NIC) TLS now check 'mode' instead.

Reviewed by:	np, gallatin
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21891
2022-07-11 11:52:46 +02:00
Kyle Evans 1ef7e3904d MFD_*: swap ordering
This API is still young enough that I would expect no one to be dependant on
this yet... Swap the ordering while it's young to match Linux values to
potentially ease implementation of linuxolator syscall, being able to reuse
existing constants.
2022-07-11 11:52:46 +02:00
David Bright 53648039c4 Add an shm_rename syscall
Add an atomic shm rename operation, similar in spirit to a file
rename. Atomically unlink an shm from a source path and link it to a
destination path. If an existing shm is linked at the destination
path, unlink it as part of the same atomic operation. The caller needs
the same permissions as shm_unlink to the shm being renamed, and the
same permissions for the shm at the destination which is being
unlinked, if it exists. If those fail, EACCES is returned, as with the
other shm_* syscalls.

truss support is included; audit support will come later.

This commit includes only the implementation; the sysent-generated
bits will come in a follow-on commit.

Submitted by:	Matthew Bryan <matthew.bryan@isilon.com>
Reviewed by:	jilles (earlier revision)
Reviewed by:	brueffer (manpages, earlier revision)
Relnotes:	yes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D21423
2022-07-11 11:52:46 +02:00
Kyle Evans 9243caa8d3 Add linux-compatible memfd_create
memfd_create is effectively a SHM_ANON shm_open(2) mapping with optional
CLOEXEC and file sealing support. This is used by some mesa parts, some
linux libs, and qemu can also take advantage of it and uses the sealing to
prevent resizing the region.

This reimplements shm_open in terms of shm_open2(2) at the same time.

shm_open(2) will be moved to COMPAT12 shortly.

Reviewed by:	markj, kib
Differential Revision:	https://reviews.freebsd.org/D21393
2022-07-11 11:52:46 +02:00
Kyle Evans 99b66f5315 Add a shm_open2 syscall to support upcoming memfd_create
shm_open2 allows a little more flexibility than the original shm_open.
shm_open2 doesn't enforce CLOEXEC on its callers, and it has a separate
shmflag argument that can be expanded later. Currently the only shmflag is
to allow file sealing on the returned fd.

shm_open and memfd_create will both be implemented in libc to use this new
syscall.

__FreeBSD_version is bumped to indicate the presence.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D21393
2022-07-11 11:52:46 +02:00
Randall Stewart 878b65b3b6 This commit adds BBR (Bottleneck Bandwidth and RTT) congestion control.
This is a completely separate TCP stack (tcp_bbr.ko) that will be built only if
you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option
TCPHPTS.  You can also include the RATELIMIT option if you have a NIC interface
that supports hardware pacing, BBR understands how to use such a feature.

Note that this commit also adds in a general purpose time-filter which
allows you to have a min-filter or max-filter. A filter allows you to
have a low (or high) value for some period of time and degrade slowly
to another value has time passes. You can find out the details of
BBR by looking at the original paper at:

https://queue.acm.org/detail.cfm?id=3022184

or consult many other web resources you can find on the web
referenced by "BBR congestion control". It should be noted that
BBRv1 (which this is) does tend to unfairness in cases of small
buffered paths, and it will usually get less bandwidth in the case
of large BDP paths(when competing with new-reno or cubic flows). BBR
is still an active research area and we do plan on  implementing V2
of BBR to see if it is an improvement over V1.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D21582
2022-07-11 11:52:46 +02:00
Alan Somers ce921ffca8 Reduce namespace pollution from r349233
Define __daddr_t in _types.h and use it in filio.h

Reported by:	ian, bde
Reviewed by:	ian, imp, cem
MFC after:	2 weeks
MFC-With:	349233
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20715
2022-07-11 11:52:46 +02:00
Alan Somers 5a6ad7c5bc #include <sys/types.h> from sys/filio.h
This fixes world build after r349231

Reported by:	Jenkins
MFC after:	2 weeks
MFC-With:	349231
Sponsored by:	The FreeBSD Foundation
2022-07-11 11:52:46 +02:00
Alan Somers 8fe49db783 Add FIOBMAP2 ioctl
This ioctl exposes VOP_BMAP information to userland. It can be used by
programs like fragmentation analyzers and optimized cp implementations. But
I'm using it to test fusefs's VOP_BMAP implementation. The "2" in the name
distinguishes it from the similar but incompatible FIBMAP ioctls in NetBSD
and Linux.  FIOBMAP2 differs from FIBMAP in that it uses a 64-bit block
number instead of 32-bit, and it also returns runp and runb.

Reviewed by:	mckusick
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20705
2022-07-11 11:52:46 +02:00
Brooks Davis c42aaaea4f Move 32-bit compat support for FIODGNAME to the right place.
ioctl(2) commands only have meaning in the context of a file descriptor
so translating them in the syscall layer is incorrect.

The new handler users an accessor to retrieve/construct a pointer from
the last member of the passed structure and relies on type punning to
access the other member which requires no translation.

Unlike r339174 this change supports both places FIODGNAME is handled.

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17475
2022-07-11 11:52:46 +02:00
Pedro F. Giffuni eb4cbf4fd3 sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2022-07-11 11:52:46 +02:00
Sebastian Huber 5c0c0e5c77 RTEMS: Remove FreeBSD version tags 2022-07-11 11:52:46 +02:00
Warner Losh 9331071f02 cdefs.h: Remove redundant #ifdefs
Remove redunant #ifdef __GNUC__ inside an #if defined(__GNUC__)
block. They are nops.

Sponsored by:		Netflix
2022-07-11 11:52:46 +02:00
Mark Johnston f537ff8ee5 cdefs: Add a default definition for __nosanitizememory
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-07-11 11:52:46 +02:00
Mark Johnston 8801440e4f cdefs: Make __nosanitizeaddress work for KASAN as well
Add __nosanitizememory while I'm here.

Reviewed by:	andrew, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30126
2022-07-11 11:52:46 +02:00
Warner Losh 65338d7299 cdefs.h: Remove __GNUCLIKE___OFFSETOF, it's unused
__GNUCLIKE___OFFSETOF is unreferenced in the tree, remove it as long
obsolete.

Sponsored by:		Netflix
2022-07-11 11:52:46 +02:00
Alex Richardson 68109f904b Expose clang's alignment builtins and use them for roundup2/rounddown2
This makes roundup2/rounddown2 type- and const-preserving and allows
using it on pointer types without casting to uintptr_t first. Not
performing pointer-to-integer conversions also helps the compiler's
optimization passes and can therefore result in better code generation.
When using it with integer values there should be no change other than
the compiler checking that the alignment value is a valid power-of-two.

I originally implemented these builtins for CHERI a few years ago and
they have been very useful for CheriBSD. However, they are also useful
for non-CHERI code so I was able to upstream them for Clang 10.0.

Rationale from the clang documentation:
Clang provides builtins to support checking and adjusting alignment
of pointers and integers. These builtins can be used to avoid relying
on implementation-defined behavior of arithmetic on integers derived
from pointers. Additionally, these builtins retain type information
and, unlike bitwise arithmetic, they can perform semantic checking on
the alignment value.

There is also a feature request for GCC, so GCC may also support it in
the future: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98641

Reviewed By:	brooks, jhb, imp
Differential Revision: https://reviews.freebsd.org/D28332
2022-07-11 11:52:46 +02:00
Warner Losh c382ecde55 cdefs.h: remove intel_compiler support
The  age  of   the  intel  compiler  support  is  so   old  as  to  be
uninteresting. No recent recports of  intel compiler support have been
received.  Remove  all the  special  case  workarounds for  the  Intel
compiler. Should there be interest in supporting the compiler, contact
me and I'll work with people to make it happen, though I suspect these
instances are more likely to be in the way than to be helpful.

Reviewed by: cem, emaste, vangyzen, dim
Differential Revision: https://reviews.freebsd.org/D26817
2022-07-11 11:52:46 +02:00
Sebastian Huber 282d57d2a8 Reduce namespace pollution from <sys/_types.h>
Provide only __daddr_t through <sys/_types.h>.
2022-07-08 06:57:52 +02:00
Sebastian Huber 4ab39e0a85 RTEMS: Declare ioctl() also if _KERNEL is defined
This fixes the following warning in libbsd:

rtems/blkdev.h:200:10: warning: implicit declaration of function 'ioctl'; did
  you mean 'ifioctl'? [-Wimplicit-function-declaration]

Remove unnecessary includes.
2022-07-08 06:57:52 +02:00
Ken Brown 5d4f405d3b Cygwin: redefine some macros for Linux compatibility
Define FD_SETSIZE (<sys/select.h>) to be 1024 by default, and define
NOFILE (<sys/param.h>) to be OPEN_MAX (== 3200) by default.

Remove the comment in <sys/select.h> that FD_SETSIZE should be >=
NOFILE.

Bump API minor.

Addresses: https://cygwin.com/pipermail/cygwin/2022-July/251839.html
2022-07-07 08:22:40 -04:00
Ken Brown 1503d14af1 Cygwin: stdio: don't try again to read after EOF
This reverts commit 1f8f7e2d54, "* libc/stdio/refill.c (__srefill):
Try again after EOF on Cygwin."  If EOF is set on a file, the stdio
input functions will now immediately return EOF rather than trying
again to read.  This aligns Cygwin's behavior to that of Linux.

Addresses: https://cygwin.com/pipermail/cygwin/2022-June/251672.html
2022-07-04 18:55:08 -04:00
Sebastian Huber 27fd806cd7 RTEMS: _KERNEL tweak for <sys/cpuset.h>
If _KERNEL is defined, then do not delcare CPU_ALLOC() and CPU_FREE() since
__cpuset_alloc() and __cpuset_free() are not declared as well.
2022-07-01 07:25:32 +02:00
Stefan Eßer e7ffbdb018 newlib/libc/sys/rtems/include/sys/cpuset.h
Fix typo in source file.

Reported by:	pluknet at gmail.com (Sergey Kandaurov)
2022-06-22 10:28:42 +02:00
Stefan Eßer e927f541f7 Make CPU_SET macros compliant with other implementations
The introduction of <sched.h> improved compatibility with some 3rd
party software, but caused the configure scripts of some ports to
assume that they were run in a GLIBC compatible environment.

Parts of sched.h were made conditional on -D_WITH_CPU_SET_T being
added to ports, but there still were compatibility issues due to
invalid assumptions made in autoconfigure scripts.

The differences between the FreeBSD version of macros like CPU_AND,
CPU_OR, etc. and the GLIBC versions was in the number of arguments:
FreeBSD used a 2-address scheme (one source argument is also used as
the destination of the operation), while GLIBC uses a 3-adderess
scheme (2 source operands and a separately passed destination).

The GLIBC scheme provides a super-set of the functionality of the
FreeBSD macros, since it does not prevent passing the same variable
as source and destination arguments. In code that wanted to preserve
both source arguments, the FreeBSD macros required a temporary copy of
one of the source arguments.

This patch set allows to unconditionally provide functions and macros
expected by 3rd party software written for GLIBC based systems, but
breaks builds of externally maintained sources that use any of the
following macros: CPU_AND, CPU_ANDNOT, CPU_OR, CPU_XOR.

One contributed driver (contrib/ofed/libmlx5) has been patched to
support both the old and the new CPU_OR signatures. If this commit
is merged to -STABLE, the version test will have to be extended to
cover more ranges.

Ports that have added -D_WITH_CPU_SET_T to build on -CURRENT do
no longer require that option.

The FreeBSD version has been bumped to 1400046 to reflect this
incompatible change.

Reviewed by:	kib
MFC after:	2 weeks
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D33451
2022-06-22 10:28:42 +02:00
Stefan Eßer 6af6e29552 sys/_bitset.h: Fix fall-out from commit 5e04571cf3c
The changes to the bitset macros allowed sched.h to be included
into userland programs without name space pollution due to BIT_*
and BITSET_* macros.

The definition of a "struct bitset" had been overlooked. This name
space pollution caused the build of port print/miktex to fail.

This commit makes the definition of struct bitset depend on the
same condition as the visibility of the BIT_* and BITSET_* macros,
i.e. needs _KERNEL or _WANT_FREEBSD_BITSET to be defined before
including sys/_bitset.h.

It has been tested with "make universe" since a prior attempt to
fix the issue broke the PowerPC64 kernel build.

This commit shall be MFCed together with commit 5e04571cf3c.

Reported by:    arrowd
MFC after:      1 month
2022-06-22 10:15:26 +02:00
Stefan Eßer 3af17aef2b sys/_bitset.h: revert commit 74e014dbfab
It caused kernel build for PowerPC64 to fail.

A different patch is being tested with make universe to make sure it
works on all architectures.

MFC after:	1 month<N [day[s]|week[s]|month[s]].  Request a reminder email>
2022-06-22 10:15:26 +02:00
Stefan Eßer c78c56c06d sys/_bitset.h: Fix fall-out from commit 5e04571cf3c
There is a reference to malloc() in #define __BITSET_ALLOC. Even
though this macro is only defined but not used, it causes the lang/gcc
ports to fail. The gcc ports "poison" a number of functions including
malloc() and prevent their use (including in macro definitions).

This commit moved the declaration of __BITSET_ALLOC into the
conditional block that depends on _KERNEL or _WANT_FREEBSD_BITSET
being defined.

There is no use of __BITSET_ALLOC in the FreeBSD sources, and userland
programs that want to use BITSEC_ALLOC will define _WANT_FREEBSD_BITSET
anyway.

This patch has been tested by building lang/gcc11 and a successful
make buildworld.

This commit shall be MFCed together with commit 5e04571cf3c.

MFC after:	1 month
2022-06-22 10:15:26 +02:00
Konstantin Belousov 2f6651097e sys/_bitset.h: Fix fall-out from commit 5e04571cf3c
The changes to the bitset macros allowed sched.h to be included into
userland programs without name space pollution due to BIT_* and
BITSET_* macros.

The definition of a global variable "bitset" had been overlooked.
This name space pollution caused a compile failure in print/miktex.

This commit renames the bitset variable to __bitset with the same
mapping back to the bitset if _KERNEL or _WANT_FREEBSD_BITSET is
defined.

This fix has been suggested by kib. It has been tested to let the
build of the print/miktex port succeed and to not break buildworld.

This commit shall be MFCed together with commit 5e04571cf3c.

Reported by:	arrowd
MFC after:	1 month
2022-06-22 10:15:26 +02:00
Stefan Eßer a1071cb178 sys/bitset.h: reduce visibility of BIT_* macros
Add two underscore characters "__" to names of BIT_* and BITSET_*
macros to move them to the implementation name space and to prevent
a name space pollution due to BIT_* macros in 3rd party programs with
conflicting parameter signatures.

These prefixed macro names are used in kernel header files to define
macros in e.g. sched.h, sys/cpuset.h and sys/domainset.h.

If C programs are built with either -D_KERNEL (automatically passed
when building a kernel or kernel modules) or -D_WANT_FREENBSD_BITSET
(or this macros is defined in the source code before including the
bitset macros), then all macros are made visible with their previous
names, too. E.g., both __BIT_SET() and BIT_SET() are visible with
either of _KERNEL or _WANT_FREEBSD_BITSET defined.

The main reason for this change is that some 3rd party sources
including sched.h have been found to contain conflicting BIT_*
macros.

As a work-around, parts of shed.h have been made conditional and
depend on _WITH_CPU_SET_T being set when sched.h is included.
Ports that expect the full functionality provided by sched.h need
to be built with -D_WITH_CPU_SET_T. But this leads to conflicts if
BIT_* macros are defined in that program, too.

This patch set makes all of sched.h visible again without this
parameter being passed and without any name space pollution due
to BIT_* macros becoming visible when sched.h is included.

This patch set will be backported to the STABLE branches, but ports
will need to use -D_WITH_CPU_SET_T as long as there are supported
releases that do not contain these patches.

Reviewed by:	kib, markj
MFC after:	1 month
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D33235
2022-06-22 10:15:26 +02:00
Konstantin Belousov 4ac3ee88c7 sched.h: add CPU_EQUAL() for better compatibility with Linux
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32901
2022-06-22 10:15:26 +02:00
Mark Johnston 6070714e0f cpuset(9): Add CPU_FOREACH_IS(SET|CLR) and modify consumers to use it
This implementation is faster and doesn't modify the cpuset, so it lets
us avoid some unnecessary copying as well.  No functional change
intended.

This is a re-application of commit
9068f6ea697b1b28ad1326a4c7a9ba86f08b985e.

Reviewed by:	cem, kib, jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32029
2022-06-22 10:15:26 +02:00
Mark Johnston 27e8401c46 bitset: Reimplement BIT_FOREACH_IS(SET|CLR)
Eliminate the nested loops and re-implement following a suggestion from
rlibby.

Add some simple regression tests.

Reviewed by:	rlibby, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32472
2022-06-22 10:15:26 +02:00
Mark Johnston 96ddb4055e Revert "cpuset(9): Add CPU_FOREACH_IS(SET|CLR) and modify consumers to use it"
This reverts commit 9068f6ea697b1b28ad1326a4c7a9ba86f08b985e.

The underlying macro needs to be reworked to avoid problems with control
flow statements.

Reported by:	rlibby
2022-06-22 10:15:26 +02:00
Mark Johnston 112245b78f cpuset(9): Add CPU_FOREACH_IS(SET|CLR) and modify consumers to use it
This implementation is faster and doesn't modify the cpuset, so it lets
us avoid some unnecessary copying as well.  No functional change
intended.

Reviewed by:	cem, kib, jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32029
2022-06-22 10:15:26 +02:00
Mark Johnston 37a3e59636 bitset(9): Introduce BIT_FOREACH_ISSET and BIT_FOREACH_ISCLR
These allow one to non-destructively iterate over the set or clear bits
in a bitset.  The motivation is that we have several code fragments
which iterate over a CPU set like this:

while ((cpu = CPU_FFS(&cpus)) != 0) {
	cpu--;
	CPU_CLR(cpu, &cpus);
	<do something>;
}

This is slow since CPU_FFS begins the search at the beginning of the
bitset each time.  On amd64 and arm64, CPU sets have size 256, so there
are four limbs in the bitset and we do a lot of unnecessary scanning.

A second problem is that this is destructive, so code which needs to
preserve the original set has to make a copy.  In particular, we have
quite a few functions which take a cpuset_t parameter by value, meaning
that each call has to copy the 32 byte cpuset_t.

The new macros address both problems.

Reviewed by:	cem, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32028
2022-06-22 10:15:26 +02:00
Patrick Kelsey 7c03cdf47e iflib: Improve mapping of TX/RX queues to CPUs
iflib now supports mapping each (TX,RX) queue pair to the same CPU
(default), to separate CPUs, or to a pair of physical and logical CPUs
that share the same L2 cache.  The mapping mechanism supports unequal
numbers of TX and RX queues, with the excess queues always being
mapped to consecutive physical CPUs.  When the platform cannot
distinguish between physical and logical CPUs, all are treated as
physical CPUs.  See the comment on get_cpuid_for_queue() for the
entire matrix.

The following device-specific tunables influence the mapping process:
dev.<device>.<unit>.iflib.core_offset       (existing)
dev.<device>.<unit>.iflib.separate_txrx     (existing)
dev.<device>.<unit>.iflib.use_logical_cores (new)

The following new, read-only sysctls provide visibility of the mapping
results:
dev.<device>.<unit>.iflib.{t,r}xq<n>.cpu

When an iflib driver allocates TX softirqs without providing reference
RX IRQs, iflib now binds those TX softirqs to CPUs using the above
mapping mechanism (that is, treats them as if they were TX IRQs).
Previously, such bindings were left up to the grouptaskqueue code and
thus fell outside of the iflib CPU mapping strategy.

Reviewed by:	kbowling
Tested by:	olivier, pkelsey
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D24094
2022-06-22 10:15:26 +02:00
Ryan Libby fb0a5865e4 bitset: implement BIT_TEST_CLR_ATOMIC & BIT_TEST_SET_ATOMIC
That is, provide wrappers around the atomic_testandclear and
atomic_testandset primitives.

Submitted by:	jeff
Reviewed by:	cem, kib, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22702
2022-06-22 10:15:26 +02:00
D Scott Phillips 9d50b44689 bitset: expand bit index type to `long`
An upcoming patch to use the bitset macros for tracking vm page
dump information could conceivably need more than INT_MAX bits.
Expand the bit type to long so that the extra range is available
on 64-bit platforms where it would most likely be needed.

CPUSET_COUNT and DOMAINSET_COUNT are also modified to remain of
type `int`.

Reviewed by:	kib, markj
Approved by:	scottl (implicit)
MFC after:	1 week
Sponsored by:	Ampere Computing, Inc.
Differential Revision:	https://reviews.freebsd.org/D26190
2022-06-22 10:15:26 +02:00
D Scott Phillips 4c5b7bec97 bitset: add BIT_FFS_AT() for finding the first bit set greater than a start bit
Reviewed by:	kib
Approved by:	scottl (implicit)
MFC after:	1 week
Sponsored by:	Ampere Computing, Inc.
Differential Revision:	https://reviews.freebsd.org/D26128
2022-06-22 10:15:26 +02:00
Konstantin Belousov 5e7a2b174a Fix undefined behavior: left-shifting into the sign bit.
Reviewed by:	dim, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22898
2022-06-22 10:15:26 +02:00
Ryan Libby de1380c36b bitset: rename confusing macro NAND to ANDNOT
s/BIT_NAND/BIT_ANDNOT/, and for CPU and DOMAINSET too.  The actual
implementation is "and not" (or "but not"), i.e. A but not B.
Fortunately this does appear to be what all existing callers want.

Don't supply a NAND (not (A and B)) operation at this time.

Discussed with:	jeff
Reviewed by:	cem
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22791
2022-06-22 10:15:26 +02:00