summaryrefslogtreecommitdiffstats
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/bonding.rst4
-rw-r--r--Documentation/networking/can.rst33
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst9
-rw-r--r--Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst1
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst124
-rw-r--r--Documentation/networking/device_drivers/ethernet/netronome/nfp.rst165
-rw-r--r--Documentation/networking/devlink/devlink-info.rst5
-rw-r--r--Documentation/networking/devlink/devlink-port.rst168
-rw-r--r--Documentation/networking/devlink/devlink-region.rst13
-rw-r--r--Documentation/networking/devlink/devlink-trap.rst13
-rw-r--r--Documentation/networking/devlink/etas_es58x.rst36
-rw-r--r--Documentation/networking/devlink/ice.rst128
-rw-r--r--Documentation/networking/ethtool-netlink.rst32
-rw-r--r--Documentation/networking/index.rst1
-rw-r--r--Documentation/networking/ip-sysctl.rst111
-rw-r--r--Documentation/networking/ipvs-sysctl.rst24
-rw-r--r--Documentation/networking/tc-queue-filters.rst37
-rw-r--r--Documentation/networking/timestamping.rst32
-rw-r--r--Documentation/networking/xfrm_device.rst62
19 files changed, 882 insertions, 116 deletions
diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
index 96cd7a26f3d9..adc4bf4f3c50 100644
--- a/Documentation/networking/bonding.rst
+++ b/Documentation/networking/bonding.rst
@@ -566,7 +566,8 @@ miimon
link monitoring. A value of 100 is a good starting point.
The use_carrier option, below, affects how the link state is
determined. See the High Availability section for additional
- information. The default value is 0.
+ information. The default value is 100 if arp_interval is not
+ set.
min_links
@@ -956,6 +957,7 @@ xmit_hash_policy
hash = hash XOR source IP XOR destination IP
hash = hash XOR (hash RSHIFT 16)
hash = hash XOR (hash RSHIFT 8)
+ hash = hash RSHIFT 1
And then hash is reduced modulo slave count.
If the protocol is IPv6 then the source and destination
diff --git a/Documentation/networking/can.rst b/Documentation/networking/can.rst
index ebc822e605f5..90121deef217 100644
--- a/Documentation/networking/can.rst
+++ b/Documentation/networking/can.rst
@@ -1148,6 +1148,39 @@ tuning on deep embedded systems'. The author is running a MPC603e
load without any problems ...
+Switchable Termination Resistors
+--------------------------------
+
+CAN bus requires a specific impedance across the differential pair,
+typically provided by two 120Ohm resistors on the farthest nodes of
+the bus. Some CAN controllers support activating / deactivating a
+termination resistor(s) to provide the correct impedance.
+
+Query the available resistances::
+
+ $ ip -details link show can0
+ ...
+ termination 120 [ 0, 120 ]
+
+Activate the terminating resistor::
+
+ $ ip link set dev can0 type can termination 120
+
+Deactivate the terminating resistor::
+
+ $ ip link set dev can0 type can termination 0
+
+To enable termination resistor support to a can-controller, either
+implement in the controller's struct can-priv::
+
+ termination_const
+ termination_const_cnt
+ do_set_termination
+
+or add gpio control with the device tree entries from
+Documentation/devicetree/bindings/net/can/can-controller.yaml
+
+
The Virtual CAN Driver (vcan)
-----------------------------
diff --git a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst
index 51e6624fb774..1d2f55feca24 100644
--- a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst
@@ -181,10 +181,13 @@ when necessary using the below listed API::
- int dpaa2_mac_connect(struct dpaa2_mac *mac);
- void dpaa2_mac_disconnect(struct dpaa2_mac *mac);
-A phylink integration is necessary only when the partner DPMAC is not of TYPE_FIXED.
-One can check for this condition using the below API::
+A phylink integration is necessary only when the partner DPMAC is not of
+``TYPE_FIXED``. This means it is either of ``TYPE_PHY``, or of
+``TYPE_BACKPLANE`` (the difference being the two that in the ``TYPE_BACKPLANE``
+mode, the MC firmware does not access the PCS registers). One can check for
+this condition using the following helper::
- - bool dpaa2_mac_is_type_fixed(struct fsl_mc_device *dpmac_dev,struct fsl_mc_io *mc_io);
+ - static inline bool dpaa2_mac_is_type_phy(struct dpaa2_mac *mac);
Before connection to a MAC, the caller must allocate and populate the
dpaa2_mac structure with the associated net_device, a pointer to the MC portal
diff --git a/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst
index bc562c49011b..cad96c8d1f97 100644
--- a/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst
+++ b/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst
@@ -23,6 +23,7 @@ Supported Devices
=================
Currently, this driver support following devices:
* Network controller: Cavium, Inc. Device b200
+ * Network controller: Cavium, Inc. Device b400
Interface Control
=================
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
index 5edf50d7dbd5..6969652f593c 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
@@ -25,7 +25,7 @@ Enabling the driver and kconfig options
| at build time via kernel Kconfig flags.
| Basic features, ethernet net device rx/tx offloads and XDP, are available with the most basic flags
| CONFIG_MLX5_CORE=y/m and CONFIG_MLX5_CORE_EN=y.
-| For the list of advanced features please see below.
+| For the list of advanced features, please see below.
**CONFIG_MLX5_CORE=(y/m/n)** (module mlx5_core.ko)
@@ -89,11 +89,11 @@ Enabling the driver and kconfig options
**CONFIG_MLX5_EN_IPSEC=(y/n)**
-| Enables `IPSec XFRM cryptography-offload accelaration <http://www.mellanox.com/related-docs/prod_software/Mellanox_Innova_IPsec_Ethernet_Adapter_Card_User_Manual.pdf>`_.
+| Enables `IPSec XFRM cryptography-offload acceleration <http://www.mellanox.com/related-docs/prod_software/Mellanox_Innova_IPsec_Ethernet_Adapter_Card_User_Manual.pdf>`_.
**CONFIG_MLX5_EN_TLS=(y/n)**
-| TLS cryptography-offload accelaration.
+| TLS cryptography-offload acceleration.
**CONFIG_MLX5_INFINIBAND=(y/n/m)** (module mlx5_ib.ko)
@@ -139,14 +139,14 @@ flow_steering_mode: Device flow steering mode
The flow steering mode parameter controls the flow steering mode of the driver.
Two modes are supported:
1. 'dmfs' - Device managed flow steering.
-2. 'smfs - Software/Driver managed flow steering.
+2. 'smfs' - Software/Driver managed flow steering.
In DMFS mode, the HW steering entities are created and managed through the
Firmware.
In SMFS mode, the HW steering entities are created and managed though by
-the driver directly into Hardware without firmware intervention.
+the driver directly into hardware without firmware intervention.
-SMFS mode is faster and provides better rule inserstion rate compared to default DMFS mode.
+SMFS mode is faster and provides better rule insertion rate compared to default DMFS mode.
User command examples:
@@ -165,9 +165,9 @@ User command examples:
enable_roce: RoCE enablement state
----------------------------------
RoCE enablement state controls driver support for RoCE traffic.
-When RoCE is disabled, there is no gid table, only raw ethernet QPs are supported and traffic on the well known UDP RoCE port is handled as raw ethernet traffic.
+When RoCE is disabled, there is no gid table, only raw ethernet QPs are supported and traffic on the well-known UDP RoCE port is handled as raw ethernet traffic.
-To change RoCE enablement state a user must change the driverinit cmode value and run devlink reload.
+To change RoCE enablement state, a user must change the driverinit cmode value and run devlink reload.
User command examples:
@@ -186,7 +186,7 @@ User command examples:
esw_port_metadata: Eswitch port metadata state
----------------------------------------------
-When applicable, disabling Eswitch metadata can increase packet rate
+When applicable, disabling eswitch metadata can increase packet rate
up to 20% depending on the use case and packet sizes.
Eswitch port metadata state controls whether to internally tag packets with
@@ -253,26 +253,26 @@ mlx5 subfunction
================
mlx5 supports subfunction management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface.
-A Subfunction has its own function capabilities and its own resources. This
+A subfunction has its own function capabilities and its own resources. This
means a subfunction has its own dedicated queues (txq, rxq, cq, eq). These
queues are neither shared nor stolen from the parent PCI function.
-When a subfunction is RDMA capable, it has its own QP1, GID table and rdma
+When a subfunction is RDMA capable, it has its own QP1, GID table, and RDMA
resources neither shared nor stolen from the parent PCI function.
A subfunction has a dedicated window in PCI BAR space that is not shared
-with ther other subfunctions or the parent PCI function. This ensures that all
-devices (netdev, rdma, vdpa etc.) of the subfunction accesses only assigned
+with the other subfunctions or the parent PCI function. This ensures that all
+devices (netdev, rdma, vdpa, etc.) of the subfunction accesses only assigned
PCI BAR space.
-A Subfunction supports eswitch representation through which it supports tc
+A subfunction supports eswitch representation through which it supports tc
offloads. The user configures eswitch to send/receive packets from/to
the subfunction port.
Subfunctions share PCI level resources such as PCI MSI-X IRQs with
other subfunctions and/or with its parent PCI function.
-Example mlx5 software, system and device view::
+Example mlx5 software, system, and device view::
_______
| admin |
@@ -310,7 +310,7 @@ Example mlx5 software, system and device view::
| (device add/del)
_____|____ ____|________
| | | subfunction |
- | PCI NIC |---- activate/deactive events---->| host driver |
+ | PCI NIC |--- activate/deactivate events--->| host driver |
|__________| | (mlx5_core) |
|_____________|
@@ -320,7 +320,7 @@ Subfunction is created using devlink port interface.
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
-- Add a devlink port of subfunction flaovur::
+- Add a devlink port of subfunction flavour::
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
@@ -351,46 +351,30 @@ driver.
MAC address setup
-----------------
-mlx5 driver provides mechanism to setup the MAC address of the PCI VF/SF.
+mlx5 driver support devlink port function attr mechanism to setup MAC
+address. (refer to Documentation/networking/devlink/devlink-port.rst)
-The configured MAC address of the PCI VF/SF will be used by netdevice and rdma
-device created for the PCI VF/SF.
+RoCE capability setup
+---------------------
+Not all mlx5 PCI devices/SFs require RoCE capability.
-- Get the MAC address of the VF identified by its unique devlink port index::
+When RoCE capability is disabled, it saves 1 Mbytes worth of system memory per
+PCI devices/SF.
- $ devlink port show pci/0000:06:00.0/2
- pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
- function:
- hw_addr 00:00:00:00:00:00
-
-- Set the MAC address of the VF identified by its unique devlink port index::
-
- $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
-
- $ devlink port show pci/0000:06:00.0/2
- pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
- function:
- hw_addr 00:11:22:33:44:55
+mlx5 driver support devlink port function attr mechanism to setup RoCE
+capability. (refer to Documentation/networking/devlink/devlink-port.rst)
-- Get the MAC address of the SF identified by its unique devlink port index::
+migratable capability setup
+---------------------------
+User who wants mlx5 PCI VFs to be able to perform live migration need to
+explicitly enable the VF migratable capability.
- $ devlink port show pci/0000:06:00.0/32768
- pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
- function:
- hw_addr 00:00:00:00:00:00
-
-- Set the MAC address of the VF identified by its unique devlink port index::
-
- $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88
-
- $ devlink port show pci/0000:06:00.0/32768
- pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcivf pfnum 0 sfnum 88
- function:
- hw_addr 00:00:00:00:88:88
+mlx5 driver support devlink port function attr mechanism to setup migratable
+capability. (refer to Documentation/networking/devlink/devlink-port.rst)
SF state setup
--------------
-To use the SF, the user must active the SF using the SF function state
+To use the SF, the user must activate the SF using the SF function state
attribute.
- Get the state of the SF identified by its unique devlink port index::
@@ -447,7 +431,7 @@ for it.
Additionally, the SF port also gets the event when the driver attaches to the
auxiliary device of the subfunction. This results in changing the operational
-state of the function. This provides visiblity to the user to decide when is it
+state of the function. This provides visibility to the user to decide when is it
safe to delete the SF port for graceful termination of the subfunction.
- Show the SF port operational state::
@@ -464,14 +448,14 @@ tx reporter
-----------
The tx reporter is responsible for reporting and recovering of the following two error scenarios:
-- TX timeout
+- tx timeout
Report on kernel tx timeout detection.
Recover by searching lost interrupts.
-- TX error completion
+- tx error completion
Report on error tx completion.
- Recover by flushing the TX queue and reset it.
+ Recover by flushing the tx queue and reset it.
-TX reporter also support on demand diagnose callback, on which it provides
+tx reporter also support on demand diagnose callback, on which it provides
real time information of its send queues status.
User commands examples:
@@ -491,32 +475,32 @@ rx reporter
-----------
The rx reporter is responsible for reporting and recovering of the following two error scenarios:
-- RX queues initialization (population) timeout
- RX queues descriptors population on ring initialization is done in
- napi context via triggering an irq, in case of a failure to get
- the minimum amount of descriptors, a timeout would occur and it
- could be recoverable by polling the EQ (Event Queue).
-- RX completions with errors (reported by HW on interrupt context)
+- rx queues' initialization (population) timeout
+ Population of rx queues' descriptors on ring initialization is done
+ in napi context via triggering an irq. In case of a failure to get
+ the minimum amount of descriptors, a timeout would occur, and
+ descriptors could be recovered by polling the EQ (Event Queue).
+- rx completions with errors (reported by HW on interrupt context)
Report on rx completion error.
Recover (if needed) by flushing the related queue and reset it.
-RX reporter also supports on demand diagnose callback, on which it
-provides real time information of its receive queues status.
+rx reporter also supports on demand diagnose callback, on which it
+provides real time information of its receive queues' status.
-- Diagnose rx queues status, and corresponding completion queue::
+- Diagnose rx queues' status and corresponding completion queue::
$ devlink health diagnose pci/0000:82:00.0 reporter rx
-NOTE: This command has valid output only when interface is up, otherwise the command has empty output.
+NOTE: This command has valid output only when interface is up. Otherwise, the command has empty output.
- Show number of rx errors indicated, number of recover flows ended successfully,
- is autorecover enabled and graceful period from last recover::
+ is autorecover enabled, and graceful period from last recover::
$ devlink health show pci/0000:82:00.0 reporter rx
fw reporter
-----------
-The fw reporter implements diagnose and dump callbacks.
+The fw reporter implements `diagnose` and `dump` callbacks.
It follows symptoms of fw error such as fw syndrome by triggering
fw core dump and storing it into the dump buffer.
The fw reporter diagnose command can be triggered any time by the user to check
@@ -537,7 +521,7 @@ running it on other PF or any VF will return "Operation not permitted".
fw fatal reporter
-----------------
-The fw fatal reporter implements dump and recover callbacks.
+The fw fatal reporter implements `dump` and `recover` callbacks.
It follows fatal errors indications by CR-space dump and recover flow.
The CR-space dump uses vsc interface which is valid even if the FW command
interface is not functional, which is the case in most FW fatal errors.
@@ -552,7 +536,7 @@ User commands examples:
$ devlink health recover pci/0000:82:00.0 reporter fw_fatal
-- Read FW CR-space dump if already strored or trigger new one::
+- Read FW CR-space dump if already stored or trigger new one::
$ devlink health dump show pci/0000:82:00.1 reporter fw_fatal
@@ -561,10 +545,10 @@ NOTE: This command can run only on PF.
mlx5 tracepoints
================
-mlx5 driver provides internal trace points for tracking and debugging using
+mlx5 driver provides internal tracepoints for tracking and debugging using
kernel tracepoints interfaces (refer to Documentation/trace/ftrace.rst).
-For the list of support mlx5 events check /sys/kernel/debug/tracing/events/mlx5/
+For the list of support mlx5 events, check `/sys/kernel/debug/tracing/events/mlx5/`.
tc and eswitch offloads tracepoints:
diff --git a/Documentation/networking/device_drivers/ethernet/netronome/nfp.rst b/Documentation/networking/device_drivers/ethernet/netronome/nfp.rst
index ada611fb427c..650b57742d51 100644
--- a/Documentation/networking/device_drivers/ethernet/netronome/nfp.rst
+++ b/Documentation/networking/device_drivers/ethernet/netronome/nfp.rst
@@ -1,50 +1,57 @@
.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+.. include:: <isonum.txt>
-=============================================
-Netronome Flow Processor (NFP) Kernel Drivers
-=============================================
+===========================================
+Network Flow Processor (NFP) Kernel Drivers
+===========================================
-Copyright (c) 2019, Netronome Systems, Inc.
+:Copyright: |copy| 2019, Netronome Systems, Inc.
+:Copyright: |copy| 2022, Corigine, Inc.
Contents
========
- `Overview`_
- `Acquiring Firmware`_
+- `Devlink Info`_
+- `Configure Device`_
+- `Statistics`_
Overview
========
-This driver supports Netronome's line of Flow Processor devices,
-including the NFP4000, NFP5000, and NFP6000 models, which are also
-incorporated in the company's family of Agilio SmartNICs. The SR-IOV
-physical and virtual functions for these devices are supported by
-the driver.
+This driver supports Netronome and Corigine's line of Network Flow Processor
+devices, including the NFP3800, NFP4000, NFP5000, and NFP6000 models, which
+are also incorporated in the companies' family of Agilio SmartNICs. The SR-IOV
+physical and virtual functions for these devices are supported by the driver.
Acquiring Firmware
==================
-The NFP4000 and NFP6000 devices require application specific firmware
-to function. Application firmware can be located either on the host file system
+The NFP3800, NFP4000 and NFP6000 devices require application specific firmware
+to function. Application firmware can be located either on the host file system
or in the device flash (if supported by management firmware).
Firmware files on the host filesystem contain card type (`AMDA-*` string), media
-config etc. They should be placed in `/lib/firmware/netronome` directory to
+config etc. They should be placed in `/lib/firmware/netronome` directory to
load firmware from the host file system.
Firmware for basic NIC operation is available in the upstream
`linux-firmware.git` repository.
+A more comprehensive list of firmware can be downloaded from the
+`Corigine Support site <https://www.corigine.com/DPUDownload.html>`_.
+
Firmware in NVRAM
-----------------
Recent versions of management firmware supports loading application
-firmware from flash when the host driver gets probed. The firmware loading
+firmware from flash when the host driver gets probed. The firmware loading
policy configuration may be used to configure this feature appropriately.
Devlink or ethtool can be used to update the application firmware on the device
flash by providing the appropriate `nic_AMDA*.nffw` file to the respective
-command. Users need to take care to write the correct firmware image for the
+command. Users need to take care to write the correct firmware image for the
card and media configuration to flash.
Available storage space in flash depends on the card being used.
@@ -79,9 +86,9 @@ You may need to use hard instead of symbolic links on distributions
which use old `mkinitrd` command instead of `dracut` (e.g. Ubuntu).
After changing firmware files you may need to regenerate the initramfs
-image. Initramfs contains drivers and firmware files your system may
-need to boot. Refer to the documentation of your distribution to find
-out how to update initramfs. Good indication of stale initramfs
+image. Initramfs contains drivers and firmware files your system may
+need to boot. Refer to the documentation of your distribution to find
+out how to update initramfs. Good indication of stale initramfs
is system loading wrong driver or firmware on boot, but when driver is
later reloaded manually everything works correctly.
@@ -89,9 +96,9 @@ Selecting firmware per device
-----------------------------
Most commonly all cards on the system use the same type of firmware.
-If you want to load specific firmware image for a specific card, you
-can use either the PCI bus address or serial number. Driver will print
-which files it's looking for when it recognizes a NFP device::
+If you want to load a specific firmware image for a specific card, you
+can use either the PCI bus address or serial number. The driver will
+print which files it's looking for when it recognizes a NFP device::
nfp: Looking for firmware file in order of priority:
nfp: netronome/serial-00-12-34-aa-bb-cc-10-ff.nffw: not found
@@ -106,6 +113,15 @@ Note that `serial-*` and `pci-*` files are **not** automatically included
in initramfs, you will have to refer to documentation of appropriate tools
to find out how to include them.
+Running firmware version
+------------------------
+
+The version of the loaded firmware for a particular <netdev> interface,
+(e.g. enp4s0), or an interface's port <netdev port> (e.g. enp4s0np0) can
+be displayed with the ethtool command::
+
+ $ ethtool -i <netdev>
+
Firmware loading policy
-----------------------
@@ -132,6 +148,115 @@ abi_drv_load_ifc
Defines a list of PF devices allowed to load FW on the device.
This variable is not currently user configurable.
+Devlink Info
+============
+
+The devlink info command displays the running and stored firmware versions
+on the device, serial number and board information.
+
+Devlink info command example (replace PCI address)::
+
+ $ devlink dev info pci/0000:03:00.0
+ pci/0000:03:00.0:
+ driver nfp
+ serial_number CSAAMDA2001-1003000111
+ versions:
+ fixed:
+ board.id AMDA2001-1003
+ board.rev 01
+ board.manufacture CSA
+ board.model mozart
+ running:
+ fw.mgmt 22.10.0-rc3
+ fw.cpld 0x1000003
+ fw.app nic-22.09.0
+ chip.init AMDA-2001-1003 1003000111
+ stored:
+ fw.bundle_id bspbundle_1003000111
+ fw.mgmt 22.10.0-rc3
+ fw.cpld 0x0
+ chip.init AMDA-2001-1003 1003000111
+
+Configure Device
+================
+
+This section explains how to use Agilio SmartNICs running basic NIC firmware.
+
+Configure interface link-speed
+------------------------------
+The following steps explains how to change between 10G mode and 25G mode on
+Agilio CX 2x25GbE cards. The changing of port speed must be done in order,
+port 0 (p0) must be set to 10G before port 1 (p1) may be set to 10G.
+
+Down the respective interface(s)::
+
+ $ ip link set dev <netdev port 0> down
+ $ ip link set dev <netdev port 1> down
+
+Set interface link-speed to 10G::
+
+ $ ethtool -s <netdev port 0> speed 10000
+ $ ethtool -s <netdev port 1> speed 10000
+
+Set interface link-speed to 25G::
+
+ $ ethtool -s <netdev port 0> speed 25000
+ $ ethtool -s <netdev port 1> speed 25000
+
+Reload driver for changes to take effect::
+
+ $ rmmod nfp; modprobe nfp
+
+Configure interface Maximum Transmission Unit (MTU)
+---------------------------------------------------
+
+The MTU of interfaces can temporarily be set using the iproute2, ip link or
+ifconfig tools. Note that this change will not persist. Setting this via
+Network Manager, or another appropriate OS configuration tool, is
+recommended as changes to the MTU using Network Manager can be made to
+persist.
+
+Set interface MTU to 9000 bytes::
+
+ $ ip link set dev <netdev port> mtu 9000
+
+It is the responsibility of the user or the orchestration layer to set
+appropriate MTU values when handling jumbo frames or utilizing tunnels. For
+example, if packets sent from a VM are to be encapsulated on the card and
+egress a physical port, then the MTU of the VF should be set to lower than
+that of the physical port to account for the extra bytes added by the
+additional header. If a setup is expected to see fallback traffic between
+the SmartNIC and the kernel then the user should also ensure that the PF MTU
+is appropriately set to avoid unexpected drops on this path.
+
+Configure Forward Error Correction (FEC) modes
+----------------------------------------------
+
+Agilio SmartNICs support FEC mode configuration, e.g. Auto, Firecode Base-R,
+ReedSolomon and Off modes. Each physical port's FEC mode can be set
+independently using ethtool. The supported FEC modes for an interface can
+be viewed using::
+
+ $ ethtool <netdev>
+
+The currently configured FEC mode can be viewed using::
+
+ $ ethtool --show-fec <netdev>
+
+To force the FEC mode for a particular port, auto-negotiation must be disabled
+(see the `Auto-negotiation`_ section). An example of how to set the FEC mode
+to Reed-Solomon is::
+
+ $ ethtool --set-fec <netdev> encoding rs
+
+Auto-negotiation
+----------------
+
+To change auto-negotiation settings, the link must first be put down. After the
+link is down, auto-negotiation can be enabled or disabled using::
+
+ ethtool -s <netdev> autoneg <on|off>
+
Statistics
==========
diff --git a/Documentation/networking/devlink/devlink-info.rst b/Documentation/networking/devlink/devlink-info.rst
index 7572bf6de5c1..1242b0e6826b 100644
--- a/Documentation/networking/devlink/devlink-info.rst
+++ b/Documentation/networking/devlink/devlink-info.rst
@@ -198,6 +198,11 @@ fw.bundle_id
Unique identifier of the entire firmware bundle.
+fw.bootloader
+-------------
+
+Version of the bootloader.
+
Future work
===========
diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
index 7627b1da01f2..3da590953ce8 100644
--- a/Documentation/networking/devlink/devlink-port.rst
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -110,7 +110,7 @@ devlink ports for both the controllers.
Function configuration
======================
-A user can configure the function attribute before enumerating the PCI
+Users can configure one or more function attributes before enumerating the PCI
function. Usually it means, user should configure function attribute
before a bus specific device for the function is created. However, when
SRIOV is enabled, virtual function devices are created on the PCI bus.
@@ -119,9 +119,127 @@ function device to the driver. For subfunctions, this means user should
configure port function attribute before activating the port function.
A user may set the hardware address of the function using
-'devlink port function set hw_addr' command. For Ethernet port function
+`devlink port function set hw_addr` command. For Ethernet port function
this means a MAC address.
+Users may also set the RoCE capability of the function using
+`devlink port function set roce` command.
+
+Users may also set the function as migratable using
+'devlink port function set migratable' command.
+
+Function attributes
+===================
+
+MAC address setup
+-----------------
+The configured MAC address of the PCI VF/SF will be used by netdevice and rdma
+device created for the PCI VF/SF.
+
+- Get the MAC address of the VF identified by its unique devlink port index::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00
+
+- Set the MAC address of the VF identified by its unique devlink port index::
+
+ $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:11:22:33:44:55
+
+- Get the MAC address of the SF identified by its unique devlink port index::
+
+ $ devlink port show pci/0000:06:00.0/32768
+ pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
+ function:
+ hw_addr 00:00:00:00:00:00
+
+- Set the MAC address of the SF identified by its unique devlink port index::
+
+ $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88
+
+ $ devlink port show pci/0000:06:00.0/32768
+ pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
+ function:
+ hw_addr 00:00:00:00:88:88
+
+RoCE capability setup
+---------------------
+Not all PCI VFs/SFs require RoCE capability.
+
+When RoCE capability is disabled, it saves system memory per PCI VF/SF.
+
+When user disables RoCE capability for a VF/SF, user application cannot send or
+receive any RoCE packets through this VF/SF and RoCE GID table for this PCI
+will be empty.
+
+When RoCE capability is disabled in the device using port function attribute,
+VF/SF driver cannot override it.
+
+- Get RoCE capability of the VF device::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 roce enable
+
+- Set RoCE capability of the VF device::
+
+ $ devlink port function set pci/0000:06:00.0/2 roce disable
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 roce disable
+
+migratable capability setup
+---------------------------
+Live migration is the process of transferring a live virtual machine
+from one physical host to another without disrupting its normal
+operation.
+
+User who want PCI VFs to be able to perform live migration need to
+explicitly enable the VF migratable capability.
+
+When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver
+with migration support, the user can migrate the VM with this VF from one HV to a
+different one.
+
+However, when migratable capability is enable, device will disable features which cannot
+be migrated. Thus migratable cap can impose limitations on a VF so let the user decide.
+
+Example of LM with migratable function configuration:
+- Get migratable capability of the VF device::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 migratable disable
+
+- Set migratable capability of the VF device::
+
+ $ devlink port function set pci/0000:06:00.0/2 migratable enable
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 migratable enable
+
+- Bind VF to VFIO driver with migration support::
+
+ $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
+ $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override
+ $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind
+
+Attach VF to the VM.
+Start the VM.
+Perform live migration.
+
Subfunction
============
@@ -130,10 +248,11 @@ it is deployed. Subfunction is created and deployed in unit of 1. Unlike
SRIOV VFs, a subfunction doesn't require its own PCI virtual function.
A subfunction communicates with the hardware through the parent PCI function.
-To use a subfunction, 3 steps setup sequence is followed.
-(1) create - create a subfunction;
-(2) configure - configure subfunction attributes;
-(3) deploy - deploy the subfunction;
+To use a subfunction, 3 steps setup sequence is followed:
+
+1) create - create a subfunction;
+2) configure - configure subfunction attributes;
+3) deploy - deploy the subfunction;
Subfunction management is done using devlink port user interface.
User performs setup on the subfunction management device.
@@ -191,13 +310,48 @@ API allows to configure following rate object's parameters:
``tx_max``
Maximum TX rate value.
+``tx_priority``
+ Allows for usage of strict priority arbiter among siblings. This
+ arbitration scheme attempts to schedule nodes based on their priority
+ as long as the nodes remain within their bandwidth limit. The higher the
+ priority the higher the probability that the node will get selected for
+ scheduling.
+
+``tx_weight``
+ Allows for usage of Weighted Fair Queuing arbitration scheme among
+ siblings. This arbitration scheme can be used simultaneously with the
+ strict priority. As a node is configured with a higher rate it gets more
+ BW relative to it's siblings. Values are relative like a percentage
+ points, they basically tell how much BW should node take relative to
+ it's siblings.
+
``parent``
Parent node name. Parent node rate limits are considered as additional limits
to all node children limits. ``tx_max`` is an upper limit for children.
``tx_share`` is a total bandwidth distributed among children.
+``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
+nodes with the same priority form a WFQ subgroup in the sibling group
+and arbitration among them is based on assigned weights.
+
+Arbitration flow from the high level:
+
+#. Choose a node, or group of nodes with the highest priority that stays
+ within the BW limit and are not blocked. Use ``tx_priority`` as a
+ parameter for this arbitration.
+
+#. If group of nodes have the same priority perform WFQ arbitration on
+ that subgroup. Use ``tx_weight`` as a parameter for this arbitration.
+
+#. Select the winner node, and continue arbitration flow among it's children,
+ until leaf node is reached, and the winner is established.
+
+#. If all the nodes from the highest priority sub-group are satisfied, or
+ overused their assigned BW, move to the lower priority nodes.
+
Driver implementations are allowed to support both or either rate object types
-and setting methods of their parameters.
+and setting methods of their parameters. Additionally driver implementation
+may export nodes/leafs and their child-parent relationships.
Terms and Definitions
=====================
diff --git a/Documentation/networking/devlink/devlink-region.rst b/Documentation/networking/devlink/devlink-region.rst
index f06dca9a1eb6..9232cd7da301 100644
--- a/Documentation/networking/devlink/devlink-region.rst
+++ b/Documentation/networking/devlink/devlink-region.rst
@@ -31,6 +31,15 @@ in its ``devlink_region_ops`` structure. If snapshot id is not set in
the ``DEVLINK_CMD_REGION_NEW`` request kernel will allocate one and send
the snapshot information to user space.
+Regions may optionally allow directly reading from their contents without a
+snapshot. Direct read requests are not atomic. In particular a read request
+of size 256 bytes or larger will be split into multiple chunks. If atomic
+access is required, use a snapshot. A driver wishing to enable this for a
+region should implement the ``.read`` callback in the ``devlink_region_ops``
+structure. User space can request a direct read by using the
+``DEVLINK_ATTR_REGION_DIRECT`` attribute instead of specifying a snapshot
+id.
+
example usage
-------------
@@ -65,6 +74,10 @@ example usage
$ devlink region read pci/0000:00:05.0/fw-health snapshot 1 address 0 length 16
0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
+ # Read from the region without a snapshot
+ $ devlink region read pci/0000:00:05.0/fw-health address 16 length 16
+ 0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
+
As regions are likely very device or driver specific, no generic regions are
defined. See the driver-specific documentation files for information on the
specific regions a driver supports.
diff --git a/Documentation/networking/devlink/devlink-trap.rst b/Documentation/networking/devlink/devlink-trap.rst
index 90d1381b88de..2c14dfe69b3a 100644
--- a/Documentation/networking/devlink/devlink-trap.rst
+++ b/Documentation/networking/devlink/devlink-trap.rst
@@ -485,6 +485,16 @@ be added to the following table:
- Traps incoming packets that the device decided to drop because
the destination MAC is not configured in the MAC table and
the interface is not in promiscuous mode
+ * - ``eapol``
+ - ``control``
+ - Traps "Extensible Authentication Protocol over LAN" (EAPOL) packets
+ specified in IEEE 802.1X
+ * - ``locked_port``
+ - ``drop``
+ - Traps packets that the device decided to drop because they failed the
+ locked bridge port check. That is, packets that were received via a
+ locked port and whose {SMAC, VID} does not correspond to an FDB entry
+ pointing to the port
Driver-specific Packet Traps
============================
@@ -589,6 +599,9 @@ narrow. The description of these groups must be added to the following table:
* - ``parser_error_drops``
- Contains packet traps for packets that were marked by the device during
parsing as erroneous
+ * - ``eapol``
+ - Contains packet traps for "Extensible Authentication Protocol over LAN"
+ (EAPOL) packets specified in IEEE 802.1X
Packet Trap Policers
====================
diff --git a/Documentation/networking/devlink/etas_es58x.rst b/Documentation/networking/devlink/etas_es58x.rst
new file mode 100644
index 000000000000..3b857d82a44c
--- /dev/null
+++ b/Documentation/networking/devlink/etas_es58x.rst
@@ -0,0 +1,36 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+etas_es58x devlink support
+==========================
+
+This document describes the devlink features implemented by the
+``etas_es58x`` device driver.
+
+Info versions
+=============
+
+The ``etas_es58x`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``fw``
+ - running
+ - Version of the firmware running on the device. Also available
+ through ``ethtool -i`` as the first member of the
+ ``firmware-version``.
+ * - ``fw.bootloader``
+ - running
+ - Version of the bootloader running on the device. Also available
+ through ``ethtool -i`` as the second member of the
+ ``firmware-version``.
+ * - ``board.rev``
+ - fixed
+ - The hardware revision of the device.
+ * - ``serial_number``
+ - fixed
+ - The USB serial number. Also available through ``lsusb -v``.
diff --git a/Documentation/networking/devlink/ice.rst b/Documentation/networking/devlink/ice.rst
index 0c89ceb8986d..625efb3777d5 100644
--- a/Documentation/networking/devlink/ice.rst
+++ b/Documentation/networking/devlink/ice.rst
@@ -189,12 +189,21 @@ device data.
* - ``nvm-flash``
- The contents of the entire flash chip, sometimes referred to as
the device's Non Volatile Memory.
+ * - ``shadow-ram``
+ - The contents of the Shadow RAM, which is loaded from the beginning
+ of the flash. Although the contents are primarily from the flash,
+ this area also contains data generated during device boot which is
+ not stored in flash.
* - ``device-caps``
- The contents of the device firmware's capabilities buffer. Useful to
determine the current state and configuration of the device.
-Users can request an immediate capture of a snapshot via the
-``DEVLINK_CMD_REGION_NEW``
+Both the ``nvm-flash`` and ``shadow-ram`` regions can be accessed without a
+snapshot. The ``device-caps`` region requires a snapshot as the contents are
+sent by firmware and can't be split into separate reads.
+
+Users can request an immediate capture of a snapshot for all three regions
+via the ``DEVLINK_CMD_REGION_NEW`` command.
.. code:: shell
@@ -254,3 +263,118 @@ Users can request an immediate capture of a snapshot via the
0000000000000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
$ devlink region delete pci/0000:01:00.0/device-caps snapshot 1
+
+Devlink Rate
+============
+
+The ``ice`` driver implements devlink-rate API. It allows for offload of
+the Hierarchical QoS to the hardware. It enables user to group Virtual
+Functions in a tree structure and assign supported parameters: tx_share,
+tx_max, tx_priority and tx_weight to each node in a tree. So effectively
+user gains an ability to control how much bandwidth is allocated for each
+VF group. This is later enforced by the HW.
+
+It is assumed that this feature is mutually exclusive with DCB performed
+in FW and ADQ, or any driver feature that would trigger changes in QoS,
+for example creation of the new traffic class. The driver will prevent DCB
+or ADQ configuration if user started making any changes to the nodes using
+devlink-rate API. To configure those features a driver reload is necessary.
+Correspondingly if ADQ or DCB will get configured the driver won't export
+hierarchy at all, or will remove the untouched hierarchy if those
+features are enabled after the hierarchy is exported, but before any
+changes are made.
+
+This feature is also dependent on switchdev being enabled in the system.
+It's required bacause devlink-rate requires devlink-port objects to be
+present, and those objects are only created in switchdev mode.
+
+If the driver is set to the switchdev mode, it will export internal
+hierarchy the moment VF's are created. Root of the tree is always
+represented by the node_0. This node can't be deleted by the user. Leaf
+nodes and nodes with children also can't be deleted.
+
+.. list-table:: Attributes supported
+ :widths: 15 85
+
+ * - Name
+ - Description
+ * - ``tx_max``
+ - maximum bandwidth to be consumed by the tree Node. Rate Limit is
+ an absolute number specifying a maximum amount of bytes a Node may
+ consume during the course of one second. Rate limit guarantees
+ that a link will not oversaturate the receiver on the remote end
+ and also enforces an SLA between the subscriber and network
+ provider.
+ * - ``tx_share``
+ - minimum bandwidth allocated to a tree node when it is not blocked.
+ It specifies an absolute BW. While tx_max defines the maximum
+ bandwidth the node may consume, the tx_share marks committed BW
+ for the Node.
+ * - ``tx_priority``
+ - allows for usage of strict priority arbiter among siblings. This
+ arbitration scheme attempts to schedule nodes based on their
+ priority as long as the nodes remain within their bandwidth limit.
+ Range 0-7. Nodes with priority 7 have the highest priority and are
+ selected first, while nodes with priority 0 have the lowest
+ priority. Nodes that have the same priority are treated equally.
+ * - ``tx_weight``
+ - allows for usage of Weighted Fair Queuing arbitration scheme among
+ siblings. This arbitration scheme can be used simultaneously with
+ the strict priority. Range 1-200. Only relative values mater for
+ arbitration.
+
+``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
+nodes with the same priority form a WFQ subgroup in the sibling group
+and arbitration among them is based on assigned weights.
+
+.. code:: shell
+
+ # enable switchdev
+ $ devlink dev eswitch set pci/0000:4b:00.0 mode switchdev
+
+ # at this point driver should export internal hierarchy
+ $ echo 2 > /sys/class/net/ens785np0/device/sriov_numvfs
+
+ $ devlink port function rate show
+ pci/0000:4b:00.0/node_25: type node parent node_24
+ pci/0000:4b:00.0/node_24: type node parent node_0
+ pci/0000:4b:00.0/node_32: type node parent node_31
+ pci/0000:4b:00.0/node_31: type node parent node_30
+ pci/0000:4b:00.0/node_30: type node parent node_16
+ pci/0000:4b:00.0/node_19: type node parent node_18
+ pci/0000:4b:00.0/node_18: type node parent node_17
+ pci/0000:4b:00.0/node_17: type node parent node_16
+ pci/0000:4b:00.0/node_14: type node parent node_5
+ pci/0000:4b:00.0/node_5: type node parent node_3
+ pci/0000:4b:00.0/node_13: type node parent node_4
+ pci/0000:4b:00.0/node_12: type node parent node_4
+ pci/0000:4b:00.0/node_11: type node parent node_4
+ pci/0000:4b:00.0/node_10: type node parent node_4
+ pci/0000:4b:00.0/node_9: type node parent node_4
+ pci/0000:4b:00.0/node_8: type node parent node_4
+ pci/0000:4b:00.0/node_7: type node parent node_4
+ pci/0000:4b:00.0/node_6: type node parent node_4
+ pci/0000:4b:00.0/node_4: type node parent node_3
+ pci/0000:4b:00.0/node_3: type node parent node_16
+ pci/0000:4b:00.0/node_16: type node parent node_15
+ pci/0000:4b:00.0/node_15: type node parent node_0
+ pci/0000:4b:00.0/node_2: type node parent node_1
+ pci/0000:4b:00.0/node_1: type node parent node_0
+ pci/0000:4b:00.0/node_0: type node
+ pci/0000:4b:00.0/1: type leaf parent node_25
+ pci/0000:4b:00.0/2: type leaf parent node_25
+
+ # let's create some custom node
+ $ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0
+
+ # second custom node
+ $ devlink port function rate add pci/0000:4b:00.0/node_custom_1 parent node_custom
+
+ # reassign second VF to newly created branch
+ $ devlink port function rate set pci/0000:4b:00.0/2 parent node_custom_1
+
+ # assign tx_weight to the VF
+ $ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 5
+
+ # assign tx_share to the VF
+ $ devlink port function rate set pci/0000:4b:00.0/2 tx_share 500Mbps
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index d578b8bcd8a4..f10f8eb44255 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -222,6 +222,7 @@ Userspace to kernel:
``ETHTOOL_MSG_MODULE_GET`` get transceiver module parameters
``ETHTOOL_MSG_PSE_SET`` set PSE parameters
``ETHTOOL_MSG_PSE_GET`` get PSE parameters
+ ``ETHTOOL_MSG_RSS_GET`` get RSS settings
===================================== =================================
Kernel to userspace:
@@ -263,6 +264,7 @@ Kernel to userspace:
``ETHTOOL_MSG_PHC_VCLOCKS_GET_REPLY`` PHC virtual clocks info
``ETHTOOL_MSG_MODULE_GET_REPLY`` transceiver module parameters
``ETHTOOL_MSG_PSE_GET_REPLY`` PSE parameters
+ ``ETHTOOL_MSG_RSS_GET_REPLY`` RSS settings
======================================== =================================
``GET`` requests are sent by userspace applications to retrieve device
@@ -491,6 +493,7 @@ Kernel response contents:
``ETHTOOL_A_LINKSTATE_SQI_MAX`` u32 Max support SQI value
``ETHTOOL_A_LINKSTATE_EXT_STATE`` u8 link extended state
``ETHTOOL_A_LINKSTATE_EXT_SUBSTATE`` u8 link extended substate
+ ``ETHTOOL_A_LINKSTATE_EXT_DOWN_CNT`` u32 count of link down events
==================================== ====== ============================
For most NIC drivers, the value of ``ETHTOOL_A_LINKSTATE_LINK`` returns
@@ -1686,6 +1689,33 @@ to control PoDL PSE Admin functions. This option is implementing
``IEEE 802.3-2018`` 30.15.1.2.1 acPoDLPSEAdminControl. See
``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` for supported values.
+RSS_GET
+=======
+
+Get indirection table, hash key and hash function info associated with a
+RSS context of an interface similar to ``ETHTOOL_GRSSH`` ioctl request.
+
+Request contents:
+
+===================================== ====== ==========================
+ ``ETHTOOL_A_RSS_HEADER`` nested request header
+ ``ETHTOOL_A_RSS_CONTEXT`` u32 context number
+===================================== ====== ==========================
+
+Kernel response contents:
+
+===================================== ====== ==========================
+ ``ETHTOOL_A_RSS_HEADER`` nested reply header
+ ``ETHTOOL_A_RSS_HFUNC`` u32 RSS hash func
+ ``ETHTOOL_A_RSS_INDIR`` binary Indir table bytes
+ ``ETHTOOL_A_RSS_HKEY`` binary Hash key bytes
+===================================== ====== ==========================
+
+ETHTOOL_A_RSS_HFUNC attribute is bitmap indicating the hash function
+being used. Current supported options are toeplitz, xor or crc32.
+ETHTOOL_A_RSS_INDIR attribute returns RSS indrection table where each byte
+indicates queue number.
+
Request translation
===================
@@ -1767,7 +1797,7 @@ are netlink only.
``ETHTOOL_GMODULEEEPROM`` ``ETHTOOL_MSG_MODULE_EEPROM_GET``
``ETHTOOL_GEEE`` ``ETHTOOL_MSG_EEE_GET``
``ETHTOOL_SEEE`` ``ETHTOOL_MSG_EEE_SET``
- ``ETHTOOL_GRSSH`` n/a
+ ``ETHTOOL_GRSSH`` ``ETHTOOL_MSG_RSS_GET``
``ETHTOOL_SRSSH`` n/a
``ETHTOOL_GTUNABLE`` n/a
``ETHTOOL_STUNABLE`` n/a
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 16a153bcc5fe..4f2d1f682a18 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -104,6 +104,7 @@ Contents:
switchdev
sysfs-tagging
tc-actions-env-rules
+ tc-queue-filters
tcp-thin
team
timestamping
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index e7b3fa7bb3f7..7fbd060d6047 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -1069,6 +1069,81 @@ tcp_child_ehash_entries - INTEGER
Default: 0
+tcp_plb_enabled - BOOLEAN
+ If set and the underlying congestion control (e.g. DCTCP) supports
+ and enables PLB feature, TCP PLB (Protective Load Balancing) is
+ enabled. PLB is described in the following paper:
+ https://doi.org/10.1145/3544216.3544226. Based on PLB parameters,
+ upon sensing sustained congestion, TCP triggers a change in
+ flow label field for outgoing IPv6 packets. A change in flow label
+ field potentially changes the path of outgoing packets for switches
+ that use ECMP/WCMP for routing.
+
+ PLB changes socket txhash which results in a change in IPv6 Flow Label
+ field, and currently no-op for IPv4 headers. It is possible
+ to apply PLB for IPv4 with other network header fields (e.g. TCP
+ or IPv4 options) or using encapsulation where outer header is used
+ by switches to determine next hop. In either case, further host
+ and switch side changes will be needed.
+
+ When set, PLB assumes that congestion signal (e.g. ECN) is made
+ available and used by congestion control module to estimate a
+ congestion measure (e.g. ce_ratio). PLB needs a congestion measure to
+ make repathing decisions.
+
+ Default: FALSE
+
+tcp_plb_idle_rehash_rounds - INTEGER
+ Number of consecutive congested rounds (RTT) seen after which
+ a rehash can be performed, given there are no packets in flight.
+ This is referred to as M in PLB paper:
+ https://doi.org/10.1145/3544216.3544226.
+
+ Possible Values: 0 - 31
+
+ Default: 3
+
+tcp_plb_rehash_rounds - INTEGER
+ Number of consecutive congested rounds (RTT) seen after which
+ a forced rehash can be performed. Be careful when setting this
+ parameter, as a small value increases the risk of retransmissions.
+ This is referred to as N in PLB paper:
+ https://doi.org/10.1145/3544216.3544226.
+
+ Possible Values: 0 - 31
+
+ Default: 12
+
+tcp_plb_suspend_rto_sec - INTEGER
+ Time, in seconds, to suspend PLB in event of an RTO. In order to avoid
+ having PLB repath onto a connectivity "black hole", after an RTO a TCP
+ connection suspends PLB repathing for a random duration between 1x and
+ 2x of this parameter. Randomness is added to avoid concurrent rehashing
+ of multiple TCP connections. This should be set corresponding to the
+ amount of time it takes to repair a failed link.
+
+ Possible Values: 0 - 255
+
+ Default: 60
+
+tcp_plb_cong_thresh - INTEGER
+ Fraction of packets marked with congestion over a round (RTT) to
+ tag that round as congested. This is referred to as K in the PLB paper:
+ https://doi.org/10.1145/3544216.3544226.
+
+ The 0-1 fraction range is mapped to 0-256 range to avoid floating
+ point operations. For example, 128 means that if at least 50% of
+ the packets in a round were marked as congested then the round
+ will be tagged as congested.
+
+ Setting threshold to 0 means that PLB repaths every RTT regardless
+ of congestion. This is not intended behavior for PLB and should be
+ used only for experimentation purpose.
+
+ Possible Values: 0 - 256
+
+ Default: 128
+
UDP variables
=============
@@ -1102,6 +1177,33 @@ udp_rmem_min - INTEGER
udp_wmem_min - INTEGER
UDP does not have tx memory accounting and this tunable has no effect.
+udp_hash_entries - INTEGER
+ Show the number of hash buckets for UDP sockets in the current
+ networking namespace.
+
+ A negative value means the networking namespace does not own its
+ hash buckets and shares the initial networking namespace's one.
+
+udp_child_ehash_entries - INTEGER
+ Control the number of hash buckets for UDP sockets in the child
+ networking namespace, which must be set before clone() or unshare().
+
+ If the value is not 0, the kernel uses a value rounded up to 2^n
+ as the actual hash bucket size. 0 is a special value, meaning
+ the child networking namespace will share the initial networking
+ namespace's hash buckets.
+
+ Note that the child will use the global one in case the kernel
+ fails to allocate enough memory. In addition, the global hash
+ buckets are spread over available NUMA nodes, but the allocation
+ of the child hash table depends on the current process's NUMA
+ policy, which could result in performance differences.
+
+ Possible values: 0, 2^n (n: 7 (128) - 16 (64K))
+
+ Default: 0
+
+
RAW variables
=============
@@ -3025,6 +3127,15 @@ ecn_enable - BOOLEAN
Default: 1
+l3mdev_accept - BOOLEAN
+ Enabling this option allows a "global" bound socket to work
+ across L3 master domains (e.g., VRFs) with packets capable of
+ being received regardless of the L3 domain in which they
+ originated. Only valid when the kernel was compiled with
+ CONFIG_NET_L3_MASTER_DEV.
+
+ Default: 1 (enabled)
+
``/proc/sys/net/core/*``
========================
diff --git a/Documentation/networking/ipvs-sysctl.rst b/Documentation/networking/ipvs-sysctl.rst
index 387fda80f05f..3fb5fa142eef 100644
--- a/Documentation/networking/ipvs-sysctl.rst
+++ b/Documentation/networking/ipvs-sysctl.rst
@@ -129,6 +129,26 @@ drop_packet - INTEGER
threshold. When the mode 3 is set, the always mode drop rate
is controlled by the /proc/sys/net/ipv4/vs/am_droprate.
+est_cpulist - CPULIST
+ Allowed CPUs for estimation kthreads
+
+ Syntax: standard cpulist format
+ empty list - stop kthread tasks and estimation
+ default - the system's housekeeping CPUs for kthreads
+
+ Example:
+ "all": all possible CPUs
+ "0-N": all possible CPUs, N denotes last CPU number
+ "0,1-N:1/2": first and all CPUs with odd number
+ "": empty list
+
+est_nice - INTEGER
+ default 0
+ Valid range: -20 (more favorable) .. 19 (less favorable)
+
+ Niceness value to use for the estimation kthreads (scheduling
+ priority)
+
expire_nodest_conn - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
@@ -304,8 +324,8 @@ run_estimation - BOOLEAN
0 - disabled
not 0 - enabled (default)
- If disabled, the estimation will be stop, and you can't see
- any update on speed estimation data.
+ If disabled, the estimation will be suspended and kthread tasks
+ stopped.
You can always re-enable estimation by setting this value to 1.
But be careful, the first estimation after re-enable is not
diff --git a/Documentation/networking/tc-queue-filters.rst b/Documentation/networking/tc-queue-filters.rst
new file mode 100644
index 000000000000..6b417092276f
--- /dev/null
+++ b/Documentation/networking/tc-queue-filters.rst
@@ -0,0 +1,37 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+TC queue based filtering
+=========================
+
+TC can be used for directing traffic to either a set of queues or
+to a single queue on both the transmit and receive side.
+
+On the transmit side:
+
+1) TC filter directing traffic to a set of queues is achieved
+ using the action skbedit priority for Tx priority selection,
+ the priority maps to a traffic class (set of queues) when
+ the queue-sets are configured using mqprio.
+
+2) TC filter directs traffic to a transmit queue with the action
+ skbedit queue_mapping $tx_qid. The action skbedit queue_mapping
+ for transmit queue is executed in software only and cannot be
+ offloaded.
+
+Likewise, on the receive side, the two filters for selecting set of
+queues and/or a single queue are supported as below:
+
+1) TC flower filter directs incoming traffic to a set of queues using
+ the 'hw_tc' option.
+ hw_tc $TCID - Specify a hardware traffic class to pass matching
+ packets on to. TCID is in the range 0 through 15.
+
+2) TC filter with action skbedit queue_mapping $rx_qid selects a
+ receive queue. The action skbedit queue_mapping for receive queue
+ is supported only in hardware. Multiple filters may compete in
+ the hardware for queue selection. In such case, the hardware
+ pipeline resolves conflicts based on priority. On Intel E810
+ devices, TC filter directing traffic to a queue have higher
+ priority over flow director filter assigning a queue. The hash
+ filter has lowest priority.
diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst
index be4eb1242057..f17c01834a12 100644
--- a/Documentation/networking/timestamping.rst
+++ b/Documentation/networking/timestamping.rst
@@ -179,7 +179,8 @@ SOF_TIMESTAMPING_OPT_ID:
identifier and returns that along with the timestamp. The identifier
is derived from a per-socket u32 counter (that wraps). For datagram
sockets, the counter increments with each sent packet. For stream
- sockets, it increments with every byte.
+ sockets, it increments with every byte. For stream sockets, also set
+ SOF_TIMESTAMPING_OPT_ID_TCP, see the section below.
The counter starts at zero. It is initialized the first time that
the socket option is enabled. It is reset each time the option is
@@ -192,6 +193,35 @@ SOF_TIMESTAMPING_OPT_ID:
among all possibly concurrently outstanding timestamp requests for
that socket.
+SOF_TIMESTAMPING_OPT_ID_TCP:
+ Pass this modifier along with SOF_TIMESTAMPING_OPT_ID for new TCP
+ timestamping applications. SOF_TIMESTAMPING_OPT_ID defines how the
+ counter increments for stream sockets, but its starting point is
+ not entirely trivial. This option fixes that.
+
+ For stream sockets, if SOF_TIMESTAMPING_OPT_ID is set, this should
+ always be set too. On datagram sockets the option has no effect.
+
+ A reasonable expectation is that the counter is reset to zero with
+ the system call, so that a subsequent write() of N bytes generates
+ a timestamp with counter N-1. SOF_TIMESTAMPING_OPT_ID_TCP
+ implements this behavior under all conditions.
+
+ SOF_TIMESTAMPING_OPT_ID without modifier often reports the same,
+ especially when the socket option is set when no data is in
+ transmission. If data is being transmitted, it may be off by the
+ length of the output queue (SIOCOUTQ).
+
+ The difference is due to being based on snd_una versus write_seq.
+ snd_una is the offset in the stream acknowledged by the peer. This
+ depends on factors outside of process control, such as network RTT.
+ write_seq is the last byte written by the process. This offset is
+ not affected by external inputs.
+
+ The difference is subtle and unlikely to be noticed when configured
+ at initial socket creation, when no data is queued or sent. But
+ SOF_TIMESTAMPING_OPT_ID_TCP behavior is more robust regardless of
+ when the socket option is set.
SOF_TIMESTAMPING_OPT_CMSG:
Support recv() cmsg for all timestamped packets. Control messages
diff --git a/Documentation/networking/xfrm_device.rst b/Documentation/networking/xfrm_device.rst
index 01391dfd37d9..c43ace79e320 100644
--- a/Documentation/networking/xfrm_device.rst
+++ b/Documentation/networking/xfrm_device.rst
@@ -5,6 +5,7 @@ XFRM device - offloading the IPsec computations
===============================================
Shannon Nelson <shannon.nelson@oracle.com>
+Leon Romanovsky <leonro@nvidia.com>
Overview
@@ -18,10 +19,21 @@ can radically increase throughput and decrease CPU utilization. The XFRM
Device interface allows NIC drivers to offer to the stack access to the
hardware offload.
+Right now, there are two types of hardware offload that kernel supports.
+ * IPsec crypto offload:
+ * NIC performs encrypt/decrypt
+ * Kernel does everything else
+ * IPsec packet offload:
+ * NIC performs encrypt/decrypt
+ * NIC does encapsulation
+ * Kernel and NIC have SA and policy in-sync
+ * NIC handles the SA and policies states
+ * The Kernel talks to the keymanager
+
Userland access to the offload is typically through a system such as
libreswan or KAME/raccoon, but the iproute2 'ip xfrm' command set can
be handy when experimenting. An example command might look something
-like this::
+like this for crypto offload:
ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \
reqid 0x07 replay-window 32 \
@@ -29,6 +41,17 @@ like this::
sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \
offload dev eth4 dir in
+and for packet offload
+
+ ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \
+ reqid 0x07 replay-window 32 \
+ aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \
+ sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \
+ offload packet dev eth4 dir in
+
+ ip x p add src 14.0.0.70 dst 14.0.0.52 offload packet dev eth4 dir in
+ tmpl src 14.0.0.70 dst 14.0.0.52 proto esp reqid 10000 mode transport
+
Yes, that's ugly, but that's what shell scripts and/or libreswan are for.
@@ -40,17 +63,24 @@ Callbacks to implement
/* from include/linux/netdevice.h */
struct xfrmdev_ops {
+ /* Crypto and Packet offload callbacks */
int (*xdo_dev_state_add) (struct xfrm_state *x);
void (*xdo_dev_state_delete) (struct xfrm_state *x);
void (*xdo_dev_state_free) (struct xfrm_state *x);
bool (*xdo_dev_offload_ok) (struct sk_buff *skb,
struct xfrm_state *x);
void (*xdo_dev_state_advance_esn) (struct xfrm_state *x);
+
+ /* Solely packet offload callbacks */
+ void (*xdo_dev_state_update_curlft) (struct xfrm_state *x);
+ int (*xdo_dev_policy_add) (struct xfrm_policy *x);
+ void (*xdo_dev_policy_delete) (struct xfrm_policy *x);
+ void (*xdo_dev_policy_free) (struct xfrm_policy *x);
};
-The NIC driver offering ipsec offload will need to implement these
-callbacks to make the offload available to the network stack's
-XFRM subsystem. Additionally, the feature bits NETIF_F_HW_ESP and
+The NIC driver offering ipsec offload will need to implement callbacks
+relevant to supported offload to make the offload available to the network
+stack's XFRM subsystem. Additionally, the feature bits NETIF_F_HW_ESP and
NETIF_F_HW_ESP_TX_CSUM will signal the availability of the offload.
@@ -79,7 +109,8 @@ and an indication of whether it is for Rx or Tx. The driver should
=========== ===================================
0 success
- -EOPNETSUPP offload not supported, try SW IPsec
+ -EOPNETSUPP offload not supported, try SW IPsec,
+ not applicable for packet offload mode
other fail the request
=========== ===================================
@@ -96,6 +127,7 @@ will serviceable. This can check the packet information to be sure the
offload can be supported (e.g. IPv4 or IPv6, no IPv4 options, etc) and
return true of false to signify its support.
+Crypto offload mode:
When ready to send, the driver needs to inspect the Tx packet for the
offload information, including the opaque context, and set up the packet
send accordingly::
@@ -139,13 +171,25 @@ the stack in xfrm_input().
In ESN mode, xdo_dev_state_advance_esn() is called from xfrm_replay_advance_esn().
Driver will check packet seq number and update HW ESN state machine if needed.
+Packet offload mode:
+HW adds and deletes XFRM headers. So in RX path, XFRM stack is bypassed if HW
+reported success. In TX path, the packet lefts kernel without extra header
+and not encrypted, the HW is responsible to perform it.
+
When the SA is removed by the user, the driver's xdo_dev_state_delete()
-is asked to disable the offload. Later, xdo_dev_state_free() is called
-from a garbage collection routine after all reference counts to the state
+and xdo_dev_policy_delete() are asked to disable the offload. Later,
+xdo_dev_state_free() and xdo_dev_policy_free() are called from a garbage
+collection routine after all reference counts to the state and policy
have been removed and any remaining resources can be cleared for the
offload state. How these are used by the driver will depend on specific
hardware needs.
As a netdev is set to DOWN the XFRM stack's netdev listener will call
-xdo_dev_state_delete() and xdo_dev_state_free() on any remaining offloaded
-states.
+xdo_dev_state_delete(), xdo_dev_policy_delete(), xdo_dev_state_free() and
+xdo_dev_policy_free() on any remaining offloaded states.
+
+Outcome of HW handling packets, the XFRM core can't count hard, soft limits.
+The HW/driver are responsible to perform it and provide accurate data when
+xdo_dev_state_update_curlft() is called. In case of one of these limits
+occuried, the driver needs to call to xfrm_state_check_expire() to make sure
+that XFRM performs rekeying sequence.