summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2021-10-16net: make use of helper netif_is_bridge_master()Kyungrok Chung9-14/+14
Make use of netdev helper functions to improve code readability. Replace 'dev->priv_flags & IFF_EBRIDGE' with netif_is_bridge_master(dev). Signed-off-by: Kyungrok Chung <acadx0@gmail.com> Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16Merge branch 'smc-rv23'David S. Miller14-383/+1607
Karsten Graul says: ==================== net/smc: introduce SMC-Rv2 support Please apply the following patch series for smc to netdev's net-next tree. SMC-Rv2 support (see https://www.ibm.com/support/pages/node/6326337) provides routable RoCE support for SMC-R, eliminating the current same-subnet restriction, by exploiting the UDP encapsulation feature of the RoCE adapter hardware. v2: resend of the v1 patch series, and CC linux-rdma this time v3: rebase after net tree was merged into net-next ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16net/smc: stop links when their GID is removedKarsten Graul1-0/+53
With SMC-Rv2 the GID is an IP address that can be deleted from the device. When an IB_EVENT_GID_CHANGE event is provided then iterate over all active links and check if their GID is still defined. Otherwise stop the affected link. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16net/smc: add netlink support for SMC-Rv2Karsten Graul2-27/+88
Implement the netlink support for SMC-Rv2 related attributes that are provided to user space. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16net/smc: extend LLC layer for SMC-Rv2Karsten Graul7-115/+531
Add support for large v2 LLC control messages in smc_llc.c. The new large work request buffer allows to combine control messages into one packet that had to be spread over several packets before. Add handling of the new v2 LLC messages. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16net/smc: add v2 support to the work request layerKarsten Graul5-21/+199
In the work request layer define one large v2 buffer for each link group that is used to transmit and receive large LLC control messages. Add the completion queue handling for this buffer. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16net/smc: retrieve v2 gid from IB deviceKarsten Graul4-30/+95
In smc_ib.c, scan for RoCE devices that support UDP encapsulation. Find an eligible device and check that there is a route to the remote peer. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16net/smc: add v2 format of CLC decline messageKarsten Graul2-10/+55
The CLC decline message changed with SMC-Rv2 and supports up to 4 additional diagnosis codes. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16net/smc: add listen processing for SMC-Rv2Karsten Graul4-66/+165
Implement the server side of the SMC-Rv2 processing. Process incoming CLC messages, find eligible devices and check for a valid route to the remote peer. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16net/smc: add SMC-Rv2 connection establishmentKarsten Graul6-61/+267
Send a CLC proposal message, and the remote side process this type of message and determine the target GID. Check for a valid route to this GID, and complete the connection establishment. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16net/smc: prepare for SMC-Rv2 connectionKarsten Graul4-35/+113
Prepare the connection establishment with SMC-Rv2. Detect eligible RoCE cards and indicate all supported SMC modes for the connection. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16net/smc: save stack space and allocate smc_init_infoKarsten Graul1-30/+53
The struct smc_init_info grew over time, its time to save space on stack and allocate this struct dynamically. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16net: stream: don't purge sk_error_queue in sk_stream_kill_queues()Jakub Kicinski1-3/+0
sk_stream_kill_queues() can be called on close when there are still outstanding skbs to transmit. Those skbs may try to queue notifications to the error queue (e.g. timestamps). If sk_stream_kill_queues() purges the queue without taking its lock the queue may get corrupted, and skbs leaked. This shows up as a warning about an rmem leak: WARNING: CPU: 24 PID: 0 at net/ipv4/af_inet.c:154 inet_sock_destruct+0x... The leak is always a multiple of 0x300 bytes (the value is in %rax on my builds, so RAX: 0000000000000300). 0x300 is truesize of an empty sk_buff. Indeed if we dump the socket state at the time of the warning the sk_error_queue is often (but not always) corrupted. The ->next pointer points back at the list head, but not the ->prev pointer. Indeed we can find the leaked skb by scanning the kernel memory for something that looks like an skb with ->sk = socket in question, and ->truesize = 0x300. The contents of ->cb[] of the skb confirms the suspicion that it is indeed a timestamp notification (as generated in __skb_complete_tx_timestamp()). Removing purging of sk_error_queue should be okay, since inet_sock_destruct() does it again once all socket refs are gone. Eric suggests this may cause sockets that go thru disconnect() to maintain notifications from the previous incarnations of the socket, but that should be okay since the race was there anyway, and disconnect() is not exactly dependable. Thanks to Jonathan Lemon and Omar Sandoval for help at various stages of tracing the issue. Fixes: cb9eff097831 ("net: new user space API for time stamping of incoming and outgoing packets") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16Merge branch 'dev_addr-conversions-part-1'David S. Miller15-46/+69
Jakub Kicinski says: ==================== ethernet: manual netdev->dev_addr conversions (part 1) Manual conversions of drivers writing directly to netdev->dev_addr (part 1 out of 3). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16ethernet: ixgb: use eth_hw_addr_set()Jakub Kicinski1-2/+6
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it got through appropriate helpers. Read the address into an array on the stack, then call eth_hw_addr_set(). ixgb_get_ee_mac_addr() is used with a non-nevdev->dev_addr pointer so we can't deal with the problem inside it. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16ethernet: ibmveth: use ether_addr_to_u64()Jakub Kicinski1-14/+3
We'll want to make netdev->dev_addr const, remove the local helper which is missing a const qualifier on the argument and use ether_addr_to_u64(). Similar story to mlx4. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16ethernet: enetc: use eth_hw_addr_set()Jakub Kicinski3-3/+7
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it got through appropriate helpers. Pass a netdev into the helper instead of just the address, read the address into an array on the stack, then call eth_hw_addr_set(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16ethernet: ec_bhf: use eth_hw_addr_set()Jakub Kicinski1-1/+3
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it got through appropriate helpers. Copy the address into an array on the stack, then call eth_hw_addr_set(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16ethernet: enic: use eth_hw_addr_set()Jakub Kicinski1-2/+3
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it got through appropriate helpers. Use a zero'ed array on the stack, then call eth_hw_addr_set(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16ethernet: bcmgenet: use eth_hw_addr_set()Jakub Kicinski1-2/+6
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it got through appropriate helpers. Read the address into an array on the stack, then call eth_hw_addr_set(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16ethernet: bnx2x: use eth_hw_addr_set()Jakub Kicinski1-5/+11
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it got through appropriate helpers. Read the address into an array on the stack, then call eth_hw_addr_set(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16ethernet: aquantia: use eth_hw_addr_set()Jakub Kicinski1-2/+4
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it got through appropriate helpers. Use an array on the stack, then call eth_hw_addr_set(). eth_hw_addr_set() is after error checking, this should be fine, error propagates all the way to failing probe. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16ethernet: amd: use eth_hw_addr_set()Jakub Kicinski2-5/+12
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it got through appropriate helpers. Read the address into an array on the stack, then call eth_hw_addr_set(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16ethernet: alteon: use eth_hw_addr_set()Jakub Kicinski1-6/+8
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it got through appropriate helpers. Break the address apart into an array on the stack, then call eth_hw_addr_set(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16ethernet: aeroflex: use eth_hw_addr_set()Jakub Kicinski1-3/+3
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it got through appropriate helpers. macaddr[] is a module param, and int, so copy the address into an array of u8 on the stack, then call eth_hw_addr_set(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16ethernet: adaptec: use eth_hw_addr_set()Jakub Kicinski1-1/+3
Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it got through appropriate helpers. Read the address into an array on the stack, then call eth_hw_addr_set(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16net: ipvtap: fix template string argument of device_create() callJean Sacren1-1/+1
The last argument of device_create() call should be a template string. The tap_name variable should be the argument to the string, but not the argument of the call itself. We should add the template string and turn tap_name into its argument. Signed-off-by: Jean Sacren <sakiwit@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16net: macvtap: fix template string argument of device_create() callJean Sacren1-1/+1
The last argument of device_create() call should be a template string. The tap_name variable should be the argument to the string, but not the argument of the call itself. We should add the template string and turn tap_name into its argument. Signed-off-by: Jean Sacren <sakiwit@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16Merge tag 'mlx5-updates-2021-10-15' of ↵David S. Miller32-133/+529
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2021-10-15 1) From Rongwei Liu: Use system_image_guid and native_port_num when bonding. Don't relay on PCIe ids anymore. With some specific NIC, the physical devices may have PCIe IDs like 0001:01:00.0/1 and 0002:02:00.0/1. All of these devices should have the same system_image_guid and device index can be queried from native_port_num. For matching sibling devices/port of the same HCA, compare the HCA GUID reported on each device rather than just assuming PCIe ids have similar attributes. 2) From Amir Tzin: Use HCA defined Timouts Replace hard coded timeouts with values stored by firmware in default timeouts register (DTOR). Timeouts are read during driver load. If DTOR is not supported by firmware then fallback to hard coded defaults instead. 3) From Shay Drory: Disable roce at HCA level Disable RoCE in Firmware when devlink roce parameter is set to off. 4) A small set of trivial cleanups ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16Merge branch '100GbE' of ↵David S. Miller22-420/+628
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== 100GbE Intel Wired LAN Driver Updates 2021-10-14 Maciej Machnikowski says: Extend the driver implementation to support PTP pins on E810-T and derivative devices. E810-T adapters are equipped with: - 2 external bidirectional SMA connectors - 1 internal TX U.FL shared with SMA1 - 1 internal RX U.FL shared with SMA2 The SMA and U.FL configuration is controlled by the external multiplexer. E810-T Derivatives are equipped with: - 2 1PPS outputs on SDP20 and SDP22 - 2 1PPS inputs on SDP21 and SDP23 --- v2: - Remove defensive programming check and simplify return statement (Patch 3) - Remove unnecessary parentheses (Patch 4) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16Merge branch 'mptcp-fixes'David S. Miller5-10/+16
Mat Martineau says: ==================== mptcp: A few fixes This set has three separate changes for the net-next tree: Patch 1 guarantees safe handling and a warning if a NULL value is encountered when gathering subflow data for the MPTCP_SUBFLOW_ADDRS socket option. Patch 2 increases the default number of subflows allowed per MPTCP connection. Patch 3 makes an existing function 'static'. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16mptcp: Make mptcp_pm_nl_mp_prio_send_ack() staticMat Martineau2-6/+3
This function is only used within pm_netlink.c now. Fixes: 067065422fcd ("mptcp: add the outgoing MP_PRIO support") Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16mptcp: increase default max additional subflows to 2Paolo Abeni3-4/+10
The current default does not allowing additional subflows, mostly as a safety restriction to avoid uncontrolled resource consumption on busy servers. Still the system admin and/or the application have to opt-in to MPTCP explicitly. After that, they need to change (increase) the default maximum number of additional subflows. Let set that to reasonable default, and make end-users life easier. Additionally we need to update some self-tests accordingly. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-16mptcp: Avoid NULL dereference in mptcp_getsockopt_subflow_addrs()Tim Gardner1-0/+3
Coverity complains of a possible NULL dereference in mptcp_getsockopt_subflow_addrs(): 861 } else if (sk->sk_family == AF_INET6) { 3. returned_null: inet6_sk returns NULL. [show details] 4. var_assigned: Assigning: np = NULL return value from inet6_sk. 862 const struct ipv6_pinfo *np = inet6_sk(sk); Fix this by checking for NULL. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/231 Fixes: c11c5906bc0a ("mptcp: add MPTCP_SUBFLOW_ADDRS getsockopt support") Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Tim Gardner <tim.gardner@canonical.com> [mjm: Added WARN_ON_ONCE() to the unexpected case] Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-15net/mlx5: Use system_image_guid to determine bondingRongwei Liu1-1/+13
With specific NICs, the PFs may have different PCIe ids like 0001:01:00.0/1 and 0002:02:00:00/1. For PFs with the same system_image_guid, driver should consider them under the same physical NIC and they are legal to bond together. If firmware doesn't support system_image_guid, set it to zero and fallback to use PCIe ids. Signed-off-by: Rongwei Liu <rongweil@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-15net/mlx5: Use native_port_num as 1st option of device indexRongwei Liu1-1/+6
Using "native_port_num" can support more NICs. Fallback to PCIe IDs if "native_port_num" query fails. Signed-off-by: Rongwei Liu <rongweil@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-15net/mlx5: Introduce new device index wrapperRongwei Liu6-7/+12
Downstream patches. Signed-off-by: Rongwei Liu <rongweil@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-15net/mlx5: Check return status first when querying system_image_guidRongwei Liu1-9/+12
When querying system_image_guid from firmware, we should check return value first. The buffer content is valid only if query succeed. Signed-off-by: Rongwei Liu <rongweil@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-15net/mlx5: DR, Prefer kcalloc over open coded arithmeticLen Baker1-2/+6
As noted in the "Deprecated Interfaces, Language Features, Attributes, and Conventions" documentation [1], size calculations (especially multiplication) should not be performed in memory allocator (or similar) function arguments due to the risk of them overflowing. This could lead to values wrapping around and a smaller allocation being made than the caller was expecting. Using those allocations could lead to linear overflows of heap memory and other misbehaviors. So, refactor the code a bit to use the purpose specific kcalloc() function instead of the argument size * count in the kzalloc() function. [1] https://www.kernel.org/doc/html/v5.14/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments Signed-off-by: Len Baker <len.baker@gmx.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-15net/mlx5e: Add extack msgs related to TC for better debugAbhiram R N1-30/+76
As multiple places EOPNOTSUPP and EINVAL is returned from driver it becomes difficult to understand the reason only with error code. With the netlink extack message exact reason will be known and will aid in debugging. Signed-off-by: Abhiram R N <abhiramrn@gmail.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-15net/mlx5: CT: Fix missing cleanup of ct nat table on init failurePaul Blakey1-0/+1
If CT fails to initialize it's rhashtables, it doesn't destroy the ct nat global table. Destroy the ct nat global table on ct init failure. Fixes: d7cade513752 ("net/mlx5e: check return value of rhashtable_init") Signed-off-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-15net/mlx5: Disable roce at HCA levelShay Drory4-7/+33
Currently, when a user disables roce via the devlink param, this change isn't passed down to the device. If device allows disabling RoCE at device level, make use of it. This instructs the device to skip memory allocations related to RoCE functionality which otherwise is done by the device. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-15net/mlx5i: Enable Rx steering for IPoIB via ethtoolMoosa Baransi5-10/+43
Enable steering IPoIB packets via ethtool, the same way it is done today for Ethernet packets. Signed-off-by: Moosa Baransi <moosab@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-15net/mlx5: Bridge, provide flow source hintsVlad Buslov1-0/+4
Currently, SMFS mode doesn't support rx-loopback flows which causes bridge egress rules to be rejected because without hint rules for both rx and tx destinations are created by default. Provide explicit flow source hints for compatibility with SMFS. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-15net/mlx5: Read timeout values from DTORAmir Tzin10-40/+124
Replace hard coded timeouts with values stored by firmware in default timeouts register (DTOR). Timeouts are read during driver load. If DTOR is not supported by firmware then fallback to hard coded defaults instead. Signed-off-by: Amir Tzin <amirtz@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-15net/mlx5: Read timeout values from init segmentAmir Tzin6-26/+161
Replace hard coded timeouts with values stored in firmware's init segment. Timeouts are read from init segment during driver load. If init segment timeouts are not supported then fallback to hard coded defaults instead. Also move pre initialization timeouts which cannot be read from firmware to the new mechanism. Signed-off-by: Amir Tzin <amirtz@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-15net/mlx5: Add layout to support default timeouts registerAmir Tzin3-2/+40
Add needed structures and defines for DTOR (default timeouts register). This will be used to get timeouts values from FW instead of hard coded values in the driver code thus enabling support for slower devices which need longer timeouts. Signed-off-by: Amir Tzin <amirtz@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-15ice: make use of ice_for_each_* macrosMaciej Fijalkowski6-27/+30
Go through the code base and use ice_for_each_* macros. While at it, introduce ice_for_each_xdp_txq() macro that can be used for looping over xdp_rings array. Commit is not introducing any new functionality. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-15ice: introduce XDP_TX fallback pathMaciej Fijalkowski6-9/+75
Under rare circumstances there might be a situation where a requirement of having XDP Tx queue per CPU could not be fulfilled and some of the Tx resources have to be shared between CPUs. This yields a need for placing accesses to xdp_ring inside a critical section protected by spinlock. These accesses happen to be in the hot path, so let's introduce the static branch that will be triggered from the control plane when driver could not provide Tx queue dedicated for XDP on each CPU. Currently, the design that has been picked is to allow any number of XDP Tx queues that is at least half of a count of CPUs that platform has. For lower number driver will bail out with a response to user that there were not enough Tx resources that would allow configuring XDP. The sharing of rings is signalled via static branch enablement which in turn indicates that lock for xdp_ring accesses needs to be taken in hot path. Approach based on static branch has no impact on performance of a non-fallback path. One thing that is needed to be mentioned is a fact that the static branch will act as a global driver switch, meaning that if one PF got out of Tx resources, then other PFs that ice driver is servicing will suffer. However, given the fact that HW that ice driver is handling has 1024 Tx queues per each PF, this is currently an unlikely scenario. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-15ice: optimize XDP_TX workloadsMaciej Fijalkowski4-25/+88
Optimize Tx descriptor cleaning for XDP. Current approach doesn't really scale and chokes when multiple flows are handled. Introduce two ring fields, @next_dd and @next_rs that will keep track of descriptor that should be looked at when the need for cleaning arise and the descriptor that should have the RS bit set, respectively. Note that at this point the threshold is a constant (32), but it is something that we could make configurable. First thing is to get away from setting RS bit on each descriptor. Let's do this only once NTU is higher than the currently @next_rs value. In such case, grab the tx_desc[next_rs], set the RS bit in descriptor and advance the @next_rs by a 32. Second thing is to clean the Tx ring only when there are less than 32 free entries. For that case, look up the tx_desc[next_dd] for a DD bit. This bit is written back by HW to let the driver know that xmit was successful. It will happen only for those descriptors that had RS bit set. Clean only 32 descriptors and advance the DD bit. Actual cleaning routine is moved from ice_napi_poll() down to the ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any SKBs in there that would rely on interrupts for the cleaning. Nice side effect is that for rare case of Tx fallback path (that next patch is going to introduce) we don't have to trigger the SW irq to clean the ring. With those two concepts, ring is kept at being almost full, but it is guaranteed that driver will be able to produce Tx descriptors. This approach seems to work out well even though the Tx descriptors are produced in one-by-one manner. Test was conducted with the ice HW bombarded with packets from HW generator, configured to generate 30 flows. Xdp2 sample yields the following results: <snip> proto 17: 79973066 pkt/s proto 17: 80018911 pkt/s proto 17: 80004654 pkt/s proto 17: 79992395 pkt/s proto 17: 79975162 pkt/s proto 17: 79955054 pkt/s proto 17: 79869168 pkt/s proto 17: 79823947 pkt/s proto 17: 79636971 pkt/s </snip> As that sample reports the Rx'ed frames, let's look at sar output. It says that what we Rx'ed we do actually Tx, no noticeable drops. Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32 with tx_busy staying calm. When compared to a state before: Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64 it can be observed that the amount of txpck/s is almost doubled, meaning that the performance is improved by around 90%. All of this due to the drops in the driver, previously the tx_busy stat was bumped at a 7mpps rate. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>