summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2018-05-17net/mlx5e: Prepare for shared table to keep TC eswitch flowsOr Gerlitz2-20/+20
This is a refactoring step to be able and store the hash table which keeps track of offloaded TC flows in a different location for NIC vs e-switch rules. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-17net/mlx5e: Add ingress/egress indication for offloaded TC flowsOr Gerlitz5-31/+70
When an e-switch TC rule is offloaded through the egdev (egress device) mechanism, we treat this as egress, all other cases (NIC and e-switch) are considred ingress. This is preparation step that will allow us to identify "wrong" stat/del offload calls made by the TC core on egdev based flows and ignore them. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-17net/mlx5e: Offload TC eswitch rules for VFs belonging to different PFsRabie Loulou1-1/+16
When the merged eswitch capability is supported, allow offloading rules between VFs which belong to different PFs (and hence have different eswitch affinity). Signed-off-by: Rabie Loulou <rabiel@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Shahar Klein <shahark@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-17Merge tag 'mlx5-updates-2018-05-17' of ↵Saeed Mahameed11-17/+62
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux mlx5-updates-2018-05-17 mlx5 core dirver updates for both net-next and rdma-next branches. From Christophe JAILLET, first three patches to use kvfree where needed. From: Or Gerlitz <ogerlitz@mellanox.com> Next six patches from Roi and Co adds support for merged sriov e-switch which comes to serve cases where both PFs, VFs set on them and both uplinks are to be used in single v-switch SW model. When merged e-switch is supported, the per-port e-switch is logically merged into one e-switch that spans both physical ports and all the VFs. This model allows to offload TC eswitch rules between VFs belonging to different PFs (and hence have different eswitch affinity), it also sets the some of the foundations needed for uplink LAG support. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-17net/mlx5e: Explicitly set source e-switch in offloaded TC rulesShahar Klein3-0/+10
Set a specific source e-switch when setting a rule that matches on the ingress port. Signed-off-by: Shahar Klein <shahark@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-17net/mlx5: Add source e-switch ownerShahar Klein2-2/+14
The source e-switch owner allows a vport on one e-switch port be associated with a rule defined on the second port e-switch. The role of the source eswitch owner valid bit in the flow group is to allow the firmware fail driver attempts to wild card the source eswitch match field. If this bit is not set, the firmware ignores the source eswitch owner field totally. Signed-off-by: Shahar Klein <shahark@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-17net/mlx5e: Explicitly set destination e-switch in FDB rulesRabie Loulou3-0/+8
Set a specific destination e-switch when setting a destination vport. Signed-off-by: Rabie Loulou <rabiel@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Shahar Klein <shahark@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-17net/mlx5: Add destination e-switch ownerShahar Klein7-10/+21
The destination e-switch owner allows a rule in namespace of one e-switch owner to point to a vport that is natively associated with another e-switch owner. Signed-off-by: Shahar Klein <shahark@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-17net/mlx5: Properly handle a vport destination when setting FTEShahar Klein1-0/+3
When creating FTE, properly distinguish between destination being vport or tir. The previous code just worked accidentally b/c of both dest being in the same offset within a union. Signed-off-by: Shahar Klein <shahark@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-17net/mlx5: Add merged e-switch capRoi Dayan1-1/+2
When merged e-switch is supported, the per-port e-switch is logically merged into one e-switch that spans both physical ports and all the VFs. Under merged eswitch, both the matching on source vport and setting destination vport can have a 2nd attribute which is the vhca id of the eswitch owner. For example: esw0: {match: <src vport=1 owner=0> action: fwd to <dst vport=7, owner=1>} is a flow set on eswitch0 matching on source vport=1 from his eswitch and the action being fwd to dest vport=7 of eswitch1. Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Shahar Klein <shahark@mellanox.com> Reviewed-by: Or Gerlitz Klein <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-17Merge branch 'net-Allow-more-drivers-with-COMPILE_TEST'David S. Miller5-10/+10
Florian Fainelli says: ==================== net: Allow more drivers with COMPILE_TEST This patch series includes more drivers to be build tested with COMPILE_TEST enabled. This helps cover some of the issues I just ran into with missing a driver *sigh*. Chanves in v3: - drop the TI Keystone NETCP driver from the COMPILE_TEST additions Changes in v2: - allow FEC to build outside of CONFIG_ARM/ARM64 by defining a layout of registers, this is not meant to run, so this is not a real issue if we are not matching the correct register layout ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17net: phy: Allow MDIO_MOXART and MDIO_SUN4I with COMPILE_TESTFlorian Fainelli1-2/+2
Those drivers build just fine with COMPILE_TEST, so make that possible. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17net: ethernet: freescale: Allow FEC with COMPILE_TESTFlorian Fainelli3-3/+3
The Freescale FEC driver builds fine with COMPILE_TEST, so make that possible. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17net: ethernet: ti: Allow most drivers with COMPILE_TESTFlorian Fainelli1-5/+5
Most of the TI drivers build just fine with COMPILE_TEST, cpmac (AR7) is the exception because it uses a header file from arch/mips/include/asm/mach-ar7/ar7.h and keystone netcp which requires help from drivers/soc/ti/ for queue management helpers. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17vlan: Add extack messages for link createDavid Ahern3-15/+44
Add informative messages for error paths related to adding a VLAN to a device. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17qede: Add build_skb() support.Manish Chopra4-174/+137
This patch makes use of build_skb() throughout in driver's receieve data path [HW gro flow and non HW gro flow]. With this, driver can build skb directly from the page segments which are already mapped to the hardware instead of allocating new SKB via netdev_alloc_skb() and memcpy the data which is quite costly. This really improves performance (keeping same or slight gain in rx throughput) in terms of CPU utilization which is significantly reduced [almost half] in non HW gro flow where for every incoming MTU sized packet driver had to allocate skb, memcpy headers etc. Additionally in that flow, it also gets rid of bunch of additional overheads [eth_get_headlen() etc.] to split headers and data in the skb. Tested with: system: 2 sockets, 4 cores per socket, hyperthreading, 2x4x2=16 cores iperf [server]: iperf -s iperf [client]: iperf -c <server_ip> -t 500 -i 10 -P 32 HW GRO off – w/o build_skb(), throughput: 36.8 Gbits/sec Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle Average: all 0.59 0.00 32.93 0.00 0.00 43.07 0.00 0.00 23.42 HW GRO off - with build_skb(), throughput: 36.9 Gbits/sec Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle Average: all 0.70 0.00 31.70 0.00 0.00 25.68 0.00 0.00 41.92 HW GRO on - w/o build_skb(), throughput: 36.9 Gbits/sec Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle Average: all 0.86 0.00 24.14 0.00 0.00 6.59 0.00 0.00 68.41 HW GRO on - with build_skb(), throughput: 37.5 Gbits/sec Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle Average: all 0.87 0.00 23.75 0.00 0.00 6.19 0.00 0.00 69.19 Signed-off-by: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: Manish Chopra <manish.chopra@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17Merge branch '10GbE' of ↵David S. Miller11-61/+82
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== 10GbE Intel Wired LAN Driver Updates 2018-05-17 This series contains updates to ixgbe, ixgbevf and ice drivers. Cathy Zhou resolves sparse warnings by using the force attribute. Mauro S M Rodrigues fixes a bug where IRQs were not freed if a PCI error recovery system opts to remove the device which causes ixgbe_io_error_detected() to return PCI_ERS_RESULT_DISCONNECT before calling ixgbe_close_suspend() which results in IRQs not freed and crashing when the remove handler calls pci_disable_device(). Resolved this by calling ixgbe_close_suspend() before evaluating the PCI channel state. Pavel Tatashin releases the rtnl_lock during the call to ixgbe_close_suspend() to allow scaling if device_shutdown() is multi-threaded. Emil modifies ixgbe to not validate the MAC address during a reset, unless the MAC was set on the host so that the VF will get a new MAC address every time it reloads. Also updates ixgbevf to set hw->mac.perm_addr in order to retain the custom MAC on a reset. Anirudh updates the ice NVM read/erase/update AQ commands to align with the latest specification. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17tc-testing: fixed copy-pasting error in ife testsRoman Mashak1-14/+14
Reported-by: Vlad Buslov <vladbu@mellanox.com> Reported-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17net/ncsi: prevent a couple array underflowsDan Carpenter1-2/+3
We recently refactored this code and introduced a static checker warning. Smatch complains that if cmd->index is zero then we would underflow the arrays. That's obviously true. The question is whether we prevent cmd->index from being zero at a different level. I've looked at the code and I don't immediately see a check for that. Fixes: 062b3e1b6d4f ("net/ncsi: Refactor MAC, VLAN filters") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17net/smc: init conn.tx_work & conn.send_lock soonerEric Dumazet3-3/+4
syzkaller found that following program crashes the host : { int fd = socket(AF_SMC, SOCK_STREAM, 0); int val = 1; listen(fd, 0); shutdown(fd, SHUT_RDWR); setsockopt(fd, 6, TCP_NODELAY, &val, 4); } Simply initialize conn.tx_work & conn.send_lock at socket creation, rather than deeper in the stack. ODEBUG: assert_init not available (active state 0) object type: timer_list hint: (null) WARNING: CPU: 1 PID: 13988 at lib/debugobjects.c:329 debug_print_object+0x16a/0x210 lib/debugobjects.c:326 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 13988 Comm: syz-executor0 Not tainted 4.17.0-rc4+ #46 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1b9/0x294 lib/dump_stack.c:113 panic+0x22f/0x4de kernel/panic.c:184 __warn.cold.8+0x163/0x1b3 kernel/panic.c:536 report_bug+0x252/0x2d0 lib/bug.c:186 fixup_bug arch/x86/kernel/traps.c:178 [inline] do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315 invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992 RIP: 0010:debug_print_object+0x16a/0x210 lib/debugobjects.c:326 RSP: 0018:ffff880197a37880 EFLAGS: 00010086 RAX: 0000000000000061 RBX: 0000000000000005 RCX: ffffc90001ed0000 RDX: 0000000000004aaf RSI: ffffffff8160f6f1 RDI: 0000000000000001 RBP: ffff880197a378c0 R08: ffff8801aa7a0080 R09: ffffed003b5e3eb2 R10: ffffed003b5e3eb2 R11: ffff8801daf1f597 R12: 0000000000000001 R13: ffffffff88d96980 R14: ffffffff87fa19a0 R15: ffffffff81666ec0 debug_object_assert_init+0x309/0x500 lib/debugobjects.c:692 debug_timer_assert_init kernel/time/timer.c:724 [inline] debug_assert_init kernel/time/timer.c:776 [inline] del_timer+0x74/0x140 kernel/time/timer.c:1198 try_to_grab_pending+0x439/0x9a0 kernel/workqueue.c:1223 mod_delayed_work_on+0x91/0x250 kernel/workqueue.c:1592 mod_delayed_work include/linux/workqueue.h:541 [inline] smc_setsockopt+0x387/0x6d0 net/smc/af_smc.c:1367 __sys_setsockopt+0x1bd/0x390 net/socket.c:1903 __do_sys_setsockopt net/socket.c:1914 [inline] __se_sys_setsockopt net/socket.c:1911 [inline] __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1911 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe Fixes: 01d2f7e2cdd3 ("net/smc: sockopts TCP_NODELAY and TCP_CORK") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ursula Braun <ubraun@linux.ibm.com> Cc: linux-s390@vger.kernel.org Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17nfp: flower: fix error path during representor creationJiri Pirko3-4/+19
Don't store repr pointer to reprs array until the representor is successfully created. This avoids message about "representor destruction" even when it was never created. Also it cleans-up the flow. Also, check return value after port alloc. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17Merge branch 'mvpp2-small-improvements'David S. Miller1-41/+18
Antoine Tenart says: ==================== net: mvpp2: small improvements Those 3 patches are small improvements to the Marvell PPv2 driver. The series does not conflict with the one sent about phylink and 1000/2500baseX support, so the two series can live in parallel. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17net: mvpp2: print rx error with rate-limitYan Markman1-6/+8
Prevent flood of RX error prints during heavy traffic with weak signal in link by checking net_ratelimit() before using netdev_err(). Signed-off-by: Yan Markman <ymarkman@marvell.com> [Antoine: small rework, commit message] Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17net: mvpp2: set mac address does not require the stop/start sequenceYan Markman1-31/+7
Remove special stop/start handling from the set_mac_address callback. All this special care is not needed, and can be removed. It also simplifies the up/down status in the driver and helps avoiding possible link status mismatch issues. Signed-off-by: Yan Markman <ymarkman@marvell.com> [Antoine: commit message] Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17net: mvpp2: avoid checking for free aggregated descriptors twiceYan Markman1-4/+3
Avoid repeating the check for free aggregated descriptors when it already failed at the beginning of the function. Signed-off-by: Yan Markman <ymarkman@marvell.com> [Antoine: commit message] Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17Merge branch 'mvpp2-phylink-conversion'David S. Miller4-355/+595
Antoine Tenart says: ==================== net: mvpp2: phylink conversion This series convert the Marvell PPv2 driver to phylink (models the MAC to PHY link). One important point is the PPv2 driver supports two probe modes: device tree and ACPI. This series only brings phylink support for the device tree mode, as the ACPI one will need further work. Still, the driver should be working as before when using ACPI. This split should be temporary, and was discussed with Marcin (in Cc.) who added ACPI support to the driver. Also as the SFP cages on both DB boards can be considered as non-wired. We thus chose not to describe those SFP cages and we use fixed-link. The rest of the series uses phylink to add support for 1000BaseX and 2500BaseX modes in the PPv2 driver. To do this, two patches are needed in the common PHY framework (patches 3 and 4). The last 4 patches modify the device tree to use the new PPv2 functionalities. The series has been tested for the device tree mode on the 7040-db, 8040-db and 8040-mcbin boards, to ensure all the interface where working as expected. @Dave: patches 7 to 10 should go through the mvebu tree (Gregory in Cc.) to avoid any conflict with the other mvebu dt patches taken during this cycle. The series is based on today's net-next. Since v2: - Removed the SFP description from the DB boards, as their SFP cages are wired properly. We now use fixed-link. - Because of this rework, split the series in two, so that the SFP part is reviewed separately. - Small fixes in the phylink patch. - Rebased on the latest net-next branch. Since v1: - Chose a different approach to the SFP changes, as the previous ones weren't valid and reworked both BD boards device trees. - Misc fixes. - Added Kishon's acked-by on one patch. - Rebaed on latest net-next branch. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17net: mvpp2: 2500baseX supportAntoine Tenart1-12/+39
This patch adds the 2500Base-X PHY mode support in the Marvell PPv2 driver. 2500Base-X is quite close to 1000Base-X and SGMII modes and uses nearly the same code path. Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17net: mvpp2: 1000baseX supportAntoine Tenart1-21/+51
This patch adds the 1000Base-X PHY mode support in the Marvell PPv2 driver. 1000Base-X is quite close the SGMII and uses nearly the same code path. Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17phy: cp110-comphy: 2.5G SGMII modeAntoine Tenart1-3/+14
This patch allow the CP110 comphy to configure some lanes in the 2.5G SGMII mode. This mode is quite close to SGMII and uses nearly the same code path. Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17phy: add 2.5G SGMII mode to the phy_mode enumAntoine Tenart1-0/+1
This patch adds one more generic PHY mode to the phy_mode enum, to allow configuring generic PHYs to the 2.5G SGMII mode by using the set_mode callback. Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Acked-by: Kishon Vijay Abraham I <kishon@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17net: mvpp2: phylink supportAntoine Tenart2-338/+509
Convert the PPv2 driver to implement phylink helpers, and use phylink in DT mode. The other mode supported is ACPI, which will need further work in order to be entirely compatible with phylink. The MAC and GoP configuration functions were completely moved to fit into the phylink helpers. When a PHY is always present between the MAC and the physical port, phylink only is used, but when this is not the case (the MAC directly is connected to the physical port) the link IRQ is used to detect changes in the link state and call phylink_mac_change. The ACPI mode do not uses phylink as of now, and the changes shouldn't impact its use. Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17net: mvpp2: align the ethtool ops definitionAntoine Tenart1-12/+12
Cosmetic patch to align the ethtool functions to ops definitions. This patch does not change in any way the driver's behaviour. Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17Merge tag 'wireless-drivers-next-for-davem-2018-05-17' of ↵David S. Miller179-1480/+5580
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next Kalle Valo says: ==================== wireless-drivers-next patches for 4.18 The first pull request for 4.18. As usual new features and bug fixes but nothing really special. I also merged wireless-drivers due to an iwlwifi patch dependency. Major changes: iwlwifi * implement Traffic Condition Monitor and use it for scan, BT coex and to detect when the AP doesn't support UAPSD properly * some more work for the 22000 family of devices; * introduce AMSDU rate control offload qtnfmac * DFS offload support rsi * roaming enhancements * increase max supported aggregation subframes * don't advertise 5 GHz support if the device doesn't support it brcmfmac * add support for BCM4366E chipset * add support for bcm43364 wireless chipset ath10k * enable temperature reads for QCA6174 and QCA9377 * add firmware memory dump support for QCA9984 * continue adding WCN3990 support via SNOC bus ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17vmxnet3: Replace msleep(1) with usleep_range()YueHaibing2-4/+4
As documented in Documentation/timers/timers-howto.txt, replace msleep(1) with usleep_range(). Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17bonding: introduce link change helperTonghao Zhang1-20/+20
Introduce an new common helper to avoid redundancy. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17Merge branch 'tcp-default-RACK-loss-recovery'David S. Miller4-64/+124
Yuchung Cheng says: ==================== tcp: default RACK loss recovery This patch set implements the features correspond to the draft-ietf-tcpm-rack-03 version of the RACK draft. https://datatracker.ietf.org/meeting/101/materials/slides-101-tcpm-update-on-tcp-rack-00 1. SACK: implement equivalent DUPACK threshold heuristic in RACK to replace existing RFC6675 recovery (tcp_mark_head_lost). 2. Non-SACK: simplify RFC6582 NewReno implementation 3. RTO: apply RACK's time-based approach to avoid spuriouly marking very recently sent packets lost. 4. with (1)(2)(3), make RACK the exclusive fast recovery mechanism to mark losses based on time on S/ACK. Tail loss probe and F-RTO remain enabled by default as complementary mechanisms to send probes in CA_Open and CA_Loss states. The probes would solicit S/ACKs to trigger RACK time-based loss detection. All Google web and internal servers have been running RACK-only mode (4) for a while now. a/b experiments indicate RACK/TLP on average reduces recovery latency by 10% compared to RFC6675. RFC6675 is default-off now but can be enabled by disabling RACK (sysctl net.ipv4.tcp_recovery=0) for unseen issues. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17tcp: don't mark recently sent packets lost on RTOYuchung Cheng1-4/+8
An RTO event indicates the head has not been acked for a long time after its last (re)transmission. But the other packets are not necessarily lost if they have been only sent recently (for example due to application limit). This patch would prohibit marking packets sent within an RTT to be lost on RTO event, using similar logic in TCP RACK detection. Normally the head (SND.UNA) would be marked lost since RTO should fire strictly after the head was sent. An exception is when the most recent RACK RTT measurement is larger than the (previous) RTO. To address this exception the head is always marked lost. Congestion control interaction: since we may not mark every packet lost, the congestion window may be more than 1 (inflight plus 1). But only one packet will be retransmitted after RTO, since tcp_retransmit_timer() calls tcp_retransmit_skb(...,segs=1). The connection still performs slow start from one packet (with Cubic congestion control). This commit was tested in an A/B test with Google web servers, and showed a reduction of 2% in (spurious) retransmits post timeout (SlowStartRetrans), and correspondingly reduced DSACKs (DSACKIgnoredOld) by 7%. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17tcp: new helper tcp_rack_skb_timeoutYuchung Cheng3-7/+14
Create and export a new helper tcp_rack_skb_timeout and move tcp_is_rack to prepare the final RTO change. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17tcp: separate loss marking and state update on RTOYuchung Cheng1-2/+2
Previously when TCP times out, it first updates cwnd and ssthresh, marks packets lost, and then updates congestion state again. This was fine because everything not yet delivered is marked lost, so the inflight is always 0 and cwnd can be safely set to 1 to retransmit one packet on timeout. But the inflight may not always be 0 on timeout if TCP changes to mark packets lost based on packet sent time. Therefore we must first mark the packet lost, then set the cwnd based on the (updated) inflight. This is not a pure refactor. Congestion control may potentially break if it uses (not yet updated) inflight to compute ssthresh. Fortunately all existing congestion control modules does not do that. Also it changes the inflight when CA_LOSS_EVENT is called, and only westwood processes such an event but does not use inflight. This change has two other minor side benefits: 1) consistent with Fast Recovery s.t. the inflight is updated first before tcp_enter_recovery flips state to CA_Recovery. 2) avoid intertwining loss marking with state update, making the code more readable. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17tcp: new helper tcp_timeout_mark_lostYuchung Cheng1-21/+29
Refactor using a new helper, tcp_timeout_mark_loss(), that marks packets lost upon RTO. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17tcp: account lost retransmit after timeoutYuchung Cheng3-17/+6
The previous approach for the lost and retransmit bits was to wipe the slate clean: zero all the lost and retransmit bits, correspondingly zero the lost_out and retrans_out counters, and then add back the lost bits (and correspondingly increment lost_out). The new approach is to treat this very much like marking packets lost in fast recovery. We don’t wipe the slate clean. We just say that for all packets that were not yet marked sacked or lost, we now mark them as lost in exactly the same way we do for fast recovery. This fixes the lost retransmit accounting at RTO time and greatly simplifies the RTO code by sharing much of the logic with Fast Recovery. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17tcp: simpler NewReno implementationYuchung Cheng3-8/+39
This is a rewrite of NewReno loss recovery implementation that is simpler and standalone for readability and better performance by using less states. Note that NewReno refers to RFC6582 as a modification to the fast recovery algorithm. It is used only if the connection does not support SACK in Linux. It should not to be confused with the Reno (AIMD) congestion control. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17tcp: disable RFC6675 loss detectionYuchung Cheng2-5/+10
This patch disables RFC6675 loss detection and make sysctl net.ipv4.tcp_recovery = 1 controls a binary choice between RACK (1) or RFC6675 (0). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17tcp: support DUPACK threshold in RACKYuchung Cheng3-13/+29
This patch adds support for the classic DUPACK threshold rule (#DupThresh) in RACK. When the number of packets SACKed is greater or equal to the threshold, RACK sets the reordering window to zero which would immediately mark all the unsacked packets below the highest SACKed sequence lost. Since this approach is known to not work well with reordering, RACK only uses it if no reordering has been observed. The DUPACK threshold rule is a particularly useful extension to the fast recoveries triggered by RACK reordering timer. For example data-center transfers where the RTT is much smaller than a timer tick, or high RTT path where the default RTT/4 may take too long. Note that this patch differs slightly from RFC6675. RFC6675 considers a packet lost when at least #DupThresh higher-sequence packets are SACKed. With RACK, for connections that have seen reordering, RACK continues to use a dynamically-adaptive time-based reordering window to detect losses. But for connections on which we have not yet seen reordering, this patch considers a packet lost when at least one higher sequence packet is SACKed and the total number of SACKed packets is at least DupThresh. For example, suppose a connection has not seen reordering, and sends 10 packets, and packets 3, 5, 7 are SACKed. RFC6675 considers packets 1 and 2 lost. RACK considers packets 1, 2, 4, 6 lost. There is some small risk of spurious retransmits here due to reordering. However, this is mostly limited to the first flight of a connection on which the sender receives SACKs from reordering. And RFC 6675 and FACK loss detection have a similar risk on the first flight with reordering (it's just that the risk of spurious retransmits from reordering was slightly narrower for those older algorithms due to the margin of 3*MSS). Also the minimum reordering window is reduced from 1 msec to 0 to recover quicker on short RTT transfers. Therefore RACK is more aggressive in marking packets lost during recovery to reduce the reordering window timeouts. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17net: ethernet: ti: cpsw: disable mq feature for "AM33xx ES1.0" devicesIvan Khoronzhuk1-49/+60
The early versions of am33xx devices, related to ES1.0 SoC revision have errata limiting mq support. That's the same errata as commit 7da1160002f1 ("drivers: net: cpsw: add am335x errata workarround for interrutps") AM33xx Errata [1] Advisory 1.0.9 http://www.ti.com/lit/er/sprz360f/sprz360f.pdf After additional investigation were found that drivers w/a is propagated on all AM33xx SoCs and on DM814x. But the errata exists only for ES1.0 of AM33xx family, limiting mq support for revisions after ES1.0. So, disable mq support only for related SoCs and use separate polls for revisions allowing mq. Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17Merge branch 'sched-refactor-NOLOCK-qdiscs'David S. Miller3-7/+23
Paolo Abeni says: ==================== sched: refactor NOLOCK qdiscs With the introduction of NOLOCK qdiscs, pfifo_fast performances in the uncontended scenario degraded measurably, especially after the commit eb82a9944792 ("net: sched, fix OOO packets with pfifo_fast"). This series restore the pfifo_fast performances in such scenario back the previous level, mainly reducing the number of atomic operations required to perform the qdisc_run() call. Even performances in the contended scenario increase measurably. Note: This series is on top of: sched: manipulate __QDISC_STATE_RUNNING in qdisc_run_* helpers ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17pfifo_fast: drop unneeded additional lock on dequeuePaolo Abeni2-2/+7
After the previous patch, for NOLOCK qdiscs, q->seqlock is always held when the dequeue() is invoked, we can drop any additional locking to protect such operation. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17sched: replace __QDISC_STATE_RUNNING bit with a spin lockPaolo Abeni2-5/+16
So that we can use lockdep on it. The newly introduced sequence lock has the same scope of busylock, so it shares the same lockdep annotation, but it's only used for NOLOCK qdiscs. With this changeset we acquire such lock in the control path around flushing operation (qdisc reset), to allow more NOLOCK qdisc perf improvement in the next patch. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17ice: Update NVM AQ command functionsAnirudh Venkataramanan2-9/+11
This patch updates the NVM read/erase/update AQ commands to align with the latest specification. Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-05-17ixgbevf: fix MAC address changes through ixgbevf_set_mac()Emil Tantilov1-0/+1
Set hw->mac.perm_addr in ixgbevf_set_mac() in order to avoid losing the custom MAC on reset. This can happen in the following case: >ip link set $vf address $mac >ethtool -r $vf Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>