summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2019-05-01net: mvpp2: cls: Add Classification offload supportMaxime Chevallier4-12/+410
This commit introduces basic classification offloading support for the PPv2 controller. The PPv2 classifier has many classification engines, for now we only use the C2 TCAM match engine. This engine allows to perform ternary lookups on 64 bits keys (called Header Extracted Key), that are built by extracting fields from the packet header and concatenating them. At most 4 fields can be extracted for a single lookup. This basic implementation allows to build the HEK from the following fields : - L4 source and destination ports (for UDP and TCP) More fields are to be added in the future. Classification flows are added through the ethtool interface, using the newly introduced flow_rule infrastructure as an internal rule representation, allowing to more easily implement tc flower rules if need be. The internal design for now allocates one range of 4 rules per port due to the internal design of the flow table, which uses 22 sub-flows. When inserting a classification rule, the rule is created in every relevant sub-flow. This low rule-count is a very simple design which reaches quickly the limitations of the flow table ordering, but guarantees that the rule ordering will always be respected. This commit only introduces support for the "steer to rxq" action. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01net: mvpp2: cls: Use a bitfield to represent the flow_typeMaxime Chevallier2-69/+109
As of today, the classification code is used only for RSS. We split the incoming traffic into multiple flows, that correspond to the ethtool flow_type parameter. We don't want to use the ethtool flow definitions such as TCP_V4_FLOW, for several reason : - We want to decorrelate the driver code from ethtool as much as possible, so that we can easily use other interfaces such as tc flower, - We want the flow_type to be a bitfield, so that we can match flows embedded into each other, such as TCP4 which is a subset of IP4. This commit does the conversion to the newer type. Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01net: mvpp2: cls: Remove extra whitespace in mvpp2_cls_flow_writeMaxime Chevallier1-3/+3
Cosmetic patch removing extra whitespaces when writing the flow_table entries Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01Merge branch 'mlx5-next' of ↵Saeed Mahameed28-247/+627
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux This merge commit includes some misc shared code updates from mlx5-next branch needed for net-next. 1) From Aya: Enable general events on all physical link types and restrict general event handling of subtype DELAY_DROP_TIMEOUT in mlx5 rdma driver to ethernet links only as it was intended. 2) From Eli: Introduce low level bits for prio tag mode 3) From Maor: Low level steering updates to support RDMA RX flow steering and enables RoCE loopback traffic when switchdev is enabled. 4) From Vu and Parav: Two small mlx5 core cleanups 5) From Yevgeny add HW definitions of geneve offloads Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-05-01Merge tag 'arc-5.1-final' of ↵Linus Torvalds3-23/+25
git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc Pull ARC fixes from Vineet Gupta: "A few minor fixes for ARC. - regression in memset if line size !64 - avoid panic if PAE and IOC" * tag 'arc-5.1-final' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: ARC: memset: fix build with L1_CACHE_SHIFT != 6 ARC: [hsdk] Make it easier to add PAE40 region to DTB ARC: PAE40: don't panic and instead turn off hw ioc
2019-05-01net/mlx5: Fix broken hca cap offsetSaeed Mahameed1-2/+2
The cited commit broke the offsets of hca cap struct, fix it. While at it, cleanup a white space introduced by the same commit. Fixes: b169e64a2444 ("net/mlx5: Geneve, Add flow table capabilities for Geneve decap with TLV options") Reported-by: Qian Cai <cai@lca.pw> Cc: Yevgeny Kliteynik <kliteyn@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-05-01PCI/portdrv: Use shared MSI/MSI-X vector for Bandwidth ManagementAlex Williamson1-1/+2
The Interrupt Message Number in the PCIe Capabilities register (PCIe r4.0, sec 7.5.3.2) indicates which MSI/MSI-X vector is shared by interrupts related to the PCIe Capability, including Link Bandwidth Management and Link Autonomous Bandwidth Interrupts (Link Control, 7.5.3.7), Command Completed and Hot-Plug Interrupts (Slot Control, 7.5.3.10), and the PME Interrupt (Root Control, 7.5.3.12). pcie_message_numbers() checked whether we want to enable PME or Hot-Plug interrupts but neglected to check for Link Bandwidth Management, so if we only wanted the Bandwidth Management interrupts, it decided we didn't need any vectors at all. Then pcie_port_enable_irq_vec() tried to reallocate zero vectors, which failed, resulting in fallback to INTx. On some systems, e.g., an X79-based workstation, that INTx seems broken or not handled correctly, so we got spurious IRQ16 interrupts for Bandwidth Management events. Change pcie_message_numbers() so that if we want Link Bandwidth Management interrupts, we use the shared MSI/MSI-X vector from the PCIe Capabilities register. Fixes: e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth notification") Link: https://lore.kernel.org/lkml/155597243666.19387.1205950870601742062.stgit@gimli.home Signed-off-by: Alex Williamson <alex.williamson@redhat.com> [bhelgaas: changelog] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2019-05-01Merge tag 'acpi-5.1-rc8' of ↵Linus Torvalds1-5/+1
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI fix from Rafael Wysocki: "Revert a recent ACPICA change that caused initialization to fail on systems with Thunderbolt docking stations connected at the init time" * tag 'acpi-5.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: Revert "ACPICA: Clear status of GPEs before enabling them"
2019-05-01gcc-9: don't warn about uninitialized btrfs extent_type variableLinus Torvalds1-1/+1
The 'extent_type' variable does seem to be reliably initialized, but it's _very_ non-obvious, since there's a "goto next" case that jumps over the normal initialization. That will then always trigger the "start >= extent_end" test, which will end up never falling through to the use of that variable. But the code is certainly not obvious, and the compiler warning looks reasonable. Make 'extent_type' an int, and initialize it to an invalid negative value, which seems to be the common pattern in other places. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-01Merge branch 'net-ll_temac-x86_64-support'David S. Miller5-203/+432
Esben Haabendal says: ==================== net: ll_temac: x86_64 support This patch series adds support for use of ll_temac driver with platform_data configuration and fixes endianess and 64-bit problems so that it can be used on x86_64 platform. A few bugfixes are also included. Changes since v2: - Fixed lp->indirect_mutex initialization regression for OF platforms introduced in v2 Changes since v1: - Make indirect_mutex specification mandatory when using platform_data - Move header to include/linux/platform_data - Enable COMPILE_TEST for XILINX_LL_TEMAC - Rebased to v5.1-rc7 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01net: ll_temac: Enable DMA when ready, not beforeEsben Haabendal1-5/+10
As soon as TAILDESCR_PTR is written, DMA transfers might start. Let's ensure we are ready to receive DMA IRQ's before doing that. Signed-off-by: Esben Haabendal <esben@geanix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01net: ll_temac: Allow configuration of IRQ coalescingEsben Haabendal3-12/+37
This allows custom setup of IRQ coalescing for platforms using legacy platform_device. The irq timeout and count parameters can be used for tuning cpu load vs. latency. I have maintained the 0x00000400 bit in TX_CHNL_CTRL. It is specified as unused in the documentation I have available. It does not make any difference in the hardware I have available, so it is left in to not risk breaking other platforms where it might be used. Signed-off-by: Esben Haabendal <esben@geanix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01net: ll_temac: Replace bad usage of msleep() with usleep_range()Esben Haabendal1-1/+1
Use usleep_range() to avoid problems with msleep() actually sleeping much longer than expected. Signed-off-by: Esben Haabendal <esben@geanix.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01net: ll_temac: Fix bug causing buffer descriptor overrunEsben Haabendal1-1/+1
As we are actually using a BD for both the skb and each frag contained in it, the oldest TX BD would be overwritten when there was exactly one BD less than needed. Signed-off-by: Esben Haabendal <esben@geanix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01net: ll_temac: Fix iommu/swiotlb leakEsben Haabendal1-1/+1
Unmap the actual buffer length, not the amount of data received, avoiding resource exhaustion of swiotlb (seen on x86_64 platform). Signed-off-by: Esben Haabendal <esben@geanix.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01net: ll_temac: Support indirect_mutex share within TEMAC IPEsben Haabendal4-20/+43
Indirect register access goes through a DCR bus bridge, which allows only one outstanding transaction. And to make matters worse, each TEMAC IP block contains two Ethernet interfaces, and although they seem to have separate registers for indirect access, they actually share the registers. Or to be more specific, MSW, LSW and CTL registers are physically shared between Ethernet interfaces in same TEMAC IP, with RDY register being (almost) specificic to the Ethernet interface. The 0x10000 bit in RDY reflects combined bus ready state though. So we need to take care to synchronize not only within a single device, but also between devices in same TEMAC IP. This commit allows to do that with legacy platform devices. For OF devices, the xlnx,compound parent of the temac node should be used to find siblings, and setup a shared indirect_mutex between them. I will leave this work to somebody else, as I don't have hardware to test that. No regression is introduced by that, as before this commit using two Ethernet interfaces in same TEMAC block is simply broken. Signed-off-by: Esben Haabendal <esben@geanix.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01net: ll_temac: Allow use on x86 platformsEsben Haabendal1-2/+2
With little-endian and 64-bit support in place, the ll_temac driver can now be used on x86 and x86_64 platforms. And while at it, enable COMPILE_TEST also. Signed-off-by: Esben Haabendal <esben@geanix.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01net: ll_temac: Fix support for little-endian platformsEsben Haabendal1-39/+50
Both TEMAC and SDMA is big-endian, so make sure that all values in SDMA buffer descriptors (cmdac_bd) are handled as big-endian, independent of the host endianness. With all currently supported platforms being big-endian, this change does not make a change for any of them. Note, when using app3 and app4 for piggybacking skb pointers there is no need to care about endianness, as neither TEMAC nor SDMA access app3 and app4 in TX buffer descriptors. Signed-off-by: Esben Haabendal <esben@geanix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01net: ll_temac: Add support for non-native register endiannessEsben Haabendal3-22/+79
Replace the powerpc specific MMIO register access functions with the generic big-endian mmio access functions, and add support for little-endian access depending on configuration. Big-endian access is maintained as the default, but little-endian can be configured in device-tree binding or in platform data. The temac_ior()/temac_iow() functions are replaced with macro wrappers to avoid modifying existing code more than necessary. Signed-off-by: Esben Haabendal <esben@geanix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01net: ll_temac: Fix support for 64-bit platformsEsben Haabendal2-4/+32
The use of buffer descriptor APP4 field (32-bit) for storing skb pointer obviously does not work on 64-bit platforms. As APP3 is also unused, we can use that to store the other half of 64-bit pointer values. Contrary to what is hinted at in commit message of commit 15bfe05c8d63 ("net: ethernet: xilinx: Mark XILINX_LL_TEMAC broken on 64-bit") there are no other pointers stored in cdmac_bd. Signed-off-by: Esben Haabendal <esben@geanix.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01net: ll_temac: Extend support to non-device-tree platformsEsben Haabendal4-66/+166
Support initialization with platdata, so the driver can be used on non-device-tree platforms. For currently supported device-tree platforms, the driver should behave as before. Signed-off-by: Esben Haabendal <esben@geanix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01net: ll_temac: Fix and simplify error handling by using devres functionsEsben Haabendal3-42/+22
As a side effect, a few error cases are fixed. If of_iomap() of sdma_regs failed, no error code was returned. Fixed to return -ENOMEM similar to of_iomap() fail of regs. If sysfs_create_group() or register_netdev() failed, lp->phy_node was not released. Finally, the order in remove function is corrected to be reverse order of what is done in probe, i.e. calling temac_mdio_teardown() last, so we unregister the netdev that most likely is using the mdio_bus first. Signed-off-by: Esben Haabendal <esben@geanix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01selftests: fib_rule_tests: print the result and return 1 if any tests failedHangbin Liu1-0/+6
Fixes: 65b2b4939a64 ("selftests: net: initial fib rule tests") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01net: ethernet: ti: cpsw: Fix inconsistent IS_ERR and PTR_ERR in cpsw_probe()YueHaibing1-1/+1
Fix inconsistent IS_ERR and PTR_ERR in cpsw_probe, The proper pointer to use is clk instead of mode. This issue was detected with the help of Coccinelle. Fixes: 83a8471ba255 ("net: ethernet: ti: cpsw: refactor probe to group common hw initialization") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01gcc-9: properly declare the {pv,hv}clock_page storageLinus Torvalds1-2/+2
The pvlock_page and hvclock_page variables are (as the name implies) addresses to pages, created by the linker script. But we declared them as just "extern u8" variables, which _works_, but now that gcc does some more bounds checking, it causes warnings like warning: array subscript 1 is outside array bounds of ‘u8[1]’ when we then access more than one byte from those variables. Fix this by simply making the declaration of the variables match reality, which makes the compiler happy too. Signed-off-by: Linus Torvalds <torvalds@-linux-foundation.org>
2019-05-01gcc-9: don't warn about uninitialized variableLinus Torvalds1-1/+1
I'm not sure what made gcc warn about this code now. The 'ret' variable does end up initialized in all cases, but it's definitely not obvious, so the compiler is quite reasonable to warn about this. So just add initialization to make it all much more obvious both to compilers and to humans. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-01gcc-9: silence 'address-of-packed-member' warningLinus Torvalds1-1/+1
We already did this for clang, but now gcc has that warning too. Yes, yes, the address may be unaligned. And that's kind of the point. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-01ipv4: ip_do_fragment: Preserve skb_iif during fragmentationShmulik Ladkani1-0/+1
Previously, during fragmentation after forwarding, skb->skb_iif isn't preserved, i.e. 'ip_copy_metadata' does not copy skb_iif from given 'from' skb. As a result, ip_do_fragment's creates fragments with zero skb_iif, leading to inconsistent behavior. Assume for example an eBPF program attached at tc egress (post forwarding) that examines __sk_buff->ingress_ifindex: - the correct iif is observed if forwarding path does not involve fragmentation/refragmentation - a bogus iif is observed if forwarding path involves fragmentation/refragmentatiom Fix, by preserving skb_iif during 'ip_copy_metadata'. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01io_uring: avoid page allocation warningsMark Rutland1-8/+9
In io_sqe_buffer_register() we allocate a number of arrays based on the iov_len from the user-provided iov. While we limit iov_len to SZ_1G, we can still attempt to allocate arrays exceeding MAX_ORDER. On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320 bytes, which is greater than 4MiB. This results in SLUB warning that we're trying to allocate greater than MAX_ORDER, and failing the allocation. Avoid this by using kvmalloc() for allocations dependent on the user-provided iov_len. At the same time, fix a leak of imu->bvec when registration fails. Full splat from before this patch: WARNING: CPU: 1 PID: 2314 at mm/page_alloc.c:4595 __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 2314 Comm: syz-executor326 Not tainted 5.1.0-rc7-dirty #4 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x2f0 include/linux/compiler.h:193 show_stack+0x20/0x30 arch/arm64/kernel/traps.c:158 __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x110/0x190 lib/dump_stack.c:113 panic+0x384/0x68c kernel/panic.c:214 __warn+0x2bc/0x2c0 kernel/panic.c:571 report_bug+0x228/0x2d8 lib/bug.c:186 bug_handler+0xa0/0x1a0 arch/arm64/kernel/traps.c:956 call_break_hook arch/arm64/kernel/debug-monitors.c:301 [inline] brk_handler+0x1d4/0x388 arch/arm64/kernel/debug-monitors.c:316 do_debug_exception+0x1a0/0x468 arch/arm64/mm/fault.c:831 el1_dbg+0x18/0x8c __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595 alloc_pages_current+0x164/0x278 mm/mempolicy.c:2132 alloc_pages include/linux/gfp.h:509 [inline] kmalloc_order+0x20/0x50 mm/slab_common.c:1231 kmalloc_order_trace+0x30/0x2b0 mm/slab_common.c:1243 kmalloc_large include/linux/slab.h:480 [inline] __kmalloc+0x3dc/0x4f0 mm/slub.c:3791 kmalloc_array include/linux/slab.h:670 [inline] io_sqe_buffer_register fs/io_uring.c:2472 [inline] __io_uring_register fs/io_uring.c:2962 [inline] __do_sys_io_uring_register fs/io_uring.c:3008 [inline] __se_sys_io_uring_register fs/io_uring.c:2990 [inline] __arm64_sys_io_uring_register+0x9e0/0x1bc8 fs/io_uring.c:2990 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall arch/arm64/kernel/syscall.c:47 [inline] el0_svc_common.constprop.0+0x148/0x2e0 arch/arm64/kernel/syscall.c:83 el0_svc_handler+0xdc/0x100 arch/arm64/kernel/syscall.c:129 el0_svc+0x8/0xc arch/arm64/kernel/entry.S:948 SMP: stopping secondary CPUs Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled CPU features: 0x002,23000438 Memory Limit: none Rebooting in 1 seconds.. Fixes: edafccee56ff3167 ("io_uring: add support for pre-mapped user IO buffers") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-01Merge branch 'net-sched-taprio-change-schedules'David S. Miller2-206/+412
Vinicius Costa Gomes says: ==================== net/sched: taprio change schedules Changes from RFC: - Removed the patches for taprio offloading, because of the lack of in-tree users; - Updated the links to point to the PATCH version of this series; Original cover letter: Overview -------- This RFC has two objectives, it adds support for changing the running schedules during "runtime", explained in more detail later, and proposes an interface between taprio and the drivers for hardware offloading. These two different features are presented together so it's clear what the "final state" would look like. But after the RFC stage, they can be proposed (and reviewed) separately. Changing the schedules without disrupting traffic is important for handling dynamic use cases, for example, when streams are added/removed and when the network configuration changes. Hardware offloading support allows schedules to be more precise and have lower resource usage. Changing schedules ------------------ The same as the other interfaces we proposed, we try to use the same concepts as the IEEE 802.1Q-2018 specification. So, for changing schedules, there are an "oper" (operational) and an "admin" schedule. The "admin" schedule is mutable and not in use, the "oper" schedule is immutable and is in use. That is, when the user first adds an schedule it is in the "admin" state, and it becomes "oper" when its base-time (basically when it starts) is reached. What this means is that now it's possible to create taprio with a schedule: $ tc qdisc add dev IFACE parent root handle 100 taprio \ num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \ queues 1@0 1@1 2@2 \ base-time 10000000 \ sched-entry S 03 300000 \ sched-entry S 02 300000 \ sched-entry S 06 400000 \ clockid CLOCK_TAI And then, later, after the previous schedule is "promoted" to "oper", add a new ("admin") schedule to be used some time later: $ tc qdisc change dev IFACE parent root handle 100 taprio \ base-time 1553121866000000000 \ sched-entry S 02 500000 \ sched-entry S 0f 400000 \ clockid CLOCK_TAI When enabling the ability to change schedules, it makes sense to add two more defined knobs to schedules: "cycle-time" allows to truncate a cycle to some value, so it repeats after a well-defined value; "cycle-time-extension" controls how much an entry can be extended if it's the last one before the change of schedules, the reason is to avoid a very small cycle when transitioning from a schedule to another. With these, taprio in the software mode should provide a fairly complete implementation of what's defined in the Enhancements for Scheduled Traffic parts of the specification. Hardware offload ---------------- Some workloads require better guarantees from their schedules than what's provided by the software implementation. This series proposes an interface for configuring schedules into compatible network controllers. This part is proposed together with the support for changing schedules, because it raises questions like, should the "qdisc" side be responsible of providing visibility into the schedules or should it be the driver? In this proposal, the driver is called passing the new schedule as soon as it is validated, and the "core" qdisc takes care of displaying (".dump()") the correct schedules at all times. It means that some logic would need to be duplicated in the driver, if the hardware doesn't have support for multiple schedules. But as taprio doesn't have enough information about the underlying controller to know how much in advance a schedule needs to be informed to the hardware, it feels like a fair compromise. The hardware offloading part of this proposal also tries to define an interface for frame-preemption and how it interacts with the scheduling of traffic, see Section 8.6.8.4 of IEEE 802.1Q-2018 for more information. One important difference between the qdisc interface and the qdisc-driver interface, is that the "gate mask" on the qdisc side references traffic classes, that is bit 0 of the gate mask means Traffic Class 0, and in the driver interface, it specifies the queues, that is bit 0 means queue 0. That is to say that taprio converts the references to traffic classes to references to queues before sending the offloading request to the driver. Request for help ---------------- I would like that interested driver maintainers could take a look at the proposed interface and see if it's going to be too awkward for any particular device. Also, pointers to available documentation would be appreciated. The idea here is to start a discussion so we can have an interface that would work for multiple vendors. Links ----- kernel patches: https://github.com/vcgomes/net-next/tree/taprio-add-support-for-change-v3 iproute2 patches: https://github.com/vcgomes/iproute2/tree/taprio-add-support-for-change-v3 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01taprio: Add support for cycle-time-extensionVinicius Costa Gomes2-6/+30
IEEE 802.1Q-2018 defines the concept of a cycle-time-extension, so the last entry of a schedule before the start of a new schedule can be extended, so "too-short" entries can be avoided. Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01taprio: Add support for setting the cycle-time manuallyVinicius Costa Gomes2-8/+52
IEEE 802.1Q-2018 defines that a the cycle-time of a schedule may be overridden, so the schedule is truncated to a determined "width". Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01taprio: Add support adding an admin scheduleVinicius Costa Gomes2-193/+329
The IEEE 802.1Q-2018 defines two "types" of schedules, the "Oper" (from operational?) and "Admin" ones. Up until now, 'taprio' only had support for the "Oper" one, added when the qdisc is created. This adds support for the "Admin" one, which allows the .change() operation to be supported. Just for clarification, some quick (and dirty) definitions, the "Oper" schedule is the currently (as in this instant) running one, and it's read-only. The "Admin" one is the one that the system configurator has installed, it can be changed, and it will be "promoted" to "Oper" when it's 'base-time' is reached. The idea behing this patch is that calling something like the below, (after taprio is already configured with an initial schedule): $ tc qdisc change taprio dev IFACE parent root \ base-time X \ sched-entry <CMD> <GATES> <INTERVAL> \ ... Will cause a new admin schedule to be created and programmed to be "promoted" to "Oper" at instant X. If an "Admin" schedule already exists, it will be overwritten with the new parameters. Up until now, there was some code that was added to ease the support of changing a single entry of a schedule, but was ultimately unused. Now, that we have support for "change" with more well thought semantics, updating a single entry seems to be less useful. So we remove what is in practice dead code, and return a "not supported" error if the user tries to use it. If changing a single entry would make the user's life easier we may ressurrect this idea, but at this point, removing it simplifies the code. For now, only the schedule specific bits are allowed to be added for a new schedule, that means that 'clockid', 'num_tc', 'map' and 'queues' cannot be modified. Example: $ tc qdisc change dev IFACE parent root handle 100 taprio \ base-time $BASE_TIME \ sched-entry S 00 500000 \ sched-entry S 0f 500000 \ clockid CLOCK_TAI The only change in the netlink API introduced by this change is the introduction of an "admin" type in the response to a dump request, that type allows userspace to separate the "oper" schedule from the "admin" schedule. If userspace doesn't support the "admin" type, it will only display the "oper" schedule. Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01taprio: Fix potencial use of invalid memory during dequeue()Vinicius Costa Gomes1-6/+8
Right now, this isn't a problem, but the next commit allows schedules to be added during runtime. When a new schedule transitions from the inactive to the active state ("admin" -> "oper") the previous one can be freed, if it's freed just after the RCU read lock is released, we may access an invalid entry. So, we should take care to protect the dequeue() flow, so all the places that access the entries are protected by the RCU read lock. Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01Merge branch 'tcp-undo-congestion'David S. Miller6-49/+84
Yuchung Cheng says: ==================== undo congestion window on spurious SYN or SYNACK timeout Linux TCP currently uses the initial congestion window of 1 packet if multiple SYN or SYNACK timeouts per RFC6298. However such timeouts are often spurious on wireless or cellular networks that experience high delay variances (e.g. ramping up dormant radios or local link retransmission). Another case is when the underlying path is longer than the default SYN timeout (e.g. 1 second). In these cases starting the transfer with a minimal congestion window is detrimental to the performance for short flows. One naive approach is to simply ignore SYN or SYNACK timeouts and always use a larger or default initial window. This approach however risks pouring gas to the fire when the network is already highly congested. This is particularly true in data center where application could start thousands to millions of connections over a single or multiple hosts resulting in high SYN drops (e.g. incast). This patch-set detects spurious SYN and SYNACK timeouts upon completing the handshake via the widely-supported TCP timestamp options. Upon such events the sender reverts to the default initial window to start the data transfer so it gets best of both worlds. This patch-set supports this feature for both active and passive as well as Fast Open or regular connections. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01tcp: refactor setting the initial congestion windowYuchung Cheng3-22/+26
Relocate the congestion window initialization from tcp_init_metrics() to tcp_init_transfer() to improve code readability. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01tcp: refactor to consolidate TFO passive open codeYuchung Cheng1-27/+25
Use a helper to consolidate two identical code block for passive TFO. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01tcp: undo cwnd on Fast Open spurious SYNACK retransmitYuchung Cheng1-0/+3
This patch makes passive Fast Open reverts the cwnd to default initial cwnd (10 packets) if the SYNACK timeout is spurious. Passive Fast Open uses a full socket during handshake so it can use the existing undo logic to detect spurious retransmission by recording the first SYNACK timeout in key state variable retrans_stamp. Upon receiving the ACK of the SYNACK, if the socket has sent some data before the timeout, the spurious timeout is detected by tcp_try_undo_recovery() in tcp_process_loss() in tcp_ack(). But if the socket has not send any data yet, tcp_ack() does not execute the undo code since no data is acknowledged. The fix is to check such case explicitly after tcp_ack() during the ACK processing in SYN_RECV state. In addition this is checked in FIN_WAIT_1 state in case the server closes the socket before handshake completes. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01tcp: lower congestion window on Fast Open SYNACK timeoutYuchung Cheng1-0/+3
TCP sender would use congestion window of 1 packet on the second SYN and SYNACK timeout except passive TCP Fast Open. This makes passive TFO too aggressive and unfair during congestion at handshake. This patch fixes this issue so TCP (fast open or not, passive or active) always conforms to the RFC6298. Note that tcp_enter_loss() is called only once during recurring timeouts. This is because during handshake, high_seq and snd_una are the same so tcp_enter_loss() would incorrect set the undo state variables multiple times. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01tcp: undo init congestion window on false SYNACK timeoutYuchung Cheng2-0/+7
Linux implements RFC6298 and use an initial congestion window of 1 upon establishing the connection if the SYNACK packet is retransmitted 2 or more times. In cellular networks SYNACK timeouts are often spurious if the wireless radio was dormant or idle. Also some network path is longer than the default SYNACK timeout. In both cases falsely starting with a minimal cwnd are detrimental to performance. This patch avoids doing so when the final ACK's TCP timestamp indicates the original SYNACK was delivered. It remembers the original SYNACK timestamp when SYNACK timeout has occurred and re-uses the function to detect spurious SYN timeout conveniently. Note that a server may receives multiple SYNs from and immediately retransmits SYNACKs without any SYNACK timeout. This often happens on when the client SYNs have timed out due to wireless delay above. In this case since the server will still use the default initial congestion (e.g. 10) because tp->undo_marker is reset in tcp_init_metrics(). This is an intentional design because packets are not lost but delayed. This patch only covers regular TCP passive open. Fast Open is supported in the next patch. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01tcp: better SYNACK sent timestampYuchung Cheng2-1/+5
Detecting spurious SYNACK timeout using timestamp option requires recording the exact SYNACK skb timestamp. Previously the SYNACK sent timestamp was stamped slightly earlier before the skb was transmitted. This patch uses the SYNACK skb transmission timestamp directly. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01tcp: undo initial congestion window on false SYN timeoutYuchung Cheng2-1/+17
Linux implements RFC6298 and use an initial congestion window of 1 upon establishing the connection if the SYN packet is retransmitted 2 or more times. In cellular networks SYN timeouts are often spurious if the wireless radio was dormant or idle. Also some network path is longer than the default SYN timeout. Having a minimal cwnd on both cases are detrimental to TCP startup performance. This patch extends TCP undo feature (RFC3522 aka TCP Eifel) to detect spurious SYN timeout via TCP timestamps. Since tp->retrans_stamp records the initial SYN timestamp instead of first retransmission, we have to implement a different undo code additionally. The detection also must happen before tcp_ack() as retrans_stamp is reset when SYN is acknowledged. Note this patch covers both active regular and fast open. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01tcp: avoid unconditional congestion window undo on SYN retransmitYuchung Cheng1-2/+2
Previously if an active TCP open has SYN timeout, it always undo the cwnd upon receiving the SYNACK. This is because tcp_clean_rtx_queue would reset tp->retrans_stamp when SYN is acked, which fools then tcp_try_undo_loss and tcp_packet_delayed. Addressing this issue is required to properly support undo for spurious SYN timeout. Fixing this is tricky -- for active TCP open tp->retrans_stamp records the time when the handshake starts, not the first retransmission time as the name may suggest. The simplest fix is for tcp_packet_delayed to ensure it is valid before comparing with other timestamp. One side effect of this change is active TCP Fast Open that incurred SYN timeout. Upon receiving a SYN-ACK that only acknowledged the SYN, it would immediately retransmit unacknowledged data in tcp_ack() because the data is marked lost after SYN timeout. But the retransmission would have an incorrect ack sequence number since rcv_nxt has not been updated yet tcp_rcv_synsent_state_process(), the retransmission needs to properly handed by tcp_rcv_fastopen_synack() like before. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01net/tls: avoid NULL pointer deref on nskb->sk in fallbackJakub Kicinski1-1/+2
update_chksum() accesses nskb->sk before it has been set by complete_skb(), move the init up. Fixes: e8f69799810c ("net/tls: Add generic NIC offload infrastructure") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01netdevsim: fix fall-through annotationGustavo A. R. Silva1-1/+1
Replace "pass through" with a proper "fall through" annotation in order to fix the following warning: drivers/net/netdevsim/bus.c: In function ‘new_device_store’: drivers/net/netdevsim/bus.c:170:14: warning: this statement may fall through [-Wimplicit-fallthrough=] port_count = 1; ~~~~~~~~~~~^~~ drivers/net/netdevsim/bus.c:172:2: note: here case 2: ^~~~ Warning level 3 was used: -Wimplicit-fallthrough=3 This fix is part of the ongoing efforts to enable -Wimplicit-fallthrough Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01selftests: fib_rule_tests: Fix icmp proto with ipv6David Ahern1-2/+2
A recent commit returns an error if icmp is used as the ip-proto for IPv6 fib rules. Update fib_rule_tests to send ipv6-icmp instead of icmp. Fixes: 5e1a99eae8499 ("ipv4: Add ICMPv6 support when parse route ipproto") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01packet: validate msg_namelen in send directlyWillem de Bruijn1-10/+14
Packet sockets in datagram mode take a destination address. Verify its length before passing to dev_hard_header. Prior to 2.6.14-rc3, the send code ignored sll_halen. This is established behavior. Directly compare msg_namelen to dev->addr_len. Change v1->v2: initialize addr in all paths Fixes: 6b8d95f1795c4 ("packet: validate address length if non-zero") Suggested-by: David Laight <David.Laight@aculab.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01packet: in recvmsg msg_name return at least sizeof sockaddr_llWillem de Bruijn1-2/+11
Packet send checks that msg_name is at least sizeof sockaddr_ll. Packet recv must return at least this length, so that its output can be passed unmodified to packet send. This ceased to be true since adding support for lladdr longer than sll_addr. Since, the return value uses true address length. Always return at least sizeof sockaddr_ll, even if address length is shorter. Zero the padding bytes. Change v1->v2: do not overwrite zeroed padding again. use copy_len. Fixes: 0fb375fb9b93 ("[AF_PACKET]: Allow for > 8 byte hardware addresses.") Suggested-by: David Laight <David.Laight@aculab.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01sfc: mcdi_port: Mark expected switch fall-throughGustavo A. R. Silva1-0/+1
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. This patch fixes the following warning: drivers/net/ethernet/sfc/mcdi_port.c: In function ‘efx_mcdi_phy_decode_link’: ./include/linux/compiler.h:77:22: warning: this statement may fall through [-Wimplicit-fallthrough=] # define unlikely(x) __builtin_expect(!!(x), 0) ^~~~~~~~~~~~~~~~~~~~~~~~~~ ./include/asm-generic/bug.h:125:2: note: in expansion of macro ‘unlikely’ unlikely(__ret_warn_on); \ ^~~~~~~~ drivers/net/ethernet/sfc/mcdi_port.c:344:3: note: in expansion of macro ‘WARN_ON’ WARN_ON(1); ^~~~~~~ drivers/net/ethernet/sfc/mcdi_port.c:345:2: note: here case MC_CMD_FCNTL_OFF: ^~~~ Warning level 3 was used: -Wimplicit-fallthrough=3 This patch is part of the ongoing efforts to enable -Wimplicit-fallthrough. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01devlink: Change devlink health locking mechanismMoshe Shemesh2-23/+75
The devlink health reporters create/destroy and user commands currently use the devlink->lock as a locking mechanism. Different reporters have different rules in the driver and are being created/destroyed during different stages of driver load/unload/running. So during execution of a reporter recover the flow can go through another reporter's destroy and create. Such flow leads to deadlock trying to lock a mutex already held. With the new locking mechanism the different reporters share mutex lock only to protect access to shared reporters list. Added refcount per reporter, to protect the reporters from destroy while being used. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>