summaryrefslogtreecommitdiffstats
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2022-08-22net: dsa: make phylink-related OF properties mandatory on DSA and CPU portsVladimir Oltean1-5/+167
Early DSA drivers were kind of simplistic in that they assumed a fairly narrow hardware layout. User ports would have integrated PHYs at an internal MDIO address that is derivable from the port number, and shared (DSA and CPU) ports would have an MII-style (serial or parallel) connection to another MAC. Phylib and then phylink were used to drive the internal PHYs, and this needed little to no description through the platform data structures. Bringing up the shared ports at the maximum supported link speed was the responsibility of the drivers. As a result of this, when these early drivers were converted from platform data to the new DSA OF bindings, there was no link information translated into the first DT bindings. https://lore.kernel.org/all/YtXFtTsf++AeDm1l@lunn.ch/ Later, phylink was adopted for shared ports as well, and today we have a workaround in place, introduced by commit a20f997010c4 ("net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed"). There, DSA checks for the presence of phy-handle/fixed-link/managed OF properties, and if missing, phylink registration would be skipped. This is because phylink is optional for some drivers (the shared ports already work without it), but the process of starting to register a port with phylink is irreversible: if phylink_create() fails to find the fwnode properties it needs, it bails out and it leaves the ports inoperational (because phylink expects ports to be initially down, so DSA necessarily takes them down, and doesn't know how to put them back up again). DSA being a common framework, new drivers opt into this workaround willy-nilly, but the ideal behavior from the DSA core's side would have been to not interfere with phylink's process of failing at all. This isn't possible because of regression concerns with pre-phylink DT blobs, but at least DSA should put a stop to the proliferation of more of such cases that rely on the workaround to skip phylink registration, and sanitize the environment that new drivers work in. To that end, create a list of compatible strings for which the workaround is preserved, and don't apply the workaround for any drivers outside that list (this includes new drivers). In some cases, we make the assumption that even existing drivers don't rely on DSA's workaround, and we do this by looking at the device trees in which they appear. We can't fully know what is the situation with downstream DT blobs, but we can guess the overall trend by studying the DT blobs that were submitted upstream. If there are upstream blobs that have lacking descriptions, we take it as very likely that there are many more downstream blobs that do so too. If all upstream blobs have complete descriptions, we take that as a hint that the driver is a candidate for enforcing strict DT bindings (considering that most bindings are copy-pasted). If there are no upstream DT blobs, we take the conservative route of allowing the workaround, unless the driver maintainer instructs us otherwise. The driver situation is as follows: ar9331 ~~~~~~ compatible strings: - qca,ar9331-switch 1 occurrence in mainline device trees, part of SoC dtsi (arch/mips/boot/dts/qca/ar9331.dtsi), description is not problematic. Verdict: opt into strict DT bindings and out of workarounds. b53 ~~~ compatible strings: - brcm,bcm5325 - brcm,bcm53115 - brcm,bcm53125 - brcm,bcm53128 - brcm,bcm5365 - brcm,bcm5389 - brcm,bcm5395 - brcm,bcm5397 - brcm,bcm5398 - brcm,bcm53010-srab - brcm,bcm53011-srab - brcm,bcm53012-srab - brcm,bcm53018-srab - brcm,bcm53019-srab - brcm,bcm5301x-srab - brcm,bcm11360-srab - brcm,bcm58522-srab - brcm,bcm58525-srab - brcm,bcm58535-srab - brcm,bcm58622-srab - brcm,bcm58623-srab - brcm,bcm58625-srab - brcm,bcm88312-srab - brcm,cygnus-srab - brcm,nsp-srab - brcm,omega-srab - brcm,bcm3384-switch - brcm,bcm6328-switch - brcm,bcm6368-switch - brcm,bcm63xx-switch I've found at least these mainline DT blobs with problems: arch/arm/boot/dts/bcm47094-linksys-panamera.dts - lacks phy-mode arch/arm/boot/dts/bcm47189-tenda-ac9.dts - lacks phy-mode and fixed-link arch/arm/boot/dts/bcm47081-luxul-xap-1410.dts arch/arm/boot/dts/bcm47081-luxul-xwr-1200.dts arch/arm/boot/dts/bcm47081-buffalo-wzr-600dhp2.dts - lacks phy-mode and fixed-link arch/arm/boot/dts/bcm47094-luxul-xbr-4500.dts arch/arm/boot/dts/bcm4708-smartrg-sr400ac.dts arch/arm/boot/dts/bcm4708-luxul-xap-1510.dts arch/arm/boot/dts/bcm953012er.dts arch/arm/boot/dts/bcm4708-netgear-r6250.dts arch/arm/boot/dts/bcm4708-buffalo-wzr-1166dhp-common.dtsi arch/arm/boot/dts/bcm4708-luxul-xwc-1000.dts arch/arm/boot/dts/bcm47094-luxul-abr-4500.dts - lacks phy-mode and fixed-link arch/arm/boot/dts/bcm53016-meraki-mr32.dts - lacks phy-mode Verdict: opt into DSA workarounds. bcm_sf2 ~~~~~~~ compatible strings: - brcm,bcm4908-switch - brcm,bcm7445-switch-v4.0 - brcm,bcm7278-switch-v4.0 - brcm,bcm7278-switch-v4.8 A single occurrence in mainline (arch/arm64/boot/dts/broadcom/bcm4908/bcm4908.dtsi), part of a SoC dtsi, valid description. Florian Fainelli explains that most of the bcm_sf2 device trees lack a full description for the internal IMP ports. Verdict: opt the BCM4908 into strict DT bindings, and opt the rest into the workarounds. Note that even though BCM4908 has strict DT bindings, it still does not register with phylink on the IMP port due to it implementing ->adjust_link(). hellcreek ~~~~~~~~~ compatible strings: - hirschmann,hellcreek-de1soc-r1 No occurrence in mainline device trees. Kurt Kanzenbach explains that the downstream device trees lacked phy-mode and fixed link, and needed work, but were fixed in the meantime. Verdict: opt into strict DT bindings and out of workarounds. lan9303 ~~~~~~~ compatible strings: - smsc,lan9303-mdio - smsc,lan9303-i2c 1 occurrence in mainline device trees: arch/arm/boot/dts/imx53-kp-hsc.dts - no phy-mode, no fixed-link Verdict: opt out of strict DT bindings and into workarounds. lantiq_gswip ~~~~~~~~~~~~ compatible strings: - lantiq,xrx200-gswip - lantiq,xrx300-gswip - lantiq,xrx330-gswip No occurrences in mainline device trees. Martin Blumenstingl confirms that the downstream OpenWrt device trees lack a proper fixed-link and need work, and that the incomplete description can even be seen in the example from Documentation/devicetree/bindings/net/dsa/lantiq-gswip.txt. Verdict: opt out of strict DT bindings and into workarounds. microchip ksz ~~~~~~~~~~~~~ compatible strings: - microchip,ksz8765 - microchip,ksz8794 - microchip,ksz8795 - microchip,ksz8863 - microchip,ksz8873 - microchip,ksz9477 - microchip,ksz9897 - microchip,ksz9893 - microchip,ksz9563 - microchip,ksz8563 - microchip,ksz9567 - microchip,lan9370 - microchip,lan9371 - microchip,lan9372 - microchip,lan9373 - microchip,lan9374 5 occurrences in mainline device trees, all descriptions are valid. But we had a snafu for the ksz8795 and ksz9477 drivers where the phy-mode property would be expected to be located directly under the 'switch' node rather than under a port OF node. It was fixed by commit edecfa98f602 ("net: dsa: microchip: look for phy-mode in port nodes"). The driver still has compatibility with the old DT blobs. The lan937x support was added later than the above snafu was fixed, and even though it has support for the broken DT blobs by virtue of sharing a common probing function, I'll take it that its DT blobs are correct. Verdict: opt lan937x into strict DT bindings, and the others out. mt7530 ~~~~~~ compatible strings - mediatek,mt7621 - mediatek,mt7530 - mediatek,mt7531 Multiple occurrences in mainline device trees, one is part of an SoC dtsi (arch/mips/boot/dts/ralink/mt7621.dtsi), all descriptions are fine. Verdict: opt into strict DT bindings and out of workarounds. mv88e6060 ~~~~~~~~~ compatible string: - marvell,mv88e6060 no occurrences in mainline, nobody knows anybody who uses it. Verdict: opt out of strict DT bindings and into workarounds. mv88e6xxx ~~~~~~~~~ compatible strings: - marvell,mv88e6085 - marvell,mv88e6190 - marvell,mv88e6250 Device trees that have incomplete descriptions of CPU or DSA ports: arch/arm64/boot/dts/freescale/imx8mq-zii-ultra.dtsi - lacks phy-mode arch/arm64/boot/dts/marvell/cn9130-crb.dtsi - lacks phy-mode and fixed-link arch/arm/boot/dts/vf610-zii-ssmb-spu3.dts - lacks phy-mode arch/arm/boot/dts/kirkwood-mv88f6281gtw-ge.dts - lacks phy-mode arch/arm/boot/dts/vf610-zii-spb4.dts - lacks phy-mode arch/arm/boot/dts/vf610-zii-cfu1.dts - lacks phy-mode arch/arm/boot/dts/vf610-zii-dev-rev-c.dts - lacks phy-mode on CPU port, fixed-link on DSA ports arch/arm/boot/dts/vf610-zii-dev-rev-b.dts - lacks phy-mode on CPU port arch/arm/boot/dts/armada-381-netgear-gs110emx.dts - lacks phy-mode arch/arm/boot/dts/vf610-zii-scu4-aib.dts - lacks fixed-link on xgmii DSA ports and/or in-band-status on 2500base-x DSA ports, and phy-mode on CPU port arch/arm/boot/dts/imx6qdl-gw5904.dtsi - lacks phy-mode and fixed-link arch/arm/boot/dts/armada-385-clearfog-gtr-l8.dts - lacks phy-mode and fixed-link arch/arm/boot/dts/vf610-zii-ssmb-dtu.dts - lacks phy-mode arch/arm/boot/dts/kirkwood-dir665.dts - lacks phy-mode arch/arm/boot/dts/kirkwood-rd88f6281.dtsi - lacks phy-mode arch/arm/boot/dts/orion5x-netgear-wnr854t.dts - lacks phy-mode and fixed-link arch/arm/boot/dts/armada-388-clearfog.dts - lacks phy-mode arch/arm/boot/dts/armada-xp-linksys-mamba.dts - lacks phy-mode arch/arm/boot/dts/armada-385-linksys.dtsi - lacks phy-mode arch/arm/boot/dts/imx6q-b450v3.dts arch/arm/boot/dts/imx6q-b850v3.dts - has a phy-handle but not a phy-mode? arch/arm/boot/dts/armada-370-rd.dts - lacks phy-mode arch/arm/boot/dts/kirkwood-linksys-viper.dts - lacks phy-mode arch/arm/boot/dts/imx51-zii-rdu1.dts - lacks phy-mode arch/arm/boot/dts/imx51-zii-scu2-mezz.dts - lacks phy-mode arch/arm/boot/dts/imx6qdl-zii-rdu2.dtsi - lacks phy-mode arch/arm/boot/dts/armada-385-clearfog-gtr-s4.dts - lacks phy-mode and fixed-link Verdict: opt out of strict DT bindings and into workarounds. ocelot ~~~~~~ compatible strings: - mscc,vsc9953-switch - felix (arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi) is a PCI device, has no compatible string 2 occurrences in mainline, both are part of SoC dtsi and complete. Verdict: opt into strict DT bindings and out of workarounds. qca8k ~~~~~ compatible strings: - qca,qca8327 - qca,qca8328 - qca,qca8334 - qca,qca8337 5 occurrences in mainline device trees, none of the descriptions are problematic. Verdict: opt into strict DT bindings and out of workarounds. realtek ~~~~~~~ compatible strings: - realtek,rtl8366rb - realtek,rtl8365mb 2 occurrences in mainline, both descriptions are fine, additionally rtl8365mb.c has a comment "The device tree firmware should also specify the link partner of the extension port - either via a fixed-link or other phy-handle." Verdict: opt into strict DT bindings and out of workarounds. rzn1_a5psw ~~~~~~~~~~ compatible strings: - renesas,rzn1-a5psw One single occurrence, part of SoC dtsi (arch/arm/boot/dts/r9a06g032.dtsi), description is fine. Verdict: opt into strict DT bindings and out of workarounds. sja1105 ~~~~~~~ Driver already validates its port OF nodes in sja1105_parse_ports_node(). Verdict: opt into strict DT bindings and out of workarounds. vsc73xx ~~~~~~~ compatible strings: - vitesse,vsc7385 - vitesse,vsc7388 - vitesse,vsc7395 - vitesse,vsc7398 2 occurrences in mainline device trees, both descriptions are fine. Verdict: opt into strict DT bindings and out of workarounds. xrs700x ~~~~~~~ compatible strings: - arrow,xrs7003e - arrow,xrs7003f - arrow,xrs7004e - arrow,xrs7004f no occurrences in mainline, we don't know. Verdict: opt out of strict DT bindings and into workarounds. Because there is a pattern where newly added switches reuse existing drivers more often than introducing new ones, I've opted for deciding who gets to opt into the workaround based on an OF compatible match table in the DSA core. The alternative would have been to add another boolean property to struct dsa_switch, like configure_vlan_while_not_filtering. But this avoids situations where sometimes driver maintainers obfuscate what goes on by sharing a common probing function, and therefore making new switches inherit old quirks. Side note, we also warn about missing properties for drivers that rely on the workaround. This isn't an indication that we'll break compatibility with those DT blobs any time soon, but is rather done to raise awareness about the change, for future DT blob authors. Cc: Rob Herring <robh+dt@kernel.org> Cc: Frank Rowand <frowand.list@gmail.com> Acked-by: Alvin Šipraga <alsi@bang-olufsen.dk> # realtek Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-22net: dsa: rename dsa_port_link_{,un}register_ofVladimir Oltean3-16/+16
There is a subset of functions that applies only to shared (DSA and CPU) ports, yet this is difficult to comprehend by looking at their code alone. These are dsa_port_link_register_of(), dsa_port_link_unregister_of(), and the functions that only these 2 call. Rename this class of functions to dsa_shared_port_* to make this fact more evident, even if this goes against the apparent convention that function names in port.c must start with dsa_port_. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-22net: dsa: avoid dsa_port_link_{,un}register_of() calls with platform dataVladimir Oltean1-10/+24
dsa_port_link_register_of() and dsa_port_link_unregister_of() are not written with the fact in mind that they can be called with a dp->dn that is NULL (as evidenced even by the _of suffix in their name), but this is exactly what happens. How this behaves will differ depending on whether the backing driver implements ->adjust_link() or not. If it doesn't, the "if (of_phy_is_fixed_link(dp->dn) || phy_np)" condition will return false, and dsa_port_link_register_of() will do nothing and return 0. If the driver does implement ->adjust_link(), the "if (of_phy_is_fixed_link(dp->dn))" condition will return false (dp->dn is NULL) and we will call dsa_port_setup_phy_of(). This will call dsa_port_get_phy_device(), which will also return NULL, and we will also do nothing and return 0. It is hard to maintain this code and make future changes to it in this state, so just suppress calls to these 2 functions if dp->dn is NULL. The only functional effect is that if the driver does implement ->adjust_link(), we'll stop printing this to the console: Using legacy PHYLIB callbacks. Please migrate to PHYLINK! but instead we'll always print: [ 8.539848] dsa-loop fixed-0:1f: skipping link registration for CPU port 5 This is for the better anyway, since "using legacy phylib callbacks" was misleading information - we weren't issuing _any_ callbacks due to dsa_port_get_phy_device() returning NULL. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-22Merge tag 'nfs-for-5.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-1/+1
Pull NFS client fixes from Trond Myklebust: "Stable fixes: - NFS: Fix another fsync() issue after a server reboot Bugfixes: - NFS: unlink/rmdir shouldn't call d_delete() twice on ENOENT - NFS: Fix missing unlock in nfs_unlink() - Add sanity checking of the file type used by __nfs42_ssc_open - Fix a case where we're failing to set task->tk_rpc_status Cleanups: - Remove the NFS_CONTEXT_RESEND_WRITES flag that got obsoleted by the fsync() fix" * tag 'nfs-for-5.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: SUNRPC: RPC level errors should set task->tk_rpc_status NFSv4.2 fix problems with __nfs42_ssc_open NFS: unlink/rmdir shouldn't call d_delete() twice on ENOENT NFS: Cleanup to remove unused flag NFS_CONTEXT_RESEND_WRITES NFS: Remove a bogus flag setting in pnfs_write_done_resend_to_mds NFS: Fix another fsync() issue after a server reboot NFS: Fix missing unlock in nfs_unlink()
2022-08-22Remove DECnet support from kernelStephen Hemminger23-10683/+1
DECnet is an obsolete network protocol that receives more attention from kernel janitors than users. It belongs in computer protocol history museum not in Linux kernel. It has been "Orphaned" in kernel since 2010. The iproute2 support for DECnet was dropped in 5.0 release. The documentation link on Sourceforge says it is abandoned there as well. Leave the UAPI alone to keep userspace programs compiling. This means that there is still an empty neighbour table for AF_DECNET. The table of /proc/sys/net entries was updated to match current directories and reformatted to be alphabetical. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: David Ahern <dsahern@kernel.org> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-22rose: check NULL rose_loopback_neigh->loopbackBernard Pidoux1-1/+2
Commit 3b3fd068c56e3fbea30090859216a368398e39bf added NULL check for `rose_loopback_neigh->dev` in rose_loopback_timer() but omitted to check rose_loopback_neigh->loopback. It thus prevents *all* rose connect. The reason is that a special rose_neigh loopback has a NULL device. /proc/net/rose_neigh illustrates it via rose_neigh_show() function : [...] seq_printf(seq, "%05d %-9s %-4s %3d %3d %3s %3s %3lu %3lu", rose_neigh->number, (rose_neigh->loopback) ? "RSLOOP-0" : ax2asc(buf, &rose_neigh->callsign), rose_neigh->dev ? rose_neigh->dev->name : "???", rose_neigh->count, /proc/net/rose_neigh displays special rose_loopback_neigh->loopback as callsign RSLOOP-0: addr callsign dev count use mode restart t0 tf digipeaters 00001 RSLOOP-0 ??? 1 2 DCE yes 0 0 By checking rose_loopback_neigh->loopback, rose_rx_call_request() is called even in case rose_loopback_neigh->dev is NULL. This repairs rose connections. Verification with rose client application FPAC: FPAC-Node v 4.1.3 (built Aug 5 2022) for LINUX (help = h) F6BVP-4 (Commands = ?) : u Users - AX.25 Level 2 sessions : Port Callsign Callsign AX.25 state ROSE state NetRom status axudp F6BVP-5 -> F6BVP-9 Connected Connected --------- Fixes: 3b3fd068c56e ("rose: Fix Null pointer dereference in rose_send_frame()") Signed-off-by: Bernard Pidoux <f6bvp@free.fr> Suggested-by: Francois Romieu <romieu@fr.zoreil.com> Cc: Thomas DL9SAU Osterried <thomas@osterried.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-22openvswitch: Fix overreporting of drops in dropwatchMike Pattrick1-2/+3
Currently queue_userspace_packet will call kfree_skb for all frames, whether or not an error occurred. This can result in a single dropped frame being reported as multiple drops in dropwatch. This functions caller may also call kfree_skb in case of an error. This patch will consume the skbs instead and allow caller's to use kfree_skb. Signed-off-by: Mike Pattrick <mkp@redhat.com> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2109957 Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-22openvswitch: Fix double reporting of drops in dropwatchMike Pattrick1-3/+10
Frames sent to userspace can be reported as dropped in ovs_dp_process_packet, however, if they are dropped in the netlink code then netlink_attachskb will report the same frame as dropped. This patch checks for error codes which indicate that the frame has already been freed. Signed-off-by: Mike Pattrick <mkp@redhat.com> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2109946 Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-22af_unix: Show number of inflight fds for sockets in TCP_LISTEN state tooKirill Tkhai1-3/+33
TCP_LISTEN sockets is a special case. They preserve skb with a newly connected sock till accept() makes it fully functional socket. Receive queue of such socket may grow after connected peer send messages there. Since these messages may contain scm_fds, we should expose correct fdinfo::scm_fds for listening socket too. Signed-off-by: Kirill Tkhai <tkhai@ya.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-20dynamic_dname(): drop unused dentry argumentAl Viro1-1/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-20batman-adv: Fix hang up with small MTU hard-interfaceShigeru Yoshida1-0/+4
The system hangs up when batman-adv soft-interface is created on hard-interface with small MTU. For example, the following commands create batman-adv soft-interface on dummy interface with zero MTU: # ip link add name dummy0 type dummy # ip link set mtu 0 dev dummy0 # ip link set up dev dummy0 # ip link add name bat0 type batadv # ip link set dev dummy0 master bat0 These commands cause the system hang up with the following messages: [ 90.578925][ T6689] batman_adv: bat0: Adding interface: dummy0 [ 90.580884][ T6689] batman_adv: bat0: The MTU of interface dummy0 is too small (0) to handle the transport of batman-adv packets. Packets going over this interface will be fragmented on layer2 which could impact the performance. Setting the MTU to 1560 would solve the problem. [ 90.586264][ T6689] batman_adv: bat0: Interface activated: dummy0 [ 90.590061][ T6689] batman_adv: bat0: Forced to purge local tt entries to fit new maximum fragment MTU (-320) [ 90.595517][ T6689] batman_adv: bat0: Forced to purge local tt entries to fit new maximum fragment MTU (-320) [ 90.598499][ T6689] batman_adv: bat0: Forced to purge local tt entries to fit new maximum fragment MTU (-320) This patch fixes this issue by returning error when enabling hard-interface with small MTU size. Fixes: c6c8fea29769 ("net: Add batman-adv meshing protocol") Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2022-08-20batman-adv: Drop initialization of flexible ethtool_link_ksettingsSven Eckelmann1-1/+0
The commit 94dfc73e7cf4 ("treewide: uapi: Replace zero-length arrays with flexible-array members") changed various structures from using 0-length arrays to flexible arrays net/batman-adv/bat_v_elp.c: note: in included file: ./include/linux/ethtool.h:148:38: warning: nested flexible array net/batman-adv/bat_v_elp.c:128:9: warning: using sizeof on a flexible structure In theory, this could be worked around by using {} as initializer for the variable on the stack. But this variable doesn't has to be initialized at all by the caller of __ethtool_get_link_ksettings - everything will be initialized by the callee when no error occurs. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2022-08-19SUNRPC: RPC level errors should set task->tk_rpc_statusTrond Myklebust1-1/+1
Fix up a case in call_encode() where we're failing to set task->tk_rpc_status when an RPC level error occurred. Fixes: 9c5948c24869 ("SUNRPC: task should be exit if encode return EKEYEXPIRED more times") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-08-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski22-152/+305
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-18bpf: Add a few optnames to bpf_setsockoptMartin KaFai Lau1-0/+5
This patch adds a few optnames for bpf_setsockopt: SO_REUSEADDR, IPV6_AUTOFLOWLABEL, TCP_MAXSEG, TCP_NODELAY, and TCP_THIN_LINEAR_TIMEOUTS. Thanks to the previous patches of this set, all additions can reuse the sk_setsockopt(), do_ipv6_setsockopt(), and do_tcp_setsockopt(). The only change here is to allow them in bpf_setsockopt. The bpf prog has been able to read all members of a sk by using PTR_TO_BTF_ID of a sk. The optname additions here can also be read by the same approach. Meaning there is a way to read the values back. These optnames can also be added to bpf_getsockopt() later with another patch set that makes the bpf_getsockopt() to reuse the sock_getsockopt(), tcp_getsockopt(), and ip[v6]_getsockopt(). Thus, this patch does not add more duplicated code to bpf_getsockopt() now. Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220817061841.4181642-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-18bpf: Change bpf_setsockopt(SOL_IPV6) to reuse do_ipv6_setsockopt()Martin KaFai Lau3-32/+29
After the prep work in the previous patches, this patch removes the dup code from bpf_setsockopt(SOL_IPV6) and reuses the implementation in do_ipv6_setsockopt(). ipv6 could be compiled as a module. Like how other code solved it with stubs in ipv6_stubs.h, this patch adds the do_ipv6_setsockopt to the ipv6_bpf_stub. The current bpf_setsockopt(IPV6_TCLASS) does not take the INET_ECN_MASK into the account for tcp. The do_ipv6_setsockopt(IPV6_TCLASS) will handle it correctly. The existing optname white-list is refactored into a new function sol_ipv6_setsockopt(). After this last SOL_IPV6 dup code removal, the __bpf_setsockopt() is simplified enough that the extra "{ }" around the if statement can be removed. Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220817061834.4181198-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-18bpf: Change bpf_setsockopt(SOL_IP) to reuse do_ip_setsockopt()Martin KaFai Lau2-22/+22
After the prep work in the previous patches, this patch removes the dup code from bpf_setsockopt(SOL_IP) and reuses the implementation in do_ip_setsockopt(). The existing optname white-list is refactored into a new function sol_ip_setsockopt(). NOTE, the current bpf_setsockopt(IP_TOS) is quite different from the the do_ip_setsockopt(IP_TOS). For example, it does not take the INET_ECN_MASK into the account for tcp and also does not adjust sk->sk_priority. It looks like the current bpf_setsockopt(IP_TOS) was referencing the IPV6_TCLASS implementation instead of IP_TOS. This patch tries to rectify that by using the do_ip_setsockopt(IP_TOS). While this is a behavior change, the do_ip_setsockopt(IP_TOS) behavior is arguably what the user is expecting. At least, the INET_ECN_MASK bits should be masked out for tcp. Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220817061826.4180990-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-18bpf: Change bpf_setsockopt(SOL_TCP) to reuse do_tcp_setsockopt()Martin KaFai Lau2-69/+32
After the prep work in the previous patches, this patch removes all the dup code from bpf_setsockopt(SOL_TCP) and reuses the do_tcp_setsockopt(). The existing optname white-list is refactored into a new function sol_tcp_setsockopt(). The sol_tcp_setsockopt() also calls the bpf_sol_tcp_setsockopt() to handle the TCP_BPF_XXX specific optnames. bpf_setsockopt(TCP_SAVE_SYN) now also allows a value 2 to save the eth header also and it comes for free from do_tcp_setsockopt(). Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220817061819.4180146-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-18bpf: Refactor bpf specific tcp optnames to a new functionMartin KaFai Lau1-29/+50
The patch moves all bpf specific tcp optnames (TCP_BPF_XXX) to a new function bpf_sol_tcp_setsockopt(). This will make the next patch easier to follow. Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220817061812.4179645-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-18bpf: Change bpf_setsockopt(SOL_SOCKET) to reuse sk_setsockopt()Martin KaFai Lau2-98/+32
After the prep work in the previous patches, this patch removes most of the dup code from bpf_setsockopt(SOL_SOCKET) and reuses them from sk_setsockopt(). The sock ptr test is added to the SO_RCVLOWAT because the sk->sk_socket could be NULL in some of the bpf hooks. The existing optname white-list is refactored into a new function sol_socket_setsockopt(). Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220817061804.4178920-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-18bpf: Embed kernel CONFIG check into the if statement in bpf_setsockoptMartin KaFai Lau1-7/+3
This patch moves the "#ifdef CONFIG_XXX" check into the "if/else" statement itself. The change is done for the bpf_setsockopt() function only. It will make the latter patches easier to follow without the surrounding ifdef macro. Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220817061758.4178374-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-18bpf: net: Change do_ipv6_setsockopt() to use the sockopt's lock_sock() and ↵Martin KaFai Lau1-7/+7
capable() Similar to the earlier patch that avoids sk_setsockopt() from taking sk lock and doing capable test when called by bpf. This patch changes do_ipv6_setsockopt() to use the sockopt_{lock,release}_sock() and sockopt_[ns_]capable(). Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220817061744.4176893-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-18bpf: net: Change do_ip_setsockopt() to use the sockopt's lock_sock() and ↵Martin KaFai Lau1-6/+6
capable() Similar to the earlier patch that avoids sk_setsockopt() from taking sk lock and doing capable test when called by bpf. This patch changes do_ip_setsockopt() to use the sockopt_{lock,release}_sock() and sockopt_[ns_]capable(). Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220817061737.4176402-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-18bpf: net: Change do_tcp_setsockopt() to use the sockopt's lock_sock() and ↵Martin KaFai Lau1-9/+9
capable() Similar to the earlier patch that avoids sk_setsockopt() from taking sk lock and doing capable test when called by bpf. This patch changes do_tcp_setsockopt() to use the sockopt_{lock,release}_sock() and sockopt_[ns_]capable(). Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220817061730.4176021-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-18bpf: net: Consider has_current_bpf_ctx() when testing capable() in ↵Martin KaFai Lau1-13/+25
sk_setsockopt() When bpf program calling bpf_setsockopt(SOL_SOCKET), it could be run in softirq and doesn't make sense to do the capable check. There was a similar situation in bpf_setsockopt(TCP_CONGESTION). In commit 8d650cdedaab ("tcp: fix tcp_set_congestion_control() use from bpf hook"), tcp_set_congestion_control(..., cap_net_admin) was added to skip the cap check for bpf prog. This patch adds sockopt_ns_capable() and sockopt_capable() for the sk_setsockopt() to use. They will consider the has_current_bpf_ctx() before doing the ns_capable() and capable() test. They are in EXPORT_SYMBOL for the ipv6 module to use in a latter patch. Suggested-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220817061723.4175820-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-18bpf: net: Avoid sk_setsockopt() taking sk lock when called from bpfMartin KaFai Lau1-3/+27
Most of the code in bpf_setsockopt(SOL_SOCKET) are duplicated from the sk_setsockopt(). The number of supported optnames are increasing ever and so as the duplicated code. One issue in reusing sk_setsockopt() is that the bpf prog has already acquired the sk lock. This patch adds a has_current_bpf_ctx() to tell if the sk_setsockopt() is called from a bpf prog. The bpf prog calling bpf_setsockopt() is either running in_task() or in_serving_softirq(). Both cases have the current->bpf_ctx initialized. Thus, the has_current_bpf_ctx() only needs to test !!current->bpf_ctx. This patch also adds sockopt_{lock,release}_sock() helpers for sk_setsockopt() to use. These helpers will test has_current_bpf_ctx() before acquiring/releasing the lock. They are in EXPORT_SYMBOL for the ipv6 module to use in a latter patch. Note on the change in sock_setbindtodevice(). sockopt_lock_sock() is done in sock_setbindtodevice() instead of doing the lock_sock in sock_bindtoindex(..., lock_sk = true). Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220817061717.4175589-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-18net: Add sk_setsockopt() to take the sk ptr instead of the sock ptrMartin KaFai Lau1-3/+10
A latter patch refactors bpf_setsockopt(SOL_SOCKET) with the sock_setsockopt() to avoid code duplication and code drift between the two duplicates. The current sock_setsockopt() takes sock ptr as the argument. The very first thing of this function is to get back the sk ptr by 'sk = sock->sk'. bpf_setsockopt() could be called when the sk does not have the sock ptr created. Meaning sk->sk_socket is NULL. For example, when a passive tcp connection has just been established but has yet been accept()-ed. Thus, it cannot use the sock_setsockopt(sk->sk_socket) or else it will pass a NULL ptr. This patch moves all sock_setsockopt implementation to the newly added sk_setsockopt(). The new sk_setsockopt() takes a sk ptr and immediately gets the sock ptr by 'sock = sk->sk_socket' The existing sock_setsockopt(sock) is changed to call sk_setsockopt(sock->sk). All existing callers have both sock->sk and sk->sk_socket pointer. The latter patch will make bpf_setsockopt(SOL_SOCKET) call sk_setsockopt(sk) directly. The bpf_setsockopt(SOL_SOCKET) does not use the optnames that require sk->sk_socket, so it will be safe. Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220817061711.4175048-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-18ip_tunnel: Respect tunnel key's "flow_flags" in IP tunnelsEyal Birger2-4/+5
Commit 451ef36bd229 ("ip_tunnels: Add new flow flags field to ip_tunnel_key") added a "flow_flags" member to struct ip_tunnel_key which was later used by the commit in the fixes tag to avoid dropping packets with sources that aren't locally configured when set in bpf_set_tunnel_key(). VXLAN and GENEVE were made to respect this flag, ip tunnels like IPIP and GRE were not. This commit fixes this omission by making ip_tunnel_init_flow() receive the flow flags from the tunnel key in the relevant collect_md paths. Fixes: b8fff748521c ("bpf: Set flow flag to allow any source IP in bpf_tunnel_key") Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Paul Chaignon <paul@isovalent.com> Link: https://lore.kernel.org/bpf/20220818074118.726639-1-eyal.birger@gmail.com
2022-08-18tcp: handle pure FIN case correctlyCong Wang2-3/+4
When skb->len==0, the recv_actor() returns 0 too, but we also use 0 for error conditions. This patch amends this by propagating the errors to tcp_read_skb() so that we can distinguish skb->len==0 case from error cases. Fixes: 04919bed948d ("tcp: Introduce tcp_read_skb()") Reported-by: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-18tcp: refactor tcp_read_skb() a bitCong Wang1-17/+9
As tcp_read_skb() only reads one skb at a time, the while loop is unnecessary, we can turn it into an if. This also simplifies the code logic. Cc: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-18tcp: fix tcp_cleanup_rbuf() for tcp_read_skb()Cong Wang1-10/+14
tcp_cleanup_rbuf() retrieves the skb from sk_receive_queue, it assumes the skb is not yet dequeued. This is no longer true for tcp_read_skb() case where we dequeue the skb first. Fix this by introducing a helper __tcp_cleanup_rbuf() which does not require any skb and calling it in tcp_read_skb(). Fixes: 04919bed948d ("tcp: Introduce tcp_read_skb()") Cc: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-18tcp: fix sock skb accounting in tcp_read_skb()Cong Wang1-0/+1
Before commit 965b57b469a5 ("net: Introduce a new proto_ops ->read_skb()"), skb was not dequeued from receive queue hence when we close TCP socket skb can be just flushed synchronously. After this commit, we have to uncharge skb immediately after being dequeued, otherwise it is still charged in the original sock. And we still need to retain skb->sk, as eBPF programs may extract sock information from skb->sk. Therefore, we have to call skb_set_owner_sk_safe() here. Fixes: 965b57b469a5 ("net: Introduce a new proto_ops ->read_skb()") Reported-and-tested-by: syzbot+a0e6f8738b58f7654417@syzkaller.appspotmail.com Tested-by: Stanislav Fomichev <sdf@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-18net: genl: fix error path memory leak in policy dumpingJakub Kicinski2-3/+17
If construction of the array of policies fails when recording non-first policy we need to unwind. netlink_policy_dump_add_policy() itself also needs fixing as it currently gives up on error without recording the allocated pointer in the pstate pointer. Reported-by: syzbot+dc54d9ba8153b216cae0@syzkaller.appspotmail.com Fixes: 50a896cf2d6f ("genetlink: properly support per-op policy dumping") Link: https://lore.kernel.org/r/20220816161939.577583-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-17net: dsa: don't warn in dsa_port_set_state_now() when driver doesn't support itVladimir Oltean1-2/+5
ds->ops->port_stp_state_set() is, like most DSA methods, optional, and if absent, the port is supposed to remain in the forwarding state (as standalone). Such is the case with the mv88e6060 driver, which does not offload the bridge layer. DSA warns that the STP state can't be changed to FORWARDING as part of dsa_port_enable_rt(), when in fact it should not. The error message is also not up to modern standards, so take the opportunity to make it more descriptive. Fixes: fd3645413197 ("net: dsa: change scope of STP state setter") Reported-by: Sergei Antonov <saproj@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Sergei Antonov <saproj@gmail.com> Link: https://lore.kernel.org/r/20220816201445.1809483-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-17Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski3-12/+11
Andrii Nakryiko says: ==================== bpf-next 2022-08-17 We've added 45 non-merge commits during the last 14 day(s) which contain a total of 61 files changed, 986 insertions(+), 372 deletions(-). The main changes are: 1) New bpf_ktime_get_tai_ns() BPF helper to access CLOCK_TAI, from Kurt Kanzenbach and Jesper Dangaard Brouer. 2) Few clean ups and improvements for libbpf 1.0, from Andrii Nakryiko. 3) Expose crash_kexec() as kfunc for BPF programs, from Artem Savkov. 4) Add ability to define sleepable-only kfuncs, from Benjamin Tissoires. 5) Teach libbpf's bpf_prog_load() and bpf_map_create() to gracefully handle unsupported names on old kernels, from Hangbin Liu. 6) Allow opting out from auto-attaching BPF programs by libbpf's BPF skeleton, from Hao Luo. 7) Relax libbpf's requirement for shared libs to be marked executable, from Henqgi Chen. 8) Improve bpf_iter internals handling of error returns, from Hao Luo. 9) Few accommodations in libbpf to support GCC-BPF quirks, from James Hilliard. 10) Fix BPF verifier logic around tracking dynptr ref_obj_id, from Joanne Koong. 11) bpftool improvements to handle full BPF program names better, from Manu Bretelle. 12) bpftool fixes around libcap use, from Quentin Monnet. 13) BPF map internals clean ups and improvements around memory allocations, from Yafang Shao. 14) Allow to use cgroup_get_from_file() on cgroupv1, allowing BPF cgroup iterator to work on cgroupv1, from Yosry Ahmed. 15) BPF verifier internal clean ups, from Dave Marchevsky and Joanne Koong. 16) Various fixes and clean ups for selftests/bpf and vmtest.sh, from Daniel Xu, Artem Savkov, Joanne Koong, Andrii Nakryiko, Shibin Koikkara Reeny. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits) selftests/bpf: Few fixes for selftests/bpf built in release mode libbpf: Clean up deprecated and legacy aliases libbpf: Streamline bpf_attr and perf_event_attr initialization libbpf: Fix potential NULL dereference when parsing ELF selftests/bpf: Tests libbpf autoattach APIs libbpf: Allows disabling auto attach selftests/bpf: Fix attach point for non-x86 arches in test_progs/lsm libbpf: Making bpf_prog_load() ignore name if kernel doesn't support selftests/bpf: Update CI kconfig selftests/bpf: Add connmark read test selftests/bpf: Add existing connection bpf_*_ct_lookup() test bpftool: Clear errno after libcap's checks bpf: Clear up confusion in bpf_skb_adjust_room()'s documentation bpftool: Fix a typo in a comment libbpf: Add names for auxiliary maps bpf: Use bpf_map_area_alloc consistently on bpf map creation bpf: Make __GFP_NOWARN consistent in bpf map creation bpf: Use bpf_map_area_free instread of kvfree bpf: Remove unneeded memset in queue_stack_map creation libbpf: preserve errno across pr_warn/pr_info/pr_debug ... ==================== Link: https://lore.kernel.org/r/20220817215656.1180215-1-andrii@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nfJakub Kicinski7-90/+182
Florian Westphal says: ==================== netfilter: conntrack and nf_tables bug fixes The following patchset contains netfilter fixes for net. Broken since 5.19: A few ancient connection tracking helpers assume TCP packets cannot exceed 64kb in size, but this isn't the case anymore with 5.19 when BIG TCP got merged, from myself. Regressions since 5.19: 1. 'conntrack -E expect' won't display anything because nfnetlink failed to enable events for expectations, only for normal conntrack events. 2. partially revert change that added resched calls to a function that can be in atomic context. Both broken and fixed up by myself. Broken for several releases (up to original merge of nf_tables): Several fixes for nf_tables control plane, from Pablo. This fixes up resource leaks in error paths and adds more sanity checks for mutually exclusive attributes/flags. Kconfig: NF_CONNTRACK_PROCFS is very old and doesn't provide all info provided via ctnetlink, so it should not default to y. From Geert Uytterhoeven. Selftests: rework nft_flowtable.sh: it frequently indicated failure; the way it tried to detect an offload failure did not work reliably. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: testing: selftests: nft_flowtable.sh: rework test to detect offload failure testing: selftests: nft_flowtable.sh: use random netns names netfilter: conntrack: NF_CONNTRACK_PROCFS should no longer default to y netfilter: nf_tables: check NFT_SET_CONCAT flag if field_count is specified netfilter: nf_tables: disallow NFT_SET_ELEM_CATCHALL and NFT_SET_ELEM_INTERVAL_END netfilter: nf_tables: NFTA_SET_ELEM_KEY_END requires concat and interval flags netfilter: nf_tables: validate NFTA_SET_ELEM_OBJREF based on NFT_SET_OBJECT flag netfilter: nf_tables: really skip inactive sets when allocating name netfilter: nfnetlink: re-enable conntrack expectation events netfilter: nf_tables: fix scheduling-while-atomic splat netfilter: nf_ct_irc: cap packet search space to 4k netfilter: nf_ct_ftp: prefer skb_linearize netfilter: nf_ct_h323: cap packet size at 64k netfilter: nf_ct_sane: remove pseudo skb linearization netfilter: nf_tables: possible module reference underflow in error path netfilter: nf_tables: disallow NFTA_SET_ELEM_KEY_END with NFT_SET_ELEM_INTERVAL_END flag netfilter: nf_tables: use READ_ONCE and WRITE_ONCE for shared generation id access ==================== Link: https://lore.kernel.org/r/20220817140015.25843-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-17skmsg: Fix wrong last sg check in sk_msg_recvmsg()Liu Jian1-2/+2
Fix one kernel NULL pointer dereference as below: [ 224.462334] Call Trace: [ 224.462394] __tcp_bpf_recvmsg+0xd3/0x380 [ 224.462441] ? sock_has_perm+0x78/0xa0 [ 224.462463] tcp_bpf_recvmsg+0x12e/0x220 [ 224.462494] inet_recvmsg+0x5b/0xd0 [ 224.462534] __sys_recvfrom+0xc8/0x130 [ 224.462574] ? syscall_trace_enter+0x1df/0x2e0 [ 224.462606] ? __do_page_fault+0x2de/0x500 [ 224.462635] __x64_sys_recvfrom+0x24/0x30 [ 224.462660] do_syscall_64+0x5d/0x1d0 [ 224.462709] entry_SYSCALL_64_after_hwframe+0x65/0xca In commit 9974d37ea75f ("skmsg: Fix invalid last sg check in sk_msg_recvmsg()"), we change last sg check to sg_is_last(), but in sockmap redirection case (without stream_parser/stream_verdict/ skb_verdict), we did not mark the end of the scatterlist. Check the sk_msg_alloc, sk_msg_page_add, and bpf_msg_push_data functions, they all do not mark the end of sg. They are expected to use sg.end for end judgment. So the judgment of '(i != msg_rx->sg.end)' is added back here. Fixes: 9974d37ea75f ("skmsg: Fix invalid last sg check in sk_msg_recvmsg()") Signed-off-by: Liu Jian <liujian56@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20220809094915.150391-1-liujian56@huawei.com
2022-08-17batman-adv: Drop unused headers in trace.hSven Eckelmann1-2/+0
The commit 9abc291812d7 ("batman-adv: tracing: Use the new __vstring() helper") removed the usage of WARN_ON_ONCE and __dynamic_array in this file. But it was forgotten to adjust the headers accordingly (dropping the now no longer used ones). Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2022-08-17batman-adv: Start new development cycleSimon Wunderlich1-1/+1
This version will contain all the (major or even only minor) changes for Linux 6.1. The version number isn't a semantic version number with major and minor information. It is just encoding the year of the expected publishing as Linux -rc1 and the number of published versions this year (starting at 0). Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2022-08-17tls: rx: react to strparser initialization errorsJakub Kicinski1-1/+3
Even though the normal strparser's init function has a return value we got away with ignoring errors until now, as it only validates the parameters and we were passing correct parameters. tls_strp can fail to init on memory allocation errors, which syzbot duly induced and reported. Reported-by: syzbot+abd45eb849b05194b1b6@syzkaller.appspotmail.com Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-17tcp: Make SYN ACK RTO tunable by BPF programs with TFOJie Meng2-2/+3
Instead of the hardcoded TCP_TIMEOUT_INIT, this diff calls tcp_timeout_init to initiate req->timeout like the non TFO SYN ACK case. Tested using the following packetdrill script, on a host with a BPF program that sets the initial connect timeout to 10ms. `../../common/defaults.sh` // Initialize connection 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_TCP, TCP_FASTOPEN, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 32792 <mss 1000,sackOK,FO TFO_COOKIE> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK> +.01 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK> +.02 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK> +.04 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK> +.01 < . 1:1(0) ack 1 win 32792 +0 accept(3, ..., ...) = 4 Signed-off-by: Jie Meng <jmeng@fb.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-17xfrm: policy: fix metadata dst->dev xmit null pointer dereferenceNikolay Aleksandrov1-1/+1
When we try to transmit an skb with metadata_dst attached (i.e. dst->dev == NULL) through xfrm interface we can hit a null pointer dereference[1] in xfrmi_xmit2() -> xfrm_lookup_with_ifid() due to the check for a loopback skb device when there's no policy which dereferences dst->dev unconditionally. Not having dst->dev can be interepreted as it not being a loopback device, so just add a check for a null dst_orig->dev. With this fix xfrm interface's Tx error counters go up as usual. [1] net-next calltrace captured via netconsole: BUG: kernel NULL pointer dereference, address: 00000000000000c0 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP CPU: 1 PID: 7231 Comm: ping Kdump: loaded Not tainted 5.19.0+ #24 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-1.fc36 04/01/2014 RIP: 0010:xfrm_lookup_with_ifid+0x5eb/0xa60 Code: 8d 74 24 38 e8 26 a4 37 00 48 89 c1 e9 12 fc ff ff 49 63 ed 41 83 fd be 0f 85 be 01 00 00 41 be ff ff ff ff 45 31 ed 48 8b 03 <f6> 80 c0 00 00 00 08 75 0f 41 80 bc 24 19 0d 00 00 01 0f 84 1e 02 RSP: 0018:ffffb0db82c679f0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffd0db7fcad430 RCX: ffffb0db82c67a10 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb0db82c67a80 RBP: ffffb0db82c67a80 R08: ffffb0db82c67a14 R09: 0000000000000000 R10: 0000000000000000 R11: ffff8fa449667dc8 R12: ffffffff966db880 R13: 0000000000000000 R14: 00000000ffffffff R15: 0000000000000000 FS: 00007ff35c83f000(0000) GS:ffff8fa478480000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000c0 CR3: 000000001ebb7000 CR4: 0000000000350ee0 Call Trace: <TASK> xfrmi_xmit+0xde/0x460 ? tcf_bpf_act+0x13d/0x2a0 dev_hard_start_xmit+0x72/0x1e0 __dev_queue_xmit+0x251/0xd30 ip_finish_output2+0x140/0x550 ip_push_pending_frames+0x56/0x80 raw_sendmsg+0x663/0x10a0 ? try_charge_memcg+0x3fd/0x7a0 ? __mod_memcg_lruvec_state+0x93/0x110 ? sock_sendmsg+0x30/0x40 sock_sendmsg+0x30/0x40 __sys_sendto+0xeb/0x130 ? handle_mm_fault+0xae/0x280 ? do_user_addr_fault+0x1e7/0x680 ? kvm_read_and_reset_apf_flags+0x3b/0x50 __x64_sys_sendto+0x20/0x30 do_syscall_64+0x34/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7ff35cac1366 Code: eb 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 72 c3 90 55 48 83 ec 30 44 89 4c 24 2c 4c 89 RSP: 002b:00007fff738e4028 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00007fff738e57b0 RCX: 00007ff35cac1366 RDX: 0000000000000040 RSI: 0000557164e4b450 RDI: 0000000000000003 RBP: 0000557164e4b450 R08: 00007fff738e7a2c R09: 0000000000000010 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000040 R13: 00007fff738e5770 R14: 00007fff738e4030 R15: 0000001d00000001 </TASK> Modules linked in: netconsole veth br_netfilter bridge bonding virtio_net [last unloaded: netconsole] CR2: 00000000000000c0 CC: Steffen Klassert <steffen.klassert@secunet.com> CC: Daniel Borkmann <daniel@iogearbox.net> Fixes: 2d151d39073a ("xfrm: Add possibility to set the default to block if we have no policy") Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-08-17netfilter: conntrack: NF_CONNTRACK_PROCFS should no longer default to yGeert Uytterhoeven1-1/+0
NF_CONNTRACK_PROCFS was marked obsolete in commit 54b07dca68557b09 ("netfilter: provide config option to disable ancient procfs parts") in v3.3. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2022-08-16net: sched: delete unused input parameter in qdisc_createZhengchao Shao1-3/+3
The input parameter p is unused in qdisc_create. Delete it. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Link: https://lore.kernel.org/r/20220815061023.51318-1-shaozhengchao@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-16net: sched: fix misuse of qcpu->backlog in gnet_stats_add_queue_cpuZhengchao Shao1-1/+1
In the gnet_stats_add_queue_cpu function, the qstats->qlen statistics are incorrectly set to qcpu->backlog. Fixes: 448e163f8b9b ("gen_stats: Add gnet_stats_add_queue()") Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Link: https://lore.kernel.org/r/20220815030848.276746-1-shaozhengchao@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-16net: sched: remove the unused return value of unregister_qdiscZhengchao Shao1-2/+3
Return value of unregister_qdisc is unused, remove it. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Link: https://lore.kernel.org/r/20220815030417.271894-1-shaozhengchao@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-15net: rtnetlink: fix module reference count leak issue in rtnetlink_rcv_msgZhengchao Shao1-0/+1
When bulk delete command is received in the rtnetlink_rcv_msg function, if bulk delete is not supported, module_put is not called to release the reference counting. As a result, module reference count is leaked. Fixes: a6cec0bcd342 ("net: rtnetlink: add bulk delete support flag") Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20220815024629.240367-1-shaozhengchao@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-15cgroup: Replace cgroup->ancestor_ids[] with ->ancestors[]Tejun Heo1-4/+5
Every cgroup knows all its ancestors through its ->ancestor_ids[]. There's no advantage to remembering the IDs instead of the pointers directly and this makes the array useless for finding an actual ancestor cgroup forcing cgroup_ancestor() to iteratively walk up the hierarchy instead. Let's replace cgroup->ancestor_ids[] with ->ancestors[] and remove the walking-up from cgroup_ancestor(). While at it, improve comments around cgroup_root->cgrp_ancestor_storage. This patch shouldn't cause user-visible behavior differences. v2: Update cgroup_ancestor() to use ->ancestors[]. v3: cgroup_root->cgrp_ancestor_storage's type is updated to match cgroup->ancestors[]. Better comments. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Namhyung Kim <namhyung@kernel.org>
2022-08-15netfilter: nf_tables: check NFT_SET_CONCAT flag if field_count is specifiedPablo Neira Ayuso1-0/+5
Since f3a2181e16f1 ("netfilter: nf_tables: Support for sets with multiple ranged fields"), it possible to combine intervals and concatenations. Later on, ef516e8625dd ("netfilter: nf_tables: reintroduce the NFT_SET_CONCAT flag") provides the NFT_SET_CONCAT flag for userspace to report that the set stores a concatenation. Make sure NFT_SET_CONCAT is set on if field_count is specified for consistency. Otherwise, if NFT_SET_CONCAT is specified with no field_count, bail out with EINVAL. Fixes: ef516e8625dd ("netfilter: nf_tables: reintroduce the NFT_SET_CONCAT flag") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-08-15netfilter: nf_tables: disallow NFT_SET_ELEM_CATCHALL and ↵Pablo Neira Ayuso1-0/+3
NFT_SET_ELEM_INTERVAL_END These flags are mutually exclusive, report EINVAL in this case. Fixes: aaa31047a6d2 ("netfilter: nftables: add catch-all set element support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>