Age | Commit message (Collapse) | Author | Files | Lines |
|
Early DSA drivers were kind of simplistic in that they assumed a fairly
narrow hardware layout. User ports would have integrated PHYs at an
internal MDIO address that is derivable from the port number, and shared
(DSA and CPU) ports would have an MII-style (serial or parallel)
connection to another MAC. Phylib and then phylink were used to drive
the internal PHYs, and this needed little to no description through the
platform data structures. Bringing up the shared ports at the maximum
supported link speed was the responsibility of the drivers.
As a result of this, when these early drivers were converted from
platform data to the new DSA OF bindings, there was no link information
translated into the first DT bindings.
https://lore.kernel.org/all/YtXFtTsf++AeDm1l@lunn.ch/
Later, phylink was adopted for shared ports as well, and today we have a
workaround in place, introduced by commit a20f997010c4 ("net: dsa: Don't
instantiate phylink for CPU/DSA ports unless needed"). There, DSA checks
for the presence of phy-handle/fixed-link/managed OF properties, and if
missing, phylink registration would be skipped. This is because phylink
is optional for some drivers (the shared ports already work without it),
but the process of starting to register a port with phylink is
irreversible: if phylink_create() fails to find the fwnode properties it
needs, it bails out and it leaves the ports inoperational (because
phylink expects ports to be initially down, so DSA necessarily takes
them down, and doesn't know how to put them back up again).
DSA being a common framework, new drivers opt into this workaround
willy-nilly, but the ideal behavior from the DSA core's side would have
been to not interfere with phylink's process of failing at all. This
isn't possible because of regression concerns with pre-phylink DT blobs,
but at least DSA should put a stop to the proliferation of more of such
cases that rely on the workaround to skip phylink registration, and
sanitize the environment that new drivers work in.
To that end, create a list of compatible strings for which the
workaround is preserved, and don't apply the workaround for any drivers
outside that list (this includes new drivers).
In some cases, we make the assumption that even existing drivers don't
rely on DSA's workaround, and we do this by looking at the device trees
in which they appear. We can't fully know what is the situation with
downstream DT blobs, but we can guess the overall trend by studying the
DT blobs that were submitted upstream. If there are upstream blobs that
have lacking descriptions, we take it as very likely that there are many
more downstream blobs that do so too. If all upstream blobs have
complete descriptions, we take that as a hint that the driver is a
candidate for enforcing strict DT bindings (considering that most
bindings are copy-pasted). If there are no upstream DT blobs, we take
the conservative route of allowing the workaround, unless the driver
maintainer instructs us otherwise.
The driver situation is as follows:
ar9331
~~~~~~
compatible strings:
- qca,ar9331-switch
1 occurrence in mainline device trees, part of SoC dtsi
(arch/mips/boot/dts/qca/ar9331.dtsi), description is not problematic.
Verdict: opt into strict DT bindings and out of workarounds.
b53
~~~
compatible strings:
- brcm,bcm5325
- brcm,bcm53115
- brcm,bcm53125
- brcm,bcm53128
- brcm,bcm5365
- brcm,bcm5389
- brcm,bcm5395
- brcm,bcm5397
- brcm,bcm5398
- brcm,bcm53010-srab
- brcm,bcm53011-srab
- brcm,bcm53012-srab
- brcm,bcm53018-srab
- brcm,bcm53019-srab
- brcm,bcm5301x-srab
- brcm,bcm11360-srab
- brcm,bcm58522-srab
- brcm,bcm58525-srab
- brcm,bcm58535-srab
- brcm,bcm58622-srab
- brcm,bcm58623-srab
- brcm,bcm58625-srab
- brcm,bcm88312-srab
- brcm,cygnus-srab
- brcm,nsp-srab
- brcm,omega-srab
- brcm,bcm3384-switch
- brcm,bcm6328-switch
- brcm,bcm6368-switch
- brcm,bcm63xx-switch
I've found at least these mainline DT blobs with problems:
arch/arm/boot/dts/bcm47094-linksys-panamera.dts
- lacks phy-mode
arch/arm/boot/dts/bcm47189-tenda-ac9.dts
- lacks phy-mode and fixed-link
arch/arm/boot/dts/bcm47081-luxul-xap-1410.dts
arch/arm/boot/dts/bcm47081-luxul-xwr-1200.dts
arch/arm/boot/dts/bcm47081-buffalo-wzr-600dhp2.dts
- lacks phy-mode and fixed-link
arch/arm/boot/dts/bcm47094-luxul-xbr-4500.dts
arch/arm/boot/dts/bcm4708-smartrg-sr400ac.dts
arch/arm/boot/dts/bcm4708-luxul-xap-1510.dts
arch/arm/boot/dts/bcm953012er.dts
arch/arm/boot/dts/bcm4708-netgear-r6250.dts
arch/arm/boot/dts/bcm4708-buffalo-wzr-1166dhp-common.dtsi
arch/arm/boot/dts/bcm4708-luxul-xwc-1000.dts
arch/arm/boot/dts/bcm47094-luxul-abr-4500.dts
- lacks phy-mode and fixed-link
arch/arm/boot/dts/bcm53016-meraki-mr32.dts
- lacks phy-mode
Verdict: opt into DSA workarounds.
bcm_sf2
~~~~~~~
compatible strings:
- brcm,bcm4908-switch
- brcm,bcm7445-switch-v4.0
- brcm,bcm7278-switch-v4.0
- brcm,bcm7278-switch-v4.8
A single occurrence in mainline
(arch/arm64/boot/dts/broadcom/bcm4908/bcm4908.dtsi), part of a SoC
dtsi, valid description. Florian Fainelli explains that most of the
bcm_sf2 device trees lack a full description for the internal IMP
ports.
Verdict: opt the BCM4908 into strict DT bindings, and opt the rest
into the workarounds. Note that even though BCM4908 has strict DT
bindings, it still does not register with phylink on the IMP port
due to it implementing ->adjust_link().
hellcreek
~~~~~~~~~
compatible strings:
- hirschmann,hellcreek-de1soc-r1
No occurrence in mainline device trees. Kurt Kanzenbach explains
that the downstream device trees lacked phy-mode and fixed link, and
needed work, but were fixed in the meantime.
Verdict: opt into strict DT bindings and out of workarounds.
lan9303
~~~~~~~
compatible strings:
- smsc,lan9303-mdio
- smsc,lan9303-i2c
1 occurrence in mainline device trees:
arch/arm/boot/dts/imx53-kp-hsc.dts
- no phy-mode, no fixed-link
Verdict: opt out of strict DT bindings and into workarounds.
lantiq_gswip
~~~~~~~~~~~~
compatible strings:
- lantiq,xrx200-gswip
- lantiq,xrx300-gswip
- lantiq,xrx330-gswip
No occurrences in mainline device trees. Martin Blumenstingl
confirms that the downstream OpenWrt device trees lack a proper
fixed-link and need work, and that the incomplete description can
even be seen in the example from
Documentation/devicetree/bindings/net/dsa/lantiq-gswip.txt.
Verdict: opt out of strict DT bindings and into workarounds.
microchip ksz
~~~~~~~~~~~~~
compatible strings:
- microchip,ksz8765
- microchip,ksz8794
- microchip,ksz8795
- microchip,ksz8863
- microchip,ksz8873
- microchip,ksz9477
- microchip,ksz9897
- microchip,ksz9893
- microchip,ksz9563
- microchip,ksz8563
- microchip,ksz9567
- microchip,lan9370
- microchip,lan9371
- microchip,lan9372
- microchip,lan9373
- microchip,lan9374
5 occurrences in mainline device trees, all descriptions are valid.
But we had a snafu for the ksz8795 and ksz9477 drivers where the
phy-mode property would be expected to be located directly under the
'switch' node rather than under a port OF node. It was fixed by
commit edecfa98f602 ("net: dsa: microchip: look for phy-mode in port
nodes"). The driver still has compatibility with the old DT blobs.
The lan937x support was added later than the above snafu was fixed,
and even though it has support for the broken DT blobs by virtue of
sharing a common probing function, I'll take it that its DT blobs
are correct.
Verdict: opt lan937x into strict DT bindings, and the others out.
mt7530
~~~~~~
compatible strings
- mediatek,mt7621
- mediatek,mt7530
- mediatek,mt7531
Multiple occurrences in mainline device trees, one is part of an SoC
dtsi (arch/mips/boot/dts/ralink/mt7621.dtsi), all descriptions are fine.
Verdict: opt into strict DT bindings and out of workarounds.
mv88e6060
~~~~~~~~~
compatible string:
- marvell,mv88e6060
no occurrences in mainline, nobody knows anybody who uses it.
Verdict: opt out of strict DT bindings and into workarounds.
mv88e6xxx
~~~~~~~~~
compatible strings:
- marvell,mv88e6085
- marvell,mv88e6190
- marvell,mv88e6250
Device trees that have incomplete descriptions of CPU or DSA ports:
arch/arm64/boot/dts/freescale/imx8mq-zii-ultra.dtsi
- lacks phy-mode
arch/arm64/boot/dts/marvell/cn9130-crb.dtsi
- lacks phy-mode and fixed-link
arch/arm/boot/dts/vf610-zii-ssmb-spu3.dts
- lacks phy-mode
arch/arm/boot/dts/kirkwood-mv88f6281gtw-ge.dts
- lacks phy-mode
arch/arm/boot/dts/vf610-zii-spb4.dts
- lacks phy-mode
arch/arm/boot/dts/vf610-zii-cfu1.dts
- lacks phy-mode
arch/arm/boot/dts/vf610-zii-dev-rev-c.dts
- lacks phy-mode on CPU port, fixed-link on DSA ports
arch/arm/boot/dts/vf610-zii-dev-rev-b.dts
- lacks phy-mode on CPU port
arch/arm/boot/dts/armada-381-netgear-gs110emx.dts
- lacks phy-mode
arch/arm/boot/dts/vf610-zii-scu4-aib.dts
- lacks fixed-link on xgmii DSA ports and/or in-band-status on
2500base-x DSA ports, and phy-mode on CPU port
arch/arm/boot/dts/imx6qdl-gw5904.dtsi
- lacks phy-mode and fixed-link
arch/arm/boot/dts/armada-385-clearfog-gtr-l8.dts
- lacks phy-mode and fixed-link
arch/arm/boot/dts/vf610-zii-ssmb-dtu.dts
- lacks phy-mode
arch/arm/boot/dts/kirkwood-dir665.dts
- lacks phy-mode
arch/arm/boot/dts/kirkwood-rd88f6281.dtsi
- lacks phy-mode
arch/arm/boot/dts/orion5x-netgear-wnr854t.dts
- lacks phy-mode and fixed-link
arch/arm/boot/dts/armada-388-clearfog.dts
- lacks phy-mode
arch/arm/boot/dts/armada-xp-linksys-mamba.dts
- lacks phy-mode
arch/arm/boot/dts/armada-385-linksys.dtsi
- lacks phy-mode
arch/arm/boot/dts/imx6q-b450v3.dts
arch/arm/boot/dts/imx6q-b850v3.dts
- has a phy-handle but not a phy-mode?
arch/arm/boot/dts/armada-370-rd.dts
- lacks phy-mode
arch/arm/boot/dts/kirkwood-linksys-viper.dts
- lacks phy-mode
arch/arm/boot/dts/imx51-zii-rdu1.dts
- lacks phy-mode
arch/arm/boot/dts/imx51-zii-scu2-mezz.dts
- lacks phy-mode
arch/arm/boot/dts/imx6qdl-zii-rdu2.dtsi
- lacks phy-mode
arch/arm/boot/dts/armada-385-clearfog-gtr-s4.dts
- lacks phy-mode and fixed-link
Verdict: opt out of strict DT bindings and into workarounds.
ocelot
~~~~~~
compatible strings:
- mscc,vsc9953-switch
- felix (arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi) is a PCI
device, has no compatible string
2 occurrences in mainline, both are part of SoC dtsi and complete.
Verdict: opt into strict DT bindings and out of workarounds.
qca8k
~~~~~
compatible strings:
- qca,qca8327
- qca,qca8328
- qca,qca8334
- qca,qca8337
5 occurrences in mainline device trees, none of the descriptions are
problematic.
Verdict: opt into strict DT bindings and out of workarounds.
realtek
~~~~~~~
compatible strings:
- realtek,rtl8366rb
- realtek,rtl8365mb
2 occurrences in mainline, both descriptions are fine, additionally
rtl8365mb.c has a comment "The device tree firmware should also
specify the link partner of the extension port - either via a
fixed-link or other phy-handle."
Verdict: opt into strict DT bindings and out of workarounds.
rzn1_a5psw
~~~~~~~~~~
compatible strings:
- renesas,rzn1-a5psw
One single occurrence, part of SoC dtsi
(arch/arm/boot/dts/r9a06g032.dtsi), description is fine.
Verdict: opt into strict DT bindings and out of workarounds.
sja1105
~~~~~~~
Driver already validates its port OF nodes in
sja1105_parse_ports_node().
Verdict: opt into strict DT bindings and out of workarounds.
vsc73xx
~~~~~~~
compatible strings:
- vitesse,vsc7385
- vitesse,vsc7388
- vitesse,vsc7395
- vitesse,vsc7398
2 occurrences in mainline device trees, both descriptions are fine.
Verdict: opt into strict DT bindings and out of workarounds.
xrs700x
~~~~~~~
compatible strings:
- arrow,xrs7003e
- arrow,xrs7003f
- arrow,xrs7004e
- arrow,xrs7004f
no occurrences in mainline, we don't know.
Verdict: opt out of strict DT bindings and into workarounds.
Because there is a pattern where newly added switches reuse existing
drivers more often than introducing new ones, I've opted for deciding
who gets to opt into the workaround based on an OF compatible match
table in the DSA core. The alternative would have been to add another
boolean property to struct dsa_switch, like configure_vlan_while_not_filtering.
But this avoids situations where sometimes driver maintainers obfuscate
what goes on by sharing a common probing function, and therefore making
new switches inherit old quirks.
Side note, we also warn about missing properties for drivers that rely
on the workaround. This isn't an indication that we'll break
compatibility with those DT blobs any time soon, but is rather done to
raise awareness about the change, for future DT blob authors.
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Acked-by: Alvin Šipraga <alsi@bang-olufsen.dk> # realtek
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There is a subset of functions that applies only to shared (DSA and CPU)
ports, yet this is difficult to comprehend by looking at their code alone.
These are dsa_port_link_register_of(), dsa_port_link_unregister_of(),
and the functions that only these 2 call.
Rename this class of functions to dsa_shared_port_* to make this fact
more evident, even if this goes against the apparent convention that
function names in port.c must start with dsa_port_.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
dsa_port_link_register_of() and dsa_port_link_unregister_of() are not
written with the fact in mind that they can be called with a dp->dn that
is NULL (as evidenced even by the _of suffix in their name), but this is
exactly what happens.
How this behaves will differ depending on whether the backing driver
implements ->adjust_link() or not.
If it doesn't, the "if (of_phy_is_fixed_link(dp->dn) || phy_np)"
condition will return false, and dsa_port_link_register_of() will do
nothing and return 0.
If the driver does implement ->adjust_link(), the
"if (of_phy_is_fixed_link(dp->dn))" condition will return false
(dp->dn is NULL) and we will call dsa_port_setup_phy_of(). This will
call dsa_port_get_phy_device(), which will also return NULL, and we will
also do nothing and return 0.
It is hard to maintain this code and make future changes to it in this
state, so just suppress calls to these 2 functions if dp->dn is NULL.
The only functional effect is that if the driver does implement
->adjust_link(), we'll stop printing this to the console:
Using legacy PHYLIB callbacks. Please migrate to PHYLINK!
but instead we'll always print:
[ 8.539848] dsa-loop fixed-0:1f: skipping link registration for CPU port 5
This is for the better anyway, since "using legacy phylib callbacks"
was misleading information - we weren't issuing _any_ callbacks due to
dsa_port_get_phy_device() returning NULL.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Pull NFS client fixes from Trond Myklebust:
"Stable fixes:
- NFS: Fix another fsync() issue after a server reboot
Bugfixes:
- NFS: unlink/rmdir shouldn't call d_delete() twice on ENOENT
- NFS: Fix missing unlock in nfs_unlink()
- Add sanity checking of the file type used by __nfs42_ssc_open
- Fix a case where we're failing to set task->tk_rpc_status
Cleanups:
- Remove the NFS_CONTEXT_RESEND_WRITES flag that got obsoleted by the
fsync() fix"
* tag 'nfs-for-5.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
SUNRPC: RPC level errors should set task->tk_rpc_status
NFSv4.2 fix problems with __nfs42_ssc_open
NFS: unlink/rmdir shouldn't call d_delete() twice on ENOENT
NFS: Cleanup to remove unused flag NFS_CONTEXT_RESEND_WRITES
NFS: Remove a bogus flag setting in pnfs_write_done_resend_to_mds
NFS: Fix another fsync() issue after a server reboot
NFS: Fix missing unlock in nfs_unlink()
|
|
DECnet is an obsolete network protocol that receives more attention
from kernel janitors than users. It belongs in computer protocol
history museum not in Linux kernel.
It has been "Orphaned" in kernel since 2010. The iproute2 support
for DECnet was dropped in 5.0 release. The documentation link on
Sourceforge says it is abandoned there as well.
Leave the UAPI alone to keep userspace programs compiling.
This means that there is still an empty neighbour table
for AF_DECNET.
The table of /proc/sys/net entries was updated to match
current directories and reformatted to be alphabetical.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: David Ahern <dsahern@kernel.org>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 3b3fd068c56e3fbea30090859216a368398e39bf added NULL check for
`rose_loopback_neigh->dev` in rose_loopback_timer() but omitted to
check rose_loopback_neigh->loopback.
It thus prevents *all* rose connect.
The reason is that a special rose_neigh loopback has a NULL device.
/proc/net/rose_neigh illustrates it via rose_neigh_show() function :
[...]
seq_printf(seq, "%05d %-9s %-4s %3d %3d %3s %3s %3lu %3lu",
rose_neigh->number,
(rose_neigh->loopback) ? "RSLOOP-0" : ax2asc(buf, &rose_neigh->callsign),
rose_neigh->dev ? rose_neigh->dev->name : "???",
rose_neigh->count,
/proc/net/rose_neigh displays special rose_loopback_neigh->loopback as
callsign RSLOOP-0:
addr callsign dev count use mode restart t0 tf digipeaters
00001 RSLOOP-0 ??? 1 2 DCE yes 0 0
By checking rose_loopback_neigh->loopback, rose_rx_call_request() is called
even in case rose_loopback_neigh->dev is NULL. This repairs rose connections.
Verification with rose client application FPAC:
FPAC-Node v 4.1.3 (built Aug 5 2022) for LINUX (help = h)
F6BVP-4 (Commands = ?) : u
Users - AX.25 Level 2 sessions :
Port Callsign Callsign AX.25 state ROSE state NetRom status
axudp F6BVP-5 -> F6BVP-9 Connected Connected ---------
Fixes: 3b3fd068c56e ("rose: Fix Null pointer dereference in rose_send_frame()")
Signed-off-by: Bernard Pidoux <f6bvp@free.fr>
Suggested-by: Francois Romieu <romieu@fr.zoreil.com>
Cc: Thomas DL9SAU Osterried <thomas@osterried.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently queue_userspace_packet will call kfree_skb for all frames,
whether or not an error occurred. This can result in a single dropped
frame being reported as multiple drops in dropwatch. This functions
caller may also call kfree_skb in case of an error. This patch will
consume the skbs instead and allow caller's to use kfree_skb.
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2109957
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Frames sent to userspace can be reported as dropped in
ovs_dp_process_packet, however, if they are dropped in the netlink code
then netlink_attachskb will report the same frame as dropped.
This patch checks for error codes which indicate that the frame has
already been freed.
Signed-off-by: Mike Pattrick <mkp@redhat.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2109946
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
TCP_LISTEN sockets is a special case. They preserve skb with a newly
connected sock till accept() makes it fully functional socket.
Receive queue of such socket may grow after connected peer
send messages there. Since these messages may contain scm_fds,
we should expose correct fdinfo::scm_fds for listening socket too.
Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
The system hangs up when batman-adv soft-interface is created on
hard-interface with small MTU. For example, the following commands
create batman-adv soft-interface on dummy interface with zero MTU:
# ip link add name dummy0 type dummy
# ip link set mtu 0 dev dummy0
# ip link set up dev dummy0
# ip link add name bat0 type batadv
# ip link set dev dummy0 master bat0
These commands cause the system hang up with the following messages:
[ 90.578925][ T6689] batman_adv: bat0: Adding interface: dummy0
[ 90.580884][ T6689] batman_adv: bat0: The MTU of interface dummy0 is too small (0) to handle the transport of batman-adv packets. Packets going over this interface will be fragmented on layer2 which could impact the performance. Setting the MTU to 1560 would solve the problem.
[ 90.586264][ T6689] batman_adv: bat0: Interface activated: dummy0
[ 90.590061][ T6689] batman_adv: bat0: Forced to purge local tt entries to fit new maximum fragment MTU (-320)
[ 90.595517][ T6689] batman_adv: bat0: Forced to purge local tt entries to fit new maximum fragment MTU (-320)
[ 90.598499][ T6689] batman_adv: bat0: Forced to purge local tt entries to fit new maximum fragment MTU (-320)
This patch fixes this issue by returning error when enabling
hard-interface with small MTU size.
Fixes: c6c8fea29769 ("net: Add batman-adv meshing protocol")
Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
|
|
The commit 94dfc73e7cf4 ("treewide: uapi: Replace zero-length arrays with
flexible-array members") changed various structures from using 0-length
arrays to flexible arrays
net/batman-adv/bat_v_elp.c: note: in included file:
./include/linux/ethtool.h:148:38: warning: nested flexible array
net/batman-adv/bat_v_elp.c:128:9: warning: using sizeof on a flexible structure
In theory, this could be worked around by using {} as initializer for the
variable on the stack. But this variable doesn't has to be initialized at
all by the caller of __ethtool_get_link_ksettings - everything will be
initialized by the callee when no error occurs.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
|
|
Fix up a case in call_encode() where we're failing to set
task->tk_rpc_status when an RPC level error occurred.
Fixes: 9c5948c24869 ("SUNRPC: task should be exit if encode return EKEYEXPIRED more times")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
No conflicts.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This patch adds a few optnames for bpf_setsockopt:
SO_REUSEADDR, IPV6_AUTOFLOWLABEL, TCP_MAXSEG, TCP_NODELAY,
and TCP_THIN_LINEAR_TIMEOUTS.
Thanks to the previous patches of this set, all additions can reuse
the sk_setsockopt(), do_ipv6_setsockopt(), and do_tcp_setsockopt().
The only change here is to allow them in bpf_setsockopt.
The bpf prog has been able to read all members of a sk by
using PTR_TO_BTF_ID of a sk. The optname additions here can also be
read by the same approach. Meaning there is a way to read
the values back.
These optnames can also be added to bpf_getsockopt() later with
another patch set that makes the bpf_getsockopt() to reuse
the sock_getsockopt(), tcp_getsockopt(), and ip[v6]_getsockopt().
Thus, this patch does not add more duplicated code to
bpf_getsockopt() now.
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061841.4181642-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
After the prep work in the previous patches,
this patch removes the dup code from bpf_setsockopt(SOL_IPV6)
and reuses the implementation in do_ipv6_setsockopt().
ipv6 could be compiled as a module. Like how other code solved it
with stubs in ipv6_stubs.h, this patch adds the do_ipv6_setsockopt
to the ipv6_bpf_stub.
The current bpf_setsockopt(IPV6_TCLASS) does not take the
INET_ECN_MASK into the account for tcp. The
do_ipv6_setsockopt(IPV6_TCLASS) will handle it correctly.
The existing optname white-list is refactored into a new
function sol_ipv6_setsockopt().
After this last SOL_IPV6 dup code removal, the __bpf_setsockopt()
is simplified enough that the extra "{ }" around the if statement
can be removed.
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061834.4181198-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
After the prep work in the previous patches,
this patch removes the dup code from bpf_setsockopt(SOL_IP)
and reuses the implementation in do_ip_setsockopt().
The existing optname white-list is refactored into a new
function sol_ip_setsockopt().
NOTE,
the current bpf_setsockopt(IP_TOS) is quite different from the
the do_ip_setsockopt(IP_TOS). For example, it does not take
the INET_ECN_MASK into the account for tcp and also does not adjust
sk->sk_priority. It looks like the current bpf_setsockopt(IP_TOS)
was referencing the IPV6_TCLASS implementation instead of IP_TOS.
This patch tries to rectify that by using the do_ip_setsockopt(IP_TOS).
While this is a behavior change, the do_ip_setsockopt(IP_TOS) behavior
is arguably what the user is expecting. At least, the INET_ECN_MASK bits
should be masked out for tcp.
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061826.4180990-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
After the prep work in the previous patches,
this patch removes all the dup code from bpf_setsockopt(SOL_TCP)
and reuses the do_tcp_setsockopt().
The existing optname white-list is refactored into a new
function sol_tcp_setsockopt(). The sol_tcp_setsockopt()
also calls the bpf_sol_tcp_setsockopt() to handle
the TCP_BPF_XXX specific optnames.
bpf_setsockopt(TCP_SAVE_SYN) now also allows a value 2 to
save the eth header also and it comes for free from
do_tcp_setsockopt().
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061819.4180146-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The patch moves all bpf specific tcp optnames (TCP_BPF_XXX)
to a new function bpf_sol_tcp_setsockopt(). This will make
the next patch easier to follow.
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061812.4179645-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
After the prep work in the previous patches,
this patch removes most of the dup code from bpf_setsockopt(SOL_SOCKET)
and reuses them from sk_setsockopt().
The sock ptr test is added to the SO_RCVLOWAT because
the sk->sk_socket could be NULL in some of the bpf hooks.
The existing optname white-list is refactored into a new
function sol_socket_setsockopt().
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061804.4178920-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This patch moves the "#ifdef CONFIG_XXX" check into the "if/else"
statement itself. The change is done for the bpf_setsockopt()
function only. It will make the latter patches easier to follow
without the surrounding ifdef macro.
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061758.4178374-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
capable()
Similar to the earlier patch that avoids sk_setsockopt() from
taking sk lock and doing capable test when called by bpf. This patch
changes do_ipv6_setsockopt() to use the sockopt_{lock,release}_sock()
and sockopt_[ns_]capable().
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061744.4176893-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
capable()
Similar to the earlier patch that avoids sk_setsockopt() from
taking sk lock and doing capable test when called by bpf. This patch
changes do_ip_setsockopt() to use the sockopt_{lock,release}_sock()
and sockopt_[ns_]capable().
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061737.4176402-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
capable()
Similar to the earlier patch that avoids sk_setsockopt() from
taking sk lock and doing capable test when called by bpf. This patch
changes do_tcp_setsockopt() to use the sockopt_{lock,release}_sock()
and sockopt_[ns_]capable().
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061730.4176021-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
sk_setsockopt()
When bpf program calling bpf_setsockopt(SOL_SOCKET),
it could be run in softirq and doesn't make sense to do the capable
check. There was a similar situation in bpf_setsockopt(TCP_CONGESTION).
In commit 8d650cdedaab ("tcp: fix tcp_set_congestion_control() use from bpf hook"),
tcp_set_congestion_control(..., cap_net_admin) was added to skip
the cap check for bpf prog.
This patch adds sockopt_ns_capable() and sockopt_capable() for
the sk_setsockopt() to use. They will consider the
has_current_bpf_ctx() before doing the ns_capable() and capable() test.
They are in EXPORT_SYMBOL for the ipv6 module to use in a latter patch.
Suggested-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061723.4175820-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Most of the code in bpf_setsockopt(SOL_SOCKET) are duplicated from
the sk_setsockopt(). The number of supported optnames are
increasing ever and so as the duplicated code.
One issue in reusing sk_setsockopt() is that the bpf prog
has already acquired the sk lock. This patch adds a
has_current_bpf_ctx() to tell if the sk_setsockopt() is called from
a bpf prog. The bpf prog calling bpf_setsockopt() is either running
in_task() or in_serving_softirq(). Both cases have the current->bpf_ctx
initialized. Thus, the has_current_bpf_ctx() only needs to
test !!current->bpf_ctx.
This patch also adds sockopt_{lock,release}_sock() helpers
for sk_setsockopt() to use. These helpers will test
has_current_bpf_ctx() before acquiring/releasing the lock. They are
in EXPORT_SYMBOL for the ipv6 module to use in a latter patch.
Note on the change in sock_setbindtodevice(). sockopt_lock_sock()
is done in sock_setbindtodevice() instead of doing the lock_sock
in sock_bindtoindex(..., lock_sk = true).
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061717.4175589-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
A latter patch refactors bpf_setsockopt(SOL_SOCKET) with the
sock_setsockopt() to avoid code duplication and code
drift between the two duplicates.
The current sock_setsockopt() takes sock ptr as the argument.
The very first thing of this function is to get back the sk ptr
by 'sk = sock->sk'.
bpf_setsockopt() could be called when the sk does not have
the sock ptr created. Meaning sk->sk_socket is NULL. For example,
when a passive tcp connection has just been established but has yet
been accept()-ed. Thus, it cannot use the sock_setsockopt(sk->sk_socket)
or else it will pass a NULL ptr.
This patch moves all sock_setsockopt implementation to the newly
added sk_setsockopt(). The new sk_setsockopt() takes a sk ptr
and immediately gets the sock ptr by 'sock = sk->sk_socket'
The existing sock_setsockopt(sock) is changed to call
sk_setsockopt(sock->sk). All existing callers have both sock->sk
and sk->sk_socket pointer.
The latter patch will make bpf_setsockopt(SOL_SOCKET) call
sk_setsockopt(sk) directly. The bpf_setsockopt(SOL_SOCKET) does
not use the optnames that require sk->sk_socket, so it will
be safe.
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220817061711.4175048-1-kafai@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Commit 451ef36bd229 ("ip_tunnels: Add new flow flags field to ip_tunnel_key")
added a "flow_flags" member to struct ip_tunnel_key which was later used by
the commit in the fixes tag to avoid dropping packets with sources that
aren't locally configured when set in bpf_set_tunnel_key().
VXLAN and GENEVE were made to respect this flag, ip tunnels like IPIP and GRE
were not.
This commit fixes this omission by making ip_tunnel_init_flow() receive
the flow flags from the tunnel key in the relevant collect_md paths.
Fixes: b8fff748521c ("bpf: Set flow flag to allow any source IP in bpf_tunnel_key")
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Paul Chaignon <paul@isovalent.com>
Link: https://lore.kernel.org/bpf/20220818074118.726639-1-eyal.birger@gmail.com
|
|
When skb->len==0, the recv_actor() returns 0 too, but we also use 0
for error conditions. This patch amends this by propagating the errors
to tcp_read_skb() so that we can distinguish skb->len==0 case from
error cases.
Fixes: 04919bed948d ("tcp: Introduce tcp_read_skb()")
Reported-by: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As tcp_read_skb() only reads one skb at a time, the while loop is
unnecessary, we can turn it into an if. This also simplifies the
code logic.
Cc: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
tcp_cleanup_rbuf() retrieves the skb from sk_receive_queue, it
assumes the skb is not yet dequeued. This is no longer true for
tcp_read_skb() case where we dequeue the skb first.
Fix this by introducing a helper __tcp_cleanup_rbuf() which does
not require any skb and calling it in tcp_read_skb().
Fixes: 04919bed948d ("tcp: Introduce tcp_read_skb()")
Cc: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Before commit 965b57b469a5 ("net: Introduce a new proto_ops
->read_skb()"), skb was not dequeued from receive queue hence
when we close TCP socket skb can be just flushed synchronously.
After this commit, we have to uncharge skb immediately after being
dequeued, otherwise it is still charged in the original sock. And we
still need to retain skb->sk, as eBPF programs may extract sock
information from skb->sk. Therefore, we have to call
skb_set_owner_sk_safe() here.
Fixes: 965b57b469a5 ("net: Introduce a new proto_ops ->read_skb()")
Reported-and-tested-by: syzbot+a0e6f8738b58f7654417@syzkaller.appspotmail.com
Tested-by: Stanislav Fomichev <sdf@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If construction of the array of policies fails when recording
non-first policy we need to unwind.
netlink_policy_dump_add_policy() itself also needs fixing as
it currently gives up on error without recording the allocated
pointer in the pstate pointer.
Reported-by: syzbot+dc54d9ba8153b216cae0@syzkaller.appspotmail.com
Fixes: 50a896cf2d6f ("genetlink: properly support per-op policy dumping")
Link: https://lore.kernel.org/r/20220816161939.577583-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
ds->ops->port_stp_state_set() is, like most DSA methods, optional, and
if absent, the port is supposed to remain in the forwarding state (as
standalone). Such is the case with the mv88e6060 driver, which does not
offload the bridge layer. DSA warns that the STP state can't be changed
to FORWARDING as part of dsa_port_enable_rt(), when in fact it should not.
The error message is also not up to modern standards, so take the
opportunity to make it more descriptive.
Fixes: fd3645413197 ("net: dsa: change scope of STP state setter")
Reported-by: Sergei Antonov <saproj@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Sergei Antonov <saproj@gmail.com>
Link: https://lore.kernel.org/r/20220816201445.1809483-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Andrii Nakryiko says:
====================
bpf-next 2022-08-17
We've added 45 non-merge commits during the last 14 day(s) which contain
a total of 61 files changed, 986 insertions(+), 372 deletions(-).
The main changes are:
1) New bpf_ktime_get_tai_ns() BPF helper to access CLOCK_TAI, from Kurt
Kanzenbach and Jesper Dangaard Brouer.
2) Few clean ups and improvements for libbpf 1.0, from Andrii Nakryiko.
3) Expose crash_kexec() as kfunc for BPF programs, from Artem Savkov.
4) Add ability to define sleepable-only kfuncs, from Benjamin Tissoires.
5) Teach libbpf's bpf_prog_load() and bpf_map_create() to gracefully handle
unsupported names on old kernels, from Hangbin Liu.
6) Allow opting out from auto-attaching BPF programs by libbpf's BPF skeleton,
from Hao Luo.
7) Relax libbpf's requirement for shared libs to be marked executable, from
Henqgi Chen.
8) Improve bpf_iter internals handling of error returns, from Hao Luo.
9) Few accommodations in libbpf to support GCC-BPF quirks, from James Hilliard.
10) Fix BPF verifier logic around tracking dynptr ref_obj_id, from Joanne Koong.
11) bpftool improvements to handle full BPF program names better, from Manu
Bretelle.
12) bpftool fixes around libcap use, from Quentin Monnet.
13) BPF map internals clean ups and improvements around memory allocations,
from Yafang Shao.
14) Allow to use cgroup_get_from_file() on cgroupv1, allowing BPF cgroup
iterator to work on cgroupv1, from Yosry Ahmed.
15) BPF verifier internal clean ups, from Dave Marchevsky and Joanne Koong.
16) Various fixes and clean ups for selftests/bpf and vmtest.sh, from Daniel
Xu, Artem Savkov, Joanne Koong, Andrii Nakryiko, Shibin Koikkara Reeny.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits)
selftests/bpf: Few fixes for selftests/bpf built in release mode
libbpf: Clean up deprecated and legacy aliases
libbpf: Streamline bpf_attr and perf_event_attr initialization
libbpf: Fix potential NULL dereference when parsing ELF
selftests/bpf: Tests libbpf autoattach APIs
libbpf: Allows disabling auto attach
selftests/bpf: Fix attach point for non-x86 arches in test_progs/lsm
libbpf: Making bpf_prog_load() ignore name if kernel doesn't support
selftests/bpf: Update CI kconfig
selftests/bpf: Add connmark read test
selftests/bpf: Add existing connection bpf_*_ct_lookup() test
bpftool: Clear errno after libcap's checks
bpf: Clear up confusion in bpf_skb_adjust_room()'s documentation
bpftool: Fix a typo in a comment
libbpf: Add names for auxiliary maps
bpf: Use bpf_map_area_alloc consistently on bpf map creation
bpf: Make __GFP_NOWARN consistent in bpf map creation
bpf: Use bpf_map_area_free instread of kvfree
bpf: Remove unneeded memset in queue_stack_map creation
libbpf: preserve errno across pr_warn/pr_info/pr_debug
...
====================
Link: https://lore.kernel.org/r/20220817215656.1180215-1-andrii@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Florian Westphal says:
====================
netfilter: conntrack and nf_tables bug fixes
The following patchset contains netfilter fixes for net.
Broken since 5.19:
A few ancient connection tracking helpers assume TCP packets cannot
exceed 64kb in size, but this isn't the case anymore with 5.19 when
BIG TCP got merged, from myself.
Regressions since 5.19:
1. 'conntrack -E expect' won't display anything because nfnetlink failed
to enable events for expectations, only for normal conntrack events.
2. partially revert change that added resched calls to a function that can
be in atomic context. Both broken and fixed up by myself.
Broken for several releases (up to original merge of nf_tables):
Several fixes for nf_tables control plane, from Pablo.
This fixes up resource leaks in error paths and adds more sanity
checks for mutually exclusive attributes/flags.
Kconfig:
NF_CONNTRACK_PROCFS is very old and doesn't provide all info provided
via ctnetlink, so it should not default to y. From Geert Uytterhoeven.
Selftests:
rework nft_flowtable.sh: it frequently indicated failure; the way it
tried to detect an offload failure did not work reliably.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
testing: selftests: nft_flowtable.sh: rework test to detect offload failure
testing: selftests: nft_flowtable.sh: use random netns names
netfilter: conntrack: NF_CONNTRACK_PROCFS should no longer default to y
netfilter: nf_tables: check NFT_SET_CONCAT flag if field_count is specified
netfilter: nf_tables: disallow NFT_SET_ELEM_CATCHALL and NFT_SET_ELEM_INTERVAL_END
netfilter: nf_tables: NFTA_SET_ELEM_KEY_END requires concat and interval flags
netfilter: nf_tables: validate NFTA_SET_ELEM_OBJREF based on NFT_SET_OBJECT flag
netfilter: nf_tables: really skip inactive sets when allocating name
netfilter: nfnetlink: re-enable conntrack expectation events
netfilter: nf_tables: fix scheduling-while-atomic splat
netfilter: nf_ct_irc: cap packet search space to 4k
netfilter: nf_ct_ftp: prefer skb_linearize
netfilter: nf_ct_h323: cap packet size at 64k
netfilter: nf_ct_sane: remove pseudo skb linearization
netfilter: nf_tables: possible module reference underflow in error path
netfilter: nf_tables: disallow NFTA_SET_ELEM_KEY_END with NFT_SET_ELEM_INTERVAL_END flag
netfilter: nf_tables: use READ_ONCE and WRITE_ONCE for shared generation id access
====================
Link: https://lore.kernel.org/r/20220817140015.25843-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix one kernel NULL pointer dereference as below:
[ 224.462334] Call Trace:
[ 224.462394] __tcp_bpf_recvmsg+0xd3/0x380
[ 224.462441] ? sock_has_perm+0x78/0xa0
[ 224.462463] tcp_bpf_recvmsg+0x12e/0x220
[ 224.462494] inet_recvmsg+0x5b/0xd0
[ 224.462534] __sys_recvfrom+0xc8/0x130
[ 224.462574] ? syscall_trace_enter+0x1df/0x2e0
[ 224.462606] ? __do_page_fault+0x2de/0x500
[ 224.462635] __x64_sys_recvfrom+0x24/0x30
[ 224.462660] do_syscall_64+0x5d/0x1d0
[ 224.462709] entry_SYSCALL_64_after_hwframe+0x65/0xca
In commit 9974d37ea75f ("skmsg: Fix invalid last sg check in
sk_msg_recvmsg()"), we change last sg check to sg_is_last(),
but in sockmap redirection case (without stream_parser/stream_verdict/
skb_verdict), we did not mark the end of the scatterlist. Check the
sk_msg_alloc, sk_msg_page_add, and bpf_msg_push_data functions, they all
do not mark the end of sg. They are expected to use sg.end for end
judgment. So the judgment of '(i != msg_rx->sg.end)' is added back here.
Fixes: 9974d37ea75f ("skmsg: Fix invalid last sg check in sk_msg_recvmsg()")
Signed-off-by: Liu Jian <liujian56@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20220809094915.150391-1-liujian56@huawei.com
|
|
The commit 9abc291812d7 ("batman-adv: tracing: Use the new __vstring()
helper") removed the usage of WARN_ON_ONCE and __dynamic_array in this
file. But it was forgotten to adjust the headers accordingly (dropping the
now no longer used ones).
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
|
|
This version will contain all the (major or even only minor) changes for
Linux 6.1.
The version number isn't a semantic version number with major and minor
information. It is just encoding the year of the expected publishing as
Linux -rc1 and the number of published versions this year (starting at 0).
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
|
|
Even though the normal strparser's init function has a return
value we got away with ignoring errors until now, as it only
validates the parameters and we were passing correct parameters.
tls_strp can fail to init on memory allocation errors, which
syzbot duly induced and reported.
Reported-by: syzbot+abd45eb849b05194b1b6@syzkaller.appspotmail.com
Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Instead of the hardcoded TCP_TIMEOUT_INIT, this diff calls tcp_timeout_init
to initiate req->timeout like the non TFO SYN ACK case.
Tested using the following packetdrill script, on a host with a BPF
program that sets the initial connect timeout to 10ms.
`../../common/defaults.sh`
// Initialize connection
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_TCP, TCP_FASTOPEN, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 32792 <mss 1000,sackOK,FO TFO_COOKIE>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.01 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.02 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.04 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.01 < . 1:1(0) ack 1 win 32792
+0 accept(3, ..., ...) = 4
Signed-off-by: Jie Meng <jmeng@fb.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When we try to transmit an skb with metadata_dst attached (i.e. dst->dev
== NULL) through xfrm interface we can hit a null pointer dereference[1]
in xfrmi_xmit2() -> xfrm_lookup_with_ifid() due to the check for a
loopback skb device when there's no policy which dereferences dst->dev
unconditionally. Not having dst->dev can be interepreted as it not being
a loopback device, so just add a check for a null dst_orig->dev.
With this fix xfrm interface's Tx error counters go up as usual.
[1] net-next calltrace captured via netconsole:
BUG: kernel NULL pointer dereference, address: 00000000000000c0
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 7231 Comm: ping Kdump: loaded Not tainted 5.19.0+ #24
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-1.fc36 04/01/2014
RIP: 0010:xfrm_lookup_with_ifid+0x5eb/0xa60
Code: 8d 74 24 38 e8 26 a4 37 00 48 89 c1 e9 12 fc ff ff 49 63 ed 41 83 fd be 0f 85 be 01 00 00 41 be ff ff ff ff 45 31 ed 48 8b 03 <f6> 80 c0 00 00 00 08 75 0f 41 80 bc 24 19 0d 00 00 01 0f 84 1e 02
RSP: 0018:ffffb0db82c679f0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffd0db7fcad430 RCX: ffffb0db82c67a10
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb0db82c67a80
RBP: ffffb0db82c67a80 R08: ffffb0db82c67a14 R09: 0000000000000000
R10: 0000000000000000 R11: ffff8fa449667dc8 R12: ffffffff966db880
R13: 0000000000000000 R14: 00000000ffffffff R15: 0000000000000000
FS: 00007ff35c83f000(0000) GS:ffff8fa478480000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000c0 CR3: 000000001ebb7000 CR4: 0000000000350ee0
Call Trace:
<TASK>
xfrmi_xmit+0xde/0x460
? tcf_bpf_act+0x13d/0x2a0
dev_hard_start_xmit+0x72/0x1e0
__dev_queue_xmit+0x251/0xd30
ip_finish_output2+0x140/0x550
ip_push_pending_frames+0x56/0x80
raw_sendmsg+0x663/0x10a0
? try_charge_memcg+0x3fd/0x7a0
? __mod_memcg_lruvec_state+0x93/0x110
? sock_sendmsg+0x30/0x40
sock_sendmsg+0x30/0x40
__sys_sendto+0xeb/0x130
? handle_mm_fault+0xae/0x280
? do_user_addr_fault+0x1e7/0x680
? kvm_read_and_reset_apf_flags+0x3b/0x50
__x64_sys_sendto+0x20/0x30
do_syscall_64+0x34/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7ff35cac1366
Code: eb 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 72 c3 90 55 48 83 ec 30 44 89 4c 24 2c 4c 89
RSP: 002b:00007fff738e4028 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00007fff738e57b0 RCX: 00007ff35cac1366
RDX: 0000000000000040 RSI: 0000557164e4b450 RDI: 0000000000000003
RBP: 0000557164e4b450 R08: 00007fff738e7a2c R09: 0000000000000010
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000040
R13: 00007fff738e5770 R14: 00007fff738e4030 R15: 0000001d00000001
</TASK>
Modules linked in: netconsole veth br_netfilter bridge bonding virtio_net [last unloaded: netconsole]
CR2: 00000000000000c0
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Daniel Borkmann <daniel@iogearbox.net>
Fixes: 2d151d39073a ("xfrm: Add possibility to set the default to block if we have no policy")
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
NF_CONNTRACK_PROCFS was marked obsolete in commit 54b07dca68557b09
("netfilter: provide config option to disable ancient procfs parts") in
v3.3.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
The input parameter p is unused in qdisc_create. Delete it.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Link: https://lore.kernel.org/r/20220815061023.51318-1-shaozhengchao@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In the gnet_stats_add_queue_cpu function, the qstats->qlen statistics
are incorrectly set to qcpu->backlog.
Fixes: 448e163f8b9b ("gen_stats: Add gnet_stats_add_queue()")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Link: https://lore.kernel.org/r/20220815030848.276746-1-shaozhengchao@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Return value of unregister_qdisc is unused, remove it.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Link: https://lore.kernel.org/r/20220815030417.271894-1-shaozhengchao@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When bulk delete command is received in the rtnetlink_rcv_msg function,
if bulk delete is not supported, module_put is not called to release
the reference counting. As a result, module reference count is leaked.
Fixes: a6cec0bcd342 ("net: rtnetlink: add bulk delete support flag")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20220815024629.240367-1-shaozhengchao@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Every cgroup knows all its ancestors through its ->ancestor_ids[]. There's
no advantage to remembering the IDs instead of the pointers directly and
this makes the array useless for finding an actual ancestor cgroup forcing
cgroup_ancestor() to iteratively walk up the hierarchy instead. Let's
replace cgroup->ancestor_ids[] with ->ancestors[] and remove the walking-up
from cgroup_ancestor().
While at it, improve comments around cgroup_root->cgrp_ancestor_storage.
This patch shouldn't cause user-visible behavior differences.
v2: Update cgroup_ancestor() to use ->ancestors[].
v3: cgroup_root->cgrp_ancestor_storage's type is updated to match
cgroup->ancestors[]. Better comments.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
|
|
Since f3a2181e16f1 ("netfilter: nf_tables: Support for sets with
multiple ranged fields"), it possible to combine intervals and
concatenations. Later on, ef516e8625dd ("netfilter: nf_tables:
reintroduce the NFT_SET_CONCAT flag") provides the NFT_SET_CONCAT flag
for userspace to report that the set stores a concatenation.
Make sure NFT_SET_CONCAT is set on if field_count is specified for
consistency. Otherwise, if NFT_SET_CONCAT is specified with no
field_count, bail out with EINVAL.
Fixes: ef516e8625dd ("netfilter: nf_tables: reintroduce the NFT_SET_CONCAT flag")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
NFT_SET_ELEM_INTERVAL_END
These flags are mutually exclusive, report EINVAL in this case.
Fixes: aaa31047a6d2 ("netfilter: nftables: add catch-all set element support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|