summaryrefslogtreecommitdiffstats
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2021-08-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfJakub Kicinski17-83/+139
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Restrict range element expansion in ipset to avoid soft lockup, from Jozsef Kadlecsik. 2) Memleak in error path for nf_conntrack_bridge for IPv4 packets, from Yajun Deng. 3) Simplify conntrack garbage collection strategy to avoid frequent wake-ups, from Florian Westphal. 4) Fix NFNLA_HOOK_FUNCTION_NAME string, do not include module name. 5) Missing chain family netlink attribute in chain description in nfnetlink_hook. 6) Incorrect sequence number on nfnetlink_hook dumps. 7) Use netlink request family in reply message for consistency. 8) Remove offload_pickup sysctl, use conntrack for established state instead, from Florian Westphal. 9) Translate NFPROTO_INET/ingress to NFPROTO_NETDEV/ingress, since NFPROTO_INET is not exposed through nfnetlink_hook. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf: netfilter: nfnetlink_hook: translate inet ingress to netdev netfilter: conntrack: remove offload_pickup sysctl again netfilter: nfnetlink_hook: Use same family as request message netfilter: nfnetlink_hook: use the sequence number of the request message netfilter: nfnetlink_hook: missing chain family netfilter: nfnetlink_hook: strip off module name from hookfn netfilter: conntrack: collect all entries in one cycle netfilter: nf_conntrack_bridge: Fix memory leak when error netfilter: ipset: Limit the maximal range of consecutive elements to add/delete ==================== Link: https://lore.kernel.org/r/20210806151149.6356-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-06netfilter: nfnetlink_hook: translate inet ingress to netdevPablo Neira Ayuso1-1/+7
The NFPROTO_INET pseudofamily is not exposed through this new netlink interface. The netlink dump either shows NFPROTO_IPV4 or NFPROTO_IPV6 for NFPROTO_INET prerouting/input/forward/output/postrouting hooks. The NFNLA_CHAIN_FAMILY attribute provides the family chain, which specifies if this hook applies to inet traffic only (either IPv4 or IPv6). Translate the inet/ingress hook to netdev/ingress to fully hide the NFPROTO_INET implementation details. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-06netfilter: conntrack: remove offload_pickup sysctl againFlorian Westphal4-21/+8
These two sysctls were added because the hardcoded defaults (2 minutes, tcp, 30 seconds, udp) turned out to be too low for some setups. They appeared in 5.14-rc1 so it should be fine to remove it again. Marcelo convinced me that there should be no difference between a flow that was offloaded vs. a flow that was not wrt. timeout handling. Thus the default is changed to those for TCP established and UDP stream, 5 days and 120 seconds, respectively. Marcelo also suggested to account for the timeout value used for the offloading, this avoids increase beyond the value in the conntrack-sysctl and will also instantly expire the conntrack entry with altered sysctls. Example: nf_conntrack_udp_timeout_stream=60 nf_flowtable_udp_timeout=60 This will remove offloaded udp flows after one minute, rather than two. An earlier version of this patch also cleared the ASSURED bit to allow nf_conntrack to evict the entry via early_drop (i.e., table full). However, it looks like we can safely assume that connection timed out via HW is still in established state, so this isn't needed. Quoting Oz: [..] the hardware sends all packets with a set FIN flags to sw. [..] Connections that are aged in hardware are expected to be in the established state. In case it turns out that back-to-sw-path transition can occur for 'dodgy' connections too (e.g., one side disappeared while software-path would have been in RETRANS timeout), we can adjust this later. Cc: Oz Shlomo <ozsh@nvidia.com> Cc: Paul Blakey <paulb@nvidia.com> Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-06netfilter: nfnetlink_hook: Use same family as request messagePablo Neira Ayuso1-3/+3
Use the same family as the request message, for consistency. The netlink payload provides sufficient information to describe the hook object, including the family. This makes it easier to userspace to correlate the hooks are that visited by the packets for a certain family. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-06netfilter: nfnetlink_hook: use the sequence number of the request messagePablo Neira Ayuso1-1/+2
The sequence number allows to correlate the netlink reply message (as part of the dump) with the original request message. The cb->seq field is internally used to detect an interference (update) of the hook list during the netlink dump, do not use it as sequence number in the netlink dump header. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-06netfilter: nfnetlink_hook: missing chain familyPablo Neira Ayuso1-2/+6
The family is relevant for pseudo-families like NFPROTO_INET otherwise the user needs to rely on the hook function name to differentiate it from NFPROTO_IPV4 and NFPROTO_IPV6 names. Add nfnl_hook_chain_desc_attributes instead of using the existing NFTA_CHAIN_* attributes, since these do not provide a family number. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-06netfilter: nfnetlink_hook: strip off module name from hookfnPablo Neira Ayuso1-0/+1
NFNLA_HOOK_FUNCTION_NAME should include the hook function name only, the module name is already provided by NFNLA_HOOK_MODULE_NAME. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-06netfilter: conntrack: collect all entries in one cycleFlorian Westphal1-49/+22
Michal Kubecek reports that conntrack gc is responsible for frequent wakeups (every 125ms) on idle systems. On busy systems, timed out entries are evicted during lookup. The gc worker is only needed to remove entries after system becomes idle after a busy period. To resolve this, always scan the entire table. If the scan is taking too long, reschedule so other work_structs can run and resume from next bucket. After a completed scan, wait for 2 minutes before the next cycle. Heuristics for faster re-schedule are removed. GC_SCAN_INTERVAL could be exposed as a sysctl in the future to allow tuning this as-needed or even turn the gc worker off. Reported-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-05Merge tag 'net-5.14-rc5' of ↵Linus Torvalds15-44/+129
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from ipsec. Current release - regressions: - sched: taprio: fix init procedure to avoid inf loop when dumping - sctp: move the active_key update after sh_keys is added Current release - new code bugs: - sparx5: fix build with old GCC & bitmask on 32-bit targets Previous releases - regressions: - xfrm: redo the PREEMPT_RT RCU vs hash_resize_mutex deadlock fix - xfrm: fixes for the compat netlink attribute translator - phy: micrel: Fix detection of ksz87xx switch Previous releases - always broken: - gro: set inner transport header offset in tcp/udp GRO hook to avoid crashes when such packets reach GSO - vsock: handle VIRTIO_VSOCK_OP_CREDIT_REQUEST, as required by spec - dsa: sja1105: fix static FDB entries on SJA1105P/Q/R/S and SJA1110 - bridge: validate the NUD_PERMANENT bit when adding an extern_learn FDB entry - usb: lan78xx: don't modify phy_device state concurrently - usb: pegasus: check for errors of IO routines" * tag 'net-5.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (48 commits) net: vxge: fix use-after-free in vxge_device_unregister net: fec: fix use-after-free in fec_drv_remove net: pegasus: fix uninit-value in get_interrupt_interval net: ethernet: ti: am65-cpsw: fix crash in am65_cpsw_port_offload_fwd_mark_update() bnx2x: fix an error code in bnx2x_nic_load() net: wwan: iosm: fix recursive lock acquire in unregister net: wwan: iosm: correct data protocol mask bit net: wwan: iosm: endianness type correction net: wwan: iosm: fix lkp buildbot warning net: usb: lan78xx: don't modify phy_device state concurrently docs: networking: netdevsim rules net: usb: pegasus: Remove the changelog and DRIVER_VERSION. net: usb: pegasus: Check the return value of get_geristers() and friends; net/prestera: Fix devlink groups leakage in error flow net: sched: fix lockdep_set_class() typo error for sch->seqlock net: dsa: qca: ar9331: reorder MDIO write sequence VSOCK: handle VIRTIO_VSOCK_OP_CREDIT_REQUEST mptcp: drop unused rcu member in mptcp_pm_addr_entry net: ipv6: fix returned variable type in ip6_skb_dst_mtu nfp: update ethtool reporting of pauseframe control ...
2021-08-05Bluetooth: defer cleanup of resources in hci_unregister_dev()Tetsuo Handa3-24/+44
syzbot is hitting might_sleep() warning at hci_sock_dev_event() due to calling lock_sock() with rw spinlock held [1]. It seems that history of this locking problem is a trial and error. Commit b40df5743ee8 ("[PATCH] bluetooth: fix socket locking in hci_sock_dev_event()") in 2.6.21-rc4 changed bh_lock_sock() to lock_sock() as an attempt to fix lockdep warning. Then, commit 4ce61d1c7a8e ("[BLUETOOTH]: Fix locking in hci_sock_dev_event().") in 2.6.22-rc2 changed lock_sock() to local_bh_disable() + bh_lock_sock_nested() as an attempt to fix the sleep in atomic context warning. Then, commit 4b5dd696f81b ("Bluetooth: Remove local_bh_disable() from hci_sock.c") in 3.3-rc1 removed local_bh_disable(). Then, commit e305509e678b ("Bluetooth: use correct lock to prevent UAF of hdev object") in 5.13-rc5 again changed bh_lock_sock_nested() to lock_sock() as an attempt to fix CVE-2021-3573. This difficulty comes from current implementation that hci_sock_dev_event(HCI_DEV_UNREG) is responsible for dropping all references from sockets because hci_unregister_dev() immediately reclaims resources as soon as returning from hci_sock_dev_event(HCI_DEV_UNREG). But the history suggests that hci_sock_dev_event(HCI_DEV_UNREG) was not doing what it should do. Therefore, instead of trying to detach sockets from device, let's accept not detaching sockets from device at hci_sock_dev_event(HCI_DEV_UNREG), by moving actual cleanup of resources from hci_unregister_dev() to hci_cleanup_dev() which is called by bt_host_release() when all references to this unregistered device (which is a kobject) are gone. Since hci_sock_dev_event(HCI_DEV_UNREG) no longer resets hci_pi(sk)->hdev, we need to check whether this device was unregistered and return an error based on HCI_UNREGISTER flag. There might be subtle behavioral difference in "monitor the hdev" functionality; please report if you found something went wrong due to this patch. Link: https://syzkaller.appspot.com/bug?extid=a5df189917e79d5e59c9 [1] Reported-by: syzbot <syzbot+a5df189917e79d5e59c9@syzkaller.appspotmail.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Fixes: e305509e678b ("Bluetooth: use correct lock to prevent UAF of hdev object") Acked-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-04Merge branch 'master' of ↵David S. Miller4-26/+67
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2021-08-04 1) Fix a sysbot reported memory leak in xfrm_user_rcv_msg. From Pavel Skripkin. 2) Revert "xfrm: policy: Read seqcount outside of rcu-read side in xfrm_policy_lookup_bytype". This commit tried to fix a lockin bug, but only cured some of the symptoms. A proper fix is applied on top of this revert. 3) Fix a locking bug on xfrm state hash resize. A recent change on sequence counters accidentally repaced a spinlock by a mutex. Fix from Frederic Weisbecker. 4) Fix possible user-memory-access in xfrm_user_rcv_msg_compat(). From Dmitry Safonov. 5) Add initialiation sefltest fot xfrm_spdattr_type_t. From Dmitry Safonov. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04net: sched: fix lockdep_set_class() typo error for sch->seqlockYunsheng Lin1-1/+1
According to comment in qdisc_alloc(), sch->seqlock's lockdep class key should be set to qdisc_tx_busylock, due to possible type error, sch->busylock's lockdep class key is set to qdisc_tx_busylock, which is duplicated because sch->busylock's lockdep class key is already set in qdisc_alloc(). So fix it by replacing sch->busylock with sch->seqlock. Fixes: 96009c7d500e ("sched: replace __QDISC_STATE_RUNNING bit with a spin lock") Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04netfilter: nf_conntrack_bridge: Fix memory leak when errorYajun Deng1-0/+6
It should be added kfree_skb_list() when err is not equal to zero in nf_br_ip_fragment(). v2: keep this aligned with IPv6. v3: modify iter.frag_list to iter.frag. Fixes: 3c171f496ef5 ("netfilter: bridge: add connection tracking system") Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-04netfilter: ipset: Limit the maximal range of consecutive elements to add/deleteJozsef Kadlecsik10-7/+85
The range size of consecutive elements were not limited. Thus one could define a huge range which may result soft lockup errors due to the long execution time. Now the range size is limited to 2^20 entries. Reported-by: Brad Spengler <spender@grsecurity.net> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-03VSOCK: handle VIRTIO_VSOCK_OP_CREDIT_REQUESTHarshavardhan Unnibhavi1-0/+3
The original implementation of the virtio-vsock driver does not handle a VIRTIO_VSOCK_OP_CREDIT_REQUEST as required by the virtio-vsock specification. The vsock device emulated by vhost-vsock and the virtio-vsock driver never uses this request, which was probably why nobody noticed it. However, another implementation of the device may use this request type. Hence, this commit introduces a way to handle an explicit credit request by responding with a corresponding credit update as required by the virtio-vsock specification. Fixes: 06a8fc78367d ("VSOCK: Introduce virtio_vsock_common.ko") Signed-off-by: Harshavardhan Unnibhavi <harshanavkis@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20210802173506.2383-1-harshanavkis@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-03mptcp: drop unused rcu member in mptcp_pm_addr_entryGeliang Tang1-1/+0
kfree_rcu() had been removed from pm_netlink.c, so this rcu field in struct mptcp_pm_addr_entry became useless. Let's drop it. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Link: https://lore.kernel.org/r/20210802231914.54709-1-mathew.j.martineau@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-03sctp: move the active_key update after sh_keys is addedXin Long1-5/+9
In commit 58acd1009226 ("sctp: update active_key for asoc when old key is being replaced"), sctp_auth_asoc_init_active_key() is called to update the active_key right after the old key is deleted and before the new key is added, and it caused that the active_key could be found with the key_id. In Ying Xu's testing, the BUG_ON in sctp_auth_asoc_init_active_key() was triggered: [ ] kernel BUG at net/sctp/auth.c:416! [ ] RIP: 0010:sctp_auth_asoc_init_active_key.part.8+0xe7/0xf0 [sctp] [ ] Call Trace: [ ] sctp_auth_set_key+0x16d/0x1b0 [sctp] [ ] sctp_setsockopt.part.33+0x1ba9/0x2bd0 [sctp] [ ] __sys_setsockopt+0xd6/0x1d0 [ ] __x64_sys_setsockopt+0x20/0x30 [ ] do_syscall_64+0x5b/0x1a0 So fix it by moving the active_key update after sh_keys is added. Fixes: 58acd1009226 ("sctp: update active_key for asoc when old key is being replaced") Reported-by: Ying Xu <yinxu@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-03net: really fix the build...David S. Miller1-1/+15
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-02net: bridge: validate the NUD_PERMANENT bit when adding an extern_learn FDB ↵Vladimir Oltean3-8/+27
entry Currently it is possible to add broken extern_learn FDB entries to the bridge in two ways: 1. Entries pointing towards the bridge device that are not local/permanent: ip link add br0 type bridge bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn static 2. Entries pointing towards the bridge device or towards a port that are marked as local/permanent, however the bridge does not process the 'permanent' bit in any way, therefore they are recorded as though they aren't permanent: ip link add br0 type bridge bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn permanent Since commit 52e4bec15546 ("net: bridge: switchdev: treat local FDBs the same as entries towards the bridge"), these incorrect FDB entries can even trigger NULL pointer dereferences inside the kernel. This is because that commit made the assumption that all FDB entries that are not local/permanent have a valid destination port. For context, local / permanent FDB entries either have fdb->dst == NULL, and these point towards the bridge device and are therefore local and not to be used for forwarding, or have fdb->dst == a net_bridge_port structure (but are to be treated in the same way, i.e. not for forwarding). That assumption _is_ correct as long as things are working correctly in the bridge driver, i.e. we cannot logically have fdb->dst == NULL under any circumstance for FDB entries that are not local. However, the extern_learn code path where FDB entries are managed by a user space controller show that it is possible for the bridge kernel driver to misinterpret the NUD flags of an entry transmitted by user space, and end up having fdb->dst == NULL while not being a local entry. This is invalid and should be rejected. Before, the two commands listed above both crashed the kernel in this check from br_switchdev_fdb_notify: struct net_device *dev = info.is_local ? br->dev : dst->dev; info.is_local == false, dst == NULL. After this patch, the invalid entry added by the first command is rejected: ip link add br0 type bridge && bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn static; ip link del br0 Error: bridge: FDB entry towards bridge must be permanent. and the valid entry added by the second command is properly treated as a local address and does not crash br_switchdev_fdb_notify anymore: ip link add br0 type bridge && bridge fdb add 00:01:02:03:04:05 dev br0 self extern_learn permanent; ip link del br0 Fixes: eb100e0e24a2 ("net: bridge: allow to add externally learned entries from user-space") Reported-by: syzbot+9ba1174359adba5a5b7c@syzkaller.appspotmail.com Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20210801231730.7493-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-02net/sched: taprio: Fix init procedureYannick Vignon1-2/+0
Commit 13511704f8d759 ("net: taprio offload: enforce qdisc to netdev queue mapping") resulted in duplicate entries in the qdisc hash. While this did not impact the overall operation of the qdisc and taprio code paths, it did result in an infinite loop when dumping the qdisc properties, at least on one target (NXP LS1028 ARDB). Removing the duplicate call to qdisc_hash_add() solves the problem. Fixes: 13511704f8d759 ("net: taprio offload: enforce qdisc to netdev queue mapping") Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-02net, gro: Set inner transport header offset in tcp/udp GRO hookJakub Sitnicki2-0/+7
GSO expects inner transport header offset to be valid when skb->encapsulation flag is set. GSO uses this value to calculate the length of an individual segment of a GSO packet in skb_gso_transport_seglen(). However, tcp/udp gro_complete callbacks don't update the skb->inner_transport_header when processing an encapsulated TCP/UDP segment. As a result a GRO skb has ->inner_transport_header set to a value carried over from earlier skb processing. This can have mild to tragic consequences. From miscalculating the GSO segment length to triggering a page fault [1], when trying to read TCP/UDP header at an address past the skb->data page. The latter scenario leads to an oops report like so: BUG: unable to handle page fault for address: ffff9fa7ec00d008 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 123f201067 P4D 123f201067 PUD 123f209067 PMD 0 Oops: 0000 [#1] SMP NOPTI CPU: 44 PID: 0 Comm: swapper/44 Not tainted 5.4.53-cloudflare-2020.7.21 #1 Hardware name: HYVE EDGE-METAL-GEN10/HS-1811DLite1, BIOS V2.15 02/21/2020 RIP: 0010:skb_gso_transport_seglen+0x44/0xa0 Code: c0 41 83 e0 11 f6 87 81 00 00 00 20 74 30 0f b7 87 aa 00 00 00 0f [...] RSP: 0018:ffffad8640bacbb8 EFLAGS: 00010202 RAX: 000000000000feda RBX: ffff9fcc8d31bc00 RCX: ffff9fa7ec00cffc RDX: ffff9fa7ebffdec0 RSI: 000000000000feda RDI: 0000000000000122 RBP: 00000000000005c4 R08: 0000000000000001 R09: 0000000000000000 R10: ffff9fe588ae3800 R11: ffff9fe011fc92f0 R12: ffff9fcc8d31bc00 R13: ffff9fe0119d4300 R14: 00000000000005c4 R15: ffff9fba57d70900 FS: 0000000000000000(0000) GS:ffff9fe68df00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff9fa7ec00d008 CR3: 0000003e99b1c000 CR4: 0000000000340ee0 Call Trace: <IRQ> skb_gso_validate_network_len+0x11/0x70 __ip_finish_output+0x109/0x1c0 ip_sublist_rcv_finish+0x57/0x70 ip_sublist_rcv+0x2aa/0x2d0 ? ip_rcv_finish_core.constprop.0+0x390/0x390 ip_list_rcv+0x12b/0x14f __netif_receive_skb_list_core+0x2a9/0x2d0 netif_receive_skb_list_internal+0x1b5/0x2e0 napi_complete_done+0x93/0x140 veth_poll+0xc0/0x19f [veth] ? mlx5e_napi_poll+0x221/0x610 [mlx5_core] net_rx_action+0x1f8/0x790 __do_softirq+0xe1/0x2bf irq_exit+0x8e/0xc0 do_IRQ+0x58/0xe0 common_interrupt+0xf/0xf </IRQ> The bug can be observed in a simple setup where we send IP/GRE/IP/TCP packets into a netns over a veth pair. Inside the netns, packets are forwarded to dummy device: trafgen -> [veth A]--[veth B] -forward-> [dummy] For veth B to GRO aggregate packets on receive, it needs to have an XDP program attached (for example, a trivial XDP_PASS). Additionally, for UDP, we need to enable GSO_UDP_L4 feature on the device: ip netns exec A ethtool -K AB rx-udp-gro-forwarding on The last component is an artificial delay to increase the chances of GRO batching happening: ip netns exec A tc qdisc add dev AB root \ netem delay 200us slot 5ms 10ms packets 2 bytes 64k With such a setup in place, the bug can be observed by tracing the skb outer and inner offsets when GSO skb is transmitted from the dummy device: tcp: FUNC DEV SKB_LEN NH TH ENC INH ITH GSO_SIZE GSO_TYPE ip_finish_output dumB 2830 270 290 1 294 254 1383 (tcpv4,gre,) ^^^ udp: FUNC DEV SKB_LEN NH TH ENC INH ITH GSO_SIZE GSO_TYPE ip_finish_output dumB 2818 270 290 1 294 254 1383 (gre,udp_l4,) ^^^ Fix it by updating the inner transport header offset in tcp/udp gro_complete callbacks, similar to how {inet,ipv6}_gro_complete callbacks update the inner network header offset, when skb->encapsulation flag is set. [1] https://lore.kernel.org/netdev/CAKxSbF01cLpZem2GFaUaifh0S-5WYViZemTicAg7FCHOnh6kug@mail.gmail.com/ Fixes: bf296b125b21 ("tcp: Add GRO support") Fixes: f993bc25e519 ("net: core: handle encapsulation offloads when computing segment lengths") Fixes: e20cf8d3f1f7 ("udp: implement GRO for plain UDP sockets.") Reported-by: Alex Forster <aforster@cloudflare.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-30Merge tag 'net-5.14-rc4' of ↵Linus Torvalds29-134/+247
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.14-rc4, including fixes from bpf, can, WiFi (mac80211) and netfilter trees. Current release - regressions: - mac80211: fix starting aggregation sessions on mesh interfaces Current release - new code bugs: - sctp: send pmtu probe only if packet loss in Search Complete state - bnxt_en: add missing periodic PHC overflow check - devlink: fix phys_port_name of virtual port and merge error - hns3: change the method of obtaining default ptp cycle - can: mcba_usb_start(): add missing urb->transfer_dma initialization Previous releases - regressions: - set true network header for ECN decapsulation - mlx5e: RX, avoid possible data corruption w/ relaxed ordering and LRO - phy: re-add check for PHY_BRCM_DIS_TXCRXC_NOENRGY on the BCM54811 PHY - sctp: fix return value check in __sctp_rcv_asconf_lookup Previous releases - always broken: - bpf: - more spectre corner case fixes, introduce a BPF nospec instruction for mitigating Spectre v4 - fix OOB read when printing XDP link fdinfo - sockmap: fix cleanup related races - mac80211: fix enabling 4-address mode on a sta vif after assoc - can: - raw: raw_setsockopt(): fix raw_rcv panic for sock UAF - j1939: j1939_session_deactivate(): clarify lifetime of session object, avoid UAF - fix number of identical memory leaks in USB drivers - tipc: - do not blindly write skb_shinfo frags when doing decryption - fix sleeping in tipc accept routine" * tag 'net-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits) gve: Update MAINTAINERS list can: esd_usb2: fix memory leak can: ems_usb: fix memory leak can: usb_8dev: fix memory leak can: mcba_usb_start(): add missing urb->transfer_dma initialization can: hi311x: fix a signedness bug in hi3110_cmd() MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver bpf: Fix leakage due to insufficient speculative store bypass mitigation bpf: Introduce BPF nospec instruction for mitigating Spectre v4 sis900: Fix missing pci_disable_device() in probe and remove net: let flow have same hash in two directions nfc: nfcsim: fix use after free during module unload tulip: windbond-840: Fix missing pci_disable_device() in probe and remove sctp: fix return value check in __sctp_rcv_asconf_lookup nfc: s3fwrn5: fix undefined parameter values in dev_err() net/mlx5: Fix mlx5_vport_tbl_attr chain from u16 to u32 net/mlx5e: Fix nullptr in mlx5e_hairpin_get_mdev() net/mlx5: Unload device upon firmware fatal error net/mlx5e: Fix page allocation failure for ptp-RQ over SF net/mlx5e: Fix page allocation failure for trap-RQ over SF ...
2021-07-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller1-11/+28
Daniel Borkmann says: ==================== pull-request: bpf 2021-07-29 The following pull-request contains BPF updates for your *net* tree. We've added 9 non-merge commits during the last 14 day(s) which contain a total of 20 files changed, 446 insertions(+), 138 deletions(-). The main changes are: 1) Fix UBSAN out-of-bounds splat for showing XDP link fdinfo, from Lorenz Bauer. 2) Fix insufficient Spectre v4 mitigation in BPF runtime, from Daniel Borkmann, Piotr Krysiuk and Benedict Schlueter. 3) Batch of fixes for BPF sockmap found under stress testing, from John Fastabend. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-28af_unix: fix garbage collect vs MSG_PEEKMiklos Szeredi1-2/+49
unix_gc() assumes that candidate sockets can never gain an external reference (i.e. be installed into an fd) while the unix_gc_lock is held. Except for MSG_PEEK this is guaranteed by modifying inflight count under the unix_gc_lock. MSG_PEEK does not touch any variable protected by unix_gc_lock (file count is not), yet it needs to be serialized with garbage collection. Do this by locking/unlocking unix_gc_lock: 1) increment file count 2) lock/unlock barrier to make sure incremented file count is visible to garbage collection 3) install file into fd This is a lock barrier (unlike smp_mb()) that ensures that garbage collection is run completely before or completely after the barrier. Cc: <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-28net: let flow have same hash in two directionszhang kai1-9/+9
using same source and destination ip/port for flow hash calculation within the two directions. Signed-off-by: zhang kai <zhangkaiheb@126.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-28sctp: fix return value check in __sctp_rcv_asconf_lookupMarcelo Ricardo Leitner1-1/+1
As Ben Hutchings noticed, this check should have been inverted: the call returns true in case of success. Reported-by: Ben Hutchings <ben@decadent.org.uk> Fixes: 0c5dc070ff3d ("sctp: validate from_addr_param return") Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-27bpf, sockmap: Fix memleak on ingress msg enqueueJohn Fastabend1-6/+0
If backlog handler is running during a tear down operation we may enqueue data on the ingress msg queue while tear down is trying to free it. sk_psock_backlog() sk_psock_handle_skb() skb_psock_skb_ingress() sk_psock_skb_ingress_enqueue() sk_psock_queue_msg(psock,msg) spin_lock(ingress_lock) sk_psock_zap_ingress() _sk_psock_purge_ingerss_msg() _sk_psock_purge_ingress_msg() -- free ingress_msg list -- spin_unlock(ingress_lock) spin_lock(ingress_lock) list_add_tail(msg,ingress_msg) <- entry on list with no one left to free it. spin_unlock(ingress_lock) To fix we only enqueue from backlog if the ENABLED bit is set. The tear down logic clears the bit with ingress_lock set so we wont enqueue the msg in the last step. Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210727160500.1713554-4-john.fastabend@gmail.com
2021-07-27bpf, sockmap: On cleanup we additionally need to remove cached skbJohn Fastabend1-6/+29
Its possible if a socket is closed and the receive thread is under memory pressure it may have cached a skb. We need to ensure these skbs are free'd along with the normal ingress_skb queue. Before 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()") tear down and backlog processing both had sock_lock for the common case of socket close or unhash. So it was not possible to have both running in parrallel so all we would need is the kfree in those kernels. But, latest kernels include the commit 799aa7f98d5e and this requires a bit more work. Without the ingress_lock guarding reading/writing the state->skb case its possible the tear down could run before the state update causing it to leak memory or worse when the backlog reads the state it could potentially run interleaved with the tear down and we might end up free'ing the state->skb from tear down side but already have the reference from backlog side. To resolve such races we wrap accesses in ingress_lock on both sides serializing tear down and backlog case. In both cases this only happens after an EAGAIN error case so having an extra lock in place is likely fine. The normal path will skip the locks. Note, we check state->skb before grabbing lock. This works because we can only enqueue with the mutex we hold already. Avoiding a race on adding state->skb after the check. And if tear down path is running that is also fine if the tear down path then removes state->skb we will simply set skb=NULL and the subsequent goto is skipped. This slight complication avoids locking in normal case. With this fix we no longer see this warning splat from tcp side on socket close when we hit the above case with redirect to ingress self. [224913.935822] WARNING: CPU: 3 PID: 32100 at net/core/stream.c:208 sk_stream_kill_queues+0x212/0x220 [224913.935841] Modules linked in: fuse overlay bpf_preload x86_pkg_temp_thermal intel_uncore wmi_bmof squashfs sch_fq_codel efivarfs ip_tables x_tables uas xhci_pci ixgbe mdio xfrm_algo xhci_hcd wmi [224913.935897] CPU: 3 PID: 32100 Comm: fgs-bench Tainted: G I 5.14.0-rc1alu+ #181 [224913.935908] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019 [224913.935914] RIP: 0010:sk_stream_kill_queues+0x212/0x220 [224913.935923] Code: 8b 83 20 02 00 00 85 c0 75 20 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 df e8 2b 11 fe ff eb c3 0f 0b e9 7c ff ff ff 0f 0b eb ce <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 90 0f 1f 44 00 00 41 57 41 [224913.935932] RSP: 0018:ffff88816271fd38 EFLAGS: 00010206 [224913.935941] RAX: 0000000000000ae8 RBX: ffff88815acd5240 RCX: dffffc0000000000 [224913.935948] RDX: 0000000000000003 RSI: 0000000000000ae8 RDI: ffff88815acd5460 [224913.935954] RBP: ffff88815acd5460 R08: ffffffff955c0ae8 R09: fffffbfff2e6f543 [224913.935961] R10: ffffffff9737aa17 R11: fffffbfff2e6f542 R12: ffff88815acd5390 [224913.935967] R13: ffff88815acd5480 R14: ffffffff98d0c080 R15: ffffffff96267500 [224913.935974] FS: 00007f86e6bd1700(0000) GS:ffff888451cc0000(0000) knlGS:0000000000000000 [224913.935981] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [224913.935988] CR2: 000000c0008eb000 CR3: 00000001020e0005 CR4: 00000000003706e0 [224913.935994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [224913.936000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [224913.936007] Call Trace: [224913.936016] inet_csk_destroy_sock+0xba/0x1f0 [224913.936033] __tcp_close+0x620/0x790 [224913.936047] tcp_close+0x20/0x80 [224913.936056] inet_release+0x8f/0xf0 [224913.936070] __sock_release+0x72/0x120 [224913.936083] sock_close+0x14/0x20 Fixes: a136678c0bdbb ("bpf: sk_msg, zap ingress queue on psock down") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210727160500.1713554-3-john.fastabend@gmail.com
2021-07-27bpf, sockmap: Zap ingress queues after stopping strparserJohn Fastabend1-2/+2
We don't want strparser to run and pass skbs into skmsg handlers when the psock is null. We just sk_drop them in this case. When removing a live socket from map it means extra drops that we do not need to incur. Move the zap below strparser close to avoid this condition. This way we stop the stream parser first stopping it from processing packets and then delete the psock. Fixes: a136678c0bdbb ("bpf: sk_msg, zap ingress queue on psock down") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210727160500.1713554-2-john.fastabend@gmail.com
2021-07-27net: llc: fix skb_over_panicPavel Skripkin2-2/+10
Syzbot reported skb_over_panic() in llc_pdu_init_as_xid_cmd(). The problem was in wrong LCC header manipulations. Syzbot's reproducer tries to send XID packet. llc_ui_sendmsg() is doing following steps: 1. skb allocation with size = len + header size len is passed from userpace and header size is 3 since addr->sllc_xid is set. 2. skb_reserve() for header_len = 3 3. filling all other space with memcpy_from_msg() Ok, at this moment we have fully loaded skb, only headers needs to be filled. Then code comes to llc_sap_action_send_xid_c(). This function pushes 3 bytes for LLC PDU header and initializes it. Then comes llc_pdu_init_as_xid_cmd(). It initalizes next 3 bytes *AFTER* LLC PDU header and call skb_push(skb, 3). This looks wrong for 2 reasons: 1. Bytes rigth after LLC header are user data, so this function was overwriting payload. 2. skb_push(skb, 3) call can cause skb_over_panic() since all free space was filled in llc_ui_sendmsg(). (This can happen is user passed 686 len: 686 + 14 (eth header) + 3 (LLC header) = 703. SKB_DATA_ALIGN(703) = 704) So, in this patch I added 2 new private constansts: LLC_PDU_TYPE_U_XID and LLC_PDU_LEN_U_XID. LLC_PDU_LEN_U_XID is used to correctly reserve header size to handle LLC + XID case. LLC_PDU_TYPE_U_XID is used by llc_pdu_header_init() function to push 6 bytes instead of 3. And finally I removed skb_push() call from llc_pdu_init_as_xid_cmd(). This changes should not affect other parts of LLC, since after all steps we just transmit buffer. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-and-tested-by: syzbot+5e5a981ad7cc54c4b2b4@syzkaller.appspotmail.com Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-26sctp: delete addr based on sin6_scope_idChen Shen1-2/+3
sctp_inet6addr_event deletes 'addr' from 'local_addr_list' when setting netdev down, but it is possible to delete the incorrect entry (match the first one with the same ipaddr, but the different 'ifindex'), if there are some netdevs with the same 'local-link' ipaddr added already. It should delete the entry depending on 'sin6_addr' and 'sin6_scope_id' both. otherwise, the endpoint will call 'sctp_sf_ootb' if it can't find the according association when receives 'heartbeat', and finally will reply 'abort'. For example: 1.when linux startup the entries in local_addr_list: ifindex:35 addr:fe80::40:43ff:fe80:0 (eths0.201) ifindex:36 addr:fe80::40:43ff:fe80:0 (eths0.209) ifindex:37 addr:fe80::40:43ff:fe80:0 (eths0.210) the route table: local fe80::40:43ff:fe80:0 dev eths0.201 local fe80::40:43ff:fe80:0 dev eths0.209 local fe80::40:43ff:fe80:0 dev eths0.210 2.after 'ifconfig eths0.209 down' the entries in local_addr_list: ifindex:36 addr:fe80::40:43ff:fe80:0 (eths0.209) ifindex:37 addr:fe80::40:43ff:fe80:0 (eths0.210) the route table: local fe80::40:43ff:fe80:0 dev eths0.201 local fe80::40:43ff:fe80:0 dev eths0.210 3.asoc not found for src:[fe80::40:43ff:fe80:0]:37381 dst:[:1]:53335 ::1->fe80::40:43ff:fe80:0 HEARTBEAT fe80::40:43ff:fe80:0->::1 ABORT Signed-off-by: Chen Shen <peterchenshen@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-26net: xfrm: Fix end of loop tests for list_for_each_entryHarshvardhan Jha1-1/+1
The list_for_each_entry() iterator, "pos" in this code, can never be NULL so the warning will never be printed. Signed-off-by: Harshvardhan Jha <harshvardhan.jha@oracle.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-07-25sctp: send pmtu probe only if packet loss in Search Complete stateXin Long1-1/+5
This patch is to introduce last_rtx_chunks into sctp_transport to detect if there's any packet retransmission/loss happened by checking against asoc's rtx_data_chunks in sctp_transport_pl_send(). If there is, namely, transport->last_rtx_chunks != asoc->rtx_data_chunks, the pmtu probe will be sent out. Otherwise, increment the pl.raise_count and return when it's in Search Complete state. With this patch, if in Search Complete state, which is a long period, it doesn't need to keep probing the current pmtu unless there's data packet loss. This will save quite some traffic. v1->v2: - add the missing Fixes tag. Fixes: 0dac127c0557 ("sctp: do black hole detection in search complete state") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-25sctp: improve the code for pmtu probe send and recv updateXin Long2-26/+30
This patch does 3 things: - make sctp_transport_pl_send() and sctp_transport_pl_recv() return bool type to decide if more probe is needed to send. - pr_debug() only when probe is really needed to send. - count pl.raise_count in sctp_transport_pl_send() instead of sctp_transport_pl_recv(), and it's only incremented for the 1st probe for the same size. These are preparations for the next patch to make probes happen only when there's packet loss in Search Complete state. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-25devlink: Fix phys_port_name of virtual port and merge errorParav Pandit1-8/+2
Merge commit cited in fixes tag was incorrect. Due to it phys_port_name of the virtual port resulted in incorrect name. Also the phys_port_name of the physical port was written twice due to the merge error. Fix it by removing the old code and inserting back the misplaced code. Related commits of interest in net and net-next branches that resulted in merge conflict are: in net-next branch: commit f285f37cb1e6 ("devlink: append split port number to the port name") in net branch: commit b28d8f0c25a9 ("devlink: Correct VIRTUAL port to not have phys_port attributes") Fixes: 126285651b7 ("Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net") Signed-off-by: Parav Pandit <parav@nvidia.com> Reported-by: Niklas Schnelle <schnelle@linux.ibm.com> Tested-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-24tipc: do not write skb_shinfo frags when doing decrytionXin Long1-10/+4
One skb's skb_shinfo frags are not writable, and they can be shared with other skbs' like by pskb_copy(). To write the frags may cause other skb's data crash. So before doing en/decryption, skb_cow_data() should always be called for a cloned or nonlinear skb if req dst is using the same sg as req src. While at it, the likely branch can be removed, as it will be covered by skb_cow_data(). Note that esp_input() has the same issue, and I will fix it in another patch. tipc_aead_encrypt() doesn't have this issue, as it only processes linear data in the unlikely branch. Fixes: fc1b6d6de220 ("tipc: introduce TIPC encryption & authentication") Reported-by: Shuang Li <shuali@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-24can: j1939: j1939_xtp_rx_dat_one(): fix rxtimer value between consecutive ↵Zhang Changzhong1-1/+1
TP.DT to 750ms For receive side, the max time interval between two consecutive TP.DT should be 750ms. Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol") Link: https://lore.kernel.org/r/1625569210-47506-1-git-send-email-zhangchangzhong@huawei.com Cc: linux-stable <stable@vger.kernel.org> Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-07-24can: j1939: j1939_session_deactivate(): clarify lifetime of session objectOleksij Rempel1-2/+7
The j1939_session_deactivate() is decrementing the session ref-count and potentially can free() the session. This would cause use-after-free situation. However, the code calling j1939_session_deactivate() does always hold another reference to the session, so that it would not be free()ed in this code path. This patch adds a comment to make this clear and a WARN_ON, to ensure that future changes will not violate this requirement. Further this patch avoids dereferencing the session pointer as a precaution to avoid use-after-free if the session is actually free()ed. Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol") Link: https://lore.kernel.org/r/20210714111602.24021-1-o.rempel@pengutronix.de Reported-by: Xiaochen Zou <xzou017@ucr.edu> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-07-24can: raw: raw_setsockopt(): fix raw_rcv panic for sock UAFZiyang Xuan1-2/+18
We get a bug during ltp can_filter test as following. =========================================== [60919.264984] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 [60919.265223] PGD 8000003dda726067 P4D 8000003dda726067 PUD 3dda727067 PMD 0 [60919.265443] Oops: 0000 [#1] SMP PTI [60919.265550] CPU: 30 PID: 3638365 Comm: can_filter Kdump: loaded Tainted: G W 4.19.90+ #1 [60919.266068] RIP: 0010:selinux_socket_sock_rcv_skb+0x3e/0x200 [60919.293289] RSP: 0018:ffff8d53bfc03cf8 EFLAGS: 00010246 [60919.307140] RAX: 0000000000000000 RBX: 000000000000001d RCX: 0000000000000007 [60919.320756] RDX: 0000000000000001 RSI: ffff8d5104a8ed00 RDI: ffff8d53bfc03d30 [60919.334319] RBP: ffff8d9338056800 R08: ffff8d53bfc29d80 R09: 0000000000000001 [60919.347969] R10: ffff8d53bfc03ec0 R11: ffffb8526ef47c98 R12: ffff8d53bfc03d30 [60919.350320] perf: interrupt took too long (3063 > 2500), lowering kernel.perf_event_max_sample_rate to 65000 [60919.361148] R13: 0000000000000001 R14: ffff8d53bcf90000 R15: 0000000000000000 [60919.361151] FS: 00007fb78b6b3600(0000) GS:ffff8d53bfc00000(0000) knlGS:0000000000000000 [60919.400812] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [60919.413730] CR2: 0000000000000010 CR3: 0000003e3f784006 CR4: 00000000007606e0 [60919.426479] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [60919.439339] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [60919.451608] PKRU: 55555554 [60919.463622] Call Trace: [60919.475617] <IRQ> [60919.487122] ? update_load_avg+0x89/0x5d0 [60919.498478] ? update_load_avg+0x89/0x5d0 [60919.509822] ? account_entity_enqueue+0xc5/0xf0 [60919.520709] security_sock_rcv_skb+0x2a/0x40 [60919.531413] sk_filter_trim_cap+0x47/0x1b0 [60919.542178] ? kmem_cache_alloc+0x38/0x1b0 [60919.552444] sock_queue_rcv_skb+0x17/0x30 [60919.562477] raw_rcv+0x110/0x190 [can_raw] [60919.572539] can_rcv_filter+0xbc/0x1b0 [can] [60919.582173] can_receive+0x6b/0xb0 [can] [60919.591595] can_rcv+0x31/0x70 [can] [60919.600783] __netif_receive_skb_one_core+0x5a/0x80 [60919.609864] process_backlog+0x9b/0x150 [60919.618691] net_rx_action+0x156/0x400 [60919.627310] ? sched_clock_cpu+0xc/0xa0 [60919.635714] __do_softirq+0xe8/0x2e9 [60919.644161] do_softirq_own_stack+0x2a/0x40 [60919.652154] </IRQ> [60919.659899] do_softirq.part.17+0x4f/0x60 [60919.667475] __local_bh_enable_ip+0x60/0x70 [60919.675089] __dev_queue_xmit+0x539/0x920 [60919.682267] ? finish_wait+0x80/0x80 [60919.689218] ? finish_wait+0x80/0x80 [60919.695886] ? sock_alloc_send_pskb+0x211/0x230 [60919.702395] ? can_send+0xe5/0x1f0 [can] [60919.708882] can_send+0xe5/0x1f0 [can] [60919.715037] raw_sendmsg+0x16d/0x268 [can_raw] It's because raw_setsockopt() concurrently with unregister_netdevice_many(). Concurrent scenario as following. cpu0 cpu1 raw_bind raw_setsockopt unregister_netdevice_many unlist_netdevice dev_get_by_index raw_notifier raw_enable_filters ...... can_rx_register can_rcv_list_find(..., net->can.rx_alldev_list) ...... sock_close raw_release(sock_a) ...... can_receive can_rcv_filter(net->can.rx_alldev_list, ...) raw_rcv(skb, sock_a) BUG After unlist_netdevice(), dev_get_by_index() return NULL in raw_setsockopt(). Function raw_enable_filters() will add sock and can_filter to net->can.rx_alldev_list. Then the sock is closed. Followed by, we sock_sendmsg() to a new vcan device use the same can_filter. Protocol stack match the old receiver whose sock has been released on net->can.rx_alldev_list in can_rcv_filter(). Function raw_rcv() uses the freed sock. UAF BUG is triggered. We can find that the key issue is that net_device has not been protected in raw_setsockopt(). Use rtnl_lock to protect net_device in raw_setsockopt(). Fixes: c18ce101f2e4 ("[CAN]: Add raw protocol") Link: https://lore.kernel.org/r/20210722070819.1048263-1-william.xuanziyang@huawei.com Cc: linux-stable <stable@vger.kernel.org> Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-07-23Merge tag 'mac80211-for-net-2021-07-23' of ↵David S. Miller7-33/+63
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Couple of fixes: * fix aggregation on mesh * fix late enabling of 4-addr mode * leave monitor SKBs with some headroom * limit band information for old applications * fix virt-wifi WARN_ON * fix memory leak in cfg80211 BSS list maintenance
2021-07-23net: qrtr: fix memory leaksPavel Skripkin1-1/+5
Syzbot reported memory leak in qrtr. The problem was in unputted struct sock. qrtr_local_enqueue() function calls qrtr_port_lookup() which takes sock reference if port was found. Then there is the following check: if (!ipc || &ipc->sk == skb->sk) { ... return -ENODEV; } Since we should drop the reference before returning from this function and ipc can be non-NULL inside this if, we should add qrtr_port_put() inside this if. The similar corner case is in qrtr_endpoint_post() as Manivannan reported. In case of sock_queue_rcv_skb() failure we need to put port reference to avoid leaking struct sock pointer. Fixes: e04df98adf7d ("net: qrtr: Remove receive worker") Fixes: bdabad3e363d ("net: Add Qualcomm IPC router") Reported-and-tested-by: syzbot+35a511c72ea7356cdcf3@syzkaller.appspotmail.com Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Reviewed-by: Manivannan Sadhasivam <mani@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller6-10/+41
Pablo Neira Ayusosays: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Memleak in commit audit error path, from Dongliang Mu. 2) Avoid possible false sharing for flowtable timeout updates and nft_last use. 3) Adjust conntrack timestamp due to garbage collection delay, from Florian Westphal. 4) Fix nft_nat without layer 3 address for the inet family. 5) Fix compilation warning in nfnl_hook when ingress support is disabled, from Arnd Bergmann. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-23ipv6: decrease hop limit counter in ip6_forward()Kangmin Park1-2/+3
Decrease hop limit counter when deliver skb to ndp proxy. Signed-off-by: Kangmin Park <l4stpr0gr4m@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-23net: Set true network header for ECN decapsulationGilad Naaman1-1/+1
In cases where the header straight after the tunnel header was another ethernet header (TEB), instead of the network header, the ECN decapsulation code would treat the ethernet header as if it was an IP header, resulting in mishandling and possible wrong drops or corruption of the IP header. In this case, ECT(1) is sent, so IP_ECN_decapsulate tries to copy it to the inner IPv4 header, and correct its checksum. The offset of the ECT bits in an IPv4 header corresponds to the lower 2 bits of the second octet of the destination MAC address in the ethernet header. The IPv4 checksum corresponds to end of the source address. In order to reproduce: $ ip netns add A $ ip netns add B $ ip -n A link add _v0 type veth peer name _v1 netns B $ ip -n A link set _v0 up $ ip -n A addr add dev _v0 10.254.3.1/24 $ ip -n A route add default dev _v0 scope global $ ip -n B link set _v1 up $ ip -n B addr add dev _v1 10.254.1.6/24 $ ip -n B route add default dev _v1 scope global $ ip -n B link add gre1 type gretap local 10.254.1.6 remote 10.254.3.1 key 0x49000000 $ ip -n B link set gre1 up # Now send an IPv4/GRE/Eth/IPv4 frame where the outer header has ECT(1), # and the inner header has no ECT bits set: $ cat send_pkt.py #!/usr/bin/env python3 from scapy.all import * pkt = IP(b'E\x01\x00\xa7\x00\x00\x00\x00@/`%\n\xfe\x03\x01\n\xfe\x01\x06 \x00eXI\x00' b'\x00\x00\x18\xbe\x92\xa0\xee&\x18\xb0\x92\xa0l&\x08\x00E\x00\x00}\x8b\x85' b'@\x00\x01\x01\xe4\xf2\x82\x82\x82\x01\x82\x82\x82\x02\x08\x00d\x11\xa6\xeb' b'3\x1e\x1e\\xf3\\xf7`\x00\x00\x00\x00ZN\x00\x00\x00\x00\x00\x00\x10\x11\x12' b'\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./01234' b'56789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ') send(pkt) $ sudo ip netns exec B tcpdump -neqlllvi gre1 icmp & ; sleep 1 $ sudo ip netns exec A python3 send_pkt.py In the original packet, the source/destinatio MAC addresses are dst=18:be:92:a0:ee:26 src=18:b0:92:a0:6c:26 In the received packet, they are dst=18:bd:92:a0:ee:26 src=18:b0:92:a0:6c:27 Thanks to Lahav Schlesinger <lschlesinger@drivenets.com> and Isaac Garzon <isaac@speed.io> for helping me pinpoint the origin. Fixes: b723748750ec ("tunnel: Propagate ECT(1) when decapsulating as recommended by RFC6040") Cc: David S. Miller <davem@davemloft.net> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: David Ahern <dsahern@kernel.org> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Gilad Naaman <gnaaman@drivenets.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-23tipc: fix sleeping in tipc accept routineHoang Le1-5/+4
The release_sock() is blocking function, it would change the state after sleeping. In order to evaluate the stated condition outside the socket lock context, switch to use wait_woken() instead. Fixes: 6398e23cdb1d8 ("tipc: standardize accept routine") Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-23tipc: fix implicit-connect for SYN+Xin Long1-8/+13
For implicit-connect, when it's either SYN- or SYN+, an ACK should be sent back to the client immediately. It's not appropriate for the client to enter established state only after receiving data from the server. On client side, after the SYN is sent out, tipc_wait_for_connect() should be called to wait for the ACK if timeout is set. This patch also restricts __tipc_sendstream() to call __sendmsg() only when it's in TIPC_OPEN state, so that the client can program in a single loop doing both connecting and data sending like: for (...) sendmsg(dest, buf); This makes the implicit-connect more implicit. Fixes: b97bf3fd8f6a ("[TIPC] Initial merge") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-23netfilter: nfnl_hook: fix unused variable warningArnd Bergmann1-0/+2
The only user of this variable is in an #ifdef: net/netfilter/nfnetlink_hook.c: In function 'nfnl_hook_entries_head': net/netfilter/nfnetlink_hook.c:177:28: error: unused variable 'netdev' [-Werror=unused-variable] Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-23netfilter: nft_nat: allow to specify layer 4 protocol NAT onlyPablo Neira Ayuso1-1/+3
nft_nat reports a bogus EAFNOSUPPORT if no layer 3 information is specified. Fixes: d07db9884a5f ("netfilter: nf_tables: introduce nft_validate_register_load()") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-23netfilter: conntrack: adjust stop timestamp to real expiry valueFlorian Westphal1-1/+6
In case the entry is evicted via garbage collection there is delay between the timeout value and the eviction event. This adjusts the stop value based on how much time has passed. Fixes: b87a2f9199ea82 ("netfilter: conntrack: add gc worker to remove timed-out entries") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-07-23netfilter: nft_last: avoid possible false sharingPablo Neira Ayuso1-7/+13
Use the idiom described in: https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONCE#it-may-improve-performance Moreover, prevent a compiler optimization. Fixes: 836382dc2471 ("netfilter: nf_tables: add last expression") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>