summaryrefslogtreecommitdiffstats
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2022-05-31netfilter: nf_tables: double hook unregistration in netns pathPablo Neira Ayuso1-13/+41
__nft_release_hooks() is called from pre_netns exit path which unregisters the hooks, then the NETDEV_UNREGISTER event is triggered which unregisters the hooks again. [ 565.221461] WARNING: CPU: 18 PID: 193 at net/netfilter/core.c:495 __nf_unregister_net_hook+0x247/0x270 [...] [ 565.246890] CPU: 18 PID: 193 Comm: kworker/u64:1 Tainted: G E 5.18.0-rc7+ #27 [ 565.253682] Workqueue: netns cleanup_net [ 565.257059] RIP: 0010:__nf_unregister_net_hook+0x247/0x270 [...] [ 565.297120] Call Trace: [ 565.300900] <TASK> [ 565.304683] nf_tables_flowtable_event+0x16a/0x220 [nf_tables] [ 565.308518] raw_notifier_call_chain+0x63/0x80 [ 565.312386] unregister_netdevice_many+0x54f/0xb50 Unregister and destroy netdev hook from netns pre_exit via kfree_rcu so the NETDEV_UNREGISTER path see unregistered hooks. Fixes: 767d1216bff8 ("netfilter: nftables: fix possible UAF over chains from packet path in netns") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-31netfilter: nf_tables: hold mutex on netns pre_exit pathPablo Neira Ayuso1-0/+4
clean_net() runs in workqueue while walking over the lists, grab mutex. Fixes: 767d1216bff8 ("netfilter: nftables: fix possible UAF over chains from packet path in netns") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-31netfilter: nf_tables: sanitize nft_set_desc_concat_parse()Pablo Neira Ayuso1-4/+13
Add several sanity checks for nft_set_desc_concat_parse(): - validate desc->field_count not larger than desc->field_len array. - field length cannot be larger than desc->field_len (ie. U8_MAX) - total length of the concatenation cannot be larger than register array. Joint work with Florian Westphal. Fixes: f3a2181e16f1 ("netfilter: nf_tables: Support for sets with multiple ranged fields") Reported-by: <zhangziming.zzm@antgroup.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-31xprtrdma: treat all calls not a bcall when bc_serv is NULLKinglong Mee1-0/+5
When a rdma server returns a fault format reply, nfs v3 client may treats it as a bcall when bc service is not exist. The debug message at rpcrdma_bc_receive_call are, [56579.837169] RPC: rpcrdma_bc_receive_call: callback XID 00000001, length=20 [56579.837174] RPC: rpcrdma_bc_receive_call: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 04 After that, rpcrdma_bc_receive_call will meets NULL pointer as, [ 226.057890] BUG: unable to handle kernel NULL pointer dereference at 00000000000000c8 ... [ 226.058704] RIP: 0010:_raw_spin_lock+0xc/0x20 ... [ 226.059732] Call Trace: [ 226.059878] rpcrdma_bc_receive_call+0x138/0x327 [rpcrdma] [ 226.060011] __ib_process_cq+0x89/0x170 [ib_core] [ 226.060092] ib_cq_poll_work+0x26/0x80 [ib_core] [ 226.060257] process_one_work+0x1a7/0x360 [ 226.060367] ? create_worker+0x1a0/0x1a0 [ 226.060440] worker_thread+0x30/0x390 [ 226.060500] ? create_worker+0x1a0/0x1a0 [ 226.060574] kthread+0x116/0x130 [ 226.060661] ? kthread_flush_work_fn+0x10/0x10 [ 226.060724] ret_from_fork+0x35/0x40 ... Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-05-31net/ipv6: Expand and rename accept_unsolicited_na to accept_untracked_naArun Ajith S2-20/+28
RFC 9131 changes default behaviour of handling RX of NA messages when the corresponding entry is absent in the neighbour cache. The current implementation is limited to accept just unsolicited NAs. However, the RFC is more generic where it also accepts solicited NAs. Both types should result in adding a STALE entry for this case. Expand accept_untracked_na behaviour to also accept solicited NAs to be compliant with the RFC and rename the sysctl knob to accept_untracked_na. Fixes: f9a2fb73318e ("net/ipv6: Introduce accept_unsolicited_na knob to implement router-side changes for RFC9131") Signed-off-by: Arun Ajith S <aajith@arista.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220530101414.65439-1-aajith@arista.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-30net: ipv4: Avoid bounds check warninghuhai1-2/+2
Fix the following build warning when CONFIG_IPV6 is not set: In function ‘fortify_memcpy_chk’, inlined from ‘tcp_md5_do_add’ at net/ipv4/tcp_ipv4.c:1210:2: ./include/linux/fortify-string.h:328:4: error: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror=attribute-warning] 328 | __write_overflow_field(p_size_field, size); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: huhai <huhai@kylinos.cn> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220526101213.2392980-1-zhanggenjian@kylinos.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-28Merge tag 'hyperv-next-signed-20220528' of ↵Linus Torvalds1-4/+17
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Harden hv_sock driver (Andrea Parri) - Harden Hyper-V PCI driver (Andrea Parri) - Fix multi-MSI for Hyper-V PCI driver (Jeffrey Hugo) - Fix Hyper-V PCI to reduce boot time (Dexuan Cui) - Remove code for long EOL'ed Hyper-V versions (Michael Kelley, Saurabh Sengar) - Fix balloon driver error handling (Shradha Gupta) - Fix a typo in vmbus driver (Julia Lawall) - Ignore vmbus IMC device (Michael Kelley) - Add a new error message to Hyper-V DRM driver (Saurabh Sengar) * tag 'hyperv-next-signed-20220528' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (28 commits) hv_balloon: Fix balloon_probe() and balloon_remove() error handling scsi: storvsc: Removing Pre Win8 related logic Drivers: hv: vmbus: fix typo in comment PCI: hv: Fix synchronization between channel callback and hv_pci_bus_exit() PCI: hv: Add validation for untrusted Hyper-V values PCI: hv: Fix interrupt mapping for multi-MSI PCI: hv: Reuse existing IRTE allocation in compose_msi_msg() drm/hyperv: Remove support for Hyper-V 2008 and 2008R2/Win7 video: hyperv_fb: Remove support for Hyper-V 2008 and 2008R2/Win7 scsi: storvsc: Remove support for Hyper-V 2008 and 2008R2/Win7 Drivers: hv: vmbus: Remove support for Hyper-V 2008 and Hyper-V 2008R2/Win7 x86/hyperv: Disable hardlockup detector by default in Hyper-V guests drm/hyperv: Add error message for fb size greater than allocated PCI: hv: Do not set PCI_COMMAND_MEMORY to reduce VM boot time PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI Drivers: hv: vmbus: Refactor the ring-buffer iterator functions Drivers: hv: vmbus: Accept hv_sock offers in isolated guests hv_sock: Add validation for untrusted Hyper-V values hv_sock: Copy packets sent by Hyper-V out of the ring buffer hv_sock: Check hv_pkt_iter_first_raw()'s return value ...
2022-05-28net: nfc: Directly use ida_alloc()/free()keliu1-2/+2
Use ida_alloc()/ida_free() instead of deprecated ida_simple_get()/ida_simple_remove() . Signed-off-by: keliu <liuke94@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-28tcp: fix tcp_mtup_probe_success vs wrong snd_cwndEric Dumazet1-4/+7
syzbot got a new report [1] finally pointing to a very old bug, added in initial support for MTU probing. tcp_mtu_probe() has checks about starting an MTU probe if tcp_snd_cwnd(tp) >= 11. But nothing prevents tcp_snd_cwnd(tp) to be reduced later and before the MTU probe succeeds. This bug would lead to potential zero-divides. Debugging added in commit 40570375356c ("tcp: add accessors to read/set tp->snd_cwnd") has paid off :) While we are at it, address potential overflows in this code. [1] WARNING: CPU: 1 PID: 14132 at include/net/tcp.h:1219 tcp_mtup_probe_success+0x366/0x570 net/ipv4/tcp_input.c:2712 Modules linked in: CPU: 1 PID: 14132 Comm: syz-executor.2 Not tainted 5.18.0-syzkaller-07857-gbabf0bb978e3 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:tcp_snd_cwnd_set include/net/tcp.h:1219 [inline] RIP: 0010:tcp_mtup_probe_success+0x366/0x570 net/ipv4/tcp_input.c:2712 Code: 74 08 48 89 ef e8 da 80 17 f9 48 8b 45 00 65 48 ff 80 80 03 00 00 48 83 c4 30 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 aa b0 c5 f8 <0f> 0b e9 16 fe ff ff 48 8b 4c 24 08 80 e1 07 38 c1 0f 8c c7 fc ff RSP: 0018:ffffc900079e70f8 EFLAGS: 00010287 RAX: ffffffff88c0f7f6 RBX: ffff8880756e7a80 RCX: 0000000000040000 RDX: ffffc9000c6c4000 RSI: 0000000000031f9e RDI: 0000000000031f9f RBP: 0000000000000000 R08: ffffffff88c0f606 R09: ffffc900079e7520 R10: ffffed101011226d R11: 1ffff1101011226c R12: 1ffff1100eadcf50 R13: ffff8880756e72c0 R14: 1ffff1100eadcf89 R15: dffffc0000000000 FS: 00007f643236e700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1ab3f1e2a0 CR3: 0000000064fe7000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> tcp_clean_rtx_queue+0x223a/0x2da0 net/ipv4/tcp_input.c:3356 tcp_ack+0x1962/0x3c90 net/ipv4/tcp_input.c:3861 tcp_rcv_established+0x7c8/0x1ac0 net/ipv4/tcp_input.c:5973 tcp_v6_do_rcv+0x57b/0x1210 net/ipv6/tcp_ipv6.c:1476 sk_backlog_rcv include/net/sock.h:1061 [inline] __release_sock+0x1d8/0x4c0 net/core/sock.c:2849 release_sock+0x5d/0x1c0 net/core/sock.c:3404 sk_stream_wait_memory+0x700/0xdc0 net/core/stream.c:145 tcp_sendmsg_locked+0x111d/0x3fc0 net/ipv4/tcp.c:1410 tcp_sendmsg+0x2c/0x40 net/ipv4/tcp.c:1448 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg net/socket.c:734 [inline] __sys_sendto+0x439/0x5c0 net/socket.c:2119 __do_sys_sendto net/socket.c:2131 [inline] __se_sys_sendto net/socket.c:2127 [inline] __x64_sys_sendto+0xda/0xf0 net/socket.c:2127 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f6431289109 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f643236e168 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00007f643139c100 RCX: 00007f6431289109 RDX: 00000000d0d0c2ac RSI: 0000000020000080 RDI: 000000000000000a RBP: 00007f64312e308d R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fff372533af R14: 00007f643236e300 R15: 0000000000022000 Fixes: 5d424d5a674f ("[TCP]: MTU probing") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-28net/smc: fixes for converting from "struct smc_cdc_tx_pend **" to "struct ↵Guangguan Wang1-1/+1
smc_wr_tx_pend_priv *" "struct smc_cdc_tx_pend **" can not directly convert to "struct smc_wr_tx_pend_priv *". Fixes: 2bced6aefa3d ("net/smc: put slot when connection is killed") Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-27netfilter: nf_tables: set element extended ACK reporting supportPablo Neira Ayuso1-3/+9
Report the element that causes problems via netlink extended ACK for set element commands. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-27netfilter: cttimeout: fix slab-out-of-bounds read in cttimeout_net_exitFlorian Westphal1-2/+3
syzbot reports: BUG: KASAN: slab-out-of-bounds in __list_del_entry_valid+0xcc/0xf0 lib/list_debug.c:42 [..] list_del include/linux/list.h:148 [inline] cttimeout_net_exit+0x211/0x540 net/netfilter/nfnetlink_cttimeout.c:617 No reproducer so far. Looking at recent changes in this area its clear that the free_head must not be at the end of the structure because nf_ct_timeout structure has variable size. Reported-by: <syzbot+92968395eedbdbd3617d@syzkaller.appspotmail.com> Fixes: 78222bacfca9 ("netfilter: cttimeout: decouple unlink and free on netns destruction") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-27netfilter: nfnetlink: fix warn in nfnetlink_unbindFlorian Westphal1-19/+5
syzbot reports following warn: WARNING: CPU: 0 PID: 3600 at net/netfilter/nfnetlink.c:703 nfnetlink_unbind+0x357/0x3b0 net/netfilter/nfnetlink.c:694 The syzbot generated program does this: socket(AF_NETLINK, SOCK_RAW, NETLINK_NETFILTER) = 3 setsockopt(3, SOL_NETLINK, NETLINK_DROP_MEMBERSHIP, [1], 4) = 0 ... which triggers 'WARN_ON_ONCE(nfnlnet->ctnetlink_listeners == 0)' check. Instead of counting, just enable reporting for every bind request and check if we still have listeners on unbind. While at it, also add the needed bounds check on nfnl_group2type[] access. Reported-by: <syzbot+4903218f7fba0a2d6226@syzkaller.appspotmail.com> Reported-by: <syzbot+afd2d80e495f96049571@syzkaller.appspotmail.com> Fixes: 2794cdb0b97b ("netfilter: nfnetlink: allow to detect if ctnetlink listeners exist") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-27xen: switch gnttab_end_foreign_access() to take a struct page pointerJuergen Gross1-4/+4
Instead of a virtual kernel address use a pointer of the associated struct page as second parameter of gnttab_end_foreign_access(). Most users have that pointer available already and are creating the virtual address from it, risking problems in case the memory is located in highmem. gnttab_end_foreign_access() itself won't need to get the struct page from the address again. Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2022-05-26Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds2-3/+3
Pull rdma updates from Jason Gunthorpe: "Small collection of incremental improvement patches: - Minor code cleanup patches, comment improvements, etc from static tools - Clean the some of the kernel caps, reducing the historical stealth uAPI leftovers - Bug fixes and minor changes for rdmavt, hns, rxe, irdma - Remove unimplemented cruft from rxe - Reorganize UMR QP code in mlx5 to avoid going through the IB verbs layer - flush_workqueue(system_unbound_wq) removal - Ensure rxe waits for objects to be unused before allowing the core to free them - Several rc quality bug fixes for hfi1" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (67 commits) RDMA/rtrs-clt: Fix one kernel-doc comment RDMA/hfi1: Remove all traces of diagpkt support RDMA/hfi1: Consolidate software versions RDMA/hfi1: Remove pointless driver version RDMA/hfi1: Fix potential integer multiplication overflow errors RDMA/hfi1: Prevent panic when SDMA is disabled RDMA/hfi1: Prevent use of lock before it is initialized RDMA/rxe: Fix an error handling path in rxe_get_mcg() IB/core: Fix typo in comment RDMA/core: Fix typo in comment IB/hf1: Fix typo in comment IB/qib: Fix typo in comment IB/iser: Fix typo in comment RDMA/mlx4: Avoid flush_scheduled_work() usage IB/isert: Avoid flush_scheduled_work() usage RDMA/mlx5: Remove duplicate pointer assignment in mlx5_ib_alloc_implicit_mr() RDMA/qedr: Remove unnecessary synchronize_irq() before free_irq() RDMA/hns: Use hr_reg_read() instead of remaining roce_get_xxx() RDMA/hns: Use hr_reg_xxx() instead of remaining roce_set_xxx() RDMA/irdma: Add SW mechanism to generate completions on error ...
2022-05-26Merge tag 'nfsd-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds9-41/+49
Pull nfsd updates from Chuck Lever: "We introduce 'courteous server' in this release. Previously NFSD would purge open and lock state for an unresponsive client after one lease period (typically 90 seconds). Now, after one lease period, another client can open and lock those files and the unresponsive client's lease is purged; otherwise if the unresponsive client's open and lock state is uncontended, the server retains that open and lock state for up to 24 hours, allowing the client's workload to resume after a lengthy network partition. A longstanding issue with NFSv4 file creation is also addressed. Previously a file creation can fail internally, returning an error to the client, but leave the newly created file in place as an artifact. The file creation code path has been reorganized so that internal failures and race conditions are less likely to result in an unwanted file creation. A fault injector has been added to help exercise paths that are run during kernel metadata cache invalidation. These caches contain information maintained by user space about exported filesystems. Many of our test workloads do not trigger cache invalidation. There is one patch that is needed to support PREEMPT_RT and a fix for an ancient 'sleep while spin-locked' splat that seems to have become easier to hit since v5.18-rc3" * tag 'nfsd-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (36 commits) NFSD: nfsd_file_put() can sleep NFSD: Add documenting comment for nfsd4_release_lockowner() NFSD: Modernize nfsd4_release_lockowner() NFSD: Fix possible sleep during nfsd4_release_lockowner() nfsd: destroy percpu stats counters after reply cache shutdown nfsd: Fix null-ptr-deref in nfsd_fill_super() nfsd: Unregister the cld notifier when laundry_wq create failed SUNRPC: Use RMW bitops in single-threaded hot paths NFSD: Clean up the show_nf_flags() macro NFSD: Trace filecache opens NFSD: Move documenting comment for nfsd4_process_open2() NFSD: Fix whitespace NFSD: Remove dprintk call sites from tail of nfsd4_open() NFSD: Instantiate a struct file when creating a regular NFSv4 file NFSD: Clean up nfsd_open_verified() NFSD: Remove do_nfsd_create() NFSD: Refactor NFSv4 OPEN(CREATE) NFSD: Refactor NFSv3 CREATE NFSD: Refactor nfsd_create_setattr() NFSD: Avoid calling fh_drop_write() twice in do_nfsd_create() ...
2022-05-26netfilter: nft_limit: Clone packet limits' cost valuePhil Sutter1-0/+2
When cloning a packet-based limit expression, copy the cost value as well. Otherwise the new limit is not functional anymore. Fixes: 3b9e2ea6c11bf ("netfilter: nft_limit: move stateful fields out of expression data") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-26netfilter: nf_tables: disallow non-stateful expression in sets earlierPablo Neira Ayuso1-9/+10
Since 3e135cd499bf ("netfilter: nft_dynset: dynamic stateful expression instantiation"), it is possible to attach stateful expressions to set elements. cd5125d8f518 ("netfilter: nf_tables: split set destruction in deactivate and destroy phase") introduces conditional destruction on the object to accomodate transaction semantics. nft_expr_init() calls expr->ops->init() first, then check for NFT_STATEFUL_EXPR, this stills allows to initialize a non-stateful lookup expressions which points to a set, which might lead to UAF since the set is not properly detached from the set->binding for this case. Anyway, this combination is non-sense from nf_tables perspective. This patch fixes this problem by checking for NFT_STATEFUL_EXPR before expr->ops->init() is called. The reporter provides a KASAN splat and a poc reproducer (similar to those autogenerated by syzbot to report use-after-free errors). It is unknown to me if they are using syzbot or if they use similar automated tool to locate the bug that they are reporting. For the record, this is the KASAN splat. [ 85.431824] ================================================================== [ 85.432901] BUG: KASAN: use-after-free in nf_tables_bind_set+0x81b/0xa20 [ 85.433825] Write of size 8 at addr ffff8880286f0e98 by task poc/776 [ 85.434756] [ 85.434999] CPU: 1 PID: 776 Comm: poc Tainted: G W 5.18.0+ #2 [ 85.436023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Fixes: 0b2d8a7b638b ("netfilter: nf_tables: add helper functions for expression handling") Reported-and-tested-by: Aaron Adams <edg-e@nccgroup.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-25net, neigh: Set lower cap for neigh_managed_work rearmingDaniel Borkmann1-1/+1
Yuwei reported that plain reuse of DELAY_PROBE_TIME to rearm work queue in neigh_managed_work is problematic if user explicitly configures the DELAY_PROBE_TIME to 0 for a neighbor table. Such misconfig can then hog CPU to 100% processing the system work queue. Instead, set lower interval bound to HZ which is totally sufficient. Yuwei is additionally looking into making the interval separately configurable from DELAY_PROBE_TIME. Reported-by: Yuwei Wang <wangyuweihx@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/netdev/797c3c53-ce1b-9f60-e253-cda615788f4a@iogearbox.net Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/3b8c5aa906c52c3a8c995d1b2e8ccf650ea7c716.1653432794.git.daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-25net/smc: set ini->smcrv2.ib_dev_v2 to NULL if SMC-Rv2 is unavailableliuyacan1-0/+1
In the process of checking whether RDMAv2 is available, the current implementation first sets ini->smcrv2.ib_dev_v2, and then allocates smc buf desc and register rmb, but the latter may fail. In this case, the pointer should be reset. Fixes: e49300a6bf62 ("net/smc: add listen processing for SMC-Rv2") Signed-off-by: liuyacan <liuyacan@corp.netease.com> Reviewed-by: Karsten Graul <kgraul@linux.ibm.com> Link: https://lore.kernel.org/r/20220525085408.812273-1-liuyacan@corp.netease.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-25Merge tag 'net-next-5.19' of ↵Linus Torvalds319-4092/+7705
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core ---- - Support TCPv6 segmentation offload with super-segments larger than 64k bytes using the IPv6 Jumbogram extension header (AKA BIG TCP). - Generalize skb freeing deferral to per-cpu lists, instead of per-socket lists. - Add a netdev statistic for packets dropped due to L2 address mismatch (rx_otherhost_dropped). - Continue work annotating skb drop reasons. - Accept alternative netdev names (ALT_IFNAME) in more netlink requests. - Add VLAN support for AF_PACKET SOCK_RAW GSO. - Allow receiving skb mark from the socket as a cmsg. - Enable memcg accounting for veth queues, sysctl tables and IPv6. BPF --- - Add libbpf support for User Statically-Defined Tracing (USDTs). - Speed up symbol resolution for kprobes multi-link attachments. - Support storing typed pointers to referenced and unreferenced objects in BPF maps. - Add support for BPF link iterator. - Introduce access to remote CPU map elements in BPF per-cpu map. - Allow middle-of-the-road settings for the kernel.unprivileged_bpf_disabled sysctl. - Implement basic types of dynamic pointers e.g. to allow for dynamically sized ringbuf reservations without extra memory copies. Protocols --------- - Retire port only listening_hash table, add a second bind table hashed by port and address. Avoid linear list walk when binding to very popular ports (e.g. 443). - Add bridge FDB bulk flush filtering support allowing user space to remove all FDB entries matching a condition. - Introduce accept_unsolicited_na sysctl for IPv6 to implement router-side changes for RFC9131. - Support for MPTCP path manager in user space. - Add MPTCP support for fallback to regular TCP for connections that have never connected additional subflows or transmitted out-of-sequence data (partial support for RFC8684 fallback). - Avoid races in MPTCP-level window tracking, stabilize and improve throughput. - Support lockless operation of GRE tunnels with seq numbers enabled. - WiFi support for host based BSS color collision detection. - Add support for SO_TXTIME/SCM_TXTIME on CAN sockets. - Support transmission w/o flow control in CAN ISOTP (ISO 15765-2). - Support zero-copy Tx with TLS 1.2 crypto offload (sendfile). - Allow matching on the number of VLAN tags via tc-flower. - Add tracepoint for tcp_set_ca_state(). Driver API ---------- - Improve error reporting from classifier and action offload. - Add support for listing line cards in switches (devlink). - Add helpers for reporting page pool statistics with ethtool -S. - Add support for reading clock cycles when using PTP virtual clocks, instead of having the driver convert to time before reporting. This makes it possible to report time from different vclocks. - Support configuring low-latency Tx descriptor push via ethtool. - Separate Clause 22 and Clause 45 MDIO accesses more explicitly. New hardware / drivers ---------------------- - Ethernet: - Marvell's Octeon NIC PCI Endpoint support (octeon_ep) - Sunplus SP7021 SoC (sp7021_emac) - Add support for Renesas RZ/V2M (in ravb) - Add support for MediaTek mt7986 switches (in mtk_eth_soc) - Ethernet PHYs: - ADIN1100 industrial PHYs (w/ 10BASE-T1L and SQI reporting) - TI DP83TD510 PHY - Microchip LAN8742/LAN88xx PHYs - WiFi: - Driver for pureLiFi X, XL, XC devices (plfxlc) - Driver for Silicon Labs devices (wfx) - Support for WCN6750 (in ath11k) - Support Realtek 8852ce devices (in rtw89) - Mobile: - MediaTek T700 modems (Intel 5G 5000 M.2 cards) - CAN: - ctucanfd: add support for CTU CAN FD open-source IP core from Czech Technical University in Prague Drivers ------- - Delete a number of old drivers still using virt_to_bus(). - Ethernet NICs: - intel: support TSO on tunnels MPLS - broadcom: support multi-buffer XDP - nfp: support VF rate limiting - sfc: use hardware tx timestamps for more than PTP - mlx5: multi-port eswitch support - hyper-v: add support for XDP_REDIRECT - atlantic: XDP support (including multi-buffer) - macb: improve real-time perf by deferring Tx processing to NAPI - High-speed Ethernet switches: - mlxsw: implement basic line card information querying - prestera: add support for traffic policing on ingress and egress - Embedded Ethernet switches: - lan966x: add support for packet DMA (FDMA) - lan966x: add support for PTP programmable pins - ti: cpsw_new: enable bc/mc storm prevention - Qualcomm 802.11ax WiFi (ath11k): - Wake-on-WLAN support for QCA6390 and WCN6855 - device recovery (firmware restart) support - support setting Specific Absorption Rate (SAR) for WCN6855 - read country code from SMBIOS for WCN6855/QCA6390 - enable keep-alive during WoWLAN suspend - implement remain-on-channel support - MediaTek WiFi (mt76): - support Wireless Ethernet Dispatch offloading packet movement between the Ethernet switch and WiFi interfaces - non-standard VHT MCS10-11 support - mt7921 AP mode support - mt7921 IPv6 NS offload support - Ethernet PHYs: - micrel: ksz9031/ksz9131: cabletest support - lan87xx: SQI support for T1 PHYs - lan937x: add interrupt support for link detection" * tag 'net-next-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1809 commits) ptp: ocp: Add firmware header checks ptp: ocp: fix PPS source selector debugfs reporting ptp: ocp: add .init function for sma_op vector ptp: ocp: vectorize the sma accessor functions ptp: ocp: constify selectors ptp: ocp: parameterize input/output sma selectors ptp: ocp: revise firmware display ptp: ocp: add Celestica timecard PCI ids ptp: ocp: Remove #ifdefs around PCI IDs ptp: ocp: 32-bit fixups for pci start address Revert "net/smc: fix listen processing for SMC-Rv2" ath6kl: Use cc-disable-warning to disable -Wdangling-pointer selftests/bpf: Dynptr tests bpf: Add dynptr data slices bpf: Add bpf_dynptr_read and bpf_dynptr_write bpf: Dynptr support for ring buffers bpf: Add bpf_dynptr_from_mem for local dynptrs bpf: Add verifier support for dynptrs bpf: Suppress 'passing zero to PTR_ERR' warning bpf: Introduce bpf_arch_text_invalidate for bpf_prog_pack ...
2022-05-25libceph: use swap() macro instead of taking tmp variableGuo Zhengkui1-4/+1
Fix the following coccicheck warning: net/ceph/crush/mapper.c:1077:8-9: WARNING opportunity for swap() by using swap() for the swapping of variable values and drop the tmp variable that is not needed any more. Signed-off-by: Guo Zhengkui <guozhengkui@vivo.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25Merge tag 'linux-kselftest-kunit-5.19-rc1' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull KUnit updates from Shuah Khan: "Several fixes, cleanups, and enhancements to tests and framework: - introduce _NULL and _NOT_NULL macros to pointer error checks - rework kunit_resource allocation policy to fix memory leaks when caller doesn't specify free() function to be used when allocating memory using kunit_add_resource() and kunit_alloc_resource() funcs. - add ability to specify suite-level init and exit functions" * tag 'linux-kselftest-kunit-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (41 commits) kunit: tool: Use qemu-system-i386 for i386 runs kunit: fix executor OOM error handling logic on non-UML kunit: tool: update riscv QEMU config with new serial dependency kcsan: test: use new suite_{init,exit} support kunit: tool: Add list of all valid test configs on UML kunit: take `kunit_assert` as `const` kunit: tool: misc cleanups kunit: tool: minor cosmetic cleanups in kunit_parser.py kunit: tool: make parser stop overwriting status of suites w/ no_tests kunit: tool: remove dead parse_crash_in_log() logic kunit: tool: print clearer error message when there's no TAP output kunit: tool: stop using a shell to run kernel under QEMU kunit: tool: update test counts summary line format kunit: bail out of test filtering logic quicker if OOM lib/Kconfig.debug: change KUnit tests to default to KUNIT_ALL_TESTS kunit: Rework kunit_resource allocation policy kunit: fix debugfs code to use enum kunit_status, not bool kfence: test: use new suite_{init/exit} support, add .kunitconfig kunit: add ability to specify suite-level init and exit functions kunit: rename print_subtest_{start,end} for clarity (s/subtest/suite) ...
2022-05-25xfrm: do not set IPv4 DF flag when encapsulating IPv6 frames <= 1280 bytes.Maciej Żenczykowski1-1/+2
One may want to have DF set on large packets to support discovering path mtu and limiting the size of generated packets (hence not setting the XFRM_STATE_NOPMTUDISC tunnel flag), while still supporting networks that are incapable of carrying even minimal sized IPv6 frames (post encapsulation). Having IPv4 Don't Frag bit set on encapsulated IPv6 frames that are not larger than the minimum IPv6 mtu of 1280 isn't useful, because the resulting ICMP Fragmentation Required error isn't actionable (even assuming you receive it) because IPv6 will not drop it's path mtu below 1280 anyway. While the IPv4 stack could prefrag the packets post encap, this requires the ICMP error to be successfully delivered and causes a loss of the original IPv6 frame (thus requiring a retransmit and latency hit). Luckily with IPv4 if we simply don't set the DF flag, we'll just make further fragmenting the packets some other router's problems. We'll still learn the correct IPv4 path mtu through encapsulation of larger IPv6 frames. I'm still not convinced this patch is entirely sufficient to make everything happy... but I don't see how it could possibly make things worse. See also recent: 4ff2980b6bd2 'xfrm: fix tunnel model fragmentation behavior' and friends Cc: Lorenzo Colitti <lorenzo@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Lina Wang <lina.wang@mediatek.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Maciej Zenczykowski <maze@google.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-05-25Revert "net: af_key: add check for pfkey_broadcast in function pfkey_process"Michal Kubecek1-4/+6
This reverts commit 4dc2a5a8f6754492180741facf2a8787f2c415d7. A non-zero return value from pfkey_broadcast() does not necessarily mean an error occurred as this function returns -ESRCH when no registered listener received the message. In particular, a call with BROADCAST_PROMISC_ONLY flag and null one_sk argument can never return zero so that this commit in fact prevents processing any PF_KEY message. One visible effect is that racoon daemon fails to find encryption algorithms like aes and refuses to start. Excluding -ESRCH return value would fix this but it's not obvious that we really want to bail out here and most other callers of pfkey_broadcast() also ignore the return value. Also, as pointed out by Steffen Klassert, PF_KEY is kind of deprecated and newer userspace code should use netlink instead so that we should only disturb the code for really important fixes. v2: add a comment explaining why is the return value ignored Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2022-05-24Merge tag 'kernel-hardening-v5.19-rc1' of ↵Linus Torvalds1-5/+3
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kernel hardening updates from Kees Cook: - usercopy hardening expanded to check other allocation types (Matthew Wilcox, Yuanzheng Song) - arm64 stackleak behavioral improvements (Mark Rutland) - arm64 CFI code gen improvement (Sami Tolvanen) - LoadPin LSM block dev API adjustment (Christoph Hellwig) - Clang randstruct support (Bill Wendling, Kees Cook) * tag 'kernel-hardening-v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (34 commits) loadpin: stop using bdevname mm: usercopy: move the virt_addr_valid() below the is_vmalloc_addr() gcc-plugins: randstruct: Remove cast exception handling af_unix: Silence randstruct GCC plugin warning niu: Silence randstruct warnings big_keys: Use struct for internal payload gcc-plugins: Change all version strings match kernel randomize_kstack: Improve docs on requirements/rationale lkdtm/stackleak: fix CONFIG_GCC_PLUGIN_STACKLEAK=n arm64: entry: use stackleak_erase_on_task_stack() stackleak: add on/off stack variants lkdtm/stackleak: check stack boundaries lkdtm/stackleak: prevent unexpected stack usage lkdtm/stackleak: rework boundary management lkdtm/stackleak: avoid spurious failure stackleak: rework poison scanning stackleak: rework stack high bound handling stackleak: clarify variable names stackleak: rework stack low bound handling stackleak: remove redundant check ...
2022-05-24Merge tag 'random-5.19-rc1-for-linus' of ↵Linus Torvalds3-8/+1
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull random number generator updates from Jason Donenfeld: "These updates continue to refine the work began in 5.17 and 5.18 of modernizing the RNG's crypto and streamlining and documenting its code. New for 5.19, the updates aim to improve entropy collection methods and make some initial decisions regarding the "premature next" problem and our threat model. The cloc utility now reports that random.c is 931 lines of code and 466 lines of comments, not that basic metrics like that mean all that much, but at the very least it tells you that this is very much a manageable driver now. Here's a summary of the various updates: - The random_get_entropy() function now always returns something at least minimally useful. This is the primary entropy source in most collectors, which in the best case expands to something like RDTSC, but prior to this change, in the worst case it would just return 0, contributing nothing. For 5.19, additional architectures are wired up, and architectures that are entirely missing a cycle counter now have a generic fallback path, which uses the highest resolution clock available from the timekeeping subsystem. Some of those clocks can actually be quite good, despite the CPU not having a cycle counter of its own, and going off-core for a stamp is generally thought to increase jitter, something positive from the perspective of entropy gathering. Done very early on in the development cycle, this has been sitting in next getting some testing for a while now and has relevant acks from the archs, so it should be pretty well tested and fine, but is nonetheless the thing I'll be keeping my eye on most closely. - Of particular note with the random_get_entropy() improvements is MIPS, which, on CPUs that lack the c0 count register, will now combine the high-speed but short-cycle c0 random register with the lower-speed but long-cycle generic fallback path. - With random_get_entropy() now always returning something useful, the interrupt handler now collects entropy in a consistent construction. - Rather than comparing two samples of random_get_entropy() for the jitter dance, the algorithm now tests many samples, and uses the amount of differing ones to determine whether or not jitter entropy is usable and how laborious it must be. The problem with comparing only two samples was that if the cycle counter was extremely slow, but just so happened to be on the cusp of a change, the slowness wouldn't be detected. Taking many samples fixes that to some degree. This, combined with the other improvements to random_get_entropy(), should make future unification of /dev/random and /dev/urandom maybe more possible. At the very least, were we to attempt it again today (we're not), it wouldn't break any of Guenter's test rigs that broke when we tried it with 5.18. So, not today, but perhaps down the road, that's something we can revisit. - We attempt to reseed the RNG immediately upon waking up from system suspend or hibernation, making use of the various timestamps about suspend time and such available, as well as the usual inputs such as RDRAND when available. - Batched randomness now falls back to ordinary randomness before the RNG is initialized. This provides more consistent guarantees to the types of random numbers being returned by the various accessors. - The "pre-init injection" code is now gone for good. I suspect you in particular will be happy to read that, as I recall you expressing your distaste for it a few months ago. Instead, to avoid a "premature first" issue, while still allowing for maximal amount of entropy availability during system boot, the first 128 bits of estimated entropy are used immediately as it arrives, with the next 128 bits being buffered. And, as before, after the RNG has been fully initialized, it winds up reseeding anyway a few seconds later in most cases. This resulted in a pretty big simplification of the initialization code and let us remove various ad-hoc mechanisms like the ugly crng_pre_init_inject(). - The RNG no longer pretends to handle the "premature next" security model, something that various academics and other RNG designs have tried to care about in the past. After an interesting mailing list thread, these issues are thought to be a) mainly academic and not practical at all, and b) actively harming the real security of the RNG by delaying new entropy additions after a potential compromise, making a potentially bad situation even worse. As well, in the first place, our RNG never even properly handled the premature next issue, so removing an incomplete solution to a fake problem was particularly nice. This allowed for numerous other simplifications in the code, which is a lot cleaner as a consequence. If you didn't see it before, https://lore.kernel.org/lkml/YmlMGx6+uigkGiZ0@zx2c4.com/ may be a thread worth skimming through. - While the interrupt handler received a separate code path years ago that avoids locks by using per-cpu data structures and a faster mixing algorithm, in order to reduce interrupt latency, input and disk events that are triggered in hardirq handlers were still hitting locks and more expensive algorithms. Those are now redirected to use the faster per-cpu data structures. - Rather than having the fake-crypto almost-siphash-based random32 implementation be used right and left, and in many places where cryptographically secure randomness is desirable, the batched entropy code is now fast enough to replace that. - As usual, numerous code quality and documentation cleanups. For example, the initialization state machine now uses enum symbolic constants instead of just hard coding numbers everywhere. - Since the RNG initializes once, and then is always initialized thereafter, a pretty heavy amount of code used during that initialization is never used again. It is now completely cordoned off using static branches and it winds up in the .text.unlikely section so that it doesn't reduce cache compactness after the RNG is ready. - A variety of functions meant for waiting on the RNG to be initialized were only used by vsprintf, and in not a particularly optimal way. Replacing that usage with a more ordinary setup made it possible to remove those functions. - A cleanup of how we warn userspace about the use of uninitialized /dev/urandom and uninitialized get_random_bytes() usage. Interestingly, with the change you merged for 5.18 that attempts to use jitter (but does not block if it can't), the majority of users should never see those warnings for /dev/urandom at all now, and the one for in-kernel usage is mainly a debug thing. - The file_operations struct for /dev/[u]random now implements .read_iter and .write_iter instead of .read and .write, allowing it to also implement .splice_read and .splice_write, which makes splice(2) work again after it was broken here (and in many other places in the tree) during the set_fs() removal. This was a bit of a last minute arrival from Jens that hasn't had as much time to bake, so I'll be keeping my eye on this as well, but it seems fairly ordinary. Unfortunately, read_iter() is around 3% slower than read() in my tests, which I'm not thrilled about. But Jens and Al, spurred by this observation, seem to be making progress in removing the bottlenecks on the iter paths in the VFS layer in general, which should remove the performance gap for all drivers. - Assorted other bug fixes, cleanups, and optimizations. - A small SipHash cleanup" * tag 'random-5.19-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (49 commits) random: check for signals after page of pool writes random: wire up fops->splice_{read,write}_iter() random: convert to using fops->write_iter() random: convert to using fops->read_iter() random: unify batched entropy implementations random: move randomize_page() into mm where it belongs random: remove mostly unused async readiness notifier random: remove get_random_bytes_arch() and add rng_has_arch_random() random: move initialization functions out of hot pages random: make consistent use of buf and len random: use proper return types on get_random_{int,long}_wait() random: remove extern from functions in header random: use static branch for crng_ready() random: credit architectural init the exact amount random: handle latent entropy and command line from random_init() random: use proper jiffies comparison macro random: remove ratelimiting for in-kernel unseeded randomness random: move initialization out of reseeding hot path random: avoid initializing twice in credit race random: use symbolic constants for crng_init states ...
2022-05-24Revert "net/smc: fix listen processing for SMC-Rv2"liuyacan1-27/+17
This reverts commit 8c3b8dc5cc9bf6d273ebe18b16e2d6882bcfb36d. Some rollback issue will be fixed in other patches in the future. Link: https://lore.kernel.org/all/20220523055056.2078994-1-liuyacan@corp.netease.com/ Fixes: 8c3b8dc5cc9b ("net/smc: fix listen processing for SMC-Rv2") Signed-off-by: liuyacan <liuyacan@corp.netease.com> Link: https://lore.kernel.org/r/20220524090230.2140302-1-liuyacan@corp.netease.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-24Merge tag 'v5.18' into rdma.git for-nextJason Gunthorpe126-731/+1424
Following patches have dependencies. Resolve the merge conflict in drivers/net/ethernet/mellanox/mlx5/core/main.c by keeping the new names for the fs functions following linux-next: https://lore.kernel.org/r/20220519113529.226bc3e2@canb.auug.org.au/ Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-05-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski9-56/+82
drivers/net/ethernet/cadence/macb_main.c 5cebb40bc955 ("net: macb: Fix PTP one step sync support") 138badbc21a0 ("net: macb: use NAPI for TX completion path") https://lore.kernel.org/all/20220523111021.31489367@canb.auug.org.au/ net/smc/af_smc.c 75c1edf23b95 ("net/smc: postpone sk_refcnt increment in connect()") 3aba103006bc ("net/smc: align the connect behaviour with TCP") https://lore.kernel.org/all/20220524114408.4bf1af38@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-23Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski7-20/+105
Daniel Borkmann says: ==================== pull-request: bpf-next 2022-05-23 We've added 113 non-merge commits during the last 26 day(s) which contain a total of 121 files changed, 7425 insertions(+), 1586 deletions(-). The main changes are: 1) Speed up symbol resolution for kprobes multi-link attachments, from Jiri Olsa. 2) Add BPF dynamic pointer infrastructure e.g. to allow for dynamically sized ringbuf reservations without extra memory copies, from Joanne Koong. 3) Big batch of libbpf improvements towards libbpf 1.0 release, from Andrii Nakryiko. 4) Add BPF link iterator to traverse links via seq_file ops, from Dmitrii Dolgov. 5) Add source IP address to BPF tunnel key infrastructure, from Kaixi Fan. 6) Refine unprivileged BPF to disable only object-creating commands, from Alan Maguire. 7) Fix JIT blinding of ld_imm64 when they point to subprogs, from Alexei Starovoitov. 8) Add BPF access to mptcp_sock structures and their meta data, from Geliang Tang. 9) Add new BPF helper for access to remote CPU's BPF map elements, from Feng Zhou. 10) Allow attaching 64-bit cookie to BPF link of fentry/fexit/fmod_ret, from Kui-Feng Lee. 11) Follow-ups to typed pointer support in BPF maps, from Kumar Kartikeya Dwivedi. 12) Add busy-poll test cases to the XSK selftest suite, from Magnus Karlsson. 13) Improvements in BPF selftest test_progs subtest output, from Mykola Lysenko. 14) Fill bpf_prog_pack allocator areas with illegal instructions, from Song Liu. 15) Add generic batch operations for BPF map-in-map cases, from Takshak Chahande. 16) Make bpf_jit_enable more user friendly when permanently on 1, from Tiezhu Yang. 17) Fix an array overflow in bpf_trampoline_get_progs(), from Yuntao Wang. ==================== Link: https://lore.kernel.org/r/20220523223805.27931-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-23Merge tag 'for-net-next-2022-05-23' of ↵Jakub Kicinski10-35/+168
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: - Add support for Realtek 8761BUV - Add HCI_QUIRK_BROKEN_ENHANCED_SETUP_SYNC_CONN quirk - Add support for RTL8852C - Add a new PID/VID 0489/e0c8 for MT7921 - Add support for Qualcomm WCN785x * tag 'for-net-next-2022-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (26 commits) Bluetooth: hci_sync: use hci_skb_event() helper Bluetooth: eir: Add helpers for managing service data Bluetooth: hci_sync: Fix attempting to suspend with unfiltered passive scan Bluetooth: MGMT: Add conditions for setting HCI_CONN_FLAG_REMOTE_WAKEUP Bluetooth: btmtksdio: fix the reset takes too long Bluetooth: btmtksdio: fix possible FW initialization failure Bluetooth: btmtksdio: fix use-after-free at btmtksdio_recv_event Bluetooth: btbcm: Add entry for BCM4373A0 UART Bluetooth Bluetooth: btusb: Add a new PID/VID 0489/e0c8 for MT7921 Bluetooth: btusb: Add 0x0bda:0x8771 Realtek 8761BUV devices Bluetooth: btusb: Set HCI_QUIRK_BROKEN_ERR_DATA_REPORTING for QCA Bluetooth: core: Fix missing power_on work cancel on HCI close Bluetooth: btusb: add support for Qualcomm WCN785x Bluetooth: protect le accept and resolv lists with hdev->lock Bluetooth: use hdev lock for accept_list and reject_list in conn req Bluetooth: use hdev lock in activate_scan for hci_is_adv_monitoring Bluetooth: btrtl: Add support for RTL8852C Bluetooth: btusb: Set HCI_QUIRK_BROKEN_ENHANCED_SETUP_SYNC_CONN for QCA Bluetooth: Print broken quirks Bluetooth: HCI: Add HCI_QUIRK_BROKEN_ENHANCED_SETUP_SYNC_CONN quirk ... ==================== Link: https://lore.kernel.org/r/20220523204151.3327345-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-23Bluetooth: hci_conn: Fix hci_connect_le_syncLuiz Augusto von Dentz2-5/+8
The handling of connection failures shall be handled by the request completion callback as already done by hci_cs_le_create_conn, also make sure to use hci_conn_failed instead of hci_le_conn_failed as the later don't actually call hci_conn_del to cleanup. Link: https://github.com/bluez/bluez/issues/340 Fixes: 8e8b92ee60de5 ("Bluetooth: hci_sync: Add hci_le_create_conn_sync") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2022-05-23Merge tag 'for-5.19/io_uring-net-2022-05-22' of git://git.kernel.dk/linux-blockLinus Torvalds1-6/+10
Pull io_uring 'more data in socket' support from Jens Axboe: "To be able to fully utilize the 'poll first' support in the core io_uring branch, it's advantageous knowing if the socket was empty after a receive. This adds support for that" * tag 'for-5.19/io_uring-net-2022-05-22' of git://git.kernel.dk/linux-block: io_uring: return hint on whether more data is available after receive tcp: pass back data left in socket after receive
2022-05-23Merge tag 'for-5.19/io_uring-socket-2022-05-22' of ↵Linus Torvalds1-10/+42
git://git.kernel.dk/linux-block Pull io_uring socket() support from Jens Axboe: "This adds support for socket(2) for io_uring. This is handy when using direct / registered file descriptors with io_uring. Outside of those two patches, a small series from Dylan on top that improves the tracing by providing a text representation of the opcode rather than needing to decode this by reading the header file every time. That sits in this branch as it was the last opcode added (until it wasn't...)" * tag 'for-5.19/io_uring-socket-2022-05-22' of git://git.kernel.dk/linux-block: io_uring: use the text representation of ops in trace io_uring: rename op -> opcode io_uring: add io_uring_get_opcode io_uring: add type to op enum io_uring: add socket(2) support net: add __sys_socket_file()
2022-05-23Bluetooth: hci_sync: use hci_skb_event() helperAhmad Fatoum1-1/+1
This instance is the only opencoded version of the macro, so have it follow suit. Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2022-05-23SUNRPC: Use RMW bitops in single-threaded hot pathsChuck Lever5-11/+11
I noticed CPU pipeline stalls while using perf. Once an svc thread is scheduled and executing an RPC, no other processes will touch svc_rqst::rq_flags. Thus bus-locked atomics are not needed outside the svc thread scheduler. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-23net: dsa: OF-ware slave_mii_busLuiz Angelo Daros de Luca1-1/+6
If found, register the DSA internally allocated slave_mii_bus with an OF "mdio" child object. It can save some drivers from creating their custom internal MDIO bus. Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-23net/smc: fix listen processing for SMC-Rv2liuyacan1-17/+27
In the process of checking whether RDMAv2 is available, the current implementation first sets ini->smcrv2.ib_dev_v2, and then allocates smc buf desc, but the latter may fail. Unfortunately, the caller will only check the former. In this case, a NULL pointer reference will occur in smc_clc_send_confirm_accept() when accessing conn->rmb_desc. This patch does two things: 1. Use the return code to determine whether V2 is available. 2. If the return code is NODEV, continue to check whether V1 is available. Fixes: e49300a6bf62 ("net/smc: add listen processing for SMC-Rv2") Signed-off-by: liuyacan <liuyacan@corp.netease.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-23net/smc: postpone sk_refcnt increment in connect()liuyacan1-1/+1
Same trigger condition as commit 86434744. When setsockopt runs in parallel to a connect(), and switch the socket into fallback mode. Then the sk_refcnt is incremented in smc_connect(), but its state stay in SMC_INIT (NOT SMC_ACTIVE). This cause the corresponding sk_refcnt decrement in __smc_release() will not be performed. Fixes: 86434744fedf ("net/smc: add fallback check to connect()") Signed-off-by: liuyacan <liuyacan@corp.netease.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22net: wrap the wireless pointers in struct net_device in an ifdefJakub Kicinski2-8/+15
Most protocol-specific pointers in struct net_device are under a respective ifdef. Wireless is the notable exception. Since there's a sizable number of custom-built kernels for datacenter workloads which don't build wireless it seems reasonable to ifdefy those pointers as well. While at it move IPv4 and IPv6 pointers up, those are special for obvious reasons. Acked-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> # ieee802154 Acked-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22rxrpc: Fix decision on when to generate an IDLE ACKDavid Howells4-16/+25
Fix the decision on when to generate an IDLE ACK by keeping a count of the number of packets we've received, but not yet soft-ACK'd, and the number of packets we've processed, but not yet hard-ACK'd, rather than trying to keep track of which DATA sequence numbers correspond to those points. We then generate an ACK when either counter exceeds 2. The counters are both cleared when we transcribe the information into any sort of ACK packet for transmission. IDLE and DELAY ACKs are skipped if both counters are 0 (ie. no change). Fixes: 805b21b929e2 ("rxrpc: Send an ACK after every few DATA packets we receive") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22rxrpc: Don't let ack.previousPacket regressDavid Howells3-4/+6
The previousPacket field in the rx ACK packet should never go backwards - it's now the highest DATA sequence number received, not the last on received (it used to be used for out of sequence detection). Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22rxrpc: Fix overlapping ACK accountingDavid Howells2-11/+12
Fix accidental overlapping of Rx-phase ACK accounting with Tx-phase ACK accounting through variables shared between the two. call->acks_* members refer to ACKs received in the Tx phase and call->ackr_* members to ACKs sent/to be sent during the Rx phase. Fixes: 1a2391c30c0b ("rxrpc: Fix detection of out of order acks") Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeffrey Altman <jaltman@auristor.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22rxrpc: Don't try to resend the request if we're receiving the replyDavid Howells1-1/+2
rxrpc has a timer to trigger resending of unacked data packets in a call. This is not cancelled when a client call switches to the receive phase on the basis that most calls don't last long enough for it to ever expire. However, if it *does* expire after we've started to receive the reply, we shouldn't then go into trying to retransmit or pinging the server to find out if an ack got lost. Fix this by skipping the resend code if we're into receiving the reply to a client call. Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-afs@lists.infradead.org Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22rxrpc: Fix listen() setting the bar too high for the prealloc ringsDavid Howells1-2/+2
AF_RXRPC's listen() handler lets you set the backlog up to 32 (if you bump up the sysctl), but whilst the preallocation circular buffers have 32 slots in them, one of them has to be a dead slot because we're using CIRC_CNT(). This means that listen(rxrpc_sock, 32) will cause an oops when the socket is closed because rxrpc_service_prealloc_one() allocated one too many calls and rxrpc_discard_prealloc() won't then be able to get rid of them because it'll think the ring is empty. rxrpc_release_calls_on_socket() then tries to abort them, but oopses because call->peer isn't yet set. Fix this by setting the maximum backlog to RXRPC_BACKLOG_MAX - 1 to match the ring capacity. BUG: kernel NULL pointer dereference, address: 0000000000000086 ... RIP: 0010:rxrpc_send_abort_packet+0x73/0x240 [rxrpc] Call Trace: <TASK> ? __wake_up_common_lock+0x7a/0x90 ? rxrpc_notify_socket+0x8e/0x140 [rxrpc] ? rxrpc_abort_call+0x4c/0x60 [rxrpc] rxrpc_release_calls_on_socket+0x107/0x1a0 [rxrpc] rxrpc_release+0xc9/0x1c0 [rxrpc] __sock_release+0x37/0xa0 sock_close+0x11/0x20 __fput+0x89/0x240 task_work_run+0x59/0x90 do_exit+0x319/0xaa0 Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-afs@lists.infradead.org Link: https://lists.infradead.org/pipermail/linux-afs/2022-March/005079.html Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22afs: Adjust ACK interpretation to try and cope with NATDavid Howells1-0/+27
If a client's address changes, say if it is NAT'd, this can disrupt an in progress operation. For most operations, this is not much of a problem, but StoreData can be different as some servers modify the target file as the data comes in, so if a store request is disrupted, the file can get corrupted on the server. The problem is that the server doesn't recognise packets that come after the change of address as belonging to the original client and will bounce them, either by sending an OUT_OF_SEQUENCE ACK to the apparent new call if the packet number falls within the initial sequence number window of a call or by sending an EXCEEDS_WINDOW ACK if it falls outside and then aborting it. In both cases, firstPacket will be 1 and previousPacket will be 0 in the ACK information. Fix this by the following means: (1) If a client call receives an EXCEEDS_WINDOW ACK with firstPacket as 1 and previousPacket as 0, assume this indicates that the server saw the incoming packets from a different peer and thus as a different call. Fail the call with error -ENETRESET. (2) Also fail the call if a similar OUT_OF_SEQUENCE ACK occurs if the first packet has been hard-ACK'd. If it hasn't been hard-ACK'd, the ACK packet will cause it to get retransmitted, so the call will just be repeated. (3) Make afs_select_fileserver() treat -ENETRESET as a straight fail of the operation. (4) Prioritise the error code over things like -ECONNRESET as the server did actually respond. (5) Make writeback treat -ENETRESET as a retryable error and make it redirty all the pages involved in a write so that the VM will retry. Note that there is still a circumstance that I can't easily deal with: if the operation is fully received and processed by the server, but the reply is lost due to address change. There's no way to know if the op happened. We can examine the server, but a conflicting change could have been made by a third party - and we can't tell the difference. In such a case, a message like: kAFS: vnode modified {100058:146266} b7->b8 YFS.StoreData64 (op=2646a) will be logged to dmesg on the next op to touch the file and the client will reset the inode state, including invalidating clean parts of the pagecache. Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-afs@lists.infradead.org Link: http://lists.infradead.org/pipermail/linux-afs/2021-December/004811.html # v1 Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22rxrpc, afs: Fix selection of abort codesDavid Howells2-3/+3
The RX_USER_ABORT code should really only be used to indicate that the user of the rxrpc service (ie. userspace) implicitly caused a call to be aborted - for instance if the AF_RXRPC socket is closed whilst the call was in progress. (The user may also explicitly abort a call and specify the abort code to use). Change some of the points of generation to use other abort codes instead: (1) Abort the call with RXGEN_SS_UNMARSHAL or RXGEN_CC_UNMARSHAL if we see ENOMEM and EFAULT during received data delivery and abort with RX_CALL_DEAD in the default case. (2) Abort with RXGEN_SS_MARSHAL if we get ENOMEM whilst trying to send a reply. (3) Abort with RX_CALL_DEAD if we stop hearing from the peer if we had heard from the peer and abort with RX_CALL_TIMEOUT if we hadn't. (4) Abort with RX_CALL_DEAD if we try to disconnect a call that's not completed successfully or been aborted. Reported-by: Jeffrey Altman <jaltman@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22rxrpc: Return an error to sendmsg if call failedDavid Howells1-0/+6
If at the end of rxrpc sendmsg() or rxrpc_kernel_send_data() the call that was being given data was aborted remotely or otherwise failed, return an error rather than returning the amount of data buffered for transmission. The call (presumably) did not complete, so there's not much point continuing with it. AF_RXRPC considers it "complete" and so will be unwilling to do anything else with it - and won't send a notification for it, deeming the return from sendmsg sufficient. Not returning an error causes afs to incorrectly handle a StoreData operation that gets interrupted by a change of address due to NAT reconfiguration. This doesn't normally affect most operations since their request parameters tend to fit into a single UDP packet and afs_make_call() returns before the server responds; StoreData is different as it involves transmission of a lot of data. This can be triggered on a client by doing something like: dd if=/dev/zero of=/afs/example.com/foo bs=1M count=512 at one prompt, and then changing the network address at another prompt, e.g.: ifconfig enp6s0 inet 192.168.6.2 && route add 192.168.6.1 dev enp6s0 Tracing packets on an Auristor fileserver looks something like: 192.168.6.1 -> 192.168.6.3 RX 107 ACK Idle Seq: 0 Call: 4 Source Port: 7000 Destination Port: 7001 192.168.6.3 -> 192.168.6.1 AFS (RX) 1482 FS Request: Unknown(64538) (64538) 192.168.6.3 -> 192.168.6.1 AFS (RX) 1482 FS Request: Unknown(64538) (64538) 192.168.6.1 -> 192.168.6.3 RX 107 ACK Idle Seq: 0 Call: 4 Source Port: 7000 Destination Port: 7001 <ARP exchange for 192.168.6.2> 192.168.6.2 -> 192.168.6.1 AFS (RX) 1482 FS Request: Unknown(0) (0) 192.168.6.2 -> 192.168.6.1 AFS (RX) 1482 FS Request: Unknown(0) (0) 192.168.6.1 -> 192.168.6.2 RX 107 ACK Exceeds Window Seq: 0 Call: 4 Source Port: 7000 Destination Port: 7001 192.168.6.1 -> 192.168.6.2 RX 74 ABORT Seq: 0 Call: 4 Source Port: 7000 Destination Port: 7001 192.168.6.1 -> 192.168.6.2 RX 74 ABORT Seq: 29321 Call: 4 Source Port: 7000 Destination Port: 7001 The Auristor fileserver logs code -453 (RXGEN_SS_UNMARSHAL), but the abort code received by kafs is -5 (RX_PROTOCOL_ERROR) as the rx layer sees the condition and generates an abort first and the unmarshal error is a consequence of that at the application layer. Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-afs@lists.infradead.org Link: http://lists.infradead.org/pipermail/linux-afs/2021-December/004810.html # v1 Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-22rxrpc: Fix locking issueDavid Howells5-22/+16
There's a locking issue with the per-netns list of calls in rxrpc. The pieces of code that add and remove a call from the list use write_lock() and the calls procfile uses read_lock() to access it. However, the timer callback function may trigger a removal by trying to queue a call for processing and finding that it's already queued - at which point it has a spare refcount that it has to do something with. Unfortunately, if it puts the call and this reduces the refcount to 0, the call will be removed from the list. Unfortunately, since the _bh variants of the locking functions aren't used, this can deadlock. ================================ WARNING: inconsistent lock state 5.18.0-rc3-build4+ #10 Not tainted -------------------------------- inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. ksoftirqd/2/25 [HC0[0]:SC1[1]:HE1:SE0] takes: ffff888107ac4038 (&rxnet->call_lock){+.?.}-{2:2}, at: rxrpc_put_call+0x103/0x14b {SOFTIRQ-ON-W} state was registered at: ... Possible unsafe locking scenario: CPU0 ---- lock(&rxnet->call_lock); <Interrupt> lock(&rxnet->call_lock); *** DEADLOCK *** 1 lock held by ksoftirqd/2/25: #0: ffff8881008ffdb0 ((&call->timer)){+.-.}-{0:0}, at: call_timer_fn+0x5/0x23d Changes ======= ver #2) - Changed to using list_next_rcu() rather than rcu_dereference() directly. Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by: David S. Miller <davem@davemloft.net>