summaryrefslogtreecommitdiffstats
path: root/kernel/bpf
AgeCommit message (Collapse)AuthorFilesLines
2022-10-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski5-7/+7
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-18Merge tag 'for-netdev' of ↵Jakub Kicinski3-8/+28
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2022-10-18 We've added 33 non-merge commits during the last 14 day(s) which contain a total of 31 files changed, 874 insertions(+), 538 deletions(-). The main changes are: 1) Add RCU grace period chaining to BPF to wait for the completion of access from both sleepable and non-sleepable BPF programs, from Hou Tao & Paul E. McKenney. 2) Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer values. In the wild we have seen OS vendors doing buggy backports where helper call numbers mismatched. This is an attempt to make backports more foolproof, from Andrii Nakryiko. 3) Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions, from Roberto Sassu. 4) Fix libbpf's BTF dumper for structs with padding-only fields, from Eduard Zingerman. 5) Fix various libbpf bugs which have been found from fuzzing with malformed BPF object files, from Shung-Hsi Yu. 6) Clean up an unneeded check on existence of SSE2 in BPF x86-64 JIT, from Jie Meng. 7) Fix various ASAN bugs in both libbpf and selftests when running the BPF selftest suite on arm64, from Xu Kuohai. 8) Fix missing bpf_iter_vma_offset__destroy() call in BPF iter selftest and use in-skeleton link pointer to remove an explicit bpf_link__destroy(), from Jiri Olsa. 9) Fix BPF CI breakage by pointing to iptables-legacy instead of relying on symlinked iptables which got upgraded to iptables-nft, from Martin KaFai Lau. 10) Minor BPF selftest improvements all over the place, from various others. * tag 'for-netdev' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (33 commits) bpf/docs: Update README for most recent vmtest.sh bpf: Use rcu_trace_implies_rcu_gp() for program array freeing bpf: Use rcu_trace_implies_rcu_gp() in local storage map bpf: Use rcu_trace_implies_rcu_gp() in bpf memory allocator rcu-tasks: Provide rcu_trace_implies_rcu_gp() selftests/bpf: Use sys_pidfd_open() helper when possible libbpf: Fix null-pointer dereference in find_prog_by_sec_insn() libbpf: Deal with section with no data gracefully libbpf: Use elf_getshdrnum() instead of e_shnum selftest/bpf: Fix error usage of ASSERT_OK in xdp_adjust_tail.c selftests/bpf: Fix error failure of case test_xdp_adjust_tail_grow selftest/bpf: Fix memory leak in kprobe_multi_test selftests/bpf: Fix memory leak caused by not destroying skeleton libbpf: Fix memory leak in parse_usdt_arg() libbpf: Fix use-after-free in btf_dump_name_dups selftests/bpf: S/iptables/iptables-legacy/ in the bpf_nf and xdp_synproxy test selftests/bpf: Alphabetize DENYLISTs selftests/bpf: Add tests for _opts variants of bpf_*_get_fd_by_id() libbpf: Introduce bpf_link_get_fd_by_id_opts() libbpf: Introduce bpf_btf_get_fd_by_id_opts() ... ==================== Link: https://lore.kernel.org/r/20221018210631.11211-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-18bpf: Use rcu_trace_implies_rcu_gp() for program array freeingHou Tao1-1/+7
To support both sleepable and normal uprobe bpf program, the freeing of trace program array chains a RCU-tasks-trace grace period and a normal RCU grace period one after the other. With the introduction of rcu_trace_implies_rcu_gp(), __bpf_prog_array_free_sleepable_cb() can check whether or not a normal RCU grace period has also passed after a RCU-tasks-trace grace period has passed. If it is true, it is safe to invoke kfree() directly. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20221014113946.965131-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-18bpf: Use rcu_trace_implies_rcu_gp() in local storage mapHou Tao1-2/+11
Local storage map is accessible for both sleepable and non-sleepable bpf program, and its memory is freed by using both call_rcu_tasks_trace() and kfree_rcu() to wait for both RCU-tasks-trace grace period and RCU grace period to pass. With the introduction of rcu_trace_implies_rcu_gp(), both bpf_selem_free_rcu() and bpf_local_storage_free_rcu() can check whether or not a normal RCU grace period has also passed after a RCU-tasks-trace grace period has passed. If it is true, it is safe to call kfree() directly. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20221014113946.965131-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-18bpf: Use rcu_trace_implies_rcu_gp() in bpf memory allocatorHou Tao1-5/+10
The memory free logic in bpf memory allocator chains a RCU Tasks Trace grace period and a normal RCU grace period one after the other, so it can ensure that both sleepable and non-sleepable programs have finished. With the introduction of rcu_trace_implies_rcu_gp(), __free_rcu_tasks_trace() can check whether or not a normal RCU grace period has also passed after a RCU Tasks Trace grace period has passed. If it is true, freeing these elements directly, else freeing through call_rcu(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20221014113946.965131-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-17Merge tag 'cgroup-for-6.1-rc1-fixes' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix a recent regression where a sleeping kernfs function is called with css_set_lock (spinlock) held - Revert the commit to enable cgroup1 support for cgroup_get_from_fd/file() Multiple users assume that the lookup only works for cgroup2 and breaks when fed a cgroup1 file. Instead, introduce a separate set of functions to lookup both v1 and v2 and use them where the user explicitly wants to support both versions. - Compat update for tools/perf/util/bpf_skel/bperf_cgroup.bpf.c. - Add Josef Bacik as a blkcg maintainer. * tag 'cgroup-for-6.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: blkcg: Update MAINTAINERS entry mm: cgroup: fix comments for get from fd/file helpers perf stat: Support old kernels for bperf cgroup counting bpf: cgroup_iter: support cgroup1 using cgroup fd cgroup: add cgroup_v1v2_get_from_[fd/file]() Revert "cgroup: enable cgroup_get_from_file() on cgroup1" cgroup: Reorganize css_set_lock and kernfs path processing
2022-10-11treewide: use get_random_u32() when possibleJason A. Donenfeld4-4/+4
The prandom_u32() function has been a deprecated inline wrapper around get_random_u32() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. The same also applies to get_random_int(), which is just a wrapper around get_random_u32(). This was done as a basic find and replace. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Acked-by: Helge Deller <deller@gmx.de> # for parisc Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-11treewide: use prandom_u32_max() when possible, part 1Jason A. Donenfeld1-2/+2
Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes the minimum required bytes from the RNG and avoids divisions. This was done mechanically with this coccinelle script: @basic@ expression E; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u64; @@ ( - ((T)get_random_u32() % (E)) + prandom_u32_max(E) | - ((T)get_random_u32() & ((E) - 1)) + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2) | - ((u64)(E) * get_random_u32() >> 32) + prandom_u32_max(E) | - ((T)get_random_u32() & ~PAGE_MASK) + prandom_u32_max(PAGE_SIZE) ) @multi_line@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; identifier RAND; expression E; @@ - RAND = get_random_u32(); ... when != RAND - RAND %= (E); + RAND = prandom_u32_max(E); // Find a potential literal @literal_mask@ expression LITERAL; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; position p; @@ ((T)get_random_u32()@p & (LITERAL)) // Add one to the literal. @script:python add_one@ literal << literal_mask.LITERAL; RESULT; @@ value = None if literal.startswith('0x'): value = int(literal, 16) elif literal[0] in '123456789': value = int(literal, 10) if value is None: print("I don't know how to handle %s" % (literal)) cocci.include_match(False) elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1: print("Skipping 0x%x for cleanup elsewhere" % (value)) cocci.include_match(False) elif value & (value + 1) != 0: print("Skipping 0x%x because it's not a power of two minus one" % (value)) cocci.include_match(False) elif literal.startswith('0x'): coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1)) else: coccinelle.RESULT = cocci.make_expr("%d" % (value + 1)) // Replace the literal mask with the calculated result. @plus_one@ expression literal_mask.LITERAL; position literal_mask.p; expression add_one.RESULT; identifier FUNC; @@ - (FUNC()@p & (LITERAL)) + prandom_u32_max(RESULT) @collapse_ret@ type T; identifier VAR; expression E; @@ { - T VAR; - VAR = (E); - return VAR; + return E; } @drop_var@ type T; identifier VAR; @@ { - T VAR; ... when != VAR } Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: KP Singh <kpsingh@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-11bpf: cgroup_iter: support cgroup1 using cgroup fdYosry Ahmed1-1/+1
Use cgroup_v1v2_get_from_fd() in cgroup_iter to support attaching to both cgroup v1 and v2 using fds. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-10-10Merge tag 'mm-stable-2022-10-08' of ↵Linus Torvalds2-6/+6
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ...
2022-10-10Merge tag 'perf-core-2022-10-07' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events updates from Ingo Molnar: "PMU driver updates: - Add AMD Last Branch Record Extension Version 2 (LbrExtV2) feature support for Zen 4 processors. - Extend the perf ABI to provide branch speculation information, if available, and use this on CPUs that have it (eg. LbrExtV2). - Improve Intel PEBS TSC timestamp handling & integration. - Add Intel Raptor Lake S CPU support. - Add 'perf mem' and 'perf c2c' memory profiling support on AMD CPUs by utilizing IBS tagged load/store samples. - Clean up & optimize various x86 PMU details. HW breakpoints: - Big rework to optimize the code for systems with hundreds of CPUs and thousands of breakpoints: - Replace the nr_bp_mutex global mutex with the bp_cpuinfo_sem per-CPU rwsem that is read-locked during most of the key operations. - Improve the O(#cpus * #tasks) logic in toggle_bp_slot() and fetch_bp_busy_slots(). - Apply micro-optimizations & cleanups. - Misc cleanups & enhancements" * tag 'perf-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits) perf/hw_breakpoint: Annotate tsk->perf_event_mutex vs ctx->mutex perf: Fix pmu_filter_match() perf: Fix lockdep_assert_event_ctx() perf/x86/amd/lbr: Adjust LBR regardless of filtering perf/x86/utils: Fix uninitialized var in get_branch_type() perf/uapi: Define PERF_MEM_SNOOPX_PEER in kernel header file perf/x86/amd: Support PERF_SAMPLE_PHY_ADDR perf/x86/amd: Support PERF_SAMPLE_ADDR perf/x86/amd: Support PERF_SAMPLE_{WEIGHT|WEIGHT_STRUCT} perf/x86/amd: Support PERF_SAMPLE_DATA_SRC perf/x86/amd: Add IBS OP_DATA2 DataSrc bit definitions perf/mem: Introduce PERF_MEM_LVLNUM_{EXTN_MEM|IO} perf/x86/uncore: Add new Raptor Lake S support perf/x86/cstate: Add new Raptor Lake S support perf/x86/msr: Add new Raptor Lake S support perf/x86: Add new Raptor Lake S support bpf: Check flags for branch stack in bpf_read_branch_records helper perf, hw_breakpoint: Fix use-after-free if perf_event_open() fails perf: Use sample_flags for raw_data perf: Use sample_flags for addr ...
2022-10-04Merge tag 'net-next-6.1' of ↵Linus Torvalds26-675/+2422
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Introduce and use a single page frag cache for allocating small skb heads, clawing back the 10-20% performance regression in UDP flood test from previous fixes. - Run packets which already went thru HW coalescing thru SW GRO. This significantly improves TCP segment coalescing and simplifies deployments as different workloads benefit from HW or SW GRO. - Shrink the size of the base zero-copy send structure. - Move TCP init under a new slow / sleepable version of DO_ONCE(). BPF: - Add BPF-specific, any-context-safe memory allocator. - Add helpers/kfuncs for PKCS#7 signature verification from BPF programs. - Define a new map type and related helpers for user space -> kernel communication over a ring buffer (BPF_MAP_TYPE_USER_RINGBUF). - Allow targeting BPF iterators to loop through resources of one task/thread. - Add ability to call selected destructive functions. Expose crash_kexec() to allow BPF to trigger a kernel dump. Use CAP_SYS_BOOT check on the loading process to judge permissions. - Enable BPF to collect custom hierarchical cgroup stats efficiently by integrating with the rstat framework. - Support struct arguments for trampoline based programs. Only structs with size <= 16B and x86 are supported. - Invoke cgroup/connect{4,6} programs for unprivileged ICMP ping sockets (instead of just TCP and UDP sockets). - Add a helper for accessing CLOCK_TAI for time sensitive network related programs. - Support accessing network tunnel metadata's flags. - Make TCP SYN ACK RTO tunable by BPF programs with TCP Fast Open. - Add support for writing to Netfilter's nf_conn:mark. Protocols: - WiFi: more Extremely High Throughput (EHT) and Multi-Link Operation (MLO) work (802.11be, WiFi 7). - vsock: improve support for SO_RCVLOWAT. - SMC: support SO_REUSEPORT. - Netlink: define and document how to use netlink in a "modern" way. Support reporting missing attributes via extended ACK. - IPSec: support collect metadata mode for xfrm interfaces. - TCPv6: send consistent autoflowlabel in SYN_RECV state and RST packets. - TCP: introduce optional per-netns connection hash table to allow better isolation between namespaces (opt-in, at the cost of memory and cache pressure). - MPTCP: support TCP_FASTOPEN_CONNECT. - Add NEXT-C-SID support in Segment Routing (SRv6) End behavior. - Adjust IP_UNICAST_IF sockopt behavior for connected UDP sockets. - Open vSwitch: - Allow specifying ifindex of new interfaces. - Allow conntrack and metering in non-initial user namespace. - TLS: support the Korean ARIA-GCM crypto algorithm. - Remove DECnet support. Driver API: - Allow selecting the conduit interface used by each port in DSA switches, at runtime. - Ethernet Power Sourcing Equipment and Power Device support. - Add tc-taprio support for queueMaxSDU parameter, i.e. setting per traffic class max frame size for time-based packet schedules. - Support PHY rate matching - adapting between differing host-side and link-side speeds. - Introduce QUSGMII PHY mode and 1000BASE-KX interface mode. - Validate OF (device tree) nodes for DSA shared ports; make phylink-related properties mandatory on DSA and CPU ports. Enforcing more uniformity should allow transitioning to phylink. - Require that flash component name used during update matches one of the components for which version is reported by info_get(). - Remove "weight" argument from driver-facing NAPI API as much as possible. It's one of those magic knobs which seemed like a good idea at the time but is too indirect to use in practice. - Support offload of TLS connections with 256 bit keys. New hardware / drivers: - Ethernet: - Microchip KSZ9896 6-port Gigabit Ethernet Switch - Renesas Ethernet AVB (EtherAVB-IF) Gen4 SoCs - Analog Devices ADIN1110 and ADIN2111 industrial single pair Ethernet (10BASE-T1L) MAC+PHY. - Rockchip RV1126 Gigabit Ethernet (a version of stmmac IP). - Ethernet SFPs / modules: - RollBall / Hilink / Turris 10G copper SFPs - HALNy GPON module - WiFi: - CYW43439 SDIO chipset (brcmfmac) - CYW89459 PCIe chipset (brcmfmac) - BCM4378 on Apple platforms (brcmfmac) Drivers: - CAN: - gs_usb: HW timestamp support - Ethernet PHYs: - lan8814: cable diagnostics - Ethernet NICs: - Intel (100G): - implement control of FCS/CRC stripping - port splitting via devlink - L2TPv3 filtering offload - nVidia/Mellanox: - tunnel offload for sub-functions - MACSec offload, w/ Extended packet number and replay window offload - significantly restructure, and optimize the AF_XDP support, align the behavior with other vendors - Huawei: - configuring DSCP map for traffic class selection - querying standard FEC statistics - querying SerDes lane number via ethtool - Marvell/Cavium: - egress priority flow control - MACSec offload - AMD/SolarFlare: - PTP over IPv6 and raw Ethernet - small / embedded: - ax88772: convert to phylink (to support SFP cages) - altera: tse: convert to phylink - ftgmac100: support fixed link - enetc: standard Ethtool counters - macb: ZynqMP SGMII dynamic configuration support - tsnep: support multi-queue and use page pool - lan743x: Rx IP & TCP checksum offload - igc: add xdp frags support to ndo_xdp_xmit - Ethernet high-speed switches: - Marvell (prestera): - support SPAN port features (traffic mirroring) - nexthop object offloading - Microchip (sparx5): - multicast forwarding offload - QoS queuing offload (tc-mqprio, tc-tbf, tc-ets) - Ethernet embedded switches: - Marvell (mv88e6xxx): - support RGMII cmode - NXP (felix): - standardized ethtool counters - Microchip (lan966x): - QoS queuing offload (tc-mqprio, tc-tbf, tc-cbs, tc-ets) - traffic policing and mirroring - link aggregation / bonding offload - QUSGMII PHY mode support - Qualcomm 802.11ax WiFi (ath11k): - cold boot calibration support on WCN6750 - support to connect to a non-transmit MBSSID AP profile - enable remain-on-channel support on WCN6750 - Wake-on-WLAN support for WCN6750 - support to provide transmit power from firmware via nl80211 - support to get power save duration for each client - spectral scan support for 160 MHz - MediaTek WiFi (mt76): - WiFi-to-Ethernet bridging offload for MT7986 chips - RealTek WiFi (rtw89): - P2P support" * tag 'net-next-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1864 commits) eth: pse: add missing static inlines once: rename _SLOW to _SLEEPABLE net: pse-pd: add regulator based PSE driver dt-bindings: net: pse-dt: add bindings for regulator based PoDL PSE controller ethtool: add interface to interact with Ethernet Power Equipment net: mdiobus: search for PSE nodes by parsing PHY nodes. net: mdiobus: fwnode_mdiobus_register_phy() rework error handling net: add framework to support Ethernet PSE and PDs devices dt-bindings: net: phy: add PoDL PSE property net: marvell: prestera: Propagate nh state from hw to kernel net: marvell: prestera: Add neighbour cache accounting net: marvell: prestera: add stub handler neighbour events net: marvell: prestera: Add heplers to interact with fib_notifier_info net: marvell: prestera: Add length macros for prestera_ip_addr net: marvell: prestera: add delayed wq and flush wq on deinit net: marvell: prestera: Add strict cleanup of fib arbiter net: marvell: prestera: Add cleanup of allocated fib_nodes net: marvell: prestera: Add router nexthops ABI eth: octeon: fix build after netif_napi_add() changes net/mlx5: E-Switch, Return EBUSY if can't get mode lock ...
2022-10-03Merge tag 'lsm-pr-20221003' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull LSM updates from Paul Moore: "Seven patches for the LSM layer and we've got a mix of trivial and significant patches. Highlights below, starting with the smaller bits first so they don't get lost in the discussion of the larger items: - Remove some redundant NULL pointer checks in the common LSM audit code. - Ratelimit the lockdown LSM's access denial messages. With this change there is a chance that the last visible lockdown message on the console is outdated/old, but it does help preserve the initial series of lockdown denials that started the denial message flood and my gut feeling is that these might be the more valuable messages. - Open userfaultfds as readonly instead of read/write. While this code obviously lives outside the LSM, it does have a noticeable impact on the LSMs with Ondrej explaining the situation in the commit description. It is worth noting that this patch languished on the VFS list for over a year without any comments (objections or otherwise) so I took the liberty of pulling it into the LSM tree after giving fair notice. It has been in linux-next since the end of August without any noticeable problems. - Add a LSM hook for user namespace creation, with implementations for both the BPF LSM and SELinux. Even though the changes are fairly small, this is the bulk of the diffstat as we are also including BPF LSM selftests for the new hook. It's also the most contentious of the changes in this pull request with Eric Biederman NACK'ing the LSM hook multiple times during its development and discussion upstream. While I've never taken NACK's lightly, I'm sending these patches to you because it is my belief that they are of good quality, satisfy a long-standing need of users and distros, and are in keeping with the existing nature of the LSM layer and the Linux Kernel as a whole. The patches in implement a LSM hook for user namespace creation that allows for a granular approach, configurable at runtime, which enables both monitoring and control of user namespaces. The general consensus has been that this is far preferable to the other solutions that have been adopted downstream including outright removal from the kernel, disabling via system wide sysctls, or various other out-of-tree mechanisms that users have been forced to adopt since we haven't been able to provide them an upstream solution for their requests. Eric has been steadfast in his objections to this LSM hook, explaining that any restrictions on the user namespace could have significant impact on userspace. While there is the possibility of impacting userspace, it is important to note that this solution only impacts userspace when it is requested based on the runtime configuration supplied by the distro/admin/user. Frederick (the pathset author), the LSM/security community, and myself have tried to work with Eric during development of this patchset to find a mutually acceptable solution, but Eric's approach and unwillingness to engage in a meaningful way have made this impossible. I have CC'd Eric directly on this pull request so he has a chance to provide his side of the story; there have been no objections outside of Eric's" * tag 'lsm-pr-20221003' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: lockdown: ratelimit denial messages userfaultfd: open userfaultfds with O_RDONLY selinux: Implement userns_create hook selftests/bpf: Add tests verifying bpf lsm userns_create hook bpf-lsm: Make bpf_lsm_userns_create() sleepable security, lsm: Introduce security_create_user_ns() lsm: clean up redundant NULL pointer check
2022-10-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski4-25/+35
Merge in the left-over fixes before the net-next pull-request. Conflicts: drivers/net/ethernet/mediatek/mtk_ppe.c ae3ed15da588 ("net: ethernet: mtk_eth_soc: fix state in __mtk_foe_entry_clear") 9d8cb4c096ab ("net: ethernet: mtk_eth_soc: add foe_entry_size to mtk_eth_soc") https://lore.kernel.org/all/6cb6893b-4921-a068-4c30-1109795110bb@tessares.net/ kernel/bpf/helpers.c 8addbfc7b308 ("bpf: Gate dynptr API behind CAP_BPF") 5679ff2f138f ("bpf: Move bpf_loop and bpf_for_each_map_elem under CAP_BPF") 8a67f2de9b1d ("bpf: expose bpf_strtol and bpf_strtoul to all program types") https://lore.kernel.org/all/20221003201957.13149-1-daniel@iogearbox.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-03bpf: kmsan: initialize BPF registers with zeroesAlexander Potapenko1-1/+1
When executing BPF programs, certain registers may get passed uninitialized to helper functions. E.g. when performing a JMP_CALL, registers BPF_R1-BPF_R5 are always passed to the helper, no matter how many of them are actually used. Passing uninitialized values as function parameters is technically undefined behavior, so we work around it by always initializing the registers. Link: https://lkml.kernel.org/r/20220915150417.722975-42-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-29bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampolineMartin KaFai Lau1-0/+23
The struct_ops prog is to allow using bpf to implement the functions in a struct (eg. kernel module). The current usage is to implement the tcp_congestion. The kernel does not call the tcp-cc's ops (ie. the bpf prog) in a recursive way. The struct_ops is sharing the tracing-trampoline's enter/exit function which tracks prog->active to avoid recursion. It is needed for tracing prog. However, it turns out the struct_ops bpf prog will hit this prog->active and unnecessarily skipped running the struct_ops prog. eg. The '.ssthresh' may run in_task() and then interrupted by softirq that runs the same '.ssthresh'. Skip running the '.ssthresh' will end up returning random value to the caller. The patch adds __bpf_prog_{enter,exit}_struct_ops for the struct_ops trampoline. They do not track the prog->active to detect recursion. One exception is when the tcp_congestion's '.init' ops is doing bpf_setsockopt(TCP_CONGESTION) and then recurs to the same '.init' ops. This will be addressed in the following patches. Fixes: ca06f55b9002 ("bpf: Add per-program recursion prevention mechanism") Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220929070407.965581-2-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-29Merge branch 'v6.0-rc7'Peter Zijlstra4-9/+12
Merge upstream to get RAPTORLAKE_S Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2022-09-28bpf: Handle show_fdinfo for the parameterized task BPF iteratorsKui-Feng Lee1-0/+18
Show information of iterators in the respective files under /proc/<pid>/fdinfo/. For example, for a task file iterator with 1723 as the value of tid parameter, its fdinfo would look like the following lines. pos: 0 flags: 02000000 mnt_id: 14 ino: 38 link_type: iter link_id: 51 prog_tag: a590ac96db22b825 prog_id: 299 target_name: task_file task_type: TID tid: 1723 This patch add the last three fields. task_type is the type of the task parameter. TID means the iterator visit only the thread specified by tid. The value of tid in the above example is 1723. For the case of PID task_type, it means the iterator visits only threads of a process and will show the pid value of the process instead of a tid. Signed-off-by: Kui-Feng Lee <kuifeng@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/bpf/20220926184957.208194-4-kuifeng@fb.com
2022-09-28bpf: Handle bpf_link_info for the parameterized task BPF iterators.Kui-Feng Lee1-0/+18
Add new fields to bpf_link_info that users can query it through bpf_obj_get_info_by_fd(). Signed-off-by: Kui-Feng Lee <kuifeng@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/bpf/20220926184957.208194-3-kuifeng@fb.com
2022-09-28bpf: Parameterize task iterators.Kui-Feng Lee1-22/+166
Allow creating an iterator that loops through resources of one thread/process. People could only create iterators to loop through all resources of files, vma, and tasks in the system, even though they were interested in only the resources of a specific task or process. Passing the additional parameters, people can now create an iterator to go through all resources or only the resources of a task. Signed-off-by: Kui-Feng Lee <kuifeng@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/bpf/20220926184957.208194-2-kuifeng@fb.com
2022-09-26bpf: Enforce W^X for bpf trampolineSong Liu1-17/+5
Mark the trampoline as RO+X after arch_prepare_bpf_trampoline, so that the trampoine follows W^X rule strictly. This will turn off warnings like CPA refuse W^X violation: 8000000000000163 -> 0000000000000163 range: ... Also remove bpf_jit_alloc_exec_page(), since it is not used any more. Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20220926184739.3512547-3-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-26bpf: use bpf_prog_pack for bpf_dispatcherSong Liu2-8/+28
Allocate bpf_dispatcher with bpf_prog_pack_alloc so that bpf_dispatcher can share pages with bpf programs. arch_prepare_bpf_dispatcher() is updated to provide a RW buffer as working area for arch code to write to. This also fixes CPA W^X warnning like: CPA refuse W^X violation: 8000000000000163 -> 0000000000000163 range: ... Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20220926184739.3512547-2-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-26bpf: remove VMA linked listLiam R. Howlett1-5/+5
Use vma_next() and remove reference to the start of the linked list Link: https://lkml.kernel.org/r/20220906194824.2110408-51-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-21bpf: Tweak definition of KF_TRUSTED_ARGSKumar Kartikeya Dwivedi1-5/+13
Instead of forcing all arguments to be referenced pointers with non-zero reg->ref_obj_id, tweak the definition of KF_TRUSTED_ARGS to mean that only PTR_TO_BTF_ID (and socket types translated to PTR_TO_BTF_ID) have that constraint, and require their offset to be set to 0. The rest of pointer types are also accomodated in this definition of trusted pointers, but with more relaxed rules regarding offsets. The inherent meaning of setting this flag is that all kfunc pointer arguments have a guranteed lifetime, and kernel object pointers (PTR_TO_BTF_ID, PTR_TO_CTX) are passed in their unmodified form (with offset 0). In general, this is not true for PTR_TO_BTF_ID as it can be obtained using pointer walks. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/cdede0043c47ed7a357f0a915d16f9ce06a1d589.1663778601.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-21bpf: Always use raw spinlock for hash bucket lockHou Tao1-52/+14
For a non-preallocated hash map on RT kernel, regular spinlock instead of raw spinlock is used for bucket lock. The reason is that on RT kernel memory allocation is forbidden under atomic context and regular spinlock is sleepable under RT. Now hash map has been fully converted to use bpf_map_alloc, and there will be no synchronous memory allocation for non-preallocated hash map, so it is safe to always use raw spinlock for bucket lock on RT. So removing the usage of htab_use_raw_lock() and updating the comments accordingly. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20220921073826.2365800-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-21bpf: Prevent bpf program recursion for raw tracepoint probesJiri Olsa2-13/+13
We got report from sysbot [1] about warnings that were caused by bpf program attached to contention_begin raw tracepoint triggering the same tracepoint by using bpf_trace_printk helper that takes trace_printk_lock lock. Call Trace: <TASK> ? trace_event_raw_event_bpf_trace_printk+0x5f/0x90 bpf_trace_printk+0x2b/0xe0 bpf_prog_a9aec6167c091eef_prog+0x1f/0x24 bpf_trace_run2+0x26/0x90 native_queued_spin_lock_slowpath+0x1c6/0x2b0 _raw_spin_lock_irqsave+0x44/0x50 bpf_trace_printk+0x3f/0xe0 bpf_prog_a9aec6167c091eef_prog+0x1f/0x24 bpf_trace_run2+0x26/0x90 native_queued_spin_lock_slowpath+0x1c6/0x2b0 _raw_spin_lock_irqsave+0x44/0x50 bpf_trace_printk+0x3f/0xe0 bpf_prog_a9aec6167c091eef_prog+0x1f/0x24 bpf_trace_run2+0x26/0x90 native_queued_spin_lock_slowpath+0x1c6/0x2b0 _raw_spin_lock_irqsave+0x44/0x50 bpf_trace_printk+0x3f/0xe0 bpf_prog_a9aec6167c091eef_prog+0x1f/0x24 bpf_trace_run2+0x26/0x90 native_queued_spin_lock_slowpath+0x1c6/0x2b0 _raw_spin_lock_irqsave+0x44/0x50 __unfreeze_partials+0x5b/0x160 ... The can be reproduced by attaching bpf program as raw tracepoint on contention_begin tracepoint. The bpf prog calls bpf_trace_printk helper. Then by running perf bench the spin lock code is forced to take slow path and call contention_begin tracepoint. Fixing this by skipping execution of the bpf program if it's already running, Using bpf prog 'active' field, which is being currently used by trampoline programs for the same reason. Moving bpf_prog_inc_misses_counter to syscall.c because trampoline.c is compiled in just for CONFIG_BPF_JIT option. Reviewed-by: Stanislav Fomichev <sdf@google.com> Reported-by: syzbot+2251879aa068ad9c960d@syzkaller.appspotmail.com [1] https://lore.kernel.org/bpf/YxhFe3EwqchC%2FfYf@krava/T/#t Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20220916071914.7156-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-21bpf: Export bpf_dynptr_get_size()Roberto Sassu1-1/+1
Export bpf_dynptr_get_size(), so that kernel code dealing with eBPF dynamic pointers can obtain the real size of data carried by this data structure. Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Acked-by: KP Singh <kpsingh@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220920075951.929132-6-roberto.sassu@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-21btf: Allow dynamic pointer parameters in kfuncsRoberto Sassu2-5/+38
Allow dynamic pointers (struct bpf_dynptr_kern *) to be specified as parameters in kfuncs. Also, ensure that dynamic pointers passed as argument are valid and initialized, are a pointer to the stack, and of the type local. More dynamic pointer types can be supported in the future. To properly detect whether a parameter is of the desired type, introduce the stringify_struct() macro to compare the returned structure name with the desired name. In addition, protect against structure renames, by halting the build with BUILD_BUG_ON(), so that developers have to revisit the code. To check if a dynamic pointer passed to the kfunc is valid and initialized, and if its type is local, export the existing functions is_dynptr_reg_valid_init() and is_dynptr_type_expected(). Cc: Joanne Koong <joannelkoong@gmail.com> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220920075951.929132-5-roberto.sassu@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-21bpf: Move dynptr type check to is_dynptr_type_expected()Roberto Sassu1-8/+27
Move dynptr type check to is_dynptr_type_expected() from is_dynptr_reg_valid_init(), so that callers can better determine the cause of a negative result (dynamic pointer not valid/initialized, dynamic pointer of the wrong type). It will be useful for example for BTF, to restrict which dynamic pointer types can be passed to kfuncs, as initially only the local type will be supported. Also, splitting makes the code more readable, since checking the dynamic pointer type is not necessarily related to validity and initialization. Split the validity/initialization and dynamic pointer type check also in the verifier, and adjust the expected error message in the test (a test for an unexpected dynptr type passed to a helper cannot be added due to missing suitable helpers, but this case has been tested manually). Cc: Joanne Koong <joannelkoong@gmail.com> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220920075951.929132-4-roberto.sassu@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-21btf: Export bpf_dynptr definitionRoberto Sassu1-0/+2
eBPF dynamic pointers is a new feature recently added to upstream. It binds together a pointer to a memory area and its size. The internal kernel structure bpf_dynptr_kern is not accessible by eBPF programs in user space. They instead see bpf_dynptr, which is then translated to the internal kernel structure by the eBPF verifier. The problem is that it is not possible to include at the same time the uapi include linux/bpf.h and the vmlinux BTF vmlinux.h, as they both contain the definition of some structures/enums. The compiler complains saying that the structures/enums are redefined. As bpf_dynptr is defined in the uapi include linux/bpf.h, this makes it impossible to include vmlinux.h. However, in some cases, e.g. when using kfuncs, vmlinux.h has to be included. The only option until now was to include vmlinux.h and add the definition of bpf_dynptr directly in the eBPF program source code from linux/bpf.h. Solve the problem by using the same approach as for bpf_timer (which also follows the same scheme with the _kern suffix for the internal kernel structure). Add the following line in one of the dynamic pointer helpers, bpf_dynptr_from_mem(): BTF_TYPE_EMIT(struct bpf_dynptr); Cc: stable@vger.kernel.org Cc: Joanne Koong <joannelkoong@gmail.com> Fixes: 97e03f521050c ("bpf: Add verifier support for dynptrs") Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Tested-by: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/r/20220920075951.929132-3-roberto.sassu@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-21bpf: Allow kfuncs to be used in LSM programsKP Singh1-0/+1
In preparation for the addition of new kfuncs, allow kfuncs defined in the tracing subsystem to be used in LSM programs by mapping the LSM program type to the TRACING hook. Signed-off-by: KP Singh <kpsingh@kernel.org> Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220920075951.929132-2-roberto.sassu@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-21bpf: Add bpf_user_ringbuf_drain() helperDavid Vernet3-9/+235
In a prior change, we added a new BPF_MAP_TYPE_USER_RINGBUF map type which will allow user-space applications to publish messages to a ring buffer that is consumed by a BPF program in kernel-space. In order for this map-type to be useful, it will require a BPF helper function that BPF programs can invoke to drain samples from the ring buffer, and invoke callbacks on those samples. This change adds that capability via a new BPF helper function: bpf_user_ringbuf_drain(struct bpf_map *map, void *callback_fn, void *ctx, u64 flags) BPF programs may invoke this function to run callback_fn() on a series of samples in the ring buffer. callback_fn() has the following signature: long callback_fn(struct bpf_dynptr *dynptr, void *context); Samples are provided to the callback in the form of struct bpf_dynptr *'s, which the program can read using BPF helper functions for querying struct bpf_dynptr's. In order to support bpf_ringbuf_drain(), a new PTR_TO_DYNPTR register type is added to the verifier to reflect a dynptr that was allocated by a helper function and passed to a BPF program. Unlike PTR_TO_STACK dynptrs which are allocated on the stack by a BPF program, PTR_TO_DYNPTR dynptrs need not use reference tracking, as the BPF helper is trusted to properly free the dynptr before returning. The verifier currently only supports PTR_TO_DYNPTR registers that are also DYNPTR_TYPE_LOCAL. Note that while the corresponding user-space libbpf logic will be added in a subsequent patch, this patch does contain an implementation of the .map_poll() callback for BPF_MAP_TYPE_USER_RINGBUF maps. This .map_poll() callback guarantees that an epoll-waiting user-space producer will receive at least one event notification whenever at least one sample is drained in an invocation of bpf_user_ringbuf_drain(), provided that the function is not invoked with the BPF_RB_NO_WAKEUP flag. If the BPF_RB_FORCE_WAKEUP flag is provided, a wakeup notification is sent even if no sample was drained. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220920000100.477320-3-void@manifault.com
2022-09-21bpf: Define new BPF_MAP_TYPE_USER_RINGBUF map typeDavid Vernet2-6/+59
We want to support a ringbuf map type where samples are published from user-space, to be consumed by BPF programs. BPF currently supports a kernel -> user-space circular ring buffer via the BPF_MAP_TYPE_RINGBUF map type. We'll need to define a new map type for user-space -> kernel, as none of the helpers exported for BPF_MAP_TYPE_RINGBUF will apply to a user-space producer ring buffer, and we'll want to add one or more helper functions that would not apply for a kernel-producer ring buffer. This patch therefore adds a new BPF_MAP_TYPE_USER_RINGBUF map type definition. The map type is useless in its current form, as there is no way to access or use it for anything until we one or more BPF helpers. A follow-on patch will therefore add a new helper function that allows BPF programs to run callbacks on samples that are published to the ring buffer. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220920000100.477320-2-void@manifault.com
2022-09-21bpf: Gate dynptr API behind CAP_BPFKumar Kartikeya Dwivedi1-14/+14
This has been enabled for unprivileged programs for only one kernel release, hence the expected annoyances due to this move are low. Users using ringbuf can stick to non-dynptr APIs. The actual use cases dynptr is meant to serve may not make sense in unprivileged BPF programs. Hence, gate these helpers behind CAP_BPF and limit use to privileged BPF programs. Fixes: 263ae152e962 ("bpf: Add bpf_dynptr_from_mem for local dynptrs") Fixes: bc34dee65a65 ("bpf: Dynptr support for ring buffers") Fixes: 13bbbfbea759 ("bpf: Add bpf_dynptr_read and bpf_dynptr_write") Fixes: 34d4ef5775f7 ("bpf: Add dynptr data slices") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220921143550.30247-1-memxor@gmail.com Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-21bpf, cgroup: Reject prog_attach_flags array when effective queryPu Lehui1-10/+18
Attach flags is only valid for attached progs of this layer cgroup, but not for effective progs. For querying with EFFECTIVE flags, exporting attach flags does not make sense. So when effective query, we reject prog_attach_flags array and don't need to populate it. Also we limit attach_flags to output 0 during effective query. Fixes: b79c9fc9551b ("bpf: implement BPF_PROG_QUERY for BPF_LSM_CGROUP") Signed-off-by: Pu Lehui <pulehui@huawei.com> Link: https://lore.kernel.org/r/20220921104604.2340580-2-pulehui@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-09-21bpf: simplify code in btf_parse_hdrWilliam Dean1-6/+1
It could directly return 'btf_check_sec_info' to simplify code. Signed-off-by: William Dean <williamsukatube@163.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220917084248.3649-1-williamsukatube@163.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-09-20bpf: Check whether or not node is NULL before free it in free_bulkHou Tao1-1/+2
llnode could be NULL if there are new allocations after the checking of c-free_cnt > c->high_watermark in bpf_mem_refill() and before the calling of __llist_del_first() in free_bulk (e.g. a PREEMPT_RT kernel or allocation in NMI context). And it will incur oops as shown below: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] PREEMPT_RT SMP CPU: 39 PID: 373 Comm: irq_work/39 Tainted: G W 6.0.0-rc6-rt9+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) RIP: 0010:bpf_mem_refill+0x66/0x130 ...... Call Trace: <TASK> irq_work_single+0x24/0x60 irq_work_run_list+0x24/0x30 run_irq_workd+0x18/0x20 smpboot_thread_fn+0x13f/0x2c0 kthread+0x121/0x140 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> Simply fixing it by checking whether or not llnode is NULL in free_bulk(). Fixes: 8d5a8011b35d ("bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20220919144811.3570825-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-16bpf/btf: Use btf_type_str() whenever possiblePeilin Ye1-9/+8
We have btf_type_str(). Use it whenever possible in btf.c, instead of "btf_kind_str[BTF_INFO_KIND(t->info)]". Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Link: https://lore.kernel.org/r/20220916202800.31421-1-yepeilin.cs@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-09-16bpf: use kvmemdup_bpfptr helperWang Yufen1-9/+4
Use kvmemdup_bpfptr helper instead of open-coding to simplify the code. Signed-off-by: Wang Yufen <wangyufen@huawei.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/1663058433-14089-1-git-send-email-wangyufen@huawei.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-09-16bpf: Ensure correct locking around vulnerable function find_vpid()Lee Jones1-0/+2
The documentation for find_vpid() clearly states: "Must be called with the tasklist_lock or rcu_read_lock() held." Presently we do neither for find_vpid() instance in bpf_task_fd_query(). Add proper rcu_read_lock/unlock() to fix the issue. Fixes: 41bdc4b40ed6f ("bpf: introduce bpf subcommand BPF_TASK_FD_QUERY") Signed-off-by: Lee Jones <lee@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220912133855.1218900-1-lee@kernel.org
2022-09-15bpf: Add verifier check for BPF_PTR_POISON retval and argDave Marchevsky2-10/+26
BPF_PTR_POISON was added in commit c0a5a21c25f37 ("bpf: Allow storing referenced kptr in map") to denote a bpf_func_proto btf_id which the verifier will replace with a dynamically-determined btf_id at verification time. This patch adds verifier 'poison' functionality to BPF_PTR_POISON in order to prepare for expanded use of the value to poison ret- and arg-btf_id in ongoing work, namely rbtree and linked list patchsets [0, 1]. Specifically, when the verifier checks helper calls, it assumes that BPF_PTR_POISON'ed ret type will be replaced with a valid type before - or in lieu of - the default ret_btf_id logic. Similarly for arg btf_id. If poisoned btf_id reaches default handling block for either, consider this a verifier internal error and fail verification. Otherwise a helper w/ poisoned btf_id but no verifier logic replacing the type will cause a crash as the invalid pointer is dereferenced. Also move BPF_PTR_POISON to existing include/linux/posion.h header and remove unnecessary shift. [0]: lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com [1]: lore.kernel.org/bpf/20220904204145.3089-1-memxor@gmail.com Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220912154544.1398199-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-13perf/bpf: Always use perf callchains if existNamhyung Kim1-2/+2
If the perf_event has PERF_SAMPLE_CALLCHAIN, BPF can use it for stack trace. The problematic cases like PEBS and IBS already handled in the PMU driver and they filled the callchain info in the sample data. For others, we can call perf_callchain() before the BPF handler. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220908214104.3851807-2-namhyung@kernel.org
2022-09-10bpf: Add verifier support for custom callback return rangeDave Marchevsky1-1/+6
Verifier logic to confirm that a callback function returns 0 or 1 was added in commit 69c087ba6225b ("bpf: Add bpf_for_each_map_elem() helper"). At the time, callback return value was only used to continue or stop iteration. In order to support callbacks with a broader return value range, such as those added in rbtree series[0] and others, add a callback_ret_range to bpf_func_state. Verifier's helpers which set in_callback_fn will also set the new field, which the verifier will later use to check return value bounds. Default to tnum_range(0, 0) instead of using tnum_unknown as a sentinel value as the latter would prevent the valid range (0, U64_MAX) being used. Previous global default tnum_range(0, 1) is explicitly set for extant callback helpers. The change to global default was made after discussion around this patch in rbtree series [1], goal here is to make it more obvious that callback_ret_range should be explicitly set. [0]: lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com/ [1]: lore.kernel.org/bpf/20220830172759.4069786-2-davemarchevsky@fb.com/ Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220908230716.2751723-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-10bpf: btf: fix truncated last_member_type_id in btf_struct_resolveLorenz Bauer1-1/+1
When trying to finish resolving a struct member, btf_struct_resolve saves the member type id in a u16 temporary variable. This truncates the 32 bit type id value if it exceeds UINT16_MAX. As a result, structs that have members with type ids > UINT16_MAX and which need resolution will fail with a message like this: [67414] STRUCT ff_device size=120 vlen=12 effect_owners type_id=67434 bits_offset=960 Member exceeds struct_size Fix this by changing the type of last_member_type_id to u32. Fixes: a0791f0df7d2 ("bpf: fix BTF limits") Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Lorenz Bauer <oss@lmb.io> Link: https://lore.kernel.org/r/20220910110120.339242-1-oss@lmb.io Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-10bpf: Export btf_type_by_id() and bpf_log()Daniel Xu2-0/+2
These symbols will be used in nf_conntrack.ko to support direct writes to `nf_conn`. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/3c98c19dc50d3b18ea5eca135b4fc3a5db036060.1662568410.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-10bpf: Remove duplicate PTR_TO_BTF_ID RO checkDaniel Xu1-3/+0
Since commit 27ae7997a661 ("bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS") there has existed bpf_verifier_ops:btf_struct_access. When btf_struct_access is _unset_ for a prog type, the verifier runs the default implementation, which is to enforce read only: if (env->ops->btf_struct_access) { [...] } else { if (atype != BPF_READ) { verbose(env, "only read is supported\n"); return -EACCES; } [...] } When btf_struct_access is _set_, the expectation is that btf_struct_access has full control over accesses, including if writes are allowed. Rather than carve out an exception for each prog type that may write to BTF ptrs, delete the redundant check and give full control to btf_struct_access. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/962da2bff1238746589e332ff1aecc49403cd7ce.1662568410.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-10bpf: Simplify code by using for_each_cpu_wrap()Punit Agrawal1-32/+16
In the percpu freelist code, it is a common pattern to iterate over the possible CPUs mask starting with the current CPU. The pattern is implemented using a hand rolled while loop with the loop variable increment being open-coded. Simplify the code by using for_each_cpu_wrap() helper to iterate over the possible cpus starting with the current CPU. As a result, some of the special-casing in the loop also gets simplified. No functional change intended. Signed-off-by: Punit Agrawal <punit.agrawal@bytedance.com> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20220907155746.1750329-1-punit.agrawal@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-10bpf: add missing percpu_counter_destroy() in htab_map_alloc()Tetsuo Handa1-0/+2
syzbot is reporting ODEBUG bug in htab_map_alloc() [1], for commit 86fe28f7692d96d2 ("bpf: Optimize element count in non-preallocated hash map.") added percpu_counter_init() to htab_map_alloc() but forgot to add percpu_counter_destroy() to the error path. Link: https://syzkaller.appspot.com/bug?extid=5d1da78b375c3b5e6c2b [1] Reported-by: syzbot <syzbot+5d1da78b375c3b5e6c2b@syzkaller.appspotmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Fixes: 86fe28f7692d96d2 ("bpf: Optimize element count in non-preallocated hash map.") Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/e2e4cc0e-9d36-4ca1-9bfa-ce23e6f8310b@I-love.SAKURA.ne.jp Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-07bpf: Add helper macro bpf_for_each_reg_in_vstateKumar Kartikeya Dwivedi1-107/+28
For a lot of use cases in future patches, we will want to modify the state of registers part of some same 'group' (e.g. same ref_obj_id). It won't just be limited to releasing reference state, but setting a type flag dynamically based on certain actions, etc. Hence, we need a way to easily pass a callback to the function that iterates over all registers in current bpf_verifier_state in all frames upto (and including) the curframe. While in C++ we would be able to easily use a lambda to pass state and the callback together, sadly we aren't using C++ in the kernel. The next best thing to avoid defining a function for each case seems like statement expressions in GNU C. The kernel already uses them heavily, hence they can passed to the macro in the style of a lambda. The statement expression will then be substituted in the for loop bodies. Variables __state and __reg are set to current bpf_func_state and reg for each invocation of the expression inside the passed in verifier state. Then, convert mark_ptr_or_null_regs, clear_all_pkt_pointers, release_reference, find_good_pkt_pointers, find_equal_scalars to use bpf_for_each_reg_in_vstate. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220904204145.3089-16-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-07bpf: Support kptrs in percpu arraymapKumar Kartikeya Dwivedi2-10/+26
Enable support for kptrs in percpu BPF arraymap by wiring up the freeing of these kptrs from percpu map elements. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220904204145.3089-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>