summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2022-08-24fec: Restart PPS after link state changeCsókás Bence3-4/+77
On link state change, the controller gets reset, causing PPS to drop out and the PHC to lose its time and calibration. So we restart it if needed, restoring calibration and time registers. Changes since v2: * Add `fec_ptp_save_state()`/`fec_ptp_restore_state()` * Use `ktime_get_real_ns()` * Use `BIT()` macro Changes since v1: * More ECR #define's * Stop PPS in `fec_ptp_stop()` Signed-off-by: Csókás Bence <csokas.bence@prolan.hu> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-24net: neigh: don't call kfree_skb() under spin_lock_irqsave()Yang Yingliang1-3/+9
It is not allowed to call kfree_skb() from hardware interrupt context or with interrupts being disabled. So add all skb to a tmp list, then free them after spin_unlock_irqrestore() at once. Fixes: 66ba215cb513 ("neigh: fix possible DoS due to net iface start/stop loop") Suggested-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-24netfilter: nf_defrag_ipv6: allow nf_conntrack_frag6_high_thresh increasesEric Dumazet1-1/+0
Currently, net.netfilter.nf_conntrack_frag6_high_thresh can only be lowered. I found this issue while investigating a probable kernel issue causing flakes in tools/testing/selftests/net/ip_defrag.sh In particular, these sysctl changes were ignored: ip netns exec "${NETNS}" sysctl -w net.netfilter.nf_conntrack_frag6_high_thresh=9000000 >/dev/null 2>&1 ip netns exec "${NETNS}" sysctl -w net.netfilter.nf_conntrack_frag6_low_thresh=7000000 >/dev/null 2>&1 This change is inline with commit 836196239298 ("net/ipfrag: let ip[6]frag_high_thresh in ns be higher than in init_net") Fixes: 8db3d41569bb ("netfilter: nf_defrag_ipv6: use net_generic infra") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-08-24netfilter: flowtable: fix stuck flows on cleanup due to pending workPablo Neira Ayuso3-4/+13
To clear the flow table on flow table free, the following sequence normally happens in order: 1) gc_step work is stopped to disable any further stats/del requests. 2) All flow table entries are set to teardown state. 3) Run gc_step which will queue HW del work for each flow table entry. 4) Waiting for the above del work to finish (flush). 5) Run gc_step again, deleting all entries from the flow table. 6) Flow table is freed. But if a flow table entry already has pending HW stats or HW add work step 3 will not queue HW del work (it will be skipped), step 4 will wait for the pending add/stats to finish, and step 5 will queue HW del work which might execute after freeing of the flow table. To fix the above, this patch flushes the pending work, then it sets the teardown flag to all flows in the flowtable and it forces a garbage collector run to queue work to remove the flows from hardware, then it flushes this new pending work and (finally) it forces another garbage collector run to remove the entry from the software flowtable. Stack trace: [47773.882335] BUG: KASAN: use-after-free in down_read+0x99/0x460 [47773.883634] Write of size 8 at addr ffff888103b45aa8 by task kworker/u20:6/543704 [47773.885634] CPU: 3 PID: 543704 Comm: kworker/u20:6 Not tainted 5.12.0-rc7+ #2 [47773.886745] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009) [47773.888438] Workqueue: nf_ft_offload_del flow_offload_work_handler [nf_flow_table] [47773.889727] Call Trace: [47773.890214] dump_stack+0xbb/0x107 [47773.890818] print_address_description.constprop.0+0x18/0x140 [47773.892990] kasan_report.cold+0x7c/0xd8 [47773.894459] kasan_check_range+0x145/0x1a0 [47773.895174] down_read+0x99/0x460 [47773.899706] nf_flow_offload_tuple+0x24f/0x3c0 [nf_flow_table] [47773.907137] flow_offload_work_handler+0x72d/0xbe0 [nf_flow_table] [47773.913372] process_one_work+0x8ac/0x14e0 [47773.921325] [47773.921325] Allocated by task 592159: [47773.922031] kasan_save_stack+0x1b/0x40 [47773.922730] __kasan_kmalloc+0x7a/0x90 [47773.923411] tcf_ct_flow_table_get+0x3cb/0x1230 [act_ct] [47773.924363] tcf_ct_init+0x71c/0x1156 [act_ct] [47773.925207] tcf_action_init_1+0x45b/0x700 [47773.925987] tcf_action_init+0x453/0x6b0 [47773.926692] tcf_exts_validate+0x3d0/0x600 [47773.927419] fl_change+0x757/0x4a51 [cls_flower] [47773.928227] tc_new_tfilter+0x89a/0x2070 [47773.936652] [47773.936652] Freed by task 543704: [47773.937303] kasan_save_stack+0x1b/0x40 [47773.938039] kasan_set_track+0x1c/0x30 [47773.938731] kasan_set_free_info+0x20/0x30 [47773.939467] __kasan_slab_free+0xe7/0x120 [47773.940194] slab_free_freelist_hook+0x86/0x190 [47773.941038] kfree+0xce/0x3a0 [47773.941644] tcf_ct_flow_table_cleanup_work Original patch description and stack trace by Paul Blakey. Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support") Reported-by: Paul Blakey <paulb@nvidia.com> Tested-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-08-24netfilter: flowtable: add function to invoke garbage collection immediatelyPablo Neira Ayuso2-3/+10
Expose nf_flow_table_gc_run() to force a garbage collector run from the offload infrastructure. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-08-24netfilter: nf_tables: disallow binding to already bound chainPablo Neira Ayuso1-0/+2
Update nft_data_init() to report EINVAL if chain is already bound. Fixes: d0e2c7de92c7 ("netfilter: nf_tables: add NFT_CHAIN_BINDING") Reported-by: Gwangun Jung <exsociety@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-08-24netfilter: nft_tunnel: restrict it to netdev familyPablo Neira Ayuso1-0/+1
Only allow to use this expression from NFPROTO_NETDEV family. Fixes: af308b94a2a4 ("netfilter: nf_tables: add tunnel support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-08-24netfilter: nft_osf: restrict osf to ipv4, ipv6 and inet familiesPablo Neira Ayuso1-3/+15
As it was originally intended, restrict extension to supported families. Fixes: b96af92d6eaf ("netfilter: nf_tables: implement Passive OS fingerprint module in nft_osf") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-08-24netfilter: nf_tables: do not leave chain stats enabled on errorPablo Neira Ayuso1-2/+4
Error might occur later in the nf_tables_addchain() codepath, enable static key only after transaction has been created. Fixes: 9f08ea848117 ("netfilter: nf_tables: keep chain counters away from hot path") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-08-24netfilter: nft_payload: do not truncate csum_offset and csum_typePablo Neira Ayuso1-6/+13
Instead report ERANGE if csum_offset is too long, and EOPNOTSUPP if type is not support. Fixes: 7ec3f7b47b8d ("netfilter: nft_payload: add packet mangling support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-08-24netfilter: nft_payload: report ERANGE for too long offset and lengthPablo Neira Ayuso1-2/+8
Instead of offset and length are truncation to u8, report ERANGE. Fixes: 96518518cc41 ("netfilter: add nftables") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-08-24netfilter: nf_tables: make table handle allocation per-netns friendlyPablo Neira Ayuso2-2/+2
mutex is per-netns, move table_netns to the pernet area. *read-write* to 0xffffffff883a01e8 of 8 bytes by task 6542 on cpu 0: nf_tables_newtable+0x6dc/0xc00 net/netfilter/nf_tables_api.c:1221 nfnetlink_rcv_batch net/netfilter/nfnetlink.c:513 [inline] nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:634 [inline] nfnetlink_rcv+0xa6a/0x13a0 net/netfilter/nfnetlink.c:652 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x652/0x730 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x643/0x740 net/netlink/af_netlink.c:1921 Fixes: f102d66b335a ("netfilter: nf_tables: use dedicated mutex to guard transactions") Reported-by: Abhishek Shah <abhishek.shah@columbia.edu> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-08-24netfilter: nf_tables: disallow updates of implicit chainPablo Neira Ayuso1-0/+3
Updates on existing implicit chain make no sense, disallow this. Fixes: d0e2c7de92c7 ("netfilter: nf_tables: add NFT_CHAIN_BINDING") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-08-23Merge tag 'cgroup-for-6.0-rc2-fixes' of ↵Linus Torvalds5-39/+61
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - The psi data structure was changed to be allocated dynamically but it wasn't being cleared leading to it reporting garbage values and triggering spurious oom kills. - A deadlock involving cpuset and cpu hotplug. - When a controller is moved across cgroup hierarchies, css->rstat_css_node didn't get RCU drained properly from the previous list. * tag 'cgroup-for-6.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Fix race condition at rebind_subsystems() cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock sched/psi: Remove redundant cgroup_psi() when !CONFIG_CGROUPS sched/psi: Remove unused parameter nbytes of psi_trigger_create() sched/psi: Zero the memory of struct psi_group
2022-08-23Merge tag 'audit-pr-20220823' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fix from Paul Moore: "A single fix for a potential double-free on a fsnotify error path" * tag 'audit-pr-20220823' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: fix potential double free on error path from fsnotify_add_inode_mark
2022-08-23Merge tag 'fs.fixes.v6.0-rc3' of ↵Linus Torvalds1-6/+8
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull file_remove_privs() fix from Christian Brauner: "As part of Stefan's and Jens' work to add async buffered write support to xfs we refactored file_remove_privs() and added __file_remove_privs() to avoid calling __remove_privs() when IOCB_NOWAIT is passed. While debugging a recent performance regression report I found that during review we missed that commit faf99b563558 ("fs: add __remove_file_privs() with flags parameter") accidently changed behavior when dentry_needs_remove_privs() returns zero. Before the commit it would still call inode_has_no_xattr() setting the S_NOSEC bit and thereby avoiding even calling into dentry_needs_remove_privs() the next time this function is called. After that commit inode_has_no_xattr() would only be called if __remove_privs() had to be called. Restore the old behavior. This is likely the cause of the performance regression" * tag 'fs.fixes.v6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: fs: __file_remove_privs(): restore call to inode_has_no_xattr()
2022-08-23Merge tag 'mlx5-fixes-2022-08-22' of ↵Jakub Kicinski11-42/+65
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5 fixes 2022-08-22 This series provides bug fixes to mlx5 driver. * tag 'mlx5-fixes-2022-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5: Unlock on error in mlx5_sriov_enable() net/mlx5e: Fix use after free in mlx5e_fs_init() net/mlx5e: kTLS, Use _safe() iterator in mlx5e_tls_priv_tx_list_cleanup() net/mlx5: unlock on error path in esw_vfs_changed_event_handler() net/mlx5e: Fix wrong tc flag used when set hw-tc-offload off net/mlx5e: TC, Add missing policer validation net/mlx5e: Fix wrong application of the LRO state net/mlx5: Avoid false positive lockdep warning by adding lock_class_key net/mlx5: Fix cmd error logging for manage pages cmd net/mlx5: Disable irq when locking lag_lock net/mlx5: Eswitch, Fix forwarding decision to uplink net/mlx5: LAG, fix logic over MLX5_LAG_FLAG_NDEVS_READY net/mlx5e: Properly disable vlan strip on non-UL reps ==================== Link: https://lore.kernel.org/r/20220822195917.216025-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-23Merge branch '100GbE' of ↵Jakub Kicinski4-29/+54
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== ice: xsk: reduced queue count fixes Maciej Fijalkowski says: this small series is supposed to fix the issues around AF_XDP usage with reduced queue count on interface. Due to the XDP rings setup, some configurations can result in sockets not seeing traffic flowing. More about this in description of patch 2. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: ice: xsk: use Rx ring's XDP ring when picking NAPI context ice: xsk: prohibit usage of non-balanced queue id ==================== Link: https://lore.kernel.org/r/20220822163257.2382487-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-23Merge branch 'linus'Andrew Morton10-31/+68
2022-08-23Merge branch 'bnxt_en-bug-fixes'Jakub Kicinski5-7/+12
Michael Chan says: ==================== bnxt_en: Bug fixes This series includes 2 fixes for regressions introduced by the XDP multi-buffer feature, 1 devlink reload bug fix, and 1 SRIOV resource accounting bug fix. ==================== Link: https://lore.kernel.org/r/1661180814-19350-1-git-send-email-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-23bnxt_en: fix LRO/GRO_HW features in ndo_fix_features callbackVikas Gupta1-4/+1
LRO/GRO_HW should be disabled if there is an attached XDP program. BNXT_FLAG_TPA is the current setting of the LRO/GRO_HW. Using BNXT_FLAG_TPA to disable LRO/GRO_HW will cause these features to be permanently disabled once they are disabled. Fixes: 1dc4c557bfed ("bnxt: adding bnxt_xdp_build_skb to build skb from multibuffer xdp_buff") Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-23bnxt_en: fix NQ resource accounting during vf creation on 57500 chipsVikas Gupta1-1/+1
There are 2 issues: 1. We should decrement hw_resc->max_nqs instead of hw_resc->max_irqs with the number of NQs assigned to the VFs. The IRQs are fixed on each function and cannot be re-assigned. Only the NQs are being assigned to the VFs. 2. vf_msix is the total number of NQs to be assigned to the VFs. So we should decrement vf_msix from hw_resc->max_nqs. Fixes: b16b68918674 ("bnxt_en: Add SR-IOV support for 57500 chips.") Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-23bnxt_en: set missing reload flag in devlink featuresVikas Gupta1-0/+1
Add missing devlink_set_features() API for callbacks reload_down and reload_up to function. Fixes: 228ea8c187d8 ("bnxt_en: implement devlink dev reload driver_reinit") Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-23bnxt_en: Use PAGE_SIZE to init buffer when multi buffer XDP is not in usePavan Chebbi2-2/+9
Using BNXT_PAGE_MODE_BUF_SIZE + offset as buffer length value is not sufficient when running single buffer XDP programs doing redirect operations. The stack will complain on missing skb tail room. Fix it by using PAGE_SIZE when calling xdp_init_buff() for single buffer programs. Fixes: b231c3f3414c ("bnxt: refactor bnxt_rx_xdp to separate xdp_init_buff/xdp_prepare_buff") Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-23net: dsa: microchip: make learning configurable and keep it off while standaloneVladimir Oltean2-1/+45
Address learning should initially be turned off by the driver for port operation in standalone mode, then the DSA core handles changes to it via ds->ops->port_bridge_flags(). Leaving address learning enabled while ports are standalone breaks any kind of communication which involves port B receiving what port A has sent. Notably it breaks the ksz9477 driver used with a (non offloaded, ports act as if standalone) bonding interface in active-backup mode, when the ports are connected together through external switches, for redundancy purposes. This fixes a major design flaw in the ksz9477 and ksz8795 drivers, which unconditionally leave address learning enabled even while ports operate as standalone. Fixes: b987e98e50ab ("dsa: add DSA switch driver for Microchip KSZ9477") Link: https://lore.kernel.org/netdev/CAFZh4h-JVWt80CrQWkFji7tZJahMfOToUJQgKS5s0_=9zzpvYQ@mail.gmail.com/ Reported-by: Brian Hutchinson <b.hutchman@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20220818164809.3198039-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-23Merge tag 'parisc-for-6.0-2' of ↵Linus Torvalds6-30/+59
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux Pull parisc fixes from Helge Deller: "Some interesting background to the current patchset: It turned out that the fldw instruction (which loads a 32-bit word from memory into one half of a FP register) failed on unaligned addresses and even trashed some other random FP register instead. It's a trivial one-liner fix in the exception handler but this failure dates back to the very beginnings of the parisc-port. It's strange that it was never noticed before. Another patch fixes an annoyance noticed by Randy Dunlap. Running "make ARCH=parisc64 randconfig" always returned a 32-bit config, although one would expect a 64-bit config. Masahiro Yamada suggested to mimik sparc Kconfig code, which fixed the issue nicely. This allowed to drop some compiler build checks too. Third, it's possible to build an optimized 32-bit kernel for PA8X00 (64-bit) CPUs, which then wouldn't start on 32-bit-only (PA1.x) machines. I've added a bootup check which prevents that and which prints a message to the console. This can be tested with qemu, which currently only supports 32-bit emulation. The other patches are usual clean-up stuff like added return value checks and typo fixes in comments. Summary: - Fix emulation of fldw instruction on unaligned addresses - Fix "make ARCH=parisc64 randconfig" to return a 64-bit config - Prevent boot if trying to boot a 32-bit kernel compiled for PA8X00 CPUs on 32-bit only machines - ccio-dma: Handle kmalloc failure in ccio_init_resources()" * tag 'parisc-for-6.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: Add runtime check to prevent PA2.0 kernels on PA1.x machines parisc: ccio-dma: Handle kmalloc failure in ccio_init_resources() parisc: led: Move from strlcpy with unused retval to strscpy parisc: ccio-dma: Fix typo in comment Revert "parisc: Show error if wrong 32/64-bit compiler is being used" parisc: Make CONFIG_64BIT available for ARCH=parisc64 only parisc: Fix exception handler for fldw and fstw instructions
2022-08-23Merge tag 'mm-hotfixes-stable-2022-08-22' of ↵Linus Torvalds18-119/+240
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Thirteen fixes, almost all for MM. Seven of these are cc:stable and the remainder fix up the changes which went into this -rc cycle" * tag 'mm-hotfixes-stable-2022-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: kprobes: don't call disarm_kprobe() for disabled kprobes mm/shmem: shmem_replace_page() remember NR_SHMEM mm/shmem: tmpfs fallocate use file_modified() mm/shmem: fix chattr fsflags support in tmpfs mm/hugetlb: support write-faults in shared mappings mm/hugetlb: fix hugetlb not supporting softdirty tracking mm/uffd: reset write protection when unregister with wp-mode mm/smaps: don't access young/dirty bit if pte unpresent mm: add DEVICE_ZONE to FOR_ALL_ZONES kernel/sys_ni: add compat entry for fadvise64_64 mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW Revert "zram: remove double compression logic" get_maintainer: add Alan to .get_maintainer.ignore
2022-08-23Merge tag 'linux-kselftest-kunit-fixes-6.0-rc3' of ↵Linus Torvalds2-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull KUnit fixes from Shuah Khan: "Fix for a mmc test and to load .kunit_test_suites section when CONFIG_KUNIT=m, and not just when KUnit is built-in" * tag 'linux-kselftest-kunit-fixes-6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: module: kunit: Load .kunit_test_suites section when CONFIG_KUNIT=m mmc: sdhci-of-aspeed: test: Fix dependencies when KUNIT=m
2022-08-23Merge tag 'linux-kselftest-fixes-6.0-rc3' of ↵Linus Torvalds2-0/+7
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull Kselftest fixes from Shuah Khan: "Fixes to vm and sgx test builds" * tag 'linux-kselftest-fixes-6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: selftests/vm: fix inability to build any vm tests selftests/sgx: Ignore OpenSSL 3.0 deprecated functions warning
2022-08-23netfilter: nft_tproxy: restrict to prerouting hookFlorian Westphal1-0/+8
TPROXY is only allowed from prerouting, but nft_tproxy doesn't check this. This fixes a crash (null dereference) when using tproxy from e.g. output. Fixes: 4ed8eb6570a4 ("netfilter: nf_tables: Add native tproxy support") Reported-by: Shell Chen <xierch@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2022-08-23cgroup: Fix race condition at rebind_subsystems()Jing-Ting Wu1-0/+1
Root cause: The rebind_subsystems() is no lock held when move css object from A list to B list,then let B's head be treated as css node at list_for_each_entry_rcu(). Solution: Add grace period before invalidating the removed rstat_css_node. Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Suggested-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Tested-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Link: https://lore.kernel.org/linux-arm-kernel/d8f0bc5e2fb6ed259f9334c83279b4c011283c41.camel@mediatek.com/T/ Acked-by: Mukesh Ojha <quic_mojha@quicinc.com> Fixes: a7df69b81aac ("cgroup: rstat: support cgroup1") Cc: stable@vger.kernel.org # v5.13+ Signed-off-by: Tejun Heo <tj@kernel.org>
2022-08-23netfilter: conntrack: work around exceeded receive windowFlorian Westphal1-0/+31
When a TCP sends more bytes than allowed by the receive window, all future packets can be marked as invalid. This can clog up the conntrack table because of 5-day default timeout. Sequence of packets: 01 initiator > responder: [S], seq 171, win 5840, options [mss 1330,sackOK,TS val 63 ecr 0,nop,wscale 1] 02 responder > initiator: [S.], seq 33211, ack 172, win 65535, options [mss 1460,sackOK,TS val 010 ecr 63,nop,wscale 8] 03 initiator > responder: [.], ack 33212, win 2920, options [nop,nop,TS val 068 ecr 010], length 0 04 initiator > responder: [P.], seq 172:240, ack 33212, win 2920, options [nop,nop,TS val 279 ecr 010], length 68 Window is 5840 starting from 33212 -> 39052. 05 responder > initiator: [.], ack 240, win 256, options [nop,nop,TS val 872 ecr 279], length 0 06 responder > initiator: [.], seq 33212:34530, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318 This is fine, conntrack will flag the connection as having outstanding data (UNACKED), which lowers the conntrack timeout to 300s. 07 responder > initiator: [.], seq 34530:35848, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318 08 responder > initiator: [.], seq 35848:37166, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318 09 responder > initiator: [.], seq 37166:38484, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318 10 responder > initiator: [.], seq 38484:39802, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318 Packet 10 is already sending more than permitted, but conntrack doesn't validate this (only seq is tested vs. maxend, not 'seq+len'). 38484 is acceptable, but only up to 39052, so this packet should not have been sent (or only 568 bytes, not 1318). At this point, connection is still in '300s' mode. Next packet however will get flagged: 11 responder > initiator: [P.], seq 39802:40128, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 326 nf_ct_proto_6: SEQ is over the upper bound (over the window of the receiver) .. LEN=378 .. SEQ=39802 ACK=240 ACK PSH .. Now, a couple of replies/acks comes in: 12 initiator > responder: [.], ack 34530, win 4368, [.. irrelevant acks removed ] 16 initiator > responder: [.], ack 39802, win 8712, options [nop,nop,TS val 296201291 ecr 2982371892], length 0 This ack is significant -- this acks the last packet send by the responder that conntrack considered valid. This means that ack == td_end. This will withdraw the 'unacked data' flag, the connection moves back to the 5-day timeout of established conntracks. 17 initiator > responder: ack 40128, win 10030, ... This packet is also flagged as invalid. Because conntrack only updates state based on packets that are considered valid, packet 11 'did not exist' and that gets us: nf_ct_proto_6: ACK is over upper bound 39803 (ACKed data not seen yet) .. SEQ=240 ACK=40128 WINDOW=10030 RES=0x00 ACK URG Because this received and processed by the endpoints, the conntrack entry remains in a bad state, no packets will ever be considered valid again: 30 responder > initiator: [F.], seq 40432, ack 2045, win 391, .. 31 initiator > responder: [.], ack 40433, win 11348, .. 32 initiator > responder: [F.], seq 2045, ack 40433, win 11348 .. ... all trigger 'ACK is over bound' test and we end up with non-early-evictable 5-day default timeout. NB: This patch triggers a bunch of checkpatch warnings because of silly indent. I will resend the cleanup series linked below to reduce the indent level once this change has propagated to net-next. I could route the cleanup via nf but that causes extra backport work for stable maintainers. Link: https://lore.kernel.org/netfilter-devel/20220720175228.17880-1-fw@strlen.de/T/#mb1d7147d36294573cc4f81d00f9f8dadfdd06cd8 Signed-off-by: Florian Westphal <fw@strlen.de>
2022-08-23netfilter: ebtables: reject blobs that don't provide all entry pointsFlorian Westphal5-35/+1
Harshit Mogalapalli says: In ebt_do_table() function dereferencing 'private->hook_entry[hook]' can lead to NULL pointer dereference. [..] Kernel panic: general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f] [..] RIP: 0010:ebt_do_table+0x1dc/0x1ce0 Code: 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 5c 16 00 00 48 b8 00 00 00 00 00 fc ff df 49 8b 6c df 08 48 8d 7d 2c 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 88 [..] Call Trace: nf_hook_slow+0xb1/0x170 __br_forward+0x289/0x730 maybe_deliver+0x24b/0x380 br_flood+0xc6/0x390 br_dev_xmit+0xa2e/0x12c0 For some reason ebtables rejects blobs that provide entry points that are not supported by the table, but what it should instead reject is the opposite: blobs that DO NOT provide an entry point supported by the table. t->valid_hooks is the bitmask of hooks (input, forward ...) that will see packets. Providing an entry point that is not support is harmless (never called/used), but the inverse isn't: it results in a crash because the ebtables traverser doesn't expect a NULL blob for a location its receiving packets for. Instead of fixing all the individual checks, do what iptables is doing and reject all blobs that differ from the expected hooks. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2022-08-23net: dsa: don't dereference NULL extack in dsa_slave_changeupper()Vladimir Oltean1-1/+1
When a driver returns -EOPNOTSUPP in dsa_port_bridge_join() but failed to provide a reason for it, DSA attempts to set the extack to say that software fallback will kick in. The problem is, when we use brctl and the legacy bridge ioctls, the extack will be NULL, and DSA dereferences it in the process of setting it. Sergei Antonov proves this using the following stack trace: Unable to handle kernel NULL pointer dereference at virtual address 00000000 PC is at dsa_slave_changeupper+0x5c/0x158 dsa_slave_changeupper from raw_notifier_call_chain+0x38/0x6c raw_notifier_call_chain from __netdev_upper_dev_link+0x198/0x3b4 __netdev_upper_dev_link from netdev_master_upper_dev_link+0x50/0x78 netdev_master_upper_dev_link from br_add_if+0x430/0x7f4 br_add_if from br_ioctl_stub+0x170/0x530 br_ioctl_stub from br_ioctl_call+0x54/0x7c br_ioctl_call from dev_ifsioc+0x4e0/0x6bc dev_ifsioc from dev_ioctl+0x2f8/0x758 dev_ioctl from sock_ioctl+0x5f0/0x674 sock_ioctl from sys_ioctl+0x518/0xe40 sys_ioctl from ret_fast_syscall+0x0/0x1c Fix the problem by only overriding the extack if non-NULL. Fixes: 1c6e8088d9a7 ("net: dsa: allow port_bridge_join() to override extack message") Link: https://lore.kernel.org/netdev/CABikg9wx7vB5eRDAYtvAm7fprJ09Ta27a4ZazC=NX5K4wn6pWA@mail.gmail.com/ Reported-by: Sergei Antonov <saproj@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Sergei Antonov <saproj@gmail.com> Link: https://lore.kernel.org/r/20220819173925.3581871-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-23net: ipvtap - add __init/__exit annotations to module init/exit funcsMaciej Żenczykowski1-2/+2
Looks to have been left out in an oversight. Cc: Mahesh Bandewar <maheshb@google.com> Cc: Sainath Grandhi <sainath.grandhi@intel.com> Fixes: 235a9d89da97 ('ipvtap: IP-VLAN based tap driver') Signed-off-by: Maciej Żenczykowski <maze@google.com> Link: https://lore.kernel.org/r/20220821130808.12143-1-zenczykowski@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-08-22Merge branch 'bonding-802-3ad-fix-no-transmission-of-lacpdus'Jakub Kicinski9-28/+108
Jonathan Toppins says: ==================== bonding: 802.3ad: fix no transmission of LACPDUs Configuring a bond in a specific order can leave the bond in a state where it never transmits LACPDUs. The first patch adds some kselftest infrastructure and the reproducer that demonstrates the problem. The second patch fixes the issue. The new third patch makes ad_ticks_per_sec a static const and removes the passing of this variable via the stack. ==================== Link: https://lore.kernel.org/r/cover.1660919940.git.jtoppins@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-22bonding: 3ad: make ad_ticks_per_sec a constJonathan Toppins3-10/+5
The value is only ever set once in bond_3ad_initialize and only ever read otherwise. There seems to be no reason to set the variable via bond_3ad_initialize when setting the global variable will do. Change ad_ticks_per_sec to a const to enforce its read-only usage. Signed-off-by: Jonathan Toppins <jtoppins@redhat.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-22bonding: 802.3ad: fix no transmission of LACPDUsJonathan Toppins1-22/+16
This is caused by the global variable ad_ticks_per_sec being zero as demonstrated by the reproducer script discussed below. This causes all timer values in __ad_timer_to_ticks to be zero, resulting in the periodic timer to never fire. To reproduce: Run the script in `tools/testing/selftests/drivers/net/bonding/bond-break-lacpdu-tx.sh` which puts bonding into a state where it never transmits LACPDUs. line 44: ip link add fbond type bond mode 4 miimon 200 \ xmit_hash_policy 1 ad_actor_sys_prio 65535 lacp_rate fast setting bond param: ad_actor_sys_prio given: params.ad_actor_system = 0 call stack: bond_option_ad_actor_sys_prio() -> bond_3ad_update_ad_actor_settings() -> set ad.system.sys_priority = bond->params.ad_actor_sys_prio -> ad.system.sys_mac_addr = bond->dev->dev_addr; because params.ad_actor_system == 0 results: ad.system.sys_mac_addr = bond->dev->dev_addr line 48: ip link set fbond address 52:54:00:3B:7C:A6 setting bond MAC addr call stack: bond->dev->dev_addr = new_mac line 52: ip link set fbond type bond ad_actor_sys_prio 65535 setting bond param: ad_actor_sys_prio given: params.ad_actor_system = 0 call stack: bond_option_ad_actor_sys_prio() -> bond_3ad_update_ad_actor_settings() -> set ad.system.sys_priority = bond->params.ad_actor_sys_prio -> ad.system.sys_mac_addr = bond->dev->dev_addr; because params.ad_actor_system == 0 results: ad.system.sys_mac_addr = bond->dev->dev_addr line 60: ip link set veth1-bond down master fbond given: params.ad_actor_system = 0 params.mode = BOND_MODE_8023AD ad.system.sys_mac_addr == bond->dev->dev_addr call stack: bond_enslave -> bond_3ad_initialize(); because first slave -> if ad.system.sys_mac_addr != bond->dev->dev_addr return results: Nothing is run in bond_3ad_initialize() because dev_addr equals sys_mac_addr leaving the global ad_ticks_per_sec zero as it is never initialized anywhere else. The if check around the contents of bond_3ad_initialize() is no longer needed due to commit 5ee14e6d336f ("bonding: 3ad: apply ad_actor settings changes immediately") which sets ad.system.sys_mac_addr if any one of the bonding parameters whos set function calls bond_3ad_update_ad_actor_settings(). This is because if ad.system.sys_mac_addr is zero it will be set to the current bond mac address, this causes the if check to never be true. Fixes: 5ee14e6d336f ("bonding: 3ad: apply ad_actor settings changes immediately") Signed-off-by: Jonathan Toppins <jtoppins@redhat.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-22selftests: include bonding tests into the kselftest infraJonathan Toppins6-0/+91
This creates a test collection in drivers/net/bonding for bonding specific kernel selftests. The first test is a reproducer that provisions a bond and given the specific order in how the ip-link(8) commands are issued the bond never transmits an LACPDU frame on any of its slaves. Signed-off-by: Jonathan Toppins <jtoppins@redhat.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-22net: moxa: get rid of asymmetry in DMA mapping/unmappingSergei Antonov1-5/+6
Since priv->rx_mapping[i] is maped in moxart_mac_open(), we should unmap it from moxart_mac_stop(). Fixes 2 warnings. 1. During error unwinding in moxart_mac_probe(): "goto init_fail;", then moxart_mac_free_memory() calls dma_unmap_single() with priv->rx_mapping[i] pointers zeroed. WARNING: CPU: 0 PID: 1 at kernel/dma/debug.c:963 check_unmap+0x704/0x980 DMA-API: moxart-ethernet 92000000.mac: device driver tries to free DMA memory it has not allocated [device address=0x0000000000000000] [size=1600 bytes] CPU: 0 PID: 1 Comm: swapper Not tainted 5.19.0+ #60 Hardware name: Generic DT based system unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x34/0x44 dump_stack_lvl from __warn+0xbc/0x1f0 __warn from warn_slowpath_fmt+0x94/0xc8 warn_slowpath_fmt from check_unmap+0x704/0x980 check_unmap from debug_dma_unmap_page+0x8c/0x9c debug_dma_unmap_page from moxart_mac_free_memory+0x3c/0xa8 moxart_mac_free_memory from moxart_mac_probe+0x190/0x218 moxart_mac_probe from platform_probe+0x48/0x88 platform_probe from really_probe+0xc0/0x2e4 2. After commands: ip link set dev eth0 down ip link set dev eth0 up WARNING: CPU: 0 PID: 55 at kernel/dma/debug.c:570 add_dma_entry+0x204/0x2ec DMA-API: moxart-ethernet 92000000.mac: cacheline tracking EEXIST, overlapping mappings aren't supported CPU: 0 PID: 55 Comm: ip Not tainted 5.19.0+ #57 Hardware name: Generic DT based system unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x34/0x44 dump_stack_lvl from __warn+0xbc/0x1f0 __warn from warn_slowpath_fmt+0x94/0xc8 warn_slowpath_fmt from add_dma_entry+0x204/0x2ec add_dma_entry from dma_map_page_attrs+0x110/0x328 dma_map_page_attrs from moxart_mac_open+0x134/0x320 moxart_mac_open from __dev_open+0x11c/0x1ec __dev_open from __dev_change_flags+0x194/0x22c __dev_change_flags from dev_change_flags+0x14/0x44 dev_change_flags from devinet_ioctl+0x6d4/0x93c devinet_ioctl from inet_ioctl+0x1ac/0x25c v1 -> v2: Extraneous change removed. Fixes: 6c821bd9edc9 ("net: Add MOXA ART SoCs ethernet driver") Signed-off-by: Sergei Antonov <saproj@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20220819110519.1230877-1-saproj@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-22net: phy: Don't WARN for PHY_READY state in mdio_bus_phy_resume()Xiaolei Wang1-4/+4
For some MAC drivers, they set the mac_managed_pm to true in its ->ndo_open() callback. So before the mac_managed_pm is set to true, we still want to leverage the mdio_bus_phy_suspend()/resume() for the phy device suspend and resume. In this case, the phy device is in PHY_READY, and we shouldn't warn about this. It also seems that the check of mac_managed_pm in WARN_ON is redundant since we already check this in the entry of mdio_bus_phy_resume(), so drop it. Fixes: 744d23c71af3 ("net: phy: Warn about incorrect mdio_bus_phy_resume() state") Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20220819082451.1992102-1-xiaolei.wang@windriver.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-22net: ipa: don't assume SMEM is page-alignedAlex Elder1-1/+1
In ipa_smem_init(), a Qualcomm SMEM region is allocated (if needed) and then its virtual address is fetched using qcom_smem_get(). The physical address associated with that region is also fetched. The physical address is adjusted so that it is page-aligned, and an attempt is made to update the size of the region to compensate for any non-zero adjustment. But that adjustment isn't done properly. The physical address is aligned twice, and as a result the size is never actually adjusted. Fix this by *not* aligning the "addr" local variable, and instead making the "phys" local variable be the adjusted "addr" value. Fixes: a0036bb413d5b ("net: ipa: define SMEM memory region for IPA") Signed-off-by: Alex Elder <elder@linaro.org> Link: https://lore.kernel.org/r/20220818134206.567618-1-elder@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-22net: dsa: microchip: keep compatibility with device tree blobs with no phy-modeVladimir Oltean1-1/+7
DSA has multiple ways of specifying a MAC connection to an internal PHY. One requires a DT description like this: port@0 { reg = <0>; phy-handle = <&internal_phy>; phy-mode = "internal"; }; (which is IMO the recommended approach, as it is the clearest description) but it is also possible to leave the specification as just: port@0 { reg = <0>; } and if the driver implements ds->ops->phy_read and ds->ops->phy_write, the DSA framework "knows" it should create a ds->slave_mii_bus, and it should connect to a non-OF-based internal PHY on this MDIO bus, at an MDIO address equal to the port address. There is also an intermediary way of describing things: port@0 { reg = <0>; phy-handle = <&internal_phy>; }; In case 2, DSA calls phylink_connect_phy() and in case 3, it calls phylink_of_phy_connect(). In both cases, phylink_create() has been called with a phy_interface_t of PHY_INTERFACE_MODE_NA, and in both cases, PHY_INTERFACE_MODE_NA is translated into phy->interface. It is important to note that phy_device_create() initializes dev->interface = PHY_INTERFACE_MODE_GMII, and so, when we use phylink_create(PHY_INTERFACE_MODE_NA), no one will override this, and we will end up with a PHY_INTERFACE_MODE_GMII interface inherited from the PHY. All this means that in order to maintain compatibility with device tree blobs where the phy-mode property is missing, we need to allow the "gmii" phy-mode and treat it as "internal". Fixes: 2c709e0bdad4 ("net: dsa: microchip: ksz8795: add phylink support") Link: https://bugzilla.kernel.org/show_bug.cgi?id=216320 Reported-by: Craig McQueen <craig@mcqueen.id.au> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk> Tested-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Link: https://lore.kernel.org/r/20220818143250.2797111-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-22Merge branch 'mm-hotfixes-stable'Andrew Morton18-119/+240
2022-08-22audit: fix potential double free on error path from fsnotify_add_inode_markGaosheng Cui1-0/+1
Audit_alloc_mark() assign pathname to audit_mark->path, on error path from fsnotify_add_inode_mark(), fsnotify_put_mark will free memory of audit_mark->path, but the caller of audit_alloc_mark will free the pathname again, so there will be double free problem. Fix this by resetting audit_mark->path to NULL pointer on error path from fsnotify_add_inode_mark(). Cc: stable@vger.kernel.org Fixes: 7b1293234084d ("fsnotify: Add group pointer in fsnotify_init_mark()") Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <paul@paul-moore.com>
2022-08-22net/mlx5: Unlock on error in mlx5_sriov_enable()Dan Carpenter1-1/+1
Unlock before returning if mlx5_device_enable_sriov() fails. Fixes: 84a433a40d0e ("net/mlx5: Lock mlx5 devlink reload callbacks") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-08-22net/mlx5e: Fix use after free in mlx5e_fs_init()Dan Carpenter1-2/+3
Call mlx5e_fs_vlan_free(fs) before kvfree(fs). Fixes: af8bbf730068 ("net/mlx5e: Convert mlx5e_flow_steering member of mlx5e_priv to pointer") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-08-22net/mlx5e: kTLS, Use _safe() iterator in mlx5e_tls_priv_tx_list_cleanup()Dan Carpenter1-2/+2
Use the list_for_each_entry_safe() macro to prevent dereferencing "obj" after it has been freed. Fixes: c4dfe704f53f ("net/mlx5e: kTLS, Recycle objects of device-offloaded TLS TX connections") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-08-22net/mlx5: unlock on error path in esw_vfs_changed_event_handler()Dan Carpenter1-1/+3
Unlock before returning on this error path. Fixes: f1bc646c9a06 ("net/mlx5: Use devl_ API in mlx5_esw_offloads_devlink_port_register") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-08-22net/mlx5e: Fix wrong tc flag used when set hw-tc-offload offMaor Dickman1-1/+3
The cited commit reintroduced the ability to set hw-tc-offload in switchdev mode by reusing NIC mode calls without modifying it to support both modes, this can cause an illegal memory access when trying to turn hw-tc-offload off. Fix this by using the right TC_FLAG when checking if tc rules are installed while disabling hw-tc-offload. Fixes: d3cbd4254df8 ("net/mlx5e: Add ndo_set_feature for uplink representor") Signed-off-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>