summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2022-05-08parisc: Change MAX_ADDRESS to become unsigned long longHelge Deller1-1/+1
Dave noticed that for the 32-bit kernel MAX_ADDRESS should be a ULL, otherwise this define would become 0: MAX_ADDRESS (1UL << MAX_ADDRBITS) It has no real effect on the kernel. Signed-off-by: Helge Deller <deller@gmx.de> Noticed-by: John David Anglin <dave.anglin@bell.net>
2022-05-08parisc: Merge model and model name into one line in /proc/cpuinfoHelge Deller1-2/+1
The Linux tool "lscpu" shows the double amount of CPUs if we have "model" and "model name" in two different lines in /proc/cpuinfo. This change combines the model and the model name into one line. Signed-off-by: Helge Deller <deller@gmx.de> Cc: stable@vger.kernel.org
2022-05-08parisc: Re-enable GENERIC_CPU_DEVICES for !SMPHelge Deller1-0/+1
In commit 62773112acc5 ("parisc: Switch from GENERIC_CPU_DEVICES to GENERIC_ARCH_TOPOLOGY") GENERIC_CPU_DEVICES was unconditionally turned off, but this triggers a warning in topology_add_dev(). Turning it back on for the !SMP case avoids this warning. Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Fixes: 62773112acc5 ("parisc: Switch from GENERIC_CPU_DEVICES to GENERIC_ARCH_TOPOLOGY") Signed-off-by: Helge Deller <deller@gmx.de>
2022-05-08parisc: Update 32- and 64-bit defconfigsHelge Deller2-2/+5
Enable CONFIG_CGROUPS=y on 32-bit defconfig for systemd-support, and enable CONFIG_NAMESPACES and CONFIG_USER_NS. Signed-off-by: Helge Deller <deller@gmx.de>
2022-05-08parisc: Only list existing CPUs in cpu_possible_maskHelge Deller1-0/+8
The inventory knows which CPUs are in the system, so this bitmask should be in cpu_possible_mask instead of the bitmask based on CONFIG_NR_CPUS. Reset the cpu_possible_mask before scanning the system for CPUs, and mark each existing CPU as possible during initialization of that CPU. This avoids those warnings later on too: register_cpu_capacity_sysctl: too early to get CPU4 device! Signed-off-by: Helge Deller <deller@gmx.de> Noticed-by: John David Anglin <dave.anglin@bell.net>
2022-05-08Revert "parisc: Fix patch code locking and flushing"Helge Deller1-11/+14
This reverts commit a9fe7fa7d874a536e0540469f314772c054a0323. Leads to segfaults on 32bit kernel. Signed-off-by: Helge Deller <deller@gmx.de>
2022-05-08Revert "parisc: Mark sched_clock unstable only if clocks are not syncronized"Helge Deller2-3/+6
This reverts commit d97180ad68bdb7ee10f327205a649bc2f558741d. It triggers RCU stalls at boot with a 32-bit kernel. Signed-off-by: Helge Deller <deller@gmx.de> Noticed-by: John David Anglin <dave.anglin@bell.net> Cc: stable@vger.kernel.org # v5.15+
2022-05-08Revert "parisc: Mark cr16 CPU clocksource unstable on all SMP machines"Helge Deller1-8/+22
This reverts commit afdb4a5b1d340e4afffc65daa21cc71890d7d589. It triggers RCU stalls at boot with a 32-bit kernel. Signed-off-by: Helge Deller <deller@gmx.de> Noticed-by: John David Anglin <dave.anglin@bell.net> Cc: stable@vger.kernel.org # v5.16+
2022-05-07Merge tag 'gpio-fixes-for-v5.18-rc6' of ↵Linus Torvalds5-16/+6
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux Pull gpio fixes from Bartosz Golaszewski: - fix the bounds check for the 'gpio-reserved-ranges' device property in gpiolib-of - drop the assignment of the pwm base number in gpio-mvebu (this was missed by the patch doing it globally for all pwm drivers) - fix the fwnode assignment (use own fwnode, not the parent's one) for the GPIO irqchip in gpio-visconti - update the irq_stat field before checking the trigger field in gpio-pca953x - update GPIO entry in MAINTAINERS * tag 'gpio-fixes-for-v5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: gpio: pca953x: fix irq_stat not updated when irq is disabled (irq_mask not set) gpio: visconti: Fix fwnode of GPIO IRQ MAINTAINERS: update the GPIO git tree entry gpio: mvebu: drop pwm base assignment gpiolib: of: fix bounds check for 'gpio-reserved-ranges'
2022-05-07Merge tag 'block-5.18-2022-05-06' of git://git.kernel.dk/linux-blockLinus Torvalds4-18/+51
Pull block fixes from Jens Axboe: "A single revert for a change that isn't needed in 5.18, and a small series for s390/dasd" * tag 'block-5.18-2022-05-06' of git://git.kernel.dk/linux-block: s390/dasd: Use kzalloc instead of kmalloc/memset s390/dasd: Fix read inconsistency for ESE DASD devices s390/dasd: Fix read for ESE with blksize < 4k s390/dasd: prevent double format of tracks for ESE devices s390/dasd: fix data corruption for ESE devices Revert "block: release rq qos structures for queue without disk"
2022-05-07Merge tag 'io_uring-5.18-2022-05-06' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+6
Pull io_uring fix from Jens Axboe: "Just a single file assignment fix this week" * tag 'io_uring-5.18-2022-05-06' of git://git.kernel.dk/linux-block: io_uring: assign non-fixed early for async work
2022-05-06Merge tag 'for-5.18-rc5-tag' of ↵Linus Torvalds4-34/+53
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Regression fixes in zone activation: - move a loop invariant out of the loop to avoid checking space status - properly handle unlimited activation Other fixes: - for subpage, force the free space v2 mount to avoid a warning and make it easy to switch a filesystem on different page size systems - export sysfs status of exclusive operation 'balance paused', so the user space tools can recognize it and allow adding a device with paused balance - fix assertion failure when logging directory key range item" * tag 'for-5.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: sysfs: export the balance paused state of exclusive operation btrfs: fix assertion failure when logging directory key range item btrfs: zoned: activate block group properly on unlimited active zone device btrfs: zoned: move non-changing condition check out of the loop btrfs: force v2 space cache usage for subpage mount
2022-05-06Merge tag 'nfs-for-5.18-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds5-10/+54
Pull NFS client fixes from Trond Myklebust: "Highlights include: Stable fixes: - Fix a socket leak when setting up an AF_LOCAL RPC client - Ensure that knfsd connects to the gss-proxy daemon on setup Bugfixes: - Fix a refcount leak when migrating a task off an offlined transport - Don't gratuitously invalidate inode attributes on delegation return - Don't leak sockets in xs_local_connect() - Ensure timely close of disconnected AF_LOCAL sockets" * tag 'nfs-for-5.18-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: Revert "SUNRPC: attempt AF_LOCAL connect on setup" SUNRPC: Ensure gss-proxy connects on setup SUNRPC: Ensure timely close of disconnected AF_LOCAL sockets SUNRPC: Don't leak sockets in xs_local_connect() NFSv4: Don't invalidate inode attributes on delegation return SUNRPC release the transport of a relocated task with an assigned transport
2022-05-06Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds9-69/+190
Pull kvm fixes from Paolo Bonzini: "x86: - Account for family 17h event renumberings in AMD PMU emulation - Remove CPUID leaf 0xA on AMD processors - Fix lockdep issue with locking all vCPUs - Fix loss of A/D bits in SPTEs - Fix syzkaller issue with invalid guest state" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: VMX: Exit to userspace if vCPU has injected exception and invalid state KVM: SEV: Mark nested locking of vcpu->lock kvm: x86/cpuid: Only provide CPUID leaf 0xA if host has architectural PMU KVM: x86/svm: Account for family 17h event renumberings in amd_pmc_perf_hw_id KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits KVM: x86/mmu: Move shadow-present check out of spte_has_volatile_bits() KVM: x86/mmu: Don't treat fully writable SPTEs as volatile (modulo A/D)
2022-05-06Merge tag 'riscv-for-linus-5.18-rc6' of ↵Linus Torvalds1-2/+19
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fix from Palmer Dabbelt: - A fix to relocate the DTB early in boot, in cases where the bootloader doesn't put the DTB in a region that will end up mapped by the kernel. This manifests as a crash early in boot on a handful of configurations. * tag 'riscv-for-linus-5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: RISC-V: relocate DTB if it's outside memory region
2022-05-06KVM: VMX: Exit to userspace if vCPU has injected exception and invalid stateSean Christopherson1-1/+1
Exit to userspace with an emulation error if KVM encounters an injected exception with invalid guest state, in addition to the existing check of bailing if there's a pending exception (KVM doesn't support emulating exceptions except when emulating real mode via vm86). In theory, KVM should never get to such a situation as KVM is supposed to exit to userspace before injecting an exception with invalid guest state. But in practice, userspace can intervene and manually inject an exception and/or stuff registers to force invalid guest state while a previously injected exception is awaiting reinjection. Fixes: fc4fad79fc3d ("KVM: VMX: Reject KVM_RUN if emulation is required with pending exception") Reported-by: syzbot+cfafed3bb76d3e37581b@syzkaller.appspotmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220502221850.131873-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-06KVM: SEV: Mark nested locking of vcpu->lockPeter Gonda1-4/+38
svm_vm_migrate_from() uses sev_lock_vcpus_for_migration() to lock all source and target vcpu->locks. Unfortunately there is an 8 subclass limit, so a new subclass cannot be used for each vCPU. Instead maintain ownership of the first vcpu's mutex.dep_map using a role specific subclass: source vs target. Release the other vcpu's mutex.dep_maps. Fixes: b56639318bb2b ("KVM: SEV: Add support for SEV intra host migration") Reported-by: John Sperbeck<jsperbeck@google.com> Suggested-by: David Rientjes <rientjes@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Peter Gonda <pgonda@google.com> Message-Id: <20220502165807.529624-1-pgonda@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-06Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds6-96/+85
Pull rdma fixes from Jason Gunthorpe: "A few recent regressions in rxe's multicast code, and some old driver bugs: - Error case unwind bug in rxe for rkeys - Dot not call netdev functions under a spinlock in rxe multicast code - Use the proper BH lock type in rxe multicast code - Fix idrma deadlock and crash - Add a missing flush to drain irdma QPs when in error - Fix high userspace latency in irdma during destroy due to synchronize_rcu() - Rare race in siw MPA processing" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: RDMA/rxe: Change mcg_lock to a _bh lock RDMA/rxe: Do not call dev_mc_add/del() under a spinlock RDMA/siw: Fix a condition race issue in MPA request processing RDMA/irdma: Fix possible crash due to NULL netdev in notifier RDMA/irdma: Reduce iWARP QP destroy time RDMA/irdma: Flush iWARP QP if modified to ERR from RTR state RDMA/rxe: Recheck the MR in when generating a READ reply RDMA/irdma: Fix deadlock in irdma_cleanup_cm_core() RDMA/rxe: Fix "Replace mr by rkey in responder resources"
2022-05-06Merge tag 'mmc-v5.18-rc4' of ↵Linus Torvalds3-6/+64
git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc Pull mmc fixes from Ulf Hansson: "MMC core: - Fix initialization for eMMC's HS200/HS400 mode MMC host: - sdhci-msm: Reset GCC_SDCC_BCR register to prevent timeout issues - sunxi-mmc: Fix DMA descriptors allocated above 32 bits" * tag 'mmc-v5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: mmc: sdhci-msm: Reset GCC_SDCC_BCR register for SDHC mmc: sunxi-mmc: Fix DMA descriptors allocated above 32 bits mmc: core: Set HS clock speed before sending HS CMD13
2022-05-06Merge tag 'drm-fixes-2022-05-06' of git://anongit.freedesktop.org/drm/drmLinus Torvalds7-21/+9
Pull drm fixes from Dave Airlie: "A pretty quiet week, one fbdev, msm, kconfig, and two amdgpu fixes, about what I'd expect for rc6. fbdev: - hotunplugging fix amdgpu: - Fix a xen dom0 regression on APUs - Fix a potential array overflow if a receiver were to send an erroneous audio channel count msm: - lockdep fix. it6505: - kconfig fix" * tag 'drm-fixes-2022-05-06' of git://anongit.freedesktop.org/drm/drm: drm/amd/display: Avoid reading audio pattern past AUDIO_CHANNELS_COUNT drm/amdgpu: do not use passthrough mode in Xen dom0 drm/bridge: ite-it6505: add missing Kconfig option select fbdev: Make fb_release() return -ENODEV if fbdev was unregistered drm/msm/dp: remove fail safe mode related code
2022-05-06gpio: pca953x: fix irq_stat not updated when irq is disabled (irq_mask not set)Puyou Lu1-2/+2
When one port's input state get inverted (eg. from low to hight) after pca953x_irq_setup but before setting irq_mask (by some other driver such as "gpio-keys"), the next inversion of this port (eg. from hight to low) will not be triggered any more (because irq_stat is not updated at the first time). Issue should be fixed after this commit. Fixes: 89ea8bbe9c3e ("gpio: pca953x.c: add interrupt handling capability") Signed-off-by: Puyou Lu <puyou.lu@gmail.com> Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
2022-05-05s390/dasd: Use kzalloc instead of kmalloc/memsetHaowen Bai1-4/+1
Use kzalloc rather than duplicating its implementation, which makes code simple and easy to understand. Signed-off-by: Haowen Bai <baihaowen@meizu.com> Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Link: https://lore.kernel.org/r/20220505141733.1989450-6-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-05s390/dasd: Fix read inconsistency for ESE DASD devicesJan Höppner1-2/+1
Read requests that return with NRF error are partially completed in dasd_eckd_ese_read(). The function keeps track of the amount of processed bytes and the driver will eventually return this information back to the block layer for further processing via __dasd_cleanup_cqr() when the request is in the final stage of processing (from the driver's perspective). For this, blk_update_request() is used which requires the number of bytes to complete the request. As per documentation the nr_bytes parameter is described as follows: "number of bytes to complete for @req". This was mistakenly interpreted as "number of bytes _left_ for @req" leading to new requests with incorrect data length. The consequence are inconsistent and completely wrong read requests as data from random memory areas are read back. Fix this by correctly specifying the amount of bytes that should be used to complete the request. Fixes: 5e6bdd37c552 ("s390/dasd: fix data corruption for thin provisioned devices") Cc: stable@vger.kernel.org # 5.3+ Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com> Reviewed-by: Stefan Haberland <sth@linux.ibm.com> Link: https://lore.kernel.org/r/20220505141733.1989450-5-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-05s390/dasd: Fix read for ESE with blksize < 4kJan Höppner1-4/+3
When reading unformatted tracks on ESE devices, the corresponding memory areas are simply set to zero for each segment. This is done incorrectly for blocksizes < 4096. There are two problems. First, the increment of dst is done using the counter of the loop (off), which is increased by blksize every iteration. This leads to a much bigger increment for dst as actually intended. Second, the increment of dst is done before the memory area is set to 0, skipping a significant amount of bytes of memory. This leads to illegal overwriting of memory and ultimately to a kernel panic. This is not a problem with 4k blocksize because blk_queue_max_segment_size is set to PAGE_SIZE, always resulting in a single iteration for the inner segment loop (bv.bv_len == blksize). The incorrectly used 'off' value to increment dst is 0 and the correct memory area is used. In order to fix this for blksize < 4k, increment dst correctly using the blksize and only do it at the end of the loop. Fixes: 5e2b17e712cf ("s390/dasd: Add dynamic formatting support for ESE volumes") Cc: stable@vger.kernel.org # v5.3+ Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com> Reviewed-by: Stefan Haberland <sth@linux.ibm.com> Link: https://lore.kernel.org/r/20220505141733.1989450-4-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-05s390/dasd: prevent double format of tracks for ESE devicesStefan Haberland3-2/+26
For ESE devices we get an error for write operations on an unformatted track. Afterwards the track will be formatted and the IO operation restarted. When using alias devices a track might be accessed by multiple requests simultaneously and there is a race window that a track gets formatted twice resulting in data loss. Prevent this by remembering the amount of formatted tracks when starting a request and comparing this number before actually formatting a track on the fly. If the number has changed there is a chance that the current track was finally formatted in between. As a result do not format the track and restart the current IO to check. The number of formatted tracks does not match the overall number of formatted tracks on the device and it might wrap around but this is no problem. It is only needed to recognize that a track has been formatted at all in between. Fixes: 5e2b17e712cf ("s390/dasd: Add dynamic formatting support for ESE volumes") Cc: stable@vger.kernel.org # 5.3+ Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com> Link: https://lore.kernel.org/r/20220505141733.1989450-3-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-05s390/dasd: fix data corruption for ESE devicesStefan Haberland3-2/+20
For ESE devices we get an error when accessing an unformatted track. The handling of this error will return zero data for read requests and format the track on demand before writing to it. To do this the code needs to distinguish between read and write requests. This is done with data from the blocklayer request. A pointer to the blocklayer request is stored in the CQR. If there is an error on the device an ERP request is built to do error recovery. While the ERP request is mostly a copy of the original CQR the pointer to the blocklayer request is not copied to not accidentally pass it back to the blocklayer without cleanup. This leads to the error that during ESE handling after an ERP request was built it is not possible to determine the IO direction. This leads to the formatting of a track for read requests which might in turn lead to data corruption. Fixes: 5e2b17e712cf ("s390/dasd: Add dynamic formatting support for ESE volumes") Cc: stable@vger.kernel.org # 5.3+ Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com> Link: https://lore.kernel.org/r/20220505141733.1989450-2-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-06Merge tag 'drm-msm-fixes-2022-04-30' of ↵Dave Airlie3-18/+0
https://gitlab.freedesktop.org/drm/msm into drm-fixes single lockdep fix. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Rob Clark <robdclark@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/CAF6AEGtkzqzxDLp82OaKXVrWd7nWZtkxKsuOK1wOGCDz7qF-dA@mail.gmail.com
2022-05-06Merge tag 'drm-misc-fixes-2022-05-05' of ↵Dave Airlie2-1/+5
git://anongit.freedesktop.org/drm/drm-misc into drm-fixes drm-misc-fixes for v5.18-rc6: - Small fix for hot-unplugging fb devices. - Kconfig fix for it6505. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/69e51773-8c6f-4ff7-9a06-5c2922a43999@linux.intel.com
2022-05-06Merge tag 'amd-drm-fixes-5.18-2022-05-04' of ↵Dave Airlie2-2/+4
https://gitlab.freedesktop.org/agd5f/linux into drm-fixes amd-drm-fixes-5.18-2022-05-04: amdgpu: - Fix a xen dom0 regression on APUs - Fix a potential array overflow if a receiver were to send an erroneous audio channel count Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220504190439.5723-1-alexander.deucher@amd.com
2022-05-05Merge tag 'folio-5.18f' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds2-7/+13
Pull folio fixes from Matthew Wilcox: "Two folio fixes for 5.18. Darrick and Brian have done amazing work debugging the race I created in the folio BIO iterator. The readahead problem was deterministic, so easy to fix. - Fix a race when we were calling folio_next() in the BIO folio iter without holding a reference, meaning the folio could be split or freed, and we'd jump to the next page instead of the intended next folio. - Fix readahead creating single-page folios instead of the intended large folios when doing reads that are not a power of two in size" * tag 'folio-5.18f' of git://git.infradead.org/users/willy/pagecache: mm/readahead: Fix readahead with large folios block: Do not call folio_next() on an unreferenced folio
2022-05-05Merge tag 'devicetree-fixes-for-5.18-3' of ↵Linus Torvalds17-75/+18
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux Pull devicetree fixes from Rob Herring: - Drop unused 'max-link-speed' in Apple PCIe - More redundant 'maxItems/minItems' schema fixes - Support values for pinctrl 'drive-push-pull' and 'drive-open-drain' - Fix redundant 'unevaluatedProperties' in MT6360 LEDs binding - Add missing 'power-domains' property to Cadence UFSHC * tag 'devicetree-fixes-for-5.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: dt-bindings: pci: apple,pcie: Drop max-link-speed from example dt-bindings: Drop redundant 'maxItems/minItems' in if/then schemas dt-bindings: pinctrl: Allow values for drive-push-pull and drive-open-drain dt-bindings: leds-mt6360: Drop redundant 'unevaluatedProperties' dt-bindings: ufs: cdns,ufshc: Add power-domains
2022-05-05btrfs: sysfs: export the balance paused state of exclusive operationDavid Sterba1-0/+3
The new state allowing device addition with paused balance is not exported to user space so it can't recognize it and actually start the operation. Fixes: efc0e69c2fea ("btrfs: introduce exclusive operation BALANCE_PAUSED state") CC: stable@vger.kernel.org # 5.17 Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-05btrfs: fix assertion failure when logging directory key range itemFilipe Manana1-14/+25
When inserting a key range item (BTRFS_DIR_LOG_INDEX_KEY) while logging a directory, we don't expect the insertion to fail with -EEXIST, because we are holding the directory's log_mutex and we have dropped all existing BTRFS_DIR_LOG_INDEX_KEY keys from the log tree before we started to log the directory. However it's possible that during the logging we attempt to insert the same BTRFS_DIR_LOG_INDEX_KEY key twice, but for this to happen we need to race with insertions of items from other inodes in the subvolume's tree while we are logging a directory. Here's how this can happen: 1) We are logging a directory with inode number 1000 that has its items spread across 3 leaves in the subvolume's tree: leaf A - has index keys from the range 2 to 20 for example. The last item in the leaf corresponds to a dir item for index number 20. All these dir items were created in a past transaction. leaf B - has index keys from the range 22 to 100 for example. It has no keys from other inodes, all its keys are dir index keys for our directory inode number 1000. Its first key is for the dir item with a sequence number of 22. All these dir items were also created in a past transaction. leaf C - has index keys for our directory for the range 101 to 120 for example. This leaf also has items from other inodes, and its first item corresponds to the dir item for index number 101 for our directory with inode number 1000; 2) When we finish processing the items from leaf A at log_dir_items(), we log a BTRFS_DIR_LOG_INDEX_KEY key with an offset of 21 and a last offset of 21, meaning the log is authoritative for the index range from 21 to 21 (a single sequence number). At this point leaf B was not yet modified in the current transaction; 3) When we return from log_dir_items() we have released our read lock on leaf B, and have set *last_offset_ret to 21 (index number of the first item on leaf B minus 1); 4) Some other task inserts an item for other inode (inode number 1001 for example) into leaf C. That resulted in pushing some items from leaf C into leaf B, in order to make room for the new item, so now leaf B has dir index keys for the sequence number range from 22 to 102 and leaf C has the dir items for the sequence number range 103 to 120; 5) At log_directory_changes() we call log_dir_items() again, passing it a 'min_offset' / 'min_key' value of 22 (*last_offset_ret from step 3 plus 1, so 21 + 1). Then btrfs_search_forward() leaves us at slot 0 of leaf B, since leaf B was modified in the current transaction. We have also initialized 'last_old_dentry_offset' to 20 after calling btrfs_previous_item() at log_dir_items(), as it left us at the last item of leaf A, which refers to the dir item with sequence number 20; 6) We then call process_dir_items_leaf() to process the dir items of leaf B, and when we process the first item, corresponding to slot 0, sequence number 22, we notice the dir item was created in a past transaction and its sequence number is greater than the value of *last_old_dentry_offset + 1 (20 + 1), so we decide to log again a BTRFS_DIR_LOG_INDEX_KEY key with an offset of 21 and an end range of 21 (key.offset - 1 == 22 - 1 == 21), which results in an -EEXIST error from insert_dir_log_key(), as we have already inserted that key at step 2, triggering the assertion at process_dir_items_leaf(). The trace produced in dmesg is like the following: assertion failed: ret != -EEXIST, in fs/btrfs/tree-log.c:3857 [198255.980839][ T7460] ------------[ cut here ]------------ [198255.981666][ T7460] kernel BUG at fs/btrfs/ctree.h:3617! [198255.983141][ T7460] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI [198255.984080][ T7460] CPU: 0 PID: 7460 Comm: repro-ghost-dir Not tainted 5.18.0-5314c78ac373-misc-next+ [198255.986027][ T7460] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [198255.988600][ T7460] RIP: 0010:assertfail.constprop.0+0x1c/0x1e [198255.989465][ T7460] Code: 8b 4c 89 (...) [198255.992599][ T7460] RSP: 0018:ffffc90007387188 EFLAGS: 00010282 [198255.993414][ T7460] RAX: 000000000000003d RBX: 0000000000000065 RCX: 0000000000000000 [198255.996056][ T7460] RDX: 0000000000000001 RSI: ffffffff8b62b180 RDI: fffff52000e70e24 [198255.997668][ T7460] RBP: ffffc90007387188 R08: 000000000000003d R09: ffff8881f0e16507 [198255.999199][ T7460] R10: ffffed103e1c2ca0 R11: 0000000000000001 R12: 00000000ffffffef [198256.000683][ T7460] R13: ffff88813befc630 R14: ffff888116c16e70 R15: ffffc90007387358 [198256.007082][ T7460] FS: 00007fc7f7c24640(0000) GS:ffff8881f0c00000(0000) knlGS:0000000000000000 [198256.009939][ T7460] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [198256.014133][ T7460] CR2: 0000560bb16d0b78 CR3: 0000000140b34005 CR4: 0000000000170ef0 [198256.015239][ T7460] Call Trace: [198256.015674][ T7460] <TASK> [198256.016313][ T7460] log_dir_items.cold+0x16/0x2c [198256.018858][ T7460] ? replay_one_extent+0xbf0/0xbf0 [198256.025932][ T7460] ? release_extent_buffer+0x1d2/0x270 [198256.029658][ T7460] ? rcu_read_lock_sched_held+0x16/0x80 [198256.031114][ T7460] ? lock_acquired+0xbe/0x660 [198256.032633][ T7460] ? rcu_read_lock_sched_held+0x16/0x80 [198256.034386][ T7460] ? lock_release+0xcf/0x8a0 [198256.036152][ T7460] log_directory_changes+0xf9/0x170 [198256.036993][ T7460] ? log_dir_items+0xba0/0xba0 [198256.037661][ T7460] ? do_raw_write_unlock+0x7d/0xe0 [198256.038680][ T7460] btrfs_log_inode+0x233b/0x26d0 [198256.041294][ T7460] ? log_directory_changes+0x170/0x170 [198256.042864][ T7460] ? btrfs_attach_transaction_barrier+0x60/0x60 [198256.045130][ T7460] ? rcu_read_lock_sched_held+0x16/0x80 [198256.046568][ T7460] ? lock_release+0xcf/0x8a0 [198256.047504][ T7460] ? lock_downgrade+0x420/0x420 [198256.048712][ T7460] ? ilookup5_nowait+0x81/0xa0 [198256.049747][ T7460] ? lock_downgrade+0x420/0x420 [198256.050652][ T7460] ? do_raw_spin_unlock+0xa9/0x100 [198256.051618][ T7460] ? __might_resched+0x128/0x1c0 [198256.052511][ T7460] ? __might_sleep+0x66/0xc0 [198256.053442][ T7460] ? __kasan_check_read+0x11/0x20 [198256.054251][ T7460] ? iget5_locked+0xbd/0x150 [198256.054986][ T7460] ? run_delayed_iput_locked+0x110/0x110 [198256.055929][ T7460] ? btrfs_iget+0xc7/0x150 [198256.056630][ T7460] ? btrfs_orphan_cleanup+0x4a0/0x4a0 [198256.057502][ T7460] ? free_extent_buffer+0x13/0x20 [198256.058322][ T7460] btrfs_log_inode+0x2654/0x26d0 [198256.059137][ T7460] ? log_directory_changes+0x170/0x170 [198256.060020][ T7460] ? rcu_read_lock_sched_held+0x16/0x80 [198256.060930][ T7460] ? rcu_read_lock_sched_held+0x16/0x80 [198256.061905][ T7460] ? lock_contended+0x770/0x770 [198256.062682][ T7460] ? btrfs_log_inode_parent+0xd04/0x1750 [198256.063582][ T7460] ? lock_downgrade+0x420/0x420 [198256.064432][ T7460] ? preempt_count_sub+0x18/0xc0 [198256.065550][ T7460] ? __mutex_lock+0x580/0xdc0 [198256.066654][ T7460] ? stack_trace_save+0x94/0xc0 [198256.068008][ T7460] ? __kasan_check_write+0x14/0x20 [198256.072149][ T7460] ? __mutex_unlock_slowpath+0x12a/0x430 [198256.073145][ T7460] ? mutex_lock_io_nested+0xcd0/0xcd0 [198256.074341][ T7460] ? wait_for_completion_io_timeout+0x20/0x20 [198256.075345][ T7460] ? lock_downgrade+0x420/0x420 [198256.076142][ T7460] ? lock_contended+0x770/0x770 [198256.076939][ T7460] ? do_raw_spin_lock+0x1c0/0x1c0 [198256.078401][ T7460] ? btrfs_sync_file+0x5e6/0xa40 [198256.080598][ T7460] btrfs_log_inode_parent+0x523/0x1750 [198256.081991][ T7460] ? wait_current_trans+0xc8/0x240 [198256.083320][ T7460] ? lock_downgrade+0x420/0x420 [198256.085450][ T7460] ? btrfs_end_log_trans+0x70/0x70 [198256.086362][ T7460] ? rcu_read_lock_sched_held+0x16/0x80 [198256.087544][ T7460] ? lock_release+0xcf/0x8a0 [198256.088305][ T7460] ? lock_downgrade+0x420/0x420 [198256.090375][ T7460] ? dget_parent+0x8e/0x300 [198256.093538][ T7460] ? do_raw_spin_lock+0x1c0/0x1c0 [198256.094918][ T7460] ? lock_downgrade+0x420/0x420 [198256.097815][ T7460] ? do_raw_spin_unlock+0xa9/0x100 [198256.101822][ T7460] ? dget_parent+0xb7/0x300 [198256.103345][ T7460] btrfs_log_dentry_safe+0x48/0x60 [198256.105052][ T7460] btrfs_sync_file+0x629/0xa40 [198256.106829][ T7460] ? start_ordered_ops.constprop.0+0x120/0x120 [198256.109655][ T7460] ? __fget_files+0x161/0x230 [198256.110760][ T7460] vfs_fsync_range+0x6d/0x110 [198256.111923][ T7460] ? start_ordered_ops.constprop.0+0x120/0x120 [198256.113556][ T7460] __x64_sys_fsync+0x45/0x70 [198256.114323][ T7460] do_syscall_64+0x5c/0xc0 [198256.115084][ T7460] ? syscall_exit_to_user_mode+0x3b/0x50 [198256.116030][ T7460] ? do_syscall_64+0x69/0xc0 [198256.116768][ T7460] ? do_syscall_64+0x69/0xc0 [198256.117555][ T7460] ? do_syscall_64+0x69/0xc0 [198256.118324][ T7460] ? sysvec_call_function_single+0x57/0xc0 [198256.119308][ T7460] ? asm_sysvec_call_function_single+0xa/0x20 [198256.120363][ T7460] entry_SYSCALL_64_after_hwframe+0x44/0xae [198256.121334][ T7460] RIP: 0033:0x7fc7fe97b6ab [198256.122067][ T7460] Code: 0f 05 48 (...) [198256.125198][ T7460] RSP: 002b:00007fc7f7c23950 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [198256.126568][ T7460] RAX: ffffffffffffffda RBX: 00007fc7f7c239f0 RCX: 00007fc7fe97b6ab [198256.127942][ T7460] RDX: 0000000000000002 RSI: 000056167536bcf0 RDI: 0000000000000004 [198256.129302][ T7460] RBP: 0000000000000004 R08: 0000000000000000 R09: 000000007ffffeb8 [198256.130670][ T7460] R10: 00000000000001ff R11: 0000000000000293 R12: 0000000000000001 [198256.132046][ T7460] R13: 0000561674ca8140 R14: 00007fc7f7c239d0 R15: 000056167536dab8 [198256.133403][ T7460] </TASK> Fix this by treating -EEXIST as expected at insert_dir_log_key() and have it update the item with an end offset corresponding to the maximum between the previously logged end offset and the new requested end offset. The end offsets may be different due to dir index key deletions that happened as part of unlink operations while we are logging a directory (triggered when fsyncing some other inode parented by the directory) or during renames which always attempt to log a single dir index deletion. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Link: https://lore.kernel.org/linux-btrfs/YmyefE9mc2xl5ZMz@hungrycats.org/ Fixes: 732d591a5d6c12 ("btrfs: stop copying old dir items when logging a directory") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-05btrfs: zoned: activate block group properly on unlimited active zone deviceNaohiro Aota1-14/+8
btrfs_zone_activate() checks if it activated all the underlying zones in the loop. However, that check never hit on an unlimited activate zone device (max_active_zones == 0). Fortunately, it still works without ENOSPC because btrfs_zone_activate() returns true in the end, even if block_group->zone_is_active == 0. But, it is confusing to have non zone_is_active block group still usable for allocation. Also, we are wasting CPU time to iterate the loop every time btrfs_zone_activate() is called for the blog groups. Since error case in the loop is handled by out_unlock, we can just set zone_is_active and do the list stuff after the loop. Fixes: f9a912a3c45f ("btrfs: zoned: make zone activation multi stripe capable") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-05btrfs: zoned: move non-changing condition check out of the loopNaohiro Aota1-6/+6
btrfs_zone_activate() checks if block_group->alloc_offset == block_group->zone_capacity every time it iterates the loop. But, it is not depending on the index. Move out the check and do it only once. Fixes: f9a912a3c45f ("btrfs: zoned: make zone activation multi stripe capable") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-05btrfs: force v2 space cache usage for subpage mountQu Wenruo1-0/+11
[BUG] For a 4K sector sized btrfs with v1 cache enabled and only mounted on systems with 4K page size, if it's mounted on subpage (64K page size) systems, it can cause the following warning on v1 space cache: BTRFS error (device dm-1): csum mismatch on free space cache BTRFS warning (device dm-1): failed to load free space cache for block group 84082688, rebuilding it now Although not a big deal, as kernel can rebuild it without problem, such warning will bother end users, especially if they want to switch the same btrfs seamlessly between different page sized systems. [CAUSE] V1 free space cache is still using fixed PAGE_SIZE for various bitmap, like BITS_PER_BITMAP. Such hard-coded PAGE_SIZE usage will cause various mismatch, from v1 cache size to checksum. Thus kernel will always reject v1 cache with a different PAGE_SIZE with csum mismatch. [FIX] Although we should fix v1 cache, it's already going to be marked deprecated soon. And we have v2 cache based on metadata (which is already fully subpage compatible), and it has almost everything superior than v1 cache. So just force subpage mount to use v2 cache on mount. Reported-by: Matt Corallo <blnxfsl@bluematt.me> CC: stable@vger.kernel.org # 5.15+ Link: https://lore.kernel.org/linux-btrfs/61aa27d1-30fc-c1a9-f0f4-9df544395ec3@bluematt.me/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-05Merge tag 's390-5.18-4' of ↵Linus Torvalds3-1/+27
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Heiko Carstens: - Disable -Warray-bounds warning for gcc12, since the only known way to workaround false positive warnings on lowcore accesses would result in worse code on fast paths. - Avoid lockdep_assert_held() warning in kvm vm memop code. - Reduce overhead within gmap_rmap code to get rid of long latencies when e.g. shutting down 2nd level guests. * tag 's390-5.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: KVM: s390: vsie/gmap: reduce gmap_rmap overhead KVM: s390: Fix lockdep issue in vm memop s390: disable -Warray-bounds
2022-05-05Merge tag 'mips-fixes_5.18_1' of ↵Linus Torvalds2-12/+7
git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux Pull MIPS fix from Thomas Bogendoerfer: "Extend R4000/R4400 CPU erratum workaround to all revisions" * tag 'mips-fixes_5.18_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: MIPS: Fix CP0 counter erratum detection for R4k CPUs
2022-05-05Merge tag 'net-5.18-rc6' of ↵Linus Torvalds76-386/+724
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from can, rxrpc and wireguard. Previous releases - regressions: - igmp: respect RCU rules in ip_mc_source() and ip_mc_msfilter() - mld: respect RCU rules in ip6_mc_source() and ip6_mc_msfilter() - rds: acquire netns refcount on TCP sockets - rxrpc: enable IPv6 checksums on transport socket - nic: hinic: fix bug of wq out of bound access - nic: thunder: don't use pci_irq_vector() in atomic context - nic: bnxt_en: fix possible bnxt_open() failure caused by wrong RFS flag - nic: mlx5e: - lag, fix use-after-free in fib event handler - fix deadlock in sync reset flow Previous releases - always broken: - tcp: fix insufficient TCP source port randomness - can: grcan: grcan_close(): fix deadlock - nfc: reorder destructive operations in to avoid bugs Misc: - wireguard: improve selftests reliability" * tag 'net-5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (63 commits) NFC: netlink: fix sleep in atomic bug when firmware download timeout selftests: ocelot: tc_flower_chains: specify conform-exceed action for policer tcp: drop the hash_32() part from the index calculation tcp: increase source port perturb table to 2^16 tcp: dynamically allocate the perturb table used by source ports tcp: add small random increments to the source port tcp: resalt the secret every 10 seconds tcp: use different parts of the port_offset for index and offset secure_seq: use the 64 bits of the siphash for port offset calculation wireguard: selftests: set panic_on_warn=1 from cmdline wireguard: selftests: bump package deps wireguard: selftests: restore support for ccache wireguard: selftests: use newer toolchains to fill out architectures wireguard: selftests: limit parallelism to $(nproc) tests at once wireguard: selftests: make routing loop test non-fatal net/mlx5: Fix matching on inner TTC net/mlx5: Avoid double clear or set of sync reset requested net/mlx5: Fix deadlock in sync reset flow net/mlx5e: Fix trust state reset in reload net/mlx5e: Avoid checking offload capability in post_parse action ...
2022-05-05gpio: visconti: Fix fwnode of GPIO IRQNobuhiro Iwamatsu1-5/+2
The fwnode of GPIO IRQ must be set to its own fwnode, not the fwnode of the parent IRQ. Therefore, this sets own fwnode instead of the parent IRQ fwnode to GPIO IRQ's. Fixes: 2ad74f40dacc ("gpio: visconti: Add Toshiba Visconti GPIO support") Signed-off-by: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
2022-05-05MAINTAINERS: update the GPIO git tree entryBartosz Golaszewski1-1/+1
My git tree has become the de facto main GPIO tree. Update the MAINTAINERS file to reflect that. Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl> Reported-by: Baruch Siach <baruch@tkos.co.il> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
2022-05-05NFC: netlink: fix sleep in atomic bug when firmware download timeoutDuoming Zhou1-2/+2
There are sleep in atomic bug that could cause kernel panic during firmware download process. The root cause is that nlmsg_new with GFP_KERNEL parameter is called in fw_dnld_timeout which is a timer handler. The call trace is shown below: BUG: sleeping function called from invalid context at include/linux/sched/mm.h:265 Call Trace: kmem_cache_alloc_node __alloc_skb nfc_genl_fw_download_done call_timer_fn __run_timers.part.0 run_timer_softirq __do_softirq ... The nlmsg_new with GFP_KERNEL parameter may sleep during memory allocation process, and the timer handler is run as the result of a "software interrupt" that should not call any other function that could sleep. This patch changes allocation mode of netlink message from GFP_KERNEL to GFP_ATOMIC in order to prevent sleep in atomic bug. The GFP_ATOMIC flag makes memory allocation operation could be used in atomic context. Fixes: 9674da8759df ("NFC: Add firmware upload netlink command") Fixes: 9ea7187c53f6 ("NFC: netlink: Rename CMD_FW_UPLOAD to CMD_FW_DOWNLOAD") Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20220504055847.38026-1-duoming@zju.edu.cn Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-05mm/readahead: Fix readahead with large foliosMatthew Wilcox (Oracle)1-6/+9
Reading 100KB chunks from a big file (eg dd bs=100K) leads to poor readahead behaviour. Studying the traces in detail, I noticed two problems. The first is that we were setting the readahead flag on the folio which contains the last byte read from the block. This is wrong because we will trigger readahead at the end of the read without waiting to see if a subsequent read is going to use the pages we just read. Instead, we need to set the readahead flag on the first folio _after_ the one which contains the last byte that we're reading. The second is that we were looking for the index of the folio with the readahead flag set to exactly match the start + size - async_size. If we've rounded this, either down (as previously) or up (as now), we'll think we hit a folio marked as readahead by a different read, and try to read the wrong pages. So round the expected index to the order of the folio we hit. Reported-by: Guo Xuenan <guoxuenan@huawei.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-05-05block: Do not call folio_next() on an unreferenced folioMatthew Wilcox (Oracle)1-1/+4
It is unsafe to call folio_next() on a folio unless you hold a reference on it that prevents it from being split or freed. After returning from the iterator, iomap calls folio_end_writeback() which may drop the last reference to the page, or allow the page to be split. If that happens, the iterator will not advance far enough through the bio_vec, leading to assertion failures like the BUG() in folio_end_writeback() that checks we're not trying to end writeback on a page not currently under writeback. Other assertion failures were also seen, but they're all explained by this one bug. Fix the bug by remembering where the next folio starts before returning from the iterator. There are other ways of fixing this bug, but this seems the simplest. Reported-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Darrick J. Wong <djwong@kernel.org> Reported-by: Brian Foster <bfoster@redhat.com> Tested-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-05-04selftests: ocelot: tc_flower_chains: specify conform-exceed action for policerVladimir Oltean1-1/+1
As discussed here with Ido Schimmel: https://patchwork.kernel.org/project/netdevbpf/patch/20220224102908.5255-2-jianbol@nvidia.com/ the default conform-exceed action is "reclassify", for a reason we don't really understand. The point is that hardware can't offload that police action, so not specifying "conform-exceed" was always wrong, even though the command used to work in hardware (but not in software) until the kernel started adding validation for it. Fix the command used by the selftest by making the policer drop on exceed, and pass the packet to the next action (goto) on conform. Fixes: 8cd6b020b644 ("selftests: ocelot: add some example VCAP IS1, IS2 and ES0 tc offloads") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/20220503121428.842906-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-04Merge branch 'insufficient-tcp-source-port-randomness'Jakub Kicinski5-25/+43
Willy Tarreau says: ==================== insufficient TCP source port randomness In a not-yet published paper, Moshe Kol, Amit Klein, and Yossi Gilad report being able to accurately identify a client by forcing it to emit only 40 times more connections than the number of entries in the table_perturb[] table, which is indexed by hashing the connection tuple. The current 2^8 setting allows them to perform that attack with only 10k connections, which is not hard to achieve in a few seconds. Eric, Amit and I have been working on this for a few weeks now imagining, testing and eliminating a number of approaches that Amit and his team were still able to break or that were found to be too risky or too expensive, and ended up with the simple improvements in this series that resists to the attack, doesn't degrade the performance, and preserves a reliable port selection algorithm to avoid connection failures, including the odd/even port selection preference that allows bind() to always find a port quickly even under strong connect() stress. The approach relies on several factors: - resalting the hash secret that's used to choose the table_perturb[] entry every 10 seconds to eliminate slow attacks and force the attacker to forget everything that was learned after this delay. This already eliminates most of the problem because if a client stays silent for more than 10 seconds there's no link between the previous and the next patterns, and 10s isn't yet frequent enough to cause too frequent repetition of a same port that may induce a connection failure ; - adding small random increments to the source port. Previously, a random 0 or 1 was added every 16 ports. Now a random 0 to 7 is added after each port. This means that with the default 32768-60999 range, a worst case rollover happens after 1764 connections, and an average of 3137. This doesn't stop statistical attacks but requires significantly more iterations of the same attack to confirm a guess. - increasing the table_perturb[] size from 2^8 to 2^16, which Amit says will require 2.6 million connections to be attacked with the changes above, making it pointless to get a fingerprint that will only last 10 seconds. Due to the size, the table was made dynamic. - a few minor improvements on the bits used from the hash, to eliminate some unfortunate correlations that may possibly have been exploited to design future attack models. These changes were tested under the most extreme conditions, up to 1.1 million connections per second to one and a few targets, showing no performance regression, and only 2 connection failures within 13 billion, which is less than 2^-32 and perfectly within usual values. The series is split into small reviewable changes and was already reviewed by Amit and Eric. ==================== Link: https://lore.kernel.org/r/20220502084614.24123-1-w@1wt.eu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-04tcp: drop the hash_32() part from the index calculationWilly Tarreau1-1/+1
In commit 190cc82489f4 ("tcp: change source port randomizarion at connect() time"), the table_perturb[] array was introduced and an index was taken from the port_offset via hash_32(). But it turns out that hash_32() performs a multiplication while the input here comes from the output of SipHash in secure_seq, that is well distributed enough to avoid the need for yet another hash. Suggested-by: Amit Klein <aksecurity@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-04tcp: increase source port perturb table to 2^16Willy Tarreau1-4/+5
Moshe Kol, Amit Klein, and Yossi Gilad reported being able to accurately identify a client by forcing it to emit only 40 times more connections than there are entries in the table_perturb[] table. The previous two improvements consisting in resalting the secret every 10s and adding randomness to each port selection only slightly improved the situation, and the current value of 2^8 was too small as it's not very difficult to make a client emit 10k connections in less than 10 seconds. Thus we're increasing the perturb table from 2^8 to 2^16 so that the same precision now requires 2.6M connections, which is more difficult in this time frame and harder to hide as a background activity. The impact is that the table now uses 256 kB instead of 1 kB, which could mostly affect devices making frequent outgoing connections. However such components usually target a small set of destinations (load balancers, database clients, perf assessment tools), and in practice only a few entries will be visited, like before. A live test at 1 million connections per second showed no performance difference from the previous value. Reported-by: Moshe Kol <moshe.kol@mail.huji.ac.il> Reported-by: Yossi Gilad <yossi.gilad@mail.huji.ac.il> Reported-by: Amit Klein <aksecurity@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-04tcp: dynamically allocate the perturb table used by source portsWilly Tarreau1-2/+10
We'll need to further increase the size of this table and it's likely that at some point its size will not be suitable anymore for a static table. Let's allocate it on boot from inet_hashinfo2_init(), which is called from tcp_init(). Cc: Moshe Kol <moshe.kol@mail.huji.ac.il> Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il> Cc: Amit Klein <aksecurity@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-04tcp: add small random increments to the source portWilly Tarreau1-4/+5
Here we're randomly adding between 0 and 7 random increments to the selected source port in order to add some noise in the source port selection that will make the next port less predictable. With the default port range of 32768-60999 this means a worst case reuse scenario of 14116/8=1764 connections between two consecutive uses of the same port, with an average of 14116/4.5=3137. This code was stressed at more than 800000 connections per second to a fixed target with all connections closed by the client using RSTs (worst condition) and only 2 connections failed among 13 billion, despite the hash being reseeded every 10 seconds, indicating a perfectly safe situation. Cc: Moshe Kol <moshe.kol@mail.huji.ac.il> Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il> Cc: Amit Klein <aksecurity@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jakub Kicinski <kuba@kernel.org>