summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2022-12-13Merge tag 'mm-stable-2022-12-13' of ↵Linus Torvalds17-207/+265
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - More userfaultfs work from Peter Xu - Several convert-to-folios series from Sidhartha Kumar and Huang Ying - Some filemap cleanups from Vishal Moola - David Hildenbrand added the ability to selftest anon memory COW handling - Some cpuset simplifications from Liu Shixin - Addition of vmalloc tracing support by Uladzislau Rezki - Some pagecache folioifications and simplifications from Matthew Wilcox - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it - Miguel Ojeda contributed some cleanups for our use of the __no_sanitize_thread__ gcc keyword. This series should have been in the non-MM tree, my bad - Naoya Horiguchi improved the interaction between memory poisoning and memory section removal for huge pages - DAMON cleanups and tuneups from SeongJae Park - Tony Luck fixed the handling of COW faults against poisoned pages - Peter Xu utilized the PTE marker code for handling swapin errors - Hugh Dickins reworked compound page mapcount handling, simplifying it and making it more efficient - Removal of the autonuma savedwrite infrastructure from Nadav Amit and David Hildenbrand - zram support for multiple compression streams from Sergey Senozhatsky - David Hildenbrand reworked the GUP code's R/O long-term pinning so that drivers no longer need to use the FOLL_FORCE workaround which didn't work very well anyway - Mel Gorman altered the page allocator so that local IRQs can remnain enabled during per-cpu page allocations - Vishal Moola removed the try_to_release_page() wrapper - Stefan Roesch added some per-BDI sysfs tunables which are used to prevent network block devices from dirtying excessive amounts of pagecache - David Hildenbrand did some cleanup and repair work on KSM COW breaking - Nhat Pham and Johannes Weiner have implemented writeback in zswap's zsmalloc backend - Brian Foster has fixed a longstanding corner-case oddity in file[map]_write_and_wait_range() - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang Chen - Shiyang Ruan has done some work on fsdax, to make its reflink mode work better under xfstests. Better, but still not perfect - Christoph Hellwig has removed the .writepage() method from several filesystems. They only need .writepages() - Yosry Ahmed wrote a series which fixes the memcg reclaim target beancounting - David Hildenbrand has fixed some of our MM selftests for 32-bit machines - Many singleton patches, as usual * tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (313 commits) mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio mm: mmu_gather: allow more than one batch of delayed rmaps mm: fix typo in struct pglist_data code comment kmsan: fix memcpy tests mm: add cond_resched() in swapin_walk_pmd_entry() mm: do not show fs mm pc for VM_LOCKONFAULT pages selftests/vm: ksm_functional_tests: fixes for 32bit selftests/vm: cow: fix compile warning on 32bit selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitions mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmem mm,thp,rmap: fix races between updates of subpages_mapcount mm: memcg: fix swapcached stat accounting mm: add nodes= arg to memory.reclaim mm: disable top-tier fallback to reclaim on proactive reclaim selftests: cgroup: make sure reclaim target memcg is unprotected selftests: cgroup: refactor proactive reclaim code to reclaim_until() mm: memcg: fix stale protection of reclaim target memcg mm/mmap: properly unaccount memory on mas_preallocate() failure omfs: remove ->writepage jfs: remove ->writepage ...
2022-12-13Merge tag 'efi-next-for-v6.2' of ↵Linus Torvalds2-0/+7
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI updates from Ard Biesheuvel: "Another fairly sizable pull request, by EFI subsystem standards. Most of the work was done by me, some of it in collaboration with the distro and bootloader folks (GRUB, systemd-boot), where the main focus has been on removing pointless per-arch differences in the way EFI boots a Linux kernel. - Refactor the zboot code so that it incorporates all the EFI stub logic, rather than calling the decompressed kernel as a EFI app. - Add support for initrd= command line option to x86 mixed mode. - Allow initrd= to be used with arbitrary EFI accessible file systems instead of just the one the kernel itself was loaded from. - Move some x86-only handling and manipulation of the EFI memory map into arch/x86, as it is not used anywhere else. - More flexible handling of any random seeds provided by the boot environment (i.e., systemd-boot) so that it becomes available much earlier during the boot. - Allow improved arch-agnostic EFI support in loaders, by setting a uniform baseline of supported features, and adding a generic magic number to the DOS/PE header. This should allow loaders such as GRUB or systemd-boot to reduce the amount of arch-specific handling substantially. - (arm64) Run EFI runtime services from a dedicated stack, and use it to recover from synchronous exceptions that might occur in the firmware code. - (arm64) Ensure that we don't allocate memory outside of the 48-bit addressable physical range. - Make EFI pstore record size configurable - Add support for decoding CXL specific CPER records" * tag 'efi-next-for-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: (43 commits) arm64: efi: Recover from synchronous exceptions occurring in firmware arm64: efi: Execute runtime services from a dedicated stack arm64: efi: Limit allocations to 48-bit addressable physical region efi: Put Linux specific magic number in the DOS header efi: libstub: Always enable initrd command line loader and bump version efi: stub: use random seed from EFI variable efi: vars: prohibit reading random seed variables efi: random: combine bootloader provided RNG seed with RNG protocol output efi/cper, cxl: Decode CXL Error Log efi/cper, cxl: Decode CXL Protocol Error Section efi: libstub: fix efi_load_initrd_dev_path() kernel-doc comment efi: x86: Move EFI runtime map sysfs code to arch/x86 efi: runtime-maps: Clarify purpose and enable by default for kexec efi: pstore: Add module parameter for setting the record size efi: xen: Set EFI_PARAVIRT for Xen dom0 boot on all architectures efi: memmap: Move manipulation routines into x86 arch tree efi: memmap: Move EFI fake memmap support into x86 arch tree efi: libstub: Undeprecate the command line initrd loader efi: libstub: Add mixed mode support to command line initrd loader efi: libstub: Permit mixed mode return types other than efi_status_t ...
2022-12-13Merge tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linuxLinus Torvalds1-9/+5
Pull block updates from Jens Axboe: - NVMe pull requests via Christoph: - Support some passthrough commands without CAP_SYS_ADMIN (Kanchan Joshi) - Refactor PCIe probing and reset (Christoph Hellwig) - Various fabrics authentication fixes and improvements (Sagi Grimberg) - Avoid fallback to sequential scan due to transient issues (Uday Shankar) - Implement support for the DEAC bit in Write Zeroes (Christoph Hellwig) - Allow overriding the IEEE OUI and firmware revision in configfs for nvmet (Aleksandr Miloserdov) - Force reconnect when number of queue changes in nvmet (Daniel Wagner) - Minor fixes and improvements (Uros Bizjak, Joel Granados, Sagi Grimberg, Christoph Hellwig, Christophe JAILLET) - Fix and cleanup nvme-fc req allocation (Chaitanya Kulkarni) - Use the common tagset helpers in nvme-pci driver (Christoph Hellwig) - Cleanup the nvme-pci removal path (Christoph Hellwig) - Use kstrtobool() instead of strtobool (Christophe JAILLET) - Allow unprivileged passthrough of Identify Controller (Joel Granados) - Support io stats on the mpath device (Sagi Grimberg) - Minor nvmet cleanup (Sagi Grimberg) - MD pull requests via Song: - Code cleanups (Christoph) - Various fixes - Floppy pull request from Denis: - Fix a memory leak in the init error path (Yuan) - Series fixing some batch wakeup issues with sbitmap (Gabriel) - Removal of the pktcdvd driver that was deprecated more than 5 years ago, and subsequent removal of the devnode callback in struct block_device_operations as no users are now left (Greg) - Fix for partition read on an exclusively opened bdev (Jan) - Series of elevator API cleanups (Jinlong, Christoph) - Series of fixes and cleanups for blk-iocost (Kemeng) - Series of fixes and cleanups for blk-throttle (Kemeng) - Series adding concurrent support for sync queues in BFQ (Yu) - Series bringing drbd a bit closer to the out-of-tree maintained version (Christian, Joel, Lars, Philipp) - Misc drbd fixes (Wang) - blk-wbt fixes and tweaks for enable/disable (Yu) - Fixes for mq-deadline for zoned devices (Damien) - Add support for read-only and offline zones for null_blk (Shin'ichiro) - Series fixing the delayed holder tracking, as used by DM (Yu, Christoph) - Series enabling bio alloc caching for IRQ based IO (Pavel) - Series enabling userspace peer-to-peer DMA (Logan) - BFQ waker fixes (Khazhismel) - Series fixing elevator refcount issues (Christoph, Jinlong) - Series cleaning up references around queue destruction (Christoph) - Series doing quiesce by tagset, enabling cleanups in drivers (Christoph, Chao) - Series untangling the queue kobject and queue references (Christoph) - Misc fixes and cleanups (Bart, David, Dawei, Jinlong, Kemeng, Ye, Yang, Waiman, Shin'ichiro, Randy, Pankaj, Christoph) * tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux: (247 commits) blktrace: Fix output non-blktrace event when blk_classic option enabled block: sed-opal: Don't include <linux/kernel.h> sed-opal: allow using IOC_OPAL_SAVE for locking too blk-cgroup: Fix typo in comment block: remove bio_set_op_attrs nvmet: don't open-code NVME_NS_ATTR_RO enumeration nvme-pci: use the tagset alloc/free helpers nvme: add the Apple shared tag workaround to nvme_alloc_io_tag_set nvme: only set reserved_tags in nvme_alloc_io_tag_set for fabrics controllers nvme: consolidate setting the tagset flags nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set block: bio_copy_data_iter nvme-pci: split out a nvme_pci_ctrl_is_dead helper nvme-pci: return early on ctrl state mismatch in nvme_reset_work nvme-pci: rename nvme_disable_io_queues nvme-pci: cleanup nvme_suspend_queue nvme-pci: remove nvme_pci_disable nvme-pci: remove nvme_disable_admin_queue nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl nvme: use nvme_wait_ready in nvme_shutdown_ctrl ...
2022-12-13Merge tag 'for-6.2/io_uring-2022-12-08' of git://git.kernel.dk/linuxLinus Torvalds2-24/+31
Pull io_uring updates from Jens Axboe: - Always ensure proper ordering in case of CQ ring overflow, which then means we can remove some work-arounds for that (Dylan) - Support completion batching for multishot, greatly increasing the efficiency for those (Dylan) - Flag epoll/eventfd wakeups done from io_uring, so that we can easily tell if we're recursing into io_uring again. Previously, this would have resulted in repeated multishot notifications if we had a dependency there. That could happen if an eventfd was registered as the ring eventfd, and we multishot polled for events on it. Or if an io_uring fd was added to epoll, and io_uring had a multishot request for the epoll fd. Test cases here: https://git.kernel.dk/cgit/liburing/commit/?id=919755a7d0096fda08fb6d65ac54ad8d0fe027cd Previously these got terminated when the CQ ring eventually overflowed, now it's handled gracefully (me). - Tightening of the IOPOLL based completions (Pavel) - Optimizations of the networking zero-copy paths (Pavel) - Various tweaks and fixes (Dylan, Pavel) * tag 'for-6.2/io_uring-2022-12-08' of git://git.kernel.dk/linux: (41 commits) io_uring: keep unlock_post inlined in hot path io_uring: don't use complete_post in kbuf io_uring: spelling fix io_uring: remove io_req_complete_post_tw io_uring: allow multishot polled reqs to defer completion io_uring: remove overflow param from io_post_aux_cqe io_uring: add lockdep assertion in io_fill_cqe_aux io_uring: make io_fill_cqe_aux static io_uring: add io_aux_cqe which allows deferred completion io_uring: allow defer completion for aux posted cqes io_uring: defer all io_req_complete_failed io_uring: always lock in io_apoll_task_func io_uring: remove iopoll spinlock io_uring: iopoll protect complete_post io_uring: inline __io_req_complete_put() io_uring: remove io_req_tw_post_queue io_uring: use io_req_task_complete() in timeout io_uring: hold locks for io_req_complete_failed io_uring: add completion locking for iopoll io_uring: kill io_cqring_ev_posted() and __io_cq_unlock_post() ...
2022-12-13Merge tag 'iomap-6.2-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-2/+1
Pull iomap update from Darrick Wong: - Minor code cleanup to eliminate unnecessary bit shifting * tag 'iomap-6.2-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: directly use logical block size
2022-12-13Merge tag 'vfs-6.2-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-5/+2
Pull vfs remap_range update from Darrick Wong: - Make some minor adjustments to the remap range preparation function to skip file updates when the request length is adjusted downwards to zero. * tag 'vfs-6.2-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: fs/remap_range: avoid spurious writeback on zero length request
2022-12-13Merge tag 'fs.xattr.simple.rework.rbtree.rwlock.v6.2' of ↵Linus Torvalds1-67/+250
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull simple-xattr updates from Christian Brauner: "This ports the simple xattr infrastucture to rely on a simple rbtree protected by a read-write lock instead of a linked list protected by a spinlock. A while ago we received reports about scaling issues for filesystems using the simple xattr infrastructure that also support setting a larger number of xattrs. Specifically, cgroups and tmpfs. Both cgroupfs and tmpfs can be mounted by unprivileged users in unprivileged containers and root in an unprivileged container can set an unrestricted number of security.* xattrs and privileged users can also set unlimited trusted.* xattrs. A few more words on further that below. Other xattrs such as user.* are restricted for kernfs-based instances to a fairly limited number. As there are apparently users that have a fairly large number of xattrs we should scale a bit better. Using a simple linked list protected by a spinlock used for set, get, and list operations doesn't scale well if users use a lot of xattrs even if it's not a crazy number. Let's switch to a simple rbtree protected by a rwlock. It scales way better and gets rid of the perf issues some people reported. We originally had fancier solutions even using an rcu+seqlock protected rbtree but we had concerns about being to clever and also that deletion from an rbtree with rcu+seqlock isn't entirely safe. The rbtree plus rwlock is perfectly fine. By far the most common operation is getting an xattr. While setting an xattr is not and should be comparatively rare. And listxattr() often only happens when copying xattrs between files or together with the contents to a new file. Holding a lock across listxattr() is unproblematic because it doesn't list the values of xattrs. It can only be used to list the names of all xattrs set on a file. And the number of xattr names that can be listed with listxattr() is limited to XATTR_LIST_MAX aka 65536 bytes. If a larger buffer is passed then vfs_listxattr() caps it to XATTR_LIST_MAX and if more xattr names are found it will return -E2BIG. In short, the maximum amount of memory that can be retrieved via listxattr() is limited and thus listxattr() bounded. Of course, the API is broken as documented on xattr(7) already. While I have no idea how the xattr api ended up in this state we should probably try to come up with something here at some point. An iterator pattern similar to readdir() as an alternative to listxattr() or something else. Right now it is extremly strange that users can set millions of xattrs but then can't use listxattr() to know which xattrs are actually set. And it's really trivial to do: for i in {1..1000000}; do setfattr -n security.$i -v $i ./file1; done And around 5000 xattrs it's impossible to use listxattr() to figure out which xattrs are actually set. So I have suggested that we try to limit the number of xattrs for simple xattrs at least. But that's a future patch and I don't consider it very urgent. A bonus of this port to rbtree+rwlock is that we shrink the memory consumption for users of the simple xattr infrastructure. This also adds kernel documentation to all the functions" * tag 'fs.xattr.simple.rework.rbtree.rwlock.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: xattr: use rbtree for simple_xattrs
2022-12-13Merge tag 'lsm-pr-20221212' of ↵Linus Torvalds3-3/+8
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull lsm updates from Paul Moore: - Improve the error handling in the device cgroup such that memory allocation failures when updating the access policy do not potentially alter the policy. - Some minor fixes to reiserfs to ensure that it properly releases LSM-related xattr values. - Update the security_socket_getpeersec_stream() LSM hook to take sockptr_t values. Previously the net/BPF folks updated the getsockopt code in the network stack to leverage the sockptr_t type to make it easier to pass both kernel and __user pointers, but unfortunately when they did so they didn't convert the LSM hook. While there was/is no immediate risk by not converting the LSM hook, it seems like this is a mistake waiting to happen so this patch proactively does the LSM hook conversion. - Convert vfs_getxattr_alloc() to return an int instead of a ssize_t and cleanup the callers. Internally the function was never going to return anything larger than an int and the callers were doing some very odd things casting the return value; this patch fixes all that and helps bring a bit of sanity to vfs_getxattr_alloc() and its callers. - More verbose, and helpful, LSM debug output when the system is booted with "lsm.debug" on the command line. There are examples in the commit description, but the quick summary is that this patch provides better information about which LSMs are enabled and the ordering in which they are processed. - General comment and kernel-doc fixes and cleanups. * tag 'lsm-pr-20221212' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: lsm: Fix description of fs_context_parse_param lsm: Add/fix return values in lsm_hooks.h and fix formatting lsm: Clarify documentation of vm_enough_memory hook reiserfs: Add missing calls to reiserfs_security_free() lsm,fs: fix vfs_getxattr_alloc() return type and caller error paths device_cgroup: Roll back to original exceptions after copy failure LSM: Better reporting of actual LSMs at boot lsm: make security_socket_getpeersec_stream() sockptr_t safe audit: Fix some kernel-doc warnings lsm: remove obsoleted comments for security hooks fs: edit a comment made in bad taste
2022-12-13Merge tag 'landlock-6.2-rc1' of ↵Linus Torvalds2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux Pull landlock updates from Mickaël Salaün: "This adds file truncation support to Landlock, contributed by Günther Noack. As described by Günther [1], the goal of these patches is to work towards a more complete coverage of file system operations that are restrictable with Landlock. The known set of currently unsupported file system operations in Landlock is described at [2]. Out of the operations listed there, truncate is the only one that modifies file contents, so these patches should make it possible to prevent the direct modification of file contents with Landlock. The new LANDLOCK_ACCESS_FS_TRUNCATE access right covers both the truncate(2) and ftruncate(2) families of syscalls, as well as open(2) with the O_TRUNC flag. This includes usages of creat() in the case where existing regular files are overwritten. Additionally, this introduces a new Landlock security blob associated with opened files, to track the available Landlock access rights at the time of opening the file. This is in line with Unix's general approach of checking the read and write permissions during open(), and associating this previously checked authorization with the opened file. An ongoing patch documents this use case [3]. In order to treat truncate(2) and ftruncate(2) calls differently in an LSM hook, we split apart the existing security_path_truncate hook into security_path_truncate (for truncation by path) and security_file_truncate (for truncation of previously opened files)" Link: https://lore.kernel.org/r/20221018182216.301684-1-gnoack3000@gmail.com [1] Link: https://www.kernel.org/doc/html/v6.1/userspace-api/landlock.html#filesystem-flags [2] Link: https://lore.kernel.org/r/20221209193813.972012-1-mic@digikod.net [3] * tag 'landlock-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux: samples/landlock: Document best-effort approach for LANDLOCK_ACCESS_FS_REFER landlock: Document Landlock's file truncation support samples/landlock: Extend sample tool to support LANDLOCK_ACCESS_FS_TRUNCATE selftests/landlock: Test ftruncate on FDs created by memfd_create(2) selftests/landlock: Test FD passing from restricted to unrestricted processes selftests/landlock: Locally define __maybe_unused selftests/landlock: Test open() and ftruncate() in multiple scenarios selftests/landlock: Test file truncation support landlock: Support file truncation landlock: Document init_layer_masks() helper landlock: Refactor check_access_path_dual() into is_access_to_paths_allowed() security: Create file_truncate hook from path_truncate hook
2022-12-13Merge tag 'configfs-6.2-2022-12-13' of ↵Linus Torvalds1-0/+2
git://git.infradead.org/users/hch/configfs Pull configfs updates from Christoph Hellwig: - fix a memory leak in configfs_create_dir (Chen Zhongjin) - remove mentions of committable items that were implemented (Bartosz Golaszewski) * tag 'configfs-6.2-2022-12-13' of git://git.infradead.org/users/hch/configfs: configfs: remove mentions of committable items configfs: fix possible memory leak in configfs_create_dir()
2022-12-13Merge tag 'nfs-for-6.2-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds15-67/+105
Pull NFS client updates from Trond Myklebust "Bugfixes: - Fix NULL pointer dereference in the mount parser - Fix memory stomp in decode_attr_security_label - Fix credential leak in _nfs4_discover_trunking() - Fix buffer leak in rpcrdma_req_create() - Fix leaked socket in rpc_sockname() - Fix deadlock between nfs4_open_recover_helper() and delegreturn - Fix an Oops in nfs_d_automount() - Fix potential race in nfs_call_unlink() - Multiple fixes for the open context mode - NFSv4.2 READ_PLUS fixes - Fix a regression in which small rsize/wsize values are being forbidden - Fail client initialisation if the NFSv4.x state manager thread can't run - Avoid spurious warning of lost lock that is being unlocked. - Ensure the initialisation of struct nfs4_label Features and cleanups: - Trigger the "ls -l" readdir heuristic sooner - Clear the file access cache upon login to ensure supplementary group info is in sync between the client and server - pnfs: Fix up the logging of layout stateids - NFSv4.2: Change the default KConfig value for READ_PLUS - Use sysfs_emit() instead of scnprintf() where appropriate" * tag 'nfs-for-6.2-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (24 commits) NFSv4.2: Change the default KConfig value for READ_PLUS NFSv4.x: Fail client initialisation if state manager thread can't run fs: nfs: sysfs: use sysfs_emit() to instead of scnprintf() NFS: use sysfs_emit() to instead of scnprintf() NFS: Allow very small rsize & wsize again NFSv4.2: Fix up READ_PLUS alignment NFSv4.2: Set the correct size scratch buffer for decoding READ_PLUS SUNRPC: Fix missing release socket in rpc_sockname() xprtrdma: Fix regbuf data not freed in rpcrdma_req_create() NFS: avoid spurious warning of lost lock that is being unlocked. nfs: fix possible null-ptr-deref when parsing param NFSv4: check FMODE_EXEC from open context mode in nfs4_opendata_access() NFS: make sure open context mode have FMODE_EXEC when file open for exec NFS4.x/pnfs: Fix up logging of layout stateids NFS: Fix a race in nfs_call_unlink() NFS: Fix an Oops in nfs_d_automount() NFSv4: Fix a deadlock between nfs4_open_recover_helper() and delegreturn NFSv4: Fix a credential leak in _nfs4_discover_trunking() NFS: Trigger the "ls -l" readdir heuristic sooner NFSv4.2: Fix initialisation of struct nfs4_label ...
2022-12-12Merge tag 'nfsd-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds34-754/+1205
Pull nfsd updates from Chuck Lever: "This release introduces support for the CB_RECALL_ANY operation. NFSD can send this operation to request that clients return any delegations they choose. The server uses this operation to handle low memory scenarios or indicate to a client when that client has reached the maximum number of delegations the server supports. The NFSv4.2 READ_PLUS operation has been simplified temporarily whilst support for sparse files in local filesystems and the VFS is improved. Two major data structure fixes appear in this release: - The nfs4_file hash table is replaced with a resizable hash table to reduce the latency of NFSv4 OPEN operations. - Reference counting in the NFSD filecache has been hardened against races. In furtherance of removing support for NFSv2 in a subsequent kernel release, a new Kconfig option enables server-side support for NFSv2 to be left out of a kernel build. MAINTAINERS has been updated to indicate that changes to fs/exportfs should go through the NFSD tree" * tag 'nfsd-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (49 commits) NFSD: Avoid clashing function prototypes SUNRPC: Fix crasher in unwrap_integ_data() SUNRPC: Make the svc_authenticate tracepoint conditional NFSD: Use only RQ_DROPME to signal the need to drop a reply SUNRPC: Clean up xdr_write_pages() SUNRPC: Don't leak netobj memory when gss_read_proxy_verf() fails NFSD: add CB_RECALL_ANY tracepoints NFSD: add delegation reaper to react to low memory condition NFSD: add support for sending CB_RECALL_ANY NFSD: refactoring courtesy_client_reaper to a generic low memory shrinker trace: Relocate event helper files NFSD: pass range end to vfs_fsync_range() instead of count lockd: fix file selection in nlmsvc_cancel_blocked lockd: ensure we use the correct file descriptor when unlocking lockd: set missing fl_flags field when retrieving args NFSD: Use struct_size() helper in alloc_session() nfsd: return error if nfs4_setacl fails lockd: set other missing fields when unlocking files NFSD: Add an nfsd_file_fsync tracepoint sunrpc: svc: Remove an unused static function svc_ungetu32() ...
2022-12-12Merge tag 'for-6.2-tag' of ↵Linus Torvalds118-9422/+10924
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This round there are a lot of cleanups and moved code so the diffstat looks huge, otherwise there are some nice performance improvements and an update to raid56 reliability. User visible features: - raid56 reliability vs performance trade off: - fix destructive RMW for raid5 data (raid6 still needs work): do full checksum verification for all data during RMW cycle, this should prevent rewriting potentially corrupted data without notice - stripes are cached in memory which should reduce the performance impact but still can hurt some workloads - checksums are verified after repair again - this is the last option without introducing additional features (write intent bitmap, journal, another tree), the extra checksum read/verification was supposed to be avoided by the original implementation exactly for performance reasons but that caused all the reliability problems - discard=async by default for devices that support it - implement emergency flush reserve to avoid almost all unnecessary transaction aborts due to ENOSPC in cases where there are too many delayed refs or delayed allocation - skip block group synchronization if there's no change in used bytes, can reduce transaction commit count for some workloads Performance improvements: - fiemap and lseek: - overall speedup due to skipping unnecessary or duplicate searches (-40% run time) - cache some data structures and sharedness of extents (-30% run time) - send: - faster backref resolution when finding clones - cached leaf to root mapping for faster backref walking - improved clone/sharing detection - overall run time improvements (-70%) Core: - module initialization converted to a table of function pointers run in a sequence - preparation for fscrypt, extend passing file names across calls, dir item can store encryption status - raid56 updates: - more accurate error tracking of sectors within stripe - simplify recovery path and remove dedicated endio worker kthread - simplify scrub call paths - refactoring to support the extra data checksum verification during RMW cycle - tree block parentness checks consolidated and done at metadata read time - improved error handling - cleanups: - move a lot of code for better synchronization between kernel and user space sources, split big files - enum cleanups - GFP flag cleanups - header file cleanups, prototypes, dependencies - redundant parameter cleanups - inline extent handling simplifications - inode parameter conversion - data structure cleanups, reductions, renames, merges" * tag 'for-6.2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (249 commits) btrfs: print transaction aborted messages with an error level btrfs: sync some cleanups from progs into uapi/btrfs.h btrfs: do not BUG_ON() on ENOMEM when dropping extent items for a range btrfs: fix extent map use-after-free when handling missing device in read_one_chunk btrfs: remove outdated logic from overwrite_item() and add assertion btrfs: unify overwrite_item() and do_overwrite_item() btrfs: replace strncpy() with strscpy() btrfs: fix uninitialized variable in find_first_clear_extent_bit btrfs: fix uninitialized parent in insert_state btrfs: add might_sleep() annotations btrfs: add stack helpers for a few btrfs items btrfs: add nr_global_roots to the super block definition btrfs: remove BTRFS_LEAF_DATA_OFFSET btrfs: add helpers for manipulating leaf items and data btrfs: add eb to btrfs_node_key_ptr_offset btrfs: pass the extent buffer for the btrfs_item_nr helpers btrfs: move the csum helpers into ctree.h btrfs: move eb offset helpers into extent_io.h btrfs: move file_extent_item helpers into file-item.h btrfs: move leaf_data_end into ctree.c ...
2022-12-12Merge tag 'dlm-6.2' of ↵Linus Torvalds19-1243/+1152
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "These patches include the usual cleanups and minor fixes, the removal of code that is no longer needed due to recent improvements, and improvements to processing large volumes of messages during heavy locking activity. Summary: - Misc code cleanup - Fix a couple of socket handling bugs: a double release on an error path and a data-ready race in an accept loop - Remove code for resending dir-remove messages. This code is no longer needed since the midcomms layer now ensures the messages are resent if needed - Add tracepoints for dlm messages - Improve callback queueing by replacing the fixed array with a list - Simplify the handling of a remove message followed by a lookup message by sending both without releasing a spinlock in between - Improve the concurrency of sending and receiving messages by holding locks for a shorter time, and changing how workqueues are used - Remove old code for shutting down sockets, which is no longer needed with the reliable connection handling that was recently added" * tag 'dlm-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: (37 commits) fs: dlm: fix building without lockdep fs: dlm: parallelize lowcomms socket handling fs: dlm: don't init error value fs: dlm: use saved sk_error_report() fs: dlm: use sock2con without checking null fs: dlm: remove dlm_node_addrs lookup list fs: dlm: don't put dlm_local_addrs on heap fs: dlm: cleanup listen sock handling fs: dlm: remove socket shutdown handling fs: dlm: use listen sock as dlm running indicator fs: dlm: use list_first_entry_or_null fs: dlm: remove twice INIT_WORK fs: dlm: add midcomms init/start functions fs: dlm: add dst nodeid for msg tracing fs: dlm: rename seq to h_seq for msg tracing fs: dlm: rename DLM_IFL_NEED_SCHED to DLM_IFL_CB_PENDING fs: dlm: ast do WARN_ON_ONCE() on hotpath fs: dlm: drop lkb ref in bug case fs: dlm: avoid false-positive checker warning fs: dlm: use WARN_ON_ONCE() instead of WARN_ON() ...
2022-12-12Merge tag 'jfs-6.2' of https://github.com/kleikamp/linux-shaggyLinus Torvalds9-22/+31
Pull jfs updates from David Kleikamp: "Assorted JFS fixes for 6.2" * tag 'jfs-6.2' of https://github.com/kleikamp/linux-shaggy: jfs: makes diUnmount/diMount in jfs_mount_rw atomic jfs: Fix a typo in function jfs_umount fs: jfs: fix shift-out-of-bounds in dbDiscardAG jfs: Fix fortify moan in symlink jfs: remove redundant assignments to ipaimap and ipaimap2 jfs: remove unused declarations for jfs fs/jfs/jfs_xattr.h: Fix spelling typo in comment MAINTAINERS: git://github -> https://github.com for kleikamp fs/jfs: replace ternary operator with min_t() fs: jfs: fix shift-out-of-bounds in dbAllocAG
2022-12-12Merge tag 'fixes_for_v6.2-rc1' of ↵Linus Torvalds9-119/+91
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull udf and ext2 fixes from Jan Kara: - a couple of smaller cleanups and fixes for ext2 - fixes of a data corruption issues in udf when handling holes and preallocation extents - fixes and cleanups of several smaller issues in udf - add maintainer entry for isofs * tag 'fixes_for_v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: Fix extending file within last block udf: Discard preallocation before extending file with a hole udf: Do not bother looking for prealloc extents if i_lenExtents matches i_size udf: Fix preallocation discarding at indirect extent boundary udf: Increase UDF_MAX_READ_VERSION to 0x0260 fs/ext2: Fix code indentation ext2: unbugger ext2_empty_dir() udf: remove ->writepage ext2: remove ->writepage ext2: Don't flush page immediately for DIRSYNC directories ext2: Fix some kernel-doc warnings maintainers: Add ISOFS entry udf: Avoid double brelse() in udf_rename() fs: udf: Optimize udf_free_in_core_inode and udf_find_fileset function
2022-12-12Merge tag 'fs.xattr.simple.noaudit.v6.2' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull xattr audit fix from Seth Forshee: "This is a single patch to remove auditing of the capability check in simple_xattr_list(). This check is done to check whether trusted xattrs should be included by listxattr(2). SELinux will normally log a denial when capable() is called and the task's SELinux context doesn't have the corresponding capability permission allowed, which can end up spamming the log. Since a failed check here cannot be used to infer malicious intent, auditing is of no real value, and it makes sense to stop auditing the capability check" * tag 'fs.xattr.simple.noaudit.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: fs: don't audit the capability check in simple_xattr_list()
2022-12-12Merge tag 'fs.idmapped.squashfs.v6.2' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull squashfs update from Seth Forshee: "This is a simple patch to enable idmapped mounts for squashfs. All functionality squashfs needs to support idmapped mounts is already implemented in generic VFS code, so all that is needed is to set FS_ALLOW_IDMAP in fs_flags" * tag 'fs.idmapped.squashfs.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: squashfs: enable idmapped mounts
2022-12-12Merge tag 'fuse-update-6.2' of ↵Linus Torvalds6-33/+73
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse update from Miklos Szeredi: - Allow some write requests to proceed in parallel - Fix a performance problem with allow_sys_admin_access - Add a special kind of invalidation that doesn't immediately purge submounts - On revalidation treat the target of rename(RENAME_NOREPLACE) the same as open(O_EXCL) - Use type safe helpers for some mnt_userns transformations * tag 'fuse-update-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: Rearrange fuse_allow_current_process checks fuse: allow non-extending parallel direct writes on the same file fuse: remove the unneeded result variable fuse: port to vfs{g,u}id_t and associated helpers fuse: Remove user_ns check for FUSE_DEV_IOC_CLONE fuse: always revalidate rename target dentry fuse: add "expire only" mode to FUSE_NOTIFY_INVAL_ENTRY fs/fuse: Replace kmap() with kmap_local_page()
2022-12-12Merge tag 'ovl-update-6.2' of ↵Linus Torvalds9-67/+86
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs update from Miklos Szeredi: - Fix a couple of bugs found by syzbot - Don't ingore some open flags set by fcntl(F_SETFL) - Fix failure to create a hard link in certain cases - Use type safe helpers for some mnt_userns transformations - Improve performance of mount - Misc cleanups * tag 'ovl-update-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: Kconfig: Fix spelling mistake "undelying" -> "underlying" ovl: use inode instead of dentry where possible ovl: Add comment on upperredirect reassignment ovl: use plain list filler in indexdir and workdir cleanup ovl: do not reconnect upper index records in ovl_indexdir_cleanup() ovl: fix comment typos ovl: port to vfs{g,u}id_t and associated helpers ovl: Use ovl mounter's fsuid and fsgid in ovl_link() ovl: Use "buf" flexible array for memcpy() destination ovl: update ->f_iocb_flags when ovl_change_flags() modifies ->f_flags ovl: fix use inode directly in rcu-walk mode
2022-12-12Merge tag 'erofs-for-6.2-rc1' of ↵Linus Torvalds9-318/+297
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "In this cycle, large folios are now enabled in the iomap/fscache mode for uncompressed files first. In order to do that, we've also cleaned up better interfaces between erofs and fscache, which are acked by fscache/netfs folks and included in this pull request. Other than that, there are random fixes around erofs over fscache and crafted images by syzbot, minor cleanups and documentation updates. Summary: - Enable large folios for iomap/fscache mode - Avoid sysfs warning due to mounting twice with the same fsid and domain_id in fscache mode - Refine fscache interface among erofs, fscache, and cachefiles - Use kmap_local_page() only for metabuf - Fixes around crafted images found by syzbot - Minor cleanups and documentation updates" * tag 'erofs-for-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: validate the extent length for uncompressed pclusters erofs: fix missing unmap if z_erofs_get_extent_compressedlen() fails erofs: Fix pcluster memleak when its block address is zero erofs: use kmap_local_page() only for erofs_bread() erofs: enable large folios for fscache mode erofs: support large folios for fscache mode erofs: switch to prepare_ondemand_read() in fscache mode fscache,cachefiles: add prepare_ondemand_read() callback erofs: clean up cached I/O strategies erofs: update documentation erofs: check the uniqueness of fsid in shared domain in advance erofs: enable large folios for iomap mode
2022-12-12Merge tag 'fsverity-for-linus' of ↵Linus Torvalds7-82/+85
git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt Pull fsverity updates from Eric Biggers: "The main change this cycle is to stop using the PG_error flag to track verity failures, and instead just track failures at the bio level. This follows a similar fscrypt change that went into 6.1, and it is a step towards freeing up PG_error for other uses. There's also one other small cleanup" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fsverity: simplify fsverity_get_digest() fsverity: stop using PG_error to track error status
2022-12-12Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscryptLinus Torvalds4-18/+38
Pull fscrypt updates from Eric Biggers: "This release adds SM4 encryption support, contributed by Tianjia Zhang. SM4 is a Chinese block cipher that is an alternative to AES. I recommend against using SM4, but (according to Tianjia) some people are being required to use it. Since SM4 has been turning up in many other places (crypto API, wireless, TLS, OpenSSL, ARMv8 CPUs, etc.), it hasn't been very controversial, and some people have to use it, I don't think it would be fair for me to reject this optional feature. Besides the above, there are a couple cleanups" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fscrypt: add additional documentation for SM4 support fscrypt: remove unused Speck definitions fscrypt: Add SM4 XTS/CTS symmetric algorithm support blk-crypto: Add support for SM4-XTS blk crypto mode fscrypt: add comment for fscrypt_valid_enc_modes_v1() fscrypt: pass super_block to fscrypt_put_master_key_activeref()
2022-12-12Merge tag 'ext4_for_linus' of ↵Linus Torvalds25-329/+487
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A large number of cleanups and bug fixes, with many of the bug fixes found by Syzbot and fuzzing. (Many of the bug fixes involve less-used ext4 features such as fast_commit, inline_data and bigalloc) In addition, remove the writepage function for ext4, since the medium-term plan is to remove ->writepage() entirely. (The VM doesn't need or want writepage() for writeback, since it is fine with ->writepages() so long as ->migrate_folio() is implemented)" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits) ext4: fix reserved cluster accounting in __es_remove_extent() ext4: fix inode leak in ext4_xattr_inode_create() on an error path ext4: allocate extended attribute value in vmalloc area ext4: avoid unaccounted block allocation when expanding inode ext4: initialize quota before expanding inode in setproject ioctl ext4: stop providing .writepage hook mm: export buffer_migrate_folio_norefs() ext4: switch to using write_cache_pages() for data=journal writeout jbd2: switch jbd2_submit_inode_data() to use fs-provided hook for data writeout ext4: switch to using ext4_do_writepages() for ordered data writeout ext4: move percpu_rwsem protection into ext4_writepages() ext4: provide ext4_do_writepages() ext4: add support for writepages calls that cannot map blocks ext4: drop pointless IO submission from ext4_bio_write_page() ext4: remove nr_submitted from ext4_bio_write_page() ext4: move keep_towrite handling to ext4_bio_write_page() ext4: handle redirtying in ext4_bio_write_page() ext4: fix kernel BUG in 'ext4_write_inline_data_end()' ext4: make ext4_mb_initialize_context return void ext4: fix deadlock due to mbcache entry corruption ...
2022-12-12Merge tag 'fs.idmapped.mnt_idmap.v6.2' of ↵Linus Torvalds4-66/+177
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull idmapping updates from Christian Brauner: "Last cycle we've already made the interaction with idmapped mounts more robust and type safe by introducing the vfs{g,u}id_t type. This cycle we concluded the conversion and removed the legacy helpers. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem - with namespaces that are relevent on the mount level. Especially for filesystem developers without detailed knowledge in this area this can be a potential source for bugs. Instead of passing the plain namespace we introduce a dedicated type struct mnt_idmap and replace the pointer with a pointer to a struct mnt_idmap. There are no semantic or size changes for the mount struct caused by this. We then start converting all places aware of idmapped mounts to rely on struct mnt_idmap. Once the conversion is done all helpers down to the really low-level make_vfs{g,u}id() and from_vfs{g,u}id() will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two removing and thus eliminating the possibility of any bugs. Fwiw, I fixed some issues in that area a while ago in ntfs3 and ksmbd in the past. Afterwards only low-level code can ultimately use the associated namespace for any permission checks. Even most of the vfs can be completely obivious about this ultimately and filesystems will never interact with it in any form in the future. A struct mnt_idmap currently encompasses a simple refcount and pointer to the relevant namespace the mount is idmapped to. If a mount isn't idmapped then it will point to a static nop_mnt_idmap and if it doesn't that it is idmapped. As usual there are no allocations or anything happening for non-idmapped mounts. Everthing is carefully written to be a nop for non-idmapped mounts as has always been the case. If an idmapped mount is created a struct mnt_idmap is allocated and a reference taken on the relevant namespace. Each mount that gets idmapped or inherits the idmap simply bumps the reference count on struct mnt_idmap. Just a reminder that we only allow a mount to change it's idmapping a single time and only if it hasn't already been attached to the filesystems and has no active writers. The actual changes are fairly straightforward but this will have huge benefits for maintenance and security in the long run even if it causes some churn. Note that this also makes it possible to extend struct mount_idmap in the future. For example, it would be possible to place the namespace pointer in an anonymous union together with an idmapping struct. This would allow us to expose an api to userspace that would let it specify idmappings directly instead of having to go through the detour of setting up namespaces at all" * tag 'fs.idmapped.mnt_idmap.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: acl: conver higher-level helpers to rely on mnt_idmap fs: introduce dedicated idmap type for mounts
2022-12-12Merge tag 'fs.vfsuid.conversion.v6.2' of ↵Linus Torvalds8-40/+48
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull vfsuid updates from Christian Brauner: "Last cycle we introduced the vfs{g,u}id_t types and associated helpers to gain type safety when dealing with idmapped mounts. That initial work already converted a lot of places over but there were still some left, This converts all remaining places that still make use of non-type safe idmapping helpers to rely on the new type safe vfs{g,u}id based helpers. Afterwards it removes all the old non-type safe helpers" * tag 'fs.vfsuid.conversion.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: fs: remove unused idmapping helpers ovl: port to vfs{g,u}id_t and associated helpers fuse: port to vfs{g,u}id_t and associated helpers ima: use type safe idmapping helpers apparmor: use type safe idmapping helpers caps: use type safe idmapping helpers fs: use type safe idmapping helpers mnt_idmapping: add missing helpers
2022-12-12Merge tag 'fs.ovl.setgid.v6.2' of ↵Linus Torvalds7-53/+137
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull setgid inheritance updates from Christian Brauner: "This contains the work to make setgid inheritance consistent between modifying a file and when changing ownership or mode as this has been a repeated source of very subtle bugs. The gist is that we perform the same permission checks in the write path as we do in the ownership and mode changing paths after this series where we're currently doing different things. We've already made setgid inheritance a lot more consistent and reliable in the last releases by moving setgid stripping from the individual filesystems up into the vfs. This aims to make the logic even more consistent and easier to understand and also to fix long-standing overlayfs setgid inheritance bugs. Miklos was nice enough to just let me carry the trivial overlayfs patches from Amir too. Below is a more detailed explanation how the current difference in setgid handling lead to very subtle bugs exemplified via overlayfs which is a victim of the current rules. I hope this explains why I think taking the regression risk here is worth it. A long while ago I found a few setgid inheritance bugs in overlayfs in the write path in certain conditions. Amir recently picked this back up in [1] and I jumped on board to fix this more generally. On the surface all that overlayfs would need to fix setgid inheritance would be to call file_remove_privs() or file_modified() but actually that isn't enough because the setgid inheritance api is wildly inconsistent in that area. Before this pr setgid stripping in file_remove_privs()'s old should_remove_suid() helper was inconsistent with other parts of the vfs. Specifically, it only raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the inode isn't in the caller's groups and the caller isn't privileged over the inode although we require this already in setattr_prepare() and setattr_copy() and so all filesystem implement this requirement implicitly because they have to use setattr_{prepare,copy}() anyway. But the inconsistency shows up in setgid stripping bugs for overlayfs in xfstests (e.g., generic/673, generic/683, generic/685, generic/686, generic/687). For example, we test whether suid and setgid stripping works correctly when performing various write-like operations as an unprivileged user (fallocate, reflink, write, etc.): echo "Test 1 - qa_user, non-exec file $verb" setup_testfile chmod a+rws $junk_file commit_and_check "$qa_user" "$verb" 64k 64k The test basically creates a file with 6666 permissions. While the file has the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a regular filesystem like xfs what will happen is: sys_fallocate() -> vfs_fallocate() -> xfs_file_fallocate() -> file_modified() -> __file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = ATTR_FORCE | kill; -> notify_change() -> setattr_copy() In should_remove_suid() we can see that ATTR_KILL_SUID is raised unconditionally because the file in the test has S_ISUID set. But we also see that ATTR_KILL_SGID won't be set because while the file is S_ISGID it is not S_IXGRP (see above) which is a condition for ATTR_KILL_SGID being raised. So by the time we call notify_change() we have attr->ia_valid set to ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that ATTR_KILL_SUID is set and does: ia_valid = attr->ia_valid |= ATTR_MODE attr->ia_mode = (inode->i_mode & ~S_ISUID); which means that when we call setattr_copy() later we will definitely update inode->i_mode. Note that attr->ia_mode still contains S_ISGID. Now we call into the filesystem's ->setattr() inode operation which will end up calling setattr_copy(). Since ATTR_MODE is set we will hit: if (ia_valid & ATTR_MODE) { umode_t mode = attr->ia_mode; vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode); if (!vfsgid_in_group_p(vfsgid) && !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID)) mode &= ~S_ISGID; inode->i_mode = mode; } and since the caller in the test is neither capable nor in the group of the inode the S_ISGID bit is stripped. But assume the file isn't suid then ATTR_KILL_SUID won't be raised which has the consequence that neither the setgid nor the suid bits are stripped even though it should be stripped because the inode isn't in the caller's groups and the caller isn't privileged over the inode. If overlayfs is in the mix things become a bit more complicated and the bug shows up more clearly. When e.g., ovl_setattr() is hit from ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and ATTR_KILL_SGID might be raised but because the check in notify_change() is questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be stripped the S_ISGID bit isn't removed even though it should be stripped: sys_fallocate() -> vfs_fallocate() -> ovl_fallocate() -> file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = ATTR_FORCE | kill; -> notify_change() -> ovl_setattr() /* TAKE ON MOUNTER'S CREDS */ -> ovl_do_notify_change() -> notify_change() /* GIVE UP MOUNTER'S CREDS */ /* TAKE ON MOUNTER'S CREDS */ -> vfs_fallocate() -> xfs_file_fallocate() -> file_modified() -> __file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = attr_force | kill; -> notify_change() The fix for all of this is to make file_remove_privs()'s should_remove_suid() helper perform the same checks as we already require in setattr_prepare() and setattr_copy() and have notify_change() not pointlessly requiring S_IXGRP again. It doesn't make any sense in the first place because the caller must calculate the flags via should_remove_suid() anyway which would raise ATTR_KILL_SGID Note that some xfstests will now fail as these patches will cause the setgid bit to be lost in certain conditions for unprivileged users modifying a setgid file when they would've been kept otherwise. I think this risk is worth taking and I explained and mentioned this multiple times on the list [2]. Enforcing the rules consistently across write operations and chmod/chown will lead to losing the setgid bit in cases were it might've been retained before. While I've mentioned this a few times but it's worth repeating just to make sure that this is understood. For the sake of maintainability, consistency, and security this is a risk worth taking. If we really see regressions for workloads the fix is to have special setgid handling in the write path again with different semantics from chmod/chown and possibly additional duct tape for overlayfs. I'll update the relevant xfstests with if you should decide to merge this second setgid cleanup. Before that people should be aware that there might be failures for fstests where unprivileged users modify a setgid file" Link: https://lore.kernel.org/linux-fsdevel/20221003123040.900827-1-amir73il@gmail.com [1] Link: https://lore.kernel.org/linux-fsdevel/20221122142010.zchf2jz2oymx55qi@wittgenstein [2] * tag 'fs.ovl.setgid.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: fs: use consistent setgid checks in is_sxid() ovl: remove privs in ovl_fallocate() ovl: remove privs in ovl_copyfile() attr: use consistent sgid stripping checks attr: add setattr_should_drop_sgid() fs: move should_remove_suid() attr: add in_group_or_capable()
2022-12-12Merge tag 'fs.acl.rework.v6.2' of ↵Linus Torvalds91-1012/+1390
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull VFS acl updates from Christian Brauner: "This contains the work that builds a dedicated vfs posix acl api. The origins of this work trace back to v5.19 but it took quite a while to understand the various filesystem specific implementations in sufficient detail and also come up with an acceptable solution. As we discussed and seen multiple times the current state of how posix acls are handled isn't nice and comes with a lot of problems: The current way of handling posix acls via the generic xattr api is error prone, hard to maintain, and type unsafe for the vfs until we call into the filesystem's dedicated get and set inode operations. It is already the case that posix acls are special-cased to death all the way through the vfs. There are an uncounted number of hacks that operate on the uapi posix acl struct instead of the dedicated vfs struct posix_acl. And the vfs must be involved in order to interpret and fixup posix acls before storing them to the backing store, caching them, reporting them to userspace, or for permission checking. Currently a range of hacks and duct tape exist to make this work. As with most things this is really no ones fault it's just something that happened over time. But the code is hard to understand and difficult to maintain and one is constantly at risk of introducing bugs and regressions when having to touch it. Instead of continuing to hack posix acls through the xattr handlers this series builds a dedicated posix acl api solely around the get and set inode operations. Going forward, the vfs_get_acl(), vfs_remove_acl(), and vfs_set_acl() helpers must be used in order to interact with posix acls. They operate directly on the vfs internal struct posix_acl instead of abusing the uapi posix acl struct as we currently do. In the end this removes all of the hackiness, makes the codepaths easier to maintain, and gets us type safety. This series passes the LTP and xfstests suites without any regressions. For xfstests the following combinations were tested: - xfs - ext4 - btrfs - overlayfs - overlayfs on top of idmapped mounts - orangefs - (limited) cifs There's more simplifications for posix acls that we can make in the future if the basic api has made it. A few implementation details: - The series makes sure to retain exactly the same security and integrity module permission checks. Especially for the integrity modules this api is a win because right now they convert the uapi posix acl struct passed to them via a void pointer into the vfs struct posix_acl format to perform permission checking on the mode. There's a new dedicated security hook for setting posix acls which passes the vfs struct posix_acl not a void pointer. Basing checking on the posix acl stored in the uapi format is really unreliable. The vfs currently hacks around directly in the uapi struct storing values that frankly the security and integrity modules can't correctly interpret as evidenced by bugs we reported and fixed in this area. It's not necessarily even their fault it's just that the format we provide to them is sub optimal. - Some filesystems like 9p and cifs need access to the dentry in order to get and set posix acls which is why they either only partially or not even at all implement get and set inode operations. For example, cifs allows setxattr() and getxattr() operations but doesn't allow permission checking based on posix acls because it can't implement a get acl inode operation. Thus, this patch series updates the set acl inode operation to take a dentry instead of an inode argument. However, for the get acl inode operation we can't do this as the old get acl method is called in e.g., generic_permission() and inode_permission(). These helpers in turn are called in various filesystem's permission inode operation. So passing a dentry argument to the old get acl inode operation would amount to passing a dentry to the permission inode operation which we shouldn't and probably can't do. So instead of extending the existing inode operation Christoph suggested to add a new one. He also requested to ensure that the get and set acl inode operation taking a dentry are consistently named. So for this version the old get acl operation is renamed to ->get_inode_acl() and a new ->get_acl() inode operation taking a dentry is added. With this we can give both 9p and cifs get and set acl inode operations and in turn remove their complex custom posix xattr handlers. In the future I hope to get rid of the inode method duplication but it isn't like we have never had this situation. Readdir is just one example. And frankly, the overall gain in type safety and the more pleasant api wise are simply too big of a benefit to not accept this duplication for a while. - We've done a full audit of every codepaths using variant of the current generic xattr api to get and set posix acls and surprisingly it isn't that many places. There's of course always a chance that we might have missed some and if so I'm sure we'll find them soon enough. The crucial codepaths to be converted are obviously stacking filesystems such as ecryptfs and overlayfs. For a list of all callers currently using generic xattr api helpers see [2] including comments whether they support posix acls or not. - The old vfs generic posix acl infrastructure doesn't obey the create and replace semantics promised on the setxattr(2) manpage. This patch series doesn't address this. It really is something we should revisit later though. The patches are roughly organized as follows: (1) Change existing set acl inode operation to take a dentry argument (Intended to be a non-functional change) (2) Rename existing get acl method (Intended to be a non-functional change) (3) Implement get and set acl inode operations for filesystems that couldn't implement one before because of the missing dentry. That's mostly 9p and cifs (Intended to be a non-functional change) (4) Build posix acl api, i.e., add vfs_get_acl(), vfs_remove_acl(), and vfs_set_acl() including security and integrity hooks (Intended to be a non-functional change) (5) Implement get and set acl inode operations for stacking filesystems (Intended to be a non-functional change) (6) Switch posix acl handling in stacking filesystems to new posix acl api now that all filesystems it can stack upon support it. (7) Switch vfs to new posix acl api (semantical change) (8) Remove all now unused helpers (9) Additional regression fixes reported after we merged this into linux-next Thanks to Seth for a lot of good discussion around this and encouragement and input from Christoph" * tag 'fs.acl.rework.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (36 commits) posix_acl: Fix the type of sentinel in get_acl orangefs: fix mode handling ovl: call posix_acl_release() after error checking evm: remove dead code in evm_inode_set_acl() cifs: check whether acl is valid early acl: make vfs_posix_acl_to_xattr() static acl: remove a slew of now unused helpers 9p: use stub posix acl handlers cifs: use stub posix acl handlers ovl: use stub posix acl handlers ecryptfs: use stub posix acl handlers evm: remove evm_xattr_acl_change() xattr: use posix acl api ovl: use posix acl api ovl: implement set acl method ovl: implement get acl method ecryptfs: implement set acl method ecryptfs: implement get acl method ksmbd: use vfs_remove_acl() acl: add vfs_remove_acl() ...
2022-12-12Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds4-58/+13
Pull misc vfs updates from Al Viro: "misc pile" * tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: sysv: Fix sysv_nblocks() returns wrong value get rid of INT_LIMIT, use type_max() instead btrfs: replace INT_LIMIT(loff_t) with OFFSET_MAX fs: simplify vfs_get_super fs: drop useless condition from inode_needs_update_time
2022-12-12Merge tag 'pull-namespace' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull namespace fix from Al Viro: "Fix weird corner case in copy_mnt_ns()" * tag 'pull-namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: copy_mnt_ns(): handle a corner case (overmounted mntns bindings) saner
2022-12-12Merge tag 'pull-iov_iter' of ↵Linus Torvalds31-72/+72
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull iov_iter updates from Al Viro: "iov_iter work; most of that is about getting rid of direction misannotations and (hopefully) preventing more of the same for the future" * tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: use less confusing names for iov_iter direction initializers iov_iter: saner checks for attempt to copy to/from iterator [xen] fix "direction" argument of iov_iter_kvec() [vhost] fix 'direction' argument of iov_iter_{init,bvec}() [target] fix iov_iter_bvec() "direction" argument [s390] memcpy_real(): WRITE is "data source", not destination... [s390] zcore: WRITE is "data source", not destination... [infiniband] READ is "data destination", not source... [fsi] WRITE is "data source", not destination... [s390] copy_oldmem_kernel() - WRITE is "data source", not destination csum_and_copy_to_iter(): handle ITER_DISCARD get rid of unlikely() on page_copy_sane() calls
2022-12-12Merge tag 'pull-elfcore' of ↵Linus Torvalds2-221/+51
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull elf coredumping updates from Al Viro: "Unification of regset and non-regset sides of ELF coredump handling. Collecting per-thread register values is the only thing that needs to be ifdefed there..." * tag 'pull-elfcore' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: [elf] get rid of get_note_info_size() [elf] unify regset and non-regset cases [elf][non-regset] use elf_core_copy_task_regs() for dumper as well [elf][non-regset] uninline elf_core_copy_task_fpregs() (and lose pt_regs argument) elf_core_copy_task_regs(): task_pt_regs is defined everywhere [elf][regset] simplify thread list handling in fill_note_info() [elf][regset] clean fill_note_info() a bit kill extern of vsyscall32_sysctl kill coredump_params->regs kill signal_pt_regs()
2022-12-12Merge tag 'mm-nonmm-stable-2022-12-12' of ↵Linus Torvalds37-120/+446
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - A ptrace API cleanup series from Sergey Shtylyov - Fixes and cleanups for kexec from ye xingchen - nilfs2 updates from Ryusuke Konishi - squashfs feature work from Xiaoming Ni: permit configuration of the filesystem's compression concurrency from the mount command line - A series from Akinobu Mita which addresses bound checking errors when writing to debugfs files - A series from Yang Yingliang to address rapidio memory leaks - A series from Zheng Yejian to address possible overflow errors in encode_comp_t() - And a whole shower of singleton patches all over the place * tag 'mm-nonmm-stable-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (79 commits) ipc: fix memory leak in init_mqueue_fs() hfsplus: fix bug causing custom uid and gid being unable to be assigned with mount rapidio: devices: fix missing put_device in mport_cdev_open kcov: fix spelling typos in comments hfs: Fix OOB Write in hfs_asc2mac hfs: fix OOB Read in __hfs_brec_find relay: fix type mismatch when allocating memory in relay_create_buf() ocfs2: always read both high and low parts of dinode link count io-mapping: move some code within the include guarded section kernel: kcsan: kcsan_test: build without structleak plugin mailmap: update email for Iskren Chernev eventfd: change int to __u64 in eventfd_signal() ifndef CONFIG_EVENTFD rapidio: fix possible UAF when kfifo_alloc() fails relay: use strscpy() is more robust and safer cpumask: limit visibility of FORCE_NR_CPUS acct: fix potential integer overflow in encode_comp_t() acct: fix accuracy loss for input value of encode_comp_t() linux/init.h: include <linux/build_bug.h> and <linux/stringify.h> rapidio: rio: fix possible name leak in rio_register_mport() rapidio: fix possible name leaks when rio_add_device() fails ...
2022-12-12Merge tag 'random-6.2-rc1-for-linus' of ↵Linus Torvalds14-34/+27
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull random number generator updates from Jason Donenfeld: - Replace prandom_u32_max() and various open-coded variants of it, there is now a new family of functions that uses fast rejection sampling to choose properly uniformly random numbers within an interval: get_random_u32_below(ceil) - [0, ceil) get_random_u32_above(floor) - (floor, U32_MAX] get_random_u32_inclusive(floor, ceil) - [floor, ceil] Coccinelle was used to convert all current users of prandom_u32_max(), as well as many open-coded patterns, resulting in improvements throughout the tree. I'll have a "late" 6.1-rc1 pull for you that removes the now unused prandom_u32_max() function, just in case any other trees add a new use case of it that needs to converted. According to linux-next, there may be two trivial cases of prandom_u32_max() reintroductions that are fixable with a 's/.../.../'. So I'll have for you a final conversion patch doing that alongside the removal patch during the second week. This is a treewide change that touches many files throughout. - More consistent use of get_random_canary(). - Updates to comments, documentation, tests, headers, and simplification in configuration. - The arch_get_random*_early() abstraction was only used by arm64 and wasn't entirely useful, so this has been replaced by code that works in all relevant contexts. - The kernel will use and manage random seeds in non-volatile EFI variables, refreshing a variable with a fresh seed when the RNG is initialized. The RNG GUID namespace is then hidden from efivarfs to prevent accidental leakage. These changes are split into random.c infrastructure code used in the EFI subsystem, in this pull request, and related support inside of EFISTUB, in Ard's EFI tree. These are co-dependent for full functionality, but the order of merging doesn't matter. - Part of the infrastructure added for the EFI support is also used for an improvement to the way vsprintf initializes its siphash key, replacing an sleep loop wart. - The hardware RNG framework now always calls its correct random.c input function, add_hwgenerator_randomness(), rather than sometimes going through helpers better suited for other cases. - The add_latent_entropy() function has long been called from the fork handler, but is a no-op when the latent entropy gcc plugin isn't used, which is fine for the purposes of latent entropy. But it was missing out on the cycle counter that was also being mixed in beside the latent entropy variable. So now, if the latent entropy gcc plugin isn't enabled, add_latent_entropy() will expand to a call to add_device_randomness(NULL, 0), which adds a cycle counter, without the absent latent entropy variable. - The RNG is now reseeded from a delayed worker, rather than on demand when used. Always running from a worker allows it to make use of the CPU RNG on platforms like S390x, whose instructions are too slow to do so from interrupts. It also has the effect of adding in new inputs more frequently with more regularity, amounting to a long term transcript of random values. Plus, it helps a bit with the upcoming vDSO implementation (which isn't yet ready for 6.2). - The jitter entropy algorithm now tries to execute on many different CPUs, round-robining, in hopes of hitting even more memory latencies and other unpredictable effects. It also will mix in a cycle counter when the entropy timer fires, in addition to being mixed in from the main loop, to account more explicitly for fluctuations in that timer firing. And the state it touches is now kept within the same cache line, so that it's assured that the different execution contexts will cause latencies. * tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (23 commits) random: include <linux/once.h> in the right header random: align entropy_timer_state to cache line random: mix in cycle counter when jitter timer fires random: spread out jitter callback to different CPUs random: remove extraneous period and add a missing one in comments efi: random: refresh non-volatile random seed when RNG is initialized vsprintf: initialize siphash key using notifier random: add back async readiness notifier random: reseed in delayed work rather than on-demand random: always mix cycle counter in add_latent_entropy() hw_random: use add_hwgenerator_randomness() for early entropy random: modernize documentation comment on get_random_bytes() random: adjust comment to account for removed function random: remove early archrandom abstraction random: use random.trust_{bootloader,cpu} command line option only stackprotector: actually use get_random_canary() stackprotector: move get_random_canary() into stackprotector.h treewide: use get_random_u32_inclusive() when possible treewide: use get_random_u32_{above,below}() instead of manual loop treewide: use get_random_u32_below() instead of deprecated function ...
2022-12-12Merge tag 'printk-for-6.2' of ↵Linus Torvalds1-3/+18
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - Add NMI-safe SRCU reader API. It uses atomic_inc() instead of this_cpu_inc() on strong load-store architectures. - Introduce new console_list_lock to synchronize a manipulation of the list of registered consoles and their flags. This is a first step in removing the big-kernel-lock-like behavior of console_lock(). This semaphore still serializes console->write() calbacks against: - each other. It primary prevents potential races between early and proper console drivers using the same device. - suspend()/resume() callbacks and init() operations in some drivers. - various other operations in the tty/vt and framebufer susbsystems. It is likely that console_lock() serializes even operations that are not directly conflicting with the console->write() callbacks here. This is the most complicated big-kernel-lock aspect of the console_lock() that will be hard to untangle. - Introduce new console_srcu lock that is used to safely iterate and access the registered console drivers under SRCU read lock. This is a prerequisite for introducing atomic console drivers and console kthreads. It will reduce the complexity of serialization against normal consoles and console_lock(). Also it should remove the risk of deadlock during critical situations, like Oops or panic, when only atomic consoles are registered. - Check whether the console is registered instead of enabled on many locations. It was a historical leftover. - Cleanly force a preferred console in xenfb code instead of a dirty hack. - A lot of code and comment clean ups and improvements. * tag 'printk-for-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: (47 commits) printk: htmldocs: add missing description tty: serial: sh-sci: use setup() callback for early console printk: relieve console_lock of list synchronization duties tty: serial: kgdboc: use console_list_lock to trap exit tty: serial: kgdboc: synchronize tty_find_polling_driver() and register_console() tty: serial: kgdboc: use console_list_lock for list traversal tty: serial: kgdboc: use srcu console list iterator proc: consoles: use console_list_lock for list iteration tty: tty_io: use console_list_lock for list synchronization printk, xen: fbfront: create/use safe function for forcing preferred netconsole: avoid CON_ENABLED misuse to track registration usb: early: xhci-dbc: use console_is_registered() tty: serial: xilinx_uartps: use console_is_registered() tty: serial: samsung_tty: use console_is_registered() tty: serial: pic32_uart: use console_is_registered() tty: serial: earlycon: use console_is_registered() tty: hvc: use console_is_registered() efi: earlycon: use console_is_registered() tty: nfcon: use console_is_registered() serial_core: replace uart_console_enabled() with uart_console_registered() ...
2022-12-12Merge tag 'locks-v6.2' of ↵Linus Torvalds10-26/+52
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull file locking updates from Jeff Layton: "The main change here is to add the new locks_inode_context helper, and convert all of the places that dereference inode->i_flctx directly to use that instead. There is a new helper to indicate whether any locks are held on an inode. This is mostly for Ceph but may be usable elsewhere too. Andi Kleen requested that we print the PID when the LOCK_MAND warning fires, to help track down applications trying to use it. Finally, we added some new warnings to some of the file locking functions that fire when the ->fl_file and filp arguments differ. This helped us find some long-standing bugs in lockd. Patches for those are in Chuck Lever's tree and should be in his v6.2 PR. After that patch, people using NFSv2/v3 locking may see some warnings fire until those go in. Happy Holidays!" * tag 'locks-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: Add process name and pid to locks warning nfsd: use locks_inode_context helper nfs: use locks_inode_context helper lockd: use locks_inode_context helper ksmbd: use locks_inode_context helper cifs: use locks_inode_context helper ceph: use locks_inode_context helper filelock: add a new locks_inode_context accessor function filelock: new helper: vfs_inode_has_locks filelock: WARN_ON_ONCE when ->fl_file and filp don't match
2022-12-12Merge tag 'execve-v6.2-rc1' of ↵Linus Torvalds4-36/+45
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve updates from Kees Cook: "Most are small refactorings and bug fixes, but three things stand out: switching timens (which got reverted before) looks solid now, FOLL_FORCE has been removed (no failures seen yet across several weeks in -next), and some whitespace cleanups (which are long overdue). - Add timens support (when switching mm). This version has survived in -next for the entire cycle (Andrei Vagin) - Various small bug fixes, refactoring, and readability improvements (Bernd Edlinger, Rolf Eike Beer, Bo Liu, Li Zetao Liu Shixin) - Remove FOLL_FORCE for stack setup (Kees Cook) - Whitespace cleanups (Rolf Eike Beer, Kees Cook)" * tag 'execve-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: binfmt_misc: fix shift-out-of-bounds in check_special_flags binfmt: Fix error return code in load_elf_fdpic_binary() exec: Remove FOLL_FORCE for stack setup binfmt_elf: replace IS_ERR() with IS_ERR_VALUE() binfmt_elf: simplify error handling in load_elf_phdrs() binfmt_elf: fix documented return value for load_elf_phdrs() exec: simplify initial stack size expansion binfmt: Fix whitespace issues exec: Add comments on check_unsafe_exec() fs counting ELF uapi: add spaces before '{' selftests/timens: add a test for vfork+exit fs/exec: switch timens when a task gets a new mm
2022-12-12Merge tag 'pstore-v6.2-rc1' of ↵Linus Torvalds5-29/+160
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore updates from Kees Cook: "A small collection of bug fixes, refactorings, and general improvements: - Reporting improvements and return path fixes (Guilherme G. Piccoli, Wang Yufen, Kees Cook) - Clean up kmsg_bytes module parameter usage (Guilherme G. Piccoli) - Add Guilherme to pstore MAINTAINERS entry - Choose friendlier allocation flags (Qiujun Huang, Stephen Boyd)" * tag 'pstore-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore: Avoid kcore oops by vmap()ing with VM_IOREMAP pstore/ram: Fix error return code in ramoops_probe() pstore: Alert on backend write error MAINTAINERS: Update pstore maintainers pstore/ram: Set freed addresses to NULL pstore/ram: Move internal definitions out of kernel-wide include pstore/ram: Move pmsg init earlier pstore/ram: Consolidate kfree() paths efi: pstore: Follow convention for the efi-pstore backend name pstore: Inform unregistered backend names as well pstore: Expose kmsg_bytes as a module parameter pstore: Improve error reporting in case of backend overlap pstore/zone: Use GFP_ATOMIC to allocate zone buffer
2022-12-11hfsplus: fix bug causing custom uid and gid being unable to be assigned with ↵Aditya Garg3-2/+8
mount Despite specifying UID and GID in mount command, the specified UID and GID were not being assigned. This patch fixes this issue. Link: https://lkml.kernel.org/r/C0264BF5-059C-45CF-B8DA-3A3BD2C803A2@live.com Signed-off-by: Aditya Garg <gargaditya08@live.com> Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11hfs: Fix OOB Write in hfs_asc2macZhangPeng1-1/+1
Syzbot reported a OOB Write bug: loop0: detected capacity change from 0 to 64 ================================================================== BUG: KASAN: slab-out-of-bounds in hfs_asc2mac+0x467/0x9a0 fs/hfs/trans.c:133 Write of size 1 at addr ffff88801848314e by task syz-executor391/3632 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x28e lib/dump_stack.c:106 print_address_description+0x74/0x340 mm/kasan/report.c:284 print_report+0x107/0x1f0 mm/kasan/report.c:395 kasan_report+0xcd/0x100 mm/kasan/report.c:495 hfs_asc2mac+0x467/0x9a0 fs/hfs/trans.c:133 hfs_cat_build_key+0x92/0x170 fs/hfs/catalog.c:28 hfs_lookup+0x1ab/0x2c0 fs/hfs/dir.c:31 lookup_open fs/namei.c:3391 [inline] open_last_lookups fs/namei.c:3481 [inline] path_openat+0x10e6/0x2df0 fs/namei.c:3710 do_filp_open+0x264/0x4f0 fs/namei.c:3740 If in->len is much larger than HFS_NAMELEN(31) which is the maximum length of an HFS filename, a OOB write could occur in hfs_asc2mac(). In that case, when the dst reaches the boundary, the srclen is still greater than 0, which causes a OOB write. Fix this by adding a check on dstlen in while() before writing to dst address. Link: https://lkml.kernel.org/r/20221202030038.1391945-1-zhangpeng362@huawei.com Fixes: 328b92278650 ("[PATCH] hfs: NLS support") Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com> Reported-by: <syzbot+dc3b1cf9111ab5fe98e7@syzkaller.appspotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11hfs: fix OOB Read in __hfs_brec_findZhangPeng1-0/+2
Syzbot reported a OOB read bug: ================================================================== BUG: KASAN: slab-out-of-bounds in hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 Read of size 1 at addr ffff88807eb62c4e by task kworker/u4:1/11 CPU: 1 PID: 11 Comm: kworker/u4:1 Not tainted 6.1.0-rc6-syzkaller-00308-g644e9524388a #0 Workqueue: writeback wb_workfn (flush-7:0) Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x28e lib/dump_stack.c:106 print_address_description+0x74/0x340 mm/kasan/report.c:284 print_report+0x107/0x1f0 mm/kasan/report.c:395 kasan_report+0xcd/0x100 mm/kasan/report.c:495 hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 __hfs_brec_find+0x213/0x5c0 fs/hfs/bfind.c:75 hfs_brec_find+0x276/0x520 fs/hfs/bfind.c:138 hfs_write_inode+0x34c/0xb40 fs/hfs/inode.c:462 write_inode fs/fs-writeback.c:1440 [inline] If the input inode of hfs_write_inode() is incorrect: struct inode struct hfs_inode_info struct hfs_cat_key struct hfs_name u8 len # len is greater than HFS_NAMELEN(31) which is the maximum length of an HFS filename OOB read occurred: hfs_write_inode() hfs_brec_find() __hfs_brec_find() hfs_cat_keycmp() hfs_strcmp() # OOB read occurred due to len is too large Fix this by adding a Check on len in hfs_write_inode() before calling hfs_brec_find(). Link: https://lkml.kernel.org/r/20221130065959.2168236-1-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reported-by: <syzbot+e836ff7133ac02be825f@syzkaller.appspotmail.com> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11ocfs2: always read both high and low parts of dinode link countAlexey Asemov1-2/+1
When filesystem is using indexed-dirs feature, maximum link count values can spill over to i_links_count_hi, up to OCFS2_DX_LINK_MAX links. ocfs2_read_links_count() checks for OCFS2_INDEXED_DIR_FL flag in dinode, but this flag is only valid for directories so for files the check causes high part of the link count not being read back from file dinodes resulting in wrong link count value when file has >65535 links. As ocfs2_set_links_count() always writes both high and low parts of link count, the flag check on reading may be removed. Link: https://lkml.kernel.org/r/cbfca02b-b39f-89de-e1a8-904a6c60407e@alex-at.net Signed-off-by: Alexey Asemov <alex@alex-at.net> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11mm: do not show fs mm pc for VM_LOCKONFAULT pagesJason A. Donenfeld1-0/+1
When VM_LOCKONFAULT was added, /proc/PID/smaps wasn't hooked up to it, so looking at /proc/PID/smaps, it shows '??' instead of something intelligable. This can be reached by userspace by simply calling `mlock2(..., MLOCK_ONFAULT);`. Fix this by adding "lf" to denote VM_LOCKONFAULT. Link: https://lkml.kernel.org/r/20221205173007.580210-1-Jason@zx2c4.com Fixes: de60f5f10c58 ("mm: introduce VM_LOCKONFAULT") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Eric B Munson <emunson@akamai.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11omfs: remove ->writepageChristoph Hellwig1-6/+1
->writepage is a very inefficient method to write back data, and only used through write_cache_pages or a a fallback when no ->migrate_folio method is present. Set ->migrate_folio to the generic buffer_head based helper, and remove the ->writepage implementation. Link: https://lkml.kernel.org/r/20221202102644.770505-8-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Bob Copeland <me@bobcopeland.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11jfs: remove ->writepageChristoph Hellwig1-6/+1
->writepage is a very inefficient method to write back data, and only used through write_cache_pages or a a fallback when no ->migrate_folio method is present. Set ->migrate_folio to the generic buffer_head based helper, and remove the ->writepage implementation. Link: https://lkml.kernel.org/r/20221202102644.770505-7-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11hpfs: remove ->writepageChristoph Hellwig1-7/+2
->writepage is a very inefficient method to write back data, and only used through write_cache_pages or a a fallback when no ->migrate_folio method is present. Set ->migrate_folio to the generic buffer_head based helper, and remove the ->writepage implementation. Link: https://lkml.kernel.org/r/20221202102644.770505-6-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11hfsplus: remove ->writepageChristoph Hellwig1-1/+1
->writepage is a very inefficient method to write back data, and only used through write_cache_pages or a a fallback when no ->migrate_folio method is present. Set ->migrate_folio to the generic buffer_head based helper, and stop wiring up ->writepage for hfsplus_aops. Link: https://lkml.kernel.org/r/20221202102644.770505-5-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11hfs: remove ->writepageChristoph Hellwig1-1/+1
->writepage is a very inefficient method to write back data, and only used through write_cache_pages or a a fallback when no ->migrate_folio method is present. Set ->migrate_folio to the generic buffer_head based helper, and stop wiring up ->writepage for hfs_aops. Link: https://lkml.kernel.org/r/20221202102644.770505-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11fat: remove ->writepageChristoph Hellwig1-7/+2
->writepage is a very inefficient method to write back data, and only used through write_cache_pages or a a fallback when no ->migrate_folio method is present. Set ->migrate_folio to the generic buffer_head based helper, and remove the ->writepage implementation. Link: https://lkml.kernel.org/r/20221202102644.770505-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-11extfat: remove ->writepageChristoph Hellwig1-7/+2
Patch series "start removing writepage instances v2". The VM doesn't need or want ->writepage for writeback and is fine with just having ->writepages as long as ->migrate_folio is implemented. This series removes all ->writepage instances that use block_write_full_page directly and also have a plain mpage_writepages based ->writepages. This patch (of 7): ->writepage is a very inefficient method to write back data, and only used through write_cache_pages or a a fallback when no ->migrate_folio method is present. Set ->migrate_folio to the generic buffer_head based helper, and remove the ->writepage implementation. Link: https://lkml.kernel.org/r/20221202102644.770505-1-hch@lst.de Link: https://lkml.kernel.org/r/20221202102644.770505-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Bob Copeland <me@bobcopeland.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Jan Kara <jack@suse.com> Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>