Age | Commit message (Collapse) | Author | Files | Lines |
|
Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.
With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).
Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.
Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.
The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.
Google Bug Id: 145475544
Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull thread management updates from Christian Brauner:
"Sargun Dhillon over the last cycle has worked on the pidfd_getfd()
syscall.
This syscall allows for the retrieval of file descriptors of a process
based on its pidfd. A task needs to have ptrace_may_access()
permissions with PTRACE_MODE_ATTACH_REALCREDS (suggested by Oleg and
Andy) on the target.
One of the main use-cases is in combination with seccomp's user
notification feature. As a reminder, seccomp's user notification
feature was made available in v5.0. It allows a task to retrieve a
file descriptor for its seccomp filter. The file descriptor is usually
handed of to a more privileged supervising process. The supervisor can
then listen for syscall events caught by the seccomp filter of the
supervisee and perform actions in lieu of the supervisee, usually
emulating syscalls. pidfd_getfd() is needed to expand its uses.
There are currently two major users that wait on pidfd_getfd() and one
future user:
- Netflix, Sargun said, is working on a service mesh where users
should be able to connect to a dns-based VIP. When a user connects
to e.g. 1.2.3.4:80 that runs e.g. service "foo" they will be
redirected to an envoy process. This service mesh uses seccomp user
notifications and pidfd to intercept all connect calls and instead
of connecting them to 1.2.3.4:80 connects them to e.g.
127.0.0.1:8080.
- LXD uses the seccomp notifier heavily to intercept and emulate
mknod() and mount() syscalls for unprivileged containers/processes.
With pidfd_getfd() more uses-cases e.g. bridging socket connections
will be possible.
- The patchset has also seen some interest from the browser corner.
Right now, Firefox is using a SECCOMP_RET_TRAP sandbox managed by a
broker process. In the future glibc will start blocking all signals
during dlopen() rendering this type of sandbox impossible. Hence,
in the future Firefox will switch to a seccomp-user-nofication
based sandbox which also makes use of file descriptor retrieval.
The thread for this can be found at
https://sourceware.org/ml/libc-alpha/2019-12/msg00079.html
With pidfd_getfd() it is e.g. possible to bridge socket connections
for the supervisee (binding to a privileged port) and taking actions
on file descriptors on behalf of the supervisee in general.
Sargun's first version was using an ioctl on pidfds but various people
pushed for it to be a proper syscall which he duely implemented as
well over various review cycles. Selftests are of course included.
I've also added instructions how to deal with merge conflicts below.
There's also a small fix coming from the kernel mentee project to
correctly annotate struct sighand_struct with __rcu to fix various
sparse warnings. We've received a few more such fixes and even though
they are mostly trivial I've decided to postpone them until after -rc1
since they came in rather late and I don't want to risk introducing
build warnings.
Finally, there's a new prctl() command PR_{G,S}ET_IO_FLUSHER which is
needed to avoid allocation recursions triggerable by storage drivers
that have userspace parts that run in the IO path (e.g. dm-multipath,
iscsi, etc). These allocation recursions deadlock the device.
The new prctl() allows such privileged userspace components to avoid
allocation recursions by setting the PF_MEMALLOC_NOIO and
PF_LESS_THROTTLE flags. The patch carries the necessary acks from the
relevant maintainers and is routed here as part of prctl()
thread-management."
* tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim
sched.h: Annotate sighand_struct with __rcu
test: Add test for pidfd getfd
arch: wire up pidfd_getfd syscall
pid: Implement pidfd_getfd syscall
vfs, fdtable: Add fget_task helper
|
|
Pull io_uring updates from Jens Axboe:
- Support for various new opcodes (fallocate, openat, close, statx,
fadvise, madvise, openat2, non-vectored read/write, send/recv, and
epoll_ctl)
- Faster ring quiesce for fileset updates
- Optimizations for overflow condition checking
- Support for max-sized clamping
- Support for probing what opcodes are supported
- Support for io-wq backend sharing between "sibling" rings
- Support for registering personalities
- Lots of little fixes and improvements
* tag 'for-5.6/io_uring-vfs-2020-01-29' of git://git.kernel.dk/linux-block: (64 commits)
io_uring: add support for epoll_ctl(2)
eventpoll: support non-blocking do_epoll_ctl() calls
eventpoll: abstract out epoll_ctl() handler
io_uring: fix linked command file table usage
io_uring: support using a registered personality for commands
io_uring: allow registering credentials
io_uring: add io-wq workqueue sharing
io-wq: allow grabbing existing io-wq
io_uring/io-wq: don't use static creds/mm assignments
io-wq: make the io_wq ref counted
io_uring: fix refcounting with batched allocations at OOM
io_uring: add comment for drain_next
io_uring: don't attempt to copy iovec for READ/WRITE
io_uring: honor IOSQE_ASYNC for linked reqs
io_uring: prep req when do IOSQE_ASYNC
io_uring: use labeled array init in io_op_defs
io_uring: optimise sqe-to-req flags translation
io_uring: remove REQ_F_IO_DRAINED
io_uring: file switch work needs to get flushed on exit
io_uring: hide uring_fd in ctx
...
|
|
Pull SCSI updates from James Bottomley:
"This series is slightly unusual because it includes Arnd's compat
ioctl tree here:
1c46a2cf2dbd Merge tag 'block-ioctl-cleanup-5.6' into 5.6/scsi-queue
Excluding Arnd's changes, this is mostly an update of the usual
drivers: megaraid_sas, mpt3sas, qla2xxx, ufs, lpfc, hisi_sas.
There are a couple of core and base updates around error propagation
and atomicity in the attribute container base we use for the SCSI
transport classes.
The rest is minor changes and updates"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (149 commits)
scsi: hisi_sas: Rename hisi_sas_cq.pci_irq_mask
scsi: hisi_sas: Add prints for v3 hw interrupt converge and automatic affinity
scsi: hisi_sas: Modify the file permissions of trigger_dump to write only
scsi: hisi_sas: Replace magic number when handle channel interrupt
scsi: hisi_sas: replace spin_lock_irqsave/spin_unlock_restore with spin_lock/spin_unlock
scsi: hisi_sas: use threaded irq to process CQ interrupts
scsi: ufs: Use UFS device indicated maximum LU number
scsi: ufs: Add max_lu_supported in struct ufs_dev_info
scsi: ufs: Delete is_init_prefetch from struct ufs_hba
scsi: ufs: Inline two functions into their callers
scsi: ufs: Move ufshcd_get_max_pwr_mode() to ufshcd_device_params_init()
scsi: ufs: Split ufshcd_probe_hba() based on its called flow
scsi: ufs: Delete struct ufs_dev_desc
scsi: ufs: Fix ufshcd_probe_hba() reture value in case ufshcd_scsi_add_wlus() fails
scsi: ufs-mediatek: enable low-power mode for hibern8 state
scsi: ufs: export some functions for vendor usage
scsi: ufs-mediatek: add dbg_register_dump implementation
scsi: qla2xxx: Fix a NULL pointer dereference in an error path
scsi: qla1280: Make checking for 64bit support consistent
scsi: megaraid_sas: Update driver version to 07.713.01.00-rc1
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull Kselftest kunit updates from Shuah Khan:
"This kunit update consists of:
- Support for building kunit as a module from Alan Maguire
- AppArmor KUnit tests for policy unpack from Mike Salvatore"
* tag 'linux-kselftest-5.6-rc1-kunit' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: building kunit as a module breaks allmodconfig
kunit: update documentation to describe module-based build
kunit: allow kunit to be loaded as a module
kunit: remove timeout dependence on sysctl_hung_task_timeout_seconds
kunit: allow kunit tests to be loaded as a module
kunit: hide unexported try-catch interface in try-catch-impl.h
kunit: move string-stream.h to lib/kunit
apparmor: add AppArmor KUnit tests for policy unpack
|
|
git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground
Pull y2038 updates from Arnd Bergmann:
"Core, driver and file system changes
These are updates to device drivers and file systems that for some
reason or another were not included in the kernel in the previous
y2038 series.
I've gone through all users of time_t again to make sure the kernel is
in a long-term maintainable state, replacing all remaining references
to time_t with safe alternatives.
Some related parts of the series were picked up into the nfsd, xfs,
alsa and v4l2 trees. A final set of patches in linux-mm removes the
now unused time_t/timeval/timespec types and helper functions after
all five branches are merged for linux-5.6, ensuring that no new users
get merged.
As a result, linux-5.6, or my backport of the patches to 5.4 [1],
should be the first release that can serve as a base for a 32-bit
system designed to run beyond year 2038, with a few remaining caveats:
- All user space must be compiled with a 64-bit time_t, which will be
supported in the coming musl-1.2 and glibc-2.32 releases, along
with installed kernel headers from linux-5.6 or higher.
- Applications that use the system call interfaces directly need to
be ported to use the time64 syscalls added in linux-5.1 in place of
the existing system calls. This impacts most users of futex() and
seccomp() as well as programming languages that have their own
runtime environment not based on libc.
- Applications that use a private copy of kernel uapi header files or
their contents may need to update to the linux-5.6 version, in
particular for sound/asound.h, xfs/xfs_fs.h, linux/input.h,
linux/elfcore.h, linux/sockios.h, linux/timex.h and
linux/can/bcm.h.
- A few remaining interfaces cannot be changed to pass a 64-bit
time_t in a compatible way, so they must be configured to use
CLOCK_MONOTONIC times or (with a y2106 problem) unsigned 32-bit
timestamps. Most importantly this impacts all users of 'struct
input_event'.
- All y2038 problems that are present on 64-bit machines also apply
to 32-bit machines. In particular this affects file systems with
on-disk timestamps using signed 32-bit seconds: ext4 with
ext3-style small inodes, ext2, xfs (to be fixed soon) and ufs"
[1] https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=y2038-endgame
* tag 'y2038-drivers-for-v5.6-signed' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (21 commits)
Revert "drm/etnaviv: reject timeouts with tv_nsec >= NSEC_PER_SEC"
y2038: sh: remove timeval/timespec usage from headers
y2038: sparc: remove use of struct timex
y2038: rename itimerval to __kernel_old_itimerval
y2038: remove obsolete jiffies conversion functions
nfs: fscache: use timespec64 in inode auxdata
nfs: fix timstamp debug prints
nfs: use time64_t internally
sunrpc: convert to time64_t for expiry
drm/etnaviv: avoid deprecated timespec
drm/etnaviv: reject timeouts with tv_nsec >= NSEC_PER_SEC
drm/msm: avoid using 'timespec'
hfs/hfsplus: use 64-bit inode timestamps
hostfs: pass 64-bit timestamps to/from user space
packet: clarify timestamp overflow
tsacct: add 64-bit btime field
acct: stop using get_seconds()
um: ubd: use 64-bit time_t where possible
xtensa: ISS: avoid struct timeval
dlm: use SO_SNDTIMEO_NEW instead of SO_SNDTIMEO_OLD
...
|
|
This adds IORING_OP_EPOLL_CTL, which can perform the same work as the
epoll_ctl(2) system call.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Also make it available outside of epoll, along with the helper that
decides if we need to copy the passed in epoll_event.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We're not consistent in how the file table is grabbed and assigned if we
have a command linked that requires the use of it.
Add ->file_table to the io_op_defs[] array, and use that to determine
when to grab the table instead of having the handlers set it if they
need to defer. This also means we can kill the IO_WQ_WORK_NEEDS_FILES
flag. We always initialize work->files, so io-wq can just check for
that.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"A regression fix, several cleanups and (maybe) plus an upcoming new
mount api convert patch as a part of vfs update are considered
available for this cycle.
All commits have been in linux-next and tested with no smoke out.
Summary:
- fix an out-of-bound read access introduced in v5.3, which could
rarely cause data corruption
- various cleanup patches"
* tag 'erofs-for-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: clean up z_erofs_submit_queue()
erofs: fold in postsubmit_is_all_bypassed()
erofs: fix out-of-bound read for shifted uncompressed block
erofs: remove void tagging/untagging of workgroup pointers
erofs: remove unused tag argument while registering a workgroup
erofs: remove unused tag argument while finding a workgroup
erofs: correct indentation of an assigned structure inside a function
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull adfs updates from Al Viro:
"adfs stuff for this cycle"
* 'work.adfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits)
fs/adfs: bigdir: Fix an error code in adfs_fplus_read()
Documentation: update adfs filesystem documentation
fs/adfs: mostly divorse inode number from indirect disc address
fs/adfs: super: add support for E and E+ floppy image formats
fs/adfs: super: extract filesystem block probe
fs/adfs: dir: remove debug in adfs_dir_update()
fs/adfs: super: fix inode dropping
fs/adfs: bigdir: implement directory update support
fs/adfs: bigdir: calculate and validate directory checkbyte
fs/adfs: bigdir: directory validation strengthening
fs/adfs: bigdir: extract directory validation
fs/adfs: bigdir: factor out directory entry offset calculation
fs/adfs: newdir: split out directory commit from update
fs/adfs: newdir: clean up adfs_f_update()
fs/adfs: newdir: merge adfs_dir_read() into adfs_f_read()
fs/adfs: newdir: improve directory validation
fs/adfs: newdir: factor out directory format validation
fs/adfs: dir: use pointers to access directory head/tails
fs/adfs: dir: add more efficient iterate() per-format method
fs/adfs: dir: switch to iterate_shared method
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull openat2 support from Al Viro:
"This is the openat2() series from Aleksa Sarai.
I'm afraid that the rest of namei stuff will have to wait - it got
zero review the last time I'd posted #work.namei, and there had been a
leak in the posted series I'd caught only last weekend. I was going to
repost it on Monday, but the window opened and the odds of getting any
review during that... Oh, well.
Anyway, openat2 part should be ready; that _did_ get sane amount of
review and public testing, so here it comes"
From Aleksa's description of the series:
"For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown
flags are present[1].
This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road
to being added to openat(2).
Furthermore, the need for some sort of control over VFS's path
resolution (to avoid malicious paths resulting in inadvertent
breakouts) has been a very long-standing desire of many userspace
applications.
This patchset is a revival of Al Viro's old AT_NO_JUMPS[3] patchset
(which was a variant of David Drysdale's O_BENEATH patchset[4] which
was a spin-off of the Capsicum project[5]) with a few additions and
changes made based on the previous discussion within [6] as well as
others I felt were useful.
In line with the conclusions of the original discussion of
AT_NO_JUMPS, the flag has been split up into separate flags. However,
instead of being an openat(2) flag it is provided through a new
syscall openat2(2) which provides several other improvements to the
openat(2) interface (see the patch description for more details). The
following new LOOKUP_* flags are added:
LOOKUP_NO_XDEV:
Blocks all mountpoint crossings (upwards, downwards, or through
absolute links). Absolute pathnames alone in openat(2) do not
trigger this. Magic-link traversal which implies a vfsmount jump is
also blocked (though magic-link jumps on the same vfsmount are
permitted).
LOOKUP_NO_MAGICLINKS:
Blocks resolution through /proc/$pid/fd-style links. This is done
by blocking the usage of nd_jump_link() during resolution in a
filesystem. The term "magic-links" is used to match with the only
reference to these links in Documentation/, but I'm happy to change
the name.
It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.
In order to correctly detect magic-links, the introduction of a new
LOOKUP_MAGICLINK_JUMPED state flag was required.
LOOKUP_BENEATH:
Disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed.
Conceptually this flag is to ensure you "stay below" a certain
point in the filesystem tree -- but this requires some additional
to protect against various races that would allow escape using
"..".
Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks done
as in LOOKUP_IN_ROOT, but that requires more discussion.
In addition, two new flags are added that expand on the above ideas:
LOOKUP_NO_SYMLINKS:
Does what it says on the tin. No symlink resolution is allowed at
all, including magic-links. Just as with LOOKUP_NO_MAGICLINKS this
can still be used with NOFOLLOW to open an fd for the symlink as
long as no parent path had a symlink component.
LOOKUP_IN_ROOT:
This is an extension of LOOKUP_BENEATH that, rather than blocking
attempts to move past the root, forces all such movements to be
scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks that
chroot(2) is not.
If a race is detected (as with LOOKUP_BENEATH) then an error is
generated, and similar to LOOKUP_BENEATH it is not permitted to
cross magic-links with LOOKUP_IN_ROOT.
The primary need for this is from container runtimes, which
currently need to do symlink scoping in userspace[7] when opening
paths in a potentially malicious container.
There is a long list of CVEs that could have bene mitigated by
having RESOLVE_THIS_ROOT (such as CVE-2017-1002101,
CVE-2017-1002102, CVE-2018-15664, and CVE-2019-5736, just to name a
few).
In order to make all of the above more usable, I'm working on
libpathrs[8] which is a C-friendly library for safe path resolution.
It features a userspace-emulated backend if the kernel doesn't support
openat2(2). Hopefully we can get userspace to switch to using it, and
thus get openat2(2) support for free once it's ready.
Future work would include implementing things like
RESOLVE_NO_AUTOMOUNT and possibly a RESOLVE_NO_REMOTE (to allow
programs to be sure they don't hit DoSes though stale NFS handles)"
* 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
Documentation: path-lookup: include new LOOKUP flags
selftests: add openat2(2) selftests
open: introduce openat2(2) syscall
namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
namei: LOOKUP_NO_XDEV: block mountpoint crossing
namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
namei: LOOKUP_NO_SYMLINKS: block symlink resolution
namei: allow set_root() to produce errors
namei: allow nd_jump_link() to produce errors
nsfs: clean-up ns_get_path() signature to return int
namei: only return -ECHILD from follow_dotdot_rcu()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is a small set of changes for 5.6-rc1 for the driver core and
some firmware subsystem changes.
Included in here are:
- device.h splitup like you asked for months ago
- devtmpfs minor cleanups
- firmware core minor changes
- debugfs fix for lockdown mode
- kernfs cleanup fix
- cpu topology minor fix
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (22 commits)
firmware: Rename FW_OPT_NOFALLBACK to FW_OPT_NOFALLBACK_SYSFS
devtmpfs: factor out common tail of devtmpfs_{create,delete}_node
devtmpfs: initify a bit
devtmpfs: simplify initialization of mount_dev
devtmpfs: factor out setup part of devtmpfsd()
devtmpfs: fix theoretical stale pointer deref in devtmpfsd()
driver core: platform: fix u32 greater or equal to zero comparison
cpu-topology: Don't error on more than CONFIG_NR_CPUS CPUs in device tree
debugfs: Return -EPERM when locked down
driver core: Print device when resources present in really_probe()
driver core: Fix test_async_driver_probe if NUMA is disabled
driver core: platform: Prevent resouce overflow from causing infinite loops
fs/kernfs/dir.c: Clean code by removing always true condition
component: do not dereference opaque pointer in debugfs
drivers/component: remove modular code
debugfs: Fix warnings when building documentation
device.h: move 'struct driver' stuff out to device/driver.h
device.h: move 'struct class' stuff out to device/class.h
device.h: move 'struct bus' stuff out to device/bus.h
device.h: move dev_printk()-like functions to dev_printk.h
...
|
|
For personalities previously registered via IORING_REGISTER_PERSONALITY,
allow any command to select them. This is done through setting
sqe->personality to the id returned from registration, and then flagging
sqe->flags with IOSQE_PERSONALITY.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If an application wants to use a ring with different kinds of
credentials, it can register them upfront. We don't lookup credentials,
the credentials of the task calling IORING_REGISTER_PERSONALITY is used.
An 'id' is returned for the application to use in subsequent personality
support.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If IORING_SETUP_ATTACH_WQ is set, it expects wq_fd in io_uring_params to
be a valid io_uring fd io-wq of which will be shared with the newly
created io_uring instance. If the flag is set but it can't share io-wq,
it fails.
This allows creation of "sibling" io_urings, where we prefer to keep the
SQ/CQ private, but want to share the async backend to minimize the amount
of overhead associated with having multiple rings that belong to the same
backend.
Reported-by: Jens Axboe <axboe@kernel.dk>
Reported-by: Daurnimator <quae@daurnimator.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Export a helper to attach to an existing io-wq, rather than setting up
a new one. This is doable now that we have reference counted io_wq's.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We currently setup the io_wq with a static set of mm and creds. Even for
a single-use io-wq per io_uring, this is suboptimal as we have may have
multiple enters of the ring. For sharing the io-wq backend, it doesn't
work at all.
Switch to passing in the creds and mm when the work item is setup. This
means that async work is no longer deferred to the io_uring mm and creds,
it is done with the current mm and creds.
Flag this behavior with IORING_FEAT_CUR_PERSONALITY, so applications know
they can rely on the current personality (mm and creds) being the same
for direct issue and async issue.
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Removed CRYPTO_TFM_RES flags
- Extended spawn grabbing to all algorithm types
- Moved hash descsize verification into API code
Algorithms:
- Fixed recursive pcrypt dead-lock
- Added new 32 and 64-bit generic versions of poly1305
- Added cryptogams implementation of x86/poly1305
Drivers:
- Added support for i.MX8M Mini in caam
- Added support for i.MX8M Nano in caam
- Added support for i.MX8M Plus in caam
- Added support for A33 variant of SS in sun4i-ss
- Added TEE support for Raven Ridge in ccp
- Added in-kernel API to submit TEE commands in ccp
- Added AMD-TEE driver
- Added support for BCM2711 in iproc-rng200
- Added support for AES256-GCM based ciphers for chtls
- Added aead support on SEC2 in hisilicon"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (244 commits)
crypto: arm/chacha - fix build failured when kernel mode NEON is disabled
crypto: caam - add support for i.MX8M Plus
crypto: x86/poly1305 - emit does base conversion itself
crypto: hisilicon - fix spelling mistake "disgest" -> "digest"
crypto: chacha20poly1305 - add back missing test vectors and test chunking
crypto: x86/poly1305 - fix .gitignore typo
tee: fix memory allocation failure checks on drv_data and amdtee
crypto: ccree - erase unneeded inline funcs
crypto: ccree - make cc_pm_put_suspend() void
crypto: ccree - split overloaded usage of irq field
crypto: ccree - fix PM race condition
crypto: ccree - fix FDE descriptor sequence
crypto: ccree - cc_do_send_request() is void func
crypto: ccree - fix pm wrongful error reporting
crypto: ccree - turn errors to debug msgs
crypto: ccree - fix AEAD decrypt auth fail
crypto: ccree - fix typo in comment
crypto: ccree - fix typos in error msgs
crypto: atmel-{aes,sha,tdes} - Retire crypto_platform_data
crypto: x86/sha - Eliminate casts on asm implementations
...
|
|
git://git.samba.org/sfrench/cifs-2.6
Pull cifs updates from Steve French:
"Various SMB3/CIFS fixes including four for stable.
- Improvement to fallocate (enables 3 additional xfstests)
- Fix for file creation when mounting with modefromsid
- Add ability to backup/restore dos attributes and creation time
- DFS failover and reconnect fixes
- performance optimization for readir
Note that due to the upcoming SMB3 Test Event (at SNIA SDC next week)
there will likely be more changesets near the end of the merge window
(since we will be testing heavily next week, I held off on some
patches and I expect some additional multichannel patches as well as
patches to enable some additional xfstests)"
* tag '5.6-smb3-fixes-and-dfs-and-readdir-improvements' of git://git.samba.org/sfrench/cifs-2.6: (24 commits)
CIFS: Fix task struct use-after-free on reconnect
cifs: use PTR_ERR_OR_ZERO() to simplify code
cifs: add support for fallocate mode 0 for non-sparse files
cifs: fix NULL dereference in match_prepath
smb3: fix default permissions on new files when mounting with modefromsid
CIFS: Add support for setting owner info, dos attributes, and create time
cifs: remove set but not used variable 'server'
cifs: Fix memory allocation in __smb2_handle_cancelled_cmd()
cifs: Fix mount options set in automount
cifs: fix unitialized variable poential problem with network I/O cache lock patch
cifs: Fix return value in __update_cache_entry
cifs: Avoid doing network I/O while holding cache lock
cifs: Fix potential deadlock when updating vol in cifs_reconnect()
cifs: Merge is_path_valid() into get_normalized_path()
cifs: Introduce helpers for finding TCP connection
cifs: Get rid of kstrdup_const()'d paths
cifs: Clean up DFS referral cache
cifs: Don't use iov_iter::type directly
cifs: set correct max-buffer-size for smb2_ioctl_init()
cifs: use compounding for open and first query-dir for readdir()
...
|
|
git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fsverity updates from Eric Biggers:
- Optimize fs-verity sequential read performance by implementing
readahead of Merkle tree pages. This allows the Merkle tree to be
read in larger chunks.
- Optimize FS_IOC_ENABLE_VERITY performance in the uncached case by
implementing readahead of data pages.
- Allocate the hash requests from a mempool in order to eliminate the
possibility of allocation failures during I/O.
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fs-verity: use u64_to_user_ptr()
fs-verity: use mempool for hash requests
fs-verity: implement readahead of Merkle tree pages
fs-verity: implement readahead for FS_IOC_ENABLE_VERITY
|
|
Pull fscrypt updates from Eric Biggers:
- Extend the FS_IOC_ADD_ENCRYPTION_KEY ioctl to allow the raw key to be
provided via a keyring key.
- Prepare for the new dirhash method (SipHash of plaintext name) that
will be used by directories that are both encrypted and casefolded.
- Switch to a new format for "no-key names" that prepares for the new
dirhash method, and also fixes a longstanding bug where multiple
filenames could map to the same no-key name.
- Allow the crypto algorithms used by fscrypt to be built as loadable
modules when the fscrypt-capable filesystems are.
- Optimize fscrypt_zeroout_range().
- Various cleanups.
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: (26 commits)
fscrypt: improve format of no-key names
ubifs: allow both hash and disk name to be provided in no-key names
ubifs: don't trigger assertion on invalid no-key filename
fscrypt: clarify what is meant by a per-file key
fscrypt: derive dirhash key for casefolded directories
fscrypt: don't allow v1 policies with casefolding
fscrypt: add "fscrypt_" prefix to fname_encrypt()
fscrypt: don't print name of busy file when removing key
ubifs: use IS_ENCRYPTED() instead of ubifs_crypt_is_encrypted()
fscrypt: document gfp_flags for bounce page allocation
fscrypt: optimize fscrypt_zeroout_range()
fscrypt: remove redundant bi_status check
fscrypt: Allow modular crypto algorithms
fscrypt: include <linux/ioctl.h> in UAPI header
fscrypt: don't check for ENOKEY from fscrypt_get_encryption_info()
fscrypt: remove fscrypt_is_direct_key_policy()
fscrypt: move fscrypt_valid_enc_modes() to policy.c
fscrypt: check for appropriate use of DIRECT_KEY flag earlier
fscrypt: split up fscrypt_supported_policy() by policy version
fscrypt: introduce fscrypt_needs_contents_encryption()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull fs deduplication fix from David Sterba:
"This is a fix for deduplication bug: the last block of two files is
allowed to deduplicated. This got broken in 5.1 by lifting some
generic checks to VFS layer. The affected filesystems are btrfs and
xfs.
The patches are marked for stable as the bug decreases deduplication
effectivity"
* tag 'fs-dedupe-last-block-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: make deduplication with range including the last block work
fs: allow deduplication of eof block into the end of the destination file
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Features, highlights:
- async discard
- "mount -o discard=async" to enable it
- freed extents are not discarded immediatelly, but grouped
together and trimmed later, with IO rate limiting
- the "sync" mode submits short extents that could have been
ignored completely by the device, for SATA prior to 3.1 the
requests are unqueued and have a big impact on performance
- the actual discard IO requests have been moved out of
transaction commit to a worker thread, improving commit latency
- IO rate and request size can be tuned by sysfs files, for now
enabled only with CONFIG_BTRFS_DEBUG as we might need to
add/delete the files and don't have a stable-ish ABI for
general use, defaults are conservative
- export device state info in sysfs, eg. missing, writeable
- no discard of extents known to be untouched on disk (eg. after
reservation)
- device stats reset is logged with process name and PID that called
the ioctl
Fixes:
- fix missing hole after hole punching and fsync when using NO_HOLES
- writeback: range cyclic mode could miss some dirty pages and lead
to OOM
- two more corner cases for metadata_uuid change after power loss
during the change
- fix infinite loop during fsync after mix of rename operations
Core changes:
- qgroup assign returns ENOTCONN when quotas not enabled, used to
return EINVAL that was confusing
- device closing does not need to allocate memory anymore
- snapshot aware code got removed, disabled for years due to
performance problems, reimplmentation will allow to select wheter
defrag breaks or does not break COW on shared extents
- tree-checker:
- check leaf chunk item size, cross check against number of
stripes
- verify location keys for DIR_ITEM, DIR_INDEX and XATTR items
- new self test for physical -> logical mapping code, used for super
block range exclusion
- assertion helpers/macros updated to avoid objtool "unreachable
code" reports on older compilers or config option combinations"
* tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (84 commits)
btrfs: free block groups after free'ing fs trees
btrfs: Fix split-brain handling when changing FSID to metadata uuid
btrfs: Handle another split brain scenario with metadata uuid feature
btrfs: Factor out metadata_uuid code from find_fsid.
btrfs: Call find_fsid from find_fsid_inprogress
Btrfs: fix infinite loop during fsync after rename operations
btrfs: set trans->drity in btrfs_commit_transaction
btrfs: drop log root for dropped roots
btrfs: sysfs, add devid/dev_state kobject and device attributes
btrfs: Refactor btrfs_rmap_block to improve readability
btrfs: Add self-tests for btrfs_rmap_block
btrfs: selftests: Add support for dummy devices
btrfs: Move and unexport btrfs_rmap_block
btrfs: separate definition of assertion failure handlers
btrfs: device stats, log when stats are zeroed
btrfs: fix improper setting of scanned for range cyclic write cache pages
btrfs: safely advance counter when looking up bio csums
btrfs: remove unused member btrfs_device::work
btrfs: remove unnecessary wrapper get_alloc_profile
btrfs: add correction to handle -1 edge case in async discard
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Ingo Molnar:
"The main change in this tree is the extension of the resctrl procfs
ABI with a new file that helps tooling to navigate from tasks back to
resctrl groups: /proc/{pid}/cpu_resctrl_groups.
Also fix static key usage for certain feature combinations and
simplify the task exit resctrl case"
* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Add task resctrl information display
x86/resctrl: Check monitoring static key in the MBM overflow handler
x86/resctrl: Do not reconfigure exiting tasks
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"These were the main changes in this cycle:
- More -rt motivated separation of CONFIG_PREEMPT and
CONFIG_PREEMPTION.
- Add more low level scheduling topology sanity checks and warnings
to filter out nonsensical topologies that break scheduling.
- Extend uclamp constraints to influence wakeup CPU placement
- Make the RT scheduler more aware of asymmetric topologies and CPU
capacities, via uclamp metrics, if CONFIG_UCLAMP_TASK=y
- Make idle CPU selection more consistent
- Various fixes, smaller cleanups, updates and enhancements - please
see the git log for details"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
sched/fair: Define sched_idle_cpu() only for SMP configurations
sched/topology: Assert non-NUMA topology masks don't (partially) overlap
idle: fix spelling mistake "iterrupts" -> "interrupts"
sched/fair: Remove redundant call to cpufreq_update_util()
sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled
sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAP
sched/fair: calculate delta runnable load only when it's needed
sched/cputime: move rq parameter in irqtime_account_process_tick
stop_machine: Make stop_cpus() static
sched/debug: Reset watchdog on all CPUs while processing sysrq-t
sched/core: Fix size of rq::uclamp initialization
sched/uclamp: Fix a bug in propagating uclamp value in new cgroups
sched/fair: Load balance aggressively for SCHED_IDLE CPUs
sched/fair : Improve update_sd_pick_busiest for spare capacity case
watchdog: Remove soft_lockup_hrtimer_cnt and related code
sched/rt: Make RT capacity-aware
sched/fair: Make EAS wakeup placement consider uclamp restrictions
sched/fair: Make task_fits_capacity() consider uclamp restrictions
sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with()
sched/uclamp: Make uclamp util helpers use and return UL values
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
"Kernel side changes:
- Ftrace is one of the last W^X violators (after this only KLP is
left). These patches move it over to the generic text_poke()
interface and thereby get rid of this oddity. This requires a
surprising amount of surgery, by Peter Zijlstra.
- x86/AMD PMUs: add support for 'Large Increment per Cycle Events' to
count certain types of events that have a special, quirky hw ABI
(by Kim Phillips)
- kprobes fixes by Masami Hiramatsu
Lots of tooling updates as well, the following subcommands were
updated: annotate/report/top, c2c, clang, record, report/top TUI,
sched timehist, tests; plus updates were done to the gtk ui, libperf,
headers and the parser"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
perf/x86/amd: Add support for Large Increment per Cycle Events
perf/x86/amd: Constrain Large Increment per Cycle events
perf/x86/intel/rapl: Add Comet Lake support
tracing: Initialize ret in syscall_enter_define_fields()
perf header: Use last modification time for timestamp
perf c2c: Fix return type for histogram sorting comparision functions
perf beauty sockaddr: Fix augmented syscall format warning
perf/ui/gtk: Fix gtk2 build
perf ui gtk: Add missing zalloc object
perf tools: Use %define api.pure full instead of %pure-parser
libperf: Setup initial evlist::all_cpus value
perf report: Fix no libunwind compiled warning break s390 issue
perf tools: Support --prefix/--prefix-strip
perf report: Clarify in help that --children is default
tools build: Fix test-clang.cpp with Clang 8+
perf clang: Fix build with Clang 9
kprobes: Fix optimize_kprobe()/unoptimize_kprobe() cancellation logic
tools lib: Fix builds when glibc contains strlcpy()
perf report/top: Make 'e' visible in the help and make it toggle showing callchains
perf report/top: Do not offer annotation for symbols without samples
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core SMP updates from Thomas Gleixner:
"A small set of SMP core code changes:
- Rework the smp function call core code to avoid the allocation of
an additional cpumask
- Remove the not longer required GFP argument from on_each_cpu_cond()
and on_each_cpu_cond_mask() and fixup the callers"
* tag 'smp-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smp: Remove allocation mask from on_each_cpu_cond.*()
smp: Add a smp_cond_func_t argument to smp_call_function_many()
smp: Use smp_cond_func_t as type for the conditional function
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"The timekeeping and timers departement provides:
- Time namespace support:
If a container migrates from one host to another then it expects
that clocks based on MONOTONIC and BOOTTIME are not subject to
disruption. Due to different boot time and non-suspended runtime
these clocks can differ significantly on two hosts, in the worst
case time goes backwards which is a violation of the POSIX
requirements.
The time namespace addresses this problem. It allows to set offsets
for clock MONOTONIC and BOOTTIME once after creation and before
tasks are associated with the namespace. These offsets are taken
into account by timers and timekeeping including the VDSO.
Offsets for wall clock based clocks (REALTIME/TAI) are not provided
by this mechanism. While in theory possible, the overhead and code
complexity would be immense and not justified by the esoteric
potential use cases which were discussed at Plumbers '18.
The overhead for tasks in the root namespace (ie where host time
offsets = 0) is in the noise and great effort was made to ensure
that especially in the VDSO. If time namespace is disabled in the
kernel configuration the code is compiled out.
Kudos to Andrei Vagin and Dmitry Sofanov who implemented this
feature and kept on for more than a year addressing review
comments, finding better solutions. A pleasant experience.
- Overhaul of the alarmtimer device dependency handling to ensure
that the init/suspend/resume ordering is correct.
- A new clocksource/event driver for Microchip PIT64
- Suspend/resume support for the Hyper-V clocksource
- The usual pile of fixes, updates and improvements mostly in the
driver code"
* tag 'timers-core-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
alarmtimer: Make alarmtimer_get_rtcdev() a stub when CONFIG_RTC_CLASS=n
alarmtimer: Use wakeup source from alarmtimer platform device
alarmtimer: Make alarmtimer platform device child of RTC device
alarmtimer: Update alarmtimer_get_rtcdev() docs to reflect reality
hrtimer: Add missing sparse annotation for __run_timer()
lib/vdso: Only read hrtimer_res when needed in __cvdso_clock_getres()
MIPS: vdso: Define BUILD_VDSO32 when building a 32bit kernel
clocksource/drivers/hyper-v: Set TSC clocksource as default w/ InvariantTSC
clocksource/drivers/hyper-v: Untangle stimers and timesync from clocksources
clocksource/drivers/timer-microchip-pit64b: Fix sparse warning
clocksource/drivers/exynos_mct: Rename Exynos to lowercase
clocksource/drivers/timer-ti-dm: Fix uninitialized pointer access
clocksource/drivers/timer-ti-dm: Switch to platform_get_irq
clocksource/drivers/timer-ti-dm: Convert to devm_platform_ioremap_resource
clocksource/drivers/em_sti: Fix variable declaration in em_sti_probe
clocksource/drivers/em_sti: Convert to devm_platform_ioremap_resource
clocksource/drivers/bcm2835_timer: Fix memory leak of timer
clocksource/drivers/cadence-ttc: Use ttc driver as platform driver
clocksource/drivers/timer-microchip-pit64b: Add Microchip PIT64B support
clocksource/drivers/hyper-v: Reserve PAGE_SIZE space for tsc page
...
|
|
In preparation for sharing an io-wq across different users, add a
reference count that manages destruction of it.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In case of out of memory the second argument of percpu_ref_put_many() in
io_submit_sqes() may evaluate into "nr - (-EAGAIN)", that is clearly
wrong.
Fixes: 2b85edfc0c90 ("io_uring: batch getting pcpu references")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Draining the middle of a link is tricky, so leave a comment there
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
For the non-vectored variant of READV/WRITEV, we don't need to setup an
async io context, and we flag that appropriately in the io_op_defs
array. However, in fixing this for the 5.5 kernel in commit 74566df3a71c
we didn't have these opcodes, so the check there was added just for the
READ_FIXED and WRITE_FIXED opcodes. Replace that check with just a
single check for needing async context, that covers all four of these
read/write variants that don't use an iovec.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The task which created the MID may be gone by the time cifsd attempts to
call the callbacks on MIDs from cifs_reconnect().
This leads to a use-after-free of the task struct in cifs_wake_up_task:
==================================================================
BUG: KASAN: use-after-free in __lock_acquire+0x31a0/0x3270
Read of size 8 at addr ffff8880103e3a68 by task cifsd/630
CPU: 0 PID: 630 Comm: cifsd Not tainted 5.5.0-rc6+ #119
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
dump_stack+0x8e/0xcb
print_address_description.constprop.5+0x1d3/0x3c0
? __lock_acquire+0x31a0/0x3270
__kasan_report+0x152/0x1aa
? __lock_acquire+0x31a0/0x3270
? __lock_acquire+0x31a0/0x3270
kasan_report+0xe/0x20
__lock_acquire+0x31a0/0x3270
? __wake_up_common+0x1dc/0x630
? _raw_spin_unlock_irqrestore+0x4c/0x60
? mark_held_locks+0xf0/0xf0
? _raw_spin_unlock_irqrestore+0x39/0x60
? __wake_up_common_lock+0xd5/0x130
? __wake_up_common+0x630/0x630
lock_acquire+0x13f/0x330
? try_to_wake_up+0xa3/0x19e0
_raw_spin_lock_irqsave+0x38/0x50
? try_to_wake_up+0xa3/0x19e0
try_to_wake_up+0xa3/0x19e0
? cifs_compound_callback+0x178/0x210
? set_cpus_allowed_ptr+0x10/0x10
cifs_reconnect+0xa1c/0x15d0
? generic_ip_connect+0x1860/0x1860
? rwlock_bug.part.0+0x90/0x90
cifs_readv_from_socket+0x479/0x690
cifs_read_from_socket+0x9d/0xe0
? cifs_readv_from_socket+0x690/0x690
? mempool_resize+0x690/0x690
? rwlock_bug.part.0+0x90/0x90
? memset+0x1f/0x40
? allocate_buffers+0xff/0x340
cifs_demultiplex_thread+0x388/0x2a50
? cifs_handle_standard+0x610/0x610
? rcu_read_lock_held_common+0x120/0x120
? mark_lock+0x11b/0xc00
? __lock_acquire+0x14ed/0x3270
? __kthread_parkme+0x78/0x100
? lockdep_hardirqs_on+0x3e8/0x560
? lock_downgrade+0x6a0/0x6a0
? lockdep_hardirqs_on+0x3e8/0x560
? _raw_spin_unlock_irqrestore+0x39/0x60
? cifs_handle_standard+0x610/0x610
kthread+0x2bb/0x3a0
? kthread_create_worker_on_cpu+0xc0/0xc0
ret_from_fork+0x3a/0x50
Allocated by task 649:
save_stack+0x19/0x70
__kasan_kmalloc.constprop.5+0xa6/0xf0
kmem_cache_alloc+0x107/0x320
copy_process+0x17bc/0x5370
_do_fork+0x103/0xbf0
__x64_sys_clone+0x168/0x1e0
do_syscall_64+0x9b/0xec0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Freed by task 0:
save_stack+0x19/0x70
__kasan_slab_free+0x11d/0x160
kmem_cache_free+0xb5/0x3d0
rcu_core+0x52f/0x1230
__do_softirq+0x24d/0x962
The buggy address belongs to the object at ffff8880103e32c0
which belongs to the cache task_struct of size 6016
The buggy address is located 1960 bytes inside of
6016-byte region [ffff8880103e32c0, ffff8880103e4a40)
The buggy address belongs to the page:
page:ffffea000040f800 refcount:1 mapcount:0 mapping:ffff8880108da5c0
index:0xffff8880103e4c00 compound_mapcount: 0
raw: 4000000000010200 ffffea00001f2208 ffffea00001e3408 ffff8880108da5c0
raw: ffff8880103e4c00 0000000000050003 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8880103e3900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8880103e3980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8880103e3a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8880103e3a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8880103e3b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
This can be reliably reproduced by adding the below delay to
cifs_reconnect(), running find(1) on the mount, restarting the samba
server while find is running, and killing find during the delay:
spin_unlock(&GlobalMid_Lock);
mutex_unlock(&server->srv_mutex);
+ msleep(10000);
+
cifs_dbg(FYI, "%s: issuing mid callbacks\n", __func__);
list_for_each_safe(tmp, tmp2, &retry_list) {
mid_entry = list_entry(tmp, struct mid_q_entry, qhead);
Fix this by holding a reference to the task struct until the MID is
freed.
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
|
|
PTR_ERR_OR_ZERO contains if(IS_ERR(...)) + PTR_ERR, just use
PTR_ERR_OR_ZERO directly.
Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
|
|
RHBZ 1336264
When we extend a file we must also force the size to be updated.
This fixes an issue with holetest in xfs-tests which performs the following
sequence :
1, create a new file
2, use fallocate mode==0 to populate the file
3, mmap the file
4, touch each page by reading the mmapped region.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
RHBZ: 1760879
Fix an oops in match_prepath() by making sure that the prepath string is not
NULL before we pass it into strcmp().
This is similar to other checks we make for example in cifs_root_iget()
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
When mounting with "modefromsid" mount parm most servers will require
that some default permissions are given to users in the ACL on newly
created files, files created with the new 'sd context' - when passing in
an sd context on create, permissions are not inherited from the parent
directory, so in addition to the ACE with the special SID which contains
the mode, we also must pass in an ACE allowing users to access the file
(GENERIC_ALL for authenticated users seemed like a reasonable default,
although later we could allow a mount option or config switch to make
it GENERIC_ALL for EVERYONE special sid).
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-By: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
|
|
This is needed for backup/restore scenarios among others.
Add extended attribute "system.cifs_ntsd" (and alias "system.smb3_ntsd")
to allow for setting owner and DACL in the security descriptor. This is in
addition to the existing "system.cifs_acl" and "system.smb3_acl" attributes
that allow for setting DACL only. Add support for setting creation time and
dos attributes using set_file_info() calls to complement the existing
support for getting these attributes via query_path_info() calls.
Signed-off-by: Boris Protopopov <bprotopopov@hotmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
fs/cifs/smb2pdu.c: In function 'SMB2_query_directory':
fs/cifs/smb2pdu.c:4444:26: warning:
variable 'server' set but not used [-Wunused-but-set-variable]
struct TCP_Server_Info *server;
It is not used, so remove it.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
__smb2_handle_cancelled_cmd() is called under a spin lock held in
cifs_mid_q_entry_release(), so make its memory allocation GFP_ATOMIC.
This issue was observed when running xfstests generic/028:
[ 1722.589204] CIFS VFS: \\192.168.30.26 Cancelling wait for mid 72064 cmd: 5
[ 1722.590687] CIFS VFS: \\192.168.30.26 Cancelling wait for mid 72065 cmd: 17
[ 1722.593529] CIFS VFS: \\192.168.30.26 Cancelling wait for mid 72066 cmd: 6
[ 1723.039014] BUG: sleeping function called from invalid context at mm/slab.h:565
[ 1723.040710] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 30877, name: cifsd
[ 1723.045098] CPU: 3 PID: 30877 Comm: cifsd Not tainted 5.5.0-rc4+ #313
[ 1723.046256] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[ 1723.048221] Call Trace:
[ 1723.048689] dump_stack+0x97/0xe0
[ 1723.049268] ___might_sleep.cold+0xd1/0xe1
[ 1723.050069] kmem_cache_alloc_trace+0x204/0x2b0
[ 1723.051051] __smb2_handle_cancelled_cmd+0x40/0x140 [cifs]
[ 1723.052137] smb2_handle_cancelled_mid+0xf6/0x120 [cifs]
[ 1723.053247] cifs_mid_q_entry_release+0x44d/0x630 [cifs]
[ 1723.054351] ? cifs_reconnect+0x26a/0x1620 [cifs]
[ 1723.055325] cifs_demultiplex_thread+0xad4/0x14a0 [cifs]
[ 1723.056458] ? cifs_handle_standard+0x2c0/0x2c0 [cifs]
[ 1723.057365] ? kvm_sched_clock_read+0x14/0x30
[ 1723.058197] ? sched_clock+0x5/0x10
[ 1723.058838] ? sched_clock_cpu+0x18/0x110
[ 1723.059629] ? lockdep_hardirqs_on+0x17d/0x250
[ 1723.060456] kthread+0x1ab/0x200
[ 1723.061149] ? cifs_handle_standard+0x2c0/0x2c0 [cifs]
[ 1723.062078] ? kthread_create_on_node+0xd0/0xd0
[ 1723.062897] ret_from_fork+0x3a/0x50
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Fixes: 9150c3adbf24 ("CIFS: Close open handle after interrupted close")
Cc: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
|
|
Starting from 4a367dc04435, we must set the mount options based on the
DFS full path rather than the resolved target, that is, cifs_mount()
will be responsible for resolving the DFS link (cached) as well as
performing failover to any other targets in the referral.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reported-by: Martijn de Gouw <martijn.de.gouw@prodrive-technologies.com>
Fixes: 4a367dc04435 ("cifs: Add support for failover in cifs_mount()")
Link: https://lore.kernel.org/linux-cifs/39643d7d-2abb-14d3-ced6-c394fab9a777@prodrive-technologies.com
Tested-by: Martijn de Gouw <martijn.de.gouw@prodrive-technologies.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
patch
static analysis with Coverity detected an issue with the following
commit:
Author: Paulo Alcantara (SUSE) <pc@cjr.nz>
Date: Wed Dec 4 17:38:03 2019 -0300
cifs: Avoid doing network I/O while holding cache lock
Addresses-Coverity: ("Uninitialized pointer read")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
copy_ref_data() may return error, it should be
returned to upstream caller.
Fixes: 03535b72873b ("cifs: Avoid doing network I/O while holding cache lock")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
When creating or updating a cache entry, we need to get an DFS
referral (get_dfs_referral), so avoid holding any locks during such
network operation.
To prevent that, do the following:
* change cache hashtable sync method from RCU sync to a read/write
lock.
* use GFP_ATOMIC in memory allocations.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
We can't acquire volume lock while refreshing the DFS cache because
cifs_reconnect() may call dfs_cache_update_vol() while we are walking
through the volume list.
To prevent that, make vol_info refcounted, create a temp list with all
volumes eligible for refreshing, and then use it without any locks
held.
Besides, replace vol_lock with a spinlock and protect cache_ttl from
concurrent accesses or changes.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Just do the trivial path validation in get_normalized_path().
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Add helpers for finding TCP connections that are good candidates for
being used by DFS refresh worker.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The DFS cache API is mostly used with heap allocated strings.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|