summaryrefslogtreecommitdiffstats
path: root/fs/namei.c
AgeCommit message (Collapse)AuthorFilesLines
2019-12-06Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-24/+32
Pull vfs d_inode/d_flags memory ordering fixes from Al Viro: "Fallout from tree-wide audit for ->d_inode/->d_flags barriers use. Basically, the problem is that negative pinned dentries require careful treatment - unless ->d_lock is locked or parent is held at least shared, another thread can make them positive right under us. Most of the uses turned out to be safe - the main surprises as far as filesystems are concerned were - race in dget_parent() fastpath, that might end up with the caller observing the returned dentry _negative_, due to insufficient barriers. It is positive in memory, but we could end up seeing the wrong value of ->d_inode in CPU cache. Fixed. - manual checks that result of lookup_one_len_unlocked() is positive (and rejection of negatives). Again, insufficient barriers (we might end up with inconsistent observed values of ->d_inode and ->d_flags). Fixed by switching to a new primitive that does the checks itself and returns ERR_PTR(-ENOENT) instead of a negative dentry. That way we get rid of boilerplate converting negatives into ERR_PTR(-ENOENT) in the callers and have a single place to deal with the barrier-related mess - inside fs/namei.c rather than in every caller out there. The guts of pathname resolution *do* need to be careful - the race found by Ritesh is real, as well as several similar races. Fortunately, it turns out that we can take care of that with fairly local changes in there. The tree-wide audit had not been fun, and I hate the idea of repeating it. I think the right approach would be to annotate the places where we are _not_ guaranteed ->d_inode/->d_flags stability and have sparse catch regressions. But I'm still not sure what would be the least invasive way of doing that and it's clearly the next cycle fodder" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs/namei.c: fix missing barriers when checking positivity fix dget_parent() fastpath race new helper: lookup_positive_unlocked() fs/namei.c: pull positivity check into follow_managed()
2019-11-15fs/namei.c: fix missing barriers when checking positivityAl Viro1-3/+3
Pinned negative dentries can, generally, be made positive by another thread. Conditions that prevent that are * ->d_lock on dentry in question * parent directory held at least shared * nobody else could have observed the address of dentry Most of the places working with those fall into one of those categories; however, d_lookup() and friends need to be used with some care. Fortunately, there's not a lot of call sites, and with few exceptions all of those fall under one of the cases above. Exceptions are all in fs/namei.c - in lookup_fast(), lookup_dcache() and mountpoint_last(). Another one is lookup_slow() - there dcache lookup is done with parent held shared, but the result is used after we'd drop the lock. The same happens in do_last() - the lookup (in lookup_one()) is done with parent locked, but result is used after unlocking. lookup_fast(), do_last() and mountpoint_last() flat-out reject negatives. Most of lookup_dcache() calls are made with parent locked at least shared; the only exception is lookup_one_len_unlocked(). It might return pinned negative, needs serious care from callers. Fortunately, almost nobody calls it directly anymore; all but two callers have converted to lookup_positive_unlocked(), which rejects negatives. lookup_slow() is called by the same lookup_one_len_unlocked() (see above), mountpoint_last() and walk_component(). In those two negatives are rejected. In other words, there is a small set of places where we need to check carefully if a pinned potentially negative dentry is, in fact, positive. After that check we want to be sure that both ->d_inode and type bits in ->d_flags are stable and observed. The set consists of follow_managed() (where the rejection happens for lookup_fast(), walk_component() and do_last()), last_mountpoint() and lookup_positive_unlocked(). Solution: 1) transition from negative to positive (in __d_set_inode_and_type()) stores ->d_inode, then uses smp_store_release() to set ->d_flags type bits. 2) aforementioned 3 places in fs/namei.c fetch ->d_flags with smp_load_acquire() and bugger off if it type bits say "negative". That way anyone downstream of those checks has dentry know positive pinned, with ->d_inode and type bits of ->d_flags stable and observed. I considered splitting off d_lookup_positive(), so that the checks could be done right there, under ->d_lock. However, that leads to massive duplication of rather subtle code in fs/namei.c and fs/dcache.c. It's worse than it might seem, thanks to autofs ->d_manage() getting involved ;-/ No matter what, autofs_d_manage()/autofs_d_automount() must live with the possibility of pinned negative dentry passed their way, becoming positive under them - that's the intended behaviour when lookup comes in the middle of automount in progress, so we can't keep them out of the area that has to deal with those, more's the pity... Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-15new helper: lookup_positive_unlocked()Al Viro1-0/+20
Most of the callers of lookup_one_len_unlocked() treat negatives are ERR_PTR(-ENOENT). Provide a helper that would do just that. Note that a pinned positive dentry remains positive - it's ->d_inode is stable, etc.; a pinned _negative_ dentry can become positive at any point as long as you are not holding its parent at least shared. So using lookup_one_len_unlocked() needs to be careful; lookup_positive_unlocked() is safer and that's what the callers end up open-coding anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-15fs/namei.c: pull positivity check into follow_managed()Al Viro1-23/+11
There are 4 callers; two proceed to check if result is positive and fail with ENOENT if it isn't; one (in handle_lookup_down()) is guaranteed to yield positive and one (in lookup_fast()) is _preceded_ by positivity check. However, follow_managed() on a negative dentry is a (fairly cheap) no-op on anything other than autofs. And negative autofs dentries are never hashed, so lookup_fast() is not going to run into one of those. Moreover, successful follow_managed() on a _positive_ dentry never yields a negative one (and we significantly rely upon that in callers of lookup_fast()). In other words, we can easily transpose the positivity check and the call of follow_managed() in lookup_fast(). And that allows to fold the positivity check *into* follow_managed(), simplifying life for the code downstream of its calls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-10-03audit: Report suspicious O_CREAT usageKees Cook1-2/+6
This renames the very specific audit_log_link_denied() to audit_log_path_denied() and adds the AUDIT_* type as an argument. This allows for the creation of the new AUDIT_ANOM_CREAT that can be used to report the fifo/regular file creation restrictions that were introduced in commit 30aba6656f61 ("namei: allow restricted O_CREAT of FIFOs and regular files"). Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-09-03fs/namei.c: keep track of nd->root refcount statusAl Viro1-26/+15
The rules for nd->root are messy: * if we have LOOKUP_ROOT, it doesn't contribute to refcounts * if we have LOOKUP_RCU, it doesn't contribute to refcounts * if nd->root.mnt is NULL, it doesn't contribute to refcounts * otherwise it does contribute terminate_walk() needs to drop the references if they are contributing. So everything else should be careful not to confuse it, leading to rather convoluted code. It's easier to keep track of whether we'd grabbed the reference(s) explicitly. Use a new flag for that. Don't bother with zeroing nd->root.mnt on unlazy failures and in terminate_walk - it's not needed anymore (terminate_walk() won't care and the next path_init() will zero nd->root in !LOOKUP_ROOT case anyway). Resulting rules for nd->root refcounts are much simpler: they are contributing iff LOOKUP_ROOT_GRABBED is set in nd->flags. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-08-30fs/namei.c: new helper - legitimize_root()Al Viro1-10/+13
identical logics in unlazy_walk() and unlazy_child() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-21audit_inode(): switch to passing AUDIT_INODE_...Al Viro1-3/+3
don't bother with remapping LOOKUP_... values - all callers pass constants and we can just as well pass the right ones from the very beginning. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-21filename_mountpoint(): make LOOKUP_NO_EVAL unconditional thereAl Viro1-1/+1
user_path_mountpoint_at() always gets it and the reasons to have it there (i.e. in umount(2)) apply to kern_path_mountpoint() callers as well. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-21filename_lookup(): audit_inode() argument is always 0Al Viro1-1/+1
We hadn't been passing LOOKUP_PARENT in flags to that thing since filename_parentat() had been split off back in 2015. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-06-20fsnotify: add empty fsnotify_{unlink,rmdir}() hooksAmir Goldstein1-0/+2
We would like to move fsnotify_nameremove() calls from d_delete() into a higher layer where the hook makes more sense and so we can consider every d_delete() call site individually. Start by creating empty hook fsnotify_{unlink,rmdir}() and place them in the proper VFS call sites. After all d_delete() call sites will be converted to use the new hook, the new hook will generate the delete events and fsnotify_nameremove() hook will be removed. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2019-05-07Merge tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscryptLinus Torvalds1-2/+2
Pull fscrypt updates from Ted Ts'o: "Clean up fscrypt's dcache revalidation support, and other miscellaneous cleanups" * tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fscrypt: cache decrypted symlink target in ->i_link vfs: use READ_ONCE() to access ->i_link fscrypt: fix race where ->lookup() marks plaintext dentry as ciphertext fscrypt: only set dentry_operations on ciphertext dentries fs, fscrypt: clear DCACHE_ENCRYPTED_NAME when unaliasing directory fscrypt: fix race allowing rename() and link() of ciphertext dentries fscrypt: clean up and improve dentry revalidation fscrypt: use READ_ONCE() to access ->i_crypt_info fscrypt: remove WARN_ON_ONCE() when decryption fails fscrypt: drop inode argument from fscrypt_get_ctx()
2019-04-26switch fsnotify_move() to passing const struct qstr * for old_nameAl Viro1-2/+2
note that in the second (RENAME_EXCHANGE) call of fsnotify_move() in vfs_rename() the old_dentry->d_name is guaranteed to be unchanged throughout the evaluation of fsnotify_move() (by the fact that the parent directory is locked exclusive), so we don't need to fetch old_dentry->d_name.name in the caller. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-26ovl_lookup_real_one(): don't bother with strlen()Al Viro1-1/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-17vfs: use READ_ONCE() to access ->i_linkEric Biggers1-2/+2
Use 'READ_ONCE(inode->i_link)' to explicitly support filesystems caching the symlink target in ->i_link later if it was unavailable at iget() time, or wasn't easily available. I'll be doing this in fscrypt, to improve the performance of encrypted symlinks on ext4, f2fs, and ubifs. ->i_link will start NULL and may later be set to a non-NULL value by a smp_store_release() or cmpxchg_release(). READ_ONCE() is needed on the read side. smp_load_acquire() is unnecessary because only a data dependency barrier is required. (Thanks to Al for pointing this out.) Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-03-12Merge branch 'work.mount' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs mount infrastructure updates from Al Viro: "The rest of core infrastructure; no new syscalls in that pile, but the old parts are switched to new infrastructure. At that point conversions of individual filesystems can happen independently; some are done here (afs, cgroup, procfs, etc.), there's also a large series outside of that pile dealing with NFS (quite a bit of option-parsing stuff is getting used there - it's one of the most convoluted filesystems in terms of mount-related logics), but NFS bits are the next cycle fodder. It got seriously simplified since the last cycle; documentation is probably the weakest bit at the moment - I considered dropping the commit introducing Documentation/filesystems/mount_api.txt (cutting the size increase by quarter ;-), but decided that it would be better to fix it up after -rc1 instead. That pile allows to do followup work in independent branches, which should make life much easier for the next cycle. fs/super.c size increase is unpleasant; there's a followup series that allows to shrink it considerably, but I decided to leave that until the next cycle" * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (41 commits) afs: Use fs_context to pass parameters over automount afs: Add fs_context support vfs: Add some logging to the core users of the fs_context log vfs: Implement logging through fs_context vfs: Provide documentation for new mount API vfs: Remove kern_mount_data() hugetlbfs: Convert to fs_context cpuset: Use fs_context kernfs, sysfs, cgroup, intel_rdt: Support fs_context cgroup: store a reference to cgroup_ns into cgroup_fs_context cgroup1_get_tree(): separate "get cgroup_root to use" into a separate helper cgroup_do_mount(): massage calling conventions cgroup: stash cgroup_root reference into cgroup_fs_context cgroup2: switch to option-by-option parsing cgroup1: switch to option-by-option parsing cgroup: take options parsing into ->parse_monolithic() cgroup: fold cgroup1_mount() into cgroup1_get_tree() cgroup: start switching to fs_context ipc: Convert mqueue fs to fs_context proc: Add fs_context support to procfs ...
2019-03-10Merge branch 'next-integrity' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull integrity updates from James Morris: "Mimi Zohar says: 'Linux 5.0 introduced the platform keyring to allow verifying the IMA kexec kernel image signature using the pre-boot keys. This pull request similarly makes keys on the platform keyring accessible for verifying the PE kernel image signature. Also included in this pull request is a new IMA hook that tags tmp files, in policy, indicating the file hash needs to be calculated. The remaining patches are cleanup'" * 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: evm: Use defined constant for UUID representation ima: define ima_post_create_tmpfile() hook and add missing call evm: remove set but not used variable 'xattr' encrypted-keys: fix Opt_err/Opt_error = -1 kexec, KEYS: Make use of platform keyring for signature verify integrity, KEYS: add a reference to platform keyring
2019-03-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-2/+0
Merge more updates from Andrew Morton: - some of the rest of MM - various misc things - dynamic-debug updates - checkpatch - some epoll speedups - autofs - rapidio - lib/, lib/lzo/ updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (83 commits) samples/mic/mpssd/mpssd.h: remove duplicate header kernel/fork.c: remove duplicated include include/linux/relay.h: fix percpu annotation in struct rchan arch/nios2/mm/fault.c: remove duplicate include unicore32: stop printing the virtual memory layout MAINTAINERS: fix GTA02 entry and mark as orphan mm: create the new vm_fault_t type arm, s390, unicore32: remove oneliner wrappers for memblock_alloc() arch: simplify several early memory allocations openrisc: simplify pte_alloc_one_kernel() sh: prefer memblock APIs returning virtual address microblaze: prefer memblock API returning virtual address powerpc: prefer memblock APIs returning virtual address lib/lzo: separate lzo-rle from lzo lib/lzo: implement run-length encoding lib/lzo: fast 8-byte copy on arm64 lib/lzo: 64-bit CTZ on arm64 lib/lzo: tidy-up ifdefs ipc/sem.c: replace kvmalloc/memset with kvzalloc and use struct_size ipc: annotate implicit fall through ...
2019-03-07linux/fs.h: move member alignment check next to definition of struct filenameRasmus Villemoes1-2/+0
Instead of doing this compile-time check in some slightly arbitrary user of struct filename, put it next to the definition. Link: http://lkml.kernel.org/r/20190208203015.29702-3-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-28vfs: Add configuration parser helpersDavid Howells1-2/+2
Because the new API passes in key,value parameters, match_token() cannot be used with it. Instead, provide three new helpers to aid with parsing: (1) fs_parse(). This takes a parameter and a simple static description of all the parameters and maps the key name to an ID. It returns 1 on a match, 0 on no match if unknowns should be ignored and some other negative error code on a parse error. The parameter description includes a list of key names to IDs, desired parameter types and a list of enumeration name -> ID mappings. [!] Note that for the moment I've required that the key->ID mapping array is expected to be sorted and unterminated. The size of the array is noted in the fsconfig_parser struct. This allows me to use bsearch(), but I'm not sure any performance gain is worth the hassle of requiring people to keep the array sorted. The parameter type array is sized according to the number of parameter IDs and is indexed directly. The optional enum mapping array is an unterminated, unsorted list and the size goes into the fsconfig_parser struct. The function can do some additional things: (a) If it's not ambiguous and no value is given, the prefix "no" on a key name is permitted to indicate that the parameter should be considered negatory. (b) If the desired type is a single simple integer, it will perform an appropriate conversion and store the result in a union in the parse result. (c) If the desired type is an enumeration, {key ID, name} will be looked up in the enumeration list and the matching value will be stored in the parse result union. (d) Optionally generate an error if the key is unrecognised. This is called something like: enum rdt_param { Opt_cdp, Opt_cdpl2, Opt_mba_mpbs, nr__rdt_params }; const struct fs_parameter_spec rdt_param_specs[nr__rdt_params] = { [Opt_cdp] = { fs_param_is_bool }, [Opt_cdpl2] = { fs_param_is_bool }, [Opt_mba_mpbs] = { fs_param_is_bool }, }; const const char *const rdt_param_keys[nr__rdt_params] = { [Opt_cdp] = "cdp", [Opt_cdpl2] = "cdpl2", [Opt_mba_mpbs] = "mba_mbps", }; const struct fs_parameter_description rdt_parser = { .name = "rdt", .nr_params = nr__rdt_params, .keys = rdt_param_keys, .specs = rdt_param_specs, .no_source = true, }; int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param) { struct fs_parse_result parse; struct rdt_fs_context *ctx = rdt_fc2context(fc); int ret; ret = fs_parse(fc, &rdt_parser, param, &parse); if (ret < 0) return ret; switch (parse.key) { case Opt_cdp: ctx->enable_cdpl3 = true; return 0; case Opt_cdpl2: ctx->enable_cdpl2 = true; return 0; case Opt_mba_mpbs: ctx->enable_mba_mbps = true; return 0; } return -EINVAL; } (2) fs_lookup_param(). This takes a { dirfd, path, LOOKUP_EMPTY? } or string value and performs an appropriate path lookup to convert it into a path object, which it will then return. If the desired type was a blockdev, the type of the looked up inode will be checked to make sure it is one. This can be used like: enum foo_param { Opt_source, nr__foo_params }; const struct fs_parameter_spec foo_param_specs[nr__foo_params] = { [Opt_source] = { fs_param_is_blockdev }, }; const char *char foo_param_keys[nr__foo_params] = { [Opt_source] = "source", }; const struct constant_table foo_param_alt_keys[] = { { "device", Opt_source }, }; const struct fs_parameter_description foo_parser = { .name = "foo", .nr_params = nr__foo_params, .nr_alt_keys = ARRAY_SIZE(foo_param_alt_keys), .keys = foo_param_keys, .alt_keys = foo_param_alt_keys, .specs = foo_param_specs, }; int foo_parse_param(struct fs_context *fc, struct fs_parameter *param) { struct fs_parse_result parse; struct foo_fs_context *ctx = foo_fc2context(fc); int ret; ret = fs_parse(fc, &foo_parser, param, &parse); if (ret < 0) return ret; switch (parse.key) { case Opt_source: return fs_lookup_param(fc, &foo_parser, param, &parse, &ctx->source); default: return -EINVAL; } } (3) lookup_constant(). This takes a table of named constants and looks up the given name within it. The table is expected to be sorted such that bsearch() be used upon it. Possibly I should require the table be terminated and just use a for-loop to scan it instead of using bsearch() to reduce hassle. Tables look something like: static const struct constant_table bool_names[] = { { "0", false }, { "1", true }, { "false", false }, { "no", false }, { "true", true }, { "yes", true }, }; and a lookup is done with something like: b = lookup_constant(bool_names, param->string, -1); Additionally, optional validation routines for the parameter description are provided that can be enabled at compile time. A later patch will invoke these when a filesystem is registered. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-04ima: define ima_post_create_tmpfile() hook and add missing callMimi Zohar1-0/+1
If tmpfiles can be made persistent, then newly created tmpfiles need to be treated like any other new files in policy. This patch indicates which newly created tmpfiles are in policy, causing the file hash to be calculated on __fput(). Reported-by: Ignaz Forster <ignaz.forster@gmx.de> [rgoldwyn@suse.com: Call ima_post_create_tmpfile() in vfs_tmpfile() as opposed to do_tmpfile(). This will help the case for overlayfs where copy_up is denied while overwriting a file.] Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2019-01-30audit: ignore fcaps on umountRichard Guy Briggs1-1/+1
Don't fetch fcaps when umount2 is called to avoid a process hang while it waits for the missing resource to (possibly never) re-appear. Note the comment above user_path_mountpoint_at(): * A umount is a special case for path walking. We're not actually interested * in the inode in this situation, and ESTALE errors can be a problem. We * simply want track down the dentry and vfsmount attached at the mountpoint * and avoid revalidating the last component. This can happen on ceph, cifs, 9p, lustre, fuse (gluster) or NFS. Please see the github issue tracker https://github.com/linux-audit/audit-kernel/issues/100 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: merge fuzz in audit_log_fcaps()] Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-12-22Revert "vfs: Allow userns root to call mknod on owned filesystems."Christian Brauner1-2/+1
This reverts commit 55956b59df336f6738da916dbb520b6e37df9fbd. commit 55956b59df33 ("vfs: Allow userns root to call mknod on owned filesystems.") enabled mknod() in user namespaces for userns root if CAP_MKNOD is available. However, these device nodes are useless since any filesystem mounted from a non-initial user namespace will set the SB_I_NODEV flag on the filesystem. Now, when a device node s created in a non-initial user namespace a call to open() on said device node will fail due to: bool may_open_dev(const struct path *path) { return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } The problem with this is that as of the aforementioned commit mknod() creates partially functional device nodes in non-initial user namespaces. In particular, it has the consequence that as of the aforementioned commit open() will be more privileged with respect to device nodes than mknod(). Before it was the other way around. Specifically, if mknod() succeeded then it was transparent for any userspace application that a fatal error must have occured when open() failed. All of this breaks multiple userspace workloads and a widespread assumption about how to handle mknod(). Basically, all container runtimes and systemd live by the slogan "ask for forgiveness not permission" when running user namespace workloads. For mknod() the assumption is that if the syscall succeeds the device nodes are useable irrespective of whether it succeeds in a non-initial user namespace or not. This logic was chosen explicitly to allow for the glorious day when mknod() will actually be able to create fully functional device nodes in user namespaces. A specific problem people are already running into when running 4.18 rc kernels are failing systemd services. For any distro that is run in a container systemd services started with the PrivateDevices= property set will fail to start since the device nodes in question cannot be opened (cf. the arguments in [1]). Full disclosure, Seth made the very sound argument that it is already possible to end up with partially functional device nodes. Any filesystem mounted with MS_NODEV set will allow mknod() to succeed but will not allow open() to succeed. The difference to the case here is that the MS_NODEV case is transparent to userspace since it is an explicitly set mount option while the SB_I_NODEV case is an implicit property enforced by the kernel and hence opaque to userspace. [1]: https://github.com/systemd/systemd/pull/9483 Signed-off-by: Christian Brauner <christian@brauner.io> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Serge Hallyn <serge@hallyn.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23namei: allow restricted O_CREAT of FIFOs and regular filesSalvatore Mesoraca1-3/+50
Disallows open of FIFOs or regular files not owned by the user in world writable sticky directories, unless the owner is the same as that of the directory or the file is opened without the O_CREAT flag. The purpose is to make data spoofing attacks harder. This protection can be turned on and off separately for FIFOs and regular files via sysctl, just like the symlinks/hardlinks protection. This patch is based on Openwall's "HARDEN_FIFO" feature by Solar Designer. This is a brief list of old vulnerabilities that could have been prevented by this feature, some of them even allow for privilege escalation: CVE-2000-1134 CVE-2007-3852 CVE-2008-0525 CVE-2009-0416 CVE-2011-4834 CVE-2015-1838 CVE-2015-7442 CVE-2016-7489 This list is not meant to be complete. It's difficult to track down all vulnerabilities of this kind because they were often reported without any mention of this particular attack vector. In fact, before hardlinks/symlinks restrictions, fifos/regular files weren't the favorite vehicle to exploit them. [s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter] Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com [keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future] [keescook@chromium.org: adjust commit subjet] Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Suggested-by: Solar Designer <solar@openwall.com> Suggested-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-21Merge tag 'ovl-update-4.19' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs updates from Miklos Szeredi: "This contains two new features: - Stack file operations: this allows removal of several hacks from the VFS, proper interaction of read-only open files with copy-up, possibility to implement fs modifying ioctls properly, and others. - Metadata only copy-up: when file is on lower layer and only metadata is modified (except size) then only copy up the metadata and continue to use the data from the lower file" * tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits) ovl: Enable metadata only feature ovl: Do not do metacopy only for ioctl modifying file attr ovl: Do not do metadata only copy-up for truncate operation ovl: add helper to force data copy-up ovl: Check redirect on index as well ovl: Set redirect on upper inode when it is linked ovl: Set redirect on metacopy files upon rename ovl: Do not set dentry type ORIGIN for broken hardlinks ovl: Add an inode flag OVL_CONST_INO ovl: Treat metacopy dentries as type OVL_PATH_MERGE ovl: Check redirects for metacopy files ovl: Move some dir related ovl_lookup_single() code in else block ovl: Do not expose metacopy only dentry from d_real() ovl: Open file with data except for the case of fsync ovl: Add helper ovl_inode_realdata() ovl: Store lower data inode in ovl_inode ovl: Fix ovl_getattr() to get number of blocks from lower ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry ovl: Copy up meta inode data from lowest data inode ovl: Modify ovl_lookup() and friends to lookup metacopy dentry ...
2018-08-13Merge branches 'work.misc' and 'work.dcache' of ↵Linus Torvalds1-24/+12
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Misc cleanups from various folks all over the place I expected more fs/dcache.c cleanups this cycle, so that went into a separate branch. Said cleanups have missed the window, so in the hindsight it could've gone into work.misc instead. Decided not to cherry-pick, thus the 'work.dcache' branch" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: dcache: Use true and false for boolean values fold generic_readlink() into its only caller fs: shave 8 bytes off of struct inode fs: Add more kernel-doc to the produced documentation fs: Fix attr.c kernel-doc removed extra extern file_fdatawait_range * 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: kill dentry_update_name_case()
2018-07-19fold generic_readlink() into its only callerAl Viro1-24/+12
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-18Revert "vfs: update ovl inode before relatime check"Miklos Szeredi1-1/+1
This reverts commit 598e3c8f72f5b77c84d2cb26cfd936ffb3cfdbaa. Overlayfs no longer relies on the vfs correct atime handling. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-12few more cleanups of link_path_walk() callersAl Viro1-19/+11
Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12allow link_path_walk() to take ERR_PTR()Al Viro1-33/+5
There is a check for IS_ERR(name) immediately upstream of each call of link_path_walk(name, nd), with positives treated as if link_path_walk() failed with PTR_ERR(name). Taking that check into link_path_walk() itself simplifies things nicely. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12make path_init() unconditionally paired with terminate_walk()Al Viro1-16/+16
including the failure exits Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12switch atomic_open() and lookup_open() to returning 0 in all success casesAl Viro1-17/+17
caller can tell "opened" from "open it yourself" by looking at ->f_mode. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12->atomic_open(): return 0 in all success casesAl Viro1-15/+15
FMODE_OPENED can be used to distingusish "successful open" from the "called finish_no_open(), do it yourself" cases. Since finish_no_open() has been adjusted, no changes in the instances were actually needed. The caller has been adjusted. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12get rid of 'opened' in path_openat() and the helpers downstreamAl Viro1-14/+8
unused now Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12get rid of 'opened' argument of ->atomic_open() - part 3Al Viro1-2/+1
now it can be done... Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12getting rid of 'opened' argument of ->atomic_open() - part 1Al Viro1-1/+1
'opened' argument of finish_open() is unused. Kill it. Signed-off-by Al Viro <viro@zeniv.linux.org.uk>
2018-07-12IMA: don't propagate opened through the entire thingAl Viro1-2/+1
just check ->f_mode in ima_appraise_measurement() Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12introduce FMODE_CREATED and switch to itAl Viro1-7/+8
Parallel to FILE_CREATED, goes into ->f_mode instead of *opened. NFS is a bit of a wart here - it doesn't have file at the point where FILE_CREATED used to be set, so we need to propagate it there (for now). IMA is another one (here and everywhere)... Note that this needs do_dentry_open() to leave old bits in ->f_mode alone - we want it to preserve FMODE_CREATED if it had been already set (no other bit can be there). Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12switch all remaining checks for FILE_OPENED to FMODE_OPENEDAl Viro1-5/+2
... and don't bother with setting FILE_OPENED at all. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12now we can fold open_check_o_direct() into do_dentry_open()Al Viro1-6/+1
These checks are better off in do_dentry_open(); the reason we couldn't put them there used to be that callers couldn't tell what kind of cleanup would do_dentry_open() failure call for. Now that we have FMODE_OPENED, cleanup is the same in all cases - it's simply fput(). So let's fold that into do_dentry_open(), as Christoph's patch tried to. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12lift fput() on late failures into path_openat()Al Viro1-16/+12
Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12fold put_filp() into fput()Al Viro1-2/+2
Just check FMODE_OPENED in __fput() and be done with that... Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12get rid of cred argument of vfs_open() and do_dentry_open()Al Viro1-2/+2
always equal to ->f_cred Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12pass ->f_flags value to alloc_empty_file()Al Viro1-3/+1
... and have it set the f_flags-derived part of ->f_mode. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12pass creds to get_empty_filp(), make sure dentry_open() passes the right credsAl Viro1-1/+1
... and rename get_empty_filp() to alloc_empty_file(). dentry_open() gets creds as argument, but the only thing that sees those is security_file_open() - file->f_cred still ends up with current_cred(). For almost all callers it's the same thing, but there are several broken cases. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-06-16Merge branch 'afs-proc' of ↵Linus Torvalds1-0/+29
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull AFS updates from Al Viro: "Assorted AFS stuff - ended up in vfs.git since most of that consists of David's AFS-related followups to Christoph's procfs series" * 'afs-proc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: afs: Optimise callback breaking by not repeating volume lookup afs: Display manually added cells in dynamic root mount afs: Enable IPv6 DNS lookups afs: Show all of a server's addresses in /proc/fs/afs/servers afs: Handle CONFIG_PROC_FS=n proc: Make inline name size calculation automatic afs: Implement network namespacing afs: Mark afs_net::ws_cell as __rcu and set using rcu functions afs: Fix a Sparse warning in xdr_decode_AFSFetchStatus() proc: Add a way to make network proc files writable afs: Rearrange fs/afs/proc.c to remove remaining predeclarations. afs: Rearrange fs/afs/proc.c to move the show routines up afs: Rearrange fs/afs/proc.c by moving fops and open functions down afs: Move /proc management functions to the end of the file
2018-06-15afs: Display manually added cells in dynamic root mountDavid Howells1-0/+29
Alter the dynroot mount so that cells created by manipulation of /proc/fs/afs/cells and /proc/fs/afs/rootcell and by specification of a root cell as a module parameter will cause directories for those cells to be created in the dynamic root superblock for the network namespace[*]. To this end: (1) Only one dynamic root superblock is now created per network namespace and this is shared between all attempts to mount it. This makes it easier to find the superblock to modify. (2) When a dynamic root superblock is created, the list of cells is walked and directories created for each cell already defined. (3) When a new cell is added, if a dynamic root superblock exists, a directory is created for it. (4) When a cell is destroyed, the directory is removed. (5) These directories are created by calling lookup_one_len() on the root dir which automatically creates them if they don't exist. [*] Inasmuch as network namespaces are currently supported here. Signed-off-by: David Howells <dhowells@redhat.com>
2018-06-12treewide: kmalloc() -> kmalloc_array()Kees Cook1-2/+2
The kmalloc() function has a 2-factor argument form, kmalloc_array(). This patch replaces cases of: kmalloc(a * b, gfp) with: kmalloc_array(a * b, gfp) as well as handling cases of: kmalloc(a * b * c, gfp) with: kmalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kmalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kmalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The tools/ directory was manually excluded, since it has its own implementation of kmalloc(). The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kmalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kmalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kmalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(u8) * COUNT + COUNT , ...) | kmalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kmalloc( - sizeof(char) * COUNT + COUNT , ...) | kmalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kmalloc + kmalloc_array ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kmalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kmalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kmalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kmalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kmalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kmalloc(C1 * C2 * C3, ...) | kmalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kmalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kmalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kmalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kmalloc(sizeof(THING) * C2, ...) | kmalloc(sizeof(TYPE) * C2, ...) | kmalloc(C1 * C2 * C3, ...) | kmalloc(C1 * C2, ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - (E1) * E2 + E1, E2 , ...) | - kmalloc + kmalloc_array ( - (E1) * (E2) + E1, E2 , ...) | - kmalloc + kmalloc_array ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-04Merge branch 'userns-linus' of ↵Linus Torvalds1-4/+12
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull userns updates from Eric Biederman: "This is the last couple of vfs bits to enable root in a user namespace to mount and manipulate a filesystem with backing store (AKA not a virtual filesystem like proc, but a filesystem where the unprivileged user controls the content). The target filesystem for this work is fuse, and Miklos should be sending you the pull request for the fuse bits this merge window. The two key patches are "evm: Don't update hmacs in user ns mounts" and "vfs: Don't allow changing the link count of an inode with an invalid uid or gid". Those close small gaps in the vfs that would be a problem if an unprivileged fuse filesystem is mounted. The rest of the changes are things that are now safe to allow a root user in a user namespace to do with a filesystem they have mounted. The most interesting development is that remount is now safe" * 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems capabilities: Allow privileged user in s_user_ns to set security.* xattrs fs: Allow superblock owner to access do_remount_sb() fs: Allow superblock owner to replace invalid owners of inodes vfs: Allow userns root to call mknod on owned filesystems. vfs: Don't allow changing the link count of an inode with an invalid uid or gid evm: Don't update hmacs in user ns mounts
2018-06-04Merge branch 'work.misc' of ↵Linus Torvalds1-3/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Misc bits and pieces not fitting into anything more specific" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: delete unnecessary assignment in vfs_listxattr Documentation: filesystems: update filesystem locking documentation vfs: namei: use path_equal() in follow_dotdot() fs.h: fix outdated comment about file flags __inode_security_revalidate() never gets NULL opt_dentry make xattr_getsecurity() static vfat: simplify checks in vfat_lookup() get rid of dead code in d_find_alias() it's SB_BORN, not MS_BORN... msdos_rmdir(): kill BS comment remove rpc_rmdir() fs: avoid fdput() after failed fdget() in vfs_dedupe_file_range()