summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2014-01-21/proc/meminfo: provide estimated available memoryRik van Riel1-0/+37
Many load balancing and workload placing programs check /proc/meminfo to estimate how much free memory is available. They generally do this by adding up "free" and "cached", which was fine ten years ago, but is pretty much guaranteed to be wrong today. It is wrong because Cached includes memory that is not freeable as page cache, for example shared memory segments, tmpfs, and ramfs, and it does not include reclaimable slab memory, which can take up a large fraction of system memory on mostly idle systems with lots of files. Currently, the amount of memory that is available for a new workload, without pushing the system into swap, can be estimated from MemFree, Active(file), Inactive(file), and SReclaimable, as well as the "low" watermarks from /proc/zoneinfo. However, this may change in the future, and user space really should not be expected to know kernel internals to come up with an estimate for the amount of free memory. It is more convenient to provide such an estimate in /proc/meminfo. If things change in the future, we only have to change it in one place. Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Erik Mouw <erik.mouw_2@nxp.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21fs/ramfs: don't use module_init for non-modular core codePaul Gortmaker1-1/+1
The ramfs is always built in. It will never be modular, so using module_init as an alias for __initcall is rather misleading. Fix this up now, so that we can relocate module_init from init.h into module.h in the future. If we don't do this, we'd have to add module.h to obviously non-modular code, and that would be a worse thing. Note that direct use of __initcall is discouraged, vs. one of the priority categorized subgroups. As __initcall gets mapped onto device_initcall, our use of fs_initcall (which makes sense for fs code) will thus change this registration from level 6-device to level 5-fs (i.e. slightly earlier). However no observable impact of that small difference has been observed during testing, or is expected. Also note that this change uncovers a missing semicolon bug in the registration of the initcall. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21fs/super.c: fix WARN on alloc_super() fail pathVladimir Davydov1-1/+2
On fail path alloc_super() calls destroy_super(), which issues a warning if the sb's s_mounts list is not empty, in particular if it has not been initialized. That said s_mounts must be initialized in alloc_super() before any possible failure, but currently it is initialized close to the end of the function leading to a useless warning dumped to log if either percpu_counter_init() or list_lru_init() fails. Let's fix this. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21fs/read_write.c:compat_readv(): remove bogus area verifyCorey Minyard1-4/+0
The compat_do_readv_writev() function was doing a verify_area on the incoming iov, but the nr_segs value is not checked. If someone passes in a -1 for nr_segs, for instance, the function should return an EINVAL. However, it returns a EFAULT because the verify_area fails because it is checking an array of size MAX_UINT. The check is bogus, anyway, because the next check, compat_rw_copy_check_uvector(), will do all the necessary checking, anyway. The non-compat do_readv_writev() function doesn't do this check, so I think it's safe to just remove the code. Signed-off-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21fs/compat_ioctl.c: fix an underflow issue (harmless)Dan Carpenter1-1/+2
We cap "nmsgs" at I2C_RDRW_IOCTL_MAX_MSGS (42) but the current code allows negative values. It's harmless but it makes my static checker upset so I've made nsmgs unsigned. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21posix_acl: uninliningAndrew Morton1-5/+79
Uninline vast tracts of nested inline functions in include/linux/posix_acl.h. This reduces the text+data+bss size of x86_64 allyesconfig vmlinux by 8026 bytes. The patch also regularises the positioning of the EXPORT_SYMBOLs in posix_acl.c. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: J. Bruce Fields <bfields@fieldses.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Tested-by: Benny Halevy <bhalevy@primarydata.com> Cc: Benny Halevy <bhalevy@panasas.com> Cc: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21ocfs2: fix NULL pointer dereference when dismount and ocfs2rec simultaneouslyYiwen Jiang1-2/+6
2 nodes cluster, say Node A and Node B, mount the same ocfs2 volume, and create a file 1. Node A Node B open 1, get open lock rm 1, and then add 1 to orphan_dir storage link down, o2hb_write_timeout ->o2quo_disk_timeout ->emergency_restart at the moment, Node B dismount and do ocfs2rec simultaneously 1) ocfs2_dismount_volume ->ocfs2_recovery_exit ->wait_event(osb->recovery_event) ->flush_workqueue(ocfs2_wq) 2) ocfs2rec ->queue_work(&journal->j_recovery_work) ->ocfs2_recover_orphans ->ocfs2_commit_truncate ->queue_delayed_work(&osb->osb_truncate_log_wq) In ocfs2_recovery_exit, it flushes workqueue and then releases system inodes. When doing ocfs2rec, it will call ocfs2_flush_truncate_log which will try to get sys_root_inode, and NULL pointer dereference occurs. Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Signed-off-by: joyce <xuejiufei@huawei.com> Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21ocfs2: punch hole should return EINVAL if the length argument in ioctl is ↵Tariq Saeed1-1/+2
negative An unreserve space ioctl OCFS2_IOC_UNRESVSP/64 should reject a negative length. Orabug:14789508 Signed-off-by: Tariq Saseed <tariq.x.saeed@oracle.com> Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21ocfs2: fix sparse non static symbol warningWei Yongjun1-1/+1
Fixes the following sparse warning: fs/ocfs2/stack_user.c:930:32: warning: symbol 'ocfs2_ls_ops' was not declared. Should it be static? Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21ocfs2: adjust minlen with discard_granularity in the FITRIM ioctlJie Liu1-0/+2
Adjust minlen with discard_granularity for FITRIM ioctl(2) if the given minimum size in bytes is less than it because, discard granularity is used to tell us that the minimum size of extent that can be discarded by the storage device. This is inspired by ext4 commit 5c2ed62fd447 ("ext4: Adjust minlen with discard_granularity in the FITRIM ioctl") from Lukas Czerner. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21ocfs2: return EINVAL if the given range to discard is less than block sizeJie Liu1-7/+3
For FITRIM ioctl(2), we should not keep silence if the given range length ls less than a block size as there is no data blocks would be discareded. Hence it should return EINVAL instead. This issue can be verified via xfstests/generic/288 which is used for FITRIM argument handling tests. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21ocfs2: return EOPNOTSUPP if the device does not support discardJie Liu1-0/+5
For FITRIM ioctl(2), we should return EOPNOTSUPP to inform the user that the storage device does not support discard if it is, otherwise return success would confuse the user even though there is no free blocks were trimmed at all. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21ocfs2: remove redundant ocfs2_alloc_dinode_update_counts() and ↵Younger Liu3-87/+14
ocfs2_block_group_set_bits() ocfs2_alloc_dinode_update_counts() and ocfs2_block_group_set_bits() are already provided in suballoc.c. So, the same functions in move_extents.c are not needed any more. Declare the functions in suballoc.h and remove redundant functions in move_extents.c. Signed-off-by: Younger Liu <liuyiyang@hisense.com> Cc: Younger Liu <younger.liucn@gmail.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21ocfs2: use the new DLM operation callbacks while requesting new lockspaceGoldwyn Rodrigues1-24/+102
Attempt to use the new DLM operations. If it is not supported, use the traditional ocfs2_controld. To exchange ocfs2 versioning, we use the LVB of the version dlm lock. It first attempts to take the lock in EX mode (non-blocking). If successful (which means it is the first mount), it writes the version number and downconverts to PR lock. If it is unsuccessful, it reads the version from the lock. If this becomes the standard (with o2cb as well), it could simplify userspace tools to check if the filesystem is mounted on other nodes. Dan: Since ocfs2_protocol_version are two u8 values, the additional checks with LONG* don't make sense. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21ocfs2: framework for version LVBGoldwyn Rodrigues1-0/+101
Use the native DLM locks for version control negotiation. Most of the framework is taken from gfs2/lock_dlm.c Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21ocfs2: pass ocfs2_cluster_connection to ocfs2_this_nodeGoldwyn Rodrigues5-8/+24
This is done to differentiate between using and not using controld and use the connection information accordingly. We need to be backward compatible. So, we use a new enum ocfs2_connection_type to identify when controld is used and when it is not. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21ocfs2: shift allocation ocfs2_live_connection to user_connect()Goldwyn Rodrigues1-19/+18
We perform this because the DLM recovery callbacks will require the ocfs2_live_connection structure to record the node information when dlm_new_lockspace() is updated (in the last patch of the series). Before calling dlm_new_lockspace(), we need the structure ready for the .recover_done() callback, which would set oc_this_node. This is the reason we allocate ocfs2_live_connection beforehand in user_connect(). [AKPM] rc initialization is not required because it assigned in case of errors. It will be cleared by compiler anyways. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reveiwed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21ocfs2: add DLM recovery callbacksGoldwyn Rodrigues1-0/+38
These are the callbacks called by the fs/dlm code in case the membership changes. If there is a failure while/during calling any of these, the DLM creates a new membership and relays to the rest of the nodes. - recover_prep() is called when DLM understands a node is down. - recover_slot() is called once all nodes have acknowledged recover_prep and recovery can begin. - recover_done() is called once the recovery is complete. It returns the new membership. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21ocfs2: add clustername to cluster connectionGoldwyn Rodrigues5-7/+24
This is an effort of removing ocfs2_controld.pcmk and getting ocfs2 DLM handling up to the times with respect to DLM (>=4.0.1) and corosync (2.3.x). AFAIK, cman also is being phased out for a unified corosync cluster stack. fs/dlm performs all the functions with respect to fencing and node management and provides the API's to do so for ocfs2. For all future references, DLM stands for fs/dlm code. The advantages are: + No need to run an additional userspace daemon (ocfs2_controld) + No controld device handling and controld protocol + Shifting responsibilities of node management to DLM layer For backward compatibility, we are keeping the controld handling code. Once enough time has passed we can remove a significant portion of the code. This was tested by using the kernel with changes on older unmodified tools. The kernel used ocfs2_controld as expected, and displayed the appropriate warning message. This feature requires modification in the userspace ocfs2-tools. The changes can be found at: https://github.com/goldwynr/ocfs2-tools branch: nocontrold Currently, not many checks are present in the userspace code, but that would change soon. This patch (of 6): Add clustername to cluster connection. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21ocfs2: remove versioning informationGoldwyn Rodrigues16-310/+7
The versioning information is confusing for end-users. The numbers are stuck at 1.5.0 when the tools version have moved to 1.8.2. Remove the versioning system in the OCFS2 modules and let the kernel version be the guide to debug issues. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Acked-by: Sunil Mushran <sunil.mushran@gmail.com> Cc: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21fsnotify: remove pointless NULL initializersJan Kara2-4/+0
We usually rely on the fact that struct members not specified in the initializer are set to NULL. So do that with fsnotify function pointers as well. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21fsnotify: remove .should_send_event callbackJan Kara4-48/+21
After removing event structure creation from the generic layer there is no reason for separate .should_send_event and .handle_event callbacks. So just remove the first one. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21fsnotify: do not share events between notification groupsJan Kara10-611/+279
Currently fsnotify framework creates one event structure for each notification event and links this event into all interested notification groups. This is done so that we save memory when several notification groups are interested in the event. However the need for event structure shared between inotify & fanotify bloats the event structure so the result is often higher memory consumption. Another problem is that fsnotify framework keeps path references with outstanding events so that fanotify can return open file descriptors with its events. This has the undesirable effect that filesystem cannot be unmounted while there are outstanding events - a regression for inotify compared to a situation before it was converted to fsnotify framework. For fanotify this problem is hard to avoid and users of fanotify should kind of expect this behavior when they ask for file descriptors from notified files. This patch changes fsnotify and its users to create separate event structure for each group. This allows for much simpler code (~400 lines removed by this patch) and also smaller event structures. For example on 64-bit system original struct fsnotify_event consumes 120 bytes, plus additional space for file name, additional 24 bytes for second and each subsequent group linking the event, and additional 32 bytes for each inotify group for private data. After the conversion inotify event consumes 48 bytes plus space for file name which is considerably less memory unless file names are long and there are several groups interested in the events (both of which are uncommon). Fanotify event fits in 56 bytes after the conversion (fanotify doesn't care about file names so its events don't have to have it allocated). A win unless there are four or more fanotify groups interested in the event. The conversion also solves the problem with unmount when only inotify is used as we don't have to grab path references for inotify events. [hughd@google.com: fanotify: fix corruption preventing startup] Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21inotify: provide function for name length roundingJan Kara1-20/+21
Rounding of name length when passing it to userspace was done in several places. Provide a function to do it and use it in all places. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-20Merge tag 'driver-core-3.14-rc1' of ↵Linus Torvalds17-2760/+3114
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core / sysfs patches from Greg KH: "Here's the big driver core and sysfs patch set for 3.14-rc1. There's a lot of work here moving sysfs logic out into a "kernfs" to allow other subsystems to also have a virtual filesystem with the same attributes of sysfs (handle device disconnect, dynamic creation / removal as needed / unneeded, etc) This is primarily being done for the cgroups filesystem, but the goal is to also move debugfs to it when it is ready, solving all of the known issues in that filesystem as well. The code isn't completed yet, but all should be stable now (there is a big section that was reverted due to problems found when testing) There's also some other smaller fixes, and a driver core addition that allows for a "collection" of objects, that the DRM people will be using soon (it's in this tree to make merges after -rc1 easier) All of this has been in linux-next with no reported issues" * tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (113 commits) kernfs: associate a new kernfs_node with its parent on creation kernfs: add struct dentry declaration in kernfs.h kernfs: fix get_active failure handling in kernfs_seq_*() Revert "kernfs: fix get_active failure handling in kernfs_seq_*()" Revert "kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq" Revert "kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()" Revert "kernfs: remove KERNFS_REMOVED" Revert "kernfs: restructure removal path to fix possible premature return" Revert "kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove()" Revert "kernfs: remove kernfs_addrm_cxt" Revert "kernfs: make kernfs_get_active() block if the node is deactivated but not removed" Revert "kernfs: implement kernfs_{de|re}activate[_self]()" Revert "kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers" Revert "pci: use device_remove_file_self() instead of device_schedule_callback()" Revert "scsi: use device_remove_file_self() instead of device_schedule_callback()" Revert "s390: use device_remove_file_self() instead of device_schedule_callback()" Revert "sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()" Revert "kernfs: remove unnecessary NULL check in __kernfs_remove()" kernfs: remove unnecessary NULL check in __kernfs_remove() drivers/base: provide an infrastructure for componentised subsystems ...
2014-01-17Merge branch 'for-linus' of ↵Linus Torvalds2-2/+7
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace fixes from Eric Biederman: "This is a set of 3 regression fixes. This fixes /proc/mounts when using "ip netns add <netns>" to display the actual mount point. This fixes a regression in clone that broke lxc-attach. This fixes a regression in the permission checks for mounting /proc that made proc unmountable if binfmt_misc was in use. Oops. My apologies for sending this pull request so late. Al Viro gave interesting review comments about the d_path fix that I wanted to address in detail before I sent this pull request. Unfortunately a bad round of colds kept from addressing that in detail until today. The executive summary of the review was: Al: Is patching d_path really sufficient? The prepend_path, d_path, d_absolute_path, and __d_path family of functions is a really mess. Me: Yes, patching d_path is really sufficient. Yes, the code is mess. No it is not appropriate to rewrite all of d_path for a regression that has existed for entirely too long already, when a two line change will do" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: vfs: Fix a regression in mounting proc fork: Allow CLONE_PARENT after setns(CLONE_NEWPID) vfs: In d_path don't call d_dname on a mount point
2014-01-17kernfs: associate a new kernfs_node with its parent on creationTejun Heo4-24/+34
Once created, a kernfs_node is always destroyed by kernfs_put(). Since ba7443bc656e ("sysfs, kernfs: implement kernfs_create/destroy_root()"), kernfs_put() depends on kernfs_root() to locate the ino_ida. kernfs_root() in turn depends on kernfs_node->parent being set for !dir nodes. This means that kernfs_put() of a !dir node requires its ->parent to be initialized. This leads to oops when a newly created !dir node is destroyed without going through kernfs_add_one() or after failing kernfs_add_one() before ->parent is set. kernfs_root() invoked from kernfs_put() will try to dereference NULL parent. Fix it by moving parent association to kernfs_new_node() from kernfs_add_one(). kernfs_new_node() now takes @parent instead of @root and determines the root from the parent and also sets the new node's parent properly. @parent parameter is removed from kernfs_add_one(). As there's no parent when creating the root node, __kernfs_new_node() which takes @root as before and doesn't set the parent is used in that case. This ensures that a kernfs_node in any stage in its life has its parent associated and thus can be put. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-16Merge tag 'writeback-fixes' of ↵Linus Torvalds1-6/+9
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux Pull writeback fix from Wu Fengguang: "Fix data corruption on NFS writeback. It has been in linux-next for one month" * tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: Fix data corruption on NFS
2014-01-15nilfs2: fix segctor bug that causes file system corruptionAndreas Rohner1-4/+6
There is a bug in the function nilfs_segctor_collect, which results in active data being written to a segment, that is marked as clean. It is possible, that this segment is selected for a later segment construction, whereby the old data is overwritten. The problem shows itself with the following kernel log message: nilfs_sufile_do_cancel_free: segment 6533 must be clean Usually a few hours later the file system gets corrupted: NILFS: bad btree node (blocknr=8748107): level = 0, flags = 0x0, nchildren = 0 NILFS error (device sdc1): nilfs_bmap_last_key: broken bmap (inode number=114660) The issue can be reproduced with a file system that is nearly full and with the cleaner running, while some IO intensive task is running. Although it is quite hard to reproduce. This is what happens: 1. The cleaner starts the segment construction 2. nilfs_segctor_collect is called 3. sc_stage is on NILFS_ST_SUFILE and segments are freed 4. sc_stage is on NILFS_ST_DAT current segment is full 5. nilfs_segctor_extend_segments is called, which allocates a new segment 6. The new segment is one of the segments freed in step 3 7. nilfs_sufile_cancel_freev is called and produces an error message 8. Loop around and the collection starts again 9. sc_stage is on NILFS_ST_SUFILE and segments are freed including the newly allocated segment, which will contain active data and can be allocated at a later time 10. A few hours later another segment construction allocates the segment and causes file system corruption This can be prevented by simply reordering the statements. If nilfs_sufile_cancel_freev is called before nilfs_segctor_extend_segments the freed segments are marked as dirty and cannot be allocated any more. Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net> Reviewed-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Tested-by: Andreas Rohner <andreas.rohner@gmx.net> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-14kernfs: fix get_active failure handling in kernfs_seq_*()Tejun Heo1-7/+44
When kernfs_seq_start() fails to obtain an active reference, it returns ERR_PTR(-ENODEV). kernfs_seq_stop() is then invoked with the error pointer value; however, it still proceeds to invoke kernfs_put_active() on the node leading to unbalanced put. If kernfs_seq_stop() is called even after active ref failure, it should skip invocation of @ops->seq_stop() and put_active. Unfortunately, this is a bit complicated because active ref failure isn't the only thing which may fail with ERR_PTR(-ENODEV). @ops->seq_start/next() may also fail with the error value and kernfs_seq_stop() doesn't have a way to tell apart those failures. Work it around by factoring out the active part of kernfs_seq_stop() into kernfs_seq_stop_active() and invoking it directly if @ops->seq_start/next() fail with ERR_PTR(-ENODEV) and updating kernfs_seq_stop() to skip kernfs_seq_stop_active() on ERR_PTR(-ENODEV). This is a bit nasty but ensures that the active put is skipped iff get_active failed in kernfs_seq_start(). tj: This was originally committed as d92d2e6bd72b but got reverted by 683bb2761fbf along with other kernfs self removal patches. However, this one is an independent fix and shouldn't have been reverted together. Reinstate the change. Sorry about the mess. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-13Revert "kernfs: fix get_active failure handling in kernfs_seq_*()"Greg Kroah-Hartman1-44/+7
This reverts commit d92d2e6bd72b653f9811e0c9c46307c743b3fc58. Tejun writes: I'm sorry but can you please revert the whole series? get_active() waiting while a node is deactivated has potential to lead to deadlock and that deactivate/reactivate interface is something fundamentally flawed and that cgroup will have to work with the remove_self() like everybody else. IOW, I think the first posting was correct. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-13Revert "kernfs: replace kernfs_node->u.completion with ↵Greg Kroah-Hartman1-11/+16
kernfs_root->deactivate_waitq" This reverts commit ea1c472dfeada211a0100daa7976e8e8e779b858. Tejun writes: I'm sorry but can you please revert the whole series? get_active() waiting while a node is deactivated has potential to lead to deadlock and that deactivate/reactivate interface is something fundamentally flawed and that cgroup will have to work with the remove_self() like everybody else. IOW, I think the first posting was correct. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-13Revert "kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()"Greg Kroah-Hartman1-20/+11
This reverts commit a69d001cfc712b96ec9d7ba44d6285702a38dabf. Tejun writes: I'm sorry but can you please revert the whole series? get_active() waiting while a node is deactivated has potential to lead to deadlock and that deactivate/reactivate interface is something fundamentally flawed and that cgroup will have to work with the remove_self() like everybody else. IOW, I think the first posting was correct. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-13Revert "kernfs: remove KERNFS_REMOVED"Greg Kroah-Hartman4-60/+42
This reverts commit ae34372eb8408b3d07e870f1939f99007a730d28. Tejun writes: I'm sorry but can you please revert the whole series? get_active() waiting while a node is deactivated has potential to lead to deadlock and that deactivate/reactivate interface is something fundamentally flawed and that cgroup will have to work with the remove_self() like everybody else. IOW, I think the first posting was correct. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-13Revert "kernfs: restructure removal path to fix possible premature return"Greg Kroah-Hartman1-86/+53
This reverts commit 45a140e587f3d32d8d424ed940dffb61e1739047. Tejun writes: I'm sorry but can you please revert the whole series? get_active() waiting while a node is deactivated has potential to lead to deadlock and that deactivate/reactivate interface is something fundamentally flawed and that cgroup will have to work with the remove_self() like everybody else. IOW, I think the first posting was correct. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-13Revert "kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove()"Greg Kroah-Hartman3-35/+14
This reverts commit f601f9a2bf7dc1f7ee18feece4c4e2fc6845d6c4. Tejun writes: I'm sorry but can you please revert the whole series? get_active() waiting while a node is deactivated has potential to lead to deadlock and that deactivate/reactivate interface is something fundamentally flawed and that cgroup will have to work with the remove_self() like everybody else. IOW, I think the first posting was correct. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-13Revert "kernfs: remove kernfs_addrm_cxt"Greg Kroah-Hartman4-29/+117
This reverts commit 99177a34110889a8f2c36420c34e3bcc9bfd8a70. Tejun writes: I'm sorry but can you please revert the whole series? get_active() waiting while a node is deactivated has potential to lead to deadlock and that deactivate/reactivate interface is something fundamentally flawed and that cgroup will have to work with the remove_self() like everybody else. IOW, I think the first posting was correct. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-13Revert "kernfs: make kernfs_get_active() block if the node is deactivated ↵Greg Kroah-Hartman1-21/+4
but not removed" This reverts commit 895a068a524e134900b9d98b519309b7aae7bbb1. Tejun writes: I'm sorry but can you please revert the whole series? get_active() waiting while a node is deactivated has potential to lead to deadlock and that deactivate/reactivate interface is something fundamentally flawed and that cgroup will have to work with the remove_self() like everybody else. IOW, I think the first posting was correct. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-13Revert "kernfs: implement kernfs_{de|re}activate[_self]()"Greg Kroah-Hartman1-117/+1
This reverts commit 9f010c2ad5194a4b682e747984477850fabd03be. Tejun writes: I'm sorry but can you please revert the whole series? get_active() waiting while a node is deactivated has potential to lead to deadlock and that deactivate/reactivate interface is something fundamentally flawed and that cgroup will have to work with the remove_self() like everybody else. IOW, I think the first posting was correct. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-13Revert "kernfs, sysfs, driver-core: implement kernfs_remove_self() and its ↵Greg Kroah-Hartman2-95/+0
wrappers" This reverts commit 1ae06819c77cff1ea2833c94f8c093fe8a5c79db. Tejun writes: I'm sorry but can you please revert the whole series? get_active() waiting while a node is deactivated has potential to lead to deadlock and that deactivate/reactivate interface is something fundamentally flawed and that cgroup will have to work with the remove_self() like everybody else. IOW, I think the first posting was correct. Cc: Tejun Heo <tj@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-13Revert "sysfs, driver-core: remove unused ↵Greg Kroah-Hartman1-0/+92
{sysfs|device}_schedule_callback_owner()" This reverts commit d1ba277e79889085a2faec3b68b91ce89c63f888. Tejun writes: I'm sorry but can you please revert the whole series? get_active() waiting while a node is deactivated has potential to lead to deadlock and that deactivate/reactivate interface is something fundamentally flawed and that cgroup will have to work with the remove_self() like everybody else. IOW, I think the first posting was correct. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-13Revert "kernfs: remove unnecessary NULL check in __kernfs_remove()"Greg Kroah-Hartman1-0/+3
This reverts commit 88533f990c616cf50c2fe585ea03f75c806a293d. Tejun writes: I'm sorry but can you please revert the whole series? get_active() waiting while a node is deactivated has potential to lead to deadlock and that deactivate/reactivate interface is something fundamentally flawed and that cgroup will have to work with the remove_self() like everybody else. IOW, I think the first posting was correct. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-11kernfs: remove unnecessary NULL check in __kernfs_remove()Tejun Heo1-3/+0
895a068a524e ("kernfs: make kernfs_get_active() block if the node is deactivated but not removed") added "struct kernfs_root *root = kernfs_root(kn);" at the head of the function; however, the parameter @kn is checked for later implying that the function may be called with NULL. This means that we may end up invoking kernfs_root() with NULL which will oops. None of the existing users invokes removal with NULL @kn, so this bug doesn't actually trigger. We can relocate kernfs_root() invocation after NULL check; however, allowing NULL param tends to cause more confusion than actually helping anything. As there's no existing user, let's remove the spurious NULL check. This bug was detected by smatch. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-10sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()Tejun Heo1-92/+0
All device_schedule_callback_owner() users are converted to use device_remove_file_self(). Remove now unused {sysfs|device}_schedule_callback_owner(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-11Merge tag 'xfs-for-linus-v3.13-rc8' of git://oss.sgi.com/xfs/xfsLinus Torvalds2-1/+2
Pull xfs bugfixes from Ben Myers: "Here we have a bugfix for an off-by-one in the remote attribute verifier that results in a forced shutdown which you can hit with v5 superblock by creating a 64k xattr, and a fix for a missing destroy_work_on_stack() in the allocation worker. It's a bit late, but they are both fairly straightforward" * tag 'xfs-for-linus-v3.13-rc8' of git://oss.sgi.com/xfs/xfs: xfs: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK() xfs: fix off-by-one error in xfs_attr3_rmt_verify
2014-01-10kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappersTejun Heo2-0/+95
Sometimes it's necessary to implement a node which wants to delete nodes including itself. This isn't straightforward because of kernfs active reference. While a file operation is in progress, an active reference is held and kernfs_remove() waits for all such references to drain before completing. For a self-deleting node, this is a deadlock as kernfs_remove() ends up waiting for an active reference that itself is sitting on top of. This currently is worked around in the sysfs layer using sysfs_schedule_callback() which makes such removals asynchronous. While it works, it's rather cumbersome and inherently breaks synchronicity of the operation - the file operation which triggered the operation may complete before the removal is finished (or even started) and the removal may fail asynchronously. If a removal operation is immmediately followed by another operation which expects the specific name to be available (e.g. removal followed by rename onto the same name), there's no way to make the latter operation reliable. The thing is there's no inherent reason for this to be asynchrnous. All that's necessary to do this synchronous is a dedicated operation which drops its own active ref and deactivates self. This patch implements kernfs_remove_self() and its wrappers in sysfs and driver core. kernfs_remove_self() is to be called from one of the file operations, drops the active ref and deactivates using __kernfs_deactivate_self(), removes the self node, and restores active ref to the dead node using __kernfs_reactivate_self() so that the ref is balanced afterwards. __kernfs_remove() is updated so that it takes an early exit if the target node is already fully removed so that the active ref restored by kernfs_remove_self() after removal doesn't confuse the deactivation path. This makes implementing self-deleting nodes very easy. The normal removal path doesn't even need to be changed to use kernfs_remove_self() for the self-deleting node. The method can invoke kernfs_remove_self() on itself before proceeding the normal removal path. kernfs_remove() invoked on the node by the normal deletion path will simply be ignored. This will replace sysfs_schedule_callback(). A subtle feature of sysfs_schedule_callback() is that it collapses multiple invocations - even if multiple removals are triggered, the removal callback is run only once. An equivalent effect can be achieved by testing the return value of kernfs_remove_self() - only the one which gets %true return value should proceed with actual deletion. All other instances of kernfs_remove_self() will wait till the enclosing kernfs operation which invoked the winning instance of kernfs_remove_self() finishes and then return %false. This trivially makes all users of kernfs_remove_self() automatically show correct synchronous behavior even when there are multiple concurrent operations - all "echo 1 > delete" instances will finish only after the whole operation is completed by one of the instances. v2: For !CONFIG_SYSFS, dummy version kernfs_remove_self() was missing and sysfs_remove_file_self() had incorrect return type. Fix it. Reported by kbuild test bot. v3: Updated to use __kernfs_{de|re}activate_self(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-10kernfs: implement kernfs_{de|re}activate[_self]()Tejun Heo1-1/+117
This patch implements four functions to manipulate deactivation state - deactivate, reactivate and the _self suffixed pair. A new fields kernfs_node->deact_depth is added so that concurrent and nested deactivations are handled properly. kernfs_node->hash is moved so that it's paired with the new field so that it doesn't increase the size of kernfs_node. A kernfs user's lock would normally nest inside active ref but during removal the user may want to perform kernfs_remove() while holding the said lock, which would introduce a reverse locking dependency. This function can be used to break such reverse dependency by allowing deactivation step to performed separately outside user's critical section. This will also be used implement kernfs_remove_self(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-10kernfs: make kernfs_get_active() block if the node is deactivated but not ↵Tejun Heo1-4/+21
removed Currently, kernfs_get_active() fails if the target node is deactivated. This is fine as a node always gets removed after deactivation; however, we're gonna add reactivation so the assumption won't hold. It'd be incorrect for kernfs_get_active() to fail for a node which was deactivated only temporarily. This patch makes kernfs_get_active() block if the node is deactivated but not removed. If the node gets reactivated (not yet implemented), it will be retried and succeed. If the node gets removed, it will be woken up and fail. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-10kernfs: remove kernfs_addrm_cxtTejun Heo4-117/+29
kernfs_addrm_cxt and the accompanying kernfs_addrm_start/finish() were added because there were operations which should be performed outside kernfs_mutex after adding and removing kernfs_nodes. The necessary operations were recorded in kernfs_addrm_cxt and performed by kernfs_addrm_finish(); however, after the recent changes which relocated deactivation and unmapping so that they're performed directly during removal, the only operation kernfs_addrm_finish() performs is kernfs_put(), which can be moved inside the removal path too. This patch moves the kernfs_put() of the base ref to __kernfs_remove() and remove kernfs_addrm_cxt and kernfs_addrm_start/finish(). * kernfs_add_one() is updated to grab and release the parent's active ref and kernfs_mutex itself. kernfs_get/put_active() and kernfs_addrm_start/finish() invocations around it are removed from all users. * __kernfs_remove() puts an unlinked node directly instead of chaining it to kernfs_addrm_cxt. Its callers are updated to grab and release kernfs_mutex instead of calling kernfs_addrm_start/finish() around it. v2: Updated to fit the v2 restructuring of removal path. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-10kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove()Tejun Heo3-14/+35
kernfs_unmap_bin_file() is supposed to unmap all memory mappings of the target file before kernfs_remove() finishes; however, it currently is being called from kernfs_addrm_finish() and has the same race problem as the original implementation of deactivation when there are multiple removers - only the remover which snatches the node to its addrm_cxt->removed list is guaranteed to wait for its completion before returning. It can be fixed by moving kernfs_unmap_bin_file() invocation from kernfs_addrm_finish() to __kernfs_remove(). The function may be called multiple times but that shouldn't do any harm. We end up dropping kernfs_mutex in the removal loop and the node may be removed inbetween by someone else. kernfs_unlink_sibling() is updated to test whether the node has already been removed and return accordingly. __kernfs_remove() in turn performs post-unlinking cleanup only if it actually unlinked the node. KERNFS_HAS_MMAP test is moved out of the unmap function into __kernfs_remove() so that we don't unlock kernfs_mutex unnecessarily. While at it, drop the now meaningless "bin" qualifier from the function name. v2: Rewritten to fit the v2 restructuring of removal path. HAS_MMAP test relocated. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>