summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2016-01-14kmemcg: account certain kmem allocations to memcgVladimir Davydov52-68/+82
Mark those kmem allocations that are known to be easily triggered from userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to memcg. For the list, see below: - threadinfo - task_struct - task_delay_info - pid - cred - mm_struct - vm_area_struct and vm_region (nommu) - anon_vma and anon_vma_chain - signal_struct - sighand_struct - fs_struct - files_struct - fdtable and fdtable->full_fds_bits - dentry and external_name - inode for all filesystems. This is the most tedious part, because most filesystems overwrite the alloc_inode method. The list is far from complete, so feel free to add more objects. Nevertheless, it should be close to "account everything" approach and keep most workloads within bounds. Malevolent users will be able to breach the limit, but this was possible even with the former "account everything" approach (simply because it did not account everything in fact). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14Revert "kernfs: do not account ino_ida allocations to memcg"Vladimir Davydov1-8/+1
Currently, all kmem allocations (namely every kmem_cache_alloc, kmalloc, alloc_kmem_pages call) are accounted to memory cgroup automatically. Callers have to explicitly opt out if they don't want/need accounting for some reason. Such a design decision leads to several problems: - kmalloc users are highly sensitive to failures, many of them implicitly rely on the fact that kmalloc never fails, while memcg makes failures quite plausible. - A lot of objects are shared among different containers by design. Accounting such objects to one of containers is just unfair. Moreover, it might lead to pinning a dead memcg along with its kmem caches, which aren't tiny, which might result in noticeable increase in memory consumption for no apparent reason in the long run. - There are tons of short-lived objects. Accounting them to memcg will only result in slight noise and won't change the overall picture, but we still have to pay accounting overhead. For more info, see - http://lkml.kernel.org/r/20151105144002.GB15111%40dhcp22.suse.cz - http://lkml.kernel.org/r/20151106090555.GK29259@esperanza Therefore this patchset switches to the white list policy. Now kmalloc users have to explicitly opt in by passing __GFP_ACCOUNT flag. Currently, the list of accounted objects is quite limited and only includes those allocations that (1) are known to be easily triggered from userspace and (2) can fail gracefully (for the full list see patch no. 6) and it still misses many object types. However, accounting only those objects should be a satisfactory approximation of the behavior we used to have for most sane workloads. This patch (of 6): Revert 499611ed451508a42d1d7d ("kernfs: do not account ino_ida allocations to memcg"). Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be fragile and difficult to maintain, because there seem to be many more allocations that should not be accounted than those that should be. Besides, false accounting an allocation might result in much worse consequences than not accounting at all, namely increased memory consumption due to pinned dead kmem caches. So it was decided to switch to the white-list policy. This patch reverts bits introducing the black-list policy. The white-list policy will be introduced later in the series. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14ocfs2/dlm: cleanup redunant lksb flags in dlmcommon.hJoseph Qi1-11/+0
lksb flags are defined both in dlmapi.h and dlmcommon.h. So clean them up from dlmcommon.h. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Jiufei Xue <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14ocfs2: dlm: remove redundant codeJunxiao Bi1-5/+1
Found this when do patch review, remove to make it clear and save a little cpu time. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14ocfs2: access orphan dinode before delete entry in ocfs2_orphan_delJoseph Qi1-9/+9
In ocfs2_orphan_del, currently it finds and deletes entry first, and then access orphan dir dinode. This will have a problem once ocfs2_journal_access_di fails. In this case, entry will be removed from orphan dir, but in deed the inode hasn't been deleted successfully. In other words, the file is missing but not actually deleted. So we should access orphan dinode first like unlink and rename. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Jiufei Xue <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14ocfs2/dlm: do not insert a new mle when another process is already migratingxuejiufei1-2/+3
When two processes are migrating the same lockres, dlm_add_migration_mle() return -EEXIST, but insert a new mle in hash list. dlm_migrate_lockres() will detach the old mle and free the new one which is already in hash list, that will destroy the list. Signed-off-by: Jiufei Xue <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14ocfs2/dlm: ignore cleaning the migration mle that is inusexuejiufei1-11/+15
We have found that migration source will trigger a BUG that the refcount of mle is already zero before put when the target is down during migration. The situation is as follows: dlm_migrate_lockres dlm_add_migration_mle dlm_mark_lockres_migrating dlm_get_mle_inuse <<<<<< Now the refcount of the mle is 2. dlm_send_one_lockres and wait for the target to become the new master. <<<<<< o2hb detect the target down and clean the migration mle. Now the refcount is 1. dlm_migrate_lockres woken, and put the mle twice when found the target goes down which trigger the BUG with the following message: "ERROR: bad mle: ". Signed-off-by: Jiufei Xue <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14ocfs2: do not lock/unlock() inode DLM lockGoldwyn Rodrigues1-8/+0
DLM does not cache locks. So, blocking lock and unlock will only make the performance worse where contention over the locks is high. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14ocfs2: fix slot overwritten if storage link down during mountjiangyiwen1-1/+10
The following case will lead to slot overwritten. N1 N2 mount ocfs2 volume, find and allocate slot 0, then set osb->slot_num to 0, begin to write slot info to disk mount ocfs2 volume, wait for super lock write block fail because of storage link down, unlock super lock got super lock and also allocate slot 0 then unlock super lock mount fail and then dismount, since osb->slot_num is 0, try to put invalid slot to disk. And it will succeed if storage link restores. N2 slot info is now overwritten Once another node say N3 mount, it will find and allocate slot 0 again, which will lead to mount hung because journal has already been locked by N2. so when write slot info failed, invalidate slot in advance to avoid overwrite slot. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14ocfs2/dlm: return appropriate value when dlm_grab() returns NULLXue jiufei2-2/+2
dlm_grab() may return NULL when the node is doing unmount. When doing code review, we found that some dlm handlers may return error to caller when dlm_grab() returns NULL and make caller BUG or other problems. Here is an example: Node 1 Node 2 receives migration message from node 3, and send migrate request to others start unmounting receives migrate request from node 1 and call dlm_migrate_request_handler() unmount thread unregisters domain handlers and removes dlm_context from dlm_domains dlm_migrate_request_handlers() returns -EINVAL to node 1 Exit migration neither clearing the migration state nor sending assert master message to node 3 which cause node 3 hung. Signed-off-by: Jiufei Xue <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14ocfs2: clean up redundant NULL check before iputJoseph Qi7-25/+11
Since iput will take care the NULL check itself, NULL check before calling it is redundant. So clean them up. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14ocfs2/dlm: wait until DLM_LOCK_RES_SETREF_INPROG is cleared in ↵jiangyiwen1-1/+1
dlm_deref_lockres_worker Commit f3f854648de6 ("ocfs2_dlm: Ensure correct ordering of set/clear refmap bit on lockres") still exists a race which can't ensure the ordering is exactly correct. Node1 Node2 Node3 umount, migrate lockres to Node2 migrate finished, send migrate request to Node3 received migrate request, create a migration_mle, respond to Node2. set DLM_LOCK_RES_SETREF_INPROG and send assert master to Node3 delete migration_mle in assert_master_handler, Node3 umount without response dlm_thread purge this lockres, send drop deref message to Node2 found the flag of DLM_LOCK_RES_SETREF_INPROG is set, dispatch dlm_deref_lockres_worker to clear refmap, but in function of dlm_deref_lockres_worker, only if node in refmap it wait DLM_LOCK_RES_SETREF_INPROG to be cleared. So worker is done successfully purge lockres, send assert master response to Node1, and finish umount set Node3 in refmap, and it won't be cleared forever, thus lead to umount hung so wait until DLM_LOCK_RES_SETREF_INPROG is cleared in dlm_deref_lockres_worker. Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14ocfs2: constify ocfs2_extent_tree_operations structuresJulia Lawall2-7/+7
The ocfs2_extent_tree_operations structures are never modified, so declare them as const. Done with the help of Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14ocfs2/dlm: fix a race between purge and migrationXue jiufei1-1/+8
We found a race between purge and migration when doing code review. Node A put lockres to purgelist before receiving the migrate message from node B which is the master. Node A call dlm_mig_lockres_handler to handle this message. dlm_mig_lockres_handler dlm_lookup_lockres >>>>>> race window, dlm_run_purge_list may run and send deref message to master, waiting the response spin_lock(&res->spinlock); res->state |= DLM_LOCK_RES_MIGRATING; spin_unlock(&res->spinlock); dlm_mig_lockres_handler returns >>>>>> dlm_thread receives the response from master for the deref message and triggers the BUG because the lockres has the state DLM_LOCK_RES_MIGRATING with the following message: dlm_purge_lockres:209 ERROR: 6633EB681FA7474A9C280A4E1A836F0F: res M0000000000000000030c0300000000 in use after deref Signed-off-by: Jiufei Xue <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14ocfs2: o2hb: increase unsteady iterationsJunxiao Bi1-2/+2
When run multiple xattr test of ocfs2-test on a three-nodes cluster, mount failed sometimes with the following message. o2hb: Unable to stabilize heartbeart on region D18B775E758D4D80837E8CF3D086AD4A (xvdb) Stabilize heartbeat depends on the timing order to mount ocfs2 from cluster nodes and how fast the tcp connections are established. So increase unsteady interations to leave more time for it. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14ocfs2: return non-zero st_blocks for inline dataJohn Haxby1-0/+8
Some versions of tar assume that files with st_blocks == 0 do not contain any data and will skip reading them entirely. See also commit 9206c561554c ("ext4: return non-zero st_blocks for inline data"). Signed-off-by: John Haxby <john.haxby@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Acked-by: Gang He <ghe@suse.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14ocfs2: optimize bad declarations and redundant assignmentNorton.Zhu1-6/+2
In ocfs2_parse_options, a) it's better to declare variables(small size) outside of while loop; b) 'option' will be set by match_int, 'option = 0;' makes no sense, if match_int failed, it just goto bail and return. Signed-off-by: Norton.Zhu <norton.zhu@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Gang He <ghe@suse.com> Cc: Mark Fasheh <mfasheh@suse.de> Acked-by: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14logfs: fix logfs build errors and dependenciesArnd Bergmann1-1/+1
Fix build errors that happen when CONFIG_LOGFS=y and CONFIG_MTD=m: fs/built-in.o: In function `logfs_mount': super.c:(.text+0x92a6f): undefined reference to `logfs_get_sb_mtd' fs/built-in.o: In function `logfs_get_sb_bdev': (.text+0x93530): undefined reference to `logfs_get_sb_mtd' This patch avoids the error by changing the dependencies of logfs in a way that we can no longer configure logfs as built-in when the MTD core is a loadable module, while leaving the dependency to require at least one of MTD or BLOCK to be enabled. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Michal Marek <mmarek@suse.cz> Cc: Peter Chen <peter.chen@freescale.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14fsnotify: destroy marks with call_srcu instead of dedicated threadJeff Layton1-52/+14
At the time that this code was originally written, call_srcu didn't exist, so this thread was required to ensure that we waited for that SRCU grace period to settle before finally freeing the object. It does exist now however and we can much more efficiently use call_srcu to handle this. That also allows us to potentially use srcu_barrier to ensure that they are all of the callbacks have run before proceeding. In order to conserve space, we union the rcu_head with the g_list. This will be necessary for nfsd which will allocate marks from a dedicated slabcache. We have to be able to ensure that all of the objects are destroyed before destroying the cache. That's fairly Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Cc: Eric Paris <eparis@parisplace.org> Reviewed-by: Jan Kara <jack@suse.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14fs/notify/inode_mark.c: use list_next_entry in fsnotify_unmount_inodesGeliang Tang1-2/+1
To make the intention clearer, use list_next_entry instead of list_entry. Signed-off-by: Geliang Tang <geliangtang@163.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-13Merge tag 'xfs-for-linus-4.5' of ↵Linus Torvalds45-443/+852
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs updates from Dave Chinner: "There's not a lot in this - the main addition is the CRC validation of the entire region of the log that the will be recovered, along with several log recovery fixes. Most of the rest is small bug fixes and cleanups. I have three bug fixes still pending, all that address recently fixed regressions that I will send to next week after they've had some time in for-next. Summary: - extensive CRC validation during log recovery - several log recovery bug fixes - Various DAX support fixes - AGFL size calculation fix - various cleanups in preparation for new functionality - project quota ENOSPC notification via netlink - tracing and debug improvements" * tag 'xfs-for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (26 commits) xfs: handle dquot buffer readahead in log recovery correctly xfs: inode recovery readahead can race with inode buffer creation xfs: eliminate committed arg from xfs_bmap_finish xfs: bmapbt checking on debug kernels too expensive xfs: add tracepoints to readpage calls xfs: debug mode log record crc error injection xfs: detect and trim torn writes during log recovery xfs: fix recursive splice read locking with DAX xfs: Don't use reserved blocks for data blocks with DAX XFS: Use a signed return type for suffix_kstrtoint() libxfs: refactor short btree block verification libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct libxfs: use a convenience variable instead of open-coding the fork xfs: fix log ticket type printing libxfs: make xfs_alloc_fix_freelist non-static xfs: make xfs_buf_ioend_async() static xfs: send warning of project quota to userspace via netlink xfs: get mp from bma->ip in xfs_bmap code xfs: print name of verifier if it fails libxfs: Optimize the loop for xfs_bitmap_empty ...
2016-01-13Merge tag 'for-f2fs-4.5' of ↵Linus Torvalds19-725/+1214
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "This series adds two ioctls to control cached data and fragmented files. Most of the rest fixes missing error cases and bugs that we have not covered so far. Summary: Enhancements: - support an ioctl to execute online file defragmentation - support an ioctl to flush cached data - speed up shrinking of extent_cache entries - handle broken superblock - refector dirty inode management infra - revisit f2fs_map_blocks to handle more cases - reduce global lock coverage - add detecting user's idle time Major bug fixes: - fix data race condition on cached nat entries - fix error cases of volatile and atomic writes" * tag 'for-f2fs-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (87 commits) f2fs: should unset atomic flag after successful commit f2fs: fix wrong memory condition check f2fs: monitor the number of background checkpoint f2fs: detect idle time depending on user behavior f2fs: introduce time and interval facility f2fs: skip releasing nodes in chindless extent tree f2fs: use atomic type for node count in extent tree f2fs: recognize encrypted data in f2fs_fiemap f2fs: clean up f2fs_balance_fs f2fs: remove redundant calls f2fs: avoid unnecessary f2fs_balance_fs calls f2fs: check the page status filled from disk f2fs: introduce __get_node_page to reuse common code f2fs: check node id earily when readaheading node page f2fs: read isize while holding i_mutex in fiemap Revert "f2fs: check the node block address of newly allocated nid" f2fs: cover more area with nat_tree_lock f2fs: introduce max_file_blocks in sbi f2fs crypto: check CONFIG_F2FS_FS_XATTR for encrypted symlink f2fs: introduce zombie list for fast shrinking extent trees ...
2016-01-13Merge tag 'libnvdimm-for-4.5' of ↵Linus Torvalds1-15/+107
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "The bulk of this has appeared in -next and independently received a build success notification from the kbuild robot. The 'for-4.5/block- dax' topic branch was rebased over the weekend to drop the "block device end-of-life" rework that Al would like to see re-implemented with a notifier, and to address bug reports against the badblocks integration. There is pending feedback against "libnvdimm: Add a poison list and export badblocks" received last week. Linda identified some localized fixups that we will handle incrementally. Summary: - Media error handling: The 'badblocks' implementation that originated in md-raid is up-levelled to a generic capability of a block device. This initial implementation is limited to being consulted in the pmem block-i/o path. Later, 'badblocks' will be consulted when creating dax mappings. - Raw block device dax: For virtualization and other cases that want large contiguous mappings of persistent memory, add the capability to dax-mmap a block device directly. - Increased /dev/mem restrictions: Add an option to treat all io-memory as IORESOURCE_EXCLUSIVE, i.e. disable /dev/mem access while a driver is actively using an address range. This behavior is controlled via the new CONFIG_IO_STRICT_DEVMEM option and can be overridden by the existing "iomem=relaxed" kernel command line option. - Miscellaneous fixes include a 'pfn'-device huge page alignment fix, block device shutdown crash fix, and other small libnvdimm fixes" * tag 'libnvdimm-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (32 commits) block: kill disk_{check|set|clear|alloc}_badblocks libnvdimm, pmem: nvdimm_read_bytes() badblocks support pmem, dax: disable dax in the presence of bad blocks pmem: fail io-requests to known bad blocks libnvdimm: convert to statically allocated badblocks libnvdimm: don't fail init for full badblocks list block, badblocks: introduce devm_init_badblocks block: clarify badblocks lifetime badblocks: rename badblocks_free to badblocks_exit libnvdimm, pmem: move definition of nvdimm_namespace_add_poison to nd.h libnvdimm: Add a poison list and export badblocks nfit_test: Enable DSMs for all test NFITs md: convert to use the generic badblocks code block: Add badblock management for gendisks badblocks: Add core badblock management code block: fix del_gendisk() vs blkdev_ioctl crash block: enable dax for raw block devices block: introduce bdev_file_inode() restrict /dev/mem to idle io memory ranges arch: consolidate CONFIG_STRICT_DEVM in lib/Kconfig.debug ...
2016-01-13Merge tag 'for-linus-20160112' of git://git.infradead.org/linux-mtdLinus Torvalds1-1/+1
Pull MTD updates from Brian Norris: "Generic MTD: - populate the MTD device 'of_node' field (and get a proper 'of_node' symlink in sysfs) This yielded some new helper functions, and changes across a variety of drivers - partitioning cleanups, to prepare for better device-tree based partitioning in the future Eliminate a lot of boilerplate for drivers that want to use OF-based partition parsing The DT bindings for this didn't settle yet, so most non-cleanup portions are deferred for a future release NAND: - embed a struct mtd_info inside struct nand_chip This is really long overdue; too many drivers have to do the same silly boilerplate to allocate and link up two "independent" structs, when in fact, everyone is assuming there is an exact 1:1 relationship between a NAND chips struct and its underlying MTD. This aids improved helpers and should make certain abstractions easier in the future. Also causes a lot of churn, helped along by some automated code transformations - add more core support for detecting (and "correcting") bitflips in erased pages; requires opt-in by drivers, but at least we kill a few bad implementations and hopefully stave off future ones - pxa3xx_nand: cleanups, a few fixes, and PM improvements - new JZ4780 NAND driver SPI NOR: - provide default erase function, for controllers that just want to send the SECTOR_ERASE command directly - fix some module auto-loading issues with device tree ("jedec,spi-nor") - error handling fixes - new Mediatek QSPI flash driver Other: - cfi: force valid geometry Kconfig (finally!) This one used to trip up randconfigs occasionally, since bots aren't deterred by big scary "advanced configuration" menus More? Probably. See the commit logs" * tag 'for-linus-20160112' of git://git.infradead.org/linux-mtd: (168 commits) mtd: jz4780_nand: replace if/else blocks with switch/case mtd: nand: jz4780: Update ecc correction error codes mtd: nandsim: use nand_get_controller_data() mtd: jz4780_nand: remove useless mtd->priv = chip assignment staging: mt29f_spinand: make use of nand_set/get_controller_data() helpers mtd: nand: make use of nand_set/get_controller_data() helpers ARM: make use of nand_set/get_controller_data() helpers mtd: nand: add helpers to access ->priv mtd: nand: jz4780: driver for NAND devices on JZ4780 SoCs mtd: nand: jz4740: remove custom 'erased check' implementation mtd: nand: diskonchip: remove custom 'erased check' implementation mtd: nand: davinci: remove custom 'erased check' implementation mtd: nand: use nand_check_erased_ecc_chunk in default ECC read functions mtd: nand: return consistent error codes in ecc.correct() implementations doc: dt: mtd: new binding for jz4780-{nand,bch} mtd: cfi_cmdset_0001: fixing memory leak and handling failed kmalloc mtd: spi-nor: wait until lock/unlock operations are ready mtd: tests: consolidate kmalloc/memset 0 call to kzalloc jffs2: use to_delayed_work mtd: nand: assign reasonable default name for NAND drivers ...
2016-01-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds1-0/+46
Pull networking updates from Davic Miller: 1) Support busy polling generically, for all NAPI drivers. From Eric Dumazet. 2) Add byte/packet counter support to nft_ct, from Floriani Westphal. 3) Add RSS/XPS support to mvneta driver, from Gregory Clement. 4) Implement IPV6_HDRINCL socket option for raw sockets, from Hannes Frederic Sowa. 5) Add support for T6 adapter to cxgb4 driver, from Hariprasad Shenai. 6) Add support for VLAN device bridging to mlxsw switch driver, from Ido Schimmel. 7) Add driver for Netronome NFP4000/NFP6000, from Jakub Kicinski. 8) Provide hwmon interface to mlxsw switch driver, from Jiri Pirko. 9) Reorganize wireless drivers into per-vendor directories just like we do for ethernet drivers. From Kalle Valo. 10) Provide a way for administrators "destroy" connected sockets via the SOCK_DESTROY socket netlink diag operation. From Lorenzo Colitti. 11) Add support to add/remove multicast routes via netlink, from Nikolay Aleksandrov. 12) Make TCP keepalive settings per-namespace, from Nikolay Borisov. 13) Add forwarding and packet duplication facilities to nf_tables, from Pablo Neira Ayuso. 14) Dead route support in MPLS, from Roopa Prabhu. 15) TSO support for thunderx chips, from Sunil Goutham. 16) Add driver for IBM's System i/p VNIC protocol, from Thomas Falcon. 17) Rationalize, consolidate, and more completely document the checksum offloading facilities in the networking stack. From Tom Herbert. 18) Support aborting an ongoing scan in mac80211/cfg80211, from Vidyullatha Kanchanapally. 19) Use per-bucket spinlock for bpf hash facility, from Tom Leiming. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1375 commits) net: bnxt: always return values from _bnxt_get_max_rings net: bpf: reject invalid shifts phonet: properly unshare skbs in phonet_rcv() dwc_eth_qos: Fix dma address for multi-fragment skbs phy: remove an unneeded condition mdio: remove an unneed condition mdio_bus: NULL dereference on allocation error net: Fix typo in netdev_intersect_features net: freescale: mac-fec: Fix build error from phy_device API change net: freescale: ucc_geth: Fix build error from phy_device API change bonding: Prevent IPv6 link local address on enslaved devices IB/mlx5: Add flow steering support net/mlx5_core: Export flow steering API net/mlx5_core: Make ipv4/ipv6 location more clear net/mlx5_core: Enable flow steering support for the IB driver net/mlx5_core: Initialize namespaces only when supported by device net/mlx5_core: Set priority attributes net/mlx5_core: Connect flow tables net/mlx5_core: Introduce modify flow table command net/mlx5_core: Managing root flow table ...
2016-01-12Merge tag 'upstream-4.5-rc1' of git://git.infradead.org/linux-ubifsLinus Torvalds2-2/+8
Pull UBI/UBIFS updates from Richard Weinberger: "This contains three changes - two cleanups and one UBI wear leveling improvement by Sebastian Siewior" * tag 'upstream-4.5-rc1' of git://git.infradead.org/linux-ubifs: ubifs: Use XATTR_*_PREFIX_LEN UBIFS: add a comment in key.h for unused parameter mtd: ubi: wl: avoid erasing a PEB which is empty
2016-01-12Merge tag 'configfs-for-linus' of git://git.infradead.org/users/hch/configfsLinus Torvalds4-13/+276
Pull configfs updates from Christoph Hellwig: "I'm assisting Joel as co-maintainer and patch monkey now, and you will see pull reuquests from me for a while. Besides the MAINTAINERS update there is just a single change, which adds support for binary attributes to configfs, which are very similar to the sysfs binary attributes. Thanks to Pantelis Antoniou! You will see another actually bigger set of configfs changes in the SCSI target pull from Nic - those were merged before this new tree even existed" * tag 'configfs-for-linus' of git://git.infradead.org/users/hch/configfs: configfs: add myself as co-maintainer, updated git tree configfs: implement binary attributes
2016-01-12Merge tag 'gfs2-merge-window' of ↵Linus Torvalds21-241/+446
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull GFS2 updates from Bob Peterson: "Here is a list of patches we've accumulated for GFS2 for the current upstream merge window. Last window's set was short, but I warned that this one would be bigger, and so it is. We've got 19 patches: - A patch from Abhi Das to propagate the GFS2_DIF_SYSTEM bit so that newly added journals don't get flagged, deleted, and recreated by fsck.gfs2. - Two patches from Andreas Gruenbacher to improve GFS2 performance where extended attributes are involved. - A patch from Andy Price to fix a suspicious rcu dereference error. - Two patches from Ben Marzinski that rework how GFS2's NFS cookies are managed. This fixes readdir problems with nfs-over-gfs2. - A patch from Ben Marzinski that fixes a race in unmounting GFS2. - A set of four patches from me to move the resource group reservations inside the gfs2 inode to improve performance and fix a bug whereby get_write_access improperly prevented some operations like chown. - A patch from me to spinlock-protect the setting of system statfs file data. This was causing small discrepancies between df and du. - A patch from me to reintroduce a timeout while clearing glocks which was accidentally dropped some time ago. - A patch from me to wait for iopen glock dequeues in order to improve deleting of files that were unlinked from a different cluster node. - A patch from me to ensure metadata address spaces get truncated when an inode is evicted. - A patch from me to fix a bug in which a memory leak could occur in some error cases when inodes were trying to be created. - A patch to consistently use iopen glocks to transition from the unlinked state to the deleted state. - A patch to fix a glock reference count error when inode creation fails. - A patch from Junxiao Bi to fix an flock panic. - A patch from Markus Elfring that removes an unnecessary if" * tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: fix flock panic issue GFS2: Don't do glock put on when inode creation fails GFS2: Always use iopen glock for gl_deletes GFS2: Release iopen glock in gfs2_create_inode error cases GFS2: Truncate address space mapping when deleting an inode GFS2: Wait for iopen glock dequeues gfs2: clear journal live bit in gfs2_log_flush gfs2: change gfs2 readdir cookie gfs2: keep offset when splitting dir leaf blocks GFS2: Reintroduce a timeout in function gfs2_gl_hash_clear GFS2: Update master statfs buffer with sd_statfs_spin locked GFS2: Reduce size of incore inode GFS2: Make rgrp reservations part of the gfs2_inode structure GFS2: Extract quota data from reservations structure (revert 5407e24) gfs2: Extended attribute readahead optimization gfs2: Extended attribute readahead GFS2: Use rht_for_each_entry_rcu in glock_hash_walk GFS2: Delete an unnecessary check before the function call "iput" gfs2: Automatically set GFS2_DIF_SYSTEM flag on system files
2016-01-12Merge branch 'work.misc' of ↵Linus Torvalds61-384/+379
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "All kinds of stuff. That probably should've been 5 or 6 separate branches, but by the time I'd realized how large and mixed that bag had become it had been too close to -final to play with rebasing. Some fs/namei.c cleanups there, memdup_user_nul() introduction and switching open-coded instances, burying long-dead code, whack-a-mole of various kinds, several new helpers for ->llseek(), assorted cleanups and fixes from various people, etc. One piece probably deserves special mention - Neil's lookup_one_len_unlocked(). Similar to lookup_one_len(), but gets called without ->i_mutex and tries to avoid ever taking it. That, of course, means that it's not useful for any directory modifications, but things like getting inode attributes in nfds readdirplus are fine with that. I really should've asked for moratorium on lookup-related changes this cycle, but since I hadn't done that early enough... I *am* asking for that for the coming cycle, though - I'm going to try and get conversion of i_mutex to rwsem with ->lookup() done under lock taken shared. There will be a patch closer to the end of the window, along the lines of the one Linus had posted last May - mechanical conversion of ->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/ inode_is_locked()/inode_lock_nested(). To quote Linus back then: ----- | This is an automated patch using | | sed 's/mutex_lock(&\(.*\)->i_mutex)/inode_lock(\1)/' | sed 's/mutex_unlock(&\(.*\)->i_mutex)/inode_unlock(\1)/' | sed 's/mutex_lock_nested(&\(.*\)->i_mutex,[ ]*I_MUTEX_\([A-Z0-9_]*\))/inode_lock_nested(\1, I_MUTEX_\2)/' | sed 's/mutex_is_locked(&\(.*\)->i_mutex)/inode_is_locked(\1)/' | sed 's/mutex_trylock(&\(.*\)->i_mutex)/inode_trylock(\1)/' | | with a very few manual fixups ----- I'm going to send that once the ->i_mutex-affecting stuff in -next gets mostly merged (or when Linus says he's about to stop taking merges)" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) nfsd: don't hold i_mutex over userspace upcalls fs:affs:Replace time_t with time64_t fs/9p: use fscache mutex rather than spinlock proc: add a reschedule point in proc_readfd_common() logfs: constify logfs_block_ops structures fcntl: allow to set O_DIRECT flag on pipe fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE fs: xattr: Use kvfree() [s390] page_to_phys() always returns a multiple of PAGE_SIZE nbd: use ->compat_ioctl() fs: use block_device name vsprintf helper lib/vsprintf: add %*pg format specifier fs: use gendisk->disk_name where possible poll: plug an unused argument to do_poll amdkfd: don't open-code memdup_user() cdrom: don't open-code memdup_user() rsxx: don't open-code memdup_user() mtip32xx: don't open-code memdup_user() [um] mconsole: don't open-code memdup_user_nul() [um] hostaudio: don't open-code memdup_user() ...
2016-01-12Merge branch 'work.copy_file_range' of ↵Linus Torvalds18-324/+657
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs copy_file_range updates from Al Viro: "Several series around copy_file_range/CLONE" * 'work.copy_file_range' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: btrfs: use new dedupe data function pointer vfs: hoist the btrfs deduplication ioctl to the vfs vfs: wire up compat ioctl for CLONE/CLONE_RANGE cifs: avoid unused variable and label nfsd: implement the NFSv4.2 CLONE operation nfsd: Pass filehandle to nfs4_preprocess_stateid_op() vfs: pull btrfs clone API to vfs layer locks: new locks_mandatory_area calling convention vfs: Add vfs_copy_file_range() support for pagecache copies btrfs: add .copy_file_range file operation x86: add sys_copy_file_range to syscall tables vfs: add copy_file_range syscall and vfs helper
2016-01-12Merge tag 'locks-v4.5-1' of git://git.samba.org/jlayton/linuxLinus Torvalds4-47/+113
Pull file locking updates from Jeff Layton: "File locking related changes for v4.5 (pile #1) Highlights: - new Kconfig option to allow disabling mandatory locking (which is racy anyway) - new tracepoints for setlk and close codepaths - fix for a long-standing bug in code that handles races between setting a POSIX lock and close()" * tag 'locks-v4.5-1' of git://git.samba.org/jlayton/linux: locks: rename __posix_lock_file to posix_lock_inode locks: prink more detail when there are leaked locks locks: pass inode pointer to locks_free_lock_context locks: sprinkle some tracepoints around the file locking code locks: don't check for race with close when setting OFD lock locks: fix unlock when fcntl_setlk races with a close fs: make locks.c explicitly non-modular locks: use list_first_entry_or_null() locks: Don't allow mounts in user namespaces to enable mandatory locking locks: Allow disabling mandatory locking at compile time
2016-01-12Merge branch 'for-linus-4.5-rc1' of ↵Linus Torvalds1-3/+1
git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml Pull UML updates from Richard Weinberger: "This contains beside of random fixes/cleanups two bigger changes: - seccomp support by Mickaël Salaün - IRQ rework by Anton Ivanov" * 'for-linus-4.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: um: Use race-free temporary file creation um: Do not set unsecure permission for temporary file um: Fix build error and kconfig for i386 um: Add seccomp support um: Add full asm/syscall.h support selftests/seccomp: Remove the need for HAVE_ARCH_TRACEHOOK um: Fix ptrace GETREGS/SETREGS bugs um: link with -lpthread um: Update UBD to use pread/pwrite family of functions um: Do not change hard IRQ flags in soft IRQ processing um: Prevent IRQ handler reentrancy uml: flush stdout before forking uml: fix hostfs mknod()
2016-01-11Merge branch 'x86-cleanups-for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "The main changes in this cycle were: - code patching and cpu_has cleanups (Borislav Petkov) - paravirt cleanups (Juergen Gross) - TSC cleanup (Thomas Gleixner) - ptrace cleanup (Chen Gang)" * 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: arch/x86/kernel/ptrace.c: Remove unused arg_offs_table x86/mm: Align macro defines x86/cpu: Provide a config option to disable static_cpu_has x86/cpufeature: Remove unused and seldomly used cpu_has_xx macros x86/cpufeature: Cleanup get_cpu_cap() x86/cpufeature: Move some of the scattered feature bits to x86_capability x86/paravirt: Remove paravirt ops pmd_update[_defer] and pte_update_defer x86/paravirt: Remove unused pv_apic_ops structure x86/tsc: Remove unused tsc_pre_init() hook x86: Remove unused function cpu_has_ht_siblings() x86/paravirt: Kill some unused patching functions
2016-01-11f2fs: should unset atomic flag after successful commitJaegeuk Kim1-1/+3
If there is an error during commit, we should keep the flag in order to abort it. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-01-11f2fs: fix wrong memory condition checkJaegeuk Kim1-2/+2
This patch fixes wrong decision for avaliable_free_memory. The return valus is already set as false, so we should consider true condition below only. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-01-11f2fs: monitor the number of background checkpointJaegeuk Kim3-2/+6
This patch adds to show the number of background checkpoint. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-01-11f2fs: detect idle time depending on user behaviorJaegeuk Kim9-10/+38
This patch adds last time that user requested filesystem operations. This information is used to detect whether system is idle or not later. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-01-11f2fs: introduce time and interval facilityJaegeuk Kim4-7/+25
This patch adds time and interval arrays to store some timing variables. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-01-11Merge branch 'work.xattr' of ↵Linus Torvalds44-948/+457
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs xattr updates from Al Viro: "Andreas' xattr cleanup series. It's a followup to his xattr work that went in last cycle; -0.5KLoC" * 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: xattr handlers: Simplify list operation ocfs2: Replace list xattr handler operations nfs: Move call to security_inode_listsecurity into nfs_listxattr xfs: Change how listxattr generates synthetic attributes tmpfs: listxattr should include POSIX ACL xattrs tmpfs: Use xattr handler infrastructure btrfs: Use xattr handler infrastructure vfs: Distinguish between full xattr names and proper prefixes posix acls: Remove duplicate xattr name definitions gfs2: Remove gfs2_xattr_acl_chmod vfs: Remove vfs_xattr_cmp
2016-01-11Merge branch 'work.symlinks' of ↵Linus Torvalds87-417/+456
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs RCU symlink updates from Al Viro: "Replacement of ->follow_link/->put_link, allowing to stay in RCU mode even if the symlink is not an embedded one. No changes since the mailbomb on Jan 1" * 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: switch ->get_link() to delayed_call, kill ->put_link() kill free_page_put_link() teach nfs_get_link() to work in RCU mode teach proc_self_get_link()/proc_thread_self_get_link() to work in RCU mode teach shmem_get_link() to work in RCU mode teach page_get_link() to work in RCU mode replace ->follow_link() with new method that could stay in RCU mode don't put symlink bodies in pagecache into highmem namei: page_getlink() and page_follow_link_light() are the same thing ufs: get rid of ->setattr() for symlinks udf: don't duplicate page_symlink_inode_operations logfs: don't duplicate page_symlink_inode_operations switch befs long symlinks to page_symlink_operations
2016-01-12Merge branch 'xfs-misc-fixes-for-4.5-2' into for-nextDave Chinner18-234/+159
2016-01-12xfs: handle dquot buffer readahead in log recovery correctlyDave Chinner5-9/+41
When we do dquot readahead in log recovery, we do not use a verifier as the underlying buffer may not have dquots in it. e.g. the allocation operation hasn't yet been replayed. Hence we do not want to fail recovery because we detect an operation to be replayed has not been run yet. This problem was addressed for inodes in commit d891400 ("xfs: inode buffers may not be valid during recovery readahead") but the problem was not recognised to exist for dquots and their buffers as the dquot readahead did not have a verifier. The result of not using a verifier is that when the buffer is then next read to replay a dquot modification, the dquot buffer verifier will only be attached to the buffer if *readahead is not complete*. Hence we can read the buffer, replay the dquot changes and then add it to the delwri submission list without it having a verifier attached to it. This then generates warnings in xfs_buf_ioapply(), which catches and warns about this case. Fix this and make it handle the same readahead verifier error cases as for inode buffers by adding a new readahead verifier that has a write operation as well as a read operation that marks the buffer as not done if any corruption is detected. Also make sure we don't run readahead if the dquot buffer has been marked as cancelled by recovery. This will result in readahead either succeeding and the buffer having a valid write verifier, or readahead failing and the buffer state requiring the subsequent read to resubmit the IO with the new verifier. In either case, this will result in the buffer always ending up with a valid write verifier on it. Note: we also need to fix the inode buffer readahead error handling to mark the buffer with EIO. Brian noticed the code I copied from there wrong during review, so fix it at the same time. Add comments linking the two functions that handle readahead verifier errors together so we don't forget this behavioural link in future. cc: <stable@vger.kernel.org> # 3.12 - current Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-12xfs: inode recovery readahead can race with inode buffer creationDave Chinner2-5/+14
When we do inode readahead in log recovery, we do can do the readahead before we've replayed the icreate transaction that stamps the buffer with inode cores. The inode readahead verifier catches this and marks the buffer as !done to indicate that it doesn't yet contain valid inodes. In adding buffer error notification (i.e. setting b_error = -EIO at the same time as as we clear the done flag) to such a readahead verifier failure, we can then get subsequent inode recovery failing with this error: XFS (dm-0): metadata I/O error: block 0xa00060 ("xlog_recover_do..(read#2)") error 5 numblks 32 This occurs when readahead completion races with icreate item replay such as: inode readahead find buffer lock buffer submit RA io .... icreate recovery xfs_trans_get_buffer find buffer lock buffer <blocks on RA completion> ..... <ra completion> fails verifier clear XBF_DONE set bp->b_error = -EIO release and unlock buffer <icreate gains lock> icreate initialises buffer marks buffer as done adds buffer to delayed write queue releases buffer At this point, we have an initialised inode buffer that is up to date but has an -EIO state registered against it. When we finally get to recovering an inode in that buffer: inode item recovery xfs_trans_read_buffer find buffer lock buffer sees XBF_DONE is set, returns buffer sees bp->b_error is set fail log recovery! Essentially, we need xfs_trans_get_buf_map() to clear the error status of the buffer when doing a lookup. This function returns uninitialised buffers, so the buffer returned can not be in an error state and none of the code that uses this function expects b_error to be set on return. Indeed, there is an ASSERT(!bp->b_error); in the transaction case in xfs_trans_get_buf_map() that would have caught this if log recovery used transactions.... This patch firstly changes the inode readahead failure to set -EIO on the buffer, and secondly changes xfs_buf_get_map() to never return a buffer with an error state set so this first change doesn't cause unexpected log recovery failures. cc: <stable@vger.kernel.org> # 3.12 - current Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-11xfs: eliminate committed arg from xfs_bmap_finishEric Sandeen10-218/+68
Calls to xfs_bmap_finish() and xfs_trans_ijoin(), and the associated comments were replicated several times across the attribute code, all dealing with what to do if the transaction was or wasn't committed. And in that replicated code, an ASSERT() test of an uninitialized variable occurs in several locations: error = xfs_attr_thing(&args); if (!error) { error = xfs_bmap_finish(&args.trans, args.flist, &committed); } if (error) { ASSERT(committed); If the first xfs_attr_thing() failed, we'd skip the xfs_bmap_finish, never set "committed", and then test it in the ASSERT. Fix this up by moving the committed state internal to xfs_bmap_finish, and add a new inode argument. If an inode is passed in, it is passed through to __xfs_trans_roll() and joined to the transaction there if the transaction was committed. xfs_qm_dqalloc() was a little unique in that it called bjoin rather than ijoin, but as Dave points out we can detect the committed state but checking whether (*tpp != tp). Addresses-Coverity-Id: 102360 Addresses-Coverity-Id: 102361 Addresses-Coverity-Id: 102363 Addresses-Coverity-Id: 102364 Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-10uml: fix hostfs mknod()Vegard Nossum1-3/+1
An inverted return value check in hostfs_mknod() caused the function to return success after handling it as an error (and cleaning up). It resulted in the following segfault when trying to bind() a named unix socket: Pid: 198, comm: a.out Not tainted 4.4.0-rc4 RIP: 0033:[<0000000061077df6>] RSP: 00000000daae5d60 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 000000006092a460 RCX: 00000000dfc54208 RDX: 0000000061073ef1 RSI: 0000000000000070 RDI: 00000000e027d600 RBP: 00000000daae5de0 R08: 00000000da980ac0 R09: 0000000000000000 R10: 0000000000000003 R11: 00007fb1ae08f72a R12: 0000000000000000 R13: 000000006092a460 R14: 00000000daaa97c0 R15: 00000000daaa9a88 Kernel panic - not syncing: Kernel mode fault at addr 0x40, ip 0x61077df6 CPU: 0 PID: 198 Comm: a.out Not tainted 4.4.0-rc4 #1 Stack: e027d620 dfc54208 0000006f da981398 61bee000 0000c1ed daae5de0 0000006e e027d620 dfcd4208 00000005 6092a460 Call Trace: [<60dedc67>] SyS_bind+0xf7/0x110 [<600587be>] handle_syscall+0x7e/0x80 [<60066ad7>] userspace+0x3e7/0x4e0 [<6006321f>] ? save_registers+0x1f/0x40 [<6006c88e>] ? arch_prctl+0x1be/0x1f0 [<60054985>] fork_handler+0x85/0x90 Let's also get rid of the "cosmic ray protection" while we're at it. Fixes: e9193059b1b3 "hostfs: fix races in dentry_name() and inode_name()" Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org Signed-off-by: Richard Weinberger <richard@nod.at>
2016-01-10ubifs: Use XATTR_*_PREFIX_LENRichard Weinberger1-2/+2
...instead of open coding it. Signed-off-by: Richard Weinberger <richard@nod.at>
2016-01-10UBIFS: add a comment in key.h for unused parameterDongsheng Yang1-0/+6
Add a comment in key.h to explain why we keep an unused parameter in key helpers. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2016-01-09block: enable dax for raw block devicesDan Williams1-8/+95
If an application wants exclusive access to all of the persistent memory provided by an NVDIMM namespace it can use this raw-block-dax facility to forgo establishing a filesystem. This capability is targeted primarily to hypervisors wanting to provision persistent memory for guests. It can be disabled / enabled dynamically via the new BLKDAXSET ioctl. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: kbuild test robot <fengguang.wu@intel.com> Reviewed-by: Jan Kara <jack@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09block: introduce bdev_file_inode()Dan Williams1-7/+12
Similar to the file_inode() helper, provide a helper to lookup the inode for a raw block device itself. Cc: Al Viro <viro@zeniv.linux.org.uk> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-01-09nfsd: don't hold i_mutex over userspace upcallsNeilBrown4-19/+85
We need information about exports when crossing mountpoints during lookup or NFSv4 readdir. If we don't already have that information cached, we may have to ask (and wait for) rpc.mountd. In both cases we currently hold the i_mutex on the parent of the directory we're asking rpc.mountd about. We've seen situations where rpc.mountd performs some operation on that directory that tries to take the i_mutex again, resulting in deadlock. With some care, we may be able to avoid that in rpc.mountd. But it seems better just to avoid holding a mutex while waiting on userspace. It appears that lookup_one_len is pretty much the only operation that needs the i_mutex. So we could just drop the i_mutex elsewhere and do something like mutex_lock() lookup_one_len() mutex_unlock() In many cases though the lookup would have been cached and not required the i_mutex, so it's more efficient to create a lookup_one_len() variant that only takes the i_mutex when necessary. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>