summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2014-12-18Merge branch 'akpm' (patches from Andrew)Linus Torvalds11-52/+133
Merge misc patches from Andrew Morton: "A few stragglers" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: tools/testing/selftests/Makefile: alphasort the TARGETS list mm/zsmalloc: adjust order of functions ocfs2: fix journal commit deadlock ocfs2/dlm: fix race between dispatched_work and dlm_lockres_grab_inflight_worker ocfs2: reflink: fix slow unlink for refcounted file mm/memory.c:do_shared_fault(): add comment .mailmap: Santosh Shilimkar has moved .mailmap: update akpm@osdl.org lib/show_mem.c: add cma reserved information fs/proc/meminfo.c: include cma info in proc/meminfo mm: cma: split cma-reserved in dmesg log hfsplus: fix longname handling mm/mempolicy.c: remove unnecessary is_valid_nodemask()
2014-12-18ocfs2: fix journal commit deadlockJunxiao Bi1-2/+14
For buffer write, page lock will be got in write_begin and released in write_end, in ocfs2_write_end_nolock(), before it unlock the page in ocfs2_free_write_ctxt(), it calls ocfs2_run_deallocs(), this will ask for the read lock of journal->j_trans_barrier. Holding page lock and ask for journal->j_trans_barrier breaks the locking order. This will cause a deadlock with journal commit threads, ocfs2cmt will get write lock of journal->j_trans_barrier first, then it wakes up kjournald2 to do the commit work, at last it waits until done. To commit journal, kjournald2 needs flushing data first, it needs get the cache page lock. Since some ocfs2 cluster locks are holding by write process, this deadlock may hung the whole cluster. unlock pages before ocfs2_run_deallocs() can fix the locking order, also put unlock before ocfs2_commit_trans() to make page lock is unlocked before j_trans_barrier to preserve unlocking order. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com> Cc: <stable@vger.kernel.org> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-18ocfs2/dlm: fix race between dispatched_work and dlm_lockres_grab_inflight_workerJoseph Qi1-9/+3
Commit ac4fef4d23ed ("ocfs2/dlm: do not purge lockres that is queued for assert master") may have the following possible race case: dlm_dispatch_assert_master dlm_wq ======================================================================== queue_work(dlm->quedlm_worker, &dlm->dispatched_work); dispatch work, dlm_lockres_drop_inflight_worker *BUG_ON(res->inflight_assert_workers == 0)* dlm_lockres_grab_inflight_worker inflight_assert_workers++ So ensure inflight_assert_workers to be increased first. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Xue jiufei <xuejiufei@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-18ocfs2: reflink: fix slow unlink for refcounted fileJunxiao Bi4-10/+24
When running ocfs2 test suite multiple nodes reflink stress test, for a 4 nodes cluster, every unlink() for refcounted file needs about 700s. The slow unlink is caused by the contention of refcount tree lock since all nodes are unlink files using the same refcount tree. When the unlinking file have many extents(over 1600 in our test), most of the extents has refcounted flag set. In ocfs2_commit_truncate(), it will execute the following call trace for every extents. This means it needs get and released refcount tree lock about 1600 times. And when several nodes are do this at the same time, the performance will be very low. ocfs2_remove_btree_range() -- ocfs2_lock_refcount_tree() ---- ocfs2_refcount_lock() ------ __ocfs2_cluster_lock() ocfs2_refcount_lock() is costly, move it to ocfs2_commit_truncate() to do lock/unlock once can improve a lot performance. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Wengang <wen.gang.wang@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-18fs/proc/meminfo.c: include cma info in proc/meminfoPintu Kumar1-2/+13
This patch include CMA info (CMATotal, CMAFree) in /proc/meminfo. Currently, in a CMA enabled system, if somebody wants to know the total CMA size declared, there is no way to tell, other than the dmesg or /var/log/messages logs. With this patch we are showing the CMA info as part of meminfo, so that it can be determined at any point of time. This will be populated only when CMA is enabled. Below is the sample output from a ARM based device with RAM:512MB and CMA:16MB. MemTotal: 471172 kB MemFree: 111712 kB MemAvailable: 271172 kB . . . CmaTotal: 16384 kB CmaFree: 6144 kB This patch also fix below checkpatch errors that were found during these changes. ERROR: space required after that ',' (ctx:ExV) 199: FILE: fs/proc/meminfo.c:199: + ,atomic_long_read(&num_poisoned_pages) << (PAGE_SHIFT - 10) ^ ERROR: space required after that ',' (ctx:ExV) 202: FILE: fs/proc/meminfo.c:202: + ,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) * ^ ERROR: space required after that ',' (ctx:ExV) 206: FILE: fs/proc/meminfo.c:206: + ,K(totalcma_pages) ^ total: 3 errors, 0 warnings, 2 checks, 236 lines checked Signed-off-by: Pintu Kumar <pintu.k@samsung.com> Signed-off-by: Vishnu Pratap Singh <vishnu.ps@samsung.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-18hfsplus: fix longname handlingSougata Santra4-29/+79
Longname is not correctly handled by hfsplus driver. If an attempt to create a longname(>255) file/directory is made, it succeeds by creating a file/directory with HFSPLUS_MAX_STRLEN and incorrect catalog key. Thus leaving the volume in an inconsistent state. This patch fixes this issue. Although lookup is always called first to create a negative entry, so just doing a check in lookup would probably fix this issue. I choose to propagate error to other iops as well. Please NOTE: I have factored out hfsplus_cat_build_key_with_cnid from hfsplus_cat_build_key, to avoid unncessary branching. Thanks a lot. TEST: ------ dir="TEST_DIR" cdir=`pwd` name255="_123456789_123456789_123456789_123456789_123456789_123456789\ _123456789_123456789_123456789_123456789_123456789_123456789_123456789\ _123456789_123456789_123456789_123456789_123456789_123456789_123456789\ _123456789_123456789_123456789_123456789_123456789_1234" name256="${name255}5" mkdir $dir cd $dir touch $name255 rm -f $name255 touch $name256 ls -la cd $cdir rm -rf $dir RESULT: ------- [sougata@ultrabook tmp]$ cdir=`pwd` [sougata@ultrabook tmp]$ name255="_123456789_123456789_123456789_123456789_123456789_123456789\ > _123456789_123456789_123456789_123456789_123456789_123456789_123456789\ > _123456789_123456789_123456789_123456789_123456789_123456789_123456789\ > _123456789_123456789_123456789_123456789_123456789_1234" [sougata@ultrabook tmp]$ name256="${name255}5" [sougata@ultrabook tmp]$ [sougata@ultrabook tmp]$ mkdir $dir [sougata@ultrabook tmp]$ cd $dir [sougata@ultrabook TEST_DIR]$ touch $name255 [sougata@ultrabook TEST_DIR]$ rm -f $name255 [sougata@ultrabook TEST_DIR]$ touch $name256 [sougata@ultrabook TEST_DIR]$ ls -la ls: cannot access _123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_1234: No such file or directory total 0 drwxrwxr-x 1 sougata sougata 3 Feb 20 19:56 . drwxrwxrwx 1 root root 6 Feb 20 19:56 .. -????????? ? ? ? ? ? _123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_1234 [sougata@ultrabook TEST_DIR]$ cd $cdir [sougata@ultrabook tmp]$ rm -rf $dir rm: cannot remove `TEST_DIR': Directory not empty -ENAMETOOLONG returned from hfsplus_asc2uni was not propaged to iops. This allowed hfsplus to create files/directories with HFSPLUS_MAX_STRLEN and incorrect keys, leaving the FS in an inconsistent state. This patch fixes this issue. Signed-off-by: Sougata Santra <sougata@tuxera.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Vyacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-18mnt: Fix a memory stomp in umountEric W. Biederman1-0/+2
While reviewing the code of umount_tree I realized that when we append to a preexisting unmounted list we do not change pprev of the former first item in the list. Which means later in namespace_unlock hlist_del_init(&mnt->mnt_hash) on the former first item of the list will stomp unmounted.first leaving it set to some random mount point which we are likely to free soon. This isn't likely to hit, but if it does I don't know how anyone could track it down. [ This happened because we don't have all the same operations for hlist's as we do for normal doubly-linked lists. In particular, list_splice() is easy on our standard doubly-linked lists, while hlist_splice() doesn't exist and needs both start/end entries of the hlist. And commit 38129a13e6e7 incorrectly open-coded that missing hlist_splice(). We should think about making these kinds of "mindless" conversions easier to get right by adding the missing hlist helpers - Linus ] Fixes: 38129a13e6e71f666e0468e99fdd932a687b4d7e switch mnt_hash to hlist Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-17Ceph: remove left-over reject fileLinus Torvalds1-10/+0
Neither Sage nor I noticed that Zheng Yan had mistakenly committed fs/ceph/super.h.rej as part of commit 31c542a199d7 ("ceph: add inline data to pagecache"). Remove it. Requested-by: Yan, Zheng <ukernel@gmail.com> Cc: Sage Weil <sweil@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-17Merge branch 'for-linus' of ↵Linus Torvalds13-116/+712
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull ceph updates from Sage Weil: "The big item here is support for inline data for CephFS and for message signatures from Zheng. There are also several bug fixes, including interrupted flock request handling, 0-length xattrs, mksnap, cached readdir results, and a message version compat field. Finally there are several cleanups from Ilya, Dan, and Markus. Note that there is another series coming soon that fixes some bugs in the RBD 'lingering' requests, but it isn't quite ready yet" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (27 commits) ceph: fix setting empty extended attribute ceph: fix mksnap crash ceph: do_sync is never initialized libceph: fixup includes in pagelist.h ceph: support inline data feature ceph: flush inline version ceph: convert inline data to normal data before data write ceph: sync read inline data ceph: fetch inline data when getting Fcr cap refs ceph: use getattr request to fetch inline data ceph: add inline data to pagecache ceph: parse inline data in MClientReply and MClientCaps libceph: specify position of extent operation libceph: add CREATE osd operation support libceph: add SETXATTR/CMPXATTR osd operations support rbd: don't treat CEPH_OSD_OP_DELETE as extent op ceph: remove unused stringification macros libceph: require cephx message signature by default ceph: introduce global empty snap context ceph: message versioning fixes ...
2014-12-17Merge branch 'for-linus' of ↵Linus Torvalds3-3/+69
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace related fixes from Eric Biederman: "As these are bug fixes almost all of thes changes are marked for backporting to stable. The first change (implicitly adding MNT_NODEV on remount) addresses a regression that was created when security issues with unprivileged remount were closed. I go on to update the remount test to make it easy to detect if this issue reoccurs. Then there are a handful of mount and umount related fixes. Then half of the changes deal with the a recently discovered design bug in the permission checks of gid_map. Unix since the beginning has allowed setting group permissions on files to less than the user and other permissions (aka ---rwx---rwx). As the unix permission checks stop as soon as a group matches, and setgroups allows setting groups that can not later be dropped, results in a situtation where it is possible to legitimately use a group to assign fewer privileges to a process. Which means dropping a group can increase a processes privileges. The fix I have adopted is that gid_map is now no longer writable without privilege unless the new file /proc/self/setgroups has been set to permanently disable setgroups. The bulk of user namespace using applications even the applications using applications using user namespaces without privilege remain unaffected by this change. Unfortunately this ix breaks a couple user space applications, that were relying on the problematic behavior (one of which was tools/selftests/mount/unprivileged-remount-test.c). To hopefully prevent needing a regression fix on top of my security fix I rounded folks who work with the container implementations mostly like to be affected and encouraged them to test the changes. > So far nothing broke on my libvirt-lxc test bed. :-) > Tested with openSUSE 13.2 and libvirt 1.2.9. > Tested-by: Richard Weinberger <richard@nod.at> > Tested on Fedora20 with libvirt 1.2.11, works fine. > Tested-by: Chen Hanxiao <chenhanxiao@cn.fujitsu.com> > Ok, thanks - yes, unprivileged lxc is working fine with your kernels. > Just to be sure I was testing the right thing I also tested using > my unprivileged nsexec testcases, and they failed on setgroup/setgid > as now expected, and succeeded there without your patches. > Tested-by: Serge Hallyn <serge.hallyn@ubuntu.com> > I tested this with Sandstorm. It breaks as is and it works if I add > the setgroups thing. > Tested-by: Andy Lutomirski <luto@amacapital.net> # breaks things as designed :(" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: userns: Unbreak the unprivileged remount tests userns; Correct the comment in map_write userns: Allow setting gid_maps without privilege when setgroups is disabled userns: Add a knob to disable setgroups on a per user namespace basis userns: Rename id_map_mutex to userns_state_mutex userns: Only allow the creator of the userns unprivileged mappings userns: Check euid no fsuid when establishing an unprivileged uid mapping userns: Don't allow unprivileged creation of gid mappings userns: Don't allow setgroups until a gid mapping has been setablished userns: Document what the invariant required for safe unprivileged mappings. groups: Consolidate the setgroups permission checks mnt: Clear mnt_expire during pivot_root mnt: Carefully set CL_UNPRIVILEGED in clone_mnt mnt: Move the clear of MNT_LOCKED from copy_tree to it's callers. umount: Do not allow unmounting rootfs. umount: Disallow unprivileged mount force mnt: Update unprivileged remount test mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount
2014-12-17Merge tag 'for-linus-20141215' of git://git.infradead.org/linux-mtdLinus Torvalds2-1/+2
Pull MTD updates from Brian Norris: "Summary: - Add device tree support for DoC3 - SPI NOR: Refactoring, for better layering between spi-nor.c and its driver users (e.g., m25p80.c) New flash device support Support 6-byte ID strings - NAND: New NAND driver for Allwinner SoC's (sunxi) GPMI NAND: add support for raw (no ECC) access, for testing purposes Add ATO manufacturer ID A few odd driver fixes - MTD tests: Allow testers to compensate for OOB bitflips in oobtest Fix a torturetest regression - nandsim: Support longer ID byte strings And more" * tag 'for-linus-20141215' of git://git.infradead.org/linux-mtd: (63 commits) mtd: tests: abort torturetest on erase errors mtd: physmap_of: fix potential NULL dereference mtd: spi-nor: allow NULL as chip name and try to auto detect it mtd: nand: gpmi: add raw oob access functions mtd: nand: gpmi: add proper raw access support mtd: nand: gpmi: add gpmi_copy_bits function mtd: spi-nor: factor out write_enable() for erase commands mtd: spi-nor: add support for s25fl128s mtd: spi-nor: remove the jedec_id/ext_id mtd: spi-nor: add id/id_len for flash_info{} mtd: nand: correct the comment of function nand_block_isreserved() jffs2: Drop bogus if in comment mtd: atmel_nand: replace memcpy32_toio/memcpy32_fromio with memcpy mtd: cafe_nand: drop duplicate .write_page implementation mtd: m25p80: Add support for serial flash Spansion S25FL132K MTD: m25p80: fix inconsistency in m25p_ids compared to spi_nor_ids mtd: spi-nor: improve wait-till-ready timeout loop mtd: delete unnecessary checks before two function calls mtd: nand: omap: Fix NAND enumeration on 3430 LDP mtd: nand: add ATO manufacturer info ...
2014-12-17Merge branch 'for-linus' of ↵Linus Torvalds6-524/+359
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse update from Miklos Szeredi: "The first part makes sure we don't hold up umount with pending async requests. In addition to being a cleanup, this is a small behavioral change (for the better) and unlikely to break anything. The second part prepares for a cleanup of the fuse device I/O code by adding a helper for simple request submission, with some savings in line numbers already realized" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: use file_inode() in fuse_file_fallocate() fuse: introduce fuse_simple_request() helper fuse: reduce max out args fuse: hold inode instead of path after release fuse: flush requests on umount fuse: don't wake up reserved req in fuse_conn_kill()
2014-12-17ceph: fix setting empty extended attributeYan, Zheng1-2/+5
make sure 'value' is not null. otherwise __ceph_setxattr will remove the extended attribute. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
2014-12-17ceph: fix mksnap crashYan, Zheng1-1/+3
mksnap reply only contain 'target', does not contain 'dentry'. So it's wrong to use req->r_reply_info.head->is_dentry to detect traceless reply. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
2014-12-17ceph: do_sync is never initializedDan Carpenter1-1/+1
Probably this code was syncing a lot more often then intended because the do_sync variable wasn't set to zero. Cc: stable@vger.kernel.org # v3.11+ Fixes: c62988ec0910 ('ceph: avoid meaningless calling ceph_caps_revoking if sync_mode == WB_SYNC_ALL.') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2014-12-17ceph: support inline data featureYan, Zheng1-1/+2
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17ceph: flush inline versionYan, Zheng3-4/+23
After converting inline data to normal data, client need to flush the new i_inline_version (CEPH_INLINE_NONE) to MDS. This commit makes cap messages (sent to MDS) contain inline_version and inline_data. Client always converts inline data to normal data before data write, so the inline data length part is always zero. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17ceph: convert inline data to normal data before data writeYan, Zheng3-3/+161
Before any data write, convert inline data to normal data and set i_inline_version to CEPH_INLINE_NONE. The OSD request that saves inline data to object contains 3 operations (CMPXATTR, WRITE and SETXATTR). It compares a xattr named 'inline_version' to prevent old data overwrites newer data. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17ceph: sync read inline dataYan, Zheng2-13/+116
we can't use getattr to fetch inline data while holding Fr cap, because it can cause deadlock. If we need to sync read inline data, drop cap refs first, then use getattr to fetch inline data. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17ceph: fetch inline data when getting Fcr cap refsYan, Zheng3-18/+63
we can't use getattr to fetch inline data after getting Fcr caps, because it can cause deadlock. The solution is try bringing inline data to page cache when not holding any cap, and hope the inline data page is still there after getting the Fcr caps. If the page is still there, pin it in page cache for later IO. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17ceph: use getattr request to fetch inline dataYan, Zheng3-10/+32
Add a new parameter 'locked_page' to ceph_do_getattr(). If inline data in getattr reply will be copied to the page. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17ceph: add inline data to pagecacheYan, Zheng5-1/+84
Request reply and cap message can contain inline data. add inline data to the page cache if there is Fc cap. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17ceph: parse inline data in MClientReply and MClientCapsYan, Zheng3-11/+36
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17libceph: specify position of extent operationYan, Zheng2-6/+11
allow specifying position of extent operation in multi-operations osd request. This is required for cephfs to convert inline data to normal data (compare xattr, then write object). Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
2014-12-17ceph: remove unused stringification macrosIlya Dryomov1-3/+0
These were used to report git versions a long time ago. Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2014-12-17ceph: introduce global empty snap contextYan, Zheng3-3/+35
Current snaphost code does not properly handle moving inode from one empty snap realm to another empty snap realm. After changing inode's snap realm, some dirty pages' snap context can be not equal to inode's i_head_snap. This can trigger BUG() in ceph_put_wrbuffer_cap_refs() The fix is introduce a global empty snap context for all empty snap realm. This avoids triggering the BUG() for filesystem with no snapshot. Fixes: http://tracker.ceph.com/issues/9928 Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
2014-12-17ceph: message versioning fixesJohn Spray1-2/+5
There were two places we were assigning version in host byte order instead of network byte order. Also in MSG_CLIENT_SESSION we weren't setting compat_version in the header to reflect continued compatability with older MDSs. Fixes: http://tracker.ceph.com/issues/9945 Signed-off-by: John Spray <john.spray@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
2014-12-17libceph: message signature supportYan, Zheng1-0/+16
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17ceph, rbd: delete unnecessary checks before two function callsSF Markus Elfring3-12/+6
The functions ceph_put_snap_context() and iput() test whether their argument is NULL and then return immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> [idryomov@redhat.com: squashed rbd.c hunk, changelog] Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2014-12-17ceph: introduce a new inode flag indicating if cached dentries are orderedYan, Zheng3-19/+55
After creating/deleting/renaming file, offsets of sibling dentries may change. So we can not use cached dentries to satisfy readdir. But we can still use the cached dentries to conclude -ENOENT for lookup. This patch introduces a new inode flag indicating if child dentries are ordered. The flag is set at the same time marking a directory complete. After creating/deleting/renaming file, we clear the flag on directory inode. This prevents ceph_readdir() from using cached dentries to satisfy readdir syscall. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17ceph: fix file lock interruptionYan, Zheng3-10/+62
When a lock operation is interrupted, current code sends a unlock request to MDS to undo the lock operation. This method does not work as expected because the unlock request can drop locks that have already been acquired. The fix is use the newly introduced CEPH_LOCK_FCNTL_INTR/CEPH_LOCK_FLOCK_INTR requests to interrupt blocked file lock request. These requests do not drop locks that have alread been acquired, they only interrupt blocked file lock request. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-16Merge branch 'for-linus' of ↵Linus Torvalds11-233/+267
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile #2 from Al Viro: "Next pile (and there'll be one or two more). The large piece in this one is getting rid of /proc/*/ns/* weirdness; among other things, it allows to (finally) make nameidata completely opaque outside of fs/namei.c, making for easier further cleanups in there" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: coda_venus_readdir(): use file_inode() fs/namei.c: fold link_path_walk() call into path_init() path_init(): don't bother with LOOKUP_PARENT in argument fs/namei.c: new helper (path_cleanup()) path_init(): store the "base" pointer to file in nameidata itself make default ->i_fop have ->open() fail with ENXIO make nameidata completely opaque outside of fs/namei.c kill proc_ns completely take the targets of /proc/*/ns/* symlinks to separate fs bury struct proc_ns in fs/proc copy address of proc_ns_ops into ns_common new helpers: ns_alloc_inum/ns_free_inum make proc_ns_operations work with struct ns_common * instead of void * switch the rest of proc_ns_operations to working with &...->ns netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns common object embedded into various struct ....ns
2014-12-16Merge branch 'for_linus' of ↵Linus Torvalds2-0/+9
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull isofs and reiserfs fixes from Jan Kara: "A reiserfs and an isofs fix. They arrived after I sent you my first pull request and I don't want to delay them unnecessarily till rc2" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: isofs: Fix infinite looping over CE entries reiserfs: destroy allocated commit workqueue
2014-12-16Merge branch 'for-3.19' of git://linux-nfs.org/~bfields/linuxLinus Torvalds15-65/+186
Pull nfsd updates from Bruce Fields: "A comparatively quieter cycle for nfsd this time, but still with two larger changes: - RPC server scalability improvements from Jeff Layton (using RCU instead of a spinlock to find idle threads). - server-side NFSv4.2 ALLOCATE/DEALLOCATE support from Anna Schumaker, enabling fallocate on new clients" * 'for-3.19' of git://linux-nfs.org/~bfields/linux: (32 commits) nfsd4: fix xdr4 count of server in fs_location4 nfsd4: fix xdr4 inclusion of escaped char sunrpc/cache: convert to use string_escape_str() sunrpc: only call test_bit once in svc_xprt_received fs: nfsd: Fix signedness bug in compare_blob sunrpc: add some tracepoints around enqueue and dequeue of svc_xprt sunrpc: convert to lockless lookup of queued server threads sunrpc: fix potential races in pool_stats collection sunrpc: add a rcu_head to svc_rqst and use kfree_rcu to free it sunrpc: require svc_create callers to pass in meaningful shutdown routine sunrpc: have svc_wake_up only deal with pool 0 sunrpc: convert sp_task_pending flag to use atomic bitops sunrpc: move rq_cachetype field to better optimize space sunrpc: move rq_splice_ok flag into rq_flags sunrpc: move rq_dropme flag into rq_flags sunrpc: move rq_usedeferral flag to rq_flags sunrpc: move rq_local field to rq_flags sunrpc: add a generic rq_flags field to svc_rqst and move rq_secure to it nfsd: minor off by one checks in __write_versions() sunrpc: release svc_pool_map reference when serv allocation fails ...
2014-12-15isofs: Fix infinite looping over CE entriesJan Kara1-0/+6
Rock Ridge extensions define so called Continuation Entries (CE) which define where is further space with Rock Ridge data. Corrupted isofs image can contain arbitrarily long chain of these, including a one containing loop and thus causing kernel to end in an infinite loop when traversing these entries. Limit the traversal to 32 entries which should be more than enough space to store all the Rock Ridge data. Reported-by: P J P <ppandit@redhat.com> CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz>
2014-12-14Merge branch 'next' of ↵Linus Torvalds1-6/+18
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security layer updates from James Morris: "In terms of changes, there's general maintenance to the Smack, SELinux, and integrity code. The IMA code adds a new kconfig option, IMA_APPRAISE_SIGNED_INIT, which allows IMA appraisal to require signatures. Support for reading keys from rootfs before init is call is also added" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (23 commits) selinux: Remove security_ops extern security: smack: fix out-of-bounds access in smk_parse_smack() VFS: refactor vfs_read() ima: require signature based appraisal integrity: provide a hook to load keys when rootfs is ready ima: load x509 certificate from the kernel integrity: provide a function to load x509 certificate from the kernel integrity: define a new function integrity_read_file() Security: smack: replace kzalloc with kmem_cache for inode_smack Smack: Lock mode for the floor and hat labels ima: added support for new kernel cmdline parameter ima_template_fmt ima: allocate field pointers array on demand in template_desc_init_fields() ima: don't allocate a copy of template_fmt in template_desc_init_fields() ima: display template format in meas. list if template name length is zero ima: added error messages to template-related functions ima: use atomic bit operations to protect policy update interface ima: ignore empty and with whitespaces policy lines ima: no need to allocate entry for comment ima: report policy load status ima: use path names cache ...
2014-12-14Merge tag 'driver-core-3.19-rc1' of ↵Linus Torvalds4-33/+154
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core update from Greg KH: "Here's the set of driver core patches for 3.19-rc1. They are dominated by the removal of the .owner field in platform drivers. They touch a lot of files, but they are "simple" changes, just removing a line in a structure. Other than that, a few minor driver core and debugfs changes. There are some ath9k patches coming in through this tree that have been acked by the wireless maintainers as they relied on the debugfs changes. Everything has been in linux-next for a while" * tag 'driver-core-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (324 commits) Revert "ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries" fs: debugfs: add forward declaration for struct device type firmware class: Deletion of an unnecessary check before the function call "vunmap" firmware loader: fix hung task warning dump devcoredump: provide a one-way disable function device: Add dev_<level>_once variants ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries ath: use seq_file api for ath9k debugfs files debugfs: add helper function to create device related seq_file drivers/base: cacheinfo: remove noisy error boot message Revert "core: platform: add warning if driver has no owner" drivers: base: support cpu cache information interface to userspace via sysfs drivers: base: add cpu_device_create to support per-cpu devices topology: replace custom attribute macros with standard DEVICE_ATTR* cpumask: factor out show_cpumap into separate helper function driver core: Fix unbalanced device reference in drivers_probe driver core: fix race with userland in device_add() sysfs/kernfs: make read requests on pre-alloc files use the buffer. sysfs/kernfs: allow attributes to request write buffer be pre-allocated. fs: sysfs: return EGBIG on write if offset is larger than file size ...
2014-12-14Merge tag 'squashfs-updates' of ↵Linus Torvalds6-0/+170
git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next Pull squashfs update from Phillip Lougher: "These patches optionally add LZ4 compression support to Squashfs. LZ4 is a lightweight compression algorithm which can be used on embedded systems to reduce CPU and memory overhead (in comparison to the standard zlib compression). These patches add the wrapper code to allow Squashfs to use the existing LZ4 decompression code, and the necessary configuration option" * tag 'squashfs-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next: Squashfs: Add LZ4 compression configuration option Squashfs: add LZ4 compression support
2014-12-14Merge git://git.kvack.org/~bcrl/aio-nextLinus Torvalds1-2/+31
Pull aio updates from Benjamin LaHaise. * git://git.kvack.org/~bcrl/aio-next: aio: Skip timer for io_getevents if timeout=0 aio: Make it possible to remap aio ring
2014-12-13aio: Skip timer for io_getevents if timeout=0Fam Zheng1-2/+6
In this case, it is basically a polling. Let's not involve timer at all because that would hurt performance for application event loops. In an arbitrary test I've done, io_getevents syscall elapsed time reduces from 50000+ nanoseconds to a few hundereds. Signed-off-by: Fam Zheng <famz@redhat.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2014-12-13aio: Make it possible to remap aio ringPavel Emelyanov1-0/+25
There are actually two issues this patch addresses. Let me start with the one I tried to solve in the beginning. So, in the checkpoint-restore project (criu) we try to dump tasks' state and restore one back exactly as it was. One of the tasks' state bits is rings set up with io_setup() call. There's (almost) no problems in dumping them, there's a problem restoring them -- if I dump a task with aio ring originally mapped at address A, I want to restore one back at exactly the same address A. Unfortunately, the io_setup() does not allow for that -- it mmaps the ring at whatever place mm finds appropriate (it calls do_mmap_pgoff() with zero address and without the MAP_FIXED flag). To make restore possible I'm going to mremap() the freshly created ring into the address A (under which it was seen before dump). The problem is that the ring's virtual address is passed back to the user-space as the context ID and this ID is then used as search key by all the other io_foo() calls. Reworking this ID to be just some integer doesn't seem to work, as this value is already used by libaio as a pointer using which this library accesses memory for aio meta-data. So, to make restore work we need to make sure that a) ring is mapped at desired virtual address b) kioctx->user_id matches this value Having said that, the patch makes mremap() on aio region update the kioctx's user_id and mmap_base values. Here appears the 2nd issue I mentioned in the beginning of this mail. If (regardless of the C/R dances I do) someone creates an io context with io_setup(), then mremap()-s the ring and then destroys the context, the kill_ioctx() routine will call munmap() on wrong (old) address. This will result in a) aio ring remaining in memory and b) some other vma get unexpectedly unmapped. What do you think? Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2014-12-13Merge branch 'for-3.19/core' of git://git.kernel.dk/linux-blockLinus Torvalds1-7/+22
Pull block driver core update from Jens Axboe: "This is the pull request for the core block IO changes for 3.19. Not a huge round this time, mostly lots of little good fixes: - Fix a bug in sysfs blktrace interface causing a NULL pointer dereference, when enabled/disabled through that API. From Arianna Avanzini. - Various updates/fixes/improvements for blk-mq: - A set of updates from Bart, mostly fixing buts in the tag handling. - Cleanup/code consolidation from Christoph. - Extend queue_rq API to be able to handle batching issues of IO requests. NVMe will utilize this shortly. From me. - A few tag and request handling updates from me. - Cleanup of the preempt handling for running queues from Paolo. - Prevent running of unmapped hardware queues from Ming Lei. - Move the kdump memory limiting check to be in the correct location, from Shaohua. - Initialize all software queues at init time from Takashi. This prevents a kobject warning when CPUs are brought online that weren't online when a queue was registered. - Single writeback fix for I_DIRTY clearing from Tejun. Queued with the core IO changes, since it's just a single fix. - Version X of the __bio_add_page() segment addition retry from Maurizio. Hope the Xth time is the charm. - Documentation fixup for IO scheduler merging from Jan. - Introduce (and use) generic IO stat accounting helpers for non-rq drivers, from Gu Zheng. - Kill off artificial limiting of max sectors in a request from Christoph" * 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits) bio: modify __bio_add_page() to accept pages that don't start a new segment blk-mq: Fix uninitialized kobject at CPU hotplugging blktrace: don't let the sysfs interface remove trace from running list blk-mq: Use all available hardware queues blk-mq: Micro-optimize bt_get() blk-mq: Fix a race between bt_clear_tag() and bt_get() blk-mq: Avoid that __bt_get_word() wraps multiple times blk-mq: Fix a use-after-free blk-mq: prevent unmapped hw queue from being scheduled blk-mq: re-check for available tags after running the hardware queue blk-mq: fix hang in bt_get() blk-mq: move the kdump check to blk_mq_alloc_tag_set blk-mq: cleanup tag free handling blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map blk: introduce generic io stat accounting help function blk-mq: handle the single queue case in blk_mq_hctx_next_cpu genhd: check for int overflow in disk_expand_part_tbl() blk-mq: add blk_mq_free_hctx_request() blk-mq: export blk_mq_free_request() blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable ...
2014-12-13fsnotify: remove destroy_list from fsnotify_markJan Kara1-4/+4
destroy_list is used to track marks which still need waiting for srcu period end before they can be freed. However by the time mark is added to destroy_list it isn't in group's list of marks anymore and thus we can reuse fsnotify_mark->g_list for queueing into destroy_list. This saves two pointers for each fsnotify_mark. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Eric Paris <eparis@redhat.com> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13fsnotify: unify inode and mount marks handlingJan Kara9-201/+148
There's a lot of common code in inode and mount marks handling. Factor it out to a common helper function. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Eric Paris <eparis@redhat.com> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13fallocate: create FAN_MODIFY and IN_MODIFY eventsHeinrich Schuchardt1-0/+11
The fanotify and the inotify API can be used to monitor changes of the file system. System call fallocate() modifies files. Hence it should trigger the corresponding fanotify (FAN_MODIFY) and inotify (IN_MODIFY) events. The most interesting case is FALLOC_FL_COLLAPSE_RANGE because this value allows to create arbitrary file content from random data. This patch adds the missing call to fsnotify_modify(). The FAN_MODIFY and IN_MODIFY event will be created when fallocate() succeeds. It will even be created if the file length remains unchanged, e.g. when calling fanotify with flag FALLOC_FL_KEEP_SIZE. This logic was primarily chosen to keep the coding simple. It resembles the logic of the write() system call. When we call write() we always create a FAN_MODIFY event, even in the case of overwriting with identical data. Events FAN_MODIFY and IN_MODIFY do not provide any guarantee that data was actually changed. Furthermore even if if the filesize remains unchanged, fallocate() may influence whether a subsequent write() will succeed and hence the fallocate() call may be considered a modification. The fallocate(2) man page teaches: After a successful call, subsequent writes into the range specified by offset and len are guaranteed not to fail because of lack of disk space. So calling fallocate(fd, FALLOC_FL_KEEP_SIZE, offset, len) may result in different outcomes of a subsequent write depending on the values of offset and len. Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@parisplace.org> Cc: John McCutchan <john@johnmccutchan.com> Cc: Robert Love <rlove@rlove.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13fs/affs/file.c: remove obsolete pagesize checkFabian Frederick1-4/+0
linux kernel doesn't manage page sizes below 4kb. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13fs/affs/file.c: add support to O_DIRECTFabian Frederick1-0/+18
Based on ext2_direct_IO Tested with O_DIRECT file open and sysbench/mariadb with 1% written queries improvement (update_non_index test) on a volume created with mkaffs. Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13fs/affs/amigaffs.c: use va_format instead of buffer/vnsprintfFabian Frederick3-21/+25
-Remove ErrorBuffer and use %pV -Add __printf to enable argument mistmatch warnings Original patch by Joe Perches. Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13fs/affs/file.c: forward declaration clean-upFabian Frederick1-22/+16
-Move file_operations to avoid forward declarations. -Remove unused declarations. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13syscalls: implement execveat() system callDavid Drysdale5-14/+119
This patchset adds execveat(2) for x86, and is derived from Meredydd Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528). The primary aim of adding an execveat syscall is to allow an implementation of fexecve(3) that does not rely on the /proc filesystem, at least for executables (rather than scripts). The current glibc version of fexecve(3) is implemented via /proc, which causes problems in sandboxed or otherwise restricted environments. Given the desire for a /proc-free fexecve() implementation, HPA suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be an appropriate generalization. Also, having a new syscall means that it can take a flags argument without back-compatibility concerns. The current implementation just defines the AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be added in future -- for example, flags for new namespaces (as suggested at https://lkml.org/lkml/2006/7/11/474). Related history: - https://lkml.org/lkml/2006/12/27/123 is an example of someone realizing that fexecve() is likely to fail in a chroot environment. - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered documenting the /proc requirement of fexecve(3) in its manpage, to "prevent other people from wasting their time". - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a problem where a process that did setuid() could not fexecve() because it no longer had access to /proc/self/fd; this has since been fixed. This patch (of 4): Add a new execveat(2) system call. execveat() is to execve() as openat() is to open(): it takes a file descriptor that refers to a directory, and resolves the filename relative to that. In addition, if the filename is empty and AT_EMPTY_PATH is specified, execveat() executes the file to which the file descriptor refers. This replicates the functionality of fexecve(), which is a system call in other UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and so relies on /proc being mounted). The filename fed to the executed program as argv[0] (or the name of the script fed to a script interpreter) will be of the form "/dev/fd/<fd>" (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively reflecting how the executable was found. This does however mean that execution of a script in a /proc-less environment won't work; also, script execution via an O_CLOEXEC file descriptor fails (as the file will not be accessible after exec). Based on patches by Meredydd Luff. Signed-off-by: David Drysdale <drysdale@google.com> Cc: Meredydd Luff <meredydd@senatehouse.org> Cc: Shuah Khan <shuah.kh@samsung.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Rich Felker <dalias@aerifal.cx> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>