summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2010-07-30NFS: kswapd must not block in nfs_release_pageTrond Myklebust2-4/+13
See https://bugzilla.kernel.org/show_bug.cgi?id=16056 If other processes are blocked waiting for kswapd to free up some memory so that they can make progress, then we cannot allow kswapd to block on those processes. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org
2010-07-30nfs: include space for the NUL in root pathDan Carpenter1-1/+1
In root_nfs_name() it does the following: if (strlen(buf) + strlen(cp) > NFS_MAXPATHLEN) { printk(KERN_ERR "Root-NFS: Pathname for remote directory too long.\n"); return -1; } sprintf(nfs_export_path, buf, cp); In the original code if (strlen(buf) + strlen(cp) == NFS_MAXPATHLEN) then the sprintf() would lead to an overflow. Generally the rest of the code assumes that the path can have NFS_MAXPATHLEN (1024) characters and a NUL terminator so the fix is to add space to the nfs_export_path[] buffer. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-07-29CRED: Fix get_task_cred() and task_state() to not resurrect dead credentialsDavid Howells1-1/+1
It's possible for get_task_cred() as it currently stands to 'corrupt' a set of credentials by incrementing their usage count after their replacement by the task being accessed. What happens is that get_task_cred() can race with commit_creds(): TASK_1 TASK_2 RCU_CLEANER -->get_task_cred(TASK_2) rcu_read_lock() __cred = __task_cred(TASK_2) -->commit_creds() old_cred = TASK_2->real_cred TASK_2->real_cred = ... put_cred(old_cred) call_rcu(old_cred) [__cred->usage == 0] get_cred(__cred) [__cred->usage == 1] rcu_read_unlock() -->put_cred_rcu() [__cred->usage == 1] panic() However, since a tasks credentials are generally not changed very often, we can reasonably make use of a loop involving reading the creds pointer and using atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero. If successful, we can safely return the credentials in the knowledge that, even if the task we're accessing has released them, they haven't gone to the RCU cleanup code. We then change task_state() in procfs to use get_task_cred() rather than calling get_cred() on the result of __task_cred(), as that suffers from the same problem. Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be tripped when it is noticed that the usage count is not zero as it ought to be, for example: kernel BUG at kernel/cred.c:168! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/kernel/mm/ksm/run CPU 0 Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex 745 RIP: 0010:[<ffffffff81069881>] [<ffffffff81069881>] __put_cred+0xc/0x45 RSP: 0018:ffff88019e7e9eb8 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0 RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0 R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001 FS: 00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0) Stack: ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45 <0> ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000 <0> ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246 Call Trace: [<ffffffff810698cd>] put_cred+0x13/0x15 [<ffffffff81069b45>] commit_creds+0x16b/0x175 [<ffffffff8106aace>] set_current_groups+0x47/0x4e [<ffffffff8106ac89>] sys_setgroups+0xf6/0x105 [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00 48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 <0f> 0b eb fe 65 48 8b 04 25 00 cc 00 00 48 3b b8 58 04 00 00 75 RIP [<ffffffff81069881>] __put_cred+0xc/0x45 RSP <ffff88019e7e9eb8> ---[ end trace df391256a100ebdd ]--- Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-28ecryptfs: Bugfix for error related to ecryptfs_hash_bucketsAndre Osterhues1-8/+9
The function ecryptfs_uid_hash wrongly assumes that the second parameter to hash_long() is the number of hash buckets instead of the number of hash bits. This patch fixes that and renames the variable ecryptfs_hash_buckets to ecryptfs_hash_bits to make it clearer. Fixes: CVE-2010-2492 Signed-off-by: Andre Osterhues <aosterhues@escrypt.com> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-28Merge branch 'for-linus' of ↵Linus Torvalds9-36/+50
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: use complete_all and wake_up_all ceph: Correct obvious typo of Kconfig variable "CRYPTO_AES" ceph: fix dentry lease release ceph: fix leak of dentry in ceph_init_dentry() error path ceph: fix pg_mapping leak on pg_temp updates ceph: fix d_release dop for snapdir, snapped dentries ceph: avoid dcache readdir for snapdir
2010-07-28GFS2: Use kmalloc when possible for ->readdir()Steven Whitehouse1-6/+25
If we don't need a huge amount of memory in ->readdir() then we can use kmalloc rather than vmalloc to allocate it. This should cut down on the greater overheads associated with vmalloc for smaller directories. We may be able to eliminate vmalloc entirely at some stage, but this is easy to do right away. Also using GFP_NOFS to avoid any issues wrt to deleting inodes while under a glock, and suggestion from Linus to factor out the alloc/dealloc. I've given this a test with a variety of different sized directories and it seems to work ok. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-27ceph: use complete_all and wake_up_allYehuda Sadeh6-20/+20
This fixes an issue triggered by running concurrent syncs. One of the syncs would go through while the other would just hang indefinitely. In any case, we never actually want to wake a single waiter, so the *_all functions should be used. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-279p: Pass the correct end of buffer to p9stat_readLatchesar Ionkov1-1/+1
Pass the correct end of the buffer to p9stat_read. Signed-off-by: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-07-26xfs simplify and speed up direct I/O completionsChristoph Hellwig1-82/+76
Our current handling of direct I/O completions is rather suboptimal, because we defer it to a workqueue more often than needed, and we perform a much to aggressive flush of the workqueue in case unwritten extent conversions happen. This patch changes the direct I/O reads to not even use a completion handler, as we don't bother to use it at all, and to perform the unwritten extent conversions in caller context for synchronous direct I/O. For a small I/O size direct I/O workload on a consumer grade SSD, such as the untar of a kernel tree inside qemu this patch gives speedups of about 5%. Getting us much closer to the speed of a native block device, or a fully allocated XFS file. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-07-26xfs: move aio completion after unwritten extent conversionChristoph Hellwig1-3/+16
If we write into an unwritten extent using AIO we need to complete the AIO request after the extent conversion has finished. Without that a read could race to see see the extent still unwritten and return zeros. For synchronous I/O we already take care of that by flushing the xfsconvertd workqueue (which might be a bit of overkill). To do that add iocb and result fields to struct xfs_ioend, so that we can call aio_complete from xfs_end_io after the extent conversion has happened. Note that we need a new result field as io_error is used for positive errno values, while the AIO code can return negative error values and positive transfer sizes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-07-26direct-io: move aio_complete into ->end_ioChristoph Hellwig5-17/+35
Filesystems with unwritten extent support must not complete an AIO request until the transaction to convert the extent has been commited. That means the aio_complete calls needs to be moved into the ->end_io callback so that the filesystem can control when to call it exactly. This makes a bit of a mess out of dio_complete and the ->end_io callback prototype even more complicated. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-07-26xfs: fix big endian buildDave Chinner1-1/+2
Commit 0fd7275cc42ab734eaa1a2c747e65479bd1e42af ("xfs: fix gcc 4.6 set but not read and unused statement warnings") failed to convert some code inside XFS_NATIVE_HOST (big endian host code only) and hence fails to build on such machines. Fix it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-07-26sysfs: allow creating symlinks from untagged to tagged directoriesEric W. Biederman1-1/+2
Supporting symlinks from untagged to tagged directories is reasonable, and needed to support CONFIG_SYSFS_DEPRECATED. So don't fail a prior allowing that case to work. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-26sysfs: sysfs_delete_link handle symlinks from untagged to tagged directories.Eric W. Biederman1-1/+1
This happens for network devices when SYSFS_DEPRECATED is enabled. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-26sysfs: Don't allow the creation of symlinks we can't removeEric W. Biederman1-5/+18
Recently my tagged sysfs support revealed a flaw in the device core that a few rare drivers are running into such that we don't always put network devices in a class subdirectory named net/. Since we are not creating the class directory the network devices wind up in a non-tagged directory, but the symlinks to the network devices from /sys/class/net are in a tagged directory. All of which works until we go to remove or rename the symlink. When we remove or rename a symlink we look in the namespace of the target of the symlink. Since the target of the symlink is in a non-tagged sysfs directory we don't have a namespace to look in, and we fail to remove the symlink. Detect this problem up front and simply don't create symlinks we won't be able to remove later. This prevents symlink leakage and fails in a much clearer and more understandable way. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Maciej W. Rozycki <macro@linux-mips.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-26xfs: clean up xfs_bmap_get_bpChristoph Hellwig1-25/+18
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26xfs: simplify xfs_truncate_fileChristoph Hellwig3-102/+52
xfs_truncate_file is only used for truncating quota files. Move it to xfs_qm_syscalls.c so it can be marked static and take advatange of the fact by removing the unused page cache validation and taking the iget into the helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26xfs: kill the b_strat callback in xfs_bufChristoph Hellwig4-17/+3
The b_strat callback is used by xfs_buf_iostrategy to perform additional checks before submitting a buffer. It is used in xfs_bwrite and when writing out delayed buffers. In xfs_bwrite it we can de-virtualize the call easily as b_strat is set a few lines above the call to xfs_buf_iostrategy. For the delayed buffers the rationale is a bit more complicated: - there are three callers of xfs_buf_delwri_queue, which places buffers on the delwri list: (1) xfs_bdwrite - this sets up b_strat, so it's fine (2) xfs_buf_iorequest. None of the callers can have XBF_DELWRI set: - xlog_bdstrat is only used for log buffers, which are never delwri - _xfs_buf_read explicitly clears the delwri flag - xfs_buf_iodone_work retries log buffers only - xfsbdstrat - only used for reads, superblock writes without the delwri flag, log I/O and file zeroing with explicitly allocated buffers. - xfs_buf_iostrategy - only calls xfs_buf_iorequest if b_strat is not set (3) xfs_buf_unlock - only puts the buffer on the delwri list if the DELWRI flag is already set. The DELWRI flag is only ever set in xfs_bwrite, xfs_buf_iodone_callbacks, or xfs_trans_log_buf. For xfs_buf_iodone_callbacks and xfs_trans_log_buf we require an initialized buf item, which means b_strat was set to xfs_bdstrat_cb in xfs_buf_item_init. Conclusion: we can just get rid of the callback and replace it with explicit calls to xfs_bdstrat_cb. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26xfs: remove obsolete osyncisosync mount optionChristoph Hellwig2-8/+4
Since Linux 2.6.33 the kernel has support for real O_SYNC, which made the osyncisosync option a no-op. Warn the users about this and remove the mount flag for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26xfs: clean up filestreams helpersChristoph Hellwig2-85/+77
Move xfs_filestream_peek_ag, xxfs_filestream_get_ag and xfs_filestream_put_ag from xfs_filestream.h to xfs_filestream.c where it's only callers are, and remove the inline marker while we're at it to let the compiler decide on the inlining. Also don't return a value from xfs_filestream_put_ag because we don't need it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26xfs: fix gcc 4.6 set but not read and unused statement warningsChristoph Hellwig7-34/+15
[hch: dropped a few hunks that need structural changes instead] Signed-off-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26xfs: Fix build when CONFIG_XFS_POSIX_ACL=nTony Luck1-0/+2
When CONFIG_XFS_POSIX_ACL is not set "xfs_check_acl" is #defined to NULL - which breaks the code attempting to add a tracepoint on this function. Only define the tracepoint when the function exists. Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26xfs: fix unsigned underflow in xfs_free_eofblocksKulikov Vasiliy1-2/+2
map_len is unsigned. Checking map_len <= 0 is buggy when it should be below zero. So, check exact expression instead of map_len. Signed-off-by: Kulikov Vasiliy <segooon@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26xfs: use GFP_NOFS for page cache allocationDave Chinner1-2/+2
Avoid a lockdep warning by preventing page cache allocation from recursing back into the filesystem during memory reclaim. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26xfs: fix memory reclaim recursion deadlock on locked inode bufferDave Chinner1-9/+11
Calling into memory reclaim with a locked inode buffer can deadlock if memory reclaim tries to lock the inode buffer during inode teardown. Convert the relevant memory allocations to use KM_NOFS to avoid this deadlock condition. Reported-by: Peter Watkins <treestem@gmail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26xfs: fix xfs_trans_add_item() lockdep warningsDave Chinner1-1/+1
xfs_trans_add_item() is called with ip->i_ilock held, which means it is unsafe for memory reclaim to recurse back into the filesystem (ilock is required in writeback). Hence the allocation needs to be KM_NOFS to avoid recursion. Lockdep report indicating memory allocation being called with the ip->i_ilock held is as follows: [ 1749.866796] ================================= [ 1749.867788] [ INFO: inconsistent lock state ] [ 1749.868327] 2.6.35-rc3-dgc+ #25 [ 1749.868741] --------------------------------- [ 1749.868741] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage. [ 1749.868741] dd/2835 [HC0[0]:SC0[0]:HE1:SE1] takes: [ 1749.868741] (&(&ip->i_lock)->mr_lock){++++?.}, at: [<ffffffff813170fb>] xfs_ilock+0x10b/0x190 [ 1749.868741] {IN-RECLAIM_FS-W} state was registered at: [ 1749.868741] [<ffffffff810b3a97>] __lock_acquire+0x437/0x1450 [ 1749.868741] [<ffffffff810b4b56>] lock_acquire+0xa6/0x160 [ 1749.868741] [<ffffffff810a20b5>] down_write_nested+0x65/0xb0 [ 1749.868741] [<ffffffff813170fb>] xfs_ilock+0x10b/0x190 [ 1749.868741] [<ffffffff8134e819>] xfs_reclaim_inode+0x99/0x310 [ 1749.868741] [<ffffffff8134f56b>] xfs_inode_ag_walk+0x8b/0x150 [ 1749.868741] [<ffffffff8134f6bb>] xfs_inode_ag_iterator+0x8b/0xf0 [ 1749.868741] [<ffffffff8134f7a8>] xfs_reclaim_inode_shrink+0x88/0x90 [ 1749.868741] [<ffffffff81119d07>] shrink_slab+0x137/0x1a0 [ 1749.868741] [<ffffffff8111bbe1>] balance_pgdat+0x421/0x6a0 [ 1749.868741] [<ffffffff8111bf7d>] kswapd+0x11d/0x320 [ 1749.868741] [<ffffffff8109ce56>] kthread+0x96/0xa0 [ 1749.868741] [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10 [ 1749.868741] irq event stamp: 4234335 [ 1749.868741] hardirqs last enabled at (4234335): [<ffffffff81147d25>] kmem_cache_free+0x115/0x220 [ 1749.868741] hardirqs last disabled at (4234334): [<ffffffff81147c4d>] kmem_cache_free+0x3d/0x220 [ 1749.868741] softirqs last enabled at (4233112): [<ffffffff81084dd2>] __do_softirq+0x142/0x260 [ 1749.868741] softirqs last disabled at (4233095): [<ffffffff81035edc>] call_softirq+0x1c/0x50 [ 1749.868741] [ 1749.868741] other info that might help us debug this: [ 1749.868741] 2 locks held by dd/2835: [ 1749.868741] #0: (&(&ip->i_iolock)->mr_lock#2){+.+.+.}, at: [<ffffffff81316edd>] xfs_ilock_nowait+0xed/0x200 [ 1749.868741] #1: (&(&ip->i_lock)->mr_lock){++++?.}, at: [<ffffffff813170fb>] xfs_ilock+0x10b/0x190 [ 1749.868741] [ 1749.868741] stack backtrace: [ 1749.868741] Pid: 2835, comm: dd Not tainted 2.6.35-rc3-dgc+ #25 [ 1749.868741] Call Trace: [ 1749.868741] [<ffffffff810b1faa>] print_usage_bug+0x18a/0x190 [ 1749.868741] [<ffffffff8104264f>] ? save_stack_trace+0x2f/0x50 [ 1749.868741] [<ffffffff810b2400>] ? check_usage_backwards+0x0/0xf0 [ 1749.868741] [<ffffffff810b2f11>] mark_lock+0x331/0x400 [ 1749.868741] [<ffffffff810b3047>] mark_held_locks+0x67/0x90 [ 1749.868741] [<ffffffff810b3111>] lockdep_trace_alloc+0xa1/0xe0 [ 1749.868741] [<ffffffff81147419>] kmem_cache_alloc+0x39/0x1e0 [ 1749.868741] [<ffffffff8133f954>] kmem_zone_alloc+0x94/0xe0 [ 1749.868741] [<ffffffff8133f9be>] kmem_zone_zalloc+0x1e/0x50 [ 1749.868741] [<ffffffff81335f02>] xfs_trans_add_item+0x72/0xb0 [ 1749.868741] [<ffffffff81339e41>] xfs_trans_ijoin+0xa1/0xd0 [ 1749.868741] [<ffffffff81319f82>] xfs_itruncate_finish+0x312/0x5d0 [ 1749.868741] [<ffffffff8133cb87>] xfs_free_eofblocks+0x227/0x280 [ 1749.868741] [<ffffffff8133cd18>] xfs_release+0x138/0x190 [ 1749.868741] [<ffffffff813464c5>] xfs_file_release+0x15/0x20 [ 1749.868741] [<ffffffff81150ebf>] fput+0x13f/0x260 [ 1749.868741] [<ffffffff8114d8c2>] filp_close+0x52/0x80 [ 1749.868741] [<ffffffff8114d9a9>] sys_close+0xb9/0x120 [ 1749.868741] [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26xfs: simplify and remove xfs_ireclaimDave Chinner3-54/+32
xfs_ireclaim has to get and put te pag structure because it is only called with the inode to reclaim. The one caller of this function already has a reference on the pag and a pointer to is, so move the radix tree delete to the caller and remove xfs_ireclaim completely. This avoids a xfs_perag_get/put on every inode being reclaimed. The overhead was noticed in a bug report at: https://bugzilla.kernel.org/show_bug.cgi?id=16348 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26xfs: don't block on buffer read errorsDave Chinner1-4/+15
xfs_buf_read() fails to detect dispatch errors before attempting to wait on sychronous IO. If there was an error, it will get stuck forever, waiting for an I/O that was never started. Make sure the error is detected correctly. Further, such a failure can leave locked pages in the page cache which will cause a later operation to hang on the page. Ensure that we correctly process pages in the buffers when we get a dispatch error. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26xfs: move inode shrinker unregister even earlierDave Chinner1-5/+5
I missed Dave Chinner's second revision of this change, and pushed his first version out to the repository instead. commit a476c59ebb279d738718edc0e3fb76aab3687114 Author: Dave Chinner <dchinner@redhat.com> This commit compensates for that by moving a block of code up a bit further, with a result that matches the the effect of Dave's second version. Dave's first version was: Reviewed-by: Eric Sandeen <sandeen@redhat.com> Dave's second version was: Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2010-07-26xfs: remove a dmapi leftoverChristoph Hellwig1-3/+0
The open_exec file operation is only added by the external dmapi patch. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-07-26xfs: writepage always has buffersChristoph Hellwig1-7/+0
These days we always have buffers thanks to ->page_mkwrite. And we already have an assert a few lines above tripping in case that was not true due to a bug. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-07-26xfs: allow writeback from kswapdChristoph Hellwig1-5/+4
We only need disable I/O from direct or memcg reclaim. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-07-26xfs: remove incorrect log write optimizationChristoph Hellwig1-5/+2
We do need a barrier for the first buffer of a split log write. Otherwise we might incorrectly stamp the tail LSN into transactions in the first part of the split write, or not flush data I/O before updating the inode size. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-07-26xfs: unregister inode shrinker before freeing filesystem structuresDave Chinner1-1/+5
Currently we don't remove the XFS mount from the shrinker list until late in the unmount path. By this time, we have already torn down the internals of the filesystem (e.g. the per-ag structures), and hence if the shrinker is executed between the teardown and the unregistering, the shrinker will get NULL per-ag structure pointers and panic trying to dereference them. Fix this by removing the xfs mount from the shrinker list before tearing down it's internal structures. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-07-26xfs: split xfs_itrace_entryChristoph Hellwig11-52/+113
Replace the xfs_itrace_entry catchall with specific trace points. For most simple callers we now use the simple inode class, which used to be the iget class, but add more details tracing for namespace events, which now includes the name of the directory entries manipulated. Remove the xfs_inactive trace point, which is a duplicate of the clear_inode one, and the xfs_change_file_space trace point, which is immediately followed by the more specific alloc/free space trace points. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26xfs: remove xfs_iputChristoph Hellwig6-26/+17
xfs_iput is just a small wrapper for xfs_iunlock + IRELE. Having this out of line wrapper means the trace events in those two can't track their caller properly. So just remove the wrapper and opencode the unlock + rele in the few callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26xfs: remove xfs_iput_newChristoph Hellwig3-28/+3
We never get an i_mode of 0 or a locked VFS inode until we pass in the XFS_IGET_CREATE flag to xfs_iget, which makes xfs_iput_new equivalent to xfs_iput for the only caller. In addition to that xfs_nfs_get_inode does not even need to lock the inode given that the generation never changes for a life inode, so just pass a 0 lock_flags to xfs_iget and release the inode using IRELE in the error path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26xfs: some iget tracing cleanups / fixesChristoph Hellwig2-6/+7
The xfs_iget_alloc/found tracepoints are a bit misnamed and misplaced. Rename them to xfs_iget_hit/xfs_iget_miss and move them to the beggining of the xfs_iget_cache_hit/miss functions. Add a new xfs_iget_reclaim_fail tracepoint for the case where we fail to re-initialize a VFS inode, and add a second instance of the xfs_iget_skip tracepoint for the case of a failed igrab() call. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26xfs: do not use emums for flags used in tracingChristoph Hellwig3-71/+71
The tracing code can't print flags defined as enums. Most flags that we want to print are defines as macros already, but move the few remaining ones over to make the trace output more useful. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26xfs: remove explicit xfs_sync_data/xfs_sync_attr calls on umountChristoph Hellwig3-17/+2
On the final put of a superblock the VFS already calls sync_filesystem for us to write out all data and wait for it. No need to start another asynchronous writeback inside ->put_super. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26xfs: small cleanups for xfs_iomap / __xfs_get_blocksChristoph Hellwig3-11/+10
Remove the flags argument to __xfs_get_blocks as we can easily derive it from the direct argument, and remove the unused BMAPI_MMAP flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26xfs: reduce stack usage in xfs_iomapChristoph Hellwig1-24/+28
xfs_iomap passes a xfs_bmbt_irec pointer to xfs_iomap_write_direct and xfs_iomap_write_allocate to give them the results of our read-only xfs_bmapi query. Instead of allocating a new xfs_bmbt_irec on stack for the next call to xfs_bmapi re use the one we got passed as it's not used after this point. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26xfs: avoid synchronous transaction in xfs_fs_write_inodeChristoph Hellwig1-30/+22
We already rely on the fact that the sync code will cause a synchronous log force later on (currently via xfs_fs_sync_fs -> xfs_quiesce_data -> xfs_sync_data), so no need to do this here. This allows us to avoid a lot of synchronous log forces during sync, which pays of especially with delayed logging enabled. Some compilebench numbers that show this: xfs (delayed logging, 256k logbufs) =================================== intial create 25.94 MB/s 25.75 MB/s 25.64 MB/s create 8.54 MB/s 9.12 MB/s 9.15 MB/s patch 2.47 MB/s 2.47 MB/s 3.17 MB/s compile 29.65 MB/s 30.51 MB/s 27.33 MB/s clean 90.92 MB/s 98.83 MB/s 128.87 MB/s read tree 11.90 MB/s 11.84 MB/s 8.56 MB/s read compiled 28.75 MB/s 29.96 MB/s 24.25 MB/s delete tree 8.39 seconds 8.12 seconds 8.46 seconds delete compiled 8.35 seconds 8.44 seconds 5.11 seconds stat tree 6.03 seconds 5.59 seconds 5.19 seconds stat compiled tree 9.00 seconds 9.52 seconds 8.49 seconds xfs + write_inode log_force removal =================================== intial create 25.87 MB/s 25.76 MB/s 25.87 MB/s create 15.18 MB/s 14.80 MB/s 14.94 MB/s patch 3.13 MB/s 3.14 MB/s 3.11 MB/s compile 36.74 MB/s 37.17 MB/s 36.84 MB/s clean 226.02 MB/s 222.58 MB/s 217.94 MB/s read tree 15.14 MB/s 15.02 MB/s 15.14 MB/s read compiled tree 29.30 MB/s 29.31 MB/s 29.32 MB/s delete tree 6.22 seconds 6.14 seconds 6.15 seconds delete compiled tree 5.75 seconds 5.92 seconds 5.81 seconds stat tree 4.60 seconds 4.51 seconds 4.56 seconds stat compiled tree 4.07 seconds 3.87 seconds 3.96 seconds In addition to that also remove the delwri inode flush that is unessecary now that bulkstat is always coherent. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26xfs: simplify xfs_vm_writepageChristoph Hellwig3-102/+49
The writepage implementation in XFS still tries to deal with dirty but unmapped buffers which used to caused by writes through shared mmaps. Since the introduction of ->page_mkwrite these can't happen anymore, so remove the code dealing with them. Note that the all_bh variable which causes us to start I/O on all buffers on the pages was controlled by the count of unmapped buffers, which also included those not actually dirty. It's now unconditionally initialized to 0 but set to 1 for the case of small file size extensions. It probably can be removed entirely, but that's left for another patch. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26xfs: simplify xfs_vm_releasepageChristoph Hellwig1-247/+107
Currently the xfs releasepage implementation has code to deal with converting delayed allocated and unwritten space. But we never get called for those as we always convert delayed and unwritten space when cleaning a page, or drop the state from the buffers in block_invalidatepage. We still keep a WARN_ON on those cases for now, but remove all the case dealing with it, which allows to fold xfs_page_state_convert into xfs_vm_writepage and remove the !startio case from the whole writeback path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26xfs: fix corruption case for block size < page sizeEric Sandeen1-0/+10
xfstests 194 first truncats a file back and then extends it again by truncating it to a larger size. This causes discard_buffer to drop the mapped, but not the uptodate bit and thus creates something that xfs_page_state_convert takes for unmapped space created by mmap because it doesn't check for the dirty bit, which also gets cleared by discard_buffer and checked by other ->writepage implementations like block_write_full_page. Handle this kind of buffers early, and unlike Eric's first version of the patch simply ASSERT that the buffers is dirty, given that the mmap write case can't happen anymore since the introduction of ->page_mkwrite. The now dead code dealing with that will be deleted in a follow on patch. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26xfs: remove unused delta tracking code in xfs_bmapiChristoph Hellwig15-260/+59
This code was introduced four years ago in commit 3e57ecf640428c01ba1ed8c8fc538447ada1715b without any review and has been unused since. Remove it just as the rest of the code introduced in that commit to reduce that stack usage and complexity in this central piece of code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26xfs: remove unused XFS_BMAPI_ flagsChristoph Hellwig3-20/+8
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26xfs: remove the unused XFS_TRANS_NOSLEEP/XFS_TRANS_WAIT flagsChristoph Hellwig1-2/+0
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26xfs: remove the unused XFS_LOG_SLEEP and XFS_LOG_NOSLEEP flagsChristoph Hellwig2-5/+0
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>