summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2012-03-20GFS2: Change truncate page allocation to be GFP_NOFSBob Peterson2-3/+3
This patch changes the page allocation in gfs2_block_truncate_page and two others to GFP_NOFS to avoid deadlock in low-memory conditions. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-03-09GFS2: call gfs2_write_alloc_required for each chunkBenjamin Marzinski1-5/+8
gfs2_fallocate was calling gfs2_write_alloc_required() once at the start of the function. This caused problems since gfs2_write_alloc_required used a long unsigned int for the len, but gfs2_fallocate could allocate a much larger amount. This patch will move the call into the loop where the chunks are actually allocated and zeroed out. This will keep the allocation size under the limit, and also allow gfs2_fallocate to quickly skip over sections of the file that are already completely allocated. fallcate_chunk was also not correctly setting the file size. It was using the len veriable to find the last block written to, but by the time it was setting the size, the len variable had already been decremented to 0. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-03-09GFS2: Clean up log flush header writingSteven Whitehouse1-65/+66
We already send both a pre and post flush to the block device when writing a journal header. There is no need to wait for the previous I/O specifically when we do this, unless we've turned "barriers" off. As a side effect, this also cleans up the code path for flushing the journal and makes it more readable. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-03-08GFS2: Remove a __GFP_NOFAIL allocationSteven Whitehouse4-2/+25
In order to ensure that we've got enough buffer heads for flushing the journal, the orignal code used __GFP_NOFAIL when performing this allocation. Here we dispense with that in favour of using a mempool. This should improve efficiency in low memory conditions since flushing the journal is a good way to get memory back, we don't want to be spinning, waiting on memory allocations. The buffers which are allocated via this mempool are fairly short lived, so that we'll recycle them pretty quickly. Although there are other memory allocations which occur during the journal flush process, this is the one which can potentially require the most memory, so the most important one to fix. The amount of memory reserved is a fixed amount, and we should not need to scale it when there are a greater number of filesystems in use. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-03-07GFS2: Flush pending glock work when evicting an inodeSteven Whitehouse1-0/+1
This ensures that we will not try to access the inode thats being flushed via the glock after it has been freed. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-03-05GFS2: make sure rgrps are up to date in func gfs2_blk2rgrpdBob Peterson1-10/+4
This patch adds a call to gfs2_rindex_update from function gfs2_blk2rgrpd and removes calls to it that are made redundant by it. The problem is that a gfs2_grow can add rgrps to the rindex, then put those rgrps into use, thus rendering the rindex we read in at mount time incomplete. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-03-05GFS2: Eliminate sd_rindex_mutexBob Peterson3-14/+10
Over time, we've slowly eliminated the use of sd_rindex_mutex. Up to this point, it was only used in two places: function gfs2_ri_total (which totals the file system size by reading and parsing the rindex file) and function gfs2_rindex_update which updates the rgrps in memory. Both of these functions have the rindex glock to protect them, so the rindex is unnecessary. Since gfs2_grow writes to the rindex via the meta_fs, the mutex is in the wrong order according to the normal rules. This patch eliminates the mutex entirely to avoid the problem. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-03-01GFS2: Unlock rindex mutex on glock errorBob Peterson1-1/+2
This patch fixes an error path in function gfs2_rindex_update that leaves the rindex mutex held. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-02-28GFS2: Make bd_cmp() staticSteven Whitehouse1-1/+1
Add missing static to bd_cmp() Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-02-28GFS2: Sort the ordered write listBob Peterson1-0/+16
This patch sorts the ordered write list for GFS2 writes. This increases the throughput for simultaneous writes. For example, if you have ten processes, all doing: dd if=/dev/zero of=/mnt/gfs2/fileX on different files, the throughput will be much better. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-02-28GFS2: FITRIM ioctl supportSteven Whitehouse7-36/+152
The FITRIM ioctl provides an alternative way to send discard requests to the underlying device. Using the discard mount option results in every freed block generating a discard request to the block device. This can be slow, since many block devices can only process discard requests of larger sizes, and also such operations can be time consuming. Rather than using the discard mount option, FITRIM allows a sweep of the filesystem on an occasional basis, and also to optionally avoid sending down discard requests for smaller regions. In GFS2 FITRIM will work at resource group granularity. There is a flag for each resource group which keeps track of which resource groups have been trimmed. This flag is reset whenever a deallocation occurs in the resource group, and set whenever a successful FITRIM of that resource group has taken place. This helps to reduce repeated discard requests for the same block ranges, again improving performance. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-02-28GFS2: Move two functions from log.c to lops.cSteven Whitehouse3-101/+97
gfs2_log_get_buf() and gfs2_log_fake_buf() are both used only in lops.c, so move them next to their callers and they can then become static. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-02-28GFS2: glock statistics gatheringSteven Whitehouse5-19/+431
The stats are divided into two sets: those relating to the super block and those relating to an individual glock. The super block stats are done on a per cpu basis in order to try and reduce the overhead of gathering them. They are also further divided by glock type. In the case of both the super block and glock statistics, the same information is gathered in each case. The super block statistics are used to provide default values for most of the glock statistics, so that newly created glocks should have, as far as possible, a sensible starting point. The statistics are divided into three pairs of mean and variance, plus two counters. The mean/variance pairs are smoothed exponential estimates and the algorithm used is one which will be very familiar to those used to calculation of round trip times in network code. The three pairs of mean/variance measure the following things: 1. DLM lock time (non-blocking requests) 2. DLM lock time (blocking requests) 3. Inter-request time (again to the DLM) A non-blocking request is one which will complete right away, whatever the state of the DLM lock in question. That currently means any requests when (a) the current state of the lock is exclusive (b) the requested state is either null or unlocked or (c) the "try lock" flag is set. A blocking request covers all the other lock requests. There are two counters. The first is there primarily to show how many lock requests have been made, and thus how much data has gone into the mean/variance calculations. The other counter is counting queueing of holders at the top layer of the glock code. Hopefully that number will be a lot larger than the number of dlm lock requests issued. So why gather these statistics? There are several reasons we'd like to get a better idea of these timings: 1. To be able to better set the glock "min hold time" 2. To spot performance issues more easily 3. To improve the algorithm for selecting resource groups for allocation (to base it on lock wait time, rather than blindly using a "try lock") Due to the smoothing action of the updates, a step change in some input quantity being sampled will only fully be taken into account after 8 samples (or 4 for the variance) and this needs to be carefully considered when interpreting the results. Knowing both the time it takes a lock request to complete and the average time between lock requests for a glock means we can compute the total percentage of the time for which the node is able to use a glock vs. time that the rest of the cluster has its share. That will be very useful when setting the lock min hold time. The other point to remember is that all times are in nanoseconds. Great care has been taken to ensure that we measure exactly the quantities that we want, as accurately as possible. There are always inaccuracies in any measuring system, but I hope this is as accurate as we can reasonably make it. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-02-28GFS2: Read resource groups on mountSteven Whitehouse4-20/+17
This makes mount take slightly longer, but at the same time, the first write to the filesystem will be faster too. It also means that if there is a problem in the resource index, then we can refuse to mount rather than having to try and report that when the first write occurs. In addition, to avoid recursive locking, we hvae to take account of instances when the rindex glock may already be held when we are trying to update the rbtree of resource groups. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-02-28GFS2: Ensure rindex is uptodate for fallocateBob Peterson1-0/+5
This patch fixes a problem whereby gfs2_grow was failing and causing GFS2 to assert. The problem was that when GFS2's fallocate operation tried to acquire an "allocation" it made sure the rindex was up to date, and if not, it called gfs2_rindex_update. However, if the file being fallocated was the rindex itself, it was already locked at that point. By calling gfs2_rindex_update at an earlier point in time, we bring rindex up to date and thereby avoid trying to lock it when the "allocation" is acquired. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-02-28GFS2: Read in rindex if necessary during unlinkBob Peterson1-2/+7
This patch fixes a problem whereby you were unable to delete files until other file system operations were done (such as statfs, touch, writes, etc.) that caused the rindex to be read in. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-02-28GFS2: Fix race between lru_list and glock ref countSteven Whitehouse1-4/+10
This patch fixes a narrow race window between the glock ref count hitting zero and glocks being removed from the lru_list. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2012-02-27Merge branch 'master' of /Volumes/CaseSensitiveDisk/linuxAnton Altaparmakov24-137/+319
2012-02-25autofs: work around unhappy compat problem on x86-64Ian Kent4-3/+23
When the autofs protocol version 5 packet type was added in commit 5c0a32fc2cd0 ("autofs4: add new packet type for v5 communications"), it obvously tried quite hard to be word-size agnostic, and uses explicitly sized fields that are all correctly aligned. However, with the final "char name[NAME_MAX+1]" array at the end, the actual size of the structure ends up being not very well defined: because the struct isn't marked 'packed', doing a "sizeof()" on it will align the size of the struct up to the biggest alignment of the members it has. And despite all the members being the same, the alignment of them is different: a "__u64" has 4-byte alignment on x86-32, but native 8-byte alignment on x86-64. And while 'NAME_MAX+1' ends up being a nice round number (256), the name[] array starts out a 4-byte aligned. End result: the "packed" size of the structure is 300 bytes: 4-byte, but not 8-byte aligned. As a result, despite all the fields being in the same place on all architectures, sizeof() will round up that size to 304 bytes on architectures that have 8-byte alignment for u64. Note that this is *not* a problem for 32-bit compat mode on POWER, since there __u64 is 8-byte aligned even in 32-bit mode. But on x86, 32-bit and 64-bit alignment is different for 64-bit entities, and as a result the structure that has exactly the same layout has different sizes. So on x86-64, but no other architecture, we will just subtract 4 from the size of the structure when running in a compat task. That way we will write the properly sized packet that user mode expects. Not pretty. Sadly, this very subtle, and unnecessary, size difference has been encoded in user space that wants to read packets of *exactly* the right size, and will refuse to touch anything else. Reported-and-tested-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-24epoll: ep_unregister_pollwait() can use the freed pwq->wheadOleg Nesterov2-4/+32
signalfd_cleanup() ensures that ->signalfd_wqh is not used, but this is not enough. eppoll_entry->whead still points to the memory we are going to free, ep_unregister_pollwait()->remove_wait_queue() is obviously unsafe. Change ep_poll_callback(POLLFREE) to set eppoll_entry->whead = NULL, change ep_unregister_pollwait() to check pwq->whead != NULL under rcu_read_lock() before remove_wait_queue(). We add the new helper, ep_remove_wait_queue(), for this. This works because sighand_cachep is SLAB_DESTROY_BY_RCU and because ->signalfd_wqh is initialized in sighand_ctor(), not in copy_sighand. ep_unregister_pollwait()->remove_wait_queue() can play with already freed and potentially reused ->sighand, but this is fine. This memory must have the valid ->signalfd_wqh until rcu_read_unlock(). Reported-by: Maxime Bizon <mbizon@freebox.fr> Cc: <stable@kernel.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-24epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree()Oleg Nesterov2-0/+15
This patch is intentionally incomplete to simplify the review. It ignores ep_unregister_pollwait() which plays with the same wqh. See the next change. epoll assumes that the EPOLL_CTL_ADD'ed file controls everything f_op->poll() needs. In particular it assumes that the wait queue can't go away until eventpoll_release(). This is not true in case of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand which is not connected to the file. This patch adds the special event, POLLFREE, currently only for epoll. It expects that init_poll_funcptr()'ed hook should do the necessary cleanup. Perhaps it should be defined as EPOLLFREE in eventpoll. __cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if ->signalfd_wqh is not empty, we add the new signalfd_cleanup() helper. ep_poll_callback(POLLFREE) simply does list_del_init(task_list). This make this poll entry inconsistent, but we don't care. If you share epoll fd which contains our sigfd with another process you should blame yourself. signalfd is "really special". I simply do not know how we can define the "right" semantics if it used with epoll. The main problem is, epoll calls signalfd_poll() once to establish the connection with the wait queue, after that signalfd_poll(NULL) returns the different/inconsistent results depending on who does EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd has nothing to do with the file, it works with the current thread. In short: this patch is the hack which tries to fix the symptoms. It also assumes that nobody can take tasklist_lock under epoll locks, this seems to be true. Note: - we do not have wake_up_all_poll() but wake_up_poll() is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE. - signalfd_cleanup() uses POLLHUP along with POLLFREE, we need a couple of simple changes in eventpoll.c to make sure it can't be "lost". Reported-by: Maxime Bizon <mbizon@freebox.fr> Cc: <stable@kernel.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-24Merge branch 'for-linus' of ↵Linus Torvalds17-131/+250
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Quoth Chris: "This is later than I wanted because I got backed up running through btrfs bugs from the Oracle QA teams. But they are all bug fixes that we've queued and tested since rc1. Nothing in particular stands out, this just reflects bug fixing and QA done in parallel by all the btrfs developers. The most user visible of these is: Btrfs: clear the extent uptodate bits during parent transid failures Because that helps deal with out of date drives (say an iscsi disk that has gone away and come back). The old code wasn't always properly retrying the other mirror for this type of failure." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits) Btrfs: fix compiler warnings on 32 bit systems Btrfs: increase the global block reserve estimates Btrfs: clear the extent uptodate bits during parent transid failures Btrfs: add extra sanity checks on the path names in btrfs_mksubvol Btrfs: make sure we update latest_bdev Btrfs: improve error handling for btrfs_insert_dir_item callers Btrfs: be less strict on finding next node in clear_extent_bit Btrfs: fix a bug on overcommit stuff Btrfs: kick out redundant stuff in convert_extent_bit Btrfs: skip states when they does not contain bits to clear Btrfs: check return value of lookup_extent_mapping() correctly Btrfs: fix deadlock on page lock when doing auto-defragment Btrfs: fix return value check of extent_io_ops btrfs: honor umask when creating subvol root btrfs: silence warning in raid array setup btrfs: fix structs where bitfields and spinlock/atomic share 8B word btrfs: delalloc for page dirtied out-of-band in fixup worker Btrfs: fix memory leak in load_free_space_cache() btrfs: don't check DUP chunks twice Btrfs: fix trim 0 bytes after a device delete ...
2012-02-24Btrfs: fix compiler warnings on 32 bit systemsChris Mason4-20/+26
The enospc tracing code added some interesting uses of u64 pointer casts. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-02-24Merge branch 'master' of /Volumes/CaseSensitiveDisk/linuxAnton Altaparmakov8-99/+86
2012-02-24NTFS: Correct two spelling errors "dealocate" to "deallocate" in mft.c.Anton Altaparmakov1-3/+3
From: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
2012-02-23Restore direct_io / truncate locking APIAnton Altaparmakov1-2/+2
With kernel 3.1, Christoph removed i_alloc_sem and replaced it with calls (namely inode_dio_wait() and inode_dio_done()) which are EXPORT_SYMBOL_GPL() thus they cannot be used by non-GPL file systems and further inode_dio_wait() was pushed from notify_change() into the file system ->setattr() method but no non-GPL file system can make this call. That means non-GPL file systems cannot exist any more unless they do not use any VFS functionality related to reading/writing as far as I can tell or at least as long as they want to implement direct i/o. Both Linus and Al (and others) have said on LKML that this breakage of the VFS API should not have happened and that the change was simply missed as it was not documented in the change logs of the patches that did those changes. This patch changes the two function exports in question to be EXPORT_SYMBOL() thus restoring the VFS API as it used to be - accessible for all modules. Christoph, who introduced the two functions and exported them GPL-only is CC-ed on this patch to give him the opportunity to object to the symbols being changed in this manner if he did indeed intend them to be GPL-only and does not want them to become available to all modules. Signed-off-by: Anton Altaparmakov <anton@tuxera.com> CC: Christoph Hellwig <hch@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-23Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds5-23/+25
A fix from Jesper Juhl removes an assignment in an ASSERT when a compare is intended. Two fixes from Mitsuo Hayasaka address off-by-ones in XFS quota enforcement. * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: make inode quota check more general xfs: change available ranges of softlimit and hardlimit in quota check XFS: xfs_trans_add_item() - don't assign in ASSERT() when compare is intended
2012-02-23Btrfs: increase the global block reserve estimatesLiu Bo1-1/+1
When doing IO with large amounts of data fragmentation, the global block reserve calulations are too low. This increases them to avoid ENOSPC crashes. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-02-23Btrfs: clear the extent uptodate bits during parent transid failuresChris Mason1-4/+3
If btrfs reads a block and finds a parent transid mismatch, it clears the uptodate flags on the extent buffer, and the pages inside it. But we only clear the uptodate bits in the state tree if the block straddles more than one page. This is from an old optimization from to reduce contention on the extent state tree. But it is buggy because the code that retries a read from a different copy of the block is going to find the uptodate state bits set and skip the IO. The end result of the bug is that we'll never actually read the good copy (if there is one). The fix here is to always clear the uptodate state bits, which is safe because this code is only called when the parent transid fails. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-02-23Btrfs: add extra sanity checks on the path names in btrfs_mksubvolChris Mason1-0/+6
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-02-23Btrfs: make sure we update latest_bdevChris Mason2-1/+22
When we are setting up the mount, we close all the devices that were not actually part of the metadata we found. But, we don't make sure that one of those devices wasn't fs_devices->latest_bdev, which means we can do a use after free on the one we closed. This updates latest_bdev as it goes. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-02-23Btrfs: improve error handling for btrfs_insert_dir_item callersChris Mason2-7/+26
This allows us to gracefully continue if we aren't able to insert directory items, both for normal files/dirs and snapshots. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-02-22Merge tag 'nfs-for-3.3-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds3-76/+61
Bugfixes for the NFS client. Fix a nasty Oops in the NFSv4 getacl code, another source of infinite loops in the NFSv4 state recovery code, and a regression in NFSv4.1 session initialisation. Also deal with an NFSv4.1 memory leak. * tag 'nfs-for-3.3-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4: fix server_scope memory leak NFSv4.1: Fix a NFSv4.1 session initialisation regression NFSv4: Ensure we throw out bad delegation stateids on NFS4ERR_BAD_STATEID NFSv4: Fix an Oops in the NFSv4 getacl code
2012-02-22NTFS: Do not dereference pointer before checking for NULL.Anton Altaparmakov1-3/+3
Found by Coverity software (http://scan.coverity.com). Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
2012-02-22NTFS: Remove unused variable.Anton Altaparmakov1-3/+1
Found by Coverity software (http://scan.coverity.com). Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
2012-02-21sys_poll: fix incorrect type for 'timeout' parameterLinus Torvalds1-1/+1
The 'poll()' system call timeout parameter is supposed to be 'int', not 'long'. Now, the reason this matters is that right now 32-bit compat mode is broken on at least x86-64, because the 32-bit code just calls 'sys_poll()' directly on x86-64, and the 32-bit argument will have been zero-extended, turning a signed 'int' into a large unsigned 'long' value. We could just introduce a 'compat_sys_poll()' function for this, and that may eventually be what we have to do, but since the actual standard poll() semantics is *supposed* to be 'int', and since at least on x86-64 glibc sign-extends the argument before invocing the system call (so nobody can actually use a 64-bit timeout value in user space _anyway_, even in 64-bit binaries), the simpler solution would seem to be to just fix the definition of the system call to match what it should have been from the very start. If it turns out that somebody somehow circumvents the user-level libc 64-bit sign extension and actually uses a large unsigned 64-bit timeout despite that not being how poll() is supposed to work, we will need to do the compat_sys_poll() approach. Reported-by: Thomas Meyer <thomas@m3y3r.de> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-21xfs: make inode quota check more generalMitsuo Hayasaka1-2/+4
The xfs checks quota when reserving disk blocks and inodes. In the block reservation, it checks if the total number of blocks including current usage and new reservation exceed quota. In the inode reservation, it checks using the total number of inodes including only current usage without new reservation. However, this inode quota check works well since the caller of xfs_trans_dquot() always sets the argument of the number of new inode reservation to 1 or 0 and inode is reserved one by one in current xfs. To make it more general, this patch changes it to the same way as the block quota check. Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Ben Myers <bpm@sgi.com> Cc: Alex Elder <elder@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-21xfs: change available ranges of softlimit and hardlimit in quota checkMitsuo Hayasaka4-19/+19
In general, quota allows us to use disk blocks and inodes up to each limit, that is, they are available if they don't exceed their limitations. Current xfs sets their available ranges to lower than them except disk inode quota check. So, this patch changes the ranges to not beyond them. Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Ben Myers <bpm@sgi.com> Cc: Alex Elder <elder@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2012-02-21Btrfs: be less strict on finding next node in clear_extent_bitLiu Bo1-2/+1
In clear_extent_bit, it is enough that next node is adjacent in tree level. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-02-20Merge branch 'for-linus' of ↵Linus Torvalds8-44/+82
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Assorted fixes, sat in -next for a week or so... * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ocfs2: deal with wraparounds of i_nlink in ocfs2_rename() vfs: fix compat_sys_stat() handling of overflows in st_nlink quota: Fix deadlock with suspend and quotas vfs: Provide function to get superblock and wait for it to thaw vfs: fix panic in __d_lookup() with high dentry hashtable counts autofs4 - fix lockdep splat in autofs vfs: fix d_inode_lookup() dentry ref leak
2012-02-17NFSv4: fix server_scope memory leakWeston Andros Adamson1-6/+9
server_scope would never be freed if nfs4_check_cl_exchange_flags() returned non-zero Signed-off-by: Weston Andros Adamson <dros@netapp.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-17NFSv4.1: Fix a NFSv4.1 session initialisation regressionTrond Myklebust1-65/+42
Commit aacd553 (NFSv4.1: cleanup init and reset of session slot tables) introduces a regression in the session initialisation code. New tables now find their sequence ids initialised to 0, rather than the mandated value of 1 (see RFC5661). Fix the problem by merging nfs4_reset_slot_table() and nfs4_init_slot_table(). Since the tbl->max_slots is initialised to 0, the test in nfs4_reset_slot_table for max_reqs != tbl->max_slots will automatically pass for an empty table. Reported-by: Vitaliy Gusev <gusev.vitaliy@nexenta.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-02-16ecryptfs: remove the second argument of k[un]map_atomic()Cong Wang2-4/+4
Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2012-02-16eCryptfs: Copy up lower inode attrs after setting lower xattrTyler Hicks1-0/+2
After passing through a ->setxattr() call, eCryptfs needs to copy the inode attributes from the lower inode to the eCryptfs inode, as they may have changed in the lower filesystem's ->setxattr() path. One example is if an extended attribute containing a POSIX Access Control List is being set. The new ACL may cause the lower filesystem to modify the mode of the lower inode and the eCryptfs inode would need to be updated to reflect the new mode. https://launchpad.net/bugs/926292 Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Reported-by: Sebastien Bacher <seb128@ubuntu.com> Cc: John Johansen <john.johansen@canonical.com> Cc: <stable@vger.kernel.org>
2012-02-16eCryptfs: Improve statfs reportingTyler Hicks4-14/+83
statfs() calls on eCryptfs files returned the wrong filesystem type and, when using filename encryption, the wrong maximum filename length. If mount-wide filename encryption is enabled, the cipher block size and the lower filesystem's max filename length will determine the max eCryptfs filename length. Pre-tested, known good lengths are used when the lower filesystem's namelen is 255 and a cipher with 8 or 16 byte block sizes is used. In other, less common cases, we fall back to a safe rounded-down estimate when determining the eCryptfs namelen. https://launchpad.net/bugs/885744 Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Reported-by: Kees Cook <keescook@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: John Johansen <john.johansen@canonical.com>
2012-02-16Btrfs: fix a bug on overcommit stuffLiu Bo1-1/+4
When overcommitting, we should check the sum of pinned space and bytes for delayed item. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-02-16Btrfs: kick out redundant stuff in convert_extent_bitLiu Bo1-5/+0
clear_state_bit will do merge_state for us, so kick out the redundant one. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-02-16Btrfs: skip states when they does not contain bits to clearLiu Bo1-5/+10
Clearing a range's bits is different with setting them, since we don't need to touch them when states do not contain bits we want. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-02-16Btrfs: check return value of lookup_extent_mapping() correctlyTsutomu Itoh3-2/+4
This patch corrects error checking of lookup_extent_mapping(). Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-02-16Btrfs: fix deadlock on page lock when doing auto-defragmentMiao Xie1-24/+29
When I ran xfstests circularly on a auto-defragment btrfs, the deadlock happened. Steps to reproduce: [tty0] # export MOUNT_OPTIONS="-o autodefrag" # export TEST_DEV=<partition1> # export TEST_DIR=<mountpoint1> # export SCRATCH_DEV=<partition2> # export SCRATCH_MNT=<mountpoint2> # while [ 1 ] > do > ./check 091 127 263 > sleep 1 > done [tty1] # while [ 1 ] > do > echo 3 > /proc/sys/vm/drop_caches > done Several hours later, the test processes will hang on, and the deadlock will happen on page lock. The reason is that: Auto defrag task Flush thread Test task btrfs_writepages() add ordered extent (including page 1, 2) set page 1 writeback set page 2 writeback endio_fn() end page 2 writeback release page 2 lock page 1 alloc and lock page 2 page 2 is not uptodate btrfs_readpage() start ordered extent() btrfs_writepages() try to lock page 1 so deadlock happens. Fix this bug by unlocking the page which is in writeback, and re-locking it after the writeback end. Signed-off-by: Miao Xie <miax@cn.fujitsu.com>