summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2013-10-30xfs: vectorise encoding/decoding directory headersDave Chinner11-281/+338
Conversion from on-disk structures to in-core header structures currently relies on magic number checks. If the magic number is wrong, but one of the supported values, we do the wrong thing with the encode/decode operation. Split these functions so that there are discrete operations for the specific directory format we are handling. In doing this, move all the header encode/decode functions to xfs_da_format.c as they are directly manipulating the on-disk format. It should be noted that all the growth in binary size is from xfs_da_format.c - the rest of the code actaully shrinks. text data bss dec hex filename 794490 96802 1096 892388 d9de4 fs/xfs/xfs.o.orig 792986 96802 1096 890884 d9804 fs/xfs/xfs.o.p1 792350 96802 1096 890248 d9588 fs/xfs/xfs.o.p2 789293 96802 1096 887191 d8997 fs/xfs/xfs.o.p3 789005 96802 1096 886903 d8997 fs/xfs/xfs.o.p4 789061 96802 1096 886959 d88af fs/xfs/xfs.o.p5 789733 96802 1096 887631 d8b4f fs/xfs/xfs.o.p6 791421 96802 1096 889319 d91e7 fs/xfs/xfs.o.p7 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-30xfs: vectorise DA btree operationsDave Chinner10-81/+139
The remaining non-vectorised code for the directory structure is the node format blocks. This is shared with the attribute tree, and so is slightly more complex to vectorise. Introduce a "non-directory" directory ops structure that is attached to all non-directory inodes so that attribute operations can be vectorised for all inodes. Once we do this, we can vectorise all the da btree operations. Because this patch adds more infrastructure than it removes the binary size does not decrease: text data bss dec hex filename 794490 96802 1096 892388 d9de4 fs/xfs/xfs.o.orig 792986 96802 1096 890884 d9804 fs/xfs/xfs.o.p1 792350 96802 1096 890248 d9588 fs/xfs/xfs.o.p2 789293 96802 1096 887191 d8997 fs/xfs/xfs.o.p3 789005 96802 1096 886903 d8997 fs/xfs/xfs.o.p4 789061 96802 1096 886959 d88af fs/xfs/xfs.o.p5 789733 96802 1096 887631 d8b4f fs/xfs/xfs.o.p6 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-30xfs: vectorise directory leaf operationsDave Chinner10-159/+218
Next step in the vectorisation process is the leaf block encode/decode operations. Most of the operations on leaves are handled by the data block vectors, so there are relatively few of them here. Because of all the shuffling of code and having to pass more state to some functions, this patch doesn't directly reduce the size of the binary. It does open up many more opportunities for factoring and optimisation, however. text data bss dec hex filename 794490 96802 1096 892388 d9de4 fs/xfs/xfs.o.orig 792986 96802 1096 890884 d9804 fs/xfs/xfs.o.p1 792350 96802 1096 890248 d9588 fs/xfs/xfs.o.p2 789293 96802 1096 887191 d8997 fs/xfs/xfs.o.p3 789005 96802 1096 886903 d8997 fs/xfs/xfs.o.p4 789061 96802 1096 886959 d88af fs/xfs/xfs.o.p5 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-30xfs: vectorise directory data operations part 2Dave Chinner10-137/+186
Convert the rest of the directory data block encode/decode operations to vector format. This further reduces the size of the built binary: text data bss dec hex filename 794490 96802 1096 892388 d9de4 fs/xfs/xfs.o.orig 792986 96802 1096 890884 d9804 fs/xfs/xfs.o.p1 792350 96802 1096 890248 d9588 fs/xfs/xfs.o.p2 789293 96802 1096 887191 d8997 fs/xfs/xfs.o.p3 789005 96802 1096 886903 d8997 fs/xfs/xfs.o.p4 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-30xfs: vectorise directory data operationsDave Chinner9-205/+329
Following from the initial patches to vectorise the shortform directory encode/decode operations, convert half the data block operations to use the vector. The rest will be done in a second patch. This further reduces the size of the built binary: text data bss dec hex filename 794490 96802 1096 892388 d9de4 fs/xfs/xfs.o.orig 792986 96802 1096 890884 d9804 fs/xfs/xfs.o.p1 792350 96802 1096 890248 d9588 fs/xfs/xfs.o.p2 789293 96802 1096 887191 d8997 fs/xfs/xfs.o.p3 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-30xfs: vectorise remaining shortform dir2 opsDave Chinner6-183/+199
Following from the initial patch to introduce the directory operations vector, convert the rest of the shortform directory operations to use vectored ops rather than superblock feature checks. This further reduces the size of the built binary: text data bss dec hex filename 794490 96802 1096 892388 d9de4 fs/xfs/xfs.o.orig 792986 96802 1096 890884 d9804 fs/xfs/xfs.o.p1 792350 96802 1096 890248 d9588 fs/xfs/xfs.o.p2 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-30xfs: abstract the differences in dir2/dir3 via an ops vectorDave Chinner12-45/+132
Lots of the dir code now goes through switches to determine what is the correct on-disk format to parse. It generally involves a "xfs_sbversion_hasfoo" check, deferencing the superblock version and feature fields and hence touching several cache lines per operation in the process. Some operations do multiple checks because they nest conditional operations and they don't pass the information in a direct fashion between each other. Hence, add an ops vector to the xfs_inode structure that is configured when the inode is initialised to point to all the correct decode and encoding operations. This will significantly reduce the branchiness and cacheline footprint of the directory object decoding and encoding. This is the first patch in a series of conversion patches. It will introduce the ops structure, the setup of it and add the first operation to the vector. Subsequent patches will convert directory ops one at a time to keep the changes simple and obvious. Just this patch shows the benefit of such an approach on code size. Just converting the two shortform dir operations as this patch does decreases the built binary size by ~1500 bytes: $ size fs/xfs/xfs.o.orig fs/xfs/xfs.o.p1 text data bss dec hex filename 794490 96802 1096 892388 d9de4 fs/xfs/xfs.o.orig 792986 96802 1096 890884 d9804 fs/xfs/xfs.o.p1 $ That's a significant decrease in the instruction cache footprint of the directory code for such a simple change, and indicates that this approach is definitely worth pursuing further. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-23xfs: split xfs_rtalloc.c for userspace sanityDave Chinner4-1248/+1292
xfs_rtalloc.c is partially shared with userspace. Split the file up into two parts - one that is kernel private and the other which is wholly shared with userspace. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-23xfs: decouple inode and bmap btree header filesDave Chinner73-583/+403
Currently the xfs_inode.h header has a dependency on the definition of the BMAP btree records as the inode fork includes an array of xfs_bmbt_rec_host_t objects in it's definition. Move all the btree format definitions from xfs_btree.h, xfs_bmap_btree.h, xfs_alloc_btree.h and xfs_ialloc_btree.h to xfs_format.h to continue the process of centralising the on-disk format definitions. With this done, the xfs inode definitions are no longer dependent on btree header files. The enables a massive culling of unnecessary includes, with close to 200 #include directives removed from the XFS kernel code base. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-23xfs: decouple log and transaction headersDave Chinner75-239/+276
xfs_trans.h has a dependency on xfs_log.h for a couple of structures. Most code that does transactions doesn't need to know anything about the log, but this dependency means that they have to include xfs_log.h. Decouple the xfs_trans.h and xfs_log.h header files and clean up the includes to be in dependency order. In doing this, remove the direct include of xfs_trans_reserve.h from xfs_trans.h so that we remove the dependency between xfs_trans.h and xfs_mount.h. Hence the xfs_trans.h include can be moved to the indicate the actual dependencies other header files have on it. Note that these are kernel only header files, so this does not translate to any userspace changes at all. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-23xfs: remove unused transaction callback variablesDave Chinner1-7/+0
We don't do callbacks at transaction commit time, no do we have any infrastructure to set up or run such callbacks, so remove the variables and typedefs for these operations. If we ever need to add callbacks, we can reintroduce the variables at that time. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-23xfs: split dquot buffer operations outDave Chinner8-266/+303
Parts of userspace want to be able to read and modify dquot buffers (e.g. xfs_db) so we need to split out the reading and writing of these buffers so it is easy to shared code with libxfs in userspace. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-23xfs: unify directory/attribute format definitionsDave Chinner33-426/+437
The on-disk format definitions for the directory and attribute structures are spread across 3 header files right now, only one of which is dedicated to defining on-disk structures and their manipulation (xfs_dir2_format.h). Pull all the format definitions into a single header file - xfs_da_format.h - and switch all the code over to point at that. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-23xfs: create a shared header file for format-related informationDave Chinner54-232/+289
All of the buffer operations structures are needed to be exported for xfs_db, so move them all to a common location rather than spreading them all over the place. They are verifying the on-disk format, so while xfs_format.h might be a good place, it is not part of the on disk format. Hence we need to create a new header file that we centralise these related definitions. Start by moving the bffer operations structures, and then also move all the other definitions that have crept into xfs_log_format.h and xfs_format.h as there was no other shared header file to put them in. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-21xfs: fold xfs_change_file_space into xfs_ioc_spaceChristoph Hellwig4-191/+126
Now that only one caller of xfs_change_file_space is left it can be merged into said caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-21xfs: simplify the fallocate pathChristoph Hellwig3-63/+56
Call xfs_alloc_file_space or xfs_free_file_space directly from xfs_file_fallocate instead of going through xfs_change_file_space. This simplified the code by removing the unessecary marshalling of the arguments into an xfs_flock64_t structure and allows removing checks that are already done in the VFS code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-21xfs: always hold the iolock when calling xfs_change_file_spaceChristoph Hellwig4-49/+27
Currently fallocate always holds the iolock when calling into xfs_change_file_space, while the ioctl path lets some of the lower level functions take it, but leave it out in others. This patch makes sure the ioctl path also always holds the iolock and thus introduces consistent locking for the preallocation operations while simplifying the code and allowing to kill the now unused XFS_ATTR_NOLOCK flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-21xfs: remove the unused XFS_ATTR_NONBLOCK flagChristoph Hellwig2-4/+0
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-21xfs: always take the iolock around xfs_setattr_sizeChristoph Hellwig4-22/+24
There is no reason to conditionally take the iolock inside xfs_setattr_size when we can let the caller handle it unconditionally, which just incrases the lock hold time for the case where it was previously taken internally by a few instructions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-17xfs: don't break from growfs ag update loop on errorEric Sandeen1-9/+13
When xfs_growfs_data_private() is updating backup superblocks, it bails out on the first error encountered, whether reading or writing: * If we get an error writing out the alternate superblocks, * just issue a warning and continue. The real work is * already done and committed. This can cause a problem later during repair, because repair looks at all superblocks, and picks the most prevalent one as correct. If we bail out early in the backup superblock loop, we can end up with more "bad" matching superblocks than good, and a post-growfs repair may revert the filesystem to the old geometry. With the combination of superblock verifiers and old bugs, we're more likely to encounter read errors due to verification. And perhaps even worse, we don't even properly write any of the newly-added superblocks in the new AGs. Even with this change, growfs will still say: xfs_growfs: XFS_IOC_FSGROWFSDATA xfsctl failed: Structure needs cleaning data blocks changed from 319815680 to 335216640 which might be confusing to the user, but it at least communicates that something has gone wrong, and dmesg will probably highlight the need for an xfs_repair. And this is still best-effort; if verifiers fail on more than half the backup supers, they may still "win" - but that's probably best left to repair to more gracefully handle by doing its own strict verification as part of the backup super "voting." Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-17xfs: don't emit corruption noise on fs probesEric Sandeen1-2/+3
If we get EWRONGFS due to probing of non-xfs filesystems, there's no need to issue the scary corruption error and backtrace. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-17xfs: remove newlines from strings passed to __xfs_printkEric Sandeen10-20/+20
__xfs_printk adds its own "\n". Having it in the original string leads to unintentional blank lines from these messages. Most format strings have no newline, but a few do, leading to i.e.: [ 7347.119911] XFS (sdb2): Access to block zero in inode 132 start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 1a05 [ 7347.119911] [ 7347.119919] XFS (sdb2): Access to block zero in inode 132 start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 1a05 [ 7347.119919] Fix them all. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-17xfs: prevent deadlock trying to cover an active logDave Chinner3-25/+47
Recent analysis of a deadlocked XFS filesystem from a kernel crash dump indicated that the filesystem was stuck waiting for log space. The short story of the hang on the RHEL6 kernel is this: - the tail of the log is pinned by an inode - the inode has been pushed by the xfsaild - the inode has been flushed to it's backing buffer and is currently flush locked and hence waiting for backing buffer IO to complete and remove it from the AIL - the backing buffer is marked for write - it is on the delayed write queue - the inode buffer has been modified directly and logged recently due to unlinked inode list modification - the backing buffer is pinned in memory as it is in the active CIL context. - the xfsbufd won't start buffer writeback because it is pinned - xfssyncd won't force the log because it sees the log as needing to be covered and hence wants to issue a dummy transaction to move the log covering state machine along. Hence there is no trigger to force the CIL to the log and hence unpin the inode buffer and therefore complete the inode IO, remove it from the AIL and hence move the tail of the log along, allowing transactions to start again. Mainline kernels also have the same deadlock, though the signature is slightly different - the inode buffer never reaches the delayed write lists because xfs_buf_item_push() sees that it is pinned and hence never adds it to the delayed write list that the xfsaild flushes. There are two possible solutions here. The first is to simply force the log before trying to cover the log and so ensure that the CIL is emptied before we try to reserve space for the dummy transaction in the xfs_log_worker(). While this might work most of the time, it is still racy and is no guarantee that we don't get stuck in xfs_trans_reserve waiting for log space to come free. Hence it's not the best way to solve the problem. The second solution is to modify xfs_log_need_covered() to be aware of the CIL. We only should be attempting to cover the log if there is no current activity in the log - covering the log is the process of ensuring that the head and tail in the log on disk are identical (i.e. the log is clean and at idle). Hence, by definition, if there are items in the CIL then the log is not at idle and so we don't need to attempt to cover it. When we don't need to cover the log because it is active or idle, we issue a log force from xfs_log_worker() - if the log is idle, then this does nothing. However, if the log is active due to there being items in the CIL, it will force the items in the CIL to the log and unpin them. In the case of the above deadlock scenario, instead of xfs_log_worker() getting stuck in xfs_trans_reserve() attempting to cover the log, it will instead force the log, thereby unpinning the inode buffer, allowing IO to be issued and complete and hence removing the inode that was pinning the tail of the log from the AIL. At that point, everything will start moving along again. i.e. the xfs_log_worker turns back into a watchdog that can alleviate deadlocks based around pinned items that prevent the tail of the log from being moved... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-08xfs: clean up xfs_inactive() error handling, kill VN_INACTIVE_[NO]CACHEBrian Foster3-26/+12
The xfs_inactive() return value is meaningless. Turn xfs_inactive() into a void function and clean up the error handling appropriately. Kill the VN_INACTIVE_[NO]CACHE directives as they are not relevant to Linux. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-08xfs: push down inactive transaction mgmt for ifreeBrian Foster1-50/+71
Push the inode free work performed during xfs_inactive() down into a new xfs_inactive_ifree() helper. This clears xfs_inactive() from all inode locking and transaction management more directly associated with freeing the inode xattrs, extents and the inode itself. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-08xfs: push down inactive transaction mgmt for truncateBrian Foster1-49/+68
Create the new xfs_inactive_truncate() function to handle the truncate portion of xfs_inactive(). Push the locking and transaction management into the new function. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-08xfs: push down inactive transaction mgmt for remote symlinksBrian Foster3-54/+49
Push down the transaction management for remote symlinks from xfs_inactive() down to xfs_inactive_symlink_rmt(). The latter is cleaned up to avoid transaction management intended for the calling context (i.e., trans duplication, reservation, item attachment). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-08xfs: add the inode directory type support to XFS_IOC_FSGEOMMark Tinguely2-3/+5
Add the inode type directory type support to XFS_IOC_FSGEOM so that xfs_repair/xfs_info knows if the superblock v4 filesystem enabled the feature. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-01xfs: remove usage of is_bad_inodeBen Myers3-17/+1
XFS never calls mark_inode_bad or iget_failed, so it will never see a bad inode. Remove all checks for is_bad_inode because they are unnecessary. Signed-off-by: Ben Myers <bpm@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2013-10-01xfs: fix the wrong new_size/rnew_size at xfs_iext_realloc_direct()Jie Liu1-7/+2
At xfs_iext_realloc_direct(), the new_size is changed by adding if_bytes if originally the extent records are stored at the inline extent buffer, and we have to switch from it to a direct extent list for those new allocated extents, this is wrong. e.g, Create a file with three extents which was showing as following, xfs_io -f -c "truncate 100m" /xfs/testme for i in $(seq 0 5 10); do offset=$(($i * $((1 << 20)))) xfs_io -c "pwrite $offset 1m" /xfs/testme done Inline ------ irec: if_bytes bytes_diff new_size 1st 0 16 16 2nd 16 16 32 Switching --------- rnew_size 3rd 32 16 48 + 32 = 80 roundup=128 In this case, the desired value of new_size should be 48, and then it will be roundup to 64 and be assigned to rnew_size. However, this issue has been covered by resetting the if_bytes to the new_size which is calculated at the begnning of xfs_iext_add() before leaving out this function, and in turn make the rnew_size correctly again. Hence, this can not be detected via xfstestes. This patch fix above problem and revise the new_size comments at xfs_iext_realloc_direct() to make it more readable. Also, fix the comments while switching from the inline extent buffer to a direct extent list to reflect this change. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-01xfs: get rid of count from xfs_iomap_write_allocate()Jie Liu3-6/+5
Get rid of function variable count from xfs_iomap_write_allocate() as it is unused. Additionally, checkpatch warn me of the following for this change: WARNING: extern prototypes should be avoided in .h files +extern int xfs_iomap_write_allocate(struct xfs_inode *, xfs_off_t, So this patch also remove all extern function prototypes at xfs_iomap.h to suppress it to make this code style in consistent manner in this file. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-10-01xfs: Use kmem_free() instead of free()Thierry Reding1-1/+1
This fixes a build failure caused by calling the free() function which does not exist in the Linux kernel. Signed-off-by: Thierry Reding <treding@nvidia.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-30xfs: fix memory leak in xlog_recover_add_to_transtinguely@sgi.com1-0/+1
Free the memory in error path of xlog_recover_add_to_trans(). Normally this memory is freed in recovery pass2, but is leaked in the error path. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-30xfs: dirent dtype presence is dependent on directory magic numbersDave Chinner4-39/+28
The determination of whether a directory entry contains a dtype field originally was dependent on the filesystem having CRCs enabled. This meant that the format for dtype beign enabled could be determined by checking the directory block magic number rather than doing a feature bit check. This was useful in that it meant that we didn't need to pass a struct xfs_mount around to functions that were already supplied with a directory block header. Unfortunately, the introduction of dtype fields into the v4 structure via a feature bit meant this "use the directory block magic number" method of discriminating the dirent entry sizes is broken. Hence we need to convert the places that use magic number checks to use feature bit checks so that they work correctly and not by chance. The current code works on v4 filesystems only because the dirent size roundup covers the extra byte needed by the dtype field in the places where this problem occurs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-30xfs: lockdep needs to know about 3 dquot-deep nestingDave Chinner1-3/+16
Michael Semon reported that xfs/299 generated this lockdep warning: ============================================= [ INFO: possible recursive locking detected ] 3.12.0-rc2+ #2 Not tainted --------------------------------------------- touch/21072 is trying to acquire lock: (&xfs_dquot_other_class){+.+...}, at: [<c12902fb>] xfs_trans_dqlockedjoin+0x57/0x64 but task is already holding lock: (&xfs_dquot_other_class){+.+...}, at: [<c12902fb>] xfs_trans_dqlockedjoin+0x57/0x64 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&xfs_dquot_other_class); lock(&xfs_dquot_other_class); *** DEADLOCK *** May be due to missing lock nesting notation 7 locks held by touch/21072: #0: (sb_writers#10){++++.+}, at: [<c11185b6>] mnt_want_write+0x1e/0x3e #1: (&type->i_mutex_dir_key#4){+.+.+.}, at: [<c11078ee>] do_last+0x245/0xe40 #2: (sb_internal#2){++++.+}, at: [<c122c9e0>] xfs_trans_alloc+0x1f/0x35 #3: (&(&ip->i_lock)->mr_lock/1){+.+...}, at: [<c126cd1b>] xfs_ilock+0x100/0x1f1 #4: (&(&ip->i_lock)->mr_lock){++++-.}, at: [<c126cf52>] xfs_ilock_nowait+0x105/0x22f #5: (&dqp->q_qlock){+.+...}, at: [<c12902fb>] xfs_trans_dqlockedjoin+0x57/0x64 #6: (&xfs_dquot_other_class){+.+...}, at: [<c12902fb>] xfs_trans_dqlockedjoin+0x57/0x64 The lockdep annotation for dquot lock nesting only understands locking for user and "other" dquots, not user, group and quota dquots. Fix the annotations to match the locking heirarchy we now have. Reported-by: Michael L. Semon <mlsemon35@gmail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-26xfs: fix node forward in xfs_node_toosmallMark Tinguely1-2/+3
Commit f5ea1100 cleans up the disk to host conversions for node directory entries, but because a variable is reused in xfs_node_toosmall() the next node is not correctly found. If the original node is small enough (<= 3/8 of the node size), this change may incorrectly cause a node collapse when it should not. That will cause an assert in xfstest generic/319: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: /root/newest/xfs/fs/xfs/xfs_trans_buf.c, line: 569 Keep the original node header to get the correct forward node. (When a node is considered for a merge with a sibling, it overwrites the sibling pointers of the original incore nodehdr with the sibling's pointers. This leads to loop considering the original node as a merge candidate with itself in the second pass, and so it incorrectly determines a merge should occur.) Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> [v3: added Dave Chinner's (slightly modified) suggestion to the commit header, cleaned up whitespace. -bpm]
2013-09-24xfs: log recovery lsn ordering needs uuid checkDave Chinner1-14/+59
After a fair number of xfstests runs, xfs/182 started to fail regularly with a corrupted directory - a directory read verifier was failing after recovery because it found a block with a XARM magic number (remote attribute block) rather than a directory data block. The first time I saw this repeated failure I did /something/ and the problem went away, so I was never able to find the underlying problem. Test xfs/182 failed again today, and I found the root cause before I did /something else/ that made it go away. Tracing indicated that the block in question was being correctly logged, the log was being flushed by sync, but the buffer was not being written back before the shutdown occurred. Tracing also indicated that log recovery was also reading the block, but then never writing it before log recovery invalidated the cache, indicating that it was not modified by log recovery. More detailed analysis of the corpse indicated that the filesystem had a uuid of "a4131074-1872-4cac-9323-2229adbcb886" but the XARM block had a uuid of "8f32f043-c3c9-e7f8-f947-4e7f989c05d3", which indicated it was a block from an older filesystem. The reason that log recovery didn't replay it was that the LSN in the XARM block was larger than the LSN of the transaction being replayed, and so the block was not overwritten by log recovery. Hence, log recovery cant blindly trust the magic number and LSN in the block - it must verify that it belongs to the filesystem being recovered before using the LSN. i.e. if the UUIDs don't match, we need to unconditionally recovery the change held in the log. This patch was first tested on a block device that was repeatedly causing xfs/182 to fail with the same failure on the same block with the same directory read corruption signature (i.e. XARM block). It did not fail, and hasn't failed since. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-24xfs: fix XFS_IOC_FREE_EOFBLOCKS definitionDave Chinner1-1/+1
It uses a kernel internal structure in it's definition rather than the user visible structure that is passed to the ioctl. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-24xfs: asserting lock not held during freeing not validDave Chinner1-5/+4
When we free an inode, we do so via RCU. As an RCU lookup can occur at any time before we free an inode, and that lookup takes the inode flags lock, we cannot safely assert that the flags lock is not held just before marking it dead and running call_rcu() to free the inode. We check on allocation of a new inode structre that the lock is not held, so we still have protection against locks being leaked and hence not correctly initialised when allocated out of the slab. Hence just remove the assert... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-24xfs: lock the AIL before removing the buffer itemDave Chinner1-0/+1
Regression introduced by commit 46f9d2e ("xfs: aborted buf items can be in the AIL") which fails to lock the AIL before removing the item. Spinlock debugging throws a warning about this. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-09-16Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds1-5/+10
Pull CIFS fixes from Steve French: "Two minor cifs fixes and a minor documentation cleanup for cifs.txt" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: cifs: update cifs.txt and remove some outdated infos cifs: Avoid calling unlock_page() twice in cifs_readpage() when using fscache cifs: Do not take a reference to the page in cifs_readpage_worker()
2013-09-16Merge tag 'upstream-3.12-rc1' of git://git.infradead.org/linux-ubifsLinus Torvalds1-3/+4
Pull ubifs fix from Artem Bityutskiy: "Just one patch which fixes the power-cut recovery testing mode. I'll start using a single UBI/UBIFS tree instead of 2 trees from now on. So in the future you'll get 1 small pull request instead of 2 tiny ones" * tag 'upstream-3.12-rc1' of git://git.infradead.org/linux-ubifs: UBIFS: remove invalid warn msg with tst_recovery enabled
2013-09-15vfs: fix typo in comment in recent dentry workLinus Torvalds1-1/+1
Sedat points out that I transposed some letters in "LRU" and wrote "RLU" instead in one of the new comments explaining the flow. Let's just fix it. Reported-by: Sedat Dilek <sedat.dilek@jpberlin.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-13Merge tag 'writeback-fixes' of ↵Linus Torvalds2-4/+6
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux Pull writeback fix from Wu Fengguang: "A trivial writeback fix" * tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: Do not sort b_io list only because of block device inode
2013-09-13vfs: fix dentry LRU list handling and nr_dentry_unused accountingLinus Torvalds1-27/+101
The LRU list changes interacted badly with our nr_dentry_unused accounting, and even worse with the new DCACHE_LRU_LIST bit logic. This introduces helper functions to make sure everything follows the proper dcache d_lru list rules: the dentry cache is complicated by the fact that some of the hotpaths don't even want to look at the LRU list at all, and the fact that we use the same list entry in the dentry for both the LRU list and for our temporary shrinking lists when removing things from the LRU. The helper functions temporarily have some extra sanity checking for the flag bits that have to match the current LRU state of the dentry. We'll remove that before the final 3.12 release, but considering how easy it is to get wrong, this first cleanup version has some very particular sanity checking. Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-13cifs: Avoid calling unlock_page() twice in cifs_readpage() when using fscacheSachin Prabhu1-3/+7
When reading a single page with cifs_readpage(), we make a call to fscache_read_or_alloc_page() which once done, asynchronously calls the completion function cifs_readpage_from_fscache_complete(). This completion function unlocks the page once it has been populated from cache. The module then attempts to unlock the page a second time in cifs_readpage() which leads to warning messages. In case of a successful call to fscache_read_or_alloc_page() we should skip the second unlock_page() since this will be called by the cifs_readpage_from_fscache_complete() once the page has been populated by fscache. With the modifications to cifs_readpage_worker(), we will need to re-grab the page lock in cifs_write_begin(). The problem was first noticed when testing new fscache patches for cifs. https://bugzilla.redhat.com/show_bug.cgi?id=1005737 Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2013-09-13cifs: Do not take a reference to the page in cifs_readpage_worker()Sachin Prabhu1-2/+3
We do not need to take a reference to the pagecache in cifs_readpage_worker() since the calling function will have already taken one before passing the pointer to the page as an argument to the function. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2013-09-13Merge git://git.kvack.org/~bcrl/aio-nextLinus Torvalds7-269/+537
Pull aio changes from Ben LaHaise: "First off, sorry for this pull request being late in the merge window. Al had raised a couple of concerns about 2 items in the series below. I addressed the first issue (the race introduced by Gu's use of mm_populate()), but he has not provided any further details on how he wants to rework the anon_inode.c changes (which were sent out months ago but have yet to be commented on). The bulk of the changes have been sitting in the -next tree for a few months, with all the issues raised being addressed" * git://git.kvack.org/~bcrl/aio-next: (22 commits) aio: rcu_read_lock protection for new rcu_dereference calls aio: fix race in ring buffer page lookup introduced by page migration support aio: fix rcu sparse warnings introduced by ioctx table lookup patch aio: remove unnecessary debugging from aio_free_ring() aio: table lookup: verify ctx pointer staging/lustre: kiocb->ki_left is removed aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3" aio: be defensive to ensure request batching is non-zero instead of BUG_ON() aio: convert the ioctx list to table lookup v3 aio: double aio_max_nr in calculations aio: Kill ki_dtor aio: Kill ki_users aio: Kill unneeded kiocb members aio: Kill aio_rw_vect_retry() aio: Don't use ctx->tail unnecessarily aio: io_cancel() no longer returns the io_event aio: percpu ioctx refcount aio: percpu reqs_available aio: reqs_active -> reqs_available aio: fix build when migration is disabled ...
2013-09-12Merge tag 'xfs-for-linus-v3.12-rc1-2' of git://oss.sgi.com/xfs/xfsLinus Torvalds25-166/+461
Pull xfs update #2 from Ben Myers: "Here we have defrag support for v5 superblock, a number of bugfixes and a cleanup or two. - defrag support for CRC filesystems - fix endian worning in xlog_recover_get_buf_lsn - fixes for sparse warnings - fix for assert in xfs_dir3_leaf_hdr_from_disk - fix for log recovery of remote symlinks - fix for log recovery of btree root splits - fixes formemory allocation failures with ACLs - fix for assert in xfs_buf_item_relse - fix for assert in xfs_inode_buf_verify - fix an assignment in an assert that should be a test in xfs_bmbt_change_owner - remove dead code in xlog_recover_inode_pass2" * tag 'xfs-for-linus-v3.12-rc1-2' of git://oss.sgi.com/xfs/xfs: xfs: remove dead code from xlog_recover_inode_pass2 xfs: = vs == typo in ASSERT() xfs: don't assert fail on bad inode numbers xfs: aborted buf items can be in the AIL. xfs: factor all the kmalloc-or-vmalloc fallback allocations xfs: fix memory allocation failures with ACLs xfs: ensure we copy buffer type in da btree root splits xfs: set remote symlink buffer type for recovery xfs: recovery of swap extents operations for CRC filesystems xfs: swap extents operations for CRC filesystems xfs: check magic numbers in dir3 leaf verifier first xfs: fix some minor sparse warnings xfs: fix endian warning in xlog_recover_get_buf_lsn()
2013-09-12Merge branch 'akpm' (patches from Andrew Morton)Linus Torvalds27-42/+28
Merge more patches from Andrew Morton: "The rest of MM. Plus one misc cleanup" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (35 commits) mm/Kconfig: add MMU dependency for MIGRATION. kernel: replace strict_strto*() with kstrto*() mm, thp: count thp_fault_fallback anytime thp fault fails thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page() thp: do_huge_pmd_anonymous_page() cleanup thp: move maybe_pmd_mkwrite() out of mk_huge_pmd() mm: cleanup add_to_page_cache_locked() thp: account anon transparent huge pages into NR_ANON_PAGES truncate: drop 'oldsize' truncate_pagecache() parameter mm: make lru_add_drain_all() selective memcg: document cgroup dirty/writeback memory statistics memcg: add per cgroup writeback pages accounting memcg: check for proper lock held in mem_cgroup_update_page_stat memcg: remove MEMCG_NR_FILE_MAPPED memcg: reduce function dereference memcg: avoid overflow caused by PAGE_ALIGN memcg: rename RESOURCE_MAX to RES_COUNTER_MAX memcg: correct RESOURCE_MAX to ULLONG_MAX mm: memcg: do not trap chargers with full callstack on OOM mm: memcg: rework and document OOM waiting and wakeup ...