summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2016-05-20Merge branch 'xfs-4.7-inode-reclaim' into for-nextDave Chinner4-199/+250
2016-05-20Merge branch 'xfs-4.7-error-cfg' into for-nextDave Chinner8-54/+450
2016-05-20Merge branch 'xfs-4.7-misc-fixes' into for-nextDave Chinner10-29/+32
2016-05-20Merge branch 'xfs-4.7-cleanup-attr-listent' into for-nextDave Chinner3-67/+39
2016-05-20Merge branch 'xfs-4.7-optimise-inline-symlinks' into for-nextDave Chinner10-85/+125
2016-05-20Merge branch 'xfs-4.7-trans-type-cleanup' into for-nextDave Chinner28-508/+200
2016-05-20Merge branch 'xfs-4.7-writeback-bio' into for-nextDave Chinner3-191/+184
2016-05-20xfs: fix warning in xfs_finish_page_writeback for non-debug buildsChristoph Hellwig1-3/+2
blockmask is unused if ASSERTs are disabled. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-05-18xfs: move reclaim tagging functionsDave Chinner1-118/+116
Rearrange the inode tagging functions so that they are higher up in xfs_cache.c and so there is no need for forward prototypes to be defined. This is purely code movement, no other change. Signed-off-by: Dave Chinner <dchinner@redhat.com>
2016-05-18xfs: simplify inode reclaim tagging interfacesDave Chinner1-48/+50
Inode radix tree tagging for reclaim passes a lot of unnecessary variables around. Over time the xfs-perag has grown a xfs_mount backpointer, and an internal agno so we don't need to pass other variables into the tagging functions to supply this information. Rework the functions to pass the minimal variable set required and simplify the internal logic and flow. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: rename variables in xfs_iflush_cluster for clarityDave Chinner1-37/+37
The cluster inode variable uses unconventional naming - iq - which makes it hard to distinguish it between the inode passed into the function - ip - and that is a vector for mistakes to be made. Rename all the cluster inode variables to use a more conventional prefixes to reduce potential future confusion (cilist, cilist_size, cip). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: xfs_iflush_cluster has range issuesDave Chinner1-2/+11
xfs_iflush_cluster() does a gang lookup on the radix tree, meaning it can find inodes beyond the current cluster if there is sparse cache population. gang lookups return results in ascending index order, so stop trying to cluster inodes once the first inode outside the cluster mask is detected. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: mark reclaimed inodes invalid earlierDave Chinner2-12/+47
The last thing we do before using call_rcu() on an xfs_inode to be freed is mark it as invalid. This means there is a window between when we know for certain that the inode is going to be freed and when we do actually mark it as "freed". This is important in the context of RCU lookups - we can look up the inode, find that it is valid, and then use it as such not realising that it is in the final stages of being freed. As such, mark the inode as being invalid the moment we know it is going to be reclaimed. This can be done while we still hold the XFS_ILOCK_EXCL and the flush lock in xfs_inode_reclaim, meaning that it occurs well before we remove it from the radix tree, and that the i_flags_lock, the XFS_ILOCK and the inode flush lock all act as synchronisation points for detecting that an inode is about to go away. For defensive purposes, this allows us to add a further check to xfs_iflush_cluster to ensure we skip inodes that are being freed after we grab the XFS_ILOCK_SHARED and the flush lock - we know that if the inode number if valid while we have these locks held we know that it has not progressed through reclaim to the point where it is clean and is about to be freed. [bfoster: fixed __xfs_inode_clear_reclaim() using ip->i_ino after it had already been zeroed.] Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: xfs_inode_free() isn't RCU safeDave Chinner1-7/+7
The xfs_inode freed in xfs_inode_free() has multiple allocated structures attached to it. We free these in xfs_inode_free() before we mark the inode as invalid, and before we run call_rcu() to queue the structure for freeing. Unfortunately, this freeing can race with other accesses that are in the RCU current grace period that have found the inode in the radix tree with a valid state. This includes xfs_iflush_cluster(), which calls xfs_inode_clean(), and that accesses the inode log item on the xfs_inode. The log item structure is freed in xfs_inode_free(), so there is the possibility we can be accessing freed memory in xfs_iflush_cluster() after validating the xfs_inode structure as being valid for this RCU context. Hence we can get spuriously incorrect clean state returned from such checks. This can lead to use thinking the inode is dirty when it is, in fact, clean, and so incorrectly attaching it to the buffer for IO and completion processing. This then leads to use-after-free situations on the xfs_inode itself if the IO completes after the current RCU grace period expires. The buffer callbacks will access the xfs_inode and try to do all sorts of things it shouldn't with freed memory. IOWs, xfs_iflush_cluster() only works correctly when racing with inode reclaim if the inode log item is present and correctly stating the inode is clean. If the inode is being freed, then reclaim has already made sure the inode is clean, and hence xfs_iflush_cluster can skip it. However, we are accessing the inode inode under RCU read lock protection and so also must ensure that all dynamically allocated memory we reference in this context is not freed until the RCU grace period expires. To fix this, move all the potential memory freeing into xfs_inode_free_callback() so that we are guarantee RCU protected lookup code will always have the memory structures it needs available during the RCU grace period that lookup races can occur in. Discovered-by: Brain Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: optimise xfs_iext_destroyAlex Lyakas1-8/+19
When unmounting XFS, we call: xfs_inode_free => xfs_idestroy_fork => xfs_iext_destroy This goes over the whole indirection array and calls xfs_iext_irec_remove for each one of the erps (from the last one to the first one). As a result, we keep shrinking (reallocating actually) the indirection array until we shrink out all of its elements. When we have files with huge numbers of extents, umount takes 30-80 sec, depending on the amount of files that XFS loaded and the amount of indirection entries of each file. The unmount stack looks like: [<ffffffffc0b6d200>] xfs_iext_realloc_indirect+0x40/0x60 [xfs] [<ffffffffc0b6cd8e>] xfs_iext_irec_remove+0xee/0xf0 [xfs] [<ffffffffc0b6cdcd>] xfs_iext_destroy+0x3d/0xb0 [xfs] [<ffffffffc0b6cef6>] xfs_idestroy_fork+0xb6/0xf0 [xfs] [<ffffffffc0b87002>] xfs_inode_free+0xb2/0xc0 [xfs] [<ffffffffc0b87260>] xfs_reclaim_inode+0x250/0x340 [xfs] [<ffffffffc0b87583>] xfs_reclaim_inodes_ag+0x233/0x370 [xfs] [<ffffffffc0b8823d>] xfs_reclaim_inodes+0x1d/0x20 [xfs] [<ffffffffc0b96feb>] xfs_unmountfs+0x7b/0x1a0 [xfs] [<ffffffffc0b98e4d>] xfs_fs_put_super+0x2d/0x70 [xfs] [<ffffffff811e9e36>] generic_shutdown_super+0x76/0x100 [<ffffffff811ea207>] kill_block_super+0x27/0x70 [<ffffffff811ea519>] deactivate_locked_super+0x49/0x60 [<ffffffff811eaaee>] deactivate_super+0x4e/0x70 [<ffffffff81207593>] cleanup_mnt+0x43/0x90 [<ffffffff81207632>] __cleanup_mnt+0x12/0x20 [<ffffffff8108f8e7>] task_work_run+0xa7/0xe0 [<ffffffff81014ff7>] do_notify_resume+0x97/0xb0 [<ffffffff81717c6f>] int_signal+0x12/0x17 Further, this reallocation prevents us from freeing the extent list from a RCU callback as allocation can block. Hence if the extent list is in indirect format, optimise the freeing of the extent list to only use kmem_free calls by freeing entire extent buffer pages at a time, rather than extent by extent. [dchinner: simplified freeing loop based on Christoph's suggestion] Signed-off-by: Alex Lyakas <alex@zadarastorage.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: skip stale inodes in xfs_iflush_clusterDave Chinner1-0/+1
We don't write back stale inodes so we should skip them in xfs_iflush_cluster, too. cc: <stable@vger.kernel.org> # 3.10.x- Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: fix inode validity check in xfs_iflush_clusterDave Chinner1-4/+4
Some careless idiot(*) wrote crap code in commit 1a3e8f3 ("xfs: convert inode cache lookups to use RCU locking") back in late 2010, and so xfs_iflush_cluster checks the wrong inode for whether it is still valid under RCU protection. Fix it to lock and check the correct inode. (*) Careless-idiot: Dave Chinner <dchinner@redhat.com> cc: <stable@vger.kernel.org> # 3.10.x- Discovered-by: Brain Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: xfs_iflush_cluster fails to abort on errorDave Chinner1-4/+13
When a failure due to an inode buffer occurs, the error handling fails to abort the inode writeback correctly. This can result in the inode being reclaimed whilst still in the AIL, leading to use-after-free situations as well as filesystems that cannot be unmounted as the inode log items left in the AIL never get removed. Fix this by ensuring fatal errors from xfs_imap_to_bp() result in the inode flush being aborted correctly. cc: <stable@vger.kernel.org> # 3.10.x- Reported-by: Shyam Kaushik <shyam@zadarastorage.com> Diagnosed-by: Shyam Kaushik <shyam@zadarastorage.com> Tested-by: Shyam Kaushik <shyam@zadarastorage.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: remove xfs_fs_evict_inode()Dave Chinner1-21/+7
Joe Lawrence reported a list_add corruption with 4.6-rc1 when testing some custom md administration code that made it's own block device nodes for the md array. The simple test loop of: for i in {0..100}; do mknod --mode=0600 $tmp/tmp_node b $MAJOR $MINOR mdadm --detail --export $tmp/tmp_node > /dev/null rm -f $tmp/tmp_node done Would produce this warning in bd_acquire() when mdadm opened the device node: list_add double add: new=ffff88043831c7b8, prev=ffff8804380287d8, next=ffff88043831c7b8. And then produce this from bd_forget from kdevtmpfs evicting a block dev inode: list_del corruption. prev->next should be ffff8800bb83eb10, but was ffff88043831c7b8 This is a regression caused by commit c19b3b05 ("xfs: mode di_mode to vfs inode"). The issue is that xfs_inactive() frees the unlinked inode, and the above commit meant that this freeing zeroed the mode in the struct inode. The problem is that after evict() has called ->evict_inode, it expects the i_mode to be intact so that it can call bd_forget() or cd_forget() to drop the reference to the block device inode attached to the XFS inode. In reality, the only thing we do in xfs_fs_evict_inode() that is not generic is call xfs_inactive(). We can move the xfs_inactive() call to xfs_fs_destroy_inode() without any problems at all, and this will leave the VFS inode intact until it is completely done with it. So, remove xfs_fs_evict_inode(), and do the work it used to do in ->destroy_inode instead. cc: <stable@vger.kernel.org> # 4.6 Reported-by: Joe Lawrence <joe.lawrence@stratus.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: add "fail at unmount" error handling configurationCarlos Maiolino4-0/+64
If we take "retry forever" literally on metadata IO errors, we can hang at unmount, once it retries those writes forever. This is the default behavior, unfortunately. Add an error configuration option for this behavior and default it to "fail" so that an unmount will trigger actuall errors, a shutdown and allow the unmount to succeed. It will be noisy, though, as it will log the errors and shutdown that occurs. To fix this, we need to mark the filesystem as being in the process of unmounting. Do this with a mount flag that is added at the appropriate time (i.e. before the blocking AIL sync). We also need to add this flag if mount fails after the initial phase of log recovery has been run. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: add configuration handlers for specific errorsCarlos Maiolino2-1/+24
now most of the infrastructure is in place, we can start adding support for configuring specific errors such as ENODEV, ENOSPC, EIO, etc. Add these error configurations and configure them all to have appropriate behaviours. That is, all will be configured to retry forever by default, except for ENODEV, which is an unrecoverable error, so it will be configured to not retry on error Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: add configuration of error failure speedCarlos Maiolino4-7/+111
On reception of an error, we can fail immediately, perform some bound amount of retries or retry indefinitely. The current behaviour we have is to retry forever. However, we'd like the ability to choose how long the filesystem should try after an error, it can either fail immediately, retry a few times, or retry forever. This is implemented by using max_retries sysfs attribute, to hold the amount of times we allow the filesystem to retry after an error. Being -1 a special case where the filesystem will retry indefinitely. Add both a maximum retry count and a retry timeout so that we can bound by time and/or physical IO attempts. Finally, plumb these into xfs_buf_iodone error processing so that the error behaviour follows the selected configuration. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: introduce table-based init for error behaviorsCarlos Maiolino1-12/+60
Before we start expanding the number of error classes and errors we can configure behaviour for, we need a simple and clear way to define the default behaviour that we initialized each mount with. Introduce a table based method for keeping the initial configuration in, and apply that to the existing initialization code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: add configurable error support to metadata buffersCarlos Maiolino5-46/+88
With the error configuration handle for async metadata write errors in place, we can now add initial support to the IO error processing in xfs_buf_iodone_error(). Add an infrastructure function to look up the configuration handle, and rearrange the error handling to prepare the way for different error handling conigurations to be used. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: introduce metadata IO error classCarlos Maiolino2-0/+37
Now we have the basic infrastructure, add the first error class so we can build up the infrastructure in a meaningful way. Add the metadata async write IO error class and sysfs entry, and introduce a default configuration that matches the existing "retry forever" behavior for async write metadata buffers. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: configurable error behavior via sysfsCarlos Maiolino4-2/+84
We need to be able to change the way XFS behaviours in error conditions depending on the type of underlying storage. This is necessary for handling non-traditional block devices with extended error cases, such as thin provisioned devices that can return ENOSPC as an IO error. Introduce the basic sysfs infrastructure needed to define and configure error behaviours. This is done to be generic enough to extend to configuring behaviour in other error conditions, such as ENOMEM, which also has different desired behaviours according to machine configuration. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-18xfs: buffer ->bi_end_io function requires irq-safe lockBrian Foster1-8/+4
Reports have surfaced of a lockdep splat complaining about an irq-safe -> irq-unsafe locking order in the xfs_buf_bio_end_io() bio completion handler. This only occurs when I/O errors are present because bp->b_lock is only acquired in this context to protect setting an error on the buffer. The problem is that this lock can be acquired with the (request_queue) q->queue_lock held. See scsi_end_request() or ata_qc_schedule_eh(), for example. Replace the locked test/set of b_io_error with a cmpxchg() call. This eliminates the need for the lock and thus the lock ordering problem goes away. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: mute some sparse warningsEryu Guan3-2/+4
These three warnings are fixed: fs/xfs/xfs_inode.c:1033:44: warning: Using plain integer as NULL pointer fs/xfs/xfs_inode_item.c:525:20: warning: context imbalance in 'xfs_inode_item_push' - unexpected unlock fs/xfs/xfs_dquot.c:696:1: warning: symbol 'xfs_dq_get_next_id' was not declared. Should it be static? Signed-off-by: Eryu Guan <guaneryu@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: improve kmem_reallocChristoph Hellwig5-21/+20
Use krealloc to implement our realloc function. This helps to avoid new allocations if we are still in the slab bucket. At least for the bmap btree root that's actually the common case. This also allows removing the now unused oldsize argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: Add caller function output to xfs_log_force tracepointCarlos Maiolino2-6/+8
I had sent this patch yesterday, but for some reason it didn't reach xfs list, sending again. Output the caller of xfs_log_force might be useful when tracing log checkpoint problems without the need to build kernel with DEBUG. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: remove transaction typesChristoph Hellwig8-162/+10
These aren't used for CIL-style logging and can be dropped. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: better xfs_trans_alloc interfaceChristoph Hellwig22-347/+191
Merge xfs_trans_reserve and xfs_trans_alloc into a single function call that returns a transaction with all the required log and block reservations, and which allows passing transaction flags directly to avoid the cumbersome _xfs_trans_alloc interface. While we're at it we also get rid of the transaction type argument that has been superflous since we stopped supporting the non-CIL logging mode. The guts of it will be removed in another patch. [dchinner: fixed transaction leak in error path in xfs_setattr_nonsize] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: optimize bio handling in the buffer writeback pathChristoph Hellwig3-165/+123
This patch implements two closely related changes: First it embeds a bio the ioend structure so that we don't have to allocate one separately. Second it uses the block layer bio chaining mechanism to chain additional bios off this first one if needed instead of manually accounting for multiple bio completions in the ioend structure. Together this removes a memory allocation per ioend and greatly simplifies the ioend setup and I/O completion path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: don't release bios on completion immediatelyDave Chinner2-29/+71
Completion of an ioend requires us to walk the bufferhead list to end writback on all the bufferheads. This, in turn, is needed so that we can end writeback on all the pages we just did IO on. To remove our dependency on bufferheads in writeback, we need to turn this around the other way - we need to walk the pages we've just completed IO on, and then walk the buffers attached to the pages and complete their IO. In doing this, we remove the requirement for the ioend to track bufferheads directly. To enable IO completion to walk all the pages we've submitted IO on, we need to keep the bios that we used for IO around until the ioend has been completed. We can do this simply by chaining the bios to the ioend at completion time, and then walking their pages directly just before destroying the ioend. Signed-off-by: Dave Chinner <dchinner@redhat.com> [hch: changed the xfs_finish_page_writeback calling convention] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: build bios directly in xfs_add_to_ioendDave Chinner2-38/+32
Currently adding a buffer to the ioend and then building a bio from the buffer list are two separate operations. We don't build the bios and submit them until the ioend is submitted, and this places a fixed dependency on bufferhead chaining in the ioend. The first step to removing the bufferhead chaining in the ioend is on the IO submission side. We can build the bio directly as we add the buffers to the ioend chain, thereby removing the need for a latter "buffer-to-bio" submission loop. This allows us to submit bios on large ioends as soon as we cannot add more data to the bio. These bios then get captured by the active plug, and hence will be dispatched as soon as either the plug overflows or we schedule away from the writeback context. This will reduce submission latency for large IOs, but will also allow more timely request queue based writeback blocking when the device becomes congested. Signed-off-by: Dave Chinner <dchinner@redhat.com> [hch: various small updates] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: collapse cases in xfs_attr3_leaf_list_intEric Sandeen1-17/+17
Consolidate the 2 calls to ->put_listent in xfs_attr3_leaf_list_int(), by setting up name, namelen, and valuelen for the local vs remote cases, then call ->put_listent and do the error handling all in one spot. Signed-off-by: Eric Sandeen <sandeen@redhat.com>
2016-04-06xfs: remove put_value from attr ->put_listent contextEric Sandeen2-29/+3
The put_value context member is never set; remove it and the conditional test in xfs_attr3_leaf_list_int(). Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: don't pass value into attr ->put_listentEric Sandeen3-15/+8
The value is not used; only names and value lengths are returned. Remove the argument. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: only return -errno or success from attr ->put_listentEric Sandeen3-9/+14
Today, the put_listent formatters return either 1 or 0; if they return 1, some callers treat this as an error and return it up the stack, despite "1" not being a valid (negative) error code. The intent seems to be that if the input buffer is full, we set seen_enough or set count = -1, and return 1; but some callers check the return before checking the seen_enough or count fields of the context. Fix this by only returning non-zero for actual errors encountered, and rely on the caller to first check the return value, then check the values in the context to decide what to do. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: optimize inline symlinksChristoph Hellwig4-15/+49
By overallocating the in-core inode fork data buffer and zero terminating the link target in xfs_init_local_fork we can avoid the memory allocation in ->follow_link. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: use ->readlink to implement the readlink_by_handle ioctlChristoph Hellwig2-17/+2
Also drop the now unused readlink_copy export. [dchinner: use d_inode(dentry) rather than dentry->d_inode] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: set up inode operation vectors laterChristoph Hellwig3-23/+42
In the next patch we'll set up different inode operations for inline vs out of line symlinks, for that we need to make sure the flags are already set up properly. [dchinner: added xfs_setup_iops() call to xfs_rename_alloc_whiteout()] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: factor out a helper to initialize a local format inode forkChristoph Hellwig4-34/+36
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: add missing break in xfs_parseargs()Eryu Guan1-0/+1
Commit 2e74af0e1189 ("xfs: convert mount option parsing to tokens") missed a 'break;' in xfs_parseargs() which causes mount to fail with "-o pqnoenforce" option when mounting a v4 filesystem. xfs/050 catches this failure: XFS (vda6): Super block does not support project and group quota together Fixes: 2e74af0e1189 ("xfs: convert mount option parsing to tokens") Signed-off-by: Eryu Guan <guaneryu@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: Don't wrap growfs AGFL indexesDave Chinner1-2/+2
Commit 96f859d ("libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct") allowed the freelist to use the empty slot at the end of the freelist on 64 bit systems that was not being used due to sizeof() rounding up the structure size. This has caused versions of xfs_repair prior to 4.5.0 (which also has the fix) to report this as a corruption once the filesystem has been grown. Older kernels can also have problems (seen from a whacky container/vm management environment) mounting filesystems grown on a system with a newer kernel than the vm/container it is deployed on. To avoid this problem, change the initial free list indexes not to wrap across the end of the AGFL, hence avoiding the initialisation of agf_fllast to the last index in the AGFL. cc: <stable@vger.kernel.org> # 4.4-4.5 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: disallow rw remount on fs with unknown ro-compat featuresEric Sandeen1-0/+10
Today, a kernel which refuses to mount a filesystem read-write due to unknown ro-compat features can still transition to read-write via the remount path. The old kernel is most likely none the wiser, because it's unaware of the new feature, and isn't using it. However, writing to the filesystem may well corrupt metadata related to that new feature, and moving to a newer kernel which understand the feature will have problems. Right now the only ro-compat feature we have is the free inode btree, which showed up in v3.16. It would be good to push this back to all the active stable kernels, I think, so that if anyone is using newer mkfs (which enables the finobt feature) with older kernel releases, they'll be protected. Cc: <stable@vger.kernel.org> # 3.10.x- Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-26Linux 4.6-rc1v4.6-rc1Linus Torvalds1-2/+2
2016-03-26Merge branch 'for-linus' of ↵Linus Torvalds22-519/+811
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "There is quite a bit here, including some overdue refactoring and cleanup on the mon_client and osd_client code from Ilya, scattered writeback support for CephFS and a pile of bug fixes from Zheng, and a few random cleanups and fixes from others" [ I already decided not to pull this because of it having been rebased recently, but ended up changing my mind after all. Next time I'll really hold people to it. Oh well. - Linus ] * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (34 commits) libceph: use KMEM_CACHE macro ceph: use kmem_cache_zalloc rbd: use KMEM_CACHE macro ceph: use lookup request to revalidate dentry ceph: kill ceph_get_dentry_parent_inode() ceph: fix security xattr deadlock ceph: don't request vxattrs from MDS ceph: fix mounting same fs multiple times ceph: remove unnecessary NULL check ceph: avoid updating directory inode's i_size accidentally ceph: fix race during filling readdir cache libceph: use sizeof_footer() more ceph: kill ceph_empty_snapc ceph: fix a wrong comparison ceph: replace CURRENT_TIME by current_fs_time() ceph: scattered page writeback libceph: add helper that duplicates last extent operation libceph: enable large, variable-sized OSD requests libceph: osdc->req_mempool should be backed by a slab pool libceph: make r_request msg_size calculation clearer ...
2016-03-26Merge tag 'ofs-pull-tag-1' of ↵Linus Torvalds33-0/+11243
git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux Pull orangefs filesystem from Mike Marshall. This finally merges the long-pending orangefs filesystem, which has been much cleaned up with input from Al Viro over the last six months. From the documentation file: "OrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal for large storage problems faced by HPC, BigData, Streaming Video, Genomics, Bioinformatics. Orangefs, originally called PVFS, was first developed in 1993 by Walt Ligon and Eric Blumer as a parallel file system for Parallel Virtual Machine (PVM) as part of a NASA grant to study the I/O patterns of parallel programs. Orangefs features include: - Distributes file data among multiple file servers - Supports simultaneous access by multiple clients - Stores file data and metadata on servers using local file system and access methods - Userspace implementation is easy to install and maintain - Direct MPI support - Stateless" see Documentation/filesystems/orangefs.txt for more in-depth details. * tag 'ofs-pull-tag-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: (174 commits) orangefs: fix orangefs_superblock locking orangefs: fix do_readv_writev() handling of error halfway through orangefs: have ->kill_sb() evict the VFS side of things first orangefs: sanitize ->llseek() orangefs-bufmap.h: trim unused junk orangefs: saner calling conventions for getting a slot orangefs_copy_{to,from}_bufmap(): don't pass bufmap pointer orangefs: get rid of readdir_handle_s ornagefs: ensure that truncate has an up to date inode size orangefs: move code which sets i_link to orangefs_inode_getattr orangefs: remove needless wrapper around GFP_KERNEL orangefs: remove wrapper around mutex_lock(&inode->i_mutex) orangefs: refactor inode type or link_target change detection orangefs: use new getattr for revalidate and remove old getattr orangefs: use new getattr in inode getattr and permission orangefs: use new orangefs_inode_getattr to get size in write and llseek orangefs: use new orangefs_inode_getattr to create new inodes orangefs: rename orangefs_inode_getattr to orangefs_inode_old_getattr orangefs: remove inode->i_lock wrapper orangefs: put register_chrdev immediately before register_filesystem ...
2016-03-26Merge tag 'ntb-4.6' of git://github.com/jonmason/ntbLinus Torvalds4-70/+79
Pull NTB bug fixes from Jon Mason: "NTB bug fixes for tasklet from spinning forever, link errors, translation window setup, NULL ptr dereference, and ntb-perf errors. Also, a modification to the driver API that makes _addr functions optional" * tag 'ntb-4.6' of git://github.com/jonmason/ntb: NTB: Remove _addr functions from ntb_hw_amd NTB: Make _addr functions optional in the API NTB: Fix incorrect clean up routine in ntb_perf NTB: Fix incorrect return check in ntb_perf ntb: fix possible NULL dereference ntb: add missing setup of translation window ntb: stop link work when we do not have memory ntb: stop tasklet from spinning forever during shutdown. ntb: perf test: fix address space confusion