summaryrefslogtreecommitdiffstats
path: root/fs/xfs/libxfs/xfs_alloc.c
AgeCommit message (Collapse)AuthorFilesLines
2017-06-27xfs: remove unneeded parameter from XFS_TEST_ERRORDarrick J. Wong1-4/+2
Since we moved the injected error frequency controls to the mountpoint, we can get rid of the last argument to XFS_TEST_ERROR. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2017-06-19xfs: export various function for the online scrubberDarrick J. Wong1-1/+1
Export various internal functions so that the online scrubber can use them to check the state of metadata. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-04-03xfs: create a function to query all records in a btreeDarrick J. Wong1-0/+15
Create a helper function that will query all records in a btree. This will be used by the online repair functions to examine every record in a btree to rebuild a second btree. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-04-03xfs: provide a query_range function for freespace btreesDarrick J. Wong1-0/+42
Implement a query_range function for the bnobt and cntbt. This will be used for getfsmap fallback if there is no rmapbt and by the online scrub and repair code. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-02-17xfs: remove XFS_ALLOCTYPE_ANY_AG and XFS_ALLOCTYPE_START_AGChristoph Hellwig1-14/+2
XFS_ALLOCTYPE_ANY_AG was only used for the RT allocator and is unused now, and XFS_ALLOCTYPE_START_AG has been unused for a while. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-09xfs: improve handling of busy extents in the low-level allocatorChristoph Hellwig1-44/+49
Currently we force the log and simply try again if we hit a busy extent, but especially with online discard enabled it might take a while after the log force for the busy extents to disappear, and we might have already completed our second pass. So instead we add a new waitqueue and a generation counter to the pag structure so that we can do wakeups once we've removed busy extents, and we replace the single retry with an unconditional one - after all we hold the AGF buffer lock, so no other allocations or frees can be racing with us in this AG. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-09xfs: don't rely on ->total in xfs_alloc_space_availableChristoph Hellwig1-3/+4
->total is a bit of an odd parameter passed down to the low-level allocator all the way from the high-level callers. It's supposed to contain the maximum number of blocks to be allocated for the whole transaction [1]. But in xfs_iomap_write_allocate we only convert existing delayed allocations and thus only have a minimal block reservation for the current transaction, so xfs_alloc_space_available can't use it for the allocation decisions. Use the maximum of args->total and the calculated block requirement to make a decision. We probably should get rid of args->total eventually and instead apply ->minleft more broadly, but that will require some extensive changes all over. [1] which creates lots of confusion as most callers don't decrement it once doing a first allocation. But that's for a separate series. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-09xfs: adjust allocation length in xfs_alloc_space_availableChristoph Hellwig1-64/+17
We must decide in xfs_alloc_fix_freelist if we can perform an allocation from a given AG is possible or not based on the available space, and should not fail the allocation past that point on a healthy file system. But currently we have two additional places that second-guess xfs_alloc_fix_freelist: xfs_alloc_ag_vextent tries to adjust the maxlen parameter to remove the reservation before doing the allocation (but ignores the various minium freespace requirements), and xfs_alloc_fix_minleft tries to fix up the allocated length after we've found an extent, but ignores the reservations and also doesn't take the AGFL into account (and thus fails allocations for not matching minlen in some cases). Remove all these later fixups and just correct the maxlen argument inside xfs_alloc_fix_freelist once we have the AGF buffer locked. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-09xfs: fix bogus minleft manipulationsChristoph Hellwig1-17/+7
We can't just set minleft to 0 when we're low on space - that's exactly what we need minleft for: to protect space in the AG for btree block allocations when we are low on free space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-09xfs: bump up reserved blocks in xfs_alloc_set_asideChristoph Hellwig1-4/+1
Setting aside 4 blocks globally for bmbt splits isn't all that useful, as different threads can allocate space in parallel. Bump it to 4 blocks per AG to allow each thread that is currently doing an allocation to dip into it separately. Without that we may no have enough reserved blocks if there are enough parallel transactions in an almost out space file system that all run into bmap btree splits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-12-05xfs: forbid AG btrees with level == 0Darrick J. Wong1-3/+7
There is no such thing as a zero-level AG btree since even a single-node zero-records btree has one level. Btree cursor constructors read cur_nlevels straight from disk and then access things like cur_bufs[cur_nlevels - 1] which is /really/ bad if cur_nlevels is zero! Therefore, strengthen the verifiers to prevent this possibility. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-03xfs: reserve AG space for the refcount btree rootDarrick J. Wong1-0/+2
Reduce the max AG usable space size so that we always have space for the refcount btree root. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03xfs: add refcount btree operationsDarrick J. Wong1-0/+3
Implement the generic btree operations required to manipulate refcount btree blocks. The implementation is similar to the bmapbt, though it will only allocate and free blocks from the AG. Since the refcount root and level fields are separate from the existing roots and levels array, they need a separate logging flag. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> [hch: fix logging of AGF refcount btree fields] Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-10-03xfs: refcount btree add more reserved blocksDarrick J. Wong1-0/+13
Since XFS reserves a small amount of space in each AG as the minimum free space needed for an operation, save some more space in case we touch the refcount btree. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03xfs: introduce refcount btree definitionsDarrick J. Wong1-0/+5
Add new per-AG refcount btree definitions to the per-AG structures. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03Merge branch 'xfs-4.9-log-recovery-fixes' into for-nextDave Chinner1-11/+12
2016-09-26xfs: remote attribute blocks aren't really userdataDave Chinner1-11/+12
When adding a new remote attribute, we write the attribute to the new extent before the allocation transaction is committed. This means we cannot reuse busy extents as that violates crash consistency semantics. Hence we currently treat remote attribute extent allocation like userdata because it has the same overwrite ordering constraints as userdata. Unfortunately, this also allows the allocator to incorrectly apply extent size hints to the remote attribute extent allocation. This results in interesting failures, such as transaction block reservation overruns and in-memory inode attribute fork corruption. To fix this, we need to separate the busy extent reuse configuration from the userdata configuration. This changes the definition of XFS_BMAPI_METADATA slightly - it now means that allocation is metadata and reuse of busy extents is acceptible due to the metadata ordering semantics of the journal. If this flag is not set, it means the allocation is that has unordered data writeback, and hence busy extent reuse is not allowed. It no longer implies the allocation is for user data, just that the data write will not be strictly ordered. This matches the semantics for both user data and remote attribute block allocation. As such, This patch changes the "userdata" field to a "datatype" field, and adds a "no busy reuse" flag to the field. When we detect an unordered data extent allocation, we immediately set the no reuse flag. We then set the "user data" flags based on the inode fork we are allocating the extent to. Hence we only set userdata flags on data fork allocations now and consider attribute fork remote extents to be an unordered metadata extent. The result is that remote attribute extents now have the expected allocation semantics, and the data fork allocation behaviour is completely unchanged. It should be noted that there may be other ways to fix this (e.g. use ordered metadata buffers for the remote attribute extent data write) but they are more invasive and difficult to validate both from a design and implementation POV. Hence this patch takes the simple, obvious route to fixing the problem... Reported-and-tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19xfs: set up per-AG free space reservationsDarrick J. Wong1-34/+78
One unfortunate quirk of the reference count and reverse mapping btrees -- they can expand in size when blocks are written to *other* allocation groups if, say, one large extent becomes a lot of tiny extents. Since we don't want to start throwing errors in the middle of CoWing, we need to reserve some blocks to handle future expansion. The transaction block reservation counters aren't sufficient here because we have to have a reserve of blocks in every AG, not just somewhere in the filesystem. Therefore, create two per-AG block reservation pools. One feeds the AGFL so that rmapbt expansion always succeeds, and the other feeds all other metadata so that refcountbt expansion never fails. Use the count of how many reserved blocks we need to have on hand to create a virtual reservation in the AG. Through selective clamping of the maximum length of allocation requests and of the length of the longest free extent, we can make it look like there's less free space in the AG unless the reservation owner is asking for blocks. In other words, play some accounting tricks in-core to make sure that we always have blocks available. On the plus side, there's nothing to clean up if we crash, which is contrast to the strategy that the rough draft used (actually removing extents from the freespace btrees). Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-26xfs: don't log the entire end of the AGFDarrick J. Wong1-0/+2
When we're logging the last non-spare field in the AGF, we don't need to log the spare fields, so plumb in a new AGF logging flag to help us avoid that. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17xfs: remove OWN_AG rmap when allocating a block from the AGFLDarrick J. Wong1-0/+13
When we're really tight on space, xfs_alloc_ag_vextent_small() can allocate a block from the AGFL and give it to the caller. Since the caller is never the AGFL-fixing method, we must remove the OWN_AG reverse mapping because it will clash with whatever rmap the caller wants to set up. This bug was discovered by running generic/299 repeatedly. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17xfs: store rmapbt block count in the AGFDarrick J. Wong1-0/+1
Track the number of blocks used for the rmapbt in the AGF. When we get to the AG reservation code we need this counter to quickly make our reservation during mount. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: move (and rename) the deferred bmap-free tracepointsDarrick J. Wong1-2/+0
Rename the deferred bmap-free to extent_free and make them only trigger when we're really running deferred ops. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: don't update rmapbt when fixing agflDarrick J. Wong1-4/+14
Allow a caller of xfs_alloc_fix_freelist to disable rmapbt updates when fixing the AG freelist. xfs_repair needs this during phase 5 to be able to adjust the freelist while it's reconstructing the rmap btree; the missing entries will be added back at the very end of phase 5 once the AGFL contents settle down. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: rmap btree requires more reserved free spaceDarrick J. Wong1-0/+69
Originally-From: Dave Chinner <dchinner@redhat.com> The rmap btree is allocated from the AGFL, which means we have to ensure ENOSPC is reported to userspace before we run out of free space in each AG. The last allocation in an AG can cause a full height rmap btree split, and that means we have to reserve at least this many blocks *in each AG* to be placed on the AGFL at ENOSPC. Update the various space calculation functions to handle this. Also, because the macros are now executing conditional code and are called quite frequently, convert them to functions that initialise variables in the struct xfs_mount, use the new variables everywhere and document the calculations better. [darrick.wong@oracle.com: don't reserve blocks if !rmap] [dchinner@redhat.com: update m_ag_max_usable after growfs] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: introduce rmap extent operation stubsDarrick J. Wong1-2/+17
Originally-From: Dave Chinner <dchinner@redhat.com> Add the stubs into the extent allocation and freeing paths that the rmap btree implementation will hook into. While doing this, add the trace points that will be used to track rmap btree extent manipulations. [darrick.wong@oracle.com: Extend the stubs to take full owner info.] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: add owner field to extent allocation and freeingDarrick J. Wong1-10/+16
For the rmap btree to work, we have to feed the extent owner information to the the allocation and freeing functions. This information is what will end up in the rmap btree that tracks allocated extents. While we technically don't need the owner information when freeing extents, passing it allows us to validate that the extent we are removing from the rmap btree actually belonged to the owner we expected it to belong to. We also define a special set of owner values for internal metadata that would otherwise have no owner. This allows us to tell the difference between metadata owned by different per-ag btrees, as well as static fs metadata (e.g. AG headers) and internal journal blocks. There are also a couple of special cases we need to take care of - during EFI recovery, we don't actually know who the original owner was, so we need to pass a wildcard to indicate that we aren't checking the owner for validity. We also need special handling in growfs, as we "free" the space in the last AG when extending it, but because it's new space it has no actual owner... While touching the xfs_bmap_add_free() function, re-order the parameters to put the struct xfs_mount first. Extend the owner field to include both the owner type and some sort of index within the owner. The index field will be used to support reverse mappings when reflink is enabled. When we're freeing extents from an EFI, we don't have the owner information available (rmap updates have their own redo items). xfs_free_extent therefore doesn't need to do an rmap update. Make sure that the log replay code signals this correctly. This is based upon a patch originally from Dave Chinner. It has been extended to add more owner information with the intent of helping recovery operations when things go wrong (e.g. offset of user data block in a file). [dchinner: de-shout the xfs_rmap_*_owner helpers] [darrick: minor style fixes suggested by Christoph Hellwig] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: rmap btree add more reserved blocksDarrick J. Wong1-0/+11
Originally-From: Dave Chinner <dchinner@redhat.com> XFS reserves a small amount of space in each AG for the minimum number of free blocks needed for operation. Adding the rmap btree increases the number of reserved blocks, but it also increases the complexity of the calculation as the free inode btree is optional (like the rmbt). Rather than calculate the prealloc blocks every time we need to check it, add a function to calculate it at mount time and store it in the struct xfs_mount, and convert the XFS_PREALLOC_BLOCKS macro just to use the xfs-mount variable directly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: introduce rmap btree definitionsDarrick J. Wong1-0/+6
Originally-From: Dave Chinner <dchinner@redhat.com> Add new per-ag rmap btree definitions to the per-ag structures. The rmap btree will sit in the empty slots on disk after the free space btrees, and hence form a part of the array of space management btrees. This requires the definition of the btree to be contiguous with the free space btrees. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: add tracepoints and error injection for deferred extent freeingDarrick J. Wong1-0/+7
Add a couple of tracepoints for the deferred extent free operation and a site for injecting errors while finishing the operation. This makes it easier to debug deferred ops and test log redo. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: rework xfs_bmap_free callers to use xfs_defer_opsDarrick J. Wong1-0/+1
Restructure everything that used xfs_bmap_free to use xfs_defer_ops instead. For now we'll just remove the old symbols and play some cpp magic to make it work; in the next patch we'll actually rename everything. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21Merge branch 'xfs-4.8-misc-fixes-2' into for-nextDave Chinner1-43/+56
2016-06-21xfs: refactor btree maxlevels computationDarrick J. Wong1-13/+2
Create a common function to calculate the maximum height of a per-AG btree. This will eventually be used by the rmapbt and refcountbt code to calculate appropriate maxlevels values for each. This is important because the verifiers and the transaction block reservations depend on accurate estimates of how many blocks are needed to satisfy a btree split. We were mistakenly using the max bnobt height for all the btrees, which creates a dangerous situation since the larger records and keys in an rmapbt make it very possible that the rmapbt will be taller than the bnobt and so we can run out of transaction block reservation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: separate freelist fixing into a separate helperDave Chinner1-30/+54
Break up xfs_free_extent() into a helper that fixes the freelist. This helper will be used subsequently to ensure the freelist during deferred rmap processing. [darrick: refactor to put this at the head of the patchset] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-01xfs: make several functions staticEric Sandeen1-1/+1
Al Viro noticed that xfs_lock_inodes should be static, and that led to ... a few more. These are just the easy ones, others require moving functions higher in source files, so that's not done here to keep this review simple. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-04libxfs: make xfs_alloc_fix_freelist non-staticDarrick J. Wong1-1/+1
Since xfs_repair wants to use xfs_alloc_fix_freelist, remove the static designation. xfsprogs already has this; this simply brings the kernel up to date. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-04xfs: print name of verifier if it failsEric Sandeen1-0/+2
This adds a name to each buf_ops structure, so that if a verifier fails we can print the type of verifier that failed it. Should be a slight debugging aid, I hope. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03Merge branch 'xfs-dax-updates' into for-nextDave Chinner1-1/+9
2015-11-03xfs: introduce BMAPI_ZERO for allocating zeroed extentsDave Chinner1-1/+9
To enable DAX to do atomic allocation of zeroed extents, we need to drive the block zeroing deep into the allocator. Because xfs_bmapi_write() can return merged extents on allocation that were only partially allocated (i.e. requested range spans allocated and hole regions, allocation into the hole was contiguous), we cannot zero the extent returned from xfs_bmapi_write() as that can overwrite existing data with zeros. Hence we have to drive the extent zeroing into the allocation code, prior to where we merge the extents into the BMBT and return the resultant map. This means we need to propagate this need down to the xfs_alloc_vextent() and issue the block zeroing at this point. While this functionality is being introduced for DAX, there is no reason why it is specific to DAX - we can per-zero blocks during the allocation transaction on any type of device. It's just slow (and usually slower than unwritten allocation and conversion) on traditional block devices so doesn't tend to get used. We can, however, hook hardware zeroing optimisations via sb_issue_zeroout() to this operation, so it may be useful in future and hence the "allocate zeroed blocks" API needs to be implementation neutral. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12Merge branch 'xfs-logging-fixes' into for-nextDave Chinner1-3/+9
2015-10-12xfs: per-filesystem stats counter implementationBill O'Donnell1-4/+4
This patch modifies the stats counting macros and the callers to those macros to properly increment, decrement, and add-to the xfs stats counts. The counts for global and per-fs stats are correctly advanced, and cleared by writing a "1" to the corresponding clear file. global counts: /sys/fs/xfs/stats/stats per-fs counts: /sys/fs/xfs/sda*/stats/stats global clear: /sys/fs/xfs/stats/stats_clear per-fs clear: /sys/fs/xfs/sda*/stats/stats_clear [dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around stats structures and macros. ] Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12xfs: validate metadata LSNs against log on v5 superblocksBrian Foster1-3/+9
Since the onset of v5 superblocks, the LSN of the last modification has been included in a variety of on-disk data structures. This LSN is used to provide log recovery ordering guarantees (e.g., to ensure an older log recovery item is not replayed over a newer target data structure). While this works correctly from the point a filesystem is formatted and mounted, userspace tools have some problematic behaviors that defeat this mechanism. For example, xfs_repair historically zeroes out the log unconditionally (regardless of whether corruption is detected). If this occurs, the LSN of the filesystem is reset and the log is now in a problematic state with respect to on-disk metadata structures that might have a larger LSN. Until either the log catches up to the highest previously used metadata LSN or each affected data structure is modified and written out without incident (which resets the metadata LSN), log recovery is susceptible to filesystem corruption. This problem is ultimately addressed and repaired in the associated userspace tools. The kernel is still responsible to detect the problem and notify the user that something is wrong. Check the superblock LSN at mount time and fail the mount if it is invalid. From that point on, trigger verifier failure on any metadata I/O where an invalid LSN is detected. This results in a filesystem shutdown and guarantees that we do not log metadata changes with invalid LSNs on disk. Since this is a known issue with a known recovery path, present a warning to instruct the user how to recover. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-08-25xfs: Fix uninitialized return value in xfs_alloc_fix_freelist()Jan Kara1-1/+1
xfs_alloc_fix_freelist() can sometimes jump to out_agbp_relse without ever setting value of 'error' variable which is then returned. This can happen e.g. when pag->pagf_init is set but AG is for metadata and we want to allocate user data. Fix the problem by initializing 'error' to 0, which is the desired return value when we decide to skip this group. CC: xfs@oss.sgi.com Coverity-id: 1309714 Signed-off-by: Jan Kara <jack@suse.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-07-29xfs: create new metadata UUID field and incompat flagEric Sandeen1-2/+2
This adds a new superblock field, sb_meta_uuid. If set, along with a new incompat flag, the code will use that field on a V5 filesystem to compare to metadata UUIDs, which allows us to change the user- visible UUID at will. Userspace handles the setting and clearing of the incompat flag as appropriate, as the UUID gets changed; i.e. setting the user-visible UUID back to the original UUID (as stored in the new field) will remove the incompatible feature flag. If the incompat flag is not set, this copies the user-visible UUID into into the meta_uuid slot in memory when the superblock is read from disk; the meta_uuid field is not written back to disk in this case. The remainder of this patch simply switches verifiers, initializers, etc to use the new sb_meta_uuid field. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-23Merge branch 'xfs-freelist-cleanup' into for-nextDave Chinner1-107/+132
2015-06-22xfs: clean up XFS_MIN_FREELIST macrosDave Chinner1-3/+19
We no longer calculate the minimum freelist size from the on-disk AGF, so we don't need the macros used for this. That means the nested macros can be cleaned up, and turn this into an actual function so the logic is clear and concise. This will make it much easier to add support for the rmap btree when the time comes. This also gets rid of the XFS_AG_MAXLEVELS macro used by these freelist macros as it is simply a wrapper around a single variable. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-22xfs: sanitise error handling in xfs_alloc_fix_freelistDave Chinner1-56/+55
The error handling is currently an inconsistent mess as every error condition handles return values and releasing buffers individually. Clean this up by using gotos and a sane error label stack. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-22xfs: factor out free space extent length checkDave Chinner1-27/+44
The longest extent length checks in xfs_alloc_fix_freelist() are now essentially identical. Factor them out into a helper function, so we know they are checking exactly the same thing before and after we lock the AGF. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-22xfs: xfs_alloc_fix_freelist() can use incore perag structuresDave Chinner1-38/+31
At the moment, xfs_alloc_fix_freelist() uses a mix of per-ag based access and agf buffer based access to freelist and space usage information. However, once the AGF buffer is locked inside this function, it is guaranteed that both the in-memory and on-disk values are identical. xfs_alloc_fix_freelist() doesn't modify the values in the structures directly, so it is a read-only user of the infomration, and hence can use the per-ag structure exclusively for determining what it should do. This opens up an avenue for cleaning up a lot of duplicated logic whose only difference is the structure it gets the data from, and in doing so removes a lot of needless byte swapping overhead when fixing up the free list. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: support min/max agbno args in block allocatorBrian Foster1-5/+37
The block allocator supports various arguments to tweak block allocation behavior and set allocation requirements. The sparse inode chunk feature introduces a new requirement not supported by the current arguments. Sparse inode allocations must convert or merge into an inode record that describes a fixed length chunk (64 inodes x inodesize). Full inode chunk allocations by definition always result in valid inode records. Sparse chunk allocations are smaller and the associated records can refer to blocks not owned by the inode chunk. This model can result in invalid inode records in certain cases. For example, if a sparse allocation occurs near the start of an AG, the aligned inode record for that chunk might refer to agbno 0. If an allocation occurs towards the end of the AG and the AG size is not aligned, the inode record could refer to blocks beyond the end of the AG. While neither of these scenarios directly result in corruption, they both insert invalid inode records and at minimum cause repair to complain, are unlikely to merge into full chunks over time and set land mines for other areas of code. To guarantee sparse inode chunk allocation creates valid inode records, support the ability to specify an agbno range limit for XFS_ALLOCTYPE_NEAR_BNO block allocations. The min/max agbno's are specified in the allocation arguments and limit the block allocation algorithms to that range. The starting 'agbno' hint is clamped to the range if the specified agbno is out of range. If no sufficient extent is available within the range, the allocation fails. For backwards compatibility, the min/max fields can be initialized to 0 to disable range limiting (e.g., equivalent to min=0,max=agsize). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-24xfs: xfs_alloc_fix_minleft can underflow near ENOSPCDave Chinner1-1/+3
Test generic/224 is failing with a corruption being detected on one of Michael's test boxes. Debug that Michael added is indicating that the minleft trimming is resulting in an underflow: ..... before fixup: rlen 1 args->len 0 after xfs_alloc_fix_len : rlen 1 args->len 1 before goto out_nominleft: rlen 1 args->len 0 before fixup: rlen 1 args->len 0 after xfs_alloc_fix_len : rlen 1 args->len 1 after fixup: rlen 1 args->len 1 before fixup: rlen 1 args->len 0 after xfs_alloc_fix_len : rlen 1 args->len 1 after fixup: rlen 4294967295 args->len 4294967295 XFS: Assertion failed: fs_is_ok, file: fs/xfs/libxfs/xfs_alloc.c, line: 1424 The "goto out_nominleft:" indicates that we are getting close to ENOSPC in the AG, and a couple of allocations later we underflow and the corruption check fires in xfs_alloc_ag_vextent_size(). The issue is that the extent length fixups comaprisons are done with variables of xfs_extlen_t types. These are unsigned so an underflow looks like a really big value and hence is not detected as being smaller than the minimum length allowed for the extent. Hence the corruption check fires as it is noticing that the returned length is longer than the original extent length passed in. This can be easily fixed by ensuring we do the underflow test on signed values, the same way xfs_alloc_fix_len() prevents underflow. So we realise in future that these casts prevent underflows from going undetected, add comments to the code indicating this. Reported-by: Michael L. Semon <mlsemon35@gmail.com> Tested-by: Michael L. Semon <mlsemon35@gmail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>