summaryrefslogtreecommitdiffstats
path: root/fs/xfs/xfs_mount.h
AgeCommit message (Collapse)AuthorFilesLines
2021-06-02xfs: move perag structure and setup to libxfs/xfs_ag.[ch]Dave Chinner1-108/+3
Move the xfs_perag infrastructure to the libxfs files that contain all the per AG infrastructure. This helps set up for passing perags around all the code instead of bare agnos with minimal extra includes for existing files. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02xfs: prepare for moving perag definitions and support to libxfsDave Chinner1-9/+10
The perag structures really need to be defined with the rest of the AG support infrastructure. The struct xfs_perag and init/teardown has been placed in xfs_mount.[ch] because there are differences in the structure between kernel and userspace. Mainly that userspace doesn't have a lot of the internal stuff that the kernel has for caches and discard and other such structures. However, it makes more sense to move this to libxfs than to keep this separation because we are now moving to use struct perags everywhere in the code instead of passing raw agnumber_t values about. Hence we shoudl really move the support infrastructure to libxfs/xfs_ag.[ch]. To do this without breaking userspace, first we need to rearrange the structures and code so that all the kernel specific code is located together. This makes it simple for userspace to ifdef out the all the parts it does not need, minimising the code differences between kernel and userspace. The next commit will do the move... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-04-29xfs: introduce in-core global counter of allocbt blocksBrian Foster1-0/+6
Introduce an in-core counter to track the sum of all allocbt blocks used by the filesystem. This value is currently tracked per-ag via the ->agf_btreeblks field in the AGF, which also happens to include rmapbt blocks. A global, in-core count of allocbt blocks is required to identify the subset of global ->m_fdblocks that consists of unavailable blocks currently used for allocation btrees. To support this calculation at block reservation time, construct a similar global counter for allocbt blocks, populate it on first read of each AGF and update it as allocbt blocks are used and released. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25xfs: rename the blockgc workqueueDarrick J. Wong1-1/+1
Since we're about to start using the blockgc workqueue to dispose of inactivated inodes, strip the "block" prefix from the name; now it's merely the general garbage collection (gc) workqueue. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03xfs: parallelize block preallocation garbage collectionDarrick J. Wong1-2/+3
Split the block preallocation garbage collection work into per-AG work items so that we can take advantage of parallelization. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03xfs: consolidate the eofblocks and cowblocks workersDarrick J. Wong1-4/+2
Remove the separate cowblocks work items and knob so that we can control and run everything from a single blockgc work queue. Note that the speculative_prealloc_lifetime sysfs knob retains its historical name even though the functions move to prefix xfs_blockgc_*. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-01-22xfs: fold sbcount quiesce logging into log coveringBrian Foster1-1/+0
xfs_log_sbcount() calls xfs_sync_sb() to sync superblock counters to disk when lazy superblock accounting is enabled. This occurs on unmount, freeze, and read-only (re)mount and ensures the final values are calculated and persisted to disk before each form of quiesce completes. Now that log covering occurs in all of these contexts and uses the same xfs_sync_sb() mechanism to update log state, there is no need to log the superblock separately for any reason. Update the log quiesce path to sync the superblock at least once for any mount where lazy superblock accounting is enabled. If the log is already covered, it will remain in the covered state. Otherwise, the next sync as part of the normal covering sequence will carry the associated superblock update with it. Remove xfs_log_sbcount() now that it is no longer needed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2020-09-15xfs: remove xfs_getsbChristoph Hellwig1-1/+0
Merge xfs_getsb into its only caller, and clean that one up a little bit as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-07xfs: allow multiple reclaimers per AGDave Chinner1-1/+0
Inode reclaim will still throttle direct reclaim on the per-ag reclaim locks. This is no longer necessary as reclaim can run non-blocking now. Hence we can remove these locks so that we don't arbitrarily block reclaimers just because there are more direct reclaimers than there are AGs. This can result in multiple reclaimers working on the same range of an AG, but this doesn't cause any apparent issues. Optimising the spread of concurrent reclaimers for best efficiency can be done in a future patchset. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27xfs: remove the m_active_trans counterDave Chinner1-1/+0
It's a global atomic counter, and we are hitting it at a rate of half a million transactions a second, so it's bouncing the counter cacheline all over the place on large machines. We don't actually need it anymore - it used to be required because the VFS freeze code could not track/prevent filesystem transactions that were running, but that problem no longer exists. Hence to remove the counter, we simply have to ensure that nothing calls xfs_sync_sb() while we are trying to quiesce the filesytem. That only happens if the log worker is still running when we call xfs_quiesce_attr(). The log worker is cancelled at the end of xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it early here and then we can remove the counter altogether. Concurrent create, 50 million inodes, identical 16p/16GB virtual machines on different physical hosts. Machine A has twice the CPU cores per socket of machine B: unpatched patched machine A: 3m16s 2m00s machine B: 4m04s 4m05s Create rates: unpatched patched machine A: 282k+/-31k 468k+/-21k machine B: 231k+/-8k 233k+/-11k Concurrent rm of same 50 million inodes: unpatched patched machine A: 6m42s 2m33s machine B: 4m47s 4m47s The transaction rate on the fast machine went from just under 300k/sec to 700k/sec, which indicates just how much of a bottleneck this atomic counter was. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27xfs: separate read-only variables in struct xfs_mountDave Chinner1-66/+82
Seeing massive cpu usage from xfs_agino_range() on one machine; instruction level profiles look similar to another machine running the same workload, only one machine is consuming 10x as much CPU as the other and going much slower. The only real difference between the two machines is core count per socket. Both are running identical 16p/16GB virtual machine configurations Machine A: 25.83% [k] xfs_agino_range 12.68% [k] __xfs_dir3_data_check 6.95% [k] xfs_verify_ino 6.78% [k] xfs_dir2_data_entry_tag_p 3.56% [k] xfs_buf_find 2.31% [k] xfs_verify_dir_ino 2.02% [k] xfs_dabuf_map.constprop.0 1.65% [k] xfs_ag_block_count And takes around 13 minutes to remove 50 million inodes. Machine B: 13.90% [k] __pv_queued_spin_lock_slowpath 3.76% [k] do_raw_spin_lock 2.83% [k] xfs_dir3_leaf_check_int 2.75% [k] xfs_agino_range 2.51% [k] __raw_callee_save___pv_queued_spin_unlock 2.18% [k] __xfs_dir3_data_check 2.02% [k] xfs_log_commit_cil And takes around 5m30s to remove 50 million inodes. Suspect is cacheline contention on m_sectbb_log which is used in one of the macros in xfs_agino_range. This is a read-only variable but shares a cacheline with m_active_trans which is a global atomic that gets bounced all around the machine. The workload is trying to run hundreds of thousands of transactions per second and hence cacheline contention will be occurring on this atomic counter. Hence xfs_agino_range() is likely just be an innocent bystander as the cache coherency protocol fights over the cacheline between CPU cores and sockets. On machine A, this rearrangement of the struct xfs_mount results in the profile changing to: 9.77% [kernel] [k] xfs_agino_range 6.27% [kernel] [k] __xfs_dir3_data_check 5.31% [kernel] [k] __pv_queued_spin_lock_slowpath 4.54% [kernel] [k] xfs_buf_find 3.79% [kernel] [k] do_raw_spin_lock 3.39% [kernel] [k] xfs_verify_ino 2.73% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock Vastly less CPU usage in xfs_agino_range(), but still 3x the amount of machine B and still runs substantially slower than it should. Current rm -rf of 50 million files: vanilla patched machine A 13m20s 6m42s machine B 5m30s 5m02s It's an improvement, hence indicating that separation and further optimisation of read-only global filesystem data is worthwhile, but it clearly isn't the underlying issue causing this specific performance degradation. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27xfs: reduce free inode accounting overheadDave Chinner1-2/+0
Shaokun Zhang reported that XFS was using substantial CPU time in percpu_count_sum() when running a single threaded benchmark on a high CPU count (128p) machine from xfs_mod_ifree(). The issue is that the filesystem is empty when the benchmark runs, so inode allocation is running with a very low inode free count. With the percpu counter batching, this means comparisons when the counter is less that 128 * 256 = 32768 use the slow path of adding up all the counters across the CPUs, and this is expensive on high CPU count machines. The summing in xfs_mod_ifree() is only used to fire an assert if an underrun occurs. The error is ignored by the higher level code. Hence this is really just debug code and we don't need to run it on production kernels, nor do we need such debug checks to return error values just to trigger an assert. Finally, xfs_mod_icount/xfs_mod_ifree are only called from xfs_trans_unreserve_and_mod_sb(), so get rid of them and just directly call the percpu_counter_add/percpu_counter_compare functions. The compare functions are now run only on debug builds as they are internal to ASSERT() checks and so only compiled in when ASSERTs are active (CONFIG_XFS_DEBUG=y or CONFIG_XFS_WARN=y). Reported-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-07xfs: remove unused shutdown typesBrian Foster1-2/+0
Both types control shutdown messaging and neither is used in the current codebase. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-04fs/xfs: Make DAX mount option a tri-stateIra Weiny1-0/+1
As agreed upon[1]. We make the dax mount option a tri-state. '-o dax' continues to operate the same. We add 'always', 'never', and 'inode' (default). [1] https://lore.kernel.org/lkml/20200405061945.GA94792@iweiny-DESK2.sc.intel.com/ Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-04fs/xfs: Change XFS_MOUNT_DAX to XFS_MOUNT_DAX_ALWAYSIra Weiny1-2/+1
In prep for the new tri-state mount option which then introduces XFS_MOUNT_DAX_NEVER. Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-04-16xfs: move inode flush to the sync workqueueDarrick J. Wong1-1/+5
Move the inode dirty data flushing to a workqueue so that multiple threads can take advantage of a single thread's flushing work. The ratelimiting technique used in bdd4ee4 was not successful, because threads that skipped the inode flush scan due to ratelimiting would ENOSPC early, which caused occasional (but noticeable) changes in behavior and sporadic fstest regressions. Therefore, make all the writer threads wait on a single inode flush, which eliminates both the stampeding hordes of flushers and the small window in which a write could fail with ENOSPC because it lost the ratelimit race after even another thread freed space. Fixes: c6425702f21e ("xfs: ratelimit inode flush on buffered write ENOSPC") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-03-31xfs: ratelimit inode flush on buffered write ENOSPCDarrick J. Wong1-0/+1
A customer reported rcu stalls and softlockup warnings on a computer with many CPU cores and many many more IO threads trying to write to a filesystem that is totally out of space. Subsequent analysis pointed to the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb, which causes a lot of wb_writeback_work to be queued. The writeback worker spends so much time trying to wake the many many threads waiting for writeback completion that it trips the softlockup detector, and (in this case) the system automatically reboots. In addition, they complain that the lengthy xfs_flush_inodes scan traps all of those threads in uninterruptible sleep, which hampers their ability to kill the program or do anything else to escape the situation. If there's thousands of threads trying to write to files on a full filesystem, each of those threads will start separate copies of the inode flush scan. This is kind of pointless since we only need one scan, so rate limit the inode flush. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-11-13xfs: remove unused structure members & simple typedefsEric Sandeen1-1/+0
Remove some unused typedef'd simple types, and some unused structure members. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13xfs: devirtualize ->m_dirnameopsChristoph Hellwig1-2/+0
Instead of causing a relatively expensive indirect call for each hashing and comparism of a file name in a directory just use an inline function and a simple branch on the ASCII CI bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: fix unused variable warning] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13xfs: remove the unused m_chsize fieldChristoph Hellwig1-1/+0
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10xfs: remove the now unused dir ops infrastructureChristoph Hellwig1-2/+0
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10xfs: move the node header size to struct xfs_da_geometryChristoph Hellwig1-1/+0
Move the node header size field to struct xfs_da_geometry, and remove the now unused non-directory dir ops infrastructure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-05xfs: use super s_id instead of struct xfs_mount m_fsnameIan Kent1-1/+0
Eliminate struct xfs_mount field m_fsname by using the super block s_id field directly. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-05xfs: remove unused struct xfs_mount field m_fsname_lenIan Kent1-1/+0
The struct xfs_mount field m_fsname_len is not used anywhere, remove it. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-29xfs: reverse the polarity of XFS_MOUNT_COMPAT_IOSIZEChristoph Hellwig1-1/+1
Replace XFS_MOUNT_COMPAT_IOSIZE with an inverted XFS_MOUNT_LARGEIO flag that makes the usage more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-29xfs: rename the XFS_MOUNT_DFLT_IOSIZE option toChristoph Hellwig1-1/+1
Make the flag match the mount option and usage. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-29xfs: simplify parsing of allocsize mount optionChristoph Hellwig1-6/+0
Rework xfs_parseargs to fill out the default value and then parse the option directly into the mount structure, similar to what we do for other updates, and open code the now trivial updates based on on the on-disk superblock directly into xfs_mountfs. Note that this change rejects the allocsize=0 mount option that has been documented as invalid for a long time instead of just ignoring it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-29xfs: rename the m_writeio_* fields in struct xfs_mountChristoph Hellwig1-2/+2
Use the allocsize name to match the mount option and usage instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-29xfs: remove the m_readio_* fields in struct xfs_mountChristoph Hellwig1-4/+1
m_readio_blocks is entirely unused, and m_readio_blocks is only used in xfs_stat_blksize in a max statements that is a no-op as it always has the same value as m_writeio_log. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-29xfs: don't use a different allocsice for -o wsyncChristoph Hellwig1-7/+0
The -o wsync allocsize overwrite overwrite was part of a special hack for NFSv2 servers in IRIX and has no real purpose in modern Linux, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-29xfs: cleanup calculating the stat optimal I/O sizeChristoph Hellwig1-24/+0
Move xfs_preferred_iosize to xfs_iops.c, unobsfucate it and also handle the realtime special case in the helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-26xfs: add kmem allocation trace pointsDave Chinner1-7/+0
When trying to correlate XFS kernel allocations to memory reclaim behaviour, it is useful to know what allocations XFS is actually attempting. This information is not directly available from tracepoints in the generic memory allocation and reclaim tracepoints, so these new trace points provide a high level indication of what the XFS memory demand actually is. There is no per-filesystem context in this code, so we just trace the type of allocation, the size and the allocation constraints. The kmem code also doesn't include much of the common XFS headers, so there are a few definitions that need to be added to the trace headers and a couple of types that need to be made common to avoid needing to include the whole world in the kmem code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28xfs: move the log ioend workqueue to struct xlogChristoph Hellwig1-1/+0
Move the workqueue used for log I/O completions from struct xfs_mount to struct xlog to keep it self contained in the log code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: destroy the log workqueue after ensuring log ios are done] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-12xfs: remove unused flags arg from getsb interfacesEric Sandeen1-1/+1
The flags value is always passed as 0 so remove the argument. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-12xfs: separate inode geometryDarrick J. Wong1-16/+3
Separate the inode geometry information into a distinct structure. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-04-26xfs: track delayed allocation reservations across the filesystemDarrick J. Wong1-0/+7
Add a percpu counter to track the number of blocks directly reserved for delayed allocations on the data device. This counter (in contrast to i_delayed_blks) does not track allocated CoW staging extents or anything going on with the realtime device. It will be used in the upcoming summary counter scrub function to check the free block counts without having to freeze the filesystem or walk all the inodes to find the delayed allocations. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-04-16xfs: remove unused m_data_workqueueDarrick J. Wong1-1/+0
Now that we're no longer using m_data_workqueue, remove it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-14xfs: replace the BAD_SUMMARY mount flag with the equivalent health codeDarrick J. Wong1-1/+0
Replace the BAD_SUMMARY mount flag with calls to the equivalent health tracking code. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-04-14xfs: track metadata health statusDarrick J. Wong1-0/+23
Add the necessary in-core metadata fields to keep track of which parts of the filesystem have been observed and which parts were observed to be unhealthy, and print a warning at unmount time if we have unfixed problems. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-02-21xfs: introduce an always_cow modeChristoph Hellwig1-0/+1
Add a mode where XFS never overwrites existing blocks in place. This is to aid debugging our COW code, and also put infatructure in place for things like possible future support for zoned block devices, which can't support overwrites. This mode is enabled globally by doing a: echo 1 > /sys/fs/xfs/debug/always_cow Note that the parameter is global to allow running all tests in xfstests easily in this mode, which would not easily be possible with a per-fs sysfs file. In always_cow mode persistent preallocations are disabled, and fallocate will fail when called with a 0 mode (with our without FALLOC_FL_KEEP_SIZE), and not create unwritten extent for zeroed space when called with FALLOC_FL_ZERO_RANGE or FALLOC_FL_UNSHARE_RANGE. There are a few interesting xfstests failures when run in always_cow mode: - generic/392 fails because the bytes used in the file used to test hole punch recovery are less after the log replay. This is because the blocks written and then punched out are only freed with a delay due to the logging mechanism. - xfs/170 will fail as the already fragile file streams mechanism doesn't seem to interact well with the COW allocator - xfs/180 xfs/182 xfs/192 xfs/198 xfs/204 and xfs/208 will claim the file system is badly fragmented, but there is not much we can do to avoid that when always writing out of place - xfs/205 fails because overwriting a file in always_cow mode will require new space allocation and the assumption in the test thus don't work anymore. - xfs/326 fails to modify the file at all in always_cow mode after injecting the refcount error, leading to an unexpected md5sum after the remount, but that again is expected Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-02-14xfs: rename m_inotbt_nores to m_finobt_noresDarrick J. Wong1-1/+1
Rename this flag variable to imply more strongly that it's related to the free inode btree (finobt) operation. No functional changes. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-02-11xfs: cache unlinked pointers in an rhashtableDarrick J. Wong1-0/+7
Use a rhashtable to cache the unlinked list incore. This should speed up unlinked processing considerably when there are a lot of inodes on the unlinked list because iunlink_remove no longer has to traverse an entire bucket list to find which inode points to the one being removed. The incore list structure records "X.next_unlinked = Y" relations, with the rhashtable using Y to index the records. This makes finding the inode X that points to a inode Y very quick. If our cache fails to find anything we can always fall back on the old method. FWIW this drastically reduces the amount of time it takes to remove inodes from the unlinked list. I wrote a program to open a lot of O_TMPFILE files and then close them in the same order, which takes a very long time if we have to traverse the unlinked lists. With the ptach, I see: + /d/t/tmpfile/tmpfile Opened 193531 files in 6.33s. Closed 193531 files in 5.86s real 0m12.192s user 0m0.064s sys 0m11.619s + cd / + umount /mnt real 0m0.050s user 0m0.004s sys 0m0.030s And without the patch: + /d/t/tmpfile/tmpfile Opened 193588 files in 6.35s. Closed 193588 files in 751.61s real 12m38.853s user 0m0.084s sys 12m34.470s + cd / + umount /mnt real 0m0.086s user 0m0.000s sys 0m0.060s Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12xfs: cache minimum realtime summary levelOmar Sandoval1-0/+7
The realtime summary is a two-dimensional array on disk, effectively: u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap] rsum[log][bbno] is the number of extents of size 2**log which start in bitmap block bbno. xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether rsum[log][bbno] != 0 for any log level. However, the summary array is stored in row-major order (i.e., like an array in C), so all of these entries are not adjacent, but rather spread across the entire summary file. In the worst case (a full bitmap block), xfs_rtany_summary() has to check every level. This means that on a moderately-used realtime device, an allocation will waste a lot of time finding, reading, and releasing buffers for the realtime summary. In particular, one of our storage services (which runs on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems) spends almost 5% of its CPU cycles in xfs_rtbuf_get() and xfs_trans_brelse() called from xfs_rtany_summary(). One solution would be to also store the summary with the dimensions swapped. However, this would require a disk format change to a very old component of XFS. Instead, we can cache the minimum size which contains any extents. We do so lazily; rather than guaranteeing that the cache contains the precise minimum, it always contains a loose lower bound which we tighten when we read or update a summary block. This only uses a few kilobytes of memory and is already serialized via the realtime bitmap and summary inode locks, so the cost is minimal. With this change, the same workload only spends 0.2% of its CPU cycles in the realtime allocator. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-12-12xfs: precalculate cluster alignment in inodes and blocksDarrick J. Wong1-0/+2
Store the inode cluster alignment information in units of inodes and blocks in the mount data so that we don't have to keep recalculating them. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-12-12xfs: precalculate inodes and blocks per inode clusterDarrick J. Wong1-0/+2
Store the number of inodes and blocks per inode cluster in the mount data so that we don't have to keep recalculating them. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-07-26xfs: remove deprecated barrier/nobarrier mountEric Sandeen1-1/+0
The barrier mount options have been no-ops and deprecated since 4cf4573 xfs: deprecate barrier/nobarrier mount option i.e. kernel 4.10 / December 2016, with a stated deprecation schedule after v4.15. Should be fair game to remove them now. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-23xfs: force summary counter recalc at next mountDarrick J. Wong1-0/+1
Use the "bad summary count" mount flag from the previous patch to skip writing the unmount record to force log recovery at the next mount, which will recalculate the summary counters for us. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-23xfs: detect and fix bad summary counts at mountDarrick J. Wong1-0/+1
Filippo Giunchedi complained that xfs doesn't even perform basic sanity checks of the fs summary counters at mount time. Therefore, recalculate the summary counters from the AGFs after log recovery if the counts were bad (or we had to recover the fs). Enhance the recalculation routine to fail the mount entirely if the new values are also obviously incorrect. We use a mount state flag to record the "bad summary count" state so that the (subsequent) online fsck patches can detect subtlely incorrect counts and set the flag; clear it userspace asks for a repair; or force a recalculation at the next mount if nobody fixes it by unmount time. Reported-by: Filippo Giunchedi <fgiunchedi@wikimedia.org> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-06-08xfs: clean up MIN/MAXDave Chinner1-1/+1
Get rid of the MIN/MAX macros and just use the native min/max macros directly in the XFS code. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06xfs: convert to SPDX license tagsDave Chinner1-13/+1
Remove the verbose license text from XFS files and replace them with SPDX tags. This does not change the license of any of the code, merely refers to the common, up-to-date license files in LICENSES/ This change was mostly scripted. fs/xfs/Makefile and fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected and modified by the following command: for f in `git grep -l "GNU General" fs/xfs/` ; do echo $f cat $f | awk -f hdr.awk > $f.new mv -f $f.new $f done And the hdr.awk script that did the modification (including detecting the difference between GPL-2.0 and GPL-2.0+ licenses) is as follows: $ cat hdr.awk BEGIN { hdr = 1.0 tag = "GPL-2.0" str = "" } /^ \* This program is free software/ { hdr = 2.0; next } /any later version./ { tag = "GPL-2.0+" next } /^ \*\// { if (hdr > 0.0) { print "// SPDX-License-Identifier: " tag print str print $0 str="" hdr = 0.0 next } print $0 next } /^ \* / { if (hdr > 1.0) next if (hdr > 0.0) { if (str != "") str = str "\n" str = str $0 next } print $0 next } /^ \*/ { if (hdr > 0.0) next print $0 next } // { if (hdr > 0.0) { if (str != "") str = str "\n" str = str $0 next } print $0 } END { } $ Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>