Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull XFS updates from Darrick Wong:
"Here are some changes for you for 4.13. For the most part it's fixes
for bugs and deadlock problems, and preparation for online fsck in
some future merge window.
- Avoid quotacheck deadlocks
- Fix transaction overflows when bunmapping fragmented files
- Refactor directory readahead
- Allow admin to configure if ASSERT is fatal
- Improve transaction usage detail logging during overflows
- Minor cleanups
- Don't leak log items when the log shuts down
- Remove double-underscore typedefs
- Various preparation for online scrubbing
- Introduce new error injection configuration sysfs knobs
- Refactor dq_get_next to use extent map directly
- Fix problems with iterating the page cache for unwritten data
- Implement SEEK_{HOLE,DATA} via iomap
- Refactor XFS to use iomap SEEK_HOLE and SEEK_DATA
- Don't use MAXPATHLEN to check on-disk symlink target lengths"
* tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits)
xfs: don't crash on unexpected holes in dir/attr btrees
xfs: rename MAXPATHLEN to XFS_SYMLINK_MAXLEN
xfs: fix contiguous dquot chunk iteration livelock
xfs: Switch to iomap for SEEK_HOLE / SEEK_DATA
vfs: Add iomap_seek_hole and iomap_seek_data helpers
vfs: Add page_cache_seek_hole_data helper
xfs: remove a whitespace-only line from xfs_fs_get_nextdqblk
xfs: rewrite xfs_dq_get_next_id using xfs_iext_lookup_extent
xfs: Check for m_errortag initialization in xfs_errortag_test
xfs: grab dquots without taking the ilock
xfs: fix semicolon.cocci warnings
xfs: Don't clear SGID when inheriting ACLs
xfs: free cowblocks and retry on buffered write ENOSPC
xfs: replace log_badcrc_factor knob with error injection tag
xfs: convert drop_writes to use the errortag mechanism
xfs: remove unneeded parameter from XFS_TEST_ERROR
xfs: expose errortag knobs via sysfs
xfs: make errortag a per-mountpoint structure
xfs: free uncommitted transactions during log recovery
xfs: don't allow bmap on rt files
...
|
|
Pull core block/IO updates from Jens Axboe:
"This is the main pull request for the block layer for 4.13. Not a huge
round in terms of features, but there's a lot of churn related to some
core cleanups.
Note this depends on the UUID tree pull request, that Christoph
already sent out.
This pull request contains:
- A series from Christoph, unifying the error/stats codes in the
block layer. We now use blk_status_t everywhere, instead of using
different schemes for different places.
- Also from Christoph, some cleanups around request allocation and IO
scheduler interactions in blk-mq.
- And yet another series from Christoph, cleaning up how we handle
and do bounce buffering in the block layer.
- A blk-mq debugfs series from Bart, further improving on the support
we have for exporting internal information to aid debugging IO
hangs or stalls.
- Also from Bart, a series that cleans up the request initialization
differences across types of devices.
- A series from Goldwyn Rodrigues, allowing the block layer to return
failure if we will block and the user asked for non-blocking.
- Patch from Hannes for supporting setting loop devices block size to
that of the underlying device.
- Two series of patches from Javier, fixing various issues with
lightnvm, particular around pblk.
- A series from me, adding support for write hints. This comes with
NVMe support as well, so applications can help guide data placement
on flash to improve performance, latencies, and write
amplification.
- A series from Ming, improving and hardening blk-mq support for
stopping/starting and quiescing hardware queues.
- Two pull requests for NVMe updates. Nothing major on the feature
side, but lots of cleanups and bug fixes. From the usual crew.
- A series from Neil Brown, greatly improving the bio rescue set
support. Most notably, this kills the bio rescue work queues, if we
don't really need them.
- Lots of other little bug fixes that are all over the place"
* 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits)
lightnvm: pblk: set line bitmap check under debug
lightnvm: pblk: verify that cache read is still valid
lightnvm: pblk: add initialization check
lightnvm: pblk: remove target using async. I/Os
lightnvm: pblk: use vmalloc for GC data buffer
lightnvm: pblk: use right metadata buffer for recovery
lightnvm: pblk: schedule if data is not ready
lightnvm: pblk: remove unused return variable
lightnvm: pblk: fix double-free on pblk init
lightnvm: pblk: fix bad le64 assignations
nvme: Makefile: remove dead build rule
blk-mq: map all HWQ also in hyperthreaded system
nvmet-rdma: register ib_client to not deadlock in device removal
nvme_fc: fix error recovery on link down.
nvmet_fc: fix crashes on bad opcodes
nvme_fc: Fix crash when nvme controller connection fails.
nvme_fc: replace ioabort msleep loop with completion
nvme_fc: fix double calls to nvme_cleanup_cmd()
nvme-fabrics: verify that a controller returns the correct NQN
nvme: simplify nvme_dev_attrs_are_visible
...
|
|
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bmap returns a dumb LBA address but not the block device that goes with
that LBA. Swapfiles don't care about this and will blindly assume that
the data volume is the correct blockdev, which is totally bogus for
files on the rt subvolume. This results in the swap code doing IOs to
arbitrary locations on the data device(!) if the passed in mapping is a
realtime file, so just turn off bmap for rt files.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
bmap returns a dumb LBA address but not the block device that goes with
that LBA. Swapfiles don't care about this and will blindly assume that
the data volume is the correct blockdev, which is totally bogus for
files on the rt subvolume. This results in the swap code doing IOs to
arbitrary locations on the data device(!) if the passed in mapping is a
realtime file, so just turn off bmap for rt files.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
This is a purely mechanical patch that removes the private
__{u,}int{8,16,32,64}_t typedefs in favor of using the system
{u,}int{8,16,32,64}_t typedefs. This is the sed script used to perform
the transformation and fix the resulting whitespace and indentation
errors:
s/typedef\t__uint8_t/typedef __uint8_t\t/g
s/typedef\t__uint/typedef __uint/g
s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g
s/__uint8_t\t/__uint8_t\t\t/g
s/__uint/uint/g
s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g
s/__int/int/g
/^typedef.*int[0-9]*_t;$/d
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Pull xfs updates from Darrick Wong:
"Here are the XFS changes for 4.12. The big new feature for this
release is the new space mapping ioctl that we've been discussing
since LSF2016, but other than that most of the patches are larger bug
fixes, memory corruption prevention, and other cleanups.
Summary:
- various code cleanups
- introduce GETFSMAP ioctl
- various refactoring
- avoid dio reads past eof
- fix memory corruption and other errors with fragmented directory blocks
- fix accidental userspace memory corruptions
- publish fs uuid in superblock
- make fstrim terminatable
- fix race between quotaoff and in-core inode creation
- avoid use-after-free when finishing up w/ buffer heads
- reserve enough space to handle bmap tree resizing during cow remap"
* tag 'xfs-4.12-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (53 commits)
xfs: fix use-after-free in xfs_finish_page_writeback
xfs: reserve enough blocks to handle btree splits when remapping
xfs: wait on new inodes during quotaoff dquot release
xfs: update ag iterator to support wait on new inodes
xfs: support ability to wait on new inodes
xfs: publish UUID in struct super_block
xfs: Allow user to kill fstrim process
xfs: better log intent item refcount checking
xfs: fix up quotacheck buffer list error handling
xfs: remove xfs_trans_ail_delete_bulk
xfs: don't use bool values in trace buffers
xfs: fix getfsmap userspace memory corruption while setting OF_LAST
xfs: fix __user annotations for xfs_ioc_getfsmap
xfs: corruption needs to respect endianess too!
xfs: use NULL instead of 0 to initialize a pointer in xfs_ioc_getfsmap
xfs: use NULL instead of 0 to initialize a pointer in xfs_getfsmap
xfs: simplify validation of the unwritten extent bit
xfs: remove unused values from xfs_exntst_t
xfs: remove the unused XFS_MAXLINK_1 define
xfs: more do_div cleanups
...
|
|
Commit 28b783e47ad7 ("xfs: bufferhead chains are invalid after
end_page_writeback") fixed one use-after-free issue by
pre-calculating the loop conditionals before calling bh->b_end_io()
in the end_io processing loop, but it assigned 'next' pointer before
checking end offset boundary & breaking the loop, at which point the
bh might be freed already, and caused use-after-free.
This is caught by KASAN when running fstests generic/127 on sub-page
block size XFS.
[ 2517.244502] run fstests generic/127 at 2017-04-27 07:30:50
[ 2747.868840] ==================================================================
[ 2747.876949] BUG: KASAN: use-after-free in xfs_destroy_ioend+0x3d3/0x4e0 [xfs] at addr ffff8801395ae698
...
[ 2747.918245] Call Trace:
[ 2747.920975] dump_stack+0x63/0x84
[ 2747.924673] kasan_object_err+0x21/0x70
[ 2747.928950] kasan_report+0x271/0x530
[ 2747.933064] ? xfs_destroy_ioend+0x3d3/0x4e0 [xfs]
[ 2747.938409] ? end_page_writeback+0xce/0x110
[ 2747.943171] __asan_report_load8_noabort+0x19/0x20
[ 2747.948545] xfs_destroy_ioend+0x3d3/0x4e0 [xfs]
[ 2747.953724] xfs_end_io+0x1af/0x2b0 [xfs]
[ 2747.958197] process_one_work+0x5ff/0x1000
[ 2747.962766] worker_thread+0xe4/0x10e0
[ 2747.966946] kthread+0x2d3/0x3d0
[ 2747.970546] ? process_one_work+0x1000/0x1000
[ 2747.975405] ? kthread_create_on_node+0xc0/0xc0
[ 2747.980457] ? syscall_return_slowpath+0xe6/0x140
[ 2747.985706] ? do_page_fault+0x30/0x80
[ 2747.989887] ret_from_fork+0x2c/0x40
[ 2747.993874] Object at ffff8801395ae690, in cache buffer_head size: 104
[ 2748.001155] Allocated:
[ 2748.003782] PID = 8327
[ 2748.006411] save_stack_trace+0x1b/0x20
[ 2748.010688] save_stack+0x46/0xd0
[ 2748.014383] kasan_kmalloc+0xad/0xe0
[ 2748.018370] kasan_slab_alloc+0x12/0x20
[ 2748.022648] kmem_cache_alloc+0xb8/0x1b0
[ 2748.027024] alloc_buffer_head+0x22/0xc0
[ 2748.031399] alloc_page_buffers+0xd1/0x250
[ 2748.035968] create_empty_buffers+0x30/0x410
[ 2748.040730] create_page_buffers+0x120/0x1b0
[ 2748.045493] __block_write_begin_int+0x17a/0x1800
[ 2748.050740] iomap_write_begin+0x100/0x2f0
[ 2748.055308] iomap_zero_range_actor+0x253/0x5c0
[ 2748.060362] iomap_apply+0x157/0x270
[ 2748.064347] iomap_zero_range+0x5a/0x80
[ 2748.068624] iomap_truncate_page+0x6b/0xa0
[ 2748.073227] xfs_setattr_size+0x1f7/0xa10 [xfs]
[ 2748.078312] xfs_vn_setattr_size+0x68/0x140 [xfs]
[ 2748.083589] xfs_file_fallocate+0x4ac/0x820 [xfs]
[ 2748.088838] vfs_fallocate+0x2cf/0x780
[ 2748.093021] SyS_fallocate+0x48/0x80
[ 2748.097006] do_syscall_64+0x18a/0x430
[ 2748.101186] return_from_SYSCALL_64+0x0/0x6a
[ 2748.105948] Freed:
[ 2748.108189] PID = 8327
[ 2748.110816] save_stack_trace+0x1b/0x20
[ 2748.115093] save_stack+0x46/0xd0
[ 2748.118788] kasan_slab_free+0x73/0xc0
[ 2748.122969] kmem_cache_free+0x7a/0x200
[ 2748.127247] free_buffer_head+0x41/0x80
[ 2748.131524] try_to_free_buffers+0x178/0x250
[ 2748.136316] xfs_vm_releasepage+0x2e9/0x3d0 [xfs]
[ 2748.141563] try_to_release_page+0x100/0x180
[ 2748.146325] invalidate_inode_pages2_range+0x7da/0xcf0
[ 2748.152087] xfs_shift_file_space+0x37d/0x6e0 [xfs]
[ 2748.157557] xfs_collapse_file_space+0x49/0x120 [xfs]
[ 2748.163223] xfs_file_fallocate+0x2a7/0x820 [xfs]
[ 2748.168462] vfs_fallocate+0x2cf/0x780
[ 2748.172642] SyS_fallocate+0x48/0x80
[ 2748.176629] do_syscall_64+0x18a/0x430
[ 2748.180810] return_from_SYSCALL_64+0x0/0x6a
Fixed it by checking on offset against end & breaking out first,
dereference bh only if there're still bufferheads to process.
Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a more
generic name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.
This patch doesn't introduce any functional change.
Link: http://lkml.kernel.org/r/20170306131408.9828-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Opencoding the trivial checks makes it much easier to read (and grep..).
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
This checks for all the non-normal extent types, including handling both
encodings of delayed allocations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
There are two different cases of buffered I/O errors:
- first we can have an already shutdown fs. In that case we should skip
any on-disk operations and just clean up the appen transaction if
present and destroy the ioend
- a real I/O error. In that case we should cleanup any lingering COW
blocks. This gets skipped in the current code and is fixed by this
patch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
We only want to reclaim preallocations from our periodic work item.
Currently this is archived by looking for a dirty inode, but that check
is rather fragile. Instead add a flag to xfs_reflink_cancel_cow_* so
that the caller can ask for just cancelling unwritten extents in the COW
fork.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix typos in commit message]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs
branch.
This patch also fixes multiple checkpatch warnings: WARNING: Prefer
'unsigned int' to bare use of 'unsigned'
Thanks to Andrew Morton for suggesting more appropriate function instead
of macro.
[geliangtang@gmail.com: truncate: use i_blocksize()]
Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com
Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Christoph Hellwig pointed out that there's a potentially nasty race when
performing simultaneous nearby directio cow writes:
"Thread 1 writes a range from B to c
" B --------- C
p
"a little later thread 2 writes from A to B
" A --------- B
p
[editor's note: the 'p' denote cowextsize boundaries, which I added to
make this more clear]
"but the code preallocates beyond B into the range where thread
"1 has just written, but ->end_io hasn't been called yet.
"But once ->end_io is called thread 2 has already allocated
"up to the extent size hint into the write range of thread 1,
"so the end_io handler will splice the unintialized blocks from
"that preallocation back into the file right after B."
We can avoid this race by ensuring that thread 1 cannot accidentally
remap the blocks that thread 2 allocated (as part of speculative
preallocation) as part of t2's write preparation in t1's end_io handler.
The way we make this happen is by taking advantage of the unwritten
extent flag as an intermediate step.
Recall that when we begin the process of writing data to shared blocks,
we create a delayed allocation extent in the CoW fork:
D: --RRRRRRSSSRRRRRRRR---
C: ------DDDDDDD---------
When a thread prepares to CoW some dirty data out to disk, it will now
convert the delalloc reservation into an /unwritten/ allocated extent in
the cow fork. The da conversion code tries to opportunistically
allocate as much of a (speculatively prealloc'd) extent as possible, so
we may end up allocating a larger extent than we're actually writing
out:
D: --RRRRRRSSSRRRRRRRR---
U: ------UUUUUUU---------
Next, we convert only the part of the extent that we're actively
planning to write to normal (i.e. not unwritten) status:
D: --RRRRRRSSSRRRRRRRR---
U: ------UURRUUU---------
If the write succeeds, the end_cow function will now scan the relevant
range of the CoW fork for real extents and remap only the real extents
into the data fork:
D: --RRRRRRRRSRRRRRRRR---
U: ------UU--UUU---------
This ensures that we never obliterate valid data fork extents with
unwritten blocks from the CoW fork.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Commit 99579ccec4e2 "xfs: skip dirty pages in ->releasepage()" started
to skip dirty pages in xfs_vm_releasepage() which also has the effect
that if a dirty page is truncated, it does not get freed by
block_invalidatepage() and is lingering in LRU list waiting for reclaim.
So a simple loop like:
while true; do
dd if=/dev/zero of=file bs=1M count=100
rm file
done
will keep using more and more memory until we hit low watermarks and
start pagecache reclaim which will eventually reclaim also the truncate
pages. Keeping these truncated (and thus never usable) pages in memory
is just a waste of memory, is unnecessarily stressing page cache
reclaim, and reportedly also leads to anonymous mmap(2) returning ENOMEM
prematurely.
So instead of just skipping dirty pages in xfs_vm_releasepage(), return
to old behavior of skipping them only if they have delalloc or unwritten
buffers and fix the spurious warnings by warning only if the page is
clean.
CC: stable@vger.kernel.org
CC: Brian Foster <bfoster@redhat.com>
CC: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Petr Tůma <petr.tuma@d3s.mff.cuni.cz>
Fixes: 99579ccec4e271c3d4d4e7c946058766812afdab
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs updates from Dave Chinner:
"There is quite a varied bunch of stuff in this update, and some of it
you will have already merged through the ext4 tree which imported the
dax-4.10-iomap-pmd topic branch from the XFS tree.
There is also a new direct IO implementation that uses the iomap
infrastructure. It's much simpler, faster, and has lower IO latency
than the existing direct IO infrastructure.
Summary:
- DAX PMD faults via iomap infrastructure
- Direct-io support in iomap infrastructure
- removal of now-redundant XFS inode iolock, replaced with VFS
i_rwsem
- synchronisation with fixes and changes in userspace libxfs code
- extent tree lookup helpers
- lots of little corruption detection improvements to verifiers
- optimised CRC calculations
- faster buffer cache lookups
- deprecation of barrier/nobarrier mount options - we always use
REQ_FUA/REQ_FLUSH where appropriate for data integrity now
- cleanups to speculative preallocation
- miscellaneous minor bug fixes and cleanups"
* tag 'xfs-for-linus-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (63 commits)
xfs: nuke unused tracepoint definitions
xfs: use GPF_NOFS when allocating btree cursors
xfs: use xfs_vn_setattr_size to check on new size
xfs: deprecate barrier/nobarrier mount option
xfs: Always flush caches when integrity is required
xfs: ignore leaf attr ichdr.count in verifier during log replay
xfs: use rhashtable to track buffer cache
xfs: optimise CRC updates
xfs: make xfs btree stats less huge
xfs: don't cap maximum dedupe request length
xfs: don't allow di_size with high bit set
xfs: error out if trying to add attrs and anextents > 0
xfs: don't crash if reading a directory results in an unexpected hole
xfs: complain if we don't get nextents bmap records
xfs: check for bogus values in btree block headers
xfs: forbid AG btrees with level == 0
xfs: several xattr functions can be void
xfs: handle cow fork in xfs_bmap_trace_exlist
xfs: pass state not whichfork to trace_xfs_extlist
xfs: Move AGI buffer type setting to xfs_read_agi
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"This merge request includes the dax-4.0-iomap-pmd branch which is
needed for both ext4 and xfs dax changes to use iomap for DAX. It also
includes the fscrypt branch which is needed for ubifs encryption work
as well as ext4 encryption and fscrypt cleanups.
Lots of cleanups and bug fixes, especially making sure ext4 is robust
against maliciously corrupted file systems --- especially maliciously
corrupted xattr blocks and a maliciously corrupted superblock. Also
fix ext4 support for 64k block sizes so it works well on ppcle. Fixed
mbcache so we don't miss some common xattr blocks that can be merged"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
dax: Fix sleep in atomic contex in grab_mapping_entry()
fscrypt: Rename FS_WRITE_PATH_FL to FS_CTX_HAS_BOUNCE_BUFFER_FL
fscrypt: Delay bounce page pool allocation until needed
fscrypt: Cleanup page locking requirements for fscrypt_{decrypt,encrypt}_page()
fscrypt: Cleanup fscrypt_{decrypt,encrypt}_page()
fscrypt: Never allocate fscrypt_ctx on in-place encryption
fscrypt: Use correct index in decrypt path.
fscrypt: move the policy flags and encryption mode definitions to uapi header
fscrypt: move non-public structures and constants to fscrypt_private.h
fscrypt: unexport fscrypt_initialize()
fscrypt: rename get_crypt_info() to fscrypt_get_crypt_info()
fscrypto: move ioctl processing more fully into common code
fscrypto: remove unneeded Kconfig dependencies
MAINTAINERS: fscrypto: recommend linux-fsdevel for fscrypto patches
ext4: do not perform data journaling when data is encrypted
ext4: return -ENOMEM instead of success
ext4: reject inodes with negative size
ext4: remove another test in ext4_alloc_file_blocks()
Documentation: fix description of ext4's block_validity mount option
ext4: fix checks for data=ordered and journal_async_commit options
...
|
|
Straight switch over to using iomap for direct I/O - we already have the
non-COW dio path in write_begin for DAX and files with extent size hints,
so nothing to add there. The COW path is ported over from the old
get_blocks version and a bit of a mess, but I have some work in progress
to make it look more like the buffered I/O COW path.
This gets rid of xfs_get_blocks_direct and the last caller of
xfs_get_blocks with the create flag set, so all that code can be removed.
Last but not least I've removed a comment in xfs_filemap_fault that
refers to xfs_get_blocks entirely instead of updating it - while the
reference is correct, the whole DAX fault path looks different than
the non-DAX one, so it seems rather pointless.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
This patch drops the XFS-own i_iolock and uses the VFS i_rwsem which
recently replaced i_mutex instead. This means we only have to take
one lock instead of two in many fast path operations, and we can
also shrink the xfs_inode structure. Thanks to the xfs_ilock family
there is very little churn, the only thing of note is that we need
to switch to use the lock_two_directory helper for taking the i_rwsem
on two inodes in a few places to make sure our lock order matches
the one used in the VFS.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
|
|
Use xfs_iext_lookup_extent to look up the extent, drop a useless check,
drop a unneeded return value and clean up the general style a little bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
We've had reports of generic/095 causing XFS to BUG() in
__xfs_get_blocks() due to the existence of delalloc blocks on a
direct I/O read. generic/095 issues a mix of various types of I/O,
including direct and memory mapped I/O to a single file. This is
clearly not supported behavior and is known to lead to such
problems. E.g., the lack of exclusion between the direct I/O and
write fault paths means that a write fault can allocate delalloc
blocks in a region of a file that was previously a hole after the
direct read has attempted to flush/inval the file range, but before
it actually reads the block mapping. In turn, the direct read
discovers a delalloc extent and cannot proceed.
While the appropriate solution here is to not mix direct and memory
mapped I/O to the same regions of the same file, the current
BUG_ON() behavior is probably overkill as it can crash the entire
system. Instead, localize the failure to the I/O in question by
returning an error for a direct I/O that cannot be handled safely
due to delalloc blocks. Be careful to allow the case of a direct
write to post-eof delalloc blocks. This can occur due to speculative
preallocation and is safe as post-eof blocks are not accompanied by
dirty pages in pagecache (conversely, preallocation within eof must
have been zeroed, and thus dirtied, before the inode size could have
been increased beyond said blocks).
Finally, provide an additional warning if a direct I/O write occurs
while the file is memory mapped. This may not catch all problematic
scenarios, but provides a hint that some known-to-be-problematic I/O
methods are in use.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and
improved dax_iomap_pmd_fault(). Also, now that it has no more users,
remove xfs_get_blocks_dax_fault().
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Add wbc_to_write_flags(), which returns the write modifier flags to use,
based on a struct writeback_control. No functional changes in this
patch, but it prepares us for factoring other wbc fields for write type.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly. Where applicable this also drops usage of the
bio_set_op_attrs wrapper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
We need to splice COW blocks we've completed in xfs_end_io_direct_write
into the data fork before converting unwritten extents. Otherwise
xfs_bmapi_write might first allocate blocks for any holes in the data
fork, which isn't only not needed but also harmful as it might cause
reserved block underruns in the transaction.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
For O_DIRECT writes to shared blocks, we have to CoW them just like
we would with buffered writes. For writes that are not block-aligned,
just bounce them to the page cache.
For block-aligned writes, however, we can do better than that. Use
the same mechanisms that we employ for buffered CoW to set up a
delalloc reservation, allocate all the blocks at once, issue the
writes against the new blocks and use the same ioend functions to
remap the blocks after the write. This should be fairly performant.
Christoph discovered that xfs_reflink_allocate_cow_range may stumble
over invalid entries in the extent array given that it drops the ilock
but still expects the index to be stable. Simple fixing it to a new
lookup for every iteration still isn't correct given that
xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and
there is nothing preventing a xfs_bunmapi_cow call removing extents
once we dropped the ilock either.
This patch duplicates the inner loop of xfs_bmapi_allocate into a
helper for xfs_reflink_allocate_cow_range so that it can be done under
the same ilock critical section as our CoW fork delayed allocation.
The directio CoW warts will be revisited in a later patch.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Report shared extents through the iomap interface so that FIEMAP flags
shared blocks accurately. Have xfs_vm_bmap return zero for reflinked
files because the bmap-based swap code requires static block mappings,
which is incompatible with copy on write.
NOTE: Existing userspace bmap users such as lilo will have the same
problem with reflink files.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
After the write component of a copy-write operation finishes, clean up
the bookkeeping left behind. On error, we simply free the new blocks
and pass the error up. If we succeed, however, then we must remove
the old data fork mapping and move the cow fork mapping to the data
fork.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: Call the CoW failure function during xfs_cancel_ioend]
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Modify the writepage handler to find and convert pending delalloc
extents to real allocations. Furthermore, when we're doing non-cow
writes to a part of a file that already has a CoW reservation (the
cowextsz hint that we set up in a subsequent patch facilitates this),
promote the write to copy-on-write so that the entire extent can get
written out as a single extent on disk, thereby reducing post-CoW
fragmentation.
Christoph moved the CoW support code in _map_blocks to a separate helper
function, refactored other functions, and reduced the number of CoW fork
lookups, so I merged those changes here to reduce churn.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Modify xfs_bmap_add_extent_delay_real() so that we can convert delayed
allocation extents in the CoW fork to real allocations, and wire this
up all the way back to xfs_iomap_write_allocate(). In a subsequent
patch, we'll modify the writepage handler to call this.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Rename the current function to __xfs_setfilesize and add a non-static
wrapper that also takes care of creating the transaction. This new
helper will be used by the new iomap-based DAX path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs updates from Dave Chinner:
"The major addition is the new iomap based block mapping
infrastructure. We've been kicking this about locally for years, but
there are other filesystems want to use it too (e.g. gfs2). Now it
is fully working, reviewed and ready for merge and be used by other
filesystems.
There are a lot of other fixes and cleanups in the tree, but those are
XFS internal things and none are of the scale or visibility of the
iomap changes. See below for details.
I am likely to send another pull request next week - we're just about
ready to merge some new functionality (on disk block->owner reverse
mapping infrastructure), but that's a huge chunk of code (74 files
changed, 7283 insertions(+), 1114 deletions(-)) so I'm keeping that
separate to all the "normal" pull request changes so they don't get
lost in the noise.
Summary of changes in this update:
- generic iomap based IO path infrastructure
- generic iomap based fiemap implementation
- xfs iomap based Io path implementation
- buffer error handling fixes
- tracking of in flight buffer IO for unmount serialisation
- direct IO and DAX io path separation and simplification
- shortform directory format definition changes for wider platform
compatibility
- various buffer cache fixes
- cleanups in preparation for rmap merge
- error injection cleanups and fixes
- log item format buffer memory allocation restructuring to prevent
rare OOM reclaim deadlocks
- sparse inode chunks are now fully supported"
* tag 'xfs-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (53 commits)
xfs: remove EXPERIMENTAL tag from sparse inode feature
xfs: bufferhead chains are invalid after end_page_writeback
xfs: allocate log vector buffers outside CIL context lock
libxfs: directory node splitting does not have an extra block
xfs: remove dax code from object file when disabled
xfs: skip dirty pages in ->releasepage()
xfs: remove __arch_pack
xfs: kill xfs_dir2_inou_t
xfs: kill xfs_dir2_sf_off_t
xfs: split direct I/O and DAX path
xfs: direct calls in the direct I/O path
xfs: stop using generic_file_read_iter for direct I/O
xfs: split xfs_file_read_iter into buffered and direct I/O helpers
xfs: remove s_maxbytes enforcement in xfs_file_read_iter
xfs: kill ioflags
xfs: don't pass ioflags around in the ioctl path
xfs: track and serialize in-flight async buffers against unmount
xfs: exclude never-released buffers from buftarg I/O accounting
xfs: don't reset b_retries to 0 on every failure
xfs: remove extraneous buffer flag changes
...
|
|
|
|
In xfs_finish_page_writeback(), we have a loop that looks like this:
do {
if (off < bvec->bv_offset)
goto next_bh;
if (off > end)
break;
bh->b_end_io(bh, !error);
next_bh:
off += bh->b_size;
} while ((bh = bh->b_this_page) != head);
The b_end_io function is end_buffer_async_write(), which will call
end_page_writeback() once all the buffers have marked as no longer
under IO. This issue here is that the only thing currently
protecting both the bufferhead chain and the page from being
reclaimed is the PageWriteback state held on the page.
While we attempt to limit the loop to just the buffers covered by
the IO, we still read from the buffer size and follow the next
pointer in the bufferhead chain. There is no guarantee that either
of these are valid after the PageWriteback flag has been cleared.
Hence, loops like this are completely unsafe, and result in
use-after-free issues. One such problem was caught by Calvin Owens
with KASAN:
.....
INFO: Freed in 0x103fc80ec age=18446651500051355200 cpu=2165122683 pid=-1
free_buffer_head+0x41/0x90
__slab_free+0x1ed/0x340
kmem_cache_free+0x270/0x300
free_buffer_head+0x41/0x90
try_to_free_buffers+0x171/0x240
xfs_vm_releasepage+0xcb/0x3b0
try_to_release_page+0x106/0x190
shrink_page_list+0x118e/0x1a10
shrink_inactive_list+0x42c/0xdf0
shrink_zone_memcg+0xa09/0xfa0
shrink_zone+0x2c3/0xbc0
.....
Call Trace:
<IRQ> [<ffffffff81e8b8e4>] dump_stack+0x68/0x94
[<ffffffff8153a995>] print_trailer+0x115/0x1a0
[<ffffffff81541174>] object_err+0x34/0x40
[<ffffffff815436e7>] kasan_report_error+0x217/0x530
[<ffffffff81543b33>] __asan_report_load8_noabort+0x43/0x50
[<ffffffff819d651f>] xfs_destroy_ioend+0x3bf/0x4c0
[<ffffffff819d69d4>] xfs_end_bio+0x154/0x220
[<ffffffff81de0c58>] bio_endio+0x158/0x1b0
[<ffffffff81dff61b>] blk_update_request+0x18b/0xb80
[<ffffffff821baf57>] scsi_end_request+0x97/0x5a0
[<ffffffff821c5558>] scsi_io_completion+0x438/0x1690
[<ffffffff821a8d95>] scsi_finish_command+0x375/0x4e0
[<ffffffff821c3940>] scsi_softirq_done+0x280/0x340
Where the access is occuring during IO completion after the buffer
had been freed from direct memory reclaim.
Prevent use-after-free accidents in this end_io processing loop by
pre-calculating the loop conditionals before calling bh->b_end_io().
The loop is already limited to just the bufferheads covered by the
IO in progress, so the offset checks are sufficient to prevent
accessing buffers in the chain after end_page_writeback() has been
called by the the bh->b_end_io() callout.
Yet another example of why Bufferheads Must Die.
cc: <stable@vger.kernel.org> # 4.7
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reported-and-Tested-by: Calvin Owens <calvinowens@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
XFS has had scattered reports of delalloc blocks present at
->releasepage() time. This results in a warning with a stack trace
similar to the following:
...
Call Trace:
[<ffffffffa23c5b8f>] dump_stack+0x63/0x84
[<ffffffffa20837a7>] warn_slowpath_common+0x97/0xe0
[<ffffffffa208380a>] warn_slowpath_null+0x1a/0x20
[<ffffffffa2326caf>] xfs_vm_releasepage+0x10f/0x140
[<ffffffffa218c680>] ? page_mkclean_one+0xd0/0xd0
[<ffffffffa218d3a0>] ? anon_vma_prepare+0x150/0x150
[<ffffffffa21521c2>] try_to_release_page+0x32/0x50
[<ffffffffa2166b2e>] shrink_active_list+0x3ce/0x3e0
[<ffffffffa21671c7>] shrink_lruvec+0x687/0x7d0
[<ffffffffa21673ec>] shrink_zone+0xdc/0x2c0
[<ffffffffa2168539>] kswapd+0x4f9/0x970
[<ffffffffa2168040>] ? mem_cgroup_shrink_node_zone+0x1a0/0x1a0
[<ffffffffa20a0d99>] kthread+0xc9/0xe0
[<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
[<ffffffffa26b404f>] ret_from_fork+0x3f/0x70
[<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
This occurs because it is possible for shrink_active_list() to send
pages marked dirty to ->releasepage() when certain buffer_head threshold
conditions are met. shrink_active_list() doesn't check the page dirty
state apparently to handle an old ext3 corner case where in some cases
clean pages would not have the dirty bit cleared, thus it is up to the
filesystem to determine how to handle the page.
XFS currently handles the delalloc case properly, but this behavior
makes the warning spurious. Update the XFS ->releasepage() handler to
explicitly skip dirty pages. Retain the existing delalloc/unwritten
checks so we continue to warn if such buffers exist on clean pages when
they shouldn't.
Diagnosed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
|
|
We control both the callers and callees of ->direct_IO, so remove the
indirect calls.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Convert XFS to use the new iomap based multipage write path. This involves
implementing the ->iomap_begin and ->iomap_end methods, and switching the
buffered file write, page_mkwrite and xfs_iozero paths to the new iomap
helpers.
With this change __xfs_get_blocks will never be used for buffered writes,
and the code handling them can be removed.
Based on earlier code from Dave Chinner.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Separate the op from the rq_flag_bits and have xfs
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixed up fs/ext4/crypto.c
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs updates from Dave Chinner:
"A pretty average collection of fixes, cleanups and improvements in
this request.
Summary:
- fixes for mount line parsing, sparse warnings, read-only compat
feature remount behaviour
- allow fast path symlink lookups for inline symlinks.
- attribute listing cleanups
- writeback goes direct to bios rather than indirecting through
bufferheads
- transaction allocation cleanup
- optimised kmem_realloc
- added configurable error handling for metadata write errors,
changed default error handling behaviour from "retry forever" to
"retry until unmount then fail"
- fixed several inode cluster writeback lookup vs reclaim race
conditions
- fixed inode cluster writeback checking wrong inode after lookup
- fixed bugs where struct xfs_inode freeing wasn't actually RCU safe
- cleaned up inode reclaim tagging"
* tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (39 commits)
xfs: fix warning in xfs_finish_page_writeback for non-debug builds
xfs: move reclaim tagging functions
xfs: simplify inode reclaim tagging interfaces
xfs: rename variables in xfs_iflush_cluster for clarity
xfs: xfs_iflush_cluster has range issues
xfs: mark reclaimed inodes invalid earlier
xfs: xfs_inode_free() isn't RCU safe
xfs: optimise xfs_iext_destroy
xfs: skip stale inodes in xfs_iflush_cluster
xfs: fix inode validity check in xfs_iflush_cluster
xfs: xfs_iflush_cluster fails to abort on error
xfs: remove xfs_fs_evict_inode()
xfs: add "fail at unmount" error handling configuration
xfs: add configuration handlers for specific errors
xfs: add configuration of error failure speed
xfs: introduce table-based init for error behaviors
xfs: add configurable error support to metadata buffers
xfs: introduce metadata IO error class
xfs: configurable error behavior via sysfs
xfs: buffer ->bi_end_io function requires irq-safe lock
...
|
|
|
|
blockmask is unused if ASSERTs are disabled.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Including blkdev_direct_IO and dax_do_io. It has to be ki_pos to actually
work, so eliminate the superflous argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Merge xfs_trans_reserve and xfs_trans_alloc into a single function call
that returns a transaction with all the required log and block reservations,
and which allows passing transaction flags directly to avoid the cumbersome
_xfs_trans_alloc interface.
While we're at it we also get rid of the transaction type argument that has
been superflous since we stopped supporting the non-CIL logging mode. The
guts of it will be removed in another patch.
[dchinner: fixed transaction leak in error path in xfs_setattr_nonsize]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
This patch implements two closely related changes: First it embeds
a bio the ioend structure so that we don't have to allocate one
separately. Second it uses the block layer bio chaining mechanism
to chain additional bios off this first one if needed instead of
manually accounting for multiple bio completions in the ioend
structure. Together this removes a memory allocation per ioend and
greatly simplifies the ioend setup and I/O completion path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|