summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/compression.c
AgeCommit message (Collapse)AuthorFilesLines
2021-07-28btrfs: mark compressed range uptodate only if all bio succeedGoldwyn Rodrigues1-1/+1
In compression write endio sequence, the range which the compressed_bio writes is marked as uptodate if the last bio of the compressed (sub)bios is completed successfully. There could be previous bio which may have failed which is recorded in cb->errors. Set the writeback range as uptodate only if cb->errors is zero, as opposed to checking only the last bio's status. Backporting notes: in all versions up to 4.4 the last argument is always replaced by "!cb->errors". CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22btrfs: remove a stale comment for btrfs_decompress_bio()Qu Wenruo1-14/+0
Since commit 8140dc30a432 ("btrfs: btrfs_decompress_bio() could accept compressed_bio instead"), btrfs_decompress_bio() accepts "struct compressed_bio" other than open-coded parameter list. Thus the comments for the parameter list is no longer needed. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: pass btrfs_inode to btrfs_writepage_endio_finish_ordered()Qu Wenruo1-3/+1
There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in end_compressed_bio_write(). It passes compressed pages to btrfs_writepage_endio_finish_ordered(), which is only supposed to accept inode pages. Thankfully the important info here is the inode, so let's pass btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and make @page parameter optional. By this, end_compressed_bio_write() can happily pass page=NULL while still getting everything done properly. Also, to cooperate with such modification, replace @page parameter for trace_btrfs_writepage_end_io_hook() with btrfs_inode. Although this removes page_index info, the existing start/len should be enough for most usage. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: fix comment about max_out in btrfs_compress_pagesAnand Jain1-3/+0
Commit e5d74902362f ("btrfs: derive maximum output size in the compression implementation") removed @max_out argument in btrfs_compress_pages() but its comment remained, remove it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: optimize variables size in btrfs_submit_compressed_writeAnand Jain1-3/+3
Patch "btrfs: reduce compressed_bio member's types" reduced some member's size. Function arguments @len, @compressed_len and @nr_pages can be declared as unsigned int. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: optimize variables size in btrfs_submit_compressed_readAnand Jain1-3/+3
Patch "btrfs: reduce compressed_bio member's types" reduced some member's size. Declare the variables @compressed_len, @nr_pages and @pg_index size as an unsigned int in the function btrfs_submit_compressed_read. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: reduce the variable size to fit nr_pagesAnand Jain1-3/+3
Patch "btrfs: reduce compressed_bio member's types" reduced the @nr_pages size to unsigned int, its cascading effects are updated here. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: reduce compressed_bio members' typesDavid Sterba1-1/+1
Several members of compressed_bio are of type that's unnecessarily big for the values that they'd hold: - the size of the uncompressed and compressed data is 128K now, we can keep is as int - same for number of pages - the compress type fits to a byte - the errors is 0/1 The size of the unpatched structure is 80 bytes with several holes. Reordering nr_pages next to the pages the hole after pending_bios is filled and the resulting size is 56 bytes. This keeps the csums array aligned to 8 bytes, which is nice. Further size optimizations may be possible but right now it looks good to me: struct compressed_bio { refcount_t pending_bios; /* 0 4 */ unsigned int nr_pages; /* 4 4 */ struct page * * compressed_pages; /* 8 8 */ struct inode * inode; /* 16 8 */ u64 start; /* 24 8 */ unsigned int len; /* 32 4 */ unsigned int compressed_len; /* 36 4 */ u8 compress_type; /* 40 1 */ u8 errors; /* 41 1 */ /* XXX 2 bytes hole, try to pack */ int mirror_num; /* 44 4 */ struct bio * orig_bio; /* 48 8 */ u8 sums[]; /* 56 0 */ /* size: 56, cachelines: 1, members: 12 */ /* sum members: 54, holes: 1, sum holes: 2 */ /* last cacheline: 56 bytes */ }; Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: zoned: factor out zoned device lookupJohannes Thumshirn1-12/+4
To be able to construct a zone append bio we need to look up the btrfs_device. The code doing the chunk map lookup to get the device is present in btrfs_submit_compressed_write and submit_extent_page. Factor out the lookup calls into a helper and use it in the submission paths. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-03Merge tag 'for-5.13-rc4-tag' of ↵Linus Torvalds1-5/+12
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Error handling improvements, caught by error injection: - handle errors during checksum deletion - set error on mapping when ordered extent io cannot be finished - inode link count fixup in tree-log - missing return value checks for inode updates in tree-log - abort transaction in rename exchange if adding second reference fails Fixes: - fix fsync failure after writes to prealloc extents - fix deadlock when cloning inline extents and low on available space - fix compressed writes that cross stripe boundary" * tag 'for-5.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: MAINTAINERS: add btrfs IRC link btrfs: fix deadlock when cloning inline extents and low on available space btrfs: fix fsync failure and transaction abort after writes to prealloc extents btrfs: abort in rename_exchange if we fail to insert the second ref btrfs: check error value from btrfs_update_inode in tree log btrfs: fixup error handling in fixup_inode_link_counts btrfs: mark ordered extent and inode with error if we fail to finish btrfs: return errors from btrfs_del_csums in cleanup_ref_head btrfs: fix error handling in btrfs_del_csums btrfs: fix compressed writes that cross stripe boundary
2021-05-27btrfs: fix compressed writes that cross stripe boundaryQu Wenruo1-5/+12
[BUG] When running btrfs/027 with "-o compress" mount option, it always crashes with the following call trace: BTRFS critical (device dm-4): mapping failed logical 298901504 bio len 12288 len 8192 ------------[ cut here ]------------ kernel BUG at fs/btrfs/volumes.c:6651! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 5 PID: 31089 Comm: kworker/u24:10 Tainted: G OE 5.13.0-rc2-custom+ #26 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Workqueue: btrfs-delalloc btrfs_work_helper [btrfs] RIP: 0010:btrfs_map_bio.cold+0x58/0x5a [btrfs] Call Trace: btrfs_submit_compressed_write+0x2d7/0x470 [btrfs] submit_compressed_extents+0x3b0/0x470 [btrfs] ? mark_held_locks+0x49/0x70 btrfs_work_helper+0x131/0x3e0 [btrfs] process_one_work+0x28f/0x5d0 worker_thread+0x55/0x3c0 ? process_one_work+0x5d0/0x5d0 kthread+0x141/0x160 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x22/0x30 ---[ end trace 63113a3a91f34e68 ]--- [CAUSE] The critical message before the crash means we have a bio at logical bytenr 298901504 length 12288, but only 8192 bytes can fit into one stripe, the remaining 4096 bytes go to another stripe. In btrfs, all bios are properly split to avoid cross stripe boundary, but commit 764c7c9a464b ("btrfs: zoned: fix parallel compressed writes") changed the behavior for compressed writes. Previously if we find our new page can't be fitted into current stripe, ie. "submit == 1" case, we submit current bio without adding current page. submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, 0); page->mapping = NULL; if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) < PAGE_SIZE) { But after the modification, we will add the page no matter if it crosses stripe boundary, leading to the above crash. submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, 0); if (pg_index == 0 && use_append) len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0); else len = bio_add_page(bio, page, PAGE_SIZE, 0); page->mapping = NULL; if (submit || len < PAGE_SIZE) { [FIX] It's no longer possible to revert to the original code style as we have two different bio_add_*_page() calls now. The new fix is to skip the bio_add_*_page() call if @submit is true. Also to avoid @len to be uninitialized, always initialize it to zero. If @submit is true, @len will not be checked. If @submit is not true, @len will be the return value of bio_add_*_page() call. Either way, the behavior is still the same as the old code. Reported-by: Josef Bacik <josef@toxicpanda.com> Fixes: 764c7c9a464b ("btrfs: zoned: fix parallel compressed writes") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-21Merge tag 'for-5.13-rc2-tag' of ↵Linus Torvalds1-4/+38
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes: - fix unaligned compressed writes in zoned mode - fix false positive lockdep warning when cloning inline extent - remove wrong BUG_ON in tree-log error handling" * tag 'for-5.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: fix parallel compressed writes btrfs: zoned: pass start block to btrfs_use_zone_append btrfs: do not BUG_ON in link_to_fixup_dir btrfs: release path before starting transaction when cloning inline extent
2021-05-20btrfs: zoned: fix parallel compressed writesJohannes Thumshirn1-4/+38
When multiple processes write data to the same block group on a compressed zoned filesystem, the underlying device could report I/O errors and data corruption is possible. This happens because on a zoned file system, compressed data writes where sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND operation. But with REQ_OP_WRITE and parallel submission it cannot be guaranteed that the data is always submitted aligned to the underlying zone's write pointer. The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned filesystem is non intrusive on a regular file system or when submitting to a conventional zone on a zoned filesystem, as it is guarded by btrfs_use_zone_append. Reported-by: David Sterba <dsterba@suse.com> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag") CC: stable@vger.kernel.org # 5.12.x: e380adfc213a13: btrfs: zoned: pass start block to btrfs_use_zone_append CC: stable@vger.kernel.org # 5.12.x Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-05btrfs: use memzero_page() instead of open coded kmap patternIra Weiny1-4/+1
There are many places where kmap/memset/kunmap patterns occur. Use the newly lifted memzero_page() to eliminate direct uses of kmap and leverage the new core functions use of kmap_local_page(). The development of this patch was aided by the following coccinelle script: // <smpl> // SPDX-License-Identifier: GPL-2.0-only // Find kmap/memset/kunmap pattern and replace with memset*page calls // // NOTE: Offsets and other expressions may be more complex than what the script // will automatically generate. Therefore a catchall rule is provided to find // the pattern which then must be evaluated by hand. // // Confidence: Low // Copyright: (C) 2021 Intel Corporation // URL: http://coccinelle.lip6.fr/ // Comments: // Options: // // Then the memset pattern // @ memset_rule1 @ expression page, V, L, Off; identifier ptr; type VP; @@ ( -VP ptr = kmap(page); | -ptr = kmap(page); | -VP ptr = kmap_atomic(page); | -ptr = kmap_atomic(page); ) <+... ( -memset(ptr, 0, L); +memzero_page(page, 0, L); | -memset(ptr + Off, 0, L); +memzero_page(page, Off, L); | -memset(ptr, V, L); +memset_page(page, V, 0, L); | -memset(ptr + Off, V, L); +memset_page(page, V, Off, L); ) ...+> ( -kunmap(page); | -kunmap_atomic(ptr); ) // Remove any pointers left unused @ depends on memset_rule1 @ identifier memset_rule1.ptr; type VP, VP1; @@ -VP ptr; ... when != ptr; ? VP1 ptr; // // Catch all // @ memset_rule2 @ expression page; identifier ptr; expression GenTo, GenSize, GenValue; type VP; @@ ( -VP ptr = kmap(page); | -ptr = kmap(page); | -VP ptr = kmap_atomic(page); | -ptr = kmap_atomic(page); ) <+... ( // // Some call sites have complex expressions within the memset/memcpy // The follow are catch alls which need to be evaluated by hand. // -memset(GenTo, 0, GenSize); +memzero_pageExtra(page, GenTo, GenSize); | -memset(GenTo, GenValue, GenSize); +memset_pageExtra(page, GenValue, GenTo, GenSize); ) ...+> ( -kunmap(page); | -kunmap_atomic(ptr); ) // Remove any pointers left unused @ depends on memset_rule2 @ identifier memset_rule2.ptr; type VP, VP1; @@ -VP ptr; ... when != ptr; ? VP1 ptr; // </smpl> Link: https://lkml.kernel.org/r/20210309212137.2610186-4-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-19btrfs: handle remount to no compress during compressionQu Wenruo1-3/+8
[BUG] When running btrfs/071 with inode_need_compress() removed from compress_file_range(), we got the following crash: BUG: kernel NULL pointer dereference, address: 0000000000000018 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page Workqueue: btrfs-delalloc btrfs_work_helper [btrfs] RIP: 0010:compress_file_range+0x476/0x7b0 [btrfs] Call Trace: ? submit_compressed_extents+0x450/0x450 [btrfs] async_cow_start+0x16/0x40 [btrfs] btrfs_work_helper+0xf2/0x3e0 [btrfs] process_one_work+0x278/0x5e0 worker_thread+0x55/0x400 ? process_one_work+0x5e0/0x5e0 kthread+0x168/0x190 ? kthread_create_worker_on_cpu+0x70/0x70 ret_from_fork+0x22/0x30 ---[ end trace 65faf4eae941fa7d ]--- This is already after the patch "btrfs: inode: fix NULL pointer dereference if inode doesn't need compression." [CAUSE] @pages is firstly created by kcalloc() in compress_file_extent(): pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS); Then passed to btrfs_compress_pages() to be utilized there: ret = btrfs_compress_pages(... pages, &nr_pages, ...); btrfs_compress_pages() will initialize each page as output, in zlib_compress_pages() we have: pages[nr_pages] = out_page; nr_pages++; Normally this is completely fine, but there is a special case which is in btrfs_compress_pages() itself: switch (type) { default: return -E2BIG; } In this case, we didn't modify @pages nor @out_pages, leaving them untouched, then when we cleanup pages, the we can hit NULL pointer dereference again: if (pages) { for (i = 0; i < nr_pages; i++) { WARN_ON(pages[i]->mapping); put_page(pages[i]); } ... } Since pages[i] are all initialized to zero, and btrfs_compress_pages() doesn't change them at all, accessing pages[i]->mapping would lead to NULL pointer dereference. This is not possible for current kernel, as we check inode_need_compress() before doing pages allocation. But if we're going to remove that inode_need_compress() in compress_file_extent(), then it's going to be a problem. [FIX] When btrfs_compress_pages() hits its default case, modify @out_pages to 0 to prevent such problem from happening. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212331 CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19btrfs: convert kmap to kmap_local_page, simple casesIra Weiny1-2/+2
Use a simple coccinelle script to help convert the most common kmap()/kunmap() patterns to kmap_local_page()/kunmap_local(). Note that some kmaps which were caught by this script needed to be handled by hand because of the strict unmapping order of kunmap_local() so they are not included in this patch. But this script got us started. There's another temp variable added for the final length write to the first page so it does not interfere with cpage_out that is used for mapping other pages. The development of this patch was aided by the follow script: // <smpl> // SPDX-License-Identifier: GPL-2.0-only // Find kmap and replace with kmap_local_page then mark kunmap // // Confidence: Low // Copyright: (C) 2021 Intel Corporation // URL: http://coccinelle.lip6.fr/ @ catch_all @ expression e, e2; @@ ( -kmap(e) +kmap_local_page(e) ) ... ( -kunmap(...) +kunmap_local() ) // </smpl> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-01Merge branch 'kmap-conversion-for-5.12' of ↵Linus Torvalds1-4/+2
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull kmap conversion updates from David Sterba: "This contains changes regarding kmap API use and eg conversion from kmap_atomic to kmap_local_page. The API belongs to memory management but to save cross-tree dependency headaches we've agreed to take it through the btrfs tree because there are some trivial conversions possible, while the rest will need some time and getting the easy cases out of the way would be convenient. The changes can be grouped: - function exports, new helpers - new VM_BUG_ON for additional verification; it's been discussed if it should be VM_BUG_ON or BUG_ON, the former was chosen due to performance reasons - code replaced by relevant helpers" [ This is an updated version of a request that originally came in during the merge window, but I asked for some updates: https://lore.kernel.org/lkml/cover.1614090658.git.dsterba@suse.com/ which is why this got merge after the merge window closed. - Linus ] * 'kmap-conversion-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: use copy_highpage() instead of 2 kmaps() btrfs: use memcpy_[to|from]_page() and kmap_local_page() mm/highmem: Add VM_BUG_ON() to mem*_page() calls mm/highmem: Introduce memcpy_page(), memmove_page(), and memset_page() mm/highmem: Convert memcpy_[to|from]_page() to kmap_local_page() mm/highmem: Lift memcpy_[to|from]_page to core
2021-02-26btrfs: use memcpy_[to|from]_page() and kmap_local_page()Ira Weiny1-4/+2
There are many places where the pattern kmap/memcpy/kunmap occurs. This pattern was lifted to the core common functions memcpy_[to|from]_page(). Use these new functions to reduce the code, eliminate direct uses of kmap, and leverage the new core functions use of kmap_local_page(). Also, there is 1 place where a kmap/memcpy is followed by an optional memset. Here we leave the kmap open coded to avoid remapping the page but use kmap_local_page() directly. Development of this patch was aided by the coccinelle script: // <smpl> // SPDX-License-Identifier: GPL-2.0-only // Find kmap/memcpy/kunmap pattern and replace with memcpy*page calls // // NOTE: Offsets and other expressions may be more complex than what the script // will automatically generate. Therefore a catchall rule is provided to find // the pattern which then must be evaluated by hand. // // Confidence: Low // Copyright: (C) 2021 Intel Corporation // URL: http://coccinelle.lip6.fr/ // Comments: // Options: // // simple memcpy version // @ memcpy_rule1 @ expression page, T, F, B, Off; identifier ptr; type VP; @@ ( -VP ptr = kmap(page); | -ptr = kmap(page); | -VP ptr = kmap_atomic(page); | -ptr = kmap_atomic(page); ) <+... ( -memcpy(ptr + Off, F, B); +memcpy_to_page(page, Off, F, B); | -memcpy(ptr, F, B); +memcpy_to_page(page, 0, F, B); | -memcpy(T, ptr + Off, B); +memcpy_from_page(T, page, Off, B); | -memcpy(T, ptr, B); +memcpy_from_page(T, page, 0, B); ) ...+> ( -kunmap(page); | -kunmap_atomic(ptr); ) // Remove any pointers left unused @ depends on memcpy_rule1 @ identifier memcpy_rule1.ptr; type VP, VP1; @@ -VP ptr; ... when != ptr; ? VP1 ptr; // // Some callers kmap without a temp pointer // @ memcpy_rule2 @ expression page, T, Off, F, B; @@ <+... ( -memcpy(kmap(page) + Off, F, B); +memcpy_to_page(page, Off, F, B); | -memcpy(kmap(page), F, B); +memcpy_to_page(page, 0, F, B); | -memcpy(T, kmap(page) + Off, B); +memcpy_from_page(T, page, Off, B); | -memcpy(T, kmap(page), B); +memcpy_from_page(T, page, 0, B); ) ...+> -kunmap(page); // No need for the ptr variable removal // // Catch all // @ memcpy_rule3 @ expression page; expression GenTo, GenFrom, GenSize; identifier ptr; type VP; @@ ( -VP ptr = kmap(page); | -ptr = kmap(page); | -VP ptr = kmap_atomic(page); | -ptr = kmap_atomic(page); ) <+... ( // // Some call sites have complex expressions within the memcpy // match a catch all to be evaluated by hand. // -memcpy(GenTo, GenFrom, GenSize); +memcpy_to_pageExtra(page, GenTo, GenFrom, GenSize); +memcpy_from_pageExtra(GenTo, page, GenFrom, GenSize); ) ...+> ( -kunmap(page); | -kunmap_atomic(ptr); ) // Remove any pointers left unused @ depends on memcpy_rule3 @ identifier memcpy_rule3.ptr; type VP, VP1; @@ -VP ptr; ... when != ptr; ? VP1 ptr; // <smpl> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22btrfs: make check_compressed_csum() to be subpage compatibleQu Wenruo1-13/+26
Currently check_compressed_csum() completely relies on sectorsize == PAGE_SIZE to do checksum verification for compressed extents. To make it subpage compatible, this patch will: - Do extra calculation for the csum range Since we have multiple sectors inside a page, we need to only hash the range we want, not the full page anymore. - Do sector-by-sector hash inside the page With this patch and previous conversion on btrfs_submit_compressed_read(), now we can read subpage compressed extents properly, and do proper csum verification. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22btrfs: make btrfs_submit_compressed_read() subpage compatibleQu Wenruo1-6/+17
For compressed read, we always submit page read using page size. This doesn't work well with subpage, as for subpage one page can contain several sectors. Such submission will read range out of what we want, and cause problems. Thankfully to make it subpage compatible, we only need to change how the last page of the compressed extent is read. Instead of always adding a full page to the compressed read bio, if we're at the last page, calculate the size using compressed length, so that we only add part of the range into the compressed read bio. Since we are here, also change the PAGE_SIZE used in lookup_extent_mapping() to sectorsize. This modification won't cause any functional change, as lookup_extent_mapping() can handle the case where the search range is larger than found extent range. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08btrfs: introduce btrfs_subpage for data inodesQu Wenruo1-2/+8
To support subpage sector size, data also need extra info to make sure which sectors in a page are uptodate/dirty/... This patch will make pages for data inodes get btrfs_subpage structure attached, and detached when the page is freed. This patch also slightly changes the timing when set_page_extent_mapped() is called to make sure: - We have page->mapping set page->mapping->host is used to grab btrfs_fs_info, thus we can only call this function after page is mapped to an inode. One call site attaches pages to inode manually, thus we have to modify the timing of set_page_extent_mapped() a bit. - As soon as possible, before other operations Since memory allocation can fail, we have to do extra error handling. Calling set_page_extent_mapped() as soon as possible can simply the error handling for several call sites. The idea is pretty much the same as iomap_page, but with more bitmaps for btrfs specific cases. Currently the plan is to switch iomap if iomap can provide sector aligned write back (only write back dirty sectors, but not the full page, data balance require this feature). So we will stick to btrfs specific bitmap for now. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecsQu Wenruo1-3/+2
Refactor btrfs_lookup_bio_sums() by: - Remove the @file_offset parameter There are two factors making the @file_offset parameter useless: * For csum lookup in csum tree, file offset makes no sense We only need disk_bytenr, which is unrelated to file_offset * page_offset (file offset) of each bvec is not contiguous. Pages can be added to the same bio as long as their on-disk bytenr is contiguous, meaning we could have pages at different file offsets in the same bio. Thus passing file_offset makes no sense any more. The only user of file_offset is for data reloc inode, we will use a new function, search_file_offset_in_bio(), to handle it. - Extract the csum tree lookup into search_csum_tree() The new function will handle the csum search in csum tree. The return value is the same as btrfs_find_ordered_sum(), returning the number of found sectors which have checksum. - Change how we do the main loop The only needed info from bio is: * the on-disk bytenr * the length After extracting the above info, we can do the search without bio at all, which makes the main loop much simpler: for (cur_disk_bytenr = orig_disk_bytenr; cur_disk_bytenr < orig_disk_bytenr + orig_len; cur_disk_bytenr += count * sectorsize) { /* Lookup csum tree */ count = search_csum_tree(fs_info, path, cur_disk_bytenr, search_len, csum_dst); if (!count) { /* Csum hole handling */ } } - Use single variable as the source to calculate all other offsets Instead of all different type of variables, we use only one main variable, cur_disk_bytenr, which represents the current disk bytenr. All involved values can be calculated from that variable, and all those variable will only be visible in the inner loop. The above refactoring makes btrfs_lookup_bio_sums() way more robust than it used to be, especially related to the file offset lookup. Now file_offset lookup is only related to data reloc inode, otherwise we don't need to bother file_offset at all. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09btrfs: drop casts of bio bi_sectorDavid Sterba1-2/+2
Since commit 72deb455b5ec ("block: remove CONFIG_LBDAF") (5.2) the sector_t type is u64 on all arches and configs so we don't need to typecast it. It used to be unsigned long and the result of sector size shifts were not guaranteed to fit in the type. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: remove unnecessary local variables for checksum sizeDavid Sterba1-4/+2
Remove local variable that is then used just once and replace it with fs_info::csum_size. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: switch cached fs_info::csum_size from u16 to u32David Sterba1-4/+3
The fs_info value is 32bit, switch also the local u16 variables. This leads to a better assembly code generated due to movzwl. This simple change will shave some bytes on x86_64 and release config: text data bss dec hex filename 1090000 17980 14912 1122892 11224c pre/btrfs.ko 1089794 17980 14912 1122686 11217e post/btrfs.ko DELTA: -206 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: use cached value of fs_info::csum_size everywhereDavid Sterba1-3/+3
btrfs_get_16 shows up in the system performance profiles (helper to read 16bit values from on-disk structures). This is partially because of the checksum size that's frequently read along with data reads/writes, other u16 uses are from item size or directory entries. Replace all calls to btrfs_super_csum_size by the cached value from fs_info. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: introduce mount option rescue=ignorebadrootsJosef Bacik1-1/+1
In the face of extent root corruption, or any other core fs wide root corruption we will fail to mount the file system. This makes recovery kind of a pain, because you need to fall back to userspace tools to scrape off data. Instead provide a mechanism to gracefully handle bad roots, so we can at least mount read-only and possibly recover data from the file system. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: push the NODATASUM check into btrfs_lookup_bio_sumsJosef Bacik1-9/+5
When we move to being able to handle NULL csum_roots it'll be cleaner to just check in btrfs_lookup_bio_sums instead of at all of the caller locations, so push the NODATASUM check into it as well so it's unified. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: compression: move declarations to headerDavid Sterba1-35/+0
The declarations of compression algorithm callbacks are defined in the .c file as they're used from there. Compiler warns that there are no declarations for public functions when compiling lzo.c/zlib.c/zstd.c. Fix that by moving the declarations to the header as it's the common place for all of them. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: remove fail label in check_compressed_csumNikolay Borisov1-7/+2
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: increment corrupt device counter during compressed readNikolay Borisov1-3/+7
If a compressed read fails due to checksum error only a line is printed to dmesg, device corrupt counter is not modified. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: remove needless ASSERT check of orig_bio in end_compressed_bio_readNikolay Borisov1-1/+0
compressed_bio::orig_bio is always set in btrfs_submit_compressed_read before any bio submission is performed. Since that function is always called with a valid bio it renders the ASSERT unnecessary. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: make btrfs_submit_compressed_write take btrfs_inodeNikolay Borisov1-8/+7
Majority of its uses are for btrfs_inode so take it as an argument directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: make btrfs_csum_one_bio takae btrfs_inodeNikolay Borisov1-2/+3
Will enable converting btrfs_submit_compressed_write to btrfs_inode more easily. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: unexport btrfs_compress_set_level()Anand Jain1-16/+16
btrfs_compress_set_level() can be static function in the file compression.c. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: use crypto_shash_digest() instead of open codingEric Biggers1-3/+1
Use crypto_shash_digest() instead of crypto_shash_init() + crypto_shash_update() + crypto_shash_final(). This is more efficient. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-31btrfs: use larger zlib buffer for s390 hardware compressionMikhail Zaslonko1-1/+1
In order to benefit from s390 zlib hardware compression support, increase the btrfs zlib workspace buffer size from 1 to 4 pages (if s390 zlib hardware support is enabled on the machine). This brings up to 60% better performance in hardware on s390 compared to the PAGE_SIZE buffer and much more compared to the software zlib processing in btrfs. In case of memory pressure, fall back to a single page buffer during workspace allocation. The data compressed with larger input buffers will still conform to zlib standard and thus can be decompressed also on a systems that uses only PAGE_SIZE buffer for btrfs zlib. Link: http://lkml.kernel.org/r/20200108105103.29028-1-zaslonko@linux.ibm.com Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: David Sterba <dsterba@suse.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Eduard Shishkin <edward6@linux.ibm.com> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-20btrfs: get rid of at_offset parameter to btrfs_lookup_bio_sums()Omar Sandoval1-2/+2
We can encode this in the offset parameter: -1 means use the page offsets, anything else is a valid offset. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: get rid of trivial __btrfs_lookup_bio_sums() wrappersOmar Sandoval1-2/+2
Currently, we have two wrappers for __btrfs_lookup_bio_sums(): btrfs_lookup_bio_sums_dio(), which is used for direct I/O, and btrfs_lookup_bio_sums(), which is used everywhere else. The only difference is that the _dio variant looks up csums starting at the given offset instead of using the page index, which isn't actually direct I/O-specific. Let's clean up the signature and return value of __btrfs_lookup_bio_sums(), rename it to btrfs_lookup_bio_sums(), and get rid of the trivial helpers. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-30btrfs: fix compressed write bio blkcg attributionDennis Zhou1-4/+5
Bio attribution is handled at bio_set_dev() as once we have a device, we have a corresponding request_queue and then can derive the current css. In special cases, we want to attribute to bio to someone else. This can be done by calling bio_associate_blkg_from_css() or kthread_associate_blkcg() depending on the scenario. Btrfs does this for compressed writeback as they are handled by kworkers, so the latter can be done here. Commit 1a41802701ec ("btrfs: drop bio_set_dev where not needed") removes early bio_set_dev() calls prior to submit_stripe_bio(). This breaks the above assumption that we'll have a request_queue when we are doing association. To fix this, switch to using kthread_associate_blkcg(). Without this, we crash in btrfs/024: [ 3052.093088] BUG: kernel NULL pointer dereference, address: 0000000000000510 [ 3052.107013] #PF: supervisor read access in kernel mode [ 3052.107014] #PF: error_code(0x0000) - not-present page [ 3052.107015] PGD 0 P4D 0 [ 3052.107021] Oops: 0000 [#1] SMP [ 3052.138904] CPU: 42 PID: 201270 Comm: kworker/u161:0 Kdump: loaded Not tainted 5.5.0-rc1-00062-g4852d8ac90a9 #712 [ 3052.138905] Hardware name: Quanta Tioga Pass Single Side 01-0032211004/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018 [ 3052.138912] Workqueue: btrfs-delalloc btrfs_work_helper [ 3052.191375] RIP: 0010:bio_associate_blkg_from_css+0x1e/0x3c0 [ 3052.191379] RSP: 0018:ffffc900210cfc90 EFLAGS: 00010282 [ 3052.191380] RAX: 0000000000000000 RBX: ffff88bfe5573c00 RCX: 0000000000000000 [ 3052.191382] RDX: ffff889db48ec2f0 RSI: ffff88bfe5573c00 RDI: ffff889db48ec2f0 [ 3052.191386] RBP: 0000000000000800 R08: 0000000000203bb0 R09: ffff889db16b2400 [ 3052.293364] R10: 0000000000000000 R11: ffff88a07fffde80 R12: ffff889db48ec2f0 [ 3052.293365] R13: 0000000000001000 R14: ffff889de82bc000 R15: ffff889e2b7bdcc8 [ 3052.293367] FS: 0000000000000000(0000) GS:ffff889ffba00000(0000) knlGS:0000000000000000 [ 3052.293368] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3052.293369] CR2: 0000000000000510 CR3: 0000000002611001 CR4: 00000000007606e0 [ 3052.293370] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3052.293371] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 3052.293372] PKRU: 55555554 [ 3052.293376] Call Trace: [ 3052.402552] btrfs_submit_compressed_write+0x137/0x390 [ 3052.402558] submit_compressed_extents+0x40f/0x4c0 [ 3052.422401] btrfs_work_helper+0x246/0x5a0 [ 3052.422408] process_one_work+0x200/0x570 [ 3052.438601] ? process_one_work+0x180/0x570 [ 3052.438605] worker_thread+0x4c/0x3e0 [ 3052.438614] kthread+0x103/0x140 [ 3052.460735] ? process_one_work+0x570/0x570 [ 3052.460737] ? kthread_mod_delayed_work+0xc0/0xc0 [ 3052.460744] ret_from_fork+0x24/0x30 Fixes: 1a41802701ec ("btrfs: drop bio_set_dev where not needed") Reported-by: Chris Murphy <chris@colorremedies.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-30btrfs: punt all bios created in btrfs_submit_compressed_write()Dennis Zhou1-0/+4
Compressed writes happen in the background via kworkers. However, this causes bios to be attributed to root bypassing any cgroup limits from the actual writer. We tag the first bio with REQ_CGROUP_PUNT, which will punt the bio to an appropriate cgroup specific workqueue and attribute the IO properly. However, if btrfs_submit_compressed_write() creates a new bio, we don't tag it the same way. Add the appropriate tagging for subsequent bios. Fixes: ec39f7696ccfa ("Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios") Reviewed-by: Chris Mason <clm@fb.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: drop bio_set_dev where not neededDavid Sterba1-10/+0
bio_set_dev sets a bdev to a bio and is not only setting a pointer bug also changing some state bits if there was a different bdev set before. This is one thing that's not needed. Another thing is that setting a bdev at bio allocation time is too early and actually does not work with plain redundancy profiles, where each time we submit a bio to a device, the bdev is set correctly. In many places the bio bdev is set to latest_bdev that seems to serve as a stub pointer "just to put something to bio". But we don't have to do that. Where do we know which bdev to set: * for regular IO: submit_stripe_bio that's called by btrfs_map_bio * repair IO: repair_io_failure, read or write from specific device * super block write (using buffer_heads but uses raw bdev) and barriers * scrub: this does not use all regular IO paths as it needs to reach all copies, verify and fixup eventually, and for that all bdev management is independent * raid56: rbio_add_io_page, for the RMW write * integrity-checker: does it's own low-level block tracking Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: remove ops pointer from workspace_managerDavid Sterba1-4/+2
We can infer the ops from the type that is now passed to all functions that would need it, this makes workspace_manager::ops redundant and can be removed. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: inline free_workspaceDavid Sterba1-3/+18
Replace indirect calls to free_workspace by switch and calls to the specific callbacks. This is mainly to get rid of the indirection due to spectre vulnerability mitigations. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: pass type to btrfs_put_workspaceDavid Sterba1-7/+6
We can infer the workspace_manager from type and the type will be used in the following patch to call a common helper for free_workspace. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: inline alloc_workspaceDavid Sterba1-3/+18
Replace indirect calls to alloc_workspace by switch and calls to the specific callbacks. This is mainly to get rid of the indirection due to spectre vulnerability mitigations. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: pass type to btrfs_get_workspaceDavid Sterba1-7/+5
We can infer the workspace_manager from type and the type will be used in the following patch to call a common helper for alloc_workspace. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: inline put_workspaceDavid Sterba1-9/+15
Similar to get_workspace, majority of the callbacks is trivial, we don't gain anything by the indirection, so replace them by a switch function. Trivial callback implementations use the helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: inline get_workspaceDavid Sterba1-8/+15
Majority of the callbacks is trivial, we don't gain anything by the indirection, so replace them by a switch function. ZLIB needs to adjust level in the callback and ZSTD workspace management is complex, the rest is call to the helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: export alloc/free/get/put callbacks of all algosDavid Sterba1-0/+12
The indirect calls will be replaced by a switch in compression.c. (Switch is faster than indirect calls with when Spectre mitigations are enabled). Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>