summaryrefslogtreecommitdiffstats
path: root/fs/ubifs
AgeCommit message (Collapse)AuthorFilesLines
2022-03-31Merge tag 'for-linus-5.18-rc1' of ↵Linus Torvalds6-114/+228
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs Pull JFFS2, UBI and UBIFS updates from Richard Weinberger: "JFFS2: - Fixes for various memory issues UBI: - Fix for a race condition in cdev ioctl handler UBIFS: - Fixes for O_TMPFILE and whiteout handling - Fixes for various memory issues" * tag 'for-linus-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs: ubifs: rename_whiteout: correct old_dir size computing jffs2: fix memory leak in jffs2_scan_medium jffs2: fix memory leak in jffs2_do_mount_fs jffs2: fix use-after-free in jffs2_clear_xattr_subsystem fs/jffs2: fix comments mentioning i_mutex ubi: fastmap: Return error code if memory allocation fails in add_aeb() ubifs: Fix to add refcount once page is set private ubifs: Fix read out-of-bounds in ubifs_wbuf_write_nolock() ubifs: setflags: Make dirtied_ino_d 8 bytes aligned ubifs: Rectify space amount budget for mkdir/tmpfile operations ubifs: Fix 'ui->dirty' race between do_tmpfile() and writeback work ubifs: Rename whiteout atomically ubifs: Add missing iput if do_tmpfile() failed in rename whiteout ubifs: Fix wrong number of inodes locked by ui_mutex in ubifs_inode comment ubifs: Fix deadlock in concurrent rename whiteout and inode writeback ubifs: rename_whiteout: Fix double free for whiteout_ui->data ubi: Fix race condition between ctrl_cdev_ioctl and ubi_cdev_ioctl
2022-03-28Merge tag 'driver-core-5.18-rc1' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the set of driver core changes for 5.18-rc1. Not much here, primarily it was a bunch of cleanups and small updates: - kobj_type cleanups for default_groups - documentation updates - firmware loader minor changes - component common helper added and take advantage of it in many drivers (the largest part of this pull request). All of these have been in linux-next for a while with no reported problems" * tag 'driver-core-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (54 commits) Documentation: update stable review cycle documentation drivers/base/dd.c : Remove the initial value of the global variable Documentation: update stable tree link Documentation: add link to stable release candidate tree devres: fix typos in comments Documentation: add note block surrounding security patch note samples/kobject: Use sysfs_emit instead of sprintf base: soc: Make soc_device_match() simpler and easier to read driver core: dd: fix return value of __setup handler driver core: Refactor sysfs and drv/bus remove hooks driver core: Refactor multiple copies of device cleanup scripts: get_abi.pl: Fix typo in help message kernfs: fix typos in comments kernfs: remove unneeded #if 0 guard ALSA: hda/realtek: Make use of the helper component_compare_dev_name video: omapfb: dss: Make use of the helper component_compare_dev power: supply: ab8500: Make use of the helper component_compare_dev ASoC: codecs: wcd938x: Make use of the helper component_compare/release_of iommu/mediatek: Make use of the helper component_compare/release_of drm: of: Make use of the helper component_release_of ...
2022-03-22Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds1-17/+17
Pull filesystem folio updates from Matthew Wilcox: "Primarily this series converts some of the address_space operations to take a folio instead of a page. Notably: - a_ops->is_partially_uptodate() takes a folio instead of a page and changes the type of the 'from' and 'count' arguments to make it obvious they're bytes. - a_ops->invalidatepage() becomes ->invalidate_folio() and has a similar type change. - a_ops->launder_page() becomes ->launder_folio() - a_ops->set_page_dirty() becomes ->dirty_folio() and adds the address_space as an argument. There are a couple of other misc changes up front that weren't worth separating into their own pull request" * tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits) fs: Remove aops ->set_page_dirty fb_defio: Use noop_dirty_folio() fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio fs: Convert __set_page_dirty_buffers to block_dirty_folio nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio() mm: Convert swap_set_page_dirty() to swap_dirty_folio() ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio() btrfs: Convert extent_range_redirty_for_io() to use folios fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio btrfs: Convert from set_page_dirty to dirty_folio fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio() fs: Add aops->dirty_folio fs: Remove aops->launder_page orangefs: Convert launder_page to launder_folio nfs: Convert from launder_page to launder_folio fuse: Convert from launder_page to launder_folio ...
2022-03-22fs: allocate inode by using alloc_inode_sb()Muchun Song1-1/+1
The inode allocation is supposed to use alloc_inode_sb(), so convert kmem_cache_alloc() of all filesystems to alloc_inode_sb(). Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Theodore Ts'o <tytso@mit.edu> [ext4] Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-17ubifs: rename_whiteout: correct old_dir size computingBaokun Li1-0/+3
When renaming the whiteout file, the old whiteout file is not deleted. Therefore, we add the old dentry size to the old dir like XFS. Otherwise, an error may be reported due to `fscki->calc_sz != fscki->size` in check_indes. Fixes: 9e0a1fff8db56ea ("ubifs: Implement RENAME_WHITEOUT") Reported-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2022-03-15ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folioMatthew Wilcox (Oracle)1-7/+7
Removes a call to __set_page_dirty_nobuffers(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15ubifs: Convert from invalidatepage to invalidate_folioMatthew Wilcox (Oracle)1-10/+10
This is a straightfoward conversion. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-01-26ubifs: use default_groups in kobj_typeGreg Kroah-Hartman1-1/+2
There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the ubifs sysfs code to use default_groups field which has been the preferred way since aa30f47cf666 ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field. Cc: Stefan Schaeckeler <schaecsn@gmx.net> Cc: linux-mtd@lists.infradead.org Acked-by: Richard Weinberger <richard@nod.at> Link: https://lore.kernel.org/r/20220114104820.1340879-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-01-10ubifs: Fix to add refcount once page is set privateZhihao Cheng1-7/+7
MM defined the rule [1] very clearly that once page was set with PG_private flag, we should increment the refcount in that page, also main flows like pageout(), migrate_page() will assume there is one additional page reference count if page_has_private() returns true. Otherwise, we may get a BUG in page migration: page:0000000080d05b9d refcount:-1 mapcount:0 mapping:000000005f4d82a8 index:0xe2 pfn:0x14c12 aops:ubifs_file_address_operations [ubifs] ino:8f1 dentry name:"f30e" flags: 0x1fffff80002405(locked|uptodate|owner_priv_1|private|node=0| zone=1|lastcpupid=0x1fffff) page dumped because: VM_BUG_ON_PAGE(page_count(page) != 0) ------------[ cut here ]------------ kernel BUG at include/linux/page_ref.h:184! invalid opcode: 0000 [#1] SMP CPU: 3 PID: 38 Comm: kcompactd0 Not tainted 5.15.0-rc5 RIP: 0010:migrate_page_move_mapping+0xac3/0xe70 Call Trace: ubifs_migrate_page+0x22/0xc0 [ubifs] move_to_new_page+0xb4/0x600 migrate_pages+0x1523/0x1cc0 compact_zone+0x8c5/0x14b0 kcompactd+0x2bc/0x560 kthread+0x18c/0x1e0 ret_from_fork+0x1f/0x30 Before the time, we should make clean a concept, what does refcount means in page gotten from grab_cache_page_write_begin(). There are 2 situations: Situation 1: refcount is 3, page is created by __page_cache_alloc. TYPE_A - the write process is using this page TYPE_B - page is assigned to one certain mapping by calling __add_to_page_cache_locked() TYPE_C - page is added into pagevec list corresponding current cpu by calling lru_cache_add() Situation 2: refcount is 2, page is gotten from the mapping's tree TYPE_B - page has been assigned to one certain mapping TYPE_A - the write process is using this page (by calling page_cache_get_speculative()) Filesystem releases one refcount by calling put_page() in xxx_write_end(), the released refcount corresponds to TYPE_A (write task is using it). If there are any processes using a page, page migration process will skip the page by judging whether expected_page_refs() equals to page refcount. The BUG is caused by following process: PA(cpu 0) kcompactd(cpu 1) compact_zone ubifs_write_begin page_a = grab_cache_page_write_begin add_to_page_cache_lru lru_cache_add pagevec_add // put page into cpu 0's pagevec (refcnf = 3, for page creation process) ubifs_write_end SetPagePrivate(page_a) // doesn't increase page count ! unlock_page(page_a) put_page(page_a) // refcnt = 2 [...] PB(cpu 0) filemap_read filemap_get_pages add_to_page_cache_lru lru_cache_add __pagevec_lru_add // traverse all pages in cpu 0's pagevec __pagevec_lru_add_fn SetPageLRU(page_a) isolate_migratepages isolate_migratepages_block get_page_unless_zero(page_a) // refcnt = 3 list_add(page_a, from_list) migrate_pages(from_list) __unmap_and_move move_to_new_page ubifs_migrate_page(page_a) migrate_page_move_mapping expected_page_refs get 3 (migration[1] + mapping[1] + private[1]) release_pages put_page_testzero(page_a) // refcnt = 3 page_ref_freeze // refcnt = 0 page_ref_dec_and_test(0 - 1 = -1) page_ref_unfreeze VM_BUG_ON_PAGE(-1 != 0, page) UBIFS doesn't increase the page refcount after setting private flag, which leads to page migration task believes the page is not used by any other processes, so the page is migrated. This causes concurrent accessing on page refcount between put_page() called by other process(eg. read process calls lru_cache_add) and page_ref_unfreeze() called by migration task. Actually zhangjun has tried to fix this problem [2] by recalculating page refcnt in ubifs_migrate_page(). It's better to follow MM rules [1], because just like Kirill suggested in [2], we need to check all users of page_has_private() helper. Like f2fs does in [3], fix it by adding/deleting refcount when setting/clearing private for a page. BTW, according to [4], we set 'page->private' as 1 because ubifs just simply SetPagePrivate(). And, [5] provided a common helper to set/clear page private, ubifs can use this helper following the example of iomap, afs, btrfs, etc. Jump [6] to find a reproducer. [1] https://lore.kernel.org/lkml/2b19b3c4-2bc4-15fa-15cc-27a13e5c7af1@aol.com [2] https://www.spinics.net/lists/linux-mtd/msg04018.html [3] http://lkml.iu.edu/hypermail/linux/kernel/1903.0/03313.html [4] https://lore.kernel.org/linux-f2fs-devel/20210422154705.GO3596236@casper.infradead.org [5] https://lore.kernel.org/all/20200517214718.468-1-guoqing.jiang@cloud.ionos.com [6] https://bugzilla.kernel.org/show_bug.cgi?id=214961 Fixes: 1e51764a3c2ac0 ("UBIFS: add new flash file system") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2022-01-10ubifs: Fix read out-of-bounds in ubifs_wbuf_write_nolock()Zhihao Cheng1-4/+30
Function ubifs_wbuf_write_nolock() may access buf out of bounds in following process: ubifs_wbuf_write_nolock(): aligned_len = ALIGN(len, 8); // Assume len = 4089, aligned_len = 4096 if (aligned_len <= wbuf->avail) ... // Not satisfy if (wbuf->used) { ubifs_leb_write() // Fill some data in avail wbuf len -= wbuf->avail; // len is still not 8-bytes aligned aligned_len -= wbuf->avail; } n = aligned_len >> c->max_write_shift; if (n) { n <<= c->max_write_shift; err = ubifs_leb_write(c, wbuf->lnum, buf + written, wbuf->offs, n); // n > len, read out of bounds less than 8(n-len) bytes } , which can be catched by KASAN: ========================================================= BUG: KASAN: slab-out-of-bounds in ecc_sw_hamming_calculate+0x1dc/0x7d0 Read of size 4 at addr ffff888105594ff8 by task kworker/u8:4/128 Workqueue: writeback wb_workfn (flush-ubifs_0_0) Call Trace: kasan_report.cold+0x81/0x165 nand_write_page_swecc+0xa9/0x160 ubifs_leb_write+0xf2/0x1b0 [ubifs] ubifs_wbuf_write_nolock+0x421/0x12c0 [ubifs] write_head+0xdc/0x1c0 [ubifs] ubifs_jnl_write_inode+0x627/0x960 [ubifs] wb_workfn+0x8af/0xb80 Function ubifs_wbuf_write_nolock() accepts that parameter 'len' is not 8 bytes aligned, the 'len' represents the true length of buf (which is allocated in 'ubifs_jnl_xxx', eg. ubifs_jnl_write_inode), so ubifs_wbuf_write_nolock() must handle the length read from 'buf' carefully to write leb safely. Fetch a reproducer in [Link]. Fixes: 1e51764a3c2ac0 ("UBIFS: add new flash file system") Link: https://bugzilla.kernel.org/show_bug.cgi?id=214785 Reported-by: Chengsong Ke <kechengsong@huawei.com> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2022-01-10ubifs: setflags: Make dirtied_ino_d 8 bytes alignedZhihao Cheng1-1/+1
Make 'ui->data_len' aligned with 8 bytes before it is assigned to dirtied_ino_d. Since 8871d84c8f8b0c6b("ubifs: convert to fileattr") applied, 'setflags()' only affects regular files and directories, only xattr inode, symlink inode and special inode(pipe/char_dev/block_dev) have none- zero 'ui->data_len' field, so assertion '!(req->dirtied_ino_d & 7)' cannot fail in ubifs_budget_space(). To avoid assertion fails in future evolution(eg. setflags can operate special inodes), it's better to make dirtied_ino_d 8 bytes aligned, after all aligned size is still zero for regular files. Fixes: 1e51764a3c2ac05a ("UBIFS: add new flash file system") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2022-01-10ubifs: Rectify space amount budget for mkdir/tmpfile operationsZhihao Cheng1-4/+8
UBIFS should make sure the flash has enough space to store dirty (Data that is newer than disk) data (in memory), space budget is exactly designed to do that. If space budget calculates less data than we need, 'make_reservation()' will do more work(return -ENOSPC if no free space lelf, sometimes we can see "cannot reserve xxx bytes in jhead xxx, error -28" in ubifs error messages) with ubifs inodes locked, which may effect other syscalls. A simple way to decide how much space do we need when make a budget: See how much space is needed by 'make_reservation()' in ubifs_jnl_xxx() function according to corresponding operation. It's better to report ENOSPC in ubifs_budget_space(), as early as we can. Fixes: 474b93704f32163 ("ubifs: Implement O_TMPFILE") Fixes: 1e51764a3c2ac05 ("UBIFS: add new flash file system") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2022-01-10ubifs: Fix 'ui->dirty' race between do_tmpfile() and writeback workZhihao Cheng1-30/+30
'ui->dirty' is not protected by 'ui_mutex' in function do_tmpfile() which may race with ubifs_write_inode[wb_workfn] to access/update 'ui->dirty', finally dirty space is released twice. open(O_TMPFILE) wb_workfn do_tmpfile ubifs_budget_space(ino_req = { .dirtied_ino = 1}) d_tmpfile // mark inode(tmpfile) dirty ubifs_jnl_update // without holding tmpfile's ui_mutex mark_inode_clean(ui) if (ui->dirty) ubifs_release_dirty_inode_budget(ui) // release first time ubifs_write_inode mutex_lock(&ui->ui_mutex) ubifs_release_dirty_inode_budget(ui) // release second time mutex_unlock(&ui->ui_mutex) ui->dirty = 0 Run generic/476 can reproduce following message easily (See reproducer in [Link]): UBIFS error (ubi0:0 pid 2578): ubifs_assert_failed [ubifs]: UBIFS assert failed: c->bi.dd_growth >= 0, in fs/ubifs/budget.c:554 UBIFS warning (ubi0:0 pid 2578): ubifs_ro_mode [ubifs]: switched to read-only mode, error -22 Workqueue: writeback wb_workfn (flush-ubifs_0_0) Call Trace: ubifs_ro_mode+0x54/0x60 [ubifs] ubifs_assert_failed+0x4b/0x80 [ubifs] ubifs_release_budget+0x468/0x5a0 [ubifs] ubifs_release_dirty_inode_budget+0x53/0x80 [ubifs] ubifs_write_inode+0x121/0x1f0 [ubifs] ... wb_workfn+0x283/0x7b0 Fix it by holding tmpfile ubifs inode lock during ubifs_jnl_update(). Similar problem exists in whiteout renaming, but previous fix("ubifs: Rename whiteout atomically") has solved the problem. Fixes: 474b93704f32163 ("ubifs: Implement O_TMPFILE") Link: https://bugzilla.kernel.org/show_bug.cgi?id=214765 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2022-01-10ubifs: Rename whiteout atomicallyZhihao Cheng2-60/+136
Currently, rename whiteout has 3 steps: 1. create tmpfile(which associates old dentry to tmpfile inode) for whiteout, and store tmpfile to disk 2. link whiteout, associate whiteout inode to old dentry agagin and store old dentry, old inode, new dentry on disk 3. writeback dirty whiteout inode to disk Suddenly power-cut or error occurring(eg. ENOSPC returned by budget, memory allocation failure) during above steps may cause kinds of problems: Problem 1: ENOSPC returned by whiteout space budget (before step 2), old dentry will disappear after rename syscall, whiteout file cannot be found either. ls dir // we get file, whiteout rename(dir/file, dir/whiteout, REANME_WHITEOUT) ENOSPC = ubifs_budget_space(&wht_req) // return ls dir // empty (no file, no whiteout) Problem 2: Power-cut happens before step 3, whiteout inode with 'nlink=1' is not stored on disk, whiteout dentry(old dentry) is written on disk, whiteout file is lost on next mount (We get "dead directory entry" after executing 'ls -l' on whiteout file). Now, we use following 3 steps to finish rename whiteout: 1. create an in-mem inode with 'nlink = 1' as whiteout 2. ubifs_jnl_rename (Write on disk to finish associating old dentry to whiteout inode, associating new dentry with old inode) 3. iput(whiteout) Rely writing in-mem inode on disk by ubifs_jnl_rename() to finish rename whiteout, which avoids middle disk state caused by suddenly power-cut and error occurring. Fixes: 9e0a1fff8db56ea ("ubifs: Implement RENAME_WHITEOUT") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2022-01-09ubifs: Add missing iput if do_tmpfile() failed in rename whiteoutZhihao Cheng1-0/+2
whiteout inode should be put when do_tmpfile() failed if inode has been initialized. Otherwise we will get following warning during umount: UBIFS error (ubi0:0 pid 1494): ubifs_assert_failed [ubifs]: UBIFS assert failed: c->bi.dd_growth == 0, in fs/ubifs/super.c:1930 VFS: Busy inodes after unmount of ubifs. Self-destruct in 5 seconds. Fixes: 9e0a1fff8db56ea ("ubifs: Implement RENAME_WHITEOUT") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Suggested-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>
2022-01-09ubifs: Fix wrong number of inodes locked by ui_mutex in ubifs_inode commentZhihao Cheng1-1/+1
Since 9ec64962afb1702f75b("ubifs: Implement RENAME_EXCHANGE") and 9e0a1fff8db56eaaebb("ubifs: Implement RENAME_WHITEOUT") are applied, ubifs_rename locks and changes 4 ubifs inodes, correct the comment for ui_mutex in ubifs_inode. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2022-01-09ubifs: Fix deadlock in concurrent rename whiteout and inode writebackZhihao Cheng1-10/+15
Following hung tasks: [ 77.028764] task:kworker/u8:4 state:D stack: 0 pid: 132 [ 77.028820] Call Trace: [ 77.029027] schedule+0x8c/0x1b0 [ 77.029067] mutex_lock+0x50/0x60 [ 77.029074] ubifs_write_inode+0x68/0x1f0 [ubifs] [ 77.029117] __writeback_single_inode+0x43c/0x570 [ 77.029128] writeback_sb_inodes+0x259/0x740 [ 77.029148] wb_writeback+0x107/0x4d0 [ 77.029163] wb_workfn+0x162/0x7b0 [ 92.390442] task:aa state:D stack: 0 pid: 1506 [ 92.390448] Call Trace: [ 92.390458] schedule+0x8c/0x1b0 [ 92.390461] wb_wait_for_completion+0x82/0xd0 [ 92.390469] __writeback_inodes_sb_nr+0xb2/0x110 [ 92.390472] writeback_inodes_sb_nr+0x14/0x20 [ 92.390476] ubifs_budget_space+0x705/0xdd0 [ubifs] [ 92.390503] do_rename.cold+0x7f/0x187 [ubifs] [ 92.390549] ubifs_rename+0x8b/0x180 [ubifs] [ 92.390571] vfs_rename+0xdb2/0x1170 [ 92.390580] do_renameat2+0x554/0x770 , are caused by concurrent rename whiteout and inode writeback processes: rename_whiteout(Thread 1) wb_workfn(Thread2) ubifs_rename do_rename lock_4_inodes (Hold ui_mutex) ubifs_budget_space make_free_space shrink_liability __writeback_inodes_sb_nr bdi_split_work_to_wbs (Queue new wb work) wb_do_writeback(wb work) __writeback_single_inode ubifs_write_inode LOCK(ui_mutex) ↑ wb_wait_for_completion (Wait wb work) <-- deadlock! Reproducer (Detail program in [Link]): 1. SYS_renameat2("/mp/dir/file", "/mp/dir/whiteout", RENAME_WHITEOUT) 2. Consume out of space before kernel(mdelay) doing budget for whiteout Fix it by doing whiteout space budget before locking ubifs inodes. BTW, it also fixes wrong goto tag 'out_release' in whiteout budget error handling path(It should at least recover dir i_size and unlock 4 ubifs inodes). Fixes: 9e0a1fff8db56ea ("ubifs: Implement RENAME_WHITEOUT") Link: https://bugzilla.kernel.org/show_bug.cgi?id=214733 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2022-01-09ubifs: rename_whiteout: Fix double free for whiteout_ui->dataZhihao Cheng1-2/+0
'whiteout_ui->data' will be freed twice if space budget fail for rename whiteout operation as following process: rename_whiteout dev = kmalloc whiteout_ui->data = dev kfree(whiteout_ui->data) // Free first time iput(whiteout) ubifs_free_inode kfree(ui->data) // Double free! KASAN reports: ================================================================== BUG: KASAN: double-free or invalid-free in ubifs_free_inode+0x4f/0x70 Call Trace: kfree+0x117/0x490 ubifs_free_inode+0x4f/0x70 [ubifs] i_callback+0x30/0x60 rcu_do_batch+0x366/0xac0 __do_softirq+0x133/0x57f Allocated by task 1506: kmem_cache_alloc_trace+0x3c2/0x7a0 do_rename+0x9b7/0x1150 [ubifs] ubifs_rename+0x106/0x1f0 [ubifs] do_syscall_64+0x35/0x80 Freed by task 1506: kfree+0x117/0x490 do_rename.cold+0x53/0x8a [ubifs] ubifs_rename+0x106/0x1f0 [ubifs] do_syscall_64+0x35/0x80 The buggy address belongs to the object at ffff88810238bed8 which belongs to the cache kmalloc-8 of size 8 ================================================================== Let ubifs_free_inode() free 'whiteout_ui->data'. BTW, delete unused assignment 'whiteout_ui->data_len = 0', process 'ubifs_evict_inode() -> ubifs_jnl_delete_inode() -> ubifs_jnl_write_inode()' doesn't need it (because 'inc_nlink(whiteout)' won't be excuted by 'goto out_release', and the nlink of whiteout inode is 0). Fixes: 9e0a1fff8db56ea ("ubifs: Implement RENAME_WHITEOUT") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-12-23ubifs: read-only if LEB may always be taken in ubifs_garbage_collectBaokun Li1-1/+10
If ubifs_garbage_collect_leb() returns -EAGAIN and ubifs_return_leb returns error, a LEB will always has a "taken" flag. In this case, set the ubifs to read-only to prevent a worse situation. Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-12-23ubifs: fix double return leb in ubifs_garbage_collectBaokun Li1-0/+2
If ubifs_garbage_collect_leb() returns -EAGAIN and enters the "out" branch, ubifs_return_leb will execute twice on the same lnum. This can cause data loss in concurrency situations. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-12-23ubifs: fix slab-out-of-bounds in ubifs_change_lpBaokun Li1-1/+5
Hulk Robot reported a KASAN report about slab-out-of-bounds: ================================================================== BUG: KASAN: slab-out-of-bounds in ubifs_change_lp+0x3a9/0x1390 [ubifs] Read of size 8 at addr ffff888101c961f8 by task fsstress/1068 [...] Call Trace: check_memory_region+0x1c1/0x1e0 ubifs_change_lp+0x3a9/0x1390 [ubifs] ubifs_change_one_lp+0x170/0x220 [ubifs] ubifs_garbage_collect+0x7f9/0xda0 [ubifs] ubifs_budget_space+0xfe4/0x1bd0 [ubifs] ubifs_write_begin+0x528/0x10c0 [ubifs] Allocated by task 1068: kmemdup+0x25/0x50 ubifs_lpt_lookup_dirty+0x372/0xb00 [ubifs] ubifs_update_one_lp+0x46/0x260 [ubifs] ubifs_tnc_end_commit+0x98b/0x1720 [ubifs] do_commit+0x6cb/0x1950 [ubifs] ubifs_run_commit+0x15a/0x2b0 [ubifs] ubifs_budget_space+0x1061/0x1bd0 [ubifs] ubifs_write_begin+0x528/0x10c0 [ubifs] [...] ================================================================== In ubifs_garbage_collect(), if ubifs_find_dirty_leb returns an error, lp is an uninitialized variable. But lp.num might be used in the out branch, which is a random value. If the value is -1 or another value that can pass the check, soob may occur in the ubifs_change_lp() in the following procedure. To solve this problem, we initialize lp.lnum to -1, and then initialize it correctly in ubifs_find_dirty_leb, which is not equal to -1, and ubifs_return_leb is executed only when lp.lnum != -1. if find a retained or indexing LEB and continue to next loop, but break before find another LEB, the "taken" flag of this LEB will be cleaned in ubi_return_lebi(). This bug has also been fixed in this patch. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-12-23ubifs: fix snprintf() length checkDan Carpenter1-1/+1
The snprintf() function returns the number of bytes (not including the NUL terminator) which would have been printed if there were enough space. So it can be greater than UBIFS_DFS_DIR_LEN. And actually if it equals UBIFS_DFS_DIR_LEN then that's okay so this check is too strict. Fixes: 9a620291fc01 ("ubifs: Export filesystem error counters") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-12-23ubifs: Export filesystem error countersStefan Schaeckeler5-2/+225
Not all ubifs filesystem errors are propagated to userspace. Export bad magic, bad node and crc errors via sysfs. This allows userspace to notice filesystem errors: /sys/fs/ubifs/ubiX_Y/errors_magic /sys/fs/ubifs/ubiX_Y/errors_node /sys/fs/ubifs/ubiX_Y/errors_crc The counters are reset to 0 with a remount. Signed-off-by: Stefan Schaeckeler <sschaeck@cisco.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-12-23ubifs: Error path in ubifs_remount_rw() seems to wrongly free write buffersPetr Cvachoucek1-1/+0
it seems freeing the write buffers in the error path of the ubifs_remount_rw() is wrong. It leads later to a kernel oops like this: [10016.431274] UBIFS (ubi0:0): start fixing up free space [10090.810042] UBIFS (ubi0:0): free space fixup complete [10090.814623] UBIFS error (ubi0:0 pid 512): ubifs_remount_fs: cannot spawn "ubifs_bgt0_0", error -4 [10101.915108] UBIFS (ubi0:0): background thread "ubifs_bgt0_0" started, PID 517 [10105.275498] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000030 [10105.284352] Mem abort info: [10105.287160] ESR = 0x96000006 [10105.290252] EC = 0x25: DABT (current EL), IL = 32 bits [10105.295592] SET = 0, FnV = 0 [10105.298652] EA = 0, S1PTW = 0 [10105.301848] Data abort info: [10105.304723] ISV = 0, ISS = 0x00000006 [10105.308573] CM = 0, WnR = 0 [10105.311564] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000f03d1000 [10105.318034] [0000000000000030] pgd=00000000f6cee003, pud=00000000f4884003, pmd=0000000000000000 [10105.326783] Internal error: Oops: 96000006 [#1] PREEMPT SMP [10105.332355] Modules linked in: ath10k_pci ath10k_core ath mac80211 libarc4 cfg80211 nvme nvme_core cryptodev(O) [10105.342468] CPU: 3 PID: 518 Comm: touch Tainted: G O 5.4.3 #1 [10105.349517] Hardware name: HYPEX CPU (DT) [10105.353525] pstate: 40000005 (nZcv daif -PAN -UAO) [10105.358324] pc : atomic64_try_cmpxchg_acquire.constprop.22+0x8/0x34 [10105.364596] lr : mutex_lock+0x1c/0x34 [10105.368253] sp : ffff000075633aa0 [10105.371563] x29: ffff000075633aa0 x28: 0000000000000001 [10105.376874] x27: ffff000076fa80c8 x26: 0000000000000004 [10105.382185] x25: 0000000000000030 x24: 0000000000000000 [10105.387495] x23: 0000000000000000 x22: 0000000000000038 [10105.392807] x21: 000000000000000c x20: ffff000076fa80c8 [10105.398119] x19: ffff000076fa8000 x18: 0000000000000000 [10105.403429] x17: 0000000000000000 x16: 0000000000000000 [10105.408741] x15: 0000000000000000 x14: fefefefefefefeff [10105.414052] x13: 0000000000000000 x12: 0000000000000fe0 [10105.419364] x11: 0000000000000fe0 x10: ffff000076709020 [10105.424675] x9 : 0000000000000000 x8 : 00000000000000a0 [10105.429986] x7 : ffff000076fa80f4 x6 : 0000000000000030 [10105.435297] x5 : 0000000000000000 x4 : 0000000000000000 [10105.440609] x3 : 0000000000000000 x2 : ffff00006f276040 [10105.445920] x1 : ffff000075633ab8 x0 : 0000000000000030 [10105.451232] Call trace: [10105.453676] atomic64_try_cmpxchg_acquire.constprop.22+0x8/0x34 [10105.459600] ubifs_garbage_collect+0xb4/0x334 [10105.463956] ubifs_budget_space+0x398/0x458 [10105.468139] ubifs_create+0x50/0x180 [10105.471712] path_openat+0x6a0/0x9b0 [10105.475284] do_filp_open+0x34/0x7c [10105.478771] do_sys_open+0x78/0xe4 [10105.482170] __arm64_sys_openat+0x1c/0x24 [10105.486180] el0_svc_handler+0x84/0xc8 [10105.489928] el0_svc+0x8/0xc [10105.492808] Code: 52800013 17fffffb d2800003 f9800011 (c85ffc05) [10105.498903] ---[ end trace 46b721d93267a586 ]--- To reproduce the problem: 1. Filesystem initially mounted read-only, free space fixup flag set. 2. mount -o remount,rw <mountpoint> 3. it takes some time (free space fixup running) ... try to terminate running mount by CTRL-C ... does not respond, only after free space fixup is complete ... then "ubifs_remount_fs: cannot spawn "ubifs_bgt0_0", error -4" 4. mount -o remount,rw <mountpoint> ... now finished instantly (fixup already done). 5. Create file or just unmount the filesystem and we get the oops. Cc: <stable@vger.kernel.org> Fixes: b50b9f408502 ("UBIFS: do not free write-buffers when in R/O mode") Signed-off-by: Petr Cvachoucek <cvachoucek@gmail.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-12-23ubifs: Make use of the helper macro kthread_run()Cai Huoqing1-4/+2
Repalce kthread_create/wake_up_process() with kthread_run() to simplify the code. Signed-off-by: Cai Huoqing <caihuoqing@baidu.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-12-23ubifs: Fix spelling mistakesAlexander Dahl2-3/+3
Found with `codespell -i 3 -w fs/ubifs/**` and proof reading that parts. Signed-off-by: Alexander Dahl <ada@thorsis.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-09-20fscrypt: remove fscrypt_operations::max_namelenEric Biggers1-1/+0
The max_namelen field is unnecessary, as it is set to 255 (NAME_MAX) on all filesystems that support fscrypt (or plan to support fscrypt). For simplicity, just use NAME_MAX directly instead. Link: https://lore.kernel.org/r/20210909184513.139281-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-07-25ubifs: report correct st_size for encrypted symlinksEric Biggers1-1/+12
The stat() family of syscalls report the wrong size for encrypted symlinks, which has caused breakage in several userspace programs. Fix this by calling fscrypt_symlink_getattr() after ubifs_getattr() for encrypted symlinks. This function computes the correct size by reading and decrypting the symlink target (if it's not already cached). For more details, see the commit which added fscrypt_symlink_getattr(). Fixes: ca7f85be8d6c ("ubifs: Add support for encrypted symlinks") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210702065350.209646-5-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-06-22ubifs: Set/Clear I_LINKABLE under i_lock for whiteout inodeZhihao Cheng1-0/+7
xfstests-generic/476 reports a warning message as below: WARNING: CPU: 2 PID: 30347 at fs/inode.c:361 inc_nlink+0x52/0x70 Call Trace: do_rename+0x502/0xd40 [ubifs] ubifs_rename+0x8b/0x180 [ubifs] vfs_rename+0x476/0x1080 do_renameat2+0x67c/0x7b0 __x64_sys_renameat2+0x6e/0x90 do_syscall_64+0x66/0xe0 entry_SYSCALL_64_after_hwframe+0x44/0xae Following race case can cause this: rename_whiteout(Thread 1) wb_workfn(Thread 2) ubifs_rename do_rename __writeback_single_inode spin_lock(&inode->i_lock) whiteout->i_state |= I_LINKABLE inode->i_state &= ~dirty; ---- How race happens on i_state: (tmp = whiteout->i_state | I_LINKABLE) (tmp = inode->i_state & ~dirty) (whiteout->i_state = tmp) (inode->i_state = tmp) ---- spin_unlock(&inode->i_lock) inc_nlink(whiteout) WARN_ON(!(inode->i_state & I_LINKABLE)) !!! Fix to add i_lock to avoid i_state update race condition. Fixes: 9e0a1fff8db56ea ("ubifs: Implement RENAME_WHITEOUT") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-06-22ubifs: Fix spelling mistakesZheng Yongjun6-6/+6
Fix some spelling mistakes in comments: withoug ==> without numer ==> number aswell ==> as well referes ==> refers childs ==> children unnecesarry ==> unnecessary Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Reviewed-by: Alexander Dahl <ada@thorsis.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-06-18ubifs: Remove ui_mutex in ubifs_xattr_get and change_xattrZhihao Cheng1-4/+0
Since ubifs_xattr_get and ubifs_xattr_set cannot being executed parallelly after importing @host_ui->xattr_sem, now we can remove ui_mutex imported by commit ab92a20bce3b4c2 ("ubifs: make ubifs_[get|set]xattr atomic"). @xattr_size, @xattr_names and @xattr_cnt can't be out of protection by @host_ui->mutex yet, they are sill accesed in other places, such as pack_inode() called by ubifs_write_inode() triggered by page-writeback. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-06-18ubifs: Fix races between xattr_{set|get} and listxattr operationsZhihao Cheng3-11/+36
UBIFS may occur some problems with concurrent xattr_{set|get} and listxattr operations, such as assertion failure, memory corruption, stale xattr value[1]. Fix it by importing a new rw-lock in @ubifs_inode to serilize write operations on xattr, concurrent read operations are still effective, just like ext4. [1] https://lore.kernel.org/linux-mtd/20200630130438.141649-1-houtao1@huawei.com Fixes: 1e51764a3c2ac05a23 ("UBIFS: add new flash file system") Cc: stable@vger.kernel.org # v2.6+ Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-06-18ubifs: fix snprintf() checkingDan Carpenter1-1/+1
The snprintf() function returns the number of characters (not counting the NUL terminator) that it would have printed if we had space. This buffer has UBIFS_DFS_DIR_LEN characters plus one extra for the terminator. Printing UBIFS_DFS_DIR_LEN is okay but anything higher will result in truncation. Thus the comparison needs to be change from == to >. These strings are compile time constants so this patch doesn't affect runtime. Fixes: ae380ce04731 ("UBIFS: lessen the size of debugging info data structure") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Alexander Dahl <ada@thorsis.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-06-18ubifs: journal: Fix error return code in ubifs_jnl_write_inode()Zhen Lei1-0/+1
Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Fixes: 9ca2d7326444 ("ubifs: Limit number of xattrs per inode") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-05-04Merge tag 'for-linus-5.13-rc1' of ↵Linus Torvalds3-3/+9
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs Pull JFFS2, UBI and UBIFS updates from Richard Weinberger: "JFFS2: - Use splice_write() - Fix for a slab-out-of-bounds bug UBI: - Fix for clang related warnings - Code cleanup UBIFS: - Fix for inode rebirth at replay - Set s_uuid - Use zstd for default filesystem" * tag 'for-linus-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs: ubi: Remove unnecessary struct declaration jffs2: Hook up splice_write callback jffs2: avoid Wempty-body warnings jffs2: Fix kasan slab-out-of-bounds problem ubi: Fix fall-through warnings for Clang ubifs: Report max LEB count at mount time ubifs: Set s_uuid in super block to support ima/evm uuid options ubifs: Default to zstd compression ubifs: Only check replay with inode type to judge if inode linked
2021-04-27Merge branch 'miklos.fileattr' of ↵Linus Torvalds4-42/+43
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fileattr conversion updates from Miklos Szeredi via Al Viro: "This splits the handling of FS_IOC_[GS]ETFLAGS from ->ioctl() into a separate method. The interface is reasonably uniform across the filesystems that support it and gives nice boilerplate removal" * 'miklos.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (23 commits) ovl: remove unneeded ioctls fuse: convert to fileattr fuse: add internal open/release helpers fuse: unsigned open flags fuse: move ioctl to separate source file vfs: remove unused ioctl helpers ubifs: convert to fileattr reiserfs: convert to fileattr ocfs2: convert to fileattr nilfs2: convert to fileattr jfs: convert to fileattr hfsplus: convert to fileattr efivars: convert to fileattr xfs: convert to fileattr orangefs: convert to fileattr gfs2: convert to fileattr f2fs: convert to fileattr ext4: convert to fileattr ext2: convert to fileattr btrfs: convert to fileattr ...
2021-04-15ubifs: Report max LEB count at mount timeMartin Devera1-2/+2
There is no other way to directly report/query this quantity. It is useful when planing how given filesystem can be resized. Signed-off-by: Martin Devera <devik@eaxlabs.cz> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-04-15ubifs: Set s_uuid in super block to support ima/evm uuid optionsSteffen Trumtrar1-0/+2
This is required to provide uuid based integrity functionality for: ima_policy (fsuuid option) and the 'evmctl' command ('--uuid' option). Co-developed-by: Oleksij Rempel <o.rempel@pengutronix.de> Co-developed-by: Juergen Borleis <jbe@pengutronix.de> Signed-off-by: Steffen Trumtrar <s.trumtrar@pengutronix.de> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-04-15ubifs: Default to zstd compressionRui Salvaterra1-0/+3
Compared to lzo and zlib, zstd is the best all-around performer, both in terms of speed and compression ratio. Set it as the default, if available. Signed-off-by: Rui Salvaterra <rsalvaterra@gmail.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-04-15ubifs: Only check replay with inode type to judge if inode linkedGuochun Mao1-1/+2
Conside the following case, it just write a big file into flash, when complete writing, delete the file, and then power off promptly. Next time power on, we'll get a replay list like: ... LEB 1105:211344 len 4144 deletion 0 sqnum 428783 key type 1 inode 80 LEB 15:233544 len 160 deletion 1 sqnum 428785 key type 0 inode 80 LEB 1105:215488 len 4144 deletion 0 sqnum 428787 key type 1 inode 80 ... In the replay list, data nodes' deletion are 0, and the inode node's deletion is 1. In current logic, the file's dentry will be removed, but inode and the flash space it occupied will be reserved. User will see that much free space been disappeared. We only need to check the deletion value of the following inode type node of the replay entry. Fixes: e58725d51fa8 ("ubifs: Handle re-linking of inodes correctly while recovery") Cc: stable@vger.kernel.org Signed-off-by: Guochun Mao <guochun.mao@mediatek.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-04-12ubifs: convert to fileattrMiklos Szeredi4-42/+43
Use the fileattr API to let the VFS handle locking, permission checking and conversion. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: Richard Weinberger <richard@nod.at>
2021-04-08treewide: Change list_sort to use const pointersSami Tolvanen2-5/+6
list_sort() internally casts the comparison function passed to it to a different type with constant struct list_head pointers, and uses this pointer to call the functions, which trips indirect call Control-Flow Integrity (CFI) checking. Instead of removing the consts, this change defines the list_cmp_func_t type and changes the comparison function types of all list_sort() callers to use const pointers, thus avoiding type mismatches. Suggested-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com
2021-02-23Merge tag 'idmapped-mounts-v5.12' of ↵Linus Torvalds5-19/+24
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull idmapped mounts from Christian Brauner: "This introduces idmapped mounts which has been in the making for some time. Simply put, different mounts can expose the same file or directory with different ownership. This initial implementation comes with ports for fat, ext4 and with Christoph's port for xfs with more filesystems being actively worked on by independent people and maintainers. Idmapping mounts handle a wide range of long standing use-cases. Here are just a few: - Idmapped mounts make it possible to easily share files between multiple users or multiple machines especially in complex scenarios. For example, idmapped mounts will be used in the implementation of portable home directories in systemd-homed.service(8) where they allow users to move their home directory to an external storage device and use it on multiple computers where they are assigned different uids and gids. This effectively makes it possible to assign random uids and gids at login time. - It is possible to share files from the host with unprivileged containers without having to change ownership permanently through chown(2). - It is possible to idmap a container's rootfs and without having to mangle every file. For example, Chromebooks use it to share the user's Download folder with their unprivileged containers in their Linux subsystem. - It is possible to share files between containers with non-overlapping idmappings. - Filesystem that lack a proper concept of ownership such as fat can use idmapped mounts to implement discretionary access (DAC) permission checking. - They allow users to efficiently changing ownership on a per-mount basis without having to (recursively) chown(2) all files. In contrast to chown (2) changing ownership of large sets of files is instantenous with idmapped mounts. This is especially useful when ownership of a whole root filesystem of a virtual machine or container is changed. With idmapped mounts a single syscall mount_setattr syscall will be sufficient to change the ownership of all files. - Idmapped mounts always take the current ownership into account as idmappings specify what a given uid or gid is supposed to be mapped to. This contrasts with the chown(2) syscall which cannot by itself take the current ownership of the files it changes into account. It simply changes the ownership to the specified uid and gid. This is especially problematic when recursively chown(2)ing a large set of files which is commong with the aforementioned portable home directory and container and vm scenario. - Idmapped mounts allow to change ownership locally, restricting it to specific mounts, and temporarily as the ownership changes only apply as long as the mount exists. Several userspace projects have either already put up patches and pull-requests for this feature or will do so should you decide to pull this: - systemd: In a wide variety of scenarios but especially right away in their implementation of portable home directories. https://systemd.io/HOME_DIRECTORY/ - container runtimes: containerd, runC, LXD:To share data between host and unprivileged containers, unprivileged and privileged containers, etc. The pull request for idmapped mounts support in containerd, the default Kubernetes runtime is already up for quite a while now: https://github.com/containerd/containerd/pull/4734 - The virtio-fs developers and several users have expressed interest in using this feature with virtual machines once virtio-fs is ported. - ChromeOS: Sharing host-directories with unprivileged containers. I've tightly synced with all those projects and all of those listed here have also expressed their need/desire for this feature on the mailing list. For more info on how people use this there's a bunch of talks about this too. Here's just two recent ones: https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf https://fosdem.org/2021/schedule/event/containers_idmap/ This comes with an extensive xfstests suite covering both ext4 and xfs: https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts It covers truncation, creation, opening, xattrs, vfscaps, setid execution, setgid inheritance and more both with idmapped and non-idmapped mounts. It already helped to discover an unrelated xfs setgid inheritance bug which has since been fixed in mainline. It will be sent for inclusion with the xfstests project should you decide to merge this. In order to support per-mount idmappings vfsmounts are marked with user namespaces. The idmapping of the user namespace will be used to map the ids of vfs objects when they are accessed through that mount. By default all vfsmounts are marked with the initial user namespace. The initial user namespace is used to indicate that a mount is not idmapped. All operations behave as before and this is verified in the testsuite. Based on prior discussions we want to attach the whole user namespace and not just a dedicated idmapping struct. This allows us to reuse all the helpers that already exist for dealing with idmappings instead of introducing a whole new range of helpers. In addition, if we decide in the future that we are confident enough to enable unprivileged users to setup idmapped mounts the permission checking can take into account whether the caller is privileged in the user namespace the mount is currently marked with. The user namespace the mount will be marked with can be specified by passing a file descriptor refering to the user namespace as an argument to the new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern of extensibility. The following conditions must be met in order to create an idmapped mount: - The caller must currently have the CAP_SYS_ADMIN capability in the user namespace the underlying filesystem has been mounted in. - The underlying filesystem must support idmapped mounts. - The mount must not already be idmapped. This also implies that the idmapping of a mount cannot be altered once it has been idmapped. - The mount must be a detached/anonymous mount, i.e. it must have been created by calling open_tree() with the OPEN_TREE_CLONE flag and it must not already have been visible in the filesystem. The last two points guarantee easier semantics for userspace and the kernel and make the implementation significantly simpler. By default vfsmounts are marked with the initial user namespace and no behavioral or performance changes are observed. The manpage with a detailed description can be found here: https://git.kernel.org/brauner/man-pages/c/1d7b902e2875a1ff342e036a9f866a995640aea8 In order to support idmapped mounts, filesystems need to be changed and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The patches to convert individual filesystem are not very large or complicated overall as can be seen from the included fat, ext4, and xfs ports. Patches for other filesystems are actively worked on and will be sent out separately. The xfstestsuite can be used to verify that port has been done correctly. The mount_setattr() syscall is motivated independent of the idmapped mounts patches and it's been around since July 2019. One of the most valuable features of the new mount api is the ability to perform mounts based on file descriptors only. Together with the lookup restrictions available in the openat2() RESOLVE_* flag namespace which we added in v5.6 this is the first time we are close to hardened and race-free (e.g. symlinks) mounting and path resolution. While userspace has started porting to the new mount api to mount proper filesystems and create new bind-mounts it is currently not possible to change mount options of an already existing bind mount in the new mount api since the mount_setattr() syscall is missing. With the addition of the mount_setattr() syscall we remove this last restriction and userspace can now fully port to the new mount api, covering every use-case the old mount api could. We also add the crucial ability to recursively change mount options for a whole mount tree, both removing and adding mount options at the same time. This syscall has been requested multiple times by various people and projects. There is a simple tool available at https://github.com/brauner/mount-idmapped that allows to create idmapped mounts so people can play with this patch series. I'll add support for the regular mount binary should you decide to pull this in the following weeks: Here's an example to a simple idmapped mount of another user's home directory: u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt u1001@f2-vm:/$ ls -al /home/ubuntu/ total 28 drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 . drwxr-xr-x 4 root root 4096 Oct 28 04:00 .. -rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history -rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout -rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc -rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile -rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful -rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo u1001@f2-vm:/$ ls -al /mnt/ total 28 drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 . drwxr-xr-x 29 root root 4096 Oct 28 22:01 .. -rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history -rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout -rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc -rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile -rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful -rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo u1001@f2-vm:/$ touch /mnt/my-file u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file u1001@f2-vm:/$ ls -al /mnt/my-file -rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file u1001@f2-vm:/$ ls -al /home/ubuntu/my-file -rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file u1001@f2-vm:/$ getfacl /mnt/my-file getfacl: Removing leading '/' from absolute path names # file: mnt/my-file # owner: u1001 # group: u1001 user::rw- user:u1001:rwx group::rw- mask::rwx other::r-- u1001@f2-vm:/$ getfacl /home/ubuntu/my-file getfacl: Removing leading '/' from absolute path names # file: home/ubuntu/my-file # owner: ubuntu # group: ubuntu user::rw- user:ubuntu:rwx group::rw- mask::rwx other::r--" * tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits) xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl xfs: support idmapped mounts ext4: support idmapped mounts fat: handle idmapped mounts tests: add mount_setattr() selftests fs: introduce MOUNT_ATTR_IDMAP fs: add mount_setattr() fs: add attr_flags_to_mnt_flags helper fs: split out functions to hold writers namespace: only take read lock in do_reconfigure_mnt() mount: make {lock,unlock}_mount_hash() static namespace: take lock_mount_hash() directly when changing flags nfs: do not export idmapped mounts overlayfs: do not mount on top of idmapped mounts ecryptfs: do not mount on top of idmapped mounts ima: handle idmapped mounts apparmor: handle idmapped mounts fs: make helpers idmap mount aware exec: handle idmapped mounts would_dump: handle idmapped mounts ...
2021-02-13ubifs: Fix error return code in alloc_wbufs()Wang ShaoBo1-1/+3
Fix to return PTR_ERR() error code from the error handling case instead fo 0 in function alloc_wbufs(), as done elsewhere in this function. Fixes: 6a98bc4614de ("ubifs: Add authentication nodes to journal") Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com> Reviewed-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-02-12ubifs: Fix off-by-one errorSascha Hauer2-2/+2
An inode is allowed to have ubifs_xattr_max_cnt() xattrs, so we must complain only when an inode has more xattrs, having exactly ubifs_xattr_max_cnt() xattrs is fine. With this the maximum number of xattrs can be created without hitting the "has too many xattrs" warning when removing it. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-02-12ubifs: replay: Fix high stack usage, againArnd Bergmann1-1/+3
An earlier commit moved out some functions to not be inlined by gcc, but after some other rework to remove one of those, clang started inlining the other one and ran into the same problem as gcc did before: fs/ubifs/replay.c:1174:5: error: stack frame size of 1152 bytes in function 'ubifs_replay_journal' [-Werror,-Wframe-larger-than=] Mark the function as noinline_for_stack to ensure it doesn't happen again. Fixes: f80df3851246 ("ubifs: use crypto_shash_tfm_digest()") Fixes: eb66eff6636d ("ubifs: replay: Fix high stack usage") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-02-12ubifs: Fix memleak in ubifs_init_authenticationDinghao Liu1-1/+1
When crypto_shash_digestsize() fails, c->hmac_tfm has not been freed before returning, which leads to memleak. Fixes: 49525e5eecca5 ("ubifs: Add helper functions for authentication support") Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2021-01-24fs: make helpers idmap mount awareChristian Brauner3-15/+19
Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24stat: handle idmapped mountsChristian Brauner1-1/+1
The generic_fillattr() helper fills in the basic attributes associated with an inode. Enable it to handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace before we store the uid and gid. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24acl: handle idmapped mountsChristian Brauner1-0/+1
The posix acl permission checking helpers determine whether a caller is privileged over an inode according to the acls associated with the inode. Add helpers that make it possible to handle acls on idmapped mounts. The vfs and the filesystems targeted by this first iteration make use of posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to translate basic posix access and default permissions such as the ACL_USER and ACL_GROUP type according to the initial user namespace (or the superblock's user namespace) to and from the caller's current user namespace. Adapt these two helpers to handle idmapped mounts whereby we either map from or into the mount's user namespace depending on in which direction we're translating. Similarly, cap_convert_nscap() is used by the vfs to translate user namespace and non-user namespace aware filesystem capabilities from the superblock's user namespace to the caller's user namespace. Enable it to handle idmapped mounts by accounting for the mount's user namespace. In addition the fileystems targeted in the first iteration of this patch series make use of the posix_acl_chmod() and, posix_acl_update_mode() helpers. Both helpers perform permission checks on the target inode. Let them handle idmapped mounts. These two helpers are called when posix acls are set by the respective filesystems to handle this case we extend the ->set() method to take an additional user namespace argument to pass the mount's user namespace down. Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>