Age | Commit message (Collapse) | Author | Files | Lines |
|
Once we failed to merge inline data into inode page during flushing inline
inode, we will skip invoking inode_dec_dirty_pages, which makes dirty page
count incorrect, result in panic in ->evict_inode, Fix it.
------------[ cut here ]------------
kernel BUG at /home/yuchao/git/devf2fs/inode.c:336!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 3 PID: 10004 Comm: umount Tainted: G O 4.6.0-rc5+ #17
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
task: f0c33000 ti: c5212000 task.ti: c5212000
EIP: 0060:[<f89aacb5>] EFLAGS: 00010202 CPU: 3
EIP is at f2fs_evict_inode+0x85/0x490 [f2fs]
EAX: 00000001 EBX: c4529ea0 ECX: 00000001 EDX: 00000000
ESI: c0131000 EDI: f89dd0a0 EBP: c5213e9c ESP: c5213e78
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 80050033 CR2: b75878c0 CR3: 1a36a700 CR4: 000406f0
Stack:
c4529ea0 c4529ef4 c5213e8c c176d45c c4529ef4 00000000 c4529ea0 c4529fac
f89dd0a0 c5213eb0 c1204a68 c5213ed8 c452a2b4 c6680930 c5213ec0 c1204b64
c6680d44 c6680620 c5213eec c120588d ee84b000 ee84b5c0 c5214000 ee84b5e0
Call Trace:
[<c176d45c>] ? _raw_spin_unlock+0x2c/0x50
[<c1204a68>] evict+0xa8/0x170
[<c1204b64>] dispose_list+0x34/0x50
[<c120588d>] evict_inodes+0x10d/0x130
[<c11ea941>] generic_shutdown_super+0x41/0xe0
[<c1185190>] ? unregister_shrinker+0x40/0x50
[<c1185190>] ? unregister_shrinker+0x40/0x50
[<c11eac52>] kill_block_super+0x22/0x70
[<f89af23e>] kill_f2fs_super+0x1e/0x20 [f2fs]
[<c11eae1d>] deactivate_locked_super+0x3d/0x70
[<c11eb383>] deactivate_super+0x43/0x60
[<c1208ec9>] cleanup_mnt+0x39/0x80
[<c1208f50>] __cleanup_mnt+0x10/0x20
[<c107d091>] task_work_run+0x71/0x90
[<c105725a>] exit_to_usermode_loop+0x72/0x9e
[<c1001c7c>] do_fast_syscall_32+0x19c/0x1c0
[<c176dd48>] sysenter_past_esp+0x45/0x74
EIP: [<f89aacb5>] f2fs_evict_inode+0x85/0x490 [f2fs] SS:ESP 0068:c5213e78
---[ end trace d30536330b7fdc58 ]---
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch enables reading node blocks in advance when truncating large
data blocks.
> time rm $MNT/testfile (500GB) after drop_cachees
Before : 9.422 s
After : 4.821 s
Reported-by: Stephen Bates <stephen.bates@microsemi.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch removes an obsolete variable used in add_free_nid.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch injects ENOSPC failures.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch converts grab_cache_page to f2fs_grab_cache_page.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
For foreground GC, we cache node blocks in victim section and set them
dirty, then we call sync_node_pages to flush these node pages, but
meanwhile, those node pages which does not locate in victim section
will be flushed together, so more bandwidth and continuous free space
would be occupied.
So for this condition, it's better to leave those unrelated node page
in cache for further write hit, and let CP or VM to flush them afterward.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In order to give atomic writes, we should consider power failure during
sync_node_pages in fsync.
So, this patch marks fsync flag only in the last dnode block.
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
The fsync_node_pages should return pass or failure so that user could know
fsync is completed or not.
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch splits the existing sync_node_pages into (f)sync_node_pages.
The fsync_node_pages is used for f2fs_sync_file only.
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
When fsync is called, sync_node_pages finds a proper direct node pages to flush.
But, it locks unrelated direct node pages together unnecessarily.
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch adds BUG_ON instead of retrying loop.
In the case of node pages, we already got this inode page, but unlocked it.
By the fact that we don't truncate any node pages in operations, the page's
mapping should be unchangeable.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Previously, after trylock_page is succeeded, it doesn't check its mapping.
In order to fix that, we can just give PGP_LOCK to pagecache_get_page.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If many threads calls fsync with data writes, we don't need to flush every
bios having node page writes.
The f2fs_wait_on_page_writeback will flush its bios when the page is really
needed.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Just to avoid sparse warnings.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
ra_node_page() is used to read ahead one node page. Comparing to regular
read, it's faster because it doesn't wait for IO completion.
But if it is called twice for reading the same block, and the IO request
from the first call hasn't been completed before the second call, the second
call will have to wait until the read is over.
Here use the code in __do_page_cache_readahead() to solve this problem.
It does nothing when someone else already puts the page in mapping. The
status of page should be assured by whoever puts it there.
This implement also prevents alteration of page reference count.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
When testing with fsstress, kworker and user threads were both blocked:
INFO: task kworker/u16:1:16580 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u16:1 D ffff8803f2595390 0 16580 2 0x00000000
Workqueue: writeback bdi_writeback_workfn (flush-251:0)
ffff8802730e5760 0000000000000046 ffff880274729fc0 0000000000012440
ffff8802730e5fd8 ffff8802730e4010 0000000000012440 0000000000012440
ffff8802730e5fd8 0000000000012440 ffff880274729fc0 ffff88026eb50000
Call Trace:
[<ffffffff816fe9d9>] schedule+0x29/0x70
[<ffffffff816ff895>] rwsem_down_read_failed+0xa5/0xf9
[<ffffffff81378584>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffffa0694feb>] f2fs_write_data_page+0x31b/0x420 [f2fs]
[<ffffffffa0690f1a>] __f2fs_writepage+0x1a/0x50 [f2fs]
[<ffffffffa06922a0>] f2fs_write_data_pages+0xe0/0x290 [f2fs]
[<ffffffff811473b3>] do_writepages+0x23/0x40
[<ffffffff811cc3ee>] __writeback_single_inode+0x4e/0x250
[<ffffffff811cd4f1>] writeback_sb_inodes+0x2c1/0x470
[<ffffffff811cd73e>] __writeback_inodes_wb+0x9e/0xd0
[<ffffffff811cda0b>] wb_writeback+0x1fb/0x2d0
[<ffffffff811cdb7c>] wb_do_writeback+0x9c/0x220
[<ffffffff811ce232>] bdi_writeback_workfn+0x72/0x1c0
[<ffffffff8106b74e>] process_one_work+0x1de/0x5b0
[<ffffffff8106e78f>] worker_thread+0x11f/0x3e0
[<ffffffff810750ce>] kthread+0xde/0xf0
[<ffffffff817093f8>] ret_from_fork+0x58/0x90
fsstress thread stack:
[<ffffffff81139f0e>] sleep_on_page+0xe/0x20
[<ffffffff81139ef7>] __lock_page+0x67/0x70
[<ffffffff8113b100>] find_lock_page+0x50/0x80
[<ffffffff8113b24f>] find_or_create_page+0x3f/0xb0
[<ffffffffa06983a9>] sync_node_pages+0x259/0x810 [f2fs]
[<ffffffffa068d874>] write_checkpoint+0x1a4/0xce0 [f2fs]
[<ffffffffa0686b0c>] f2fs_sync_fs+0x7c/0xd0 [f2fs]
[<ffffffffa067c813>] f2fs_sync_file+0x143/0x5f0 [f2fs]
[<ffffffff811d301b>] vfs_fsync_range+0x2b/0x40
[<ffffffff811d304c>] vfs_fsync+0x1c/0x20
[<ffffffff811d3291>] do_fsync+0x41/0x70
[<ffffffff811d32d3>] SyS_fdatasync+0x13/0x20
[<ffffffff817094a2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
The reason of this issue is:
CPU0: CPU1:
- f2fs_write_data_pages
- f2fs_sync_fs
- write_checkpoint
- block_operations
- f2fs_lock_all
- down_write(sbi->cp_rwsem)
- lock_page(page)
- f2fs_write_data_page
- sync_node_pages
- flush_inline_data
- pagecache_get_page(page, GFP_LOCK)
- f2fs_lock_op
- down_read(sbi->cp_rwsem)
This patch alters to use trylock_page in flush_inline_data to fix this ABBA
deadlock issue.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
1. Inode mapping tree can index page in range of [0, ULONG_MAX], however,
in some places, f2fs only search or iterate page in ragne of [0, LONG_MAX],
result in miss hitting in page cache.
2. filemap_fdatawait_range accepts range parameters in unit of bytes, so
the max range it covers should be [0, LLONG_MAX], if we use [0, LONG_MAX]
as range for waiting on writeback, big number of pages will not be covered.
This patch corrects above two issues.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch enables to trace old block address of CoWed page for better
debugging.
f2fs_submit_page_mbio: dev = (1,0), ino = 1, page_index = 0x1d4f0, oldaddr = 0xfe8ab, newaddr = 0xfee90 rw = WRITE_SYNC, type = NODE
f2fs_submit_page_mbio: dev = (1,0), ino = 1, page_index = 0x1d4f8, oldaddr = 0xfe8b0, newaddr = 0xfee91 rw = WRITE_SYNC, type = NODE
f2fs_submit_page_mbio: dev = (1,0), ino = 1, page_index = 0x1d4fa, oldaddr = 0xfe8ae, newaddr = 0xfee92 rw = WRITE_SYNC, type = NODE
f2fs_submit_page_mbio: dev = (1,0), ino = 134824, page_index = 0x96, oldaddr = 0xf049b, newaddr = 0x2bbe rw = WRITE, type = DATA
f2fs_submit_page_mbio: dev = (1,0), ino = 134824, page_index = 0x97, oldaddr = 0xf049c, newaddr = 0x2bbf rw = WRITE, type = DATA
f2fs_submit_page_mbio: dev = (1,0), ino = 134824, page_index = 0x98, oldaddr = 0xf049d, newaddr = 0x2bc0 rw = WRITE, type = DATA
f2fs_submit_page_mbio: dev = (1,0), ino = 135260, page_index = 0x47, oldaddr = 0xffffffff, newaddr = 0xf2631 rw = WRITE, type = DATA
f2fs_submit_page_mbio: dev = (1,0), ino = 135260, page_index = 0x48, oldaddr = 0xffffffff, newaddr = 0xf2632 rw = WRITE, type = DATA
f2fs_submit_page_mbio: dev = (1,0), ino = 135260, page_index = 0x49, oldaddr = 0xffffffff, newaddr = 0xf2633 rw = WRITE, type = DATA
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
When flushing node pages, if current node page is an inline inode page, we
will try to merge inline data from data page into inline inode page, then
skip flushing current node page, it will decrease the number of nodes to
be flushed in batch in this round, which may lead to worse performance.
This patch gives a chance to flush just merged inline inode pages for
performance.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
When lookuping nat entry in cache_nat_entry, if we fail to hit nat cache,
we try to load nat entries a) from journal of current segment cache or b)
from NAT pages for updating, during the process, write lock of
nat_tree_lock will be held to avoid inconsistent condition in between
nid cache and nat cache caused by racing among nat entry shrinker,
checkpointer, nat entry updater.
But this way may cause low efficient when updating nat cache, because it
serializes accessing in journal cache or reading NAT pages.
Here, we reorder lock and update flow as below to enhance accessing
concurrency:
- get_node_info
- down_read(nat_tree_lock)
- lookup nat cache --- hit -> unlock & return
- lookup journal cache --- hit -> unlock & goto update
- up_read(nat_tree_lock)
update:
- down_write(nat_tree_lock)
- cache_nat_entry
- lookup nat cache --- nohit -> update
- up_write(nat_tree_lock)
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In curseg cache, f2fs caches two different parts:
- datas of current summay block, i.e. summary entries, footer info.
- journal info, i.e. sparse nat/sit entries or io stat info.
With this approach, 1) it may cause higher lock contention when we access
or update both of the parts of cache since we use the same mutex lock
curseg_mutex to protect the cache. 2) current summary block with last
journal info will be writebacked into device as a normal summary block
when flushing, however, we treat journal info as valid one only in current
summary, so most normal summary blocks contain junk journal data, it wastes
remaining space of summary block.
So, in order to fix above issues, we split curseg cache into two parts:
a) current summary block, protected by original mutex lock curseg_mutex
b) journal cache, protected by newly introduced r/w semaphore journal_rwsem
When loading curseg cache during ->mount, we store summary info and
journal info into different caches; When doing checkpoint, we combine
datas of two cache into current summary block for persisting.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Introduce a new structure f2fs_journal to wrap journal info in struct
f2fs_summary_block for readability.
struct f2fs_journal {
union {
__le16 n_nats;
__le16 n_sits;
};
union {
struct nat_journal nat_j;
struct sit_journal sit_j;
struct f2fs_extra_info info;
};
} __packed;
struct f2fs_summary_block {
struct f2fs_summary entries[ENTRIES_IN_SUM];
struct f2fs_journal journal;
struct summary_footer footer;
} __packed;
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
fix missing skip pages info in f2fs_writepages trace event.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
f2fs use single bio buffer per type data (META/NODE/DATA) for caching
writes locating in continuous block address as many as possible, after
submitting, these writes may be still cached in bio buffer, so we have
to flush cached writes in bio buffer by calling f2fs_submit_merged_bio.
Unfortunately, in the scenario of high concurrency, bio buffer could be
flushed by someone else before we submit it as below reasons:
a) there is no space in bio buffer.
b) add a request of different type (SYNC, ASYNC).
c) add a discontinuous block address.
For this condition, f2fs_submit_merged_bio will be devastating, because
it could break the following merging of writes in bio buffer, split one
big bio into two smaller one.
This patch introduces f2fs_submit_merged_bio_cond which can do a
conditional submitting with bio buffer, before submitting it will judge
whether:
- page in DATA type bio buffer is matching with specified page;
- page in DATA type bio buffer is belong to specified inode;
- page in NODE type bio buffer is belong to specified inode;
If there is no eligible page in bio buffer, we will skip submitting step,
result in gaining more chance to merge consecutive block IOs in bio cache.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Likewise f2fs_write_cache_pages, let's do for node and meta pages too.
Especially, for node blocks, we should do this before marking its fsync
and dentry flags.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
When seeking data in ->llseek, if we encounter a big hole which covers
several dnode pages, we will try to seek data from index of page which
is the first page of next dnode page, at most we could skip searching
(ADDRS_PER_BLOCK - 1) pages.
However it's still not efficient, because if our indirect/double-indirect
pointer are NULL, there are no dnode page locate in the tree indirect/
double-indirect pointer point to, it's not necessary to search the whole
region.
This patch introduces get_next_page_offset to calculate next page offset
based on current searching level and max searching level returned from
get_dnode_of_data, with this, we could skip searching the entire area
indirect or double-indirect node block is not exist.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
There are redundant pointer conversion in following call stack:
- at position a, inode was been converted to f2fs_file_info.
- at position b, f2fs_file_info was been converted to inode again.
- truncate_blocks(inode,..)
- fi = F2FS_I(inode) ---a
- ADDRS_PER_PAGE(node_page, fi)
- addrs_per_inode(fi)
- inode = &fi->vfs_inode ---b
- f2fs_has_inline_xattr(inode)
- fi = F2FS_I(inode)
- is_inode_flag_set(fi,..)
In order to avoid unneeded conversion, alter ADDRS_PER_PAGE and
addrs_per_inode to acept parameter with type of inode pointer.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In write_begin, if storage supports stable_page, we don't need to wait for
writeback to update its contents.
This patch introduces to use wait_for_stable_page instead of
wait_on_page_writeback.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
The sceanrio is:
1. create fully node blocks
2. flush node blocks
3. write inline_data for all the node blocks again
4. flush node blocks redundantly
So, this patch tries to flush inline_data when flushing node blocks.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch exports a new sysfs entry 'dirty_nat_ratio' to control threshold
of dirty nat entries, if current ratio exceeds configured threshold,
checkpoint will be triggered in f2fs_balance_fs_bg for flushing dirty nats.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch fixes wrong decision for avaliable_free_memory.
The return valus is already set as false, so we should consider true condition
below only.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Only when node page is newly dirtied, it needs to check whether we need to do
f2fs_gc.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
There are duplicated code in between get_node_page and get_node_page_ra,
introduce __get_node_page to includes common parts of these two, and
export get_node_page and get_node_page_ra by reusing __get_node_page.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Add node id check in ra_node_page and get_node_page_ra like get_node_page.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Original issue is fixed by:
f2fs: cover more area with nat_tree_lock
This reverts commit 24928634f81b1592e83b37dcd89ed45c28f12feb.
|
|
There was a subtle bug on nat cache management which incurs wrong nid allocation
or wrong block addresses when try_to_free_nats is triggered heavily.
This patch enlarges the previous coverage of nat_tree_lock to avoid data race.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
When testing ioc_shutdown, put_super is able to be hanged by waiting for
writebacking pages as follows.
INFO: task umount:2723 blocked for more than 120 seconds.
Tainted: G O 4.4.0-rc3+ #8
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount D ffff88000859f9d8 0 2723 2110 0x00000000
ffff88000859f9d8 0000000000000000 0000000000000000 ffffffff81e11540
ffff880078c225c0 ffff8800085a0000 ffff88007fc17440 7fffffffffffffff
ffffffff818239f0 ffff88000859fb48 ffff88000859f9f0 ffffffff8182310c
Call Trace:
[<ffffffff818239f0>] ? bit_wait+0x50/0x50
[<ffffffff8182310c>] schedule+0x3c/0x90
[<ffffffff81827fb9>] schedule_timeout+0x2d9/0x430
[<ffffffff810e0f8f>] ? mark_held_locks+0x6f/0xa0
[<ffffffff8111614d>] ? ktime_get+0x7d/0x140
[<ffffffff818239f0>] ? bit_wait+0x50/0x50
[<ffffffff8106a655>] ? kvm_clock_get_cycles+0x25/0x30
[<ffffffff8111617c>] ? ktime_get+0xac/0x140
[<ffffffff818239f0>] ? bit_wait+0x50/0x50
[<ffffffff81822564>] io_schedule_timeout+0xa4/0x110
[<ffffffff81823a25>] bit_wait_io+0x35/0x50
[<ffffffff818235bd>] __wait_on_bit+0x5d/0x90
[<ffffffff811b9e8b>] wait_on_page_bit+0xcb/0xf0
[<ffffffff810d5f90>] ? autoremove_wake_function+0x40/0x40
[<ffffffff811cf84c>] truncate_inode_pages_range+0x4bc/0x840
[<ffffffff811cfc3d>] truncate_inode_pages_final+0x4d/0x60
[<ffffffffc023ced5>] f2fs_evict_inode+0x75/0x400 [f2fs]
[<ffffffff812639bc>] evict+0xbc/0x190
[<ffffffff81263d19>] iput+0x229/0x2c0
[<ffffffffc0241885>] f2fs_put_super+0x105/0x1a0 [f2fs]
[<ffffffff8124756a>] generic_shutdown_super+0x6a/0xf0
[<ffffffff812478f7>] kill_block_super+0x27/0x70
[<ffffffffc0241290>] kill_f2fs_super+0x20/0x30 [f2fs]
[<ffffffff81247b03>] deactivate_locked_super+0x43/0x70
[<ffffffff81247f4c>] deactivate_super+0x5c/0x60
[<ffffffff81268d2f>] cleanup_mnt+0x3f/0x90
[<ffffffff81268dc2>] __cleanup_mnt+0x12/0x20
[<ffffffff810ac463>] task_work_run+0x73/0xa0
[<ffffffff810032ac>] exit_to_usermode_loop+0xcc/0xd0
[<ffffffff81003e7c>] syscall_return_slowpath+0xcc/0xe0
[<ffffffff81829ea2>] int_ret_from_sys_call+0x25/0x9f
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Sometimes we keep dumb when IO error occur in lower layer device, so user
will not receive any error return value for some operation, but actually,
the operation did not succeed.
This sould be avoided, so this patch reports such kind of error to user.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
If get_node_page() gets zero nid, we can return early without getting a wrong
page. For example, get_dnode_of_data() can try to do that.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch introduces recording node block allocation in dnode_of_data.
This information helps to figure out whether any node block is allocated during
specific file operations.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
It would be better to use atomic variable for total_extent_tree.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
If read_node_page return LOCKED_PAGE, in its caller it's better a) skip
unneeded 'Update' flag and mapping info verfication; b) check nid value
stored in footer structure of node page.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
After finishing building free nid cache, we will try to readahead
asynchronously 4 more pages for the next reloading, the count of
readahead nid pages is fixed.
In some case, like SMR drive, read less sectors with fixed count
each time we trigger RA may be low efficient, since we will face
high seeking overhead, so we'd better let user to configure this
parameter from sysfs in specific workload.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
When there is no free nid in nid cache, all new node allocaters stop their
job to wait for reloading of free nids, however reloading is synchronous as
we will read 4 NAT pages for building nid cache, it cause the long latency.
This patch tries to readahead more NAT pages with READA request flag after
reloading of free nids. It helps to improve performance when users allocate
node id intensively.
Env: Sandisk 32G sd card
time for i in `seq 1 60000`; { echo -n > /mnt/f2fs/$i; echo XXXXXX > /mnt/f2fs/$i;}
Before:
real 0m2.814s
user 0m1.220s
sys 0m1.536s
After:
real 0m2.711s
user 0m1.136s
sys 0m1.568s
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Now, we use ra_meta_pages to reads continuous physical blocks as much as
possible to improve performance of following reads. However, ra_meta_pages
uses a synchronous readahead approach by submitting bio with READ, as READ
is with high priority, it can not be used in the case of preloading blocks,
and it's not sure when these RAed pages will be used.
This patch supports asynchronous readahead in ra_meta_pages by tagging bio
with READA flag in order to allow preloading.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In recovery or checkpoint flow, we grab pages temperarily in meta inode's
mapping for caching temperary data, actually, datas in these pages were
not meta data of f2fs, but still we tag them with REQ_META flag. However,
lower device like eMMC may do some optimization for data of such type.
So in order to avoid wrong optimization, we'd better remove such flag
for temperary non-meta pages.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
The periodic checkpoint can resolve the previous issue.
So, now we can use this again to improve the reported performance regression:
https://lkml.org/lkml/2015/10/8/20
This reverts commit 15bec0ff5a9ba6d203178fa8772259df6207942a.
|
|
Previously, we skip dentry block writes when wbc is SYNC_NONE with no memory
pressure and the number of dirty pages is pretty small.
But, we didn't skip for normal data writes, which gives us not much big impact
on overall performance.
Moreover, by skipping some data writes, kworker falls into infinite loop to try
to write blocks, when many dir inodes have only one dentry block.
So, this patch removes skipping data writes.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This number is referenced by checkpoint under node_write lock.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|