summaryrefslogtreecommitdiffstats
path: root/fs/ext4/inode.c
AgeCommit message (Collapse)AuthorFilesLines
2013-03-21Merge tag 'ext4_for_linue' of ↵Linus Torvalds1-8/+174
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Fix a number of regression and other bugs in ext4, most of which were relatively obscure cornercases or races that were found using regression tests." * tag 'ext4_for_linue' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits) ext4: fix data=journal fast mount/umount hang ext4: fix ext4_evict_inode() racing against workqueue processing code ext4: fix memory leakage in mext_check_coverage ext4: use s_extent_max_zeroout_kb value as number of kb ext4: use atomic64_t for the per-flexbg free_clusters count jbd2: fix use after free in jbd2_journal_dirty_metadata() ext4: reserve metadata block for every delayed write ext4: update reserved space after the 'correction' ext4: do not use yield() ext4: remove unused variable in ext4_free_blocks() ext4: fix WARN_ON from ext4_releasepage() ext4: fix the wrong number of the allocated blocks in ext4_split_extent() ext4: update extent status tree after an extent is zeroed out ext4: fix wrong m_len value after unwritten extent conversion ext4: add self-testing infrastructure to do a sanity check ext4: avoid a potential overflow in ext4_es_can_be_merged() ext4: invalidate extent status tree during extent migration ext4: remove unnecessary wait for extent conversion in ext4_fallocate() ext4: add warning to ext4_convert_unwritten_extents_endio ext4: disable merging of uninitialized extents ...
2013-03-20ext4: fix data=journal fast mount/umount hangTheodore Ts'o1-1/+2
In data=journal mode, if we unmount the file system before a transaction has a chance to complete, when the journal inode is being evicted, we can end up calling into jbd2_log_wait_commit() for the last transaction, after the journalling machinery has been shut down. Arguably we should adjust ext4_should_journal_data() to return FALSE for the journal inode, but the only place it matters is ext4_evict_inode(), and so to save a bit of CPU time, and to make the patch much more obviously correct by inspection(tm), we'll fix it by explicitly not trying to waiting for a journal commit when we are evicting the journal inode, since it's guaranteed to never succeed in this case. This can be easily replicated via: mount -t ext4 -o data=journal /dev/vdb /vdb ; umount /vdb ------------[ cut here ]------------ WARNING: at /usr/projects/linux/ext4/fs/jbd2/journal.c:542 __jbd2_log_start_commit+0xba/0xcd() Hardware name: Bochs JBD2: bad log_start_commit: 3005630206 3005630206 0 0 Modules linked in: Pid: 2909, comm: umount Not tainted 3.8.0-rc3 #1020 Call Trace: [<c015c0ef>] warn_slowpath_common+0x68/0x7d [<c02b7e7d>] ? __jbd2_log_start_commit+0xba/0xcd [<c015c177>] warn_slowpath_fmt+0x2b/0x2f [<c02b7e7d>] __jbd2_log_start_commit+0xba/0xcd [<c02b8075>] jbd2_log_start_commit+0x24/0x34 [<c0279ed5>] ext4_evict_inode+0x71/0x2e3 [<c021f0ec>] evict+0x94/0x135 [<c021f9aa>] iput+0x10a/0x110 [<c02b7836>] jbd2_journal_destroy+0x190/0x1ce [<c0175284>] ? bit_waitqueue+0x50/0x50 [<c028d23f>] ext4_put_super+0x52/0x294 [<c020efe3>] generic_shutdown_super+0x48/0xb4 [<c020f071>] kill_block_super+0x22/0x60 [<c020f3e0>] deactivate_locked_super+0x22/0x49 [<c020f5d6>] deactivate_super+0x30/0x33 [<c0222795>] mntput_no_expire+0x107/0x10c [<c02233a7>] sys_umount+0x2cf/0x2e0 [<c02233ca>] sys_oldumount+0x12/0x14 [<c08096b8>] syscall_call+0x7/0xb ---[ end trace 6a954cc790501c1f ]--- jbd2_log_wait_commit: error: j_commit_request=-1289337090, tid=0 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org
2013-03-20ext4: fix ext4_evict_inode() racing against workqueue processing codeTheodore Ts'o1-2/+2
Commit 84c17543ab56 (ext4: move work from io_end to inode) triggered a regression when running xfstest #270 when the file system is mounted with dioread_nolock. The problem is that after ext4_evict_inode() calls ext4_ioend_wait(), this guarantees that last io_end structure has been freed, but it does not guarantee that the workqueue structure, which was moved into the inode by commit 84c17543ab56, is actually finished. Once ext4_flush_completed_IO() calls ext4_free_io_end() on CPU #1, this will allow ext4_ioend_wait() to return on CPU #2, at which point the evict_inode() codepath can race against the workqueue code on CPU #1 accessing EXT4_I(inode)->i_unwritten_work to find the next item of work to do. Fix this by calling cancel_work_sync() in ext4_ioend_wait(), which will be renamed ext4_ioend_shutdown(), since it is only used by ext4_evict_inode(). Also, move the call to ext4_ioend_shutdown() until after truncate_inode_pages() and filemap_write_and_wait() are called, to make sure all dirty pages have been written back and flushed from the page cache first. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e *pdpt = 0000000030bc3001 *pde = 0000000000000000 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC Modules linked in: Pid: 6, comm: kworker/u:0 Not tainted 3.8.0-rc3-00013-g84c1754-dirty #91 Bochs Bochs EIP: 0060:[<c01dda6a>] EFLAGS: 00010046 CPU: 0 EIP is at cwq_activate_delayed_work+0x3b/0x7e EAX: 00000000 EBX: 00000000 ECX: f505fe54 EDX: 00000000 ESI: ed5b697c EDI: 00000006 EBP: f64b7e8c ESP: f64b7e84 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 CR0: 8005003b CR2: 00000000 CR3: 30bc2000 CR4: 000006f0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 Process kworker/u:0 (pid: 6, ti=f64b6000 task=f64b4160 task.ti=f64b6000) Stack: f505fe00 00000006 f64b7e9c c01de3d7 f6435540 00000003 f64b7efc c01def1d f6435540 00000002 00000000 0000008a c16d0808 c040a10b c16d07d8 c16d08b0 f505fe00 c16d0780 00000000 00000000 ee153df4 c1ce4a30 c17d0e30 00000000 Call Trace: [<c01de3d7>] cwq_dec_nr_in_flight+0x71/0xfb [<c01def1d>] process_one_work+0x5d8/0x637 [<c040a10b>] ? ext4_end_bio+0x300/0x300 [<c01e3105>] worker_thread+0x249/0x3ef [<c01ea317>] kthread+0xd8/0xeb [<c01e2ebc>] ? manage_workers+0x4bb/0x4bb [<c023a370>] ? trace_hardirqs_on+0x27/0x37 [<c0f1b4b7>] ret_from_kernel_thread+0x1b/0x28 [<c01ea23f>] ? __init_kthread_worker+0x71/0x71 Code: 01 83 15 ac ff 6c c1 00 31 db 89 c6 8b 00 a8 04 74 12 89 c3 30 db 83 05 b0 ff 6c c1 01 83 15 b4 ff 6c c1 00 89 f0 e8 42 ff ff ff <8b> 13 89 f0 83 05 b8 ff 6c c1 6c c1 00 31 c9 83 EIP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e SS:ESP 0068:f64b7e84 CR2: 0000000000000000 ---[ end trace a1923229da53d8a4 ]--- Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz>
2013-03-10ext4: reserve metadata block for every delayed writeLukas Czerner1-2/+61
Currently we only reserve space (data+metadata) in delayed allocation if we're allocating from new cluster (which is always in non-bigalloc file system) which is ok for data blocks, because we reserve the whole cluster. However we have to reserve metadata for every delayed block we're going to write because every block could potentially require metedata block when we need to grow the extent tree. Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2013-03-10ext4: do not use yield()Lukas Czerner1-1/+1
Using yield() is strongly discouraged (see sched/core.c) especially since we can just use cond_resched(). Replace all use of yield() with cond_resched(). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-10ext4: fix WARN_ON from ext4_releasepage()Jan Kara1-2/+2
ext4_releasepage() warns when it is passed a page with PageChecked set. However this can correctly happen when invalidate_inode_pages2_range() invalidates pages - and we should fail the release in that case. Since the page was dirty anyway, it won't be discarded and no harm has happened but it's good to be safe. Also remove bogus page_has_buffers() check - we are guaranteed page has buffers in this function. Reported-by: Zheng Liu <gnehzuil.liu@gmail.com> Tested-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz>
2013-03-10ext4: update extent status tree after an extent is zeroed outZheng Liu1-0/+10
When we try to split an extent, this extent could be zeroed out and mark as initialized. But we don't know this in ext4_map_blocks because it only returns a length of allocated extent. Meanwhile we will mark this extent as uninitialized because we only check m_flags. This commit update extent status tree when we try to split an unwritten extent. We don't need to worry about the status of this extent because we always mark it as initialized. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Dmitry Monakhov <dmonakhov@openvz.org>
2013-03-10ext4: add self-testing infrastructure to do a sanity checkDmitry Monakhov1-0/+96
This commit adds a self-testing infrastructure like extent tree does to do a sanity check for extent status tree. After status tree is as a extent cache, we'd better to make sure that it caches right result. After applied this commit, we will get a lot of messages when we run xfstests as below. ... kernel: ES len assertation failed for inode: 230 retval 1 != map->m_len 3 in ext4_map_blocks (allocation) ... kernel: ES cache assertation failed for inode: 230 es_cached ex [974/2/4781/20] != found ex [974/1/4781/1000] ... kernel: ES insert assertation failed for inode: 635 ex_status [0/45/21388/w] != es_status [44/1/21432/u] ... Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-28Merge tag 'writeback-fixes' of ↵Linus Torvalds1-6/+2
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux Pull writeback fixes from Wu Fengguang: "Two writeback fixes - fix negative (setpoint - dirty) in 32bit archs - use down_read_trylock() in writeback_inodes_sb(_nr)_if_idle()" * tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: Negative (setpoint-dirty) in bdi_position_ratio() vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them
2013-02-26Merge branch 'for-linus' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile (part one) from Al Viro: "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent locking violations, etc. The most visible changes here are death of FS_REVAL_DOT (replaced with "has ->d_weak_revalidate()") and a new helper getting from struct file to inode. Some bits of preparation to xattr method interface changes. Misc patches by various people sent this cycle *and* ocfs2 fixes from several cycles ago that should've been upstream right then. PS: the next vfs pile will be xattr stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) saner proc_get_inode() calling conventions proc: avoid extra pde_put() in proc_fill_super() fs: change return values from -EACCES to -EPERM fs/exec.c: make bprm_mm_init() static ocfs2/dlm: use GFP_ATOMIC inside a spin_lock ocfs2: fix possible use-after-free with AIO ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero target: writev() on single-element vector is pointless export kernel_write(), convert open-coded instances fs: encode_fh: return FILEID_INVALID if invalid fid_type kill f_vfsmnt vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op nfsd: handle vfs_getattr errors in acl protocol switch vfs_getattr() to struct path default SET_PERSONALITY() in linux/elf.h ceph: prepopulate inodes only when request is aborted d_hash_and_lookup(): export, switch open-coded instances 9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate() 9p: split dropping the acls from v9fs_set_create_acl() ...
2013-02-26Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-367/+297
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Theodore Ts'o: "The one new feature added in this patch series is the ability to use the "punch hole" functionality for inodes that are not using extent maps. In the bug fix category, we fixed some races in the AIO and fstrim code, and some potential NULL pointer dereferences and memory leaks in error handling code paths. In the optimization category, we fixed a performance regression in the jbd2 layer introduced by commit d9b01934d56a ("jbd: fix fsync() tid wraparound bug", introduced in v3.0) which shows up in the AIM7 benchmark. We also further optimized jbd2 by minimize the amount of time that transaction handles are held active. This patch series also features some additional enhancement of the extent status tree, which is now used to cache extent information in a more efficient/compact form than what we use on-disk." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (65 commits) ext4: fix free clusters calculation in bigalloc filesystem ext4: no need to remove extent if len is 0 in ext4_es_remove_extent() ext4: fix xattr block allocation/release with bigalloc ext4: reclaim extents from extent status tree ext4: adjust some functions for reclaiming extents from extent status tree ext4: remove single extent cache ext4: lookup block mapping in extent status tree ext4: track all extent status in extent status tree ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag ext4: rename and improbe ext4_es_find_extent() ext4: add physical block and status member into extent status tree ext4: refine extent status tree ext4: use ERR_PTR() abstraction for ext4_append() ext4: refactor code to read directory blocks into ext4_read_dirblock() ext4: add debugging context for warning in ext4_da_update_reserve_space() ext4: use KERN_WARNING for warning messages jbd2: use module parameters instead of debugfs for jbd_debug ext4: use module parameters instead of debugfs for mballoc_debug ext4: start handle at the last possible moment when creating inodes ext4: fix the number of credits needed for acl ops with inline data ...
2013-02-22new helper: file_inode(file)Al Viro1-3/+3
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-21mm: only enforce stable page writes if the backing device requires itDarrick J. Wong1-1/+1
Create a helper function to check if a backing device requires stable page writes and, if so, performs the necessary wait. Then, make it so that all points in the memory manager that handle making pages writable use the helper function. This should provide stable page write support to most filesystems, while eliminating unnecessary waiting for devices that don't require the feature. Before this patchset, all filesystems would block, regardless of whether or not it was necessary. ext3 would wait, but still generate occasional checksum errors. The network filesystems were left to do their own thing, so they'd wait too. After this patchset, all the disk filesystems except ext3 and btrfs will wait only if the hardware requires it. ext3 (if necessary) snapshots pages instead of blocking, and btrfs provides its own bdi so the mm will never wait. Network filesystems haven't been touched, so either they provide their own stable page guarantees or they don't block at all. The blocking behavior is back to what it was before 3.0 if you don't have a disk requiring stable page writes. Here's the result of using dbench to test latency on ext2: 3.8.0-rc3: Operation Count AvgLat MaxLat ---------------------------------------- WriteX 109347 0.028 59.817 ReadX 347180 0.004 3.391 Flush 15514 29.828 287.283 Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms 3.8.0-rc3 + patches: WriteX 105556 0.029 4.273 ReadX 335004 0.005 4.112 Flush 14982 30.540 298.634 Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms As you can see, the maximum write latency drops considerably with this patch enabled. The other filesystems (ext3/ext4/xfs/btrfs) behave similarly, but see the cover letter for those results. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-18ext4: lookup block mapping in extent status treeZheng Liu1-2/+64
After tracking all extent status, we already have a extent cache in memory. Every time we want to lookup a block mapping, we can first try to lookup it in extent status tree to avoid a potential disk I/O. A new function called ext4_es_lookup_extent is defined to finish this work. When we try to lookup a block mapping, we always call ext4_map_blocks and/or ext4_da_map_blocks. So in these functions we first try to lookup a block mapping in extent status tree. A new flag EXT4_GET_BLOCKS_NO_PUT_HOLE is used in ext4_da_map_blocks in order not to put a hole into extent status tree because this hole will be converted to delayed extent in the tree immediately. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan kara <jack@suse.cz>
2013-02-18ext4: track all extent status in extent status treeZheng Liu1-25/+51
By recording the phycisal block and status, extent status tree is able to track the status of every extents. When we call _map_blocks functions to lookup an extent or create a new written/unwritten/delayed extent, this extent will be inserted into extent status tree. We don't load all extents from disk in alloc_inode() because it costs too much memory, and if a file is opened and closed frequently it will takes too much time to load all extent information. So currently when we create/lookup an extent, this extent will be inserted into extent status tree. Hence, the extent status tree may not comprehensively contain all of the extents found in the file. Here a condition we need to take care is that an extent might contains unwritten and delayed status simultaneously because an extent is delayed allocated and could be allocated by fallocate. At this time we need to keep delayed status because later we need to update delayed reservation space using it. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan kara <jack@suse.cz>
2013-02-18ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flagZheng Liu1-9/+3
This commit lets ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag because in later commit ext4_map_blocks needs to use this flag to determine the extent status. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-18ext4: add physical block and status member into extent status treeZheng Liu1-1/+2
This commit adds two members in extent_status structure to let it record physical block and extent status. Here es_pblk is used to record both of them because physical block only has 48 bits. So extent status could be stashed into it so that we can save some memory. Now written, unwritten, delayed and hole are defined as status. Due to new member is added into extent status tree, all interfaces need to be adjusted. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-15ext4: use ERR_PTR() abstraction for ext4_append()Theodore Ts'o1-0/+2
Use ERR_PTR()/IS_ERR() abstraction instead of passing in a separate pointer to an integer for the error code, as a code cleanup. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-14ext4: add debugging context for warning in ext4_da_update_reserve_space()Theodore Ts'o1-5/+7
Print some additional debugging context to hopefully help to debug a warning which is getting triggered by xfstests #74. Also remove extraneous newlines from when printk's were converted to ext4_warning() and ext4_msg(). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-14ext4: use KERN_WARNING for warning messagesTheodore Ts'o1-3/+3
Some messages printed related to a WARN_ON(1) were printed using KERN_NOTICE. Use KERN_WARNING or ext4_warning() instead so that context related to the WARN_ON() is printed at the same printk warning level (and log files, etc.) Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-09ext4: grab page before starting transaction handle in write_begin()Theodore Ts'o1-43/+68
The grab_cache_page_write_begin() function can potentially sleep for a long time, since it may need to do memory allocation which can block if the system is under significant memory pressure, and because it may be blocked on page writeback. If it does take a long time to grab the page, it's better that we not hold an active jbd2 handle. So grab a handle on the page first, and _then_ start the transaction handle. This commit fixes the following long transaction handle hold time: postmark-2917 [000] .... 196.435786: jbd2_handle_stats: dev 254,32 tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1 dirtied_blocks 0 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2013-02-08ext4: pass context information to jbd2__journal_start()Theodore Ts'o1-13/+20
So we can better understand what bits of ext4 are responsible for long-running jbd2 handles, use jbd2__journal_start() so we can pass context information for logging purposes. The recommended way for finding the longer-running handles is: T=/sys/kernel/debug/tracing EVENT=$T/events/jbd2/jbd2_handle_stats echo "interval > 5" > $EVENT/filter echo 1 > $EVENT/enable ./run-my-fs-benchmark cat $T/trace > /tmp/problem-handles This will list handles that were active for longer than 20ms. Having longer-running handles is bad, because a commit started at the wrong time could stall for those 20+ milliseconds, which could delay an fsync() or an O_SYNC operation. Here is an example line from the trace file describing a handle which lived on for 311 jiffies, or over 1.2 seconds: postmark-2917 [000] .... 196.435786: jbd2_handle_stats: dev 254,32 tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1 dirtied_blocks 0 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-29ext4: fix possible use-after-free with AIOJan Kara1-1/+1
Running AIO is pinning inode in memory using file reference. Once AIO is completed using aio_complete(), file reference is put and inode can be freed from memory. So we have to be sure that calling aio_complete() is the last thing we do with the inode. CC: stable@vger.kernel.org Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28ext4: fix ext4_writepage() to achieve data=ordered guaranteesJan Kara1-12/+18
So far ext4_writepage() skipped writing pages that had any delayed or unwritten buffers attached. When blocksize < pagesize this breaks data=ordered mode guarantees as we can have a page with one freshly allocated buffer whose allocation is part of the committing transaction and another buffer in the page which is delayed or unwritten. So fix this problem by calling ext4_bio_writepage() anyway. It will submit mapped buffers and leave others alone. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28ext4: simplify mpage_add_bh_to_extent()Jan Kara1-20/+9
The argument b_size of mpage_add_bh_to_extent() was bogus since it was always == blocksize (which we can easily derive from inode->i_blkbits). Also second branch of condition: if (nrblocks >= EXT4_MAX_TRANS_DATA) { } else if ((nrblocks + (b_size >> mpd->inode->i_blkbits)) > EXT4_MAX_TRANS_DATA) { } was never taken because (b_size >> mpd->inode->i_blkbits) == 1. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28ext4: dirty page has always buffers attachedJan Kara1-109/+38
ext4_writepage(), write_cache_pages_da(), and mpage_da_submit_io() doesn't have to deal with the case when page doesn't have buffers. We attach buffers to a page in ->write_begin() and ->page_mkwrite() which covers all places where a page can become dirty. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28ext4: remove __ext4_journalled_writepage() from mpage_da_submit_io()Jan Kara1-11/+2
We don't support delayed allocation in data=journal mode. So checking for it in mpage_da_submit_io() doesn't make really sence. If we ever decide to extend delayed allocation support to data=journal mode, adding __ext4_journalled_writepage() call will be the least of problems we have to solve. Most likely we'd have to implement separate writepages call anyways because we don't have transaction credits for writing more than a single page so mapping of page buffers would have to be done differently. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28ext4: Always use ext4_bio_write_page() for writeoutJan Kara1-115/+14
Currently we sometimes used block_write_full_page() and sometimes ext4_bio_write_page() for writeback (depending on mount options and call path). Let's always use ext4_bio_write_page() to simplify things a bit. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28ext4: add punching hole support for non-extent-mapped filesZheng Liu1-4/+2
This patch add supports for indirect file support punching hole. It is almost the same as ext4_ext_punch_hole. First, we invalidate all pages between this hole, and then we try to deallocate all blocks of this hole. A recursive function is used to handle deallocation of blocks. In this function, it iterates over the entries in inode's i_blocks or indirect blocks, and try to free the block for each one of them. After applying this patch, xfstest #255 will not pass w/o extent because indirect-based file doesn't support unwritten extents. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-16ext4: add tracepoint in punching holeZheng Liu1-0/+2
This patch adds a tracepoint in ext4_punch_hole. CC: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-12ext4: use unlikely to improve the efficiency of the kernelWang Shilong1-3/+3
Because the function 'sb_getblk' seldomly fails to return NULL value,it will be better to use 'unlikely' to optimize it. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-12ext4: return ENOMEM if sb_getblk() failsTheodore Ts'o1-6/+3
The only reason for sb_getblk() failing is if it can't allocate the buffer_head. So ENOMEM is more appropriate than EIO. In addition, make sure that the file system is marked as being inconsistent if sb_getblk() fails. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-01-12vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename themMiao Xie1-6/+2
writeback_inodes_sb(_nr)_if_idle() is re-implemented by replacing down_read() with down_read_trylock() because - If ->s_umount is write locked, then the sb is not idle. That is writeback_inodes_sb(_nr)_if_idle() needn't wait for the lock. - writeback_inodes_sb(_nr)_if_idle() grabs s_umount lock when it want to start writeback, it may bring us deadlock problem when doing umount. In order to fix the problem, ext4 and btrfs implemented their own writeback functions instead of writeback_inodes_sb(_nr)_if_idle(), but it introduced the redundant code, it is better to implement a new writeback_inodes_sb(_nr)_if_idle(). The name of these two functions is cumbersome, so rename them to try_to_writeback_inodes_sb(_nr). This idea came from Christoph Hellwig. Some code is from the patch of Kamal Mostafa. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2012-12-25ext4: fix deadlock in journal_unmap_buffer()Jan Kara1-11/+71
We cannot wait for transaction commit in journal_unmap_buffer() because we hold page lock which ranks below transaction start. We solve the issue by bailing out of journal_unmap_buffer() and jbd2_journal_invalidatepage() with -EBUSY. Caller is then responsible for waiting for transaction commit to finish and try invalidation again. Since the issue can happen only for page stradding i_size, it is simple enough to manually call jbd2_journal_invalidatepage() for such page from ext4_setattr(), check the return value and wait if necessary. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-25ext4: split off ext4_journalled_invalidatepage()Jan Kara1-7/+16
In data=journal mode we don't need delalloc or DIO handling in invalidatepage and similarly in other modes we don't need the journal handling. So split invalidatepage implementations. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10ext4: let ext4_truncate handle inline data correctlyTao Ma1-0/+8
Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10ext4: add delalloc support for inline dataTao Ma1-9/+54
For delayed allocation mode, we write to inline data if the file is small enough. And in case of we write to some offset larger than the inline size, the 1st page is dirtied, so that ext4_da_writepages can handle the conversion. When the 1st page is initialized with blocks, the inline part is removed. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10ext4: add journalled write support for inline dataTao Ma1-20/+49
Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10ext4: add normal write support for inline dataTao Ma1-41/+62
For a normal write case (not journalled write, not delayed allocation), we write to the inline if the file is small and convert it to an extent based file when the write is larger than the max inline size. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10ext4: add read support for inline dataTao Ma1-1/+30
Let readpage and readpages handle the case when we want to read an inlined file. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10ext4: add the basic function for inline data supportTao Ma1-1/+4
Implement inline data with xattr. Now we use "system.data" to store xattr, and the xattr will be extended if the i_size is increased while we don't release the space during truncate. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-02ext4: move extra inode read to a new functionTao Ma1-5/+11
Currently, in ext4_iget we do a simple check to see whether there does exist some information starting from the end of i_extra_size. With inline data added, this procedure is more complicated. So move it to a new function named ext4_iget_extra_inode. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-29ext4: restructure ext4_ext_direct_IO()Theodore Ts'o1-108/+103
Remove a level of indentation by moving the DIO read and extending write case to the beginning of the file. This results in no actual programmatic changes to the file, but makes it easier to read/understand. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-15ext4: remove calls to ext4_jbd2_file_inode() from delalloc write pathTheodore Ts'o1-19/+1
The calls to ext4_jbd2_file_inode() are needed to guarantee that we do not expose stale data in the data=ordered mode. However, they are not necessary because in all of the cases where we have newly allocated blocks in the delayed allocation write path, we immediately submit the dirty pages for I/O. Hence, we can avoid the overhead of adding the inode to the list of inodes whose data pages will be to be flushed out to disk completely during the next commit operation. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-14ext4: init pagevec in ext4_da_block_invalidatepagesEric Sandeen1-0/+1
ext4_da_block_invalidatepages is missing a pagevec_init(), which means that pvec->cold contains random garbage. This affects whether the page goes to the front or back of the LRU when ->cold makes it to free_hot_cold_page() Reviewed-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2012-11-08ext4: reimplement ext4_find_delay_alloc_range on extent status treeZheng Liu1-52/+1
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08ext4: let ext4 maintain extent status treeZheng Liu1-3/+35
This patch lets ext4 maintain extent status tree. Currently it only tracks delay extent status in extent status tree. When a delay allocation is issued, the related delay extent will be inserted into extent status tree. When a delay extent is written out or invalidated, it will be removed from this tree. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08ext4: remove code duplication in ext4_get_block_write_nolock()Anatol Pomozov1-39/+24
729f52c6be51013 introduced function ext4_get_block_write_nolock() that is very similar to _ext4_get_block(). Eliminate code duplication by passing different flags to _ext4_get_block() Tested: xfs tests Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-10-08Merge tag 'ext4_for_linus' of ↵Linus Torvalds1-41/+42
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "The big new feature added this time is supporting online resizing using the meta_bg feature. This allows us to resize file systems which are greater than 16TB. In addition, the speed of online resizing has been improved in general. We also fix a number of races, some of which could lead to deadlocks, in ext4's Asynchronous I/O and online defrag support, thanks to good work by Dmitry Monakhov. There are also a large number of more minor bug fixes and cleanups from a number of other ext4 contributors, quite of few of which have submitted fixes for the first time." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (69 commits) ext4: fix ext4_flush_completed_IO wait semantics ext4: fix mtime update in nodelalloc mode ext4: fix ext_remove_space for punch_hole case ext4: punch_hole should wait for DIO writers ext4: serialize truncate with owerwrite DIO workers ext4: endless truncate due to nonlocked dio readers ext4: serialize unlocked dio reads with truncate ext4: serialize dio nonlocked reads with defrag workers ext4: completed_io locking cleanup ext4: fix unwritten counter leakage ext4: give i_aiodio_unwritten a more appropriate name ext4: ext4_inode_info diet ext4: convert to use leXX_add_cpu() ext4: ext4_bread usage audit fs: reserve fallocate flag codepoint ext4: remove redundant offset check in mext_check_arguments() ext4: don't clear orphan list on ro mount with errors jbd2: fix assertion failure in commit code due to lacking transaction credits ext4: release donor reference when EXT4_IOC_MOVE_EXT ioctl fails ext4: enable FITRIM ioctl on bigalloc file system ...
2012-10-01Merge branch 'for-linus' of ↵Linus Torvalds1-7/+7
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull the trivial tree from Jiri Kosina: "Tiny usual fixes all over the place" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits) doc: fix old config name of kprobetrace fs/fs-writeback.c: cleanup riteback_sb_inodes kerneldoc btrfs: fix the commment for the action flags in delayed-ref.h btrfs: fix trivial typo for the comment of BTRFS_FREE_INO_OBJECTID vfs: fix kerneldoc for generic_fh_to_parent() treewide: fix comment/printk/variable typos ipr: fix small coding style issues doc: fix broken utf8 encoding nfs: comment fix platform/x86: fix asus_laptop.wled_type module parameter mfd: printk/comment fixes doc: getdelays.c: remember to close() socket on error in create_nl_socket() doc: aliasing-test: close fd on write error mmc: fix comment typos dma: fix comments spi: fix comment/printk typos in spi Coccinelle: fix typo in memdup_user.cocci tmiofb: missing NULL pointer checks tools: perf: Fix typo in tools/perf tools/testing: fix comment / output typos ...