From e861b5e9a47bd8c6a7491a2b9f6e9a230b1b8e86 Mon Sep 17 00:00:00 2001 From: Theodore Ts'o Date: Thu, 20 Feb 2014 12:54:05 -0500 Subject: ext4: avoid possible overflow in ext4_map_blocks() The ext4_map_blocks() function returns the number of blocks which satisfying the caller's request. This number of blocks requested by the caller is specified by an unsigned integer, but the return value of ext4_map_blocks() is a signed integer (to accomodate error codes per the kernel's standard error signalling convention). Historically, overflows could never happen since mballoc() will refuse to allocate more than 2048 blocks at a time (which is something we should fix), and if the blocks were already allocated, the fact that there would be some number of intervening metadata blocks pretty much guaranteed that there could never be a contiguous region of data blocks that was greater than 2**31 blocks. However, this is now possible if there is a file system which is a bit bigger than 8TB, and is created using the new mke2fs hugeblock feature, which can create a perfectly contiguous file. In that case, if a userspace program attempted to call fallocate() on this already fully allocated file, it's possible that ext4_map_blocks() could return a number large enough that it would overflow a signed integer, resulting in a ext4 thinking that the ext4_map_blocks() call had failed with some strange error code. Since ext4_map_blocks() is always free to return a smaller number of blocks than what was requested by the caller, fix this by capping the number of blocks that ext4_map_blocks() will ever try to map to 2**31 - 1. In practice this should never get hit, except by someone deliberately trying to provke the above-described bug. Thanks to the PaX team for asking whethre this could possibly happen in some off-line discussions about using some static code checking technology they are developing to find bugs in kernel code. Signed-off-by: "Theodore Ts'o" --- fs/ext4/inode.c | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'fs/ext4/inode.c') diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 6e39895a91b8..113458c9d08b 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -514,6 +514,12 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode, "logical block %lu\n", inode->i_ino, flags, map->m_len, (unsigned long) map->m_lblk); + /* + * ext4_map_blocks returns an int, and m_len is an unsigned int + */ + if (unlikely(map->m_len > INT_MAX)) + map->m_len = INT_MAX; + /* Lookup extent status tree firstly */ if (ext4_es_lookup_extent(inode, map->m_lblk, &es)) { ext4_es_lru_add(inode); -- cgit v1.2.3 From e251f9bca99c0f219eff9c76034476c2b17d3dba Mon Sep 17 00:00:00 2001 From: Maxim Patlasov Date: Thu, 20 Feb 2014 16:58:05 -0500 Subject: ext4: avoid exposure of stale data in ext4_punch_hole() While handling punch-hole fallocate, it's useless to truncate page cache before removing the range from extent tree (or block map in indirect case) because page cache can be re-populated (by read-ahead or read(2) or mmap-ed read) immediately after truncating page cache, but before updating extent tree (or block map). In that case the user will see stale data even after fallocate is completed. Until the problem of data corruption resulting from pages backed by already freed blocks is fully resolved, the simple thing we can do now is to add another truncation of pagecache after punch hole is done. Signed-off-by: Maxim Patlasov Signed-off-by: "Theodore Ts'o" Reviewed-by: Jan Kara --- fs/ext4/inode.c | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'fs/ext4/inode.c') diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 113458c9d08b..5324a38d848d 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3614,6 +3614,12 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length) up_write(&EXT4_I(inode)->i_data_sem); if (IS_SYNC(inode)) ext4_handle_sync(handle); + + /* Now release the pages again to reduce race window */ + if (last_block_offset > first_block_offset) + truncate_pagecache_range(inode, first_block_offset, + last_block_offset); + inode->i_mtime = inode->i_ctime = ext4_current_time(inode); ext4_mark_inode_dirty(handle, inode); out_stop: -- cgit v1.2.3 From 10542c229a4e8e25b40357beea66abe9dacda2c0 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Tue, 4 Mar 2014 10:50:50 -0500 Subject: ext4: Speedup WB_SYNC_ALL pass called from sync(2) When doing filesystem wide sync, there's no need to force transaction commit (or synchronously write inode buffer) separately for each inode because ext4_sync_fs() takes care of forcing commit at the end (VFS takes care of flushing buffer cache, respectively). Most of the time this slowness doesn't manifest because previous WB_SYNC_NONE writeback doesn't leave much to write but when there are processes aggressively creating new files and several filesystems to sync, the sync slowness can be noticeable. In the following test script sync(1) takes around 6 minutes when there are two ext4 filesystems mounted on a standard SATA drive. After this patch sync takes a couple of seconds so we have about two orders of magnitude improvement. function run_writers { for (( i = 0; i < 10; i++ )); do mkdir $1/dir$i for (( j = 0; j < 40000; j++ )); do dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null done & done } for dir in "$@"; do run_writers $dir done sleep 40 time sync Signed-off-by: Jan Kara Signed-off-by: "Theodore Ts'o" --- fs/ext4/inode.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) (limited to 'fs/ext4/inode.c') diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 5324a38d848d..ab3e8357929d 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4455,7 +4455,12 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc) return -EIO; } - if (wbc->sync_mode != WB_SYNC_ALL) + /* + * No need to force transaction in WB_SYNC_NONE mode. Also + * ext4_sync_fs() will force the commit after everything is + * written. + */ + if (wbc->sync_mode != WB_SYNC_ALL || wbc->for_sync) return 0; err = ext4_force_commit(inode->i_sb); @@ -4465,7 +4470,11 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc) err = __ext4_get_inode_loc(inode, &iloc, 0); if (err) return err; - if (wbc->sync_mode == WB_SYNC_ALL) + /* + * sync(2) will flush the whole buffer cache. No need to do + * it here separately for each inode. + */ + if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) sync_dirty_buffer(iloc.bh); if (buffer_req(iloc.bh) && !buffer_uptodate(iloc.bh)) { EXT4_ERROR_INODE_BLOCK(inode, iloc.bh->b_blocknr, -- cgit v1.2.3 From b8a8684502a0fc852afa0056c6bb2a9273f6fcc0 Mon Sep 17 00:00:00 2001 From: Lukas Czerner Date: Tue, 18 Mar 2014 18:05:35 -0400 Subject: ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same functionality as xfs ioctl XFS_IOC_ZERO_RANGE. It can be used to convert a range of file to zeros preferably without issuing data IO. Blocks should be preallocated for the regions that span holes in the file, and the entire range is preferable converted to unwritten extents This can be also used to preallocate blocks past EOF in the same way as with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode size to remain the same. Also add appropriate tracepoints. Signed-off-by: Lukas Czerner Signed-off-by: "Theodore Ts'o" --- fs/ext4/ext4.h | 2 + fs/ext4/extents.c | 273 +++++++++++++++++++++++++++++++++++++++++--- fs/ext4/inode.c | 17 ++- include/trace/events/ext4.h | 68 +++++------ 4 files changed, 307 insertions(+), 53 deletions(-) (limited to 'fs/ext4/inode.c') diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index beec42750a8c..1b3cbf8cacf9 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -568,6 +568,8 @@ enum { #define EXT4_GET_BLOCKS_NO_LOCK 0x0100 /* Do not put hole in extent cache */ #define EXT4_GET_BLOCKS_NO_PUT_HOLE 0x0200 + /* Convert written extents to unwritten */ +#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN 0x0400 /* * The bit position of these flags must not overlap with any of the diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 2db2d77769a2..464e95da716e 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -3602,6 +3602,8 @@ out: * b> Splits in two extents: Write is happening at either end of the extent * c> Splits in three extents: Somone is writing in middle of the extent * + * This works the same way in the case of initialized -> unwritten conversion. + * * One of more index blocks maybe needed if the extent tree grow after * the uninitialized extent split. To prevent ENOSPC occur at the IO * complete, we need to split the uninitialized extent before DIO submit @@ -3612,7 +3614,7 @@ out: * * Returns the size of uninitialized extent to be written on success. */ -static int ext4_split_unwritten_extents(handle_t *handle, +static int ext4_split_convert_extents(handle_t *handle, struct inode *inode, struct ext4_map_blocks *map, struct ext4_ext_path *path, @@ -3624,9 +3626,9 @@ static int ext4_split_unwritten_extents(handle_t *handle, unsigned int ee_len; int split_flag = 0, depth; - ext_debug("ext4_split_unwritten_extents: inode %lu, logical" - "block %llu, max_blocks %u\n", inode->i_ino, - (unsigned long long)map->m_lblk, map->m_len); + ext_debug("%s: inode %lu, logical block %llu, max_blocks %u\n", + __func__, inode->i_ino, + (unsigned long long)map->m_lblk, map->m_len); eof_block = (inode->i_size + inode->i_sb->s_blocksize - 1) >> inode->i_sb->s_blocksize_bits; @@ -3641,14 +3643,73 @@ static int ext4_split_unwritten_extents(handle_t *handle, ee_block = le32_to_cpu(ex->ee_block); ee_len = ext4_ext_get_actual_len(ex); - split_flag |= ee_block + ee_len <= eof_block ? EXT4_EXT_MAY_ZEROOUT : 0; - split_flag |= EXT4_EXT_MARK_UNINIT2; - if (flags & EXT4_GET_BLOCKS_CONVERT) - split_flag |= EXT4_EXT_DATA_VALID2; + /* Convert to unwritten */ + if (flags & EXT4_GET_BLOCKS_CONVERT_UNWRITTEN) { + split_flag |= EXT4_EXT_DATA_VALID1; + /* Convert to initialized */ + } else if (flags & EXT4_GET_BLOCKS_CONVERT) { + split_flag |= ee_block + ee_len <= eof_block ? + EXT4_EXT_MAY_ZEROOUT : 0; + split_flag |= (EXT4_EXT_MARK_UNINIT2 | EXT4_EXT_DATA_VALID2); + } flags |= EXT4_GET_BLOCKS_PRE_IO; return ext4_split_extent(handle, inode, path, map, split_flag, flags); } +static int ext4_convert_initialized_extents(handle_t *handle, + struct inode *inode, + struct ext4_map_blocks *map, + struct ext4_ext_path *path) +{ + struct ext4_extent *ex; + ext4_lblk_t ee_block; + unsigned int ee_len; + int depth; + int err = 0; + + depth = ext_depth(inode); + ex = path[depth].p_ext; + ee_block = le32_to_cpu(ex->ee_block); + ee_len = ext4_ext_get_actual_len(ex); + + ext_debug("%s: inode %lu, logical" + "block %llu, max_blocks %u\n", __func__, inode->i_ino, + (unsigned long long)ee_block, ee_len); + + if (ee_block != map->m_lblk || ee_len > map->m_len) { + err = ext4_split_convert_extents(handle, inode, map, path, + EXT4_GET_BLOCKS_CONVERT_UNWRITTEN); + if (err < 0) + goto out; + ext4_ext_drop_refs(path); + path = ext4_ext_find_extent(inode, map->m_lblk, path, 0); + if (IS_ERR(path)) { + err = PTR_ERR(path); + goto out; + } + depth = ext_depth(inode); + ex = path[depth].p_ext; + } + + err = ext4_ext_get_access(handle, inode, path + depth); + if (err) + goto out; + /* first mark the extent as uninitialized */ + ext4_ext_mark_uninitialized(ex); + + /* note: ext4_ext_correct_indexes() isn't needed here because + * borders are not changed + */ + ext4_ext_try_to_merge(handle, inode, path, ex); + + /* Mark modified extent as dirty */ + err = ext4_ext_dirty(handle, inode, path + path->p_depth); +out: + ext4_ext_show_leaf(inode, path); + return err; +} + + static int ext4_convert_unwritten_extents_endio(handle_t *handle, struct inode *inode, struct ext4_map_blocks *map, @@ -3682,8 +3743,8 @@ static int ext4_convert_unwritten_extents_endio(handle_t *handle, inode->i_ino, (unsigned long long)ee_block, ee_len, (unsigned long long)map->m_lblk, map->m_len); #endif - err = ext4_split_unwritten_extents(handle, inode, map, path, - EXT4_GET_BLOCKS_CONVERT); + err = ext4_split_convert_extents(handle, inode, map, path, + EXT4_GET_BLOCKS_CONVERT); if (err < 0) goto out; ext4_ext_drop_refs(path); @@ -3883,6 +3944,38 @@ get_reserved_cluster_alloc(struct inode *inode, ext4_lblk_t lblk_start, return allocated_clusters; } +static int +ext4_ext_convert_initialized_extent(handle_t *handle, struct inode *inode, + struct ext4_map_blocks *map, + struct ext4_ext_path *path, int flags, + unsigned int allocated, ext4_fsblk_t newblock) +{ + int ret = 0; + int err = 0; + + /* + * Make sure that the extent is no bigger than we support with + * uninitialized extent + */ + if (map->m_len > EXT_UNINIT_MAX_LEN) + map->m_len = EXT_UNINIT_MAX_LEN / 2; + + ret = ext4_convert_initialized_extents(handle, inode, map, + path); + if (ret >= 0) { + ext4_update_inode_fsync_trans(handle, inode, 1); + err = check_eofblocks_fl(handle, inode, map->m_lblk, + path, map->m_len); + } else + err = ret; + map->m_flags |= EXT4_MAP_UNWRITTEN; + if (allocated > map->m_len) + allocated = map->m_len; + map->m_len = allocated; + + return err ? err : allocated; +} + static int ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode, struct ext4_map_blocks *map, @@ -3910,8 +4003,8 @@ ext4_ext_handle_uninitialized_extents(handle_t *handle, struct inode *inode, /* get_block() before submit the IO, split the extent */ if ((flags & EXT4_GET_BLOCKS_PRE_IO)) { - ret = ext4_split_unwritten_extents(handle, inode, map, - path, flags); + ret = ext4_split_convert_extents(handle, inode, map, + path, flags | EXT4_GET_BLOCKS_CONVERT); if (ret <= 0) goto out; /* @@ -4199,6 +4292,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode, ext4_fsblk_t ee_start = ext4_ext_pblock(ex); unsigned short ee_len; + /* * Uninitialized extents are treated as holes, except that * we split out initialized portions during a write. @@ -4215,7 +4309,17 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode, ext_debug("%u fit into %u:%d -> %llu\n", map->m_lblk, ee_block, ee_len, newblock); - if (!ext4_ext_is_uninitialized(ex)) + /* + * If the extent is initialized check whether the + * caller wants to convert it to unwritten. + */ + if ((!ext4_ext_is_uninitialized(ex)) && + (flags & EXT4_GET_BLOCKS_CONVERT_UNWRITTEN)) { + allocated = ext4_ext_convert_initialized_extent( + handle, inode, map, path, flags, + allocated, newblock); + goto out2; + } else if (!ext4_ext_is_uninitialized(ex)) goto out; ret = ext4_ext_handle_uninitialized_extents( @@ -4604,6 +4708,144 @@ retry: return ret > 0 ? ret2 : ret; } +static long ext4_zero_range(struct file *file, loff_t offset, + loff_t len, int mode) +{ + struct inode *inode = file_inode(file); + handle_t *handle = NULL; + unsigned int max_blocks; + loff_t new_size = 0; + int ret = 0; + int flags; + int partial; + loff_t start, end; + ext4_lblk_t lblk; + struct address_space *mapping = inode->i_mapping; + unsigned int blkbits = inode->i_blkbits; + + trace_ext4_zero_range(inode, offset, len, mode); + + /* + * Write out all dirty pages to avoid race conditions + * Then release them. + */ + if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) { + ret = filemap_write_and_wait_range(mapping, offset, + offset + len - 1); + if (ret) + return ret; + } + + /* + * Round up offset. This is not fallocate, we neet to zero out + * blocks, so convert interior block aligned part of the range to + * unwritten and possibly manually zero out unaligned parts of the + * range. + */ + start = round_up(offset, 1 << blkbits); + end = round_down((offset + len), 1 << blkbits); + + if (start < offset || end > offset + len) + return -EINVAL; + partial = (offset + len) & ((1 << blkbits) - 1); + + lblk = start >> blkbits; + max_blocks = (end >> blkbits); + if (max_blocks < lblk) + max_blocks = 0; + else + max_blocks -= lblk; + + flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT | + EXT4_GET_BLOCKS_CONVERT_UNWRITTEN; + if (mode & FALLOC_FL_KEEP_SIZE) + flags |= EXT4_GET_BLOCKS_KEEP_SIZE; + + mutex_lock(&inode->i_mutex); + + /* + * Indirect files do not support unwritten extnets + */ + if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) { + ret = -EOPNOTSUPP; + goto out_mutex; + } + + if (!(mode & FALLOC_FL_KEEP_SIZE) && + offset + len > i_size_read(inode)) { + new_size = offset + len; + ret = inode_newsize_ok(inode, new_size); + if (ret) + goto out_mutex; + /* + * If we have a partial block after EOF we have to allocate + * the entire block. + */ + if (partial) + max_blocks += 1; + } + + if (max_blocks > 0) { + + /* Now release the pages and zero block aligned part of pages*/ + truncate_pagecache_range(inode, start, end - 1); + + /* Wait all existing dio workers, newcomers will block on i_mutex */ + ext4_inode_block_unlocked_dio(inode); + inode_dio_wait(inode); + + /* + * Remove entire range from the extent status tree. + */ + ret = ext4_es_remove_extent(inode, lblk, max_blocks); + if (ret) + goto out_dio; + + ret = ext4_alloc_file_blocks(file, lblk, max_blocks, flags, + mode); + if (ret) + goto out_dio; + } + + handle = ext4_journal_start(inode, EXT4_HT_MISC, 4); + if (IS_ERR(handle)) { + ret = PTR_ERR(handle); + ext4_std_error(inode->i_sb, ret); + goto out_dio; + } + + inode->i_mtime = inode->i_ctime = ext4_current_time(inode); + + if (!ret && new_size) { + if (new_size > i_size_read(inode)) + i_size_write(inode, new_size); + if (new_size > EXT4_I(inode)->i_disksize) + ext4_update_i_disksize(inode, new_size); + } else if (!ret && !new_size) { + /* + * Mark that we allocate beyond EOF so the subsequent truncate + * can proceed even if the new size is the same as i_size. + */ + if ((offset + len) > i_size_read(inode)) + ext4_set_inode_flag(inode, EXT4_INODE_EOFBLOCKS); + } + + ext4_mark_inode_dirty(handle, inode); + + /* Zero out partial block at the edges of the range */ + ret = ext4_zero_partial_blocks(handle, inode, offset, len); + + if (file->f_flags & O_SYNC) + ext4_handle_sync(handle); + + ext4_journal_stop(handle); +out_dio: + ext4_inode_resume_unlocked_dio(inode); +out_mutex: + mutex_unlock(&inode->i_mutex); + return ret; +} + /* * preallocate space for a file. This implements ext4's fallocate file * operation, which gets called from sys_fallocate system call. @@ -4625,7 +4867,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) /* Return error if mode is not supported */ if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | - FALLOC_FL_COLLAPSE_RANGE)) + FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE)) return -EOPNOTSUPP; if (mode & FALLOC_FL_PUNCH_HOLE) @@ -4645,6 +4887,9 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len) if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) return -EOPNOTSUPP; + if (mode & FALLOC_FL_ZERO_RANGE) + return ext4_zero_range(file, offset, len, mode); + trace_ext4_fallocate_enter(inode, offset, len, mode); lblk = offset >> blkbits; /* diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index ab3e8357929d..7cc24555eca8 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -503,6 +503,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode, { struct extent_status es; int retval; + int ret = 0; #ifdef ES_AGGRESSIVE_TEST struct ext4_map_blocks orig_map; @@ -558,7 +559,6 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode, EXT4_GET_BLOCKS_KEEP_SIZE); } if (retval > 0) { - int ret; unsigned int status; if (unlikely(retval != map->m_len)) { @@ -585,7 +585,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode, found: if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) { - int ret = check_block_validity(inode, map); + ret = check_block_validity(inode, map); if (ret != 0) return ret; } @@ -602,7 +602,13 @@ found: * with buffer head unmapped. */ if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) - return retval; + /* + * If we need to convert extent to unwritten + * we continue and do the actual work in + * ext4_ext_map_blocks() + */ + if (!(flags & EXT4_GET_BLOCKS_CONVERT_UNWRITTEN)) + return retval; /* * Here we clear m_flags because after allocating an new extent, @@ -658,7 +664,6 @@ found: ext4_clear_inode_state(inode, EXT4_STATE_DELALLOC_RESERVED); if (retval > 0) { - int ret; unsigned int status; if (unlikely(retval != map->m_len)) { @@ -693,7 +698,7 @@ found: has_zeroout: up_write((&EXT4_I(inode)->i_data_sem)); if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) { - int ret = check_block_validity(inode, map); + ret = check_block_validity(inode, map); if (ret != 0) return ret; } @@ -3507,7 +3512,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length) if (!S_ISREG(inode->i_mode)) return -EOPNOTSUPP; - trace_ext4_punch_hole(inode, offset, length); + trace_ext4_punch_hole(inode, offset, length, 0); /* * Write out all dirty pages to avoid race conditions diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h index e9d7ee77d3a1..010ea89eeb0e 100644 --- a/include/trace/events/ext4.h +++ b/include/trace/events/ext4.h @@ -21,6 +21,10 @@ struct extent_status; #define FALLOC_FL_COLLAPSE_RANGE 0x08 #endif +#ifndef FALLOC_FL_ZERO_RANGE +#define FALLOC_FL_ZERO_RANGE 0x10 +#endif + #define EXT4_I(inode) (container_of(inode, struct ext4_inode_info, vfs_inode)) #define show_mballoc_flags(flags) __print_flags(flags, "|", \ @@ -77,7 +81,8 @@ struct extent_status; { FALLOC_FL_KEEP_SIZE, "KEEP_SIZE"}, \ { FALLOC_FL_PUNCH_HOLE, "PUNCH_HOLE"}, \ { FALLOC_FL_NO_HIDE_STALE, "NO_HIDE_STALE"}, \ - { FALLOC_FL_COLLAPSE_RANGE, "COLLAPSE_RANGE"}) + { FALLOC_FL_COLLAPSE_RANGE, "COLLAPSE_RANGE"}, \ + { FALLOC_FL_ZERO_RANGE, "ZERO_RANGE"}) TRACE_EVENT(ext4_free_inode, @@ -1339,7 +1344,7 @@ TRACE_EVENT(ext4_direct_IO_exit, __entry->rw, __entry->ret) ); -TRACE_EVENT(ext4_fallocate_enter, +DECLARE_EVENT_CLASS(ext4__fallocate_mode, TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode), TP_ARGS(inode, offset, len, mode), @@ -1347,23 +1352,45 @@ TRACE_EVENT(ext4_fallocate_enter, TP_STRUCT__entry( __field( dev_t, dev ) __field( ino_t, ino ) - __field( loff_t, pos ) - __field( loff_t, len ) + __field( loff_t, offset ) + __field( loff_t, len ) __field( int, mode ) ), TP_fast_assign( __entry->dev = inode->i_sb->s_dev; __entry->ino = inode->i_ino; - __entry->pos = offset; + __entry->offset = offset; __entry->len = len; __entry->mode = mode; ), - TP_printk("dev %d,%d ino %lu pos %lld len %lld mode %s", + TP_printk("dev %d,%d ino %lu offset %lld len %lld mode %s", MAJOR(__entry->dev), MINOR(__entry->dev), - (unsigned long) __entry->ino, __entry->pos, - __entry->len, show_falloc_mode(__entry->mode)) + (unsigned long) __entry->ino, + __entry->offset, __entry->len, + show_falloc_mode(__entry->mode)) +); + +DEFINE_EVENT(ext4__fallocate_mode, ext4_fallocate_enter, + + TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode), + + TP_ARGS(inode, offset, len, mode) +); + +DEFINE_EVENT(ext4__fallocate_mode, ext4_punch_hole, + + TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode), + + TP_ARGS(inode, offset, len, mode) +); + +DEFINE_EVENT(ext4__fallocate_mode, ext4_zero_range, + + TP_PROTO(struct inode *inode, loff_t offset, loff_t len, int mode), + + TP_ARGS(inode, offset, len, mode) ); TRACE_EVENT(ext4_fallocate_exit, @@ -1395,31 +1422,6 @@ TRACE_EVENT(ext4_fallocate_exit, __entry->ret) ); -TRACE_EVENT(ext4_punch_hole, - TP_PROTO(struct inode *inode, loff_t offset, loff_t len), - - TP_ARGS(inode, offset, len), - - TP_STRUCT__entry( - __field( dev_t, dev ) - __field( ino_t, ino ) - __field( loff_t, offset ) - __field( loff_t, len ) - ), - - TP_fast_assign( - __entry->dev = inode->i_sb->s_dev; - __entry->ino = inode->i_ino; - __entry->offset = offset; - __entry->len = len; - ), - - TP_printk("dev %d,%d ino %lu offset %lld len %lld", - MAJOR(__entry->dev), MINOR(__entry->dev), - (unsigned long) __entry->ino, - __entry->offset, __entry->len) -); - TRACE_EVENT(ext4_unlink_enter, TP_PROTO(struct inode *parent, struct dentry *dentry), -- cgit v1.2.3 From c4f65706056e9f0c2cf126b29c6920a179d91150 Mon Sep 17 00:00:00 2001 From: Theodore Ts'o Date: Thu, 20 Mar 2014 00:32:57 -0400 Subject: ext4: kill i_version support for Hurd-castrated file systems The Hurd file system uses uses the inode field which is now used for i_version for its translator block. This means that ext2 file systems that are formatted for GNU Hurd can't be used to support NFSv4. Given that Hurd file systems don't support extents, and a huge number of modern file system features, this is no great loss. If we don't do this, the attempt to update the i_version field will stomp over the translator block field, which will cause file system corruption for Hurd file systems. This can be replicated via: mke2fs -t ext2 -o hurd /dev/vdc mount -t ext4 /dev/vdc /vdc touch /vdc/bug0000 umount /dev/vdc e2fsck -f /dev/vdc Addresses-Debian-Bug: #738758 Reported-By: Gabriele Giacone <1o5g4r8o@gmail.com> Signed-off-by: "Theodore Ts'o" --- fs/ext4/inode.c | 29 ++++++++++++++++++----------- 1 file changed, 18 insertions(+), 11 deletions(-) (limited to 'fs/ext4/inode.c') diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 7cc24555eca8..ed2c13a7f293 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4168,11 +4168,14 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) EXT4_INODE_GET_XTIME(i_atime, inode, raw_inode); EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode); - inode->i_version = le32_to_cpu(raw_inode->i_disk_version); - if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) { - if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) - inode->i_version |= - (__u64)(le32_to_cpu(raw_inode->i_version_hi)) << 32; + if (EXT4_SB(inode->i_sb)->s_es->s_creator_os != + cpu_to_le32(EXT4_OS_HURD)) { + inode->i_version = le32_to_cpu(raw_inode->i_disk_version); + if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) { + if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) + inode->i_version |= + (__u64)(le32_to_cpu(raw_inode->i_version_hi)) << 32; + } } ret = 0; @@ -4388,12 +4391,16 @@ static int ext4_do_update_inode(handle_t *handle, raw_inode->i_block[block] = ei->i_data[block]; } - raw_inode->i_disk_version = cpu_to_le32(inode->i_version); - if (ei->i_extra_isize) { - if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) - raw_inode->i_version_hi = - cpu_to_le32(inode->i_version >> 32); - raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize); + if (EXT4_SB(inode->i_sb)->s_es->s_creator_os != + cpu_to_le32(EXT4_OS_HURD)) { + raw_inode->i_disk_version = cpu_to_le32(inode->i_version); + if (ei->i_extra_isize) { + if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) + raw_inode->i_version_hi = + cpu_to_le32(inode->i_version >> 32); + raw_inode->i_extra_isize = + cpu_to_le16(ei->i_extra_isize); + } } ext4_inode_csum_set(inode, raw_inode, ei); -- cgit v1.2.3 From ed3654eb981fd44694b4d2a636e13f998bc10e7f Mon Sep 17 00:00:00 2001 From: Theodore Ts'o Date: Mon, 24 Mar 2014 14:09:06 -0400 Subject: ext4: optimize Hurd tests when reading/writing inodes Set a in-memory superblock flag to indicate whether the file system is designed to support the Hurd. Also, add a sanity check to make sure the 64-bit feature is not set for Hurd file systems, since i_file_acl_high conflicts with a Hurd-specific field. Signed-off-by: "Theodore Ts'o" --- fs/ext4/ext4.h | 2 ++ fs/ext4/inode.c | 9 +++------ fs/ext4/super.c | 10 ++++++++++ 3 files changed, 15 insertions(+), 6 deletions(-) (limited to 'fs/ext4/inode.c') diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index f4f889e6df83..e01135d791ca 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1001,6 +1001,8 @@ struct ext4_inode_info { #define EXT4_MOUNT2_STD_GROUP_SIZE 0x00000002 /* We have standard group size of blocksize * 8 blocks */ +#define EXT4_MOUNT2_HURD_COMPAT 0x00000004 /* Support HURD-castrated + file systems */ #define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \ ~EXT4_MOUNT_##opt diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index ed2c13a7f293..b5e182acf9b9 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4168,8 +4168,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) EXT4_INODE_GET_XTIME(i_atime, inode, raw_inode); EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode); - if (EXT4_SB(inode->i_sb)->s_es->s_creator_os != - cpu_to_le32(EXT4_OS_HURD)) { + if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) { inode->i_version = le32_to_cpu(raw_inode->i_disk_version); if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) { if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) @@ -4345,8 +4344,7 @@ static int ext4_do_update_inode(handle_t *handle, goto out_brelse; raw_inode->i_dtime = cpu_to_le32(ei->i_dtime); raw_inode->i_flags = cpu_to_le32(ei->i_flags & 0xFFFFFFFF); - if (EXT4_SB(inode->i_sb)->s_es->s_creator_os != - cpu_to_le32(EXT4_OS_HURD)) + if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) raw_inode->i_file_acl_high = cpu_to_le16(ei->i_file_acl >> 32); raw_inode->i_file_acl_lo = cpu_to_le32(ei->i_file_acl); @@ -4391,8 +4389,7 @@ static int ext4_do_update_inode(handle_t *handle, raw_inode->i_block[block] = ei->i_data[block]; } - if (EXT4_SB(inode->i_sb)->s_es->s_creator_os != - cpu_to_le32(EXT4_OS_HURD)) { + if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) { raw_inode->i_disk_version = cpu_to_le32(inode->i_version); if (ei->i_extra_isize) { if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 5a51af7d0335..f3c667091618 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -3580,6 +3580,16 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) "feature flags set on rev 0 fs, " "running e2fsck is recommended"); + if (es->s_creator_os == cpu_to_le32(EXT4_OS_HURD)) { + set_opt2(sb, HURD_COMPAT); + if (EXT4_HAS_INCOMPAT_FEATURE(sb, + EXT4_FEATURE_INCOMPAT_64BIT)) { + ext4_msg(sb, KERN_ERR, + "The Hurd can't support 64-bit file systems"); + goto failed_mount; + } + } + if (IS_EXT2_SB(sb)) { if (ext2_feature_set_ok(sb)) ext4_msg(sb, KERN_INFO, "mounting ext2 file system " -- cgit v1.2.3 From 5f16f3225b06242a9ee876f07c1c9b6ed36a22b6 Mon Sep 17 00:00:00 2001 From: Theodore Ts'o Date: Mon, 24 Mar 2014 14:43:12 -0400 Subject: ext4: atomically set inode->i_flags in ext4_set_inode_flags() Use cmpxchg() to atomically set i_flags instead of clearing out the S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race where an immutable file has the immutable flag cleared for a brief window of time. Reported-by: John Sullivan Signed-off-by: "Theodore Ts'o" Cc: stable@kernel.org --- fs/ext4/inode.c | 14 ++++++++------ fs/inode.c | 31 +++++++++++++++++++++++++++++++ include/linux/fs.h | 3 +++ 3 files changed, 42 insertions(+), 6 deletions(-) (limited to 'fs/ext4/inode.c') diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index b5e182acf9b9..df067c3c6c93 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3938,18 +3938,20 @@ int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc) void ext4_set_inode_flags(struct inode *inode) { unsigned int flags = EXT4_I(inode)->i_flags; + unsigned int new_fl = 0; - inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC); if (flags & EXT4_SYNC_FL) - inode->i_flags |= S_SYNC; + new_fl |= S_SYNC; if (flags & EXT4_APPEND_FL) - inode->i_flags |= S_APPEND; + new_fl |= S_APPEND; if (flags & EXT4_IMMUTABLE_FL) - inode->i_flags |= S_IMMUTABLE; + new_fl |= S_IMMUTABLE; if (flags & EXT4_NOATIME_FL) - inode->i_flags |= S_NOATIME; + new_fl |= S_NOATIME; if (flags & EXT4_DIRSYNC_FL) - inode->i_flags |= S_DIRSYNC; + new_fl |= S_DIRSYNC; + inode_set_flags(inode, new_fl, + S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC); } /* Propagate flags from i_flags to EXT4_I(inode)->i_flags */ diff --git a/fs/inode.c b/fs/inode.c index 4bcdad3c9361..26f95ceb6250 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1899,3 +1899,34 @@ void inode_dio_done(struct inode *inode) wake_up_bit(&inode->i_state, __I_DIO_WAKEUP); } EXPORT_SYMBOL(inode_dio_done); + +/* + * inode_set_flags - atomically set some inode flags + * + * Note: the caller should be holding i_mutex, or else be sure that + * they have exclusive access to the inode structure (i.e., while the + * inode is being instantiated). The reason for the cmpxchg() loop + * --- which wouldn't be necessary if all code paths which modify + * i_flags actually followed this rule, is that there is at least one + * code path which doesn't today --- for example, + * __generic_file_aio_write() calls file_remove_suid() without holding + * i_mutex --- so we use cmpxchg() out of an abundance of caution. + * + * In the long run, i_mutex is overkill, and we should probably look + * at using the i_lock spinlock to protect i_flags, and then make sure + * it is so documented in include/linux/fs.h and that all code follows + * the locking convention!! + */ +void inode_set_flags(struct inode *inode, unsigned int flags, + unsigned int mask) +{ + unsigned int old_flags, new_flags; + + WARN_ON_ONCE(flags & ~mask); + do { + old_flags = ACCESS_ONCE(inode->i_flags); + new_flags = (old_flags & ~mask) | flags; + } while (unlikely(cmpxchg(&inode->i_flags, old_flags, + new_flags) != old_flags)); +} +EXPORT_SYMBOL(inode_set_flags); diff --git a/include/linux/fs.h b/include/linux/fs.h index 60829565e552..5d1f6fa8daed 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2556,6 +2556,9 @@ static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb, void inode_dio_wait(struct inode *inode); void inode_dio_done(struct inode *inode); +extern void inode_set_flags(struct inode *inode, unsigned int flags, + unsigned int mask); + extern const struct file_operations generic_ro_fops; #define special_file(m) (S_ISCHR(m)||S_ISBLK(m)||S_ISFIFO(m)||S_ISSOCK(m)) -- cgit v1.2.3 From 94350ab5c34166f08ef67aaca3a01e6b420891c9 Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Mon, 24 Mar 2014 15:09:16 -0400 Subject: ext4: make ext4_block_zero_page_range static It's only called within inode.c, so make it static, remove its prototype from ext4.h and move it above all of its callers so it doesn't need a prototype within inode.c. Signed-off-by: Matthew Wilcox Signed-off-by: "Theodore Ts'o" --- fs/ext4/ext4.h | 2 -- fs/ext4/inode.c | 42 +++++++++++++++++++++--------------------- 2 files changed, 21 insertions(+), 23 deletions(-) (limited to 'fs/ext4/inode.c') diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index e01135d791ca..f1c65dc7cc0a 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2139,8 +2139,6 @@ extern int ext4_writepage_trans_blocks(struct inode *); extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks); extern int ext4_block_truncate_page(handle_t *handle, struct address_space *mapping, loff_t from); -extern int ext4_block_zero_page_range(handle_t *handle, - struct address_space *mapping, loff_t from, loff_t length); extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode, loff_t lstart, loff_t lend); extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index df067c3c6c93..f03a9e7094bc 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3322,26 +3322,6 @@ void ext4_set_aops(struct inode *inode) inode->i_mapping->a_ops = &ext4_aops; } -/* - * ext4_block_truncate_page() zeroes out a mapping from file offset `from' - * up to the end of the block which corresponds to `from'. - * This required during truncate. We need to physically zero the tail end - * of that block so it doesn't yield old data if the file is later grown. - */ -int ext4_block_truncate_page(handle_t *handle, - struct address_space *mapping, loff_t from) -{ - unsigned offset = from & (PAGE_CACHE_SIZE-1); - unsigned length; - unsigned blocksize; - struct inode *inode = mapping->host; - - blocksize = inode->i_sb->s_blocksize; - length = blocksize - (offset & (blocksize - 1)); - - return ext4_block_zero_page_range(handle, mapping, from, length); -} - /* * ext4_block_zero_page_range() zeros out a mapping of length 'length' * starting from file offset 'from'. The range to be zero'd must @@ -3349,7 +3329,7 @@ int ext4_block_truncate_page(handle_t *handle, * the end of the block it will be shortened to end of the block * that cooresponds to 'from' */ -int ext4_block_zero_page_range(handle_t *handle, +static int ext4_block_zero_page_range(handle_t *handle, struct address_space *mapping, loff_t from, loff_t length) { ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT; @@ -3439,6 +3419,26 @@ unlock: return err; } +/* + * ext4_block_truncate_page() zeroes out a mapping from file offset `from' + * up to the end of the block which corresponds to `from'. + * This required during truncate. We need to physically zero the tail end + * of that block so it doesn't yield old data if the file is later grown. + */ +int ext4_block_truncate_page(handle_t *handle, + struct address_space *mapping, loff_t from) +{ + unsigned offset = from & (PAGE_CACHE_SIZE-1); + unsigned length; + unsigned blocksize; + struct inode *inode = mapping->host; + + blocksize = inode->i_sb->s_blocksize; + length = blocksize - (offset & (blocksize - 1)); + + return ext4_block_zero_page_range(handle, mapping, from, length); +} + int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode, loff_t lstart, loff_t length) { -- cgit v1.2.3 From e04027e887c37b670e30a3f29fde8bfbeba56abc Mon Sep 17 00:00:00 2001 From: Matthew Wilcox Date: Mon, 24 Mar 2014 15:15:07 -0400 Subject: ext4: fix comment typo Signed-off-by: Matthew Wilcox Signed-off-by: "Theodore Ts'o" --- fs/ext4/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'fs/ext4/inode.c') diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index f03a9e7094bc..0171c19a5467 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3698,7 +3698,7 @@ void ext4_truncate(struct inode *inode) /* * There is a possibility that we're either freeing the inode - * or it completely new indode. In those cases we might not + * or it's a completely new inode. In those cases we might not * have i_mutex locked because it's not necessary. */ if (!(inode->i_state & (I_NEW|I_FREEING))) -- cgit v1.2.3