summaryrefslogtreecommitdiffstats
path: root/fs/ext4
AgeCommit message (Collapse)AuthorFilesLines
2012-08-17Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds4-26/+44
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bug fixes from Ted Ts'o: "The following are all bug fixes and regressions. The most notable are the ones which cause problems for ext4 on RAID --- a performance problem when mounting very large filesystems, and a kernel OOPS when doing an rm -rf on large directory hierarchies on fast devices." * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix kernel BUG on large-scale rm -rf commands ext4: fix long mount times on very big file systems ext4: don't call ext4_error while block group is locked ext4: avoid kmemcheck complaint from reading uninitialized memory ext4: make sure the journal sb is written in ext4_clear_journal_err()
2012-08-17ext4: fix kernel BUG on large-scale rm -rf commandsTheodore Ts'o1-0/+1
Commit 968dee7722: "ext4: fix hole punch failure when depth is greater than 0" introduced a regression in v3.5.1/v3.6-rc1 which caused kernel crashes when users ran run "rm -rf" on large directory hierarchy on ext4 filesystems on RAID devices: BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 Process rm (pid: 18229, threadinfo ffff8801276bc000, task ffff880123631710) Call Trace: [<ffffffff81236483>] ? __ext4_handle_dirty_metadata+0x83/0x110 [<ffffffff812353d3>] ext4_ext_truncate+0x193/0x1d0 [<ffffffff8120a8cf>] ? ext4_mark_inode_dirty+0x7f/0x1f0 [<ffffffff81207e05>] ext4_truncate+0xf5/0x100 [<ffffffff8120cd51>] ext4_evict_inode+0x461/0x490 [<ffffffff811a1312>] evict+0xa2/0x1a0 [<ffffffff811a1513>] iput+0x103/0x1f0 [<ffffffff81196d84>] do_unlinkat+0x154/0x1c0 [<ffffffff8118cc3a>] ? sys_newfstatat+0x2a/0x40 [<ffffffff81197b0b>] sys_unlinkat+0x1b/0x50 [<ffffffff816135e9>] system_call_fastpath+0x16/0x1b Code: 8b 4d 20 0f b7 41 02 48 8d 04 40 48 8d 04 81 49 89 45 18 0f b7 49 02 48 83 c1 01 49 89 4d 00 e9 ae f8 ff ff 0f 1f 00 49 8b 45 28 <48> 8b 40 28 49 89 45 20 e9 85 f8 ff ff 0f 1f 80 00 00 00 RIP [<ffffffff81233164>] ext4_ext_remove_space+0xa34/0xdf0 This could be reproduced as follows: The problem in commit 968dee7722 was that caused the variable 'i' to be left uninitialized if the truncate required more space than was available in the journal. This resulted in the function ext4_ext_truncate_extend_restart() returning -EAGAIN, which caused ext4_ext_remove_space() to restart the truncate operation after starting a new jbd2 handle. Reported-by: Maciej Żenczykowski <maze@google.com> Reported-by: Marti Raudsepp <marti@juffo.org> Tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2012-08-17ext4: fix long mount times on very big file systemsTheodore Ts'o1-0/+4
Commit 8aeb00ff85a: "ext4: fix overhead calculation used by ext4_statfs()" introduced a O(n**2) calculation which makes very large file systems take forever to mount. Fix this with an optimization for non-bigalloc file systems. (For bigalloc file systems the overhead needs to be set in the the superblock.) Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2012-08-17ext4: don't call ext4_error while block group is lockedTheodore Ts'o2-26/+37
While in ext4_validate_block_bitmap(), if an block allocation bitmap is found to be invalid, we call ext4_error() while the block group is still locked. This causes ext4_commit_super() to call a function which might sleep while in an atomic context. There's no need to keep the block group locked at this point, so hoist the ext4_error() call up to ext4_validate_block_bitmap() and release the block group spinlock before calling ext4_error(). The reported stack trace can be found at: http://article.gmane.org/gmane.comp.file-systems.ext4/33731 Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2012-08-05ext4: avoid kmemcheck complaint from reading uninitialized memoryTheodore Ts'o1-0/+1
Commit 03179fe923 introduced a kmemcheck complaint in ext4_da_get_block_prep() because we save and restore ei->i_da_metadata_calc_last_lblock even though it is left uninitialized in the case where i_da_metadata_calc_len is zero. This doesn't hurt anything, but silencing the kmemcheck complaint makes it easier for people to find real bugs. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=45631 (which is marked as a regression). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2012-08-05ext4: make sure the journal sb is written in ext4_clear_journal_err()Theodore Ts'o1-0/+1
After we transfer set the EXT4_ERROR_FS bit in the file system superblock, it's not enough to call jbd2_journal_clear_err() to clear the error indication from journal superblock --- we need to call jbd2_journal_update_sb_errno() as well. Otherwise, when the root file system is mounted read-only, the journal is replayed, and the error indicator is transferred to the superblock --- but the s_errno field in the jbd2 superblock is left set (since although we cleared it in memory, we never flushed it out to disk). This can end up confusing e2fsck. We should make e2fsck more robust in this case, but the kernel shouldn't be leaving things in this confused state, either. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
2012-08-04ext4: nuke pdflush from commentsArtem Bityutskiy1-1/+1
The pdflush thread is long gone, so this patch removes references to pdflush from ext4 comments. Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-08-04ext4: nuke write_super from commentsArtem Bityutskiy2-19/+0
The '->write_super' superblock method is gone, and this patch removes all the references to 'write_super' from ext3. Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-08-01Merge branch 'for-linus' of ↵Linus Torvalds3-29/+23
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull second vfs pile from Al Viro: "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the deadlock reproduced by xfstests 068), symlink and hardlink restriction patches, plus assorted cleanups and fixes. Note that another fsfreeze deadlock (emergency thaw one) is *not* dealt with - the series by Fernando conflicts a lot with Jan's, breaks userland ABI (FIFREEZE semantics gets changed) and trades the deadlock for massive vfsmount leak; this is going to be handled next cycle. There probably will be another pull request, but that stuff won't be in it." Fix up trivial conflicts due to unrelated changes next to each other in drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c} * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits) delousing target_core_file a bit Documentation: Correct s_umount state for freeze_fs/unfreeze_fs fs: Remove old freezing mechanism ext2: Implement freezing btrfs: Convert to new freezing mechanism nilfs2: Convert to new freezing mechanism ntfs: Convert to new freezing mechanism fuse: Convert to new freezing mechanism gfs2: Convert to new freezing mechanism ocfs2: Convert to new freezing mechanism xfs: Convert to new freezing code ext4: Convert to new freezing mechanism fs: Protect write paths by sb_start_write - sb_end_write fs: Skip atime update on frozen filesystem fs: Add freezing handling to mnt_want_write() / mnt_drop_write() fs: Improve filesystem freezing handling switch the protection of percpu_counter list to spinlock nfsd: Push mnt_want_write() outside of i_mutex btrfs: Push mnt_want_write() outside of i_mutex fat: Push mnt_want_write() outside of i_mutex ...
2012-07-31ext4: Convert to new freezing mechanismJan Kara3-29/+23
We remove most of frozen checks since upper layer takes care of blocking all writes. We have to handle protection in ext4_page_mkwrite() in a special way because we cannot use generic block_page_mkwrite(). Also we add a freeze protection to ext4_evict_inode() so that iput() of unlinked inode cannot modify a frozen filesystem (we cannot easily instrument ext4_journal_start() / ext4_journal_stop() with freeze protection because we are missing the superblock pointer in ext4_journal_stop() in nojournal mode). CC: linux-ext4@vger.kernel.org CC: "Theodore Ts'o" <tytso@mit.edu> BugLink: https://bugs.launchpad.net/bugs/897421 Tested-by: Kamal Mostafa <kamal@canonical.com> Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com> Tested-by: Dann Frazier <dann.frazier@canonical.com> Tested-by: Massimo Morana <massimo.morana@canonical.com> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-30ext4: use memweight()Akinobu Mita1-8/+1
Convert ext4_count_free() to use memweight() instead of table lookup based counting clear bits implementation. This change only affects the code segments enabled by EXT4FS_DEBUG. Note that this memweight() call can't be replaced with a single bitmap_weight() call, although the pointer to the memory area is aligned to long-word boundary. Because the size of the memory area may not be a multiple of BITS_PER_LONG, then it returns wrong value on big-endian architecture. This also includes the following change. - Remove unnecessary map == NULL check in ext4_count_free() which always takes non-null pointer as the memory area. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-27Merge tag 'ext4_for_linus' of ↵Linus Torvalds14-196/+563
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "The usual collection of bug fixes and optimizations. Perhaps of greatest note is a speed up for parallel, non-allocating DIO writes, since we no longer take the i_mutex lock in that case. For bug fixes, we fix an incorrect overhead calculation which caused slightly incorrect results for df(1) and statfs(2). We also fixed bugs in the metadata checksum feature." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits) ext4: undo ext4_calc_metadata_amount if we fail to claim space ext4: don't let i_reserved_meta_blocks go negative ext4: fix hole punch failure when depth is greater than 0 ext4: remove unnecessary argument from __ext4_handle_dirty_metadata() ext4: weed out ext4_write_super ext4: remove unnecessary superblock dirtying ext4: convert last user of ext4_mark_super_dirty() to ext4_handle_dirty_super() ext4: remove useless marking of superblock dirty ext4: fix ext4 mismerge back in January ext4: remove dynamic array size in ext4_chksum() ext4: remove unused variable in ext4_update_super() ext4: make quota as first class supported feature ext4: don't take the i_mutex lock when doing DIO overwrites ext4: add a new nolock flag in ext4_map_blocks ext4: split ext4_file_write into buffered IO and direct IO ext4: remove an unused statement in ext4_mb_get_buddy_page_lock() ext4: fix out-of-date comments in extents.c ext4: use s_csum_seed instead of i_csum_seed for xattr block ext4: use proper csum calculation in ext4_rename ext4: fix overhead calculation used by ext4_statfs() ...
2012-07-23Merge branch 'for-linus-2' of ↵Linus Torvalds6-80/+32
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull the big VFS changes from Al Viro: "This one is *big* and changes quite a few things around VFS. What's in there: - the first of two really major architecture changes - death to open intents. The former is finally there; it was very long in making, but with Miklos getting through really hard and messy final push in fs/namei.c, we finally have it. Unlike his variant, this one doesn't introduce struct opendata; what we have instead is ->atomic_open() taking preallocated struct file * and passing everything via its fields. Instead of returning struct file *, it returns -E... on error, 0 on success and 1 in "deal with it yourself" case (e.g. symlink found on server, etc.). See comments before fs/namei.c:atomic_open(). That made a lot of goodies finally possible and quite a few are in that pile: ->lookup(), ->d_revalidate() and ->create() do not get struct nameidata * anymore; ->lookup() and ->d_revalidate() get lookup flags instead, ->create() gets "do we want it exclusive" flag. With the introduction of new helper (kern_path_locked()) we are rid of all struct nameidata instances outside of fs/namei.c; it's still visible in namei.h, but not for long. Come the next cycle, declaration will move either to fs/internal.h or to fs/namei.c itself. [me, miklos, hch] - The second major change: behaviour of final fput(). Now we have __fput() done without any locks held by caller *and* not from deep in call stack. That obviously lifts a lot of constraints on the locking in there. Moreover, it's legal now to call fput() from atomic contexts (which has immediately simplified life for aio.c). We also don't need anti-recursion logics in __scm_destroy() anymore. There is a price, though - the damn thing has become partially asynchronous. For fput() from normal process we are guaranteed that pending __fput() will be done before the caller returns to userland, exits or gets stopped for ptrace. For kernel threads and atomic contexts it's done via schedule_work(), so theoretically we might need a way to make sure it's finished; so far only one such place had been found, but there might be more. There's flush_delayed_fput() (do all pending __fput()) and there's __fput_sync() (fput() analog doing __fput() immediately). I hope we won't need them often; see warnings in fs/file_table.c for details. [me, based on task_work series from Oleg merged last cycle] - sync series from Jan - large part of "death to sync_supers()" work from Artem; the only bits missing here are exofs and ext4 ones. As far as I understand, those are going via the exofs and ext4 trees resp.; once they are in, we can put ->write_super() to the rest, along with the thread calling it. - preparatory bits from unionmount series (from dhowells). - assorted cleanups and fixes all over the place, as usual. This is not the last pile for this cycle; there's at least jlayton's ESTALE work and fsfreeze series (the latter - in dire need of fixes, so I'm not sure it'll make the cut this cycle). I'll probably throw symlink/hardlink restrictions stuff from Kees into the next pile, too. Plus there's a lot of misc patches I hadn't thrown into that one - it's large enough as it is..." * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (127 commits) ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file() btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file() switch dentry_open() to struct path, make it grab references itself spufs: shift dget/mntget towards dentry_open() zoran: don't bother with struct file * in zoran_map ecryptfs: don't reinvent the wheels, please - use struct completion don't expose I_NEW inodes via dentry->d_inode tidy up namei.c a bit unobfuscate follow_up() a bit ext3: pass custom EOF to generic_file_llseek_size() ext4: use core vfs llseek code for dir seeks vfs: allow custom EOF in generic_file_llseek code vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes vfs: Remove unnecessary flushing of block devices vfs: Make sys_sync writeout also block device inodes vfs: Create function for iterating over block devices vfs: Reorder operations during sys_sync quota: Move quota syncing to ->sync_fs method quota: Split dquot_quota_sync() to writeback and cache flushing part vfs: Move noop_backing_dev_info check from sync into writeback ...
2012-07-23ext4: undo ext4_calc_metadata_amount if we fail to claim spaceTheodore Ts'o1-11/+21
The function ext4_calc_metadata_amount() has side effects, although it's not obvious from its function name. So if we fail to claim space, regardless of whether we retry to claim the space again, or return an error, we need to undo these side effects. Otherwise we can end up incorrectly calculating the number of metadata blocks needed for the operation, which was responsible for an xfstests failure for test #271 when using an ext2 file system with delalloc enabled. Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2012-07-22ext4: don't let i_reserved_meta_blocks go negativeBrian Foster1-0/+9
If we hit a condition where we have allocated metadata blocks that were not appropriately reserved, we risk underflow of ei->i_reserved_meta_blocks. In turn, this can throw sbi->s_dirtyclusters_counter significantly out of whack and undermine the nondelalloc fallback logic in ext4_nonda_switch(). Warn if this occurs and set i_allocated_meta_blocks to avoid this problem. This condition is reproduced by xfstests 270 against ext2 with delalloc enabled: Mar 28 08:58:02 localhost kernel: [ 171.526344] EXT4-fs (loop1): delayed block allocation failed for inode 14 at logical offset 64486 with max blocks 64 with error -28 Mar 28 08:58:02 localhost kernel: [ 171.526346] EXT4-fs (loop1): This should not happen!! Data will be lost 270 ultimately fails with an inconsistent filesystem and requires an fsck to repair. The cause of the error is an underflow in ext4_da_update_reserve_space() due to an unreserved meta block allocation. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2012-07-22ext4: fix hole punch failure when depth is greater than 0Ashish Sangwan1-17/+29
Whether to continue removing extents or not is decided by the return value of function ext4_ext_more_to_rm() which checks 2 conditions: a) if there are no more indexes to process. b) if the number of entries are decreased in the header of "depth -1". In case of hole punch, if the last block to be removed is not part of the last extent index than this index will not be deleted, hence the number of valid entries in the extent header of "depth - 1" will remain as it is and ext4_ext_more_to_rm will return 0 although the required blocks are not yet removed. This patch fixes the above mentioned problem as instead of removing the extents from the end of file, it starts removing the blocks from the particular extent from which removing blocks is actually required and continue backward until done. Signed-off-by: Ashish Sangwan <ashish.sangwan2@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@gmail.com> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Cc: stable@vger.kernel.org
2012-07-22ext4: remove unnecessary argument from __ext4_handle_dirty_metadata()Artem Bityutskiy5-11/+7
The '__ext4_handle_dirty_metadata()' does not need the 'now' argument anymore and we can kill it. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2012-07-22ext4: weed out ext4_write_superArtem Bityutskiy1-10/+0
We do not depend on VFS's '->write_super()' anymore and do not need the 's_dirt' flag anymore, so weed out 'ext4_write_super()' and 's_dirt'. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2012-07-22ext4: remove unnecessary superblock dirtyingArtem Bityutskiy2-4/+3
This patch changes the 'ext4_handle_dirty_super()' function which submits the superblock for I/O in the following cases: 1. When creating the first large file on a file system without EXT4_FEATURE_RO_COMPAT_LARGE_FILE feature. 2. When re-sizing the file-system. 3. When creating an xattr on a file-system without the EXT4_FEATURE_COMPAT_EXT_ATTR feature. If the file-system has journal enabled, the superblock is written via the journal. We do not modify this path. If the file-system has no journal, this function, falls back to just marking the superblock as dirty using the 's_dirt' superblock flag. This means that it delays the actual superblock I/O submission by 5 seconds (default setting). Namely, the 'sync_supers()' kernel thread will call 'ext4_write_super()' later and will actually submit the superblock for I/O. And this is the behavior this patch modifies: we stop using 's_dirt' and just mark the superblock buffer as dirty right away. Indeed, all 3 cases above are extremely rare and it does not add any value to delay the I/O submission for them. Note: 'ext4_handle_dirty_super()' executes '__ext4_handle_dirty_super()' with 'now = 0'. This patch basically makes the 'now' argument unneeded and it will be deleted in one of the next patches. This patch also removes 's_dirt' condition on the unmount path because we never set it anymore, so we should not test it. Tested using xfstests for both journalled and non-journalled ext4. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2012-07-22ext4: convert last user of ext4_mark_super_dirty() to ext4_handle_dirty_super()Jan Kara2-10/+13
The last user of ext4_mark_super_dirty() in ext4_file_open() is so rare it can well be modifying the superblock properly by journalling the change. Change it and get rid of ext4_mark_super_dirty() as it's not needed anymore. Artem: small amendments. Artem: tested using xfstests for both journalled and non-journalled ext4. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2012-07-22ext4: remove useless marking of superblock dirtyJan Kara2-4/+0
Commit a0375156 properly notes that superblock doesn't need to be marked as dirty when only number of free inodes / blocks / number of directories changes since that is recomputed on each mount anyway. However that comment leaves some unnecessary markings as dirty in place. Remove these. Artem: tested using xfstests for both journalled and non-journalled ext4. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2012-07-22ext4: fix ext4 mismerge back in JanuaryAl Viro1-1/+0
Duplicate caused, AFAICS, by mismerge in ff9cb1c4eead5e4c292e75cd3170a82d66944101> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2012-07-22ext4: remove dynamic array size in ext4_chksum()Theodore Ts'o1-1/+3
The ext4_checksum() inline function was using a dynamic array size, which is not legal C. (It is a gcc extension). Remove it. Cc: "Darrick J. Wong" <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-22ext4: remove unused variable in ext4_update_super()Theodore Ts'o1-1/+1
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-22ext4: make quota as first class supported featureAditya Kali3-10/+150
This patch adds support for quotas as a first class feature in ext4; which is to say, the quota files are stored in hidden inodes as file system metadata, instead of as separate files visible in the file system directory hierarchy. It is based on the proposal at: https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4 This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA which, when turned on, enables quota accounting at mount time iteself. Also, the quota inodes are stored in two additional superblock fields. Some changes introduced by this patch that should be pointed out are: 1) Two new ext4-superblock fields - s_usr_quota_inum and s_grp_quota_inum for storing the quota inodes in use. 2) Default quota inodes are: inode#3 for tracking userquota and inode#4 for tracking group quota. The superblock fields can be set to use other inodes as well. 3) If the QUOTA feature and corresponding quota inodes are set in superblock, the quota usage tracking is turned on at mount time. On 'quotaon' ioctl, the quota limits enforcement is turned on. 'quotaoff' ioctl turns off only the limits enforcement in this case. 4) When QUOTA feature is in use, the quota mount options 'quota', 'usrquota', 'grpquota' are ignored by the kernel. 5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize quota inodes. The default reserved inodes will not be visible to user as regular files. 6) The quota-tools will need to be modified to support hidden quota files on ext4. E2fsprogs will also include support for creating and fixing quota files. 7) Support is only for the new V2 quota file format. Tested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johann Lombardi <johann@whamcloud.com> Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-22ext4: don't take the i_mutex lock when doing DIO overwritesZheng Liu2-4/+71
Aligned and overwrite direct I/O can be parallelized. In ext4_file_dio_write, we first check whether these conditions are satisfied or not. If so, we take i_data_sem and release i_mutex lock directly. Meanwhile iocb->private is set to indicate that this is a dio overwrite, and it will be handled in ext4_ext_direct_IO. [ Added fix from Dan Carpenter to fix locking bug on the error path. ] CC: Tao Ma <tm@tao.ma> CC: Eric Sandeen <sandeen@redhat.com> CC: Robin Dong <hao.bigrat@gmail.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2012-07-23ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file()Al Viro1-2/+2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23don't expose I_NEW inodes via dentry->d_inodeAl Viro1-2/+2
d_instantiate(dentry, inode); unlock_new_inode(inode); is a bad idea; do it the other way round... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23ext4: use core vfs llseek code for dir seeksEric Sandeen2-64/+17
Use the new functionality in generic_file_llseek_size() to accept a custom EOF position, and un-cut-and-paste all the vfs llseek code from ext4. Also fix up comments on ext4_llseek() to reflect reality. Signed-off-by: Eric Sandeen <sandeen@redaht.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23vfs: allow custom EOF in generic_file_llseek codeEric Sandeen1-1/+2
For ext3/4 htree directories, using the vfs llseek function with SEEK_END goes to i_size like for any other file, but in reality we want the maximum possible hash value. Recent changes in ext4 have cut & pasted generic_file_llseek() back into fs/ext4/dir.c, but replicating this core code seems like a bad idea, especially since the copy has already diverged from the vfs. This patch updates generic_file_llseek_size to accept both a custom maximum offset, and a custom EOF position. With this in place, ext4_dir_llseek can pass in the appropriate maximum hash position for both maxsize and eof, and get what it wants. As far as I know, this does not fix any bugs - nfs in the kernel doesn't use SEEK_END, and I don't know of any user who does. But some ext4 folks seem keen on doing the right thing here, and I can't really argue. (Patch also fixes up some comments slightly) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22quota: Move quota syncing to ->sync_fs methodJan Kara1-0/+5
Since the moment writes to quota files are using block device page cache and space for quota structures is reserved at the moment they are first accessed we have no reason to sync quota before inode writeback. In fact this order is now only harmful since quota information can easily change during inode writeback (either because conversion of delayed-allocated extents or simply because of allocation of new blocks for simple filesystems not using page_mkwrite). So move syncing of quota information after writeback of inodes into ->sync_fs method. This way we do not have to use ->quota_sync callback which is primarily intended for use by quotactl syscall anyway and we get rid of calling ->sync_fs() twice unnecessarily. We skip quota syncing for OCFS2 since it does proper quota journalling in all cases (unlike ext3, ext4, and reiserfs which also support legacy non-journalled quotas) and thus there are no dirty quota structures. CC: "Theodore Ts'o" <tytso@mit.edu> CC: Joel Becker <jlbec@evilplan.org> CC: reiserfs-devel@vger.kernel.org Acked-by: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Dave Kleikamp <shaggy@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-18ext4: fix duplicated mnt_drop_write call in EXT4_IOC_MOVE_EXTAl Viro1-1/+0
Caused, AFAICS, by mismerge in commit ff9cb1c4eead ("Merge branch 'for_linus' into for_linus_merged") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org # 3.3+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-14don't pass nameidata to ->create()Al Viro1-1/+1
boolean "does it have to be exclusive?" flag is passed instead; Local filesystem should just ignore it - the object is guaranteed not to be there yet. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14stop passing nameidata to ->lookup()Al Viro1-1/+1
Just the flags; only NFS cares even about that, but there are legitimate uses for such argument. And getting rid of that completely would require splitting ->lookup() into a couple of methods (at least), so let's leave that alone for now... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14vfs: switch i_dentry/d_alias to hlistAl Viro1-1/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14ext4: get rid of open-coded d_find_any_alias()Al Viro1-8/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-09ext4: add a new nolock flag in ext4_map_blocksZheng Liu2-10/+51
EXT4_GET_BLOCKS_NO_LOCK flag is added to indicate that we don't need to acquire i_data_sem lock in ext4_map_blocks. Meanwhile, it changes ext4_get_block() to not start a new journal because when we do a overwrite dio, there is no any metadata that needs to be modified. We define a new function called ext4_get_block_write_nolock, which is used in dio overwrite nolock. In this function, it doesn't try to acquire i_data_sem lock and doesn't start a new journal as it does a lookup. CC: Tao Ma <tm@tao.ma> CC: Eric Sandeen <sandeen@redhat.com> CC: Robin Dong <hao.bigrat@gmail.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-09ext4: split ext4_file_write into buffered IO and direct IOZheng Liu1-22/+38
ext4_file_dio_write is defined in order to split buffered IO and direct IO in ext4. This patch just refactor some stuff in write path. CC: Tao Ma <tm@tao.ma> CC: Eric Sandeen <sandeen@redhat.com> CC: Robin Dong <hao.bigrat@gmail.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-09ext4: remove an unused statement in ext4_mb_get_buddy_page_lock()Haibo Liu1-1/+0
In this patch, the statement "poff = block % blocks_per_page" in ext4_mb_get_buddy_page_lock has no effect. It will be optimized out by the compiler, but it's better to remove it. Signed-off-by: Haibo Liu <HaiboLiu6@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-09ext4: fix out-of-date comments in extents.cHaiboLiu1-2/+1
In this patch, ext4_ext_try_to_merge has been change to merge an extent both left and right. So we need to update the comment in here. Signed-off-by: HaiboLiu <HaiboLiu6@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-09ext4: use s_csum_seed instead of i_csum_seed for xattr blockTao Ma1-7/+4
In xattr block operation, we use h_refcount to indicate whether the xattr block is shared among many inodes. And xattr block csum uses s_csum_seed if it is shared and i_csum_seed if it belongs to one inode. But this has a problem. So consider the block is shared first bewteen inode A and B, and B has some xattr update and CoW the xattr block. When it updates the *old* xattr block(because of the h_refcount change) and calls ext4_xattr_release_block, we has no idea that inode A is the real owner of the *old* xattr block and we can't use the i_csum_seed of inode A either in xattr block csum calculation. And I don't think we have an easy way to find inode A. So this patch just removes the tricky i_csum_seed and we now uses s_csum_seed every time for the xattr block csum. The corresponding patch for the e2fsprogs will be sent in another patch. This is spotted by xfstests 117. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Darrick J. Wong <djwong@us.ibm.com>
2012-07-09ext4: use proper csum calculation in ext4_renameTao Ma1-2/+9
In ext4_rename, when the old name is a dir, we need to change ".." to its new parent and journal the change, so with metadata_csum enabled, we have to re-calc the csum. As the first block of the dir can be either a htree root or a normal directory block and we have different csum calculation for these 2 types, we have to choose the right one in ext4_rename. btw, it is found by xfstests 013. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Darrick J. Wong <djwong@us.ibm.com>
2012-07-09ext4: fix overhead calculation used by ext4_statfs()Theodore Ts'o4-57/+132
Commit f975d6bcc7a introduced bug which caused ext4_statfs() to miscalculate the number of file system overhead blocks. This causes the f_blocks field in the statfs structure to be larger than it should be. This would in turn cause the "df" output to show the number of data blocks in the file system and the number of data blocks used to be larger than they should be. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
2012-06-30ext4: pass a char * to ext4_count_free() instead of a buffer_head ptrTheodore Ts'o4-8/+8
Make it possible for ext4_count_free to operate on buffers and not just data in buffer_heads. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
2012-06-30ext4: honor O_(D)SYNC semantic in ext4_fallocate()Zheng Liu1-0/+2
Ext4 must make sure the transaction to be commited to the disk when user opens a file with O_(D)SYNC flag and do a fallocate(2) call. This problem had been reported by Christoph Hellwig in this thread: http://www.spinics.net/lists/linux-btrfs/msg13621.html Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-06-30ext4: avoid uneeded calls to ext4_mb_load_buddy() while reading mb_groupsAditya Kali1-8/+15
Currently ext4_mb_load_buddy is called for every group, irrespective of whether the group info is already in memory, while reading /proc/fs/ext4/<partition>/mb_groups proc file. For the purpose of mb_groups proc file, it is unnecessary to load the file group info from disk if it was loaded in past. These calls to ext4_mb_load_buddy make reading the mb_groups proc file expensive. Also, the locks around ext4_get_group_info are not required. This patch modifies the code to call ext4_mb_load_buddy only if the group info had never been loaded into memory in past. It also removes the mb group locking around ext4_get_group_info call. Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-06-07ext4: don't set i_flags in EXT4_IOC_SETFLAGSTao Ma1-1/+0
Commit 7990696 uses the ext4_{set,clear}_inode_flags() functions to change the i_flags automatically but fails to remove the error setting of i_flags. So we still have the problem of trashing state flags. Fix this by removing the assignment. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
2012-06-07ext4: fix the free blocks calculation for ext3 file systems w/ uninit_bgTheodore Ts'o1-4/+4
Ext3 filesystems that are converted to use as many ext4 file system features as possible will enable uninit_bg to speed up e2fsck times. These file systems will have a native ext3 layout of inode tables and block allocation bitmaps (as opposed to ext4's flex_bg layout). Unfortunately, in these cases, when first allocating a block in an uninitialized block group, ext4 would incorrectly calculate the number of free blocks in that block group, and then errorneously report that the file system was corrupt: EXT4-fs error (device vdd): ext4_mb_generate_buddy:741: group 30, 32254 clusters in bitmap, 32258 in gd This problem can be reproduced via: mke2fs -q -t ext4 -O ^flex_bg /dev/vdd 5g mount -t ext4 /dev/vdd /mnt fallocate -l 4600m /mnt/test The problem was caused by a bone headed mistake in the check to see if a particular metadata block was part of the block group. Many thanks to Kees Cook for finding and bisecting the buggy commit which introduced this bug (commit fd034a84e1, present since v3.2). Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Reported-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Tested-by: Kees Cook <keescook@chromium.org> Cc: stable@kernel.org
2012-06-01Merge tag 'ext4_for_linus' of ↵Linus Torvalds20-182/+1377
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull Ext4 updates from Theodore Ts'o: "The major new feature added in this update is Darrick J Wong's metadata checksum feature, which adds crc32 checksums to ext4's metadata fields. There is also the usual set of cleanups and bug fixes." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (44 commits) ext4: hole-punch use truncate_pagecache_range jbd2: use kmem_cache_zalloc wrapper instead of flag ext4: remove mb_groups before tearing down the buddy_cache ext4: add ext4_mb_unload_buddy in the error path ext4: don't trash state flags in EXT4_IOC_SETFLAGS ext4: let getattr report the right blocks in delalloc+bigalloc ext4: add missing save_error_info() to ext4_error() ext4: add debugging trigger for ext4_error() ext4: protect group inode free counting with group lock ext4: use consistent ssize_t type in ext4_file_write() ext4: fix format flag in ext4_ext_binsearch_idx() ext4: cleanup in ext4_discard_allocated_blocks() ext4: return ENOMEM when mounts fail due to lack of memory ext4: remove redundundant "(char *) bh->b_data" casts ext4: disallow hard-linked directory in ext4_lookup ext4: fix potential integer overflow in alloc_flex_gd() ext4: remove needs_recovery in ext4_mb_init() ext4: force ro mount if ext4_setup_super() fails ext4: fix potential NULL dereference in ext4_free_inodes_counts() ext4/jbd2: add metadata checksumming to the list of supported features ...
2012-06-01ext4: hole-punch use truncate_pagecache_rangeHugh Dickins1-2/+2
When truncating a file, we unmap pages from userspace first, as that's usually more efficient than relying, page by page, on the fallback in truncate_inode_page() - particularly if the file is mapped many times. Do the same when punching a hole: 3.4 added truncate_pagecache_range() to do the unmap and trunc, so use it in ext4_ext_punch_hole(), instead of calling truncate_inode_pages_range() directly. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>