summaryrefslogtreecommitdiffstats
path: root/fs/gfs2
AgeCommit message (Collapse)AuthorFilesLines
2015-12-14gfs2: clear journal live bit in gfs2_log_flushBenjamin Marzinski2-4/+3
When gfs2 was unmounting filesystems or changing them to read-only it was clearing the SDF_JOURNAL_LIVE bit before the final log flush. This caused a race. If an inode glock got demoted in the gap between clearing the bit and the shutdown flush, it would be unable to reserve log space to clear out the active items list in inode_go_sync, causing an error in inode_go_inval because the glock was still dirty. To solve this, the SDF_JOURNAL_LIVE bit is now cleared inside the shutdown log flush. This means that, because of the locking on the log blocks, either inode_go_sync will be able to reserve space to clean the glock before the shutdown flush, or the shutdown flush will clean the glock itself, before inode_go_sync fails to reserve the space. Either way, the glock will be clean before inode_go_inval. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-12-14gfs2: change gfs2 readdir cookieBenjamin Marzinski4-20/+89
gfs2 currently returns 31 bits of filename hash as a cookie that readdir uses for an offset into the directory. When there are a large number of directory entries, the likelihood of a collision goes up way too quickly. GFS2 will now return cookies that are guaranteed unique for a while, and then fail back to using 30 bits of filename hash. Specifically, the directory leaf blocks are divided up into chunks based on the minimum size of a gfs2 directory entry (48 bytes). Each entry's cookie is based off the chunk where it starts, in the linked list of leaf blocks that it hashes to (there are 131072 hash buckets). Directory entries will have unique names until they take reach chunk 8192. Assuming the largest filenames possible, and the least efficient spacing possible, this new method will still be able to return unique names when the previous method has statistically more than a 99% chance of a collision. The non-unique names it fails back to are guaranteed to not collide with the unique names. unique cookies will be in this format: - 1 bit "0" to make sure the the returned cookie is positive - 17 bits for the hash table index - 1 bit for the mode "0" - 13 bits for the offset non-unique cookies will be in this format: - 1 bit "0" to make sure the the returned cookie is positive - 17 bits for the hash table index - 1 bit for the mode "1" - 13 more bits of the name hash Another benefit of location based cookies, is that once a directory's exhash table is fully extended (so that multiple hash table indexs do not use the same leaf blocks), gfs2 can skip sorting the directory entries until it reaches the non-unique ones, and then it only needs to sort these. This provides a significant speed up for directory reads of very large directories. The only issue is that for these cookies to continue to point to the correct entry as files are added and removed from the directory, gfs2 must keep the entries at the same offset in the leaf block when they are split (see my previous patch). This means that until all the nodes in a cluster are running with code that will split the directory leaf blocks this way, none of the nodes can use the new cookie code. To deal with this, gfs2 now has the mount option loccookie, which, if set, will make it return these new location based cookies. This option must not be set until all nodes in the cluster are at least running this version of the kernel code, and you have guaranteed that there are no outstanding cookies required by other software, such as NFS. gfs2 uses some of the extra space at the end of the gfs2_dirent structure to store the calculated readdir cookies. This keeps us from needing to allocate a seperate array to hold these values. gfs2 recomputes the cookie stored in de_cookie for every readdir call. The time it takes to do so is small, and if gfs2 expected this value to be saved on disk, the new code wouldn't work correctly on filesystems created with an earlier version of gfs2. One issue with adding de_cookie to the union in the gfs2_dirent structure is that it caused the union to align itself to a 4 byte boundary, instead of its previous 2 byte boundary. This changed the offset of de_rahead. To solve that, I pulled de_rahead out of the union, since it does not need to be there. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-12-14gfs2: keep offset when splitting dir leaf blocksBenjamin Marzinski1-16/+53
Currently, when gfs2 splits a directory leaf block, the dirents that need to be copied to the new leaf block are packed into the start of it. This is good for space efficiency. However, if gfs2 were to copy those dirents into the exact same offset in the new leaf block as they had in the old block, it would be able to generate a readdir cookie based on the dirent location, that would be guaranteed to be unique up well past where the current code is statistically almost guaranteed to have collisions. So, gfs2 now keeps the dirent's offset in the block the same when it copies it to the new leaf block. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-12-14GFS2: Reintroduce a timeout in function gfs2_gl_hash_clearBob Peterson1-1/+3
At some point in the past, we used to have a timeout when GFS2 was unmounting, trying to clear out its glocks. If the timeout expires, it would dump the remaining glocks to the kernel messages so that developers can debug the problem. That timeout was eliminated, probably by accident. This patch reintroduces it. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-12-14GFS2: Update master statfs buffer with sd_statfs_spin lockedBob Peterson1-3/+2
Before this patch, function update_statfs called gfs2_statfs_change_out to update the master statfs buffer without the sd_statfs_spin held. In theory, another process could call gfs2_statfs_sync, which takes the sd_statfs_spin lock and re-reads m_sc from the buffer. So there's a theoretical timing window in which one process could write the master statfs buffer, then another comes along and re-reads it, wiping out the changes. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-12-14GFS2: Reduce size of incore inodeBob Peterson5-26/+26
This patch makes no functional changes. Its goal is to reduce the size of the gfs2 inode in memory by rearranging structures and changing the size of some variables within the structure. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-12-14GFS2: Make rgrp reservations part of the gfs2_inode structureBob Peterson12-82/+33
Before this patch, multi-block reservation structures were allocated from a special slab. This patch folds the structure into the gfs2_inode structure. The disadvantage is that the gfs2_inode needs more memory, even when a file is opened read-only. The advantages are: (a) we don't need the special slab and the extra time it takes to allocate and deallocate from it. (b) we no longer need to worry that the structure exists for things like quota management. (c) This also allows us to remove the calls to get_write_access and put_write_access since we know the structure will exist. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-12-08replace ->follow_link() with new method that could stay in RCU modeAl Viro1-5/+10
new method: ->get_link(); replacement of ->follow_link(). The differences are: * inode and dentry are passed separately * might be called both in RCU and non-RCU mode; the former is indicated by passing it a NULL dentry. * when called that way it isn't allowed to block and should return ERR_PTR(-ECHILD) if it needs to be called in non-RCU mode. It's a flagday change - the old method is gone, all in-tree instances converted. Conversion isn't hard; said that, so far very few instances do not immediately bail out when called in RCU mode. That'll change in the next commits. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-06posix acls: Remove duplicate xattr name definitionsAndreas Gruenbacher2-4/+2
Remove POSIX_ACL_XATTR_{ACCESS,DEFAULT} and GFS2_POSIX_ACL_{ACCESS,DEFAULT} and replace them with the definitions in <include/uapi/linux/xattr.h>. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: James Morris <james.l.morris@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-06gfs2: Remove gfs2_xattr_acl_chmodAndreas Gruenbacher2-51/+0
Function gfs2_xattr_acl_chmod is unused since commit e01580bf. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: James Morris <james.l.morris@oracle.com> Acked-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-11-24GFS2: Extract quota data from reservations structure (revert 5407e24)Bob Peterson13-63/+125
This patch basically reverts the majority of patch 5407e24. That patch eliminated the gfs2_qadata structure in favor of just using the reservations structure. The problem with doing that is that it increases the size of the reservations structure. That is not an issue until it comes time to fold the reservations structure into the inode in memory so we know it's always there. By separating out the quota structure again, we aren't punishing the non-quota users by making all the inodes bigger, requiring more slab space. This patch creates a new slab area to allocate the quota stuff so it's managed a little more sanely. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-11-18gfs2: Extended attribute readahead optimizationAndreas Gruenbacher1-18/+63
Instead of submitting a READ_SYNC bio for the inode and a READA bio for the inode's extended attributes through submit_bh, submit a single READ_SYNC bio for both through submit_bio when possible. This can be more efficient on some kinds of block devices. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-11-16gfs2: Extended attribute readaheadAndreas Gruenbacher8-14/+46
When gfs2 allocates an inode and its extended attribute block next to each other at inode create time, the inode's directory entry indicates that in de_rahead. In that case, we can readahead the extended attribute block when we read in the inode. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-11-16GFS2: Use rht_for_each_entry_rcu in glock_hash_walkAndrew Price1-2/+2
This lockdep splat was being triggered on umount: [55715.973122] =============================== [55715.980169] [ INFO: suspicious RCU usage. ] [55715.981021] 4.3.0-11553-g8d3de01-dirty #15 Tainted: G W [55715.982353] ------------------------------- [55715.983301] fs/gfs2/glock.c:1427 suspicious rcu_dereference_protected() usage! The code it refers to is the rht_for_each_entry_safe usage in glock_hash_walk. The condition that triggers the warning is lockdep_rht_bucket_is_held(tbl, hash) which is checked in the __rcu_dereference_protected macro. The rhashtable buckets are not changed in glock_hash_walk so it's safe to rely on the rcu protection. Replace the rht_for_each_entry_safe() usage with rht_for_each_entry_rcu(), which doesn't care whether the bucket lock is held if the rcu read lock is held. Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2015-11-16GFS2: Delete an unnecessary check before the function call "iput"Markus Elfring1-2/+1
The iput() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-11-13Merge branch 'for-linus-3' of ↵Linus Torvalds1-5/+8
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs xattr cleanups from Al Viro. * 'for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: f2fs: xattr simplifications squashfs: xattr simplifications 9p: xattr simplifications xattr handlers: Pass handler to operations instead of flags jffs2: Add missing capability check for listing trusted xattrs hfsplus: Remove unused xattr handler list operations ubifs: Remove unused security xattr handler vfs: Fix the posix_acl_xattr_list return value vfs: Check attribute names in posix acl xattr handers
2015-11-13xattr handlers: Pass handler to operations instead of flagsAndreas Gruenbacher1-5/+8
The xattr_handler operations are currently all passed a file system specific flags value which the operations can use to disambiguate between different handlers; some file systems use that to distinguish the xattr namespace, for example. In some oprations, it would be useful to also have access to the handler prefix. To allow that, pass a pointer to the handler to operations instead of the flags value alone. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-11-10gfs2: Automatically set GFS2_DIF_SYSTEM flag on system filesAbhi Das2-2/+7
When new files and directories are created inside a parent directory we automatically inherit the GFS2_DIF_SYSTEM flag (if set) and assign it to the new file/dirs. All new system files/dirs created in the metafs by, say gfs2_jadd, will have this flag set because they will have parent directories in the metafs whose GFS2_DIF_SYSTEM flag has already been set (most likely by a previous mkfs.gfs2) Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-11-09Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-1/+1
Merge third patch-bomb from Andrew Morton: "We're pretty much done over here - I'm still waiting for a nouveau merge so I can cleanly finish up Christoph's dma-mapping rework. - bunch of small misc stuff - fold abs64() into abs(), remove abs64() - new_valid_dev() cleanups - binfmt_elf_fdpic feature work" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (24 commits) fs/binfmt_elf_fdpic.c: provide NOMMU loader for regular ELF binaries fs/stat.c: remove unnecessary new_valid_dev() check fs/reiserfs/namei.c: remove unnecessary new_valid_dev() check fs/nilfs2/namei.c: remove unnecessary new_valid_dev() check fs/ncpfs/dir.c: remove unnecessary new_valid_dev() check fs/jfs: remove unnecessary new_valid_dev() checks fs/hpfs/namei.c: remove unnecessary new_valid_dev() check fs/f2fs/namei.c: remove unnecessary new_valid_dev() check fs/ext2/namei.c: remove unnecessary new_valid_dev() check fs/exofs/namei.c: remove unnecessary new_valid_dev() check fs/btrfs/inode.c: remove unnecessary new_valid_dev() check fs/9p: remove unnecessary new_valid_dev() checks include/linux/kdev_t.h: old/new_valid_dev() can return bool include/linux/kdev_t.h: remove unused huge_valid_dev() kmap_atomic_to_page() has no users, remove it drivers/scsi/cxgbi: fix build with EXTRA_CFLAGS dma: remove external references to dma_supported Documentation/sysctl/vm.txt: fix misleading code reference of overcommit_memory remove abs64() kernel.h: make abs() work with 64-bit types ...
2015-11-09Merge tag 'gfs2-merge-window' of ↵Linus Torvalds10-59/+70
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 updates from Bob Peterson: "Here is a list of patches we've accumulated for GFS2 for the current upstream merge window. There are only six patches this time: 1. A cleanup patch from Andreas to remove the gl_spin #define in favor of its value for the sake of clarity. 2. A fix from Andy Price to mark the inode dirty during fallocate. 3. A fix from Andy Price to set s_mode on mount failures to prevent a stack trace. 4 A patch from me to prevent a kernel BUG() in trans_add_meta/trans_add_data due to uninitialized storage. 5. A patch from me to protecting our freeing of the in-core directory hash table to prevent double-free. 6. A fix for a page/block rounding problem that resulted in a metadata coherency problem when the block size != page size" I've got a lot more patches in various stages of review and testing, but I'm afraid they'll have to wait until the next merge window. So next time we're likely to have a lot more" * tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: GFS2: Fix rgrp end rounding problem for bsize < page size GFS2: Protect freeing directory hash table with i_lock spin_lock gfs2: Remove gl_spin define gfs2: Add missing else in trans_add_meta/data GFS2: Set s_mode before parsing mount options GFS2: fallocate: do not rely on file_update_time to mark the inode dirty
2015-11-09remove abs64()Andrew Morton1-1/+1
Switch everything to the new and more capable implementation of abs(). Mainly to give the new abs() a bit of a workout. Cc: Michal Nazarewicz <mina86@mina86.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-09GFS2: Fix rgrp end rounding problem for bsize < page sizeBob Peterson1-2/+3
This patch fixes a bug introduced by commit 7005c3e. That patch tries to map a vm range for resource groups, but the calculation breaks down when the block size is less than the page size. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-11-04GFS2: Protect freeing directory hash table with i_lock spin_lockBob Peterson1-1/+6
This patch changes function gfs2_dir_hash_inval so it uses the i_lock spin_lock to protect the in-core hash table, i_hash_cache. This will prevent double-frees due to a race between gfs2_evict_inode and inode invalidation. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-10-29gfs2: Remove gl_spin defineAndreas Gruenbacher6-54/+53
Commit e66cf161 replaced the gl_spin spinlock in struct gfs2_glock with a gl_lockref lockref and defined gl_spin as gl_lockref.lock (the spinlock in gl_lockref). Remove that define to make the references to gl_lockref.lock more obvious. Signed-off-by: Andreas Gruenbacher <andreas.gruenbacher@gmail.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-10-22Move locks API users to locks_lock_inode_wait()Benjamin Coddington1-4/+4
Instead of having users check for FL_POSIX or FL_FLOCK to call the correct locks API function, use the check within locks_lock_inode_wait(). This allows for some later cleanup. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-10-01gfs2: Add missing else in trans_add_meta/dataBob Peterson1-0/+4
This patch fixes a timing window that causes a segfault. The problem is that bd can remain NULL throughout the function and then reference that NULL pointer if the bh->b_private starts out NULL, then someone sets it to non-NULL inside the locking. In that case, bd still needs to be set. Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-09-23GFS2: Set s_mode before parsing mount optionsAndrew Price1-1/+3
In the generic mount_bdev() function, deactivate_locked_super() is called after the fill_super() call fails, at which point s_mode has been set. kill_block_super() expects this and dumps a warning when FMODE_EXCL is not set in s_mode. In gfs2_mount() we call deactivate_locked_super() on failure of gfs2_mount_args(), at which point s_mode has not yet been set. This causes kill_block_super() to dump a stack trace when gfs2 fails to mount with invalid options. Set s_mode earlier in gfs2_mount() to avoid that. Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-09-22GFS2: fallocate: do not rely on file_update_time to mark the inode dirtyAndrew Price1-1/+1
Previously __gfs2_fallocate() relied on file_update_time() marking the inode dirty, but that's not a safe assumption as that function doesn't dirty the inode in some cases. Mark the inode dirty explicitly. Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-09-11Merge tag 'gfs2-merge-window' of ↵Linus Torvalds11-285/+212
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull GFS2 updates from Bob Peterson: "Here is a list of patches we've accumulated for GFS2 for the current upstream merge window. This time we've only got six patches, many of which are very small: - three cleanups from Andreas Gruenbacher, including a nice cleanup of the sequence file code for the sbstats debugfs file. - a patch from Ben Hutchings that changes statistics variables from signed to unsigned. - two patches from me that increase GFS2's glock scalability by switching from a conventional hash table to rhashtable" * tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: A minor "sbstats" cleanup gfs2: Fix a typo in a comment gfs2: Make statistics unsigned, suitable for use with do_div() GFS2: Use resizable hash table for glocks GFS2: Move glock superblock pointer to field gl_name gfs2: Simplify the seq file code for "sbstats"
2015-09-04fs: create and use seq_show_option for escapingKees Cook1-3/+3
Many file systems that implement the show_options hook fail to correctly escape their output which could lead to unescaped characters (e.g. new lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This could lead to confusion, spoofed entries (resulting in things like systemd issuing false d-bus "mount" notifications), and who knows what else. This looks like it would only be the root user stepping on themselves, but it's possible weird things could happen in containers or in other situations with delegated mount privileges. Here's an example using overlay with setuid fusermount trusting the contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use of "sudo" is something more sneaky: $ BASE="ovl" $ MNT="$BASE/mnt" $ LOW="$BASE/lower" $ UP="$BASE/upper" $ WORK="$BASE/work/ 0 0 none /proc fuse.pwn user_id=1000" $ mkdir -p "$LOW" "$UP" "$WORK" $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt $ cat /proc/mounts none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0 none /proc fuse.pwn user_id=1000 0 0 $ fusermount -u /proc $ cat /proc/mounts cat: /proc/mounts: No such file or directory This fixes the problem by adding new seq_show_option and seq_show_option_n helpers, and updating the vulnerable show_option handlers to use them as needed. Some, like SELinux, need to be open coded due to unusual existing escape mechanisms. [akpm@linux-foundation.org: add lost chunk, per Kees] [keescook@chromium.org: seq_show_option should be using const parameters] Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Jan Kara <jack@suse.com> Acked-by: Paul Moore <paul@paul-moore.com> Cc: J. R. Okajima <hooanon05g@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-03gfs2: A minor "sbstats" cleanupAndreas Gruenbacher1-7/+6
It seems cleaner to avoid the temporary value here. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-09-03gfs2: Fix a typo in a commentAndreas Gruenbacher1-1/+1
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-09-03gfs2: Make statistics unsigned, suitable for use with do_div()Ben Hutchings4-24/+24
None of these statistics can meaningfully be negative, and the numerator for do_div() must have the type u64. The generic implementation of do_div() used on some 32-bit architectures asserts that, resulting in a compiler error in gfs2_rgrp_congested(). Fixes: 0166b197c2ed ("GFS2: Average in only non-zero round-trip times ...") Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Andreas Gruenbacher <agruenba@redhat.com>
2015-09-03GFS2: Use resizable hash table for glocksBob Peterson2-166/+102
This patch changes the glock hash table from a normal hash table to a resizable hash table, which scales better. This also simplifies a lot of code. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2015-09-03GFS2: Move glock superblock pointer to field gl_nameBob Peterson11-74/+75
What uniquely identifies a glock in the glock hash table is not gl_name, but gl_name and its superblock pointer. This patch makes the gl_name field correspond to a unique glock identifier. That will allow us to simplify hashing with a future patch, since the hash algorithm can then take the gl_name and hash its components in one operation. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2015-09-03gfs2: Simplify the seq file code for "sbstats"Andreas Gruenbacher1-20/+11
Don't use struct gfs2_glock_iter as the helper data structure for iterating through "sbstats"; we are not iterating through glocks here. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-08-13block: remove bio_get_nr_vecs()Kent Overstreet1-8/+1
We can always fill up the bio now, no need to estimate the possible size based on queue parameters. Acked-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [hch: rebased and wrote a changelog] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29block: add a bi_error field to struct bioChristoph Hellwig2-8/+8
Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-27Merge tag 'gfs2-merge-window' of ↵Linus Torvalds11-145/+435
git://git.kernel.org:/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull GFS2 updates from Bob Peterson: "Here are the patches we've accumulated for GFS2 for the current upstream merge window. We have a good mixture this time. Here are some of the features: - Fix a problem with RO mounts writing to the journal. - Further improvements to quotas on GFS2. - Added support for rename2 and RENAME_EXCHANGE on GFS2. - Increase performance by making glock lru_list less of a bottleneck. - Increase performance by avoiding unnecessary buffer_head releases. - Increase performance by using average glock round trip time from all CPUs. - Fixes for some compiler warnings and minor white space issues. - Other misc bug fixes" * tag 'gfs2-merge-window' of git://git.kernel.org:/pub/scm/linux/kernel/git/gfs2/linux-gfs2: GFS2: Don't brelse rgrp buffer_heads every allocation GFS2: Don't add all glocks to the lru gfs2: Don't support fallocate on jdata files gfs2: s64 cast for negative quota value gfs2: limit quota log messages gfs2: fix quota updates on block boundaries gfs2: fix shadow warning in gfs2_rbm_find() gfs2: kerneldoc warning fixes gfs2: convert simple_str to kstr GFS2: make sure S_NOSEC flag isn't overwritten GFS2: add support for rename2 and RENAME_EXCHANGE gfs2: handle NULL rgd in set_rgrp_preferences GFS2: inode.c: indent with TABs, not spaces GFS2: mark the journal idle to fix ro mounts GFS2: Average in only non-zero round-trip times for congestion stats GFS2: Use average srttb value in congestion calculations
2015-06-25Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+1
Pull cgroup writeback support from Jens Axboe: "This is the big pull request for adding cgroup writeback support. This code has been in development for a long time, and it has been simmering in for-next for a good chunk of this cycle too. This is one of those problems that has been talked about for at least half a decade, finally there's a solution and code to go with it. Also see last weeks writeup on LWN: http://lwn.net/Articles/648292/" * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits) writeback, blkio: add documentation for cgroup writeback support vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB writeback: do foreign inode detection iff cgroup writeback is enabled v9fs: fix error handling in v9fs_session_init() bdi: fix wrong error return value in cgwb_create() buffer: remove unusued 'ret' variable writeback: disassociate inodes from dying bdi_writebacks writeback: implement foreign cgroup inode bdi_writeback switching writeback: add lockdep annotation to inode_to_wb() writeback: use unlocked_inode_to_wb transaction in inode_congested() writeback: implement unlocked_inode_to_wb transaction and use it for stat updates writeback: implement [locked_]inode_to_wb_and_lock_list() writeback: implement foreign cgroup inode detection writeback: make writeback_control track the inode being written back writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb() mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use writeback: implement memcg writeback domain based throttling writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes writeback: implement memcg wb_domain writeback: update wb_over_bg_thresh() to use wb_domain aware operations ...
2015-06-19GFS2: Don't brelse rgrp buffer_heads every allocationBob Peterson3-7/+31
This patch allows the block allocation code to retain the buffers for the resource groups so they don't need to be re-read from buffer cache with every request. This is a performance improvement that's especially noticeable when resource groups are very large. For example, with 2GB resource groups and 4K blocks, there can be 33 blocks for every resource group. This patch allows those 33 buffers to be kept around and not read in and thrown away with every operation. The buffers are released when the resource group is either synced or invalidated. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Reviewed-by: Steven Whitehouse <swhiteho@redhat.com> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com>
2015-06-18GFS2: Don't add all glocks to the lruBob Peterson3-3/+7
The glocks used for resource groups often come and go hundreds of thousands of times per second. Adding them to the lru list just adds unnecessary contention for the lru_lock spin_lock, especially considering we're almost certainly going to re-use the glock and take it back off the lru microseconds later. We never want the glock shrinker to cull them anyway. This patch adds a new bit in the glops that determines which glock types get put onto the lru list and which ones don't. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2015-06-09gfs2: Don't support fallocate on jdata filesAbhi Das1-1/+1
We cannot provide an efficient implementation due to the headers on the data blocks, so there doesn't seem much point in having it. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-06-08gfs2: s64 cast for negative quota valueAbhi Das1-1/+1
One-line fix to cast quota value to s64 before comparison. By default the quantity is treated as u64. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2015-06-02gfs2: limit quota log messagesAbhi Das2-4/+12
This patch makes the quota subsystem only report once that a particular user/group has exceeded their allotted quota. Previously, it was possible for a program to continuously try exceeding quota (despite receiving EDQUOT) and in turn trigger gfs2 to issue a kernel log message about quota exceed. In theory, this could get out of hand and flood the log and the filesystem hosting the log files. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-06-02gfs2: fix quota updates on block boundariesAbhi Das2-80/+119
For smaller block sizes (512B, 1K, 2K), some quotas straddle block boundaries such that the usage value is on one block and the rest of the quota is on the previous block. In such cases, the value does not get updated correctly. This patch fixes that by addressing the boundary conditions correctly. This patch also adds a (s64) cast that was missing in a call to gfs2_quota_change() in inode.c Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-06-02writeback: move bandwidth related fields from backing_dev_info into ↵Tejun Heo1-1/+1
bdi_writeback Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bandwidth related fields from backing_dev_info into bdi_writeback. * The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp, write_bandwidth, avg_write_bandwidth, dirty_ratelimit, balanced_dirty_ratelimit, completions and dirty_exceeded. * writeback_chunk_size() and over_bground_thresh() now take @wb instead of @bdi. * bdi_writeout_fraction(bdi, ...) -> wb_writeout_fraction(wb, ...) bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...) bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...) bdi_update_writebandwidth(bdi, ...) -> wb_update_write_bandwidth(wb, ...) [__]bdi_update_bandwidth(bdi, ...) -> [__]wb_update_bandwidth(wb, ...) bdi_{max|min}_pause(bdi, ...) -> wb_{max|min}_pause(wb, ...) bdi_dirty_limits(bdi, ...) -> wb_dirty_limits(wb, ...) * Init/exits of the relocated fields are moved to bdi_wb_init/exit() respectively. Note that explicit zeroing is dropped in the process as wb's are cleared in entirety anyway. * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[] introducing no behavior changes. v2: Typo in description fixed as suggested by Jan. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-18gfs2: fix shadow warning in gfs2_rbm_find()Fabian Frederick1-3/+1
bi was already declared and initialized globally in gfs2_rbm_find() Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2015-05-10don't pass nameidata to ->follow_link()Al Viro1-1/+1
its only use is getting passed to nd_jump_link(), which can obtain it from current->nameidata Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-10new ->follow_link() and ->put_link() calling conventionsAl Viro1-5/+5
a) instead of storing the symlink body (via nd_set_link()) and returning an opaque pointer later passed to ->put_link(), ->follow_link() _stores_ that opaque pointer (into void * passed by address by caller) and returns the symlink body. Returning ERR_PTR() on error, NULL on jump (procfs magic symlinks) and pointer to symlink body for normal symlinks. Stored pointer is ignored in all cases except the last one. Storing NULL for opaque pointer (or not storing it at all) means no call of ->put_link(). b) the body used to be passed to ->put_link() implicitly (via nameidata). Now only the opaque pointer is. In the cases when we used the symlink body to free stuff, ->follow_link() now should store it as opaque pointer in addition to returning it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>