summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2019-05-16afs: Don't invalidate callback if AFS_VNODE_DIR_VALID not setDavid Howells1-5/+2
Don't invalidate the callback promise on a directory if the AFS_VNODE_DIR_VALID flag is not set (which indicates that the directory contents are invalid, due to edit failure, callback break, page reclaim). The directory will be reloaded next time the directory is accessed, so clearing the callback flag at this point may race with a reload of the directory and cancel it's recorded callback promise. Fixes: f3ddee8dc4e2 ("afs: Fix directory handling") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Fix order-1 allocation in afs_do_lookup()David Howells5-50/+45
afs_do_lookup() will do an order-1 allocation to allocate status records if there are more than 39 vnodes to stat. Fix this by allocating an array of {status,callback} records for each vnode we want to examine using vmalloc() if larger than a page. This not only gets rid of the order-1 allocation, but makes it easier to grow beyond 50 records for YFS servers. It also allows us to move to {status,callback} tuples for other calls too and makes it easier to lock across the application of the status and the callback to the vnode. Fixes: 5cf9dd55a0ec ("afs: Prospectively look up extra files when doing a single lookup") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Fix calculation of callback expiry timeDavid Howells2-57/+51
Fix the calculation of the expiry time of a callback promise, as obtained from operations like FS.FetchStatus and FS.FetchData. The time should be based on the timestamp of the first DATA packet in the reply and the calculation needs to turn the ktime_t timestamp into a time64_t. Fixes: c435ee34551e ("afs: Overhaul the callback handling") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Get rid of afs_call::reply[]David Howells10-303/+293
Replace the afs_call::reply[] array with a bunch of typed members so that the compiler can use type-checking on them. It's also easier for the eye to see what's going on. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Make dynamic root population wait uninterruptibly for proc_cells_lockDavid Howells1-2/+1
Make dynamic root population wait uninterruptibly for proc_cells_lock. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Don't pass the vnode pointer through into the inline bulk status opDavid Howells2-15/+2
Don't pass the vnode pointer through into the inline bulk status op. We want to process the status records outside of it anyway. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Make some RPC operations non-interruptibleDavid Howells14-35/+99
Make certain RPC operations non-interruptible, including: (*) Set attributes (*) Store data We don't want to get interrupted during a flush on close, flush on unlock, writeback or an inode update, leaving us in a state where we still need to do the writeback or update. (*) Extend lock (*) Release lock We don't want to get lock extension interrupted as the file locks on the server are time-limited. Interruption during lock release is less of an issue since the lock is time-limited, but it's better to complete the release to avoid a several-minute wait to recover it. *Setting* the lock isn't a problem if it's interrupted since we can just return to the user and tell them they were interrupted - at which point they can elect to retry. (*) Silly unlink We want to remove silly unlink files if we can, rather than leaving them for the salvager to clear up. Note that whilst these calls are no longer interruptible, they do have timeouts on them, so if the server stops responding the call will fail with something like ETIME or ECONNRESET. Without this, the following: kAFS: Unexpected error from FS.StoreData -512 appears in dmesg when a pending store data gets interrupted and some processes may just hang. Additionally, make the code that checks/updates the server record ignore failure due to interruption if the main call is uninterruptible and if the server has an address list. The next op will check it again since the expiration time on the old list has past. Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation") Reported-by: Jonathan Billings <jsbillings@jsbillings.org> Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16rxrpc: Allow the kernel to mark a call as being non-interruptibleDavid Howells1-0/+1
Allow kernel services using AF_RXRPC to indicate that a call should be non-interruptible. This allows kafs to make things like lock-extension and writeback data storage calls non-interruptible. If this is set, signals will be ignored for operations on that call where possible - such as waiting to get a call channel on an rxrpc connection. It doesn't prevent UDP sendmsg from being interrupted, but that will be handled by packet retransmission. rxrpc_kernel_recv_data() isn't affected by this since that never waits, preferring instead to return -EAGAIN and leave the waiting to the caller. Userspace initiated calls can't be set to be uninterruptible at this time. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Fix error propagation from server record check/updateDavid Howells1-3/+3
afs_check/update_server_record() should be setting fc->error rather than fc->ac.error as they're called from within the cursor iteration function. afs_fs_cursor::error is where the error code of the attempt to call the operation on multiple servers is integrated and is the final result, whereas afs_addr_cursor::error is used to hold the error from individual iterations of the call loop. (Note there's also an afs_vl_cursor which also wraps afs_addr_cursor for accessing VL servers rather than file servers). Fix this by setting fc->error in the afs_check/update_server_record() so that any error incurred whilst talking to the VL server correctly propagates to the final result. This results in: kAFS: Unexpected error from FS.StoreData -512 being seen, even though the store-data op is non-interruptible. The error is actually coming from the server record update getting interrupted. Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Fix the maximum lifespan of VL and probe callsDavid Howells5-0/+13
If an older AFS server doesn't support an operation, it may accept the call and then sit on it forever, happily responding to pings that make kafs think that the call is still alive. Fix this by setting the maximum lifespan of Volume Location service calls in particular and probe calls in general so that they don't run on endlessly if they're not supported. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Fix "kAFS: AFS vnode with undefined type 0"David Howells1-0/+2
Under some circumstances afs_select_fileserver() can return without setting an error in fc->error. The problem is in the no_more_servers segment where the accumulated errors from attempts to contact various servers are integrated into an afs_error-type variable 'e'. The resultant error code is, however, then abandoned. Fix this by getting the error out of e.error and putting it in 'error' so that the next part will store it into fc->error. Not doing this causes a report like the following: kAFS: AFS vnode with undefined type 0 kAFS: A=0 m=0 s=0 v=0 kAFS: vnode 20000025:1:1 because the code following the server selection loop then sees what it thinks is a successful invocation because fc.error is 0. However, it can't apply the status record because it's all zeros. The report is followed on the first instance with a trace looking something like: dump_stack+0x67/0x8e afs_inode_init_from_status.isra.2+0x21b/0x487 afs_fetch_status+0x119/0x1df afs_iget+0x130/0x295 afs_get_tree+0x31d/0x595 vfs_get_tree+0x1f/0xe8 fc_mount+0xe/0x36 afs_d_automount+0x328/0x3c3 follow_managed+0x109/0x20a lookup_fast+0x3bf/0x3f8 do_last+0xc3/0x6a4 path_openat+0x1af/0x236 do_filp_open+0x51/0xae ? _raw_spin_unlock+0x24/0x2d ? __alloc_fd+0x1a5/0x1b7 do_sys_open+0x13b/0x1e8 do_syscall_64+0x7d/0x1b3 entry_SYSCALL_64_after_hwframe+0x49/0xbe Fixes: 4584ae96ae30 ("afs: Fix missing net error handling") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16io_uring: use wait_event_interruptible for cq_wait conditional waitJackie Liu1-15/+2
The previous patch has ensured that io_cqring_events contain smp_rmb memory barriers, Now we can use wait_event_interruptible to keep the code simple. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-16io_uring: adjust smp_rmb inside io_cqring_eventsJackie Liu1-4/+2
Whenever smp_rmb is required to use io_cqring_events, keep smp_rmb inside the function io_cqring_events. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-16io_uring: fix infinite wait in khread_park() on io_finish_async()Roman Penyaev1-7/+8
This fixes couple of races which lead to infinite wait of park completion with the following backtraces: [20801.303319] Call Trace: [20801.303321] ? __schedule+0x284/0x650 [20801.303323] schedule+0x33/0xc0 [20801.303324] schedule_timeout+0x1bc/0x210 [20801.303326] ? schedule+0x3d/0xc0 [20801.303327] ? schedule_timeout+0x1bc/0x210 [20801.303329] ? preempt_count_add+0x79/0xb0 [20801.303330] wait_for_completion+0xa5/0x120 [20801.303331] ? wake_up_q+0x70/0x70 [20801.303333] kthread_park+0x48/0x80 [20801.303335] io_finish_async+0x2c/0x70 [20801.303336] io_ring_ctx_wait_and_kill+0x95/0x180 [20801.303338] io_uring_release+0x1c/0x20 [20801.303339] __fput+0xad/0x210 [20801.303341] task_work_run+0x8f/0xb0 [20801.303342] exit_to_usermode_loop+0xa0/0xb0 [20801.303343] do_syscall_64+0xe0/0x100 [20801.303349] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [20801.303380] Call Trace: [20801.303383] ? __schedule+0x284/0x650 [20801.303384] schedule+0x33/0xc0 [20801.303386] io_sq_thread+0x38a/0x410 [20801.303388] ? __switch_to_asm+0x40/0x70 [20801.303390] ? wait_woken+0x80/0x80 [20801.303392] ? _raw_spin_lock_irqsave+0x17/0x40 [20801.303394] ? io_submit_sqes+0x120/0x120 [20801.303395] kthread+0x112/0x130 [20801.303396] ? kthread_create_on_node+0x60/0x60 [20801.303398] ret_from_fork+0x35/0x40 o kthread_park() waits for park completion, so io_sq_thread() loop should check kthread_should_park() along with khread_should_stop(), otherwise if kthread_park() is called before prepare_to_wait() the following schedule() never returns: CPU#0 CPU#1 io_sq_thread_stop(): io_sq_thread(): while(!kthread_should_stop() && !ctx->sqo_stop) { ctx->sqo_stop = 1; kthread_park() prepare_to_wait(); if (kthread_should_stop() { } schedule(); <<< nobody checks park flag, <<< so schedule and never return o if the flag ctx->sqo_stop is observed by the io_sq_thread() loop it is quite possible, that kthread_should_park() check and the following kthread_parkme() is never called, because kthread_park() has not been yet called, but few moments later is is called and waits there for park completion, which never happens, because kthread has already exited: CPU#0 CPU#1 io_sq_thread_stop(): io_sq_thread(): ctx->sqo_stop = 1; while(!kthread_should_stop() && !ctx->sqo_stop) { <<< observe sqo_stop and exit the loop } if (kthread_should_park()) kthread_parkme(); <<< never called, since was <<< never parked kthread_park() <<< waits forever for park completion In the current patch we quit the loop by only kthread_should_park() check (kthread_park() is synchronous, so kthread_should_stop() is never observed), and we abandon ->sqo_stop flag, since it is racy. At the end of the io_sq_thread() we unconditionally call parmke(), since we've exited the loop by the park flag. Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-16Btrfs: tree-checker: detect file extent items with overlapping rangesFilipe Manana1-4/+45
Having file extent items with ranges that overlap each other is a serious issue that leads to all sorts of corruptions and crashes (like a BUG_ON() during the course of __btrfs_drop_extents() when it traims file extent items). Therefore teach the tree checker to detect such cases. This is motivated by a recently fixed bug (race between ranged full fsync and writeback or adjacent ranges). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-16Btrfs: fix race between ranged fsync and writeback of adjacent rangesFilipe Manana1-0/+12
When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the inode) that happens to be ranged, which happens during a msync() or writes for files opened with O_SYNC for example, we can end up with a corrupt log, due to different file extent items representing ranges that overlap with each other, or hit some assertion failures. When doing a ranged fsync we only flush delalloc and wait for ordered exents within that range. If while we are logging items from our inode ordered extents for adjacent ranges complete, we end up in a race that can make us insert the file extent items that overlap with others we logged previously and the assertion failures. For example, if tree-log.c:copy_items() receives a leaf that has the following file extents items, all with a length of 4K and therefore there is an implicit hole in the range 68K to 72K - 1: (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ... It copies them to the log tree. However due to the need to detect implicit holes, it may release the path, in order to look at the previous leaf to detect an implicit hole, and then later it will search again in the tree for the first file extent item key, with the goal of locking again the leaf (which might have changed due to concurrent changes to other inodes). However when it locks again the leaf containing the first key, the key corresponding to the extent at offset 72K may not be there anymore since there is an ordered extent for that range that is finishing (that is, somewhere in the middle of btrfs_finish_ordered_io()), and it just removed the file extent item but has not yet replaced it with a new file extent item, so the part of copy_items() that does hole detection will decide that there is a hole in the range starting from 68K to 76K - 1, and therefore insert a file extent item to represent that hole, having a key offset of 68K. After that we now have a log tree with 2 different extent items that have overlapping ranges: 1) The file extent item copied before copy_items() released the path, which has a key offset of 72K and a length of 4K, representing the file range 72K to 76K - 1. 2) And a file extent item representing a hole that has a key offset of 68K and a length of 8K, representing the range 68K to 76K - 1. This item was inserted after releasing the path, and overlaps with the extent item inserted before. The overlapping extent items can cause all sorts of unpredictable and incorrect behaviour, either when replayed or if a fast (non full) fsync happens later, which can trigger a BUG_ON() when calling btrfs_set_item_key_safe() through __btrfs_drop_extents(), producing a trace like the following: [61666.783269] ------------[ cut here ]------------ [61666.783943] kernel BUG at fs/btrfs/ctree.c:3182! [61666.784644] invalid opcode: 0000 [#1] PREEMPT SMP (...) [61666.786253] task: ffff880117b88c40 task.stack: ffffc90008168000 [61666.786253] RIP: 0010:btrfs_set_item_key_safe+0x7c/0xd2 [btrfs] [61666.786253] RSP: 0018:ffffc9000816b958 EFLAGS: 00010246 [61666.786253] RAX: 0000000000000000 RBX: 000000000000000f RCX: 0000000000030000 [61666.786253] RDX: 0000000000000000 RSI: ffffc9000816ba4f RDI: ffffc9000816b937 [61666.786253] RBP: ffffc9000816b998 R08: ffff88011dae2428 R09: 0000000000001000 [61666.786253] R10: 0000160000000000 R11: 6db6db6db6db6db7 R12: ffff88011dae2418 [61666.786253] R13: ffffc9000816ba4f R14: ffff8801e10c4118 R15: ffff8801e715c000 [61666.786253] FS: 00007f6060a18700(0000) GS:ffff88023f5c0000(0000) knlGS:0000000000000000 [61666.786253] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [61666.786253] CR2: 00007f6060a28000 CR3: 0000000213e69000 CR4: 00000000000006e0 [61666.786253] Call Trace: [61666.786253] __btrfs_drop_extents+0x5e3/0xaad [btrfs] [61666.786253] ? time_hardirqs_on+0x9/0x14 [61666.786253] btrfs_log_changed_extents+0x294/0x4e0 [btrfs] [61666.786253] ? release_extent_buffer+0x38/0xb4 [btrfs] [61666.786253] btrfs_log_inode+0xb6e/0xcdc [btrfs] [61666.786253] ? lock_acquire+0x131/0x1c5 [61666.786253] ? btrfs_log_inode_parent+0xee/0x659 [btrfs] [61666.786253] ? arch_local_irq_save+0x9/0xc [61666.786253] ? btrfs_log_inode_parent+0x1f5/0x659 [btrfs] [61666.786253] btrfs_log_inode_parent+0x223/0x659 [btrfs] [61666.786253] ? arch_local_irq_save+0x9/0xc [61666.786253] ? lockref_get_not_zero+0x2c/0x34 [61666.786253] ? rcu_read_unlock+0x3e/0x5d [61666.786253] btrfs_log_dentry_safe+0x60/0x7b [btrfs] [61666.786253] btrfs_sync_file+0x317/0x42c [btrfs] [61666.786253] vfs_fsync_range+0x8c/0x9e [61666.786253] SyS_msync+0x13c/0x1c9 [61666.786253] entry_SYSCALL_64_fastpath+0x18/0xad A sample of a corrupt log tree leaf with overlapping extents I got from running btrfs/072: item 14 key (295 108 200704) itemoff 2599 itemsize 53 extent data disk bytenr 0 nr 0 extent data offset 0 nr 458752 ram 458752 item 15 key (295 108 659456) itemoff 2546 itemsize 53 extent data disk bytenr 4343541760 nr 770048 extent data offset 606208 nr 163840 ram 770048 item 16 key (295 108 663552) itemoff 2493 itemsize 53 extent data disk bytenr 4343541760 nr 770048 extent data offset 610304 nr 155648 ram 770048 item 17 key (295 108 819200) itemoff 2440 itemsize 53 extent data disk bytenr 4334788608 nr 4096 extent data offset 0 nr 4096 ram 4096 The file extent item at offset 659456 (item 15) ends at offset 823296 (659456 + 163840) while the next file extent item (item 16) starts at offset 663552. Another different problem that the race can trigger is a failure in the assertions at tree-log.c:copy_items(), which expect that the first file extent item key we found before releasing the path exists after we have released path and that the last key we found before releasing the path also exists after releasing the path: $ cat -n fs/btrfs/tree-log.c 4080 if (need_find_last_extent) { 4081 /* btrfs_prev_leaf could return 1 without releasing the path */ 4082 btrfs_release_path(src_path); 4083 ret = btrfs_search_slot(NULL, inode->root, &first_key, 4084 src_path, 0, 0); 4085 if (ret < 0) 4086 return ret; 4087 ASSERT(ret == 0); (...) 4103 if (i >= btrfs_header_nritems(src_path->nodes[0])) { 4104 ret = btrfs_next_leaf(inode->root, src_path); 4105 if (ret < 0) 4106 return ret; 4107 ASSERT(ret == 0); 4108 src = src_path->nodes[0]; 4109 i = 0; 4110 need_find_last_extent = true; 4111 } (...) The second assertion implicitly expects that the last key before the path release still exists, because the surrounding while loop only stops after we have found that key. When this assertion fails it produces a stack like this: [139590.037075] assertion failed: ret == 0, file: fs/btrfs/tree-log.c, line: 4107 [139590.037406] ------------[ cut here ]------------ [139590.037707] kernel BUG at fs/btrfs/ctree.h:3546! [139590.038034] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [139590.038340] CPU: 1 PID: 31841 Comm: fsstress Tainted: G W 5.0.0-btrfs-next-46 #1 (...) [139590.039354] RIP: 0010:assfail.constprop.24+0x18/0x1a [btrfs] (...) [139590.040397] RSP: 0018:ffffa27f48f2b9b0 EFLAGS: 00010282 [139590.040730] RAX: 0000000000000041 RBX: ffff897c635d92c8 RCX: 0000000000000000 [139590.041105] RDX: 0000000000000000 RSI: ffff897d36a96868 RDI: ffff897d36a96868 [139590.041470] RBP: ffff897d1b9a0708 R08: 0000000000000000 R09: 0000000000000000 [139590.041815] R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000013 [139590.042159] R13: 0000000000000227 R14: ffff897cffcbba88 R15: 0000000000000001 [139590.042501] FS: 00007f2efc8dee80(0000) GS:ffff897d36a80000(0000) knlGS:0000000000000000 [139590.042847] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [139590.043199] CR2: 00007f8c064935e0 CR3: 0000000232252002 CR4: 00000000003606e0 [139590.043547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [139590.043899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [139590.044250] Call Trace: [139590.044631] copy_items+0xa3f/0x1000 [btrfs] [139590.045009] ? generic_bin_search.constprop.32+0x61/0x200 [btrfs] [139590.045396] btrfs_log_inode+0x7b3/0xd70 [btrfs] [139590.045773] btrfs_log_inode_parent+0x2b3/0xce0 [btrfs] [139590.046143] ? do_raw_spin_unlock+0x49/0xc0 [139590.046510] btrfs_log_dentry_safe+0x4a/0x70 [btrfs] [139590.046872] btrfs_sync_file+0x3b6/0x440 [btrfs] [139590.047243] btrfs_file_write_iter+0x45b/0x5c0 [btrfs] [139590.047592] __vfs_write+0x129/0x1c0 [139590.047932] vfs_write+0xc2/0x1b0 [139590.048270] ksys_write+0x55/0xc0 [139590.048608] do_syscall_64+0x60/0x1b0 [139590.048946] entry_SYSCALL_64_after_hwframe+0x49/0xbe [139590.049287] RIP: 0033:0x7f2efc4be190 (...) [139590.050342] RSP: 002b:00007ffe743243a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [139590.050701] RAX: ffffffffffffffda RBX: 0000000000008d58 RCX: 00007f2efc4be190 [139590.051067] RDX: 0000000000008d58 RSI: 00005567eca0f370 RDI: 0000000000000003 [139590.051459] RBP: 0000000000000024 R08: 0000000000000003 R09: 0000000000008d60 [139590.051863] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000003 [139590.052252] R13: 00000000003d3507 R14: 00005567eca0f370 R15: 0000000000000000 (...) [139590.055128] ---[ end trace 193f35d0215cdeeb ]--- So fix this race between a full ranged fsync and writeback of adjacent ranges by flushing all delalloc and waiting for all ordered extents to complete before logging the inode. This is the simplest way to solve the problem because currently the full fsync path does not deal with ranges at all (it assumes a full range from 0 to LLONG_MAX) and it always needs to look at adjacent ranges for hole detection. For use cases of ranged fsyncs this can make a few fsyncs slower but on the other hand it can make some following fsyncs to other ranges do less work or no need to do anything at all. A full fsync is rare anyway and happens only once after loading/creating an inode and once after less common operations such as a shrinking truncate. This is an issue that exists for a long time, and was often triggered by generic/127, because it does mmap'ed writes and msync (which triggers a ranged fsync). Adding support for the tree checker to detect overlapping extents (next patch in the series) and trigger a WARN() when such cases are found, and then calling btrfs_check_leaf_full() at the end of btrfs_insert_file_extent() made the issue much easier to detect. Running btrfs/072 with that change to the tree checker and making fsstress open files always with O_SYNC made it much easier to trigger the issue (as triggering it with generic/127 is very rare). CC: stable@vger.kernel.org # 3.16+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-16Btrfs: avoid fallback to transaction commit during fsync of files with holesFilipe Manana1-0/+1
When we are doing a full fsync (bit BTRFS_INODE_NEEDS_FULL_SYNC set) of a file that has holes and has file extent items spanning two or more leafs, we can end up falling to back to a full transaction commit due to a logic bug that leads to failure to insert a duplicate file extent item that is meant to represent a hole between the last file extent item of a leaf and the first file extent item in the next leaf. The failure (EEXIST error) leads to a transaction commit (as most errors when logging an inode do). For example, we have the two following leafs: Leaf N: ----------------------------------------------- | ..., ..., ..., (257, FILE_EXTENT_ITEM, 64K) | ----------------------------------------------- The file extent item at the end of leaf N has a length of 4Kb, representing the file range from 64K to 68K - 1. Leaf N + 1: ----------------------------------------------- | (257, FILE_EXTENT_ITEM, 72K), ..., ..., ... | ----------------------------------------------- The file extent item at the first slot of leaf N + 1 has a length of 4Kb too, representing the file range from 72K to 76K - 1. During the full fsync path, when we are at tree-log.c:copy_items() with leaf N as a parameter, after processing the last file extent item, that represents the extent at offset 64K, we take a look at the first file extent item at the next leaf (leaf N + 1), and notice there's a 4K hole between the two extents, and therefore we insert a file extent item representing that hole, starting at file offset 68K and ending at offset 72K - 1. However we don't update the value of *last_extent, which is used to represent the end offset (plus 1, non-inclusive end) of the last file extent item inserted in the log, so it stays with a value of 68K and not with a value of 72K. Then, when copy_items() is called for leaf N + 1, because the value of *last_extent is smaller then the offset of the first extent item in the leaf (68K < 72K), we look at the last file extent item in the previous leaf (leaf N) and see it there's a 4K gap between it and our first file extent item (again, 68K < 72K), so we decide to insert a file extent item representing the hole, starting at file offset 68K and ending at offset 72K - 1, this insertion will fail with -EEXIST being returned from btrfs_insert_file_extent() because we already inserted a file extent item representing a hole for this offset (68K) in the previous call to copy_items(), when processing leaf N. The -EEXIST error gets propagated to the fsync callback, btrfs_sync_file(), which falls back to a full transaction commit. Fix this by adjusting *last_extent after inserting a hole when we had to look at the next leaf. Fixes: 4ee3fad34a9c ("Btrfs: fix fsync after hole punching when using no-holes feature") Cc: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-16btrfs: extent-tree: Fix a bug that btrfs is unable to add pinned bytesQu Wenruo1-5/+7
Commit ddf30cf03fb5 ("btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes()") refactored add_pinned_bytes(), but during that refactor, there are two callers which add the pinned bytes instead of subtracting. That refactor misses those two caller, causing incorrect pinned bytes calculation and resulting unexpected ENOSPC error. Fix it by adding a new parameter @sign to restore the original behavior. Reported-by: kernel test robot <rong.a.chen@intel.com> Fixes: ddf30cf03fb5 ("btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes()") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-16btrfs: sysfs: don't leak memory when failing add fsidTobin C. Harding1-1/+6
A failed call to kobject_init_and_add() must be followed by a call to kobject_put(). Currently in the error path when adding fs_devices we are missing this call. This could be fixed by calling btrfs_sysfs_remove_fsid() if btrfs_sysfs_add_fsid() returns an error or by adding a call to kobject_put() directly in btrfs_sysfs_add_fsid(). Here we choose the second option because it prevents the slightly unusual error path handling requirements of kobject from leaking out into btrfs functions. Add a call to kobject_put() in the error path of kobject_add_and_init(). This causes the release method to be called if kobject_init_and_add() fails. open_tree() is the function that calls btrfs_sysfs_add_fsid() and the error code in this function is already written with the assumption that the release method is called during the error path of open_tree() (as seen by the call to btrfs_sysfs_remove_fsid() under the fail_fsdev_sysfs label). Cc: stable@vger.kernel.org # v4.4+ Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tobin C. Harding <tobin@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-16btrfs: sysfs: Fix error path kobject memory leakTobin C. Harding1-2/+1
If a call to kobject_init_and_add() fails we must call kobject_put() otherwise we leak memory. Calling kobject_put() when kobject_init_and_add() fails drops the refcount back to 0 and calls the ktype release method (which in turn calls the percpu destroy and kfree). Add call to kobject_put() in the error path of call to kobject_init_and_add(). Cc: stable@vger.kernel.org # v4.4+ Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tobin C. Harding <tobin@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-16afs: Fix cell DNS lookupDavid Howells3-75/+130
Currently, once configured, AFS cells are looked up in the DNS at regular intervals - which is a waste of resources if those cells aren't being used. It also leads to a problem where cells preloaded, but not configured, before the network is brought up end up effectively statically configured with no VL servers and are unable to get any. Fix this by not doing the DNS lookup until the first time a cell is touched. It is waited for if we don't have any cached records yet, otherwise the DNS lookup to maintain the record is done in the background. This has the downside that the first time you touch a cell, you now have to wait for the upcall to do the required DNS lookups rather than them already being cached. Further, the record is not replaced if the old record has at least one server in it and the new record doesn't have any. Fixes: 0a5143f2f89c ("afs: Implement VL server rotation") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-15cifs: add support for SEEK_DATA and SEEK_HOLERonnie Sahlberg3-0/+99
Add llseek op for SEEK_DATA and SEEK_HOLE. Improves xfstests/285,286,436,445,448 and 490 Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-05-15Fixed https://bugzilla.kernel.org/show_bug.cgi?id=202935 allow write on the ↵Kovtunenko Oleksandr1-5/+0
same file Copychunk allows source and target to be on the same file. For details on restrictions see MS-SMB2 3.3.5.15.6 Signed-off-by: Kovtunenko Oleksandr <alexander198961@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-05-15cifs: Allocate memory for all iovs in smb2_ioctlLong Li1-2/+19
An IOCTL uses up to 2 iovs. The 1st iov is the command itself, the 2nd iov is optional data for that command. The 1st iov is always allocated on the heap but the 2nd iov may point to a variable on the stack. This will trigger an error when passing the 2nd iov for RDMA I/O. Fix this by allocating a buffer for the 2nd iov. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie sahlberg <lsahlber@redhat.com>
2019-05-15cifs: Don't match port on SMBDirect transportLong Li1-0/+4
SMBDirect manages its own ports in the transport layer, there is no need to check the port to find a connection. Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie sahlberg <lsahlber@redhat.com>
2019-05-15Merge tag 'nfsd-5.2' of git://linux-nfs.org/~bfields/linuxLinus Torvalds23-196/+818
Pull nfsd updates from Bruce Fields: "This consists mostly of nfsd container work: Scott Mayhew revived an old api that communicates with a userspace daemon to manage some on-disk state that's used to track clients across server reboots. We've been using a usermode_helper upcall for that, but it's tough to run those with the right namespaces, so a daemon is much friendlier to container use cases. Trond fixed nfsd's handling of user credentials in user namespaces. He also contributed patches that allow containers to support different sets of NFS protocol versions. The only remaining container bug I'm aware of is that the NFS reply cache is shared between all containers. If anyone's aware of other gaps in our container support, let me know. The rest of this is miscellaneous bugfixes" * tag 'nfsd-5.2' of git://linux-nfs.org/~bfields/linux: (23 commits) nfsd: update callback done processing locks: move checks from locks_free_lock() to locks_release_private() nfsd: fh_drop_write in nfsd_unlink nfsd: allow fh_want_write to be called twice nfsd: knfsd must use the container user namespace SUNRPC: rsi_parse() should use the current user namespace SUNRPC: Fix the server AUTH_UNIX userspace mappings lockd: Pass the user cred from knfsd when starting the lockd server SUNRPC: Temporary sockets should inherit the cred from their parent SUNRPC: Cache the process user cred in the RPC server listener nfsd: Allow containers to set supported nfs versions nfsd: Add custom rpcbind callbacks for knfsd SUNRPC: Allow further customisation of RPC program registration SUNRPC: Clean up generic dispatcher code SUNRPC: Add a callback to initialise server requests SUNRPC/nfs: Fix return value for nfs4_callback_compound() nfsd: handle legacy client tracking records sent by nfsdcld nfsd: re-order client tracking method selection nfsd: keep a tally of RECLAIM_COMPLETE operations when using nfsdcld nfsd: un-deprecate nfsdcld ...
2019-05-15ubifs: Convert xattr inum to host orderRichard Weinberger1-1/+1
UBIFS stores inode numbers as LE64 integers. We have to convert them to host oder, otherwise BE hosts won't be able to use the integer correctly. Reported-by: kbuild test robot <lkp@intel.com> Fixes: 9ca2d7326444 ("ubifs: Limit number of xattrs per inode") Signed-off-by: Richard Weinberger <richard@nod.at>
2019-05-15ubifs: Use correct config name for encryptionRichard Weinberger1-2/+2
CONFIG_UBIFS_FS_ENCRYPTION is gone, fscrypt is now controlled via CONFIG_FS_ENCRYPTION. This problem slipped into the tree because of a mis-merge on my side. Reported-by: Eric Biggers <ebiggers@kernel.org> Fixes: eea2c05d927b ("ubifs: Remove #ifdef around CONFIG_FS_ENCRYPTION") Signed-off-by: Richard Weinberger <richard@nod.at>
2019-05-15ubifs: Fix build error without CONFIG_UBIFS_FS_XATTRYueHaibing1-1/+5
Fix gcc build error while CONFIG_UBIFS_FS_XATTR is not set fs/ubifs/dir.o: In function `ubifs_unlink': dir.c:(.text+0x260): undefined reference to `ubifs_purge_xattrs' fs/ubifs/dir.o: In function `do_rename': dir.c:(.text+0x1edc): undefined reference to `ubifs_purge_xattrs' fs/ubifs/dir.o: In function `ubifs_rmdir': dir.c:(.text+0x2638): undefined reference to `ubifs_purge_xattrs' Reported-by: Hulk Robot <hulkci@huawei.com> Fixes: 9ca2d7326444 ("ubifs: Limit number of xattrs per inode") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2019-05-15io_uring: remove 'ev_flags' argumentJens Axboe1-14/+14
We always pass in 0 for the cqe flags argument, since the support for "this read hit page cache" hint was dropped. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-15dns_resolver: Allow used keys to be invalidatedDavid Howells4-4/+4
Allow used DNS resolver keys to be invalidated after use if the caller is doing its own caching of the results. This reduces the amount of resources required. Fix AFS to invalidate DNS results to kill off permanent failure records that get lodged in the resolver keyring and prevent future lookups from happening. Fixes: 0a5143f2f89c ("afs: Implement VL server rotation") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-15afs: Fix afs_cell records to always have a VL server list recordDavid Howells4-24/+25
Fix it such that afs_cell records always have a VL server list record attached, even if it's a dummy one, so that various checks can be removed. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-15afs: Fix missing lock when replacing VL server listDavid Howells1-3/+2
When afs_update_cell() replaces the cell->vl_servers list, it uses RCU protocol so that proc is protected, but doesn't take ->vl_servers_lock to protect afs_start_vl_iteration() (which does actually take a shared lock). Fix this by making afs_update_cell() take an exclusive lock when replacing ->vl_servers. Fixes: 0a5143f2f89c ("afs: Implement VL server rotation") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-15afs: Fix afs_xattr_get_yfs() to not try freeing an error valueDavid Howells3-64/+53
afs_xattr_get_yfs() tries to free yacl, which may hold an error value (say if yfs_fs_fetch_opaque_acl() failed and returned an error). Fix this by allocating yacl up front (since it's a fixed-length struct, unlike afs_acl) and passing it in to the RPC function. This also allows the flags to be placed in the object rather than passing them through to the RPC function. Fixes: ae46578b963f ("afs: Get YFS ACLs and information through xattrs") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-15afs: Fix incorrect error handling in afs_xattr_get_acl()David Howells1-5/+4
Fix incorrect error handling in afs_xattr_get_acl() where there appears to be a redundant assignment before return, but in fact the return should be a goto to the error handling at the end of the function. Fixes: 260f082bae6d ("afs: Get an AFS3 ACL as an xattr") Addresses-Coverity: ("Unused Value") Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Joe Perches <joe@perches.com>
2019-05-15afs: Fix key leak in afs_release() and afs_evict_inode()David Howells2-3/+5
Fix afs_release() to go through the cleanup part of the function if FMODE_WRITE is set rather than exiting through vfs_fsync() (which skips the cleanup). The cleanup involves discarding the refs on the key used for file ops and the writeback key record. Also fix afs_evict_inode() to clean up any left over wb keys attached to the inode/vnode when it is removed. Fixes: 5a8132761609 ("afs: Do better accretion of small writes on newly created content") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-15ext4: fix block validity checks for journal inodes using indirect blocksTheodore Ts'o1-0/+5
Commit 345c0dbf3a30 ("ext4: protect journal inode's blocks using block_validity") failed to add an exception for the journal inode in ext4_check_blockref(), which is the function used by ext4_get_branch() for indirect blocks. This caused attempts to read from the ext3-style journals to fail with: [ 848.968550] EXT4-fs error (device sdb7): ext4_get_branch:171: inode #8: block 30343695: comm jbd2/sdb7-8: invalid block Fix this by adding the missing exception check. Fixes: 345c0dbf3a30 ("ext4: protect journal inode's blocks using block_validity") Reported-by: Arthur Marsh <arthur.marsh@internode.on.net> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-05-14fs/block_dev.c: Remove duplicate headerSabyasachi Gupta1-1/+0
linux/dax.h is included more than once. Link: http://lkml.kernel.org/r/5c867e95.1c69fb81.4f15a.e5e4@mx.google.com Signed-off-by: Sabyasachi Gupta <sabyasachi.linux@gmail.com> Acked-by: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14fs/cachefiles/namei.c: remove duplicate headerSabyasachi Gupta1-1/+0
linux/xattr.h is included more than once. Link: http://lkml.kernel.org/r/5c86803d.1c69fb81.1a7c6.2b78@mx.google.com Signed-off-by: Sabyasachi Gupta <sabyasachi.linux@gmail.com> Acked-by: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14fs/coda/psdev.c: remove duplicate headerSabyasachi Gupta1-1/+0
linux/poll.h is included more than once. Link: http://lkml.kernel.org/r/5c86820f.1c69fb81.149f0.0834@mx.google.com Signed-off-by: Sabyasachi Gupta <sabyasachi.linux@gmail.com> Acked-by: Souptick Joarder <jrdr.linux@gmail.com> Cc: Jan Harkes <jaharkes@cs.cmu.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14fs/eventfd.c: make eventfd_ida staticYueHaibing1-1/+1
Fix sparse warning: fs/eventfd.c:26:1: warning: symbol 'eventfd_ida' was not declared. Should it be static? Link: http://lkml.kernel.org/r/20190413142348.34716-1-yuehaibing@huawei.com Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14eventfd: present id to userspace via fdinfoMasatake YAMATO1-0/+8
Finding endpoints of an IPC channel is one of essential task to understand how a user program works. Procfs and netlink socket provide enough hints to find endpoints for IPC channels like pipes, unix sockets, and pseudo terminals. However, there is no simple way to find endpoints for an eventfd file from userland. An inode number doesn't hint. Unlike pipe, all eventfd files share the same inode object. To provide the way to find endpoints of an eventfd file, this patch adds "eventfd-id" field to /proc/PID/fdinfo of eventfd as identifier. Integers managed by an IDA are used as ids. A tool like lsof can utilize the information to print endpoints. Link: http://lkml.kernel.org/r/20190327181823.20222-1-yamato@redhat.com Signed-off-by: Masatake YAMATO <yamato@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14fs/exec.c: move ->recursion_depth out of critical sectionsAlexey Dobriyan1-1/+3
->recursion_depth is changed only by current, therefore decrementing can be done without taking any locks. Link: http://lkml.kernel.org/r/20190417213150.GA26474@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14fs/fat/file.c: issue flush after the writeback of FATHou Tao1-3/+8
fsync() needs to make sure the data & meta-data of file are persistent after the return of fsync(), even when a power-failure occurs later. In the case of fat-fs, the FAT belongs to the meta-data of file, so we need to issue a flush after the writeback of FAT instead before. Also bail out early when any stage of fsync fails. Link: http://lkml.kernel.org/r/20190409030158.136316-1-houtao1@huawei.com Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14reiserfs: add comment to explain endianness issue in xattr_hashBharath Vedartham1-0/+9
csum_partial() gives different results for little-endian and big-endian hosts. This causes images created on little-endian hosts and mounted on big endian hosts to see csum mismatches. This causes an endianness bug. Sparse gives a warning as csum_partial returns a restricted integer type __wsum_t and xattr_hash expects __u32. This warning acts as a reminder for this bug and should not be suppressed. This comment aims to convey these endianness issues. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20190423161831.GA15387@bharath12345-Inspiron-5559 Signed-off-by: Bharath Vedartham <linux.bhar@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jann Horn <jannh@google.com> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14binfmt_elf: move brk out of mmap when doing direct loader execKees Cook1-0/+11
Commmit eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"), made changes in the rare case when the ELF loader was directly invoked (e.g to set a non-inheritable LD_LIBRARY_PATH, testing new versions of the loader), by moving into the mmap region to avoid both ET_EXEC and PIE binaries. This had the effect of also moving the brk region into mmap, which could lead to the stack and brk being arbitrarily close to each other. An unlucky process wouldn't get its requested stack size and stack allocations could end up scribbling on the heap. This is illustrated here. In the case of using the loader directly, brk (so helpfully identified as "[heap]") is allocated with the _loader_ not the binary. For example, with ASLR entirely disabled, you can see this more clearly: $ /bin/cat /proc/self/maps 555555554000-55555555c000 r-xp 00000000 ... /bin/cat 55555575b000-55555575c000 r--p 00007000 ... /bin/cat 55555575c000-55555575d000 rw-p 00008000 ... /bin/cat 55555575d000-55555577e000 rw-p 00000000 ... [heap] ... 7ffff7ff7000-7ffff7ffa000 r--p 00000000 ... [vvar] 7ffff7ffa000-7ffff7ffc000 r-xp 00000000 ... [vdso] 7ffff7ffc000-7ffff7ffd000 r--p 00027000 ... /lib/x86_64-linux-gnu/ld-2.27.so 7ffff7ffd000-7ffff7ffe000 rw-p 00028000 ... /lib/x86_64-linux-gnu/ld-2.27.so 7ffff7ffe000-7ffff7fff000 rw-p 00000000 ... 7ffffffde000-7ffffffff000 rw-p 00000000 ... [stack] $ /lib/x86_64-linux-gnu/ld-2.27.so /bin/cat /proc/self/maps ... 7ffff7bcc000-7ffff7bd4000 r-xp 00000000 ... /bin/cat 7ffff7bd4000-7ffff7dd3000 ---p 00008000 ... /bin/cat 7ffff7dd3000-7ffff7dd4000 r--p 00007000 ... /bin/cat 7ffff7dd4000-7ffff7dd5000 rw-p 00008000 ... /bin/cat 7ffff7dd5000-7ffff7dfc000 r-xp 00000000 ... /lib/x86_64-linux-gnu/ld-2.27.so 7ffff7fb2000-7ffff7fd6000 rw-p 00000000 ... 7ffff7ff7000-7ffff7ffa000 r--p 00000000 ... [vvar] 7ffff7ffa000-7ffff7ffc000 r-xp 00000000 ... [vdso] 7ffff7ffc000-7ffff7ffd000 r--p 00027000 ... /lib/x86_64-linux-gnu/ld-2.27.so 7ffff7ffd000-7ffff7ffe000 rw-p 00028000 ... /lib/x86_64-linux-gnu/ld-2.27.so 7ffff7ffe000-7ffff8020000 rw-p 00000000 ... [heap] 7ffffffde000-7ffffffff000 rw-p 00000000 ... [stack] The solution is to move brk out of mmap and into ELF_ET_DYN_BASE since nothing is there in the direct loader case (and ET_EXEC is still far away at 0x400000). Anything that ran before should still work (i.e. the ultimately-launched binary already had the brk very far from its text, so this should be no different from a COMPAT_BRK standpoint). The only risk I see here is that if someone started to suddenly depend on the entire memory space lower than the mmap region being available when launching binaries via a direct loader execs which seems highly unlikely, I'd hope: this would mean a binary would _not_ work when exec()ed normally. (Note that this is only done under CONFIG_ARCH_HAS_ELF_RANDOMIZATION when randomization is turned on.) Link: http://lkml.kernel.org/r/20190422225727.GA21011@beast Link: https://lkml.kernel.org/r/CAGXu5jJ5sj3emOT2QPxQkNQk0qbU6zEfu9=Omfhx_p0nCKPSjA@mail.gmail.com Fixes: eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE") Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Ali Saidi <alisaidi@amazon.com> Cc: Ali Saidi <alisaidi@amazon.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14elf: init pt_regs pointer laterAlexey Dobriyan1-1/+2
Get "current_pt_regs" pointer right before usage. Space savings on x86_64: add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-180 (-180) Function old new delta load_elf_binary 5806 5626 -180 !!! Looks like the compiler doesn't know that "current_pt_regs" is stable pointer (because it doesn't know ->stack isn't) even though it knows that "current" is stable pointer. So it saves it in the very beginning and then tries to carry it through a lot of code. Here is what happens here: load_elf_binary() ... mov rax,QWORD PTR gs:0x14c00 mov r13,QWORD PTR [rax+0x18] r13 = current->stack call kmem_cache_alloc # first kmalloc [980 bytes later!] # let's spill that sucker because we need a register # for "load_bias" calculations at # # if (interpreter) { # load_bias = ELF_ET_DYN_BASE; # if (current->flags & PF_RANDOMIZE) # load_bias += arch_mmap_rnd(); # elf_flags |= elf_fixed; # } mov QWORD PTR [rsp+0x68],r13 If this is not _the_ root cause it is still eeeeh. After the patch things become much simpler: mov rax, QWORD PTR gs:0x14c00 # current mov rdx, QWORD PTR [rax+0x18] # current->stack movq [rdx+0x3fb8], 0 # fill pt_regs ... call finalize_exec Link: http://lkml.kernel.org/r/20190419200343.GA19788@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Tested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14fs/binfmt_elf.c: extract PROT_* calculationsAlexey Dobriyan1-14/+16
There are two places where mapping protections are calculated: one for executable, another one for interpreter -- take them out. ELF read and execute permissions are interchanged with Linux PROT_READ and PROT_EXEC, microoptimizations are welcome! Link: http://lkml.kernel.org/r/20190417213413.GB26474@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14fs//binfmt_elf.c: move variables initialization closer to their usageAlexey Dobriyan1-8/+8
Link: http://lkml.kernel.org/r/20190416202002.GB24304@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14fs/binfmt_elf.c: save 1 indent levelAlexey Dobriyan1-57/+54
Rewrite for (...) { if (->p_type == PT_INTERP) { ... break; } } loop into for (...) { if (->p_type != PT_INTERP) continue; ... break; } Link: http://lkml.kernel.org/r/20190416201906.GA24304@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>