summaryrefslogtreecommitdiffstats
path: root/fs/fuse/dev.c
AgeCommit message (Collapse)AuthorFilesLines
2019-09-24fuse: Make fuse_args_to_req staticYueHaibing1-1/+1
Fix sparse warning: fs/fuse/dev.c:468:6: warning: symbol 'fuse_args_to_req' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Fixes: 68583165f962 ("fuse: add pages to fuse_args") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-24fuse: unexport fuse_put_requestArnd Bergmann1-1/+0
This function has been made static, which now causes a compile-time warning: WARNING: "fuse_put_request" [vmlinux] is a static EXPORT_SYMBOL_GPL Remove the unneeded export. Fixes: 66abc3599c3c ("fuse: unexport request ops") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: separate fuse device allocation and installation in fuse_connVivek Goyal1-1/+1
As of now fuse_dev_alloc() both allocates a fuse device and installs it in fuse_conn list. fuse_dev_alloc() can fail if fuse_device allocation fails. virtio-fs needs to initialize multiple fuse devices (one per virtio queue). It initializes one fuse device as part of call to fuse_fill_super_common() and rest of the devices are allocated and installed after that. But, we can't afford to fail after calling fuse_fill_super_common() as we don't have a way to undo all the actions done by fuse_fill_super_common(). So to avoid failures after the call to fuse_fill_super_common(), pre-allocate all fuse devices early and install them into fuse connection later. This patch provides two separate helpers for fuse device allocation and fuse device installation in fuse_conn. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: add fuse_iqueue_ops callbacksStefan Hajnoczi1-16/+30
The /dev/fuse device uses fiq->waitq and fasync to signal that requests are available. These mechanisms do not apply to virtio-fs. This patch introduces callbacks so alternative behavior can be used. Note that queue_interrupt() changes along these lines: spin_lock(&fiq->waitq.lock); wake_up_locked(&fiq->waitq); + kill_fasync(&fiq->fasync, SIGIO, POLL_IN); spin_unlock(&fiq->waitq.lock); - kill_fasync(&fiq->fasync, SIGIO, POLL_IN); Since queue_request() and queue_forget() also call kill_fasync() inside the spinlock this should be safe. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: export fuse_dequeue_forget() functionVivek Goyal1-6/+7
File systems like virtio-fs need to do not have to play directly with forget list data structures. There is a helper function use that instead. Rename dequeue_forget() to fuse_dequeue_forget() and export it so that stacked filesystems can use it. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: export fuse_get_unique()Stefan Hajnoczi1-1/+2
virtio-fs will need unique IDs for FORGET requests from outside fs/fuse/dev.c. Make the symbol visible. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: export fuse_len_args()Stefan Hajnoczi1-4/+5
virtio-fs will need to query the length of fuse_arg lists. Make the symbol visible. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-12fuse: export fuse_end_request()Stefan Hajnoczi1-9/+10
virtio-fs will need to complete requests from outside fs/fuse/dev.c. Make the symbol visible. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: stop copying pages to fuse_reqMiklos Szeredi1-12/+6
The page array pointers are also duplicated across fuse_args_pages and fuse_req. Get rid of the fuse_req ones. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: stop copying args to fuse_reqMiklos Szeredi1-54/+26
No need to duplicate the argument arrays in fuse_req, so just dereference req->args instead of copying to the fuse_req internal ones. This allows further cleanup of the fuse_req structure. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: simplify request allocationMiklos Szeredi1-50/+11
Page arrays are not allocated together with the request anymore. Get rid of the dead code Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: unexport request opsMiklos Szeredi1-42/+9
All requests are now sent with one of the fuse_simple_... helpers. Get rid of the old api from the fuse internal header. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: convert retrieve to simple apiMiklos Szeredi1-30/+62
Rename fuse_request_send_notify_reply() to fuse_simple_notify_reply() and convert to passing fuse_args instead of fuse_req. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: convert writepages to simple apiMiklos Szeredi1-27/+0
Derive fuse_writepage_args from fuse_io_args. Sending the request is tricky since it was done with fi->lock held, hence we must either use atomic allocation or release the lock. Both are possible so try atomic first and if it fails, release the lock and do the regular allocation with GFP_NOFS and __GFP_NOFAIL. Both flags are necessary for correct operation. Move the page realloc function from dev.c to file.c and convert to using fuse_writepage_args. The last caller of fuse_write_fill() is gone, so get rid of it. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: add simple background helperMiklos Szeredi1-0/+45
Create a helper named fuse_simple_background() that is similar to fuse_simple_request(). Unlike the latter, it returns immediately and calls the supplied 'end' callback when the reply is received. The supplied 'args' pointer is stored in 'fuse_req' which allows the callback to interpret the output arguments decoded from the reply. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: convert ioctl to simple apiMiklos Szeredi1-2/+2
fuse_simple_request() is converted to return length of last (instead of single) out arg, since FUSE_IOCTL_OUT has two out args, the second of which is variable length. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: move page allocMiklos Szeredi1-15/+2
fuse_req_pages_alloc() is moved to file.c, since its internal use by the device code will eventually be removed. Rename to fuse_pages_alloc() to signify that it's not only usable for fuse_req page array. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: add pages to fuse_argsMiklos Szeredi1-10/+31
Derive fuse_args_pages from fuse_args. This is used to handle requests which use pages for input or output. The related flags are added to fuse_args. New FR_ALLOC_PAGES flags is added to indicate whether the page arrays in fuse_req need to be freed by fuse_put_request() or not. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: add nocreds to fuse_argsMiklos Szeredi1-29/+15
In some cases it makes no sense to set pid/uid/gid fields in the request header. Allow fuse_simple_background() to omit these. This is only required in the "force" case, so for now just WARN if set otherwise. Fold fuse_get_req_nofail_nopages() into its only caller. Comment is obsolete anyway. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: convert fuse_force_forget() to simple apiMiklos Szeredi1-21/+0
Move this function to the readdir.c where its only caller resides. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: add noreply to fuse_argsMiklos Szeredi1-1/+3
This will be used by fuse_force_forget(). We can expand fuse_request_send() into fuse_simple_request(). The FR_WAITING bit has already been set, no need to check. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: convert flush to simple apiMiklos Szeredi1-4/+9
Add 'force' to fuse_args and use fuse_get_req_nofail_nopages() to allocate the request in that case. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: simplify 'nofail' requestMiklos Szeredi1-60/+4
Instead of complex games with a reserved request, just use __GFP_NOFAIL. Both calers (flush, readdir) guarantee that connection was already initialized, so no need to wait for fc->initialized. Also remove unneeded clearing of FR_BACKGROUND flag. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: flatten 'struct fuse_args'Miklos Szeredi1-19/+19
...to make future expansion simpler. The hiearachical structure is a historical thing that does not serve any practical purpose. The generated code is excatly the same before and after the patch. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-10fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lockEric Biggers1-45/+48
When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock. This may have to wait for fuse_iqueue::waitq.lock to be released by one of many places that take it with IRQs enabled. Since the IRQ handler may take kioctx::ctx_lock, lockdep reports that a deadlock is possible. Fix it by protecting the state of struct fuse_iqueue with a separate spinlock, and only accessing fuse_iqueue::waitq using the versions of the waitqueue functions which do IRQ-safe locking internally. Reproducer: #include <fcntl.h> #include <stdio.h> #include <sys/mount.h> #include <sys/stat.h> #include <sys/syscall.h> #include <unistd.h> #include <linux/aio_abi.h> int main() { char opts[128]; int fd = open("/dev/fuse", O_RDWR); aio_context_t ctx = 0; struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd }; struct iocb *cbp = &cb; sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd); mkdir("mnt", 0700); mount("foo", "mnt", "fuse", 0, opts); syscall(__NR_io_setup, 1, &ctx); syscall(__NR_io_submit, ctx, 1, &cbp); } Beginning of lockdep output: ===================================================== WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected 5.3.0-rc5 #9 Not tainted ----------------------------------------------------- syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: 000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline] 000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline] 000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825 and this task is already holding: 0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline] 0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline] 0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825 which would create a new lock dependency: (&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.} but this new dependency connects a SOFTIRQ-irq-safe lock: (&(&ctx->ctx_lock)->rlock){..-.} [...] Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL") Cc: <stable@vger.kernel.org> # v4.19+ Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-02fuse: require /dev/fuse reads to have enough buffer capacity (take 2)Kirill Smelkov1-0/+18
[ This retries commit d4b13963f217 ("fuse: require /dev/fuse reads to have enough buffer capacity"), which was reverted. In this version we require only `sizeof(fuse_in_header) + sizeof(fuse_write_in)` instead of 4K for FUSE request header room, because, contrary to libfuse and kernel client behaviour, GlusterFS actually provides only so much room for request header. ] A FUSE filesystem server queues /dev/fuse sys_read calls to get filesystem requests to handle. It does not know in advance what would be that request as it can be anything that client issues - LOOKUP, READ, WRITE, ... Many requests are short and retrieve data from the filesystem. However WRITE and NOTIFY_REPLY write data into filesystem. Before getting into operation phase, FUSE filesystem server and kernel client negotiate what should be the maximum write size the client will ever issue. After negotiation the contract in between server/client is that the filesystem server then should queue /dev/fuse sys_read calls with enough buffer capacity to receive any client request - WRITE in particular, while FUSE client should not, in particular, send WRITE requests with > negotiated max_write payload. FUSE client in kernel and libfuse historically reserve 4K for request header. However an existing filesystem server - GlusterFS - was found which reserves only 80 bytes for header room (= `sizeof(fuse_in_header) + sizeof(fuse_write_in)`). Since `sizeof(fuse_in_header) + sizeof(fuse_write_in)` == `sizeof(fuse_in_header) + sizeof(fuse_read_in)` == `sizeof(fuse_in_header) + sizeof(fuse_notify_retrieve_in)` is the absolute minimum any sane filesystem should be using for header room, the contract is that filesystem server should queue sys_reads with `sizeof(fuse_in_header) + sizeof(fuse_write_in)` + max_write buffer. If the filesystem server does not follow this contract, what can happen is that fuse_dev_do_read will see that request size is > buffer size, and then it will return EIO to client who issued the request but won't indicate in any way that there is a problem to filesystem server. This can be hard to diagnose because for some requests, e.g. for NOTIFY_REPLY which mimics WRITE, there is no client thread that is waiting for request completion and that EIO goes nowhere, while on filesystem server side things look like the kernel is not replying back after successful NOTIFY_RETRIEVE request made by the server. We can make the problem easy to diagnose if we indicate via error return to filesystem server when it is violating the contract. This should not practically cause problems because if a filesystem server is using shorter buffer, writes to it were already very likely to cause EIO, and if the filesystem is read-only it should be too following FUSE_MIN_READ_BUFFER minimum buffer size. Please see [1] for context where the problem of stuck filesystem was hit for real (because kernel client was incorrectly sending more than max_write data with NOTIFY_REPLY; see also previous patch), how the situation was traced and for more involving patch that did not make it into the tree. [1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2 Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Tested-by: Sander Eikelenboom <linux@eikelenboom.it> Cc: Han-Wen Nienhuys <hanwen@google.com> Cc: Jakob Unterwurzacher <jakobunt@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-06-11Revert "fuse: require /dev/fuse reads to have enough buffer capacity"Miklos Szeredi1-10/+0
This reverts commit d4b13963f217dd947da5c0cabd1569e914d21699. The commit introduced a regression in glusterfs-fuse. Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-04-24fuse: require /dev/fuse reads to have enough buffer capacityKirill Smelkov1-0/+10
A FUSE filesystem server queues /dev/fuse sys_read calls to get filesystem requests to handle. It does not know in advance what would be that request as it can be anything that client issues - LOOKUP, READ, WRITE, ... Many requests are short and retrieve data from the filesystem. However WRITE and NOTIFY_REPLY write data into filesystem. Before getting into operation phase, FUSE filesystem server and kernel client negotiate what should be the maximum write size the client will ever issue. After negotiation the contract in between server/client is that the filesystem server then should queue /dev/fuse sys_read calls with enough buffer capacity to receive any client request - WRITE in particular, while FUSE client should not, in particular, send WRITE requests with > negotiated max_write payload. FUSE client in kernel and libfuse historically reserve 4K for request header. This way the contract is that filesystem server should queue sys_reads with 4K+max_write buffer. If the filesystem server does not follow this contract, what can happen is that fuse_dev_do_read will see that request size is > buffer size, and then it will return EIO to client who issued the request but won't indicate in any way that there is a problem to filesystem server. This can be hard to diagnose because for some requests, e.g. for NOTIFY_REPLY which mimics WRITE, there is no client thread that is waiting for request completion and that EIO goes nowhere, while on filesystem server side things look like the kernel is not replying back after successful NOTIFY_RETRIEVE request made by the server. We can make the problem easy to diagnose if we indicate via error return to filesystem server when it is violating the contract. This should not practically cause problems because if a filesystem server is using shorter buffer, writes to it were already very likely to cause EIO, and if the filesystem is read-only it should be too following FUSE_MIN_READ_BUFFER minimum buffer size. Please see [1] for context where the problem of stuck filesystem was hit for real (because kernel client was incorrectly sending more than max_write data with NOTIFY_REPLY; see also previous patch), how the situation was traced and for more involving patch that did not make it into the tree. [1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2 Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Cc: Han-Wen Nienhuys <hanwen@google.com> Cc: Jakob Unterwurzacher <jakobunt@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-04-24fuse: retrieve: cap requested size to negotiated max_writeKirill Smelkov1-1/+1
FUSE filesystem server and kernel client negotiate during initialization phase, what should be the maximum write size the client will ever issue. Correspondingly the filesystem server then queues sys_read calls to read requests with buffer capacity large enough to carry request header + that max_write bytes. A filesystem server is free to set its max_write in anywhere in the range between [1*page, fc->max_pages*page]. In particular go-fuse[2] sets max_write by default as 64K, wheres default fc->max_pages corresponds to 128K. Libfuse also allows users to configure max_write, but by default presets it to possible maximum. If max_write is < fc->max_pages*page, and in NOTIFY_RETRIEVE handler we allow to retrieve more than max_write bytes, corresponding prepared NOTIFY_REPLY will be thrown away by fuse_dev_do_read, because the filesystem server, in full correspondence with server/client contract, will be only queuing sys_read with ~max_write buffer capacity, and fuse_dev_do_read throws away requests that cannot fit into server request buffer. In turn the filesystem server could get stuck waiting indefinitely for NOTIFY_REPLY since NOTIFY_RETRIEVE handler returned OK which is understood by clients as that NOTIFY_REPLY was queued and will be sent back. Cap requested size to negotiate max_write to avoid the problem. This aligns with the way NOTIFY_RETRIEVE handler works, which already unconditionally caps requested retrieve size to fuse_conn->max_pages. This way it should not hurt NOTIFY_RETRIEVE semantic if we return less data than was originally requested. Please see [1] for context where the problem of stuck filesystem was hit for real, how the situation was traced and for more involving patch that did not make it into the tree. [1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2 [2] https://github.com/hanwen/go-fuse Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Cc: Han-Wen Nienhuys <hanwen@google.com> Cc: Jakob Unterwurzacher <jakobunt@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-04-24fuse: convert printk -> pr_*Kirill Smelkov1-2/+2
Functions, like pr_err, are a more modern variant of printing compared to printk. They could be used to denoise sources by using needed level in the print function name, and by automatically inserting per-driver / function / ... print prefix as defined by pr_fmt macro. pr_* are also said to be used in Documentation/process/coding-style.rst and more recent code - for example overlayfs - uses them instead of printk. Convert CUSE and FUSE to use the new pr_* functions. CUSE output stays completely unchanged, while FUSE output is amended a bit for "trying to steal weird page" warning - the second line now comes also with "fuse:" prefix. I hope it is ok. Suggested-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-04-14Merge branch 'page-refs' (page ref overflow)Linus Torvalds1-6/+6
Merge page ref overflow branch. Jann Horn reported that he can overflow the page ref count with sufficient memory (and a filesystem that is intentionally extremely slow). Admittedly it's not exactly easy. To have more than four billion references to a page requires a minimum of 32GB of kernel memory just for the pointers to the pages, much less any metadata to keep track of those pointers. Jann needed a total of 140GB of memory and a specially crafted filesystem that leaves all reads pending (in order to not ever free the page references and just keep adding more). Still, we have a fairly straightforward way to limit the two obvious user-controllable sources of page references: direct-IO like page references gotten through get_user_pages(), and the splice pipe page duplication. So let's just do that. * branch page-refs: fs: prevent page refcount overflow in pipe_buf_get mm: prevent get_user_pages() from overflowing page refcount mm: add 'try_get_page()' helper function mm: make page ref count overflow check tighter and more explicit
2019-04-14fs: prevent page refcount overflow in pipe_buf_getMatthew Wilcox1-6/+6
Change pipe_buf_get() to return a bool indicating whether it succeeded in raising the refcount of the page (if the thing in the pipe is a page). This removes another mechanism for overflowing the page refcount. All callers converted to handle a failure. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Matthew Wilcox <willy@infradead.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-13fuse: clean up abortedMiklos Szeredi1-5/+4
The only caller that needs fc->aborted set is fuse_conn_abort_write(). Setting fc->aborted is now racy (fuse_abort_conn() may already be in progress or finished) but there's no reason to care. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13fuse: Protect ff->reserved_req via corresponding fi->lockKirill Tkhai1-4/+6
This is rather natural action after previous patches, and it just decreases load of fc->lock. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13fuse: Verify userspace asks to requeue interrupt that we really sentKirill Tkhai1-3/+10
When queue_interrupt() is called from fuse_dev_do_write(), it came from userspace directly. Userspace may pass any request id, even the request's we have not interrupted (or even background's request). This patch adds sanity check to make kernel safe against that. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13fuse: Do some refactoring in fuse_dev_do_write()Kirill Tkhai1-25/+23
This is needed for next patch. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13fuse: Wake up req->waitq of only if not backgroundKirill Tkhai1-1/+4
Currently, we wait on req->waitq in request_wait_answer() function only, and it's never used for background requests. Since wake_up() is not a light-weight macros, instead of this, it unfolds in really called function, which makes locking operations taking some cpu cycles, let's avoid its call for the case we definitely know it's completely useless. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13fuse: Optimize request_end() by not taking fiq->waitq.lockKirill Tkhai1-8/+20
We take global fiq->waitq.lock every time, when we are in this function, but interrupted requests are just small subset of all requests. This patch optimizes request_end() and makes it to take the lock when it's really needed. queue_interrupt() needs small change for that. After req is linked to interrupt list, we do smp_mb() and check for FR_FINISHED again. In case of FR_FINISHED bit has appeared, we remove req and leave the function: Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13fuse: Kill fasync only if interrupt is queued in queue_interrupt()Kirill Tkhai1-1/+1
We should sent signal only in case of interrupt is really queued. Not a real problem, but this makes the code clearer and intuitive. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-02-13fuse: Remove stale comment in end_requests()Kirill Tkhai1-5/+1
Function end_requests() does not take fc->lock. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-01-16fuse: call pipe_buf_release() under pipe lockJann Horn1-0/+2
Some of the pipe_buf_release() handlers seem to assume that the pipe is locked - in particular, anon_pipe_buf_release() accesses pipe->tmp_page without taking any extra locks. From a glance through the callers of pipe_buf_release(), it looks like FUSE is the only one that calls pipe_buf_release() without having the pipe locked. This bug should only lead to a memory leak, nothing terrible. Fixes: dd3bb14f44a6 ("fuse: support splice() writing to fuse device") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-01-16fuse: handle zero sized retrieve correctlyMiklos Szeredi1-1/+1
Dereferencing req->page_descs[0] will Oops if req->max_pages is zero. Reported-by: syzbot+c1e36d30ee3416289cc0@syzkaller.appspotmail.com Tested-by: syzbot+c1e36d30ee3416289cc0@syzkaller.appspotmail.com Fixes: b2430d7567a3 ("fuse: add per-page descriptor <offset, length> to fuse_req") Cc: <stable@vger.kernel.org> # v3.9 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-11-09fuse: fix possibly missed wake-up after abortMiklos Szeredi1-3/+9
In current fuse_drop_waiting() implementation it's possible that fuse_wait_aborted() will not be woken up in the unlikely case that fuse_abort_conn() + fuse_wait_aborted() runs in between checking fc->connected and calling atomic_dec(&fc->num_waiting). Do the atomic_dec_and_test() unconditionally, which also provides the necessary barrier against reordering with the fc->connected check. The explicit smp_mb() in fuse_wait_aborted() is not actually needed, since the spin_unlock() in fuse_abort_conn() provides the necessary RELEASE barrier after resetting fc->connected. However, this is not a performance sensitive path, and adding the explicit barrier makes it easier to document. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: b8f95e5d13f5 ("fuse: umount should wait for all requests") Cc: <stable@vger.kernel.org> #v4.19
2018-11-09fuse: fix leaked notify replyMiklos Szeredi1-1/+3
fuse_request_send_notify_reply() may fail if the connection was reset for some reason (e.g. fs was unmounted). Don't leak request reference in this case. Besides leaking memory, this resulted in fc->num_waiting not being decremented and hence fuse_wait_aborted() left in a hanging and unkillable state. Fixes: 2d45ba381a74 ("fuse: add retrieve request") Fixes: b8f95e5d13f5 ("fuse: umount should wait for all requests") Reported-and-tested-by: syzbot+6339eda9cb4ebbc4c37b@syzkaller.appspotmail.com Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> #v2.6.36
2018-10-01fuse: realloc page arrayMiklos Szeredi1-4/+45
Writeback caching currently allocates requests with the maximum number of possible pages, while the actual number of pages per request depends on a couple of factors that cannot be determined when the request is allocated (whether page is already under writeback, whether page is contiguous with previous pages already added to a request). This patch allows such requests to start with no page allocation (all pages inline) and grow the page array on demand. If the max_pages tunable remains the default value, then this will mean just one allocation that is the same size as before. If the tunable is larger, then this adds at most 3 additional memory allocations (which is generously compensated by the improved performance from the larger request). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-01fuse: add max_pages to init_outConstantine Shulyupin1-2/+3
Replace FUSE_MAX_PAGES_PER_REQ with the configurable parameter max_pages to improve performance. Old RFC with detailed description of the problem and many fixes by Mitsuo Hayasaka (mitsuo.hayasaka.hu@hitachi.com): - https://lkml.org/lkml/2012/7/5/136 We've encountered performance degradation and fixed it on a big and complex virtual environment. Environment to reproduce degradation and improvement: 1. Add lag to user mode FUSE Add nanosleep(&(struct timespec){ 0, 1000 }, NULL); to xmp_write_buf in passthrough_fh.c 2. patch UM fuse with configurable max_pages parameter. The patch will be provided latter. 3. run test script and perform test on tmpfs fuse_test() { cd /tmp mkdir -p fusemnt passthrough_fh -o max_pages=$1 /tmp/fusemnt grep fuse /proc/self/mounts dd conv=fdatasync oflag=dsync if=/dev/zero of=fusemnt/tmp/tmp \ count=1K bs=1M 2>&1 | grep -v records rm fusemnt/tmp/tmp killall passthrough_fh } Test results: passthrough_fh /tmp/fusemnt fuse.passthrough_fh \ rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0 1073741824 bytes (1.1 GB) copied, 1.73867 s, 618 MB/s passthrough_fh /tmp/fusemnt fuse.passthrough_fh \ rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_pages=256 0 0 1073741824 bytes (1.1 GB) copied, 1.15643 s, 928 MB/s Obviously with bigger lag the difference between 'before' and 'after' will be more significant. Mitsuo Hayasaka, in 2012 (https://lkml.org/lkml/2012/7/5/136), observed improvement from 400-550 to 520-740. Signed-off-by: Constantine Shulyupin <const@MakeLinux.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-01fuse: allocate page array more efficientlyMiklos Szeredi1-26/+18
When allocating page array for a request the array for the page pointers and the array for page descriptors are allocated by two separate kmalloc() calls. Merge these into one allocation. Also instead of initializing the request and the page arrays with memset(), use the zeroing allocation variants. Reserved requests never carry pages (page array size is zero). Make that explicit by initializing the page array pointers to NULL and make sure the assumption remains true by adding a WARN_ON(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28fuse: Use hash table to link processing requestKirill Tkhai1-4/+17
We noticed the performance bottleneck in FUSE running our Virtuozzo storage over rdma. On some types of workload we observe 20% of times spent in request_find() in profiler. This function is iterating over long requests list, and it scales bad. The patch introduces hash table to reduce the number of iterations, we do in this function. Hash generating algorithm is taken from hash_add() function, while 256 lines table is used to store pending requests. This fixes problem and improves the performance. Reported-by: Alexey Kuznetsov <kuznet@virtuozzo.com> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28fuse: kill req->intr_uniqueKirill Tkhai1-6/+5
This field is not needed after the previous patch, since we can easily convert request ID to interrupt request ID and vice versa. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28fuse: change interrupt requests allocation algorithmKirill Tkhai1-2/+7
Using of two unconnected IDs req->in.h.unique and req->intr_unique does not allow to link requests to a hash table. We need can't use none of them as a key to calculate hash. This patch changes the algorithm of allocation of IDs for a request. Plain requests obtain even ID, while interrupt requests are encoded in the low bit. So, in next patches we will be able to use the rest of ID bits to calculate hash, and the hash will be the same for plain and interrupt requests. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>