Age | Commit message (Collapse) | Author | Files | Lines |
|
icq_to_bic is a container_of operation, so we need to check for NULL
before it. Also move the check outside the spinlock while we're at
it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Merge three functions only tail-called by blk_mq_free_request into
blk_mq_free_request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
No need to have two different callouts of bfq vs kyber.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Having these as separate helpers in a header really does not help
readability, or my chances to refactor this code sanely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Having them out of line in blk-mq-sched.c just makes the code flow
unnecessarily complicated.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull NVMe changes for 4.13 from Christoph:
Highlights:
- UUID identifier support from Johannes
- Lots of cleanups from Sagi
- Host Memory Buffer support from me
And lots of cleanups and smaller fixes of course.
Note that the UUID identifier changes are based on top of the uuid tree.
I am the maintainer of that tree and will send it to Linus as soon as
4.12 is released as various other trees depend on it as well (and the
diffstat includes those changes unfortunately)
|
|
This patch fixes two sparse warnings introduced by the "dedicated
error codes for the block layer V3" patch series. These changes
have not been tested.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
nvme-base
|
|
Should be a blk_status_t type, not an integer.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We've already got a few conflicts and upcoming work depends on some of the
changes that have gone into mainline as regression fixes for this series.
Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
trees to continue working on 4.13 changes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Use the same values for use for request completion errors as the return
value from ->queue_rq. BLK_STS_RESOURCE is special cased to cause
a requeue, and all the others are completed as-is.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Currently we use nornal Linux errno values in the block layer, and while
we accept any error a few have overloaded magic meanings. This patch
instead introduces a new blk_status_t value that holds block layer specific
status codes and explicitly explains their meaning. Helpers to convert from
and to the previous special meanings are provided for now, but I suspect
we want to get rid of them in the long run - those drivers that have a
errno input (e.g. networking) usually get errnos that don't know about
the special block layer overloads, and similarly returning them to userspace
will usually return somethings that strictly speaking isn't correct
for file system operations, but that's left as an exercise for later.
For now the set of errors is a very limited set that closely corresponds
to the previous overloaded errno values, but there is some low hanging
fruite to improve it.
blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse
typechecking, so that we can easily catch places passing the wrong values.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
In blk-cgroup, operations on blkg objects are protected with the
request_queue lock. This is no more the lock that protects
I/O-scheduler operations in blk-mq. In fact, the latter are now
protected with a finer-grained per-scheduler-instance lock. As a
consequence, although blkg lookups are also rcu-protected, blk-mq I/O
schedulers may see inconsistent data when they access blkg and
blkg-related objects. BFQ does access these objects, and does incur
this problem, in the following case.
The blkg_lookup performed in bfq_get_queue, being protected (only)
through rcu, may happen to return the address of a copy of the
original blkg. If this is the case, then the blkg_get performed in
bfq_get_queue, to pin down the blkg, is useless: it does not prevent
blk-cgroup code from destroying both the original blkg and all objects
directly or indirectly referred by the copy of the blkg. BFQ accesses
these objects, which typically causes a crash for NULL-pointer
dereference of memory-protection violation.
Some additional protection mechanism should be added to blk-cgroup to
address this issue. In the meantime, this commit provides a quick
temporary fix for BFQ: cache (when safe) blkg data that might
disappear right after a blkg_lookup.
In particular, this commit exploits the following facts to achieve its
goal without introducing further locks. Destroy operations on a blkg
invoke, as a first step, hooks of the scheduler associated with the
blkg. And these hooks are executed with bfqd->lock held for BFQ. As a
consequence, for any blkg associated with the request queue an
instance of BFQ is attached to, we are guaranteed that such a blkg is
not destroyed, and that all the pointers it contains are consistent,
while that instance is holding its bfqd->lock. A blkg_lookup performed
with bfqd->lock held then returns a fully consistent blkg, which
remains consistent until this lock is held. In more detail, this holds
even if the returned blkg is a copy of the original one.
Finally, also the object describing a group inside BFQ needs to be
protected from destruction on the blkg_free of the original blkg
(which invokes bfq_pd_free). This commit adds private refcounting for
this object, to let it disappear only after no bfq_queue refers to it
any longer.
This commit also removes or updates some stale comments on locking
issues related to blk-cgroup operations.
Reported-by: Tomas Konir <tomas.konir@gmail.com>
Reported-by: Lee Tibbert <lee.tibbert@gmail.com>
Reported-by: Marco Piazza <mpiazza@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Tomas Konir <tomas.konir@gmail.com>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Tested-by: Marco Piazza <mpiazza@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
hard disk IO latency varies a lot depending on spindle move. The latency
range could be from several microseconds to several milliseconds. It's
pretty hard to get the baseline latency used by io.low.
We will use a different stragety here. The idea is only using IO with
spindle move to determine if cgroup IO is in good state. For HD, if io
latency is small (< 1ms), we ignore the IO. Such IO is likely from
sequential IO, and is helpless to help determine if a cgroup's IO is
impacted by other cgroups. With this, we only account IO with big
latency. Then we can choose a hardcoded baseline latency for HD (4ms,
which is typical IO latency with seek). With all these settings, the
io.low latency works for both HD and SSD.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
I have encountered a NULL pointer dereference in
throtl_schedule_pending_timer:
[ 413.735396] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
[ 413.735535] IP: [<ffffffff812ebbbf>] throtl_schedule_pending_timer+0x3f/0x210
[ 413.735643] PGD 22c8cf067 PUD 22cb34067 PMD 0
[ 413.735713] Oops: 0000 [#1] SMP
......
This is caused by the following case:
blk_throtl_bio
throtl_schedule_next_dispatch <= sq is top level one without parent
throtl_schedule_pending_timer
sq_to_tg(sq)->td->throtl_slice <= sq_to_tg(sq) returns NULL
Fix it by using sq_to_td instead of sq_to_tg(sq)->td, which will always
return a valid td.
Fixes: 297e3d854784 ("blk-throttle: make throtl_slice tunable")
Signed-off-by: Joseph Qi <qijiang.qj@alibaba-inc.com>
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
If queue is stopped, we shouldn't dispatch request into driver and
hardware, unfortunately the check is removed in bd166ef183c2(blk-mq-sched:
add framework for MQ capable IO schedulers).
This patch fixes the issue by moving the check back into
__blk_mq_try_issue_directly().
This patch fixes request use-after-free[1][2] during canceling requets
of NVMe in nvme_dev_disable(), which can be triggered easily during
NVMe reset & remove test.
[1] oops kernel log when CONFIG_BLK_DEV_INTEGRITY is on
[ 103.412969] BUG: unable to handle kernel NULL pointer dereference at 000000000000000a
[ 103.412980] IP: bio_integrity_advance+0x48/0xf0
[ 103.412981] PGD 275a88067
[ 103.412981] P4D 275a88067
[ 103.412982] PUD 276c43067
[ 103.412983] PMD 0
[ 103.412984]
[ 103.412986] Oops: 0000 [#1] SMP
[ 103.412989] Modules linked in: vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd cryptd ipmi_ssif iTCO_wdt iTCO_vendor_support mxm_wmi glue_helper dcdbas ipmi_si mei_me pcspkr mei sg ipmi_devintf lpc_ich ipmi_msghandler shpchp acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel nvme ahci nvme_core libahci libata tg3 i2c_core megaraid_sas ptp pps_core dm_mirror dm_region_hash dm_log dm_mod
[ 103.413035] CPU: 0 PID: 102 Comm: kworker/0:2 Not tainted 4.11.0+ #1
[ 103.413036] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[ 103.413041] Workqueue: events nvme_remove_dead_ctrl_work [nvme]
[ 103.413043] task: ffff9cc8775c8000 task.stack: ffffc033c252c000
[ 103.413045] RIP: 0010:bio_integrity_advance+0x48/0xf0
[ 103.413046] RSP: 0018:ffffc033c252fc10 EFLAGS: 00010202
[ 103.413048] RAX: 0000000000000000 RBX: ffff9cc8720a8cc0 RCX: ffff9cca72958240
[ 103.413049] RDX: ffff9cca72958000 RSI: 0000000000000008 RDI: ffff9cc872537f00
[ 103.413049] RBP: ffffc033c252fc28 R08: 0000000000000000 R09: ffffffffb963a0d5
[ 103.413050] R10: 000000000000063e R11: 0000000000000000 R12: ffff9cc8720a8d18
[ 103.413051] R13: 0000000000001000 R14: ffff9cc872682e00 R15: 00000000fffffffb
[ 103.413053] FS: 0000000000000000(0000) GS:ffff9cc877c00000(0000) knlGS:0000000000000000
[ 103.413054] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 103.413055] CR2: 000000000000000a CR3: 0000000276c41000 CR4: 00000000001406f0
[ 103.413056] Call Trace:
[ 103.413063] bio_advance+0x2a/0xe0
[ 103.413067] blk_update_request+0x76/0x330
[ 103.413072] blk_mq_end_request+0x1a/0x70
[ 103.413074] blk_mq_dispatch_rq_list+0x370/0x410
[ 103.413076] ? blk_mq_flush_busy_ctxs+0x94/0xe0
[ 103.413080] blk_mq_sched_dispatch_requests+0x173/0x1a0
[ 103.413083] __blk_mq_run_hw_queue+0x8e/0xa0
[ 103.413085] __blk_mq_delay_run_hw_queue+0x9d/0xa0
[ 103.413088] blk_mq_start_hw_queue+0x17/0x20
[ 103.413090] blk_mq_start_hw_queues+0x32/0x50
[ 103.413095] nvme_kill_queues+0x54/0x80 [nvme_core]
[ 103.413097] nvme_remove_dead_ctrl_work+0x1f/0x40 [nvme]
[ 103.413103] process_one_work+0x149/0x360
[ 103.413105] worker_thread+0x4d/0x3c0
[ 103.413109] kthread+0x109/0x140
[ 103.413111] ? rescuer_thread+0x380/0x380
[ 103.413113] ? kthread_park+0x60/0x60
[ 103.413120] ret_from_fork+0x2c/0x40
[ 103.413121] Code: 08 4c 8b 63 50 48 8b 80 80 00 00 00 48 8b 90 d0 03 00 00 31 c0 48 83 ba 40 02 00 00 00 48 8d 8a 40 02 00 00 48 0f 45 c1 c1 ee 09 <0f> b6 48 0a 0f b6 40 09 41 89 f5 83 e9 09 41 d3 ed 44 0f af e8
[ 103.413145] RIP: bio_integrity_advance+0x48/0xf0 RSP: ffffc033c252fc10
[ 103.413146] CR2: 000000000000000a
[ 103.413157] ---[ end trace cd6875d16eb5a11e ]---
[ 103.455368] Kernel panic - not syncing: Fatal exception
[ 103.459826] Kernel Offset: 0x37600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 103.850916] ---[ end Kernel panic - not syncing: Fatal exception
[ 103.857637] sched: Unexpected reschedule of offline CPU#1!
[ 103.863762] ------------[ cut here ]------------
[2] kernel hang in blk_mq_freeze_queue_wait() when CONFIG_BLK_DEV_INTEGRITY is off
[ 247.129825] INFO: task nvme-test:1772 blocked for more than 120 seconds.
[ 247.137311] Not tainted 4.12.0-rc2.upstream+ #4
[ 247.142954] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 247.151704] Call Trace:
[ 247.154445] __schedule+0x28a/0x880
[ 247.158341] schedule+0x36/0x80
[ 247.161850] blk_mq_freeze_queue_wait+0x4b/0xb0
[ 247.166913] ? remove_wait_queue+0x60/0x60
[ 247.171485] blk_freeze_queue+0x1a/0x20
[ 247.175770] blk_cleanup_queue+0x7f/0x140
[ 247.180252] nvme_ns_remove+0xa3/0xb0 [nvme_core]
[ 247.185503] nvme_remove_namespaces+0x32/0x50 [nvme_core]
[ 247.191532] nvme_uninit_ctrl+0x2d/0xa0 [nvme_core]
[ 247.196977] nvme_remove+0x70/0x110 [nvme]
[ 247.201545] pci_device_remove+0x39/0xc0
[ 247.205927] device_release_driver_internal+0x141/0x200
[ 247.211761] device_release_driver+0x12/0x20
[ 247.216531] pci_stop_bus_device+0x8c/0xa0
[ 247.221104] pci_stop_and_remove_bus_device_locked+0x1a/0x30
[ 247.227420] remove_store+0x7c/0x90
[ 247.231320] dev_attr_store+0x18/0x30
[ 247.235409] sysfs_kf_write+0x3a/0x50
[ 247.239497] kernfs_fop_write+0xff/0x180
[ 247.243867] __vfs_write+0x37/0x160
[ 247.247757] ? selinux_file_permission+0xe5/0x120
[ 247.253011] ? security_file_permission+0x3b/0xc0
[ 247.258260] vfs_write+0xb2/0x1b0
[ 247.261964] ? syscall_trace_enter+0x1d0/0x2b0
[ 247.266924] SyS_write+0x55/0xc0
[ 247.270540] do_syscall_64+0x67/0x150
[ 247.274636] entry_SYSCALL64_slow_path+0x25/0x25
[ 247.279794] RIP: 0033:0x7f5c96740840
[ 247.283785] RSP: 002b:00007ffd00e87ee8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 247.292238] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f5c96740840
[ 247.300194] RDX: 0000000000000002 RSI: 00007f5c97060000 RDI: 0000000000000001
[ 247.308159] RBP: 00007f5c97060000 R08: 000000000000000a R09: 00007f5c97059740
[ 247.316123] R10: 0000000000000001 R11: 0000000000000246 R12: 00007f5c96a14400
[ 247.324087] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000
[ 370.016340] INFO: task nvme-test:1772 blocked for more than 120 seconds.
Fixes: 12d70958a2e8(blk-mq: don't fail allocating driver tag for stopped hw queue)
Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
When direct issue is done on request picked up from plug list,
the hctx need to be updated with the actual hw queue, otherwise
wrong hctx is used and may hurt performance, especially when
wrong SRCU readlock is acquired/released
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
And the uuid helpers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
|
|
If bio has no data, such as ones from blkdev_issue_flush(),
then we have nothing to protect.
This patch prevent bugon like follows:
kfree_debugcheck: out of range ptr ac1fa1d106742a5ah
kernel BUG at mm/slab.c:2773!
invalid opcode: 0000 [#1] SMP
Modules linked in: bcache
CPU: 0 PID: 4428 Comm: xfs_io Tainted: G W 4.11.0-rc4-ext4-00041-g2ef0043-dirty #43
Hardware name: Virtuozzo KVM, BIOS seabios-1.7.5-11.vz7.4 04/01/2014
task: ffff880137786440 task.stack: ffffc90000ba8000
RIP: 0010:kfree_debugcheck+0x25/0x2a
RSP: 0018:ffffc90000babde0 EFLAGS: 00010082
RAX: 0000000000000034 RBX: ac1fa1d106742a5a RCX: 0000000000000007
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013f3ccb40
RBP: ffffc90000babde8 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000fcb76420 R11: 00000000725172ed R12: 0000000000000282
R13: ffffffff8150e766 R14: ffff88013a145e00 R15: 0000000000000001
FS: 00007fb09384bf40(0000) GS:ffff88013f200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd0172f9e40 CR3: 0000000137fa9000 CR4: 00000000000006f0
Call Trace:
kfree+0xc8/0x1b3
bio_integrity_free+0xc3/0x16b
bio_free+0x25/0x66
bio_put+0x14/0x26
blkdev_issue_flush+0x7a/0x85
blkdev_fsync+0x35/0x42
vfs_fsync_range+0x8e/0x9f
vfs_fsync+0x1c/0x1e
do_fsync+0x31/0x4a
SyS_fsync+0x10/0x14
entry_SYSCALL_64_fastpath+0x1f/0xc2
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Since BSG only supports request queues for which struct scsi_request
is the first member of their private request data, refuse to register
block layer queues for which struct scsi_request is not the first
member of their private data.
References: commit bd1599d931ca ("scsi_transport_sas: fix BSG ioctl memory corruption")
References: commit 82ed4db499b8 ("block: split scsi_request out of struct request")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
From the context where a SCSI command is submitted it is not always
possible to figure out whether or not the queue the command is
submitted to has struct scsi_request as the first member of its
private data. Hence introduce the flag QUEUE_FLAG_SCSI_PASSTHROUGH.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Don Brace <don.brace@microsemi.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Since the introduction of .init_rq_fn() and .exit_rq_fn() it is
essential that the memory allocated for struct request_queue
stays around until all blk_exit_rl() calls have finished. Hence
make blk_init_rl() take a reference on struct request_queue.
This patch fixes the following crash:
general protection fault: 0000 [#2] SMP
CPU: 3 PID: 28 Comm: ksoftirqd/3 Tainted: G D 4.12.0-rc2-dbg+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
task: ffff88013a108040 task.stack: ffffc9000071c000
RIP: 0010:free_request_size+0x1a/0x30
RSP: 0018:ffffc9000071fd38 EFLAGS: 00010202
RAX: 6b6b6b6b6b6b6b6b RBX: ffff880067362a88 RCX: 0000000000000003
RDX: ffff880067464178 RSI: ffff880067362a88 RDI: ffff880135ea4418
RBP: ffffc9000071fd40 R08: 0000000000000000 R09: 0000000100180009
R10: ffffc9000071fd38 R11: ffffffff81110800 R12: ffff88006752d3d8
R13: ffff88006752d3d8 R14: ffff88013a108040 R15: 000000000000000a
FS: 0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa8ec1edb00 CR3: 0000000138ee8000 CR4: 00000000001406e0
Call Trace:
mempool_destroy.part.10+0x21/0x40
mempool_destroy+0xe/0x10
blk_exit_rl+0x12/0x20
blkg_free+0x4d/0xa0
__blkg_release_rcu+0x59/0x170
rcu_process_callbacks+0x260/0x4e0
__do_softirq+0x116/0x250
smpboot_thread_fn+0x123/0x1e0
kthread+0x109/0x140
ret_from_fork+0x31/0x40
Fixes: commit e9c787e65c0c ("scsi: allocate scsi_cmnd structures as part of struct request")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org> # v4.11+
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Running a queue causes the block layer to examine the per-CPU and
hw queues but not the requeue list. Hence add a 'kick' operation
that also examines the requeue list.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Eduardo Valentin <eduval@amazon.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Requests that got stuck in a block driver are neither on
blk_mq_ctx.rq_list nor on any hw dispatch queue. Make these
visible in debugfs through the "busy" attribute.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Eduardo Valentin <eduval@amazon.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
When verifying whether or not a blk-mq driver forgot to kick the
requeue list after having requeued a request it is important to
be able to verify the contents of the requeue list. Hence export
that list through debugfs.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Eduardo Valentin <eduval@amazon.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
When analyzing e.g. queue lockups it is important to know whether
or not a request has already been started. Hence also show the
atomic request flags.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Eduardo Valentin <eduval@amazon.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
When adding a cfq_group into the cfq service tree, we use CFQ_IDLE_DELAY
as the delay of cfq_group's vdisktime if there have been other cfq_groups
already.
When cfq is under iops mode, commit 9a7f38c42c2b ("cfq-iosched: Convert
from jiffies to nanoseconds") could result in a large iops delay and
lead to an abnormal io schedule delay for the added cfq_group. To fix
it, we just need to revert to the old CFQ_IDLE_DELAY value: HZ / 5
when iops mode is enabled.
Despite having the same value, the delay of a cfq_queue in idle class
and the delay of cfq_group are different things, so I define two new
macros for the delay of a cfq_group under time-slice mode and iops mode.
Fixes: 9a7f38c42c2b ("cfq-iosched: Convert from jiffies to nanoseconds")
Cc: <stable@vger.kernel.org> # 4.8+
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
The tagset lock needs to be held when iterating the tag_list, so a
lockdep assert was added when updating number of hardware queues. The
drivers calling this API, however, were unaware of the new requirement,
so are failing the assertion.
This patch takes the lock within the blk-mq function so the drivers do
not have to be modified in order to be safe.
Fixes: 705cda97e ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
Reported-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
This fixes the following warning when building with clang:
block/cfq-iosched.c:970:19: error: unused function 'min_vdisktime'
[-Werror,-Wunused-function]
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Because what the per-sw-queue bio merge does is basically same with
scheduler's .bio_merge(), this patch makes per-sw-queue bio merge
as the default .bio_merge if no scheduler is used or io scheduler
doesn't provide .bio_merge().
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Before blk-mq is introduced, I/O is merged to elevator
before being putted into plug queue, but blk-mq changed the
order and makes merging to sw queue basically impossible.
Then it is observed that throughput of sequential I/O is degraded
about 10%~20% on virtio-blk in the test[1] if mq-deadline isn't used.
This patch moves the bio merging per sw queue before plugging,
like what blk_queue_bio() does, and the performance regression is
fixed under this situation.
[1]. test script:
sudo fio --direct=1 --size=128G --bsrange=4k-4k --runtime=40 --numjobs=16 --ioengine=libaio --iodepth=64 --group_reporting=1 --filename=/dev/vdb --name=virtio_blk-test-$RW --rw=$RW --output-format=json
RW=read or write
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Christoph writes:
"A couple of fixes for the next rc on the nvme front. Various FC fixes
from James, controller removal fixes from Ming (including a block layer
patch), a APST related device quirk from Andy, a RDMA fix for small
queue depth device from Marta, as well as fixes for the lack of
metadata support in non-PCIe drivers and the printk logging format from
me."
|
|
The code in blk-mq-debugfs.c assumes that it is working on a blk-mq
queue and is not intended to work on a blk-sq queue. Hence only
register blk-mq debugfs attributes for blk-mq queues.
Fixes: commit 9c1051aacde8 ("blk-mq: untangle debugfs and sysfs")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
The code in block/partitions/msdos.c recognizes FreeBSD, OpenBSD
and NetBSD partitions and does a reasonable job picking out OpenBSD
and NetBSD UFS subpartitions.
But for FreeBSD the subpartitions are always "bad".
Kernel: <bsd:bad subpartition - ignored
Though all 3 of these BSD systems use UFS as a file system, only
FreeBSD uses relative start addresses in the subpartition
declarations.
The following patch fixes this for FreeBSD partitions and leaves
the code for OpenBSD and NetBSD intact:
Signed-off-by: Richard Narron <comet.berkeley@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
We don't set an error code on this path. It means that we return NULL
instead of an error pointer and the caller does a NULL dereference.
Fixes: 6d1d8050b4bc ("block, partition: add partition_meta_info to hd_struct")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Default value of io.low limit is 0. If user doesn't configure the limit,
last patch makes cgroup be throttled to very tiny bps/iops, which could
stall the system. A cgroup with default settings of io.low limit really
means nothing, so we force user to configure all settings, otherwise
io.low limit doesn't take effect. With this stragety, default setting of
latency/idle isn't important, so just set them to very conservative and
safe value.
Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
If a cgroup with low limit 0 for both bps/iops, the cgroup's low limit
is ignored and we throttle the cgroup with its max limit. In this way,
other cgroups with a low limit will not get protected. To fix this, we
don't do the exception any more. cgroup will be throttled to a limit 0
if it uese default setting. To avoid completed stall, we give such
cgroup tiny IO resources.
Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
These info are important to understand what's happening and help debug.
Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
For idle time, children's setting should not be bigger than parent's.
For latency target, children's setting should not be smaller than
parent's. The leaf nodes will adjust their settings according to the
hierarchy and compare their IO with the settings and do
upgrade/downgrade. parents nodes don't need to track their IO
latency/idle time.
Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
No one uses it any more, so remove it.
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
"Incremental fixes and a small feature addition on top of the main
libnvdimm 4.12 pull request:
- Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
The size regression is fixed by moving all dax helpers into the
dax-core and only specifying "select DAX" for FS_DAX and
dax-capable drivers. He also asked for clarification of the
NR_DEV_DAX config option which, on closer look, does not need to be
a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
for good measure.
- Ben's attention to detail on -stable patch submissions caught a
case where the recent fixes to arch_copy_from_iter_pmem() missed a
condition where we strand dirty data in the cache. This is tagged
for -stable and will also be included in the rework of the pmem api
to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.
- Vishal adds a feature that missed the initial pull due to pending
review feedback. It allows the kernel to clear media errors when
initializing a BTT (atomic sector update driver) instance on a pmem
namespace.
- Ross noticed that the dax_device + dax_operations conversion broke
__dax_zero_page_range(). The nvdimm unit tests fail to check this
path, but xfstests immediately trips over it. No excuse for missing
this before submitting the 4.12 pull request.
These all pass the nvdimm unit tests and an xfstests spot check. The
set has received a build success notification from the kbuild robot"
* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
filesystem-dax: fix broken __dax_zero_page_range() conversion
libnvdimm, btt: ensure that initializing metadata clears poison
libnvdimm: add an atomic vs process context flag to rw_bytes
x86, pmem: Fix cache flushing for iovec write < 8 bytes
device-dax: kill NR_DEV_DAX
block, dax: move "select DAX" from BLOCK to FS_DAX
device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX
|
|
SCSI devices can return short writes on Write Same just like for normal
writes, so we need to handle this case for our special payload requests
as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Tested-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
When formatting NVMe to 512B/4K + T10 DIf/DIX, dd with split op returns
"Input/output error". Looks block layer split the bio after calling
bio_integrity_prep(bio). This patch fixes the issue.
Below is how we debug this issue:
(1)format nvme to 4K block # size with type 2 DIF
(2)dd with block size bigger than 1024k.
oflag=direct
dd: error writing '/dev/nvme0n1': Input/output error
We added some debug code in nvme device driver. It showed us the first
op and the second op have the same bi and pi address. This is not
correct.
1st op: nvme0n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
dsmgmt=0x0, AT=0x0 & RT=0x505
Guard 0x00b1, AT 0x0000, RT physical 0x00000505 RT virtual 0x00002828
2nd op: nvme0n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
AT=0x0 & RT=0x605 ==> This op fails and subsequent 5 retires..
Guard 0x00b1, AT 0x0000, RT physical 0x00000605 RT virtual 0x00002828
With the fix, It showed us both of the first op and the second op have
correct bi and pi address.
1st op: nvme2n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
dsmgmt=0x0, AT=0x0 & RT=0x505
Guard 0x5ccb, AT 0x0000, RT physical 0x00000505 RT virtual
0x00002828
2nd op: nvme2n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
AT=0x0 & RT=0x605
Guard 0xab4c, AT 0x0000, RT physical 0x00000605 RT virtual
0x00003028
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
If PREEMPT_RCU is enabled, rcu_read_lock() isn't strong enough
for us to use this_cpu_ptr() in that section. Use the safer
get/put_cpu_ptr() variants instead.
Reported-by: Mike Galbraith <efault@gmx.de>
Fixes: 34dbad5d26e2 ("blk-stat: convert to callback-based statistics reporting")
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
We warn twice for switching to a scheduler, if that switch fails.
As we also report the failure in the return value to the
sysfs write, remove the dmesg induced failures.
Keep the failure print for warning to switch to the kconfig
selected IO scheduler, as we can't report errors for that in
any other way.
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
The introduction of the BFQ and Kyber I/O schedulers has triggered a
new wave of I/O benchmarks. Unfortunately, comments and discussions on
these benchmarks confirm that there is still little awareness that it
is very hard to achieve, at the same time, a low latency and a high
throughput. In particular, virtually all benchmarks measure
throughput, or throughput-related figures of merit, but, for BFQ, they
use the scheduler in its default configuration. This configuration is
geared, instead, toward a low latency. This is evidently a sign that
BFQ documentation is still too unclear on this important aspect. This
commit addresses this issue by stressing how BFQ configuration must be
(easily) changed if the only goal is maximum throughput.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
In the function __bfq_deactivate_entity, the pointer
entity->sched_data could happen to be used before being properly
initialized. This led to a NULL pointer dereference. This commit fixes
this bug by just using this pointer only where it is safe to do so.
Reported-by: Tom Harrison <l12436.tw@gmail.com>
Tested-by: Tom Harrison <l12436.tw@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
|