summaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2016-12-13 10:19:16 -0800
committerLinus Torvalds <torvalds@linux-foundation.org>2016-12-13 10:19:16 -0800
commit36869cb93d36269f34800b3384ba7991060a69cf (patch)
tree1ff266dcb3386bb1403494aa89647a96fd2396cd /Documentation
parent9439b3710df688d853eb6cb4851256f2c92b1797 (diff)
parent7cd54aa8438947602cf68eda1db327822b9b8e6b (diff)
downloadlinux-36869cb93d36269f34800b3384ba7991060a69cf.tar.bz2
Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe: "This is the main block pull request this series. Contrary to previous release, I've kept the core and driver changes in the same branch. We always ended up having dependencies between the two for obvious reasons, so makes more sense to keep them together. That said, I'll probably try and keep more topical branches going forward, especially for cycles that end up being as busy as this one. The major parts of this pull request is: - Improved support for O_DIRECT on block devices, with a small private implementation instead of using the pig that is fs/direct-io.c. From Christoph. - Request completion tracking in a scalable fashion. This is utilized by two components in this pull, the new hybrid polling and the writeback queue throttling code. - Improved support for polling with O_DIRECT, adding a hybrid mode that combines pure polling with an initial sleep. From me. - Support for automatic throttling of writeback queues on the block side. This uses feedback from the device completion latencies to scale the queue on the block side up or down. From me. - Support from SMR drives in the block layer and for SD. From Hannes and Shaun. - Multi-connection support for nbd. From Josef. - Cleanup of request and bio flags, so we have a clear split between which are bio (or rq) private, and which ones are shared. From Christoph. - A set of patches from Bart, that improve how we handle queue stopping and starting in blk-mq. - Support for WRITE_ZEROES from Chaitanya. - Lightnvm updates from Javier/Matias. - Supoort for FC for the nvme-over-fabrics code. From James Smart. - A bunch of fixes from a whole slew of people, too many to name here" * 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits) blk-stat: fix a few cases of missing batch flushing blk-flush: run the queue when inserting blk-mq flush elevator: make the rqhash helpers exported blk-mq: abstract out blk_mq_dispatch_rq_list() helper blk-mq: add blk_mq_start_stopped_hw_queue() block: improve handling of the magic discard payload blk-wbt: don't throttle discard or write zeroes nbd: use dev_err_ratelimited in io path nbd: reset the setup task for NBD_CLEAR_SOCK nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME nvme-fabrics: Add target support for FC transport nvme-fabrics: Add host support for FC transport nvme-fabrics: Add FC transport LLDD api definitions nvme-fabrics: Add FC transport FC-NVME definitions nvme-fabrics: Add FC transport error codes to nvme.h Add type 0x28 NVME type code to scsi fc headers nvme-fabrics: patch target code in prep for FC transport support nvme-fabrics: set sqe.command_id in core not transports parser: add u64 number parser nvme-rdma: align to generic ib_event logging helper ...
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/ABI/testing/sysfs-block42
-rw-r--r--Documentation/block/biodoc.txt6
-rw-r--r--Documentation/block/cfq-iosched.txt32
-rw-r--r--Documentation/block/null_blk.txt2
-rw-r--r--Documentation/block/queue-sysfs.txt23
5 files changed, 85 insertions, 20 deletions
diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block
index 71d184dbb70d..2da04ce6aeef 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -235,3 +235,45 @@ Description:
write_same_max_bytes is 0, write same is not supported
by the device.
+What: /sys/block/<disk>/queue/write_zeroes_max_bytes
+Date: November 2016
+Contact: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
+Description:
+ Devices that support write zeroes operation in which a
+ single request can be issued to zero out the range of
+ contiguous blocks on storage without having any payload
+ in the request. This can be used to optimize writing zeroes
+ to the devices. write_zeroes_max_bytes indicates how many
+ bytes can be written in a single write zeroes command. If
+ write_zeroes_max_bytes is 0, write zeroes is not supported
+ by the device.
+
+What: /sys/block/<disk>/queue/zoned
+Date: September 2016
+Contact: Damien Le Moal <damien.lemoal@hgst.com>
+Description:
+ zoned indicates if the device is a zoned block device
+ and the zone model of the device if it is indeed zoned.
+ The possible values indicated by zoned are "none" for
+ regular block devices and "host-aware" or "host-managed"
+ for zoned block devices. The characteristics of
+ host-aware and host-managed zoned block devices are
+ described in the ZBC (Zoned Block Commands) and ZAC
+ (Zoned Device ATA Command Set) standards. These standards
+ also define the "drive-managed" zone model. However,
+ since drive-managed zoned block devices do not support
+ zone commands, they will be treated as regular block
+ devices and zoned will report "none".
+
+What: /sys/block/<disk>/queue/chunk_sectors
+Date: September 2016
+Contact: Hannes Reinecke <hare@suse.com>
+Description:
+ chunk_sectors has different meaning depending on the type
+ of the disk. For a RAID device (dm-raid), chunk_sectors
+ indicates the size in 512B sectors of the RAID volume
+ stripe segment. For a zoned block device, either
+ host-aware or host-managed, chunk_sectors indicates the
+ size of 512B sectors of the zones of the device, with
+ the eventual exception of the last zone of the device
+ which may be smaller.
diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
index 918e1e0d0e78..01ddeaf64b0f 100644
--- a/Documentation/block/biodoc.txt
+++ b/Documentation/block/biodoc.txt
@@ -348,7 +348,7 @@ Drivers can now specify a request prepare function (q->prep_rq_fn) that the
block layer would invoke to pre-build device commands for a given request,
or perform other preparatory processing for the request. This is routine is
called by elv_next_request(), i.e. typically just before servicing a request.
-(The prepare function would not be called for requests that have REQ_DONTPREP
+(The prepare function would not be called for requests that have RQF_DONTPREP
enabled)
Aside:
@@ -553,8 +553,8 @@ struct request {
struct request_list *rl;
}
-See the rq_flag_bits definitions for an explanation of the various flags
-available. Some bits are used by the block layer or i/o scheduler.
+See the req_ops and req_flag_bits definitions for an explanation of the various
+flags available. Some bits are used by the block layer or i/o scheduler.
The behaviour of the various sector counts are almost the same as before,
except that since we have multi-segment bios, current_nr_sectors refers
diff --git a/Documentation/block/cfq-iosched.txt b/Documentation/block/cfq-iosched.txt
index 1e4f835a659d..895bd3813115 100644
--- a/Documentation/block/cfq-iosched.txt
+++ b/Documentation/block/cfq-iosched.txt
@@ -240,11 +240,11 @@ All cfq queues doing synchronous sequential IO go on to sync-idle tree.
On this tree we idle on each queue individually.
All synchronous non-sequential queues go on sync-noidle tree. Also any
-request which are marked with REQ_NOIDLE go on this service tree. On this
-tree we do not idle on individual queues instead idle on the whole group
-of queues or the tree. So if there are 4 queues waiting for IO to dispatch
-we will idle only once last queue has dispatched the IO and there is
-no more IO on this service tree.
+synchronous write request which is not marked with REQ_IDLE goes on this
+service tree. On this tree we do not idle on individual queues instead idle
+on the whole group of queues or the tree. So if there are 4 queues waiting
+for IO to dispatch we will idle only once last queue has dispatched the IO
+and there is no more IO on this service tree.
All async writes go on async service tree. There is no idling on async
queues.
@@ -257,17 +257,17 @@ tree idling provides isolation with buffered write queues on async tree.
FAQ
===
-Q1. Why to idle at all on queues marked with REQ_NOIDLE.
+Q1. Why to idle at all on queues not marked with REQ_IDLE.
-A1. We only do tree idle (all queues on sync-noidle tree) on queues marked
- with REQ_NOIDLE. This helps in providing isolation with all the sync-idle
+A1. We only do tree idle (all queues on sync-noidle tree) on queues not marked
+ with REQ_IDLE. This helps in providing isolation with all the sync-idle
queues. Otherwise in presence of many sequential readers, other
synchronous IO might not get fair share of disk.
For example, if there are 10 sequential readers doing IO and they get
- 100ms each. If a REQ_NOIDLE request comes in, it will be scheduled
- roughly after 1 second. If after completion of REQ_NOIDLE request we
- do not idle, and after a couple of milli seconds a another REQ_NOIDLE
+ 100ms each. If a !REQ_IDLE request comes in, it will be scheduled
+ roughly after 1 second. If after completion of !REQ_IDLE request we
+ do not idle, and after a couple of milli seconds a another !REQ_IDLE
request comes in, again it will be scheduled after 1second. Repeat it
and notice how a workload can lose its disk share and suffer due to
multiple sequential readers.
@@ -276,16 +276,16 @@ A1. We only do tree idle (all queues on sync-noidle tree) on queues marked
context of fsync, and later some journaling data is written. Journaling
data comes in only after fsync has finished its IO (atleast for ext4
that seemed to be the case). Now if one decides not to idle on fsync
- thread due to REQ_NOIDLE, then next journaling write will not get
+ thread due to !REQ_IDLE, then next journaling write will not get
scheduled for another second. A process doing small fsync, will suffer
badly in presence of multiple sequential readers.
- Hence doing tree idling on threads using REQ_NOIDLE flag on requests
+ Hence doing tree idling on threads using !REQ_IDLE flag on requests
provides isolation from multiple sequential readers and at the same
time we do not idle on individual threads.
-Q2. When to specify REQ_NOIDLE
-A2. I would think whenever one is doing synchronous write and not expecting
+Q2. When to specify REQ_IDLE
+A2. I would think whenever one is doing synchronous write and expecting
more writes to be dispatched from same context soon, should be able
- to specify REQ_NOIDLE on writes and that probably should work well for
+ to specify REQ_IDLE on writes and that probably should work well for
most of the cases.
diff --git a/Documentation/block/null_blk.txt b/Documentation/block/null_blk.txt
index d8880ca30af4..3140dbd860d8 100644
--- a/Documentation/block/null_blk.txt
+++ b/Documentation/block/null_blk.txt
@@ -72,4 +72,4 @@ use_per_node_hctx=[0/1]: Default: 0
queue for each CPU node in the system.
use_lightnvm=[0/1]: Default: 0
- Register device with LightNVM. Requires blk-mq to be used.
+ Register device with LightNVM. Requires blk-mq and CONFIG_NVM to be enabled.
diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt
index 2a3904030dea..51642159aedb 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -58,6 +58,20 @@ When read, this file shows the total number of block IO polls and how
many returned success. Writing '0' to this file will disable polling
for this device. Writing any non-zero value will enable this feature.
+io_poll_delay (RW)
+------------------
+If polling is enabled, this controls what kind of polling will be
+performed. It defaults to -1, which is classic polling. In this mode,
+the CPU will repeatedly ask for completions without giving up any time.
+If set to 0, a hybrid polling mode is used, where the kernel will attempt
+to make an educated guess at when the IO will complete. Based on this
+guess, the kernel will put the process issuing IO to sleep for an amount
+of time, before entering a classic poll loop. This mode might be a
+little slower than pure classic polling, but it will be more efficient.
+If set to a value larger than 0, the kernel will put the process issuing
+IO to sleep for this amont of microseconds before entering classic
+polling.
+
iostats (RW)
-------------
This file is used to control (on/off) the iostats accounting of the
@@ -169,5 +183,14 @@ This is the number of bytes the device can write in a single write-same
command. A value of '0' means write-same is not supported by this
device.
+wb_lat_usec (RW)
+----------------
+If the device is registered for writeback throttling, then this file shows
+the target minimum read latency. If this latency is exceeded in a given
+window of time (see wb_window_usec), then the writeback throttling will start
+scaling back writes. Writing a value of '0' to this file disables the
+feature. Writing a value of '-1' to this file resets the value to the
+default setting.
+
Jens Axboe <jens.axboe@oracle.com>, February 2009