diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/ABI/testing/sysfs-block | 12 | ||||
-rw-r--r-- | Documentation/admin-guide/cgroup-v2.rst | 8 | ||||
-rw-r--r-- | Documentation/block/biodoc.txt | 88 | ||||
-rw-r--r-- | Documentation/block/cfq-iosched.txt | 291 | ||||
-rw-r--r-- | Documentation/block/queue-sysfs.txt | 29 | ||||
-rw-r--r-- | Documentation/scsi/scsi-parameters.txt | 5 |
6 files changed, 43 insertions, 390 deletions
diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block index dea212db9df3..7710d4022b19 100644 --- a/Documentation/ABI/testing/sysfs-block +++ b/Documentation/ABI/testing/sysfs-block @@ -244,7 +244,7 @@ Description: What: /sys/block/<disk>/queue/zoned Date: September 2016 -Contact: Damien Le Moal <damien.lemoal@hgst.com> +Contact: Damien Le Moal <damien.lemoal@wdc.com> Description: zoned indicates if the device is a zoned block device and the zone model of the device if it is indeed zoned. @@ -259,6 +259,14 @@ Description: zone commands, they will be treated as regular block devices and zoned will report "none". +What: /sys/block/<disk>/queue/nr_zones +Date: November 2018 +Contact: Damien Le Moal <damien.lemoal@wdc.com> +Description: + nr_zones indicates the total number of zones of a zoned block + device ("host-aware" or "host-managed" zone model). For regular + block devices, the value is always 0. + What: /sys/block/<disk>/queue/chunk_sectors Date: September 2016 Contact: Hannes Reinecke <hare@suse.com> @@ -268,6 +276,6 @@ Description: indicates the size in 512B sectors of the RAID volume stripe segment. For a zoned block device, either host-aware or host-managed, chunk_sectors indicates the - size of 512B sectors of the zones of the device, with + size in 512B sectors of the zones of the device, with the eventual exception of the last zone of the device which may be smaller. diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 476722b7b636..baf19bf28385 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1879,8 +1879,10 @@ following two functions. wbc_init_bio(@wbc, @bio) Should be called for each bio carrying writeback data and - associates the bio with the inode's owner cgroup. Can be - called anytime between bio allocation and submission. + associates the bio with the inode's owner cgroup and the + corresponding request queue. This must be called after + a queue (device) has been associated with the bio and + before submission. wbc_account_io(@wbc, @page, @bytes) Should be called for each data segment being written out. @@ -1899,7 +1901,7 @@ the configuration, the bio may be executed at a lower priority and if the writeback session is holding shared resources, e.g. a journal entry, may lead to priority inversion. There is no one easy solution for the problem. Filesystems can try to work around specific problem -cases by skipping wbc_init_bio() or using bio_associate_blkcg() +cases by skipping wbc_init_bio() and using bio_associate_blkg() directly. diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt index 207eca58efaa..ac18b488cb5e 100644 --- a/Documentation/block/biodoc.txt +++ b/Documentation/block/biodoc.txt @@ -65,7 +65,6 @@ Description of Contents: 3.2.3 I/O completion 3.2.4 Implications for drivers that do not interpret bios (don't handle multiple segments) - 3.2.5 Request command tagging 3.3 I/O submission 4. The I/O scheduler 5. Scalability related changes @@ -708,93 +707,6 @@ is crossed on completion of a transfer. (The end*request* functions should be used if only if the request has come down from block/bio path, not for direct access requests which only specify rq->buffer without a valid rq->bio) -3.2.5 Generic request command tagging - -3.2.5.1 Tag helpers - -Block now offers some simple generic functionality to help support command -queueing (typically known as tagged command queueing), ie manage more than -one outstanding command on a queue at any given time. - - blk_queue_init_tags(struct request_queue *q, int depth) - - Initialize internal command tagging structures for a maximum - depth of 'depth'. - - blk_queue_free_tags((struct request_queue *q) - - Teardown tag info associated with the queue. This will be done - automatically by block if blk_queue_cleanup() is called on a queue - that is using tagging. - -The above are initialization and exit management, the main helpers during -normal operations are: - - blk_queue_start_tag(struct request_queue *q, struct request *rq) - - Start tagged operation for this request. A free tag number between - 0 and 'depth' is assigned to the request (rq->tag holds this number), - and 'rq' is added to the internal tag management. If the maximum depth - for this queue is already achieved (or if the tag wasn't started for - some other reason), 1 is returned. Otherwise 0 is returned. - - blk_queue_end_tag(struct request_queue *q, struct request *rq) - - End tagged operation on this request. 'rq' is removed from the internal - book keeping structures. - -To minimize struct request and queue overhead, the tag helpers utilize some -of the same request members that are used for normal request queue management. -This means that a request cannot both be an active tag and be on the queue -list at the same time. blk_queue_start_tag() will remove the request, but -the driver must remember to call blk_queue_end_tag() before signalling -completion of the request to the block layer. This means ending tag -operations before calling end_that_request_last()! For an example of a user -of these helpers, see the IDE tagged command queueing support. - -3.2.5.2 Tag info - -Some block functions exist to query current tag status or to go from a -tag number to the associated request. These are, in no particular order: - - blk_queue_tagged(q) - - Returns 1 if the queue 'q' is using tagging, 0 if not. - - blk_queue_tag_request(q, tag) - - Returns a pointer to the request associated with tag 'tag'. - - blk_queue_tag_depth(q) - - Return current queue depth. - - blk_queue_tag_queue(q) - - Returns 1 if the queue can accept a new queued command, 0 if we are - at the maximum depth already. - - blk_queue_rq_tagged(rq) - - Returns 1 if the request 'rq' is tagged. - -3.2.5.2 Internal structure - -Internally, block manages tags in the blk_queue_tag structure: - - struct blk_queue_tag { - struct request **tag_index; /* array or pointers to rq */ - unsigned long *tag_map; /* bitmap of free tags */ - struct list_head busy_list; /* fifo list of busy tags */ - int busy; /* queue depth */ - int max_depth; /* max queue depth */ - }; - -Most of the above is simple and straight forward, however busy_list may need -a bit of explaining. Normally we don't care too much about request ordering, -but in the event of any barrier requests in the tag queue we need to ensure -that requests are restarted in the order they were queue. - 3.3 I/O Submission The routine submit_bio() is used to submit a single io. Higher level i/o diff --git a/Documentation/block/cfq-iosched.txt b/Documentation/block/cfq-iosched.txt deleted file mode 100644 index 895bd3813115..000000000000 --- a/Documentation/block/cfq-iosched.txt +++ /dev/null @@ -1,291 +0,0 @@ -CFQ (Complete Fairness Queueing) -=============================== - -The main aim of CFQ scheduler is to provide a fair allocation of the disk -I/O bandwidth for all the processes which requests an I/O operation. - -CFQ maintains the per process queue for the processes which request I/O -operation(synchronous requests). In case of asynchronous requests, all the -requests from all the processes are batched together according to their -process's I/O priority. - -CFQ ioscheduler tunables -======================== - -slice_idle ----------- -This specifies how long CFQ should idle for next request on certain cfq queues -(for sequential workloads) and service trees (for random workloads) before -queue is expired and CFQ selects next queue to dispatch from. - -By default slice_idle is a non-zero value. That means by default we idle on -queues/service trees. This can be very helpful on highly seeky media like -single spindle SATA/SAS disks where we can cut down on overall number of -seeks and see improved throughput. - -Setting slice_idle to 0 will remove all the idling on queues/service tree -level and one should see an overall improved throughput on faster storage -devices like multiple SATA/SAS disks in hardware RAID configuration. The down -side is that isolation provided from WRITES also goes down and notion of -IO priority becomes weaker. - -So depending on storage and workload, it might be useful to set slice_idle=0. -In general I think for SATA/SAS disks and software RAID of SATA/SAS disks -keeping slice_idle enabled should be useful. For any configurations where -there are multiple spindles behind single LUN (Host based hardware RAID -controller or for storage arrays), setting slice_idle=0 might end up in better -throughput and acceptable latencies. - -back_seek_max -------------- -This specifies, given in Kbytes, the maximum "distance" for backward seeking. -The distance is the amount of space from the current head location to the -sectors that are backward in terms of distance. - -This parameter allows the scheduler to anticipate requests in the "backward" -direction and consider them as being the "next" if they are within this -distance from the current head location. - -back_seek_penalty ------------------ -This parameter is used to compute the cost of backward seeking. If the -backward distance of request is just 1/back_seek_penalty from a "front" -request, then the seeking cost of two requests is considered equivalent. - -So scheduler will not bias toward one or the other request (otherwise scheduler -will bias toward front request). Default value of back_seek_penalty is 2. - -fifo_expire_async ------------------ -This parameter is used to set the timeout of asynchronous requests. Default -value of this is 248ms. - -fifo_expire_sync ----------------- -This parameter is used to set the timeout of synchronous requests. Default -value of this is 124ms. In case to favor synchronous requests over asynchronous -one, this value should be decreased relative to fifo_expire_async. - -group_idle ------------ -This parameter forces idling at the CFQ group level instead of CFQ -queue level. This was introduced after a bottleneck was observed -in higher end storage due to idle on sequential queue and allow dispatch -from a single queue. The idea with this parameter is that it can be run with -slice_idle=0 and group_idle=8, so that idling does not happen on individual -queues in the group but happens overall on the group and thus still keeps the -IO controller working. -Not idling on individual queues in the group will dispatch requests from -multiple queues in the group at the same time and achieve higher throughput -on higher end storage. - -Default value for this parameter is 8ms. - -low_latency ------------ -This parameter is used to enable/disable the low latency mode of the CFQ -scheduler. If enabled, CFQ tries to recompute the slice time for each process -based on the target_latency set for the system. This favors fairness over -throughput. Disabling low latency (setting it to 0) ignores target latency, -allowing each process in the system to get a full time slice. - -By default low latency mode is enabled. - -target_latency --------------- -This parameter is used to calculate the time slice for a process if cfq's -latency mode is enabled. It will ensure that sync requests have an estimated -latency. But if sequential workload is higher(e.g. sequential read), -then to meet the latency constraints, throughput may decrease because of less -time for each process to issue I/O request before the cfq queue is switched. - -Though this can be overcome by disabling the latency_mode, it may increase -the read latency for some applications. This parameter allows for changing -target_latency through the sysfs interface which can provide the balanced -throughput and read latency. - -Default value for target_latency is 300ms. - -slice_async ------------ -This parameter is same as of slice_sync but for asynchronous queue. The -default value is 40ms. - -slice_async_rq --------------- -This parameter is used to limit the dispatching of asynchronous request to -device request queue in queue's slice time. The maximum number of request that -are allowed to be dispatched also depends upon the io priority. Default value -for this is 2. - -slice_sync ----------- -When a queue is selected for execution, the queues IO requests are only -executed for a certain amount of time(time_slice) before switching to another -queue. This parameter is used to calculate the time slice of synchronous -queue. - -time_slice is computed using the below equation:- -time_slice = slice_sync + (slice_sync/5 * (4 - prio)). To increase the -time_slice of synchronous queue, increase the value of slice_sync. Default -value is 100ms. - -quantum -------- -This specifies the number of request dispatched to the device queue. In a -queue's time slice, a request will not be dispatched if the number of request -in the device exceeds this parameter. This parameter is used for synchronous -request. - -In case of storage with several disk, this setting can limit the parallel -processing of request. Therefore, increasing the value can improve the -performance although this can cause the latency of some I/O to increase due -to more number of requests. - -CFQ Group scheduling -==================== - -CFQ supports blkio cgroup and has "blkio." prefixed files in each -blkio cgroup directory. It is weight-based and there are four knobs -for configuration - weight[_device] and leaf_weight[_device]. -Internal cgroup nodes (the ones with children) can also have tasks in -them, so the former two configure how much proportion the cgroup as a -whole is entitled to at its parent's level while the latter two -configure how much proportion the tasks in the cgroup have compared to -its direct children. - -Another way to think about it is assuming that each internal node has -an implicit leaf child node which hosts all the tasks whose weight is -configured by leaf_weight[_device]. Let's assume a blkio hierarchy -composed of five cgroups - root, A, B, AA and AB - with the following -weights where the names represent the hierarchy. - - weight leaf_weight - root : 125 125 - A : 500 750 - B : 250 500 - AA : 500 500 - AB : 1000 500 - -root never has a parent making its weight is meaningless. For backward -compatibility, weight is always kept in sync with leaf_weight. B, AA -and AB have no child and thus its tasks have no children cgroup to -compete with. They always get 100% of what the cgroup won at the -parent level. Considering only the weights which matter, the hierarchy -looks like the following. - - root - / | \ - A B leaf - 500 250 125 - / | \ - AA AB leaf - 500 1000 750 - -If all cgroups have active IOs and competing with each other, disk -time will be distributed like the following. - -Distribution below root. The total active weight at this level is -A:500 + B:250 + C:125 = 875. - - root-leaf : 125 / 875 =~ 14% - A : 500 / 875 =~ 57% - B(-leaf) : 250 / 875 =~ 28% - -A has children and further distributes its 57% among the children and -the implicit leaf node. The total active weight at this level is -AA:500 + AB:1000 + A-leaf:750 = 2250. - - A-leaf : ( 750 / 2250) * A =~ 19% - AA(-leaf) : ( 500 / 2250) * A =~ 12% - AB(-leaf) : (1000 / 2250) * A =~ 25% - -CFQ IOPS Mode for group scheduling -=================================== -Basic CFQ design is to provide priority based time slices. Higher priority -process gets bigger time slice and lower priority process gets smaller time -slice. Measuring time becomes harder if storage is fast and supports NCQ and -it would be better to dispatch multiple requests from multiple cfq queues in -request queue at a time. In such scenario, it is not possible to measure time -consumed by single queue accurately. - -What is possible though is to measure number of requests dispatched from a -single queue and also allow dispatch from multiple cfq queue at the same time. -This effectively becomes the fairness in terms of IOPS (IO operations per -second). - -If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches -to IOPS mode and starts providing fairness in terms of number of requests -dispatched. Note that this mode switching takes effect only for group -scheduling. For non-cgroup users nothing should change. - -CFQ IO scheduler Idling Theory -=============================== -Idling on a queue is primarily about waiting for the next request to come -on same queue after completion of a request. In this process CFQ will not -dispatch requests from other cfq queues even if requests are pending there. - -The rationale behind idling is that it can cut down on number of seeks -on rotational media. For example, if a process is doing dependent -sequential reads (next read will come on only after completion of previous -one), then not dispatching request from other queue should help as we -did not move the disk head and kept on dispatching sequential IO from -one queue. - -CFQ has following service trees and various queues are put on these trees. - - sync-idle sync-noidle async - -All cfq queues doing synchronous sequential IO go on to sync-idle tree. -On this tree we idle on each queue individually. - -All synchronous non-sequential queues go on sync-noidle tree. Also any -synchronous write request which is not marked with REQ_IDLE goes on this -service tree. On this tree we do not idle on individual queues instead idle -on the whole group of queues or the tree. So if there are 4 queues waiting -for IO to dispatch we will idle only once last queue has dispatched the IO -and there is no more IO on this service tree. - -All async writes go on async service tree. There is no idling on async -queues. - -CFQ has some optimizations for SSDs and if it detects a non-rotational -media which can support higher queue depth (multiple requests at in -flight at a time), then it cuts down on idling of individual queues and -all the queues move to sync-noidle tree and only tree idle remains. This -tree idling provides isolation with buffered write queues on async tree. - -FAQ -=== -Q1. Why to idle at all on queues not marked with REQ_IDLE. - -A1. We only do tree idle (all queues on sync-noidle tree) on queues not marked - with REQ_IDLE. This helps in providing isolation with all the sync-idle - queues. Otherwise in presence of many sequential readers, other - synchronous IO might not get fair share of disk. - - For example, if there are 10 sequential readers doing IO and they get - 100ms each. If a !REQ_IDLE request comes in, it will be scheduled - roughly after 1 second. If after completion of !REQ_IDLE request we - do not idle, and after a couple of milli seconds a another !REQ_IDLE - request comes in, again it will be scheduled after 1second. Repeat it - and notice how a workload can lose its disk share and suffer due to - multiple sequential readers. - - fsync can generate dependent IO where bunch of data is written in the - context of fsync, and later some journaling data is written. Journaling - data comes in only after fsync has finished its IO (atleast for ext4 - that seemed to be the case). Now if one decides not to idle on fsync - thread due to !REQ_IDLE, then next journaling write will not get - scheduled for another second. A process doing small fsync, will suffer - badly in presence of multiple sequential readers. - - Hence doing tree idling on threads using !REQ_IDLE flag on requests - provides isolation from multiple sequential readers and at the same - time we do not idle on individual threads. - -Q2. When to specify REQ_IDLE -A2. I would think whenever one is doing synchronous write and expecting - more writes to be dispatched from same context soon, should be able - to specify REQ_IDLE on writes and that probably should work well for - most of the cases. diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt index 2c1e67058fd3..39e286d7afc9 100644 --- a/Documentation/block/queue-sysfs.txt +++ b/Documentation/block/queue-sysfs.txt @@ -64,7 +64,7 @@ guess, the kernel will put the process issuing IO to sleep for an amount of time, before entering a classic poll loop. This mode might be a little slower than pure classic polling, but it will be more efficient. If set to a value larger than 0, the kernel will put the process issuing -IO to sleep for this amont of microseconds before entering classic +IO to sleep for this amount of microseconds before entering classic polling. iostats (RW) @@ -194,4 +194,31 @@ blk-throttle makes decision based on the samplings. Lower time means cgroups have more smooth throughput, but higher CPU overhead. This exists only when CONFIG_BLK_DEV_THROTTLING_LOW is enabled. +zoned (RO) +---------- +This indicates if the device is a zoned block device and the zone model of the +device if it is indeed zoned. The possible values indicated by zoned are +"none" for regular block devices and "host-aware" or "host-managed" for zoned +block devices. The characteristics of host-aware and host-managed zoned block +devices are described in the ZBC (Zoned Block Commands) and ZAC +(Zoned Device ATA Command Set) standards. These standards also define the +"drive-managed" zone model. However, since drive-managed zoned block devices +do not support zone commands, they will be treated as regular block devices +and zoned will report "none". + +nr_zones (RO) +------------- +For zoned block devices (zoned attribute indicating "host-managed" or +"host-aware"), this indicates the total number of zones of the device. +This is always 0 for regular block devices. + +chunk_sectors (RO) +------------------ +This has different meaning depending on the type of the block device. +For a RAID device (dm-raid), chunk_sectors indicates the size in 512B sectors +of the RAID volume stripe segment. For a zoned block device, either host-aware +or host-managed, chunk_sectors indicates the size in 512B sectors of the zones +of the device, with the eventual exception of the last zone of the device which +may be smaller. + Jens Axboe <jens.axboe@oracle.com>, February 2009 diff --git a/Documentation/scsi/scsi-parameters.txt b/Documentation/scsi/scsi-parameters.txt index 92999d4e0cb8..25a4b4cf04a6 100644 --- a/Documentation/scsi/scsi-parameters.txt +++ b/Documentation/scsi/scsi-parameters.txt @@ -97,11 +97,6 @@ parameters may be changed at runtime by the command allowing boot to proceed. none ignores them, expecting user space to do the scan. - scsi_mod.use_blk_mq= - [SCSI] use blk-mq I/O path by default - See SCSI_MQ_DEFAULT in drivers/scsi/Kconfig. - Format: <y/n> - sim710= [SCSI,HW] See header of drivers/scsi/sim710.c. |