Age | Commit message (Collapse) | Author | Files | Lines |
|
Writes tend to starve reads. bfq counters this problem by overcharging
writes with an inflated service w.r.t. the actual service (number of
sector written) they receive.
Yet his overcharging is useless, and actually causes unfairness in the
opposite direction, when bfq happens to be enforcing strong I/O control.
bfq does this enforcing when the scenario is asymmetric, i.e., when some
bfq_queue or group of bfq_queues is to be granted a different bandwidth
than some other bfq_queue or group of bfq_queues. So, in such a
scenario, this commit disables write overcharging.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The original commit is commit 1a1238a7dd48 ("cfq-iosched: improve hw_tag
detection") and has the following commit message:
If active queue hasn't enough requests and idle window opens, cfq will
not dispatch sufficient requests to hardware. In such situation, current
code will zero hw_tag. But this is because cfq doesn't dispatch enough
requests instead of hardware queue doesn't work. Don't zero hw_tag in
such case.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bfq simple heuristic from cfq for detecting whether the drive performs
command queueing: check whether the average number of in-flight requests
is above a given threshold. Unfortunately this heuristic does fail to
detect queueing (on drives with queueing) if processes doing I/O are few
and issue I/O with a low depth.
To reduce false negatives, this commit lowers the threshold.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bfq maintains an ordered list, through a red-black tree, of unique
weights of active bfq_queues. This list is used to detect whether there
are active queues with differentiated weights. The weight of a queue is
removed from the list when both the following two conditions become
true:
(1) the bfq_queue is flagged as inactive
(2) the has no in-flight request any longer;
Unfortunately, in the rare cases where condition (2) becomes true before
condition (1), the removal fails, because the function to remove the
weight of the queue (bfq_weights_tree_remove) is rightly invoked in the
path that deactivates the bfq_queue, but mistakenly invoked *before* the
function that actually performs the deactivation (bfq_deactivate_bfqq).
This commits moves the invocation of bfq_weights_tree_remove for
condition (1) to after bfq_deactivate_bfqq. As a consequence of this
move, it is necessary to add a further reference to the queue when the
weight of a queue is added, because the queue might otherwise be freed
before bfq_weights_tree_remove is invoked. This commit adds this
reference and makes all related modifications.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In bfq_update_peak_rate, to check whether an I/O request rq is
sequential, only the seek distance of rq w.r.t. the last request
dispatched is controlled. This is not sufficient for non-rotational
storage, where the size of rq is at least as relevant. This commit adds
the missing control.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bfq detects the creation of multiple bfq_queues shortly after each
other, namely a burst of queue creations in the terminology used in the
code. If the burst is large, then no queue in the burst is granted
- either I/O-dispatch plugging when the queue remains temporarily idle
while in service;
- or weight raising, because it causes even longer plugging.
In fact, such a plugging tends to lower throughput, while these bursts
are typically due to applications or services that spawn multiple
processes, to reach a common goal as soon as possible. Examples are a
"git grep" or the booting of a system.
Unfortunately, disabling plugging may cause a loss of service guarantees
in asymmetric scenarios, i.e., if queue weights are differentiated or if
more than one group is active.
This commit addresses this issue by no longer disabling I/O-dispatch
plugging for queues in large bursts.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If the in-service bfq_queue is sync and remains temporarily idle, then
I/O dispatching (from other queues) may be plugged. It may be dome for
two reasons: either to boost throughput, or to preserve the bandwidth
share of the in-service queue. In the first case, if the I/O of the
in-service queue, when it finally arrives, consists only of one small
I/O request, then it makes sense to plug even the I/O of the in-service
queue. In fact, serving such a small request immediately is likely to
lower throughput instead of boosting it, whereas waiting a little bit is
likely to let that request grow, thanks to request merging, and become
more profitable in terms of throughput (this is likely to happen exactly
because the I/O of the queue has been detected to boost throughput).
On the opposite end, if I/O dispatching is being plugged only to
preserve the bandwidth of the in-service queue, then it would be better
not to plug also the I/O of the in-service queue, because such a
plugging is likely to cause only loss of bandwidth for the queue.
Unfortunately, no distinction is made between the two cases, and the I/O
of the in-service queue is always plugged in case just a small I/O
request arrives. This commit draws this missing distinction and does not
perform harmful plugging.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is a preparatory commit for commits that need to check only one of
the two main reasons for idling. This change should also improve the
quality of the code a little bit, by splitting a function that contains
very long, non-trivial and little related comments.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In asymmetric scenarios, i.e., when some bfq_queue or bfq_group needs to
be guaranteed a different bandwidth than other bfq_queues or bfq_groups,
these service guaranteed can be provided only by plugging I/O dispatch,
completely or partially, when the queue in service remains temporarily
empty. A case where asymmetry is particularly strong is when some active
bfq_queues belong to a higher-priority class than some other active
bfq_queues. Unfortunately, this important case is not considered at all
in the code for detecting asymmetric scenarios. This commit adds the
missing logic.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Before commit 18e5a57d7987 ("block, bfq: postpone rq preparation to
insert or merge"), the destination queue for a request was chosen by a
different hook than the one that then inserted the request. So, between
the execution of the two hooks, the bic of the process generating the
request could happen to be redirected to a different bfq_queue. As a
consequence, the destination bfq_queue stored in the request could be
wrong. Such an event does not need to ba handled any longer.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
With some unlucky sequences of events, the function bfq_updated_next_req
updates the current budget of a bfq_queue to a lower value than the
service received by the queue using such a budget. Unfortunately, if
this happens, then the return value of the function bfq_bfqq_budget_left
becomes inconsistent. This commit solves this problem by lower-bounding
the budget computed in bfq_updated_next_req to the service currently
charged to the queue.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
To boost throughput on devices with internal queueing and in scenarios
where device idling is not strictly needed, bfq immediately starts
serving a new bfq_queue if the in-service bfq_queue remains without
pending I/O, even if new I/O may arrive soon for the latter queue. Then,
if such I/O actually arrives soon, bfq preempts the new in-service
bfq_queue so as to give the previous queue a chance to go on being
served (in case the previous queue should actually be the one to be
served, according to its timestamps).
However, the in-service bfq_queue, say Q, may also be without further
budget when it remains also pending I/O. Since bfq changes budgets
dynamically to fit the needs of bfq_queues, this happens more often than
one may expect. If this happens, then there is no point in trying to go
on serving Q when new I/O arrives for it soon: Q would be expired
immediately after being selected for service. This would only cause
useless overhead. This commit avoids such a useless selection.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The speed at which a bfq_queue receives I/O is one of the parameters by
which bfq decides whether the queue is soft real-time (i.e., whether the
queue contains the I/O of a soft real-time application). In particular,
when a bfq_queue remains without outstanding I/O requests, bfq computes
the minimum time instant, named soft_rt_next_start, at which the next
request of the queue may arrive for the queue to be deemed as soft real
time.
Unfortunately this filtering may cause problems with a queue in
interactive weight raising. In fact, such a queue may be conveying the
I/O needed to load a soft real-time application. The latter will
actually exhibit a soft real-time I/O pattern after it finally starts
doing its job. But, if soft_rt_next_start is updated for an interactive
bfq_queue, and the queue has received a lot of service before remaining
with no outstanding request (likely to happen on a fast device), then
soft_rt_next_start is assigned such a high value that, for a very long
time, the queue is prevented from being possibly considered as soft real
time.
This commit removes the updating of soft_rt_next_start for bfq_queues in
interactive weight raising.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull in v4.20-rc6 to resolve the conflict in NVMe, but also to get the
two corruption fixes. We're going to be overhauling the direct dispatch
path, and we need to do that on top of the changes we made for that
in mainline.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The bio_blkcg() function turns out to be inconsistent and consequently
dangerous to use. The first part returns a blkcg where a reference is
owned by the bio meaning it does not need to be rcu protected. However,
the third case, the last line, is problematic:
return css_to_blkcg(task_css(current, io_cgrp_id));
This can race against task migration and the cgroup dying. It is also
semantically different as it must be called rcu protected and is
susceptible to failure when trying to get a reference to it.
This patch adds association ahead of calling bio_blkcg() rather than
after. This makes association a required and explicit step along the
code paths for calling bio_blkcg(). In blk-iolatency, association is
moved above the bio_blkcg() call to ensure it will not return %NULL.
BFQ uses the old bio_blkcg() function, but I do not want to address it
in this series due to the complexity. I have created a private version
documenting the inconsistency and noting not to use it.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Since commit '2d29c9f89fcd ("block, bfq: improve asymmetric scenarios
detection")', if there are process groups with I/O requests waiting for
completion, then BFQ tags the scenario as 'asymmetric'. This detection
is needed for preserving service guarantees (for details, see comments
on the computation * of the variable asymmetric_scenario in the
function bfq_better_to_idle).
Unfortunately, commit '2d29c9f89fcd ("block, bfq: improve asymmetric
scenarios detection")' contains an error exactly in the updating of
the number of groups with I/O requests waiting for completion: if a
group has more than one descendant process, then the above number of
groups, which is renamed from num_active_groups to a more appropriate
num_groups_with_pending_reqs by this commit, may happen to be wrongly
decremented multiple times, namely every time one of the descendant
processes gets all its pending I/O requests completed.
A correct, complete solution should work as follows. Consider a group
that is inactive, i.e., that has no descendant process with pending
I/O inside BFQ queues. Then suppose that num_groups_with_pending_reqs
is still accounting for this group, because the group still has some
descendant process with some I/O request still in
flight. num_groups_with_pending_reqs should be decremented when the
in-flight request of the last descendant process is finally completed
(assuming that nothing else has changed for the group in the meantime,
in terms of composition of the group and active/inactive state of
child groups and processes). To accomplish this, an additional
pending-request counter must be added to entities, and must be
updated correctly.
To avoid this additional field and operations, this commit resorts to
the following tradeoff between simplicity and accuracy: for an
inactive group that is still counted in num_groups_with_pending_reqs,
this commit decrements num_groups_with_pending_reqs when the first
descendant process of the group remains with no request waiting for
completion.
This simplified scheme provides a fix to the unbalanced decrements
introduced by 2d29c9f89fcd. Since this error was also caused by lack
of comments on this non-trivial issue, this commit also adds related
comments.
Fixes: 2d29c9f89fcd ("block, bfq: improve asymmetric scenarios detection")
Reported-by: Steven Barrett <steven@liquorix.net>
Tested-by: Steven Barrett <steven@liquorix.net>
Tested-by: Lucjan Lucjanov <lucjan.lucjanov@gmail.com>
Reviewed-by: Federico Motta <federico@willer.it>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
With the legacy request path gone there is no good reason to keep
queue_lock as a pointer, we can always use the embedded lock now.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Fixed floppy and blk-cgroup missing conversions and half done edits.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is a remnant of when we had ops for both SQ and MQ
schedulers. Now it's just MQ, so get rid of the union.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This removes a bunch of core and elevator related code. On the core
front, we remove anything related to queue running, draining,
initialization, plugging, and congestions. We also kill anything
related to request allocation, merging, retrieval, and completion.
Remove any checking for single queue IO schedulers, as they no
longer exist. This means we can also delete a bunch of code related
to request issue, adding, completion, etc - and all the SQ related
ops and helpers.
Also kill the load_default_modules(), as all that did was provide
for a way to load the default single queue elevator.
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This reverts a series committed earlier due to null pointer exception
bug report in [1]. It seems there are edge case interactions that I did
not consider and will need some time to understand what causes the
adverse interactions.
The original series can be found in [2] with a follow up series in [3].
[1] https://www.spinics.net/lists/cgroups/msg20719.html
[2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
[3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/
This reverts the following commits:
d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae,
f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef,
a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bfq defines as asymmetric a scenario where an active entity, say E
(representing either a single bfq_queue or a group of other entities),
has a higher weight than some other entities. If the entity E does sync
I/O in such a scenario, then bfq plugs the dispatch of the I/O of the
other entities in the following situation: E is in service but
temporarily has no pending I/O request. In fact, without this plugging,
all the times that E stops being temporarily idle, it may find the
internal queues of the storage device already filled with an
out-of-control number of extra requests, from other entities. So E may
have to wait for the service of these extra requests, before finally
having its own requests served. This may easily break service
guarantees, with E getting less than its fair share of the device
throughput. Usually, the end result is that E gets the same fraction of
the throughput as the other entities, instead of getting more, according
to its higher weight.
Yet there are two other more subtle cases where E, even if its weight is
actually equal to or even lower than the weight of any other active
entities, may get less than its fair share of the throughput in case the
above I/O plugging is not performed:
1. other entities issue larger requests than E;
2. other entities contain more active child entities than E (or in
general tend to have more backlog than E).
In the first case, other entities may get more service than E because
they get larger requests, than those of E, served during the temporary
idle periods of E. In the second case, other entities get more service
because, by having many child entities, they have many requests ready
for dispatching while E is temporarily idle.
This commit addresses this issue by extending the definition of
asymmetric scenario: a scenario is asymmetric when
- active entities representing bfq_queues have differentiated weights,
as in the original definition
or (inclusive)
- one or more entities representing groups of entities are active.
This broader definition makes sure that I/O plugging will be performed
in all the above cases, provided that there is at least one active
group. Of course, this definition is very coarse, so it will trigger
I/O plugging also in cases where it is not needed, such as, e.g.,
multiple active entities with just one child each, and all with the same
I/O-request size. The reason for this coarse definition is just that a
finer-grained definition would be rather heavy to compute.
On the opposite end, even this new definition does not trigger I/O
plugging in all cases where there is no active group, and all bfq_queues
have the same weight. So, in these cases some unfairness may occur if
there are asymmetries in I/O-request sizes. We made this choice because
I/O plugging may lower throughput, and probably a user that has not
created any group cares more about throughput than about perfect
fairness. At any rate, as for possible applications that may care about
service guarantees, bfq already guarantees a high responsiveness and a
low latency to soft real-time applications automatically.
Signed-off-by: Federico Motta <federico@willer.it>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The accessor function bio_blkcg either returns the blkcg associated with
the bio or finds one in the current context. This can cause an issue
when trying to associate a bio with a blkcg. Particularly, it's the
third case that is problematic:
return css_to_blkcg(task_css(current, io_cgrp_id));
As the above may race against task migration and the cgroup exiting, it
is not always ok to take a reference on the blkcg returned from
bio_blkcg.
This patch adds association ahead of calling bio_blkcg rather than
after. This makes association a required and explicit step along the
code paths for calling bio_blkcg. blk_get_rl is modified as well to get
a reference to the blkcg it may use and blk_put_rl will always put the
reference back. Association is also moved above the bio_blkcg call to
ensure it will not return NULL in blk-iolatency.
BFQ and CFQ utilize this flaw, but due to the complexity, I do not want
to address this in this series. I've created a private version of the
function with notes not to use it describing the flaw. Hopefully soon,
that code can be cleaned up.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
To reduce latency for interactive and soft real-time applications, bfq
privileges the bfq_queues containing the I/O of these
applications. These privileged queues, referred-to as weight-raised
queues, get a much higher share of the device throughput
w.r.t. non-privileged queues. To preserve this higher share, the I/O
of any non-weight-raised queue must be plugged whenever a sync
weight-raised queue, while being served, remains temporarily empty. To
attain this goal, bfq simply plugs any I/O (from any queue), if a sync
weight-raised queue remains empty while in service.
Unfortunately, this plugging typically lowers throughput with random
I/O, on devices with internal queueing (because it reduces the filling
level of the internal queues of the device).
This commit addresses this issue by restricting the cases where
plugging is performed: if a sync weight-raised queue remains empty
while in service, then I/O plugging is performed only if some of the
active bfq_queues are *not* weight-raised (which is actually the only
circumstance where plugging is needed to preserve the higher share of
the throughput of weight-raised queues). This restriction proved able
to boost throughput in really many use cases needing only maximum
throughput.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The Achilles' heel of BFQ is its failing to reach a high throughput
with sync random I/O on flash storage with internal queueing, in case
the processes doing I/O have differentiated weights.
The cause of this failure is as follows. If at least two processes do
sync I/O, and have a different weight from each other, then BFQ plugs
I/O dispatching every time one of these processes, while it is being
served, remains temporarily without pending I/O requests. This
plugging is necessary to guarantee that every process enjoys a
bandwidth proportional to its weight; but it empties the internal
queue(s) of the drive. And this kills throughput with random I/O. So,
if some processes have differentiated weights and do both sync and
random I/O, the end result is a throughput collapse.
This commit tries to counter this problem by injecting the service of
other processes, in a controlled way, while the process in service
happens to have no I/O. This injection is performed only if the medium
is non rotational and performs internal queueing, and the process in
service does random I/O (service injection might be beneficial for
sequential I/O too, we'll work on that).
As an example of the benefits of this commit, on a PLEXTOR PX-256M5S
SSD, and with five processes having differentiated weights and doing
sync random 4KB I/O, this commit makes the throughput with bfq grow by
400%, from 25 to 100MB/s. This higher throughput is 10MB/s lower than
that reached with none. As some less random I/O is added to the mix,
the throughput becomes equal to or higher than that with none.
This commit is a very first attempt to recover throughput without
losing control, and certainly has many limitations. One is, e.g., that
the processes whose service is injected are not chosen so as to
distribute the extra bandwidth they receive in accordance to their
weights. Thus there might be loss of weighted fairness in some
cases. Anyway, this loss concerns extra service, which would not have
been received at all without this commit. Other limitations and issues
will probably show up with usage.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When a sync request is dispatched, the queue that contains that
request, and all the ancestor entities of that queue, are charged with
the number of sectors of the request. In constrast, if the request is
async, then the queue and its ancestor entities are charged with the
number of sectors of the request, multiplied by an overcharge
factor. This throttles the bandwidth for async I/O, w.r.t. to sync
I/O, and it is done to counter the tendency of async writes to steal
I/O throughput to reads.
On the opposite end, the lower this parameter, the stabler I/O
control, in the following respect. The lower this parameter is, the
less the bandwidth enjoyed by a group decreases
- when the group does writes, w.r.t. to when it does reads;
- when other groups do reads, w.r.t. to when they do writes.
The fixes "block, bfq: always update the budget of an entity when
needed" and "block, bfq: readd missing reset of parent-entity service"
improved I/O control in bfq to such an extent that it has been
possible to revise this overcharge factor downwards. This commit
introduces the resulting, new value.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The received-service counter needs to be equal to 0 when an entity is
set in service. Unfortunately, commit "block, bfq: fix service being
wrongly set to zero in case of preemption" mistakenly removed the
resetting of this counter for the parent entities of the bfq_queue
being set in service. This commit fixes this issue by resetting
service for parent entities, directly on the expiration of the
in-service bfq_queue.
Fixes: 9fae8dd59ff3 ("block, bfq: fix service being wrongly set to zero in case of preemption")
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The actual goal of the function bfq_bfqq_may_idle is to tell whether
it is better to perform device idling (more precisely: I/O-dispatch
plugging) for the input bfq_queue, either to boost throughput or to
preserve service guarantees. This commit improves the name of the
function accordingly.
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If
- a bfq_queue Q preempts another queue, because one request of Q
arrives in time,
- but, after this preemption, Q is not the queue that is set in service,
then Q->entity.service is set to 0 when Q is eventually set in
service. But Q should have continued receiving service with its old
budget (which is why preemption has occurred) and its old service.
This commit addresses this issue by resetting service on queue real
expiration.
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
For some bfq_queues, BFQ plugs I/O dispatching when the queue becomes
idle, and keeps the plug until a new request of the queue arrives, or
a timeout fires. BFQ does so either to boost throughput or to preserve
service guarantees for the queue.
More precisely, for such a queue, plugging starts when the queue
happens to have either no request enqueued, or no request in flight,
that is, no request already dispatched but not yet completed.
On the opposite end, BFQ may happen to expire a queue with no request
enqueued, without doing any plugging, if the queue still has some
request in flight. Unfortunately, such a premature expiration causes
the queue to lose its chance to enjoy dispatch plugging a moment
later, i.e., when its in-flight requests finally get completed. This
breaks service guarantees for the queue.
This commit prevents BFQ from expiring an empty queue if the latter
still has in-flight requests.
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
To keep I/O throughput high as often as possible, BFQ performs
I/O-dispatch plugging (aka device idling) only when beneficial exactly
for throughput, or when needed for service guarantees (low latency,
fairness). An important case where the latter condition holds is when
the scenario is 'asymmetric' in terms of weights: i.e., when some
bfq_queue or whole group of queues has a higher weight, and thus has
to receive more service, than other queues or groups. Without dispatch
plugging, lower-weight queues/groups may unjustly steal bandwidth to
higher-weight queues/groups.
To detect asymmetric scenarios, BFQ checks some sufficient
conditions. One of these conditions is that active groups have
different weights. BFQ controls this condition by maintaining a
special set of unique weights of active groups
(group_weights_tree). To this purpose, in the function
bfq_active_insert/bfq_active_extract BFQ adds/removes the weight of a
group to/from this set.
Unfortunately, the function bfq_active_extract may happen to be
invoked also for a group that is still active (to preserve the correct
update of the next queue to serve, see comments in function
bfq_no_longer_next_in_service() for details). In this case, removing
the weight of the group makes the set group_weights_tree
inconsistent. Service-guarantee violations follow.
This commit addresses this issue by moving group_weights_tree
insertions from their previous location (in bfq_active_insert) into
the function __bfq_activate_entity, and by moving group_weights_tree
extractions from bfq_active_extract to when the entity that represents
a group remains throughly idle, i.e., with no request either enqueued
or dispatched.
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
BFQ can deem a bfq_queue as soft real-time only if the queue
- periodically becomes completely idle, i.e., empty and with
no still-outstanding I/O request;
- after becoming idle, gets new I/O only after a special reference
time soft_rt_next_start.
In this respect, after commit "block, bfq: consider also past I/O in
soft real-time detection", the value of soft_rt_next_start can never
decrease. This causes a problem with the following special updating
case for soft_rt_next_start: to prevent queues that are not completely
idle to be wrongly detected as soft real-time (when they become
non-empty again), soft_rt_next_start is temporarily set to infinity
for empty queues with still outstanding I/O requests. But, if such an
update is actually performed, then, because of the above commit,
soft_rt_next_start will be stuck at infinity forever, and the queue
will have no more chance to be considered soft real-time.
On slow systems, this problem does cause actual soft real-time
applications to be occasionally not detected as such.
This commit addresses this issue by eliminating the pushing of
soft_rt_next_start to infinity, and by changing the way non-empty
queues are prevented from being wrongly detected as soft
real-time. Simply, a queue that becomes non-empty again can now be
detected as soft real-time only if it has no outstanding I/O request.
Signed-off-by: Davide Sapienza <sapienza.dav@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The maximum possible duration of the weight-raising period for
interactive applications is limited to 13 seconds, as this is the time
needed to load the largest application that we considered when tuning
weight raising. Unfortunately, in such an evaluation, we did not
consider the case of very slow virtual machines.
For example, on a QEMU/KVM virtual machine
- running in a slow PC;
- with a virtual disk stacked on a slow low-end 5400rpm HDD;
- serving a heavy I/O workload, such as the sequential reading of
several files;
mplayer takes 23 seconds to start, if constantly weight-raised.
To address this issue, this commit conservatively sets the upper limit
for weight-raising duration to 25 seconds.
Signed-off-by: Davide Sapienza <sapienza.dav@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
BFQ computes the duration of weight raising for interactive
applications automatically, using some reference parameters. In
particular, BFQ uses the best durations (see comments in the code for
how these durations have been assessed) for two classes of systems:
slow and fast ones. Examples of slow systems are old phones or systems
using micro HDDs. Fast systems are all the remaining ones. Using these
parameters, BFQ computes the actual duration of the weight raising,
for the system at hand, as a function of the relative speed of the
system w.r.t. the speed of a reference system, belonging to the same
class of systems as the system at hand.
This slow vs fast differentiation proved to be useful in the past, but
happens to have little meaning with current hardware. Even worse, it
does cause problems in virtual systems, where the speed of the system
can vary frequently, and so widely to just confuse the class-detection
mechanism, and, as we have verified experimentally, to cause BFQ to
compute non-sensical weight-raising durations.
This commit addresses this issue by removing the slow class and the
class-detection mechanism.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A description of how weight raising works is missing in BFQ
sources. In addition, the code for handling weight raising is
scattered across a few functions. This makes it rather hard to
understand the mechanism and its rationale. This commits adds such a
description at the beginning of the main source file.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Since bfq_finish_request() is always called on the request 'next',
after bfq_requests_merged() is finished, and bfq_finish_request()
removes 'next' from its bfq_queue if needed, it isn't necessary to do
such a removal in advance in bfq_merged_requests().
This commit removes such a useless 'next' removal.
Signed-off-by: Filippo Muzzini <filippo.muzzini@outlook.it>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The request rq passed to the function bfq_requests_merged is always in
a bfq_queue, so the check !RB_EMPTY_NODE(&rq->rb_node) at the
beginning of bfq_requests_merged always succeeds, and the control
flow systematically skips to the end of the function. This implies
that the body of the function is never executed, i.e., the
repositioning of rq is never performed.
On the opposite end, a control is missing in the body of the function:
'next' must be removed only if it is inside a bfq_queue.
This commit removes the wrong check on rq, and adds the missing check
on 'next'. In addition, this commit adds comments on
bfq_requests_merged.
Signed-off-by: Filippo Muzzini <filippo.muzzini@outlook.it>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In bfq_requests_merged(), there is a deadlock because the lock on
bfqq->bfqd->lock is held by the calling function, but the code of
this function tries to grab the lock again.
This deadlock is currently hidden by another bug (fixed by next commit
for this source file), which causes the body of bfq_requests_merged()
to be never executed.
This commit removes the deadlock by removing the lock/unlock pair.
Signed-off-by: Filippo Muzzini <filippo.muzzini@outlook.it>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If our shallow depth is smaller than the wake batching of sbitmap,
we can introduce hangs. Ensure that sbitmap knows how low we'll go.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bfqd->sb_shift was attempted used as a cache for the sbitmap queue
shift, but we don't need it, as it never changes. Kill it with fire.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
It doesn't change, so don't put it in the per-IO hot path.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Reserved tags are used for error handling, we don't need to
care about them for regular IO. The core won't call us for these
anyway.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When invoked for an I/O request rq, the prepare_request hook of bfq
increments reference counters in the destination bfq_queue for rq. In
this respect, after this hook has been invoked, rq may still be
transformed into a request with no icq attached, i.e., for bfq, a
request not associated with any bfq_queue. No further hook is invoked
to signal this tranformation to bfq (in general, to the destination
elevator for rq). This leads bfq into an inconsistent state, because
bfq has no chance to correctly lower these counters back. This
inconsistency may in its turn cause incorrect scheduling and hangs. It
certainly causes memory leaks, by making it impossible for bfq to free
the involved bfq_queue.
On the bright side, no transformation can still happen for rq after rq
has been inserted into bfq, or merged with another, already inserted,
request. Exploiting this fact, this commit addresses the above issue
by delaying the preparation of an I/O request to when the request is
inserted or merged.
This change also gives a performance bonus: a lock-contention point
gets removed. To prepare a request, bfq needs to hold its scheduler
lock. After postponing request preparation to insertion or merging, no
lock needs to be grabbed any longer in the prepare_request hook, while
the lock already taken to perform insertion or merging is used to
preparare the request as well.
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Even if we don't have an IO context attached to a request, we still
need to clear the priv[0..1] pointers, as they could be pointing
to previously used bic/bfqq structures. If we don't do so, we'll
either corrupt memory on dispatching a request, or cause an
imbalance in counters.
Inspired by a fix from Kees.
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Reported-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If a storage device handled by BFQ happens to be slower than 7.5 KB/s
for a certain amount of time (in the order of a second), then the
estimated peak rate of the device, maintained in BFQ, becomes equal to
0. The reason is the limited precision with which the rate is
represented (details on the range of representable values in the
comments introduced by this commit). This leads to a division-by-zero
error where the estimated peak rate is used as divisor. Such a type of
failure has been reported in [1].
This commit addresses this issue by:
1. Lower-bounding the estimated peak rate to 1
2. Adding and improving comments on the range of rates representable
[1] https://www.spinics.net/lists/kernel/msg2739205.html
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
be re-inserted into the active I/O scheduler for that device. As a
consequence, I/O schedulers may get the same request inserted again,
even several times, without a finish_request invoked on that request
before each re-insertion.
This fact is the cause of the failure reported in [1]. For an I/O
scheduler, every re-insertion of the same re-prepared request is
equivalent to the insertion of a new request. For schedulers like
mq-deadline or kyber, this fact causes no harm. In contrast, it
confuses a stateful scheduler like BFQ, which keeps state for an I/O
request, until the finish_request hook is invoked on the request. In
particular, BFQ may get stuck, waiting forever for the number of
request dispatches, of the same request, to be balanced by an equal
number of request completions (while there will be one completion for
that request). In this state, BFQ may refuse to serve I/O requests
from other bfq_queues. The hang reported in [1] then follows.
However, the above re-prepared requests undergo a requeue, thus the
requeue_request hook of the active elevator is invoked for these
requests, if set. This commit then addresses the above issue by
properly implementing the hook requeue_request in BFQ.
[1] https://marc.info/?l=linux-block&m=151211117608676
Reported-by: Ivan Kozik <ivan@ludios.org>
Reported-by: Alban Browaeys <alban.browaeys@gmail.com>
Tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Serena Ziviani <ziviani.serena@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
To maximise responsiveness, BFQ raises the weight, and performs device
idling, for bfq_queues associated with processes deemed as
interactive. In particular, weight raising has a maximum duration,
equal to the time needed to start a large application. If a
weight-raised process goes on doing I/O beyond this maximum duration,
it loses weight-raising.
This mechanism is evidently vulnerable to the following false
positives: I/O-bound applications that will go on doing I/O for much
longer than the duration of weight-raising. These applications have
basically no benefit from being weight-raised at the beginning of
their I/O. On the opposite end, while being weight-raised, these
applications
a) unjustly steal throughput to applications that may truly need
low latency;
b) make BFQ uselessly perform device idling; device idling results
in loss of device throughput with most flash-based storage, and may
increase latencies when used purposelessly.
This commit adds a countermeasure to reduce both the above
problems. To introduce this countermeasure, we provide the following
extra piece of information (full details in the comments added by this
commit). During the start-up of the large application used as a
reference to set the duration of weight-raising, involved processes
transfer at most ~110K sectors each. Accordingly, a process initially
deemed as interactive has no right to be weight-raised any longer,
once transferred 110K sectors or more.
Basing on this consideration, this commit early-ends weight-raising
for a bfq_queue if the latter happens to have received an amount of
service at least equal to 110K sectors (actually, a little bit more,
to keep a safety margin). I/O-bound applications that reach a high
throughput, such as file copy, get to this threshold much before the
allowed weight-raising period finishes. Thus this early ending of
weight-raising reduces the amount of time during which these
applications cause the problems described above.
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Asynchronous I/O can easily starve synchronous I/O (both sync reads
and sync writes), by consuming all request tags. Similarly, storms of
synchronous writes, such as those that sync(2) may trigger, can starve
synchronous reads. In their turn, these two problems may also cause
BFQ to loose control on latency for interactive and soft real-time
applications. For example, on a PLEXTOR PX-256M5S SSD, LibreOffice
Writer takes 0.6 seconds to start if the device is idle, but it takes
more than 45 seconds (!) if there are sequential writes in the
background.
This commit addresses this issue by limiting the maximum percentage of
tags that asynchronous I/O requests and synchronous write requests can
consume. In particular, this commit grants a higher threshold to
synchronous writes, to prevent the latter from being starved by
asynchronous I/O.
According to the above test, LibreOffice Writer now starts in about
1.2 seconds on average, regardless of the background workload, and
apart from some rare outlier. To check this improvement, run, e.g.,
sudo ./comm_startup_lat.sh bfq 5 5 seq 10 "lowriter --terminate_after_init"
for the comm_startup_lat benchmark in the S suite [1].
[1] https://github.com/Algodev-github/S
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit '7b9e93616399' ("blk-mq-sched: unify request finished methods")
changed the old name of current bfq_finish_request method, but left it
unchanged elsewhere in the code (related comments, part of function
name bfq_put_rq_priv_body).
This commit fixes all occurrences of the old name of this method by
changing them into the current name.
Fixes: 7b9e93616399 ("blk-mq-sched: unify request finished methods")
Reviewed-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Federico Motta <federico@willer.it>
Signed-off-by: Chiara Bruschi <bruschi.chiara@outlook.it>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
It's not available if we don't have group io scheduling set, and
there's no need to call it.
Fixes: 0d52af590552 ("block, bfq: release oom-queue ref to root group on exit")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|