LinuxLists.cc - [PATCH V10 0/8] block, bfq: extend bfq to support multi-actuator drives

2022-12-09 09:59:07

Subject: [PATCH V10 0/8] block, bfq: extend bfq to support multi-actuator drives

Hi,
here is the V10, it differs from V9 in that it applies the
recommendation by Damien in [2].

Here is the whole description of this patch series again. This
extension addresses the following issue. Single-LUN multi-actuator
SCSI drives, as well as all multi-actuator SATA drives appear as a
single device to the I/O subsystem [1]. Yet they address commands to
different actuators internally, as a function of Logical Block
Addressing (LBAs). A given sector is reachable by only one of the
actuators. For example, Seagate’s Serial Advanced Technology
Attachment (SATA) version contains two actuators and maps the lower
half of the SATA LBA space to the lower actuator and the upper half to
the upper actuator.

Evidently, to fully utilize actuators, no actuator must be left idle
or underutilized while there is pending I/O for it. To reach this
goal, the block layer must somehow control the load of each actuator
individually. This series enriches BFQ with such a per-actuator
control, as a first step. Then it also adds a simple mechanism for
guaranteeing that actuators with pending I/O are never left idle.

See [1] for a more detailed overview of the problem and of the
solutions implemented in this patch series. There you will also find
some preliminary performance results.

Thanks,
Paolo

[1] https://www.linaro.org/blog/budget-fair-queueing-bfq-linux-io-scheduler-optimizations-for-multi-actuator-sata-hard-drives/
[2] https://lore.kernel.org/lkml/[email protected]/T/#t

Davide Zini (3):
block, bfq: split also async bfq_queues on a per-actuator basis
block, bfq: inject I/O to underutilized actuators
block, bfq: balance I/O injection among underutilized actuators

Federico Gavioli (1):
block, bfq: retrieve independent access ranges from request queue

Paolo Valente (4):
block, bfq: split sync bfq_queues on a per-actuator basis
block, bfq: forbid stable merging of queues associated with different
actuators
block, bfq: move io_cq-persistent bfqq data into a dedicated struct
block, bfq: turn bfqq_data into an array in bfq_io_cq

block/bfq-cgroup.c | 94 +++----
block/bfq-iosched.c | 584 ++++++++++++++++++++++++++++++--------------
block/bfq-iosched.h | 142 ++++++++---
block/bfq-wf2q.c | 2 +-
4 files changed, 566 insertions(+), 256 deletions(-)

--
2.20.1

2022-12-09 10:00:35

by Paolo Valente

[permalink] [raw]

Subject: [PATCH V10 2/8] block, bfq: forbid stable merging of queues associated with different actuators

If queues associated with different actuators are merged, then control
is lost on each actuator. Therefore some actuator may be
underutilized, and throughput may decrease. This problem cannot occur
with basic queue merging, because the latter is triggered by spatial
locality, and sectors for different actuators are not close to each
other. Yet it may happen with stable merging. To address this issue,
this commit prevents stable merging from occurring among queues
associated with different actuators.

Reviewed-by: Damien Le Moal <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
---
block/bfq-iosched.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 16dbf52ce74b..456b5f5c24fd 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5708,9 +5708,13 @@ static struct bfq_queue *bfq_do_or_sched_stable_merge(struct bfq_data *bfqd,
* it has been set already, but too long ago, then move it
* forward to bfqq. Finally, move also if bfqq belongs to a
* different group than last_bfqq_created, or if bfqq has a
- * different ioprio or ioprio_class. If none of these
- * conditions holds true, then try an early stable merge or
- * schedule a delayed stable merge.
+ * different ioprio, ioprio_class or actuator_idx. If none of
+ * these conditions holds true, then try an early stable merge
+ * or schedule a delayed stable merge. As for the condition on
+ * actuator_idx, the reason is that, if queues associated with
+ * different actuators are merged, then control is lost on
+ * each actuator. Therefore some actuator may be
+ * underutilized, and throughput may decrease.
*
* A delayed merge is scheduled (instead of performing an
* early merge), in case bfqq might soon prove to be more
@@ -5728,7 +5732,8 @@ static struct bfq_queue *bfq_do_or_sched_stable_merge(struct bfq_data *bfqd,
bfqq->creation_time) ||
bfqq->entity.parent != last_bfqq_created->entity.parent ||
bfqq->ioprio != last_bfqq_created->ioprio ||
- bfqq->ioprio_class != last_bfqq_created->ioprio_class)
+ bfqq->ioprio_class != last_bfqq_created->ioprio_class ||
+ bfqq->actuator_idx != last_bfqq_created->actuator_idx)
*source_bfqq = bfqq;
else if (time_after_eq(last_bfqq_created->creation_time +
bfqd->bfq_burst_interval,
--
2.20.1

2022-12-09 10:00:45

by Paolo Valente

[permalink] [raw]

Subject: [PATCH V10 7/8] block, bfq: inject I/O to underutilized actuators

From: Davide Zini <[email protected]>

The main service scheme of BFQ for sync I/O is serving one sync
bfq_queue at a time, for a while. In particular, BFQ enforces this
scheme when it deems the latter necessary to boost throughput or
to preserve service guarantees. Unfortunately, when BFQ enforces
this policy, only one actuator at a time gets served for a while,
because each bfq_queue contains I/O only for one actuator. The
other actuators may remain underutilized.

Actually, BFQ may serve (inject) extra I/O, taken from other
bfq_queues, in parallel with that of the in-service queue. This
injection mechanism may provide the ground for dealing also with
the above actuator-underutilization problem. Yet BFQ does not take
the actuator load into account when choosing which queue to pick
extra I/O from. In addition, BFQ may happen to inject extra I/O
only when the in-service queue is temporarily empty.

In view of these facts, this commit extends the
injection mechanism in such a way that the latter:
(1) takes into account also the actuator load;
(2) checks such a load on each dispatch, and injects I/O for an
underutilized actuator, if there is one and there is I/O for it.

To perform the check in (2), this commit introduces a load
threshold, currently set to 4. A linear scan of each actuator is
performed, until an actuator is found for which the following two
conditions hold: the load of the actuator is below the threshold,
and there is at least one non-in-service queue that contains I/O
for that actuator. If such a pair (actuator, queue) is found, then
the head request of that queue is returned for dispatch, instead
of the head request of the in-service queue.

We have set the threshold, empirically, to the minimum possible
value for which an actuator is fully utilized, or close to be
fully utilized. By doing so, injected I/O 'steals' as few
drive-queue slots as possibile to the in-service queue. This
reduces as much as possible the probability that the service of
I/O from the in-service bfq_queue gets delayed because of slot
exhaustion, i.e., because all the slots of the drive queue are
filled with I/O injected from other queues (NCQ provides for 32
slots).

This new mechanism also counters actuator underutilization in the
case of asymmetric configurations of bfq_queues. Namely if there
are few bfq_queues containing I/O for some actuators and many
bfq_queues containing I/O for other actuators. Or if the
bfq_queues containing I/O for some actuators have lower weights
than the other bfq_queues.

Reviewed-by: Damien Le Moal <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Davide Zini <[email protected]>
---
block/bfq-cgroup.c | 2 +-
block/bfq-iosched.c | 143 ++++++++++++++++++++++++++++++++------------
block/bfq-iosched.h | 39 +++++++++++-
block/bfq-wf2q.c | 2 +-
4 files changed, 144 insertions(+), 42 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 7ea77f994424..5783601f3d7a 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -698,7 +698,7 @@ void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
bfq_activate_bfqq(bfqd, bfqq);
}

- if (!bfqd->in_service_queue && !bfqd->rq_in_driver)
+ if (!bfqd->in_service_queue && !bfqd->tot_rq_in_driver)
bfq_schedule_dispatch(bfqd);
/* release extra ref taken above, bfqq may happen to be freed now */
bfq_put_queue(bfqq);
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index b5ed666be0e6..527def05ce44 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2303,9 +2303,9 @@ static void bfq_add_request(struct request *rq)
* elapsed.
*/
if (bfqq == bfqd->in_service_queue &&
- (bfqd->rq_in_driver == 0 ||
+ (bfqd->tot_rq_in_driver == 0 ||
(bfqq->last_serv_time_ns > 0 &&
- bfqd->rqs_injected && bfqd->rq_in_driver > 0)) &&
+ bfqd->rqs_injected && bfqd->tot_rq_in_driver > 0)) &&
time_is_before_eq_jiffies(bfqq->decrease_time_jif +
msecs_to_jiffies(10))) {
bfqd->last_empty_occupied_ns = ktime_get_ns();
@@ -2329,7 +2329,7 @@ static void bfq_add_request(struct request *rq)
* will be set in case injection is performed
* on bfqq before rq is completed).
*/
- if (bfqd->rq_in_driver == 0)
+ if (bfqd->tot_rq_in_driver == 0)
bfqd->rqs_injected = false;
}
}
@@ -2427,15 +2427,18 @@ static sector_t get_sdist(sector_t last_pos, struct request *rq)
static void bfq_activate_request(struct request_queue *q, struct request *rq)
{
struct bfq_data *bfqd = q->elevator->elevator_data;
+ unsigned int act_idx = bfq_actuator_index(bfqd, rq->bio);

- bfqd->rq_in_driver++;
+ bfqd->tot_rq_in_driver++;
+ bfqd->rq_in_driver[act_idx]++;
}

static void bfq_deactivate_request(struct request_queue *q, struct request *rq)
{
struct bfq_data *bfqd = q->elevator->elevator_data;

- bfqd->rq_in_driver--;
+ bfqd->tot_rq_in_driver--;
+ bfqd->rq_in_driver[bfq_actuator_index(bfqd, rq->bio)]--;
}
#endif

@@ -2710,11 +2713,14 @@ void bfq_end_wr_async_queues(struct bfq_data *bfqd,
static void bfq_end_wr(struct bfq_data *bfqd)
{
struct bfq_queue *bfqq;
+ int i;

spin_lock_irq(&bfqd->lock);

- list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list)
- bfq_bfqq_end_wr(bfqq);
+ for (i = 0; i < bfqd->num_actuators; i++) {
+ list_for_each_entry(bfqq, &bfqd->active_list[i], bfqq_list)
+ bfq_bfqq_end_wr(bfqq);
+ }
list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list)
bfq_bfqq_end_wr(bfqq);
bfq_end_wr_async(bfqd);
@@ -3671,13 +3677,13 @@ static void bfq_update_peak_rate(struct bfq_data *bfqd, struct request *rq)
* - start a new observation interval with this dispatch
*/
if (now_ns - bfqd->last_dispatch > 100*NSEC_PER_MSEC &&
- bfqd->rq_in_driver == 0)
+ bfqd->tot_rq_in_driver == 0)
goto update_rate_and_reset;

/* Update sampling information */
bfqd->peak_rate_samples++;

- if ((bfqd->rq_in_driver > 0 ||
+ if ((bfqd->tot_rq_in_driver > 0 ||
now_ns - bfqd->last_completion < BFQ_MIN_TT)
&& !BFQ_RQ_SEEKY(bfqd, bfqd->last_position, rq))
bfqd->sequential_samples++;
@@ -3942,10 +3948,8 @@ static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd,
return false;

return (bfqq->wr_coeff > 1 &&
- (bfqd->wr_busy_queues <
- tot_busy_queues ||
- bfqd->rq_in_driver >=
- bfqq->dispatched + 4)) ||
+ (bfqd->wr_busy_queues < tot_busy_queues ||
+ bfqd->tot_rq_in_driver >= bfqq->dispatched + 4)) ||
bfq_asymmetric_scenario(bfqd, bfqq) ||
tot_busy_queues == 1;
}
@@ -4716,6 +4720,8 @@ bfq_choose_bfqq_for_injection(struct bfq_data *bfqd)
{
struct bfq_queue *bfqq, *in_serv_bfqq = bfqd->in_service_queue;
unsigned int limit = in_serv_bfqq->inject_limit;
+ int i;
+
/*
* If
* - bfqq is not weight-raised and therefore does not carry
@@ -4747,7 +4753,7 @@ bfq_choose_bfqq_for_injection(struct bfq_data *bfqd)
)
limit = 1;

- if (bfqd->rq_in_driver >= limit)
+ if (bfqd->tot_rq_in_driver >= limit)
return NULL;

/*
@@ -4762,11 +4768,12 @@ bfq_choose_bfqq_for_injection(struct bfq_data *bfqd)
* (and re-added only if it gets new requests, but then it
* is assigned again enough budget for its new backlog).
*/
- list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list)
- if (!RB_EMPTY_ROOT(&bfqq->sort_list) &&
- (in_serv_always_inject || bfqq->wr_coeff > 1) &&
- bfq_serv_to_charge(bfqq->next_rq, bfqq) <=
- bfq_bfqq_budget_left(bfqq)) {
+ for (i = 0; i < bfqd->num_actuators; i++) {
+ list_for_each_entry(bfqq, &bfqd->active_list[i], bfqq_list)
+ if (!RB_EMPTY_ROOT(&bfqq->sort_list) &&
+ (in_serv_always_inject || bfqq->wr_coeff > 1) &&
+ bfq_serv_to_charge(bfqq->next_rq, bfqq) <=
+ bfq_bfqq_budget_left(bfqq)) {
/*
* Allow for only one large in-flight request
* on non-rotational devices, for the
@@ -4791,22 +4798,69 @@ bfq_choose_bfqq_for_injection(struct bfq_data *bfqd)
else
limit = in_serv_bfqq->inject_limit;

- if (bfqd->rq_in_driver < limit) {
+ if (bfqd->tot_rq_in_driver < limit) {
bfqd->rqs_injected = true;
return bfqq;
}
}
+ }
+
+ return NULL;
+}
+
+static struct bfq_queue *
+bfq_find_active_bfqq_for_actuator(struct bfq_data *bfqd, int idx)
+{
+ struct bfq_queue *bfqq;
+
+ if (bfqd->in_service_queue &&
+ bfqd->in_service_queue->actuator_idx == idx)
+ return bfqd->in_service_queue;
+
+ list_for_each_entry(bfqq, &bfqd->active_list[idx], bfqq_list) {
+ if (!RB_EMPTY_ROOT(&bfqq->sort_list) &&
+ bfq_serv_to_charge(bfqq->next_rq, bfqq) <=
+ bfq_bfqq_budget_left(bfqq)) {
+ return bfqq;
+ }
+ }
+
+ return NULL;
+}
+
+/*
+ * Perform a linear scan of each actuator, until an actuator is found
+ * for which the following two conditions hold: the load of the
+ * actuator is below the threshold (see comments on actuator_load_threshold
+ * for details), and there is a queue that contains I/O for that
+ * actuator. On success, return that queue.
+ */
+static struct bfq_queue *
+bfq_find_bfqq_for_underused_actuator(struct bfq_data *bfqd)
+{
+ int i;
+
+ for (i = 0 ; i < bfqd->num_actuators; i++) {
+ if (bfqd->rq_in_driver[i] < bfqd->actuator_load_threshold) {
+ struct bfq_queue *bfqq =
+ bfq_find_active_bfqq_for_actuator(bfqd, i);
+
+ if (bfqq)
+ return bfqq;
+ }
+ }

return NULL;
}

+
/*
* Select a queue for service. If we have a current queue in service,
* check whether to continue servicing it, or retrieve and set a new one.
*/
static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
{
- struct bfq_queue *bfqq;
+ struct bfq_queue *bfqq, *inject_bfqq;
struct request *next_rq;
enum bfqq_expiration reason = BFQQE_BUDGET_TIMEOUT;

@@ -4828,6 +4882,15 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
goto expire;

check_queue:
+ /*
+ * If some actuator is underutilized, but the in-service
+ * queue does not contain I/O for that actuator, then try to
+ * inject I/O for that actuator.
+ */
+ inject_bfqq = bfq_find_bfqq_for_underused_actuator(bfqd);
+ if (inject_bfqq && inject_bfqq != bfqq)
+ return inject_bfqq;
+
/*
* This loop is rarely executed more than once. Even when it
* happens, it is much more convenient to re-execute this loop
@@ -5183,11 +5246,11 @@ static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)

/*
* We exploit the bfq_finish_requeue_request hook to
- * decrement rq_in_driver, but
+ * decrement tot_rq_in_driver, but
* bfq_finish_requeue_request will not be invoked on
* this request. So, to avoid unbalance, just start
- * this request, without incrementing rq_in_driver. As
- * a negative consequence, rq_in_driver is deceptively
+ * this request, without incrementing tot_rq_in_driver. As
+ * a negative consequence, tot_rq_in_driver is deceptively
* lower than it should be while this request is in
* service. This may cause bfq_schedule_dispatch to be
* invoked uselessly.
@@ -5196,7 +5259,7 @@ static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
* bfq_finish_requeue_request hook, if defined, is
* probably invoked also on this request. So, by
* exploiting this hook, we could 1) increment
- * rq_in_driver here, and 2) decrement it in
+ * tot_rq_in_driver here, and 2) decrement it in
* bfq_finish_requeue_request. Such a solution would
* let the value of the counter be always accurate,
* but it would entail using an extra interface
@@ -5225,7 +5288,7 @@ static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
* Of course, serving one request at a time may cause loss of
* throughput.
*/
- if (bfqd->strict_guarantees && bfqd->rq_in_driver > 0)
+ if (bfqd->strict_guarantees && bfqd->tot_rq_in_driver > 0)
goto exit;

bfqq = bfq_select_queue(bfqd);
@@ -5236,7 +5299,8 @@ static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)

if (rq) {
inc_in_driver_start_rq:
- bfqd->rq_in_driver++;
+ bfqd->rq_in_driver[bfqq->actuator_idx]++;
+ bfqd->tot_rq_in_driver++;
start_rq:
rq->rq_flags |= RQF_STARTED;
}
@@ -6304,7 +6368,7 @@ static void bfq_update_hw_tag(struct bfq_data *bfqd)
struct bfq_queue *bfqq = bfqd->in_service_queue;

bfqd->max_rq_in_driver = max_t(int, bfqd->max_rq_in_driver,
- bfqd->rq_in_driver);
+ bfqd->tot_rq_in_driver);

if (bfqd->hw_tag == 1)
return;
@@ -6315,7 +6379,7 @@ static void bfq_update_hw_tag(struct bfq_data *bfqd)
* sum is not exact, as it's not taking into account deactivated
* requests.
*/
- if (bfqd->rq_in_driver + bfqd->queued <= BFQ_HW_QUEUE_THRESHOLD)
+ if (bfqd->tot_rq_in_driver + bfqd->queued <= BFQ_HW_QUEUE_THRESHOLD)
return;

/*
@@ -6326,7 +6390,7 @@ static void bfq_update_hw_tag(struct bfq_data *bfqd)
if (bfqq && bfq_bfqq_has_short_ttime(bfqq) &&
bfqq->dispatched + bfqq->queued[0] + bfqq->queued[1] <
BFQ_HW_QUEUE_THRESHOLD &&
- bfqd->rq_in_driver < BFQ_HW_QUEUE_THRESHOLD)
+ bfqd->tot_rq_in_driver < BFQ_HW_QUEUE_THRESHOLD)
return;

if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES)
@@ -6347,7 +6411,8 @@ static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd)

bfq_update_hw_tag(bfqd);

- bfqd->rq_in_driver--;
+ bfqd->rq_in_driver[bfqq->actuator_idx]--;
+ bfqd->tot_rq_in_driver--;
bfqq->dispatched--;

if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq)) {
@@ -6466,7 +6531,7 @@ static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd)
BFQQE_NO_MORE_REQUESTS);
}

- if (!bfqd->rq_in_driver)
+ if (!bfqd->tot_rq_in_driver)
bfq_schedule_dispatch(bfqd);
}

@@ -6597,13 +6662,13 @@ static void bfq_update_inject_limit(struct bfq_data *bfqd,
* conditions to do it, or we can lower the last base value
* computed.
*
- * NOTE: (bfqd->rq_in_driver == 1) means that there is no I/O
+ * NOTE: (bfqd->tot_rq_in_driver == 1) means that there is no I/O
* request in flight, because this function is in the code
* path that handles the completion of a request of bfqq, and,
* in particular, this function is executed before
- * bfqd->rq_in_driver is decremented in such a code path.
+ * bfqd->tot_rq_in_driver is decremented in such a code path.
*/
- if ((bfqq->last_serv_time_ns == 0 && bfqd->rq_in_driver == 1) ||
+ if ((bfqq->last_serv_time_ns == 0 && bfqd->tot_rq_in_driver == 1) ||
tot_time_ns < bfqq->last_serv_time_ns) {
if (bfqq->last_serv_time_ns == 0) {
/*
@@ -6613,7 +6678,7 @@ static void bfq_update_inject_limit(struct bfq_data *bfqd,
bfqq->inject_limit = max_t(unsigned int, 1, old_limit);
}
bfqq->last_serv_time_ns = tot_time_ns;
- } else if (!bfqd->rqs_injected && bfqd->rq_in_driver == 1)
+ } else if (!bfqd->rqs_injected && bfqd->tot_rq_in_driver == 1)
/*
* No I/O injected and no request still in service in
* the drive: these are the exact conditions for
@@ -7260,7 +7325,8 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
bfqd->queue_weights_tree = RB_ROOT_CACHED;
bfqd->num_groups_with_pending_reqs = 0;

- INIT_LIST_HEAD(&bfqd->active_list);
+ INIT_LIST_HEAD(&bfqd->active_list[0]);
+ INIT_LIST_HEAD(&bfqd->active_list[1]);
INIT_LIST_HEAD(&bfqd->idle_list);
INIT_HLIST_HEAD(&bfqd->burst_list);

@@ -7305,6 +7371,9 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
ref_wr_duration[blk_queue_nonrot(bfqd->queue)];
bfqd->peak_rate = ref_rate[blk_queue_nonrot(bfqd->queue)] * 2 / 3;

+ /* see comments on the definition of next field inside bfq_data */
+ bfqd->actuator_load_threshold = 4;
+
spin_lock_init(&bfqd->lock);

/*
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 953980de6b4b..be725d9143fa 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -586,7 +586,12 @@ struct bfq_data {
/* number of queued requests */
int queued;
/* number of requests dispatched and waiting for completion */
- int rq_in_driver;
+ int tot_rq_in_driver;
+ /*
+ * number of requests dispatched and waiting for completion
+ * for each actuator
+ */
+ int rq_in_driver[BFQ_MAX_ACTUATORS];

/* true if the device is non rotational and performs queueing */
bool nonrot_with_queueing;
@@ -680,8 +685,13 @@ struct bfq_data {
/* maximum budget allotted to a bfq_queue before rescheduling */
int bfq_max_budget;

- /* list of all the bfq_queues active on the device */
- struct list_head active_list;
+ /*
+ * List of all the bfq_queues active for a specific actuator
+ * on the device. Keeping active queues separate on a
+ * per-actuator basis helps implementing per-actuator
+ * injection more efficiently.
+ */
+ struct list_head active_list[BFQ_MAX_ACTUATORS];
/* list of all the bfq_queues idle on the device */
struct list_head idle_list;

@@ -817,6 +827,29 @@ struct bfq_data {
sector_t sector[BFQ_MAX_ACTUATORS];
sector_t nr_sectors[BFQ_MAX_ACTUATORS];
struct blk_independent_access_range ia_ranges[BFQ_MAX_ACTUATORS];
+
+ /*
+ * If the number of I/O requests queued in the device for a
+ * given actuator is below next threshold, then the actuator
+ * is deemed as underutilized. If this condition is found to
+ * hold for some actuator upon a dispatch, but (i) the
+ * in-service queue does not contain I/O for that actuator,
+ * while (ii) some other queue does contain I/O for that
+ * actuator, then the head I/O request of the latter queue is
+ * returned (injected), instead of the head request of the
+ * currently in-service queue.
+ *
+ * We set the threshold, empirically, to the minimum possible
+ * value for which an actuator is fully utilized, or close to
+ * be fully utilized. By doing so, injected I/O 'steals' as
+ * few drive-queue slots as possibile to the in-service
+ * queue. This reduces as much as possible the probability
+ * that the service of I/O from the in-service bfq_queue gets
+ * delayed because of slot exhaustion, i.e., because all the
+ * slots of the drive queue are filled with I/O injected from
+ * other queues (NCQ provides for 32 slots).
+ */
+ unsigned int actuator_load_threshold;
};

enum bfqq_state_flags {
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index 8fc3da4c23bb..ec0273e2cd07 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -477,7 +477,7 @@ static void bfq_active_insert(struct bfq_service_tree *st,
bfqd = (struct bfq_data *)bfqg->bfqd;
#endif
if (bfqq)
- list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list);
+ list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list[bfqq->actuator_idx]);
#ifdef CONFIG_BFQ_GROUP_IOSCHED
if (bfqg != bfqd->root_group)
bfqg->active_entities++;
--
2.20.1

2022-12-09 10:19:05

by Paolo Valente

[permalink] [raw]

Subject: [PATCH V10 4/8] block, bfq: turn bfqq_data into an array in bfq_io_cq

When a bfq_queue Q is merged with another queue, several pieces of
information are saved about Q. These pieces are stored in the
bfqq_data field in the bfq_io_cq data structure of the process
associated with Q.

Yet, with a multi-actuator drive, a process may get associated with
multiple bfq_queues: one queue for each of the N actuators. Each of
these queues may undergo a merge. So, the bfq_io_cq data structure
must be able to accommodate the above information for N queues.

This commit solves this problem by turning the bfqq_data scalar field
into an array of N elements (and by changing code so as to handle
this array).

This solution is written under the assumption that bfq_queues
associated with different actuators cannot be cross-merged. This
assumption holds naturally with basic queue merging: the latter is
triggered by spatial locality, and sectors for different actuators are
not close to each other (apart from the corner case of the last
sectors served by a given actuator and the first sectors served by the
next actuator). As for stable cross-merging, the assumption here is
that it is disabled.

Reviewed-by: Damien Le Moal <[email protected]>
Signed-off-by: Gabriele Felici <[email protected]>
Signed-off-by: Gianmarco Lusvardi <[email protected]>
Signed-off-by: Giulio Barabino <[email protected]>
Signed-off-by: Emiliano Maccaferri <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
---
block/bfq-iosched.c | 100 ++++++++++++++++++++++++++------------------
block/bfq-iosched.h | 12 ++++--
2 files changed, 67 insertions(+), 45 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 0d6b35ef3d3f..48a2a57097c9 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -406,7 +406,7 @@ void bic_set_bfqq(struct bfq_io_cq *bic,
* we cancel the stable merge if
* bic->stable_merge_bfqq == bfqq.
*/
- struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data;
+ struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data[actuator_idx];

if (is_sync)
bic->bfqq[1][actuator_idx] = bfqq;
@@ -1181,9 +1181,10 @@ static void
bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd,
struct bfq_io_cq *bic, bool bfq_already_existing)
{
- struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data;
unsigned int old_wr_coeff = 1;
bool busy = bfq_already_existing && bfq_bfqq_busy(bfqq);
+ unsigned int a_idx = bfqq->actuator_idx;
+ struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data[a_idx];

if (bfqq_data->saved_has_short_ttime)
bfq_mark_bfqq_has_short_ttime(bfqq);
@@ -1864,7 +1865,9 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
arrived_in_time = ktime_get_ns() <=
bfqq->ttime.last_end_request +
bfqd->bfq_slice_idle * 3;
-
+ unsigned int act_idx = bfq_actuator_index(bfqd, rq->bio);
+ bool bfqq_non_merged_or_stably_merged =
+ bfqq->bic || RQ_BIC(rq)->bfqq_data[act_idx].stably_merged;

/*
* bfqq deserves to be weight-raised if:
@@ -1898,9 +1901,8 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
*/
wr_or_deserves_wr = bfqd->low_latency &&
(bfqq->wr_coeff > 1 ||
- (bfq_bfqq_sync(bfqq) &&
- (bfqq->bic || RQ_BIC(rq)->bfqq_data.stably_merged) &&
- (*interactive || soft_rt)));
+ (bfq_bfqq_sync(bfqq) && bfqq_non_merged_or_stably_merged &&
+ (*interactive || soft_rt)));

/*
* Using the last flag, update budget and check whether bfqq
@@ -2888,6 +2890,35 @@ static bool bfq_may_be_close_cooperator(struct bfq_queue *bfqq,
static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
struct bfq_queue *bfqq);

+static struct bfq_queue *
+bfq_setup_stable_merge(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ struct bfq_queue *stable_merge_bfqq,
+ struct bfq_iocq_bfqq_data *bfqq_data)
+{
+ int proc_ref = min(bfqq_process_refs(bfqq),
+ bfqq_process_refs(stable_merge_bfqq));
+ struct bfq_queue *new_bfqq;
+
+ if (idling_boosts_thr_without_issues(bfqd, bfqq) ||
+ proc_ref == 0)
+ return NULL;
+
+ /* next function will take at least one ref */
+ new_bfqq = bfq_setup_merge(bfqq, stable_merge_bfqq);
+
+ if (new_bfqq) {
+ bfqq_data->stably_merged = true;
+ if (new_bfqq->bic) {
+ unsigned int new_a_idx = new_bfqq->actuator_idx;
+ struct bfq_iocq_bfqq_data *new_bfqq_data =
+ &new_bfqq->bic->bfqq_data[new_a_idx];
+
+ new_bfqq_data->stably_merged = true;
+ }
+ }
+ return new_bfqq;
+}
+
/*
* Attempt to schedule a merge of bfqq with the currently in-service
* queue or with a close queue among the scheduled queues. Return
@@ -2913,7 +2944,8 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
void *io_struct, bool request, struct bfq_io_cq *bic)
{
struct bfq_queue *in_service_bfqq, *new_bfqq;
- struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data;
+ unsigned int a_idx = bfqq->actuator_idx;
+ struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data[a_idx];

/* if a merge has already been setup, then proceed with that first */
if (bfqq->new_bfqq)
@@ -2943,29 +2975,15 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
msecs_to_jiffies(bfq_late_stable_merging))) {
struct bfq_queue *stable_merge_bfqq =
bfqq_data->stable_merge_bfqq;
- int proc_ref = min(bfqq_process_refs(bfqq),
- bfqq_process_refs(stable_merge_bfqq));

/* deschedule stable merge, because done or aborted here */
bfq_put_stable_ref(stable_merge_bfqq);

bfqq_data->stable_merge_bfqq = NULL;

- if (!idling_boosts_thr_without_issues(bfqd, bfqq) &&
- proc_ref > 0) {
- /* next function will take at least one ref */
- struct bfq_queue *new_bfqq =
- bfq_setup_merge(bfqq, stable_merge_bfqq);
-
- if (new_bfqq) {
- bfqq_data->stably_merged = true;
- if (new_bfqq->bic)
- new_bfqq->bic->bfqq_data.stably_merged =
- true;
- }
- return new_bfqq;
- } else
- return NULL;
+ return bfq_setup_stable_merge(bfqd, bfqq,
+ stable_merge_bfqq,
+ bfqq_data);
}
}

@@ -3060,7 +3078,8 @@ bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
static void bfq_bfqq_save_state(struct bfq_queue *bfqq)
{
struct bfq_io_cq *bic = bfqq->bic;
- struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data;
+ unsigned int a_idx = bfqq->actuator_idx;
+ struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data[a_idx];

/*
* If !bfqq->bic, the queue is already shared or its requests
@@ -3071,7 +3090,7 @@ static void bfq_bfqq_save_state(struct bfq_queue *bfqq)
return;

bfqq_data->saved_last_serv_time_ns = bfqq->last_serv_time_ns;
- bfqq_data->saved_inject_limit = bfqq->inject_limit;
+ bfqq_data->saved_inject_limit = bfqq->inject_limit;
bfqq_data->saved_decrease_time_jif = bfqq->decrease_time_jif;

bfqq_data->saved_weight = bfqq->entity.orig_weight;
@@ -5438,7 +5457,7 @@ static void bfq_exit_icq(struct io_cq *icq)
* therefore on its unused per-actuator fields being NULL.
*/
unsigned int num_actuators = BFQ_MAX_ACTUATORS;
- struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data;
+ struct bfq_iocq_bfqq_data *bfqq_data = bic->bfqq_data;

/*
* bfqd is NULL if scheduler already exited, and in that case
@@ -5449,10 +5468,10 @@ static void bfq_exit_icq(struct io_cq *icq)
num_actuators = bfqd->num_actuators;
}

- if (bfqq_data->stable_merge_bfqq)
- bfq_put_stable_ref(bfqq_data->stable_merge_bfqq);
-
for (act_idx = 0; act_idx < num_actuators; act_idx++) {
+ if (bfqq_data[act_idx].stable_merge_bfqq)
+ bfq_put_stable_ref(bfqq_data[act_idx].stable_merge_bfqq);
+
bfq_exit_icq_bfqq(bic, true, act_idx);
bfq_exit_icq_bfqq(bic, false, act_idx);
}
@@ -5639,16 +5658,16 @@ bfq_do_early_stable_merge(struct bfq_data *bfqd, struct bfq_queue *bfqq,
struct bfq_io_cq *bic,
struct bfq_queue *last_bfqq_created)
{
+ unsigned int a_idx = last_bfqq_created->actuator_idx;
struct bfq_queue *new_bfqq =
bfq_setup_merge(bfqq, last_bfqq_created);
- struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data;

if (!new_bfqq)
return bfqq;

if (new_bfqq->bic)
- new_bfqq->bic->bfqq_data.stably_merged = true;
- bfqq_data->stably_merged = true;
+ new_bfqq->bic->bfqq_data[a_idx].stably_merged = true;
+ bic->bfqq_data[a_idx].stably_merged = true;

/*
* Reusing merge functions. This implies that
@@ -5717,7 +5736,6 @@ static struct bfq_queue *bfq_do_or_sched_stable_merge(struct bfq_data *bfqd,
&bfqd->last_bfqq_created;

struct bfq_queue *last_bfqq_created = *source_bfqq;
- struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data;

/*
* If last_bfqq_created has not been set yet, then init it. If
@@ -5779,7 +5797,8 @@ static struct bfq_queue *bfq_do_or_sched_stable_merge(struct bfq_data *bfqd,
/*
* Record the bfqq to merge to.
*/
- bfqq_data->stable_merge_bfqq = last_bfqq_created;
+ bic->bfqq_data[last_bfqq_created->actuator_idx].stable_merge_bfqq =
+ last_bfqq_created;
}
}

@@ -6699,7 +6718,7 @@ static struct bfq_queue *bfq_get_bfqq_handle_split(struct bfq_data *bfqd,
{
unsigned int act_idx = bfq_actuator_index(bfqd, bio);
struct bfq_queue *bfqq = bic_to_bfqq(bic, is_sync, act_idx);
- struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data;
+ struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data[act_idx];

if (likely(bfqq && bfqq != &bfqd->oom_bfqq))
return bfqq;
@@ -6807,7 +6826,7 @@ static struct bfq_queue *bfq_init_rq(struct request *rq)
struct bfq_queue *bfqq;
bool new_queue = false;
bool bfqq_already_existing = false, split = false;
- struct bfq_iocq_bfqq_data *bfqq_data;
+ unsigned int a_idx = bfq_actuator_index(bfqd, bio);

if (unlikely(!rq->elv.icq))
return NULL;
@@ -6831,17 +6850,16 @@ static struct bfq_queue *bfq_init_rq(struct request *rq)
bfqq = bfq_get_bfqq_handle_split(bfqd, bic, bio, false, is_sync,
&new_queue);

- bfqq_data = &bic->bfqq_data;
-
if (likely(!new_queue)) {
/* If the queue was seeky for too long, break it apart. */
if (bfq_bfqq_coop(bfqq) && bfq_bfqq_split_coop(bfqq) &&
- !bfqq_data->stably_merged) {
+ !bic->bfqq_data[a_idx].stably_merged) {
struct bfq_queue *old_bfqq = bfqq;

/* Update bic before losing reference to bfqq */
if (bfq_bfqq_in_large_burst(bfqq))
- bfqq_data->saved_in_large_burst = true;
+ bic->bfqq_data[a_idx].saved_in_large_burst =
+ true;

bfqq = bfq_split_bfqq(bic, bfqq);
split = true;
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 9d77d964620f..a685aa32b037 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -416,7 +416,7 @@ struct bfq_queue {
struct bfq_iocq_bfqq_data {
/*
* Snapshot of the has_short_time flag before merging; taken
- * to remember its value while the queue is merged, so as to
+ * to remember its values while the queue is merged, so as to
* be able to restore it in case of split.
*/
bool saved_has_short_ttime;
@@ -430,7 +430,7 @@ struct bfq_iocq_bfqq_data {
u64 saved_tot_idle_time;

/*
- * Same purpose as the previous fields for the value of the
+ * Same purpose as the previous fields for the values of the
* field keeping the queue's belonging to a large burst
*/
bool saved_in_large_burst;
@@ -493,8 +493,12 @@ struct bfq_io_cq {
uint64_t blkcg_serial_nr; /* the current blkcg serial */
#endif

- /* persistent data for associated synchronous process queue */
- struct bfq_iocq_bfqq_data bfqq_data;
+ /*
+ * Persistent data for associated synchronous process queues
+ * (one queue per actuator, see field bfqq above). In
+ * particular, each of these queues may undergo a merge.
+ */
+ struct bfq_iocq_bfqq_data bfqq_data[BFQ_MAX_ACTUATORS];

unsigned int requests; /* Number of requests this process has in flight */
};
--
2.20.1

2022-12-13 16:02:41

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH V10 0/8] block, bfq: extend bfq to support multi-actuator drives

On 12/13/22 8:40?AM, Paolo Valente wrote:
> Hi Jens, Damien,
> can we consider this for 6.2?

No, it's too late to queue up for 6.2, even when it was posted on
Friday. Bigger changes like that should be in my tree at least a week
before the merge window opens, preferably two (or somewhere in between).

I already tagged the main 6.2 block changes on Friday, and the pull
request has been sent out.

--
Jens Axboe

2022-12-13 16:03:08

by Paolo Valente

[permalink] [raw]

Subject: Re: [PATCH V10 0/8] block, bfq: extend bfq to support multi-actuator drives

Hi Jens, Damien,
can we consider this for 6.2?

Thanks,
Paolo

> Il giorno 9 dic 2022, alle ore 10:44, Paolo Valente <[email protected]> ha scritto:
>
> Hi,
> here is the V10, it differs from V9 in that it applies the
> recommendation by Damien in [2].
>
> Here is the whole description of this patch series again. This
> extension addresses the following issue. Single-LUN multi-actuator
> SCSI drives, as well as all multi-actuator SATA drives appear as a
> single device to the I/O subsystem [1]. Yet they address commands to
> different actuators internally, as a function of Logical Block
> Addressing (LBAs). A given sector is reachable by only one of the
> actuators. For example, Seagate’s Serial Advanced Technology
> Attachment (SATA) version contains two actuators and maps the lower
> half of the SATA LBA space to the lower actuator and the upper half to
> the upper actuator.
>
> Evidently, to fully utilize actuators, no actuator must be left idle
> or underutilized while there is pending I/O for it. To reach this
> goal, the block layer must somehow control the load of each actuator
> individually. This series enriches BFQ with such a per-actuator
> control, as a first step. Then it also adds a simple mechanism for
> guaranteeing that actuators with pending I/O are never left idle.
>
> See [1] for a more detailed overview of the problem and of the
> solutions implemented in this patch series. There you will also find
> some preliminary performance results.
>
> Thanks,
> Paolo
>
> [1] https://www.linaro.org/blog/budget-fair-queueing-bfq-linux-io-scheduler-optimizations-for-multi-actuator-sata-hard-drives/
> [2] https://lore.kernel.org/lkml/[email protected]/T/#t
>
> Davide Zini (3):
> block, bfq: split also async bfq_queues on a per-actuator basis
> block, bfq: inject I/O to underutilized actuators
> block, bfq: balance I/O injection among underutilized actuators
>
> Federico Gavioli (1):
> block, bfq: retrieve independent access ranges from request queue
>
> Paolo Valente (4):
> block, bfq: split sync bfq_queues on a per-actuator basis
> block, bfq: forbid stable merging of queues associated with different
> actuators
> block, bfq: move io_cq-persistent bfqq data into a dedicated struct
> block, bfq: turn bfqq_data into an array in bfq_io_cq
>
> block/bfq-cgroup.c | 94 +++----
> block/bfq-iosched.c | 584 ++++++++++++++++++++++++++++++--------------
> block/bfq-iosched.h | 142 ++++++++---
> block/bfq-wf2q.c | 2 +-
> 4 files changed, 566 insertions(+), 256 deletions(-)
>
> --
> 2.20.1

2022-12-13 17:31:38

by Arie van der Hoeven

[permalink] [raw]

Subject: Re: [PATCH V10 0/8] block, bfq: extend bfq to support multi-actuator drives

We understand being conservative but the code paths only impact on a product that is not yet in market. This is version 10 spanning months with many gaps waiting on review. It's an interesting case study.

-- Arie van der Hoeven

From: Jens Axboe <[email protected]>
Sent: Tuesday, December 13, 2022 7:43 AM
To: Paolo Valente <[email protected]>
Cc: linux-block <[email protected]>; linux-kernel <[email protected]>; Arie van der Hoeven <[email protected]>; Rory Chen <[email protected]>; Glen Valante <[email protected]>; Damien Le Moal <[email protected]>
Subject: Re: [PATCH V10 0/8] block, bfq: extend bfq to support multi-actuator drives

This message has originated from an External Source. Please use proper judgment and caution when opening attachments, clicking links, or responding to this email.

On 12/13/22 8:40?AM, Paolo Valente wrote:
> Hi Jens, Damien,
> can we consider this for 6.2?

No, it's too late to queue up for 6.2, even when it was posted on
Friday. Bigger changes like that should be in my tree at least a week
before the merge window opens, preferably two (or somewhere in between).

I already tagged the main 6.2 block changes on Friday, and the pull
request has been sent out.

--
Jens Axboe

Seagate Internal

2022-12-13 17:48:29

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH V10 0/8] block, bfq: extend bfq to support multi-actuator drives

Please don't top post...

On 12/13/22 10:10?AM, Arie van der Hoeven wrote:
> We understand being conservative but the code paths only impact on a
> product that is not yet in market. This is version 10 spanning months
> with many gaps waiting on review. It's an interesting case study.

That's a nice theory, but that's not how code works. As mentioned, the
last version was posted 1-2 weeks later than would've been appropriate
for inclusion.

--
Jens Axboe

2022-12-15 15:44:00

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH V10 0/8] block, bfq: extend bfq to support multi-actuator drives

On 12/15/22 8:04 AM, Paolo Valente wrote:
>
>
>> Il giorno 13 dic 2022, alle ore 18:17, Jens Axboe <[email protected]> ha scritto:
>>
>> Please don't top post...
>>
>> On 12/13/22 10:10?AM, Arie van der Hoeven wrote:
>>> We understand being conservative but the code paths only impact on a
>>> product that is not yet in market. This is version 10 spanning months
>>> with many gaps waiting on review. It's an interesting case study.
>>
>> That's a nice theory, but that's not how code works. As mentioned, the
>> last version was posted 1-2 weeks later than would've been appropriate
>> for inclusion.
>>
>
> So, what's the plan?

Looks like 1/8 and 8/8 still need Damien to review it, then queue up for
6.3 when ready. Not sure why this is even a question, it just means that
inclusion is pushed a release out as it missed the current merge window.

--
Jens Axboe

2022-12-15 15:47:05

by Paolo Valente

[permalink] [raw]

Subject: Re: [PATCH V10 0/8] block, bfq: extend bfq to support multi-actuator drives

> Il giorno 13 dic 2022, alle ore 18:17, Jens Axboe <[email protected]> ha scritto:
>
> Please don't top post...
>
> On 12/13/22 10:10?AM, Arie van der Hoeven wrote:
>> We understand being conservative but the code paths only impact on a
>> product that is not yet in market. This is version 10 spanning months
>> with many gaps waiting on review. It's an interesting case study.
>
> That's a nice theory, but that's not how code works. As mentioned, the
> last version was posted 1-2 weeks later than would've been appropriate
> for inclusion.
>

So, what's the plan?

Thanks,
Paolo

> --
> Jens Axboe
>