2015-11-02 14:49:07

by Paolo Valente

[permalink] [raw]
Subject: [PATCH BUGFIX 0/3] null_blk: fix throughout losses and hangs

Hi,
while doing some tests with the null_blk device driver, we bumped into
two problems: first, unjustified and in some cases high throughput
losses; second, actual hangs. These problems seem to be the
consequence of the combination of three causes, and this patchset
introduces a fix for each of these causes. In particular, changes
address:
. an apparent flaw in the logic with which delayed completions are
implemented: this flaw causes, with unlucky but non-pathological
workloads, actual request-completion delays to become arbitrarily
larger than the configured delay;
. the missing restart of the device queue on the completion of a request in
single-queue non-delayed mode;
. the overflow of the request-delay parameter, when extremely high values
are used (e.g., to spot bugs).

To avoid possible confusion, we stress that these fixes *do not* have
anything to do with the problems highlighted in [1] (tests of the
multiqueue xen-blkfront and xen-blkback modules with null_blk).

You can find more details in the patch descriptions.

Thanks,
Paolo and Arianna

[1] https://lkml.org/lkml/2015/8/19/181

Arianna Avanzini (2):
null_blk: guarantee device restart in all irq modes
null_blk: change type of completion_nsec to unsigned long

Paolo Valente (1):
null_blk: set a separate timer for each command

drivers/block/null_blk.c | 94 +++++++++++++++++-------------------------------
1 file changed, 33 insertions(+), 61 deletions(-)

--
1.9.1


2015-11-02 14:52:30

by Paolo Valente

[permalink] [raw]
Subject: [PATCH BUGFIX 1/3] null_blk: set a separate timer for each command

For the Timer IRQ mode (i.e., when command completions are delayed),
there is one timer for each CPU. Each of these timers
. has a completion queue associated with it, containing all the
command completions to be executed when the timer fires;
. is set, and a new completion-to-execute is inserted into its
completion queue, every time the dispatch code for a new command
happens to be executed on the CPU related to the timer.

This implies that, if the dispatch of a new command happens to be
executed on a CPU whose timer has already been set, but has not yet
fired, then the timer is set again, to the completion time of the
newly arrived command. When the timer eventually fires, all its queued
completions are executed.

This way of handling delayed command completions entails the following
problem: if more than one command completion is inserted into the
queue of a timer before the timer fires, then the expiration time for
the timer is moved forward every time each of these completions is
enqueued. As a consequence, only the last completion enqueued enjoys a
correct execution time, while all previous completions are unjustly
delayed until the last completion is executed (and at that time they
are executed all together).

Specifically, if all the above completions are enqueued almost at the
same time, then the problem is negligible. On the opposite end, if
every completion is enqueued a while after the previous completion was
enqueued (in the extreme case, it is enqueued only right before the
timer would have expired), then every enqueued completion, except for
the last one, experiences an inflated delay, proportional to the number
of completions enqueued after it. In the end, commands, and thus I/O
requests, may be completed at an arbitrarily lower rate than the
desired one.

This commit addresses this issue by replacing per-CPU timers with
per-command timers, i.e., by associating an individual timer with each
command.

Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Arianna Avanzini <[email protected]>
---
drivers/block/null_blk.c | 79 +++++++++++++++---------------------------------
1 file changed, 24 insertions(+), 55 deletions(-)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 1c9e4fe..7db5a77 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -17,6 +17,7 @@ struct nullb_cmd {
struct bio *bio;
unsigned int tag;
struct nullb_queue *nq;
+ struct hrtimer timer;
};

struct nullb_queue {
@@ -46,17 +47,6 @@ static struct mutex lock;
static int null_major;
static int nullb_indexes;

-struct completion_queue {
- struct llist_head list;
- struct hrtimer timer;
-};
-
-/*
- * These are per-cpu for now, they will need to be configured by the
- * complete_queues parameter and appropriately mapped.
- */
-static DEFINE_PER_CPU(struct completion_queue, completion_queues);
-
enum {
NULL_IRQ_NONE = 0,
NULL_IRQ_SOFTIRQ = 1,
@@ -173,6 +163,8 @@ static void free_cmd(struct nullb_cmd *cmd)
put_tag(cmd->nq, cmd->tag);
}

+static enum hrtimer_restart null_cmd_timer_expired(struct hrtimer *timer);
+
static struct nullb_cmd *__alloc_cmd(struct nullb_queue *nq)
{
struct nullb_cmd *cmd;
@@ -183,6 +175,11 @@ static struct nullb_cmd *__alloc_cmd(struct nullb_queue *nq)
cmd = &nq->cmds[tag];
cmd->tag = tag;
cmd->nq = nq;
+ if (irqmode == NULL_IRQ_TIMER) {
+ hrtimer_init(&cmd->timer, CLOCK_MONOTONIC,
+ HRTIMER_MODE_REL);
+ cmd->timer.function = null_cmd_timer_expired;
+ }
return cmd;
}

@@ -231,47 +228,28 @@ static void end_cmd(struct nullb_cmd *cmd)

static enum hrtimer_restart null_cmd_timer_expired(struct hrtimer *timer)
{
- struct completion_queue *cq;
- struct llist_node *entry;
- struct nullb_cmd *cmd;
-
- cq = &per_cpu(completion_queues, smp_processor_id());
-
- while ((entry = llist_del_all(&cq->list)) != NULL) {
- entry = llist_reverse_order(entry);
- do {
- struct request_queue *q = NULL;
+ struct nullb_cmd *cmd = container_of(timer, struct nullb_cmd, timer);
+ struct request_queue *q = NULL;

- cmd = container_of(entry, struct nullb_cmd, ll_list);
- entry = entry->next;
- if (cmd->rq)
- q = cmd->rq->q;
- end_cmd(cmd);
+ if (cmd->rq)
+ q = cmd->rq->q;

- if (q && !q->mq_ops && blk_queue_stopped(q)) {
- spin_lock(q->queue_lock);
- if (blk_queue_stopped(q))
- blk_start_queue(q);
- spin_unlock(q->queue_lock);
- }
- } while (entry);
+ if (q && !q->mq_ops && blk_queue_stopped(q)) {
+ spin_lock(q->queue_lock);
+ if (blk_queue_stopped(q))
+ blk_start_queue(q);
+ spin_unlock(q->queue_lock);
}
+ end_cmd(cmd);

return HRTIMER_NORESTART;
}

static void null_cmd_end_timer(struct nullb_cmd *cmd)
{
- struct completion_queue *cq = &per_cpu(completion_queues, get_cpu());
-
- cmd->ll_list.next = NULL;
- if (llist_add(&cmd->ll_list, &cq->list)) {
- ktime_t kt = ktime_set(0, completion_nsec);
+ ktime_t kt = ktime_set(0, completion_nsec);

- hrtimer_start(&cq->timer, kt, HRTIMER_MODE_REL_PINNED);
- }
-
- put_cpu();
+ hrtimer_start(&cmd->timer, kt, HRTIMER_MODE_REL);
}

static void null_softirq_done_fn(struct request *rq)
@@ -368,6 +346,10 @@ static int null_queue_rq(struct blk_mq_hw_ctx *hctx,
{
struct nullb_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);

+ if (irqmode == NULL_IRQ_TIMER) {
+ hrtimer_init(&cmd->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ cmd->timer.function = null_cmd_timer_expired;
+ }
cmd->rq = bd->rq;
cmd->nq = hctx->driver_data;

@@ -637,19 +619,6 @@ static int __init null_init(void)

mutex_init(&lock);

- /* Initialize a separate list for each CPU for issuing softirqs */
- for_each_possible_cpu(i) {
- struct completion_queue *cq = &per_cpu(completion_queues, i);
-
- init_llist_head(&cq->list);
-
- if (irqmode != NULL_IRQ_TIMER)
- continue;
-
- hrtimer_init(&cq->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
- cq->timer.function = null_cmd_timer_expired;
- }
-
null_major = register_blkdev(0, "nullb");
if (null_major < 0)
return null_major;
--
1.9.1

2015-11-02 14:52:33

by Paolo Valente

[permalink] [raw]
Subject: [PATCH BUGFIX 2/3] null_blk: guarantee device restart in all irq modes

From: Arianna Avanzini <[email protected]>

In single-queue (block layer) mode,the function null_rq_prep_fn stops
the device if alloc_cmd fails. Then, once stopped, the device must be
restarted on the next command completion, so that the request(s) for
which alloc_cmd failed can be requeued. Otherwise the device hangs.

Unfortunately, device restart is currently performed only for delayed
completions, i.e., in irqmode==2. This fact causes hangs, for the
above reasons, with the other irqmodes in combination with single-queue
block layer.

This commits addresses this issue by making sure that, if stopped, the
device is properly restarted for all irqmodes on completions.

Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Arianna AVanzini <[email protected]>
---
drivers/block/null_blk.c | 29 ++++++++++++++++-------------
1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 7db5a77..caaff67 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -210,6 +210,8 @@ static struct nullb_cmd *alloc_cmd(struct nullb_queue *nq, int can_wait)

static void end_cmd(struct nullb_cmd *cmd)
{
+ struct request_queue *q = NULL;
+
switch (queue_mode) {
case NULL_Q_MQ:
blk_mq_end_request(cmd->rq, 0);
@@ -220,27 +222,28 @@ static void end_cmd(struct nullb_cmd *cmd)
break;
case NULL_Q_BIO:
bio_endio(cmd->bio);
- break;
+ goto free_cmd;
}

- free_cmd(cmd);
-}
-
-static enum hrtimer_restart null_cmd_timer_expired(struct hrtimer *timer)
-{
- struct nullb_cmd *cmd = container_of(timer, struct nullb_cmd, timer);
- struct request_queue *q = NULL;
-
if (cmd->rq)
q = cmd->rq->q;

- if (q && !q->mq_ops && blk_queue_stopped(q)) {
- spin_lock(q->queue_lock);
+ /* Restart queue if needed, as we are freeing a tag */
+ if (q && blk_queue_stopped(q)) {
+ unsigned long flags;
+
+ spin_lock_irqsave(q->queue_lock, flags);
if (blk_queue_stopped(q))
blk_start_queue(q);
- spin_unlock(q->queue_lock);
+ spin_unlock_irqrestore(q->queue_lock, flags);
}
- end_cmd(cmd);
+free_cmd:
+ free_cmd(cmd);
+}
+
+static enum hrtimer_restart null_cmd_timer_expired(struct hrtimer *timer)
+{
+ end_cmd(container_of(timer, struct nullb_cmd, timer));

return HRTIMER_NORESTART;
}
--
1.9.1

2015-11-02 14:47:58

by Paolo Valente

[permalink] [raw]
Subject: [PATCH BUGFIX 3/3] null_blk: change type of completion_nsec to unsigned long

From: Arianna Avanzini <[email protected]>

This commit at least doubles the maximum value for
completion_nsec. This helps in special cases where one wants/needs to
emulate an extremely slow I/O (for example to spot bugs).

Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Arianna Avanzini <[email protected]>
---
drivers/block/null_blk.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index caaff67..6f0bb2f 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -125,8 +125,8 @@ static const struct kernel_param_ops null_irqmode_param_ops = {
device_param_cb(irqmode, &null_irqmode_param_ops, &irqmode, S_IRUGO);
MODULE_PARM_DESC(irqmode, "IRQ completion handler. 0-none, 1-softirq, 2-timer");

-static int completion_nsec = 10000;
-module_param(completion_nsec, int, S_IRUGO);
+static unsigned long completion_nsec = 10000;
+module_param(completion_nsec, ulong, S_IRUGO);
MODULE_PARM_DESC(completion_nsec, "Time in ns to complete a request in hardware. Default: 10,000ns");

static int hw_queue_depth = 64;
--
1.9.1

2015-11-02 16:14:57

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH BUGFIX 1/3] null_blk: set a separate timer for each command

On 11/02/2015 07:31 AM, Paolo Valente wrote:
> For the Timer IRQ mode (i.e., when command completions are delayed),
> there is one timer for each CPU. Each of these timers
> . has a completion queue associated with it, containing all the
> command completions to be executed when the timer fires;
> . is set, and a new completion-to-execute is inserted into its
> completion queue, every time the dispatch code for a new command
> happens to be executed on the CPU related to the timer.
>
> This implies that, if the dispatch of a new command happens to be
> executed on a CPU whose timer has already been set, but has not yet
> fired, then the timer is set again, to the completion time of the
> newly arrived command. When the timer eventually fires, all its queued
> completions are executed.
>
> This way of handling delayed command completions entails the following
> problem: if more than one command completion is inserted into the
> queue of a timer before the timer fires, then the expiration time for
> the timer is moved forward every time each of these completions is
> enqueued. As a consequence, only the last completion enqueued enjoys a
> correct execution time, while all previous completions are unjustly
> delayed until the last completion is executed (and at that time they
> are executed all together).
>
> Specifically, if all the above completions are enqueued almost at the
> same time, then the problem is negligible. On the opposite end, if
> every completion is enqueued a while after the previous completion was
> enqueued (in the extreme case, it is enqueued only right before the
> timer would have expired), then every enqueued completion, except for
> the last one, experiences an inflated delay, proportional to the number
> of completions enqueued after it. In the end, commands, and thus I/O
> requests, may be completed at an arbitrarily lower rate than the
> desired one.
>
> This commit addresses this issue by replacing per-CPU timers with
> per-command timers, i.e., by associating an individual timer with each
> command.

Functionally the patch looks fine. My only worry is that a timer per
command would be an unnecessary slowdown compared to pushing one timer
forward. The problem should be fixable by still doing that, just
maintaining next-expire instead. Maybe something that would still
roughly be precise enough, while still getting some completion batching
going? Or maybe that would be slower, and the individual timers are
still better.

Comments?

--
Jens Axboe

2015-11-02 16:25:43

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH BUGFIX 2/3] null_blk: guarantee device restart in all irq modes

On 11/02/2015 07:31 AM, Paolo Valente wrote:
> From: Arianna Avanzini <[email protected]>
>
> In single-queue (block layer) mode,the function null_rq_prep_fn stops
> the device if alloc_cmd fails. Then, once stopped, the device must be
> restarted on the next command completion, so that the request(s) for
> which alloc_cmd failed can be requeued. Otherwise the device hangs.
>
> Unfortunately, device restart is currently performed only for delayed
> completions, i.e., in irqmode==2. This fact causes hangs, for the
> above reasons, with the other irqmodes in combination with single-queue
> block layer.
>
> This commits addresses this issue by making sure that, if stopped, the
> device is properly restarted for all irqmodes on completions.

This looks good. I did a double take at the removal of the q->mq_ops
check before blk_queue_stopped(), but we can't get there in MQ mode.
Perhaps a comment would be warranted for that, the ->mq_ops check served
as a bit of documentation for that before.

--
Jens Axboe

2015-11-03 09:02:24

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH BUGFIX 1/3] null_blk: set a separate timer for each command


Il giorno 02/nov/2015, alle ore 17:14, Jens Axboe <[email protected]> ha scritto:

> On 11/02/2015 07:31 AM, Paolo Valente wrote:
>> For the Timer IRQ mode (i.e., when command completions are delayed),
>> there is one timer for each CPU. Each of these timers
>> . has a completion queue associated with it, containing all the
>> command completions to be executed when the timer fires;
>> . is set, and a new completion-to-execute is inserted into its
>> completion queue, every time the dispatch code for a new command
>> happens to be executed on the CPU related to the timer.
>>
>> This implies that, if the dispatch of a new command happens to be
>> executed on a CPU whose timer has already been set, but has not yet
>> fired, then the timer is set again, to the completion time of the
>> newly arrived command. When the timer eventually fires, all its queued
>> completions are executed.
>>
>> This way of handling delayed command completions entails the following
>> problem: if more than one command completion is inserted into the
>> queue of a timer before the timer fires, then the expiration time for
>> the timer is moved forward every time each of these completions is
>> enqueued. As a consequence, only the last completion enqueued enjoys a
>> correct execution time, while all previous completions are unjustly
>> delayed until the last completion is executed (and at that time they
>> are executed all together).
>>
>> Specifically, if all the above completions are enqueued almost at the
>> same time, then the problem is negligible. On the opposite end, if
>> every completion is enqueued a while after the previous completion was
>> enqueued (in the extreme case, it is enqueued only right before the
>> timer would have expired), then every enqueued completion, except for
>> the last one, experiences an inflated delay, proportional to the number
>> of completions enqueued after it. In the end, commands, and thus I/O
>> requests, may be completed at an arbitrarily lower rate than the
>> desired one.
>>
>> This commit addresses this issue by replacing per-CPU timers with
>> per-command timers, i.e., by associating an individual timer with each
>> command.
>
> Functionally the patch looks fine. My only worry is that a timer per command would be an unnecessary slowdown compared to pushing one timer forward. The problem should be fixable by still doing that, just maintaining next-expire instead. Maybe something that would still roughly be precise enough, while still getting some completion batching going? Or maybe that would be slower, and the individual timers are still better.
>
> Comments?
>

I have tried to think about these questions. Unfortunately I was not able to go beyond the following general considerations.

Given that:
1) hrtimer_start is very efficient;
2) to implement a batch-by-batch execution of command completions, a more complex code would of course be needed in null_blk;
I think that, to get a perceptible improvement, command completions should probably be executed in batches of at least 3 or 4 commands each.

In this respect, commands can accumulate in a batch only if the time granularity with which delayed commands are guaranteed to be executed, i.e, the granularity with which timers eventually fire, happens to be coarser than completion_nsec. But this would lead to the exact uncontrolled throughput-loss problem addressed by this patch, although at a lesser extent. The reason is as follows.

Consider a batch of 3 or 4 commands, executed at a certain timer expiration. If the arrival time of these commands is close to the timer expiration, then the commands happen to be executed with a delay close to the expected one. If, on the other hand, their arrival time is far from the timer expiration, then they are completed with a larger delay. In the worst case, their delay is inflated by a factor proportional to the size of the batch, namely 3 or 4.

In the end, depending on the command arrival pattern, the throughput of the emulated device may vary by 3 or 4 times. This would be a reduced version of the issue addressed by this patch. In fact, with the current implementation of delayed completions (per-cpu timers), the throughput may vary, in the worst case, by a factor equal to the queue depth (64 by default).

Finally, a side note: as for whether the increased efficiency of batched delayed completions is worth additional code complexity as well as throughput losses/fluctuations, the low delays for which this efficiency becomes important match the latencies of new very fast devices (or are at least in the same order). But, exactly for the high speed of these devices, what typically matters (AFAIK) in tests with these devices is understanding whether the rest of the system can cope with their speed. And for this type of measures, the best/typical choice is (again, AFAIK) not to delay request completions at all in null_blk.

I hope these considerations may be somehow useful.

Thanks,
Paolo

> --
> Jens Axboe


--
Paolo Valente
Algogroup
Dipartimento di Fisica, Informatica e Matematica
Via Campi, 213/B
41125 Modena - Italy
homepage: http://algogroup.unimore.it/people/paolo/

2015-11-03 09:02:53

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH BUGFIX 2/3] null_blk: guarantee device restart in all irq modes


Il giorno 02/nov/2015, alle ore 17:25, Jens Axboe <[email protected]> ha scritto:

> On 11/02/2015 07:31 AM, Paolo Valente wrote:
>> From: Arianna Avanzini <[email protected]>
>>
>> In single-queue (block layer) mode,the function null_rq_prep_fn stops
>> the device if alloc_cmd fails. Then, once stopped, the device must be
>> restarted on the next command completion, so that the request(s) for
>> which alloc_cmd failed can be requeued. Otherwise the device hangs.
>>
>> Unfortunately, device restart is currently performed only for delayed
>> completions, i.e., in irqmode==2. This fact causes hangs, for the
>> above reasons, with the other irqmodes in combination with single-queue
>> block layer.
>>
>> This commits addresses this issue by making sure that, if stopped, the
>> device is properly restarted for all irqmodes on completions.
>
> This looks good. I did a double take at the removal of the q->mq_ops check before blk_queue_stopped(), but we can't get there in MQ mode. Perhaps a comment would be warranted for that, the ->mq_ops check served as a bit of documentation for that before.
>

Honestly, I removed that additional check by mistake. I will put it back in the next version of this patchset, once (and if) I have the green light on patch 1/3.

Thanks,
Paolo

> --
> Jens Axboe
>


--
Paolo Valente
Algogroup
Dipartimento di Fisica, Informatica e Matematica
Via Campi, 213/B
41125 Modena - Italy
homepage: http://algogroup.unimore.it/people/paolo/

2015-11-29 17:28:10

by Paolo Valente

[permalink] [raw]
Subject: Re: [PATCH BUGFIX 1/3] null_blk: set a separate timer for each command

Hi Jens,
this is just to ask you whether you made any decision about these patches, including just not to apply them.

Thanks,
Paolo

Il giorno 03/nov/2015, alle ore 10:01, Paolo Valente <[email protected]> ha scritto:

>
> Il giorno 02/nov/2015, alle ore 17:14, Jens Axboe <[email protected]> ha scritto:
>
>> On 11/02/2015 07:31 AM, Paolo Valente wrote:
>>> For the Timer IRQ mode (i.e., when command completions are delayed),
>>> there is one timer for each CPU. Each of these timers
>>> . has a completion queue associated with it, containing all the
>>> command completions to be executed when the timer fires;
>>> . is set, and a new completion-to-execute is inserted into its
>>> completion queue, every time the dispatch code for a new command
>>> happens to be executed on the CPU related to the timer.
>>>
>>> This implies that, if the dispatch of a new command happens to be
>>> executed on a CPU whose timer has already been set, but has not yet
>>> fired, then the timer is set again, to the completion time of the
>>> newly arrived command. When the timer eventually fires, all its queued
>>> completions are executed.
>>>
>>> This way of handling delayed command completions entails the following
>>> problem: if more than one command completion is inserted into the
>>> queue of a timer before the timer fires, then the expiration time for
>>> the timer is moved forward every time each of these completions is
>>> enqueued. As a consequence, only the last completion enqueued enjoys a
>>> correct execution time, while all previous completions are unjustly
>>> delayed until the last completion is executed (and at that time they
>>> are executed all together).
>>>
>>> Specifically, if all the above completions are enqueued almost at the
>>> same time, then the problem is negligible. On the opposite end, if
>>> every completion is enqueued a while after the previous completion was
>>> enqueued (in the extreme case, it is enqueued only right before the
>>> timer would have expired), then every enqueued completion, except for
>>> the last one, experiences an inflated delay, proportional to the number
>>> of completions enqueued after it. In the end, commands, and thus I/O
>>> requests, may be completed at an arbitrarily lower rate than the
>>> desired one.
>>>
>>> This commit addresses this issue by replacing per-CPU timers with
>>> per-command timers, i.e., by associating an individual timer with each
>>> command.
>>
>> Functionally the patch looks fine. My only worry is that a timer per command would be an unnecessary slowdown compared to pushing one timer forward. The problem should be fixable by still doing that, just maintaining next-expire instead. Maybe something that would still roughly be precise enough, while still getting some completion batching going? Or maybe that would be slower, and the individual timers are still better.
>>
>> Comments?
>>
>
> I have tried to think about these questions. Unfortunately I was not able to go beyond the following general considerations.
>
> Given that:
> 1) hrtimer_start is very efficient;
> 2) to implement a batch-by-batch execution of command completions, a more complex code would of course be needed in null_blk;
> I think that, to get a perceptible improvement, command completions should probably be executed in batches of at least 3 or 4 commands each.
>
> In this respect, commands can accumulate in a batch only if the time granularity with which delayed commands are guaranteed to be executed, i.e, the granularity with which timers eventually fire, happens to be coarser than completion_nsec. But this would lead to the exact uncontrolled throughput-loss problem addressed by this patch, although at a lesser extent. The reason is as follows.
>
> Consider a batch of 3 or 4 commands, executed at a certain timer expiration. If the arrival time of these commands is close to the timer expiration, then the commands happen to be executed with a delay close to the expected one. If, on the other hand, their arrival time is far from the timer expiration, then they are completed with a larger delay. In the worst case, their delay is inflated by a factor proportional to the size of the batch, namely 3 or 4.
>
> In the end, depending on the command arrival pattern, the throughput of the emulated device may vary by 3 or 4 times. This would be a reduced version of the issue addressed by this patch. In fact, with the current implementation of delayed completions (per-cpu timers), the throughput may vary, in the worst case, by a factor equal to the queue depth (64 by default).
>
> Finally, a side note: as for whether the increased efficiency of batched delayed completions is worth additional code complexity as well as throughput losses/fluctuations, the low delays for which this efficiency becomes important match the latencies of new very fast devices (or are at least in the same order). But, exactly for the high speed of these devices, what typically matters (AFAIK) in tests with these devices is understanding whether the rest of the system can cope with their speed. And for this type of measures, the best/typical choice is (again, AFAIK) not to delay request completions at all in null_blk.
>
> I hope these considerations may be somehow useful.
>
> Thanks,
> Paolo
>
>> --
>> Jens Axboe
>
>
> --
> Paolo Valente
> Algogroup
> Dipartimento di Fisica, Informatica e Matematica
> Via Campi, 213/B
> 41125 Modena - Italy
> homepage: http://algogroup.unimore.it/people/paolo/


--
Paolo Valente
Algogroup
Dipartimento di Fisica, Informatica e Matematica
Via Campi, 213/B
41125 Modena - Italy
homepage: http://algogroup.unimore.it/people/paolo/

2015-11-30 15:56:09

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH BUGFIX 1/3] null_blk: set a separate timer for each command

On 11/29/2015 10:27 AM, Paolo Valente wrote:
> Hi Jens,
> this is just to ask you whether you made any decision about these patches, including just not to apply them.

Let's move forward with them, I got hung up on potential drops in
performance. We can fix that up as we go. Are you going to respin #2 to
reinstate the mq_ops check?

--
Jens Axboe

2015-12-01 10:48:59

by Paolo Valente

[permalink] [raw]
Subject: [PATCH BUGFIX V2 0/3] null_blk: fix throughput losses and hangs

Hi,
here is an updated version of the patchset, differing from the
previous version only in that it reinstates the missing extra check
pointed out in [2]. For your convenience, the content of the cover
letter for the previous version follows.

While doing some tests with the null_blk device driver, we bumped into
two problems: first, unjustified and in some cases high throughput
losses; second, actual hangs. These problems seem to be the
consequence of the combination of three causes, and this patchset
introduces a fix for each of these causes. In particular, changes
address:
. an apparent flaw in the logic with which delayed completions are
implemented: this flaw causes, with unlucky but non-pathological
workloads, actual request-completion delays to become arbitrarily
larger than the configured delay;
. the missing restart of the device queue on the completion of a request in
single-queue non-delayed mode;
. the overflow of the request-delay parameter, when extremely high values
are used (e.g., to spot bugs).

To avoid possible confusion, we stress that these fixes *do not* have
anything to do with the problems highlighted in [1] (tests of the
multiqueue xen-blkfront and xen-blkback modules with null_blk).

You can find more details in the patch descriptions.

Thanks,
Paolo and Arianna

[1] https://lkml.org/lkml/2015/8/19/181
[2] https://lkml.org/lkml/2015/11/2/433

Arianna Avanzini (2):
null_blk: guarantee device restart in all irq modes
null_blk: change type of completion_nsec to unsigned long

Paolo Valente (1):
null_blk: set a separate timer for each command

drivers/block/null_blk.c | 94 +++++++++++++++++-------------------------------
1 file changed, 33 insertions(+), 61 deletions(-)

--
1.9.1

2015-12-01 10:49:42

by Paolo Valente

[permalink] [raw]
Subject: [PATCH BUGFIX V2 1/3] null_blk: set a separate timer for each command

For the Timer IRQ mode (i.e., when command completions are delayed),
there is one timer for each CPU. Each of these timers
. has a completion queue associated with it, containing all the
command completions to be executed when the timer fires;
. is set, and a new completion-to-execute is inserted into its
completion queue, every time the dispatch code for a new command
happens to be executed on the CPU related to the timer.

This implies that, if the dispatch of a new command happens to be
executed on a CPU whose timer has already been set, but has not yet
fired, then the timer is set again, to the completion time of the
newly arrived command. When the timer eventually fires, all its queued
completions are executed.

This way of handling delayed command completions entails the following
problem: if more than one command completion is inserted into the
queue of a timer before the timer fires, then the expiration time for
the timer is moved forward every time each of these completions is
enqueued. As a consequence, only the last completion enqueued enjoys a
correct execution time, while all previous completions are unjustly
delayed until the last completion is executed (and at that time they
are executed all together).

Specifically, if all the above completions are enqueued almost at the
same time, then the problem is negligible. On the opposite end, if
every completion is enqueued a while after the previous completion was
enqueued (in the extreme case, it is enqueued only right before the
timer would have expired), then every enqueued completion, except for
the last one, experiences an inflated delay, proportional to the number
of completions enqueued after it. In the end, commands, and thus I/O
requests, may be completed at an arbitrarily lower rate than the
desired one.

This commit addresses this issue by replacing per-CPU timers with
per-command timers, i.e., by associating an individual timer with each
command.

Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Arianna Avanzini <[email protected]>
---
drivers/block/null_blk.c | 79 +++++++++++++++---------------------------------
1 file changed, 24 insertions(+), 55 deletions(-)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 5c8ba54..08932f5 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -18,6 +18,7 @@ struct nullb_cmd {
struct bio *bio;
unsigned int tag;
struct nullb_queue *nq;
+ struct hrtimer timer;
};

struct nullb_queue {
@@ -49,17 +50,6 @@ static int null_major;
static int nullb_indexes;
static struct kmem_cache *ppa_cache;

-struct completion_queue {
- struct llist_head list;
- struct hrtimer timer;
-};
-
-/*
- * These are per-cpu for now, they will need to be configured by the
- * complete_queues parameter and appropriately mapped.
- */
-static DEFINE_PER_CPU(struct completion_queue, completion_queues);
-
enum {
NULL_IRQ_NONE = 0,
NULL_IRQ_SOFTIRQ = 1,
@@ -180,6 +170,8 @@ static void free_cmd(struct nullb_cmd *cmd)
put_tag(cmd->nq, cmd->tag);
}

+static enum hrtimer_restart null_cmd_timer_expired(struct hrtimer *timer);
+
static struct nullb_cmd *__alloc_cmd(struct nullb_queue *nq)
{
struct nullb_cmd *cmd;
@@ -190,6 +182,11 @@ static struct nullb_cmd *__alloc_cmd(struct nullb_queue *nq)
cmd = &nq->cmds[tag];
cmd->tag = tag;
cmd->nq = nq;
+ if (irqmode == NULL_IRQ_TIMER) {
+ hrtimer_init(&cmd->timer, CLOCK_MONOTONIC,
+ HRTIMER_MODE_REL);
+ cmd->timer.function = null_cmd_timer_expired;
+ }
return cmd;
}

@@ -238,47 +235,28 @@ static void end_cmd(struct nullb_cmd *cmd)

static enum hrtimer_restart null_cmd_timer_expired(struct hrtimer *timer)
{
- struct completion_queue *cq;
- struct llist_node *entry;
- struct nullb_cmd *cmd;
-
- cq = &per_cpu(completion_queues, smp_processor_id());
-
- while ((entry = llist_del_all(&cq->list)) != NULL) {
- entry = llist_reverse_order(entry);
- do {
- struct request_queue *q = NULL;
+ struct nullb_cmd *cmd = container_of(timer, struct nullb_cmd, timer);
+ struct request_queue *q = NULL;

- cmd = container_of(entry, struct nullb_cmd, ll_list);
- entry = entry->next;
- if (cmd->rq)
- q = cmd->rq->q;
- end_cmd(cmd);
+ if (cmd->rq)
+ q = cmd->rq->q;

- if (q && !q->mq_ops && blk_queue_stopped(q)) {
- spin_lock(q->queue_lock);
- if (blk_queue_stopped(q))
- blk_start_queue(q);
- spin_unlock(q->queue_lock);
- }
- } while (entry);
+ if (q && !q->mq_ops && blk_queue_stopped(q)) {
+ spin_lock(q->queue_lock);
+ if (blk_queue_stopped(q))
+ blk_start_queue(q);
+ spin_unlock(q->queue_lock);
}
+ end_cmd(cmd);

return HRTIMER_NORESTART;
}

static void null_cmd_end_timer(struct nullb_cmd *cmd)
{
- struct completion_queue *cq = &per_cpu(completion_queues, get_cpu());
-
- cmd->ll_list.next = NULL;
- if (llist_add(&cmd->ll_list, &cq->list)) {
- ktime_t kt = ktime_set(0, completion_nsec);
+ ktime_t kt = ktime_set(0, completion_nsec);

- hrtimer_start(&cq->timer, kt, HRTIMER_MODE_REL_PINNED);
- }
-
- put_cpu();
+ hrtimer_start(&cmd->timer, kt, HRTIMER_MODE_REL);
}

static void null_softirq_done_fn(struct request *rq)
@@ -376,6 +354,10 @@ static int null_queue_rq(struct blk_mq_hw_ctx *hctx,
{
struct nullb_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);

+ if (irqmode == NULL_IRQ_TIMER) {
+ hrtimer_init(&cmd->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ cmd->timer.function = null_cmd_timer_expired;
+ }
cmd->rq = bd->rq;
cmd->nq = hctx->driver_data;

@@ -813,19 +795,6 @@ static int __init null_init(void)

mutex_init(&lock);

- /* Initialize a separate list for each CPU for issuing softirqs */
- for_each_possible_cpu(i) {
- struct completion_queue *cq = &per_cpu(completion_queues, i);
-
- init_llist_head(&cq->list);
-
- if (irqmode != NULL_IRQ_TIMER)
- continue;
-
- hrtimer_init(&cq->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
- cq->timer.function = null_cmd_timer_expired;
- }
-
null_major = register_blkdev(0, "nullb");
if (null_major < 0)
return null_major;
--
1.9.1

2015-12-01 10:48:58

by Paolo Valente

[permalink] [raw]
Subject: [PATCH BUGFIX V2 2/3] null_blk: guarantee device restart in all irq modes

From: Arianna Avanzini <[email protected]>

In single-queue (block layer) mode,the function null_rq_prep_fn stops
the device if alloc_cmd fails. Then, once stopped, the device must be
restarted on the next command completion, so that the request(s) for
which alloc_cmd failed can be requeued. Otherwise the device hangs.

Unfortunately, device restart is currently performed only for delayed
completions, i.e., in irqmode==2. This fact causes hangs, for the
above reasons, with the other irqmodes in combination with single-queue
block layer.

This commits addresses this issue by making sure that, if stopped, the
device is properly restarted for all irqmodes on completions.

Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Arianna AVanzini <[email protected]>
---
Changes V1->V2
- reinstated mq_ops check

drivers/block/null_blk.c | 27 +++++++++++++++------------
1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 08932f5..cf65619 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -217,6 +217,8 @@ static struct nullb_cmd *alloc_cmd(struct nullb_queue *nq, int can_wait)

static void end_cmd(struct nullb_cmd *cmd)
{
+ struct request_queue *q = NULL;
+
switch (queue_mode) {
case NULL_Q_MQ:
blk_mq_end_request(cmd->rq, 0);
@@ -227,27 +229,28 @@ static void end_cmd(struct nullb_cmd *cmd)
break;
case NULL_Q_BIO:
bio_endio(cmd->bio);
- break;
+ goto free_cmd;
}

- free_cmd(cmd);
-}
-
-static enum hrtimer_restart null_cmd_timer_expired(struct hrtimer *timer)
-{
- struct nullb_cmd *cmd = container_of(timer, struct nullb_cmd, timer);
- struct request_queue *q = NULL;
-
if (cmd->rq)
q = cmd->rq->q;

+ /* Restart queue if needed, as we are freeing a tag */
if (q && !q->mq_ops && blk_queue_stopped(q)) {
- spin_lock(q->queue_lock);
+ unsigned long flags;
+
+ spin_lock_irqsave(q->queue_lock, flags);
if (blk_queue_stopped(q))
blk_start_queue(q);
- spin_unlock(q->queue_lock);
+ spin_unlock_irqrestore(q->queue_lock, flags);
}
- end_cmd(cmd);
+free_cmd:
+ free_cmd(cmd);
+}
+
+static enum hrtimer_restart null_cmd_timer_expired(struct hrtimer *timer)
+{
+ end_cmd(container_of(timer, struct nullb_cmd, timer));

return HRTIMER_NORESTART;
}
--
1.9.1

2015-12-01 10:49:18

by Paolo Valente

[permalink] [raw]
Subject: [PATCH BUGFIX V2 3/3] null_blk: change type of completion_nsec to unsigned long

From: Arianna Avanzini <[email protected]>

This commit at least doubles the maximum value for
completion_nsec. This helps in special cases where one wants/needs to
emulate an extremely slow I/O (for example to spot bugs).

Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Arianna Avanzini <[email protected]>
---
drivers/block/null_blk.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index cf65619..0c3940e 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -132,8 +132,8 @@ static const struct kernel_param_ops null_irqmode_param_ops = {
device_param_cb(irqmode, &null_irqmode_param_ops, &irqmode, S_IRUGO);
MODULE_PARM_DESC(irqmode, "IRQ completion handler. 0-none, 1-softirq, 2-timer");

-static int completion_nsec = 10000;
-module_param(completion_nsec, int, S_IRUGO);
+static unsigned long completion_nsec = 10000;
+module_param(completion_nsec, ulong, S_IRUGO);
MODULE_PARM_DESC(completion_nsec, "Time in ns to complete a request in hardware. Default: 10,000ns");

static int hw_queue_depth = 64;
--
1.9.1

2015-12-01 17:52:52

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH BUGFIX V2 0/3] null_blk: fix throughput losses and hangs

On 12/01/2015 03:48 AM, Paolo Valente wrote:
> Hi,
> here is an updated version of the patchset, differing from the
> previous version only in that it reinstates the missing extra check
> pointed out in [2]. For your convenience, the content of the cover
> letter for the previous version follows.
>
> While doing some tests with the null_blk device driver, we bumped into
> two problems: first, unjustified and in some cases high throughput
> losses; second, actual hangs. These problems seem to be the
> consequence of the combination of three causes, and this patchset
> introduces a fix for each of these causes. In particular, changes
> address:
> . an apparent flaw in the logic with which delayed completions are
> implemented: this flaw causes, with unlucky but non-pathological
> workloads, actual request-completion delays to become arbitrarily
> larger than the configured delay;
> . the missing restart of the device queue on the completion of a request in
> single-queue non-delayed mode;
> . the overflow of the request-delay parameter, when extremely high values
> are used (e.g., to spot bugs).
>
> To avoid possible confusion, we stress that these fixes *do not* have
> anything to do with the problems highlighted in [1] (tests of the
> multiqueue xen-blkfront and xen-blkback modules with null_blk).
>
> You can find more details in the patch descriptions.

Thanks Paolo, added.

--
Jens Axboe