2017-09-04 09:09:50

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH] blk-mq: Start to fix memory ordering...


Attempt to untangle the ordering in blk-mq. The patch introducing the
single smp_mb__before_atomic() is obviously broken in that it doesn't
clearly specify a pairing barrier and an obtained guarantee.

The comment is further misleading in that it hints that the
deadline store and the COMPLETE store also need to be ordered, but
AFAICT there is no such dependency. However what does appear to be
important is the clear happening _after_ the store, and that worked by
pure accident.

This clarifies blk_mq_start_request() -- we should not get there with
STARTING set -- this simplifies the code and makes the barrier usage
sane (the old code could be read to allow not having _any_ atomic after
the barrier, in which case the barrier hasn't got anything to order). We
then also introduce the missing pairing barrier for it.

And it documents the STARTING vs COMPLETE ordering. Although I've not
been entirely successful in reverse engineering the blk-mq state
machine so there might still be more funnies around timeout vs
requeue.

If I got anything wrong, feel free to educate me by adding comments to
clarify things ;-)

Cc: Alan Stern <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Ming Lei <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Andrea Parri <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Bart Van Assche <[email protected]>
Fixes: 538b75341835 ("blk-mq: request deadline must be visible before marking rq as started")
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
block/blk-mq.c | 48 +++++++++++++++++++++++++++++++++++++-----------
1 file changed, 37 insertions(+), 11 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -558,22 +558,29 @@ void blk_mq_start_request(struct request

blk_add_timer(rq);

- /*
- * Ensure that ->deadline is visible before set the started
- * flag and clear the completed flag.
- */
- smp_mb__before_atomic();
+ WARN_ON_ONCE(test_bit(REQ_ATOM_STARTED, &rq->atomic_flags));

/*
* Mark us as started and clear complete. Complete might have been
* set if requeue raced with timeout, which then marked it as
* complete. So be sure to clear complete again when we start
* the request, otherwise we'll ignore the completion event.
+ *
+ * Ensure that ->deadline is visible before set STARTED, such that
+ * blk_mq_check_expired() is guaranteed to observe our ->deadline
+ * when it observes STARTED.
*/
- if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
- set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
- if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
+ smp_mb__before_atomic();
+ set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
+ if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) {
+ /*
+ * Coherence order guarantees these consequtive stores to a
+ * singe variable propagate in the specified order. Thus the
+ * clear_bit() is ordered _after_ the set bit. See
+ * blk_mq_check_expired().
+ */
clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
+ }

if (q->dma_drain_size && blk_rq_bytes(rq)) {
/*
@@ -744,11 +751,20 @@ static void blk_mq_check_expired(struct
struct request *rq, void *priv, bool reserved)
{
struct blk_mq_timeout_data *data = priv;
+ unsigned long deadline;

if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
return;

/*
+ * Ensures that if we see STARTED we must also see our
+ * up-to-date deadline, see blk_mq_start_request().
+ */
+ smp_rmb();
+
+ deadline = READ_ONCE(rq->deaedline);
+
+ /*
* The rq being checked may have been freed and reallocated
* out already here, we avoid this race by checking rq->deadline
* and REQ_ATOM_COMPLETE flag together:
@@ -761,10 +777,20 @@ static void blk_mq_check_expired(struct
* and clearing the flag in blk_mq_start_request(), so
* this rq won't be timed out too.
*/
- if (time_after_eq(jiffies, rq->deadline)) {
- if (!blk_mark_rq_complete(rq))
+ if (time_after_eq(jiffies, deadline)) {
+ if (!blk_mark_rq_complete(rq)) {
+ /*
+ * Relies on the implied MB from test_and_clear() to
+ * order the COMPLETE load against the STARTED load.
+ * Orders against the coherence order in
+ * blk_mq_start_request().
+ *
+ * This ensures that if we see !COMPLETE we must see
+ * STARTED and ignore this timeout.
+ */
blk_mq_rq_timed_out(rq, reserved);
- } else if (!data->next_set || time_after(data->next, rq->deadline)) {
+ }
+ } else if (!data->next_set || time_after(data->next, deadline)) {
data->next = rq->deadline;
data->next_set = 1;
}


2017-09-05 14:52:16

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH] blk-mq: Start to fix memory ordering...

On Mon, 2017-09-04 at 11:09 +0200, Peter Zijlstra wrote:
> /*
> * Mark us as started and clear complete. Complete might have been
> * set if requeue raced with timeout, which then marked it as
> * complete. So be sure to clear complete again when we start
> * the request, otherwise we'll ignore the completion event.
> + *
> + * Ensure that ->deadline is visible before set STARTED, such that

It seems like there is something wrong with the grammar in the above
sentence? Did you perhaps mean "before we set STARTED"?

> + * blk_mq_check_expired() is guaranteed to observe our ->deadline
> + * when it observes STARTED.
> */
> - if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
> - set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
> - if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
> + smp_mb__before_atomic();
> + set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
> + if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) {
> + /*
> + * Coherence order guarantees these consequtive stores to a
> + * singe variable propagate in the specified order. Thus the
> + * clear_bit() is ordered _after_ the set bit. See
> + * blk_mq_check_expired().
> + */
> clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
> + }

Is this new comment really useful? If you want to keep it please spell
"consecutive" correctly.

> if (q->dma_drain_size && blk_rq_bytes(rq)) {
> /*
> @@ -744,11 +751,20 @@ static void blk_mq_check_expired(struct
> struct request *rq, void *priv, bool reserved)
> {
> struct blk_mq_timeout_data *data = priv;
> + unsigned long deadline;
>
> if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
> return;
>
> /*
> + * Ensures that if we see STARTED we must also see our
> + * up-to-date deadline, see blk_mq_start_request().
> + */
> + smp_rmb();
> +
> + deadline = READ_ONCE(rq->deaedline);

"deaedline" is a spelling error. Has this patch been tested?

> + /*
> * The rq being checked may have been freed and reallocated
> * out already here, we avoid this race by checking rq->deadline
> * and REQ_ATOM_COMPLETE flag together:
> @@ -761,10 +777,20 @@ static void blk_mq_check_expired(struct
> * and clearing the flag in blk_mq_start_request(), so
> * this rq won't be timed out too.
> */
> - if (time_after_eq(jiffies, rq->deadline)) {
> - if (!blk_mark_rq_complete(rq))
> + if (time_after_eq(jiffies, deadline)) {
> + if (!blk_mark_rq_complete(rq)) {
> + /*
> + * Relies on the implied MB from test_and_clear() to
> + * order the COMPLETE load against the STARTED load.
> + * Orders against the coherence order in
> + * blk_mq_start_request().
> + *
> + * This ensures that if we see !COMPLETE we must see
> + * STARTED and ignore this timeout.
> + */
> blk_mq_rq_timed_out(rq, reserved);
> - } else if (!data->next_set || time_after(data->next, rq->deadline)) {
> + }
> + } else if (!data->next_set || time_after(data->next, deadline)) {
> data->next = rq->deadline;
> data->next_set = 1;
> }

Apparently a READ_ONCE(rq->deadline) statement has been added but not all
rq->deadline reads have been changed into reads of the local variable
"deadline"? Was that really your intention?

Bart.

2017-09-06 07:13:18

by Andrea Parri

[permalink] [raw]
Subject: Re: [PATCH] blk-mq: Start to fix memory ordering...

On Mon, Sep 04, 2017 at 11:09:32AM +0200, Peter Zijlstra wrote:
>
> Attempt to untangle the ordering in blk-mq. The patch introducing the
> single smp_mb__before_atomic() is obviously broken in that it doesn't
> clearly specify a pairing barrier and an obtained guarantee.
>
> The comment is further misleading in that it hints that the
> deadline store and the COMPLETE store also need to be ordered, but
> AFAICT there is no such dependency. However what does appear to be
> important is the clear happening _after_ the store, and that worked by
> pure accident.
>
> This clarifies blk_mq_start_request() -- we should not get there with
> STARTING set -- this simplifies the code and makes the barrier usage
> sane (the old code could be read to allow not having _any_ atomic after
> the barrier, in which case the barrier hasn't got anything to order). We
> then also introduce the missing pairing barrier for it.
>
> And it documents the STARTING vs COMPLETE ordering. Although I've not
> been entirely successful in reverse engineering the blk-mq state
> machine so there might still be more funnies around timeout vs
> requeue.
>
> If I got anything wrong, feel free to educate me by adding comments to
> clarify things ;-)
>
> Cc: Alan Stern <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Ming Lei <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Andrea Parri <[email protected]>
> Cc: Boqun Feng <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Bart Van Assche <[email protected]>
> Fixes: 538b75341835 ("blk-mq: request deadline must be visible before marking rq as started")
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> block/blk-mq.c | 48 +++++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 37 insertions(+), 11 deletions(-)
>
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -558,22 +558,29 @@ void blk_mq_start_request(struct request
>
> blk_add_timer(rq);
>
> - /*
> - * Ensure that ->deadline is visible before set the started
> - * flag and clear the completed flag.
> - */
> - smp_mb__before_atomic();
> + WARN_ON_ONCE(test_bit(REQ_ATOM_STARTED, &rq->atomic_flags));
>
> /*
> * Mark us as started and clear complete. Complete might have been
> * set if requeue raced with timeout, which then marked it as
> * complete. So be sure to clear complete again when we start
> * the request, otherwise we'll ignore the completion event.
> + *
> + * Ensure that ->deadline is visible before set STARTED, such that
> + * blk_mq_check_expired() is guaranteed to observe our ->deadline
> + * when it observes STARTED.
> */
> - if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
> - set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
> - if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
> + smp_mb__before_atomic();

I am wondering whether we should be using smp_wmb() instead: this would
provide the above guarantee and save a full barrier on powerpc/arm64.


> + set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
> + if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) {
> + /*
> + * Coherence order guarantees these consequtive stores to a
> + * singe variable propagate in the specified order. Thus the
> + * clear_bit() is ordered _after_ the set bit. See
> + * blk_mq_check_expired().
> + */
> clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);

It could be useful to stress that set_bit(), clear_bit() must "act" on
the same subword of the unsigned long (whatever "subword" means at this
level...) to rely on the coherence order (c.f., alpha's implementation).


> + }
>
> if (q->dma_drain_size && blk_rq_bytes(rq)) {
> /*
> @@ -744,11 +751,20 @@ static void blk_mq_check_expired(struct
> struct request *rq, void *priv, bool reserved)
> {
> struct blk_mq_timeout_data *data = priv;
> + unsigned long deadline;
>
> if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
> return;
>
> /*
> + * Ensures that if we see STARTED we must also see our
> + * up-to-date deadline, see blk_mq_start_request().
> + */
> + smp_rmb();
> +
> + deadline = READ_ONCE(rq->deaedline);
> +
> + /*
> * The rq being checked may have been freed and reallocated
> * out already here, we avoid this race by checking rq->deadline
> * and REQ_ATOM_COMPLETE flag together:
> @@ -761,10 +777,20 @@ static void blk_mq_check_expired(struct
> * and clearing the flag in blk_mq_start_request(), so
> * this rq won't be timed out too.
> */
> - if (time_after_eq(jiffies, rq->deadline)) {
> - if (!blk_mark_rq_complete(rq))
> + if (time_after_eq(jiffies, deadline)) {
> + if (!blk_mark_rq_complete(rq)) {
> + /*
> + * Relies on the implied MB from test_and_clear() to
> + * order the COMPLETE load against the STARTED load.
> + * Orders against the coherence order in
> + * blk_mq_start_request().

I understand "from test_and_set_bit()" (in blk_mark_rq_complete()) and
that the interested cycle is:

/* in blk_mq_start_request() */
[STORE STARTED bit = 1 into atomic_flags]
-->co [STORE COMPLETE bit = 0 into atomic_flags]
/* in blk_mq_check_expired() */
-->rf [LOAD COMPLETE bit = 0 from atomic_flags]
-->po-loc [LOAD STARTED bit = 0 from atomic_flags]
/* in blk_mq_start_request() again */
-->fr [STORE STARTED bit = 1 into atomic_flags]

(N.B. Assume all accesses happen to/from the same subword.)

This cycle being forbidden by the "coherence check", I'd say we do not
need to rely on the MB mentioned by the comment; what am I missing?

Andrea


> + *
> + * This ensures that if we see !COMPLETE we must see
> + * STARTED and ignore this timeout.
> + */
> blk_mq_rq_timed_out(rq, reserved);
> - } else if (!data->next_set || time_after(data->next, rq->deadline)) {
> + }
> + } else if (!data->next_set || time_after(data->next, deadline)) {
> data->next = rq->deadline;
> data->next_set = 1;
> }

2017-09-06 07:22:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] blk-mq: Start to fix memory ordering...

On Tue, Sep 05, 2017 at 02:51:25PM +0000, Bart Van Assche wrote:
> "deaedline" is a spelling error. Has this patch been tested?

Bah.. So I ran with a previous version for a while, but then redid the
whole patch (as its mostly comments anyway) and clearly made a giant
mess of it.

I'll respin.

2017-09-06 08:48:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] blk-mq: Start to fix memory ordering...

On Wed, Sep 06, 2017 at 09:13:04AM +0200, Andrea Parri wrote:
> > + smp_mb__before_atomic();
>
> I am wondering whether we should be using smp_wmb() instead: this would
> provide the above guarantee and save a full barrier on powerpc/arm64.

Right, did that.

> > + set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
> > + if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) {
> > + /*
> > + * Coherence order guarantees these consequtive stores to a
> > + * singe variable propagate in the specified order. Thus the
> > + * clear_bit() is ordered _after_ the set bit. See
> > + * blk_mq_check_expired().
> > + */
> > clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
>
> It could be useful to stress that set_bit(), clear_bit() must "act" on
> the same subword of the unsigned long (whatever "subword" means at this
> level...) to rely on the coherence order (c.f., alpha's implementation).

As I wrote to your initial reply (which I now saw was private) I went
through the architectures again and found h8300 to use byte ops to
implement all the bitops.

So subword here means byte :/

The last time we looked at this was for PG_waiter and back then I think
we settled on u32 (with Alpha for example using 32bit ll/sc ops). Linus
moved PG_waiters to the same byte though, so that is all fine.

b91e1302ad9b ("mm: optimize PageWaiters bit use for unlock_page()")

> > + if (time_after_eq(jiffies, deadline)) {
> > + if (!blk_mark_rq_complete(rq)) {
> > + /*
> > + * Relies on the implied MB from test_and_clear() to
> > + * order the COMPLETE load against the STARTED load.
> > + * Orders against the coherence order in
> > + * blk_mq_start_request().
>
> I understand "from test_and_set_bit()" (in blk_mark_rq_complete()) and
> that the interested cycle is:
>
> /* in blk_mq_start_request() */
> [STORE STARTED bit = 1 into atomic_flags]
> -->co [STORE COMPLETE bit = 0 into atomic_flags]
> /* in blk_mq_check_expired() */
> -->rf [LOAD COMPLETE bit = 0 from atomic_flags]
> -->po-loc [LOAD STARTED bit = 0 from atomic_flags]
> /* in blk_mq_start_request() again */
> -->fr [STORE STARTED bit = 1 into atomic_flags]
>
> (N.B. Assume all accesses happen to/from the same subword.)
>
> This cycle being forbidden by the "coherence check", I'd say we do not
> need to rely on the MB mentioned by the comment; what am I missing?

Nothing, I forgot about the read-after-read thing but did spot the MB.
Either one suffices to guarantee the order we need. It just needs to be
documented as being relied upon.