Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751901AbdIFHNS (ORCPT ); Wed, 6 Sep 2017 03:13:18 -0400 Received: from mail-wm0-f52.google.com ([74.125.82.52]:34439 "EHLO mail-wm0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750832AbdIFHNP (ORCPT ); Wed, 6 Sep 2017 03:13:15 -0400 X-Google-Smtp-Source: ADKCNb5i7buPhDH3Vvht+kSBjxsyJfHld5AjklXJEIwGHzJLef1e69Nbhy+0u+tUsGpdy6QHaO/sDQ== Date: Wed, 6 Sep 2017 09:13:04 +0200 From: Andrea Parri To: Peter Zijlstra Cc: Jens Axboe , linux-kernel@vger.kernel.org, stern@rowland.harvard.edu, will.deacon@arm.com, tom.leiming@gmail.com, boqun.feng@gmail.com, paulmck@linux.vnet.ibm.com, hch@lst.de, bart.vanassche@wdc.com Subject: Re: [PATCH] blk-mq: Start to fix memory ordering... Message-ID: <20170906071304.GA3519@andrea> References: <20170904090932.ngfefw2clot3sbqi@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170904090932.ngfefw2clot3sbqi@hirez.programming.kicks-ass.net> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6060 Lines: 158 On Mon, Sep 04, 2017 at 11:09:32AM +0200, Peter Zijlstra wrote: > > Attempt to untangle the ordering in blk-mq. The patch introducing the > single smp_mb__before_atomic() is obviously broken in that it doesn't > clearly specify a pairing barrier and an obtained guarantee. > > The comment is further misleading in that it hints that the > deadline store and the COMPLETE store also need to be ordered, but > AFAICT there is no such dependency. However what does appear to be > important is the clear happening _after_ the store, and that worked by > pure accident. > > This clarifies blk_mq_start_request() -- we should not get there with > STARTING set -- this simplifies the code and makes the barrier usage > sane (the old code could be read to allow not having _any_ atomic after > the barrier, in which case the barrier hasn't got anything to order). We > then also introduce the missing pairing barrier for it. > > And it documents the STARTING vs COMPLETE ordering. Although I've not > been entirely successful in reverse engineering the blk-mq state > machine so there might still be more funnies around timeout vs > requeue. > > If I got anything wrong, feel free to educate me by adding comments to > clarify things ;-) > > Cc: Alan Stern > Cc: Will Deacon > Cc: Ming Lei > Cc: Jens Axboe > Cc: Andrea Parri > Cc: Boqun Feng > Cc: "Paul E. McKenney" > Cc: Christoph Hellwig > Cc: Bart Van Assche > Fixes: 538b75341835 ("blk-mq: request deadline must be visible before marking rq as started") > Signed-off-by: Peter Zijlstra (Intel) > --- > block/blk-mq.c | 48 +++++++++++++++++++++++++++++++++++++----------- > 1 file changed, 37 insertions(+), 11 deletions(-) > > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -558,22 +558,29 @@ void blk_mq_start_request(struct request > > blk_add_timer(rq); > > - /* > - * Ensure that ->deadline is visible before set the started > - * flag and clear the completed flag. > - */ > - smp_mb__before_atomic(); > + WARN_ON_ONCE(test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)); > > /* > * Mark us as started and clear complete. Complete might have been > * set if requeue raced with timeout, which then marked it as > * complete. So be sure to clear complete again when we start > * the request, otherwise we'll ignore the completion event. > + * > + * Ensure that ->deadline is visible before set STARTED, such that > + * blk_mq_check_expired() is guaranteed to observe our ->deadline > + * when it observes STARTED. > */ > - if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) > - set_bit(REQ_ATOM_STARTED, &rq->atomic_flags); > - if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) > + smp_mb__before_atomic(); I am wondering whether we should be using smp_wmb() instead: this would provide the above guarantee and save a full barrier on powerpc/arm64. > + set_bit(REQ_ATOM_STARTED, &rq->atomic_flags); > + if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) { > + /* > + * Coherence order guarantees these consequtive stores to a > + * singe variable propagate in the specified order. Thus the > + * clear_bit() is ordered _after_ the set bit. See > + * blk_mq_check_expired(). > + */ > clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags); It could be useful to stress that set_bit(), clear_bit() must "act" on the same subword of the unsigned long (whatever "subword" means at this level...) to rely on the coherence order (c.f., alpha's implementation). > + } > > if (q->dma_drain_size && blk_rq_bytes(rq)) { > /* > @@ -744,11 +751,20 @@ static void blk_mq_check_expired(struct > struct request *rq, void *priv, bool reserved) > { > struct blk_mq_timeout_data *data = priv; > + unsigned long deadline; > > if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) > return; > > /* > + * Ensures that if we see STARTED we must also see our > + * up-to-date deadline, see blk_mq_start_request(). > + */ > + smp_rmb(); > + > + deadline = READ_ONCE(rq->deaedline); > + > + /* > * The rq being checked may have been freed and reallocated > * out already here, we avoid this race by checking rq->deadline > * and REQ_ATOM_COMPLETE flag together: > @@ -761,10 +777,20 @@ static void blk_mq_check_expired(struct > * and clearing the flag in blk_mq_start_request(), so > * this rq won't be timed out too. > */ > - if (time_after_eq(jiffies, rq->deadline)) { > - if (!blk_mark_rq_complete(rq)) > + if (time_after_eq(jiffies, deadline)) { > + if (!blk_mark_rq_complete(rq)) { > + /* > + * Relies on the implied MB from test_and_clear() to > + * order the COMPLETE load against the STARTED load. > + * Orders against the coherence order in > + * blk_mq_start_request(). I understand "from test_and_set_bit()" (in blk_mark_rq_complete()) and that the interested cycle is: /* in blk_mq_start_request() */ [STORE STARTED bit = 1 into atomic_flags] -->co [STORE COMPLETE bit = 0 into atomic_flags] /* in blk_mq_check_expired() */ -->rf [LOAD COMPLETE bit = 0 from atomic_flags] -->po-loc [LOAD STARTED bit = 0 from atomic_flags] /* in blk_mq_start_request() again */ -->fr [STORE STARTED bit = 1 into atomic_flags] (N.B. Assume all accesses happen to/from the same subword.) This cycle being forbidden by the "coherence check", I'd say we do not need to rely on the MB mentioned by the comment; what am I missing? Andrea > + * > + * This ensures that if we see !COMPLETE we must see > + * STARTED and ignore this timeout. > + */ > blk_mq_rq_timed_out(rq, reserved); > - } else if (!data->next_set || time_after(data->next, rq->deadline)) { > + } > + } else if (!data->next_set || time_after(data->next, deadline)) { > data->next = rq->deadline; > data->next_set = 1; > }