Date: Wed, 6 Sep 2017 10:48:41 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: Andrea Parri <parri.andrea@gmail.com>
Cc: Jens Axboe <axboe@fb.com>, linux-kernel@vger.kernel.org,
        stern@rowland.harvard.edu, will.deacon@arm.com, tom.leiming@gmail.com,
        boqun.feng@gmail.com, paulmck@linux.vnet.ibm.com, hch@lst.de,
        bart.vanassche@wdc.com
Subject: Re: [PATCH] blk-mq: Start to fix memory ordering...
Message-ID: <20170906084841.upozocusxerkpqdp@hirez.programming.kicks-ass.net>
References: <20170904090932.ngfefw2clot3sbqi@hirez.programming.kicks-ass.net>
 <20170906071304.GA3519@andrea>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170906071304.GA3519@andrea>
User-Agent: NeoMutt/20170609 (1.8.3)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2611
Lines: 62

On Wed, Sep 06, 2017 at 09:13:04AM +0200, Andrea Parri wrote:
> > +	smp_mb__before_atomic();
> 
> I am wondering whether we should be using smp_wmb() instead: this would
> provide the above guarantee and save a full barrier on powerpc/arm64.

Right, did that.

> > +	set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
> > +	if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) {
> > +		/*
> > +		 * Coherence order guarantees these consequtive stores to a
> > +		 * singe variable propagate in the specified order. Thus the
> > +		 * clear_bit() is ordered _after_ the set bit. See
> > +		 * blk_mq_check_expired().
> > +		 */
> >  		clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
> 
> It could be useful to stress that set_bit(), clear_bit()  must "act" on
> the same subword of the unsigned long (whatever "subword" means at this
> level...) to rely on the coherence order (c.f., alpha's implementation).

As I wrote to your initial reply (which I now saw was private) I went
through the architectures again and found h8300 to use byte ops to
implement all the bitops.

So subword here means byte :/

The last time we looked at this was for PG_waiter and back then I think
we settled on u32 (with Alpha for example using 32bit ll/sc ops). Linus
moved PG_waiters to the same byte though, so that is all fine.

  b91e1302ad9b ("mm: optimize PageWaiters bit use for unlock_page()")

> > +	if (time_after_eq(jiffies, deadline)) {
> > +		if (!blk_mark_rq_complete(rq)) {
> > +			/*
> > +			 * Relies on the implied MB from test_and_clear() to
> > +			 * order the COMPLETE load against the STARTED load.
> > +			 * Orders against the coherence order in
> > +			 * blk_mq_start_request().
> 
> I understand "from test_and_set_bit()" (in blk_mark_rq_complete()) and
> that the interested cycle is:
> 
>    /* in blk_mq_start_request() */
>    [STORE STARTED bit = 1 into atomic_flags]
>       -->co [STORE COMPLETE bit = 0 into atomic_flags]
>          /* in blk_mq_check_expired() */
>          -->rf [LOAD COMPLETE bit = 0 from atomic_flags]
>             -->po-loc [LOAD STARTED bit = 0 from atomic_flags]
>                /* in blk_mq_start_request() again */
>                -->fr [STORE STARTED bit = 1 into atomic_flags]
> 
>    (N.B. Assume all accesses happen to/from the same subword.)
> 
> This cycle being forbidden by the "coherence check", I'd say we do not
> need to rely on the MB mentioned by the comment; what am I missing?

Nothing, I forgot about the read-after-read thing but did spot the MB.
Either one suffices to guarantee the order we need. It just needs to be
documented as being relied upon.