Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Date:   Thu, 12 Dec 2019 14:04:21 +0100
From:   Peter Zijlstra <peterz@infradead.org>
To:     David Laight <David.Laight@ACULAB.COM>
Cc:     "Luck, Tony" <tony.luck@intel.com>,
        Andy Lutomirski <luto@kernel.org>,
        "Yu, Fenghua" <fenghua.yu@intel.com>,
        Ingo Molnar <mingo@kernel.org>,
        Thomas Gleixner <tglx@linutronix.de>,
        Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
        H Peter Anvin <hpa@zytor.com>,
        "Raj, Ashok" <ashok.raj@intel.com>,
        "Shankar, Ravi V" <ravi.v.shankar@intel.com>,
        linux-kernel <linux-kernel@vger.kernel.org>,
        x86 <x86@kernel.org>, Will Deacon <will@kernel.org>
Subject: Re: [PATCH v10 6/6] x86/split_lock: Enable split lock detection by
 kernel parameter
Message-ID: <20191212130421.GX2827@hirez.programming.kicks-ass.net>
References: <20191121202508.GZ4097@hirez.programming.kicks-ass.net>
 <CALCETrXbe_q07kL1AyaNaAqgUHsdN6rEDzzZ0CEtv-k9VvQL0A@mail.gmail.com>
 <20191122092555.GA4097@hirez.programming.kicks-ass.net>
 <3908561D78D1C84285E8C5FCA982C28F7F4DD19F@ORSMSX115.amr.corp.intel.com>
 <20191122203105.GE2844@hirez.programming.kicks-ass.net>
 <CALCETrVjXC7RHZCkAcWEeCrJq7DPeVBooK8S3mG0LT8q9AxvPw@mail.gmail.com>
 <20191211175202.GQ2827@hirez.programming.kicks-ass.net>
 <20191211184416.GA6344@agluck-desk2.amr.corp.intel.com>
 <20191211223917.GU2844@hirez.programming.kicks-ass.net>
 <5ba0190e63594ad1b8b032bc12553393@AcuMS.aculab.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <5ba0190e63594ad1b8b032bc12553393@AcuMS.aculab.com>
User-Agent: Mutt/1.10.1 (2018-07-13)
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Thu, Dec 12, 2019 at 10:36:27AM +0000, David Laight wrote:

> On x86 'xchg' is always 'locked' regardless of whether there is a 'lock' prefix.

Sure, irrelevant here though.

> set_bit() (etc) include the 'lock' prefix (dunno why this decision was made...).

Because it is the atomic set bit function, we have __set_bit() if you
want the non-atomic one.

Atomic bitops are (obviously) useful if you have concurrent changes to
your bitmap.

Lots of people seem confused on this though, as evidenced by a lot of
the broken crap we keep finding (then again, them using __set_bit()
would still be broken due to the endian thing).

> For locked operations (including misaligned ones) that don't cross cache-line
> boundaries the read operation almost certainly locks the cache line (against
> a snoop) until the write has updated the cache line.

Note your use of 'almost'. Almost isn't good enough. Note that other
architectures allow the store from atomic operations to hit the store
buffer. And I strongly suspect x86 does the same.

Waiting for a store-buffer drain is *expensive*.

Try timing:

	LOCK INC (ptr);

vs

	LOCK INC (ptr);
	MFENCE

My guess is the second one *far* more expensive. MFENCE drains (and waits
for completion thereof) the store-buffer -- it must since it fences
against non-coherent stuff.

I suppose ARM's DMB vs DSB is of similar distinction.

> This won't happen until the write 'drains' from the store buffer.
> (I suspect that locked read requests act like write requests in ensuring
> that no other cpu has a dirty copy of the cache line, and also marking it dirty.)
> Although this will delay the response to the snoop it will only
> stall the cpu (or other bus master), not the entire memory 'bus'.

I really don't think so. The commit I pointed to earlier in the thread,
that replaced MFENCE with LOCK ADD $0, -4(%RSP) for smp_mb(), strongly
indicates LOCK prefixed instructions do _NOT_ flush the store buffer.

All barriers impose is order, if your store-buffer can preserve order,
all should just work. One possible way would be to tag each entry, and
increment the tag on barrier. Then ensure that all smaller tags are
flushed before allowing a higher tagged entry to leave.

> If you read the description of 'lock btr' you'll see that it always does the
> write cycle (to complete the atomic RMW expected by the memory
> subsystem) even when the bit is clear.

I know it does, but I don't see how that is relevant here.

> Remote store buffers are irrelevant to locked accesses.

They are not in general and I've seen nothing to indicate this is the
case on x86.

> (If you are doing concurrent locked and unlocked accesses to the same
> memory location something is badly broken.)

It is actually quite common.

> It really can't matter whether one access is a mis-aligned 64bit word
> and the other a byte. Both do atomic RMW updates so the result
> cannot be unexpected.

Expectations are often violated. Esp when talking about memory ordering.

> In principle two separate 8 bit RMW cycles could be done concurrently
> to two halves of a 16 bit 'flag' word without losing any bits or any reads
> returning any of the expected 4 values.
> Not that any memory system would support such updates.

I'm thinking you ought to go read that paper on mixed size concurrency I
referenced earlier in this thread. IIRC the conclusion was that PowerPC
does exactly that and ARM64 allows for it but it hasn't been observed,
yet.

Anyway, I'm not saying x86 behaves this way, I'm saying that I have lots
of questions and very little answers. I'm also saying that the variant
with non-overlapping atomics could conceivably misbehave, while the
variant with overlapping atomics is guaranteed not to.

Specifically smp_mb()/SYNC on PowerPC can not restore Sequential
Consistency under mixed size operations. How's that for expectations?