2019-04-29 14:54:06

by Jan Glauber

[permalink] [raw]
Subject: [RFC] Disable lockref on arm64

Hi Catalin & Will,

I've been looking into performance issues that were reported for several
test-cases, for instance an nginx benchmark.

It turned out the issue we have on ThunderX2 is the file open-close sequence
with small read sizes. If the used files are opened read-only the
lockref code (enabled by ARCH_USE_CMPXCHG_LOCKREF) is used.

The lockref CMPXCHG_LOOP uses an unbound (as long as the associated
spinlock isn't taken) while loop to change the lock count. This behaves
badly under heavy contention (~25x retries for one cmpxchg to succeed
with 28 threads operating on the same file). In case of a NUMA system
it also behaves badly as the access from the other socket is much slower.

The fact that on ThunderX2 cpu_relax() turns only into one NOP
instruction doesn't help either. On Intel pause seems to block the thread
much longer, avoiding the heavy contention thereby.

With the queued spinlocks implementation I can see a major improvement
when I disable lockref. A trivial open-close test-case improves by
factor 2 while system time is decreasing also 2x. Looking at kernel compile
and dbench numbers didn't show any regression with lockref disabled.

Can we simply disable lockref? Is anyone else seeing this issue? Is there
an arm64 platform that actually implements yield?

Thanks,
Jan


2019-05-01 16:04:18

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC] Disable lockref on arm64

Hi Jan,

[+Peter and Linus, since they enjoy this stuff]

On Mon, Apr 29, 2019 at 02:52:11PM +0000, Jan Glauber wrote:
> I've been looking into performance issues that were reported for several
> test-cases, for instance an nginx benchmark.

Could you share enough specifics here so that we can reproduce the issue
locally, please? That would help us in our attempts to develop a fix without
simply disabling the option for everybody else.

> It turned out the issue we have on ThunderX2 is the file open-close sequence
> with small read sizes. If the used files are opened read-only the
> lockref code (enabled by ARCH_USE_CMPXCHG_LOCKREF) is used.
>
> The lockref CMPXCHG_LOOP uses an unbound (as long as the associated
> spinlock isn't taken) while loop to change the lock count. This behaves
> badly under heavy contention (~25x retries for one cmpxchg to succeed
> with 28 threads operating on the same file). In case of a NUMA system
> it also behaves badly as the access from the other socket is much slower.

It's surprising that this hasn't been reported on x86. I suspect their
implementation of cmpxchg is a little more forgiving under contention.

> The fact that on ThunderX2 cpu_relax() turns only into one NOP
> instruction doesn't help either. On Intel pause seems to block the thread
> much longer, avoiding the heavy contention thereby.

NOPing out the yield instruction seems like a poor choice for an SMT CPU
such as TX2. That said, the yield was originally added to cpu_relax() as
a scheduling hint for QEMU.

> With the queued spinlocks implementation I can see a major improvement
> when I disable lockref. A trivial open-close test-case improves by
> factor 2 while system time is decreasing also 2x. Looking at kernel compile
> and dbench numbers didn't show any regression with lockref disabled.
>
> Can we simply disable lockref? Is anyone else seeing this issue? Is there
> an arm64 platform that actually implements yield?

There are two issues with disabling lockref like this:

1. It's a compile-time thing, so systems that would benefit from the code
are unfairly penalised.

2. You're optimising for the contended case at the cost of the
uncontended case, which should actually be the common case as well.

Now, nobody expects contended CAS to scale well, so the middle ground
probably involves backing off to the lock under contention, a bit like
an optimistic trylock(). Unfortunately, that will need some tuning, hence
my initial request for a reproducer.

Cheers,

Will

2019-05-01 16:42:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] Disable lockref on arm64

On Mon, Apr 29, 2019 at 7:52 AM Jan Glauber <[email protected]> wrote:
>
> It turned out the issue we have on ThunderX2 is the file open-close sequence
> with small read sizes. If the used files are opened read-only the
> lockref code (enabled by ARCH_USE_CMPXCHG_LOCKREF) is used.
>
> The lockref CMPXCHG_LOOP uses an unbound (as long as the associated
> spinlock isn't taken) while loop to change the lock count. This behaves
> badly under heavy contention

Ok, excuse me when I rant a bit.

Since you're at Marvell, maybe you can forward this rant to the proper
guilty parties?

Who was the absolute *GENIUS* who went

Step 1: "Oh, we have a middling CPU that isn't world-class on its own"

Step 2: "BUT! We can put a lot of them on a die, because that's 'easy'"

Step 3: "But let's make sure the interconnect isn't all that special,
because that would negate the the whole 'easy' part, and really strong
interconnects are even harder than CPU's and use even more power, so
that wouldn't work"

Step 4: "I wonder why this thing scales badly?"

Seriously. Why are you guys doing this? Has nobody ever looked at the
fundamental thought process above and gone "Hmm"?

If you try to compensate for a not-great core by putting twice the
number of them in a system, you need a cache system and interconnect
between them that is more than twice as good as the competition.

And honestly, from everything that I hear, you don't have it. The
whole chip is designed for "throughput when there is no contention".
Is it really a huge surprise that it then falls flat on its face when
there's something fancy going on?

So now you want to penalize everybody else in the ARM community
because you have a badly balanced system?

Ok, rant over.

The good news is that we can easily fix _this_ particular case by just
limiting the CMPXCHG_LOOP to a maximum number of retries, since the
loop is already designed to fail quickly if the spin lock value isn't
unlocked, and all the lockref code is already organized to fall back
to spinlocks.

So the attached three-liner patch may just work for you. Once _one_
thread hits the maximum retry case and goes into the spinlocked case,
everybody else will also fall back to spinlocks because they now see
that the lockref is contended. So the "retry" value probably isn't all
that important, but let's make it big enough that it probably never
happens on a well-balanced system.

But seriously: the whole "let's just do lots of CPU cores because it's
easy" needs to stop. It's fine if you have a network processor and
you're doing independent things, but it's not a GP processor approach.

Your hardware people need to improve on your CPU core (maybe the
server version of Cortex A76 is starting to approach being good
enough?) and your interconnect (seriously!) instead of just slapping
32 cores on a die and calling it a day.

Linus "not a fan of the flock of chickens" Torvalds


Attachments:
patch.diff (649.00 B)

2019-05-02 08:29:49

by Jan Glauber

[permalink] [raw]
Subject: Re: [RFC] Disable lockref on arm64

On Wed, May 01, 2019 at 09:41:08AM -0700, Linus Torvalds wrote:
> On Mon, Apr 29, 2019 at 7:52 AM Jan Glauber <[email protected]> wrote:
> >
> > It turned out the issue we have on ThunderX2 is the file open-close sequence
> > with small read sizes. If the used files are opened read-only the
> > lockref code (enabled by ARCH_USE_CMPXCHG_LOCKREF) is used.
> >
> > The lockref CMPXCHG_LOOP uses an unbound (as long as the associated
> > spinlock isn't taken) while loop to change the lock count. This behaves
> > badly under heavy contention
>
> Ok, excuse me when I rant a bit.
>
> Since you're at Marvell, maybe you can forward this rant to the proper
> guilty parties?

Sure :)

> Who was the absolute *GENIUS* who went
>
> Step 1: "Oh, we have a middling CPU that isn't world-class on its own"
>
> Step 2: "BUT! We can put a lot of them on a die, because that's 'easy'"
>
> Step 3: "But let's make sure the interconnect isn't all that special,
> because that would negate the the whole 'easy' part, and really strong
> interconnects are even harder than CPU's and use even more power, so
> that wouldn't work"
>
> Step 4: "I wonder why this thing scales badly?"
>
> Seriously. Why are you guys doing this? Has nobody ever looked at the
> fundamental thought process above and gone "Hmm"?
>
> If you try to compensate for a not-great core by putting twice the
> number of them in a system, you need a cache system and interconnect
> between them that is more than twice as good as the competition.
>
> And honestly, from everything that I hear, you don't have it. The
> whole chip is designed for "throughput when there is no contention".
> Is it really a huge surprise that it then falls flat on its face when
> there's something fancy going on?

I'll see how x86 runs the same testcase, I thought that playing
cacheline ping-pong is not the optimal use case for any CPU.

My assumption was that x86 probably doesn't suffer that much because
of cpu_relax() -> pause insn could slow down the retry rate.

> So now you want to penalize everybody else in the ARM community
> because you have a badly balanced system?

Not really, as I intentionally did not include a patch and sent this as
RFC.

> Ok, rant over.
>
> The good news is that we can easily fix _this_ particular case by just
> limiting the CMPXCHG_LOOP to a maximum number of retries, since the
> loop is already designed to fail quickly if the spin lock value isn't
> unlocked, and all the lockref code is already organized to fall back
> to spinlocks.
>
> So the attached three-liner patch may just work for you. Once _one_
> thread hits the maximum retry case and goes into the spinlocked case,
> everybody else will also fall back to spinlocks because they now see
> that the lockref is contended. So the "retry" value probably isn't all
> that important, but let's make it big enough that it probably never
> happens on a well-balanced system.

Agreed, your patch would solve the issue for ThunderX2. Limiting the
retry attempts was one of the things I tried beside extending the number
of NOPs in cpu_relax().

> But seriously: the whole "let's just do lots of CPU cores because it's
> easy" needs to stop. It's fine if you have a network processor and
> you're doing independent things, but it's not a GP processor approach.
>
> Your hardware people need to improve on your CPU core (maybe the
> server version of Cortex A76 is starting to approach being good
> enough?) and your interconnect (seriously!) instead of just slapping
> 32 cores on a die and calling it a day.
>
> Linus "not a fan of the flock of chickens" Torvalds

> lib/lockref.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/lib/lockref.c b/lib/lockref.c
> index 3d468b53d4c9..a6762f8f45c9 100644
> --- a/lib/lockref.c
> +++ b/lib/lockref.c
> @@ -9,6 +9,7 @@
> * failure case.
> */
> #define CMPXCHG_LOOP(CODE, SUCCESS) do { \
> + int retry = 15; /* Guaranteed random number */ \
> struct lockref old; \
> BUILD_BUG_ON(sizeof(old) != 8); \
> old.lock_count = READ_ONCE(lockref->lock_count); \
> @@ -21,6 +22,8 @@
> if (likely(old.lock_count == prev.lock_count)) { \
> SUCCESS; \
> } \
> + if (!--retry) \
> + break; \
> cpu_relax(); \
> } \
> } while (0)

2019-05-02 08:40:07

by Jan Glauber

[permalink] [raw]
Subject: Re: [RFC] Disable lockref on arm64

On Wed, May 01, 2019 at 05:01:40PM +0100, Will Deacon wrote:
> Hi Jan,
>
> [+Peter and Linus, since they enjoy this stuff]
>
> On Mon, Apr 29, 2019 at 02:52:11PM +0000, Jan Glauber wrote:
> > I've been looking into performance issues that were reported for several
> > test-cases, for instance an nginx benchmark.
>
> Could you share enough specifics here so that we can reproduce the issue
> locally, please? That would help us in our attempts to develop a fix without
> simply disabling the option for everybody else.

I can send my test-case which is a trivial open-read-close loop with one
thread per CPU and increasing read sizes.

> > It turned out the issue we have on ThunderX2 is the file open-close sequence
> > with small read sizes. If the used files are opened read-only the
> > lockref code (enabled by ARCH_USE_CMPXCHG_LOCKREF) is used.
> >
> > The lockref CMPXCHG_LOOP uses an unbound (as long as the associated
> > spinlock isn't taken) while loop to change the lock count. This behaves
> > badly under heavy contention (~25x retries for one cmpxchg to succeed
> > with 28 threads operating on the same file). In case of a NUMA system
> > it also behaves badly as the access from the other socket is much slower.
>
> It's surprising that this hasn't been reported on x86. I suspect their
> implementation of cmpxchg is a little more forgiving under contention.
>
> > The fact that on ThunderX2 cpu_relax() turns only into one NOP
> > instruction doesn't help either. On Intel pause seems to block the thread
> > much longer, avoiding the heavy contention thereby.
>
> NOPing out the yield instruction seems like a poor choice for an SMT CPU
> such as TX2. That said, the yield was originally added to cpu_relax() as
> a scheduling hint for QEMU.

The issue is not limited to SMT, it also shows without SMT.

> > With the queued spinlocks implementation I can see a major improvement
> > when I disable lockref. A trivial open-close test-case improves by
> > factor 2 while system time is decreasing also 2x. Looking at kernel compile
> > and dbench numbers didn't show any regression with lockref disabled.
> >
> > Can we simply disable lockref? Is anyone else seeing this issue? Is there
> > an arm64 platform that actually implements yield?
>
> There are two issues with disabling lockref like this:
>
> 1. It's a compile-time thing, so systems that would benefit from the code
> are unfairly penalised.
>
> 2. You're optimising for the contended case at the cost of the
> uncontended case, which should actually be the common case as well.

I completely agree with 2). Nevertheless limiting the retry attempts
like Linus suggested looks like a fair change that should not penalize
anyone and would still help the contented case.

--Jan

> Now, nobody expects contended CAS to scale well, so the middle ground
> probably involves backing off to the lock under contention, a bit like
> an optimistic trylock(). Unfortunately, that will need some tuning, hence
> my initial request for a reproducer.
>
> Cheers,
>
> Will

2019-05-02 16:14:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] Disable lockref on arm64

On Thu, May 2, 2019 at 1:27 AM Jan Glauber <[email protected]> wrote:
>
> I'll see how x86 runs the same testcase, I thought that playing
> cacheline ping-pong is not the optimal use case for any CPU.

Oh, ping-pong is always bad.

But from past experience, x86 tends to be able to always do tight a
cmpxchg loop without failing more than a once or twice, which is all
you need for things like this.

And it's "easy" to do in hardware on a CPU: all you need to do is
guarantee that when you have a cmpxchg loop, the cacheline is sticky
enough that it stays around at the local CPU for the duration of one
loop entry (ie from one cmpxchg to the next).

Obviously you can do that wrong too, and make cachelines *too* sticky,
and then you get fairness issues.

But it really sounds like what happens for your ThunderX2 case, the
different CPU's steal each others cachelines so quickly that even when
you get the cacheline, you don't then get to update it.

Does ThunderX2 do LSE atomics? Are the acquire/release versions really
slow, perhaps, and more or less serializing (maybe it does the
"release" logic even when the store _fails_?), so that doing two
back-to-back cmpxchg ends up taking the core a "long" time, so that
the cache subsystem then steals it easily in between cmpxchg's in a
loop? Does the L1 cache maybe have no way to keep a line around from
one cmpxchg to the next?

This is (one example) where having a CPU and an interconnect that
works together matters. And yes, it probably needs a few generations
of hardware tuning where people see problems and fix them.

Linus

Subject: Re: [RFC] Disable lockref on arm64

On Thu, May 02, 2019 at 09:12:18AM -0700, Linus Torvalds wrote:
> On Thu, May 2, 2019 at 1:27 AM Jan Glauber <[email protected]> wrote:
> >
> > I'll see how x86 runs the same testcase, I thought that playing
> > cacheline ping-pong is not the optimal use case for any CPU.
>
> Oh, ping-pong is always bad.
>
> But from past experience, x86 tends to be able to always do tight a
> cmpxchg loop without failing more than a once or twice, which is all
> you need for things like this.

I don't really see the point your are making about hardware. If you
look at the test case, you have about 64 cores doing CAS to the same
location. At any point one of them will succeed and the other 63 will
fail - and in our case since cpu_relax is a nop, they sit in a tight
loop mostly failing.

And further due to the nature of the test case, the successful thread
will come back almost immediately with another CAS.

> And it's "easy" to do in hardware on a CPU: all you need to do is
> guarantee that when you have a cmpxchg loop, the cacheline is sticky
> enough that it stays around at the local CPU for the duration of one
> loop entry (ie from one cmpxchg to the next).
>
> Obviously you can do that wrong too, and make cachelines *too* sticky,
> and then you get fairness issues.

This is certainly not the case, we are not bouncing around not making
progress at all. We have all 64 cores hitting the same location in a
very tight loop slowing the system down. And you will get fairness
issues anyway about which of the failing cores succeeds next.

The testcase does not hang indefinitely, it eventually completes. The
scaling loss is, in my opinion, due to the naive lockref implementation,
rather than due to a hardware limitation.

Are you expecting the hardware cache coherency implementation to have
the equivalent of queued locks and block potentially failing CAS?

After speaking to the folks doing performance comparisons here, x86
suffers in the same test case as well, when there are enough cores.

Your patch that switches to spinlock (in this case queued) works nicely
in case of high contention. Is this something that will be merged to
mainline? We can provide some testing results if that will help.

> But it really sounds like what happens for your ThunderX2 case, the
> different CPU's steal each others cachelines so quickly that even when
> you get the cacheline, you don't then get to update it.
>
> Does ThunderX2 do LSE atomics? Are the acquire/release versions really
> slow, perhaps, and more or less serializing (maybe it does the
> "release" logic even when the store _fails_?), so that doing two
> back-to-back cmpxchg ends up taking the core a "long" time, so that
> the cache subsystem then steals it easily in between cmpxchg's in a
> loop? Does the L1 cache maybe have no way to keep a line around from
> one cmpxchg to the next?

ThunderX2 has LSE atomics. It has also full out-of-order execution with
weak ordering for load/store, so a barrier will be slow down exection.

Also to address some points on the earlier rant: ThunderX2 is a fairly
beefy processor (based on Broadcom Vulcan), it compares well on per-core
performance with x86 (and with A76 from what I hear even though A76 was
out a few years later). There are more cores per socket due to the fact
that there is no ISA baggage, and not really due to a weaker core.

> This is (one example) where having a CPU and an interconnect that
> works together matters. And yes, it probably needs a few generations
> of hardware tuning where people see problems and fix them.

The next generation ThunderX3 is in the works, and it will have even
more cores, it is going to be fun :)

JC

2019-05-03 19:50:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] Disable lockref on arm64

On Thu, May 2, 2019 at 4:19 PM Jayachandran Chandrasekharan Nair
<[email protected]> wrote:
>>
> I don't really see the point your are making about hardware. If you
> look at the test case, you have about 64 cores doing CAS to the same
> location. At any point one of them will succeed and the other 63 will
> fail - and in our case since cpu_relax is a nop, they sit in a tight
> loop mostly failing.

No.

My point is that the others will *not* fail, if your cache coherency acts sane.

Here's the deal: with a cmpxchg loop, no cacheline should *ever* be in
shared mode as part of the loop. Agreed? Even if the cmpxchg is done
with ldx/stx, the ldx should do a read-for-write cycle, so at no
single time will you ever have a shared cacheline.

And once one CPU gets ownership of the line, it doesn't lose it
immediately, so the next cmpxchg will *succeed*.

So at most, the *first* cmpxchg will fail (because that's the one that
was fed not by a previous cmpxchg, but by a regular load (which we'd
*like* to do as a "load-for-ownership" load, but we don't have the
interfaces to do that). But the second cmpxchg should basically always
succeed, unless something exceptional happened (maybe an interrupt,
maybe something big like that).

Ergo: if you have a case of failing cmpxchg a lot, your cache
coherency is simply bad. Your hardware people should be ashamed of
themselves for letting go of the cacheline without just letting the
next cmpxchg succeed.

Notice how there is *NO* ping-pong. Sure, the cacheline moves around,
but every time it moves around just once, a thread makes progress.
None of this "for every progrress, there are 63 threads that fail"
garbage that you're claiming is normal.

It's not normal, and it's not inevitable.

If it really happens, it's a sign of bad hardware. Just own it, and
talk to the hw people, and make sure it gets fixed in ThunderX3. Ok?

Linus

Subject: Re: [EXT] Re: [RFC] Disable lockref on arm64

On Fri, May 03, 2019 at 12:40:34PM -0700, Linus Torvalds wrote:
> On Thu, May 2, 2019 at 4:19 PM Jayachandran Chandrasekharan Nair
> <[email protected]> wrote:
> >>
> > I don't really see the point your are making about hardware. If you
> > look at the test case, you have about 64 cores doing CAS to the same
> > location. At any point one of them will succeed and the other 63 will
> > fail - and in our case since cpu_relax is a nop, they sit in a tight
> > loop mostly failing.
>
> No.
>
> My point is that the others will *not* fail, if your cache coherency acts sane.
>
> Here's the deal: with a cmpxchg loop, no cacheline should *ever* be in
> shared mode as part of the loop. Agreed? Even if the cmpxchg is done
> with ldx/stx, the ldx should do a read-for-write cycle, so at no
> single time will you ever have a shared cacheline.
>
> And once one CPU gets ownership of the line, it doesn't lose it
> immediately, so the next cmpxchg will *succeed*.
>
> So at most, the *first* cmpxchg will fail (because that's the one that
> was fed not by a previous cmpxchg, but by a regular load (which we'd
> *like* to do as a "load-for-ownership" load, but we don't have the
> interfaces to do that). But the second cmpxchg should basically always
> succeed, unless something exceptional happened (maybe an interrupt,
> maybe something big like that).
>
> Ergo: if you have a case of failing cmpxchg a lot, your cache
> coherency is simply bad. Your hardware people should be ashamed of
> themselves for letting go of the cacheline without just letting the
> next cmpxchg succeed.
>
> Notice how there is *NO* ping-pong. Sure, the cacheline moves around,
> but every time it moves around just once, a thread makes progress.
> None of this "for every progrress, there are 63 threads that fail"
> garbage that you're claiming is normal.
>
> It's not normal, and it's not inevitable.

If you look at the code, the CAS failure is followed by a yield
before retrying the CAS. Yield on arm64 is expected to be a hint
to release resources so that other threads/cores can make progress.
Under heavy contention, I expect the current code to behave the way
I noted in my last mail, with the issue with fairness as well.

Perhaps someone from ARM can chime in here how the cas/yield combo
is expected to work when there is contention. ThunderX2 does not
do much with the yield, but I don't expect any ARM implementation
to treat YIELD as a hint not to yield, but to get/keep exclusive
access to the last failed CAS location.

> If it really happens, it's a sign of bad hardware. Just own it, and
> talk to the hw people, and make sure it gets fixed in ThunderX3. Ok?

Also, I tested a the lockref code on a fairly high core count x86
system with SMT. The worst case number of loops taken is higher than
your guaranteed random number of 15, but the average number of loops
is to be fairly low (about 3-4, and double that for SMT). On x86,
I suppose there has been some coevolution between the software and
hardware on locking with cmpxchg and pause, so by now both are
optimized for each other.

Your larger point seems to be that the hardware has smarter to
scale standard locking implementations when adding cores, and
be graceful even in extremely high contention cases. Yes, this
is something we should be looking at for ThunderX3.

This whole discussion has been difficult since this has nothing to
do with the core capability which you originally talked about. There
are quite a few low-powered ARM64 cores (some of them in server space
too), but ThunderX2 is certainly not one. I say this from first hand
experience from using a ThunderX2 workstation as my primary system
for a while now. Kernel builds, git operations and running multiple
VMs work extremely well and are pretty fast compared to my earlier
x86 based system.

Anyway, I will talk to hardware folks on locking patterns and see
what can be done about cas & yield in ThunderX3. Thanks for your
suggestions.

JC

2019-05-06 17:15:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [EXT] Re: [RFC] Disable lockref on arm64

On Sun, May 5, 2019 at 11:13 PM Jayachandran Chandrasekharan Nair
<[email protected]> wrote:
>
> > It's not normal, and it's not inevitable.
>
> If you look at the code, the CAS failure is followed by a yield
> before retrying the CAS. Yield on arm64 is expected to be a hint
> to release resources so that other threads/cores can make progress.
> Under heavy contention, I expect the current code to behave the way
> I noted in my last mail, with the issue with fairness as well.

Yes, this is a good point.

It's entirely possibly that _particularly_ for CAS loops - where we
get an updated value that caused the CAS to fail - we should not yield
in between every CAS. Exactly so that we don't force the CPU to flush
the cacheline that it just got and make the CAS loop do lots of
unnecessary work and just keep looping.

That said, I think right now all ARM implementations actually treat
the yield instruction as a pure nop, so I assume this isn't actually
the root of the ThunderX2 problems.

But we could do something like "unroll the cmpxchg loop twice, only
yield every other iteration" exactly for the case wher cpu_relax()
might encourage the CPU to actually release the cacheline.

If you find that something like that would help ThunderX2, we can most
certainly try that kind of thing. It makes sense, and unrolling the
cmpxchg loop once might even help with branch prediction (ie the
common case might be that the *first* cmpxchg succeeds, and that's
what the branch predictors get primed for, but then if the first one
actually fails, we fall through to the contention case, and now maybe
the branch predictor ends up being primed for *that* case for the
second one).

Yes, the above is wild handwaving - but it's the kind of thing we
could easily do if the hardware people have a good argument for them.

I'm not asking for _perfect_ hardware. I'm just asking for hardware to
not be actively antagonistic to something fundamental like a cmpxchg
loop.

> Your larger point seems to be that the hardware has smarter to
> scale standard locking implementations when adding cores, and
> be graceful even in extremely high contention cases. Yes, this
> is something we should be looking at for ThunderX3.

Yes. Also, my point really is that no core should ever be in the
situation that it fetches a cache-line, only to then release it
immediately again without making any progress. You do want a *certain*
amount of stickiness to the cachelines.

Of course, you very much do not want to overdo it - I'm talking about
keeping the cacheline around for tens of cycles, not for hundreds of
cycles. Enough that if you have small code snippets that are in and
out quickly, you don't see bouncing at that level.

Because you *don't* want the stickiness to be such that once one CPU
has the cacheline, it will stay with that CPU (or that socket) as long
as the CPU hammers it all the time.

So for example, in the case of a cmpxchg loop, maybe you'll get a
couple of branch mispredicts the first few times round the loop (first
because the original value read was a plain read and the cacheline was
shared rather than exclusive, but then after that because the branch
predictor saw previous cases where there was no contention, and the
loop exited immediately).

So maybe the first couple of iterations the core might be "slow
enough" to not get the cmpxchg loop working well. But not a lot more
than that (again, unless something special happens, like an interrupt
storm that keeps interrupting the loop).

End result: I do think the cache coherency needs to be a *bit* smarter
than just a "oh, another core wants this line, I will now immediately
try to satisfy it". And yes, those kinds of things get more important
the more cores you have. You don't need a whole lot of smarts if you
have just a couple of cores, because you'll be making progress
regardless.

And note that I really think this kind of support absolutely falls on
hardware. Software simply _cannot_ make some good tuning decision
using backoff or whatever. Yes, yes, I know people have done things
like that historically (Sparc and SunOS were famous for it), but part
of doing them historically was that

(a) they didn't know better

(b) the software was often tuned to particular hardware and loads (or
at least a very small set of hardware)

(c) there wasn't necessarily better hardware options available

but none of the above is true today.

Things like backoff loops in software are not only hw specific, they
tend to have seriously bad fairness behavior (and they depend on
everybody honoring the backoff, which can be a security issue), so
then you have to add even more complexity to deal with the fairness
issues, and your cmpxchg loop becomes something truly disgusting.

Which is why I say: "do it right in hardware".

Linus

2019-05-06 18:12:01

by Will Deacon

[permalink] [raw]
Subject: Re: [EXT] Re: [RFC] Disable lockref on arm64

On Mon, May 06, 2019 at 06:13:12AM +0000, Jayachandran Chandrasekharan Nair wrote:
> Perhaps someone from ARM can chime in here how the cas/yield combo
> is expected to work when there is contention. ThunderX2 does not
> do much with the yield, but I don't expect any ARM implementation
> to treat YIELD as a hint not to yield, but to get/keep exclusive
> access to the last failed CAS location.

Just picking up on this as "someone from ARM".

The yield instruction in our implementation of cpu_relax() is *only* there
as a scheduling hint to QEMU so that it can treat it as an internal
scheduling hint and run some other thread; see 1baa82f48030 ("arm64:
Implement cpu_relax as yield"). We can't use WFE or WFI blindly here, as it
could be a long time before we see a wake-up event such as an interrupt. Our
implementation of smp_cond_load_acquire() is much better for that kind of
thing, but doesn't help at all for a contended CAS loop where the variable
is actually changing constantly.

Implementing yield in the CPU may generally be beneficial for SMT designs so
that the hardware resources aren't wasted when spinning round a busy loop.
For this particular discussion (i.e. lockref), however, it seems as though
the cpu_relax() call is questionable to start with.

Will

Subject: Re: [RFC] Disable lockref on arm64

On Mon, May 06, 2019 at 07:10:40PM +0100, Will Deacon wrote:
> On Mon, May 06, 2019 at 06:13:12AM +0000, Jayachandran Chandrasekharan Nair wrote:
> > Perhaps someone from ARM can chime in here how the cas/yield combo
> > is expected to work when there is contention. ThunderX2 does not
> > do much with the yield, but I don't expect any ARM implementation
> > to treat YIELD as a hint not to yield, but to get/keep exclusive
> > access to the last failed CAS location.
>
> Just picking up on this as "someone from ARM".
>
> The yield instruction in our implementation of cpu_relax() is *only* there
> as a scheduling hint to QEMU so that it can treat it as an internal
> scheduling hint and run some other thread; see 1baa82f48030 ("arm64:
> Implement cpu_relax as yield"). We can't use WFE or WFI blindly here, as it
> could be a long time before we see a wake-up event such as an interrupt. Our
> implementation of smp_cond_load_acquire() is much better for that kind of
> thing, but doesn't help at all for a contended CAS loop where the variable
> is actually changing constantly.

Looking thru the perf output of this case (open/close of a file from
multiple CPUs), I see that refcount is a significant factor in most
kernel configurations - and that too uses cmpxchg (without yield).
x86 has an optimized inline version of refcount that helps
significantly. Do you think this is worth looking at for arm64?

> Implementing yield in the CPU may generally be beneficial for SMT designs so
> that the hardware resources aren't wasted when spinning round a busy loop.

Yield is probably used in sub-optimal implementations of delay or wait.
It is going to be different across multiple implementations and
revisions (given the description in ARM spec). Having a more yielding(?)
implementation would be equally problematic especially in the lockref
case.

> For this particular discussion (i.e. lockref), however, it seems as though
> the cpu_relax() call is questionable to start with.

In case of lockref, taking out the yield/pause and dropping to queued
spinlock after some cycles appears to me to be a better approach.
Relying on the quality of cpu_relax() on the specific processor to
mitigate against contention is going to be tricky anyway.

We will do some more work here, but would appreciate any pointers
based on your experience here.

Thanks,
JC

2019-05-18 10:02:25

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [RFC] Disable lockref on arm64

On Sat, 18 May 2019 at 06:25, Jayachandran Chandrasekharan Nair
<[email protected]> wrote:
>
> On Mon, May 06, 2019 at 07:10:40PM +0100, Will Deacon wrote:
> > On Mon, May 06, 2019 at 06:13:12AM +0000, Jayachandran Chandrasekharan Nair wrote:
> > > Perhaps someone from ARM can chime in here how the cas/yield combo
> > > is expected to work when there is contention. ThunderX2 does not
> > > do much with the yield, but I don't expect any ARM implementation
> > > to treat YIELD as a hint not to yield, but to get/keep exclusive
> > > access to the last failed CAS location.
> >
> > Just picking up on this as "someone from ARM".
> >
> > The yield instruction in our implementation of cpu_relax() is *only* there
> > as a scheduling hint to QEMU so that it can treat it as an internal
> > scheduling hint and run some other thread; see 1baa82f48030 ("arm64:
> > Implement cpu_relax as yield"). We can't use WFE or WFI blindly here, as it
> > could be a long time before we see a wake-up event such as an interrupt. Our
> > implementation of smp_cond_load_acquire() is much better for that kind of
> > thing, but doesn't help at all for a contended CAS loop where the variable
> > is actually changing constantly.
>
> Looking thru the perf output of this case (open/close of a file from
> multiple CPUs), I see that refcount is a significant factor in most
> kernel configurations - and that too uses cmpxchg (without yield).
> x86 has an optimized inline version of refcount that helps
> significantly. Do you think this is worth looking at for arm64?
>

I looked into this a while ago [0], but at the time, we decided to
stick with the generic implementation until we encountered a use case
that benefits from it. Worth a try, I suppose ...

[0] https://lore.kernel.org/linux-arm-kernel/[email protected]/

> > Implementing yield in the CPU may generally be beneficial for SMT designs so
> > that the hardware resources aren't wasted when spinning round a busy loop.
>
> Yield is probably used in sub-optimal implementations of delay or wait.
> It is going to be different across multiple implementations and
> revisions (given the description in ARM spec). Having a more yielding(?)
> implementation would be equally problematic especially in the lockref
> case.
>
> > For this particular discussion (i.e. lockref), however, it seems as though
> > the cpu_relax() call is questionable to start with.
>
> In case of lockref, taking out the yield/pause and dropping to queued
> spinlock after some cycles appears to me to be a better approach.
> Relying on the quality of cpu_relax() on the specific processor to
> mitigate against contention is going to be tricky anyway.
>
> We will do some more work here, but would appreciate any pointers
> based on your experience here.
>
> Thanks,
> JC
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

2019-05-22 17:08:27

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC] Disable lockref on arm64

On Sat, May 18, 2019 at 12:00:34PM +0200, Ard Biesheuvel wrote:
> On Sat, 18 May 2019 at 06:25, Jayachandran Chandrasekharan Nair
> <[email protected]> wrote:
> >
> > On Mon, May 06, 2019 at 07:10:40PM +0100, Will Deacon wrote:
> > > On Mon, May 06, 2019 at 06:13:12AM +0000, Jayachandran Chandrasekharan Nair wrote:
> > > > Perhaps someone from ARM can chime in here how the cas/yield combo
> > > > is expected to work when there is contention. ThunderX2 does not
> > > > do much with the yield, but I don't expect any ARM implementation
> > > > to treat YIELD as a hint not to yield, but to get/keep exclusive
> > > > access to the last failed CAS location.
> > >
> > > Just picking up on this as "someone from ARM".
> > >
> > > The yield instruction in our implementation of cpu_relax() is *only* there
> > > as a scheduling hint to QEMU so that it can treat it as an internal
> > > scheduling hint and run some other thread; see 1baa82f48030 ("arm64:
> > > Implement cpu_relax as yield"). We can't use WFE or WFI blindly here, as it
> > > could be a long time before we see a wake-up event such as an interrupt. Our
> > > implementation of smp_cond_load_acquire() is much better for that kind of
> > > thing, but doesn't help at all for a contended CAS loop where the variable
> > > is actually changing constantly.
> >
> > Looking thru the perf output of this case (open/close of a file from
> > multiple CPUs), I see that refcount is a significant factor in most
> > kernel configurations - and that too uses cmpxchg (without yield).
> > x86 has an optimized inline version of refcount that helps
> > significantly. Do you think this is worth looking at for arm64?
> >
>
> I looked into this a while ago [0], but at the time, we decided to
> stick with the generic implementation until we encountered a use case
> that benefits from it. Worth a try, I suppose ...
>
> [0] https://lore.kernel.org/linux-arm-kernel/[email protected]/

If JC can show that we benefit from this, it would be interesting to see if
we can implement the refcount-full saturating arithmetic using the
LDMIN/LDMAX instructions instead of the current cmpxchg() loops.

Will

2019-06-05 13:51:45

by Jan Glauber

[permalink] [raw]
Subject: [PATCH] lockref: Limit number of cmpxchg loop retries

The lockref cmpxchg loop is unbound as long as the spinlock is not
taken. Depending on the hardware implementation of compare-and-swap
a high number of loop retries might happen.

Add an upper bound to the loop to force the fallback to spinlocks
after some time. A retry value of 100 should not impact any hardware
that does not have this issue.

With the retry limit the performance of an open-close testcase
improved between 60-70% on ThunderX2.

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Jan Glauber <[email protected]>
---
lib/lockref.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/lib/lockref.c b/lib/lockref.c
index 3d468b53d4c9..5b34bbd3eba8 100644
--- a/lib/lockref.c
+++ b/lib/lockref.c
@@ -9,6 +9,7 @@
* failure case.
*/
#define CMPXCHG_LOOP(CODE, SUCCESS) do { \
+ int retry = 100; \
struct lockref old; \
BUILD_BUG_ON(sizeof(old) != 8); \
old.lock_count = READ_ONCE(lockref->lock_count); \
@@ -21,6 +22,8 @@
if (likely(old.lock_count == prev.lock_count)) { \
SUCCESS; \
} \
+ if (!--retry) \
+ break; \
cpu_relax(); \
} \
} while (0)
--
2.21.0

2019-06-05 20:19:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] lockref: Limit number of cmpxchg loop retries

On Wed, Jun 5, 2019 at 6:49 AM Jan Glauber <[email protected]> wrote:
>
> Add an upper bound to the loop to force the fallback to spinlocks
> after some time. A retry value of 100 should not impact any hardware
> that does not have this issue.
>
> With the retry limit the performance of an open-close testcase
> improved between 60-70% on ThunderX2.

Btw, did you do any kind of performance analysis across different
retry limit values?

I'm perfectly happy to just pick a random number and '100' looks fine
to me, so this is mainly out of curiosity.

Linus

2019-06-06 09:43:39

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH] lockref: Limit number of cmpxchg loop retries

On Thu, Jun 06, 2019 at 08:03:27AM +0000, Jan Glauber wrote:
> On Wed, Jun 05, 2019 at 01:16:46PM -0700, Linus Torvalds wrote:
> > On Wed, Jun 5, 2019 at 6:49 AM Jan Glauber <[email protected]> wrote:
> > >
> > > Add an upper bound to the loop to force the fallback to spinlocks
> > > after some time. A retry value of 100 should not impact any hardware
> > > that does not have this issue.
> > >
> > > With the retry limit the performance of an open-close testcase
> > > improved between 60-70% on ThunderX2.
> >
> > Btw, did you do any kind of performance analysis across different
> > retry limit values?
>
> I tried 15/50/100/200/500, results were largely identical up to 100.
> For SMT=4 a higher retry value might be better, but unless we can add a
> sysctl value 100 looked like a good compromise to me.

Perhaps I'm just getting confused pre-morning-coffee, but I thought the
original complaint (and the reason for this patch even existing) was that
when many CPUs were hammering the lockref then performance tanked? In which
case, increasing the threshold as the number of CPUs increases seems
counter-intuitive to me because it suggests that the larger the system,
the harder we should try to make the cmpxchg work.

Will

2019-06-06 10:30:28

by Jan Glauber

[permalink] [raw]
Subject: Re: [PATCH] lockref: Limit number of cmpxchg loop retries

On Thu, Jun 06, 2019 at 10:41:54AM +0100, Will Deacon wrote:
> On Thu, Jun 06, 2019 at 08:03:27AM +0000, Jan Glauber wrote:
> > On Wed, Jun 05, 2019 at 01:16:46PM -0700, Linus Torvalds wrote:
> > > On Wed, Jun 5, 2019 at 6:49 AM Jan Glauber <[email protected]> wrote:
> > > >
> > > > Add an upper bound to the loop to force the fallback to spinlocks
> > > > after some time. A retry value of 100 should not impact any hardware
> > > > that does not have this issue.
> > > >
> > > > With the retry limit the performance of an open-close testcase
> > > > improved between 60-70% on ThunderX2.
> > >
> > > Btw, did you do any kind of performance analysis across different
> > > retry limit values?
> >
> > I tried 15/50/100/200/500, results were largely identical up to 100.
> > For SMT=4 a higher retry value might be better, but unless we can add a
> > sysctl value 100 looked like a good compromise to me.
>
> Perhaps I'm just getting confused pre-morning-coffee, but I thought the
> original complaint (and the reason for this patch even existing) was that
> when many CPUs were hammering the lockref then performance tanked? In which
> case, increasing the threshold as the number of CPUs increases seems
> counter-intuitive to me because it suggests that the larger the system,
> the harder we should try to make the cmpxchg work.

For SMT=4 the top hit I see is queued_spin_lock_slowpath(). Maybe this is more
costly with more threads, so trying harder to use lockref-cmpxchg makes
the microbenchmark faster in that case?

--Jan

2019-06-06 11:06:54

by Jan Glauber

[permalink] [raw]
Subject: Re: [PATCH] lockref: Limit number of cmpxchg loop retries

On Wed, Jun 05, 2019 at 01:16:46PM -0700, Linus Torvalds wrote:
> On Wed, Jun 5, 2019 at 6:49 AM Jan Glauber <[email protected]> wrote:
> >
> > Add an upper bound to the loop to force the fallback to spinlocks
> > after some time. A retry value of 100 should not impact any hardware
> > that does not have this issue.
> >
> > With the retry limit the performance of an open-close testcase
> > improved between 60-70% on ThunderX2.
>
> Btw, did you do any kind of performance analysis across different
> retry limit values?

I tried 15/50/100/200/500, results were largely identical up to 100.
For SMT=4 a higher retry value might be better, but unless we can add a
sysctl value 100 looked like a good compromise to me.

--Jan

> I'm perfectly happy to just pick a random number and '100' looks fine
> to me, so this is mainly out of curiosity.
>
> Linus

2019-06-07 08:30:27

by Jan Glauber

[permalink] [raw]
Subject: Re: [PATCH] lockref: Limit number of cmpxchg loop retries

On Thu, Jun 06, 2019 at 10:28:12AM +0000, Jan Glauber wrote:
> On Thu, Jun 06, 2019 at 10:41:54AM +0100, Will Deacon wrote:
> > On Thu, Jun 06, 2019 at 08:03:27AM +0000, Jan Glauber wrote:
> > > On Wed, Jun 05, 2019 at 01:16:46PM -0700, Linus Torvalds wrote:
> > > > On Wed, Jun 5, 2019 at 6:49 AM Jan Glauber <[email protected]> wrote:
> > > > >
> > > > > Add an upper bound to the loop to force the fallback to spinlocks
> > > > > after some time. A retry value of 100 should not impact any hardware
> > > > > that does not have this issue.
> > > > >
> > > > > With the retry limit the performance of an open-close testcase
> > > > > improved between 60-70% on ThunderX2.
> > > >
> > > > Btw, did you do any kind of performance analysis across different
> > > > retry limit values?
> > >
> > > I tried 15/50/100/200/500, results were largely identical up to 100.
> > > For SMT=4 a higher retry value might be better, but unless we can add a
> > > sysctl value 100 looked like a good compromise to me.
> >
> > Perhaps I'm just getting confused pre-morning-coffee, but I thought the
> > original complaint (and the reason for this patch even existing) was that
> > when many CPUs were hammering the lockref then performance tanked? In which
> > case, increasing the threshold as the number of CPUs increases seems
> > counter-intuitive to me because it suggests that the larger the system,
> > the harder we should try to make the cmpxchg work.
>
> For SMT=4 the top hit I see is queued_spin_lock_slowpath(). Maybe this is more
> costly with more threads, so trying harder to use lockref-cmpxchg makes
> the microbenchmark faster in that case?

To clarify, with 224 threads & CPUs queued_spin_lock_slowpath is the top hit
even without a retry limit in lockref. This could be unrelated to the lockref
fallback, it looks like it's coming from the spinlock in:
do_sys_open -> get_unused_fd_flags -> __alloc_fd

--Jan

2019-06-07 20:26:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] lockref: Limit number of cmpxchg loop retries

On Fri, Jun 7, 2019 at 12:27 AM Jan Glauber <[email protected]> wrote:
>
> To clarify, with 224 threads & CPUs queued_spin_lock_slowpath is the top hit
> even without a retry limit in lockref. This could be unrelated to the lockref
> fallback, it looks like it's coming from the spinlock in:
> do_sys_open -> get_unused_fd_flags -> __alloc_fd

At some point I stop worrying about microbenchmarks just because it's
easy to hit some locking paths in them, without it necessarily being
relevant in real loads.

But I'll apply the lockref patch because I think the "limit cmpxchg
loop" is conceptually a valid model, and while I think the "hitting
the same dentry lockref over and over again" is likely also an
artifact of a microbenchmark, I could at least imagine that it happens
with some common dentries (root, cwd) in some situations.

Linus

Subject: Re: [RFC] Disable lockref on arm64

On Wed, May 22, 2019 at 05:04:17PM +0100, Will Deacon wrote:
> On Sat, May 18, 2019 at 12:00:34PM +0200, Ard Biesheuvel wrote:
> > On Sat, 18 May 2019 at 06:25, Jayachandran Chandrasekharan Nair
> > <[email protected]> wrote:
> > >
> > > On Mon, May 06, 2019 at 07:10:40PM +0100, Will Deacon wrote:
> > > > On Mon, May 06, 2019 at 06:13:12AM +0000, Jayachandran Chandrasekharan Nair wrote:
> > > > > Perhaps someone from ARM can chime in here how the cas/yield combo
> > > > > is expected to work when there is contention. ThunderX2 does not
> > > > > do much with the yield, but I don't expect any ARM implementation
> > > > > to treat YIELD as a hint not to yield, but to get/keep exclusive
> > > > > access to the last failed CAS location.
> > > >
> > > > Just picking up on this as "someone from ARM".
> > > >
> > > > The yield instruction in our implementation of cpu_relax() is *only* there
> > > > as a scheduling hint to QEMU so that it can treat it as an internal
> > > > scheduling hint and run some other thread; see 1baa82f48030 ("arm64:
> > > > Implement cpu_relax as yield"). We can't use WFE or WFI blindly here, as it
> > > > could be a long time before we see a wake-up event such as an interrupt. Our
> > > > implementation of smp_cond_load_acquire() is much better for that kind of
> > > > thing, but doesn't help at all for a contended CAS loop where the variable
> > > > is actually changing constantly.
> > >
> > > Looking thru the perf output of this case (open/close of a file from
> > > multiple CPUs), I see that refcount is a significant factor in most
> > > kernel configurations - and that too uses cmpxchg (without yield).
> > > x86 has an optimized inline version of refcount that helps
> > > significantly. Do you think this is worth looking at for arm64?
> > >
> >
> > I looked into this a while ago [0], but at the time, we decided to
> > stick with the generic implementation until we encountered a use case
> > that benefits from it. Worth a try, I suppose ...
> >
> > [0] https://lore.kernel.org/linux-arm-kernel/[email protected]/
>
> If JC can show that we benefit from this, it would be interesting to see if
> we can implement the refcount-full saturating arithmetic using the
> LDMIN/LDMAX instructions instead of the current cmpxchg() loops.

Now that the lockref change is mainline, I think we need to take another
look at this patch.

Using a fixed up version of Ard's patch above along with Jan's lockref
change upstream, I get significant improvement in scaling for my file
open/read/close testcase[1]. Like I wrote earlier, if I take a
standard Ubuntu arm64 kernel configuration, most of the time for my
test[1] is spent in refcount operations.

With Ard's changes applied[2], I see that the lockref CAS code becomes
the top function and then the retry limit will kick in as expected. In
my testcase, I see that the queued spinlock case is about 2.5 times
faster than the unbound CAS loop when 224 CPUs are enabled (SMT 4,
28core, 2socket).

JC

[1] https://github.com/jchandra-cavm/refcount-test
[2] https://github.com/jchandra-cavm/linux/commits/refcount-fixes

2019-06-12 09:36:56

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC] Disable lockref on arm64

Hi JC,

On Wed, Jun 12, 2019 at 04:10:20AM +0000, Jayachandran Chandrasekharan Nair wrote:
> On Wed, May 22, 2019 at 05:04:17PM +0100, Will Deacon wrote:
> > On Sat, May 18, 2019 at 12:00:34PM +0200, Ard Biesheuvel wrote:
> > > On Sat, 18 May 2019 at 06:25, Jayachandran Chandrasekharan Nair
> > > <[email protected]> wrote:
> > > > Looking thru the perf output of this case (open/close of a file from
> > > > multiple CPUs), I see that refcount is a significant factor in most
> > > > kernel configurations - and that too uses cmpxchg (without yield).
> > > > x86 has an optimized inline version of refcount that helps
> > > > significantly. Do you think this is worth looking at for arm64?
> > > >
> > >
> > > I looked into this a while ago [0], but at the time, we decided to
> > > stick with the generic implementation until we encountered a use case
> > > that benefits from it. Worth a try, I suppose ...
> > >
> > > [0] https://lore.kernel.org/linux-arm-kernel/[email protected]/
> >
> > If JC can show that we benefit from this, it would be interesting to see if
> > we can implement the refcount-full saturating arithmetic using the
> > LDMIN/LDMAX instructions instead of the current cmpxchg() loops.
>
> Now that the lockref change is mainline, I think we need to take another
> look at this patch.

Before we get too involved with this, I really don't want to start a trend of
"let's try to rewrite all code using cmpxchg() in Linux because of TX2". At
some point, the hardware needs to play ball. However...

Ard's refcount patch was about moving the overflow check out-of-line. A
side-effect of this, is that we avoid the cmpxchg() operation from many of
the operations (atomic_add_unless() disappears), and it's /this/ which helps
you. So there may well be a middle ground where we avoid the complexity of
the out-of-line {over,under}flow handling but do the saturation post-atomic
inline.

I was hoping we could use LDMIN/LDMAX to maintain the semantics of
REFCOUNT_FULL, but now that I think about it I can't see how we could keep
the arithmetic atomic in that case. Hmm.

Whatever we do, I prefer to keep REFCOUNT_FULL the default option for arm64,
so if we can't keep the semantics when we remove the cmpxchg, you'll need to
opt into this at config time.

Will

2019-06-13 15:41:01

by Hanjun Guo

[permalink] [raw]
Subject: Re: [RFC] Disable lockref on arm64

On 2019/6/12 12:10, Jayachandran Chandrasekharan Nair wrote:
> On Wed, May 22, 2019 at 05:04:17PM +0100, Will Deacon wrote:
>> On Sat, May 18, 2019 at 12:00:34PM +0200, Ard Biesheuvel wrote:
>>> On Sat, 18 May 2019 at 06:25, Jayachandran Chandrasekharan Nair
>>> <[email protected]> wrote:
>>>>
>>>> On Mon, May 06, 2019 at 07:10:40PM +0100, Will Deacon wrote:
>>>>> On Mon, May 06, 2019 at 06:13:12AM +0000, Jayachandran Chandrasekharan Nair wrote:
>>>>>> Perhaps someone from ARM can chime in here how the cas/yield combo
>>>>>> is expected to work when there is contention. ThunderX2 does not
>>>>>> do much with the yield, but I don't expect any ARM implementation
>>>>>> to treat YIELD as a hint not to yield, but to get/keep exclusive
>>>>>> access to the last failed CAS location.
>>>>>
>>>>> Just picking up on this as "someone from ARM".
>>>>>
>>>>> The yield instruction in our implementation of cpu_relax() is *only* there
>>>>> as a scheduling hint to QEMU so that it can treat it as an internal
>>>>> scheduling hint and run some other thread; see 1baa82f48030 ("arm64:
>>>>> Implement cpu_relax as yield"). We can't use WFE or WFI blindly here, as it
>>>>> could be a long time before we see a wake-up event such as an interrupt. Our
>>>>> implementation of smp_cond_load_acquire() is much better for that kind of
>>>>> thing, but doesn't help at all for a contended CAS loop where the variable
>>>>> is actually changing constantly.
>>>>
>>>> Looking thru the perf output of this case (open/close of a file from
>>>> multiple CPUs), I see that refcount is a significant factor in most
>>>> kernel configurations - and that too uses cmpxchg (without yield).
>>>> x86 has an optimized inline version of refcount that helps
>>>> significantly. Do you think this is worth looking at for arm64?
>>>>
>>>
>>> I looked into this a while ago [0], but at the time, we decided to
>>> stick with the generic implementation until we encountered a use case
>>> that benefits from it. Worth a try, I suppose ...
>>>
>>> [0] https://lore.kernel.org/linux-arm-kernel/[email protected]/
>>
>> If JC can show that we benefit from this, it would be interesting to see if
>> we can implement the refcount-full saturating arithmetic using the
>> LDMIN/LDMAX instructions instead of the current cmpxchg() loops.
>
> Now that the lockref change is mainline, I think we need to take another
> look at this patch.
>
> Using a fixed up version of Ard's patch above along with Jan's lockref
> change upstream, I get significant improvement in scaling for my file
> open/read/close testcase[1]. Like I wrote earlier, if I take a
> standard Ubuntu arm64 kernel configuration, most of the time for my
> test[1] is spent in refcount operations.
>
> With Ard's changes applied[2], I see that the lockref CAS code becomes
> the top function and then the retry limit will kick in as expected. In
> my testcase, I see that the queued spinlock case is about 2.5 times
> faster than the unbound CAS loop when 224 CPUs are enabled (SMT 4,
> 28core, 2socket).
>
> JC
>
> [1] https://github.com/jchandra-cavm/refcount-test
> [2] https://github.com/jchandra-cavm/linux/commits/refcount-fixes

FWIW, with the patch (Ard's patch plus fixes) above, running the
same testcase on ARM64 Kunpeng920 96 CPU core system, I can see about 50%
performance boost.

I also tested Jan's lockref change without Ard's patch, performance
is almost the same.

Thanks
Hanjun

Subject: Re: [RFC] Disable lockref on arm64

On Wed, Jun 12, 2019 at 10:31:53AM +0100, Will Deacon wrote:
> Hi JC,
>
> On Wed, Jun 12, 2019 at 04:10:20AM +0000, Jayachandran Chandrasekharan Nair wrote:
> > On Wed, May 22, 2019 at 05:04:17PM +0100, Will Deacon wrote:
> > > On Sat, May 18, 2019 at 12:00:34PM +0200, Ard Biesheuvel wrote:
> > > > On Sat, 18 May 2019 at 06:25, Jayachandran Chandrasekharan Nair
> > > > <[email protected]> wrote:
> > > > > Looking thru the perf output of this case (open/close of a file from
> > > > > multiple CPUs), I see that refcount is a significant factor in most
> > > > > kernel configurations - and that too uses cmpxchg (without yield).
> > > > > x86 has an optimized inline version of refcount that helps
> > > > > significantly. Do you think this is worth looking at for arm64?
> > > > >
> > > >
> > > > I looked into this a while ago [0], but at the time, we decided to
> > > > stick with the generic implementation until we encountered a use case
> > > > that benefits from it. Worth a try, I suppose ...
> > > >
> > > > [0] https://lore.kernel.org/linux-arm-kernel/[email protected]/
> > >
> > > If JC can show that we benefit from this, it would be interesting to see if
> > > we can implement the refcount-full saturating arithmetic using the
> > > LDMIN/LDMAX instructions instead of the current cmpxchg() loops.
> >
> > Now that the lockref change is mainline, I think we need to take another
> > look at this patch.
>
> Before we get too involved with this, I really don't want to start a trend of
> "let's try to rewrite all code using cmpxchg() in Linux because of TX2".

x86 added a arch-specific fast refcount implementation - and the commit
specifically notes that it is faster than cmpxchg based code[1].

There seems to be an ongoing effort to move over more and more subsystems
from atomic_t to refcount_t(e.g.[2]), specifically because refcount_t on
x86 is fast enough and you get some error checking atomic_t that does not
have.

> At some point, the hardware needs to play ball. However...

Even on a totally baller CPU, REFCOUNT_FULL is going to be slow :)
On TX2, this specific benchmark just highlights the issue, but the
difference is significant even on x86 (as noted above).

> Ard's refcount patch was about moving the overflow check out-of-line. A
> side-effect of this, is that we avoid the cmpxchg() operation from many of
> the operations (atomic_add_unless() disappears), and it's /this/ which helps
> you. So there may well be a middle ground where we avoid the complexity of
> the out-of-line {over,under}flow handling but do the saturation post-atomic
> inline.

Right.

> I was hoping we could use LDMIN/LDMAX to maintain the semantics of
> REFCOUNT_FULL, but now that I think about it I can't see how we could keep
> the arithmetic atomic in that case. Hmm.

Do you think Ard's patch needs changes before it can be considered? I
can take a look at that.

> Whatever we do, I prefer to keep REFCOUNT_FULL the default option for arm64,
> so if we can't keep the semantics when we remove the cmpxchg, you'll need to
> opt into this at config time.

Only arm64 and arm selects REFCOUNT_FULL in the default config. So please
reconsider this! This is going to slow down arm64 vs. other archs and it
will become worse when more code adopts refcount_t.

JC
[1] https://www.mail-archive.com/[email protected]/msg1451350.html
[2] https://www.mail-archive.com/[email protected]/msg1336955.html

2019-06-14 10:00:14

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC] Disable lockref on arm64

[+Kees]

On Fri, Jun 14, 2019 at 07:09:26AM +0000, Jayachandran Chandrasekharan Nair wrote:
> On Wed, Jun 12, 2019 at 10:31:53AM +0100, Will Deacon wrote:
> > On Wed, Jun 12, 2019 at 04:10:20AM +0000, Jayachandran Chandrasekharan Nair wrote:
> > > Now that the lockref change is mainline, I think we need to take another
> > > look at this patch.
> >
> > Before we get too involved with this, I really don't want to start a trend of
> > "let's try to rewrite all code using cmpxchg() in Linux because of TX2".
>
> x86 added a arch-specific fast refcount implementation - and the commit
> specifically notes that it is faster than cmpxchg based code[1].
>
> There seems to be an ongoing effort to move over more and more subsystems
> from atomic_t to refcount_t(e.g.[2]), specifically because refcount_t on
> x86 is fast enough and you get some error checking atomic_t that does not
> have.

Correct, but there are also some cases that are only caught by
REFCOUNT_FULL.

> > At some point, the hardware needs to play ball. However...
>
> Even on a totally baller CPU, REFCOUNT_FULL is going to be slow :)
> On TX2, this specific benchmark just highlights the issue, but the
> difference is significant even on x86 (as noted above).

My point was more general than that. If you want scalable concurrent code,
then you end up having to move away from the serialisation introduced by
locking. The main trick in the toolbox is cmpxchg() so, in the absence of
a zoo of special-purpose atomic instructions, it really needs to do better
than serialising.

> > I was hoping we could use LDMIN/LDMAX to maintain the semantics of
> > REFCOUNT_FULL, but now that I think about it I can't see how we could keep
> > the arithmetic atomic in that case. Hmm.
>
> Do you think Ard's patch needs changes before it can be considered? I
> can take a look at that.

I would like to see how it performs if we keep the checking inline, yes.
I suspect Ard could spin this in short order.

> > Whatever we do, I prefer to keep REFCOUNT_FULL the default option for arm64,
> > so if we can't keep the semantics when we remove the cmpxchg, you'll need to
> > opt into this at config time.
>
> Only arm64 and arm selects REFCOUNT_FULL in the default config. So please
> reconsider this! This is going to slow down arm64 vs. other archs and it
> will become worse when more code adopts refcount_t.

Maybe, but faced with the choice between your micro-benchmark results and
security-by-default for people using the arm64 Linux kernel, I really think
that's a no-brainer. I'm well aware that not everybody agrees with me on
that.

Will

2019-06-14 10:25:37

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [RFC] Disable lockref on arm64

On Fri, 14 Jun 2019 at 11:58, Will Deacon <[email protected]> wrote:
>
> [+Kees]
>
> On Fri, Jun 14, 2019 at 07:09:26AM +0000, Jayachandran Chandrasekharan Nair wrote:
> > On Wed, Jun 12, 2019 at 10:31:53AM +0100, Will Deacon wrote:
> > > On Wed, Jun 12, 2019 at 04:10:20AM +0000, Jayachandran Chandrasekharan Nair wrote:
> > > > Now that the lockref change is mainline, I think we need to take another
> > > > look at this patch.
> > >
> > > Before we get too involved with this, I really don't want to start a trend of
> > > "let's try to rewrite all code using cmpxchg() in Linux because of TX2".
> >
> > x86 added a arch-specific fast refcount implementation - and the commit
> > specifically notes that it is faster than cmpxchg based code[1].
> >
> > There seems to be an ongoing effort to move over more and more subsystems
> > from atomic_t to refcount_t(e.g.[2]), specifically because refcount_t on
> > x86 is fast enough and you get some error checking atomic_t that does not
> > have.
>
> Correct, but there are also some cases that are only caught by
> REFCOUNT_FULL.
>

Yes, but do note that my arm64 implementation catches
increment-from-zero as well.

> > > At some point, the hardware needs to play ball. However...
> >
> > Even on a totally baller CPU, REFCOUNT_FULL is going to be slow :)
> > On TX2, this specific benchmark just highlights the issue, but the
> > difference is significant even on x86 (as noted above).
>
> My point was more general than that. If you want scalable concurrent code,
> then you end up having to move away from the serialisation introduced by
> locking. The main trick in the toolbox is cmpxchg() so, in the absence of
> a zoo of special-purpose atomic instructions, it really needs to do better
> than serialising.
>
> > > I was hoping we could use LDMIN/LDMAX to maintain the semantics of
> > > REFCOUNT_FULL, but now that I think about it I can't see how we could keep
> > > the arithmetic atomic in that case. Hmm.
> >
> > Do you think Ard's patch needs changes before it can be considered? I
> > can take a look at that.
>
> I would like to see how it performs if we keep the checking inline, yes.
> I suspect Ard could spin this in short order.
>

Moving the post checks before the stores you mean? That shouldn't be
too difficult, I suppose, but it will certainly cost performance.

> > > Whatever we do, I prefer to keep REFCOUNT_FULL the default option for arm64,
> > > so if we can't keep the semantics when we remove the cmpxchg, you'll need to
> > > opt into this at config time.
> >
> > Only arm64 and arm selects REFCOUNT_FULL in the default config. So please
> > reconsider this! This is going to slow down arm64 vs. other archs and it
> > will become worse when more code adopts refcount_t.
>
> Maybe, but faced with the choice between your micro-benchmark results and
> security-by-default for people using the arm64 Linux kernel, I really think
> that's a no-brainer. I'm well aware that not everybody agrees with me on
> that.

I think the question whether the benchmark is valid is justified, but
otoh, we are obsessed with hackbench which is not that representative
of a real workload either. It would be better to discuss these changes
in the context of known real-world use cases where refcounts are a
true bottleneck.

Also, I'd like to have Kees's view on the gap between REFCOUNT_FULL
and the fast version on arm64. I'm not convinced the cases we are not
covering are such a big deal.

2019-06-14 10:39:37

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC] Disable lockref on arm64

Hi Ard,

On Fri, Jun 14, 2019 at 12:24:54PM +0200, Ard Biesheuvel wrote:
> On Fri, 14 Jun 2019 at 11:58, Will Deacon <[email protected]> wrote:
> > On Fri, Jun 14, 2019 at 07:09:26AM +0000, Jayachandran Chandrasekharan Nair wrote:
> > > x86 added a arch-specific fast refcount implementation - and the commit
> > > specifically notes that it is faster than cmpxchg based code[1].
> > >
> > > There seems to be an ongoing effort to move over more and more subsystems
> > > from atomic_t to refcount_t(e.g.[2]), specifically because refcount_t on
> > > x86 is fast enough and you get some error checking atomic_t that does not
> > > have.
> >
> > Correct, but there are also some cases that are only caught by
> > REFCOUNT_FULL.
> >
> Yes, but do note that my arm64 implementation catches
> increment-from-zero as well.

Ok, so it's just the silly racy cases that are problematic?

> > > Do you think Ard's patch needs changes before it can be considered? I
> > > can take a look at that.
> >
> > I would like to see how it performs if we keep the checking inline, yes.
> > I suspect Ard could spin this in short order.
>
> Moving the post checks before the stores you mean? That shouldn't be
> too difficult, I suppose, but it will certainly cost performance.

That's what I'd like to assess, since the major complaint seems to be the
use of cmpxchg() as opposed to inline branching.

> > > > Whatever we do, I prefer to keep REFCOUNT_FULL the default option for arm64,
> > > > so if we can't keep the semantics when we remove the cmpxchg, you'll need to
> > > > opt into this at config time.
> > >
> > > Only arm64 and arm selects REFCOUNT_FULL in the default config. So please
> > > reconsider this! This is going to slow down arm64 vs. other archs and it
> > > will become worse when more code adopts refcount_t.
> >
> > Maybe, but faced with the choice between your micro-benchmark results and
> > security-by-default for people using the arm64 Linux kernel, I really think
> > that's a no-brainer. I'm well aware that not everybody agrees with me on
> > that.
>
> I think the question whether the benchmark is valid is justified, but
> otoh, we are obsessed with hackbench which is not that representative
> of a real workload either. It would be better to discuss these changes
> in the context of known real-world use cases where refcounts are a
> true bottleneck.

I wasn't calling into question the validity of the benchmark (I really have
no clue about that), but rather that you can't have your cake and eat it.
Faced with the choice, I'd err on the security side because it's far easier
to explain to somebody that the default is full mitigation at a cost than it
is to explain why a partial mitigation is acceptable (and in the end it's
often subjective because people have different thresholds).

> Also, I'd like to have Kees's view on the gap between REFCOUNT_FULL
> and the fast version on arm64. I'm not convinced the cases we are not
> covering are such a big deal.

Fair enough, but if the conclusion is that it's not a big deal then we
should just remove REFCOUNT_FULL altogether, because it's the choice that
is the problem here.

Will

2019-06-15 04:24:21

by Kees Cook

[permalink] [raw]
Subject: Re: [RFC] Disable lockref on arm64

tl;dr: if arm/arm64 can catch overflow, untested dec-to-zero, and
inc-from-zero, while performing better than existing REFCOUNT_FULL,
it's a no-brainer to switch. Minimum parity to x86 would be to catch
overflow and untested dec-to-zero. Minimum viable protection would be to
catch overflow. LKDTM is your friend.

Details below...

On Fri, Jun 14, 2019 at 11:38:50AM +0100, Will Deacon wrote:
> On Fri, Jun 14, 2019 at 12:24:54PM +0200, Ard Biesheuvel wrote:
> > On Fri, 14 Jun 2019 at 11:58, Will Deacon <[email protected]> wrote:
> > > On Fri, Jun 14, 2019 at 07:09:26AM +0000, Jayachandran Chandrasekharan Nair wrote:
> > > > x86 added a arch-specific fast refcount implementation - and the commit
> > > > specifically notes that it is faster than cmpxchg based code[1].
> > > >
> > > > There seems to be an ongoing effort to move over more and more subsystems
> > > > from atomic_t to refcount_t(e.g.[2]), specifically because refcount_t on
> > > > x86 is fast enough and you get some error checking atomic_t that does not
> > > > have.

For clarity: the choices on x86 are: full or fast, where both catch
the condition that leads to use-after-free that can be unconditionally
mitigated (i.e. refcount overflow-wrapping to zero: the common missing
ref count decrement). The _underflow_ case (the less common missing ref
count increment) can be exploited but nothing can be done to mitigate
it. Only a later increment from zero can indicate that something went
wrong _in the past_.

There is not a way to build x86 without the overflow protection, and
that was matched on arm/arm64 by making REFCOUNT_FULL unconditionally
enabled. So, from the perspective of my take on weakening the protection
level, I'm totally fine if arm/arm64 falls back to a non-FULL
implementation as long as it catches the overflow case (which the prior
"fast" patches totally did).

> > > Correct, but there are also some cases that are only caught by
> > > REFCOUNT_FULL.
> > >
> > Yes, but do note that my arm64 implementation catches
> > increment-from-zero as well.

FWIW, the vast majority of bugs that refcount_t has found has been
inc-from-zero (the overflow case doesn't tend to ever get exercised,
but it's easy for syzkaller and other fuzzers to underflow when such a
path is found). And those are only found on REFCOUNT_FULL kernels
presently, so it'd be nice to have that case covered in the "fast"
arm/arm64 case too.

> Ok, so it's just the silly racy cases that are problematic?
>
> > > > Do you think Ard's patch needs changes before it can be considered? I
> > > > can take a look at that.
> > >
> > > I would like to see how it performs if we keep the checking inline, yes.
> > > I suspect Ard could spin this in short order.
> >
> > Moving the post checks before the stores you mean? That shouldn't be
> > too difficult, I suppose, but it will certainly cost performance.
>
> That's what I'd like to assess, since the major complaint seems to be the
> use of cmpxchg() as opposed to inline branching.
>
> > > > > Whatever we do, I prefer to keep REFCOUNT_FULL the default option for arm64,
> > > > > so if we can't keep the semantics when we remove the cmpxchg, you'll need to
> > > > > opt into this at config time.
> > > >
> > > > Only arm64 and arm selects REFCOUNT_FULL in the default config. So please
> > > > reconsider this! This is going to slow down arm64 vs. other archs and it
> > > > will become worse when more code adopts refcount_t.
> > >
> > > Maybe, but faced with the choice between your micro-benchmark results and
> > > security-by-default for people using the arm64 Linux kernel, I really think
> > > that's a no-brainer. I'm well aware that not everybody agrees with me on
> > > that.
> >
> > I think the question whether the benchmark is valid is justified, but
> > otoh, we are obsessed with hackbench which is not that representative
> > of a real workload either. It would be better to discuss these changes
> > in the context of known real-world use cases where refcounts are a
> > true bottleneck.
>
> I wasn't calling into question the validity of the benchmark (I really have
> no clue about that), but rather that you can't have your cake and eat it.
> Faced with the choice, I'd err on the security side because it's far easier
> to explain to somebody that the default is full mitigation at a cost than it
> is to explain why a partial mitigation is acceptable (and in the end it's
> often subjective because people have different thresholds).

I'm happy to call into question the validity of the benchmark though! ;)
Seriously, it came up repeatedly in the x86 port, where there was a
claim of "it's slower" (which is certainly objectively true: more cycles
are spent), but no one could present a real-world workload where the
difference was measurable.

> > Also, I'd like to have Kees's view on the gap between REFCOUNT_FULL
> > and the fast version on arm64. I'm not convinced the cases we are not
> > covering are such a big deal.
>
> Fair enough, but if the conclusion is that it's not a big deal then we
> should just remove REFCOUNT_FULL altogether, because it's the choice that
> is the problem here.

The coverage difference on x86 is that inc-from-zero is only caught in
the FULL case. Additionally there is the internal difference around how
"saturation" of the value happens. e.g. under FULL a count gets pinned
either to INT_MAX or to zero.

Since the "fast" arm patch caught inc-from-zero, I would say sure
ditch FULL in favor of it (though check that "dec-to-zero" is caught:
i.e. _dec() hitting zero -- instead of dec_and_test() hitting zero). LKDTM
has extensive behavioral tests for refcount_t, so if the tests show the
same results before/after, go for it. :) Though note that the logic may
need tweaking depending on the saturation behavior: right now it expects
either FULL (INT_MAX/0 pinning) or the x86 saturation (INT_MIN / 2).

Note also that LKDTM has a refcount benchmark as well, in case you want
to measure the difference between atomic_t and refcount_t in the most
microbenchmark-y way possible. This is what was used for the numbers in
commit 7a46ec0e2f48 ("locking/refcounts, x86/asm: Implement fast
refcount overflow protection"):

2147483646 refcount_inc()s and 2147483647 refcount_dec_and_test()s:
cycles protections
atomic_t 82249267387 none
refcount_t-fast 82211446892 overflow, untested dec-to-zero
refcount_t-full 144814735193 overflow, untested dec-to-zero, inc-from-zero

Also note that the x86 fast implementations adjusted memory ordering
slightly later on in commit 47b8f3ab9c49 ("refcount_t: Add ACQUIRE
ordering on success for dec(sub)_and_test() variants").

--
Kees Cook

2019-06-15 08:49:34

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [RFC] Disable lockref on arm64

On Sat, 15 Jun 2019 at 06:21, Kees Cook <[email protected]> wrote:
>
> tl;dr: if arm/arm64 can catch overflow, untested dec-to-zero, and
> inc-from-zero, while performing better than existing REFCOUNT_FULL,
> it's a no-brainer to switch. Minimum parity to x86 would be to catch
> overflow and untested dec-to-zero. Minimum viable protection would be to
> catch overflow. LKDTM is your friend.
>
> Details below...
>
> On Fri, Jun 14, 2019 at 11:38:50AM +0100, Will Deacon wrote:
> > On Fri, Jun 14, 2019 at 12:24:54PM +0200, Ard Biesheuvel wrote:
> > > On Fri, 14 Jun 2019 at 11:58, Will Deacon <[email protected]> wrote:
> > > > On Fri, Jun 14, 2019 at 07:09:26AM +0000, Jayachandran Chandrasekharan Nair wrote:
> > > > > x86 added a arch-specific fast refcount implementation - and the commit
> > > > > specifically notes that it is faster than cmpxchg based code[1].
> > > > >
> > > > > There seems to be an ongoing effort to move over more and more subsystems
> > > > > from atomic_t to refcount_t(e.g.[2]), specifically because refcount_t on
> > > > > x86 is fast enough and you get some error checking atomic_t that does not
> > > > > have.
>
> For clarity: the choices on x86 are: full or fast, where both catch
> the condition that leads to use-after-free that can be unconditionally
> mitigated (i.e. refcount overflow-wrapping to zero: the common missing
> ref count decrement). The _underflow_ case (the less common missing ref
> count increment) can be exploited but nothing can be done to mitigate
> it. Only a later increment from zero can indicate that something went
> wrong _in the past_.
>
> There is not a way to build x86 without the overflow protection, and
> that was matched on arm/arm64 by making REFCOUNT_FULL unconditionally
> enabled. So, from the perspective of my take on weakening the protection
> level, I'm totally fine if arm/arm64 falls back to a non-FULL
> implementation as long as it catches the overflow case (which the prior
> "fast" patches totally did).
>
> > > > Correct, but there are also some cases that are only caught by
> > > > REFCOUNT_FULL.
> > > >
> > > Yes, but do note that my arm64 implementation catches
> > > increment-from-zero as well.
>
> FWIW, the vast majority of bugs that refcount_t has found has been
> inc-from-zero (the overflow case doesn't tend to ever get exercised,
> but it's easy for syzkaller and other fuzzers to underflow when such a
> path is found). And those are only found on REFCOUNT_FULL kernels
> presently, so it'd be nice to have that case covered in the "fast"
> arm/arm64 case too.
>
> > Ok, so it's just the silly racy cases that are problematic?
> >
> > > > > Do you think Ard's patch needs changes before it can be considered? I
> > > > > can take a look at that.
> > > >
> > > > I would like to see how it performs if we keep the checking inline, yes.
> > > > I suspect Ard could spin this in short order.
> > >
> > > Moving the post checks before the stores you mean? That shouldn't be
> > > too difficult, I suppose, but it will certainly cost performance.
> >
> > That's what I'd like to assess, since the major complaint seems to be the
> > use of cmpxchg() as opposed to inline branching.
> >
> > > > > > Whatever we do, I prefer to keep REFCOUNT_FULL the default option for arm64,
> > > > > > so if we can't keep the semantics when we remove the cmpxchg, you'll need to
> > > > > > opt into this at config time.
> > > > >
> > > > > Only arm64 and arm selects REFCOUNT_FULL in the default config. So please
> > > > > reconsider this! This is going to slow down arm64 vs. other archs and it
> > > > > will become worse when more code adopts refcount_t.
> > > >
> > > > Maybe, but faced with the choice between your micro-benchmark results and
> > > > security-by-default for people using the arm64 Linux kernel, I really think
> > > > that's a no-brainer. I'm well aware that not everybody agrees with me on
> > > > that.
> > >
> > > I think the question whether the benchmark is valid is justified, but
> > > otoh, we are obsessed with hackbench which is not that representative
> > > of a real workload either. It would be better to discuss these changes
> > > in the context of known real-world use cases where refcounts are a
> > > true bottleneck.
> >
> > I wasn't calling into question the validity of the benchmark (I really have
> > no clue about that), but rather that you can't have your cake and eat it.
> > Faced with the choice, I'd err on the security side because it's far easier
> > to explain to somebody that the default is full mitigation at a cost than it
> > is to explain why a partial mitigation is acceptable (and in the end it's
> > often subjective because people have different thresholds).
>
> I'm happy to call into question the validity of the benchmark though! ;)
> Seriously, it came up repeatedly in the x86 port, where there was a
> claim of "it's slower" (which is certainly objectively true: more cycles
> are spent), but no one could present a real-world workload where the
> difference was measurable.
>
> > > Also, I'd like to have Kees's view on the gap between REFCOUNT_FULL
> > > and the fast version on arm64. I'm not convinced the cases we are not
> > > covering are such a big deal.
> >
> > Fair enough, but if the conclusion is that it's not a big deal then we
> > should just remove REFCOUNT_FULL altogether, because it's the choice that
> > is the problem here.
>
> The coverage difference on x86 is that inc-from-zero is only caught in
> the FULL case. Additionally there is the internal difference around how
> "saturation" of the value happens. e.g. under FULL a count gets pinned
> either to INT_MAX or to zero.
>
> Since the "fast" arm patch caught inc-from-zero, I would say sure
> ditch FULL in favor of it (though check that "dec-to-zero" is caught:
> i.e. _dec() hitting zero -- instead of dec_and_test() hitting zero). LKDTM
> has extensive behavioral tests for refcount_t, so if the tests show the
> same results before/after, go for it. :) Though note that the logic may
> need tweaking depending on the saturation behavior: right now it expects
> either FULL (INT_MAX/0 pinning) or the x86 saturation (INT_MIN / 2).
>
> Note also that LKDTM has a refcount benchmark as well, in case you want
> to measure the difference between atomic_t and refcount_t in the most
> microbenchmark-y way possible. This is what was used for the numbers in
> commit 7a46ec0e2f48 ("locking/refcounts, x86/asm: Implement fast
> refcount overflow protection"):
>
> 2147483646 refcount_inc()s and 2147483647 refcount_dec_and_test()s:
> cycles protections
> atomic_t 82249267387 none
> refcount_t-fast 82211446892 overflow, untested dec-to-zero
> refcount_t-full 144814735193 overflow, untested dec-to-zero, inc-from-zero
>
> Also note that the x86 fast implementations adjusted memory ordering
> slightly later on in commit 47b8f3ab9c49 ("refcount_t: Add ACQUIRE
> ordering on success for dec(sub)_and_test() variants").
>

Thanks Kees.

That acquire ordering patch appears to carry over cleanly . So the
remaining question Will had was whether it makes sense to do the
condition checks before doing the actual store, to avoid having a time
window where the refcount assumes its illegal value. Since arm64 does
not have memory operands, the instruction count wouldn't change, but
it will definitely result in a performance hit on out-of-order CPUs.

2019-06-15 14:01:33

by Kees Cook

[permalink] [raw]
Subject: Re: [RFC] Disable lockref on arm64

On Sat, Jun 15, 2019 at 10:47:19AM +0200, Ard Biesheuvel wrote:
> remaining question Will had was whether it makes sense to do the
> condition checks before doing the actual store, to avoid having a time
> window where the refcount assumes its illegal value. Since arm64 does
> not have memory operands, the instruction count wouldn't change, but
> it will definitely result in a performance hit on out-of-order CPUs.

What do the races end up looking like? Is it possible to have two
threads ordered in a way that a second thread could _un_saturate a
counter?

CPU 1 CPU 2
inc()
load INT_MAX-1
about to overflow?
yes
dec()
load INT_MAX-1
set to INT_MAX
set to INT_MAX-2

Or would you use the same INT_MIN/2 saturation point done on x86?

As for performance, it should be easy to measure with the LKDTM test
to find out exactly the differences.

--
Kees Cook

2019-06-15 14:19:22

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [RFC] Disable lockref on arm64

On Sat, 15 Jun 2019 at 15:59, Kees Cook <[email protected]> wrote:
>
> On Sat, Jun 15, 2019 at 10:47:19AM +0200, Ard Biesheuvel wrote:
> > remaining question Will had was whether it makes sense to do the
> > condition checks before doing the actual store, to avoid having a time
> > window where the refcount assumes its illegal value. Since arm64 does
> > not have memory operands, the instruction count wouldn't change, but
> > it will definitely result in a performance hit on out-of-order CPUs.
>
> What do the races end up looking like? Is it possible to have two
> threads ordered in a way that a second thread could _un_saturate a
> counter?
>
> CPU 1 CPU 2
> inc()
> load INT_MAX-1
> about to overflow?
> yes
> dec()
> load INT_MAX-1
> set to INT_MAX
> set to INT_MAX-2
>
> Or would you use the same INT_MIN/2 saturation point done on x86?
>

Yes, I am using the same saturation point as x86. In this example, I
am not entirely sure I understand why it matters, though: the atomics
guarantee that the write by CPU2 fails if CPU1 changed the value in
the mean time, regardless of which value it wrote.

I think the concern is more related to the likelihood of another CPU
doing something nasty between the moment that the refcount overflows
and the moment that the handler pins it at INT_MIN/2, e.g.,

> CPU 1 CPU 2
> inc()
> load INT_MAX
> about to overflow?
> yes
>
> set to 0
> <insert exploit here>
> set to INT_MIN/2


> As for performance, it should be easy to measure with the LKDTM test
> to find out exactly the differences.
>

Yes, I intend to look into this on Monday.

2019-06-16 21:32:35

by Kees Cook

[permalink] [raw]
Subject: Re: [RFC] Disable lockref on arm64

On Sat, Jun 15, 2019 at 04:18:21PM +0200, Ard Biesheuvel wrote:
> Yes, I am using the same saturation point as x86. In this example, I
> am not entirely sure I understand why it matters, though: the atomics
> guarantee that the write by CPU2 fails if CPU1 changed the value in
> the mean time, regardless of which value it wrote.
>
> I think the concern is more related to the likelihood of another CPU
> doing something nasty between the moment that the refcount overflows
> and the moment that the handler pins it at INT_MIN/2, e.g.,
>
> > CPU 1 CPU 2
> > inc()
> > load INT_MAX
> > about to overflow?
> > yes
> >
> > set to 0
> > <insert exploit here>
> > set to INT_MIN/2

Ah, gotcha, but the "set to 0" is really "set to INT_MAX+1" (not zero)
if you're using the same saturation.

--
Kees Cook

2019-06-17 11:35:22

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [RFC] Disable lockref on arm64

On Sun, 16 Jun 2019 at 23:31, Kees Cook <[email protected]> wrote:
>
> On Sat, Jun 15, 2019 at 04:18:21PM +0200, Ard Biesheuvel wrote:
> > Yes, I am using the same saturation point as x86. In this example, I
> > am not entirely sure I understand why it matters, though: the atomics
> > guarantee that the write by CPU2 fails if CPU1 changed the value in
> > the mean time, regardless of which value it wrote.
> >
> > I think the concern is more related to the likelihood of another CPU
> > doing something nasty between the moment that the refcount overflows
> > and the moment that the handler pins it at INT_MIN/2, e.g.,
> >
> > > CPU 1 CPU 2
> > > inc()
> > > load INT_MAX
> > > about to overflow?
> > > yes
> > >
> > > set to 0
> > > <insert exploit here>
> > > set to INT_MIN/2
>
> Ah, gotcha, but the "set to 0" is really "set to INT_MAX+1" (not zero)
> if you're using the same saturation.
>

Of course. So there is no issue here: whatever manipulations are
racing with the overflow handler can never result in the counter to
unsaturate.

And actually, moving the checks before the stores is not as trivial as
I thought, E.g., for the LSE refcount_add case, we have

" ldadd %w[i], w30, %[cval]\n" \
" adds %w[i], %w[i], w30\n" \
REFCOUNT_PRE_CHECK_ ## pre (w30)) \
REFCOUNT_POST_CHECK_ ## post \

and changing this into load/test/store defeats the purpose of using
the LSE atomics in the first place.

On my single core TX2, the comparative performance is as follows

Baseline: REFCOUNT_TIMING test using REFCOUNT_FULL (LSE cmpxchg)
191057942484 cycles # 2.207 GHz
148447589402 instructions # 0.78 insn per
cycle

86.568269904 seconds time elapsed

Upper bound: ATOMIC_TIMING
116252672661 cycles # 2.207 GHz
28089216452 instructions # 0.24 insn per
cycle

52.689793525 seconds time elapsed

REFCOUNT_TIMING test using LSE atomics
127060259162 cycles # 2.207 GHz
0 instructions # 0.00 insn per
cycle

57.243690077 seconds time elapsed

2019-06-17 18:30:36

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC] Disable lockref on arm64

On Mon, Jun 17, 2019 at 01:33:19PM +0200, Ard Biesheuvel wrote:
> On Sun, 16 Jun 2019 at 23:31, Kees Cook <[email protected]> wrote:
> > On Sat, Jun 15, 2019 at 04:18:21PM +0200, Ard Biesheuvel wrote:
> > > Yes, I am using the same saturation point as x86. In this example, I
> > > am not entirely sure I understand why it matters, though: the atomics
> > > guarantee that the write by CPU2 fails if CPU1 changed the value in
> > > the mean time, regardless of which value it wrote.
> > >
> > > I think the concern is more related to the likelihood of another CPU
> > > doing something nasty between the moment that the refcount overflows
> > > and the moment that the handler pins it at INT_MIN/2, e.g.,
> > >
> > > > CPU 1 CPU 2
> > > > inc()
> > > > load INT_MAX
> > > > about to overflow?
> > > > yes
> > > >
> > > > set to 0
> > > > <insert exploit here>
> > > > set to INT_MIN/2
> >
> > Ah, gotcha, but the "set to 0" is really "set to INT_MAX+1" (not zero)
> > if you're using the same saturation.
> >
>
> Of course. So there is no issue here: whatever manipulations are
> racing with the overflow handler can never result in the counter to
> unsaturate.
>
> And actually, moving the checks before the stores is not as trivial as
> I thought, E.g., for the LSE refcount_add case, we have
>
> " ldadd %w[i], w30, %[cval]\n" \
> " adds %w[i], %w[i], w30\n" \
> REFCOUNT_PRE_CHECK_ ## pre (w30)) \
> REFCOUNT_POST_CHECK_ ## post \
>
> and changing this into load/test/store defeats the purpose of using
> the LSE atomics in the first place.
>
> On my single core TX2, the comparative performance is as follows
>
> Baseline: REFCOUNT_TIMING test using REFCOUNT_FULL (LSE cmpxchg)
> 191057942484 cycles # 2.207 GHz
> 148447589402 instructions # 0.78 insn per
> cycle
>
> 86.568269904 seconds time elapsed
>
> Upper bound: ATOMIC_TIMING
> 116252672661 cycles # 2.207 GHz
> 28089216452 instructions # 0.24 insn per
> cycle
>
> 52.689793525 seconds time elapsed
>
> REFCOUNT_TIMING test using LSE atomics
> 127060259162 cycles # 2.207 GHz

Ok, so assuming JC's complaint is valid, then these numbers are compelling.
In particular, my understanding of this thread is that your optimised
implementation doesn't actually sacrifice any precision; it just changes
the saturation behaviour in a way that has no material impact. Kees, is that
right?

If so, I'm not against having this for arm64, with the premise that we can
hide the REFCOUNT_FULL option entirely given that it would only serve to
confuse if exposed.

Will

Subject: Re: [RFC] Disable lockref on arm64

On Mon, Jun 17, 2019 at 06:26:20PM +0100, Will Deacon wrote:
> On Mon, Jun 17, 2019 at 01:33:19PM +0200, Ard Biesheuvel wrote:
> > On Sun, 16 Jun 2019 at 23:31, Kees Cook <[email protected]> wrote:
> > > On Sat, Jun 15, 2019 at 04:18:21PM +0200, Ard Biesheuvel wrote:
> > > > Yes, I am using the same saturation point as x86. In this example, I
> > > > am not entirely sure I understand why it matters, though: the atomics
> > > > guarantee that the write by CPU2 fails if CPU1 changed the value in
> > > > the mean time, regardless of which value it wrote.
> > > >
> > > > I think the concern is more related to the likelihood of another CPU
> > > > doing something nasty between the moment that the refcount overflows
> > > > and the moment that the handler pins it at INT_MIN/2, e.g.,
> > > >
> > > > > CPU 1 CPU 2
> > > > > inc()
> > > > > load INT_MAX
> > > > > about to overflow?
> > > > > yes
> > > > >
> > > > > set to 0
> > > > > <insert exploit here>
> > > > > set to INT_MIN/2
> > >
> > > Ah, gotcha, but the "set to 0" is really "set to INT_MAX+1" (not zero)
> > > if you're using the same saturation.
> > >
> >
> > Of course. So there is no issue here: whatever manipulations are
> > racing with the overflow handler can never result in the counter to
> > unsaturate.
> >
> > And actually, moving the checks before the stores is not as trivial as
> > I thought, E.g., for the LSE refcount_add case, we have
> >
> > " ldadd %w[i], w30, %[cval]\n" \
> > " adds %w[i], %w[i], w30\n" \
> > REFCOUNT_PRE_CHECK_ ## pre (w30)) \
> > REFCOUNT_POST_CHECK_ ## post \
> >
> > and changing this into load/test/store defeats the purpose of using
> > the LSE atomics in the first place.
> >
> > On my single core TX2, the comparative performance is as follows
> >
> > Baseline: REFCOUNT_TIMING test using REFCOUNT_FULL (LSE cmpxchg)
> > 191057942484 cycles # 2.207 GHz
> > 148447589402 instructions # 0.78 insn per
> > cycle
> >
> > 86.568269904 seconds time elapsed
> >
> > Upper bound: ATOMIC_TIMING
> > 116252672661 cycles # 2.207 GHz
> > 28089216452 instructions # 0.24 insn per
> > cycle
> >
> > 52.689793525 seconds time elapsed
> >
> > REFCOUNT_TIMING test using LSE atomics
> > 127060259162 cycles # 2.207 GHz
>
> Ok, so assuming JC's complaint is valid, then these numbers are compelling.

Let me try to point out the issues I see again, to make sure that we are
not talking just about micro-benchmarks.

That first issue: on arm64, REFCOUNT_FULL is 'select'-ed. There is
no option to go to the atomics implementation or a x86-like compromise
implementation, without patching the kernel. Currently we are stuck with
a function call for what has to be a single atomic instruction.

The second part is that REFCOUNT_FULL uses a unbounded CAS loop which is
problematic when the core count increases and when there is contention.
Upto to some level of contention, the CAS loop works fine. But when we
go to the order of a hundred CPUs this becomes an issue. The LDADD
series of atomics can be handled fairly well by hardware even with
heavy contention, but in case of CAS(or LDXR/STXR) loops, getting it
correct in hardware is much more difficult. There is nothing in the
arm64 ISA to manage this correctly. As discussed earlier in the thread,
WFE does not work, YIELD is troublesome, and there is no 'wait in low
power for exclusive access' hint instruction. This is not a TX2
specific issue.

The testcase I provided was not really a microbenchmark. That was a
simplified webserver testcase where multiple threads read a small file
in parallel. With Ubuntu configuration (apparmor enabled) and when
other things line up (I had made the file & dir non-writable), you
can see that refcount is the top function. I expect this kind of
situation to be more frequent as more subsystems move to refcount_t.

> In particular, my understanding of this thread is that your optimised
> implementation doesn't actually sacrifice any precision; it just changes
> the saturation behaviour in a way that has no material impact. Kees, is that
> right?
>
> If so, I'm not against having this for arm64, with the premise that we can
> hide the REFCOUNT_FULL option entirely given that it would only serve to
> confuse if exposed.

Thanks for looking into this! From the discussion it seems likely
that we can get a version of Ard's patch in, which does not have CAS
loop in most cases.

JC

2019-06-18 07:34:16

by Kees Cook

[permalink] [raw]
Subject: Re: [RFC] Disable lockref on arm64

On Mon, Jun 17, 2019 at 06:26:20PM +0100, Will Deacon wrote:
> On Mon, Jun 17, 2019 at 01:33:19PM +0200, Ard Biesheuvel wrote:
> > On my single core TX2, the comparative performance is as follows
> >
> > Baseline: REFCOUNT_TIMING test using REFCOUNT_FULL (LSE cmpxchg)
> > 191057942484 cycles # 2.207 GHz
> > 148447589402 instructions # 0.78 insn per
> > cycle
> >
> > 86.568269904 seconds time elapsed
> >
> > Upper bound: ATOMIC_TIMING
> > 116252672661 cycles # 2.207 GHz
> > 28089216452 instructions # 0.24 insn per
> > cycle
> >
> > 52.689793525 seconds time elapsed
> >
> > REFCOUNT_TIMING test using LSE atomics
> > 127060259162 cycles # 2.207 GHz
>
> Ok, so assuming JC's complaint is valid, then these numbers are compelling.
> In particular, my understanding of this thread is that your optimised
> implementation doesn't actually sacrifice any precision; it just changes
> the saturation behaviour in a way that has no material impact. Kees, is that
> right?

That is my understanding, yes. There is no loss to detection precision.
But for clarity, I should point out it has one behavioral change that is
the same change as on x86: the counter is now effectively a 31 bit counter
not a 32 bit counter, as the signed bit is being used for saturation.

> If so, I'm not against having this for arm64, with the premise that we can
> hide the REFCOUNT_FULL option entirely given that it would only serve to
> confuse if exposed.

If the LSE atomics version has overflow, dec-to-zero, and inc-from-zero
protections, then as far as I'm concerned, REFCOUNT_FULL doesn't need
to exist for arm64. On the Kconfig front, as long as there isn't a way
to revert refcount_t to atomic_t, I'm happy. :)

--
Kees Cook