2022-05-25 19:11:13

by Uros Bizjak

[permalink] [raw]
Subject: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro

Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro.
x86 CMPXCHG instruction returns success in ZF flag, so this
change saves a compare after cmpxchg (and related move instruction
in front of cmpxchg). The main loop of lockref_get improves from:

13: 48 89 c1 mov %rax,%rcx
16: 48 c1 f9 20 sar $0x20,%rcx
1a: 83 c1 01 add $0x1,%ecx
1d: 48 89 ce mov %rcx,%rsi
20: 89 c1 mov %eax,%ecx
22: 48 89 d0 mov %rdx,%rax
25: 48 c1 e6 20 shl $0x20,%rsi
29: 48 09 f1 or %rsi,%rcx
2c: f0 48 0f b1 4d 00 lock cmpxchg %rcx,0x0(%rbp)
32: 48 39 d0 cmp %rdx,%rax
35: 75 17 jne 4e <lockref_get+0x4e>

to:

13: 48 89 ca mov %rcx,%rdx
16: 48 c1 fa 20 sar $0x20,%rdx
1a: 83 c2 01 add $0x1,%edx
1d: 48 89 d6 mov %rdx,%rsi
20: 89 ca mov %ecx,%edx
22: 48 c1 e6 20 shl $0x20,%rsi
26: 48 09 f2 or %rsi,%rdx
29: f0 48 0f b1 55 00 lock cmpxchg %rdx,0x0(%rbp)
2f: 75 02 jne 33 <lockref_get+0x33>

Signed-off-by: Uros Bizjak <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
lib/lockref.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/lib/lockref.c b/lib/lockref.c
index 5b34bbd3eba8..c6f0b183b937 100644
--- a/lib/lockref.c
+++ b/lib/lockref.c
@@ -14,12 +14,11 @@
BUILD_BUG_ON(sizeof(old) != 8); \
old.lock_count = READ_ONCE(lockref->lock_count); \
while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \
- struct lockref new = old, prev = old; \
+ struct lockref new = old; \
CODE \
- old.lock_count = cmpxchg64_relaxed(&lockref->lock_count, \
- old.lock_count, \
- new.lock_count); \
- if (likely(old.lock_count == prev.lock_count)) { \
+ if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \
+ &old.lock_count, \
+ new.lock_count))) { \
SUCCESS; \
} \
if (!--retry) \
--
2.35.3



2022-05-26 14:41:07

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro

Linus Torvalds <[email protected]> writes:
> On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <[email protected]> wrote:
>>
>> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro.
>> x86 CMPXCHG instruction returns success in ZF flag, so this
>> change saves a compare after cmpxchg (and related move instruction
>> in front of cmpxchg). The main loop of lockref_get improves from:
>
> Ack on this one regardless of the 32-bit x86 question.
>
> HOWEVER.
>
> I'd like other architectures to pipe up too, because I think right now
> x86 is the only one that implements that "arch_try_cmpxchg()" family
> of operations natively, and I think the generic fallback for when it
> is missing might be kind of nasty.
>
> Maybe it ends up generating ok code, but it's also possible that it
> just didn't matter when it was only used in one place in the
> scheduler.

This patch seems to generate slightly *better* code on powerpc.

I see one register-to-register move that gets shifted slightly later, so
that it's skipped on the path that returns directly via the SUCCESS
case.

So LGTM.

> The lockref_get() case can be quite hot under some loads, it would be
> sad if this made other architectures worse.

Do you know of a benchmark that shows it up? I tried a few things but
couldn't get lockref_get() to count for more than 1-2%.

cheers

2022-05-27 11:49:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro

On Thu, May 26, 2022 at 5:15 AM Michael Ellerman <[email protected]> wrote:
>
> Do you know of a benchmark that shows it up? I tried a few things but
> couldn't get lockref_get() to count for more than 1-2%.

Heh. 1% for a small instruction sequence that is only handful of
instructions and is used in just a couple of places counts as "very
hot" for me.

I assume you did the natural thing: threaded pathname lookup (with
paths being of the longer variety to not be dominated by system call
etc costs).

Linus

2022-05-27 18:55:15

by Uros Bizjak

[permalink] [raw]
Subject: Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro

On Wed, May 25, 2022 at 6:48 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <[email protected]> wrote:
> >
> > Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro.
> > x86 CMPXCHG instruction returns success in ZF flag, so this
> > change saves a compare after cmpxchg (and related move instruction
> > in front of cmpxchg). The main loop of lockref_get improves from:
>
> Ack on this one regardless of the 32-bit x86 question.
>
> HOWEVER.
>
> I'd like other architectures to pipe up too, because I think right now
> x86 is the only one that implements that "arch_try_cmpxchg()" family
> of operations natively, and I think the generic fallback for when it
> is missing might be kind of nasty.
>
> Maybe it ends up generating ok code, but it's also possible that it
> just didn't matter when it was only used in one place in the
> scheduler.
>
> The lockref_get() case can be quite hot under some loads, it would be
> sad if this made other architectures worse.
>
> Anyway, maybe that try_cmpxchg() fallback is fine, and works out well
> on architectures that use load-locked / store-conditional as-is.

Attached to this message, please find attached the testcase that
analyses various CMPXCHG_LOOPs. Here you will find the old, the
fallback and the new cmpxchg loop, together with corresponding
lockref_get_* functions.

The testcase models the x86 cmpxchg8b and can be compiled for 64bit as
well as 32bit targets. As can be seen from the experiment, the
try_cmpxchg fallback creates EXACTLY THE SAME code for 64bit target as
the unpatched code. For the 32bit target one extra dead reg-reg 32bit
move remains in the generated fallback code assembly (this is the
compiler (gcc-10.3) artefact with double-word 64bit moves on x86_32
target).

From the above experiment, we can conclude that the patched lockref.c
creates the same code with the try_cmpxchg fallback as the original
code. I think the same will be observed also for other targets.

When the new code involving try_cmpxchg is compiled, impressive size
gains for x86_32 can be seen. The main loop size reduces from 44 bytes
to 30 bytes.

In the git repository, several transitions from cmpxchg to try_cmpxchg
can be found. The above experiment confirms, that the generated
fallback assembly is at least as good as the original unpatched
version, but can be more optimal when the target provides try_cmpxchg
instruction. Also, it looks to me that several other hot spots
throughout the code can be improved by changing them from using
cmpxchg to try_cmpxchg.

Uros.


Attachments:
lockref-test.c (2.54 kB)

2022-05-28 18:52:35

by Heiko Carstens

[permalink] [raw]
Subject: Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro

On Thu, May 26, 2022 at 01:42:35PM +0100, Mark Rutland wrote:
> On Thu, May 26, 2022 at 10:14:59PM +1000, Michael Ellerman wrote:
> > Linus Torvalds <[email protected]> writes:
> > > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <[email protected]> wrote:
> > >>
> > >> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro.
> > >> x86 CMPXCHG instruction returns success in ZF flag, so this
> > >> change saves a compare after cmpxchg (and related move instruction
> > >> in front of cmpxchg). The main loop of lockref_get improves from:
> > >
> > > Ack on this one regardless of the 32-bit x86 question.
> > >
> > > HOWEVER.
> > >
> > > I'd like other architectures to pipe up too, because I think right now
> > > x86 is the only one that implements that "arch_try_cmpxchg()" family
> > > of operations natively, and I think the generic fallback for when it
> > > is missing might be kind of nasty.
> > >
> > > Maybe it ends up generating ok code, but it's also possible that it
> > > just didn't matter when it was only used in one place in the
> > > scheduler.
> >
> > This patch seems to generate slightly *better* code on powerpc.
> >
> > I see one register-to-register move that gets shifted slightly later, so
> > that it's skipped on the path that returns directly via the SUCCESS
> > case.
>
> FWIW, I see the same on arm64; a register-to-register move gets moved out of
> the success path. That changes the register allocation, and resulting in one
> fewer move, but otherwise the code generation is the same.

Just for the records: s390 code generation changes the same like on
powerpc; so looks good.

2022-05-28 19:19:04

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro

On Thu, May 26, 2022 at 10:14:59PM +1000, Michael Ellerman wrote:
> Linus Torvalds <[email protected]> writes:
> > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <[email protected]> wrote:
> >>
> >> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro.
> >> x86 CMPXCHG instruction returns success in ZF flag, so this
> >> change saves a compare after cmpxchg (and related move instruction
> >> in front of cmpxchg). The main loop of lockref_get improves from:
> >
> > Ack on this one regardless of the 32-bit x86 question.
> >
> > HOWEVER.
> >
> > I'd like other architectures to pipe up too, because I think right now
> > x86 is the only one that implements that "arch_try_cmpxchg()" family
> > of operations natively, and I think the generic fallback for when it
> > is missing might be kind of nasty.
> >
> > Maybe it ends up generating ok code, but it's also possible that it
> > just didn't matter when it was only used in one place in the
> > scheduler.
>
> This patch seems to generate slightly *better* code on powerpc.
>
> I see one register-to-register move that gets shifted slightly later, so
> that it's skipped on the path that returns directly via the SUCCESS
> case.

FWIW, I see the same on arm64; a register-to-register move gets moved out of
the success path. That changes the register allocation, and resulting in one
fewer move, but otherwise the code generation is the same.

Thanks,
Mark.

2022-05-28 19:44:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro

On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <[email protected]> wrote:
>
> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro.
> x86 CMPXCHG instruction returns success in ZF flag, so this
> change saves a compare after cmpxchg (and related move instruction
> in front of cmpxchg). The main loop of lockref_get improves from:

Ack on this one regardless of the 32-bit x86 question.

HOWEVER.

I'd like other architectures to pipe up too, because I think right now
x86 is the only one that implements that "arch_try_cmpxchg()" family
of operations natively, and I think the generic fallback for when it
is missing might be kind of nasty.

Maybe it ends up generating ok code, but it's also possible that it
just didn't matter when it was only used in one place in the
scheduler.

The lockref_get() case can be quite hot under some loads, it would be
sad if this made other architectures worse.

Anyway, maybe that try_cmpxchg() fallback is fine, and works out well
on architectures that use load-locked / store-conditional as-is.

But just to verify, I'm adding arm/powerpc/s390/mips people to the cc. See

https://lore.kernel.org/all/[email protected]/

for the original email and the x86-64 code example.

Linus

2022-05-28 19:54:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro

On Wed, May 25, 2022 at 9:47 AM Linus Torvalds
<[email protected]> wrote:
>
> Ack on this one regardless of the 32-bit x86 question.

.. and with confirmation from Michael and Mark (and the analysis from
Uros on the fallback code), I've applied this to my tree.

NOTE!

I have *not* applied the second patch with the x86-32 change to enable
this on 32-bit with CMPXCHG8B enabled. I think that's independent, and
will leave it up to the x86 maintainers. Plus I think the patch
should be simplified.

Linus