Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
check_cpu_stall. x86 CMPXCHG instruction returns success in ZF flag, so
this change saves a compare after cmpxchg (and related move instruction in
front of cmpxchg).
No functional change intended.
Cc: "Paul E. McKenney" <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Neeraj Upadhyay <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Joel Fernandes <[email protected]>
Signed-off-by: Uros Bizjak <[email protected]>
---
kernel/rcu/tree_stall.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index b10b8349bb2a..d81c88e66b42 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -760,7 +760,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
jn = jiffies + ULONG_MAX / 2;
if (rcu_gp_in_progress() &&
(READ_ONCE(rnp->qsmask) & rdp->grpmask) &&
- cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
+ try_cmpxchg(&rcu_state.jiffies_stall, &js, jn)) {
/*
* If a virtual machine is stopped by the host it can look to
@@ -778,7 +778,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
} else if (rcu_gp_in_progress() &&
ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY) &&
- cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
+ try_cmpxchg(&rcu_state.jiffies_stall, &js, jn)) {
/*
* If a virtual machine is stopped by the host it can look to
--
2.39.2
On Tue, Feb 28, 2023 at 04:51:21PM +0100, Uros Bizjak wrote:
> Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
> check_cpu_stall. x86 CMPXCHG instruction returns success in ZF flag, so
> this change saves a compare after cmpxchg (and related move instruction in
> front of cmpxchg).
In my codegen, I am not seeing mov instruction before the cmp removed, how
can that be? The rax has to be populated with a mov before cmpxchg right?
So try_cmpxchg gives: mov, cmpxchg, cmp, jne
Where as cmpxchg gives: mov, cmpxchg, mov, jne
So yeah you got rid of compare, but I am not seeing reduction in moves.
Either way, I think it is an improvement due to dropping cmp so:
Acked-by: Joel Fernandes (Google) <[email protected]>
thanks,
- Joel
>
> No functional change intended.
>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Neeraj Upadhyay <[email protected]>
> Cc: Josh Triplett <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Mathieu Desnoyers <[email protected]>
> Cc: Lai Jiangshan <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Signed-off-by: Uros Bizjak <[email protected]>
> ---
> kernel/rcu/tree_stall.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> index b10b8349bb2a..d81c88e66b42 100644
> --- a/kernel/rcu/tree_stall.h
> +++ b/kernel/rcu/tree_stall.h
> @@ -760,7 +760,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
> jn = jiffies + ULONG_MAX / 2;
> if (rcu_gp_in_progress() &&
> (READ_ONCE(rnp->qsmask) & rdp->grpmask) &&
> - cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
> + try_cmpxchg(&rcu_state.jiffies_stall, &js, jn)) {
>
> /*
> * If a virtual machine is stopped by the host it can look to
> @@ -778,7 +778,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
>
> } else if (rcu_gp_in_progress() &&
> ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY) &&
> - cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
> + try_cmpxchg(&rcu_state.jiffies_stall, &js, jn)) {
>
> /*
> * If a virtual machine is stopped by the host it can look to
> --
> 2.39.2
>
> On Feb 28, 2023, at 3:39 PM, Joel Fernandes <[email protected]> wrote:
>
> On Tue, Feb 28, 2023 at 04:51:21PM +0100, Uros Bizjak wrote:
>> Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
>> check_cpu_stall. x86 CMPXCHG instruction returns success in ZF flag, so
>> this change saves a compare after cmpxchg (and related move instruction in
>> front of cmpxchg).
>
> In my codegen, I am not seeing mov instruction before the cmp removed, how
> can that be? The rax has to be populated with a mov before cmpxchg right?
>
> So try_cmpxchg gives: mov, cmpxchg, cmp, jne
> Where as cmpxchg gives: mov, cmpxchg, mov, jne
>
> So yeah you got rid of compare, but I am not seeing reduction in moves.
> Either way, I think it is an improvement due to dropping cmp so:
>
> Acked-by: Joel Fernandes (Google) <[email protected]>
Ah there probably is a lesser mov to hold the value to compare against,
but maybe not always?
Would be nice to clarify that in the changelog anyway.
Thanks,
- Joel
>
> thanks,
>
> - Joel
>
>
>>
>> No functional change intended.
>>
>> Cc: "Paul E. McKenney" <[email protected]>
>> Cc: Frederic Weisbecker <[email protected]>
>> Cc: Neeraj Upadhyay <[email protected]>
>> Cc: Josh Triplett <[email protected]>
>> Cc: Steven Rostedt <[email protected]>
>> Cc: Mathieu Desnoyers <[email protected]>
>> Cc: Lai Jiangshan <[email protected]>
>> Cc: Joel Fernandes <[email protected]>
>> Signed-off-by: Uros Bizjak <[email protected]>
>> ---
>> kernel/rcu/tree_stall.h | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
>> index b10b8349bb2a..d81c88e66b42 100644
>> --- a/kernel/rcu/tree_stall.h
>> +++ b/kernel/rcu/tree_stall.h
>> @@ -760,7 +760,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
>> jn = jiffies + ULONG_MAX / 2;
>> if (rcu_gp_in_progress() &&
>> (READ_ONCE(rnp->qsmask) & rdp->grpmask) &&
>> - cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
>> + try_cmpxchg(&rcu_state.jiffies_stall, &js, jn)) {
>>
>> /*
>> * If a virtual machine is stopped by the host it can look to
>> @@ -778,7 +778,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
>>
>> } else if (rcu_gp_in_progress() &&
>> ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY) &&
>> - cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
>> + try_cmpxchg(&rcu_state.jiffies_stall, &js, jn)) {
>>
>> /*
>> * If a virtual machine is stopped by the host it can look to
>> --
>> 2.39.2
>>
On Tue, 28 Feb 2023 20:39:30 +0000
Joel Fernandes <[email protected]> wrote:
> On Tue, Feb 28, 2023 at 04:51:21PM +0100, Uros Bizjak wrote:
> > Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
> > check_cpu_stall. x86 CMPXCHG instruction returns success in ZF flag, so
> > this change saves a compare after cmpxchg (and related move instruction in
> > front of cmpxchg).
>
> In my codegen, I am not seeing mov instruction before the cmp removed, how
> can that be? The rax has to be populated with a mov before cmpxchg right?
>
> So try_cmpxchg gives: mov, cmpxchg, cmp, jne
> Where as cmpxchg gives: mov, cmpxchg, mov, jne
>
> So yeah you got rid of compare, but I am not seeing reduction in moves.
> Either way, I think it is an improvement due to dropping cmp so:
Did you get the above backwards?
Anyway, when looking at the conversion of cmpxchg() to try_cmpxchg() that
Uros sent to me for the ring buffer, the code went from:
0000000000000070 <ring_buffer_record_off>:
70: 48 8d 4f 08 lea 0x8(%rdi),%rcx
74: 8b 57 08 mov 0x8(%rdi),%edx
77: 89 d6 mov %edx,%esi
79: 89 d0 mov %edx,%eax
7b: 81 ce 00 00 10 00 or $0x100000,%esi
81: f0 0f b1 31 lock cmpxchg %esi,(%rcx)
85: 39 d0 cmp %edx,%eax
87: 75 eb jne 74 <ring_buffer_record_off+0x4>
89: e9 00 00 00 00 jmp 8e <ring_buffer_record_off+0x1e>
8a: R_X86_64_PLT32 __x86_return_thunk-0x4
8e: 66 90 xchg %ax,%ax
To
00000000000001a0 <ring_buffer_record_off>:
1a0: 8b 47 08 mov 0x8(%rdi),%eax
1a3: 48 8d 4f 08 lea 0x8(%rdi),%rcx
1a7: 89 c2 mov %eax,%edx
1a9: 81 ca 00 00 10 00 or $0x100000,%edx
1af: f0 0f b1 57 08 lock cmpxchg %edx,0x8(%rdi)
1b4: 75 05 jne 1bb <ring_buffer_record_off+0x1b>
1b6: e9 00 00 00 00 jmp 1bb <ring_buffer_record_off+0x1b>
1b7: R_X86_64_PLT32 __x86_return_thunk-0x4
1bb: 89 c2 mov %eax,%edx
1bd: 81 ca 00 00 10 00 or $0x100000,%edx
1c3: f0 0f b1 11 lock cmpxchg %edx,(%rcx)
1c7: 75 f2 jne 1bb <ring_buffer_record_off+0x1b>
1c9: e9 00 00 00 00 jmp 1ce <ring_buffer_record_off+0x2e>
1ca: R_X86_64_PLT32 __x86_return_thunk-0x4
1ce: 66 90 xchg %ax,%ax
It does add a bit more code, but the fast path seems better (where the
cmpxchg succeeds). That would be:
00000000000001a0 <ring_buffer_record_off>:
1a0: 8b 47 08 mov 0x8(%rdi),%eax
1a3: 48 8d 4f 08 lea 0x8(%rdi),%rcx
1a7: 89 c2 mov %eax,%edx
1a9: 81 ca 00 00 10 00 or $0x100000,%edx
1af: f0 0f b1 57 08 lock cmpxchg %edx,0x8(%rdi)
1b4: 75 05 jne 1bb <ring_buffer_record_off+0x1b>
1b6: e9 00 00 00 00 jmp 1bb <ring_buffer_record_off+0x1b>
1b7: R_X86_64_PLT32 __x86_return_thunk-0x4
Where there's only two moves and no cmp, where the former has 3 moves and a
cmp in the fast path.
-- Steve
On Tue, Feb 28, 2023 at 04:03:24PM -0500, Steven Rostedt wrote:
> On Tue, 28 Feb 2023 20:39:30 +0000
> Joel Fernandes <[email protected]> wrote:
>
> > On Tue, Feb 28, 2023 at 04:51:21PM +0100, Uros Bizjak wrote:
> > > Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
> > > check_cpu_stall. x86 CMPXCHG instruction returns success in ZF flag, so
> > > this change saves a compare after cmpxchg (and related move instruction in
> > > front of cmpxchg).
> >
> > In my codegen, I am not seeing mov instruction before the cmp removed, how
> > can that be? The rax has to be populated with a mov before cmpxchg right?
> >
> > So try_cmpxchg gives: mov, cmpxchg, cmp, jne
> > Where as cmpxchg gives: mov, cmpxchg, mov, jne
> >
> > So yeah you got rid of compare, but I am not seeing reduction in moves.
> > Either way, I think it is an improvement due to dropping cmp so:
>
> Did you get the above backwards?
>
> Anyway, when looking at the conversion of cmpxchg() to try_cmpxchg() that
> Uros sent to me for the ring buffer, the code went from:
>
> 0000000000000070 <ring_buffer_record_off>:
> 70: 48 8d 4f 08 lea 0x8(%rdi),%rcx
> 74: 8b 57 08 mov 0x8(%rdi),%edx
> 77: 89 d6 mov %edx,%esi
> 79: 89 d0 mov %edx,%eax
> 7b: 81 ce 00 00 10 00 or $0x100000,%esi
> 81: f0 0f b1 31 lock cmpxchg %esi,(%rcx)
> 85: 39 d0 cmp %edx,%eax
> 87: 75 eb jne 74 <ring_buffer_record_off+0x4>
> 89: e9 00 00 00 00 jmp 8e <ring_buffer_record_off+0x1e>
> 8a: R_X86_64_PLT32 __x86_return_thunk-0x4
> 8e: 66 90 xchg %ax,%ax
>
>
> To
>
> 00000000000001a0 <ring_buffer_record_off>:
> 1a0: 8b 47 08 mov 0x8(%rdi),%eax
> 1a3: 48 8d 4f 08 lea 0x8(%rdi),%rcx
> 1a7: 89 c2 mov %eax,%edx
> 1a9: 81 ca 00 00 10 00 or $0x100000,%edx
> 1af: f0 0f b1 57 08 lock cmpxchg %edx,0x8(%rdi)
> 1b4: 75 05 jne 1bb <ring_buffer_record_off+0x1b>
> 1b6: e9 00 00 00 00 jmp 1bb <ring_buffer_record_off+0x1b>
> 1b7: R_X86_64_PLT32 __x86_return_thunk-0x4
> 1bb: 89 c2 mov %eax,%edx
> 1bd: 81 ca 00 00 10 00 or $0x100000,%edx
> 1c3: f0 0f b1 11 lock cmpxchg %edx,(%rcx)
> 1c7: 75 f2 jne 1bb <ring_buffer_record_off+0x1b>
> 1c9: e9 00 00 00 00 jmp 1ce <ring_buffer_record_off+0x2e>
> 1ca: R_X86_64_PLT32 __x86_return_thunk-0x4
> 1ce: 66 90 xchg %ax,%ax
>
>
> It does add a bit more code, but the fast path seems better (where the
> cmpxchg succeeds). That would be:
>
> 00000000000001a0 <ring_buffer_record_off>:
> 1a0: 8b 47 08 mov 0x8(%rdi),%eax
> 1a3: 48 8d 4f 08 lea 0x8(%rdi),%rcx
> 1a7: 89 c2 mov %eax,%edx
> 1a9: 81 ca 00 00 10 00 or $0x100000,%edx
> 1af: f0 0f b1 57 08 lock cmpxchg %edx,0x8(%rdi)
> 1b4: 75 05 jne 1bb <ring_buffer_record_off+0x1b>
> 1b6: e9 00 00 00 00 jmp 1bb <ring_buffer_record_off+0x1b>
> 1b7: R_X86_64_PLT32 __x86_return_thunk-0x4
>
>
> Where there's only two moves and no cmp, where the former has 3 moves and a
> cmp in the fast path.
All well and good, but the stall-warning code is nowhere near a fastpath.
Is try_cmpxchg() considered more readable in this context?
Thanx, Paul
On Tue, 28 Feb 2023 13:29:11 -0800
"Paul E. McKenney" <[email protected]> wrote:
> All well and good, but the stall-warning code is nowhere near a fastpath.
>
> Is try_cmpxchg() considered more readable in this context?
- cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
+ try_cmpxchg(&rcu_state.jiffies_stall, &js, jn)) {
It's basically the same :-/
But looking at this use case, I'd actually NAK it, as it is misleading.
As try_cmpxchg() is used to get rid of the updating of the old value. As in
the ring buffer code we had:
void ring_buffer_record_off(struct trace_buffer *buffer)
{
unsigned int rd;
unsigned int new_rd;
do {
rd = atomic_read(&buffer->record_disabled);
new_rd = rd | RB_BUFFER_OFF;
} while (!atomic_cmpxchg(&buffer->record_disabled, &rd, new_rd) != rd);
}
and the try_cmpxchg() converted it to:
void ring_buffer_record_off(struct trace_buffer *buffer)
{
unsigned int rd;
unsigned int new_rd;
rd = atomic_read(&buffer->record_disabled);
do {
new_rd = rd | RB_BUFFER_OFF;
} while (!atomic_try_cmpxchg(&buffer->record_disabled, &rd, new_rd));
}
Which got rid of the need to constantly update the rd variable (cmpxchg
will load rax with the value read, so it removes the need for an extra
move).
But in your case, we don't need to update js, in which case the
try_cmpxchg() does.
The patch that Uros sent me for the ring buffer code also does some of
that, which I feel is wrong.
So with that, I would nack the patch.
-- Steve
On Tue, Feb 28, 2023 at 04:41:24PM -0500, Steven Rostedt wrote:
> On Tue, 28 Feb 2023 13:29:11 -0800
> "Paul E. McKenney" <[email protected]> wrote:
>
> > All well and good, but the stall-warning code is nowhere near a fastpath.
> >
> > Is try_cmpxchg() considered more readable in this context?
>
>
> - cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
> + try_cmpxchg(&rcu_state.jiffies_stall, &js, jn)) {
>
> It's basically the same :-/
That was my assessment. ;-)
> But looking at this use case, I'd actually NAK it, as it is misleading.
>
> As try_cmpxchg() is used to get rid of the updating of the old value. As in
> the ring buffer code we had:
>
> void ring_buffer_record_off(struct trace_buffer *buffer)
> {
> unsigned int rd;
> unsigned int new_rd;
>
> do {
> rd = atomic_read(&buffer->record_disabled);
> new_rd = rd | RB_BUFFER_OFF;
> } while (!atomic_cmpxchg(&buffer->record_disabled, &rd, new_rd) != rd);
> }
>
> and the try_cmpxchg() converted it to:
>
> void ring_buffer_record_off(struct trace_buffer *buffer)
> {
> unsigned int rd;
> unsigned int new_rd;
>
> rd = atomic_read(&buffer->record_disabled);
> do {
> new_rd = rd | RB_BUFFER_OFF;
> } while (!atomic_try_cmpxchg(&buffer->record_disabled, &rd, new_rd));
> }
>
> Which got rid of the need to constantly update the rd variable (cmpxchg
> will load rax with the value read, so it removes the need for an extra
> move).
>
> But in your case, we don't need to update js, in which case the
> try_cmpxchg() does.
>
> The patch that Uros sent me for the ring buffer code also does some of
> that, which I feel is wrong.
>
> So with that, I would nack the patch.
OK, I will leave this one out.
Thanx, Paul
Hey Steve,
On Tue, Feb 28, 2023 at 4:41 PM Steven Rostedt <[email protected]> wrote:
>
> On Tue, 28 Feb 2023 13:29:11 -0800
> "Paul E. McKenney" <[email protected]> wrote:
>
> > All well and good, but the stall-warning code is nowhere near a fastpath.
> >
> > Is try_cmpxchg() considered more readable in this context?
>
>
> - cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
> + try_cmpxchg(&rcu_state.jiffies_stall, &js, jn)) {
>
> It's basically the same :-/
>
> But looking at this use case, I'd actually NAK it, as it is misleading.
I'm trying to parse this. You are saying it is misleading, because it
updates js when it doesn't need to?
> As try_cmpxchg() is used to get rid of the updating of the old value. As in
> the ring buffer code we had:
>
> void ring_buffer_record_off(struct trace_buffer *buffer)
> {
> unsigned int rd;
> unsigned int new_rd;
>
> do {
> rd = atomic_read(&buffer->record_disabled);
> new_rd = rd | RB_BUFFER_OFF;
> } while (!atomic_cmpxchg(&buffer->record_disabled, &rd, new_rd) != rd);
Hear you actually meant "rd" as the second parameter without the & ?
> }
>
> and the try_cmpxchg() converted it to:
>
> void ring_buffer_record_off(struct trace_buffer *buffer)
> {
> unsigned int rd;
> unsigned int new_rd;
>
> rd = atomic_read(&buffer->record_disabled);
> do {
> new_rd = rd | RB_BUFFER_OFF;
> } while (!atomic_try_cmpxchg(&buffer->record_disabled, &rd, new_rd));
> }
>
> Which got rid of the need to constantly update the rd variable (cmpxchg
> will load rax with the value read, so it removes the need for an extra
> move).
So that's a good thing?
>
> But in your case, we don't need to update js, in which case the
> try_cmpxchg() does.
Right, it has lesser value here but I'm curious why you feel it also
doesn't belong in that ring buffer loop you shared (or did you mean,
it does belong there but not in other ftrace code modified by Uros?).
> The patch that Uros sent me for the ring buffer code also does some of
> that, which I feel is wrong.
I am guessing that is code *other than* the ring buffer loop above.
> So with that, I would nack the patch.
Dropping it for RCU is fine with me as well.
(And yes for the other thread, I had it backwards, try_cmpxchg is what
drops the need for cmp instruction -- sorry.)
thanks,
-Joel
On Tue, 28 Feb 2023 18:30:14 -0500
Joel Fernandes <[email protected]> wrote:
> >
> > But looking at this use case, I'd actually NAK it, as it is misleading.
>
> I'm trying to parse this. You are saying it is misleading, because it
> updates js when it doesn't need to?
Correct.
>
> > As try_cmpxchg() is used to get rid of the updating of the old value. As in
> > the ring buffer code we had:
> >
> > void ring_buffer_record_off(struct trace_buffer *buffer)
> > {
> > unsigned int rd;
> > unsigned int new_rd;
> >
> > do {
> > rd = atomic_read(&buffer->record_disabled);
> > new_rd = rd | RB_BUFFER_OFF;
> > } while (!atomic_cmpxchg(&buffer->record_disabled, &rd, new_rd) != rd);
>
> Hear you actually meant "rd" as the second parameter without the & ?
Yes, I cut and pasted the updated code and incorrectly try to revert it in
this example :-p
>
> > }
> >
> > and the try_cmpxchg() converted it to:
> >
> > void ring_buffer_record_off(struct trace_buffer *buffer)
> > {
> > unsigned int rd;
> > unsigned int new_rd;
> >
> > rd = atomic_read(&buffer->record_disabled);
> > do {
> > new_rd = rd | RB_BUFFER_OFF;
> > } while (!atomic_try_cmpxchg(&buffer->record_disabled, &rd, new_rd));
> > }
> >
> > Which got rid of the need to constantly update the rd variable (cmpxchg
> > will load rax with the value read, so it removes the need for an extra
> > move).
>
> So that's a good thing?
Yes. For looping, try_cmpxchg() is the proper function to use. But in the
RCU case (and other cases in the ring-buffer patch) there is no loop, and
no need to modify the "old" variable.
>
> >
> > But in your case, we don't need to update js, in which case the
> > try_cmpxchg() does.
>
> Right, it has lesser value here but I'm curious why you feel it also
> doesn't belong in that ring buffer loop you shared (or did you mean,
> it does belong there but not in other ftrace code modified by Uros?).
The ring buffer patch had more than one change, where half the updates were
fine, and half were not.
-- Steve
Hey Steve,
On Tue, Feb 28, 2023 at 7:08 PM Steven Rostedt <[email protected]> wrote:
[...]
> >
> > >
> > > But in your case, we don't need to update js, in which case the
> > > try_cmpxchg() does.
> >
> > Right, it has lesser value here but I'm curious why you feel it also
> > doesn't belong in that ring buffer loop you shared (or did you mean,
> > it does belong there but not in other ftrace code modified by Uros?).
>
> The ring buffer patch had more than one change, where half the updates were
> fine, and half were not.
Thanks for all the clarifications. Good to know such loop design
patterns can benefit from it!
- Joel
On Tue, Feb 28, 2023 at 9:39 PM Joel Fernandes <[email protected]> wrote:
>
> On Tue, Feb 28, 2023 at 04:51:21PM +0100, Uros Bizjak wrote:
> > Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
> > check_cpu_stall. x86 CMPXCHG instruction returns success in ZF flag, so
> > this change saves a compare after cmpxchg (and related move instruction in
> > front of cmpxchg).
>
> In my codegen, I am not seeing mov instruction before the cmp removed, how
> can that be? The rax has to be populated with a mov before cmpxchg right?
>
> So try_cmpxchg gives: mov, cmpxchg, cmp, jne
> Where as cmpxchg gives: mov, cmpxchg, mov, jne
(I assume the above is reversed)
You have quite a new compiler (e.g. gcc-12+) which is able to reorder
basic blocks, reschedule instructions and create fast paths through
the code based on likely/unlikely annotations in the definition of
try_cmpxchg. [On a related note, some code growth is allowed to the
compiler in the hot path since the kernel is compiled with -O2, not
-Os.] gcc-10.3.1 improves the code from tree.c from:
a1c5: 0f 84 53 03 00 00 je a51e <rcu_sched_clock_irq+0x70e>
a1cb: 48 89 c8 mov %rcx,%rax
a1ce: f0 48 0f b1 35 00 00 lock cmpxchg %rsi,0x0(%rip)
# a1d7 <rcu_sched_clock_irq+0x3c7>
a1d5: 00 00
a1d3: R_X86_64_PC32 .data+0xf9c
a1d7: 48 39 c1 cmp %rax,%rcx
a1da: 0f 85 3e 03 00 00 jne a51e <rcu_sched_clock_irq+0x70e>
to:
a1d0: 0f 84 49 03 00 00 je a51f <rcu_sched_clock_irq+0x70f>
a1d6: f0 48 0f b1 35 00 00 lock cmpxchg %rsi,0x0(%rip)
# a1df <rcu_sched_clock_irq+0x3cf>
a1dd: 00 00
a1db: R_X86_64_PC32 .data+0xf9c
a1df: 0f 85 3a 03 00 00 jne a51f <rcu_sched_clock_irq+0x70f>
The change brings the following code size improvement:
text data bss dec hex filename
57712 9945 86 67743 1089f tree-new.o
57760 9945 86 67791 108cf tree-old.o
Small change, but the result of effectively an almost mechanical
one-line substitutions.
Uros.
> So yeah you got rid of compare, but I am not seeing reduction in moves.
> Either way, I think it is an improvement due to dropping cmp so:
>
> Acked-by: Joel Fernandes (Google) <[email protected]>
>
> thanks,
>
> - Joel
>
>
> >
> > No functional change intended.
> >
> > Cc: "Paul E. McKenney" <[email protected]>
> > Cc: Frederic Weisbecker <[email protected]>
> > Cc: Neeraj Upadhyay <[email protected]>
> > Cc: Josh Triplett <[email protected]>
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Mathieu Desnoyers <[email protected]>
> > Cc: Lai Jiangshan <[email protected]>
> > Cc: Joel Fernandes <[email protected]>
> > Signed-off-by: Uros Bizjak <[email protected]>
> > ---
> > kernel/rcu/tree_stall.h | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> > index b10b8349bb2a..d81c88e66b42 100644
> > --- a/kernel/rcu/tree_stall.h
> > +++ b/kernel/rcu/tree_stall.h
> > @@ -760,7 +760,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
> > jn = jiffies + ULONG_MAX / 2;
> > if (rcu_gp_in_progress() &&
> > (READ_ONCE(rnp->qsmask) & rdp->grpmask) &&
> > - cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
> > + try_cmpxchg(&rcu_state.jiffies_stall, &js, jn)) {
> >
> > /*
> > * If a virtual machine is stopped by the host it can look to
> > @@ -778,7 +778,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
> >
> > } else if (rcu_gp_in_progress() &&
> > ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY) &&
> > - cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
> > + try_cmpxchg(&rcu_state.jiffies_stall, &js, jn)) {
> >
> > /*
> > * If a virtual machine is stopped by the host it can look to
> > --
> > 2.39.2
> >
On Wed, Mar 1, 2023 at 1:08 AM Steven Rostedt <[email protected]> wrote:
>
> On Tue, 28 Feb 2023 18:30:14 -0500
> Joel Fernandes <[email protected]> wrote:
> > >
> > > But looking at this use case, I'd actually NAK it, as it is misleading.
> >
> > I'm trying to parse this. You are saying it is misleading, because it
> > updates js when it doesn't need to?
>
> Correct.
I'm a bit late to the discussion (well, I have to sleep from time to
time, too), but in the hope that everybody interested in this issue
will find the reply, I'll try to clarify the "updates" claim:
The try_cmpxchg is written in such a way that benefits loops as well
as linear code, in the latter case it depends on the compiler to
eliminate the dead assignment.
When changing linear code from cmpxchg to try_cmpxchg, one has to take
care that the variable, passed by reference, is unused after cmpxchg,
so it can be considered as a temporary variable (as said elsewhere,
the alternative is to copy the value to a local temporary variable and
pass the pointer to this variable to try_cmpxchg - the compiler will
eliminate the assignment if the original variable is unused).
Even in linear code, the conversion from cmpxchg to try_cmpxchg is
able to eliminate assignment and compare, as can be seen when the code
is compiled with gcc-10.3.1:
a1c5: 0f 84 53 03 00 00 je a51e <rcu_sched_clock_irq+0x70e>
a1cb: 48 89 c8 mov %rcx,%rax
a1ce: f0 48 0f b1 35 00 00 lock cmpxchg %rsi,0x0(%rip)
# a1d7 <rcu_sched_clock_irq+0x3c7>
a1d5: 00 00
a1d3: R_X86_64_PC32 .data+0xf9c
a1d7: 48 39 c1 cmp %rax,%rcx
a1da: 0f 85 3e 03 00 00 jne a51e <rcu_sched_clock_irq+0x70e>
to:
a1d0: 0f 84 49 03 00 00 je a51f <rcu_sched_clock_irq+0x70f>
a1d6: f0 48 0f b1 35 00 00 lock cmpxchg %rsi,0x0(%rip)
# a1df <rcu_sched_clock_irq+0x3cf>
a1dd: 00 00
a1db: R_X86_64_PC32 .data+0xf9c
a1df: 0f 85 3a 03 00 00 jne a51f <rcu_sched_clock_irq+0x70f>
Newer compilers (e.g. gcc-12+) are able to use likely/unlikely
annotations to reorder the code, so the change is less visible. But
due to reordering, even targets that don't define try_cmpxchg natively
benefit from the change, please see thread at [1].
These benefits are the reason the change to try_cmpxchg was accepted
also in the linear code elsewhere in the linux kernel, e.g. [2,3] to
name a few commits, with a thumbs-up and a claim that the new code is
actually *clearer* at the merge commit [4].
I really think that the above demonstrates various improvements, and
would be unfortunate not to consider them.
[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4e1da8fe031303599e78f88e0dad9f44272e4f99
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8baceabca656d5ef4494cdeb3b9b9fbb844ac613
[4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=91bc559d8d3aed488b4b50e9eba1d7ebb1da7bbf
Uros.
> >
> > > As try_cmpxchg() is used to get rid of the updating of the old value. As in
> > > the ring buffer code we had:
> > >
> > > void ring_buffer_record_off(struct trace_buffer *buffer)
> > > {
> > > unsigned int rd;
> > > unsigned int new_rd;
> > >
> > > do {
> > > rd = atomic_read(&buffer->record_disabled);
> > > new_rd = rd | RB_BUFFER_OFF;
> > > } while (!atomic_cmpxchg(&buffer->record_disabled, &rd, new_rd) != rd);
> >
> > Hear you actually meant "rd" as the second parameter without the & ?
>
> Yes, I cut and pasted the updated code and incorrectly try to revert it in
> this example :-p
>
> >
> > > }
> > >
> > > and the try_cmpxchg() converted it to:
> > >
> > > void ring_buffer_record_off(struct trace_buffer *buffer)
> > > {
> > > unsigned int rd;
> > > unsigned int new_rd;
> > >
> > > rd = atomic_read(&buffer->record_disabled);
> > > do {
> > > new_rd = rd | RB_BUFFER_OFF;
> > > } while (!atomic_try_cmpxchg(&buffer->record_disabled, &rd, new_rd));
> > > }
> > >
> > > Which got rid of the need to constantly update the rd variable (cmpxchg
> > > will load rax with the value read, so it removes the need for an extra
> > > move).
> >
> > So that's a good thing?
>
> Yes. For looping, try_cmpxchg() is the proper function to use. But in the
> RCU case (and other cases in the ring-buffer patch) there is no loop, and
> no need to modify the "old" variable.
>
> >
> > >
> > > But in your case, we don't need to update js, in which case the
> > > try_cmpxchg() does.
> >
> > Right, it has lesser value here but I'm curious why you feel it also
> > doesn't belong in that ring buffer loop you shared (or did you mean,
> > it does belong there but not in other ftrace code modified by Uros?).
>
> The ring buffer patch had more than one change, where half the updates were
> fine, and half were not.
>
> -- Steve
On Wed, 1 Mar 2023 11:28:47 +0100
Uros Bizjak <[email protected]> wrote:
> These benefits are the reason the change to try_cmpxchg was accepted
> also in the linear code elsewhere in the linux kernel, e.g. [2,3] to
> name a few commits, with a thumbs-up and a claim that the new code is
> actually *clearer* at the merge commit [4].
I'll say it here too. I really like Joel's suggestion of having a
cmpxchg_success() that does not have the added side effect of modifying the
old variable.
I think that would allow for the arch optimizations that you are trying to
achieve, as well as remove the side effect that might cause issues down the
road.
-- Steve
On Wed, Mar 1, 2023 at 5:38 PM Steven Rostedt <[email protected]> wrote:
>
> On Wed, 1 Mar 2023 11:28:47 +0100
> Uros Bizjak <[email protected]> wrote:
>
> > These benefits are the reason the change to try_cmpxchg was accepted
> > also in the linear code elsewhere in the linux kernel, e.g. [2,3] to
> > name a few commits, with a thumbs-up and a claim that the new code is
> > actually *clearer* at the merge commit [4].
>
> I'll say it here too. I really like Joel's suggestion of having a
> cmpxchg_success() that does not have the added side effect of modifying the
> old variable.
>
> I think that would allow for the arch optimizations that you are trying to
> achieve, as well as remove the side effect that might cause issues down the
> road.
Attached patch implements this suggestion.
Uros.
On Wed, 1 Mar 2023 19:43:34 +0100
Uros Bizjak <[email protected]> wrote:
> On Wed, Mar 1, 2023 at 5:38 PM Steven Rostedt <[email protected]> wrote:
> >
> > On Wed, 1 Mar 2023 11:28:47 +0100
> > Uros Bizjak <[email protected]> wrote:
> >
> > > These benefits are the reason the change to try_cmpxchg was accepted
> > > also in the linear code elsewhere in the linux kernel, e.g. [2,3] to
> > > name a few commits, with a thumbs-up and a claim that the new code is
> > > actually *clearer* at the merge commit [4].
> >
> > I'll say it here too. I really like Joel's suggestion of having a
> > cmpxchg_success() that does not have the added side effect of modifying the
> > old variable.
> >
> > I think that would allow for the arch optimizations that you are trying to
> > achieve, as well as remove the side effect that might cause issues down the
> > road.
>
> Attached patch implements this suggestion.
I like it!
Anyway to make this more generic?
-- Steve
On Wed, Mar 1, 2023 at 7:52 PM Steven Rostedt <[email protected]> wrote:
>
> On Wed, 1 Mar 2023 19:43:34 +0100
> Uros Bizjak <[email protected]> wrote:
>
> > On Wed, Mar 1, 2023 at 5:38 PM Steven Rostedt <[email protected]> wrote:
> > >
> > > On Wed, 1 Mar 2023 11:28:47 +0100
> > > Uros Bizjak <[email protected]> wrote:
> > >
> > > > These benefits are the reason the change to try_cmpxchg was accepted
> > > > also in the linear code elsewhere in the linux kernel, e.g. [2,3] to
> > > > name a few commits, with a thumbs-up and a claim that the new code is
> > > > actually *clearer* at the merge commit [4].
> > >
> > > I'll say it here too. I really like Joel's suggestion of having a
> > > cmpxchg_success() that does not have the added side effect of modifying the
> > > old variable.
> > >
> > > I think that would allow for the arch optimizations that you are trying to
> > > achieve, as well as remove the side effect that might cause issues down the
> > > road.
> >
> > Attached patch implements this suggestion.
>
> I like it!
Thanks!
> Anyway to make this more generic?
If we want to put the definition in generic headers, then we also need
to define acquire/release/relaxed and 64bit variants. ATM, we have two
sites that require this definition and I think that for now we could
live with two instances of the same definition in two separate
subsystems. But this would definitely be a good future addition. There
is some code in the form of
if (cmpxchg (ptr, 0, 1) == 0)
that can not be converted to try_cmpxchg, but can use cmpxchg_success.
Uros.
On Wed, 1 Mar 2023 20:18:11 +0100
Uros Bizjak <[email protected]> wrote:
> If we want to put the definition in generic headers, then we also need
> to define acquire/release/relaxed and 64bit variants. ATM, we have two
> sites that require this definition and I think that for now we could
> live with two instances of the same definition in two separate
> subsystems. But this would definitely be a good future addition. There
> is some code in the form of
>
> if (cmpxchg (ptr, 0, 1) == 0)
>
> that can not be converted to try_cmpxchg, but can use cmpxchg_success.
And even modify code that uses temp variables. For example, where you
modified the ring buffer code to use try_cmpxchg(), I could convert your:
static int rb_head_page_replace(struct buffer_page *old,
struct buffer_page *new)
{
unsigned long *ptr = (unsigned long *)&old->list.prev->next;
unsigned long val;
val = *ptr & ~RB_FLAG_MASK;
val |= RB_PAGE_HEAD;
return try_cmpxchg(ptr, &val, (unsigned long)&new->list);
}
Into just:
static int rb_head_page_replace(struct buffer_page *old,
struct buffer_page *new)
{
unsigned long *ptr = (unsigned long *)&old->list.prev->next;
unsigned long val;
val = *ptr & ~RB_FLAG_MASK;
return cmpxchg_success(ptr, val | RB_PAGE_HEAD, (unsigned long)&new->list);
}
-- Steve
> On Mar 1, 2023, at 2:18 PM, Uros Bizjak <[email protected]> wrote:
>
> On Wed, Mar 1, 2023 at 7:52 PM Steven Rostedt <[email protected]> wrote:
>>
>>> On Wed, 1 Mar 2023 19:43:34 +0100
>>> Uros Bizjak <[email protected]> wrote:
>>>
>>> On Wed, Mar 1, 2023 at 5:38 PM Steven Rostedt <[email protected]> wrote:
>>>>
>>>> On Wed, 1 Mar 2023 11:28:47 +0100
>>>> Uros Bizjak <[email protected]> wrote:
>>>>
>>>>> These benefits are the reason the change to try_cmpxchg was accepted
>>>>> also in the linear code elsewhere in the linux kernel, e.g. [2,3] to
>>>>> name a few commits, with a thumbs-up and a claim that the new code is
>>>>> actually *clearer* at the merge commit [4].
>>>>
>>>> I'll say it here too. I really like Joel's suggestion of having a
>>>> cmpxchg_success() that does not have the added side effect of modifying the
>>>> old variable.
>>>>
>>>> I think that would allow for the arch optimizations that you are trying to
>>>> achieve, as well as remove the side effect that might cause issues down the
>>>> road.
>>>
>>> Attached patch implements this suggestion.
>>
>> I like it!
Me too :)
> Thanks!
>
>> Anyway to make this more generic?
>
> If we want to put the definition in generic headers, then we also need
> to define acquire/release/relaxed and 64bit variants. ATM, we have two
> sites that require this definition and I think that for now we could
> live with two instances of the same definition in two separate
> subsystems. But this would definitely be a good future addition. There
> is some code in the form of
>
> if (cmpxchg (ptr, 0, 1) == 0)
>
> that can not be converted to try_cmpxchg, but can use cmpxchg_success.
I would prefer if we can put it in generic headers instead of duplicating
across ftrace and RCU.
thanks,
- Joel
>
> Uros.
> On Mar 1, 2023, at 2:43 PM, Steven Rostedt <[email protected]> wrote:
>
> On Wed, 1 Mar 2023 20:18:11 +0100
> Uros Bizjak <[email protected]> wrote:
>
>> If we want to put the definition in generic headers, then we also need
>> to define acquire/release/relaxed and 64bit variants. ATM, we have two
>> sites that require this definition and I think that for now we could
>> live with two instances of the same definition in two separate
>> subsystems. But this would definitely be a good future addition. There
>> is some code in the form of
>>
>> if (cmpxchg (ptr, 0, 1) == 0)
>>
>> that can not be converted to try_cmpxchg, but can use cmpxchg_success.
>
> And even modify code that uses temp variables. For example, where you
> modified the ring buffer code to use try_cmpxchg(), I could convert your:
>
> static int rb_head_page_replace(struct buffer_page *old,
> struct buffer_page *new)
> {
> unsigned long *ptr = (unsigned long *)&old->list.prev->next;
> unsigned long val;
>
> val = *ptr & ~RB_FLAG_MASK;
> val |= RB_PAGE_HEAD;
>
> return try_cmpxchg(ptr, &val, (unsigned long)&new->list);
> }
>
> Into just:
>
> static int rb_head_page_replace(struct buffer_page *old,
> struct buffer_page *new)
> {
> unsigned long *ptr = (unsigned long *)&old->list.prev->next;
> unsigned long val;
>
> val = *ptr & ~RB_FLAG_MASK;
>
> return cmpxchg_success(ptr, val | RB_PAGE_HEAD, (unsigned long)&new->list);
> }
Nice! Looks much better.
Thanks,
- Joel
>
> -- Steve
On Wed, Mar 01, 2023 at 07:43:34PM +0100, Uros Bizjak wrote:
> On Wed, Mar 1, 2023 at 5:38 PM Steven Rostedt <[email protected]> wrote:
> >
> > On Wed, 1 Mar 2023 11:28:47 +0100
> > Uros Bizjak <[email protected]> wrote:
> >
> > > These benefits are the reason the change to try_cmpxchg was accepted
> > > also in the linear code elsewhere in the linux kernel, e.g. [2,3] to
> > > name a few commits, with a thumbs-up and a claim that the new code is
> > > actually *clearer* at the merge commit [4].
> >
> > I'll say it here too. I really like Joel's suggestion of having a
> > cmpxchg_success() that does not have the added side effect of modifying the
> > old variable.
> >
> > I think that would allow for the arch optimizations that you are trying to
> > achieve, as well as remove the side effect that might cause issues down the
> > road.
>
> Attached patch implements this suggestion.
Please help me out here.
Why on earth are we even discussing making this change to code that
normally never executes? Performance is not a consideration here.
What am I missing here? Is there some sort of forward-progress
issue that this change addresses?
Thanx, Paul
> Uros.
> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> index b10b8349bb2a..229263ebba3b 100644
> --- a/kernel/rcu/tree_stall.h
> +++ b/kernel/rcu/tree_stall.h
> @@ -709,6 +709,12 @@ static void print_cpu_stall(unsigned long gps)
> set_preempt_need_resched();
> }
>
> +#define cmpxchg_success(ptr, old, new) \
> +({ \
> + __typeof__(*(ptr)) __tmp = (old); \
> + try_cmpxchg((ptr), &__tmp, (new)); \
> +})
> +
> static void check_cpu_stall(struct rcu_data *rdp)
> {
> bool didstall = false;
> @@ -760,7 +766,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
> jn = jiffies + ULONG_MAX / 2;
> if (rcu_gp_in_progress() &&
> (READ_ONCE(rnp->qsmask) & rdp->grpmask) &&
> - cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
> + cmpxchg_success(&rcu_state.jiffies_stall, js, jn)) {
>
> /*
> * If a virtual machine is stopped by the host it can look to
> @@ -778,7 +784,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
>
> } else if (rcu_gp_in_progress() &&
> ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY) &&
> - cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
> + cmpxchg_success(&rcu_state.jiffies_stall, js, jn)) {
>
> /*
> * If a virtual machine is stopped by the host it can look to
On Wed, 1 Mar 2023 12:08:20 -0800
"Paul E. McKenney" <[email protected]> wrote:
> > Attached patch implements this suggestion.
>
> Please help me out here.
>
> Why on earth are we even discussing making this change to code that
> normally never executes? Performance is not a consideration here.
>
> What am I missing here? Is there some sort of forward-progress
> issue that this change addresses?
Well, we sorta hijacked this thread. It turned into a more general
discussion, as there is code that this change will be useful for
(ring_buffer.c), but we just happen to be having the discussion here.
Where it will at most remove some text and give you back a few extra bytes
of memory ;-)
But if we do use cmpxchg_success() IMHO, it does improve readability.
> - cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
> + cmpxchg_success(&rcu_state.jiffies_stall, js, jn)) {
-- Steve
> On Mar 1, 2023, at 3:08 PM, Paul E. McKenney <[email protected]> wrote:
>
> On Wed, Mar 01, 2023 at 07:43:34PM +0100, Uros Bizjak wrote:
>>> On Wed, Mar 1, 2023 at 5:38 PM Steven Rostedt <[email protected]> wrote:
>>>
>>> On Wed, 1 Mar 2023 11:28:47 +0100
>>> Uros Bizjak <[email protected]> wrote:
>>>
>>>> These benefits are the reason the change to try_cmpxchg was accepted
>>>> also in the linear code elsewhere in the linux kernel, e.g. [2,3] to
>>>> name a few commits, with a thumbs-up and a claim that the new code is
>>>> actually *clearer* at the merge commit [4].
>>>
>>> I'll say it here too. I really like Joel's suggestion of having a
>>> cmpxchg_success() that does not have the added side effect of modifying the
>>> old variable.
>>>
>>> I think that would allow for the arch optimizations that you are trying to
>>> achieve, as well as remove the side effect that might cause issues down the
>>> road.
>>
>> Attached patch implements this suggestion.
>
> Please help me out here.
>
> Why on earth are we even discussing making this change to code that
> normally never executes? Performance is not a consideration here.
>
> What am I missing here? Is there some sort of forward-progress
> issue that this change addresses?
I do not think it is anything with performance. The suggestion just makes
the code easier to read. In the case of ftrace (not RCU), it results in further
deleted lines of code.
Maybe it got confusing because we are discussing the change as it
applies to both ftrace and RCU.
You could argue that it has to do with performance in the fast path, but it
is probably down in the noise.
- Joel
>
> Thanx, Paul
>
>> Uros.
>
>> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
>> index b10b8349bb2a..229263ebba3b 100644
>> --- a/kernel/rcu/tree_stall.h
>> +++ b/kernel/rcu/tree_stall.h
>> @@ -709,6 +709,12 @@ static void print_cpu_stall(unsigned long gps)
>> set_preempt_need_resched();
>> }
>>
>> +#define cmpxchg_success(ptr, old, new) \
>> +({ \
>> + __typeof__(*(ptr)) __tmp = (old); \
>> + try_cmpxchg((ptr), &__tmp, (new)); \
>> +})
>> +
>> static void check_cpu_stall(struct rcu_data *rdp)
>> {
>> bool didstall = false;
>> @@ -760,7 +766,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
>> jn = jiffies + ULONG_MAX / 2;
>> if (rcu_gp_in_progress() &&
>> (READ_ONCE(rnp->qsmask) & rdp->grpmask) &&
>> - cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
>> + cmpxchg_success(&rcu_state.jiffies_stall, js, jn)) {
>>
>> /*
>> * If a virtual machine is stopped by the host it can look to
>> @@ -778,7 +784,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
>>
>> } else if (rcu_gp_in_progress() &&
>> ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY) &&
>> - cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
>> + cmpxchg_success(&rcu_state.jiffies_stall, js, jn)) {
>>
>> /*
>> * If a virtual machine is stopped by the host it can look to
>
On Wed, Mar 01, 2023 at 03:18:26PM -0500, Steven Rostedt wrote:
> On Wed, 1 Mar 2023 12:08:20 -0800
> "Paul E. McKenney" <[email protected]> wrote:
>
> > > Attached patch implements this suggestion.
> >
> > Please help me out here.
> >
> > Why on earth are we even discussing making this change to code that
> > normally never executes? Performance is not a consideration here.
> >
> > What am I missing here? Is there some sort of forward-progress
> > issue that this change addresses?
>
> Well, we sorta hijacked this thread. It turned into a more general
> discussion, as there is code that this change will be useful for
> (ring_buffer.c), but we just happen to be having the discussion here.
>
> Where it will at most remove some text and give you back a few extra bytes
> of memory ;-)
>
> But if we do use cmpxchg_success() IMHO, it does improve readability.
>
> > - cmpxchg(&rcu_state.jiffies_stall, js, jn) == js) {
> > + cmpxchg_success(&rcu_state.jiffies_stall, js, jn)) {
Some years down the road, should cmpxchg_success() be on the tip of
the tongue of every kernel hacker, perhaps. Or perhaps not.
In the meantime, we have yet another abysmally documented atomic
operation that is not well known throughout the community. And then the
people coming across this curse everyone who had anything to do with it,
as they search the source code, dig through assembly output, and so on
trying to work out exactly what this thing does.
Sorry, but no way.
Again, unless there is some sort of forward-progress argument or
similar convincing argument.
Thanx, Paul
On Wed, 1 Mar 2023 12:36:45 -0800
"Paul E. McKenney" <[email protected]> wrote:
> Some years down the road, should cmpxchg_success() be on the tip of
> the tongue of every kernel hacker, perhaps. Or perhaps not.
A bit of a catch-22 I would say. It will only become something everyone
knows if it exists.
>
> In the meantime, we have yet another abysmally documented atomic
Is it?
> operation that is not well known throughout the community. And then the
> people coming across this curse everyone who had anything to do with it,
> as they search the source code, dig through assembly output, and so on
> trying to work out exactly what this thing does.
>
> Sorry, but no way.
>
> Again, unless there is some sort of forward-progress argument or
> similar convincing argument.
Speaking of forward progress...
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/atomic_t.txt#n316
Anyway, I'm guessing this will not become part of rcu any time soon. But
for the ring buffer, I would happily take it.
-- Steve
On Wed, Mar 01, 2023 at 03:46:41PM -0500, Steven Rostedt wrote:
> On Wed, 1 Mar 2023 12:36:45 -0800
> "Paul E. McKenney" <[email protected]> wrote:
>
> > Some years down the road, should cmpxchg_success() be on the tip of
> > the tongue of every kernel hacker, perhaps. Or perhaps not.
>
> A bit of a catch-22 I would say. It will only become something everyone
> knows if it exists.
>
> > In the meantime, we have yet another abysmally documented atomic
>
> Is it?
Is sure is.
> > operation that is not well known throughout the community. And then the
> > people coming across this curse everyone who had anything to do with it,
> > as they search the source code, dig through assembly output, and so on
> > trying to work out exactly what this thing does.
> >
> > Sorry, but no way.
> >
> > Again, unless there is some sort of forward-progress argument or
> > similar convincing argument.
>
> Speaking of forward progress...
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/atomic_t.txt#n316
Yes, that is probably the best starting point.
And if you have been around long enough, you know that this is in fact
the best starting point. Plus if you can correctly interpret the words
in that document. And if you are familiar with the entire document.
But otherwise, good luck learning the semantics for something like
atomic_fetch_add_release().
> Anyway, I'm guessing this will not become part of rcu any time soon. But
> for the ring buffer, I would happily take it.
Certainly not unless someone comes up with a good reason for it.
Thanx, Paul