Date:   Tue, 28 Feb 2023 13:29:11 -0800
From:   "Paul E. McKenney" <paulmck@kernel.org>
To:     Steven Rostedt <rostedt@goodmis.org>
Cc:     Joel Fernandes <joel@joelfernandes.org>,
        Uros Bizjak <ubizjak@gmail.com>, rcu@vger.kernel.org,
        linux-kernel@vger.kernel.org,
        Frederic Weisbecker <frederic@kernel.org>,
        Neeraj Upadhyay <quic_neeraju@quicinc.com>,
        Josh Triplett <josh@joshtriplett.org>,
        Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
        Lai Jiangshan <jiangshanlai@gmail.com>
Subject: Re: [PATCH] rcu: use try_cmpxchg in check_cpu_stall
Message-ID: <20230228212911.GX2948950@paulmck-ThinkPad-P17-Gen-1>
Reply-To: paulmck@kernel.org
References: <20230228155121.3416-1-ubizjak@gmail.com>
 <Y/5mguXPPqdP3MZF@google.com>
 <20230228160324.2a7c1012@gandalf.local.home>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20230228160324.2a7c1012@gandalf.local.home>
Precedence: bulk

On Tue, Feb 28, 2023 at 04:03:24PM -0500, Steven Rostedt wrote:
> On Tue, 28 Feb 2023 20:39:30 +0000
> Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> > On Tue, Feb 28, 2023 at 04:51:21PM +0100, Uros Bizjak wrote:
> > > Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
> > > check_cpu_stall.  x86 CMPXCHG instruction returns success in ZF flag, so
> > > this change saves a compare after cmpxchg (and related move instruction in
> > > front of cmpxchg).  
> > 
> > In my codegen, I am not seeing mov instruction before the cmp removed, how
> > can that be? The rax has to be populated with a mov before cmpxchg right?
> > 
> > So try_cmpxchg gives: mov, cmpxchg, cmp, jne
> > Where as cmpxchg gives: mov, cmpxchg, mov, jne
> > 
> > So yeah you got rid of compare, but I am not seeing reduction in moves.
> > Either way, I think it is an improvement due to dropping cmp so:
> 
> Did you get the above backwards?
> 
> Anyway, when looking at the conversion of cmpxchg() to try_cmpxchg() that
> Uros sent to me for the ring buffer, the code went from:
> 
> 0000000000000070 <ring_buffer_record_off>:
>       70:       48 8d 4f 08             lea    0x8(%rdi),%rcx
>       74:       8b 57 08                mov    0x8(%rdi),%edx
>       77:       89 d6                   mov    %edx,%esi
>       79:       89 d0                   mov    %edx,%eax
>       7b:       81 ce 00 00 10 00       or     $0x100000,%esi
>       81:       f0 0f b1 31             lock cmpxchg %esi,(%rcx)
>       85:       39 d0                   cmp    %edx,%eax
>       87:       75 eb                   jne    74 <ring_buffer_record_off+0x4>
>       89:       e9 00 00 00 00          jmp    8e <ring_buffer_record_off+0x1e>
>                         8a: R_X86_64_PLT32      __x86_return_thunk-0x4
>       8e:       66 90                   xchg   %ax,%ax
> 
> 
>   To
> 
> 00000000000001a0 <ring_buffer_record_off>:
>      1a0:       8b 47 08                mov    0x8(%rdi),%eax
>      1a3:       48 8d 4f 08             lea    0x8(%rdi),%rcx
>      1a7:       89 c2                   mov    %eax,%edx
>      1a9:       81 ca 00 00 10 00       or     $0x100000,%edx
>      1af:       f0 0f b1 57 08          lock cmpxchg %edx,0x8(%rdi)
>      1b4:       75 05                   jne    1bb <ring_buffer_record_off+0x1b>
>      1b6:       e9 00 00 00 00          jmp    1bb <ring_buffer_record_off+0x1b>
>                         1b7: R_X86_64_PLT32     __x86_return_thunk-0x4
>      1bb:       89 c2                   mov    %eax,%edx
>      1bd:       81 ca 00 00 10 00       or     $0x100000,%edx
>      1c3:       f0 0f b1 11             lock cmpxchg %edx,(%rcx)
>      1c7:       75 f2                   jne    1bb <ring_buffer_record_off+0x1b>
>      1c9:       e9 00 00 00 00          jmp    1ce <ring_buffer_record_off+0x2e>
>                         1ca: R_X86_64_PLT32     __x86_return_thunk-0x4
>      1ce:       66 90                   xchg   %ax,%ax
> 
> 
> It does add a bit more code, but the fast path seems better (where the
> cmpxchg succeeds). That would be:
> 
> 00000000000001a0 <ring_buffer_record_off>:
>      1a0:       8b 47 08                mov    0x8(%rdi),%eax
>      1a3:       48 8d 4f 08             lea    0x8(%rdi),%rcx
>      1a7:       89 c2                   mov    %eax,%edx
>      1a9:       81 ca 00 00 10 00       or     $0x100000,%edx
>      1af:       f0 0f b1 57 08          lock cmpxchg %edx,0x8(%rdi)
>      1b4:       75 05                   jne    1bb <ring_buffer_record_off+0x1b>
>      1b6:       e9 00 00 00 00          jmp    1bb <ring_buffer_record_off+0x1b>
>                         1b7: R_X86_64_PLT32     __x86_return_thunk-0x4
> 
> 
> Where there's only two moves and no cmp, where the former has 3 moves and a
> cmp in the fast path.

All well and good, but the stall-warning code is nowhere near a fastpath.

Is try_cmpxchg() considered more readable in this context?

							Thanx, Paul