Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
Date:   Sat, 25 Jul 2020 12:39:09 -0700
From:   "Paul E. McKenney" <paulmck@kernel.org>
To:     Peter Zijlstra <peterz@infradead.org>
Cc:     Marco Elver <elver@google.com>,
        Andrey Konovalov <andreyknvl@google.com>,
        Alexander Potapenko <glider@google.com>,
        Dmitry Vyukov <dvyukov@google.com>,
        kasan-dev <kasan-dev@googlegroups.com>,
        LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] kcsan: Add option to allow watcher interruptions
Message-ID: <20200725193909.GB9247@paulmck-ThinkPad-P72>
Reply-To: paulmck@kernel.org
References: <20200220141551.166537-1-elver@google.com>
 <20200220185855.GY2935@paulmck-ThinkPad-P72>
 <20200220213317.GA35033@google.com>
 <20200725145623.GZ9247@paulmck-ThinkPad-P72>
 <CANpmjNPhuvrhRHAiuv2Zju1VNSe7dO0aaYn+1TB99OF2Hv0S_A@mail.gmail.com>
 <20200725174430.GH10769@hirez.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20200725174430.GH10769@hirez.programming.kicks-ass.net>
User-Agent: Mutt/1.9.4 (2018-02-28)
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Sat, Jul 25, 2020 at 07:44:30PM +0200, Peter Zijlstra wrote:
> On Sat, Jul 25, 2020 at 05:17:43PM +0200, Marco Elver wrote:
> > On Sat, 25 Jul 2020 at 16:56, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > On Thu, Feb 20, 2020 at 10:33:17PM +0100, Marco Elver wrote:
> > > > On Thu, 20 Feb 2020, Paul E. McKenney wrote:
> > > > > On Thu, Feb 20, 2020 at 03:15:51PM +0100, Marco Elver wrote:
> > > > > > Add option to allow interrupts while a watchpoint is set up. This can be
> > > > > > enabled either via CONFIG_KCSAN_INTERRUPT_WATCHER or via the boot
> > > > > > parameter 'kcsan.interrupt_watcher=1'.
> > [...]
> > > > > > As an example, the first data race that this found:
> > > > > >
> > > > > > write to 0xffff88806b3324b8 of 4 bytes by interrupt on cpu 0:
> > > > > >  rcu_preempt_read_enter kernel/rcu/tree_plugin.h:353 [inline]
> > > > > >  __rcu_read_lock+0x3c/0x50 kernel/rcu/tree_plugin.h:373
> > [...]
> > > > > > read to 0xffff88806b3324b8 of 4 bytes by task 6131 on cpu 0:       |
> > > > > >  rcu_preempt_read_enter kernel/rcu/tree_plugin.h:353 [inline]  ----+
> > [...]
> > > > > >
> > > > > > The writer is doing 'current->rcu_read_lock_nesting++'. The read is as
> > > > > > vulnerable to compiler optimizations and would therefore conclude this
> > > > > > is a valid data race.
> > > > >
> > > > > Heh!  That one is a fun one!  It is on a very hot fastpath.  READ_ONCE()
> > > > > and WRITE_ONCE() are likely to be measurable at the system level.
> > > > >
> > > > > Thoughts on other options?
> 
> > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > index c6ea81cd41890..e0595abd50c0f 100644
> > > > --- a/kernel/rcu/tree_plugin.h
> > > > +++ b/kernel/rcu/tree_plugin.h
> > > > @@ -350,17 +350,17 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
> > > >
> > > >  static void rcu_preempt_read_enter(void)
> > > >  {
> > > > -     current->rcu_read_lock_nesting++;
> > > > +     local_inc(&current->rcu_read_lock_nesting);
> > > >  }
> > > >
> > > >  static void rcu_preempt_read_exit(void)
> > > >  {
> > > > -     current->rcu_read_lock_nesting--;
> > > > +     local_dec(&current->rcu_read_lock_nesting);
> > > >  }
> > > >
> > > >  static void rcu_preempt_depth_set(int val)
> > > >  {
> > > > -     current->rcu_read_lock_nesting = val;
> > > > +     local_set(&current->rcu_read_lock_nesting, val);
> > 
> > > I agree that this removes the data races, and that the code for x86 is
> > > quite nice, but aren't rcu_read_lock() and rcu_read_unlock() going to
> > > have heavyweight atomic operations on many CPUs?
> > >
> > > Maybe I am stuck with arch-specific code in rcu_read_lock() and
> > > rcu_preempt_read_exit().  I suppose worse things could happen.
> > 
> > Peter also mentioned to me that while local_t on x86 generates
> > reasonable code, on other architectures it's terrible. So I think
> > something else is needed, and feel free to discard the above idea.
> > With sufficient enough reasoning, how bad would a 'data_race(..)' be?
> 
> Right, so local_t it atrocious on many architectures, they fall back to
> atomic_t.
> 
> Even architectures that have optimized variants (eg. Power), they're
> quite a lot more expensive than what we actually need here.
> 
> Only architectures like x86 that have single instruction memops can
> generate anywhere near the code that we'd want here.
> 
> So the thing is, since RCU count is 0 per context (an IRQ must have an
> equal amount of rcu_read_unlock() as it has rcu_read_lock()), interrupts
> are not in fact a problem, even on load-store (RISC) architectures
> (preempt_count has the same thing).

True enough!

> So the addition/subtraction in rcu_preempt_read_{enter,exit}() doesn't
> need to be atomic vs interrupts. The only thing we really do need is
> them being single-copy-atomic.
> 
> The problem with READ/WRITE_ONCE is that if we were to use it, we'd end
> up with a load-store, even on x86, which is sub-optimal.

Agreed.

> I suppose the 'correct' code here would be something like:
> 
> 	*((volatile int *)&current->rcu_read_lock_nesting)++;
> 
> then the compiler can either do a single memop (x86 and the like) or a
> load-store that is free from tearing.

Hah!!!  That is the original ACCESS_ONCE(), isn't it?  ;-)

	ACCESS_ONCE(current->rcu_read_lock_nesting)++;

But open-coding makes sense unless a lot of other places need something
similar.  Besides, open-coding allows me to defer bikeshedding on the
name, given that there are actually two accesses.  :-/

Ah, but that gets compiler warnings:

kernel/rcu/tree_plugin.h:354:52: error: lvalue required as increment operand
  *((volatile int *)&current->rcu_read_lock_nesting)++;

Let's try the old ACCESS_ONCE().  Dialing back to v3.0:

#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

So:
	(*(volatile int *)&(current->rcu_read_lock_nesting))++;

This gets me the following for __rcu_read_lock():

00000000000000e0 <__rcu_read_lock>:
      e0:	48 8b 14 25 00 00 00 	mov    0x0,%rdx
      e7:	00 
      e8:	8b 82 e0 02 00 00    	mov    0x2e0(%rdx),%eax
      ee:	83 c0 01             	add    $0x1,%eax
      f1:	89 82 e0 02 00 00    	mov    %eax,0x2e0(%rdx)
      f7:	c3                   	retq   
      f8:	0f 1f 84 00 00 00 00 	nopl   0x0(%rax,%rax,1)
      ff:	00 

One might hope for a dec instruction, but this isn't bad.  We do lose
a few instructions compared to the C-language case due to differences
in address calculation:

00000000000000e0 <__rcu_read_lock>:
      e0:	48 8b 04 25 00 00 00 	mov    0x0,%rax
      e7:	00 
      e8:	83 80 e0 02 00 00 01 	addl   $0x1,0x2e0(%rax)
      ef:	c3                   	retq   

For the relevant portion of __rcu_read_unlock(), this gets us
the following:

00000000000027f0 <__rcu_read_unlock>:
    27f0:	48 8b 3c 25 00 00 00 	mov    0x0,%rdi
    27f7:	00 
    27f8:	53                   	push   %rbx
    27f9:	8b 87 e0 02 00 00    	mov    0x2e0(%rdi),%eax
    27ff:	8d 50 ff             	lea    -0x1(%rax),%edx
    2802:	85 c0                	test   %eax,%eax
    2804:	89 97 e0 02 00 00    	mov    %edx,0x2e0(%rdi)
    280a:	75 0a                	jne    2816 <__rcu_read_unlock+0x26>

Here we have a load-subtract-store, but given that we need to test
the value, this seems reasonable to me.  We again lose a few instructions
compared to the C-language case, and again due to address calculation:

00000000000027e0 <__rcu_read_unlock>:
    27e0:	53                   	push   %rbx
    27e1:	48 8b 3c 25 00 00 00 	mov    0x0,%rdi
    27e8:	00 
    27e9:	83 af e0 02 00 00 01 	subl   $0x1,0x2e0(%rdi)
    27f0:	75 0a                	jne    27fc <__rcu_read_unlock+0x1c>

Thoughts?

							Thanx, Paul