2020-05-07 22:56:39

by Doug Anderson

[permalink] [raw]
Subject: [PATCH] kgdb: Avoid suspicious RCU usage warning

At times when I'm using kgdb I see a splat on my console about
suspicious RCU usage. I managed to come up with a case that could
reproduce this that looked like this:

WARNING: suspicious RCU usage
5.7.0-rc4+ #609 Not tainted
-----------------------------
kernel/pid.c:395 find_task_by_pid_ns() needs rcu_read_lock() protection!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
3 locks held by swapper/0/1:
#0: ffffff81b6b8e988 (&dev->mutex){....}-{3:3}, at: __device_attach+0x40/0x13c
#1: ffffffd01109e9e8 (dbg_master_lock){....}-{2:2}, at: kgdb_cpu_enter+0x20c/0x7ac
#2: ffffffd01109ea90 (dbg_slave_lock){....}-{2:2}, at: kgdb_cpu_enter+0x3ec/0x7ac

stack backtrace:
CPU: 7 PID: 1 Comm: swapper/0 Not tainted 5.7.0-rc4+ #609
Hardware name: Google Cheza (rev3+) (DT)
Call trace:
dump_backtrace+0x0/0x1b8
show_stack+0x1c/0x24
dump_stack+0xd4/0x134
lockdep_rcu_suspicious+0xf0/0x100
find_task_by_pid_ns+0x5c/0x80
getthread+0x8c/0xb0
gdb_serial_stub+0x9d4/0xd04
kgdb_cpu_enter+0x284/0x7ac
kgdb_handle_exception+0x174/0x20c
kgdb_brk_fn+0x24/0x30
call_break_hook+0x6c/0x7c
brk_handler+0x20/0x5c
do_debug_exception+0x1c8/0x22c
el1_sync_handler+0x3c/0xe4
el1_sync+0x7c/0x100
rpmh_rsc_probe+0x38/0x420
platform_drv_probe+0x94/0xb4
really_probe+0x134/0x300
driver_probe_device+0x68/0x100
__device_attach_driver+0x90/0xa8
bus_for_each_drv+0x84/0xcc
__device_attach+0xb4/0x13c
device_initial_probe+0x18/0x20
bus_probe_device+0x38/0x98
device_add+0x38c/0x420

If I understand properly we should just be able to blanket kgdb under
one big RCU read lock and the problem should go away. We'll add it to
the beast-of-a-function known as kgdb_cpu_enter().

With this I no longer get any splats and things seem to work fine.

Signed-off-by: Douglas Anderson <[email protected]>
---

kernel/debug/debug_core.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
index 2b7c9b67931d..5155cf06731b 100644
--- a/kernel/debug/debug_core.c
+++ b/kernel/debug/debug_core.c
@@ -564,6 +564,8 @@ static int kgdb_cpu_enter(struct kgdb_state *ks, struct pt_regs *regs,
int online_cpus = num_online_cpus();
u64 time_left;

+ rcu_read_lock();
+
kgdb_info[ks->cpu].enter_kgdb++;
kgdb_info[ks->cpu].exception_state |= exception_state;

@@ -635,6 +637,7 @@ static int kgdb_cpu_enter(struct kgdb_state *ks, struct pt_regs *regs,
atomic_dec(&slaves_in_kgdb);
dbg_touch_watchdogs();
local_irq_restore(flags);
+ rcu_read_unlock();
return 0;
}
cpu_relax();
@@ -773,6 +776,8 @@ static int kgdb_cpu_enter(struct kgdb_state *ks, struct pt_regs *regs,
dbg_touch_watchdogs();
local_irq_restore(flags);

+ rcu_read_unlock();
+
return kgdb_info[cpu].ret_state;
}

--
2.26.2.645.ge9eca65c58-goog


2020-05-19 10:44:23

by Daniel Thompson

[permalink] [raw]
Subject: Re: [PATCH] kgdb: Avoid suspicious RCU usage warning

On Thu, May 07, 2020 at 03:53:58PM -0700, Douglas Anderson wrote:
> At times when I'm using kgdb I see a splat on my console about
> suspicious RCU usage. I managed to come up with a case that could
> reproduce this that looked like this:
>
> WARNING: suspicious RCU usage
> 5.7.0-rc4+ #609 Not tainted
> -----------------------------
> kernel/pid.c:395 find_task_by_pid_ns() needs rcu_read_lock() protection!
>
> other info that might help us debug this:
>
> rcu_scheduler_active = 2, debug_locks = 1
> 3 locks held by swapper/0/1:
> #0: ffffff81b6b8e988 (&dev->mutex){....}-{3:3}, at: __device_attach+0x40/0x13c
> #1: ffffffd01109e9e8 (dbg_master_lock){....}-{2:2}, at: kgdb_cpu_enter+0x20c/0x7ac
> #2: ffffffd01109ea90 (dbg_slave_lock){....}-{2:2}, at: kgdb_cpu_enter+0x3ec/0x7ac
>
> stack backtrace:
> CPU: 7 PID: 1 Comm: swapper/0 Not tainted 5.7.0-rc4+ #609
> Hardware name: Google Cheza (rev3+) (DT)
> Call trace:
> dump_backtrace+0x0/0x1b8
> show_stack+0x1c/0x24
> dump_stack+0xd4/0x134
> lockdep_rcu_suspicious+0xf0/0x100
> find_task_by_pid_ns+0x5c/0x80
> getthread+0x8c/0xb0
> gdb_serial_stub+0x9d4/0xd04
> kgdb_cpu_enter+0x284/0x7ac
> kgdb_handle_exception+0x174/0x20c
> kgdb_brk_fn+0x24/0x30
> call_break_hook+0x6c/0x7c
> brk_handler+0x20/0x5c
> do_debug_exception+0x1c8/0x22c
> el1_sync_handler+0x3c/0xe4
> el1_sync+0x7c/0x100
> rpmh_rsc_probe+0x38/0x420
> platform_drv_probe+0x94/0xb4
> really_probe+0x134/0x300
> driver_probe_device+0x68/0x100
> __device_attach_driver+0x90/0xa8
> bus_for_each_drv+0x84/0xcc
> __device_attach+0xb4/0x13c
> device_initial_probe+0x18/0x20
> bus_probe_device+0x38/0x98
> device_add+0x38c/0x420
>
> If I understand properly we should just be able to blanket kgdb under
> one big RCU read lock and the problem should go away. We'll add it to
> the beast-of-a-function known as kgdb_cpu_enter().
>
> With this I no longer get any splats and things seem to work fine.
>
> Signed-off-by: Douglas Anderson <[email protected]>

In principle this looks OK but I'm curious why we don't cuddle these
calls up to the local interrupt locking (and also whether we want to
keep hold of the lock during stepping). If nothing else that would make
review easier.


Daniel.


> ---
>
> kernel/debug/debug_core.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
> index 2b7c9b67931d..5155cf06731b 100644
> --- a/kernel/debug/debug_core.c
> +++ b/kernel/debug/debug_core.c
> @@ -564,6 +564,8 @@ static int kgdb_cpu_enter(struct kgdb_state *ks, struct pt_regs *regs,
> int online_cpus = num_online_cpus();
> u64 time_left;
>
> + rcu_read_lock();
> +
> kgdb_info[ks->cpu].enter_kgdb++;
> kgdb_info[ks->cpu].exception_state |= exception_state;
>
> @@ -635,6 +637,7 @@ static int kgdb_cpu_enter(struct kgdb_state *ks, struct pt_regs *regs,
> atomic_dec(&slaves_in_kgdb);
> dbg_touch_watchdogs();
> local_irq_restore(flags);
> + rcu_read_unlock();
> return 0;
> }
> cpu_relax();
> @@ -773,6 +776,8 @@ static int kgdb_cpu_enter(struct kgdb_state *ks, struct pt_regs *regs,
> dbg_touch_watchdogs();
> local_irq_restore(flags);
>
> + rcu_read_unlock();
> +
> return kgdb_info[cpu].ret_state;
> }
>
> --
> 2.26.2.645.ge9eca65c58-goog
>

2020-05-28 00:08:03

by Doug Anderson

[permalink] [raw]
Subject: Re: [PATCH] kgdb: Avoid suspicious RCU usage warning

Hi,

On Tue, May 19, 2020 at 3:41 AM Daniel Thompson
<[email protected]> wrote:
>
> On Thu, May 07, 2020 at 03:53:58PM -0700, Douglas Anderson wrote:
> > At times when I'm using kgdb I see a splat on my console about
> > suspicious RCU usage. I managed to come up with a case that could
> > reproduce this that looked like this:
> >
> > WARNING: suspicious RCU usage
> > 5.7.0-rc4+ #609 Not tainted
> > -----------------------------
> > kernel/pid.c:395 find_task_by_pid_ns() needs rcu_read_lock() protection!
> >
> > other info that might help us debug this:
> >
> > rcu_scheduler_active = 2, debug_locks = 1
> > 3 locks held by swapper/0/1:
> > #0: ffffff81b6b8e988 (&dev->mutex){....}-{3:3}, at: __device_attach+0x40/0x13c
> > #1: ffffffd01109e9e8 (dbg_master_lock){....}-{2:2}, at: kgdb_cpu_enter+0x20c/0x7ac
> > #2: ffffffd01109ea90 (dbg_slave_lock){....}-{2:2}, at: kgdb_cpu_enter+0x3ec/0x7ac
> >
> > stack backtrace:
> > CPU: 7 PID: 1 Comm: swapper/0 Not tainted 5.7.0-rc4+ #609
> > Hardware name: Google Cheza (rev3+) (DT)
> > Call trace:
> > dump_backtrace+0x0/0x1b8
> > show_stack+0x1c/0x24
> > dump_stack+0xd4/0x134
> > lockdep_rcu_suspicious+0xf0/0x100
> > find_task_by_pid_ns+0x5c/0x80
> > getthread+0x8c/0xb0
> > gdb_serial_stub+0x9d4/0xd04
> > kgdb_cpu_enter+0x284/0x7ac
> > kgdb_handle_exception+0x174/0x20c
> > kgdb_brk_fn+0x24/0x30
> > call_break_hook+0x6c/0x7c
> > brk_handler+0x20/0x5c
> > do_debug_exception+0x1c8/0x22c
> > el1_sync_handler+0x3c/0xe4
> > el1_sync+0x7c/0x100
> > rpmh_rsc_probe+0x38/0x420
> > platform_drv_probe+0x94/0xb4
> > really_probe+0x134/0x300
> > driver_probe_device+0x68/0x100
> > __device_attach_driver+0x90/0xa8
> > bus_for_each_drv+0x84/0xcc
> > __device_attach+0xb4/0x13c
> > device_initial_probe+0x18/0x20
> > bus_probe_device+0x38/0x98
> > device_add+0x38c/0x420
> >
> > If I understand properly we should just be able to blanket kgdb under
> > one big RCU read lock and the problem should go away. We'll add it to
> > the beast-of-a-function known as kgdb_cpu_enter().
> >
> > With this I no longer get any splats and things seem to work fine.
> >
> > Signed-off-by: Douglas Anderson <[email protected]>
>
> In principle this looks OK but I'm curious why we don't cuddle these
> calls up to the local interrupt locking (and also whether we want to
> keep hold of the lock during stepping). If nothing else that would make
> review easier.

It probably wouldn't hurt to keep hold of the lock during single
stepping but I don't think there's any real reason we'd want to.
Specifically the only real reason we're calling rcu_read_lock() is to
avoid the warning. Since we're a stop-the-world debugger it's not
like something else could be messing with state at the same time.

I'm looking at the whole function though and I don't really understand
all the comments about interrupts being restored by the 'trap return'
code, do you? Specifically: as far as I can tell we _always_ restore
interrupts when exiting the function. There are only two return
statements and both have "local_irq_restore(flags);" right before
them. We never modify the flags directly and the one other usage of
"flags" is effectively the statement "local_irq_restore(flags);
local_irq_save(flags);" which will, I guess, allow any interrupts that
were already pending to take place. Are you saying that you want me
to match that and do a "rcu_read_unlock(); rcu_read_lock()" there?

If I understand things correctly (and there's maybe a better chance
after I read Wei Li's recent patches) the disabling of IRQs for single
stepping happens in a different way. It looks like we update the
"struct pt_regs" of the task we're stepping so that when we exit kgdb
and start running the task again that the interrupts are off. That
seems reasonable to me and this function has nothing to do with it.

...and further confusion on my part: does the whole saving / restoring
of interrupts in kgdb_cpu_enter() make any sense anyway? Is this
function ever called from a context that's not an interrupt context?
How do we get the pt_regs in that case? Just for fun, I tried doing
this:

local_irq_save(flags);
+ if (!arch_irqs_disabled_flags(flags))
+ pr_warn("I was wrong\n");

...and I never saw "I was wrong" on my system. Maybe it matters for
something not arm64? ...or, maybe, this is from when kgdb worked in a
completely different way?


In general I made my patch by:
* Calling rcu_read_lock() at the start of the function.
* Calling rcu_read_unlock() right before all 2 of the "return" calls of
the function.

...I was hoping that would actually make it easier to reason about
even if the function is a beast.


Hopefully the above makes sense. I wouldn't rule out me just being
utterly confused, but I _think_ I reasoned through it all. ;-) If it
all makes sense, I'm inclined to:

1. Leave my patch the way it is.

2. Perhaps remove the whole irq saving / restoring in kgdb_cpu_enter().


-Doug

2020-06-01 16:22:36

by Daniel Thompson

[permalink] [raw]
Subject: Re: [PATCH] kgdb: Avoid suspicious RCU usage warning

On Wed, May 27, 2020 at 05:02:27PM -0700, Doug Anderson wrote:
> Hi,
>
> On Tue, May 19, 2020 at 3:41 AM Daniel Thompson
> <[email protected]> wrote:
> >
> > On Thu, May 07, 2020 at 03:53:58PM -0700, Douglas Anderson wrote:
> > > At times when I'm using kgdb I see a splat on my console about
> > > suspicious RCU usage. I managed to come up with a case that could
> > > reproduce this that looked like this:
> > >
> > > WARNING: suspicious RCU usage
> > > 5.7.0-rc4+ #609 Not tainted
> > > -----------------------------
> > > kernel/pid.c:395 find_task_by_pid_ns() needs rcu_read_lock() protection!
> > >
> > > other info that might help us debug this:
> > >
> > > rcu_scheduler_active = 2, debug_locks = 1
> > > 3 locks held by swapper/0/1:
> > > #0: ffffff81b6b8e988 (&dev->mutex){....}-{3:3}, at: __device_attach+0x40/0x13c
> > > #1: ffffffd01109e9e8 (dbg_master_lock){....}-{2:2}, at: kgdb_cpu_enter+0x20c/0x7ac
> > > #2: ffffffd01109ea90 (dbg_slave_lock){....}-{2:2}, at: kgdb_cpu_enter+0x3ec/0x7ac
> > >
> > > stack backtrace:
> > > CPU: 7 PID: 1 Comm: swapper/0 Not tainted 5.7.0-rc4+ #609
> > > Hardware name: Google Cheza (rev3+) (DT)
> > > Call trace:
> > > dump_backtrace+0x0/0x1b8
> > > show_stack+0x1c/0x24
> > > dump_stack+0xd4/0x134
> > > lockdep_rcu_suspicious+0xf0/0x100
> > > find_task_by_pid_ns+0x5c/0x80
> > > getthread+0x8c/0xb0
> > > gdb_serial_stub+0x9d4/0xd04
> > > kgdb_cpu_enter+0x284/0x7ac
> > > kgdb_handle_exception+0x174/0x20c
> > > kgdb_brk_fn+0x24/0x30
> > > call_break_hook+0x6c/0x7c
> > > brk_handler+0x20/0x5c
> > > do_debug_exception+0x1c8/0x22c
> > > el1_sync_handler+0x3c/0xe4
> > > el1_sync+0x7c/0x100
> > > rpmh_rsc_probe+0x38/0x420
> > > platform_drv_probe+0x94/0xb4
> > > really_probe+0x134/0x300
> > > driver_probe_device+0x68/0x100
> > > __device_attach_driver+0x90/0xa8
> > > bus_for_each_drv+0x84/0xcc
> > > __device_attach+0xb4/0x13c
> > > device_initial_probe+0x18/0x20
> > > bus_probe_device+0x38/0x98
> > > device_add+0x38c/0x420
> > >
> > > If I understand properly we should just be able to blanket kgdb under
> > > one big RCU read lock and the problem should go away. We'll add it to
> > > the beast-of-a-function known as kgdb_cpu_enter().
> > >
> > > With this I no longer get any splats and things seem to work fine.
> > >
> > > Signed-off-by: Douglas Anderson <[email protected]>
> >
> > In principle this looks OK but I'm curious why we don't cuddle these
> > calls up to the local interrupt locking (and also whether we want to
> > keep hold of the lock during stepping). If nothing else that would make
> > review easier.
>
> It probably wouldn't hurt to keep hold of the lock during single
> stepping but I don't think there's any real reason we'd want to.
> Specifically the only real reason we're calling rcu_read_lock() is to
> avoid the warning. Since we're a stop-the-world debugger it's not
> like something else could be messing with state at the same time.
>
> I'm looking at the whole function though and I don't really understand
> all the comments about interrupts being restored by the 'trap return'
> code, do you?

Not as much as I'd like but, in principle at least, I think the trap
handler is written to make it an architecture decision to whether it
will be entered with interrupt disabled or not (which would in turn
depend on how the architecture manages exception stacks and what else
it uses the breakpoint machinery for).


> Specifically: as far as I can tell we _always_ restore
> interrupts when exiting the function. There are only two return
> statements and both have "local_irq_restore(flags);" right before
> them. We never modify the flags directly and the one other usage of
> "flags" is effectively the statement "local_irq_restore(flags);
> local_irq_save(flags);" which will, I guess, allow any interrupts that
> were already pending to take place. Are you saying that you want me
> to match that and do a "rcu_read_unlock(); rcu_read_lock()" there?
>
> If I understand things correctly (and there's maybe a better chance
> after I read Wei Li's recent patches) the disabling of IRQs for single
> stepping happens in a different way. It looks like we update the
> "struct pt_regs" of the task we're stepping so that when we exit kgdb
> and start running the task again that the interrupts are off. That
> seems reasonable to me and this function has nothing to do with it.
>
> ...and further confusion on my part: does the whole saving / restoring
> of interrupts in kgdb_cpu_enter() make any sense anyway? Is this
> function ever called from a context that's not an interrupt context?
> How do we get the pt_regs in that case? Just for fun, I tried doing
> this:
>
> local_irq_save(flags);
> + if (!arch_irqs_disabled_flags(flags))
> + pr_warn("I was wrong\n");
>
> ...and I never saw "I was wrong" on my system. Maybe it matters for
> something not arm64? ...or, maybe, this is from when kgdb worked in a
> completely different way?

I'm not yet in a position to test kgdb on all architectures. Mostly
this is because it can be hard to discover suitable qemu stanzas to
launch an emulator... and in one case because buildroot doesn't support
cross-gdb yet. Anyhow, right now, it is difficult to tour the niche
architectures to give a definitive answer.


> In general I made my patch by:
> * Calling rcu_read_lock() at the start of the function.
> * Calling rcu_read_unlock() right before all 2 of the "return" calls of
> the function.
>
> ...I was hoping that would actually make it easier to reason about
> even if the function is a beast.
>
>
> Hopefully the above makes sense. I wouldn't rule out me just being
> utterly confused, but I _think_ I reasoned through it all. ;-) If it
> all makes sense, I'm inclined to:
>
> 1. Leave my patch the way it is.

I'm still a little reluctant on this.

It's not that I think the current patch is buggy, more that I view the
interrupt locks as the place in the function that we start to "assert"
what our execution context is going to be. Seeing all the entry/exit
logic in one place makes reasoning easier and should make the eventual
refactoring of "the beast" a little easier too.


> 2. Perhaps remove the whole irq saving / restoring in kgdb_cpu_enter().

Are you feeling lucky?

I think there will come a time when bravery is called for but I'd rather
see this as part of a bigger rewrite instead of a single high risk
change.


Daniel.

2020-06-02 23:16:23

by Doug Anderson

[permalink] [raw]
Subject: Re: [PATCH] kgdb: Avoid suspicious RCU usage warning

Hi,

On Mon, Jun 1, 2020 at 9:20 AM Daniel Thompson
<[email protected]> wrote:
>
> On Wed, May 27, 2020 at 05:02:27PM -0700, Doug Anderson wrote:
> > Hi,
> >
> > On Tue, May 19, 2020 at 3:41 AM Daniel Thompson
> > <[email protected]> wrote:
> > >
> > > On Thu, May 07, 2020 at 03:53:58PM -0700, Douglas Anderson wrote:
> > > > At times when I'm using kgdb I see a splat on my console about
> > > > suspicious RCU usage. I managed to come up with a case that could
> > > > reproduce this that looked like this:
> > > >
> > > > WARNING: suspicious RCU usage
> > > > 5.7.0-rc4+ #609 Not tainted
> > > > -----------------------------
> > > > kernel/pid.c:395 find_task_by_pid_ns() needs rcu_read_lock() protection!
> > > >
> > > > other info that might help us debug this:
> > > >
> > > > rcu_scheduler_active = 2, debug_locks = 1
> > > > 3 locks held by swapper/0/1:
> > > > #0: ffffff81b6b8e988 (&dev->mutex){....}-{3:3}, at: __device_attach+0x40/0x13c
> > > > #1: ffffffd01109e9e8 (dbg_master_lock){....}-{2:2}, at: kgdb_cpu_enter+0x20c/0x7ac
> > > > #2: ffffffd01109ea90 (dbg_slave_lock){....}-{2:2}, at: kgdb_cpu_enter+0x3ec/0x7ac
> > > >
> > > > stack backtrace:
> > > > CPU: 7 PID: 1 Comm: swapper/0 Not tainted 5.7.0-rc4+ #609
> > > > Hardware name: Google Cheza (rev3+) (DT)
> > > > Call trace:
> > > > dump_backtrace+0x0/0x1b8
> > > > show_stack+0x1c/0x24
> > > > dump_stack+0xd4/0x134
> > > > lockdep_rcu_suspicious+0xf0/0x100
> > > > find_task_by_pid_ns+0x5c/0x80
> > > > getthread+0x8c/0xb0
> > > > gdb_serial_stub+0x9d4/0xd04
> > > > kgdb_cpu_enter+0x284/0x7ac
> > > > kgdb_handle_exception+0x174/0x20c
> > > > kgdb_brk_fn+0x24/0x30
> > > > call_break_hook+0x6c/0x7c
> > > > brk_handler+0x20/0x5c
> > > > do_debug_exception+0x1c8/0x22c
> > > > el1_sync_handler+0x3c/0xe4
> > > > el1_sync+0x7c/0x100
> > > > rpmh_rsc_probe+0x38/0x420
> > > > platform_drv_probe+0x94/0xb4
> > > > really_probe+0x134/0x300
> > > > driver_probe_device+0x68/0x100
> > > > __device_attach_driver+0x90/0xa8
> > > > bus_for_each_drv+0x84/0xcc
> > > > __device_attach+0xb4/0x13c
> > > > device_initial_probe+0x18/0x20
> > > > bus_probe_device+0x38/0x98
> > > > device_add+0x38c/0x420
> > > >
> > > > If I understand properly we should just be able to blanket kgdb under
> > > > one big RCU read lock and the problem should go away. We'll add it to
> > > > the beast-of-a-function known as kgdb_cpu_enter().
> > > >
> > > > With this I no longer get any splats and things seem to work fine.
> > > >
> > > > Signed-off-by: Douglas Anderson <[email protected]>
> > >
> > > In principle this looks OK but I'm curious why we don't cuddle these
> > > calls up to the local interrupt locking (and also whether we want to
> > > keep hold of the lock during stepping). If nothing else that would make
> > > review easier.
> >
> > It probably wouldn't hurt to keep hold of the lock during single
> > stepping but I don't think there's any real reason we'd want to.
> > Specifically the only real reason we're calling rcu_read_lock() is to
> > avoid the warning. Since we're a stop-the-world debugger it's not
> > like something else could be messing with state at the same time.
> >
> > I'm looking at the whole function though and I don't really understand
> > all the comments about interrupts being restored by the 'trap return'
> > code, do you?
>
> Not as much as I'd like but, in principle at least, I think the trap
> handler is written to make it an architecture decision to whether it
> will be entered with interrupt disabled or not (which would in turn
> depend on how the architecture manages exception stacks and what else
> it uses the breakpoint machinery for).
>
>
> > Specifically: as far as I can tell we _always_ restore
> > interrupts when exiting the function. There are only two return
> > statements and both have "local_irq_restore(flags);" right before
> > them. We never modify the flags directly and the one other usage of
> > "flags" is effectively the statement "local_irq_restore(flags);
> > local_irq_save(flags);" which will, I guess, allow any interrupts that
> > were already pending to take place. Are you saying that you want me
> > to match that and do a "rcu_read_unlock(); rcu_read_lock()" there?
> >
> > If I understand things correctly (and there's maybe a better chance
> > after I read Wei Li's recent patches) the disabling of IRQs for single
> > stepping happens in a different way. It looks like we update the
> > "struct pt_regs" of the task we're stepping so that when we exit kgdb
> > and start running the task again that the interrupts are off. That
> > seems reasonable to me and this function has nothing to do with it.
> >
> > ...and further confusion on my part: does the whole saving / restoring
> > of interrupts in kgdb_cpu_enter() make any sense anyway? Is this
> > function ever called from a context that's not an interrupt context?
> > How do we get the pt_regs in that case? Just for fun, I tried doing
> > this:
> >
> > local_irq_save(flags);
> > + if (!arch_irqs_disabled_flags(flags))
> > + pr_warn("I was wrong\n");
> >
> > ...and I never saw "I was wrong" on my system. Maybe it matters for
> > something not arm64? ...or, maybe, this is from when kgdb worked in a
> > completely different way?
>
> I'm not yet in a position to test kgdb on all architectures. Mostly
> this is because it can be hard to discover suitable qemu stanzas to
> launch an emulator... and in one case because buildroot doesn't support
> cross-gdb yet. Anyhow, right now, it is difficult to tour the niche
> architectures to give a definitive answer.

OK, fair enough. I guess I just really don't understand how it could
work if an architecture didn't do this, but perhaps it would be clear
if there was code demonstrating it. ;-)


> > In general I made my patch by:
> > * Calling rcu_read_lock() at the start of the function.
> > * Calling rcu_read_unlock() right before all 2 of the "return" calls of
> > the function.
> >
> > ...I was hoping that would actually make it easier to reason about
> > even if the function is a beast.
> >
> >
> > Hopefully the above makes sense. I wouldn't rule out me just being
> > utterly confused, but I _think_ I reasoned through it all. ;-) If it
> > all makes sense, I'm inclined to:
> >
> > 1. Leave my patch the way it is.
>
> I'm still a little reluctant on this.
>
> It's not that I think the current patch is buggy, more that I view the
> interrupt locks as the place in the function that we start to "assert"
> what our execution context is going to be. Seeing all the entry/exit
> logic in one place makes reasoning easier and should make the eventual
> refactoring of "the beast" a little easier too.

OK, I give in. Posted v2 doing it how I think you want it.


> > 2. Perhaps remove the whole irq saving / restoring in kgdb_cpu_enter().
>
> Are you feeling lucky?
>
> I think there will come a time when bravery is called for but I'd rather
> see this as part of a bigger rewrite instead of a single high risk
> change.

Hrm, maybe. I guess it depends on whether we want to take baby steps
there or try to do it all at once. If we take baby steps we will
occasionally fall down but we'll slowly start getting things cleaned
up. If we wait for a full rewrite then we might be waiting for a long
time. It'll also be harder to figure out which of the big changes in
the major rewrite broken someone. ...or if the major rewrite comes in
20 small/bisectable patches it may be hard to revert patch 2 out of 20
if the future patches all build upon it. If we do one small high-risk
change and then wait before building upon it then it'll be easy for
someone to bisect and then yell for a revert.


-Doug

2020-06-03 12:03:58

by Daniel Thompson

[permalink] [raw]
Subject: Re: [PATCH] kgdb: Avoid suspicious RCU usage warning

On Tue, Jun 02, 2020 at 03:56:33PM -0700, Doug Anderson wrote:
> > > 2. Perhaps remove the whole irq saving / restoring in kgdb_cpu_enter().
> >
> > Are you feeling lucky?
> >
> > I think there will come a time when bravery is called for but I'd rather
> > see this as part of a bigger rewrite instead of a single high risk
> > change.
>
> Hrm, maybe. I guess it depends on whether we want to take baby steps
> there or try to do it all at once. If we take baby steps we will
> occasionally fall down but we'll slowly start getting things cleaned
> up. If we wait for a full rewrite then we might be waiting for a long
> time. It'll also be harder to figure out which of the big changes in
> the major rewrite broken someone. ...or if the major rewrite comes in
> 20 small/bisectable patches it may be hard to revert patch 2 out of 20
> if the future patches all build upon it. If we do one small high-risk
> change and then wait before building upon it then it'll be easy for
> someone to bisect and then yell for a revert.

My views are a bit too nuanced for me to agree or disagree with this.
I'm not against baby steps and I definitely *don't* want kgdb to
continue to be preserved in aspic.

However I'm still reluctant to start our baby steps with a "let's see
if this breaks something" patch given we know it could be a very large
number of kernel cycles before we get an answer. I would be much
happier if those baby steps started, for example, with refactoring to
decompose the beast into clearer (and dare I say better documented)
functions.

Or put another way, even if someone sent me 20 small bisectable patches
in a single kernel cycle I'd still want the high risk bits to be
towards the end of the patch set.


Daniel.

2020-06-04 17:49:46

by Doug Anderson

[permalink] [raw]
Subject: Re: [PATCH] kgdb: Avoid suspicious RCU usage warning

Hi,

On Wed, Jun 3, 2020 at 5:00 AM Daniel Thompson
<[email protected]> wrote:
>
> On Tue, Jun 02, 2020 at 03:56:33PM -0700, Doug Anderson wrote:
> > > > 2. Perhaps remove the whole irq saving / restoring in kgdb_cpu_enter().
> > >
> > > Are you feeling lucky?
> > >
> > > I think there will come a time when bravery is called for but I'd rather
> > > see this as part of a bigger rewrite instead of a single high risk
> > > change.
> >
> > Hrm, maybe. I guess it depends on whether we want to take baby steps
> > there or try to do it all at once. If we take baby steps we will
> > occasionally fall down but we'll slowly start getting things cleaned
> > up. If we wait for a full rewrite then we might be waiting for a long
> > time. It'll also be harder to figure out which of the big changes in
> > the major rewrite broken someone. ...or if the major rewrite comes in
> > 20 small/bisectable patches it may be hard to revert patch 2 out of 20
> > if the future patches all build upon it. If we do one small high-risk
> > change and then wait before building upon it then it'll be easy for
> > someone to bisect and then yell for a revert.
>
> My views are a bit too nuanced for me to agree or disagree with this.
> I'm not against baby steps and I definitely *don't* want kgdb to
> continue to be preserved in aspic.
>
> However I'm still reluctant to start our baby steps with a "let's see
> if this breaks something" patch given we know it could be a very large
> number of kernel cycles before we get an answer.

Yeah, it's kinda amazing how much of a delay there is sometimes.
Perhaps all of the kgdb users are off on downstream kernels so they
only notice changes when they re-sync up.


> I would be much
> happier if those baby steps started, for example, with refactoring to
> decompose the beast into clearer (and dare I say better documented)
> functions.

OK, makes sense.


> Or put another way, even if someone sent me 20 small bisectable patches
> in a single kernel cycle I'd still want the high risk bits to be
> towards the end of the patch set.

OK, fair enough. You're the maintainer so it's your view on the
matter that's the most important. I guess one worry I have is that if
neither you nor I really understand the code path that a theoretical
system would take if it didn't already have its interrupts disabled
it'll be hard to keep from breaking it in making other changes to kgdb
in the future.

Anyway, enough discussion for now. ;-) My v2 should work like you
suggested and I'm not planning on any other short term changes to this
function, so we should be all good right now.

-Doug