2023-09-12 08:15:03

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] maple_tree: Disable mas_wr_append() when other readers are possible

On Mon, Sep 11, 2023 at 07:54:52PM -0400, Liam R. Howlett wrote:
> * Paul E. McKenney <[email protected]> [230906 14:03]:
> > On Wed, Sep 06, 2023 at 01:29:54PM -0400, Liam R. Howlett wrote:
> > > * Paul E. McKenney <[email protected]> [230906 13:24]:
> > > > On Wed, Sep 06, 2023 at 11:23:25AM -0400, Liam R. Howlett wrote:
> > > > > (Adding Paul & Shanker to Cc list.. please see below for why)
> > > > >
> > > > > Apologies on the late response, I was away and have been struggling to
> > > > > get a working PPC32 test environment.
> > > > >
> > > > > * Geert Uytterhoeven <[email protected]> [230829 12:42]:
> > > > > > Hi Liam,
> > > > > >
> > > > > > On Fri, 18 Aug 2023, Liam R. Howlett wrote:
> > > > > > > The current implementation of append may cause duplicate data and/or
> > > > > > > incorrect ranges to be returned to a reader during an update. Although
> > > > > > > this has not been reported or seen, disable the append write operation
> > > > > > > while the tree is in rcu mode out of an abundance of caution.
> > > > >
> > > > > ...
> > > > > > >
>
> ...
>
> > > > > > RCU-related configs:
> > > > > >
> > > > > > $ grep RCU .config
> > > > > > # RCU Subsystem
> > > > > > CONFIG_TINY_RCU=y

I must have been asleep last time I looked at this. I was looking at
Tree RCU. Please accept my apologies for my lapse. :-/

However, Tiny RCU's call_rcu() also avoids enabling IRQs, so I would
have said the same thing, albeit after looking at a lot less RCU code.

TL;DR:

1. Try making the __setup_irq() function's call to mutex_lock()
instead be as follows:

if (!mutex_trylock(&desc->request_mutex))
mutex_lock(&desc->request_mutex);

This might fail if __setup_irq() has other dependencies on a
fully operational scheduler.

2. Move that ppc32 call to __setup_irq() much later, most definitely
after interrupts have been enabled and the scheduler is fully
operational. Invoking mutex_lock() before that time is not a
good idea. ;-)

For more detail, please read on!

> > > > > > # CONFIG_RCU_EXPERT is not set
> > > > > > CONFIG_TINY_SRCU=y
> > > > > > # end of RCU Subsystem
> > > > > > # RCU Debugging
> > > > > > # CONFIG_RCU_SCALE_TEST is not set
> > > > > > # CONFIG_RCU_TORTURE_TEST is not set
> > > > > > # CONFIG_RCU_REF_SCALE_TEST is not set
> > > > > > # CONFIG_RCU_TRACE is not set
> > > > > > # CONFIG_RCU_EQS_DEBUG is not set
> > > > > > # end of RCU Debugging
> > > > >
> > > > > I used the configuration from debian 8 and ran 'make oldconfig' to build
> > > > > my kernel. I have attached the configuration.
>
> ...
>
> > > > > It appears to be something to do with struct maple_tree sparse_irqs. If
> > > > > you drop the rcu flag from that maple tree, then my configuration boots
> > > > > without the warning.
> > > > >
> > > > > I *think* this is because we will reuse a lot more nodes. And I *think*
> > > > > the rcu flag is not needed, since there is a comment about reading the
> > > > > tree being protected by the mutex sparse_irq_lock within the
> > > > > kernel/irq/irqdesc.c file. Shanker, can you comment on that?
> > > > >
> > > > > I wonder if there is a limit to the number of RCU free events before
> > > > > something is triggered to flush them out which could trigger IRQ
> > > > > enabling? Paul, could this be the case?
> > > >
> > > > Are you asking if call_rcu() will re-enable interrupts in the following
> > > > use case?
> > > >
> > > > local_irq_disable();
> > > > call_rcu(&p->rh, my_cb_func);
> > > > local_irq_enable();
>
> I am not.
>
> ...
>
> > > >
> > > > Or am I missing your point?
> > >
> > > This is very early in the boot sequence when interrupts have not been
> > > enabled. What we are seeing is a WARN_ON() that is triggered by
> > > interrupts being enabled before they should be enabled.
> > >
> > > I was wondering if, for example, I called call_rcu() a lot *before*
> > > interrupts were enabled, that something could trigger that would either
> > > enable interrupts or indicate the task needs rescheduling?
> >
> > You aren't doing call_rcu() enough to hit OOM, are you? The actual RCU
> > callback invocations won't happen until some time after the scheduler
> > starts up.
>
> I am not, it's just a detection of IRQs being enabled early.
>
> >
> > > Specifically the rescheduling part is suspect. I tracked down the call
> > > to a mutex_lock() which calls cond_resched(), so could rcu be
> > > 'encouraging' the rcu window by a reschedule request?
> >
> > During boot before interrupts are enabled, RCU has not yet spawned any of
> > its kthreads. Therefore, all of its attempts to do wakeups would notice
> > a NULL task_struct pointer and refrain from actually doing the wakeup.
> > If it did do the wakeup, you would see a NULL-pointer exception. See
> > for example, invoke_rcu_core_kthread(), though that won't happen unless
> > you booted with rcutree.use_softirq=0.
> >
> > Besides, since when did doing a wakeup enable interrupts? That would
> > make it hard to do wakeups from hardware interrupt handlers, not?
>
> Taking the mutex lock in kernel/irq/manage.c __setup_irq() is calling a
> cond_resched().
>
> >From what Michael said [1] in this thread, since something has already
> set TIF_NEED_RESCHED, it will eventually enable interrupts on us.
>
> I've traced this to running call_rcu() in kernel/rcu/tiny.c and
> is_idle_task(current) is true, which means rcu runs:
> /* force scheduling for rcu_qs() */
> resched_cpu(0);
>
> the task is set idle in sched_init() -> init_idle() and never changed,
> afaict.

Yes, because RCU eventually needs a context switch in order to make
a grace period happen. And Maple Tree isn't the only thing invoking
call_rcu() early, so this has been in place for a very long time.

> Removing the RCU option from the maple tree in kernel/irq/irqdesc.c
> fixes the issue by avoiding the maple tree running call_rcu(). I am not
> sure on the locking of the tree so I feel this change may cause other
> issues...also it's before lockdep_init(), so any issue I introduce may
> not be detected.
>
> When CONFIG_DEBUG_ATOMIC_SLEEP is configured, it seems that rcu does the
> same thing, but the IRQs are not enabled on return. So, resched_cpu(0)
> is called, but the IRQs warning of enabled isn't triggered. I failed to
> find a reason why.

Here you mean IRQs being enabled upon return from __setup_irq(), correct?

But yes, __setup_irq() does call mutex_lock(). Which will call
preempt_schedule_common() via might_sleep() and __cond_resched(), even
though that is clearly a very bad thing this early. And let's face it,
the whole purpose of mutex_lock() is to block when needed. And a big
purpose of that might_sleep() is to yell at people invoking this with
interrupts disabled.

And most of the wrappers around __setup_irq() look to be intended
for much later, after interrupts have been enabled. One exception is
setup_percpu_irq(), which says that it is for "the early boot process",
whenever that might be. But this is only invoked from mips in v6.5.

The __request_percpu_irq() function is wrappered by request_percpu_irq(),
and its header comment suggests that it is to be called after there are
multiple CPUs. I am not seeing a call that is obviously from ppc32,
but there are a number of drivers using this with which I am unfamiliar.

The request_percpu_nmi() has to be followed up on each CPU using
prepare_percpu_nmi() and enable_percpu_nmi(), so it is not clear that
it is useful to invoke this before interrupts are enabled. But this is
used by ARM, not ppc32 from what I can see.

So even though I am not seeing how ppc32 invokes __setup_irq() early,
your testing clearly indicates that it is doing so.

> I am not entirely sure what makes ppc32 different than other platforms
> in that the initial task is configured to an idle task and the first
> call to call_rcu (tiny!) would cause the observed behaviour.

Maybe something like this in __setup_irq(), right before the
mutex_lock()?

WARN_ON_ONCE(irqs_disabled());

This will dump the stack trace showing how __setup_irq() is being invoked
in early boot on ppc32.

Again, given that __setup_irq() invokes mutex_lock(), invoking this
function in its current form before interrupts have been enabled is a
bad idea.

> Non-tiny rcu calls (as I am sure you know, but others may not)
> kernel/rcu/tree.c which in turn calls __call_rcu_common(). That
> function is far more complex than the tiny version. Maybe it's part of
> why we see different behaviour based on platforms? I don't see an idle
> check in that version of call_rcu().

Tree RCU can rely on the grace-period kthread starting up later in boot.
This kthread will handle any need for a grace period from early boot
invocation of call_rcu(). Tiny RCU does not have such a kthread, hence
the resched_cpu(). Which is fine, as long as nothing (incorrectly!)
invokes a blocking operation before the scheduler has been initialized.

> Or maybe PPC32 has something set incorrectly to cause this failure in
> early boot and I've just found something that needs to be set
> differently?

This might be a failure of imagination on my part, but I certainly don't
see why ppc32 should expect to get away with invoking __setup_irq()
anywhere near that early! ;-)

But if ppc32 cannot be adjusted so as to invoke __setup_irq() later,
the following change to __setup_irq() might work:

if (!mutex_trylock(&desc->request_mutex))
mutex_lock(&desc->request_mutex);

The reason for "might" rather than "will" is that __setup_irq() might
well have other dependencies on the scheduler being up and running.

It would be cleaner for that ppc32 call to __setup_irq() to be moved
later, after the scheduler is up and running.

> > But why not put some WARN_ON_ONCE(!irqs_disabled()) calls in the areas
> > of greatest suspicion, starting from the stack trace generated by that
> > mutex_lock()? A stray interrupt-enable could be pretty much anywhere.
> >
> > But where are those call_rcu() invocations? Before rcu_init()?
>
> During init_IRQ(), which is after rcu_init() but before rcu_init_nohz(),
> srcu_init(), and softirq_init() in init/main.c start_kernel().

That is a perfectly legal time and place for a call_rcu() invocation.
It is also legal before rcu_init(), but it has to be late enough that
things like per-CPU variables are accessible. Which is still very early.

> > Presumably before init is spawned and the early_init() calls.
> >
> > And what is the RCU-related Kconfig and boot-parameter setup?
>
> The .config was attached to the email I sent, and it matches what was
> quoted above in the "RCU-related configs" section.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/

Again, apologies for having misread that .config snippet!!!

Thanx, Paul


2023-09-12 08:36:29

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] maple_tree: Disable mas_wr_append() when other readers are possible

Hi Paul,

On Tue, Sep 12, 2023 at 10:14 AM Paul E. McKenney <[email protected]> wrote:
> On Mon, Sep 11, 2023 at 07:54:52PM -0400, Liam R. Howlett wrote:
> > * Paul E. McKenney <[email protected]> [230906 14:03]:
> > > On Wed, Sep 06, 2023 at 01:29:54PM -0400, Liam R. Howlett wrote:
> > > > * Paul E. McKenney <[email protected]> [230906 13:24]:
> > > > > On Wed, Sep 06, 2023 at 11:23:25AM -0400, Liam R. Howlett wrote:
> > > > > > (Adding Paul & Shanker to Cc list.. please see below for why)
> > > > > >
> > > > > > Apologies on the late response, I was away and have been struggling to
> > > > > > get a working PPC32 test environment.
> > > > > >
> > > > > > * Geert Uytterhoeven <[email protected]> [230829 12:42]:
> > > > > > > Hi Liam,
> > > > > > >
> > > > > > > On Fri, 18 Aug 2023, Liam R. Howlett wrote:
> > > > > > > > The current implementation of append may cause duplicate data and/or
> > > > > > > > incorrect ranges to be returned to a reader during an update. Although
> > > > > > > > this has not been reported or seen, disable the append write operation
> > > > > > > > while the tree is in rcu mode out of an abundance of caution.
> > > > > >
> > > > > > ...
> > > > > > > >
> >
> > ...
> >
> > > > > > > RCU-related configs:
> > > > > > >
> > > > > > > $ grep RCU .config
> > > > > > > # RCU Subsystem
> > > > > > > CONFIG_TINY_RCU=y
>
> I must have been asleep last time I looked at this. I was looking at
> Tree RCU. Please accept my apologies for my lapse. :-/
>
> However, Tiny RCU's call_rcu() also avoids enabling IRQs, so I would
> have said the same thing, albeit after looking at a lot less RCU code.
>
> TL;DR:
>
> 1. Try making the __setup_irq() function's call to mutex_lock()
> instead be as follows:
>
> if (!mutex_trylock(&desc->request_mutex))
> mutex_lock(&desc->request_mutex);
>
> This might fail if __setup_irq() has other dependencies on a
> fully operational scheduler.
>
> 2. Move that ppc32 call to __setup_irq() much later, most definitely
> after interrupts have been enabled and the scheduler is fully
> operational. Invoking mutex_lock() before that time is not a
> good idea. ;-)

There is no call to __setup_irq() from arch/powerpc/?

Note that there are (possibly different) issues seen on ppc32 and on arm32
(Renesas RZ/A in particular, but not on other Renesas ARM systems).

I saw an issue on arm32 with cfeb6ae8bcb96ccf, but not with cfeb6ae8bcb96ccf^.
Other people saw an issue on ppc32 with both cfeb6ae8bcb96ccf and
cfeb6ae8bcb96ccf^.

Thanks!

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2023-09-12 09:13:42

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] maple_tree: Disable mas_wr_append() when other readers are possible

On Tue, Sep 12, 2023 at 10:23:37AM +0200, Geert Uytterhoeven wrote:
> Hi Paul,
>
> On Tue, Sep 12, 2023 at 10:14 AM Paul E. McKenney <[email protected]> wrote:
> > On Mon, Sep 11, 2023 at 07:54:52PM -0400, Liam R. Howlett wrote:
> > > * Paul E. McKenney <[email protected]> [230906 14:03]:
> > > > On Wed, Sep 06, 2023 at 01:29:54PM -0400, Liam R. Howlett wrote:
> > > > > * Paul E. McKenney <[email protected]> [230906 13:24]:
> > > > > > On Wed, Sep 06, 2023 at 11:23:25AM -0400, Liam R. Howlett wrote:
> > > > > > > (Adding Paul & Shanker to Cc list.. please see below for why)
> > > > > > >
> > > > > > > Apologies on the late response, I was away and have been struggling to
> > > > > > > get a working PPC32 test environment.
> > > > > > >
> > > > > > > * Geert Uytterhoeven <[email protected]> [230829 12:42]:
> > > > > > > > Hi Liam,
> > > > > > > >
> > > > > > > > On Fri, 18 Aug 2023, Liam R. Howlett wrote:
> > > > > > > > > The current implementation of append may cause duplicate data and/or
> > > > > > > > > incorrect ranges to be returned to a reader during an update. Although
> > > > > > > > > this has not been reported or seen, disable the append write operation
> > > > > > > > > while the tree is in rcu mode out of an abundance of caution.
> > > > > > >
> > > > > > > ...
> > > > > > > > >
> > >
> > > ...
> > >
> > > > > > > > RCU-related configs:
> > > > > > > >
> > > > > > > > $ grep RCU .config
> > > > > > > > # RCU Subsystem
> > > > > > > > CONFIG_TINY_RCU=y
> >
> > I must have been asleep last time I looked at this. I was looking at
> > Tree RCU. Please accept my apologies for my lapse. :-/
> >
> > However, Tiny RCU's call_rcu() also avoids enabling IRQs, so I would
> > have said the same thing, albeit after looking at a lot less RCU code.
> >
> > TL;DR:
> >
> > 1. Try making the __setup_irq() function's call to mutex_lock()
> > instead be as follows:
> >
> > if (!mutex_trylock(&desc->request_mutex))
> > mutex_lock(&desc->request_mutex);
> >
> > This might fail if __setup_irq() has other dependencies on a
> > fully operational scheduler.
> >
> > 2. Move that ppc32 call to __setup_irq() much later, most definitely
> > after interrupts have been enabled and the scheduler is fully
> > operational. Invoking mutex_lock() before that time is not a
> > good idea. ;-)
>
> There is no call to __setup_irq() from arch/powerpc/?

Glad it is not just me, given that I didn't see a direct call, either. So
later in this email, I asked Liam to put a WARN_ON_ONCE(irqs_disabled())
just before that mutex_lock() in __setup_irq().

Either way, invoking mutex_lock() early in boot before interrupts have
been enabled is a bad idea. ;-)

> Note that there are (possibly different) issues seen on ppc32 and on arm32
> (Renesas RZ/A in particular, but not on other Renesas ARM systems).
>
> I saw an issue on arm32 with cfeb6ae8bcb96ccf, but not with cfeb6ae8bcb96ccf^.
> Other people saw an issue on ppc32 with both cfeb6ae8bcb96ccf and
> cfeb6ae8bcb96ccf^.

I look forward to hearing what is the issue in both cases.

Thanx, Paul

> Thanks!
>
> Gr{oetje,eeting}s,
>
> Geert
>
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
>
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
> -- Linus Torvalds

2023-09-13 01:28:29

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] maple_tree: Disable mas_wr_append() when other readers are possible

Hi Paul,

On Tue, Sep 12, 2023 at 10:30 AM Paul E. McKenney <[email protected]> wrote:
> On Tue, Sep 12, 2023 at 10:23:37AM +0200, Geert Uytterhoeven wrote:
> > On Tue, Sep 12, 2023 at 10:14 AM Paul E. McKenney <[email protected]> wrote:
> > > On Mon, Sep 11, 2023 at 07:54:52PM -0400, Liam R. Howlett wrote:
> > > > * Paul E. McKenney <[email protected]> [230906 14:03]:
> > > > > On Wed, Sep 06, 2023 at 01:29:54PM -0400, Liam R. Howlett wrote:
> > > > > > * Paul E. McKenney <[email protected]> [230906 13:24]:
> > > > > > > On Wed, Sep 06, 2023 at 11:23:25AM -0400, Liam R. Howlett wrote:
> > > > > > > > (Adding Paul & Shanker to Cc list.. please see below for why)
> > > > > > > >
> > > > > > > > Apologies on the late response, I was away and have been struggling to
> > > > > > > > get a working PPC32 test environment.
> > > > > > > >
> > > > > > > > * Geert Uytterhoeven <[email protected]> [230829 12:42]:
> > > > > > > > > Hi Liam,
> > > > > > > > >
> > > > > > > > > On Fri, 18 Aug 2023, Liam R. Howlett wrote:
> > > > > > > > > > The current implementation of append may cause duplicate data and/or
> > > > > > > > > > incorrect ranges to be returned to a reader during an update. Although
> > > > > > > > > > this has not been reported or seen, disable the append write operation
> > > > > > > > > > while the tree is in rcu mode out of an abundance of caution.
> > > > > > > >
> > > > > > > > ...
> > > > > > > > > >
> > > >
> > > > ...
> > > >
> > > > > > > > > RCU-related configs:
> > > > > > > > >
> > > > > > > > > $ grep RCU .config
> > > > > > > > > # RCU Subsystem
> > > > > > > > > CONFIG_TINY_RCU=y
> > >
> > > I must have been asleep last time I looked at this. I was looking at
> > > Tree RCU. Please accept my apologies for my lapse. :-/
> > >
> > > However, Tiny RCU's call_rcu() also avoids enabling IRQs, so I would
> > > have said the same thing, albeit after looking at a lot less RCU code.
> > >
> > > TL;DR:
> > >
> > > 1. Try making the __setup_irq() function's call to mutex_lock()
> > > instead be as follows:
> > >
> > > if (!mutex_trylock(&desc->request_mutex))
> > > mutex_lock(&desc->request_mutex);
> > >
> > > This might fail if __setup_irq() has other dependencies on a
> > > fully operational scheduler.
> > >
> > > 2. Move that ppc32 call to __setup_irq() much later, most definitely
> > > after interrupts have been enabled and the scheduler is fully
> > > operational. Invoking mutex_lock() before that time is not a
> > > good idea. ;-)
> >
> > There is no call to __setup_irq() from arch/powerpc/?
>
> Glad it is not just me, given that I didn't see a direct call, either. So
> later in this email, I asked Liam to put a WARN_ON_ONCE(irqs_disabled())
> just before that mutex_lock() in __setup_irq().
>
> Either way, invoking mutex_lock() early in boot before interrupts have
> been enabled is a bad idea. ;-)

I'll add that WARN_ON_ONCE() too, and will report back later today...

> > Note that there are (possibly different) issues seen on ppc32 and on arm32
> > (Renesas RZ/A in particular, but not on other Renesas ARM systems).
> >
> > I saw an issue on arm32 with cfeb6ae8bcb96ccf, but not with cfeb6ae8bcb96ccf^.
> > Other people saw an issue on ppc32 with both cfeb6ae8bcb96ccf and
> > cfeb6ae8bcb96ccf^.
>
> I look forward to hearing what is the issue in both cases.

For RZ/A, my problem report is
https://lore.kernel.org/all/[email protected]/

Thanks!

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2023-09-16 20:17:59

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] maple_tree: Disable mas_wr_append() when other readers are possible



Le 12/09/2023 à 10:14, Paul E. McKenney a écrit :
> On Mon, Sep 11, 2023 at 07:54:52PM -0400, Liam R. Howlett wrote:
>> * Paul E. McKenney <[email protected]> [230906 14:03]:
>>> On Wed, Sep 06, 2023 at 01:29:54PM -0400, Liam R. Howlett wrote:
>>>> * Paul E. McKenney <[email protected]> [230906 13:24]:
>>>>> On Wed, Sep 06, 2023 at 11:23:25AM -0400, Liam R. Howlett wrote:
>>>>>> (Adding Paul & Shanker to Cc list.. please see below for why)
>>>>>>
>>>>>> Apologies on the late response, I was away and have been struggling to
>>>>>> get a working PPC32 test environment.
>>>>>>
>>>>>> * Geert Uytterhoeven <[email protected]> [230829 12:42]:
>>>>>>> Hi Liam,
>>>>>>>
>>>>>>> On Fri, 18 Aug 2023, Liam R. Howlett wrote:
>>>>>>>> The current implementation of append may cause duplicate data and/or
>>>>>>>> incorrect ranges to be returned to a reader during an update. Although
>>>>>>>> this has not been reported or seen, disable the append write operation
>>>>>>>> while the tree is in rcu mode out of an abundance of caution.
>>>>>>
>>>>>> ...
>>>>>>>>
>>
>> ...
>>
>>>>>>> RCU-related configs:
>>>>>>>
>>>>>>> $ grep RCU .config
>>>>>>> # RCU Subsystem
>>>>>>> CONFIG_TINY_RCU=y
>
> I must have been asleep last time I looked at this. I was looking at
> Tree RCU. Please accept my apologies for my lapse. :-/
>
> However, Tiny RCU's call_rcu() also avoids enabling IRQs, so I would
> have said the same thing, albeit after looking at a lot less RCU code.
>
> TL;DR:
>
> 1. Try making the __setup_irq() function's call to mutex_lock()
> instead be as follows:
>
> if (!mutex_trylock(&desc->request_mutex))
> mutex_lock(&desc->request_mutex);
>
> This might fail if __setup_irq() has other dependencies on a
> fully operational scheduler.
>
> 2. Move that ppc32 call to __setup_irq() much later, most definitely
> after interrupts have been enabled and the scheduler is fully
> operational. Invoking mutex_lock() before that time is not a
> good idea. ;-)
>
> For more detail, please read on!
>
>>>>>>> # CONFIG_RCU_EXPERT is not set
>>>>>>> CONFIG_TINY_SRCU=y
>>>>>>> # end of RCU Subsystem
>>>>>>> # RCU Debugging
>>>>>>> # CONFIG_RCU_SCALE_TEST is not set
>>>>>>> # CONFIG_RCU_TORTURE_TEST is not set
>>>>>>> # CONFIG_RCU_REF_SCALE_TEST is not set
>>>>>>> # CONFIG_RCU_TRACE is not set
>>>>>>> # CONFIG_RCU_EQS_DEBUG is not set
>>>>>>> # end of RCU Debugging
>>>>>>
>>>>>> I used the configuration from debian 8 and ran 'make oldconfig' to build
>>>>>> my kernel. I have attached the configuration.
>>
>> ...
>>
>>>>>> It appears to be something to do with struct maple_tree sparse_irqs. If
>>>>>> you drop the rcu flag from that maple tree, then my configuration boots
>>>>>> without the warning.
>>>>>>
>>>>>> I *think* this is because we will reuse a lot more nodes. And I *think*
>>>>>> the rcu flag is not needed, since there is a comment about reading the
>>>>>> tree being protected by the mutex sparse_irq_lock within the
>>>>>> kernel/irq/irqdesc.c file. Shanker, can you comment on that?
>>>>>>
>>>>>> I wonder if there is a limit to the number of RCU free events before
>>>>>> something is triggered to flush them out which could trigger IRQ
>>>>>> enabling? Paul, could this be the case?
>>>>>
>>>>> Are you asking if call_rcu() will re-enable interrupts in the following
>>>>> use case?
>>>>>
>>>>> local_irq_disable();
>>>>> call_rcu(&p->rh, my_cb_func);
>>>>> local_irq_enable();
>>
>> I am not.
>>
>> ...
>>
>>>>>
>>>>> Or am I missing your point?
>>>>
>>>> This is very early in the boot sequence when interrupts have not been
>>>> enabled. What we are seeing is a WARN_ON() that is triggered by
>>>> interrupts being enabled before they should be enabled.
>>>>
>>>> I was wondering if, for example, I called call_rcu() a lot *before*
>>>> interrupts were enabled, that something could trigger that would either
>>>> enable interrupts or indicate the task needs rescheduling?
>>>
>>> You aren't doing call_rcu() enough to hit OOM, are you? The actual RCU
>>> callback invocations won't happen until some time after the scheduler
>>> starts up.
>>
>> I am not, it's just a detection of IRQs being enabled early.
>>
>>>
>>>> Specifically the rescheduling part is suspect. I tracked down the call
>>>> to a mutex_lock() which calls cond_resched(), so could rcu be
>>>> 'encouraging' the rcu window by a reschedule request?
>>>
>>> During boot before interrupts are enabled, RCU has not yet spawned any of
>>> its kthreads. Therefore, all of its attempts to do wakeups would notice
>>> a NULL task_struct pointer and refrain from actually doing the wakeup.
>>> If it did do the wakeup, you would see a NULL-pointer exception. See
>>> for example, invoke_rcu_core_kthread(), though that won't happen unless
>>> you booted with rcutree.use_softirq=0.
>>>
>>> Besides, since when did doing a wakeup enable interrupts? That would
>>> make it hard to do wakeups from hardware interrupt handlers, not?
>>
>> Taking the mutex lock in kernel/irq/manage.c __setup_irq() is calling a
>> cond_resched().
>>
>> >From what Michael said [1] in this thread, since something has already
>> set TIF_NEED_RESCHED, it will eventually enable interrupts on us.
>>
>> I've traced this to running call_rcu() in kernel/rcu/tiny.c and
>> is_idle_task(current) is true, which means rcu runs:
>> /* force scheduling for rcu_qs() */
>> resched_cpu(0);
>>
>> the task is set idle in sched_init() -> init_idle() and never changed,
>> afaict.
>
> Yes, because RCU eventually needs a context switch in order to make
> a grace period happen. And Maple Tree isn't the only thing invoking
> call_rcu() early, so this has been in place for a very long time.
>
>> Removing the RCU option from the maple tree in kernel/irq/irqdesc.c
>> fixes the issue by avoiding the maple tree running call_rcu(). I am not
>> sure on the locking of the tree so I feel this change may cause other
>> issues...also it's before lockdep_init(), so any issue I introduce may
>> not be detected.
>>
>> When CONFIG_DEBUG_ATOMIC_SLEEP is configured, it seems that rcu does the
>> same thing, but the IRQs are not enabled on return. So, resched_cpu(0)
>> is called, but the IRQs warning of enabled isn't triggered. I failed to
>> find a reason why.
>
> Here you mean IRQs being enabled upon return from __setup_irq(), correct?
>
> But yes, __setup_irq() does call mutex_lock(). Which will call
> preempt_schedule_common() via might_sleep() and __cond_resched(), even
> though that is clearly a very bad thing this early. And let's face it,
> the whole purpose of mutex_lock() is to block when needed. And a big
> purpose of that might_sleep() is to yell at people invoking this with
> interrupts disabled.
>
> And most of the wrappers around __setup_irq() look to be intended
> for much later, after interrupts have been enabled. One exception is
> setup_percpu_irq(), which says that it is for "the early boot process",
> whenever that might be. But this is only invoked from mips in v6.5.
>
> The __request_percpu_irq() function is wrappered by request_percpu_irq(),
> and its header comment suggests that it is to be called after there are
> multiple CPUs. I am not seeing a call that is obviously from ppc32,
> but there are a number of drivers using this with which I am unfamiliar.
>
> The request_percpu_nmi() has to be followed up on each CPU using
> prepare_percpu_nmi() and enable_percpu_nmi(), so it is not clear that
> it is useful to invoke this before interrupts are enabled. But this is
> used by ARM, not ppc32 from what I can see.
>
> So even though I am not seeing how ppc32 invokes __setup_irq() early,
> your testing clearly indicates that it is doing so.
>
>> I am not entirely sure what makes ppc32 different than other platforms
>> in that the initial task is configured to an idle task and the first
>> call to call_rcu (tiny!) would cause the observed behaviour.
>
> Maybe something like this in __setup_irq(), right before the
> mutex_lock()?
>
> WARN_ON_ONCE(irqs_disabled());
>
> This will dump the stack trace showing how __setup_irq() is being invoked
> in early boot on ppc32.
>
> Again, given that __setup_irq() invokes mutex_lock(), invoking this
> function in its current form before interrupts have been enabled is a
> bad idea.
>

No trigger of that WARN_ON() I added in __setup_irq() as instructed
above, still getting (pmac32_defconfig on MAC99 QEMU)

------------[ cut here ]------------
Interrupts were enabled early
WARNING: CPU: 0 PID: 0 at init/main.c:992 start_kernel+0x4d8/0x5c0
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 6.6.0-rc1-dirty #481
Hardware name: PowerMac3,1 7400 0xc0209 PowerMac
NIP: c0a6052c LR: c0a6052c CTR: 00000000
REGS: c0c4dee0 TRAP: 0700 Not tainted (6.6.0-rc1-dirty)
MSR: 00029032 <EE,ME,IR,DR,RI> CR: 24000282 XER: 20000000

GPR00: c0a6052c c0c4dfa0 c0b92580 0000001d c0b9d128 00000001 c0b9d148
3ffffdff
GPR08: c0ba80f0 00000000 00000000 3ffffe00 44000282 00000000 00000000
0199abfc
GPR16: 0199b0a4 7fde7fa4 7fc5ac0c 000000bb 41000000 01c690c8 c0b92014
c09b4bdc
GPR24: c0c55220 c0ac0000 00000000 efff9109 efff9100 0000000a c0c6d000
c0b920a0
NIP [c0a6052c] start_kernel+0x4d8/0x5c0
LR [c0a6052c] start_kernel+0x4d8/0x5c0
Call Trace:
[c0c4dfa0] [c0a6052c] start_kernel+0x4d8/0x5c0 (unreliable)
[c0c4dff0] [00003540] 0x3540
Code: 480037b1 48023c05 4bab88ed 90620260 480139e9 4b657ced 7d2000a6
71298000 41a20014 3c60c09a 3863b78c 4b5e9ccd <0fe00000> 39200000
99380008 7d2000a6
---[ end trace 0000000000000000 ]---