2020-05-28 11:54:00

by Joerg Vehlow

[permalink] [raw]
Subject: [BUG RT] dump-capture kernel not executed for panic in interrupt context

Hi,

I think I found a bug in the kernel with rt patches (or maybe even without).
This applies to all kernels propably starting at 2.6.27.

When a kernel panic is triggered from an interrupt handler, the dump-capture
kernel is not started, instead the system acts as if it was not installed.
The reason for this is, that panic calls __crash_kexec, which is protected
by a mutex. On an rt kernel this mutex is an rt mutex and when trylock
is called
on an rt mutex, the first check is whether the current kthread is in an
nmi or
irq handler. If it is, the function just returns 0 -> locking failed.

According to rt_mutex_trylock documentation, it is not allowed to call this
function from an irq handler, but panic can be called from everywhere
and thus
rt_mutex_trylock can be called from everywhere. Actually even
mutex_trylock has
the comment, that it is not supposed to be used from interrupt context,
but it
still locks the mutex. I guess this could also be a bug in the non-rt
kernel.

I found this problem using a test module, that triggers the softlock
detection.
It is a pretty simple module, that creates a kthread, that disables
preemption,
spins 60 seconds in an endless loop and then reenables preemption and
terminates
the thread. This reliably triggers the softlock detection and if
kernel.softlockup_panic=0, the system resumes perfectly fine afterwards. If
kernel.softlockup_panic=1 I would expect the dump-capture kernel to be
executed,
but it is not due to the bug (without rt patches it works), instead the
panic
function is executed until the end to the endless loop.


A stacktrace captured at the trylock call inside kexec_code looks like this:
#0  __rt_mutex_trylock (lock=0xffffffff81701aa0 <kexec_mutex>) at
/usr/src/kernel/kernel/locking/rtmutex.c:2110
#1  0xffffffff8087601a in _mutex_trylock (lock=<optimised out>) at
/usr/src/kernel/kernel/locking/mutex-rt.c:185
#2  0xffffffff803022a0 in __crash_kexec (regs=0x0 <irq_stack_union>) at
/usr/src/kernel/kernel/kexec_core.c:941
#3  0xffffffff8027af59 in panic (fmt=0xffffffff80fa3d66 "softlockup:
hung tasks") at /usr/src/kernel/kernel/panic.c:198
#4  0xffffffff80325b6d in watchdog_timer_fn (hrtimer=<optimised out>) at
/usr/src/kernel/kernel/watchdog.c:464
#5  0xffffffff802e6b90 in __run_hrtimer (flags=<optimised out>,
now=<optimised out>, timer=<optimised out>, base=<optimised out>,
cpu_base=<optimised out>) at /usr/src/kernel/kernel/time/hrtimer.c:1417
#6  __hrtimer_run_queues (cpu_base=0xffff88807db1c000, now=<optimised
out>, flags=<optimised out>, active_mask=<optimised out>) at
/usr/src/kernel/kernel/time/hrtimer.c:1479
#7  0xffffffff802e7704 in hrtimer_interrupt (dev=<optimised out>) at
/usr/src/kernel/kernel/time/hrtimer.c:1539
#8  0xffffffff80a020f2 in local_apic_timer_interrupt () at
/usr/src/kernel/arch/x86/kernel/apic/apic.c:1067
#9  smp_apic_timer_interrupt (regs=<optimised out>) at
/usr/src/kernel/arch/x86/kernel/apic/apic.c:1092
#10 0xffffffff80a015df in apic_timer_interrupt () at
/usr/src/kernel/arch/x86/entry/entry_64.S:909


Obviously and as expected the panic was triggered in the context of the apic
interrupt. So in_irq() is true and trylock fails.


About 12 years ago this was not implemented using a mutex, but using xchg.
See: 8c5a1cf0ad3ac5fcdf51314a63b16a440870f6a2


Since my knowledege about mutexes inside the kernel is very limited, I
do not
know how this can be fixed and whether it should be fixed in the rt
patches or
if this really is a bug in mainline kernel (because trylock is also not
allowed
to be used in interrupt handlers.


Jörg


2020-05-28 12:48:53

by Steven Rostedt

[permalink] [raw]
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context


Hi Joerg,

This does look like Andrew's commit (from 2008) is buggy (and this is a
mainline bug, not an RT one). (top posting this so Andrew knows to look
further ;-)

On Thu, 28 May 2020 13:41:08 +0200
Joerg Vehlow <[email protected]> wrote:

> Hi,
>
> I think I found a bug in the kernel with rt patches (or maybe even without).
> This applies to all kernels propably starting at 2.6.27.
>
> When a kernel panic is triggered from an interrupt handler, the dump-capture
> kernel is not started, instead the system acts as if it was not installed.
> The reason for this is, that panic calls __crash_kexec, which is protected
> by a mutex. On an rt kernel this mutex is an rt mutex and when trylock
> is called
> on an rt mutex, the first check is whether the current kthread is in an
> nmi or
> irq handler. If it is, the function just returns 0 -> locking failed.
>
> According to rt_mutex_trylock documentation, it is not allowed to call this
> function from an irq handler, but panic can be called from everywhere
> and thus
> rt_mutex_trylock can be called from everywhere. Actually even
> mutex_trylock has
> the comment, that it is not supposed to be used from interrupt context,
> but it
> still locks the mutex. I guess this could also be a bug in the non-rt
> kernel.
>
> I found this problem using a test module, that triggers the softlock
> detection.
> It is a pretty simple module, that creates a kthread, that disables
> preemption,
> spins 60 seconds in an endless loop and then reenables preemption and
> terminates
> the thread. This reliably triggers the softlock detection and if
> kernel.softlockup_panic=0, the system resumes perfectly fine afterwards. If
> kernel.softlockup_panic=1 I would expect the dump-capture kernel to be
> executed,
> but it is not due to the bug (without rt patches it works), instead the
> panic
> function is executed until the end to the endless loop.
>
>
> A stacktrace captured at the trylock call inside kexec_code looks like this:
> #0  __rt_mutex_trylock (lock=0xffffffff81701aa0 <kexec_mutex>) at
> /usr/src/kernel/kernel/locking/rtmutex.c:2110
> #1  0xffffffff8087601a in _mutex_trylock (lock=<optimised out>) at
> /usr/src/kernel/kernel/locking/mutex-rt.c:185
> #2  0xffffffff803022a0 in __crash_kexec (regs=0x0 <irq_stack_union>) at
> /usr/src/kernel/kernel/kexec_core.c:941
> #3  0xffffffff8027af59 in panic (fmt=0xffffffff80fa3d66 "softlockup:
> hung tasks") at /usr/src/kernel/kernel/panic.c:198
> #4  0xffffffff80325b6d in watchdog_timer_fn (hrtimer=<optimised out>) at
> /usr/src/kernel/kernel/watchdog.c:464
> #5  0xffffffff802e6b90 in __run_hrtimer (flags=<optimised out>,
> now=<optimised out>, timer=<optimised out>, base=<optimised out>,
> cpu_base=<optimised out>) at /usr/src/kernel/kernel/time/hrtimer.c:1417
> #6  __hrtimer_run_queues (cpu_base=0xffff88807db1c000, now=<optimised
> out>, flags=<optimised out>, active_mask=<optimised out>) at
> /usr/src/kernel/kernel/time/hrtimer.c:1479
> #7  0xffffffff802e7704 in hrtimer_interrupt (dev=<optimised out>) at
> /usr/src/kernel/kernel/time/hrtimer.c:1539
> #8  0xffffffff80a020f2 in local_apic_timer_interrupt () at
> /usr/src/kernel/arch/x86/kernel/apic/apic.c:1067
> #9  smp_apic_timer_interrupt (regs=<optimised out>) at
> /usr/src/kernel/arch/x86/kernel/apic/apic.c:1092
> #10 0xffffffff80a015df in apic_timer_interrupt () at
> /usr/src/kernel/arch/x86/entry/entry_64.S:909
>
>
> Obviously and as expected the panic was triggered in the context of the apic
> interrupt. So in_irq() is true and trylock fails.
>
>
> About 12 years ago this was not implemented using a mutex, but using xchg.
> See: 8c5a1cf0ad3ac5fcdf51314a63b16a440870f6a2

Yes, that commit is wrong, because mutex_trylock() is not to be taken in
interrupt context, where crash_kexec() looks like it can be called.

Unless back then crash_kexec() wasn't called in interrupt context, then the
commit that calls it from that combined with this commit is the issue.

-- Steve

>
>
> Since my knowledege about mutexes inside the kernel is very limited, I
> do not
> know how this can be fixed and whether it should be fixed in the rt
> patches or
> if this really is a bug in mainline kernel (because trylock is also not
> allowed
> to be used in interrupt handlers.
>
>
> Jörg

2020-07-22 04:44:07

by Joerg Vehlow

[permalink] [raw]
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

Hi Andrew,

it's been two month now and no reaction from you. Maybe you did not see
this mail from Steven.
Please look at this issue.

Greets,

Jörg

On 5/28/2020 2:46 PM, Steven Rostedt wrote:
> Hi Joerg,
>
> This does look like Andrew's commit (from 2008) is buggy (and this is a
> mainline bug, not an RT one). (top posting this so Andrew knows to look
> further ;-)
>
> On Thu, 28 May 2020 13:41:08 +0200
> Joerg Vehlow <[email protected]> wrote:
>
>> Hi,
>>
>> I think I found a bug in the kernel with rt patches (or maybe even without).
>> This applies to all kernels propably starting at 2.6.27.
>>
>> When a kernel panic is triggered from an interrupt handler, the dump-capture
>> kernel is not started, instead the system acts as if it was not installed.
>> The reason for this is, that panic calls __crash_kexec, which is protected
>> by a mutex. On an rt kernel this mutex is an rt mutex and when trylock
>> is called
>> on an rt mutex, the first check is whether the current kthread is in an
>> nmi or
>> irq handler. If it is, the function just returns 0 -> locking failed.
>>
>> According to rt_mutex_trylock documentation, it is not allowed to call this
>> function from an irq handler, but panic can be called from everywhere
>> and thus
>> rt_mutex_trylock can be called from everywhere. Actually even
>> mutex_trylock has
>> the comment, that it is not supposed to be used from interrupt context,
>> but it
>> still locks the mutex. I guess this could also be a bug in the non-rt
>> kernel.
>>
>> I found this problem using a test module, that triggers the softlock
>> detection.
>> It is a pretty simple module, that creates a kthread, that disables
>> preemption,
>> spins 60 seconds in an endless loop and then reenables preemption and
>> terminates
>> the thread. This reliably triggers the softlock detection and if
>> kernel.softlockup_panic=0, the system resumes perfectly fine afterwards. If
>> kernel.softlockup_panic=1 I would expect the dump-capture kernel to be
>> executed,
>> but it is not due to the bug (without rt patches it works), instead the
>> panic
>> function is executed until the end to the endless loop.
>>
>>
>> A stacktrace captured at the trylock call inside kexec_code looks like this:
>> #0  __rt_mutex_trylock (lock=0xffffffff81701aa0 <kexec_mutex>) at
>> /usr/src/kernel/kernel/locking/rtmutex.c:2110
>> #1  0xffffffff8087601a in _mutex_trylock (lock=<optimised out>) at
>> /usr/src/kernel/kernel/locking/mutex-rt.c:185
>> #2  0xffffffff803022a0 in __crash_kexec (regs=0x0 <irq_stack_union>) at
>> /usr/src/kernel/kernel/kexec_core.c:941
>> #3  0xffffffff8027af59 in panic (fmt=0xffffffff80fa3d66 "softlockup:
>> hung tasks") at /usr/src/kernel/kernel/panic.c:198
>> #4  0xffffffff80325b6d in watchdog_timer_fn (hrtimer=<optimised out>) at
>> /usr/src/kernel/kernel/watchdog.c:464
>> #5  0xffffffff802e6b90 in __run_hrtimer (flags=<optimised out>,
>> now=<optimised out>, timer=<optimised out>, base=<optimised out>,
>> cpu_base=<optimised out>) at /usr/src/kernel/kernel/time/hrtimer.c:1417
>> #6  __hrtimer_run_queues (cpu_base=0xffff88807db1c000, now=<optimised
>> out>, flags=<optimised out>, active_mask=<optimised out>) at
>> /usr/src/kernel/kernel/time/hrtimer.c:1479
>> #7  0xffffffff802e7704 in hrtimer_interrupt (dev=<optimised out>) at
>> /usr/src/kernel/kernel/time/hrtimer.c:1539
>> #8  0xffffffff80a020f2 in local_apic_timer_interrupt () at
>> /usr/src/kernel/arch/x86/kernel/apic/apic.c:1067
>> #9  smp_apic_timer_interrupt (regs=<optimised out>) at
>> /usr/src/kernel/arch/x86/kernel/apic/apic.c:1092
>> #10 0xffffffff80a015df in apic_timer_interrupt () at
>> /usr/src/kernel/arch/x86/entry/entry_64.S:909
>>
>>
>> Obviously and as expected the panic was triggered in the context of the apic
>> interrupt. So in_irq() is true and trylock fails.
>>
>>
>> About 12 years ago this was not implemented using a mutex, but using xchg.
>> See: 8c5a1cf0ad3ac5fcdf51314a63b16a440870f6a2
> Yes, that commit is wrong, because mutex_trylock() is not to be taken in
> interrupt context, where crash_kexec() looks like it can be called.
>
> Unless back then crash_kexec() wasn't called in interrupt context, then the
> commit that calls it from that combined with this commit is the issue.
>
> -- Steve
>
>>
>> Since my knowledege about mutexes inside the kernel is very limited, I
>> do not
>> know how this can be fixed and whether it should be fixed in the rt
>> patches or
>> if this really is a bug in mainline kernel (because trylock is also not
>> allowed
>> to be used in interrupt handlers.
>>
>>
>> Jörg

2020-07-22 20:52:19

by Steven Rostedt

[permalink] [raw]
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

On Wed, 22 Jul 2020 06:30:53 +0200
Joerg Vehlow <[email protected]> wrote:

> Hi Andrew,
>
> it's been two month now and no reaction from you. Maybe you did not see
> this mail from Steven.
> Please look at this issue.
>

Perhaps you need to send the report again without the RT (just [BUG])
to get Andrew's attention ;-)

-- Steve

2020-07-27 23:39:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

On Wed, 22 Jul 2020 06:30:53 +0200 Joerg Vehlow <[email protected]> wrote:

> >> About 12 years ago this was not implemented using a mutex, but using xchg.
> >> See: 8c5a1cf0ad3ac5fcdf51314a63b16a440870f6a2
> > Yes, that commit is wrong, because mutex_trylock() is not to be taken in
> > interrupt context, where crash_kexec() looks like it can be called.

Yup, mutex_trylock() from interrupt is improper. Well dang, that's a
bit silly. Presumably the 2006 spin_lock_mutex() wasn't taken with
irqs-off.

Ho hum, did you look at switching the kexec code back to the xchg
approach?

2020-08-21 10:28:58

by Joerg Vehlow

[permalink] [raw]
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

Hi Andrew and Others (please read at least the part with @RT developers),

> Yup, mutex_trylock() from interrupt is improper. Well dang, that's a
> bit silly. Presumably the 2006 spin_lock_mutex() wasn't taken with
> irqs-off.
>
> Ho hum, did you look at switching the kexec code back to the xchg
> approach?
>
I looked into reverting to the xchg approach, but that seems to be
not a good solution anymore, because the mutex is used in many places,
a lot with waiting locks and I guess that would require spinning now,
if we do this with bare xchg.

Instead I thought about using a spinlock, because they are supposed
to be used in interrupt context as well, if I understand the documentation
correctly ([1]).
@RT developers
Unfortunately the rt patches seem to interpret it a bit different and
spin_trylock uses __rt_mutex_trylock again, with the same consequences as
with the current code.

I tried raw_spinlocks, but it looks like they result in a deadlock at
least in the rt kernel. Thiy may be because of memory allocations in the
critical sections, that are not allowed if I understand it correctly.

I have no clue how to fix it at this point.

Jörg

[1] https://kernel.readthedocs.io/en/sphinx-samples/kernel-locking.html

2020-08-21 15:10:23

by Steven Rostedt

[permalink] [raw]
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

On Fri, 21 Aug 2020 12:25:33 +0200
Joerg Vehlow <[email protected]> wrote:

> Hi Andrew and Others (please read at least the part with @RT developers),
>
> > Yup, mutex_trylock() from interrupt is improper. Well dang, that's a
> > bit silly. Presumably the 2006 spin_lock_mutex() wasn't taken with
> > irqs-off.
> >
> > Ho hum, did you look at switching the kexec code back to the xchg
> > approach?
> >
> I looked into reverting to the xchg approach, but that seems to be
> not a good solution anymore, because the mutex is used in many places,
> a lot with waiting locks and I guess that would require spinning now,
> if we do this with bare xchg.
>
> Instead I thought about using a spinlock, because they are supposed
> to be used in interrupt context as well, if I understand the documentation
> correctly ([1]).
> @RT developers
> Unfortunately the rt patches seem to interpret it a bit different and
> spin_trylock uses __rt_mutex_trylock again, with the same consequences as
> with the current code.
>
> I tried raw_spinlocks, but it looks like they result in a deadlock at
> least in the rt kernel. Thiy may be because of memory allocations in the
> critical sections, that are not allowed if I understand it correctly.
>
> I have no clue how to fix it at this point.
>
> Jörg
>
> [1] https://kernel.readthedocs.io/en/sphinx-samples/kernel-locking.html

There's only two places that wait on the mutex, and all other places
try to get it, and if it fails, it simply exits.

What I would do is introduce a kexec_busy counter, and have something
like this:

For the two locations that actually wait on the mutex:

loop:
mutex_lock(&kexec_mutex);
ret = atomic_inc_return(&kexec_busy);
if (ret > 1) {
/* Atomic context is busy on this counter, spin */
atomic_dec(&kexec_busy);
mutex_unlock(&kexec_mutex);
goto loop;
}
[..]
atomic_dec(&kexec_busy);
mutex_unlock(&kexec_mutex);

And then all the other places that do the trylock:

cant_sleep();
ret = atomic_inc_return(&kexec_busy);
if (ret > 1) {
atomic_dec(&kexec_busy);
return;
}
[..]
atomic_dec(&kexec_busy);


-- Steve

2020-08-21 20:51:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

On Fri, 21 Aug 2020 11:08:48 -0400 Steven Rostedt <[email protected]> wrote:

> On Fri, 21 Aug 2020 12:25:33 +0200
> Joerg Vehlow <[email protected]> wrote:
>
> > Hi Andrew and Others (please read at least the part with @RT developers),
> >
> > > Yup, mutex_trylock() from interrupt is improper. Well dang, that's a
> > > bit silly. Presumably the 2006 spin_lock_mutex() wasn't taken with
> > > irqs-off.
> > >
> > > Ho hum, did you look at switching the kexec code back to the xchg
> > > approach?
> > >
> > I looked into reverting to the xchg approach, but that seems to be
> > not a good solution anymore, because the mutex is used in many places,
> > a lot with waiting locks and I guess that would require spinning now,
> > if we do this with bare xchg.
> >
> > Instead I thought about using a spinlock, because they are supposed
> > to be used in interrupt context as well, if I understand the documentation
> > correctly ([1]).
> > @RT developers
> > Unfortunately the rt patches seem to interpret it a bit different and
> > spin_trylock uses __rt_mutex_trylock again, with the same consequences as
> > with the current code.
> >
> > I tried raw_spinlocks, but it looks like they result in a deadlock at
> > least in the rt kernel. Thiy may be because of memory allocations in the
> > critical sections, that are not allowed if I understand it correctly.
> >
> > I have no clue how to fix it at this point.
> >
> > J?rg
> >
> > [1] https://kernel.readthedocs.io/en/sphinx-samples/kernel-locking.html
>
> There's only two places that wait on the mutex, and all other places
> try to get it, and if it fails, it simply exits.
>
> What I would do is introduce a kexec_busy counter, and have something
> like this:
>
> For the two locations that actually wait on the mutex:
>
> loop:
> mutex_lock(&kexec_mutex);
> ret = atomic_inc_return(&kexec_busy);
> if (ret > 1) {
> /* Atomic context is busy on this counter, spin */
> atomic_dec(&kexec_busy);
> mutex_unlock(&kexec_mutex);
> goto loop;
> }
> [..]
> atomic_dec(&kexec_busy);
> mutex_unlock(&kexec_mutex);
>
> And then all the other places that do the trylock:
>
> cant_sleep();
> ret = atomic_inc_return(&kexec_busy);
> if (ret > 1) {
> atomic_dec(&kexec_busy);
> return;
> }
> [..]
> atomic_dec(&kexec_busy);

Aw gee. Hide all this in include/linux/rostedt_lock.h...

Sigh. Is it too hard to make mutex_trylock() usable from interrupt
context?

2020-08-21 21:06:57

by Steven Rostedt

[permalink] [raw]
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

On Fri, 21 Aug 2020 13:47:53 -0700
Andrew Morton <[email protected]> wrote:

> On Fri, 21 Aug 2020 11:08:48 -0400 Steven Rostedt <[email protected]> wrote:
>
> > On Fri, 21 Aug 2020 12:25:33 +0200
> > Joerg Vehlow <[email protected]> wrote:
> >
> > > Hi Andrew and Others (please read at least the part with @RT developers),
> > >
> > > > Yup, mutex_trylock() from interrupt is improper. Well dang, that's a
> > > > bit silly. Presumably the 2006 spin_lock_mutex() wasn't taken with
> > > > irqs-off.
> > > >
> > > > Ho hum, did you look at switching the kexec code back to the xchg
> > > > approach?
> > > >
> > > I looked into reverting to the xchg approach, but that seems to be
> > > not a good solution anymore, because the mutex is used in many places,
> > > a lot with waiting locks and I guess that would require spinning now,
> > > if we do this with bare xchg.
> > >
> > > Instead I thought about using a spinlock, because they are supposed
> > > to be used in interrupt context as well, if I understand the documentation
> > > correctly ([1]).
> > > @RT developers
> > > Unfortunately the rt patches seem to interpret it a bit different and
> > > spin_trylock uses __rt_mutex_trylock again, with the same consequences as
> > > with the current code.
> > >
> > > I tried raw_spinlocks, but it looks like they result in a deadlock at
> > > least in the rt kernel. Thiy may be because of memory allocations in the
> > > critical sections, that are not allowed if I understand it correctly.
> > >
> > > I have no clue how to fix it at this point.
> > >
> > > Jörg
> > >
> > > [1] https://kernel.readthedocs.io/en/sphinx-samples/kernel-locking.html
> >
> > There's only two places that wait on the mutex, and all other places
> > try to get it, and if it fails, it simply exits.
> >
> > What I would do is introduce a kexec_busy counter, and have something
> > like this:
> >
> > For the two locations that actually wait on the mutex:
> >
> > loop:
> > mutex_lock(&kexec_mutex);
> > ret = atomic_inc_return(&kexec_busy);
> > if (ret > 1) {
> > /* Atomic context is busy on this counter, spin */
> > atomic_dec(&kexec_busy);
> > mutex_unlock(&kexec_mutex);
> > goto loop;
> > }
> > [..]
> > atomic_dec(&kexec_busy);
> > mutex_unlock(&kexec_mutex);
> >
> > And then all the other places that do the trylock:
> >
> > cant_sleep();
> > ret = atomic_inc_return(&kexec_busy);
> > if (ret > 1) {
> > atomic_dec(&kexec_busy);
> > return;
> > }
> > [..]
> > atomic_dec(&kexec_busy);
>
> Aw gee. Hide all this in include/linux/rostedt_lock.h...

Heh, if this was the way to go, I would have definitely recommended
packaging that up in static inline functions in some local header. Not
necessarily rostedt_lock.h, but I'll use that if people let me :-)

>
> Sigh. Is it too hard to make mutex_trylock() usable from interrupt
> context?


That's a question for Thomas and Peter Z.

-- Steve

2020-08-22 12:39:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

On Fri, Aug 21, 2020 at 05:03:34PM -0400, Steven Rostedt wrote:

> > Sigh. Is it too hard to make mutex_trylock() usable from interrupt
> > context?
>
>
> That's a question for Thomas and Peter Z.

You should really know that too, the TL;DR answer is it's fundamentally
buggered, can't work.

The problem is that RT relies on being able to PI boost the mutex owner.

ISTR we had a thread about all this last year or so, let me see if I can
find that.

Here goes:

https://lkml.kernel.org/r/[email protected]

Kexec even gets mentioned:

https://lkml.kernel.org/r/[email protected]


2020-08-23 00:40:27

by Steven Rostedt

[permalink] [raw]
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

On Sat, 22 Aug 2020 14:32:52 +0200
[email protected] wrote:

> On Fri, Aug 21, 2020 at 05:03:34PM -0400, Steven Rostedt wrote:
>
> > > Sigh. Is it too hard to make mutex_trylock() usable from interrupt
> > > context?
> >
> >
> > That's a question for Thomas and Peter Z.
>
> You should really know that too, the TL;DR answer is it's fundamentally
> buggered, can't work.

I knew there was an issue but I couldn't remember the reasoning, and
figured you could easily answer it without having to look back at the
code.

>
> The problem is that RT relies on being able to PI boost the mutex owner.
>
> ISTR we had a thread about all this last year or so, let me see if I can
> find that.
>
> Here goes:
>
> https://lkml.kernel.org/r/[email protected]

From this email:

> The problem happens when that owner is the idle task, this can happen
> when the irq/softirq hits the idle task, in that case the contending
> mutex_lock() will try and PI boost the idle task, and that is a big
> no-no.

What's wrong with priority boosting the idle task? It's not obvious,
and I can't find comments in the code saying it would be bad.

I looked around the code to see if I could find "why this is bad" but
couldn't find it. There's lots of places that say "Do not use
mutex_trylock in interrupt context, the implementation is not safe to
do so" but I can't find where it says "why" it is not safe to do so.

The idle task is not mentioned at all in rtmutex.c and not mentioned in
kernel/locking except for some comments about RCU in lockdep.

I see that in the idle code the prio_change method does a BUG(), but
there's no comment to say why it does so.

The commit that added that BUG, doesn't explain why it can't happen:

a8941d7ec8167 ("sched: Simplify the idle scheduling class")

I may have once known the rationale behind all this, but it's been a
long time since I worked on the PI code, and it's out of my cache.


-- Steve

2020-09-07 10:53:21

by Joerg Vehlow

[permalink] [raw]
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

Hi,

I guess there is currently no other way than to use something like Steven
proposed. I implemented and tested the attached patch with a module,
that triggers the soft lockup detection and it works as expected.
I did not use inline functions, but normal function implemented in
kexec_core,
because there is nothing time critical here.
I also added the mutex_lock to the trylock variant, because then the unlock
function can be the same for both lock functions.

What do you think?

Jörg

---
 kernel/kexec.c                     |  8 ++---
 kernel/kexec_core.c                | 55 ++++++++++++++++++++++++------
 kernel/kexec_file.c                |  6 ++--
 kernel/kexec_internal.h            |  6 +++-
 security/integrity/ima/ima_kexec.c |  2 +-
 5 files changed, 58 insertions(+), 19 deletions(-)

diff --git a/kernel/kexec.c b/kernel/kexec.c
index f977786fe498..118a012aeac2 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -255,12 +255,12 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry,
unsigned long, nr_segments,
      *
      * KISS: always take the mutex.
      */
-    if (!mutex_trylock(&kexec_mutex))
+    if (!kexec_trylock())
         return -EBUSY;

     result = do_kexec_load(entry, nr_segments, segments, flags);

-    mutex_unlock(&kexec_mutex);
+    kexec_unlock();

     return result;
 }
@@ -309,12 +309,12 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t,
entry,
      *
      * KISS: always take the mutex.
      */
-    if (!mutex_trylock(&kexec_mutex))
+    if (!kexec_trylock())
         return -EBUSY;

     result = do_kexec_load(entry, nr_segments, ksegments, flags);

-    mutex_unlock(&kexec_mutex);
+    kexec_unlock();

     return result;
 }
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index c19c0dad1ebe..a8337e6114fb 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -45,7 +45,8 @@
 #include <crypto/sha.h>
 #include "kexec_internal.h"

-DEFINE_MUTEX(kexec_mutex);
+static atomic_t kexec_busy = ATOMIC_INIT(0);
+static DEFINE_MUTEX(kexec_mutex);

 /* Per cpu memory for storing cpu states in case of system crash. */
 note_buf_t __percpu *crash_notes;
@@ -70,6 +71,40 @@ struct resource crashk_low_res = {
     .desc  = IORES_DESC_CRASH_KERNEL
 };

+int kexec_trylock(void)
+{
+    int ret = atomic_inc_return(&kexec_busy);
+    if (ret > 1) {
+        atomic_dec(&kexec_busy);
+        return 0;
+    }
+    mutex_lock(&kexec_mutex);
+    return 1;
+}
+
+void kexec_lock(void)
+{
+    int ret;
+    /* Prevent spinning, if the lock is busy */
+    mutex_lock(&kexec_mutex);
+    while (1) {
+        ret = atomic_inc_return(&kexec_busy);
+        if (ret > 1) {
+            atomic_dec(&kexec_busy);
+            mutex_unlock(&kexec_mutex);
+        } else {
+            break;
+        }
+    }
+}
+
+void kexec_unlock(void)
+{
+    BUG_ON(atomic_read(&kexec_busy) <= 0);
+    atomic_dec(&kexec_busy);
+    mutex_unlock(&kexec_mutex);
+}
+
 int kexec_should_crash(struct task_struct *p)
 {
     /*
@@ -943,7 +978,7 @@ int kexec_load_disabled;
  */
 void __noclone __crash_kexec(struct pt_regs *regs)
 {
-    /* Take the kexec_mutex here to prevent sys_kexec_load
+    /* Take the kexec_lock here to prevent sys_kexec_load
      * running on one cpu from replacing the crash kernel
      * we are using after a panic on a different cpu.
      *
@@ -951,7 +986,7 @@ void __noclone __crash_kexec(struct pt_regs *regs)
      * of memory the xchg(&kexec_crash_image) would be
      * sufficient.  But since I reuse the memory...
      */
-    if (mutex_trylock(&kexec_mutex)) {
+    if (kexec_trylock()) {
         if (kexec_crash_image) {
             struct pt_regs fixed_regs;

@@ -960,7 +995,7 @@ void __noclone __crash_kexec(struct pt_regs *regs)
             machine_crash_shutdown(&fixed_regs);
             machine_kexec(kexec_crash_image);
         }
-        mutex_unlock(&kexec_mutex);
+        kexec_unlock();
     }
 }
 STACK_FRAME_NON_STANDARD(__crash_kexec);
@@ -993,10 +1028,10 @@ size_t crash_get_memory_size(void)
 {
     size_t size = 0;

-    mutex_lock(&kexec_mutex);
+    kexec_lock();
     if (crashk_res.end != crashk_res.start)
         size = resource_size(&crashk_res);
-    mutex_unlock(&kexec_mutex);
+    kexec_unlock();
     return size;
 }

@@ -1016,7 +1051,7 @@ int crash_shrink_memory(unsigned long new_size)
     unsigned long old_size;
     struct resource *ram_res;

-    mutex_lock(&kexec_mutex);
+    kexec_lock();

     if (kexec_crash_image) {
         ret = -ENOENT;
@@ -1054,7 +1089,7 @@ int crash_shrink_memory(unsigned long new_size)
     insert_resource(&iomem_resource, ram_res);

 unlock:
-    mutex_unlock(&kexec_mutex);
+    kexec_unlock();
     return ret;
 }

@@ -1126,7 +1161,7 @@ int kernel_kexec(void)
 {
     int error = 0;

-    if (!mutex_trylock(&kexec_mutex))
+    if (!kexec_trylock())
         return -EBUSY;
     if (!kexec_image) {
         error = -EINVAL;
@@ -1203,7 +1238,7 @@ int kernel_kexec(void)
 #endif

  Unlock:
-    mutex_unlock(&kexec_mutex);
+    kexec_unlock();
     return error;
 }

diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index faa74d5f6941..52ac7573626e 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -384,7 +384,7 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd,
int, initrd_fd,

     image = NULL;

-    if (!mutex_trylock(&kexec_mutex))
+    if (!kexec_trylock())
         return -EBUSY;

     dest_image = &kexec_image;
@@ -456,7 +456,7 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd,
int, initrd_fd,
     if ((flags & KEXEC_FILE_ON_CRASH) && kexec_crash_image)
         arch_kexec_protect_crashkres();

-    mutex_unlock(&kexec_mutex);
+    kexec_unlock();
     kimage_free(image);
     return ret;
 }
@@ -656,7 +656,7 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf)
  * kexec_add_buffer - place a buffer in a kexec segment
  * @kbuf:    Buffer contents and memory parameters.
  *
- * This function assumes that kexec_mutex is held.
+ * This function assumes that kexec_lock is held.
  * On successful return, @kbuf->mem will have the physical address of
  * the buffer in memory.
  *
diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
index 39d30ccf8d87..c26d26b3c00e 100644
--- a/kernel/kexec_internal.h
+++ b/kernel/kexec_internal.h
@@ -3,6 +3,11 @@
 #define LINUX_KEXEC_INTERNAL_H

 #include <linux/kexec.h>
+#include <linux/atomic.h>
+
+int kexec_trylock(void);
+void kexec_lock(void);
+void kexec_unlock(void);

 struct kimage *do_kimage_alloc_init(void);
 int sanity_check_segment_list(struct kimage *image);
@@ -15,7 +20,6 @@ int kimage_is_destination_range(struct kimage *image,

 int machine_kexec_post_load(struct kimage *image);

-extern struct mutex kexec_mutex;

 #ifdef CONFIG_KEXEC_FILE
 #include <linux/purgatory.h>
diff --git a/security/integrity/ima/ima_kexec.c
b/security/integrity/ima/ima_kexec.c
index 121de3e04af2..a26cd17b346c 100644
--- a/security/integrity/ima/ima_kexec.c
+++ b/security/integrity/ima/ima_kexec.c
@@ -75,7 +75,7 @@ static int ima_dump_measurement_list(unsigned long
*buffer_size, void **buffer,
  * Called during kexec_file_load so that IMA can add a segment to the
kexec
  * image for the measurement list for the next kernel.
  *
- * This function assumes that kexec_mutex is held.
+ * This function assumes that kexec_lock is held.
  */
 void ima_add_kexec_buffer(struct kimage *image)
 {
--
2.25.1


2020-09-07 11:45:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

On Sat, Aug 22, 2020 at 07:49:28PM -0400, Steven Rostedt wrote:

> From this email:
>
> > The problem happens when that owner is the idle task, this can happen
> > when the irq/softirq hits the idle task, in that case the contending
> > mutex_lock() will try and PI boost the idle task, and that is a big
> > no-no.
>
> What's wrong with priority boosting the idle task? It's not obvious,
> and I can't find comments in the code saying it would be bad.

> The idle task is not mentioned at all in rtmutex.c and not mentioned in
> kernel/locking except for some comments about RCU in lockdep.

There used to be a giant BUG() and comment somewhere in the PI code I
think.. but that's vage memories.

> I see that in the idle code the prio_change method does a BUG(), but
> there's no comment to say why it does so.
>
> The commit that added that BUG, doesn't explain why it can't happen:
>
> a8941d7ec8167 ("sched: Simplify the idle scheduling class")

That's like almost a decade ago ...

> I may have once known the rationale behind all this, but it's been a
> long time since I worked on the PI code, and it's out of my cache.

I suffer much the same problem.

So cenceptually there's the problem that idle must always be runnable,
and the moment you boost it, it becomes subject to a different
scheduling class.

Imagine for example what happens when we boost it to RT and then have it
be subject to throttling. What are we going to run when the idle task
is no longer elegible to run.

(it might all work out by accident, but ISTR we had a whole bunch of fun
in the earlier days of RT due to things like that)

2020-09-07 11:50:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

On Mon, Sep 07, 2020 at 12:51:37PM +0200, Joerg Vehlow wrote:
> Hi,
>
> I guess there is currently no other way than to use something like Steven
> proposed. I implemented and tested the attached patch with a module,
> that triggers the soft lockup detection and it works as expected.
> I did not use inline functions, but normal function implemented in
> kexec_core,
> because there is nothing time critical here.
> I also added the mutex_lock to the trylock variant, because then the unlock
> function can be the same for both lock functions.
>
> What do you think?

I think it's too complicated for that is needed, did you see my
suggestion from a year ago? Did i miss something obvious?


2020-09-07 12:15:43

by Joerg Vehlow

[permalink] [raw]
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context



On 9/7/2020 1:46 PM, [email protected] wrote:
> I think it's too complicated for that is needed, did you see my
> suggestion from a year ago? Did i miss something obvious?
>
This one?
https://lore.kernel.org/linux-fsdevel/[email protected]/

I think it may be a bit incorrect?
According to the original comment in __crash_kexec, the mutex was used to
prevent a sys_kexec_load, while crash_kexec is executed. Your proposed
patch
does not lock the mutex in crash_kexec. This does not cover the original
use
case anymore. The only thing that is protected now are two panicing cores at
the same time.
Actually, this implementation feels even more hacky to me....

Jörg

2020-09-07 16:26:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

On Mon, Sep 07, 2020 at 02:03:09PM +0200, Joerg Vehlow wrote:
>
>
> On 9/7/2020 1:46 PM, [email protected] wrote:
> > I think it's too complicated for that is needed, did you see my
> > suggestion from a year ago? Did i miss something obvious?
> >
> This one? https://lore.kernel.org/linux-fsdevel/[email protected]/
>
> I think it may be a bit incorrect?
> According to the original comment in __crash_kexec, the mutex was used to
> prevent a sys_kexec_load, while crash_kexec is executed. Your proposed patch
> does not lock the mutex in crash_kexec.

Sure, but any mutex taker will (spin) wait for panic_cpu==CPU_INVALID.
And if the mutex is already held, we'll not run __crash_kexec() just
like the trylock() would do today.

> This does not cover the original use
> case anymore. The only thing that is protected now are two panicing cores at
> the same time.

I'm not following. AFAICT it does exactly what the old code did.
Although maybe I didn't replace all kexec_mutex users, I now see that
thing isn't static.

> Actually, this implementation feels even more hacky to me....

It's more minimal ;-) It's simpler in that it only provides the required
semantics (as I understand them) and does not attempt to implement a
more general trylock() like primitive that isn't needed.

Also, read the kexec_lock() implementation you posted and explain to me
what happens when kexec_busy is elevated. Also note the lack of
confusing loops in my code.

2020-09-07 17:49:28

by Valentin Schneider

[permalink] [raw]
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context


On 07/09/20 12:41, [email protected] wrote:
> So cenceptually there's the problem that idle must always be runnable,
> and the moment you boost it, it becomes subject to a different
> scheduling class.
>
> Imagine for example what happens when we boost it to RT and then have it
> be subject to throttling. What are we going to run when the idle task
> is no longer elegible to run.
>
> (it might all work out by accident, but ISTR we had a whole bunch of fun
> in the earlier days of RT due to things like that)

Doesn't that become a non-problem (conceptually) with proxy exec? In that
case the idle task remains forever runnable, it just runs with its own
scheduling context during the throttling window.

Not trying to sell PE as the Billy Mays-backed solution to all kernel
issues, but I think it's another thing for me to ponder on & play with.

2020-09-08 05:51:05

by Joerg Vehlow

[permalink] [raw]
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

Hi Peter

On 9/7/2020 6:23 PM, [email protected] wrote:
>> According to the original comment in __crash_kexec, the mutex was used to
>> prevent a sys_kexec_load, while crash_kexec is executed. Your proposed patch
>> does not lock the mutex in crash_kexec.
> Sure, but any mutex taker will (spin) wait for panic_cpu==CPU_INVALID.
> And if the mutex is already held, we'll not run __crash_kexec() just
> like the trylock() would do today.
Yes you are right, it should work.
>> This does not cover the original use
>> case anymore. The only thing that is protected now are two panicing cores at
>> the same time.
> I'm not following. AFAICT it does exactly what the old code did.
> Although maybe I didn't replace all kexec_mutex users, I now see that
> thing isn't static.
Same thing here.
>
>> Actually, this implementation feels even more hacky to me....
> It's more minimal ;-) It's simpler in that it only provides the required
> semantics (as I understand them) and does not attempt to implement a
> more general trylock() like primitive that isn't needed.
Here I cannot agree with you. There is a second trylock in kernel_kexec,
that cannot
be protected using the panic_cpu, but it actually could still use
mutex_trylock and check
the panic_cpu. This should work I guess:

int kexec_trylock(void) {
    if (!mutex_trylock(&kexec_mutex)) {
        return 0;
    }
    smp_mb();
    if (panic_cpu != PANIC_CPU_INVALID) {
         mutex_unlock(&kexec_mutex);
         return 0;
    }
    return 1;
}

Or do I miss something now? All functions protected by mutex_lock cannot
be executed, after
kexec_trylock resturned 1. kexec_crash will execute up to
mutex_is_locked and then roll back.
The only thing that can go wrong now is: kexec_trylock executes up to
smb_mb. At the same time
kexec_crash executes mutex_is_locked, which returns false now and then
before panic_cpu is reset,
kexec_trylock executes the panic_cpu check, and returns. Now both
functions did not get the lock and
nothing is executed.

Does that sound right to you? If you have no further objections I will
post it here

Jörg

2020-09-09 05:48:10

by Joerg Vehlow

[permalink] [raw]
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

Hi,

here is the new version of the patch based on Peters suggestion
It looks like it works fine. I added the BUG_ON to __crash_kexec,
because it is a precondition, that panic_cpu is set correctly, otherwise
the whole locking logic fails.

The mutex_trylock can still be used, because it is only in syscall
context and no interrupt context.

Jörg

---
 kernel/kexec.c          |  8 ++--
 kernel/kexec_core.c     | 86 +++++++++++++++++++++++++++--------------
 kernel/kexec_file.c     |  4 +-
 kernel/kexec_internal.h |  6 ++-
 4 files changed, 69 insertions(+), 35 deletions(-)

diff --git a/kernel/kexec.c b/kernel/kexec.c
index f977786fe498..118a012aeac2 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -255,12 +255,12 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry,
unsigned long, nr_segments,
      *
      * KISS: always take the mutex.
      */
-    if (!mutex_trylock(&kexec_mutex))
+    if (!kexec_trylock())
         return -EBUSY;

     result = do_kexec_load(entry, nr_segments, segments, flags);

-    mutex_unlock(&kexec_mutex);
+    kexec_unlock();

     return result;
 }
@@ -309,12 +309,12 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t,
entry,
      *
      * KISS: always take the mutex.
      */
-    if (!mutex_trylock(&kexec_mutex))
+    if (!kexec_trylock())
         return -EBUSY;

     result = do_kexec_load(entry, nr_segments, ksegments, flags);

-    mutex_unlock(&kexec_mutex);
+    kexec_unlock();

     return result;
 }
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index c19c0dad1ebe..71682a33b1ba 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -45,7 +45,7 @@
 #include <crypto/sha.h>
 #include "kexec_internal.h"

-DEFINE_MUTEX(kexec_mutex);
+static DEFINE_MUTEX(kexec_mutex);

 /* Per cpu memory for storing cpu states in case of system crash. */
 note_buf_t __percpu *crash_notes;
@@ -70,6 +70,43 @@ struct resource crashk_low_res = {
     .desc  = IORES_DESC_CRASH_KERNEL
 };

+void kexec_lock(void)
+{
+    /*
+     * LOCK kexec_mutex        cmpxchg(&panic_cpu, INVALID, cpu)
+     *   MB                  MB
+     * panic_cpu == INVALID        kexec_mutex == LOCKED
+     *
+     * Ensures either we observe the cmpxchg, or crash_kernel() observes
+     * our lock acquisition.
+     */
+    mutex_lock(&kexec_mutex);
+    smp_mb();
+    atomic_cond_read_acquire(&panic_cpu, VAL == PANIC_CPU_INVALID);
+}
+
+int kexec_trylock(void) {
+    if (!mutex_trylock(&kexec_mutex)) {
+        return 0;
+    }
+    smp_mb();
+    if (atomic_read(&panic_cpu) != PANIC_CPU_INVALID) {
+         mutex_unlock(&kexec_mutex);
+         return 0;
+    }
+    return 1;
+}
+
+void kexec_unlock(void)
+{
+    mutex_unlock(&kexec_mutex);
+}
+
+int kexec_is_locked(void)
+{
+    return mutex_is_locked(&kexec_mutex);
+}
+
 int kexec_should_crash(struct task_struct *p)
 {
     /*
@@ -943,24 +980,15 @@ int kexec_load_disabled;
  */
 void __noclone __crash_kexec(struct pt_regs *regs)
 {
-    /* Take the kexec_mutex here to prevent sys_kexec_load
-     * running on one cpu from replacing the crash kernel
-     * we are using after a panic on a different cpu.
-     *
-     * If the crash kernel was not located in a fixed area
-     * of memory the xchg(&kexec_crash_image) would be
-     * sufficient.  But since I reuse the memory...
-     */
-    if (mutex_trylock(&kexec_mutex)) {
-        if (kexec_crash_image) {
-            struct pt_regs fixed_regs;
-
-            crash_setup_regs(&fixed_regs, regs);
-            crash_save_vmcoreinfo();
-            machine_crash_shutdown(&fixed_regs);
-            machine_kexec(kexec_crash_image);
-        }
-        mutex_unlock(&kexec_mutex);
+    BUG_ON(atomic_read(&panic_cpu) != raw_smp_processor_id());
+
+    if (!kexec_is_locked() && kexec_crash_image) {
+        struct pt_regs fixed_regs;
+
+        crash_setup_regs(&fixed_regs, regs);
+        crash_save_vmcoreinfo();
+        machine_crash_shutdown(&fixed_regs);
+        machine_kexec(kexec_crash_image);
     }
 }
 STACK_FRAME_NON_STANDARD(__crash_kexec);
@@ -977,9 +1005,11 @@ void crash_kexec(struct pt_regs *regs)
     this_cpu = raw_smp_processor_id();
     old_cpu = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu);
     if (old_cpu == PANIC_CPU_INVALID) {
-        /* This is the 1st CPU which comes here, so go ahead. */
-        printk_safe_flush_on_panic();
-        __crash_kexec(regs);
+        if (!kexec_is_locked()) {
+            /* This is the 1st CPU which comes here, so go ahead. */
+            printk_safe_flush_on_panic();
+            __crash_kexec(regs);
+        }

         /*
          * Reset panic_cpu to allow another panic()/crash_kexec()
@@ -993,10 +1023,10 @@ size_t crash_get_memory_size(void)
 {
     size_t size = 0;

-    mutex_lock(&kexec_mutex);
+    kexec_lock();
     if (crashk_res.end != crashk_res.start)
         size = resource_size(&crashk_res);
-    mutex_unlock(&kexec_mutex);
+    kexec_unlock();
     return size;
 }

@@ -1016,7 +1046,7 @@ int crash_shrink_memory(unsigned long new_size)
     unsigned long old_size;
     struct resource *ram_res;

-    mutex_lock(&kexec_mutex);
+    kexec_lock();

     if (kexec_crash_image) {
         ret = -ENOENT;
@@ -1054,7 +1084,7 @@ int crash_shrink_memory(unsigned long new_size)
     insert_resource(&iomem_resource, ram_res);

 unlock:
-    mutex_unlock(&kexec_mutex);
+    kexec_unlock();
     return ret;
 }

@@ -1126,7 +1156,7 @@ int kernel_kexec(void)
 {
     int error = 0;

-    if (!mutex_trylock(&kexec_mutex))
+    if (!kexec_trylock())
         return -EBUSY;
     if (!kexec_image) {
         error = -EINVAL;
@@ -1203,7 +1233,7 @@ int kernel_kexec(void)
 #endif

  Unlock:
-    mutex_unlock(&kexec_mutex);
+    kexec_unlock();
     return error;
 }

diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index ca40bef75a61..d40b0aedc187 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -362,7 +362,7 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd,
int, initrd_fd,

     image = NULL;

-    if (!mutex_trylock(&kexec_mutex))
+    if (!kexec_trylock())
         return -EBUSY;

     dest_image = &kexec_image;
@@ -434,7 +434,7 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd,
int, initrd_fd,
     if ((flags & KEXEC_FILE_ON_CRASH) && kexec_crash_image)
         arch_kexec_protect_crashkres();

-    mutex_unlock(&kexec_mutex);
+    kexec_unlock();
     kimage_free(image);
     return ret;
 }
diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
index 39d30ccf8d87..2c1683cb1082 100644
--- a/kernel/kexec_internal.h
+++ b/kernel/kexec_internal.h
@@ -15,7 +15,11 @@ int kimage_is_destination_range(struct kimage *image,

 int machine_kexec_post_load(struct kimage *image);

-extern struct mutex kexec_mutex;
+void kexec_lock(void);
+int kexec_trylock(void);
+void kexec_unlock(void);
+int kexec_is_locked(void);
+

 #ifdef CONFIG_KEXEC_FILE
 #include <linux/purgatory.h>
--
2.25.1


2020-09-11 22:50:22

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

Joerg Vehlow <[email protected]> writes:

> Hi,
>
> here is the new version of the patch based on Peters suggestion
> It looks like it works fine. I added the BUG_ON to __crash_kexec, because it is
> a precondition, that panic_cpu is set correctly, otherwise the whole locking
> logic fails.
>
> The mutex_trylock can still be used, because it is only in syscall context and
> no interrupt context.

What is this patch supposed to be doing?

What bug is it fixing?

A BUG_ON that triggers inside of BUG_ONs seems not just suspect but
outright impossible to make use of.


I get the feeling skimming this that it is time to sort out and simplify
the locking here, rather than make it more complex, and more likely to
fail.

I get the feeling that over the years somehow the assumption that the
rest of the kernel is broken and that we need to get out of the broken
kernel as fast and as simply as possible has been lost.

Eric



> ---
>  kernel/kexec.c          |  8 ++--
>  kernel/kexec_core.c     | 86 +++++++++++++++++++++++++++--------------
>  kernel/kexec_file.c     |  4 +-
>  kernel/kexec_internal.h |  6 ++-
>  4 files changed, 69 insertions(+), 35 deletions(-)
>
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index f977786fe498..118a012aeac2 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -255,12 +255,12 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned
> long, nr_segments,
>       *
>       * KISS: always take the mutex.
>       */
> -    if (!mutex_trylock(&kexec_mutex))
> +    if (!kexec_trylock())
>          return -EBUSY;
>
>      result = do_kexec_load(entry, nr_segments, segments, flags);
>
> -    mutex_unlock(&kexec_mutex);
> +    kexec_unlock();
>
>      return result;
>  }
> @@ -309,12 +309,12 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t, entry,
>       *
>       * KISS: always take the mutex.
>       */
> -    if (!mutex_trylock(&kexec_mutex))
> +    if (!kexec_trylock())
>          return -EBUSY;
>
>      result = do_kexec_load(entry, nr_segments, ksegments, flags);
>
> -    mutex_unlock(&kexec_mutex);
> +    kexec_unlock();
>
>      return result;
>  }
> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
> index c19c0dad1ebe..71682a33b1ba 100644
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -45,7 +45,7 @@
>  #include <crypto/sha.h>
>  #include "kexec_internal.h"
>
> -DEFINE_MUTEX(kexec_mutex);
> +static DEFINE_MUTEX(kexec_mutex);
>
>  /* Per cpu memory for storing cpu states in case of system crash. */
>  note_buf_t __percpu *crash_notes;
> @@ -70,6 +70,43 @@ struct resource crashk_low_res = {
>      .desc  = IORES_DESC_CRASH_KERNEL
>  };
>
> +void kexec_lock(void)
> +{
> +    /*
> +     * LOCK kexec_mutex        cmpxchg(&panic_cpu, INVALID, cpu)
> +     *   MB                  MB
> +     * panic_cpu == INVALID        kexec_mutex == LOCKED
> +     *
> +     * Ensures either we observe the cmpxchg, or crash_kernel() observes
> +     * our lock acquisition.
> +     */
> +    mutex_lock(&kexec_mutex);
> +    smp_mb();
> +    atomic_cond_read_acquire(&panic_cpu, VAL == PANIC_CPU_INVALID);
> +}
> +
> +int kexec_trylock(void) {
> +    if (!mutex_trylock(&kexec_mutex)) {
> +        return 0;
> +    }
> +    smp_mb();
> +    if (atomic_read(&panic_cpu) != PANIC_CPU_INVALID) {
> +         mutex_unlock(&kexec_mutex);
> +         return 0;
> +    }
> +    return 1;
> +}
> +
> +void kexec_unlock(void)
> +{
> +    mutex_unlock(&kexec_mutex);
> +}
> +
> +int kexec_is_locked(void)
> +{
> +    return mutex_is_locked(&kexec_mutex);
> +}
> +
>  int kexec_should_crash(struct task_struct *p)
>  {
>      /*
> @@ -943,24 +980,15 @@ int kexec_load_disabled;
>   */
>  void __noclone __crash_kexec(struct pt_regs *regs)
>  {
> -    /* Take the kexec_mutex here to prevent sys_kexec_load
> -     * running on one cpu from replacing the crash kernel
> -     * we are using after a panic on a different cpu.
> -     *
> -     * If the crash kernel was not located in a fixed area
> -     * of memory the xchg(&kexec_crash_image) would be
> -     * sufficient.  But since I reuse the memory...
> -     */
> -    if (mutex_trylock(&kexec_mutex)) {
> -        if (kexec_crash_image) {
> -            struct pt_regs fixed_regs;
> -
> -            crash_setup_regs(&fixed_regs, regs);
> -            crash_save_vmcoreinfo();
> -            machine_crash_shutdown(&fixed_regs);
> -            machine_kexec(kexec_crash_image);
> -        }
> -        mutex_unlock(&kexec_mutex);
> +    BUG_ON(atomic_read(&panic_cpu) != raw_smp_processor_id());
> +
> +    if (!kexec_is_locked() && kexec_crash_image) {
> +        struct pt_regs fixed_regs;
> +
> +        crash_setup_regs(&fixed_regs, regs);
> +        crash_save_vmcoreinfo();
> +        machine_crash_shutdown(&fixed_regs);
> +        machine_kexec(kexec_crash_image);
>      }
>  }
>  STACK_FRAME_NON_STANDARD(__crash_kexec);
> @@ -977,9 +1005,11 @@ void crash_kexec(struct pt_regs *regs)
>      this_cpu = raw_smp_processor_id();
>      old_cpu = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu);
>      if (old_cpu == PANIC_CPU_INVALID) {
> -        /* This is the 1st CPU which comes here, so go ahead. */
> -        printk_safe_flush_on_panic();
> -        __crash_kexec(regs);
> +        if (!kexec_is_locked()) {
> +            /* This is the 1st CPU which comes here, so go ahead. */
> +            printk_safe_flush_on_panic();
> +            __crash_kexec(regs);
> +        }
>
>          /*
>           * Reset panic_cpu to allow another panic()/crash_kexec()
> @@ -993,10 +1023,10 @@ size_t crash_get_memory_size(void)
>  {
>      size_t size = 0;
>
> -    mutex_lock(&kexec_mutex);
> +    kexec_lock();
>      if (crashk_res.end != crashk_res.start)
>          size = resource_size(&crashk_res);
> -    mutex_unlock(&kexec_mutex);
> +    kexec_unlock();
>      return size;
>  }
>
> @@ -1016,7 +1046,7 @@ int crash_shrink_memory(unsigned long new_size)
>      unsigned long old_size;
>      struct resource *ram_res;
>
> -    mutex_lock(&kexec_mutex);
> +    kexec_lock();
>
>      if (kexec_crash_image) {
>          ret = -ENOENT;
> @@ -1054,7 +1084,7 @@ int crash_shrink_memory(unsigned long new_size)
>      insert_resource(&iomem_resource, ram_res);
>
>  unlock:
> -    mutex_unlock(&kexec_mutex);
> +    kexec_unlock();
>      return ret;
>  }
>
> @@ -1126,7 +1156,7 @@ int kernel_kexec(void)
>  {
>      int error = 0;
>
> -    if (!mutex_trylock(&kexec_mutex))
> +    if (!kexec_trylock())
>          return -EBUSY;
>      if (!kexec_image) {
>          error = -EINVAL;
> @@ -1203,7 +1233,7 @@ int kernel_kexec(void)
>  #endif
>
>   Unlock:
> -    mutex_unlock(&kexec_mutex);
> +    kexec_unlock();
>      return error;
>  }
>
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index ca40bef75a61..d40b0aedc187 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -362,7 +362,7 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int,
> initrd_fd,
>
>      image = NULL;
>
> -    if (!mutex_trylock(&kexec_mutex))
> +    if (!kexec_trylock())
>          return -EBUSY;
>
>      dest_image = &kexec_image;
> @@ -434,7 +434,7 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int,
> initrd_fd,
>      if ((flags & KEXEC_FILE_ON_CRASH) && kexec_crash_image)
>          arch_kexec_protect_crashkres();
>
> -    mutex_unlock(&kexec_mutex);
> +    kexec_unlock();
>      kimage_free(image);
>      return ret;
>  }
> diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
> index 39d30ccf8d87..2c1683cb1082 100644
> --- a/kernel/kexec_internal.h
> +++ b/kernel/kexec_internal.h
> @@ -15,7 +15,11 @@ int kimage_is_destination_range(struct kimage *image,
>
>  int machine_kexec_post_load(struct kimage *image);
>
> -extern struct mutex kexec_mutex;
> +void kexec_lock(void);
> +int kexec_trylock(void);
> +void kexec_unlock(void);
> +int kexec_is_locked(void);
> +
>
>  #ifdef CONFIG_KEXEC_FILE
>  #include <linux/purgatory.h>

2020-09-14 06:07:27

by Joerg Vehlow

[permalink] [raw]
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

Hi Eric,
> What is this patch supposed to be doing?
>
> What bug is it fixing?
This information is part in the first message of this mail thread.
The patch was intendedfor the active discussion in this thread,
not for a broad review.
A short summary: In the rt kernel, a panic in an interrupt context does
not start the dump-capture kernel, because there is a mutex_trylock in
__crash_kexe. If this is called in interrupt context, it always fails.
In the non-rt kernel calling mutex_trylock is not allowed according to
the comment of the function, but it still works.

> A BUG_ON that triggers inside of BUG_ONs seems not just suspect but
> outright impossible to make use of.
I am not entirely sure what would happen here. But even if it gets in
some kind ofendless loop, I guess this is ok, because it allows finding
the problem. A piece of code in the function, that ensures the precondition
is a lot better than relying on only a comment.
If this was in mtex_trylock, the bug described above wouldn't have sneaked
in 12 years ago...
> I get the feeling skimming this that it is time to sort out and simplify
> the locking here, rather than make it more complex, and more likely to
> fail.
I would very much like that, but sadly it looks like it is not possible.
Either it wouldrequire blocking locks, that may fail, or not locking at
all, that may also fail.Using a different kind of lock (like spinlock)
is also not possible, becausespinlock_trylock again uses mutex_trylock
in the rt kernel.
>
> I get the feeling that over the years somehow the assumption that the
> rest of the kernel is broken and that we need to get out of the broken
> kernel as fast and as simply as possible has been lost.
Yes I also have the feeling, that the mutexes need fixing, but I wouldn't
to post any patch for that. At the moment, given the interface of the mutex,
this is clearly a bug in kexec, even if it works in the non-rt kernel.

Jörg

2020-09-14 17:05:05

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context


Adding the kexec list as well.

Joerg Vehlow <[email protected]> writes:

> Hi Eric,
>> What is this patch supposed to be doing?
>>
>> What bug is it fixing?
> This information is part in the first message of this mail thread.
> The patch was intendedfor the active discussion in this thread,
> not for a broad review.

> A short summary: In the rt kernel, a panic in an interrupt context does
> not start the dump-capture kernel, because there is a mutex_trylock in
> __crash_kexe. If this is called in interrupt context, it always fails.
> In the non-rt kernel calling mutex_trylock is not allowed according to
> the comment of the function, but it still works.

Thanks. For whatever reason I did not see the rest of this thread
when I was replying to your patch.

I get the feeling the rt kernel is breaking this case deliberately.
I don't know of any reason why a trylock couldn't work.

That said I won't propose fixing up the locks that way.

>> A BUG_ON that triggers inside of BUG_ONs seems not just suspect but
>> outright impossible to make use of.
> I am not entirely sure what would happen here. But even if it gets in
> some kind ofendless loop, I guess this is ok, because it allows finding
> the problem. A piece of code in the function, that ensures the precondition
> is a lot better than relying on only a comment.
> If this was in mtex_trylock, the bug described above wouldn't have sneaked
> in 12 years ago...

BUG_ON's are more likely to hide a problem then to show it.
Sometimes they are appropriate but the should be avoided as much as
possible.


>> I get the feeling skimming this that it is time to sort out and simplify
>> the locking here, rather than make it more complex, and more likely to
>> fail.
> I would very much like that, but sadly it looks like it is not possible.
> Either it wouldrequire blocking locks, that may fail, or not locking at
> all, that may also fail.Using a different kind of lock (like spinlock)
> is also not possible, becausespinlock_trylock again uses mutex_trylock
> in the rt kernel.

I think it is possible but the locking needs to be relooked at.

>> I get the feeling that over the years somehow the assumption that the
>> rest of the kernel is broken and that we need to get out of the broken
>> kernel as fast and as simply as possible has been lost.
> Yes I also have the feeling, that the mutexes need fixing, but I wouldn't
> to post any patch for that. At the moment, given the interface of the mutex,
> this is clearly a bug in kexec, even if it works in the non-rt kernel.

Cleanups that break the code. Sigh.

The code was written correctly for this case and was fine until
8c5a1cf0ad3a ("kexec: use a mutex for locking rather than xchg()").

Mostly because I didn't trust locks given their comparatively high level
of abstraction and what do you know that turned out to be correct in
this case.

It definitely looks time to see how the locking can be improved on the
kexec on panic code path.

Eric

2020-09-14 19:01:49

by Steven Rostedt

[permalink] [raw]
Subject: Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

On Mon, 7 Sep 2020 13:41:40 +0200
[email protected] wrote:

> > I may have once known the rationale behind all this, but it's been a
> > long time since I worked on the PI code, and it's out of my cache.

Now I'm trying to cache all this in from a long PTO ;-)

>
> I suffer much the same problem.
>
> So cenceptually there's the problem that idle must always be runnable,
> and the moment you boost it, it becomes subject to a different
> scheduling class.
>
> Imagine for example what happens when we boost it to RT and then have it
> be subject to throttling. What are we going to run when the idle task
> is no longer elegible to run.
>
> (it might all work out by accident, but ISTR we had a whole bunch of fun
> in the earlier days of RT due to things like that)

I'm thinking if a mutex_trylock() happens in an interrupt context, which I
believe is the only way idle could trigger it, it could not be throttled if
boosted, because it's in interrupt context. And it would have to release
the mutex before leaving interrupt context putting it back to the idle
state before throttling could take effect.

Thus, I'm still not convinced that mutex_trylock() is still bad in interrupt
context yet. At least not from what I can tell.

-- Steve