LinuxLists.cc - [RFC] How to test panic handlers, without crashing the kernel

2024-03-01 11:15:49

Subject: [RFC] How to test panic handlers, without crashing the kernel

Hi,

While writing a panic handler for drm devices [1], I needed a way to
test it without crashing the machine.
So from debugfs, I called
atomic_notifier_call_chain(&panic_notifier_list, ...), but it has the
side effect of calling all other panic notifiers registered.

So Sima suggested to move that to the generic panic code, and test all
panic notifiers with a dedicated debugfs interface.

I can move that code to kernel/, but before doing that, I would like to
know if you think that's the right way to test the panic code.

The second question is how to simulate a panic context in a
non-destructive way, so we can test the panic notifiers in CI, without
crashing the machine. The worst case for a panic notifier, is when the
panic occurs in NMI context, but I don't know how to simulate that. The
goal would be to find early if a panic notifier tries to sleep, or do
other things that are not allowed in a panic context.

Best regards,

--

Jocelyn

[1] https://patchwork.freedesktop.org/patch/580183/?series=122244&rev=8

2024-03-04 21:13:07

by John Ogness

[permalink] [raw]

Subject: Re: [RFC] How to test panic handlers, without crashing the kernel

[Added printk maintainer and kdb folks]

Hi Jocelyn,

On 2024-03-01, Jocelyn Falempe <[email protected]> wrote:
> While writing a panic handler for drm devices [1], I needed a way to
> test it without crashing the machine.
> So from debugfs, I called
> atomic_notifier_call_chain(&panic_notifier_list, ...), but it has the
> side effect of calling all other panic notifiers registered.
>
> So Sima suggested to move that to the generic panic code, and test all
> panic notifiers with a dedicated debugfs interface.
>
> I can move that code to kernel/, but before doing that, I would like to
> know if you think that's the right way to test the panic code.

One major event that happens before the panic notifiers is
panic_other_cpus_shutdown(). This can cause special situations because
CPUs can be stopped while holding resources (such as raw spin
locks). And these are the situations that make it so tricky to have safe
and reliable notifiers. If triggered from debugfs, these situations will
never occur.

My concern is that the tests via debugfs will always succeed, but in the
real world panic notifiers are failing/hanging/exploding. IMHO useful
panic testing requires real panic'ing.

For my printk panic tests I trigger unknown NMIs while booting with
"unknown_nmi_panic". Particularly with Qemu this is quite easy and
amazingly effective at catching problems. In fact, a recent printk
series [0] fixed seven issues that were found through this method of
panic testing.

> The second question is how to simulate a panic context in a
> non-destructive way, so we can test the panic notifiers in CI, without
> crashing the machine.

I'm wondering if a "fake panic" can be implemented that quiesces all the
other CPUs via NMI (similar to kdb) and then calls the panic
notifiers. And finally releases everything back to normal. That might
produce a fairly realistic panic situation and should be fairly
non-destructive (depending on what the notifiers do and how long they
take).

> The worst case for a panic notifier, is when the panic occurs in NMI
> context, but I don't know how to simulate that. The goal would be to
> find early if a panic notifier tries to sleep, or do other things that
> are not allowed in a panic context.

Maybe with a new boot argument "unknown_nmi_fake_panic" that triggers
the fake panic instead?

John Ogness

[0] https://lore.kernel.org/lkml/[email protected]

2024-03-04 21:43:54

by Guilherme G. Piccoli

[permalink] [raw]

Subject: Re: [RFC] How to test panic handlers, without crashing the kernel

On 04/03/2024 18:12, John Ogness wrote:
> [...]
>> The second question is how to simulate a panic context in a
>> non-destructive way, so we can test the panic notifiers in CI, without
>> crashing the machine.
>
> I'm wondering if a "fake panic" can be implemented that quiesces all the
> other CPUs via NMI (similar to kdb) and then calls the panic
> notifiers. And finally releases everything back to normal. That might
> produce a fairly realistic panic situation and should be fairly
> non-destructive (depending on what the notifiers do and how long they
> take).
>

Hi Jocelyn / John,

one concern here is that the panic notifiers are kind of a no man's
land, so we can have very simple / safe ones, while others are
destructive in nature.

An example of a good behaving notifier that is destructive is the
Hyper-V one, that destroys an essential host-guest interface (called
"vmbus connection"). What happens if we trigger this one just for
testing purposes in a debugfs interface? Likely the guest would die...

[+CCing Michael Kelley here since he seems interested in panic and is
also expert in Hyper-V, just in case my example is bogus.]

So, maybe the problem could be split in 2: the non-notifiers portion of
the panic path, and the the notifiers; maybe restricting the notifiers
you'd run is a way to circumvent the risks, like if you could pass a
list of the specific notifiers you aim to test, this could be
interesting. Let's see what the others think and thanks for your work in
the DRM panic notifier =)

Cheers,

Guilherme

2024-03-05 16:27:36

by Michael Kelley

[permalink] [raw]

Subject: RE: [RFC] How to test panic handlers, without crashing the kernel

From: Guilherme G. Piccoli <[email protected]> Sent: Monday, March 4, 2024 1:43 PM
>
> On 04/03/2024 18:12, John Ogness wrote:
> > [...]
> >> The second question is how to simulate a panic context in a
> >> non-destructive way, so we can test the panic notifiers in CI, without
> >> crashing the machine.
> >
> > I'm wondering if a "fake panic" can be implemented that quiesces all the
> > other CPUs via NMI (similar to kdb) and then calls the panic
> > notifiers. And finally releases everything back to normal. That might
> > produce a fairly realistic panic situation and should be fairly
> > non-destructive (depending on what the notifiers do and how long they
> > take).
> >
>
> Hi Jocelyn / John,
>
> one concern here is that the panic notifiers are kind of a no man's
> land, so we can have very simple / safe ones, while others are
> destructive in nature.
>
> An example of a good behaving notifier that is destructive is the
> Hyper-V one, that destroys an essential host-guest interface (called
> "vmbus connection"). What happens if we trigger this one just for
> testing purposes in a debugfs interface? Likely the guest would die...
>
> [+CCing Michael Kelley here since he seems interested in panic and is
> also expert in Hyper-V, just in case my example is bogus.]

The Hyper-V example is valid. After hv_panic_vmbus_unload()
is called, the VM won't be able to do any disk, network, or graphics
frame buffer I/O. There's no recovery short of restarting the VM.

Michael

[I have retired from Microsoft. I'm still occasionally contributing
to Linux kernel work with email [email protected].]

>
> So, maybe the problem could be split in 2: the non-notifiers portion of
> the panic path, and the the notifiers; maybe restricting the notifiers
> you'd run is a way to circumvent the risks, like if you could pass a
> list of the specific notifiers you aim to test, this could be
> interesting. Let's see what the others think and thanks for your work in
> the DRM panic notifier =)
>
> Cheers,
>
>
> Guilherme

2024-03-05 16:32:34

by Jocelyn Falempe

[permalink] [raw]

Subject: Re: [RFC] How to test panic handlers, without crashing the kernel

On 04/03/2024 22:12, John Ogness wrote:
> [Added printk maintainer and kdb folks]
>
> Hi Jocelyn,
>
> On 2024-03-01, Jocelyn Falempe <[email protected]> wrote:
>> While writing a panic handler for drm devices [1], I needed a way to
>> test it without crashing the machine.
>> So from debugfs, I called
>> atomic_notifier_call_chain(&panic_notifier_list, ...), but it has the
>> side effect of calling all other panic notifiers registered.
>>
>> So Sima suggested to move that to the generic panic code, and test all
>> panic notifiers with a dedicated debugfs interface.
>>
>> I can move that code to kernel/, but before doing that, I would like to
>> know if you think that's the right way to test the panic code.
>
> One major event that happens before the panic notifiers is
> panic_other_cpus_shutdown(). This can cause special situations because
> CPUs can be stopped while holding resources (such as raw spin
> locks). And these are the situations that make it so tricky to have safe
> and reliable notifiers. If triggered from debugfs, these situations will
> never occur.
>
> My concern is that the tests via debugfs will always succeed, but in the
> real world panic notifiers are failing/hanging/exploding. IMHO useful
> panic testing requires real panic'ing.

Yes, but for the drm panic, it's still useful to check that the output
is working (ie: make sure the color format and the framebuffer address
are good). Also I've reworked the debugfs patch, so I don't have to call
all panic notifiers. It's now per device, so your can trigger the
drm_panic handler on a specific GPU.

>
> For my printk panic tests I trigger unknown NMIs while booting with
> "unknown_nmi_panic". Particularly with Qemu this is quite easy and
> amazingly effective at catching problems. In fact, a recent printk
> series [0] fixed seven issues that were found through this method of
> panic testing.

Thanks for this tip, I used to test with "echo c > /proc/sysrq-trigger"
in the guest, but that's more permissive. I'm now testing with virsh
inject-nmi, and drm_panic is still working.
>
>> The second question is how to simulate a panic context in a
>> non-destructive way, so we can test the panic notifiers in CI, without
>> crashing the machine.
>
> I'm wondering if a "fake panic" can be implemented that quiesces all the
> other CPUs via NMI (similar to kdb) and then calls the panic
> notifiers. And finally releases everything back to normal. That might
> produce a fairly realistic panic situation and should be fairly
> non-destructive (depending on what the notifiers do and how long they
> take).
>
>> The worst case for a panic notifier, is when the panic occurs in NMI
>> context, but I don't know how to simulate that. The goal would be to
>> find early if a panic notifier tries to sleep, or do other things that
>> are not allowed in a panic context.
>
> Maybe with a new boot argument "unknown_nmi_fake_panic" that triggers
> the fake panic instead?
>
> John Ogness
>
> [0] https://lore.kernel.org/lkml/[email protected]
>

Best regards,

--

Jocelyn

2024-03-05 16:52:54

by Jocelyn Falempe

[permalink] [raw]

Subject: Re: [RFC] How to test panic handlers, without crashing the kernel

On 05/03/2024 17:23, Michael Kelley wrote:
> From: Guilherme G. Piccoli <[email protected]> Sent: Monday, March 4, 2024 1:43 PM
>>
>> On 04/03/2024 18:12, John Ogness wrote:
>>> [...]
>>>> The second question is how to simulate a panic context in a
>>>> non-destructive way, so we can test the panic notifiers in CI, without
>>>> crashing the machine.
>>>
>>> I'm wondering if a "fake panic" can be implemented that quiesces all the
>>> other CPUs via NMI (similar to kdb) and then calls the panic
>>> notifiers. And finally releases everything back to normal. That might
>>> produce a fairly realistic panic situation and should be fairly
>>> non-destructive (depending on what the notifiers do and how long they
>>> take).
>>>
>>
>> Hi Jocelyn / John,
>>
>> one concern here is that the panic notifiers are kind of a no man's
>> land, so we can have very simple / safe ones, while others are
>> destructive in nature.
>>
>> An example of a good behaving notifier that is destructive is the
>> Hyper-V one, that destroys an essential host-guest interface (called
>> "vmbus connection"). What happens if we trigger this one just for
>> testing purposes in a debugfs interface? Likely the guest would die...
>>
>> [+CCing Michael Kelley here since he seems interested in panic and is
>> also expert in Hyper-V, just in case my example is bogus.]
>
> The Hyper-V example is valid. After hv_panic_vmbus_unload()
> is called, the VM won't be able to do any disk, network, or graphics
> frame buffer I/O. There's no recovery short of restarting the VM.

Thanks for the confirmation.
>
> Michael
>
> [I have retired from Microsoft. I'm still occasionally contributing
> to Linux kernel work with email [email protected].]
>
>>
>> So, maybe the problem could be split in 2: the non-notifiers portion of
>> the panic path, and the the notifiers; maybe restricting the notifiers
>> you'd run is a way to circumvent the risks, like if you could pass a
>> list of the specific notifiers you aim to test, this could be
>> interesting. Let's see what the others think and thanks for your work in
>> the DRM panic notifier =)

Or maybe have two lists of panic notifiers, the safe and the destructive
list. So in case of fake panic, we can only call the safe notifiers.

>>
>> Cheers,
>>
>>
>> Guilherme
>

2024-03-05 17:52:11

by Guilherme G. Piccoli

[permalink] [raw]

Subject: Re: [RFC] How to test panic handlers, without crashing the kernel

On 05/03/2024 13:52, Jocelyn Falempe wrote:
> [...]
> Or maybe have two lists of panic notifiers, the safe and the destructive
> list. So in case of fake panic, we can only call the safe notifiers.
>

I tried something like that:
https://lore.kernel.org/lkml/[email protected]/

There were many suggestions, a completely refactor of the idea (panic
lists are not really seen as reliable things).

Given that, I'm not really sure splitting in lists gonna fly; maybe
restricting the test infrastructure to drm_panic plus some paths of
panic would be enough for this debugfs interface, in principle? I mean,
to unblock your work on the drm panic stuff.

Cheers,

Guilherme

2024-03-07 17:23:49

by Jocelyn Falempe

[permalink] [raw]

Subject: Re: [RFC] How to test panic handlers, without crashing the kernel

On 05/03/2024 18:50, Guilherme G. Piccoli wrote:
> On 05/03/2024 13:52, Jocelyn Falempe wrote:
>> [...]
>> Or maybe have two lists of panic notifiers, the safe and the destructive
>> list. So in case of fake panic, we can only call the safe notifiers.
>>
>
> I tried something like that:
> https://lore.kernel.org/lkml/[email protected]/
>
> There were many suggestions, a completely refactor of the idea (panic
> lists are not really seen as reliable things).

Thanks for sharing this, so it's much more complex than what I though.
>
> Given that, I'm not really sure splitting in lists gonna fly; maybe
> restricting the test infrastructure to drm_panic plus some paths of
> panic would be enough for this debugfs interface, in principle? I mean,
> to unblock your work on the drm panic stuff.

For drm_panic, I changed the way the debugfs is calling the drm_panic
functions in the last version:
https://patchwork.freedesktop.org/patch/581845/?series=122244&rev=9

It doesn't use the panic notifier list, but create a file for each plane
of each device directly.
It allows to test the panic handler, not in a real panic condition, but
that's still better than nothing.

>
> Cheers,
>
>
> Guilherme
>

Best regards,

--

Jocelyn

2024-03-07 17:31:36

by Guilherme G. Piccoli

[permalink] [raw]

Subject: Re: [RFC] How to test panic handlers, without crashing the kernel

On 07/03/2024 14:22, Jocelyn Falempe wrote:
> [...]
>
> For drm_panic, I changed the way the debugfs is calling the drm_panic
> functions in the last version:
> https://patchwork.freedesktop.org/patch/581845/?series=122244&rev=9
>
> It doesn't use the panic notifier list, but create a file for each plane
> of each device directly.
> It allows to test the panic handler, not in a real panic condition, but
> that's still better than nothing.
>

Nice! Seems a very good idea, at least as a first step to unblock the
work you're doing.

Thanks again for the effort, much appreciated =)