2015-05-13 14:25:56

by 王龙

[permalink] [raw]
Subject: [RFC] how to perform a safe NMI stack trace on all CPUs on x86?

Hi all,

In kernel before 3.19, when trigger_all_cpu_backtrace() is called on x86,
it will trigger an NMI on each CPU and call show_regs(). But this can lead
to a hard lock up if the NMI comes in on another printk().

The commit a9edc88093287183ac934be44f295f183b2c62dd (x86/nmi: Perform a safe
NMI stack trace on all CPUs) fix this problem on kernel mainline. when the NMI
triggers, it switches the printk routine for that CPU to call a NMI safe printk
function that records the printk in a per_cpu seq_buf descriptor. After all
NMIs have finished recording its data, the seq_bufs are printed in a safe
context. But how do we fix this problem in older version of kernel(eg, 3.10 stable)?
The 3.10 stable has no "switch printk routine" and "seq_buf" infrastructures.

Could anyone give me some ideas?

Best Regards
Wang Long????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?


2015-05-13 14:22:54

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC] how to perform a safe NMI stack trace on all CPUs on x86?

On Wed, 13 May 2015 22:14:54 +0800
"王龙" <[email protected]> wrote:


> context. But how do we fix this problem in older version of kernel(eg, 3.10 stable)?
> The 3.10 stable has no "switch printk routine" and "seq_buf" infrastructures.
>
> Could anyone give me some ideas?
>

Backport the necessary patches.

-- Steve

2015-05-13 14:26:12

by Jiri Kosina

[permalink] [raw]
Subject: Re: [RFC] how to perform a safe NMI stack trace on all CPUs on x86?

On Wed, 13 May 2015, ???? wrote:

> Hi all,
>
> In kernel before 3.19, when trigger_all_cpu_backtrace() is called on x86,
> it will trigger an NMI on each CPU and call show_regs(). But this can lead
> to a hard lock up if the NMI comes in on another printk().
>
> The commit a9edc88093287183ac934be44f295f183b2c62dd (x86/nmi: Perform a safe
> NMI stack trace on all CPUs) fix this problem on kernel mainline. when the NMI
> triggers, it switches the printk routine for that CPU to call a NMI safe printk
> function that records the printk in a per_cpu seq_buf descriptor. After all
> NMIs have finished recording its data, the seq_bufs are printed in a safe
> context. But how do we fix this problem in older version of kernel(eg, 3.10 stable)?
> The 3.10 stable has no "switch printk routine" and "seq_buf" infrastructures.
>
> Could anyone give me some ideas?

Either you backport seq_buf-based aproach to the older kernel, or, if you
are working on 3.4 kernel or earlier (basically any kernel preceeding the
printk() revamp that happened in 7ff9554bb57 and after), you can use
slightly simpler aproach.

It's an aproach we used initially when finding out the issue for the first
time, and it is proven to work as well (but it's not applicable after Kay
added all the complexity to printk()).

You can see it in our SLE11 kernel tree, available on

http://kernel.suse.com/cgit/kernel/commit/?h=SLE11-SP4&id=8d62ae68ff61d77ae3c4899f05dbd9c9742b14c9

for example.

It's up to you to judget which is the least painful way :)

--
Jiri Kosina
SUSE Labs

2015-05-14 11:18:03

by Wang Long

[permalink] [raw]
Subject: Re: [RFC] how to perform a safe NMI stack trace on all CPUs on x86?

On 2015/5/13 22:22, Steven Rostedt wrote:
> On Wed, 13 May 2015 22:14:54 +0800
> "王龙" <[email protected]> wrote:
>
>
>> context. But how do we fix this problem in older version of kernel(eg, 3.10 stable)?
>> The 3.10 stable has no "switch printk routine" and "seq_buf" infrastructures.
>>
>> Could anyone give me some ideas?
>>
>
> Backport the necessary patches.
>
> -- Steve
>
Hi Steve,

Thank you for your reply, I will backport necessary patches to 3.10 stable.
Welcome you to review my backport patches.

Best Regards
Wang Long
> .
>

2015-05-14 12:13:07

by Wang Long

[permalink] [raw]
Subject: Re: [RFC] how to perform a safe NMI stack trace on all CPUs on x86?

On 2015/5/13 22:26, Jiri Kosina wrote:
> On Wed, 13 May 2015, ???? wrote:
>
>> Hi all,
>>
>> In kernel before 3.19, when trigger_all_cpu_backtrace() is called on x86,
>> it will trigger an NMI on each CPU and call show_regs(). But this can lead
>> to a hard lock up if the NMI comes in on another printk().
>>
>> The commit a9edc88093287183ac934be44f295f183b2c62dd (x86/nmi: Perform a safe
>> NMI stack trace on all CPUs) fix this problem on kernel mainline. when the NMI
>> triggers, it switches the printk routine for that CPU to call a NMI safe printk
>> function that records the printk in a per_cpu seq_buf descriptor. After all
>> NMIs have finished recording its data, the seq_bufs are printed in a safe
>> context. But how do we fix this problem in older version of kernel(eg, 3.10 stable)?
>> The 3.10 stable has no "switch printk routine" and "seq_buf" infrastructures.
>>
>> Could anyone give me some ideas?
>
> Either you backport seq_buf-based aproach to the older kernel, or, if you
> are working on 3.4 kernel or earlier (basically any kernel preceeding the
> printk() revamp that happened in 7ff9554bb57 and after), you can use
> slightly simpler aproach.
>
> It's an aproach we used initially when finding out the issue for the first
> time, and it is proven to work as well (but it's not applicable after Kay
> added all the complexity to printk()).
>
> You can see it in our SLE11 kernel tree, available on
>
> http://kernel.suse.com/cgit/kernel/commit/?h=SLE11-SP4&id=8d62ae68ff61d77ae3c4899f05dbd9c9742b14c9
>
> for example.
>
> It's up to you to judget which is the least painful way :)
>

Hi Jiri Kosina,

For 3.10 stable, the only way to solve this problem is backport seq_buf-based aproach.

I will backport necessary patches to 3.10 stable. Welcome you to review my backport patches.

Best Regards
Wang Long