2013-08-06 09:21:06

by Hatayama, Daisuke

[permalink] [raw]
Subject: [Help Test] kdump, x86, acpi: Reproduce CPU0 SMI corruption issue after unsetting BSP flag

Hello,

I've addressing kdump restriction that there's only one cpu available
on the kdump 2nd kernel. Now I need to check if the following CPU0 SMI
corruption issue fixed in the following commit can again be reproduced
by unsetting BSP flag of the boot cpu:

commit 74b5820808215f65b70b05a099d6d3c969b82689
Author: Bjorn Helgaas <[email protected]>
Date: Wed Jul 29 15:54:25 2009 -0600

ACPI: bind workqueues to CPU 0 to avoid SMI corruption

On some machines, a software-initiated SMI causes corruption unless the
SMI runs on CPU 0. An SMI can be initiated by any AML, but typically it's
done in GPE-related methods that are run via workqueues, so we can avoid
the known corruption cases by binding the workqueues to CPU 0.

References:
http://bugzilla.kernel.org/show_bug.cgi?id=13751
https://bugs.launchpad.net/bugs/157171
https://bugs.launchpad.net/bugs/157691

Signed-off-by: Bjorn Helgaas <[email protected]>
Signed-off-by: Len Brown <[email protected]>

The reason is that in the current situation, I have two ideas to deal
with the avove kdump restriction:

1) Disable BSP at the 2nd kernel, posted at:
[PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
https://lkml.org/lkml/2012/10/16/15

2) Unset BSP flag at the 1st kernel, suggested by Eric Biederman
during the discussion of the idea 1).

On the idea 1), BSP is disabled on the kdump 2nd kernel. My conclusion
is that we have no method to reset BSP, i.e. recover BPS's healthy
state, while we can recover AP by means of INIT as described in MP
specification.

The idea 2) is simpler. We unset BSP flag of the boot cpu at 1st
kernel. The behaviour when receiving INIT depends on whether or not
BSP flag is set or not on its MSR; we can set and unset BSP flag of
MSR freely at runtime. (I don't mean we should).

So, next thing I should do is to evalute risk of the idea 2). In fact,
during the discussion of the idea 1), HPA pointed out that some kind
of firmware affects if BSP flag is unset. Also, maybe from the same
reason, recently introduced cpu0 hot-plugging feature by Fenghua Yu
doesn't appear to unset BSP flag.

The biggest problem next is that I don't have any machines reported in
the bugzilla articles; this issue inherently depends on firmware.

So, could anyone help testing the idea 2) above if you have which of
the following machines? (or other ones that can lead to the same bug)

- HP Compaq 6910p
- HP Compaq 6710b
- HP Compaq 6710s
- HP Compaq 6510b
- HP Compaq 2510p

I prepared a small programs for this test. See the attached file.
The steps to try to reproduce the bug is as follows:

1. $ tar xf bsp_flag_modules.tar.gz; cd bsp_flag_modules
2. $ make # to build these programs
3. $ insmod unsetbspflag.ko # to unset BSP flag of the boot cpu
4. $ insmod getcpuinfo.ko # to confirm if BSP flag of the boot cpu has
# been unset.
$ dmesg | tail
5. Close the lid of the machine.
6. Wait some minutes if necessary.
7. Open the lid and you can see oops on the screen if bug has
successfully been reproduced.

--
Thanks.
HATAYAMA, Daisuke


Attachments:
bsp_flag_modules.tar.gz (8.97 kB)

2013-08-06 16:26:11

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [Help Test] kdump, x86, acpi: Reproduce CPU0 SMI corruption issue after unsetting BSP flag

On Tue, Aug 6, 2013 at 3:19 AM, HATAYAMA Daisuke
<[email protected]> wrote:
> So, could anyone help testing the idea 2) above if you have which of
> the following machines? (or other ones that can lead to the same bug)
>
> - HP Compaq 6910p
> - HP Compaq 6710b
> - HP Compaq 6710s
> - HP Compaq 6510b
> - HP Compaq 2510p

Sorry, I don't have access to any of these machines any more, and I
don't have any useful advice for you.

2013-08-07 10:05:29

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [Help Test] kdump, x86, acpi: Reproduce CPU0 SMI corruption issue after unsetting BSP flag

(2013/08/07 1:25), Bjorn Helgaas wrote:
> On Tue, Aug 6, 2013 at 3:19 AM, HATAYAMA Daisuke
> <[email protected]> wrote:
>> So, could anyone help testing the idea 2) above if you have which of
>> the following machines? (or other ones that can lead to the same bug)
>>
>> - HP Compaq 6910p
>> - HP Compaq 6710b
>> - HP Compaq 6710s
>> - HP Compaq 6510b
>> - HP Compaq 2510p
>
> Sorry, I don't have access to any of these machines any more, and I
> don't have any useful advice for you.
>

I have guessed that answer from the change of your email address...

But thanks for your record in commit and bugzilla. It's helpful to follow
details of this bug.

--
Thanks.
HATAYAMA, Daisuke

2013-08-13 10:55:38

by Jingbai Ma

[permalink] [raw]
Subject: Re: [Help Test] kdump, x86, acpi: Reproduce CPU0 SMI corruption issue after unsetting BSP flag

On 08/06/2013 05:19 PM, HATAYAMA Daisuke wrote:
> Hello,
>
> I've addressing kdump restriction that there's only one cpu available
> on the kdump 2nd kernel. Now I need to check if the following CPU0 SMI
> corruption issue fixed in the following commit can again be reproduced
> by unsetting BSP flag of the boot cpu:
>
> commit 74b5820808215f65b70b05a099d6d3c969b82689
> Author: Bjorn Helgaas<[email protected]>
> Date: Wed Jul 29 15:54:25 2009 -0600
>
> ACPI: bind workqueues to CPU 0 to avoid SMI corruption
>
> On some machines, a software-initiated SMI causes corruption unless the
> SMI runs on CPU 0. An SMI can be initiated by any AML, but typically it's
> done in GPE-related methods that are run via workqueues, so we can avoid
> the known corruption cases by binding the workqueues to CPU 0.
>
> References:
> http://bugzilla.kernel.org/show_bug.cgi?id=13751
> https://bugs.launchpad.net/bugs/157171
> https://bugs.launchpad.net/bugs/157691
>
> Signed-off-by: Bjorn Helgaas<[email protected]>
> Signed-off-by: Len Brown<[email protected]>
>
> The reason is that in the current situation, I have two ideas to deal
> with the avove kdump restriction:
>
> 1) Disable BSP at the 2nd kernel, posted at:
> [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
> https://lkml.org/lkml/2012/10/16/15
>
> 2) Unset BSP flag at the 1st kernel, suggested by Eric Biederman
> during the discussion of the idea 1).
>
> On the idea 1), BSP is disabled on the kdump 2nd kernel. My conclusion
> is that we have no method to reset BSP, i.e. recover BPS's healthy
> state, while we can recover AP by means of INIT as described in MP
> specification.
>
> The idea 2) is simpler. We unset BSP flag of the boot cpu at 1st
> kernel. The behaviour when receiving INIT depends on whether or not
> BSP flag is set or not on its MSR; we can set and unset BSP flag of
> MSR freely at runtime. (I don't mean we should).
>
> So, next thing I should do is to evalute risk of the idea 2). In fact,
> during the discussion of the idea 1), HPA pointed out that some kind
> of firmware affects if BSP flag is unset. Also, maybe from the same
> reason, recently introduced cpu0 hot-plugging feature by Fenghua Yu
> doesn't appear to unset BSP flag.
>
> The biggest problem next is that I don't have any machines reported in
> the bugzilla articles; this issue inherently depends on firmware.
>
> So, could anyone help testing the idea 2) above if you have which of
> the following machines? (or other ones that can lead to the same bug)
>
> - HP Compaq 6910p
> - HP Compaq 6710b
> - HP Compaq 6710s
> - HP Compaq 6510b
> - HP Compaq 2510p
>
> I prepared a small programs for this test. See the attached file.
> The steps to try to reproduce the bug is as follows:
>
> 1. $ tar xf bsp_flag_modules.tar.gz; cd bsp_flag_modules
> 2. $ make # to build these programs
> 3. $ insmod unsetbspflag.ko # to unset BSP flag of the boot cpu
> 4. $ insmod getcpuinfo.ko # to confirm if BSP flag of the boot cpu has
> # been unset.
> $ dmesg | tail
> 5. Close the lid of the machine.
> 6. Wait some minutes if necessary.
> 7. Open the lid and you can see oops on the screen if bug has
> successfully been reproduced.
>

I couldn't find any model list above, but found one HP EliteBook 6930p.
I tested this machine with kernel 2.6.30 first. After resuming from
suspend, system hang.

Then, I tested with kernel 3.11.0-rc5, it worked well, could resume from
suspend without any problem.

Next, I tested your program to clear BSP flag, I found the
unsetbspflag.ko didn't work everytime, sometimes I have to execute
insmod/rmmod several times to clear the BSP flag. (I used your
getcpuinfo.ko to check the BSP flag)

cpu: 0 bios_apic: 0 apic: 0 AP
cpu: 1 bios_apic: 1 apic: 1 AP

I suspended it, and them resumed it. This machine resumed from suspend
successfully, but the BSP flag has been set back:

cpu: 0 bios_apic: 0 apic: 0 BSP
cpu: 1 bios_apic: 1 apic: 1 AP

That's all my observation. Hope it's helpful.

--
Thanks,
Jingbai Ma

2013-08-14 09:13:15

by Jingbai Ma

[permalink] [raw]
Subject: Re: [Help Test] kdump, x86, acpi: Reproduce CPU0 SMI corruption issue after unsetting BSP flag

On 08/13/2013 06:55 PM, Jingbai Ma wrote:
> On 08/06/2013 05:19 PM, HATAYAMA Daisuke wrote:
>> Hello,
>>
>> I've addressing kdump restriction that there's only one cpu available
>> on the kdump 2nd kernel. Now I need to check if the following CPU0 SMI
>> corruption issue fixed in the following commit can again be reproduced
>> by unsetting BSP flag of the boot cpu:
>>
>> commit 74b5820808215f65b70b05a099d6d3c969b82689
>> Author: Bjorn Helgaas<[email protected]>
>> Date: Wed Jul 29 15:54:25 2009 -0600
>>
>> ACPI: bind workqueues to CPU 0 to avoid SMI corruption
>>
>> On some machines, a software-initiated SMI causes corruption unless the
>> SMI runs on CPU 0. An SMI can be initiated by any AML, but typically it's
>> done in GPE-related methods that are run via workqueues, so we can avoid
>> the known corruption cases by binding the workqueues to CPU 0.
>>
>> References:
>> http://bugzilla.kernel.org/show_bug.cgi?id=13751
>> https://bugs.launchpad.net/bugs/157171
>> https://bugs.launchpad.net/bugs/157691
>>
>> Signed-off-by: Bjorn Helgaas<[email protected]>
>> Signed-off-by: Len Brown<[email protected]>
>>
>> The reason is that in the current situation, I have two ideas to deal
>> with the avove kdump restriction:
>>
>> 1) Disable BSP at the 2nd kernel, posted at:
>> [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
>> https://lkml.org/lkml/2012/10/16/15
>>
>> 2) Unset BSP flag at the 1st kernel, suggested by Eric Biederman
>> during the discussion of the idea 1).
>>
>> On the idea 1), BSP is disabled on the kdump 2nd kernel. My conclusion
>> is that we have no method to reset BSP, i.e. recover BPS's healthy
>> state, while we can recover AP by means of INIT as described in MP
>> specification.
>>
>> The idea 2) is simpler. We unset BSP flag of the boot cpu at 1st
>> kernel. The behaviour when receiving INIT depends on whether or not
>> BSP flag is set or not on its MSR; we can set and unset BSP flag of
>> MSR freely at runtime. (I don't mean we should).
>>
>> So, next thing I should do is to evalute risk of the idea 2). In fact,
>> during the discussion of the idea 1), HPA pointed out that some kind
>> of firmware affects if BSP flag is unset. Also, maybe from the same
>> reason, recently introduced cpu0 hot-plugging feature by Fenghua Yu
>> doesn't appear to unset BSP flag.
>>
>> The biggest problem next is that I don't have any machines reported in
>> the bugzilla articles; this issue inherently depends on firmware.
>>
>> So, could anyone help testing the idea 2) above if you have which of
>> the following machines? (or other ones that can lead to the same bug)
>>
>> - HP Compaq 6910p
>> - HP Compaq 6710b
>> - HP Compaq 6710s
>> - HP Compaq 6510b
>> - HP Compaq 2510p
>>
>> I prepared a small programs for this test. See the attached file.
>> The steps to try to reproduce the bug is as follows:
>>
>> 1. $ tar xf bsp_flag_modules.tar.gz; cd bsp_flag_modules
>> 2. $ make # to build these programs
>> 3. $ insmod unsetbspflag.ko # to unset BSP flag of the boot cpu
>> 4. $ insmod getcpuinfo.ko # to confirm if BSP flag of the boot cpu has
>> # been unset.
>> $ dmesg | tail
>> 5. Close the lid of the machine.
>> 6. Wait some minutes if necessary.
>> 7. Open the lid and you can see oops on the screen if bug has
>> successfully been reproduced.
>>
>
> I couldn't find any model list above, but found one HP EliteBook 6930p.
> I tested this machine with kernel 2.6.30 first. After resuming from
> suspend, system hang.
>
> Then, I tested with kernel 3.11.0-rc5, it worked well, could resume from
> suspend without any problem.
>
> Next, I tested your program to clear BSP flag, I found the
> unsetbspflag.ko didn't work everytime, sometimes I have to execute
> insmod/rmmod several times to clear the BSP flag. (I used your
> getcpuinfo.ko to check the BSP flag)
>
> cpu: 0 bios_apic: 0 apic: 0 AP
> cpu: 1 bios_apic: 1 apic: 1 AP
>
> I suspended it, and them resumed it. This machine resumed from suspend
> successfully, but the BSP flag has been set back:
>
> cpu: 0 bios_apic: 0 apic: 0 BSP
> cpu: 1 bios_apic: 1 apic: 1 AP
>
> That's all my observation. Hope it's helpful.
>

I found a side effect of unsetting BSP flag.
It affected system rebooting, once the BSP flags been removed, and issue
reboot command, system will hang after message:
Restarting system.
And have to do a hardware reset to recover it.

I have reproduced this problem on the following systems:
HP EliteBook 6930p
HP Compaq DC7700
HP ProLiant DL980 (4 sockets, 40 cores)

I have an idea: To avoid such kind of issue, we can unset BSP flag in
the first kernel during crash processing, and restore it in the second
kernel in the APs initializing.

--
Thanks,
Jingbai Ma

2013-08-14 19:46:23

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Help Test] kdump, x86, acpi: Reproduce CPU0 SMI corruption issue after unsetting BSP flag

Jingbai Ma <[email protected]> writes:

> I found a side effect of unsetting BSP flag.
> It affected system rebooting, once the BSP flags been removed, and issue
> reboot command, system will hang after message:
> Restarting system.
> And have to do a hardware reset to recover it.
>
> I have reproduced this problem on the following systems:
> HP EliteBook 6930p
> HP Compaq DC7700
> HP ProLiant DL980 (4 sockets, 40 cores)
>
> I have an idea: To avoid such kind of issue, we can unset BSP flag in
> the first kernel during crash processing, and restore it in the second
> kernel in the APs initializing.

The premise was clearing BSP would not be an issue. If we could
reliably count on unsetting the BSP during crash processing we could
just switch to the BSP and be done totally avoid this problem.

Given that there are reald world issues with clearing the BSP flag,
I believe the alternate suggestion was to simply never attempt to start
the bootstrap processor during processor bring up.

If as normal we are running on the bootstrap processor everything will
work the same, but if we are in the kdump scenario we will be short one
core. Being short one core seems like a reasonable tradeoff between
reliability and performance.

Eric

2013-08-19 01:58:45

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [Help Test] kdump, x86, acpi: Reproduce CPU0 SMI corruption issue after unsetting BSP flag

(2013/08/14 18:13), Jingbai Ma wrote:
> On 08/13/2013 06:55 PM, Jingbai Ma wrote:
>> On 08/06/2013 05:19 PM, HATAYAMA Daisuke wrote:
>>> Hello,
>>>
>>> I've addressing kdump restriction that there's only one cpu available
>>> on the kdump 2nd kernel. Now I need to check if the following CPU0 SMI
>>> corruption issue fixed in the following commit can again be reproduced
>>> by unsetting BSP flag of the boot cpu:
>>>
>>> commit 74b5820808215f65b70b05a099d6d3c969b82689
>>> Author: Bjorn Helgaas<[email protected]>
>>> Date: Wed Jul 29 15:54:25 2009 -0600
>>>
>>> ACPI: bind workqueues to CPU 0 to avoid SMI corruption
>>>
>>> On some machines, a software-initiated SMI causes corruption unless the
>>> SMI runs on CPU 0. An SMI can be initiated by any AML, but typically it's
>>> done in GPE-related methods that are run via workqueues, so we can avoid
>>> the known corruption cases by binding the workqueues to CPU 0.
>>>
>>> References:
>>> http://bugzilla.kernel.org/show_bug.cgi?id=13751
>>> https://bugs.launchpad.net/bugs/157171
>>> https://bugs.launchpad.net/bugs/157691
>>>
>>> Signed-off-by: Bjorn Helgaas<[email protected]>
>>> Signed-off-by: Len Brown<[email protected]>
>>>
>>> The reason is that in the current situation, I have two ideas to deal
>>> with the avove kdump restriction:
>>>
>>> 1) Disable BSP at the 2nd kernel, posted at:
>>> [PATCH v1 0/2] x86, apic: Disable BSP if boot cpu is AP
>>> https://lkml.org/lkml/2012/10/16/15
>>>
>>> 2) Unset BSP flag at the 1st kernel, suggested by Eric Biederman
>>> during the discussion of the idea 1).
>>>
>>> On the idea 1), BSP is disabled on the kdump 2nd kernel. My conclusion
>>> is that we have no method to reset BSP, i.e. recover BPS's healthy
>>> state, while we can recover AP by means of INIT as described in MP
>>> specification.
>>>
>>> The idea 2) is simpler. We unset BSP flag of the boot cpu at 1st
>>> kernel. The behaviour when receiving INIT depends on whether or not
>>> BSP flag is set or not on its MSR; we can set and unset BSP flag of
>>> MSR freely at runtime. (I don't mean we should).
>>>
>>> So, next thing I should do is to evalute risk of the idea 2). In fact,
>>> during the discussion of the idea 1), HPA pointed out that some kind
>>> of firmware affects if BSP flag is unset. Also, maybe from the same
>>> reason, recently introduced cpu0 hot-plugging feature by Fenghua Yu
>>> doesn't appear to unset BSP flag.
>>>
>>> The biggest problem next is that I don't have any machines reported in
>>> the bugzilla articles; this issue inherently depends on firmware.
>>>
>>> So, could anyone help testing the idea 2) above if you have which of
>>> the following machines? (or other ones that can lead to the same bug)
>>>
>>> - HP Compaq 6910p
>>> - HP Compaq 6710b
>>> - HP Compaq 6710s
>>> - HP Compaq 6510b
>>> - HP Compaq 2510p
>>>
>>> I prepared a small programs for this test. See the attached file.
>>> The steps to try to reproduce the bug is as follows:
>>>
>>> 1. $ tar xf bsp_flag_modules.tar.gz; cd bsp_flag_modules
>>> 2. $ make # to build these programs
>>> 3. $ insmod unsetbspflag.ko # to unset BSP flag of the boot cpu
>>> 4. $ insmod getcpuinfo.ko # to confirm if BSP flag of the boot cpu has
>>> # been unset.
>>> $ dmesg | tail
>>> 5. Close the lid of the machine.
>>> 6. Wait some minutes if necessary.
>>> 7. Open the lid and you can see oops on the screen if bug has
>>> successfully been reproduced.
>>>
>>
>> I couldn't find any model list above, but found one HP EliteBook 6930p.
>> I tested this machine with kernel 2.6.30 first. After resuming from
>> suspend, system hang.
>>
>> Then, I tested with kernel 3.11.0-rc5, it worked well, could resume from
>> suspend without any problem.
>>
>> Next, I tested your program to clear BSP flag, I found the
>> unsetbspflag.ko didn't work everytime, sometimes I have to execute
>> insmod/rmmod several times to clear the BSP flag. (I used your
>> getcpuinfo.ko to check the BSP flag)
>>
>> cpu: 0 bios_apic: 0 apic: 0 AP
>> cpu: 1 bios_apic: 1 apic: 1 AP
>>
>> I suspended it, and them resumed it. This machine resumed from suspend
>> successfully, but the BSP flag has been set back:
>>
>> cpu: 0 bios_apic: 0 apic: 0 BSP
>> cpu: 1 bios_apic: 1 apic: 1 AP
>>
>> That's all my observation. Hope it's helpful.
>>
>
> I found a side effect of unsetting BSP flag.
> It affected system rebooting, once the BSP flags been removed, and issue
> reboot command, system will hang after message:
> Restarting system.
> And have to do a hardware reset to recover it.
>
> I have reproduced this problem on the following systems:
> HP EliteBook 6930p
> HP Compaq DC7700
> HP ProLiant DL980 (4 sockets, 40 cores)
>

# Sorry for the delayed response. I was in vacation last week.

Thanks for your help, Ma. This result is enough to indicate risk of unsetting
BSP flag in the 1st kernel.

BTW, I have question that does normal kdump work well if crash happens on some
AP? I wonder the same issue could happen on the 2nd kernel.

> I have an idea: To avoid such kind of issue, we can unset BSP flag in
> the first kernel during crash processing, and restore it in the second
> kernel in the APs initializing.
>

As Eric has already suggested, we cannot rely on kdump crash path. There are
certainly several codes that try to disable/reset a variety of CPU features
in the kdump crash path. However, they are just best effort. On worst
catastrophic case, even the reset codes can be broken. It was the same reason
why the first patch of mine that switches CPUes to BSP if crash happens on AP
was nacked.

--
Thanks.
HATAYAMA, Daisuke

2013-08-19 02:30:10

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [Help Test] kdump, x86, acpi: Reproduce CPU0 SMI corruption issue after unsetting BSP flag

(2013/08/15 4:45), Eric W. Biederman wrote:
> Jingbai Ma <[email protected]> writes:
>
>> I found a side effect of unsetting BSP flag.
>> It affected system rebooting, once the BSP flags been removed, and issue
>> reboot command, system will hang after message:
>> Restarting system.
>> And have to do a hardware reset to recover it.
>>
>> I have reproduced this problem on the following systems:
>> HP EliteBook 6930p
>> HP Compaq DC7700
>> HP ProLiant DL980 (4 sockets, 40 cores)
>>
>> I have an idea: To avoid such kind of issue, we can unset BSP flag in
>> the first kernel during crash processing, and restore it in the second
>> kernel in the APs initializing.
>
> The premise was clearing BSP would not be an issue. If we could
> reliably count on unsetting the BSP during crash processing we could
> just switch to the BSP and be done totally avoid this problem.
>
> Given that there are reald world issues with clearing the BSP flag,
> I believe the alternate suggestion was to simply never attempt to start
> the bootstrap processor during processor bring up.
>
> If as normal we are running on the bootstrap processor everything will
> work the same, but if we are in the kdump scenario we will be short one
> core. Being short one core seems like a reasonable tradeoff between
> reliability and performance.
>
> Eric

Sorry Eric, I'm not clear to what you mean by ``short one core''...
Which are you suggesting? Disabling BSP if crash happens on AP is reasonable?
Or restricting cpus to a single one only just as the current kdump
configuration is reasonable?

--
Thanks.
HATAYAMA, Daisuke

2013-08-19 03:36:57

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Help Test] kdump, x86, acpi: Reproduce CPU0 SMI corruption issue after unsetting BSP flag




>
>Sorry Eric, I'm not clear to what you mean by ``short one core''...
>Which are you suggesting? Disabling BSP if crash happens on AP is
>reasonable?
>Or restricting cpus to a single one only just as the current kdump
>configuration is reasonable?

I am suggesting we start every cpu except the BSP from the AP we started on.

N-1 cpus seems like a good tradeoff between performance and reliability for those who need it.

Eric

2013-08-19 09:14:36

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [Help Test] kdump, x86, acpi: Reproduce CPU0 SMI corruption issue after unsetting BSP flag

(2013/08/19 11:59), Eric W. Biederman wrote:
>
>
>
>>
>> Sorry Eric, I'm not clear to what you mean by ``short one core''...
>> Which are you suggesting? Disabling BSP if crash happens on AP is
>> reasonable?
>> Or restricting cpus to a single one only just as the current kdump
>> configuration is reasonable?
>
> I am suggesting we start every cpu except the BSP from the AP we started on.
>
> N-1 cpus seems like a good tradeoff between performance and reliability for those who need it.
>
> Eric
>

Thanks. I'm now clear.

Well, I'll post a version 2 patch after upgrading it for current upstream kernel.
But it would be next week or later.

--
Thanks.
HATAYAMA, Daisuke

2013-08-19 13:46:33

by Petr Tesařík

[permalink] [raw]
Subject: Re: [Help Test] kdump, x86, acpi: Reproduce CPU0 SMI corruption issue after unsetting BSP flag

On Sun, 18 Aug 2013 19:59:53 -0700
"Eric W. Biederman" <[email protected]> wrote:

>
>
>
> >
> >Sorry Eric, I'm not clear to what you mean by ``short one core''...
> >Which are you suggesting? Disabling BSP if crash happens on AP is
> >reasonable?
> >Or restricting cpus to a single one only just as the current kdump
> >configuration is reasonable?
>
> I am suggesting we start every cpu except the BSP from the AP we started on.
>
> N-1 cpus seems like a good tradeoff between performance and reliability for those who need it.

FWIW a large customers of ours is fine with such a limitation. And I
have already tested this approach manually (starting the kdump kernel
with maxcpus=1 and hot-plugging the remaining APs from user-space).

Now that this approach is in line with upstream efforts, I'm going to
test it on some more machines and see if there are any troubles.

@Hatayama-san:
> BTW, I have question that does normal kdump work well if crash happens
> on some AP? I wonder the same issue could happen on the 2nd kernel.

I'm not sure what you mean. Normal kdump starts with "maxcpus=1", and
yes, that works even if the secondary kernel is booted from an AP. OTOH
I suspect that not having any BSP in the system may be the cause of some
mysterious random reboots and/or hangs experienced by some customers.

I'll try setting the BSP flag on the boot CPU unconditionally and see
if it makes any difference.

Petr Tesarik

2013-08-20 03:14:15

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [Help Test] kdump, x86, acpi: Reproduce CPU0 SMI corruption issue after unsetting BSP flag

(2013/08/19 22:46), Petr Tesarik wrote:
> On Sun, 18 Aug 2013 19:59:53 -0700
> "Eric W. Biederman" <[email protected]> wrote:
>
>>
>>
>>
>>>
>>> Sorry Eric, I'm not clear to what you mean by ``short one core''...
>>> Which are you suggesting? Disabling BSP if crash happens on AP is
>>> reasonable?
>>> Or restricting cpus to a single one only just as the current kdump
>>> configuration is reasonable?
>>
>> I am suggesting we start every cpu except the BSP from the AP we started on.
>>
>> N-1 cpus seems like a good tradeoff between performance and reliability for those who need it.
>
> FWIW a large customers of ours is fine with such a limitation. And I
> have already tested this approach manually (starting the kdump kernel
> with maxcpus=1 and hot-plugging the remaining APs from user-space).
>

This is a workaround I suggested previously on this mailing list?
The additional merits of disabling BSP in kernel-side on the 2nd kernel is:

- We can assign memory for BSP to another CPU that is available. It's more
efficient in memory consumption. It's the same reason why distro uses nr_cpus=1
instead of maxcpus=1. If we don't disable BSP, we allocate some amount of memory
for BSP in kernel-space although we never use it. 2nd kernel should have as small
amount of memory as possible.

- Remove BSP from hot-plugging target CPUes. Keeping BSP after crash happens on AP
means keeping a potential risk of triggering system hang from user-space.
Can remove awkward configuration to hot-add APs from user-space while avoiding BSP.
This seems less important than the above.

On practical configuraiton, it's necessary to decide how many cpus we should use
on the 2nd kernel for trade-off of performance and acceptable amount of memory
for additional CPUes. I think it would simply be the number of disks and the number
of threads of makedumpfile in most cases.

> Now that this approach is in line with upstream efforts, I'm going to
> test it on some more machines and see if there are any troubles.
>
> @Hatayama-san:
>> BTW, I have question that does normal kdump work well if crash happens
>> on some AP? I wonder the same issue could happen on the 2nd kernel.
>
> I'm not sure what you mean. Normal kdump starts with "maxcpus=1", and
> yes, that works even if the secondary kernel is booted from an AP. OTOH
> I suspect that not having any BSP in the system may be the cause of some
> mysterious random reboots and/or hangs experienced by some customers.
>
> I'll try setting the BSP flag on the boot CPU unconditionally and see
> if it makes any difference.
>
> Petr Tesarik
>

Ma saw a hang when he tried to reboot his HP systems on the 1st kernel
under the condition that BSP flag was unset on any existing CPUs. I
thought the condition is similar to the 2nd kernel after crash happens
on AP in the sense that there is no BSP in online CPUs. If the hang he saw
was caused by running reboot on the CPU without BSP flag, I guess the same
situation could already happen on only-1-cpu configuration now.

--
Thanks.
HATAYAMA, Daisuke