Linux has logical cpu offline capability. That can be triggered by:
# echo 0 > /sys/devices/system/cpu/cpuX/online
In Intel Architecture, MCE's are broadcasted to all CPUs in the system.
This includes the CPUs marked offline by Linux. Unless the CPU's were removed
via an ACPI notification, in which case the cpu's are removed from the
cpu_present_map.
This patch ensures offline CPU's don't participate in MCE rendezvous, but
simply perform clearing some status bits to ensure a second MCE wont cause
automatic shutdown.
Without the patch, mce_start will increment mce_callin, but mce_start would
wait for all online_cpus. So offline cpu's should avoid participating in the
rendezvous process.
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Ashok Raj <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index c5b0d56..82a0c8b 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -998,6 +998,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
u64 recover_paddr = ~0ull;
int flags = MF_ACTION_REQUIRED;
int lmce = 0;
+ unsigned int cpu = smp_processor_id();
ist_enter(regs);
@@ -1008,6 +1009,14 @@ void do_machine_check(struct pt_regs *regs, long error_code)
mce_gather_info(&m, regs);
+ /*
+ * if this cpu is offline, just bail out.
+ * TBD: looking into adding any logs this offline CPU might have,
+ * to be collected and reported by the rendezvous master.
+ */
+ if (cpu_is_offline(cpu) && (m.mcgstatus & MCG_STATUS_RIPV))
+ goto out;
+
final = this_cpu_ptr(&mces_seen);
*final = m;
@@ -1142,8 +1151,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
if (worst > 0)
mce_report_event(regs);
- mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
out:
+ mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
sync_core();
if (recover_paddr == ~0ull)
--
2.4.3
On Thu, Dec 03, 2015 at 07:16:10PM -0500, Ashok Raj wrote:
> Linux has logical cpu offline capability. That can be triggered by:
>
> # echo 0 > /sys/devices/system/cpu/cpuX/online
>
> In Intel Architecture, MCE's are broadcasted to all CPUs in the system.
>
> This includes the CPUs marked offline by Linux. Unless the CPU's were removed
> via an ACPI notification, in which case the cpu's are removed from the
> cpu_present_map.
>
> This patch ensures offline CPU's don't participate in MCE rendezvous, but
> simply perform clearing some status bits to ensure a second MCE wont cause
> automatic shutdown.
>
> Without the patch, mce_start will increment mce_callin, but mce_start would
> wait for all online_cpus. So offline cpu's should avoid participating in the
> rendezvous process.
>
> Reviewed-by: Tony Luck <[email protected]>
> Signed-off-by: Ashok Raj <[email protected]>
> ---
> arch/x86/kernel/cpu/mcheck/mce.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
<formletter>
This is not the correct way to submit patches for inclusion in the
stable kernel tree. Please read Documentation/stable_kernel_rules.txt
for how to do this properly.
</formletter>
On Thu, Dec 03, 2015 at 07:16:10PM -0500, Ashok Raj wrote:
> Linux has logical cpu offline capability. That can be triggered by:
>
> # echo 0 > /sys/devices/system/cpu/cpuX/online
>
> In Intel Architecture, MCE's are broadcasted to all CPUs in the system.
>
> This includes the CPUs marked offline by Linux. Unless the CPU's were removed
> via an ACPI notification, in which case the cpu's are removed from the
> cpu_present_map.
>
> This patch ensures offline CPU's don't participate in MCE rendezvous, but
> simply perform clearing some status bits to ensure a second MCE wont cause
> automatic shutdown.
>
> Without the patch, mce_start will increment mce_callin, but mce_start would
> wait for all online_cpus. So offline cpu's should avoid participating in the
> rendezvous process.
>
> Reviewed-by: Tony Luck <[email protected]>
> Signed-off-by: Ashok Raj <[email protected]>
> ---
> arch/x86/kernel/cpu/mcheck/mce.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index c5b0d56..82a0c8b 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -998,6 +998,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> u64 recover_paddr = ~0ull;
> int flags = MF_ACTION_REQUIRED;
> int lmce = 0;
> + unsigned int cpu = smp_processor_id();
>
> ist_enter(regs);
>
> @@ -1008,6 +1009,14 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>
> mce_gather_info(&m, regs);
>
> + /*
> + * if this cpu is offline, just bail out.
> + * TBD: looking into adding any logs this offline CPU might have,
> + * to be collected and reported by the rendezvous master.
> + */
> + if (cpu_is_offline(cpu) && (m.mcgstatus & MCG_STATUS_RIPV))
> + goto out;
This CPU - it being offline and all - is not doing the minimal amount of
work possible IMO.
Why does it have to do ist_enter(), this_cpu_inc(mce_exception_count),
etc?
IMO the only things it should do is this:
if (cpu_is_offline(smp_processor_id())) {
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
return;
}
and that should be at the very beginning of do_machine_check(). So
that the hardware is happy. Concerning Linux, it is offline so no data
structures on it are valid.
Hmmm?
P.S., please don't put stable@ to CC - add it as a "CC: " line in the
SOB section instead.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
On Fri, Dec 04, 2015 at 12:14:20PM -0500, Raj, Ashok wrote:
> Yes, thats possible to not do ist_enter() and the exception count.
>
> I tried to keep most of the part as is and leveraging code already
> doing the reading of MCG_STATUS. Architecturally we need to also check RIPV
> and if clear we should initiate shutdown.
So add that check too.
> When we add the logging from offline cpus as next step it would be safe to
> use interrupt stack, and the offline
Franky, I'm not sure at all and very very wary of adding *any* code
which runs on an offlined CPU. Because *no one* does that and it hasn't
been tested at all. So who knows what happens.
What we should be doing is execute the *minimal* amount of code possible
and get out. No counting, no per-cpu variables. No nothing.
> I liked the observability part keeping the exception count. if and
> when we online the cpu again, it might look as it noticed nothing. Now
> we can check /proc/interrupts and see the offline cpu also observed
> the MCE.
And? Tell us what? That SMM fondled the hardware under our feet. TBH,
I'd tend to be much more drastic here and even taint the kernel. I mean,
seriously, what kind of MCEs which happen as a result of OS execution
are you expecting to get reported on an offlined CPU?
I can't think of very any.
Because we have been considering offlining a core as one possible RAS
action. So what happens is a user or a RAS agent offlines a core and
yet, that offlined core still reports MCEs. Something's terribly wrong
with that picture, IMO.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
Hi Boris
On Fri, Dec 04, 2015 at 03:34:04PM +0100, Borislav Petkov wrote:
> > @@ -1008,6 +1009,14 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> > + if (cpu_is_offline(cpu) && (m.mcgstatus & MCG_STATUS_RIPV))
> > + goto out;
>
> This CPU - it being offline and all - is not doing the minimal amount of
> work possible IMO.
>
> Why does it have to do ist_enter(), this_cpu_inc(mce_exception_count),
> etc?
Yes, thats possible to not do ist_enter() and the exception count.
I tried to keep most of the part as is and leveraging code already
doing the reading of MCG_STATUS. Architecturally we need to also check RIPV
and if clear we should initiate shutdown.
When we add the logging from offline cpus as next step it would be safe to
use interrupt stack, and the offline
I liked the observability part keeping the exception count. if and when we
online the cpu again, it might look as it noticed nothing. Now we can
check /proc/interrupts and see the offline cpu also observed the MCE.
>
> IMO the only things it should do is this:
>
> if (cpu_is_offline(smp_processor_id())) {
> mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
> return;
> }
>
> and that should be at the very beginning of do_machine_check(). So
> that the hardware is happy. Concerning Linux, it is offline so no data
> structures on it are valid.
>
> P.S., please don't put stable@ to CC - add it as a "CC: " line in the
> SOB section instead.
Let me know what you think, i can resend with the Cc: stable line.. I
Did add the stable line in the right section in an earlier version, but
deleting some extraneous commit messages accidently got to this one :(.
Cheers,
Ashok
> Franky, I'm not sure at all and very very wary of adding *any* code
> which runs on an offlined CPU. Because *no one* does that and it hasn't
> been tested at all. So who knows what happens.
>
> What we should be doing is execute the *minimal* amount of code possible
> and get out. No counting, no per-cpu variables. No nothing.
The minimal code requires we use:
smp_processor_id() [to get our cpu number]
cpu_is_offline() [to find out the cpu is offline]
The first of those looks more dangerous in that it accesses a per-cpu variable.
I don't think we need to be totally paranoid here. We know that the offline cpus
were once online and went through normal kernel initialization code (if they didn't,
then we can't possibly be executing this code ... their CR4.MCE bit would be zero so their
response to a machine check would have been to reset the system).
> Because we have been considering offlining a core as one possible RAS
> action. So what happens is a user or a RAS agent offlines a core and
> yet, that offlined core still reports MCEs. Something's terribly wrong
> with that picture, IMO.
Agreed. It would be more pleasant if we had some way to *really* offline a cpu,
including telling the rest of the system not to send it any more broadcast events
like MCE, SMI. But the h/w guys like to give the s/w guys job security by making
these corner cases that we have to work around in s/w :-)
-Tony
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Fri, Dec 04, 2015 at 05:23:18PM +0000, Luck, Tony wrote:
> > Franky, I'm not sure at all and very very wary of adding *any* code
> > which runs on an offlined CPU. Because *no one* does that and it hasn't
> > been tested at all. So who knows what happens.
> >
> > What we should be doing is execute the *minimal* amount of code possible
> > and get out. No counting, no per-cpu variables. No nothing.
>
> The minimal code requires we use:
>
> smp_processor_id() [to get our cpu number]
> cpu_is_offline() [to find out the cpu is offline]
>
> The first of those looks more dangerous in that it accesses a per-cpu variable.
>
> I don't think we need to be totally paranoid here. We know that the offline cpus
> were once online and went through normal kernel initialization code (if they didn't,
> then we can't possibly be executing this code ... their CR4.MCE bit would be zero so their
> response to a machine check would have been to reset the system).
I don't mean that - I mean the stuff we do before we call
cpu_is_offline() like ist_enter, this_cpu_inc(mce_exception_count),
etc. Then we do a whole another bunch of stuff at the "out:" label like
printk and whatnot which shouldn't run on an offlined CPU.
I.e., the check whether a CPU is offline should be the first thing we do
in do_machine_check and get the hell out if so.
> Agreed. It would be more pleasant if we had some way to *really* offline a cpu,
> including telling the rest of the system not to send it any more broadcast events
> like MCE, SMI. But the h/w guys like to give the s/w guys job security by making
> these corner cases that we have to work around in s/w :-)
Mind you, this is unintentional from the hw guys. But ha(!), I know
*exactly* what you mean.
:-)
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
> I don't mean that - I mean the stuff we do before we call
> cpu_is_offline() like ist_enter, this_cpu_inc(mce_exception_count),
> etc. Then we do a whole another bunch of stuff at the "out:" label like
> printk and whatnot which shouldn't run on an offlined CPU.
ist_enter() is black magic to me. Andy? Would you be worried about executing
ist_{enter,exit}() on a cpu that was once online, but is currently marked offline
by Linux?
Bumping mce_exception_count doesn't look like a big deal either way. It is visible in
/proc/interrupts so I'd like to keep that honest (if the cpu comes back online again).
But we could do the offline check before this.
There will be no printk() executed in the tail of the function. after we clear MCG_STATUS
at the (new location of) the out: label we will see recover_paddr is still ~0ull and "goto done".
-Tony
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Fri, Dec 04, 2015 at 05:53:33PM +0000, Luck, Tony wrote:
> > I don't mean that - I mean the stuff we do before we call
> > cpu_is_offline() like ist_enter, this_cpu_inc(mce_exception_count),
> > etc. Then we do a whole another bunch of stuff at the "out:" label like
> > printk and whatnot which shouldn't run on an offlined CPU.
>
> ist_enter() is black magic to me. Andy? Would you be worried about executing
> ist_{enter,exit}() on a cpu that was once online, but is currently marked offline
> by Linux?
ist_enter() is context tracking functionality.
> Bumping mce_exception_count doesn't look like a big deal either way. It is visible in
> /proc/interrupts so I'd like to keep that honest (if the cpu comes back online again).
> But we could do the offline check before this.
>
> There will be no printk() executed in the tail of the function. after we clear MCG_STATUS
> at the (new location of) the out: label we will see recover_paddr is still ~0ull and "goto done".
Whether it is kosher or not is beside the point. Why should an offlined
CPU even noodle through all that code if it doesn't need/have to? It can
return immediately instead.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
> Whether it is kosher or not is beside the point. Why should an offlined
> CPU even noodle through all that code if it doesn't need/have to? It can
> return immediately instead.
Ashok wants to move in stage 2 to having the offline cpu scan banks and report
any errors seen there. To do that we'll have to run through a fair bit of the
do_machine_check() code.
But ... if you want a super safe version to put the stable tag on ... we could
just have something like this at the head of do_machine_check()
int cpu = smp_processor_id();
if (cpu_is_offline(cpu)) {
rdmsr(MCG_STATUS);
if (RIPV bit set) {
wrmsr(MCG_STATUS, 0);
return;
}
// can we do anything here? Offline cpu has no place to return to.
// There are no good answers ... falling into the regular code is
// what we did historically
}
-Tony
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Fri, Dec 04, 2015 at 06:30:39PM +0000, Luck, Tony wrote:
> Ashok wants to move in stage 2 to having the offline cpu scan banks and report
> any errors seen there. To do that we'll have to run through a fair bit of the
> do_machine_check() code.
I'm still very sceptical whether logging those errors are worth the risk
of running code on an offlined CPU.
> But ... if you want a super safe version to put the stable tag on ... we could
> just have something like this at the head of do_machine_check()
It would be prudent IMO. We don't want to disrupt stable kernels
unnecessarily.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
On Fri, Dec 4, 2015 at 9:53 AM, Luck, Tony <[email protected]> wrote:
>> I don't mean that - I mean the stuff we do before we call
>> cpu_is_offline() like ist_enter, this_cpu_inc(mce_exception_count),
>> etc. Then we do a whole another bunch of stuff at the "out:" label like
>> printk and whatnot which shouldn't run on an offlined CPU.
>
> ist_enter() is black magic to me. Andy? Would you be worried about executing
> ist_{enter,exit}() on a cpu that was once online, but is currently marked offline
> by Linux?
Offline CPUs are black magic to me. But as long as the CPU works the
way that the normal specs say it should, then ist_enter is fair game.
In any event, if context tracking blows up on an offline CPU, I'd
argue that's a context tracking bug and needs to be fixed.
But maybe offlined CPUs are supposed to have all interrupts off
(including MCE?) and the argument goes the other way? Dunno.
--Andy
On Fri, Dec 4, 2015 at 4:08 PM, Raj, Ashok <[email protected]> wrote:
> On Fri, Dec 04, 2015 at 02:34:52PM -0800, Andy Lutomirski wrote:
>> On Fri, Dec 4, 2015 at 9:53 AM, Luck, Tony <[email protected]> wrote:
>> > ist_enter() is black magic to me. Andy? Would you be worried about executing
>> > ist_{enter,exit}() on a cpu that was once online, but is currently marked offline
>> > by Linux?
>>
>> Offline CPUs are black magic to me. But as long as the CPU works the
>> way that the normal specs say it should, then ist_enter is fair game.
>> In any event, if context tracking blows up on an offline CPU, I'd
>> argue that's a context tracking bug and needs to be fixed.
>>
>> But maybe offlined CPUs are supposed to have all interrupts off
>> (including MCE?) and the argument goes the other way? Dunno.
>
> MCE's are broadcast by the hardware and cannot be blocked. Offline
> is only a Linux specific state. Now if the offline was a result of an ACPI
> event (eject) that triggered the CPU removal (offline in Linux, as it would
> have in a platform that supports true hotplug) then the platform would
> remove this cpu from the broadcast list.
>
> if kernel were to set CR4.MCE=0 that would cause system shutdown when
> an MCE is broadcast and hits this cpu.
I meant "supposed" as in Linux might expect arch code to prevent the
CPU from receiving interrupts.
Anyway, I think that would be silly and we should just expect
ist_enter to work regardless of online state.
This does mean that if we plug in a new CPU and online it, then
there's a window before we set up percpu memory and enable CR4.MCE in
which an MCE on any CPU will kill the system, at least on hardware for
which MCE broadcast can't be turned off.
--Andy
On Fri, Dec 04, 2015 at 02:34:52PM -0800, Andy Lutomirski wrote:
> On Fri, Dec 4, 2015 at 9:53 AM, Luck, Tony <[email protected]> wrote:
> > ist_enter() is black magic to me. Andy? Would you be worried about executing
> > ist_{enter,exit}() on a cpu that was once online, but is currently marked offline
> > by Linux?
>
> Offline CPUs are black magic to me. But as long as the CPU works the
> way that the normal specs say it should, then ist_enter is fair game.
> In any event, if context tracking blows up on an offline CPU, I'd
> argue that's a context tracking bug and needs to be fixed.
>
> But maybe offlined CPUs are supposed to have all interrupts off
> (including MCE?) and the argument goes the other way? Dunno.
MCE's are broadcast by the hardware and cannot be blocked. Offline
is only a Linux specific state. Now if the offline was a result of an ACPI
event (eject) that triggered the CPU removal (offline in Linux, as it would
have in a platform that supports true hotplug) then the platform would
remove this cpu from the broadcast list.
if kernel were to set CR4.MCE=0 that would cause system shutdown when
an MCE is broadcast and hits this cpu.
Cheers,
Ashok