2024-05-28 06:55:12

by Tony W Wang-oc

[permalink] [raw]
Subject: [PATCH] x86/hpet: Read HPET directly if panic in progress

When the clocksource of the system is HPET,a CPU executing read_hpet
might be interrupted by exceptions to executing the panic,this may
lead to read_hpet dead loops:

CPU x CPU x
---- ----
read_hpet()
arch_spin_trylock(&hpet.lock)
[CPU x got the hpet.lock] #MCE happened
do_machine_check()
mce_panic()
panic()
kmsg_dump()
pstore_dump()
pstore_record_init()
ktime_get_real_fast_ns()
read_hpet()
[dead loops]

To avoid this dead loops, read HPET directly if panic in progress.

Signed-off-by: Tony W Wang-oc <[email protected]>
---
arch/x86/kernel/hpet.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index c96ae8fee95e..ecadd0698d6a 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs)
if (in_nmi())
return (u64)hpet_readl(HPET_COUNTER);

+ /*
+ * Read HPET directly if panic in progress.
+ */
+ if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID))
+ return (u64)hpet_readl(HPET_COUNTER);
+
/*
* Read the current state of the lock and HPET value atomically.
*/
--
2.25.1



2024-05-28 14:19:05

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress

On 5/27/24 23:38, Tony W Wang-oc wrote:
..> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
> index c96ae8fee95e..ecadd0698d6a 100644
> --- a/arch/x86/kernel/hpet.c
> +++ b/arch/x86/kernel/hpet.c
> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs)
> if (in_nmi())
> return (u64)hpet_readl(HPET_COUNTER);
>
> + /*
> + * Read HPET directly if panic in progress.
> + */
> + if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID))
> + return (u64)hpet_readl(HPET_COUNTER);
> +

There is literally one other piece of the code in the kernel doing
something similar: the printk() implementation. There's no other
clocksource or timekeeping code that does this on any architecture.

Why doesn't this problem apply to any other clock sources?

Why should the problem be fixed in the clock sources themselves? Why
doesn't printk() deadlock on systems using the HPET?

In other words, I think we should fix pstore to be more like printk
rather than hacking around this in each clock source.

2024-05-28 22:13:15

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress

On Tue, May 28 2024 at 07:18, Dave Hansen wrote:
> On 5/27/24 23:38, Tony W Wang-oc wrote:
> ...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
>> index c96ae8fee95e..ecadd0698d6a 100644
>> --- a/arch/x86/kernel/hpet.c
>> +++ b/arch/x86/kernel/hpet.c
>> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs)
>> if (in_nmi())
>> return (u64)hpet_readl(HPET_COUNTER);
>>
>> + /*
>> + * Read HPET directly if panic in progress.
>> + */
>> + if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID))
>> + return (u64)hpet_readl(HPET_COUNTER);
>> +
>
> There is literally one other piece of the code in the kernel doing
> something similar: the printk() implementation. There's no other
> clocksource or timekeeping code that does this on any architecture.
>
> Why doesn't this problem apply to any other clock sources?

I principle it applies to any clocksource which needs a spinlock to
serialize access. HPET is not the only insanity here.

Think about i8253 :)

Most real clocksources, like TSC and the majority of the preferred clock
sources on other architectures are perfectly fine. They just read and be
done.

> Why should the problem be fixed in the clock sources themselves? Why
> doesn't printk() deadlock on systems using the HPET?

Because regular printk()s are deferred to irq work when in NMI and
similar contexts, but that obviously does not apply to panic
situations. Also NMI is treated special even in the HPET code. See
below.

> In other words, I think we should fix pstore to be more like printk
> rather than hacking around this in each clock source.

pstore is perfectly fine. It uses a NMI safe time accessor function
which is then tripping over the HPET lock. That's really a HPET specific
problem.

Though what I read out of the changelog is that the MCE hits the same
CPU 'x' which holds the lock. But that's fairy tale material as you can
see in the patch above:

if (in_nmi())
return (u64)hpet_readl(HPET_COUNTER);

For that particular case the dead lock, which would actually be a live
lock, cannot happen because in kernel MCEs are NMI class exceptions and
therefore in_nmi() evaluates to true and that new voodoo can't be
reached at all.

Now there are two other scenarios which really can make that happen:

1) A non-NMI class exception within the lock held region

CPU A
acquire(hpet_lock);
... <- #PF, #GP, #DE, ... -> panic()

If any of that happens within that lock held section then the live
lock on the hpet_lock is the least of your worries. Seriously, I
don't care about this at all.

2) The actual scenario is:

CPU A CPU B
lock(hpet_lock)
MCE hits user space
...
exc_machine_check_user()
irqentry_enter_from_user_mode(regs);

irqentry_enter_from_user_mode() obviously does not mark the
exception as NMI class, so in_nmi() evaluates to false. That would
actually dead lock if CPU A is not making progress and releases
hpet_lock.

Sounds unlikely to happen, right? But in reality it can because of
MCE broadcast. Assume that both CPUs go into MCE:

CPU A CPU B
lock(hpet_lock)
exc_machine_check_user()
irqentry_enter_from_user_mode();
exc_machine_check_kernel() do_machine_check()
irqentry_nmi_enter(); mce_panic()
do_machine_check() if (atomic_inc_return(&mce_panicked) > 1)
mce_panic() wait_for_panic(); <- Not taken

if (atomic_inc_return(&mce_panicked) > 1)
wait_for_panic(); <- Taken

....
hpet_read()

-> Dead lock because in_nmi() evaluates to false on CPU B and CPU A
obviously can't release the lock.

So the proposed patch makes sense to some extent. But it only cures the
symptom. The real underlying questions are:

1) Should we provide a panic mode read callback for clocksources which
are affected by this?

2) Is it correct to claim that a MCE which hits user space and ends up in
mce_panic() is still just a regular exception or should we upgrade to
NMI class context when we enter mce_panic() or even go as far to
upgrade to NMI class context for any panic() invocation?

#1 Solves it at the clocksource level. It still needs HPET specific
changes.

#2 Solves a whole class of issues

... while potentially introducing new ones :)

To me upgrading any panic() invocation to NMI class context makes a
lot of sense because in that case all bets are off.

in_nmi() is used in quite some places to avoid such problems. IOW,
that would kill a whole class of issues instead of "curing" the HPET
problem locally for the price of an extra conditional. Not that the
extra conditional matters much if HPET is the clocksource as that's
awfully slow anyway and I really don't care about that.

But I very much care about avoiding to sprinkle panic_cpu checks all
over the place.

Thanks,

tglx

2024-05-28 23:22:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress

On Tue, 28 May 2024 at 15:12, Thomas Gleixner <[email protected]> wrote:
>
> I principle it applies to any clocksource which needs a spinlock to
> serialize access. HPET is not the only insanity here.

HPET may be the main / only one we care about.

Because:

> Think about i8253 :)

I see the smiley, but yeah, I don't think we really care about it.

> 1) Should we provide a panic mode read callback for clocksources which
> are affected by this?

The current patch under discussion may be ugly, but looks workable.
Local ugliness isn't necessarily a show-stopper.

So if the HPET is the *only* case which has this situation, I vote for
just doing the ugly thing.

Now, if *other* cases exist, and can't be worked around in similar
ways, then that argues for a more "proper" fix.

And no, I don't think i8253 is a strong enough argument. I don't
actually believe you can realistically find a machine that doesn't
have HPET or the TSC and really falls back on the i8253 any more. And
if you *do* find hw like that, is it SMP-capable? And can you find
somebody who cares?

> 2) Is it correct to claim that a MCE which hits user space and ends up in
> mce_panic() is still just a regular exception or should we upgrade to
> NMI class context when we enter mce_panic() or even go as far to
> upgrade to NMI class context for any panic() invocation?

I do think that an NMI in user space should be considered mostly just
a normal exception. From a kernel perspective, the NMI'ness just
doesn't matter.

That said, I find your suggestion of making 'panic()' just basically
act as an NMI context intriguing. And cleaner than the
atomic_read(&panic_cpu) thing.

Are there any other situations than this odd HPET thing where that
would change semantics?

Linus

2024-05-29 04:39:35

by Tony W Wang-oc

[permalink] [raw]
Subject: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress



On 2024/5/29 06:12, Thomas Gleixner wrote:
>
>
> [这封邮件来自外部发件人 谨防风险]
>
> On Tue, May 28 2024 at 07:18, Dave Hansen wrote:
>> On 5/27/24 23:38, Tony W Wang-oc wrote:
>> ...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
>>> index c96ae8fee95e..ecadd0698d6a 100644
>>> --- a/arch/x86/kernel/hpet.c
>>> +++ b/arch/x86/kernel/hpet.c
>>> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs)
>>> if (in_nmi())
>>> return (u64)hpet_readl(HPET_COUNTER);
>>>
>>> + /*
>>> + * Read HPET directly if panic in progress.
>>> + */
>>> + if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID))
>>> + return (u64)hpet_readl(HPET_COUNTER);
>>> +
>>
>> There is literally one other piece of the code in the kernel doing
>> something similar: the printk() implementation. There's no other
>> clocksource or timekeeping code that does this on any architecture.
>>
>> Why doesn't this problem apply to any other clock sources?
>
> I principle it applies to any clocksource which needs a spinlock to
> serialize access. HPET is not the only insanity here.
>
> Think about i8253 :)
>
> Most real clocksources, like TSC and the majority of the preferred clock
> sources on other architectures are perfectly fine. They just read and be
> done.
>
>> Why should the problem be fixed in the clock sources themselves? Why
>> doesn't printk() deadlock on systems using the HPET?
>
> Because regular printk()s are deferred to irq work when in NMI and
> similar contexts, but that obviously does not apply to panic
> situations. Also NMI is treated special even in the HPET code. See
> below.
>
>> In other words, I think we should fix pstore to be more like printk
>> rather than hacking around this in each clock source.
>
> pstore is perfectly fine. It uses a NMI safe time accessor function
> which is then tripping over the HPET lock. That's really a HPET specific
> problem.
>
> Though what I read out of the changelog is that the MCE hits the same
> CPU 'x' which holds the lock. But that's fairy tale material as you can
> see in the patch above:
>
> if (in_nmi())
> return (u64)hpet_readl(HPET_COUNTER);
>
> For that particular case the dead lock, which would actually be a live
> lock, cannot happen because in kernel MCEs are NMI class exceptions and
> therefore in_nmi() evaluates to true and that new voodoo can't be
> reached at all.
>
> Now there are two other scenarios which really can make that happen:
>
> 1) A non-NMI class exception within the lock held region
>
> CPU A
> acquire(hpet_lock);
> ... <- #PF, #GP, #DE, ... -> panic()
>
> If any of that happens within that lock held section then the live
> lock on the hpet_lock is the least of your worries. Seriously, I
> don't care about this at all.
>
> 2) The actual scenario is:
>
> CPU A CPU B
> lock(hpet_lock)
> MCE hits user space
> ...
> exc_machine_check_user()
> irqentry_enter_from_user_mode(regs);
>
> irqentry_enter_from_user_mode() obviously does not mark the
> exception as NMI class, so in_nmi() evaluates to false. That would
> actually dead lock if CPU A is not making progress and releases
> hpet_lock.
>
> Sounds unlikely to happen, right? But in reality it can because of
> MCE broadcast. Assume that both CPUs go into MCE:
>
> CPU A CPU B
> lock(hpet_lock)
> exc_machine_check_user()
> irqentry_enter_from_user_mode();
> exc_machine_check_kernel() do_machine_check()
> irqentry_nmi_enter(); mce_panic()
> do_machine_check() if (atomic_inc_return(&mce_panicked) > 1)
> mce_panic() wait_for_panic(); <- Not taken
>
> if (atomic_inc_return(&mce_panicked) > 1)
> wait_for_panic(); <- Taken
>
> ....
> hpet_read()
>
> -> Dead lock because in_nmi() evaluates to false on CPU B and CPU A
> obviously can't release the lock.
>

Because MCE handler will call printk() before call the panic, so
printk() deadlock may happen in this scenario:

CPU A CPU B
printk()
lock(console_owner_lock)
exc_machine_check_user()
irqentry_enter_from_user_mode()
exc_machine_check_kernel() do_machine_check()
irqentry_nmi_enter() mce_panic()
do_machine_check() printk_mce() #A
mce_panic() ...
wait_for_panic() panic()

printk deadlock will happened at #A because in_nmi() evaluates to false
on CPU B and CPU B do not enter the panic() AT #A.

Update user space MCE handler to NMI class context is preferred?

Sincerely
TonyWWang-oc

> So the proposed patch makes sense to some extent. But it only cures the
> symptom. The real underlying questions are:
>
> 1) Should we provide a panic mode read callback for clocksources which
> are affected by this?
>
> 2) Is it correct to claim that a MCE which hits user space and ends up in
> mce_panic() is still just a regular exception or should we upgrade to
> NMI class context when we enter mce_panic() or even go as far to
> upgrade to NMI class context for any panic() invocation?
>
> #1 Solves it at the clocksource level. It still needs HPET specific
> changes.
>
> #2 Solves a whole class of issues
>
> ... while potentially introducing new ones :)
>
> To me upgrading any panic() invocation to NMI class context makes a
> lot of sense because in that case all bets are off.
>
> in_nmi() is used in quite some places to avoid such problems. IOW,
> that would kill a whole class of issues instead of "curing" the HPET
> problem locally for the price of an extra conditional. Not that the
> extra conditional matters much if HPET is the clocksource as that's
> awfully slow anyway and I really don't care about that.
>
> But I very much care about avoiding to sprinkle panic_cpu checks all
> over the place.
>
> Thanks,
>
> tglx

2024-05-29 06:46:16

by Tony W Wang-oc

[permalink] [raw]
Subject: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress



On 2024/5/29 12:39, Tony W Wang-oc wrote:
>
>
> On 2024/5/29 06:12, Thomas Gleixner wrote:
>>
>>
>> [这封邮件来自外部发件人 谨防风险]
>>
>> On Tue, May 28 2024 at 07:18, Dave Hansen wrote:
>>> On 5/27/24 23:38, Tony W Wang-oc wrote:
>>> ...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
>>>> index c96ae8fee95e..ecadd0698d6a 100644
>>>> --- a/arch/x86/kernel/hpet.c
>>>> +++ b/arch/x86/kernel/hpet.c
>>>> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs)
>>>>       if (in_nmi())
>>>>               return (u64)hpet_readl(HPET_COUNTER);
>>>>
>>>> +    /*
>>>> +     * Read HPET directly if panic in progress.
>>>> +     */
>>>> +    if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID))
>>>> +            return (u64)hpet_readl(HPET_COUNTER);
>>>> +
>>>
>>> There is literally one other piece of the code in the kernel doing
>>> something similar: the printk() implementation.  There's no other
>>> clocksource or timekeeping code that does this on any architecture.
>>>
>>> Why doesn't this problem apply to any other clock sources?
>>
>> I principle it applies to any clocksource which needs a spinlock to
>> serialize access. HPET is not the only insanity here.
>>
>> Think about i8253 :)
>>
>> Most real clocksources, like TSC and the majority of the preferred clock
>> sources on other architectures are perfectly fine. They just read and be
>> done.
>>
>>> Why should the problem be fixed in the clock sources themselves?  Why
>>> doesn't printk() deadlock on systems using the HPET?
>>
>> Because regular printk()s are deferred to irq work when in NMI and
>> similar contexts, but that obviously does not apply to panic
>> situations. Also NMI is treated special even in the HPET code. See
>> below.
>>
>>> In other words, I think we should fix pstore to be more like printk
>>> rather than hacking around this in each clock source.
>>
>> pstore is perfectly fine. It uses a NMI safe time accessor function
>> which is then tripping over the HPET lock. That's really a HPET specific
>> problem.
>>
>> Though what I read out of the changelog is that the MCE hits the same
>> CPU 'x' which holds the lock. But that's fairy tale material as you can
>> see in the patch above:
>>
>>          if (in_nmi())
>>                  return (u64)hpet_readl(HPET_COUNTER);
>>
>> For that particular case the dead lock, which would actually be a live
>> lock, cannot happen because in kernel MCEs are NMI class exceptions and
>> therefore in_nmi() evaluates to true and that new voodoo can't be
>> reached at all.
>>
>> Now there are two other scenarios which really can make that happen:
>>
>>   1) A non-NMI class exception within the lock held region
>>
>>      CPU A
>>      acquire(hpet_lock);
>>      ...                 <- #PF, #GP, #DE, ... -> panic()
>>
>>      If any of that happens within that lock held section then the live
>>      lock on the hpet_lock is the least of your worries. Seriously, I
>>      don't care about this at all.
>>

Actually, this scenario is what this patch is trying to solve.

We encountered hpet_lock deadlock from the call path of the MCE handler,
and this hpet_lock deadlock scenario may happen when others exceptions'
handler like #PF/#GP... to call the panic. So read_hpet should avoid
deadlock if panic in progress.

Sincerely
TonyWWang-oc

>>   2) The actual scenario is:
>>
>>      CPU A                       CPU B
>>      lock(hpet_lock)
>>                                  MCE hits user space
>>                                  ...
>>                                  exc_machine_check_user()
>>                                    irqentry_enter_from_user_mode(regs);
>>
>>      irqentry_enter_from_user_mode() obviously does not mark the
>>      exception as NMI class, so in_nmi() evaluates to false. That would
>>      actually dead lock if CPU A is not making progress and releases
>>      hpet_lock.
>>
>>      Sounds unlikely to happen, right? But in reality it can because of
>>      MCE broadcast. Assume that both CPUs go into MCE:
>>
>>      CPU A                       CPU B
>>      lock(hpet_lock)
>>                                  exc_machine_check_user()
>>                                    irqentry_enter_from_user_mode();
>>      exc_machine_check_kernel()    do_machine_check()
>>        irqentry_nmi_enter();         mce_panic()
>>        do_machine_check()            if
>> (atomic_inc_return(&mce_panicked) > 1)
>>          mce_panic()                     wait_for_panic(); <- Not taken
>>
>>          if (atomic_inc_return(&mce_panicked) > 1)
>>              wait_for_panic(); <- Taken
>>
>>                                      ....
>>                                      hpet_read()
>>
>>      -> Dead lock because in_nmi() evaluates to false on CPU B and CPU A
>>         obviously can't release the lock.
>>
>
> Because MCE handler will call printk() before call the panic, so
> printk() deadlock may happen in this scenario:
>
> CPU A                            CPU B
> printk()
>   lock(console_owner_lock)
>                                  exc_machine_check_user()
>                                    irqentry_enter_from_user_mode()
> exc_machine_check_kernel()         do_machine_check()
>   irqentry_nmi_enter()               mce_panic()
>   do_machine_check()                 printk_mce()  #A
>     mce_panic()                      ...
>       wait_for_panic()               panic()
>
> printk deadlock will happened at #A because in_nmi() evaluates to false
> on CPU B and CPU B do not enter the panic() AT #A.
>
> Update user space MCE handler to NMI class context is preferred?
>
> Sincerely
> TonyWWang-oc
>
>> So the proposed patch makes sense to some extent. But it only cures the
>> symptom. The real underlying questions are:
>>
>>    1) Should we provide a panic mode read callback for clocksources which
>>       are affected by this?
>>
>>    2) Is it correct to claim that a MCE which hits user space and ends
>> up in
>>       mce_panic() is still just a regular exception or should we
>> upgrade to
>>       NMI class context when we enter mce_panic() or even go as far to
>>       upgrade to NMI class context for any panic() invocation?
>>
>> #1 Solves it at the clocksource level. It still needs HPET specific
>>     changes.
>>
>> #2 Solves a whole class of issues
>>
>>     ... while potentially introducing new ones :)
>>
>>     To me upgrading any panic() invocation to NMI class context makes a
>>     lot of sense because in that case all bets are off.
>>
>>     in_nmi() is used in quite some places to avoid such problems. IOW,
>>     that would kill a whole class of issues instead of "curing" the HPET
>>     problem locally for the price of an extra conditional. Not that the
>>     extra conditional matters much if HPET is the clocksource as that's
>>     awfully slow anyway and I really don't care about that.
>>
>>     But I very much care about avoiding to sprinkle panic_cpu checks all
>>     over the place.
>>
>> Thanks,
>>
>>          tglx

2024-05-29 07:44:47

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress

On Wed, May 29 2024 at 12:39, Tony W Wang-oc wrote:
> printk deadlock will happened at #A because in_nmi() evaluates to false
> on CPU B and CPU B do not enter the panic() AT #A.
>
> Update user space MCE handler to NMI class context is preferred?

No.

2024-05-29 07:45:34

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress

On Wed, May 29 2024 at 14:44, Tony W Wang-oc wrote:
> Actually, this scenario is what this patch is trying to solve.
>
> We encountered hpet_lock deadlock from the call path of the MCE handler,
> and this hpet_lock deadlock scenario may happen when others exceptions'
> handler like #PF/#GP... to call the panic. So read_hpet should avoid
> deadlock if panic in progress.

That's not what your changelog says.

2024-05-29 07:51:05

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress

Linus!

On Tue, May 28 2024 at 16:22, Linus Torvalds wrote:
> On Tue, 28 May 2024 at 15:12, Thomas Gleixner <[email protected]> wrote:
> I see the smiley, but yeah, I don't think we really care about it.

Indeed. But the same problem exists on other architectures as
well. drivers/clocksource alone has 4 examples aside of i8253

>> 1) Should we provide a panic mode read callback for clocksources which
>> are affected by this?
>
> The current patch under discussion may be ugly, but looks workable.
> Local ugliness isn't necessarily a show-stopper.
>
> So if the HPET is the *only* case which has this situation, I vote for
> just doing the ugly thing.
>
> Now, if *other* cases exist, and can't be worked around in similar
> ways, then that argues for a more "proper" fix.
>
> And no, I don't think i8253 is a strong enough argument. I don't
> actually believe you can realistically find a machine that doesn't
> have HPET or the TSC and really falls back on the i8253 any more. And
> if you *do* find hw like that, is it SMP-capable? And can you find
> somebody who cares?

Probably not.

>> 2) Is it correct to claim that a MCE which hits user space and ends up in
>> mce_panic() is still just a regular exception or should we upgrade to
>> NMI class context when we enter mce_panic() or even go as far to
>> upgrade to NMI class context for any panic() invocation?
>
> I do think that an NMI in user space should be considered mostly just
> a normal exception. From a kernel perspective, the NMI'ness just
> doesn't matter.

That's correct. I don't want to change that at all especially not for
recoverable MCEs.

> That said, I find your suggestion of making 'panic()' just basically
> act as an NMI context intriguing. And cleaner than the
> atomic_read(&panic_cpu) thing.
>
> Are there any other situations than this odd HPET thing where that
> would change semantics?

I need to go and stare at this some more.

Thanks,

tglx

2024-05-29 08:21:43

by Tony W Wang-oc

[permalink] [raw]
Subject: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress



On 2024/5/29 15:45, Thomas Gleixner wrote:
>
>
> [这封邮件来自外部发件人 谨防风险]
>
> On Wed, May 29 2024 at 14:44, Tony W Wang-oc wrote:
>> Actually, this scenario is what this patch is trying to solve.
>>
>> We encountered hpet_lock deadlock from the call path of the MCE handler,
>> and this hpet_lock deadlock scenario may happen when others exceptions'
>> handler like #PF/#GP... to call the panic. So read_hpet should avoid
>> deadlock if panic in progress.
>
> That's not what your changelog says.

Yes.
The example flow I gave with the MCE handler in my changelog is misleading.

2024-06-05 06:24:06

by Tony W Wang-oc

[permalink] [raw]
Subject: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress



On 2024/5/29 15:42, Thomas Gleixner wrote:
>
>
> [这封邮件来自外部发件人 谨防风险]
>
> Linus!
>
> On Tue, May 28 2024 at 16:22, Linus Torvalds wrote:
>> On Tue, 28 May 2024 at 15:12, Thomas Gleixner <[email protected]> wrote:
>> I see the smiley, but yeah, I don't think we really care about it.
>
> Indeed. But the same problem exists on other architectures as
> well. drivers/clocksource alone has 4 examples aside of i8253
>
>>> 1) Should we provide a panic mode read callback for clocksources which
>>> are affected by this?
>>
>> The current patch under discussion may be ugly, but looks workable.
>> Local ugliness isn't necessarily a show-stopper.
>>
>> So if the HPET is the *only* case which has this situation, I vote for
>> just doing the ugly thing.
>>
>> Now, if *other* cases exist, and can't be worked around in similar
>> ways, then that argues for a more "proper" fix.
>>
>> And no, I don't think i8253 is a strong enough argument. I don't
>> actually believe you can realistically find a machine that doesn't
>> have HPET or the TSC and really falls back on the i8253 any more. And
>> if you *do* find hw like that, is it SMP-capable? And can you find
>> somebody who cares?
>
> Probably not.
>
>>> 2) Is it correct to claim that a MCE which hits user space and ends up in
>>> mce_panic() is still just a regular exception or should we upgrade to
>>> NMI class context when we enter mce_panic() or even go as far to
>>> upgrade to NMI class context for any panic() invocation?
>>

After MCE has occurred, it is possible for the MCE handler to execute
the add_taint() function without panic. For example, the fake_panic is
configured.

So the above patch method does not seem to be able to cover the printk
deadlock caused by the add_taint() function in the MCE handler when a
MCE occurs in user space.

Sincerely
TonyWWang-oc

>> I do think that an NMI in user space should be considered mostly just
>> a normal exception. From a kernel perspective, the NMI'ness just
>> doesn't matter.
>
> That's correct. I don't want to change that at all especially not for
> recoverable MCEs.
>
>> That said, I find your suggestion of making 'panic()' just basically
>> act as an NMI context intriguing. And cleaner than the
>> atomic_read(&panic_cpu) thing.
>>
>> Are there any other situations than this odd HPET thing where that
>> would change semantics?
>
> I need to go and stare at this some more.
>
> Thanks,
>
> tglx

2024-06-05 09:40:19

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress

On Wed, Jun 05, 2024 at 02:23:32PM +0800, Tony W Wang-oc wrote:
> After MCE has occurred, it is possible for the MCE handler to execute the
> add_taint() function without panic. For example, the fake_panic is
> configured.

fake_panic is an ancient debugging leftover. If you set it, you get what
you deserve.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-06-05 10:10:33

by Tony W Wang-oc

[permalink] [raw]
Subject: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress



On 2024/6/5 17:36, Borislav Petkov wrote:
>
>
> [这封邮件来自外部发件人 谨防风险]
>
> On Wed, Jun 05, 2024 at 02:23:32PM +0800, Tony W Wang-oc wrote:
>> After MCE has occurred, it is possible for the MCE handler to execute the
>> add_taint() function without panic. For example, the fake_panic is
>> configured.
>
> fake_panic is an ancient debugging leftover. If you set it, you get what
> you deserve.
>

It may also happen without setting fake_panic, such as an MCE error of
the UCNA/SRAO type occurred?

Sincerely
TonyWWang-oc

> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

2024-06-05 11:34:05

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress

On Wed, Jun 05, 2024 at 06:10:07PM +0800, Tony W Wang-oc wrote:
> It may also happen without setting fake_panic, such as an MCE error of the
> UCNA/SRAO type occurred?

Which types exactly do you mean when you're looking at the severities[]
array in severity.c?

And what scenario are you talking about?

To get an #MC exception and detect only UCNA/SRAO errors? Can that even
happen on any hardware?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-06-05 12:34:25

by Tony W Wang-oc

[permalink] [raw]
Subject: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress



On 2024/6/5 19:33, Borislav Petkov wrote:
>
>
> [这封邮件来自外部发件人 谨防风险]
>
> On Wed, Jun 05, 2024 at 06:10:07PM +0800, Tony W Wang-oc wrote:
>> It may also happen without setting fake_panic, such as an MCE error of the
>> UCNA/SRAO type occurred?
>
> Which types exactly do you mean when you're looking at the severities[]
> array in severity.c?
>
> And what scenario are you talking about?
>
> To get an #MC exception and detect only UCNA/SRAO errors? Can that even
> happen on any hardware?
>

Yes, I mean an #MC exception happened and detect only like SRAO errors
like below:

MCESEV(
AO, "Action optional: memory scrubbing error",
SER, MASK(MCI_UC_AR|MCACOD_SCRUBMSK,
MCI_STATUS_UC|MCACOD_SCRUB)
),
MCESEV(
AO, "Action optional: last level cache writeback error",
SER, MASK(MCI_UC_AR|MCACOD, MCI_STATUS_UC|MCACOD_L3WB)
),

I think these errors are actually encountered on some platforms that
support these type of errors report to the #MC.

Sincerely
TonyWWang-oc
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

2024-06-05 15:52:05

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH] x86/hpet: Read HPET directly if panic in progress

> > Which types exactly do you mean when you're looking at the severities[]
> > array in severity.c?
> >
> > And what scenario are you talking about?
> >
> > To get an #MC exception and detect only UCNA/SRAO errors? Can that even
> > happen on any hardware?
> >
>
> Yes, I mean an #MC exception happened and detect only like SRAO errors
> like below:
>
> MCESEV(
> AO, "Action optional: memory scrubbing error",
> SER, MASK(MCI_UC_AR|MCACOD_SCRUBMSK,
> MCI_STATUS_UC|MCACOD_SCRUB)
> ),
> MCESEV(
> AO, "Action optional: last level cache writeback error",
> SER, MASK(MCI_UC_AR|MCACOD, MCI_STATUS_UC|MCACOD_L3WB)
> ),
>
> I think these errors are actually encountered on some platforms that
> support these type of errors report to the #MC.

Intel servers from Nehalem through Cascade Lake reported memory controller
patrol scrub uncorrected error with #MC and SRAO signature.

Icelake and newer use CMCI with a UCNA signature.

-Tony

2024-06-06 08:45:18

by Tony W Wang-oc

[permalink] [raw]
Subject: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress



On 2024/6/5 23:51, Luck, Tony wrote:
>
>
> [这封邮件来自外部发件人 谨防风险]
>
>>> Which types exactly do you mean when you're looking at the severities[]
>>> array in severity.c?
>>>
>>> And what scenario are you talking about?
>>>
>>> To get an #MC exception and detect only UCNA/SRAO errors? Can that even
>>> happen on any hardware?
>>>
>>
>> Yes, I mean an #MC exception happened and detect only like SRAO errors
>> like below:
>>
>> MCESEV(
>> AO, "Action optional: memory scrubbing error",
>> SER, MASK(MCI_UC_AR|MCACOD_SCRUBMSK,
>> MCI_STATUS_UC|MCACOD_SCRUB)
>> ),
>> MCESEV(
>> AO, "Action optional: last level cache writeback error",
>> SER, MASK(MCI_UC_AR|MCACOD, MCI_STATUS_UC|MCACOD_L3WB)
>> ),
>>
>> I think these errors are actually encountered on some platforms that
>> support these type of errors report to the #MC.
>
> Intel servers from Nehalem through Cascade Lake reported memory controller
> patrol scrub uncorrected error with #MC and SRAO signature.
>
> Icelake and newer use CMCI with a UCNA signature.
>

I have a question, does Intel use #MC to report UCNA errors?

Sincerely
TonyWWang-oc

2024-06-06 17:01:28

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH] x86/hpet: Read HPET directly if panic in progress

>> Icelake and newer use CMCI with a UCNA signature.
>>
>
> I have a question, does Intel use #MC to report UCNA errors?

No. They are reported with CMCI[1] (assuming it is enabled by IA32_MCi_CTL2 bit 30).

-Tony

[1] Usage evolved and naming did not keep up. An "Uncorrected" error is being signaled
using "Corrected Machine Check Interrupt".