2019-01-10 08:05:41

by Zhenzhong Duan

[permalink] [raw]
Subject: Question about qspinlock nest

Hi Maintainer,


There is a question confused me for days. Appreciate an answer.

In below code, the comment says we never have more than 4 nested contexts.

What happen if debug and mce exceptions nest with the four, or we ensure
it never happen?


/*
 * Per-CPU queue node structures; we can never have more than 4 nested
 * contexts: task, softirq, hardirq, nmi.
 *
 * Exactly fits one 64-byte cacheline on a 64-bit architecture.
 *
 * PV doubles the storage and uses the second cacheline for PV state.
 */
static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);

--
Thanks
Zhenzhong



2019-01-10 14:46:21

by Waiman Long

[permalink] [raw]
Subject: Re: Question about qspinlock nest

On 01/10/2019 03:02 AM, Zhenzhong Duan wrote:
> Hi Maintainer,
>
>
> There is a question confused me for days. Appreciate an answer.
>
> In below code, the comment says we never have more than 4 nested
> contexts.
>
> What happen if debug and mce exceptions nest with the four, or we
> ensure it never happen?
>
>
> /*
>  * Per-CPU queue node structures; we can never have more than 4 nested
>  * contexts: task, softirq, hardirq, nmi.
>  *
>  * Exactly fits one 64-byte cacheline on a 64-bit architecture.
>  *
>  * PV doubles the storage and uses the second cacheline for PV state.
>  */
> static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
>
Yes, both debug and mce exceptions are some kind of NMIs. So
theoretically, it is possible to have more than four. Are you aware of
any debug and MCE exception handlers that need to take a spinlock for
synchronization?

Cheers,
Longman



2019-01-10 19:24:52

by Waiman Long

[permalink] [raw]
Subject: Re: Question about qspinlock nest

On 01/10/2019 01:25 PM, James Morse wrote:
> Hi Longman, Zhenzhong,
>
> On 10/01/2019 14:43, Waiman Long wrote:
>> On 01/10/2019 03:02 AM, Zhenzhong Duan wrote:
>>> There is a question confused me for days. Appreciate an answer.
>>>
>>> In below code, the comment says we never have more than 4 nested
>>> contexts.
>>>
>>> What happen if debug and mce exceptions nest with the four, or we
>>> ensure it never happen?
>>>
>>>
>>> /*
>>>  * Per-CPU queue node structures; we can never have more than 4 nested
>>>  * contexts: task, softirq, hardirq, nmi.
>>>  *
>>>  * Exactly fits one 64-byte cacheline on a 64-bit architecture.
>>>  *
>>>  * PV doubles the storage and uses the second cacheline for PV state.
>>>  */
>>> static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
>> Yes, both debug and mce exceptions are some kind of NMIs. So
>> theoretically, it is possible to have more than four. Are you aware of
>> any debug and MCE exception handlers that need to take a spinlock for
>> synchronization?
> On arm64 if all the RAS and psuedo-NMI patches land, our worst-case interleaving
> jumps to at least 7. The culprit is APEI using spinlocks to protect fixmap slots.
>
> I have an RFC to bump the number of node bits from 2 to 3, but as this is APEI
> four times, it may be preferable to make it use something other than spinlocks.
>
>
> The worst-case order is below. Each one masks those before it:
> 1. process context
> 2. soft-irq
> 3. hard-irq
> 4. psuedo-nmi [0]
> - using the irqchip priorities to configure some IRQs as NMI.
> 5. SError [1]
> - a bit like an asynchronous MCE. ACPI allows this to convey CPER records,
> requiring an APEI call.
> 6&7. SDEI [2]
> - a firmware triggered software interrupt, only its two of them, either of
> which could convey CPER records.
> 8. Synchronous external abort

Nested spinlocks isn't a problem as long as no more than 4 of them are
contended.

Adding one more bit for node will halving the number of CPUs supported
without going to the slower version with a single pending bit.

Cheers,
Longman


2019-01-10 21:17:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Question about qspinlock nest

On Thu, Jan 10, 2019 at 06:25:57PM +0000, James Morse wrote:

> On arm64 if all the RAS and psuedo-NMI patches land, our worst-case interleaving
> jumps to at least 7. The culprit is APEI using spinlocks to protect fixmap slots.
>
> I have an RFC to bump the number of node bits from 2 to 3, but as this is APEI
> four times, it may be preferable to make it use something other than spinlocks.
>
>
> The worst-case order is below. Each one masks those before it:
> 1. process context
> 2. soft-irq
> 3. hard-irq
> 4. psuedo-nmi [0]
> - using the irqchip priorities to configure some IRQs as NMI.
> 5. SError [1]
> - a bit like an asynchronous MCE. ACPI allows this to convey CPER records,
> requiring an APEI call.
> 6&7. SDEI [2]
> - a firmware triggered software interrupt, only its two of them, either of
> which could convey CPER records.
> 8. Synchronous external abort
> - again, similar to MCE. There are systems using this with APEI.

The thing is, everything non-maskable (NMI like) really should not be
using spinlocks at all.

I otherwise have no clue about wth APEI is, but it sounds like horrible
crap ;-)

2019-01-10 22:20:56

by James Morse

[permalink] [raw]
Subject: Re: Question about qspinlock nest

Hi Longman, Zhenzhong,

On 10/01/2019 14:43, Waiman Long wrote:
> On 01/10/2019 03:02 AM, Zhenzhong Duan wrote:
>> There is a question confused me for days. Appreciate an answer.
>>
>> In below code, the comment says we never have more than 4 nested
>> contexts.
>>
>> What happen if debug and mce exceptions nest with the four, or we
>> ensure it never happen?
>>
>>
>> /*
>>  * Per-CPU queue node structures; we can never have more than 4 nested
>>  * contexts: task, softirq, hardirq, nmi.
>>  *
>>  * Exactly fits one 64-byte cacheline on a 64-bit architecture.
>>  *
>>  * PV doubles the storage and uses the second cacheline for PV state.
>>  */
>> static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);

> Yes, both debug and mce exceptions are some kind of NMIs. So
> theoretically, it is possible to have more than four. Are you aware of
> any debug and MCE exception handlers that need to take a spinlock for
> synchronization?

On arm64 if all the RAS and psuedo-NMI patches land, our worst-case interleaving
jumps to at least 7. The culprit is APEI using spinlocks to protect fixmap slots.

I have an RFC to bump the number of node bits from 2 to 3, but as this is APEI
four times, it may be preferable to make it use something other than spinlocks.


The worst-case order is below. Each one masks those before it:
1. process context
2. soft-irq
3. hard-irq
4. psuedo-nmi [0]
- using the irqchip priorities to configure some IRQs as NMI.
5. SError [1]
- a bit like an asynchronous MCE. ACPI allows this to convey CPER records,
requiring an APEI call.
6&7. SDEI [2]
- a firmware triggered software interrupt, only its two of them, either of
which could convey CPER records.
8. Synchronous external abort
- again, similar to MCE. There are systems using this with APEI.


Thanks,

James


[0] lore.kernel.org/r/[email protected]
[1]
https://lore.kernel.org/r/[email protected]
[2]
https://lore.kernel.org/linux-arm-kernel/[email protected]/


2019-01-10 23:07:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Question about qspinlock nest

On Thu, Jan 10, 2019 at 04:02:53PM +0800, Zhenzhong Duan wrote:
> Hi Maintainer,
>
>
> There is a question confused me for days. Appreciate an answer.
>
> In below code, the comment says we never have more than 4 nested contexts.
>
> What happen if debug and mce exceptions nest with the four, or we ensure it
> never happen?

MCE and NMI really should not use spinlocks at all since they cannot be
masked.


2019-01-11 09:41:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Question about qspinlock nest

On Fri, Jan 11, 2019 at 04:06:52PM +0800, Zhenzhong Duan wrote:
>
> On 2019/1/10 22:43, Waiman Long wrote:
> > On 01/10/2019 03:02 AM, Zhenzhong Duan wrote:
> > > Hi Maintainer,
> > >
> > >
> > > There is a question confused me for days. Appreciate an answer.
> > >
> > > In below code, the comment says we never have more than 4 nested
> > > contexts.
> > >
> > > What happen if debug and mce exceptions nest with the four, or we
> > > ensure it never happen?
> > >
> > >
> > > /*
> > > ?* Per-CPU queue node structures; we can never have more than 4 nested
> > > ?* contexts: task, softirq, hardirq, nmi.
> > > ?*
> > > ?* Exactly fits one 64-byte cacheline on a 64-bit architecture.
> > > ?*
> > > ?* PV doubles the storage and uses the second cacheline for PV state.
> > > ?*/
> > > static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
> > >
> > Yes, both debug and mce exceptions are some kind of NMIs. So
> > theoretically, it is possible to have more than four. Are you aware of
> > any debug and MCE exception handlers that need to take a spinlock for
> > synchronization?
>
> Not for debug exception, for MCE exception handler I found below two:
>
> do_machine_check->mce_report_event->schedule_work
> do_machine_check->force_sig->force_sig_info
>
> schedule_work() and force_sig_info() take spinlocks.

Boris, how can this be?

2019-01-11 20:57:32

by Borislav Petkov

[permalink] [raw]
Subject: Re: Question about qspinlock nest

On Fri, Jan 11, 2019 at 10:16:38AM +0100, Peter Zijlstra wrote:
> > Not for debug exception, for MCE exception handler I found below two:
> >
> > do_machine_check->mce_report_event->schedule_work

That has been the case at least since 2009. We probably never hit it or
I've never seen a deadlock caused by it.

> > do_machine_check->force_sig->force_sig_info

This does:

ist_begin_non_atomic(regs);
local_irq_enable();

before forcing the signal. And there's blurb above
ist_begin_non_atomic() why it is OK to schedule when the exception came
from usermode.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2019-01-11 20:59:40

by James Morse

[permalink] [raw]
Subject: Re: Question about qspinlock nest

Hi Peter,

On 10/01/2019 20:12, Peter Zijlstra wrote:
> On Thu, Jan 10, 2019 at 06:25:57PM +0000, James Morse wrote:
>
>> On arm64 if all the RAS and psuedo-NMI patches land, our worst-case interleaving
>> jumps to at least 7. The culprit is APEI using spinlocks to protect fixmap slots.
>>
>> I have an RFC to bump the number of node bits from 2 to 3, but as this is APEI
>> four times, it may be preferable to make it use something other than spinlocks.

>> The worst-case order is below. Each one masks those before it:
>> 1. process context
>> 2. soft-irq
>> 3. hard-irq
>> 4. psuedo-nmi [0]
>> - using the irqchip priorities to configure some IRQs as NMI.
>> 5. SError [1]
>> - a bit like an asynchronous MCE. ACPI allows this to convey CPER records,
>> requiring an APEI call.
>> 6&7. SDEI [2]
>> - a firmware triggered software interrupt, only its two of them, either of
>> which could convey CPER records.
>> 8. Synchronous external abort
>> - again, similar to MCE. There are systems using this with APEI.

> The thing is, everything non-maskable (NMI like) really should not be
> using spinlocks at all.
>
> I otherwise have no clue about wth APEI is, but it sounds like horrible
> crap ;-)

I think you've called it that before!: its that GHES thing in drivers/acpi/apei.

What is the alternative? bit_spin_lock()?
These things can happen independently on multiple CPUs. On arm64 these NMIlike
things don't affect all CPUs like they seem to on x86.


Thanks,

James

2019-01-11 21:42:10

by Waiman Long

[permalink] [raw]
Subject: Re: Question about qspinlock nest

On 01/11/2019 12:06 AM, Zhenzhong Duan wrote:
>
>
> On 2019/1/10 22:43, Waiman Long wrote:
>> On 01/10/2019 03:02 AM, Zhenzhong Duan wrote:
>>> Hi Maintainer,
>>>
>>>
>>> There is a question confused me for days. Appreciate an answer.
>>>
>>> In below code, the comment says we never have more than 4 nested
>>> contexts.
>>>
>>> What happen if debug and mce exceptions nest with the four, or we
>>> ensure it never happen?
>>>
>>>
>>> /*
>>>  * Per-CPU queue node structures; we can never have more than 4 nested
>>>  * contexts: task, softirq, hardirq, nmi.
>>>  *
>>>  * Exactly fits one 64-byte cacheline on a 64-bit architecture.
>>>  *
>>>  * PV doubles the storage and uses the second cacheline for PV state.
>>>  */
>>> static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
>>>
>> Yes, both debug and mce exceptions are some kind of NMIs. So
>> theoretically, it is possible to have more than four. Are you aware of
>> any debug and MCE exception handlers that need to take a spinlock for
>> synchronization?
> Not for debug exception, for MCE exception handler I found below two:
>
> do_machine_check->mce_report_event->schedule_work
> do_machine_check->force_sig->force_sig_info
>
> schedule_work() and force_sig_info() take spinlocks.
> --
> Thanks
> Zhenzhong

The comment for do_machine_scheck() has already state that:

 * This is executed in NMI context not subject to normal locking rules. This
 * implies that most kernel services cannot be safely used. Don't even
 * think about putting a printk in there!

So even if it doesn't exceed the MAX_NODES limit, it could hit deadlock
and other kind of locking hazards. So supporting MCE is not a reason
strong enough to extend MAX_NODES.

In hindsight, we should have added a "BUG_ON(idx >= MAX_NODES);" check
to avoid silent corruption because of that issue.

Thanks,
Longman



2019-01-14 09:27:51

by Zhenzhong Duan

[permalink] [raw]
Subject: Re: Question about qspinlock nest


----- [email protected] wrote:

> On 01/11/2019 12:06 AM, Zhenzhong Duan wrote:
> >
> >
> > On 2019/1/10 22:43, Waiman Long wrote:
> >> On 01/10/2019 03:02 AM, Zhenzhong Duan wrote:
> >>> Hi Maintainer,
> >>>
> >>>
> >>> There is a question confused me for days. Appreciate an answer.
> >>>
> >>> In below code, the comment says we never have more than 4 nested
> >>> contexts.
> >>>
> >>> What happen if debug and mce exceptions nest with the four, or we
> >>> ensure it never happen?
> >>>
> >>>
> >>> /*
> >>>  * Per-CPU queue node structures; we can never have more than 4
> nested
> >>>  * contexts: task, softirq, hardirq, nmi.
> >>>  *
> >>>  * Exactly fits one 64-byte cacheline on a 64-bit architecture.
> >>>  *
> >>>  * PV doubles the storage and uses the second cacheline for PV
> state.
> >>>  */
> >>> static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);
> >>>
> >> Yes, both debug and mce exceptions are some kind of NMIs. So
> >> theoretically, it is possible to have more than four. Are you aware
> of
> >> any debug and MCE exception handlers that need to take a spinlock
> for
> >> synchronization?
> > Not for debug exception, for MCE exception handler I found below
> two:
> >
> > do_machine_check->mce_report_event->schedule_work
> > do_machine_check->force_sig->force_sig_info
> >
> > schedule_work() and force_sig_info() take spinlocks.
> > --
> > Thanks
> > Zhenzhong
>
> The comment for do_machine_scheck() has already state that:
>
>  * This is executed in NMI context not subject to normal locking
> rules. This
>  * implies that most kernel services cannot be safely used. Don't
> even
>  * think about putting a printk in there!
>
> So even if it doesn't exceed the MAX_NODES limit, it could hit
> deadlock
> and other kind of locking hazards. So supporting MCE is not a reason
> strong enough to extend MAX_NODES.

Agree.

>
> In hindsight, we should have added a "BUG_ON(idx >= MAX_NODES);"
> check
> to avoid silent corruption because of that issue.

Looks a good idea if it's hard to avoid using spinlock in MCE handler.

Thanks
Zhenzhong

2019-01-14 13:19:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Question about qspinlock nest

On Fri, Jan 11, 2019 at 06:32:58PM +0000, James Morse wrote:
> Hi Peter,
>
> On 10/01/2019 20:12, Peter Zijlstra wrote:
> > On Thu, Jan 10, 2019 at 06:25:57PM +0000, James Morse wrote:
> >
> >> On arm64 if all the RAS and psuedo-NMI patches land, our worst-case interleaving
> >> jumps to at least 7. The culprit is APEI using spinlocks to protect fixmap slots.
> >>
> >> I have an RFC to bump the number of node bits from 2 to 3, but as this is APEI
> >> four times, it may be preferable to make it use something other than spinlocks.
>
> >> The worst-case order is below. Each one masks those before it:
> >> 1. process context
> >> 2. soft-irq
> >> 3. hard-irq
> >> 4. psuedo-nmi [0]
> >> - using the irqchip priorities to configure some IRQs as NMI.
> >> 5. SError [1]
> >> - a bit like an asynchronous MCE. ACPI allows this to convey CPER records,
> >> requiring an APEI call.
> >> 6&7. SDEI [2]
> >> - a firmware triggered software interrupt, only its two of them, either of
> >> which could convey CPER records.
> >> 8. Synchronous external abort
> >> - again, similar to MCE. There are systems using this with APEI.
>
> > The thing is, everything non-maskable (NMI like) really should not be
> > using spinlocks at all.
> >
> > I otherwise have no clue about wth APEI is, but it sounds like horrible
> > crap ;-)
>
> I think you've called it that before!: its that GHES thing in drivers/acpi/apei.
>
> What is the alternative? bit_spin_lock()?
> These things can happen independently on multiple CPUs. On arm64 these NMIlike
> things don't affect all CPUs like they seem to on x86.

It has nothing to do with how many CPUs are affected. It has everything
to do with not being maskable.

What avoids the trivial self-recursion:

spin_lock(&)
<NMI>
spin_lock(&x)
... wait forever more ...
</NMI>
spin_unlock(&x)

?

Normally for actual maskable interrupts, we use:

spin_lock_irq(&x)
// our IRQ cannot happen here because: masked
spin_unlock_irq(&x)

But non-maskable, has, per definition, a wee issue there.

Non-maskable MUST NOT _EVAH_ use any form of spinlocks, they're
fundamentally incompatible. Non-maskable interrupts must employ
wait-free atomic constructs.

2019-01-14 13:56:16

by James Morse

[permalink] [raw]
Subject: Re: Question about qspinlock nest

Hi Peter,

On 14/01/2019 13:16, Peter Zijlstra wrote:
> On Fri, Jan 11, 2019 at 06:32:58PM +0000, James Morse wrote:
>> On 10/01/2019 20:12, Peter Zijlstra wrote:
>>> On Thu, Jan 10, 2019 at 06:25:57PM +0000, James Morse wrote:
>>> The thing is, everything non-maskable (NMI like) really should not be
>>> using spinlocks at all.
>>>
>>> I otherwise have no clue about wth APEI is, but it sounds like horrible
>>> crap ;-)
>>
>> I think you've called it that before!: its that GHES thing in drivers/acpi/apei.
>>
>> What is the alternative? bit_spin_lock()?

>> These things can happen independently on multiple CPUs. On arm64 these NMIlike
>> things don't affect all CPUs like they seem to on x86.
>
> It has nothing to do with how many CPUs are affected. It has everything
> to do with not being maskable.

(sorry, I didn't include any of the context, let me back-up a bit here:)

> What avoids the trivial self-recursion:
>
> spin_lock(&)
> <NMI>
> spin_lock(&x)
> ... wait forever more ...
> </NMI>
> spin_unlock(&x)
>
> ?

If its trying to take the same lock, I agree its deadlocked.
If the sequence above started with <NMI>, I agree its deadlocked.

APEI/GHES is doing neither of these things. It take a lock that is only ever
taken in_nmi(). nmi_enter()s BUG_ON(in_nmi()) means these never become re-entrant.

What is the lock doing? Protecting the 'NMI' fixmap slot in the unlikely case
that two CPUs end up in here at the same time.

(I though x86's NMI masked NMI until the next iret?)


This is murkier on arm64 as we have multiple things that behave like this, but
there is an order to them, and none of them can interrupt themselves.
e.g. We can't take an SError during the SError handler.

But we can take this SError/NMI on another CPU while the first one is still
running the handler.

These multiple NMIlike notifications mean having multiple locks/fixmap-slots,
one per notification. This is where the qspinlock node limit comes in, as we
could have more than 4 contexts.


Thanks,

James

> Normally for actual maskable interrupts, we use:
>
> spin_lock_irq(&x)
> // our IRQ cannot happen here because: masked
> spin_unlock_irq(&x)
>
> But non-maskable, has, per definition, a wee issue there.

> Non-maskable MUST NOT _EVAH_ use any form of spinlocks, they're
> fundamentally incompatible. Non-maskable interrupts must employ
> wait-free atomic constructs.






2019-01-14 21:10:23

by Waiman Long

[permalink] [raw]
Subject: Re: Question about qspinlock nest

On 01/14/2019 08:54 AM, James Morse wrote:
> Hi Peter,
>
> On 14/01/2019 13:16, Peter Zijlstra wrote:
>> On Fri, Jan 11, 2019 at 06:32:58PM +0000, James Morse wrote:
>>> On 10/01/2019 20:12, Peter Zijlstra wrote:
>>>> On Thu, Jan 10, 2019 at 06:25:57PM +0000, James Morse wrote:
>>>> The thing is, everything non-maskable (NMI like) really should not be
>>>> using spinlocks at all.
>>>>
>>>> I otherwise have no clue about wth APEI is, but it sounds like horrible
>>>> crap ;-)
>>> I think you've called it that before!: its that GHES thing in drivers/acpi/apei.
>>>
>>> What is the alternative? bit_spin_lock()?
>>> These things can happen independently on multiple CPUs. On arm64 these NMIlike
>>> things don't affect all CPUs like they seem to on x86.
>> It has nothing to do with how many CPUs are affected. It has everything
>> to do with not being maskable.
> (sorry, I didn't include any of the context, let me back-up a bit here:)
>
>> What avoids the trivial self-recursion:
>>
>> spin_lock(&)
>> <NMI>
>> spin_lock(&x)
>> ... wait forever more ...
>> </NMI>
>> spin_unlock(&x)
>>
>> ?
> If its trying to take the same lock, I agree its deadlocked.
> If the sequence above started with <NMI>, I agree its deadlocked.
>
> APEI/GHES is doing neither of these things. It take a lock that is only ever
> taken in_nmi(). nmi_enter()s BUG_ON(in_nmi()) means these never become re-entrant.
>
> What is the lock doing? Protecting the 'NMI' fixmap slot in the unlikely case
> that two CPUs end up in here at the same time.
>
> (I though x86's NMI masked NMI until the next iret?)
>
>
> This is murkier on arm64 as we have multiple things that behave like this, but
> there is an order to them, and none of them can interrupt themselves.
> e.g. We can't take an SError during the SError handler.
>
> But we can take this SError/NMI on another CPU while the first one is still
> running the handler.
>
> These multiple NMIlike notifications mean having multiple locks/fixmap-slots,
> one per notification. This is where the qspinlock node limit comes in, as we
> could have more than 4 contexts.

Still it will be extremely unlikely to have more than 4 nested spinlock
acquisitions with contention. Do you think it will be helpful to make
the MAX_NODES parameter configurable to either 4 or 8? For x86, I think
we can live with 4. On arm64, we can opt for 8 if you think there is a
decent chance that more than 4 could be needed under certain
circumstances. This will, of course, reduce the max NR_CPUS by half.

Cheers,
Longman


2019-01-18 10:05:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Question about qspinlock nest

On Mon, Jan 14, 2019 at 01:54:49PM +0000, James Morse wrote:
> On 14/01/2019 13:16, Peter Zijlstra wrote:

> > What avoids the trivial self-recursion:
> >
> > spin_lock(&)
> > <NMI>
> > spin_lock(&x)
> > ... wait forever more ...
> > </NMI>
> > spin_unlock(&x)
> >
> > ?
>
> If its trying to take the same lock, I agree its deadlocked.
> If the sequence above started with <NMI>, I agree its deadlocked.
>
> APEI/GHES is doing neither of these things. It take a lock that is only ever
> taken in_nmi(). nmi_enter()s BUG_ON(in_nmi()) means these never become re-entrant.

Urgh.. yes. I abhor that spinlock usage, but you're correct. If they're
only ever used in the NMI then it ought to work.

/me digs around... Bugger we have more like that :/

> What is the lock doing? Protecting the 'NMI' fixmap slot in the unlikely case
> that two CPUs end up in here at the same time.
>
> (I though x86's NMI masked NMI until the next iret?)

Correct; x86 has his 'feature' where IRET will unmask the NMI, so we
have something quite terrible to deal with that, don't ask and I shall
not have to explain :-)

> This is murkier on arm64 as we have multiple things that behave like this, but
> there is an order to them, and none of them can interrupt themselves.

Well, x86 too has multiple non-maskable vectors, and afaik only the
actual NMI vector is covered in tricky. But our MCE vector is
non-maskable too (and I have vague memories of there being more).

Boris, Rostedt, WTH happens if our MCE code goes and hits a #BP ? (not
unlikely with this proliferation of self-modifying code)

Anyway, the idea is that they can indeed not interrupt themselves, but I
would not be surprised if the whole MCE thing is riddled with fail (on
x86).

> e.g. We can't take an SError during the SError handler.
>
> But we can take this SError/NMI on another CPU while the first one is still
> running the handler.
>
> These multiple NMIlike notifications mean having multiple locks/fixmap-slots,
> one per notification. This is where the qspinlock node limit comes in, as we
> could have more than 4 contexts.

Right; so Waiman was going to do a patch that reverts to test-and-set or
something along those lines once we hit the queue limit, which seems
like a good way out. Actually hitting that nesting level should be
exceedingly rare.

2019-01-18 10:27:54

by Borislav Petkov

[permalink] [raw]
Subject: Re: Question about qspinlock nest

On Fri, Jan 18, 2019 at 11:02:29AM +0100, Peter Zijlstra wrote:
> Well, x86 too has multiple non-maskable vectors, and afaik only the
> actual NMI vector is covered in tricky. But our MCE vector is
> non-maskable too (and I have vague memories of there being more).
>
> Boris, Rostedt, WTH happens if our MCE code goes and hits a #BP ? (not
> unlikely with this proliferation of self-modifying code)
>
> Anyway, the idea is that they can indeed not interrupt themselves, but I
> would not be surprised if the whole MCE thing is riddled with fail (on
> x86).

As we talked on IRC: we'll apply a mixture of "So don't do that
then!" hints i.e., kprobing MCE code etc, and fix the issues like the
schedule_work() invocation.

In general, I'd like to make/keep the #MC handler as simple and
as idiot-proof as possible.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2019-01-18 15:43:52

by Waiman Long

[permalink] [raw]
Subject: Re: Question about qspinlock nest

On 01/18/2019 05:02 AM, Peter Zijlstra wrote:
>
>> e.g. We can't take an SError during the SError handler.
>>
>> But we can take this SError/NMI on another CPU while the first one is still
>> running the handler.
>>
>> These multiple NMIlike notifications mean having multiple locks/fixmap-slots,
>> one per notification. This is where the qspinlock node limit comes in, as we
>> could have more than 4 contexts.
> Right; so Waiman was going to do a patch that reverts to test-and-set or
> something along those lines once we hit the queue limit, which seems
> like a good way out. Actually hitting that nesting level should be
> exceedingly rare.

Yes, I am working on a patch to support arbitrary levels of nesting. It
is easy for PV qspinlock as lock stealing is supported.

For native qspinlock, we cannot do lock stealing without incurring a
certain amount of overhead in the regular slowpath code. It was up to
10% in my own testing. So I am exploring an alternative that can do the
job without incurring any noticeable performance degradation in the
slowpath. I ran into a race condition which I am still trying to find
out where that comes from. Hopefully, I will have something to post next
week.

Cheers,
Longman




2019-01-18 20:08:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Question about qspinlock nest

On Fri, Jan 18, 2019 at 09:50:12AM -0500, Waiman Long wrote:
> On 01/18/2019 05:02 AM, Peter Zijlstra wrote:
> >
> >> e.g. We can't take an SError during the SError handler.
> >>
> >> But we can take this SError/NMI on another CPU while the first one is still
> >> running the handler.
> >>
> >> These multiple NMIlike notifications mean having multiple locks/fixmap-slots,
> >> one per notification. This is where the qspinlock node limit comes in, as we
> >> could have more than 4 contexts.
> > Right; so Waiman was going to do a patch that reverts to test-and-set or
> > something along those lines once we hit the queue limit, which seems
> > like a good way out. Actually hitting that nesting level should be
> > exceedingly rare.
>
> Yes, I am working on a patch to support arbitrary levels of nesting. It
> is easy for PV qspinlock as lock stealing is supported.
>
> For native qspinlock, we cannot do lock stealing without incurring a
> certain amount of overhead in the regular slowpath code. It was up to
> 10% in my own testing. So I am exploring an alternative that can do the
> job without incurring any noticeable performance degradation in the
> slowpath. I ran into a race condition which I am still trying to find
> out where that comes from. Hopefully, I will have something to post next
> week.

Where does the overhead come from? Surely that's not just checking that
bound?

2019-01-18 21:32:08

by Waiman Long

[permalink] [raw]
Subject: Re: Question about qspinlock nest

On 01/18/2019 03:06 PM, Peter Zijlstra wrote:
> On Fri, Jan 18, 2019 at 09:50:12AM -0500, Waiman Long wrote:
>> On 01/18/2019 05:02 AM, Peter Zijlstra wrote:
>>>> e.g. We can't take an SError during the SError handler.
>>>>
>>>> But we can take this SError/NMI on another CPU while the first one is still
>>>> running the handler.
>>>>
>>>> These multiple NMIlike notifications mean having multiple locks/fixmap-slots,
>>>> one per notification. This is where the qspinlock node limit comes in, as we
>>>> could have more than 4 contexts.
>>> Right; so Waiman was going to do a patch that reverts to test-and-set or
>>> something along those lines once we hit the queue limit, which seems
>>> like a good way out. Actually hitting that nesting level should be
>>> exceedingly rare.
>> Yes, I am working on a patch to support arbitrary levels of nesting. It
>> is easy for PV qspinlock as lock stealing is supported.
>>
>> For native qspinlock, we cannot do lock stealing without incurring a
>> certain amount of overhead in the regular slowpath code. It was up to
>> 10% in my own testing. So I am exploring an alternative that can do the
>> job without incurring any noticeable performance degradation in the
>> slowpath. I ran into a race condition which I am still trying to find
>> out where that comes from. Hopefully, I will have something to post next
>> week.
> Where does the overhead come from? Surely that's not just checking that
> bound?

It is not about checking bound, it is about how to acquire the lock
without using an MCS node. The overhead comes from using atomic
instruction to acquire the lock instead of non-atomic one in order to
allow lock stealing.

Cheers,
Longman