2014-01-21 13:28:42

by Jan Kiszka

[permalink] [raw]
Subject: x86: Inconsistent xAPIC synchronization in arch_irq_work_raise?

Hi all,

while trying to plug a race in the CPU hotplug code on xAPIC systems, I
was analyzing IPI transmission patterns. The handlers in
arch/x86/include/asm/ipi.h first wait for ICR, then send. In contrast,
arch_irq_work_raise sends the self-IPI directly and then waits. This
looks inconsistent. Is it intended?

BTW, the races are in wakeup_secondary_cpu_via_init and
wakeup_secondary_cpu_via_nmi (lacking IRQ disable around ICR accesses).
There we also send first, then wait for completion. But I guess that is
due to the code originally only being used during boot. Will send fixes
for those once the sync pattern is clear to me.

Jan

--
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux


2014-01-21 14:01:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: x86: Inconsistent xAPIC synchronization in arch_irq_work_raise?

On Tue, Jan 21, 2014 at 02:02:06PM +0100, Jan Kiszka wrote:
> Hi all,
>
> while trying to plug a race in the CPU hotplug code on xAPIC systems, I
> was analyzing IPI transmission patterns. The handlers in
> arch/x86/include/asm/ipi.h first wait for ICR, then send. In contrast,
> arch_irq_work_raise sends the self-IPI directly and then waits. This
> looks inconsistent. Is it intended?
>
> BTW, the races are in wakeup_secondary_cpu_via_init and
> wakeup_secondary_cpu_via_nmi (lacking IRQ disable around ICR accesses).
> There we also send first, then wait for completion. But I guess that is
> due to the code originally only being used during boot. Will send fixes
> for those once the sync pattern is clear to me.

Could be I had no clue what I was doing and copy/pasted the code until
it compiled and ran.

In fact, I've got no clue what an ICR is.

2014-01-21 14:11:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: x86: Inconsistent xAPIC synchronization in arch_irq_work_raise?


* Peter Zijlstra <[email protected]> wrote:

> On Tue, Jan 21, 2014 at 02:02:06PM +0100, Jan Kiszka wrote:
> > Hi all,
> >
> > while trying to plug a race in the CPU hotplug code on xAPIC systems, I
> > was analyzing IPI transmission patterns. The handlers in
> > arch/x86/include/asm/ipi.h first wait for ICR, then send. In contrast,
> > arch_irq_work_raise sends the self-IPI directly and then waits. This
> > looks inconsistent. Is it intended?
> >
> > BTW, the races are in wakeup_secondary_cpu_via_init and
> > wakeup_secondary_cpu_via_nmi (lacking IRQ disable around ICR accesses).
> > There we also send first, then wait for completion. But I guess that is
> > due to the code originally only being used during boot. Will send fixes
> > for those once the sync pattern is clear to me.
>
> Could be I had no clue what I was doing and copy/pasted the code until
> it compiled and ran.
>
> In fact, I've got no clue what an ICR is.

APIC ICR = Interrupt Command Register - this is the MMIO-mapped
register of the local APIC that (when written to) triggers the sending
of IPIs and (when read from) shows the status how the IPI is going or
whether a new IPI can be sent.

( Not to be confused with the APIC timer ICR, which is 'Initial Count
Register' and does something entirely different. )

Thanks,

Ingo

2014-01-21 14:12:06

by Jan Kiszka

[permalink] [raw]
Subject: Re: x86: Inconsistent xAPIC synchronization in arch_irq_work_raise?

On 2014-01-21 15:01, Peter Zijlstra wrote:
> On Tue, Jan 21, 2014 at 02:02:06PM +0100, Jan Kiszka wrote:
>> Hi all,
>>
>> while trying to plug a race in the CPU hotplug code on xAPIC systems, I
>> was analyzing IPI transmission patterns. The handlers in
>> arch/x86/include/asm/ipi.h first wait for ICR, then send. In contrast,
>> arch_irq_work_raise sends the self-IPI directly and then waits. This
>> looks inconsistent. Is it intended?
>>
>> BTW, the races are in wakeup_secondary_cpu_via_init and
>> wakeup_secondary_cpu_via_nmi (lacking IRQ disable around ICR accesses).
>> There we also send first, then wait for completion. But I guess that is
>> due to the code originally only being used during boot. Will send fixes
>> for those once the sync pattern is clear to me.
>
> Could be I had no clue what I was doing and copy/pasted the code until
> it compiled and ran.
>
> In fact, I've got no clue what an ICR is.

Old xAPIC requires you to only send IPIs, when the APIC signals it is
done with sending the previous one. Therefore we wait for availability
in the other IPI transmission services before writing to ICR.

OK, then I will write a separate patch for arch_irq_work_raise to switch
the ordering.

Jan

--
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

2014-01-21 14:35:34

by Jan Kiszka

[permalink] [raw]
Subject: Re: x86: Inconsistent xAPIC synchronization in arch_irq_work_raise?

On 2014-01-21 15:11, Jan Kiszka wrote:
> On 2014-01-21 15:01, Peter Zijlstra wrote:
>> On Tue, Jan 21, 2014 at 02:02:06PM +0100, Jan Kiszka wrote:
>>> Hi all,
>>>
>>> while trying to plug a race in the CPU hotplug code on xAPIC systems, I
>>> was analyzing IPI transmission patterns. The handlers in
>>> arch/x86/include/asm/ipi.h first wait for ICR, then send. In contrast,
>>> arch_irq_work_raise sends the self-IPI directly and then waits. This
>>> looks inconsistent. Is it intended?
>>>
>>> BTW, the races are in wakeup_secondary_cpu_via_init and
>>> wakeup_secondary_cpu_via_nmi (lacking IRQ disable around ICR accesses).
>>> There we also send first, then wait for completion. But I guess that is
>>> due to the code originally only being used during boot. Will send fixes
>>> for those once the sync pattern is clear to me.
>>
>> Could be I had no clue what I was doing and copy/pasted the code until
>> it compiled and ran.
>>
>> In fact, I've got no clue what an ICR is.
>
> Old xAPIC requires you to only send IPIs, when the APIC signals it is
> done with sending the previous one. Therefore we wait for availability
> in the other IPI transmission services before writing to ICR.
>
> OK, then I will write a separate patch for arch_irq_work_raise to switch
> the ordering.

Hmm, missed that we do have synchronization already via
apic->send_IPI_self -> default_send_IPI_self ->
__default_send_IPI_shortcut. So the closing wait would only be relevant
if we need to settle the APIC because we may have interrupted a
wait_icr_idle + write_icr sequence. But shouldnt those sequences be
atomic (except for the problematic wakeup_secondary path)?

Still confused: What is the official locking model around wait_icr_idle,
write(ICR), and also write(IRC2)? IRQ disable around ICR2+ICR accesses
and preempt_disable around wait + write? That is also important to fix
the SMP boot-up code properly.

Jan

--
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

2014-01-21 14:51:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: x86: Inconsistent xAPIC synchronization in arch_irq_work_raise?

On Tue, Jan 21, 2014 at 03:01:13PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 21, 2014 at 02:02:06PM +0100, Jan Kiszka wrote:
> > Hi all,
> >
> > while trying to plug a race in the CPU hotplug code on xAPIC systems, I
> > was analyzing IPI transmission patterns. The handlers in
> > arch/x86/include/asm/ipi.h first wait for ICR, then send. In contrast,
> > arch_irq_work_raise sends the self-IPI directly and then waits. This
> > looks inconsistent. Is it intended?
> >
> > BTW, the races are in wakeup_secondary_cpu_via_init and
> > wakeup_secondary_cpu_via_nmi (lacking IRQ disable around ICR accesses).
> > There we also send first, then wait for completion. But I guess that is
> > due to the code originally only being used during boot. Will send fixes
> > for those once the sync pattern is clear to me.
>
> Could be I had no clue what I was doing and copy/pasted the code until
> it compiled and ran.
>
> In fact, I've got no clue what an ICR is.

I dug about a bit, I borrowed that code from:

lkml.kernel.org/r/[email protected]

Huang Ying, can you explain to Jan why you do the wait afterwards?

2014-01-21 23:20:24

by Huang, Ying

[permalink] [raw]
Subject: Re: x86: Inconsistent xAPIC synchronization in arch_irq_work_raise?

On Tue, 2014-01-21 at 15:51 +0100, Peter Zijlstra wrote:
> On Tue, Jan 21, 2014 at 03:01:13PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 21, 2014 at 02:02:06PM +0100, Jan Kiszka wrote:
> > > Hi all,
> > >
> > > while trying to plug a race in the CPU hotplug code on xAPIC systems, I
> > > was analyzing IPI transmission patterns. The handlers in
> > > arch/x86/include/asm/ipi.h first wait for ICR, then send. In contrast,
> > > arch_irq_work_raise sends the self-IPI directly and then waits. This
> > > looks inconsistent. Is it intended?
> > >
> > > BTW, the races are in wakeup_secondary_cpu_via_init and
> > > wakeup_secondary_cpu_via_nmi (lacking IRQ disable around ICR accesses).
> > > There we also send first, then wait for completion. But I guess that is
> > > due to the code originally only being used during boot. Will send fixes
> > > for those once the sync pattern is clear to me.
> >
> > Could be I had no clue what I was doing and copy/pasted the code until
> > it compiled and ran.
> >
> > In fact, I've got no clue what an ICR is.
>
> I dug about a bit, I borrowed that code from:
>
> lkml.kernel.org/r/[email protected]
>
> Huang Ying, can you explain to Jan why you do the wait afterwards?

I borrow the code from the original MCE report event code.

Andi, could you help us to explain it?

Best Regards,
Huang Ying

2014-01-22 18:43:28

by Andi Kleen

[permalink] [raw]
Subject: Re: x86: Inconsistent xAPIC synchronization in arch_irq_work_raise?

> > Huang Ying, can you explain to Jan why you do the wait afterwards?
>
> I borrow the code from the original MCE report event code.
>
> Andi, could you help us to explain it?

I don't recall all the details, but I believe i also just copied
it from the APIC code. I don't think I did any particular ordering
intentionally.

-Andi

--
[email protected] -- Speaking for myself only.

2014-01-23 18:52:38

by Jan Kiszka

[permalink] [raw]
Subject: Re: x86: Inconsistent xAPIC synchronization in arch_irq_work_raise?

On 2014-01-22 19:43, Andi Kleen wrote:
>>> Huang Ying, can you explain to Jan why you do the wait afterwards?
>>
>> I borrow the code from the original MCE report event code.
>>
>> Andi, could you help us to explain it?
>
> I don't recall all the details, but I believe i also just copied
> it from the APIC code. I don't think I did any particular ordering
> intentionally.

OK, then let me summarize my current understanding so that we can derive
a consistent usage:

The xAPIC requires us to only write to ICR (both low and high part) if
ICR.DS is cleared - correct? ICR.DS checking as well as ICR writing must
only work against the same CPU, naturally.

Both __default_send_IPI_shortcut and __default_send_IPI_dest_field check
ICR.DS first, then write, but do not wait for ICR.DS to become 0 again -
not needed if this pattern is used consistently. Moreover,
default_send_IPI_mask* disables interrupts around these steps, thus
ensure atomicity. But shorthand IPI transmitters
(default_send_IPI_allbutself, default_send_IPI_all,
default_send_IPI_self) do not disable interrupts themselves. I didn't
check their call sites yet, maybe it's there.

Next we have x86's arch_irq_work_raise which does wait-write-wait,
either by chance or in order to work around a missing atomicity of
wait+write somewhere else. Preemption is off, interrupts remain on.

And then there is apic_icr_write, used while onlining CPUs, not only
during boot, that runs without any protection - that's the race I
originally stumbled over (INIT/SIPI or "just" NMI signals can end up on
the wrong CPU).

So now I'm looking for consistent locking rules (which type of lock, who
is responsible when issuing IPIs?) and a good (ie. also efficient) way
to apply them.

Jan

--
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

2014-01-23 19:12:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: x86: Inconsistent xAPIC synchronization in arch_irq_work_raise?

On Thu, Jan 23, 2014 at 07:51:40PM +0100, Jan Kiszka wrote:
> Next we have x86's arch_irq_work_raise which does wait-write-wait,
> either by chance or in order to work around a missing atomicity of
> wait+write somewhere else. Preemption is off, interrupts remain on.

Note that arch_irq_work_raise() is 'special' in that its used from NMI
context, so no amount of interrupts disabling will serialize things.

2014-01-23 19:23:05

by Andi Kleen

[permalink] [raw]
Subject: Re: x86: Inconsistent xAPIC synchronization in arch_irq_work_raise?

> So now I'm looking for consistent locking rules (which type of lock, who
> is responsible when issuing IPIs?) and a good (ie. also efficient) way
> to apply them.

It has to be lockless, the machine checks run as NMIs.
The whole point of the self nmi is to get back to a lockable state.

-Andi
--
[email protected] -- Speaking for myself only.

2014-01-23 19:52:06

by Jan Kiszka

[permalink] [raw]
Subject: Re: x86: Inconsistent xAPIC synchronization in arch_irq_work_raise?

On 2014-01-23 20:22, Andi Kleen wrote:
>> So now I'm looking for consistent locking rules (which type of lock, who
>> is responsible when issuing IPIs?) and a good (ie. also efficient) way
>> to apply them.
>
> It has to be lockless, the machine checks run as NMIs.
> The whole point of the self nmi is to get back to a lockable state.

"It" is arch_irq_work_raise, or more?

arch_irq_work_raise looks consistent then, and we now know why there is
the trailing wait.

Remains the question if shorthand IPIs are always locked by their
callers. Some do (those in kernel/apic/io_apic.c e.g.), other seem to be
more relaxed with this (e.g. kernel/apic/hw_nmi.c or
native_send_call_func_single_ipi), unless I miss something again.

Jan

--
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

2014-01-23 20:17:54

by Andi Kleen

[permalink] [raw]
Subject: Re: x86: Inconsistent xAPIC synchronization in arch_irq_work_raise?

> > It has to be lockless, the machine checks run as NMIs.
> > The whole point of the self nmi is to get back to a lockable state.
>
> "It" is arch_irq_work_raise, or more?

Really irq_work_queue, but it boils down to arch_irq_work_raise

-Andi