The Xen specific queue spinlock wait function has two issues which
could result in a hanging system.
They have a similar root cause of clearing a pending wakeup of a
waiting vcpu and later going to sleep waiting for the just cleared
wakeup event, which of course won't ever happen.
Juergen Gross (2):
xen: fix race in xen_qlock_wait()
xen: make xen_qlock_wait() nestable
arch/x86/xen/spinlock.c | 33 ++++++++++++---------------------
1 file changed, 12 insertions(+), 21 deletions(-)
Cc: [email protected]
Cc: [email protected]
--
2.16.4
In the following situation a vcpu waiting for a lock might not be
woken up from xen_poll_irq():
CPU 1: CPU 2: CPU 3:
takes a spinlock
tries to get lock
-> xen_qlock_wait()
-> xen_clear_irq_pending()
frees the lock
-> xen_qlock_kick(cpu2)
takes lock again
tries to get lock
-> *lock = _Q_SLOW_VAL
-> *lock == _Q_SLOW_VAL ?
-> xen_poll_irq()
frees the lock
-> xen_qlock_kick(cpu3)
And cpu 2 will sleep forever.
This can be avoided easily by modifying xen_qlock_wait() to call
xen_poll_irq() only if the related irq was not pending and to call
xen_clear_irq_pending() only if it was pending.
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Juergen Gross <[email protected]>
---
arch/x86/xen/spinlock.c | 15 +++++----------
1 file changed, 5 insertions(+), 10 deletions(-)
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 973f10e05211..cd210a4ba7b1 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -45,17 +45,12 @@ static void xen_qlock_wait(u8 *byte, u8 val)
if (irq == -1)
return;
- /* clear pending */
- xen_clear_irq_pending(irq);
- barrier();
+ /* If irq pending already clear it and return. */
+ if (xen_test_irq_pending(irq)) {
+ xen_clear_irq_pending(irq);
+ return;
+ }
- /*
- * We check the byte value after clearing pending IRQ to make sure
- * that we won't miss a wakeup event because of the clearing.
- *
- * The sync_clear_bit() call in xen_clear_irq_pending() is atomic.
- * So it is effectively a memory barrier for x86.
- */
if (READ_ONCE(*byte) != val)
return;
--
2.16.4
xen_qlock_wait() isn't safe for nested calls due to interrupts. A call
of xen_qlock_kick() might be ignored in case a deeper nesting level
was active right before the call of xen_poll_irq():
CPU 1: CPU 2:
spin_lock(lock1)
spin_lock(lock1)
-> xen_qlock_wait()
-> xen_clear_irq_pending()
Interrupt happens
spin_unlock(lock1)
-> xen_qlock_kick(CPU 2)
spin_lock_irqsave(lock2)
spin_lock_irqsave(lock2)
-> xen_qlock_wait()
-> xen_clear_irq_pending()
clears kick for lock1
-> xen_poll_irq()
spin_unlock_irq_restore(lock2)
-> xen_qlock_kick(CPU 2)
wakes up
spin_unlock_irq_restore(lock2)
IRET
resumes in xen_qlock_wait()
-> xen_poll_irq()
never wakes up
The solution is to disable interrupts in xen_qlock_wait() and not to
poll for the irq in case xen_qlock_wait() is called in nmi context.
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Juergen Gross <[email protected]>
---
arch/x86/xen/spinlock.c | 24 ++++++++++--------------
1 file changed, 10 insertions(+), 14 deletions(-)
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index cd210a4ba7b1..e8d880e98057 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -39,29 +39,25 @@ static void xen_qlock_kick(int cpu)
*/
static void xen_qlock_wait(u8 *byte, u8 val)
{
+ unsigned long flags;
int irq = __this_cpu_read(lock_kicker_irq);
/* If kicker interrupts not initialized yet, just spin */
- if (irq == -1)
+ if (irq == -1 || in_nmi())
return;
- /* If irq pending already clear it and return. */
+ /* Guard against reentry. */
+ local_irq_save(flags);
+
+ /* If irq pending already clear it. */
if (xen_test_irq_pending(irq)) {
xen_clear_irq_pending(irq);
- return;
+ } else if (READ_ONCE(*byte) == val) {
+ /* Block until irq becomes pending (or a spurious wakeup) */
+ xen_poll_irq(irq);
}
- if (READ_ONCE(*byte) != val)
- return;
-
- /*
- * If an interrupt happens here, it will leave the wakeup irq
- * pending, which will cause xen_poll_irq() to return
- * immediately.
- */
-
- /* Block until irq becomes pending (or perhaps a spurious wakeup) */
- xen_poll_irq(irq);
+ local_irq_restore(flags);
}
static irqreturn_t dummy_handler(int irq, void *dev_id)
--
2.16.4
Correcting Waiman's mail address
On 01/10/2018 09:16, Juergen Gross wrote:
> The Xen specific queue spinlock wait function has two issues which
> could result in a hanging system.
>
> They have a similar root cause of clearing a pending wakeup of a
> waiting vcpu and later going to sleep waiting for the just cleared
> wakeup event, which of course won't ever happen.
>
> Juergen Gross (2):
> xen: fix race in xen_qlock_wait()
> xen: make xen_qlock_wait() nestable
>
> arch/x86/xen/spinlock.c | 33 ++++++++++++---------------------
> 1 file changed, 12 insertions(+), 21 deletions(-)
>
> Cc: [email protected]
> Cc: [email protected]
>
Correcting Waiman's mail address
On 01/10/2018 09:16, Juergen Gross wrote:
> In the following situation a vcpu waiting for a lock might not be
> woken up from xen_poll_irq():
>
> CPU 1: CPU 2: CPU 3:
> takes a spinlock
> tries to get lock
> -> xen_qlock_wait()
> -> xen_clear_irq_pending()
> frees the lock
> -> xen_qlock_kick(cpu2)
>
> takes lock again
> tries to get lock
> -> *lock = _Q_SLOW_VAL
> -> *lock == _Q_SLOW_VAL ?
> -> xen_poll_irq()
> frees the lock
> -> xen_qlock_kick(cpu3)
>
> And cpu 2 will sleep forever.
>
> This can be avoided easily by modifying xen_qlock_wait() to call
> xen_poll_irq() only if the related irq was not pending and to call
> xen_clear_irq_pending() only if it was pending.
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Juergen Gross <[email protected]>
> ---
> arch/x86/xen/spinlock.c | 15 +++++----------
> 1 file changed, 5 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
> index 973f10e05211..cd210a4ba7b1 100644
> --- a/arch/x86/xen/spinlock.c
> +++ b/arch/x86/xen/spinlock.c
> @@ -45,17 +45,12 @@ static void xen_qlock_wait(u8 *byte, u8 val)
> if (irq == -1)
> return;
>
> - /* clear pending */
> - xen_clear_irq_pending(irq);
> - barrier();
> + /* If irq pending already clear it and return. */
> + if (xen_test_irq_pending(irq)) {
> + xen_clear_irq_pending(irq);
> + return;
> + }
>
> - /*
> - * We check the byte value after clearing pending IRQ to make sure
> - * that we won't miss a wakeup event because of the clearing.
> - *
> - * The sync_clear_bit() call in xen_clear_irq_pending() is atomic.
> - * So it is effectively a memory barrier for x86.
> - */
> if (READ_ONCE(*byte) != val)
> return;
>
>
Correcting Waiman's mail address
On 01/10/2018 09:16, Juergen Gross wrote:
> xen_qlock_wait() isn't safe for nested calls due to interrupts. A call
> of xen_qlock_kick() might be ignored in case a deeper nesting level
> was active right before the call of xen_poll_irq():
>
> CPU 1: CPU 2:
> spin_lock(lock1)
> spin_lock(lock1)
> -> xen_qlock_wait()
> -> xen_clear_irq_pending()
> Interrupt happens
> spin_unlock(lock1)
> -> xen_qlock_kick(CPU 2)
> spin_lock_irqsave(lock2)
> spin_lock_irqsave(lock2)
> -> xen_qlock_wait()
> -> xen_clear_irq_pending()
> clears kick for lock1
> -> xen_poll_irq()
> spin_unlock_irq_restore(lock2)
> -> xen_qlock_kick(CPU 2)
> wakes up
> spin_unlock_irq_restore(lock2)
> IRET
> resumes in xen_qlock_wait()
> -> xen_poll_irq()
> never wakes up
>
> The solution is to disable interrupts in xen_qlock_wait() and not to
> poll for the irq in case xen_qlock_wait() is called in nmi context.
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Juergen Gross <[email protected]>
> ---
> arch/x86/xen/spinlock.c | 24 ++++++++++--------------
> 1 file changed, 10 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
> index cd210a4ba7b1..e8d880e98057 100644
> --- a/arch/x86/xen/spinlock.c
> +++ b/arch/x86/xen/spinlock.c
> @@ -39,29 +39,25 @@ static void xen_qlock_kick(int cpu)
> */
> static void xen_qlock_wait(u8 *byte, u8 val)
> {
> + unsigned long flags;
> int irq = __this_cpu_read(lock_kicker_irq);
>
> /* If kicker interrupts not initialized yet, just spin */
> - if (irq == -1)
> + if (irq == -1 || in_nmi())
> return;
>
> - /* If irq pending already clear it and return. */
> + /* Guard against reentry. */
> + local_irq_save(flags);
> +
> + /* If irq pending already clear it. */
> if (xen_test_irq_pending(irq)) {
> xen_clear_irq_pending(irq);
> - return;
> + } else if (READ_ONCE(*byte) == val) {
> + /* Block until irq becomes pending (or a spurious wakeup) */
> + xen_poll_irq(irq);
> }
>
> - if (READ_ONCE(*byte) != val)
> - return;
> -
> - /*
> - * If an interrupt happens here, it will leave the wakeup irq
> - * pending, which will cause xen_poll_irq() to return
> - * immediately.
> - */
> -
> - /* Block until irq becomes pending (or perhaps a spurious wakeup) */
> - xen_poll_irq(irq);
> + local_irq_restore(flags);
> }
>
> static irqreturn_t dummy_handler(int irq, void *dev_id)
>
>>> On 01.10.18 at 09:16, <[email protected]> wrote:
> In the following situation a vcpu waiting for a lock might not be
> woken up from xen_poll_irq():
>
> CPU 1: CPU 2: CPU 3:
> takes a spinlock
> tries to get lock
> -> xen_qlock_wait()
> -> xen_clear_irq_pending()
Doesn't the last line above ...
> frees the lock
> -> xen_qlock_kick(cpu2)
... need to be below here?
> takes lock again
> tries to get lock
> -> *lock = _Q_SLOW_VAL
> -> *lock == _Q_SLOW_VAL ?
> -> xen_poll_irq()
> frees the lock
> -> xen_qlock_kick(cpu3)
>
> And cpu 2 will sleep forever.
>
> This can be avoided easily by modifying xen_qlock_wait() to call
> xen_poll_irq() only if the related irq was not pending and to call
> xen_clear_irq_pending() only if it was pending.
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Juergen Gross <[email protected]>
Patch itself
Reviewed-by: Jan Beulich <[email protected]>
Jan
>>> On 01.10.18 at 09:16, <[email protected]> wrote:
> xen_qlock_wait() isn't safe for nested calls due to interrupts. A call
> of xen_qlock_kick() might be ignored in case a deeper nesting level
> was active right before the call of xen_poll_irq():
>
> CPU 1: CPU 2:
> spin_lock(lock1)
> spin_lock(lock1)
> -> xen_qlock_wait()
> -> xen_clear_irq_pending()
> Interrupt happens
> spin_unlock(lock1)
> -> xen_qlock_kick(CPU 2)
> spin_lock_irqsave(lock2)
> spin_lock_irqsave(lock2)
> -> xen_qlock_wait()
> -> xen_clear_irq_pending()
> clears kick for lock1
> -> xen_poll_irq()
> spin_unlock_irq_restore(lock2)
> -> xen_qlock_kick(CPU 2)
> wakes up
> spin_unlock_irq_restore(lock2)
> IRET
> resumes in xen_qlock_wait()
> -> xen_poll_irq()
> never wakes up
>
> The solution is to disable interrupts in xen_qlock_wait() and not to
> poll for the irq in case xen_qlock_wait() is called in nmi context.
Are precautions against NMI really worthwhile? Locks acquired both
in NMI context as well as outside of it are liable to deadlock anyway,
aren't they?
Jan
On Mon, 2018-10-01 at 09:16 +0200, Juergen Gross wrote:
> The Xen specific queue spinlock wait function has two issues which
> could result in a hanging system.
>
> They have a similar root cause of clearing a pending wakeup of a
> waiting vcpu and later going to sleep waiting for the just cleared
> wakeup event, which of course won't ever happen.
>
> Juergen Gross (2):
> xen: fix race in xen_qlock_wait()
> xen: make xen_qlock_wait() nestable
>
> arch/x86/xen/spinlock.c | 33 ++++++++++++---------------------
> 1 file changed, 12 insertions(+), 21 deletions(-)
>
> Cc: [email protected]
> Cc: [email protected]
LGTM. Both these should be Cc:[email protected], yes?
Thanks.
On 09/10/2018 16:40, David Woodhouse wrote:
> On Mon, 2018-10-01 at 09:16 +0200, Juergen Gross wrote:
>> The Xen specific queue spinlock wait function has two issues which
>> could result in a hanging system.
>>
>> They have a similar root cause of clearing a pending wakeup of a
>> waiting vcpu and later going to sleep waiting for the just cleared
>> wakeup event, which of course won't ever happen.
>>
>> Juergen Gross (2):
>> xen: fix race in xen_qlock_wait()
>> xen: make xen_qlock_wait() nestable
>>
>> arch/x86/xen/spinlock.c | 33 ++++++++++++---------------------
>> 1 file changed, 12 insertions(+), 21 deletions(-)
>>
>> Cc: [email protected]
>> Cc: [email protected]
>
> LGTM. Both these should be Cc:[email protected], yes?
Yes, they are.
I have them already queued in the Xen tree. As the bug is rather
old I didn't want to rush the patches in so late in the rc phase of
4.19.
Juergen
On Mon, 2018-10-01 at 09:16 +0200, Juergen Gross wrote:
> - /* If irq pending already clear it and return. */
> + /* Guard against reentry. */
> + local_irq_save(flags);
> +
> + /* If irq pending already clear it. */
> if (xen_test_irq_pending(irq)) {
> xen_clear_irq_pending(irq);
> - return;
> + } else if (READ_ONCE(*byte) == val) {
> + /* Block until irq becomes pending (or a spurious wakeup) */
> + xen_poll_irq(irq);
> }
Does this still allow other IRQs to wake it from xen_poll_irq()?
In the case where process-context code is spinning for a lock without
disabling interrupts, we *should* allow interrupts to occur still...
does this?
On Wed, 10 Oct 2018, David Woodhouse wrote:
> On Mon, 2018-10-01 at 09:16 +0200, Juergen Gross wrote:
> > - /* If irq pending already clear it and return. */
> > + /* Guard against reentry. */
> > + local_irq_save(flags);
> > +
> > + /* If irq pending already clear it. */
> > if (xen_test_irq_pending(irq)) {
> > xen_clear_irq_pending(irq);
> > - return;
> > + } else if (READ_ONCE(*byte) == val) {
> > + /* Block until irq becomes pending (or a spurious wakeup) */
> > + xen_poll_irq(irq);
> > }
>
>
> Does this still allow other IRQs to wake it from xen_poll_irq()?
>
> In the case where process-context code is spinning for a lock without
> disabling interrupts, we *should* allow interrupts to occur still...
> does this?
Yes. Look at it like idle HLT or WFI. You have to disable interrupt before
checking the condition and then the hardware or in this case the hypervisor
has to bring you back when an interrupt is raised.
If that would not work then the check would be racy, because the interrupt
could hit and be handled after the check and before going into
HLT/WFI/hypercall and then the thing is out until the next interrupt comes
along, which might be never.
Thanks,
tglx
On Wed, 2018-10-10 at 14:30 +0200, Thomas Gleixner wrote:
> On Wed, 10 Oct 2018, David Woodhouse wrote:
>
> > On Mon, 2018-10-01 at 09:16 +0200, Juergen Gross wrote:
> > > - /* If irq pending already clear it and return. */
> > > + /* Guard against reentry. */
> > > + local_irq_save(flags);
> > > +
> > > + /* If irq pending already clear it. */
> > > if (xen_test_irq_pending(irq)) {
> > > xen_clear_irq_pending(irq);
> > > - return;
> > > + } else if (READ_ONCE(*byte) == val) {
> > > + /* Block until irq becomes pending (or a spurious wakeup) */
> > > + xen_poll_irq(irq);
> > > }
> >
> >
> > Does this still allow other IRQs to wake it from xen_poll_irq()?
> >
> > In the case where process-context code is spinning for a lock without
> > disabling interrupts, we *should* allow interrupts to occur still...
> > does this?
>
> Yes. Look at it like idle HLT or WFI. You have to disable interrupt before
> checking the condition and then the hardware or in this case the hypervisor
> has to bring you back when an interrupt is raised.
>
> If that would not work then the check would be racy, because the interrupt
> could hit and be handled after the check and before going into
> HLT/WFI/hypercall and then the thing is out until the next interrupt comes
> along, which might be never.
Right, but in this case we're calling into the hypervisor to poll for
one *specific* IRQ. Everything you say is true for that specific IRQ.
My question is what happens to *other* IRQs. We want them, but are they
masked? I'm staring at the Xen do_poll() code and haven't quite worked
that out...
On Wed, 10 Oct 2018, David Woodhouse wrote:
> On Wed, 2018-10-10 at 14:30 +0200, Thomas Gleixner wrote:
> > On Wed, 10 Oct 2018, David Woodhouse wrote:
> >
> > > On Mon, 2018-10-01 at 09:16 +0200, Juergen Gross wrote:
> > > > - /* If irq pending already clear it and return. */
> > > > + /* Guard against reentry. */
> > > > + local_irq_save(flags);
> > > > +
> > > > + /* If irq pending already clear it. */
> > > > if (xen_test_irq_pending(irq)) {
> > > > xen_clear_irq_pending(irq);
> > > > - return;
> > > > + } else if (READ_ONCE(*byte) == val) {
> > > > + /* Block until irq becomes pending (or a spurious wakeup) */
> > > > + xen_poll_irq(irq);
> > > > }
> > >
> > >
> > > Does this still allow other IRQs to wake it from xen_poll_irq()?
> > >
> > > In the case where process-context code is spinning for a lock without
> > > disabling interrupts, we *should* allow interrupts to occur still...
> > > does this?
> >
> > Yes. Look at it like idle HLT or WFI. You have to disable interrupt before
> > checking the condition and then the hardware or in this case the hypervisor
> > has to bring you back when an interrupt is raised.
> >
> > If that would not work then the check would be racy, because the interrupt
> > could hit and be handled after the check and before going into
> > HLT/WFI/hypercall and then the thing is out until the next interrupt comes
> > along, which might be never.
>
> Right, but in this case we're calling into the hypervisor to poll for
> one *specific* IRQ. Everything you say is true for that specific IRQ.
>
> My question is what happens to *other* IRQs. We want them, but are they
> masked? I'm staring at the Xen do_poll() code and haven't quite worked
> that out...
Ah, sorry. That of course has to come back like HLT/WFI for any interrupt,
but I have no idea what the Xen HV is doing there.
Thanks,
tglx
On 10/10/2018 14:47, Thomas Gleixner wrote:
> On Wed, 10 Oct 2018, David Woodhouse wrote:
>> On Wed, 2018-10-10 at 14:30 +0200, Thomas Gleixner wrote:
>>> On Wed, 10 Oct 2018, David Woodhouse wrote:
>>>
>>>> On Mon, 2018-10-01 at 09:16 +0200, Juergen Gross wrote:
>>>>> - /* If irq pending already clear it and return. */
>>>>> + /* Guard against reentry. */
>>>>> + local_irq_save(flags);
>>>>> +
>>>>> + /* If irq pending already clear it. */
>>>>> if (xen_test_irq_pending(irq)) {
>>>>> xen_clear_irq_pending(irq);
>>>>> - return;
>>>>> + } else if (READ_ONCE(*byte) == val) {
>>>>> + /* Block until irq becomes pending (or a spurious wakeup) */
>>>>> + xen_poll_irq(irq);
>>>>> }
>>>>
>>>>
>>>> Does this still allow other IRQs to wake it from xen_poll_irq()?
>>>>
>>>> In the case where process-context code is spinning for a lock without
>>>> disabling interrupts, we *should* allow interrupts to occur still...
>>>> does this?
>>>
>>> Yes. Look at it like idle HLT or WFI. You have to disable interrupt before
>>> checking the condition and then the hardware or in this case the hypervisor
>>> has to bring you back when an interrupt is raised.
>>>
>>> If that would not work then the check would be racy, because the interrupt
>>> could hit and be handled after the check and before going into
>>> HLT/WFI/hypercall and then the thing is out until the next interrupt comes
>>> along, which might be never.
>>
>> Right, but in this case we're calling into the hypervisor to poll for
>> one *specific* IRQ. Everything you say is true for that specific IRQ.
>>
>> My question is what happens to *other* IRQs. We want them, but are they
>> masked? I'm staring at the Xen do_poll() code and haven't quite worked
>> that out...
>
> Ah, sorry. That of course has to come back like HLT/WFI for any interrupt,
> but I have no idea what the Xen HV is doing there.
The Xen HV is doing it right. It is blocking the vcpu in do_poll() and
any interrupt will unblock it.
Juergen
> The Xen HV is doing it right. It is blocking the vcpu in do_poll() and
> any interrupt will unblock it.
Great. Thanks for the confirmation.
--
dwmw2