2019-11-23 22:52:55

by Woody Suwalski

[permalink] [raw]
Subject: kernel 5.2+: suspend freeze in VMware Player.

Rafael, Thomas, this is the same VMware Player 15.2 freeze on suspend issue
I have been discussing with you in August.

It has surfaced after Thomas Gleixner's change in kernel 5.2
dfe0cf8b  x86/ioapic: Implement irq_get irqchip_state() callback

It is still with us in 5.4, 100% repeatable on a second suspend after a
reboot.

I have traced it down to the ioapic_irq_get_chip_state() function, where
rentry.rr is stuck hi.

On the first suspend I can see that for IRQ9 the test exits with irr=0,
trigger=1, but on second and consecutive suspends it is returning
irr=1 trigger=1, so *state=1, and this results in a never-ending loop
in __synchronize_hardirq(), because inprogress is always 1.

I have been usig a "fix" to timeout in __synchronize_hardirq() after
64 iterations, and that seems to work OK (no side-effects noticed),
but of course is not addressing the underlying problem.

And the problem may be somewhere in VMware emulation code, returning bad
data?

Would you have ideas as to what should be the right setting for
IRQ9 in VM environment?  Edge or level?
And which part of code is reading the "hardware" state from VMware?

OTOH, current implementation is not really safe, as the wait loop should
have
a timeout, or else it may get stuck. Should I provide my safety-exit patch?

Thanks, Woody


2019-11-25 18:23:22

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: kernel 5.2+: suspend freeze in VMware Player.

On Saturday, November 23, 2019 11:51:19 PM CET Woody Suwalski wrote:
> Rafael, Thomas, this is the same VMware Player 15.2 freeze on suspend issue
> I have been discussing with you in August.
>
> It has surfaced after Thomas Gleixner's change in kernel 5.2
> dfe0cf8b x86/ioapic: Implement irq_get irqchip_state() callback
>
> It is still with us in 5.4, 100% repeatable on a second suspend after a
> reboot.
>
> I have traced it down to the ioapic_irq_get_chip_state() function, where
> rentry.rr is stuck hi.
>
> On the first suspend I can see that for IRQ9 the test exits with irr=0,
> trigger=1, but on second and consecutive suspends it is returning
> irr=1 trigger=1, so *state=1, and this results in a never-ending loop
> in __synchronize_hardirq(), because inprogress is always 1.
>
> I have been usig a "fix" to timeout in __synchronize_hardirq() after
> 64 iterations, and that seems to work OK (no side-effects noticed),
> but of course is not addressing the underlying problem.
>
> And the problem may be somewhere in VMware emulation code, returning bad
> data?
>
> Would you have ideas as to what should be the right setting for
> IRQ9 in VM environment? Edge or level?
> And which part of code is reading the "hardware" state from VMware?
>
> OTOH, current implementation is not really safe, as the wait loop should

It is not clear to me the current implementation of what exactly you mean here.

> have a timeout, or else it may get stuck. Should I provide my safety-exit patch?

Thanks!



2019-11-26 02:54:26

by Woody Suwalski

[permalink] [raw]
Subject: Re: kernel 5.2+: suspend freeze in VMware Player.

Rafael J. Wysocki wrote:
> On Saturday, November 23, 2019 11:51:19 PM CET Woody Suwalski wrote:
>> Rafael, Thomas, this is the same VMware Player 15.2 freeze on suspend issue
>> I have been discussing with you in August.
>>
>> It has surfaced after Thomas Gleixner's change in kernel 5.2
>> dfe0cf8b x86/ioapic: Implement irq_get irqchip_state() callback
>>
>> It is still with us in 5.4, 100% repeatable on a second suspend after a
>> reboot.
>>
>> I have traced it down to the ioapic_irq_get_chip_state() function, where
>> rentry.rr is stuck hi.
>>
>> On the first suspend I can see that for IRQ9 the test exits with irr=0,
>> trigger=1, but on second and consecutive suspends it is returning
>> irr=1 trigger=1, so *state=1, and this results in a never-ending loop
>> in __synchronize_hardirq(), because inprogress is always 1.
>>
>> I have been usig a "fix" to timeout in __synchronize_hardirq() after
>> 64 iterations, and that seems to work OK (no side-effects noticed),
>> but of course is not addressing the underlying problem.
>>
>> And the problem may be somewhere in VMware emulation code, returning bad
>> data?
>>
>> Would you have ideas as to what should be the right setting for
>> IRQ9 in VM environment? Edge or level?
>> And which part of code is reading the "hardware" state from VMware?
>>
>> OTOH, current implementation is not really safe, as the wait loop should
> It is not clear to me the current implementation of what exactly you mean here.
Sorry, by implementation I have meant the source code of a never-ending
loop where suspend may be indefinitely blocked by a flaky hardware bit.
The result is a frozen VM. (check kernel/irq/manage.c line 73 on version
5.4)
>> have a timeout, or else it may get stuck. Should I provide my safety-exit patch?
> Thanks!
>
>
>