2022-04-05 01:21:14

by Evan Green

[permalink] [raw]
Subject: Lost MSIs during hibernate

Hi Thomas et al,
To my surprise, I'm back with another MSI problem, and hoping to get
some advice on how to approach fixing it.

Summary: I think MSIs are lost across the hibernate freeze/thaw
transition on the way down because __pci_write_msi_msg() drops the
write if the device is not in D0.

Details:
I've been playing with hibernation on an Alderlake device, and have
been running into problems where the freeze/thaw transition that
should generate the hibernate image ends up hanging (eg before we even
start writing the image out to disk). When it hangs I find it in
usb_kill_urb(), an error path that comes out of a failed attempt to
send a control packet to a hub port coming from usb_resume().
Occasionally, I see the classic "HC died; cleaning up" message
instead. XHCI in general appears to be very sensitive to lost MSIs, so
I started down that road.

I printed the three major paths through __pci_write_msi_msg() so I
could see what the XHCI controller was ending up with when it hung.
You can see a full boot and hibernate attempt sequence that results in
a hang here (sorry there's other cruft in there):

https://pastebin.com/PFd3x1k0

What worries me is those IRQ "no longer affine" messages, as well as
my "EVAN don't touch hw" prints, indicating that requests to change
the MSI are being dropped. These ignored requests are coming in when
we try to migrate all IRQs off of the non-boot CPU, and they get
ignored because all devices are "frozen" at this point, and presumably
not in D0.

So my theory is XHCI for whatever reason boots affinitized to a
non-boot CPU. We go through pci_pm_freeze(), then try to take the
non-boot CPUs down. The request to move the MSI off of the dead CPU is
ignored, and then XHCI generates an interrupt during the period while
that non-boot CPU is dead.

To further try and prove that theory, I wrote a script to do the
hibernate prepare image step in a loop, but messed with XHCI's IRQ
affinity beforehand. If I move the IRQ to core 0, so far I have never
seen a hang. But if I move it to another core, I can usually get a
hang in the first attempt. I also very occasionally see wifi splats
when trying this, and those "no longer affine" prints are all the wifi
queue IRQs. So I think a wifi packet coming in at the wrong time can
do the same thing.

I wanted to see what thoughts you might have on this. Should I try to
make a patch that moves all IRQs to CPU 0 *before* the devices all
freeze? Sounds a little unpleasant. Or should PCI be doing something
different to avoid this combination of "you're not allowed to modify
my MSIs, but I might still generate interrupts that must not be lost"?

-Evan


2022-04-06 11:46:44

by Evan Green

[permalink] [raw]
Subject: Re: Lost MSIs during hibernate

On Tue, Apr 5, 2022 at 7:06 AM Thomas Gleixner <[email protected]> wrote:
>
> Evan!
>
> On Mon, Apr 04 2022 at 12:11, Evan Green wrote:
> > To my surprise, I'm back with another MSI problem, and hoping to get
> > some advice on how to approach fixing it.
>
> Why am I not surprised?

I swear I don't intersect with this stuff that often. Maybe the
reality is I always intersect with this stuff, it just usually works
perfectly :)

>
> > What worries me is those IRQ "no longer affine" messages, as well as
> > my "EVAN don't touch hw" prints, indicating that requests to change
> > the MSI are being dropped. These ignored requests are coming in when
> > we try to migrate all IRQs off of the non-boot CPU, and they get
> > ignored because all devices are "frozen" at this point, and presumably
> > not in D0.
>
> They are disabled at that point.
>
> > To further try and prove that theory, I wrote a script to do the
> > hibernate prepare image step in a loop, but messed with XHCI's IRQ
> > affinity beforehand. If I move the IRQ to core 0, so far I have never
> > seen a hang. But if I move it to another core, I can usually get a
> > hang in the first attempt. I also very occasionally see wifi splats
> > when trying this, and those "no longer affine" prints are all the wifi
> > queue IRQs. So I think a wifi packet coming in at the wrong time can
> > do the same thing.
> >
> > I wanted to see what thoughts you might have on this. Should I try to
> > make a patch that moves all IRQs to CPU 0 *before* the devices all
> > freeze? Sounds a little unpleasant. Or should PCI be doing something
> > different to avoid this combination of "you're not allowed to modify
> > my MSIs, but I might still generate interrupts that must not be lost"?
>
> PCI cannot do much here and moving interrupts around is papering over
> the underlying problem.
>
> xhci_hcd 0000:00:0d.0: EVAN Write MSI 0 fee1e000 4023
>
> This sets up the interrupt when the driver is loaded
>
> xhci_hcd 0000:00:14.0: EVAN Write MSI 0 fee01000 4024
>
> Ditto
>
> xhci_hcd 0000:00:0d.0: calling pci_pm_freeze+0x0/0xad @ 423, parent: pci0000:00
> xhci_hcd 0000:00:14.0: calling pci_pm_freeze+0x0/0xad @ 4644, parent: pci0000:00
> xhci_hcd 0000:00:14.0: pci_pm_freeze+0x0/0xad returned 0 after 0 usecs
> xhci_hcd 0000:00:0d.0: EVAN Write MSI 0 fee1e000 4023
> xhci_hcd 0000:00:0d.0: pci_pm_freeze+0x0/0xad returned 0 after 196000 usecs
>
> Those freeze() calls end up in xhci_suspend(), which tears down the XHCI
> and ensures that no interrupts are on flight.

Your hint here about xhci_suspend() was helpful. It turns out this is
not called in the freeze path, usb_hcd_pci_pm_ops just calls
check_root_hub_suspended(). The documentation in devices.rst is pretty
clear about this:

```
The ``->freeze`` methods should quiesce the device so that it doesn't
generate IRQs or DMA
```

So I think you're right that the PM layer is doing everything right
(though with a bit of a footgun that if you mess up and generate an
interrupt after freeze it may just be gone forever), and usb core is
at fault here. I've been testing with this patch (mangled in email),
and so far the issue seems to be gone:

@@ -614,10 +622,10 @@ const struct dev_pm_ops usb_hcd_pci_pm_ops = {
.suspend_noirq = hcd_pci_suspend_noirq,
.resume_noirq = hcd_pci_resume_noirq,
.resume = hcd_pci_resume,
- .freeze = check_root_hub_suspended,
- .freeze_noirq = check_root_hub_suspended,
- .thaw_noirq = NULL,
- .thaw = NULL,
+ .freeze = hcd_pci_suspend,
+ .freeze_noirq = hcd_pci_suspend_noirq,
+ .thaw_noirq = hcd_pci_resume_noirq,
+ .thaw = hcd_pci_resume,
.poweroff = hcd_pci_suspend,
.poweroff_noirq = hcd_pci_suspend_noirq,
.restore_noirq = hcd_pci_resume_noirq,


As an aside, one might wonder "why don't we see this everywhere
then?". I think that's because Intel missed a patch enabling runtime
pm on one of these XHCI controllers (8086:51ed). See the quirks below,
that missing 2 on 00:14.0 is XHCI_DEFAULT_PM_RUNTIME_ALLOW:
# dmesg | grep quirks
[ 2.804073] xhci_hcd 0000:00:0d.0: hcc params 0x20007fc1 hci
version 0x120 quirks 0x0000000200009810
[ 3.108045] xhci_hcd 0000:00:14.0: hcc params 0x20007fc1 hci
version 0x120 quirks 0x0000000000009810

If the XHCI controller were usually in runtime suspend when freeze()
got called, it would be fully quiesced and would not lose its
interrupt. I had noticed this earlier, and it did reduce the repro
rate, but did not reduce it to zero. Now it makes sense why.

I think I have enough info to go make a USB patch now. Thank you for you help!
-Evan

>
> xhci_hcd 0000:00:0d.0: calling pci_pm_freeze_noirq+0x0/0xb2 @ 4645, parent: pci0000:00
> xhci_hcd 0000:00:0d.0: pci_pm_freeze_noirq+0x0/0xb2 returned 0 after 30 usecs
> xhci_hcd 0000:00:14.0: calling pci_pm_freeze_noirq+0x0/0xb2 @ 4644, parent: pci0000:00
> xhci_hcd 0000:00:14.0: pci_pm_freeze_noirq+0x0/0xb2 returned 0 after 3118 usecs
>
> Now the devices are disabled and not accessible
>
> xhci_hcd 0000:00:14.0: EVAN Don't touch hw 0 fee00000 4024
> xhci_hcd 0000:00:0d.0: EVAN Don't touch hw 0 fee1e000 4045
> xhci_hcd 0000:00:0d.0: EVAN Don't touch hw 0 fee00000 4045
> xhci_hcd 0000:00:14.0: calling pci_pm_thaw_noirq+0x0/0x70 @ 9, parent: pci0000:00
> xhci_hcd 0000:00:14.0: EVAN Write MSI 0 fee00000 4024
>
> This is the early restore _before_ the XHCI resume code is called
> This interrupt is targeted at CPU0 (it's the one which could not be
> written above).
>
> xhci_hcd 0000:00:14.0: pci_pm_thaw_noirq+0x0/0x70 returned 0 after 5272 usecs
> xhci_hcd 0000:00:0d.0: calling pci_pm_thaw_noirq+0x0/0x70 @ 1123, parent: pci0000:00
> xhci_hcd 0000:00:0d.0: EVAN Write MSI 0 fee00000 4045
>
> Ditto
>
> xhci_hcd 0000:00:0d.0: pci_pm_thaw_noirq+0x0/0x70 returned 0 after 623 usecs
> xhci_hcd 0000:00:14.0: calling pci_pm_thaw+0x0/0x7c @ 3856, parent: pci0000:00
> xhci_hcd 0000:00:14.0: pci_pm_thaw+0x0/0x7c returned 0 after 0 usecs
> xhci_hcd 0000:00:0d.0: calling pci_pm_thaw+0x0/0x7c @ 4664, parent: pci0000:00
> xhci_hcd 0000:00:0d.0: pci_pm_thaw+0x0/0x7c returned 0 after 0 usecs
>
> That means the suspend/resume logic is doing the right thing.
>
> How the XHCI ends up being confused here is a mystery. Cc'ed a few more folks.
>
> Thanks,
>
> tglx
>
>

2022-04-06 12:46:55

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Lost MSIs during hibernate

Evan!

On Mon, Apr 04 2022 at 12:11, Evan Green wrote:
> To my surprise, I'm back with another MSI problem, and hoping to get
> some advice on how to approach fixing it.

Why am I not surprised?

> What worries me is those IRQ "no longer affine" messages, as well as
> my "EVAN don't touch hw" prints, indicating that requests to change
> the MSI are being dropped. These ignored requests are coming in when
> we try to migrate all IRQs off of the non-boot CPU, and they get
> ignored because all devices are "frozen" at this point, and presumably
> not in D0.

They are disabled at that point.

> To further try and prove that theory, I wrote a script to do the
> hibernate prepare image step in a loop, but messed with XHCI's IRQ
> affinity beforehand. If I move the IRQ to core 0, so far I have never
> seen a hang. But if I move it to another core, I can usually get a
> hang in the first attempt. I also very occasionally see wifi splats
> when trying this, and those "no longer affine" prints are all the wifi
> queue IRQs. So I think a wifi packet coming in at the wrong time can
> do the same thing.
>
> I wanted to see what thoughts you might have on this. Should I try to
> make a patch that moves all IRQs to CPU 0 *before* the devices all
> freeze? Sounds a little unpleasant. Or should PCI be doing something
> different to avoid this combination of "you're not allowed to modify
> my MSIs, but I might still generate interrupts that must not be lost"?

PCI cannot do much here and moving interrupts around is papering over
the underlying problem.

xhci_hcd 0000:00:0d.0: EVAN Write MSI 0 fee1e000 4023

This sets up the interrupt when the driver is loaded

xhci_hcd 0000:00:14.0: EVAN Write MSI 0 fee01000 4024

Ditto

xhci_hcd 0000:00:0d.0: calling pci_pm_freeze+0x0/0xad @ 423, parent: pci0000:00
xhci_hcd 0000:00:14.0: calling pci_pm_freeze+0x0/0xad @ 4644, parent: pci0000:00
xhci_hcd 0000:00:14.0: pci_pm_freeze+0x0/0xad returned 0 after 0 usecs
xhci_hcd 0000:00:0d.0: EVAN Write MSI 0 fee1e000 4023
xhci_hcd 0000:00:0d.0: pci_pm_freeze+0x0/0xad returned 0 after 196000 usecs

Those freeze() calls end up in xhci_suspend(), which tears down the XHCI
and ensures that no interrupts are on flight.

xhci_hcd 0000:00:0d.0: calling pci_pm_freeze_noirq+0x0/0xb2 @ 4645, parent: pci0000:00
xhci_hcd 0000:00:0d.0: pci_pm_freeze_noirq+0x0/0xb2 returned 0 after 30 usecs
xhci_hcd 0000:00:14.0: calling pci_pm_freeze_noirq+0x0/0xb2 @ 4644, parent: pci0000:00
xhci_hcd 0000:00:14.0: pci_pm_freeze_noirq+0x0/0xb2 returned 0 after 3118 usecs

Now the devices are disabled and not accessible

xhci_hcd 0000:00:14.0: EVAN Don't touch hw 0 fee00000 4024
xhci_hcd 0000:00:0d.0: EVAN Don't touch hw 0 fee1e000 4045
xhci_hcd 0000:00:0d.0: EVAN Don't touch hw 0 fee00000 4045
xhci_hcd 0000:00:14.0: calling pci_pm_thaw_noirq+0x0/0x70 @ 9, parent: pci0000:00
xhci_hcd 0000:00:14.0: EVAN Write MSI 0 fee00000 4024

This is the early restore _before_ the XHCI resume code is called
This interrupt is targeted at CPU0 (it's the one which could not be
written above).

xhci_hcd 0000:00:14.0: pci_pm_thaw_noirq+0x0/0x70 returned 0 after 5272 usecs
xhci_hcd 0000:00:0d.0: calling pci_pm_thaw_noirq+0x0/0x70 @ 1123, parent: pci0000:00
xhci_hcd 0000:00:0d.0: EVAN Write MSI 0 fee00000 4045

Ditto

xhci_hcd 0000:00:0d.0: pci_pm_thaw_noirq+0x0/0x70 returned 0 after 623 usecs
xhci_hcd 0000:00:14.0: calling pci_pm_thaw+0x0/0x7c @ 3856, parent: pci0000:00
xhci_hcd 0000:00:14.0: pci_pm_thaw+0x0/0x7c returned 0 after 0 usecs
xhci_hcd 0000:00:0d.0: calling pci_pm_thaw+0x0/0x7c @ 4664, parent: pci0000:00
xhci_hcd 0000:00:0d.0: pci_pm_thaw+0x0/0x7c returned 0 after 0 usecs

That means the suspend/resume logic is doing the right thing.

How the XHCI ends up being confused here is a mystery. Cc'ed a few more folks.

Thanks,

tglx