2019-10-04 12:41:12

by Mika Westerberg

[permalink] [raw]
Subject: [PATCH v2 1/2] PCI: Introduce pcie_wait_for_link_delay()

This is otherwise similar to pcie_wait_for_link() but allows passing
custom activation delay in milliseconds.

Signed-off-by: Mika Westerberg <[email protected]>
---
drivers/pci/pci.c | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e7982af9a5d8..bfd92e018925 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4607,14 +4607,17 @@ static int pci_pm_reset(struct pci_dev *dev, int probe)

return pci_dev_wait(dev, "PM D3->D0", PCIE_RESET_READY_POLL_MS);
}
+
/**
- * pcie_wait_for_link - Wait until link is active or inactive
+ * pcie_wait_for_link_delay - Wait until link is active or inactive
* @pdev: Bridge device
* @active: waiting for active or inactive?
+ * @delay: Delay to wait after link has become active (in ms)
*
* Use this to wait till link becomes active or inactive.
*/
-bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
+static bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active,
+ int delay)
{
int timeout = 1000;
bool ret;
@@ -4651,13 +4654,25 @@ bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
timeout -= 10;
}
if (active && ret)
- msleep(100);
+ msleep(delay);
else if (ret != active)
pci_info(pdev, "Data Link Layer Link Active not %s in 1000 msec\n",
active ? "set" : "cleared");
return ret == active;
}

+/**
+ * pcie_wait_for_link - Wait until link is active or inactive
+ * @pdev: Bridge device
+ * @active: waiting for active or inactive?
+ *
+ * Use this to wait till link becomes active or inactive.
+ */
+bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
+{
+ return pcie_wait_for_link_delay(pdev, active, 100);
+}
+
void pci_reset_secondary_bus(struct pci_dev *dev)
{
u16 ctrl;
--
2.23.0


2020-08-08 20:55:04

by Marc MERLIN

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] PCI: Introduce pcie_wait_for_link_delay()

On Fri, Oct 04, 2019 at 03:39:46PM +0300, Mika Westerberg wrote:
> This is otherwise similar to pcie_wait_for_link() but allows passing
> custom activation delay in milliseconds.
>
> Signed-off-by: Mika Westerberg <[email protected]>
> ---
> drivers/pci/pci.c | 21 ++++++++++++++++++---
> 1 file changed, 18 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e7982af9a5d8..bfd92e018925 100644

Hi Mika,

So, I have a thinkpad P73 with thunderbolt, and while I don't boot
often, my last boots have been unreliable at best (was only able to boot
5.7 once, and 5.8 did not succeed either).

5.6 was working for a while, but couldn't boot it either this morning,
so I had to go back to 5.5. This does not mean 5.5 does not have the
problem, just that it booted this morning, while 5.6 didn't when I
tried.
Once the kernel is booted, the problem does not seem to occur much, or
at all.

Basically, I'm getting the same thing than this person with a P53 (which
is a mostly identical lenovo thinkpad, to mine)
kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
https://bbs.archlinux.org/viewtopic.php?id=250658

The kernel boots eventually, but it takes minutes, and everything is so
super slow, that I just can't reasonably use the machine.

This shows similar issues with 5.3, 5.4.
https://forum.proxmox.com/threads/pme-spurious-native-interrupt-kernel-meldungen.62850/

Another report here with 5.6:
https://bugzilla.redhat.com/show_bug.cgi?id=1831899

My current kernel is running your patch above, and I haven't done a lot
of research yet to confirm whether going back to a kernel before it was
merged, fixes the problem. Unfortunately the problem is not consistent,
so it makes things harder to test/debug, especially on my main laptop
that I do all my work on :)

I noticed this older patch of yours:
http://patchwork.ozlabs.org/project/linux-pci/patch/0113014581dbe2d1f938813f1783905bd81b79db.1560079442.git.lukas@wunner.de/
This patch is not in my kernel, is it worth adding?

Can I get you more info to help debug this?

If that helps:
sauron:/usr/src/linux-5.7.11-amd64-preempt-sysrq-20190816/drivers/pci# lspci
00:00.0 Host bridge: Intel Corporation Device 3e20 (rev 0d)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 0d)
00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 630 (Mobile) (rev 02)
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 0d)
00:08.0 System peripheral: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model
00:12.0 Signal processing controller: Intel Corporation Cannon Lake PCH Thermal Controller (rev 10)
00:14.0 USB controller: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller (rev 10)
00:14.2 RAM memory: Intel Corporation Cannon Lake PCH Shared SRAM (rev 10)
00:15.0 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 (rev 10)
00:15.1 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1 (rev 10)
00:16.0 Communication controller: Intel Corporation Cannon Lake PCH HECI Controller (rev 10)
00:17.0 SATA controller: Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller (rev 10)
00:1b.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #17 (rev f0)
00:1c.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #1 (rev f0)
00:1c.5 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #6 (rev f0)
00:1c.7 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #8 (rev f0)
00:1e.0 Communication controller: Intel Corporation Cannon Lake PCH Serial IO UART Host Controller (rev 10)
00:1f.0 ISA bridge: Intel Corporation Cannon Lake LPC Controller (rev 10)
00:1f.3 Audio device: Intel Corporation Cannon Lake PCH cAVS (rev 10)
00:1f.4 SMBus: Intel Corporation Cannon Lake PCH SMBus Controller (rev 10)
00:1f.5 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH SPI Controller (rev 10)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (7) I219-LM (rev 10)
01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)
01:00.1 Audio device: NVIDIA Corporation TU104 HD Audio Controller (rev a1)
01:00.2 USB controller: NVIDIA Corporation TU104 USB 3.1 Host Controller (rev a1)
01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
04:00.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
05:00.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
05:01.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
05:02.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
05:04.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
06:00.0 System peripheral: Intel Corporation JHL7540 Thunderbolt 3 NHI [Titan Ridge 4C 2018] (rev 06)
2c:00.0 USB controller: Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge 4C 2018] (rev 06)
52:00.0 Network controller: Intel Corporation Wi-Fi 6 AX200 (rev 1a)
54:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01)


sauron:/usr/src/linux-5.7.11-amd64-preempt-sysrq-20190816/drivers/pci# lsusb -t
/: Bus 06.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 10000M
/: Bus 05.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 480M
/: Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 10000M
/: Bus 03.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 480M
/: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/10p, 10000M
/: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/16p, 480M
|__ Port 2: Dev 2, If 0, Class=Human Interface Device, Driver=usbhid, 12M
|__ Port 8: Dev 3, If 3, Class=Video, Driver=uvcvideo, 480M
|__ Port 8: Dev 3, If 1, Class=Video, Driver=uvcvideo, 480M
|__ Port 8: Dev 3, If 2, Class=Video, Driver=uvcvideo, 480M
|__ Port 8: Dev 3, If 0, Class=Video, Driver=uvcvideo, 480M
|__ Port 9: Dev 4, If 0, Class=Vendor Specific Class, Driver=, 12M
|__ Port 14: Dev 6, If 0, Class=Wireless, Driver=btusb, 12M
|__ Port 14: Dev 6, If 1, Class=Wireless, Driver=btusb, 12M

Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.

Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08

2020-08-08 20:55:55

by Marc MERLIN

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] PCI: Introduce pcie_wait_for_link_delay()

I forgot to add that my mostly hanging boots look like this:
https://photos.app.goo.gl/HJvTraYYZbiNTNE39

Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.

Home page: http://marc.merlins.org/

2020-08-09 16:35:55

by Marc MERLIN

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] PCI: Introduce pcie_wait_for_link_delay()

On Sat, Aug 08, 2020 at 01:22:02PM -0700, Marc MERLIN wrote:
> Basically, I'm getting the same thing than this person with a P53 (which
> is a mostly identical lenovo thinkpad, to mine)
> kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> https://bbs.archlinux.org/viewtopic.php?id=250658

I had to reboot today and tried my 5.7.11 kernel 6 times.
It never booted and each time got stuck on
pcieport 0000:00:01.0: PME: Spurious native interrupt!

This is the nvidia driver and claimed by nouveau (I don't use nvidia graphics
but I'm forced to use nouveau to turn the nvidia chip down so that it
doesn't drain my batteries).

I ended up being able to boot the 7th time after removing the yubikey in my USB-C
port, which is also thunderbolt.
PME messages shown below. Let me know if you'd like further data.

Thanks,
Marc

[ 4.142484] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME PCIeCapability LTR DPC]
[ 4.151715] pci 0000:00:01.0: PME# supported from D0 D3hot D3cold
[ 4.151727] pci 0000:00:01.0: PME# disabled
[ 4.165979] pci 0000:00:14.0: PME# supported from D3hot D3cold
[ 4.166000] pci 0000:00:14.0: PME# disabled
[ 4.177746] pci 0000:00:16.0: PME# supported from D3hot
[ 4.177767] pci 0000:00:16.0: PME# disabled
[ 4.180850] pci 0000:00:17.0: PME# supported from D3hot
[ 4.180862] pci 0000:00:17.0: PME# disabled
[ 4.183830] pci 0000:00:1b.0: PME# supported from D0 D3hot D3cold
[ 4.183847] pci 0000:00:1b.0: PME# disabled
[ 4.189643] pci 0000:00:1c.0: PME# supported from D0 D3hot D3cold
[ 4.189660] pci 0000:00:1c.0: PME# disabled
[ 4.193085] pci 0000:00:1c.5: PME# supported from D0 D3hot D3cold
[ 4.193101] pci 0000:00:1c.5: PME# disabled
[ 4.196462] pci 0000:00:1c.7: PME# supported from D0 D3hot D3cold
[ 4.196478] pci 0000:00:1c.7: PME# disabled
[ 4.206057] pci 0000:00:1f.3: PME# supported from D3hot D3cold
[ 4.206079] pci 0000:00:1f.3: PME# disabled
[ 4.214993] pci 0000:00:1f.6: PME# supported from D0 D3hot D3cold
[ 4.215015] pci 0000:00:1f.6: PME# disabled
[ 4.217978] pci 0000:01:00.0: PME# supported from D0 D3hot
[ 4.217991] pci 0000:01:00.0: PME# disabled
[ 4.219129] pci 0000:01:00.2: PME# supported from D0 D3hot
[ 4.219142] pci 0000:01:00.2: PME# disabled
[ 4.219578] pci 0000:01:00.3: PME# supported from D0 D3hot
[ 4.219591] pci 0000:01:00.3: PME# disabled
[ 4.221398] pci 0000:04:00.0: PME# supported from D0 D1 D2 D3hot D3cold
[ 4.221433] pci 0000:04:00.0: PME# disabled
[ 4.222282] pci 0000:05:00.0: PME# supported from D0 D1 D2 D3hot D3cold
[ 4.222297] pci 0000:05:00.0: PME# disabled
[ 4.222792] pci 0000:05:01.0: PME# supported from D0 D1 D2 D3hot D3cold
[ 4.222806] pci 0000:05:01.0: PME# disabled
[ 4.223289] pci 0000:05:02.0: PME# supported from D0 D1 D2 D3hot D3cold
[ 4.223304] pci 0000:05:02.0: PME# disabled
[ 4.223839] pci 0000:05:04.0: PME# supported from D0 D1 D2 D3hot D3cold
[ 4.223854] pci 0000:05:04.0: PME# disabled
[ 4.224645] pci 0000:06:00.0: PME# supported from D0 D1 D2 D3hot D3cold
[ 4.224661] pci 0000:06:00.0: PME# disabled
[ 4.225644] pci 0000:2c:00.0: PME# supported from D0 D1 D2 D3hot D3cold
[ 4.225661] pci 0000:2c:00.0: PME# disabled
[ 4.227557] pci 0000:52:00.0: PME# supported from D0 D3hot D3cold
[ 4.227708] pci 0000:52:00.0: PME# disabled
[ 4.229139] pci 0000:54:00.0: PME# supported from D1 D2 D3hot D3cold
[ 4.229155] pci 0000:54:00.0: PME# disabled
[ 7.238126] pcieport 0000:00:01.0: PME: Signaling with IRQ 122
[ 7.239208] pcieport 0000:00:1b.0: PME: Signaling with IRQ 123
[ 7.239861] pcieport 0000:00:1c.0: PME: Signaling with IRQ 124
[ 7.241522] pcieport 0000:00:1c.5: PME: Signaling with IRQ 125
[ 7.242499] pcieport 0000:00:1c.7: PME: Signaling with IRQ 126
[ 7.401422] pcieport 0000:05:01.0: PME# enabled
[ 7.401868] pcieport 0000:05:04.0: PME# enabled
[ 8.985668] xhci_hcd 0000:01:00.2: PME# enabled
[ 8.988738] xhci_hcd 0000:2c:00.0: PME# enabled
[ 9.008649] pcieport 0000:05:02.0: PME# enabled
[ 12.378450] nvidia-gpu 0000:01:00.3: PME# enabled
[ 25.610848] thunderbolt 0000:06:00.0: PME# enabled
[ 25.628766] pcieport 0000:05:00.0: PME# enabled
[ 25.648762] pcieport 0000:04:00.0: PME# enabled
[ 25.668889] pcieport 0000:00:1c.0: PME# enabled
[ 179.608847] nvidia-gpu 0000:01:00.3: PME# disabled
[ 179.608873] pcieport 0000:00:01.0: PME: Spurious native interrupt!
[ 183.359454] nvidia-gpu 0000:01:00.3: PME# enabled
[ 183.396832] nvidia-gpu 0000:01:00.3: PME# disabled
[ 183.396859] pcieport 0000:00:01.0: PME: Spurious native interrupt!
[ 187.147398] nvidia-gpu 0000:01:00.3: PME# enabled
[ 187.184826] nvidia-gpu 0000:01:00.3: PME# disabled
[ 187.184852] pcieport 0000:00:01.0: PME: Spurious native interrupt!
[ 190.932329] nvidia-gpu 0000:01:00.3: PME# enabled
[ 190.972359] nvidia-gpu 0000:01:00.3: PME# disabled
[ 190.972366] pcieport 0000:00:01.0: PME: Spurious native interrupt!
[ 192.351330] snd_hda_intel 0000:00:1f.3: PME# enabled
[ 192.468365] snd_hda_intel 0000:00:1f.3: PME# disabled
[ 192.736342] xhci_hcd 0000:01:00.2: PME# disabled
[ 194.296431] pcieport 0000:00:1c.0: PME# disabled
[ 194.432427] pcieport 0000:04:00.0: PME# disabled
[ 194.432591] pcieport 0000:05:01.0: PME# disabled
[ 194.432771] pcieport 0000:05:00.0: PME# disabled
[ 194.432943] pcieport 0000:05:02.0: PME# disabled
[ 194.433102] pcieport 0000:05:04.0: PME# disabled
[ 194.556299] pcieport 0000:05:01.0: PME# enabled
[ 194.556417] pcieport 0000:05:04.0: PME# enabled
[ 195.560440] thunderbolt 0000:06:00.0: PME# disabled


--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.

Home page: http://marc.merlins.org/

2020-09-06 18:57:39

by Marc MERLIN

[permalink] [raw]
Subject: Re: pcieport 0000:00:01.0: PME: Spurious native interrupt (nvidia with nouveau and thunderbolt on thinkpad P73)

Ok, I have an update to this problem. I added the nouveau list because
I can't quite tell if the issue is:
- the PCIe changes that went in 5.6 I think (or 5.5?), referenced below

- a new issue with thunderbold on thinkpad P73, that seems to be
triggered if I have a USB-C yubikey in the port. With 5.7, my issues
went away if I removed the USB key during boot, showing an interaction
between nouveau and thunderbolt

- changes in the nouveau driver. Mika told me the PCIe regression
"pcieport 0000:00:01.0: PME: Spurious native interrupt!" is supposed
to be fixed in 5.8, but I still get a 4mn hang or so during boot and
with 5.8, removing the USB key, didn't help make the boot faster

I don't otherwise use the nvidia chip I so wish I didn't have, I only
use intel graphics on that laptop, but I must apparently use the nouveau
driver to manage the nouveau chip so that it's turned off and not
burning 60W doing nothing.

lspci is in the quoted message below, I won't copy it here again, but
here's the nvidia bit:
01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)
01:00.1 Audio device: NVIDIA Corporation TU104 HD Audio Controller (rev a1)
01:00.2 USB controller: NVIDIA Corporation TU104 USB 3.1 Host Controller (rev a1)
01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)

Here are 5 boots, 4 on 5.8.5:

dmesg.1_hang_but_no_warning.txt https://pastebin.com/Y5NaH08n
Boot hung for quite a while, but no clear output

dmesg.2_pme_spurious.txt https://pastebin.com/dX19aCpj
[ 8.185808] nvidia-gpu 0000:01:00.3: runtime IRQ mapping not provided by arch
[ 8.185989] nvidia-gpu 0000:01:00.3: enabling device (0000 -> 0002)
[ 8.188986] nvidia-gpu 0000:01:00.3: enabling bus mastering
[ 11.936507] nvidia-gpu 0000:01:00.3: PME# enabled
[ 11.975985] nvidia-gpu 0000:01:00.3: PME# disabled
[ 11.976011] pcieport 0000:00:01.0: PME: Spurious native interrupt!

dmesg.3_usb_key_yanked.txt https://pastebin.com/m7QLnCZt
I yanked the USB key during boot, that seemed to help unlock things with
5.7, but did not with 5.8. It's hung on a loop of:
[ 11.262854] nvidia-gpu 0000:01:00.3: saving config space at offset 0x0 (reading 0x1ad910de)
[ 11.262863] nvidia-gpu 0000:01:00.3: saving config space at offset 0x4 (reading 0x100406)
[ 11.262869] nvidia-gpu 0000:01:00.3: saving config space at offset 0x8 (reading 0xc8000a1)
[ 11.262874] nvidia-gpu 0000:01:00.3: saving config space at offset 0xc (reading 0x800000)
[ 11.262880] nvidia-gpu 0000:01:00.3: saving config space at offset 0x10 (reading 0xce054000)
[ 11.262885] nvidia-gpu 0000:01:00.3: saving config space at offset 0x14 (reading 0x0)
[ 11.262890] nvidia-gpu 0000:01:00.3: saving config space at offset 0x18 (reading 0x0)
[ 11.262895] nvidia-gpu 0000:01:00.3: saving config space at offset 0x1c (reading 0x0)
[ 11.262900] nvidia-gpu 0000:01:00.3: saving config space at offset 0x20 (reading 0x0)
[ 11.262906] nvidia-gpu 0000:01:00.3: saving config space at offset 0x24 (reading 0x0)
[ 11.262911] nvidia-gpu 0000:01:00.3: saving config space at offset 0x28 (reading 0x0)
[ 11.262916] nvidia-gpu 0000:01:00.3: saving config space at offset 0x2c (reading 0x229b17aa)
[ 11.262921] nvidia-gpu 0000:01:00.3: saving config space at offset 0x30 (reading 0x0)
[ 11.262926] nvidia-gpu 0000:01:00.3: saving config space at offset 0x34 (reading 0x68)
[ 11.262931] nvidia-gpu 0000:01:00.3: saving config space at offset 0x38 (reading 0x0)
[ 11.262937] nvidia-gpu 0000:01:00.3: saving config space at offset 0x3c (reading 0x4ff)
[ 11.262985] nvidia-gpu 0000:01:00.3: PME# enabled
[ 11.303060] nvidia-gpu 0000:01:00.3: PME# disabled

dmesg.4_5.5_boot_fine.txt https://pastebin.com/WXgQTUYP
reference boot with 4.5, it works fine, no issues

dmesg.5_no_key_still_hang.txt https://pastebin.com/kcT8Ras0
unfortunately, booting without the USB-C key in thunderbolt, did not
allow this boot to be faster, it looks different though:
[ 6.723454] pcieport 0000:00:01.0: runtime IRQ mapping not provided by arch
[ 6.723598] pcieport 0000:00:01.0: PME: Signaling with IRQ 122
[ 6.724011] pcieport 0000:00:01.0: saving config space at offset 0x0 (reading 0x19018086)
[ 6.724016] pcieport 0000:00:01.0: saving config space at offset 0x4 (reading 0x100407)
[ 6.724021] pcieport 0000:00:01.0: saving config space at offset 0x8 (reading 0x604000d)
[ 6.724025] pcieport 0000:00:01.0: saving config space at offset 0xc (reading 0x810000)
[ 6.724029] pcieport 0000:00:01.0: saving config space at offset 0x10 (reading 0x0)
[ 6.724033] pcieport 0000:00:01.0: saving config space at offset 0x14 (reading 0x0)
[ 6.724037] pcieport 0000:00:01.0: saving config space at offset 0x18 (reading 0x10100)
[ 6.724041] pcieport 0000:00:01.0: saving config space at offset 0x1c (reading 0x20002020)
[ 6.724046] pcieport 0000:00:01.0: saving config space at offset 0x20 (reading 0xce00cd00)
[ 6.724050] pcieport 0000:00:01.0: saving config space at offset 0x24 (reading 0xb1f1a001)
[ 6.724054] pcieport 0000:00:01.0: saving config space at offset 0x28 (reading 0x0)
[ 6.724058] pcieport 0000:00:01.0: saving config space at offset 0x2c (reading 0x0)
[ 6.724062] pcieport 0000:00:01.0: saving config space at offset 0x30 (reading 0x0)
[ 6.724066] pcieport 0000:00:01.0: saving config space at offset 0x34 (reading 0x88)
[ 6.724070] pcieport 0000:00:01.0: saving config space at offset 0x38 (reading 0x0)
[ 6.724074] pcieport 0000:00:01.0: saving config space at offset 0x3c (reading 0x201ff)
[ 6.724129] pcieport 0000:00:1b.0: runtime IRQ mapping not provided by arch
[ 6.724650] pcieport 0000:00:1b.0: PME: Signaling with IRQ 123
[ 6.725021] pcieport 0000:00:1b.0: saving config space at offset 0x0 (reading 0xa3408086)
[ 6.725026] pcieport 0000:00:1b.0: saving config space at offset 0x4 (reading 0x100407)
[ 6.725031] pcieport 0000:00:1b.0: saving config space at offset 0x8 (reading 0x60400f0)
[ 6.725035] pcieport 0000:00:1b.0: saving config space at offset 0xc (reading 0x810000)
[ 6.725040] pcieport 0000:00:1b.0: saving config space at offset 0x10 (reading 0x0)
[ 6.725044] pcieport 0000:00:1b.0: saving config space at offset 0x14 (reading 0x0)
[ 6.725049] pcieport 0000:00:1b.0: saving config space at offset 0x18 (reading 0x20200)
[ 6.725053] pcieport 0000:00:1b.0: saving config space at offset 0x1c (reading 0x200000f0)
[ 6.725058] pcieport 0000:00:1b.0: saving config space at offset 0x20 (reading 0xce30ce30)
[ 6.725062] pcieport 0000:00:1b.0: saving config space at offset 0x24 (reading 0x1fff1)
[ 6.725067] pcieport 0000:00:1b.0: saving config space at offset 0x28 (reading 0x0)
[ 6.725071] pcieport 0000:00:1b.0: saving config space at offset 0x2c (reading 0x0)
[ 6.725075] pcieport 0000:00:1b.0: saving config space at offset 0x30 (reading 0x0)
[ 6.725080] pcieport 0000:00:1b.0: saving config space at offset 0x34 (reading 0x40)
[ 6.725084] pcieport 0000:00:1b.0: saving config space at offset 0x38 (reading 0x0)
[ 6.725089] pcieport 0000:00:1b.0: saving config space at offset 0x3c (reading 0x201ff)
[ 6.725154] pcieport 0000:00:1c.0: runtime IRQ mapping not provided by arch
[ 6.725284] pcieport 0000:00:1c.0: PME: Signaling with IRQ 124
[ 6.725580] pcieport 0000:00:1c.0: pciehp: Slot #0 AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+ Interlock- NoCompl+ IbPresDis- LLActRep+
[ 6.726086] pci_bus 0000:04: dev 00, created physical slot 0

Any idea what's going on?

Thanks,
Marc

On Sat, Aug 08, 2020 at 01:22:02PM -0700, Marc MERLIN wrote:
> On Fri, Oct 04, 2019 at 03:39:46PM +0300, Mika Westerberg wrote:
> > This is otherwise similar to pcie_wait_for_link() but allows passing
> > custom activation delay in milliseconds.
> >
> > Signed-off-by: Mika Westerberg <[email protected]>
> > ---
> > drivers/pci/pci.c | 21 ++++++++++++++++++---
> > 1 file changed, 18 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index e7982af9a5d8..bfd92e018925 100644
>
> Hi Mika,
>
> So, I have a thinkpad P73 with thunderbolt, and while I don't boot
> often, my last boots have been unreliable at best (was only able to boot
> 5.7 once, and 5.8 did not succeed either).
>
> 5.6 was working for a while, but couldn't boot it either this morning,
> so I had to go back to 5.5. This does not mean 5.5 does not have the
> problem, just that it booted this morning, while 5.6 didn't when I
> tried.
> Once the kernel is booted, the problem does not seem to occur much, or
> at all.
>
> Basically, I'm getting the same thing than this person with a P53 (which
> is a mostly identical lenovo thinkpad, to mine)
> kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> https://bbs.archlinux.org/viewtopic.php?id=250658
>
> The kernel boots eventually, but it takes minutes, and everything is so
> super slow, that I just can't reasonably use the machine.
>
> This shows similar issues with 5.3, 5.4.
> https://forum.proxmox.com/threads/pme-spurious-native-interrupt-kernel-meldungen.62850/
>
> Another report here with 5.6:
> https://bugzilla.redhat.com/show_bug.cgi?id=1831899
>
> My current kernel is running your patch above, and I haven't done a lot
> of research yet to confirm whether going back to a kernel before it was
> merged, fixes the problem. Unfortunately the problem is not consistent,
> so it makes things harder to test/debug, especially on my main laptop
> that I do all my work on :)
>
> I noticed this older patch of yours:
> http://patchwork.ozlabs.org/project/linux-pci/patch/0113014581dbe2d1f938813f1783905bd81b79db.1560079442.git.lukas@wunner.de/
> This patch is not in my kernel, is it worth adding?
>
> Can I get you more info to help debug this?
>
> If that helps:
> sauron:/usr/src/linux-5.7.11-amd64-preempt-sysrq-20190816/drivers/pci# lspci
> 00:00.0 Host bridge: Intel Corporation Device 3e20 (rev 0d)
> 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 0d)
> 00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 630 (Mobile) (rev 02)
> 00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 0d)
> 00:08.0 System peripheral: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model
> 00:12.0 Signal processing controller: Intel Corporation Cannon Lake PCH Thermal Controller (rev 10)
> 00:14.0 USB controller: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller (rev 10)
> 00:14.2 RAM memory: Intel Corporation Cannon Lake PCH Shared SRAM (rev 10)
> 00:15.0 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 (rev 10)
> 00:15.1 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1 (rev 10)
> 00:16.0 Communication controller: Intel Corporation Cannon Lake PCH HECI Controller (rev 10)
> 00:17.0 SATA controller: Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller (rev 10)
> 00:1b.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #17 (rev f0)
> 00:1c.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #1 (rev f0)
> 00:1c.5 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #6 (rev f0)
> 00:1c.7 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #8 (rev f0)
> 00:1e.0 Communication controller: Intel Corporation Cannon Lake PCH Serial IO UART Host Controller (rev 10)
> 00:1f.0 ISA bridge: Intel Corporation Cannon Lake LPC Controller (rev 10)
> 00:1f.3 Audio device: Intel Corporation Cannon Lake PCH cAVS (rev 10)
> 00:1f.4 SMBus: Intel Corporation Cannon Lake PCH SMBus Controller (rev 10)
> 00:1f.5 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH SPI Controller (rev 10)
> 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (7) I219-LM (rev 10)
> 01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)
> 01:00.1 Audio device: NVIDIA Corporation TU104 HD Audio Controller (rev a1)
> 01:00.2 USB controller: NVIDIA Corporation TU104 USB 3.1 Host Controller (rev a1)
> 01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)
> 02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
> 04:00.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
> 05:00.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
> 05:01.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
> 05:02.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
> 05:04.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
> 06:00.0 System peripheral: Intel Corporation JHL7540 Thunderbolt 3 NHI [Titan Ridge 4C 2018] (rev 06)
> 2c:00.0 USB controller: Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge 4C 2018] (rev 06)
> 52:00.0 Network controller: Intel Corporation Wi-Fi 6 AX200 (rev 1a)
> 54:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01)
>
>
> sauron:/usr/src/linux-5.7.11-amd64-preempt-sysrq-20190816/drivers/pci# lsusb -t
> /: Bus 06.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 10000M
> /: Bus 05.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 480M
> /: Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 10000M
> /: Bus 03.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 480M
> /: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/10p, 10000M
> /: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/16p, 480M
> |__ Port 2: Dev 2, If 0, Class=Human Interface Device, Driver=usbhid, 12M
> |__ Port 8: Dev 3, If 3, Class=Video, Driver=uvcvideo, 480M
> |__ Port 8: Dev 3, If 1, Class=Video, Driver=uvcvideo, 480M
> |__ Port 8: Dev 3, If 2, Class=Video, Driver=uvcvideo, 480M
> |__ Port 8: Dev 3, If 0, Class=Video, Driver=uvcvideo, 480M
> |__ Port 9: Dev 4, If 0, Class=Vendor Specific Class, Driver=, 12M
> |__ Port 14: Dev 6, If 0, Class=Wireless, Driver=btusb, 12M
> |__ Port 14: Dev 6, If 1, Class=Wireless, Driver=btusb, 12M
>
> Thanks,
> Marc
> --
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
>
> Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08

--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.

Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08

2020-09-07 19:25:49

by Karol Herbst

[permalink] [raw]
Subject: Re: [Nouveau] pcieport 0000:00:01.0: PME: Spurious native interrupt (nvidia with nouveau and thunderbolt on thinkpad P73)

On Sun, Sep 6, 2020 at 8:52 PM Marc MERLIN <[email protected]> wrote:
>
> Ok, I have an update to this problem. I added the nouveau list because
> I can't quite tell if the issue is:
> - the PCIe changes that went in 5.6 I think (or 5.5?), referenced below
>
> - a new issue with thunderbold on thinkpad P73, that seems to be
> triggered if I have a USB-C yubikey in the port. With 5.7, my issues
> went away if I removed the USB key during boot, showing an interaction
> between nouveau and thunderbolt
>
> - changes in the nouveau driver. Mika told me the PCIe regression
> "pcieport 0000:00:01.0: PME: Spurious native interrupt!" is supposed
> to be fixed in 5.8, but I still get a 4mn hang or so during boot and
> with 5.8, removing the USB key, didn't help make the boot faster
>

that's the root port the GPU is attached to, no? I saw that message on
the Thinkpad P1G2 when runtime resuming the Nvidia GPU, but it does
seem to come from the root port.

> I don't otherwise use the nvidia chip I so wish I didn't have, I only
> use intel graphics on that laptop, but I must apparently use the nouveau
> driver to manage the nouveau chip so that it's turned off and not
> burning 60W doing nothing.
>

Well, you'd also need it when attaching external displays.

> lspci is in the quoted message below, I won't copy it here again, but
> here's the nvidia bit:
> 01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)
> 01:00.1 Audio device: NVIDIA Corporation TU104 HD Audio Controller (rev a1)
> 01:00.2 USB controller: NVIDIA Corporation TU104 USB 3.1 Host Controller (rev a1)
> 01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)
>
> Here are 5 boots, 4 on 5.8.5:
>
> dmesg.1_hang_but_no_warning.txt https://pastebin.com/Y5NaH08n
> Boot hung for quite a while, but no clear output
>
> dmesg.2_pme_spurious.txt https://pastebin.com/dX19aCpj
> [ 8.185808] nvidia-gpu 0000:01:00.3: runtime IRQ mapping not provided by arch
> [ 8.185989] nvidia-gpu 0000:01:00.3: enabling device (0000 -> 0002)
> [ 8.188986] nvidia-gpu 0000:01:00.3: enabling bus mastering
> [ 11.936507] nvidia-gpu 0000:01:00.3: PME# enabled
> [ 11.975985] nvidia-gpu 0000:01:00.3: PME# disabled
> [ 11.976011] pcieport 0000:00:01.0: PME: Spurious native interrupt!
>
> dmesg.3_usb_key_yanked.txt https://pastebin.com/m7QLnCZt
> I yanked the USB key during boot, that seemed to help unlock things with
> 5.7, but did not with 5.8. It's hung on a loop of:
> [ 11.262854] nvidia-gpu 0000:01:00.3: saving config space at offset 0x0 (reading 0x1ad910de)
> [ 11.262863] nvidia-gpu 0000:01:00.3: saving config space at offset 0x4 (reading 0x100406)
> [ 11.262869] nvidia-gpu 0000:01:00.3: saving config space at offset 0x8 (reading 0xc8000a1)
> [ 11.262874] nvidia-gpu 0000:01:00.3: saving config space at offset 0xc (reading 0x800000)
> [ 11.262880] nvidia-gpu 0000:01:00.3: saving config space at offset 0x10 (reading 0xce054000)
> [ 11.262885] nvidia-gpu 0000:01:00.3: saving config space at offset 0x14 (reading 0x0)
> [ 11.262890] nvidia-gpu 0000:01:00.3: saving config space at offset 0x18 (reading 0x0)
> [ 11.262895] nvidia-gpu 0000:01:00.3: saving config space at offset 0x1c (reading 0x0)
> [ 11.262900] nvidia-gpu 0000:01:00.3: saving config space at offset 0x20 (reading 0x0)
> [ 11.262906] nvidia-gpu 0000:01:00.3: saving config space at offset 0x24 (reading 0x0)
> [ 11.262911] nvidia-gpu 0000:01:00.3: saving config space at offset 0x28 (reading 0x0)
> [ 11.262916] nvidia-gpu 0000:01:00.3: saving config space at offset 0x2c (reading 0x229b17aa)
> [ 11.262921] nvidia-gpu 0000:01:00.3: saving config space at offset 0x30 (reading 0x0)
> [ 11.262926] nvidia-gpu 0000:01:00.3: saving config space at offset 0x34 (reading 0x68)
> [ 11.262931] nvidia-gpu 0000:01:00.3: saving config space at offset 0x38 (reading 0x0)
> [ 11.262937] nvidia-gpu 0000:01:00.3: saving config space at offset 0x3c (reading 0x4ff)
> [ 11.262985] nvidia-gpu 0000:01:00.3: PME# enabled
> [ 11.303060] nvidia-gpu 0000:01:00.3: PME# disabled
>

mhh, interesting. I heard some random comments that the Nvidia
USB-C/UCSI driver is a bit broken and can cause various issues. Mind
blacklisting i2c-nvidia-gpu and typec_nvidia (and verify they don't
get loaded) and see if that helps?

> dmesg.4_5.5_boot_fine.txt https://pastebin.com/WXgQTUYP
> reference boot with 4.5, it works fine, no issues
>
> dmesg.5_no_key_still_hang.txt https://pastebin.com/kcT8Ras0
> unfortunately, booting without the USB-C key in thunderbolt, did not
> allow this boot to be faster, it looks different though:
> [ 6.723454] pcieport 0000:00:01.0: runtime IRQ mapping not provided by arch
> [ 6.723598] pcieport 0000:00:01.0: PME: Signaling with IRQ 122
> [ 6.724011] pcieport 0000:00:01.0: saving config space at offset 0x0 (reading 0x19018086)
> [ 6.724016] pcieport 0000:00:01.0: saving config space at offset 0x4 (reading 0x100407)
> [ 6.724021] pcieport 0000:00:01.0: saving config space at offset 0x8 (reading 0x604000d)
> [ 6.724025] pcieport 0000:00:01.0: saving config space at offset 0xc (reading 0x810000)
> [ 6.724029] pcieport 0000:00:01.0: saving config space at offset 0x10 (reading 0x0)
> [ 6.724033] pcieport 0000:00:01.0: saving config space at offset 0x14 (reading 0x0)
> [ 6.724037] pcieport 0000:00:01.0: saving config space at offset 0x18 (reading 0x10100)
> [ 6.724041] pcieport 0000:00:01.0: saving config space at offset 0x1c (reading 0x20002020)
> [ 6.724046] pcieport 0000:00:01.0: saving config space at offset 0x20 (reading 0xce00cd00)
> [ 6.724050] pcieport 0000:00:01.0: saving config space at offset 0x24 (reading 0xb1f1a001)
> [ 6.724054] pcieport 0000:00:01.0: saving config space at offset 0x28 (reading 0x0)
> [ 6.724058] pcieport 0000:00:01.0: saving config space at offset 0x2c (reading 0x0)
> [ 6.724062] pcieport 0000:00:01.0: saving config space at offset 0x30 (reading 0x0)
> [ 6.724066] pcieport 0000:00:01.0: saving config space at offset 0x34 (reading 0x88)
> [ 6.724070] pcieport 0000:00:01.0: saving config space at offset 0x38 (reading 0x0)
> [ 6.724074] pcieport 0000:00:01.0: saving config space at offset 0x3c (reading 0x201ff)
> [ 6.724129] pcieport 0000:00:1b.0: runtime IRQ mapping not provided by arch
> [ 6.724650] pcieport 0000:00:1b.0: PME: Signaling with IRQ 123
> [ 6.725021] pcieport 0000:00:1b.0: saving config space at offset 0x0 (reading 0xa3408086)
> [ 6.725026] pcieport 0000:00:1b.0: saving config space at offset 0x4 (reading 0x100407)
> [ 6.725031] pcieport 0000:00:1b.0: saving config space at offset 0x8 (reading 0x60400f0)
> [ 6.725035] pcieport 0000:00:1b.0: saving config space at offset 0xc (reading 0x810000)
> [ 6.725040] pcieport 0000:00:1b.0: saving config space at offset 0x10 (reading 0x0)
> [ 6.725044] pcieport 0000:00:1b.0: saving config space at offset 0x14 (reading 0x0)
> [ 6.725049] pcieport 0000:00:1b.0: saving config space at offset 0x18 (reading 0x20200)
> [ 6.725053] pcieport 0000:00:1b.0: saving config space at offset 0x1c (reading 0x200000f0)
> [ 6.725058] pcieport 0000:00:1b.0: saving config space at offset 0x20 (reading 0xce30ce30)
> [ 6.725062] pcieport 0000:00:1b.0: saving config space at offset 0x24 (reading 0x1fff1)
> [ 6.725067] pcieport 0000:00:1b.0: saving config space at offset 0x28 (reading 0x0)
> [ 6.725071] pcieport 0000:00:1b.0: saving config space at offset 0x2c (reading 0x0)
> [ 6.725075] pcieport 0000:00:1b.0: saving config space at offset 0x30 (reading 0x0)
> [ 6.725080] pcieport 0000:00:1b.0: saving config space at offset 0x34 (reading 0x40)
> [ 6.725084] pcieport 0000:00:1b.0: saving config space at offset 0x38 (reading 0x0)
> [ 6.725089] pcieport 0000:00:1b.0: saving config space at offset 0x3c (reading 0x201ff)
> [ 6.725154] pcieport 0000:00:1c.0: runtime IRQ mapping not provided by arch
> [ 6.725284] pcieport 0000:00:1c.0: PME: Signaling with IRQ 124
> [ 6.725580] pcieport 0000:00:1c.0: pciehp: Slot #0 AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+ Interlock- NoCompl+ IbPresDis- LLActRep+
> [ 6.726086] pci_bus 0000:04: dev 00, created physical slot 0
>
> Any idea what's going on?
>
> Thanks,
> Marc
>
> On Sat, Aug 08, 2020 at 01:22:02PM -0700, Marc MERLIN wrote:
> > On Fri, Oct 04, 2019 at 03:39:46PM +0300, Mika Westerberg wrote:
> > > This is otherwise similar to pcie_wait_for_link() but allows passing
> > > custom activation delay in milliseconds.
> > >
> > > Signed-off-by: Mika Westerberg <[email protected]>
> > > ---
> > > drivers/pci/pci.c | 21 ++++++++++++++++++---
> > > 1 file changed, 18 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > index e7982af9a5d8..bfd92e018925 100644
> >
> > Hi Mika,
> >
> > So, I have a thinkpad P73 with thunderbolt, and while I don't boot
> > often, my last boots have been unreliable at best (was only able to boot
> > 5.7 once, and 5.8 did not succeed either).
> >
> > 5.6 was working for a while, but couldn't boot it either this morning,
> > so I had to go back to 5.5. This does not mean 5.5 does not have the
> > problem, just that it booted this morning, while 5.6 didn't when I
> > tried.
> > Once the kernel is booted, the problem does not seem to occur much, or
> > at all.
> >
> > Basically, I'm getting the same thing than this person with a P53 (which
> > is a mostly identical lenovo thinkpad, to mine)
> > kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> > kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> > kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> > kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> > kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> > https://bbs.archlinux.org/viewtopic.php?id=250658
> >
> > The kernel boots eventually, but it takes minutes, and everything is so
> > super slow, that I just can't reasonably use the machine.
> >
> > This shows similar issues with 5.3, 5.4.
> > https://forum.proxmox.com/threads/pme-spurious-native-interrupt-kernel-meldungen.62850/
> >
> > Another report here with 5.6:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1831899
> >
> > My current kernel is running your patch above, and I haven't done a lot
> > of research yet to confirm whether going back to a kernel before it was
> > merged, fixes the problem. Unfortunately the problem is not consistent,
> > so it makes things harder to test/debug, especially on my main laptop
> > that I do all my work on :)
> >
> > I noticed this older patch of yours:
> > http://patchwork.ozlabs.org/project/linux-pci/patch/0113014581dbe2d1f938813f1783905bd81b79db.1560079442.git.lukas@wunner.de/
> > This patch is not in my kernel, is it worth adding?
> >
> > Can I get you more info to help debug this?
> >
> > If that helps:
> > sauron:/usr/src/linux-5.7.11-amd64-preempt-sysrq-20190816/drivers/pci# lspci
> > 00:00.0 Host bridge: Intel Corporation Device 3e20 (rev 0d)
> > 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 0d)
> > 00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 630 (Mobile) (rev 02)
> > 00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 0d)
> > 00:08.0 System peripheral: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model
> > 00:12.0 Signal processing controller: Intel Corporation Cannon Lake PCH Thermal Controller (rev 10)
> > 00:14.0 USB controller: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller (rev 10)
> > 00:14.2 RAM memory: Intel Corporation Cannon Lake PCH Shared SRAM (rev 10)
> > 00:15.0 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 (rev 10)
> > 00:15.1 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1 (rev 10)
> > 00:16.0 Communication controller: Intel Corporation Cannon Lake PCH HECI Controller (rev 10)
> > 00:17.0 SATA controller: Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller (rev 10)
> > 00:1b.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #17 (rev f0)
> > 00:1c.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #1 (rev f0)
> > 00:1c.5 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #6 (rev f0)
> > 00:1c.7 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #8 (rev f0)
> > 00:1e.0 Communication controller: Intel Corporation Cannon Lake PCH Serial IO UART Host Controller (rev 10)
> > 00:1f.0 ISA bridge: Intel Corporation Cannon Lake LPC Controller (rev 10)
> > 00:1f.3 Audio device: Intel Corporation Cannon Lake PCH cAVS (rev 10)
> > 00:1f.4 SMBus: Intel Corporation Cannon Lake PCH SMBus Controller (rev 10)
> > 00:1f.5 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH SPI Controller (rev 10)
> > 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (7) I219-LM (rev 10)
> > 01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)
> > 01:00.1 Audio device: NVIDIA Corporation TU104 HD Audio Controller (rev a1)
> > 01:00.2 USB controller: NVIDIA Corporation TU104 USB 3.1 Host Controller (rev a1)
> > 01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)
> > 02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
> > 04:00.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
> > 05:00.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
> > 05:01.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
> > 05:02.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
> > 05:04.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
> > 06:00.0 System peripheral: Intel Corporation JHL7540 Thunderbolt 3 NHI [Titan Ridge 4C 2018] (rev 06)
> > 2c:00.0 USB controller: Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge 4C 2018] (rev 06)
> > 52:00.0 Network controller: Intel Corporation Wi-Fi 6 AX200 (rev 1a)
> > 54:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01)
> >
> >
> > sauron:/usr/src/linux-5.7.11-amd64-preempt-sysrq-20190816/drivers/pci# lsusb -t
> > /: Bus 06.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 10000M
> > /: Bus 05.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 480M
> > /: Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 10000M
> > /: Bus 03.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 480M
> > /: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/10p, 10000M
> > /: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/16p, 480M
> > |__ Port 2: Dev 2, If 0, Class=Human Interface Device, Driver=usbhid, 12M
> > |__ Port 8: Dev 3, If 3, Class=Video, Driver=uvcvideo, 480M
> > |__ Port 8: Dev 3, If 1, Class=Video, Driver=uvcvideo, 480M
> > |__ Port 8: Dev 3, If 2, Class=Video, Driver=uvcvideo, 480M
> > |__ Port 8: Dev 3, If 0, Class=Video, Driver=uvcvideo, 480M
> > |__ Port 9: Dev 4, If 0, Class=Vendor Specific Class, Driver=, 12M
> > |__ Port 14: Dev 6, If 0, Class=Wireless, Driver=btusb, 12M
> > |__ Port 14: Dev 6, If 1, Class=Wireless, Driver=btusb, 12M
> >
> > Thanks,
> > Marc
> > --
> > "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> >
> > Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
>
> --
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
>
> Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
> _______________________________________________
> Nouveau mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/nouveau
>

2020-09-07 20:59:40

by Marc MERLIN

[permalink] [raw]
Subject: Re: [Nouveau] pcieport 0000:00:01.0: PME: Spurious native interrupt (nvidia with nouveau and thunderbolt on thinkpad P73)

On Mon, Sep 07, 2020 at 09:14:03PM +0200, Karol Herbst wrote:
> > - changes in the nouveau driver. Mika told me the PCIe regression
> > "pcieport 0000:00:01.0: PME: Spurious native interrupt!" is supposed
> > to be fixed in 5.8, but I still get a 4mn hang or so during boot and
> > with 5.8, removing the USB key, didn't help make the boot faster
>
> that's the root port the GPU is attached to, no? I saw that message on
> the Thinkpad P1G2 when runtime resuming the Nvidia GPU, but it does
> seem to come from the root port.

Hi Karol, thanks for your answer.

00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 0d)
01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)

> Well, you'd also need it when attaching external displays.

Indeed. I just don't need that on this laptop, but familiar with the not
so seemless procedure to turn on both GPUs, and mirror the intel one into
the nvidia one for external output.

> > [ 11.262985] nvidia-gpu 0000:01:00.3: PME# enabled
> > [ 11.303060] nvidia-gpu 0000:01:00.3: PME# disabled
>
> mhh, interesting. I heard some random comments that the Nvidia
> USB-C/UCSI driver is a bit broken and can cause various issues. Mind
> blacklisting i2c-nvidia-gpu and typec_nvidia (and verify they don't
> get loaded) and see if that helps?

Right, this one:
01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)
Sure, I'll blacklist it. Ok, just did that, removed from initrd,
rebooted, and it was no better.

From initrd (before root gets mounted), I have this:
nouveau 1961984 0
mxm_wmi 16384 1 nouveau
hwmon 32768 1 nouveau
ttm 102400 1 nouveau
wmi 32768 2 nouveau,mxm_wmi

I still got a 2mn hang. and a nouveau probe error
[ 189.124530] nouveau: probe of 0000:01:00.0 failed with error -12


Here's what it looks like:
[ 9.693230] hid: raw HID events driver (C) Jiri Kosina
[ 9.694988] usbcore: registered new interface driver usbhid
[ 9.694989] usbhid: USB HID core driver
[ 9.696700] hid-generic 0003:1050:0200.0001: hiddev0,hidraw0: USB HID v1.00 Device [Yubico Yubico Gnubby (gnubby1)] on usb-0000:00:14.0-2/input0
[ 9.784456] Console: switching to colour frame buffer device 240x67
[ 9.816297] i915 0000:00:02.0: fb0: i915drmfb frame buffer device
[ 25.087400] thunderbolt 0000:06:00.0: saving config space at offset 0x0 (reading 0x15eb8086)
[ 25.087414] thunderbolt 0000:06:00.0: saving config space at offset 0x4 (reading 0x100406)
[ 25.087419] thunderbolt 0000:06:00.0: saving config space at offset 0x8 (reading 0x8800006)
[ 25.087424] thunderbolt 0000:06:00.0: saving config space at offset 0xc (reading 0x20)
[ 25.087430] thunderbolt 0000:06:00.0: saving config space at offset 0x10 (reading 0xcc100000)
[ 25.087435] thunderbolt 0000:06:00.0: saving config space at offset 0x14 (reading 0xcc140000)
[ 25.087440] thunderbolt 0000:06:00.0: saving config space at offset 0x18 (reading 0x0)
[ 25.087445] thunderbolt 0000:06:00.0: saving config space at offset 0x1c (reading 0x0)
[ 25.087450] thunderbolt 0000:06:00.0: saving config space at offset 0x20 (reading 0x0)
[ 25.087455] thunderbolt 0000:06:00.0: saving config space at offset 0x24 (reading 0x0)
[ 25.087460] thunderbolt 0000:06:00.0: saving config space at offset 0x28 (reading 0x0)
[ 25.087466] thunderbolt 0000:06:00.0: saving config space at offset 0x2c (reading 0x229b17aa)
[ 25.087471] thunderbolt 0000:06:00.0: saving config space at offset 0x30 (reading 0x0)
[ 25.087476] thunderbolt 0000:06:00.0: saving config space at offset 0x34 (reading 0x80)
[ 25.087481] thunderbolt 0000:06:00.0: saving config space at offset 0x38 (reading 0x0)
[ 25.087486] thunderbolt 0000:06:00.0: saving config space at offset 0x3c (reading 0x1ff)
[ 25.087571] thunderbolt 0000:06:00.0: PME# enabled
[ 25.105353] pcieport 0000:05:00.0: saving config space at offset 0x0 (reading 0x15ea8086)
[ 25.105364] pcieport 0000:05:00.0: saving config space at offset 0x4 (reading 0x100407)
[ 25.105370] pcieport 0000:05:00.0: saving config space at offset 0x8 (reading 0x6040006)
[ 25.105375] pcieport 0000:05:00.0: saving config space at offset 0xc (reading 0x10020)
[ 25.105380] pcieport 0000:05:00.0: saving config space at offset 0x10 (reading 0x0)
[ 25.105384] pcieport 0000:05:00.0: saving config space at offset 0x14 (reading 0x0)
[ 25.105389] pcieport 0000:05:00.0: saving config space at offset 0x18 (reading 0x60605)
[ 25.105394] pcieport 0000:05:00.0: saving config space at offset 0x1c (reading 0x1f1)
[ 25.105399] pcieport 0000:05:00.0: saving config space at offset 0x20 (reading 0xcc10cc10)
[ 25.105404] pcieport 0000:05:00.0: saving config space at offset 0x24 (reading 0x1fff1)
[ 25.105409] pcieport 0000:05:00.0: saving config space at offset 0x28 (reading 0x0)
[ 25.105413] pcieport 0000:05:00.0: saving config space at offset 0x2c (reading 0x0)
[ 25.105418] pcieport 0000:05:00.0: saving config space at offset 0x30 (reading 0x0)
[ 25.105423] pcieport 0000:05:00.0: saving config space at offset 0x34 (reading 0x80)
[ 25.105428] pcieport 0000:05:00.0: saving config space at offset 0x38 (reading 0x0)
[ 25.105432] pcieport 0000:05:00.0: saving config space at offset 0x3c (reading 0x201ff)
[ 25.105517] pcieport 0000:05:00.0: PME# enabled
[ 25.125367] pcieport 0000:04:00.0: saving config space at offset 0x0 (reading 0x15ea8086)
[ 25.125378] pcieport 0000:04:00.0: saving config space at offset 0x4 (reading 0x100007)
[ 25.125383] pcieport 0000:04:00.0: saving config space at offset 0x8 (reading 0x6040006)
[ 25.125388] pcieport 0000:04:00.0: saving config space at offset 0xc (reading 0x10020)
[ 25.125393] pcieport 0000:04:00.0: saving config space at offset 0x10 (reading 0x0)
[ 25.125398] pcieport 0000:04:00.0: saving config space at offset 0x14 (reading 0x0)
[ 25.125403] pcieport 0000:04:00.0: saving config space at offset 0x18 (reading 0x510504)
[ 25.125407] pcieport 0000:04:00.0: saving config space at offset 0x1c (reading 0x5141)
[ 25.125412] pcieport 0000:04:00.0: saving config space at offset 0x20 (reading 0xcc10b400)
[ 25.125417] pcieport 0000:04:00.0: saving config space at offset 0x24 (reading 0x3ff10001)
[ 25.125422] pcieport 0000:04:00.0: saving config space at offset 0x28 (reading 0x60)
[ 25.125427] pcieport 0000:04:00.0: saving config space at offset 0x2c (reading 0x60)
[ 25.125431] pcieport 0000:04:00.0: saving config space at offset 0x30 (reading 0x0)
[ 25.125436] pcieport 0000:04:00.0: saving config space at offset 0x34 (reading 0x80)
[ 25.125441] pcieport 0000:04:00.0: saving config space at offset 0x38 (reading 0x0)
[ 25.125446] pcieport 0000:04:00.0: saving config space at offset 0x3c (reading 0x201ff)
[ 25.125528] pcieport 0000:04:00.0: PME# enabled
[ 25.145423] pcieport 0000:00:1c.0: saving config space at offset 0x0 (reading 0xa3388086)
[ 25.145437] pcieport 0000:00:1c.0: saving config space at offset 0x4 (reading 0x100407)
[ 25.145445] pcieport 0000:00:1c.0: saving config space at offset 0x8 (reading 0x60400f0)
[ 25.145453] pcieport 0000:00:1c.0: saving config space at offset 0xc (reading 0x810000)
[ 25.145460] pcieport 0000:00:1c.0: saving config space at offset 0x10 (reading 0x0)
[ 25.145464] pcieport 0000:00:1c.0: saving config space at offset 0x14 (reading 0x0)
[ 25.145469] pcieport 0000:00:1c.0: saving config space at offset 0x18 (reading 0x510400)
[ 25.145476] pcieport 0000:00:1c.0: saving config space at offset 0x1c (reading 0x20006040)
[ 25.145484] pcieport 0000:00:1c.0: saving config space at offset 0x20 (reading 0xcc10b400)
[ 25.145488] pcieport 0000:00:1c.0: saving config space at offset 0x24 (reading 0x3ff10001)
[ 25.145493] pcieport 0000:00:1c.0: saving config space at offset 0x28 (reading 0x60)
[ 25.145497] pcieport 0000:00:1c.0: saving config space at offset 0x2c (reading 0x60)
[ 25.145502] pcieport 0000:00:1c.0: saving config space at offset 0x30 (reading 0x0)
[ 25.145506] pcieport 0000:00:1c.0: saving config space at offset 0x34 (reading 0x40)
[ 25.145510] pcieport 0000:00:1c.0: saving config space at offset 0x38 (reading 0x0)
[ 25.145515] pcieport 0000:00:1c.0: saving config space at offset 0x3c (reading 0x201ff)
[ 25.145604] pcieport 0000:00:1c.0: PME# enabled
[ 26.265697] pcieport 0000:00:1c.0: power state changed by ACPI to D3cold
[ 45.468365] random: crng init done
[ 105.032727] usb 1-2: USB disconnect, device number 2 <= I removed a usb key, didn't help
[ 128.495144] async_tx: api initialized (async)
[ 128.514820] device-mapper: uevent: version 1.0.3
[ 128.518186] device-mapper: ioctl: 4.42.0-ioctl (2020-02-27) initialised: [email protected]
[ 144.869445] e1000e 0000:00:1f.6 eth0: NIC Link is Down
[ 172.851384] BTRFS: device label btrfs_pool4 devid 1 transid 78270 /dev/sdb4 scanned by btrfs (1293)
[ 172.851648] BTRFS: device label btrfs_pool3 devid 1 transid 27410 /dev/sda5 scanned by btrfs (1293)
[ 172.852030] BTRFS: device fsid de9694f8-9c0d-4e9d-bd12-57adc4381cd7 devid 1 transid 41 /dev/sda3 scanned by btrfs (1293)
[ 172.852224] BTRFS: device fsid 23e1398d-e462-41aa-b85e-f574906ddc03 devid 1 transid 585 /dev/nvme0n1p4 scanned by btrfs (1293)
[ 189.124291] nouveau 0000:01:00.0: disp ctor failed, -12
[ 189.124530] nouveau: probe of 0000:01:00.0 failed with error -12

The next boot looks similar:
[ 25.161759] pcieport 0000:00:1c.0: PME# enabled
[ 26.297810] pcieport 0000:00:1c.0: power state changed by ACPI to D3cold
[ 128.427270] async_tx: api initialized (async)
[ 128.446525] device-mapper: uevent: version 1.0.3
[ 128.446691] device-mapper: ioctl: 4.42.0-ioctl (2020-02-27) initialised: [email protected]
[ 128.458120] random: cryptsetup: uninitialized urandom read (4 bytes read)
[ 138.507373] random: cryptsetup: uninitialized urandom read (4 bytes read)
[ 144.793573] e1000e 0000:00:1f.6 eth0: NIC Link is Down
[ 159.627780] random: crng init done
[ 171.814064] process '/usr/bin/fstype' started with executable stack
[ 181.949989] BTRFS: device label btrfs_boot devid 1 transid 394687 /dev/mapper/cryptroot scanned by btrfs (1063)
[ 181.953437] BTRFS: device label btrfs_pool4 devid 1 transid 78270 /dev/sdb4 scanned by btrfs (1063)
[ 181.956989] BTRFS: device label btrfs_pool3 devid 1 transid 27410 /dev/sda5 scanned by btrfs (1063)
[ 181.960473] BTRFS: device fsid de9694f8-9c0d-4e9d-bd12-57adc4381cd7 devid 1 transid 41 /dev/sda3 scanned by btrfs (1063)
[ 181.964097] BTRFS: device fsid 23e1398d-e462-41aa-b85e-f574906ddc03 devid 1 transid 585 /dev/nvme0n1p4 scanned by btrfs (1063)
[ 188.733645] nouveau 0000:01:00.0: disp ctor failed, -12
[ 188.740653] nouveau: probe of 0000:01:00.0 failed with error -12
[ 188.901070] PM: Image not found (code -22)

Does that help?

Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.

Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08

2020-09-07 23:55:17

by Karol Herbst

[permalink] [raw]
Subject: Re: [Nouveau] pcieport 0000:00:01.0: PME: Spurious native interrupt (nvidia with nouveau and thunderbolt on thinkpad P73)

On Mon, Sep 7, 2020 at 10:58 PM Marc MERLIN <[email protected]> wrote:
>
> On Mon, Sep 07, 2020 at 09:14:03PM +0200, Karol Herbst wrote:
> > > - changes in the nouveau driver. Mika told me the PCIe regression
> > > "pcieport 0000:00:01.0: PME: Spurious native interrupt!" is supposed
> > > to be fixed in 5.8, but I still get a 4mn hang or so during boot and
> > > with 5.8, removing the USB key, didn't help make the boot faster
> >
> > that's the root port the GPU is attached to, no? I saw that message on
> > the Thinkpad P1G2 when runtime resuming the Nvidia GPU, but it does
> > seem to come from the root port.
>
> Hi Karol, thanks for your answer.
>
> 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 0d)
> 01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)
>
> > Well, you'd also need it when attaching external displays.
>
> Indeed. I just don't need that on this laptop, but familiar with the not
> so seemless procedure to turn on both GPUs, and mirror the intel one into
> the nvidia one for external output.
>
> > > [ 11.262985] nvidia-gpu 0000:01:00.3: PME# enabled
> > > [ 11.303060] nvidia-gpu 0000:01:00.3: PME# disabled
> >
> > mhh, interesting. I heard some random comments that the Nvidia
> > USB-C/UCSI driver is a bit broken and can cause various issues. Mind
> > blacklisting i2c-nvidia-gpu and typec_nvidia (and verify they don't
> > get loaded) and see if that helps?
>
> Right, this one:
> 01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)
> Sure, I'll blacklist it. Ok, just did that, removed from initrd,
> rebooted, and it was no better.
>
> From initrd (before root gets mounted), I have this:
> nouveau 1961984 0
> mxm_wmi 16384 1 nouveau
> hwmon 32768 1 nouveau
> ttm 102400 1 nouveau
> wmi 32768 2 nouveau,mxm_wmi
>
> I still got a 2mn hang. and a nouveau probe error
> [ 189.124530] nouveau: probe of 0000:01:00.0 failed with error -12
>
>
> Here's what it looks like:
> [ 9.693230] hid: raw HID events driver (C) Jiri Kosina
> [ 9.694988] usbcore: registered new interface driver usbhid
> [ 9.694989] usbhid: USB HID core driver
> [ 9.696700] hid-generic 0003:1050:0200.0001: hiddev0,hidraw0: USB HID v1.00 Device [Yubico Yubico Gnubby (gnubby1)] on usb-0000:00:14.0-2/input0
> [ 9.784456] Console: switching to colour frame buffer device 240x67
> [ 9.816297] i915 0000:00:02.0: fb0: i915drmfb frame buffer device
> [ 25.087400] thunderbolt 0000:06:00.0: saving config space at offset 0x0 (reading 0x15eb8086)
> [ 25.087414] thunderbolt 0000:06:00.0: saving config space at offset 0x4 (reading 0x100406)
> [ 25.087419] thunderbolt 0000:06:00.0: saving config space at offset 0x8 (reading 0x8800006)
> [ 25.087424] thunderbolt 0000:06:00.0: saving config space at offset 0xc (reading 0x20)
> [ 25.087430] thunderbolt 0000:06:00.0: saving config space at offset 0x10 (reading 0xcc100000)
> [ 25.087435] thunderbolt 0000:06:00.0: saving config space at offset 0x14 (reading 0xcc140000)
> [ 25.087440] thunderbolt 0000:06:00.0: saving config space at offset 0x18 (reading 0x0)
> [ 25.087445] thunderbolt 0000:06:00.0: saving config space at offset 0x1c (reading 0x0)
> [ 25.087450] thunderbolt 0000:06:00.0: saving config space at offset 0x20 (reading 0x0)
> [ 25.087455] thunderbolt 0000:06:00.0: saving config space at offset 0x24 (reading 0x0)
> [ 25.087460] thunderbolt 0000:06:00.0: saving config space at offset 0x28 (reading 0x0)
> [ 25.087466] thunderbolt 0000:06:00.0: saving config space at offset 0x2c (reading 0x229b17aa)
> [ 25.087471] thunderbolt 0000:06:00.0: saving config space at offset 0x30 (reading 0x0)
> [ 25.087476] thunderbolt 0000:06:00.0: saving config space at offset 0x34 (reading 0x80)
> [ 25.087481] thunderbolt 0000:06:00.0: saving config space at offset 0x38 (reading 0x0)
> [ 25.087486] thunderbolt 0000:06:00.0: saving config space at offset 0x3c (reading 0x1ff)
> [ 25.087571] thunderbolt 0000:06:00.0: PME# enabled
> [ 25.105353] pcieport 0000:05:00.0: saving config space at offset 0x0 (reading 0x15ea8086)
> [ 25.105364] pcieport 0000:05:00.0: saving config space at offset 0x4 (reading 0x100407)
> [ 25.105370] pcieport 0000:05:00.0: saving config space at offset 0x8 (reading 0x6040006)
> [ 25.105375] pcieport 0000:05:00.0: saving config space at offset 0xc (reading 0x10020)
> [ 25.105380] pcieport 0000:05:00.0: saving config space at offset 0x10 (reading 0x0)
> [ 25.105384] pcieport 0000:05:00.0: saving config space at offset 0x14 (reading 0x0)
> [ 25.105389] pcieport 0000:05:00.0: saving config space at offset 0x18 (reading 0x60605)
> [ 25.105394] pcieport 0000:05:00.0: saving config space at offset 0x1c (reading 0x1f1)
> [ 25.105399] pcieport 0000:05:00.0: saving config space at offset 0x20 (reading 0xcc10cc10)
> [ 25.105404] pcieport 0000:05:00.0: saving config space at offset 0x24 (reading 0x1fff1)
> [ 25.105409] pcieport 0000:05:00.0: saving config space at offset 0x28 (reading 0x0)
> [ 25.105413] pcieport 0000:05:00.0: saving config space at offset 0x2c (reading 0x0)
> [ 25.105418] pcieport 0000:05:00.0: saving config space at offset 0x30 (reading 0x0)
> [ 25.105423] pcieport 0000:05:00.0: saving config space at offset 0x34 (reading 0x80)
> [ 25.105428] pcieport 0000:05:00.0: saving config space at offset 0x38 (reading 0x0)
> [ 25.105432] pcieport 0000:05:00.0: saving config space at offset 0x3c (reading 0x201ff)
> [ 25.105517] pcieport 0000:05:00.0: PME# enabled
> [ 25.125367] pcieport 0000:04:00.0: saving config space at offset 0x0 (reading 0x15ea8086)
> [ 25.125378] pcieport 0000:04:00.0: saving config space at offset 0x4 (reading 0x100007)
> [ 25.125383] pcieport 0000:04:00.0: saving config space at offset 0x8 (reading 0x6040006)
> [ 25.125388] pcieport 0000:04:00.0: saving config space at offset 0xc (reading 0x10020)
> [ 25.125393] pcieport 0000:04:00.0: saving config space at offset 0x10 (reading 0x0)
> [ 25.125398] pcieport 0000:04:00.0: saving config space at offset 0x14 (reading 0x0)
> [ 25.125403] pcieport 0000:04:00.0: saving config space at offset 0x18 (reading 0x510504)
> [ 25.125407] pcieport 0000:04:00.0: saving config space at offset 0x1c (reading 0x5141)
> [ 25.125412] pcieport 0000:04:00.0: saving config space at offset 0x20 (reading 0xcc10b400)
> [ 25.125417] pcieport 0000:04:00.0: saving config space at offset 0x24 (reading 0x3ff10001)
> [ 25.125422] pcieport 0000:04:00.0: saving config space at offset 0x28 (reading 0x60)
> [ 25.125427] pcieport 0000:04:00.0: saving config space at offset 0x2c (reading 0x60)
> [ 25.125431] pcieport 0000:04:00.0: saving config space at offset 0x30 (reading 0x0)
> [ 25.125436] pcieport 0000:04:00.0: saving config space at offset 0x34 (reading 0x80)
> [ 25.125441] pcieport 0000:04:00.0: saving config space at offset 0x38 (reading 0x0)
> [ 25.125446] pcieport 0000:04:00.0: saving config space at offset 0x3c (reading 0x201ff)
> [ 25.125528] pcieport 0000:04:00.0: PME# enabled
> [ 25.145423] pcieport 0000:00:1c.0: saving config space at offset 0x0 (reading 0xa3388086)
> [ 25.145437] pcieport 0000:00:1c.0: saving config space at offset 0x4 (reading 0x100407)
> [ 25.145445] pcieport 0000:00:1c.0: saving config space at offset 0x8 (reading 0x60400f0)
> [ 25.145453] pcieport 0000:00:1c.0: saving config space at offset 0xc (reading 0x810000)
> [ 25.145460] pcieport 0000:00:1c.0: saving config space at offset 0x10 (reading 0x0)
> [ 25.145464] pcieport 0000:00:1c.0: saving config space at offset 0x14 (reading 0x0)
> [ 25.145469] pcieport 0000:00:1c.0: saving config space at offset 0x18 (reading 0x510400)
> [ 25.145476] pcieport 0000:00:1c.0: saving config space at offset 0x1c (reading 0x20006040)
> [ 25.145484] pcieport 0000:00:1c.0: saving config space at offset 0x20 (reading 0xcc10b400)
> [ 25.145488] pcieport 0000:00:1c.0: saving config space at offset 0x24 (reading 0x3ff10001)
> [ 25.145493] pcieport 0000:00:1c.0: saving config space at offset 0x28 (reading 0x60)
> [ 25.145497] pcieport 0000:00:1c.0: saving config space at offset 0x2c (reading 0x60)
> [ 25.145502] pcieport 0000:00:1c.0: saving config space at offset 0x30 (reading 0x0)
> [ 25.145506] pcieport 0000:00:1c.0: saving config space at offset 0x34 (reading 0x40)
> [ 25.145510] pcieport 0000:00:1c.0: saving config space at offset 0x38 (reading 0x0)
> [ 25.145515] pcieport 0000:00:1c.0: saving config space at offset 0x3c (reading 0x201ff)
> [ 25.145604] pcieport 0000:00:1c.0: PME# enabled
> [ 26.265697] pcieport 0000:00:1c.0: power state changed by ACPI to D3cold
> [ 45.468365] random: crng init done
> [ 105.032727] usb 1-2: USB disconnect, device number 2 <= I removed a usb key, didn't help
> [ 128.495144] async_tx: api initialized (async)
> [ 128.514820] device-mapper: uevent: version 1.0.3
> [ 128.518186] device-mapper: ioctl: 4.42.0-ioctl (2020-02-27) initialised: [email protected]
> [ 144.869445] e1000e 0000:00:1f.6 eth0: NIC Link is Down
> [ 172.851384] BTRFS: device label btrfs_pool4 devid 1 transid 78270 /dev/sdb4 scanned by btrfs (1293)
> [ 172.851648] BTRFS: device label btrfs_pool3 devid 1 transid 27410 /dev/sda5 scanned by btrfs (1293)
> [ 172.852030] BTRFS: device fsid de9694f8-9c0d-4e9d-bd12-57adc4381cd7 devid 1 transid 41 /dev/sda3 scanned by btrfs (1293)
> [ 172.852224] BTRFS: device fsid 23e1398d-e462-41aa-b85e-f574906ddc03 devid 1 transid 585 /dev/nvme0n1p4 scanned by btrfs (1293)
> [ 189.124291] nouveau 0000:01:00.0: disp ctor failed, -12
> [ 189.124530] nouveau: probe of 0000:01:00.0 failed with error -12
>
> The next boot looks similar:
> [ 25.161759] pcieport 0000:00:1c.0: PME# enabled
> [ 26.297810] pcieport 0000:00:1c.0: power state changed by ACPI to D3cold
> [ 128.427270] async_tx: api initialized (async)
> [ 128.446525] device-mapper: uevent: version 1.0.3
> [ 128.446691] device-mapper: ioctl: 4.42.0-ioctl (2020-02-27) initialised: [email protected]
> [ 128.458120] random: cryptsetup: uninitialized urandom read (4 bytes read)
> [ 138.507373] random: cryptsetup: uninitialized urandom read (4 bytes read)
> [ 144.793573] e1000e 0000:00:1f.6 eth0: NIC Link is Down
> [ 159.627780] random: crng init done
> [ 171.814064] process '/usr/bin/fstype' started with executable stack
> [ 181.949989] BTRFS: device label btrfs_boot devid 1 transid 394687 /dev/mapper/cryptroot scanned by btrfs (1063)
> [ 181.953437] BTRFS: device label btrfs_pool4 devid 1 transid 78270 /dev/sdb4 scanned by btrfs (1063)
> [ 181.956989] BTRFS: device label btrfs_pool3 devid 1 transid 27410 /dev/sda5 scanned by btrfs (1063)
> [ 181.960473] BTRFS: device fsid de9694f8-9c0d-4e9d-bd12-57adc4381cd7 devid 1 transid 41 /dev/sda3 scanned by btrfs (1063)
> [ 181.964097] BTRFS: device fsid 23e1398d-e462-41aa-b85e-f574906ddc03 devid 1 transid 585 /dev/nvme0n1p4 scanned by btrfs (1063)
> [ 188.733645] nouveau 0000:01:00.0: disp ctor failed, -12
> [ 188.740653] nouveau: probe of 0000:01:00.0 failed with error -12

oh, I somehow missed that "disp ctor failed" message. I think that
might explain why things are a bit hanging. From the top of my head I
am not sure if that's something known or something new. But just in
case I CCed Lyude and Ben. And I think booting with
nouveau.debug=disp=trace could already show something relevant.

> [ 188.901070] PM: Image not found (code -22)
>
> Does that help?
>
> Thanks,
> Marc
> --
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
>
> Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08
>

2020-09-08 00:31:22

by Marc MERLIN

[permalink] [raw]
Subject: Re: [Nouveau] pcieport 0000:00:01.0: PME: Spurious native interrupt (nvidia with nouveau and thunderbolt on thinkpad P73)

On Tue, Sep 08, 2020 at 01:51:19AM +0200, Karol Herbst wrote:
> oh, I somehow missed that "disp ctor failed" message. I think that
> might explain why things are a bit hanging. From the top of my head I
> am not sure if that's something known or something new. But just in
> case I CCed Lyude and Ben. And I think booting with
> nouveau.debug=disp=trace could already show something relevant.

Thanks.
I've added that to my boot for next time I reboot.

I'm moving some folks to Bcc now, and let's remove the lists other than
nouveau on followups (lkml and pci). I'm just putting a warning here
so that it shows up in other list archives and anyone finding this
later knows that they should look in the nouveau archives for further
updates/resolution.

Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.

Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08

2020-09-13 20:20:51

by Marc MERLIN

[permalink] [raw]
Subject: Re: [Nouveau] pcieport 0000:00:01.0: PME: Spurious native interrupt (nvidia with nouveau and thunderbolt on thinkpad P73)

On Mon, Sep 07, 2020 at 05:29:35PM -0700, Marc MERLIN wrote:
> On Tue, Sep 08, 2020 at 01:51:19AM +0200, Karol Herbst wrote:
> > oh, I somehow missed that "disp ctor failed" message. I think that
> > might explain why things are a bit hanging. From the top of my head I
> > am not sure if that's something known or something new. But just in
> > case I CCed Lyude and Ben. And I think booting with
> > nouveau.debug=disp=trace could already show something relevant.
>
> Thanks.
> I've added that to my boot for next time I reboot.
>
> I'm moving some folks to Bcc now, and let's remove the lists other than
> nouveau on followups (lkml and pci). I'm just putting a warning here
> so that it shows up in other list archives and anyone finding this
> later knows that they should look in the nouveau archives for further
> updates/resolution.

Hi, I didn't hear back on this issue. Did you need the nouveau.debug=disp=trace
or are you already working on the "disp ctor failed" issue?

Thanks
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.

Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08

2020-12-27 17:04:26

by Marc MERLIN

[permalink] [raw]
Subject: 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

This started with 5.5 and hasn't gotten better since then, despite some reports
I tried to send.

As per my previous message:
I have a Thinkpad P70 with hybrid graphics.
01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro M600M] (rev a2)
that one works fine, I can use i915 for the main screen, and nouveau to
display on the external ports (external ports are only wired to nvidia
chip, so it's impossible to use them without turning the nvidia chip
on).

I now got a newer P73 also with the same hybrid graphics (setup as such
in the bios). It runs fine with i915, and I don't need to use external
display with nouveau for now (it almost works, but I only see the mouse
cursor on the external screen, no window or anything else can get
displayed, very weird).
01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)


after boot, when it gets the right trigger (not sure which ones), it
loops on this evern 2 seconds, mostly forever.

I'm not sure if it's nouveau's fault or the kernel's PCI PME's fault, or something else.

Boot hangs look like this:
[ 10.659209] Console: switching to colour frame buffer device 240x67
[ 10.732353] i915 0000:00:02.0: [drm] fb0: i915drmfb frame buffer device
[ 12.101203] nvidia-gpu 0000:01:00.3: saving config space at offset 0x0 (reading 0x1ad910de)
[ 12.101212] nvidia-gpu 0000:01:00.3: saving config space at offset 0x4 (reading 0x100406)
[ 12.101217] nvidia-gpu 0000:01:00.3: saving config space at offset 0x8 (reading 0xc8000a1)
[ 12.101223] nvidia-gpu 0000:01:00.3: saving config space at offset 0xc (reading 0x800000)
[ 12.101228] nvidia-gpu 0000:01:00.3: saving config space at offset 0x10 (reading 0xce054000)
[ 12.101234] nvidia-gpu 0000:01:00.3: saving config space at offset 0x14 (reading 0x0)
[ 12.101239] nvidia-gpu 0000:01:00.3: saving config space at offset 0x18 (reading 0x0)
[ 12.101244] nvidia-gpu 0000:01:00.3: saving config space at offset 0x1c (reading 0x0)
[ 12.101249] nvidia-gpu 0000:01:00.3: saving config space at offset 0x20 (reading 0x0)
[ 12.101254] nvidia-gpu 0000:01:00.3: saving config space at offset 0x24 (reading 0x0)
[ 12.101259] nvidia-gpu 0000:01:00.3: saving config space at offset 0x28 (reading 0x0)
[ 12.101265] nvidia-gpu 0000:01:00.3: saving config space at offset 0x2c (reading 0x229b17aa)
[ 12.101270] nvidia-gpu 0000:01:00.3: saving config space at offset 0x30 (reading 0x0)
[ 12.101275] nvidia-gpu 0000:01:00.3: saving config space at offset 0x34 (reading 0x68)
[ 12.101280] nvidia-gpu 0000:01:00.3: saving config space at offset 0x38 (reading 0x0)
[ 12.101285] nvidia-gpu 0000:01:00.3: saving config space at offset 0x3c (reading 0x4ff)
[ 12.101333] nvidia-gpu 0000:01:00.3: PME# enabled
[ 25.151246] thunderbolt 0000:06:00.0: saving config space at offset 0x0 (reading 0x15eb8086)
[ 25.151260] thunderbolt 0000:06:00.0: saving config space at offset 0x4 (reading 0x100406)
[ 25.151265] thunderbolt 0000:06:00.0: saving config space at offset 0x8 (reading 0x8800006)
[ 25.151270] thunderbolt 0000:06:00.0: saving config space at offset 0xc (reading 0x20)
[ 25.151276] thunderbolt 0000:06:00.0: saving config space at offset 0x10 (reading 0xcc100000)
[ 25.151281] thunderbolt 0000:06:00.0: saving config space at offset 0x14 (reading 0xcc140000)
[ 25.151286] thunderbolt 0000:06:00.0: saving config space at offset 0x18 (reading 0x0)
[ 25.151291] thunderbolt 0000:06:00.0: saving config space at offset 0x1c (reading 0x0)
[ 25.151296] thunderbolt 0000:06:00.0: saving config space at offset 0x20 (reading 0x0)
[ 25.151301] thunderbolt 0000:06:00.0: saving config space at offset 0x24 (reading 0x0)
[ 25.151306] thunderbolt 0000:06:00.0: saving config space at offset 0x28 (reading 0x0)
[ 25.151311] thunderbolt 0000:06:00.0: saving config space at offset 0x2c (reading 0x229b17aa)
[ 25.151316] thunderbolt 0000:06:00.0: saving config space at offset 0x30 (reading 0x0)
[ 25.151322] thunderbolt 0000:06:00.0: saving config space at offset 0x34 (reading 0x80)
[ 25.151327] thunderbolt 0000:06:00.0: saving config space at offset 0x38 (reading 0x0)
[ 25.151332] thunderbolt 0000:06:00.0: saving config space at offset 0x3c (reading 0x1ff)
[ 25.151416] thunderbolt 0000:06:00.0: PME# enabled
[ 25.169204] pcieport 0000:05:00.0: saving config space at offset 0x0 (reading 0x15ea8086)
[ 25.169214] pcieport 0000:05:00.0: saving config space at offset 0x4 (reading 0x100407)
[ 25.169219] pcieport 0000:05:00.0: saving config space at offset 0x8 (reading 0x6040006)
[ 25.169224] pcieport 0000:05:00.0: saving config space at offset 0xc (reading 0x10020)
[ 25.169229] pcieport 0000:05:00.0: saving config space at offset 0x10 (reading 0x0)
[ 25.169233] pcieport 0000:05:00.0: saving config space at offset 0x14 (reading 0x0)
[ 25.169238] pcieport 0000:05:00.0: saving config space at offset 0x18 (reading 0x60605)
[ 25.169243] pcieport 0000:05:00.0: saving config space at offset 0x1c (reading 0x1f1)
[ 25.169248] pcieport 0000:05:00.0: saving config space at offset 0x20 (reading 0xcc10cc10)
[ 25.169253] pcieport 0000:05:00.0: saving config space at offset 0x24 (reading 0x1fff1)
[ 25.169258] pcieport 0000:05:00.0: saving config space at offset 0x28 (reading 0x0)
[ 25.169263] pcieport 0000:05:00.0: saving config space at offset 0x2c (reading 0x0)
[ 25.169268] pcieport 0000:05:00.0: saving config space at offset 0x30 (reading 0x0)
[ 25.169272] pcieport 0000:05:00.0: saving config space at offset 0x34 (reading 0x80)
[ 25.169277] pcieport 0000:05:00.0: saving config space at offset 0x38 (reading 0x0)
[ 25.169282] pcieport 0000:05:00.0: saving config space at offset 0x3c (reading 0x201ff)
[ 25.169367] pcieport 0000:05:00.0: PME# enabled
[ 25.189195] pcieport 0000:04:00.0: saving config space at offset 0x0 (reading 0x15ea8086)
[ 25.189206] pcieport 0000:04:00.0: saving config space at offset 0x4 (reading 0x100007)
[ 25.189212] pcieport 0000:04:00.0: saving config space at offset 0x8 (reading 0x6040006)
[ 25.189216] pcieport 0000:04:00.0: saving config space at offset 0xc (reading 0x10020)
[ 25.189221] pcieport 0000:04:00.0: saving config space at offset 0x10 (reading 0x0)
[ 25.189226] pcieport 0000:04:00.0: saving config space at offset 0x14 (reading 0x0)
[ 25.189231] pcieport 0000:04:00.0: saving config space at offset 0x18 (reading 0x510504)
[ 25.189236] pcieport 0000:04:00.0: saving config space at offset 0x1c (reading 0x5141)
[ 25.189241] pcieport 0000:04:00.0: saving config space at offset 0x20 (reading 0xcc10b400)
[ 25.189246] pcieport 0000:04:00.0: saving config space at offset 0x24 (reading 0x3ff10001)
[ 25.189251] pcieport 0000:04:00.0: saving config space at offset 0x28 (reading 0x60)
[ 25.189255] pcieport 0000:04:00.0: saving config space at offset 0x2c (reading 0x60)
[ 25.189260] pcieport 0000:04:00.0: saving config space at offset 0x30 (reading 0x0)
[ 25.189265] pcieport 0000:04:00.0: saving config space at offset 0x34 (reading 0x80)
[ 25.189270] pcieport 0000:04:00.0: saving config space at offset 0x38 (reading 0x0)
[ 25.189274] pcieport 0000:04:00.0: saving config space at offset 0x3c (reading 0x201ff)
[ 25.189358] pcieport 0000:04:00.0: PME# enabled
[ 25.209257] pcieport 0000:00:1c.0: saving config space at offset 0x0 (reading 0xa3388086)
[ 25.209271] pcieport 0000:00:1c.0: saving config space at offset 0x4 (reading 0x100407)
[ 25.209279] pcieport 0000:00:1c.0: saving config space at offset 0x8 (reading 0x60400f0)
[ 25.209287] pcieport 0000:00:1c.0: saving config space at offset 0xc (reading 0x810000)
[ 25.209291] pcieport 0000:00:1c.0: saving config space at offset 0x10 (reading 0x0)
[ 25.209299] pcieport 0000:00:1c.0: saving config space at offset 0x14 (reading 0x0)
[ 25.209303] pcieport 0000:00:1c.0: saving config space at offset 0x18 (reading 0x510400)
[ 25.209311] pcieport 0000:00:1c.0: saving config space at offset 0x1c (reading 0x20006040)
[ 25.209324] pcieport 0000:00:1c.0: saving config space at offset 0x20 (reading 0xcc10b400)
[ 25.209329] pcieport 0000:00:1c.0: saving config space at offset 0x24 (reading 0x3ff10001)
[ 25.209333] pcieport 0000:00:1c.0: saving config space at offset 0x28 (reading 0x60)
[ 25.209338] pcieport 0000:00:1c.0: saving config space at offset 0x2c (reading 0x60)
[ 25.209342] pcieport 0000:00:1c.0: saving config space at offset 0x30 (reading 0x0)
[ 25.209346] pcieport 0000:00:1c.0: saving config space at offset 0x34 (reading 0x40)
[ 25.209351] pcieport 0000:00:1c.0: saving config space at offset 0x38 (reading 0x0)
[ 25.209355] pcieport 0000:00:1c.0: saving config space at offset 0x3c (reading 0x201ff)
[ 25.209447] pcieport 0000:00:1c.0: PME# enabled
[ 26.341460] pcieport 0000:00:1c.0: power state changed by ACPI to D3cold
[ 129.257560] async_tx: api initialized (async)
[ 129.280335] device-mapper: uevent: version 1.0.3
[ 129.280466] device-mapper: ioctl: 4.42.0-ioctl (2020-02-27) initialised: [email protected]
[ 129.293087] random: cryptsetup: uninitialized urandom read (4 bytes read)
[ 139.346041] random: cryptsetup: uninitialized urandom read (4 bytes read)
[ 145.633300] e1000e 0000:00:1f.6 eth0: NIC Link is Down
[ 149.384146] random: crng init done
[ 161.435256] process '/usr/bin/fstype' started with executable stack
[ 171.578236] BTRFS: device label btrfs_boot devid 1 transid 575473 /dev/mapper/cryptroot scanned by btrfs (1069)
[ 171.583482] BTRFS: device label btrfs_pool4 devid 1 transid 117379 /dev/sdb4 scanned by btrfs (1069)
[ 171.588979] BTRFS: device label btrfs_pool3 devid 1 transid 40487 /dev/sda5 scanned by btrfs (1069)
[ 171.594484] BTRFS: device fsid de9694f8-9c0d-4e9d-bd12-57adc4381cd7 devid 1 transid 41 /dev/sda3 scanned by btrfs (1069)
[ 171.600437] BTRFS: device fsid 23e1398d-e462-41aa-b85e-f574906ddc03 devid 1 transid 585 /dev/nvme0n1p4 scanned by btrfs (1069)
[ 182.799968] PM: Image not found (code -22)
[ 189.304662] nouveau 0000:01:00.0: pmu: firmware unavailable
[ 189.312455] nouveau 0000:01:00.0: disp: destroy running...
[ 189.316552] nouveau 0000:01:00.0: disp: destroy completed in 1us
[ 189.320326] nouveau 0000:01:00.0: disp ctor failed, -12
[ 189.324214] nouveau: probe of 0000:01:00.0 failed with error -12

At runtime, it later gets into a loop like this, and that murders
battery life if I'm not plugged in:
[2140771.370888] nvidia-gpu 0000:01:00.3: saving config space at offset 0x0 (reading 0x1ad910de)
[2140771.370895] nvidia-gpu 0000:01:00.3: saving config space at offset 0x4 (reading 0x100406)
[2140771.370899] nvidia-gpu 0000:01:00.3: saving config space at offset 0x8 (reading 0xc8000a1)
[2140771.370902] nvidia-gpu 0000:01:00.3: saving config space at offset 0xc (reading 0x800000)
[2140771.370905] nvidia-gpu 0000:01:00.3: saving config space at offset 0x10 (reading 0xce054000)
[2140771.370908] nvidia-gpu 0000:01:00.3: saving config space at offset 0x14 (reading 0x0)
[2140771.370912] nvidia-gpu 0000:01:00.3: saving config space at offset 0x18 (reading 0x0)
[2140771.370915] nvidia-gpu 0000:01:00.3: saving config space at offset 0x1c (reading 0x0)
[2140771.370918] nvidia-gpu 0000:01:00.3: saving config space at offset 0x20 (reading 0x0)
[2140771.370921] nvidia-gpu 0000:01:00.3: saving config space at offset 0x24 (reading 0x0)
[2140771.370924] nvidia-gpu 0000:01:00.3: saving config space at offset 0x28 (reading 0x0)
[2140771.370927] nvidia-gpu 0000:01:00.3: saving config space at offset 0x2c (reading 0x229b17aa)
[2140771.370930] nvidia-gpu 0000:01:00.3: saving config space at offset 0x30 (reading 0x0)
[2140771.370933] nvidia-gpu 0000:01:00.3: saving config space at offset 0x34 (reading 0x68)
[2140771.370936] nvidia-gpu 0000:01:00.3: saving config space at offset 0x38 (reading 0x0)
[2140771.370939] nvidia-gpu 0000:01:00.3: saving config space at offset 0x3c (reading 0x4ff)
[2140771.370970] nvidia-gpu 0000:01:00.3: PME# enabled
[2140771.389882] pci 0000:01:00.0: saving config space at offset 0x0 (reading 0x1eb610de)
[2140771.389891] pci 0000:01:00.0: saving config space at offset 0x4 (reading 0x100403)
[2140771.389896] pci 0000:01:00.0: saving config space at offset 0x8 (reading 0x30000a1)
[2140771.389899] pci 0000:01:00.0: saving config space at offset 0xc (reading 0x800000)
[2140771.389903] pci 0000:01:00.0: saving config space at offset 0x10 (reading 0xcd000000)
[2140771.389907] pci 0000:01:00.0: saving config space at offset 0x14 (reading 0xa000000c)
[2140771.389910] pci 0000:01:00.0: saving config space at offset 0x18 (reading 0x0)
[2140771.389914] pci 0000:01:00.0: saving config space at offset 0x1c (reading 0xb000000c)
[2140771.389918] pci 0000:01:00.0: saving config space at offset 0x20 (reading 0x0)
[2140771.389922] pci 0000:01:00.0: saving config space at offset 0x24 (reading 0x2001)
[2140771.389925] pci 0000:01:00.0: saving config space at offset 0x28 (reading 0x0)
[2140771.389928] pci 0000:01:00.0: saving config space at offset 0x2c (reading 0x229b17aa)
[2140771.389932] pci 0000:01:00.0: saving config space at offset 0x30 (reading 0xfff80000)
[2140771.389935] pci 0000:01:00.0: saving config space at offset 0x34 (reading 0x60)
[2140771.389939] pci 0000:01:00.0: saving config space at offset 0x38 (reading 0x0)
[2140771.389943] pci 0000:01:00.0: saving config space at offset 0x3c (reading 0x1ff)
[2140771.390027] pcieport 0000:00:01.0: saving config space at offset 0x0 (reading 0x19018086)
[2140771.390030] pcieport 0000:00:01.0: saving config space at offset 0x4 (reading 0x100407)
[2140771.390033] pcieport 0000:00:01.0: saving config space at offset 0x8 (reading 0x604000d)
[2140771.390036] pcieport 0000:00:01.0: saving config space at offset 0xc (reading 0x810000)
[2140771.390038] pcieport 0000:00:01.0: saving config space at offset 0x10 (reading 0x0)
[2140771.390041] pcieport 0000:00:01.0: saving config space at offset 0x14 (reading 0x0)
[2140771.390044] pcieport 0000:00:01.0: saving config space at offset 0x18 (reading 0x10100)
[2140771.390046] pcieport 0000:00:01.0: saving config space at offset 0x1c (reading 0x2020)
[2140771.390049] pcieport 0000:00:01.0: saving config space at offset 0x20 (reading 0xce00cd00)
[2140771.390051] pcieport 0000:00:01.0: saving config space at offset 0x24 (reading 0xb1f1a001)
[2140771.390054] pcieport 0000:00:01.0: saving config space at offset 0x28 (reading 0x0)
[2140771.390056] pcieport 0000:00:01.0: saving config space at offset 0x2c (reading 0x0)
[2140771.390059] pcieport 0000:00:01.0: saving config space at offset 0x30 (reading 0x0)
[2140771.390061] pcieport 0000:00:01.0: saving config space at offset 0x34 (reading 0x88)
[2140771.390064] pcieport 0000:00:01.0: saving config space at offset 0x38 (reading 0x0)
[2140771.390067] pcieport 0000:00:01.0: saving config space at offset 0x3c (reading 0x201ff)
[2140771.390125] pcieport 0000:00:01.0: PME# enabled

Thanks for any help
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.

Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08

2020-12-27 18:30:57

by Ilia Mirkin

[permalink] [raw]
Subject: Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

On Sun, Dec 27, 2020 at 12:03 PM Marc MERLIN <[email protected]> wrote:
>
> This started with 5.5 and hasn't gotten better since then, despite some reports
> I tried to send.
>
> As per my previous message:
> I have a Thinkpad P70 with hybrid graphics.
> 01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro M600M] (rev a2)
> that one works fine, I can use i915 for the main screen, and nouveau to
> display on the external ports (external ports are only wired to nvidia
> chip, so it's impossible to use them without turning the nvidia chip
> on).
>
> I now got a newer P73 also with the same hybrid graphics (setup as such
> in the bios). It runs fine with i915, and I don't need to use external
> display with nouveau for now (it almost works, but I only see the mouse
> cursor on the external screen, no window or anything else can get
> displayed, very weird).
> 01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)

Display offload usually requires acceleration -- the copies are done
using the DMA engine. Please make sure that you have firmware
available (and a new enough mesa). The errors suggest that you don't
have firmware available at the time that nouveau loads. Depending on
your setup, that might mean the firmware has to be built into the
kernel, or available in initramfs. (Or just regular filesystem if you
don't use a complicated boot sequence. But many people go with distro
defaults, which do have this complexity.)

>
>
> after boot, when it gets the right trigger (not sure which ones), it
> loops on this evern 2 seconds, mostly forever.

The gpu suspends with runtime pm. And then gets woken up for some
reason (could be something quite silly, like lspci, or could be
something explicitly checking connectors, etc). Repeat.

Cheers,

-ilia

2020-12-29 15:56:22

by Marc MERLIN

[permalink] [raw]
Subject: Re: 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

On Sat, Dec 26, 2020 at 03:12:09AM -0800, Ilia Mirkin wrote:
> > after boot, when it gets the right trigger (not sure which ones), it
> > loops on this evern 2 seconds, mostly forever.
>
> The gpu suspends with runtime pm. And then gets woken up for some
> reason (could be something quite silly, like lspci, or could be
> something explicitly checking connectors, etc). Repeat.

Ah, fair point. Could it be powertop even?
How would I go towards tracing that?
Sounds like this would be a problem with all chips if userspace is able
to wake them up every second or two with a probe. Now I wonder what
broken userspace I have that could be doing this.

> Display offload usually requires acceleration -- the copies are done
> using the DMA engine. Please make sure that you have firmware
> available (and a new enough mesa). The errors suggest that you don't
> have firmware available at the time that nouveau loads. Depending on
> your setup, that might mean the firmware has to be built into the
> kernel, or available in initramfs. (Or just regular filesystem if you
> don't use a complicated boot sequence. But many people go with distro
> defaults, which do have this complexity.)

Hi Ilia, thanks for your answer.

Do you think that could be a reason why the boot would hang for 2 full minutes at every
boot ever since I upgraded to 5.5?

Also, without wanting to sound like a full newbie, where is that
firmware you're talking about? In my kernel source?

Here's what I do have:
sauron:/usr/local/bin# dpkggrep nouveau
libdrm-nouveau2:amd64 install
xserver-xorg-video-nouveau install

no nouveau-firmware package in debian:
sauron:/usr/local/bin# apt-cache search nouveau
bumblebee - NVIDIA Optimus support for Linux
libdrm-nouveau2 - Userspace interface to nouveau-specific kernel DRM services -- runtime
xfonts-jmk - Jim Knoble's character-cell fonts for X
xserver-xorg-video-nouveau - X.Org X server -- Nouveau display driver

No firmware file on my disk:
sauron:/usr/local/bin# find /lib/modules/5.9.11-amd64-preempt-sysrq-20190817/ /lib/firmware/ |grep nouveau
/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau
/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau/nouveau.ko
sauron:/usr/local/bin#

The kernel module is in my initrd:
sauron:/usr/local/bin# dd if=/boot/initrd.img-5.9.11-amd64-preempt-sysrq-20190817 bs=2966528 skip=1 | gunzip | cpio -tdv | grep nouveau
drwxr-xr-x 1 root root 0 Nov 30 15:40 usr/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau
-rw-r--r-- 1 root root 3691385 Nov 30 15:35 usr/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau/nouveau.ko
17+1 records in
17+1 records out
52566778 bytes (53 MB, 50 MiB) copied, 1.69708 s, 31.0 MB/s

What am I supposed to do/check next?

Note that ultimately I only need nouveau not to hang my boot 2mn and do
PM so that the nvidia chip goes to sleep since I don't use it.

Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.

Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08

2020-12-29 16:35:07

by Ilia Mirkin

[permalink] [raw]
Subject: Re: 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

On Tue, Dec 29, 2020 at 10:52 AM Marc MERLIN <[email protected]> wrote:
>
> On Sat, Dec 26, 2020 at 03:12:09AM -0800, Ilia Mirkin wrote:
> > > after boot, when it gets the right trigger (not sure which ones), it
> > > loops on this evern 2 seconds, mostly forever.
> >
> > The gpu suspends with runtime pm. And then gets woken up for some
> > reason (could be something quite silly, like lspci, or could be
> > something explicitly checking connectors, etc). Repeat.
>
> Ah, fair point. Could it be powertop even?
> How would I go towards tracing that?
> Sounds like this would be a problem with all chips if userspace is able
> to wake them up every second or two with a probe. Now I wonder what
> broken userspace I have that could be doing this.

Well, it's a theory. Some userspace helpfully prevents the GPU from
suspending entirely, unfortunately I don't remember its name though by
messing with the attached audio device. It's very common and meant to
help... oh well.

>
> > Display offload usually requires acceleration -- the copies are done
> > using the DMA engine. Please make sure that you have firmware
> > available (and a new enough mesa). The errors suggest that you don't
> > have firmware available at the time that nouveau loads. Depending on
> > your setup, that might mean the firmware has to be built into the
> > kernel, or available in initramfs. (Or just regular filesystem if you
> > don't use a complicated boot sequence. But many people go with distro
> > defaults, which do have this complexity.)
>
> Hi Ilia, thanks for your answer.
>
> Do you think that could be a reason why the boot would hang for 2 full minutes at every
> boot ever since I upgraded to 5.5?

I'd have to check, but I'm guessing TU104 acceleration became a thing
in 5.5. I would also not be very surprised if the code didn't handle
failure extremely gracefully - there definitely have been problems
with that in the past.

>
> Also, without wanting to sound like a full newbie, where is that
> firmware you're talking about? In my kernel source?
>
> Here's what I do have:
> sauron:/usr/local/bin# dpkggrep nouveau
> libdrm-nouveau2:amd64 install
> xserver-xorg-video-nouveau install
>
> no nouveau-firmware package in debian:
> sauron:/usr/local/bin# apt-cache search nouveau
> bumblebee - NVIDIA Optimus support for Linux
> libdrm-nouveau2 - Userspace interface to nouveau-specific kernel DRM services -- runtime
> xfonts-jmk - Jim Knoble's character-cell fonts for X
> xserver-xorg-video-nouveau - X.Org X server -- Nouveau display driver
>
> No firmware file on my disk:
> sauron:/usr/local/bin# find /lib/modules/5.9.11-amd64-preempt-sysrq-20190817/ /lib/firmware/ |grep nouveau
> /lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau
> /lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau/nouveau.ko
> sauron:/usr/local/bin#
>
> The kernel module is in my initrd:
> sauron:/usr/local/bin# dd if=/boot/initrd.img-5.9.11-amd64-preempt-sysrq-20190817 bs=2966528 skip=1 | gunzip | cpio -tdv | grep nouveau
> drwxr-xr-x 1 root root 0 Nov 30 15:40 usr/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau
> -rw-r--r-- 1 root root 3691385 Nov 30 15:35 usr/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau/nouveau.ko
> 17+1 records in
> 17+1 records out
> 52566778 bytes (53 MB, 50 MiB) copied, 1.69708 s, 31.0 MB/s

I think that gets you out of "full newbie" land...

>
> What am I supposed to do/check next?
>
> Note that ultimately I only need nouveau not to hang my boot 2mn and do
> PM so that the nvidia chip goes to sleep since I don't use it.

I'm not extremely familiar with debian packaging, but the firmware is
provided by NVIDIA and shipped as part of linux-firmware:

https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/nvidia

This needs to be available at /lib/firmware/nvidia when nouveau loads.
Based on your email above, it's most likely that it would load from
the initrd - so make sure it's in there.

Of course now that I read your email a bit more carefully, it seems
your issue is with the "saving config space" messages. I'm not sure
I've seen those before. Perhaps you have some sort of debug enabled.
I'd find where in the kernel they are being produced, and what the
conditions for it are. But the failure to load firmware isn't great --
not 100% sure if it impacts runpm or not.

I just double-checked, TU10x accel came in via
afa3b96b058d87c2c44d1c83dadb2ba6998d03ce, which was first in v5.6.
Initial TU10x support came in v5.0. So that doesn't line up with your
timeline.

Anyways, I'd definitely sort the firmware situation out, but it may
not be the cause of your problem.

Cheers,

-ilia

2021-01-28 00:40:17

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

Hi Marc, I appreciate your persistence on this. I am frankly
surprised that you've put up with this so long.

On Sat, Dec 26, 2020 at 03:12:09AM -0800, Marc MERLIN wrote:
> This started with 5.5 and hasn't gotten better since then, despite
> some reports I tried to send.
>
> As per my previous message:
> I have a Thinkpad P70 with hybrid graphics.
> 01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro M600M] (rev a2)
> that one works fine, I can use i915 for the main screen, and nouveau to
> display on the external ports (external ports are only wired to nvidia
> chip, so it's impossible to use them without turning the nvidia chip
> on).
>
> I now got a newer P73 also with the same hybrid graphics (setup as such
> in the bios). It runs fine with i915, and I don't need to use external
> display with nouveau for now (it almost works, but I only see the mouse
> cursor on the external screen, no window or anything else can get
> displayed, very weird).
> 01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)
>
>
> after boot, when it gets the right trigger (not sure which ones), it
> loops on this evern 2 seconds, mostly forever.
>
> I'm not sure if it's nouveau's fault or the kernel's PCI PME's fault, or something else.

IIUC there are basically two problems:

1) A 2 minute delay during boot
2) Some sort of event every 2 seconds that kills your battery life

Your machine doesn't sound unusual, and I haven't seen a flood of
similar reports, so maybe there's something unusual about your config.
But I really don't have any guesses for either one.

It sounds like v5.5 worked fine and you first noticed the slow boot
problem in v5.8. We *could* try to bisect it, but I know that's a lot
of work on your part.

Grasping for any ideas for the boot delay; could you boot with
"initcall_debug" and collect your "lsmod" output? I notice async_tx
in some of your logs, but I have no idea what it is. It's from
crypto, so possibly somewhat unusual?

Bjorn

2021-01-28 21:02:13

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

On Wed, Jan 27, 2021 at 03:33:02PM -0600, Bjorn Helgaas wrote:
> On Sat, Dec 26, 2020 at 03:12:09AM -0800, Marc MERLIN wrote:
> > This started with 5.5 and hasn't gotten better since then, despite
> > some reports I tried to send.
> >
> > As per my previous message:
> > I have a Thinkpad P70 with hybrid graphics.
> > 01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro M600M] (rev a2)
> > that one works fine, I can use i915 for the main screen, and nouveau to
> > display on the external ports (external ports are only wired to nvidia
> > chip, so it's impossible to use them without turning the nvidia chip
> > on).
> >
> > I now got a newer P73 also with the same hybrid graphics (setup as such
> > in the bios). It runs fine with i915, and I don't need to use external
> > display with nouveau for now (it almost works, but I only see the mouse
> > cursor on the external screen, no window or anything else can get
> > displayed, very weird).
> > 01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)
> >
> >
> > after boot, when it gets the right trigger (not sure which ones), it
> > loops on this evern 2 seconds, mostly forever.
> >
> > I'm not sure if it's nouveau's fault or the kernel's PCI PME's fault, or something else.
>
> IIUC there are basically two problems:
>
> 1) A 2 minute delay during boot
> 2) Some sort of event every 2 seconds that kills your battery life
>
> Your machine doesn't sound unusual, and I haven't seen a flood of
> similar reports, so maybe there's something unusual about your config.
> But I really don't have any guesses for either one.
>
> It sounds like v5.5 worked fine and you first noticed the slow boot
> problem in v5.8. We *could* try to bisect it, but I know that's a lot
> of work on your part.
>
> Grasping for any ideas for the boot delay; could you boot with
> "initcall_debug" and collect your "lsmod" output? I notice async_tx
> in some of your logs, but I have no idea what it is. It's from
> crypto, so possibly somewhat unusual?

Another random thought: is there any chance the boot delay could be
related to crypto waiting for entropy?

2021-01-29 00:58:27

by Marc MERLIN

[permalink] [raw]
Subject: Re: 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

On Wed, Jan 27, 2021 at 03:33:00PM -0600, Bjorn Helgaas wrote:
> Hi Marc, I appreciate your persistence on this. I am frankly
> surprised that you've put up with this so long.

Well, been using linux for 27 years, but also it's not like I have much
of a choice outside of switching to windows, as tempting as it's getting
sometimes ;)

> > after boot, when it gets the right trigger (not sure which ones), it
> > loops on this evern 2 seconds, mostly forever.
> >
> > I'm not sure if it's nouveau's fault or the kernel's PCI PME's fault, or something else.
>
> IIUC there are basically two problems:
>
> 1) A 2 minute delay during boot
> Another random thought: is there any chance the boot delay could be
> related to crypto waiting for entropy?

So, the 2mn hang went away after I added the nouveau firwmare in initrd.
The only problem is that the nouveau driver does not give a very good
clue as to what's going on and what to do.
For comparison the intel iwlwifi driver is very clear about firmware
it's trying to load, if it can't and what exact firmware you need to
find on the internet (filename)

> 2) Some sort of event every 2 seconds that kills your battery life
> Your machine doesn't sound unusual, and I haven't seen a flood of
> similar reports, so maybe there's something unusual about your config.
> But I really don't have any guesses for either one.

Honestly, there are not too many thinpad P73 running linux out there. I
wouldn't be surprised if it's only a handful or two.

> It sounds like v5.5 worked fine and you first noticed the slow boot
> problem in v5.8. We *could* try to bisect it, but I know that's a lot
> of work on your part.

I've done that in the past, to be honest now that it works after I added
the firmware that nouveau started needing, and didn't need before, the
hang at boot is gone for sure.
The PCI PM wakeup issues on batteries happen sometimes still, but they
are much more rare now.

> Grasping for any ideas for the boot delay; could you boot with
> "initcall_debug" and collect your "lsmod" output? I notice async_tx
> in some of your logs, but I have no idea what it is. It's from
> crypto, so possibly somewhat unusual?

Is this still neeeded? I think of nouveau does a better job of helping
the user correct the issue if firmware is missing (I think intel even
gives a URL in printk), that would probably be what's needed for the
most part.

[ 12.832547] async_tx: api initialized (async) comes from ./crypto/async_tx/async_tx.c

Thanks for your answer, let me know if there is anything else useful I
can give, I think I'm otherwise mostly ok now.

Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.

Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08

2021-01-29 21:23:43

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

On Thu, Jan 28, 2021 at 04:56:26PM -0800, Marc MERLIN wrote:
> On Wed, Jan 27, 2021 at 03:33:00PM -0600, Bjorn Helgaas wrote:
> > Hi Marc, I appreciate your persistence on this. I am frankly
> > surprised that you've put up with this so long.
>
> Well, been using linux for 27 years, but also it's not like I have much
> of a choice outside of switching to windows, as tempting as it's getting
> sometimes ;)
>
> > > after boot, when it gets the right trigger (not sure which ones), it
> > > loops on this evern 2 seconds, mostly forever.
> > >
> > > I'm not sure if it's nouveau's fault or the kernel's PCI PME's fault, or something else.
> >
> > IIUC there are basically two problems:
> >
> > 1) A 2 minute delay during boot
> > Another random thought: is there any chance the boot delay could be
> > related to crypto waiting for entropy?
>
> So, the 2mn hang went away after I added the nouveau firwmare in initrd.
> The only problem is that the nouveau driver does not give a very good
> clue as to what's going on and what to do.
>
> For comparison the intel iwlwifi driver is very clear about firmware
> it's trying to load, if it can't and what exact firmware you need to
> find on the internet (filename)

I guess you're referring to this in iwl_request_firmware()?

IWL_ERR(drv, "check git://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git\n");

How can we fix this in nouveau so we don't have the debug this again?
I don't really know how firmware loading works, but "git grep -A5
request_firmware drivers/gpu/drm/nouveau/" shows that we generally
print something when request_firmware() fails.

But I didn't notice those messages in your logs, so I'm probably
barking up the wrong tree.

> > 2) Some sort of event every 2 seconds that kills your battery life
> > Your machine doesn't sound unusual, and I haven't seen a flood of
> > similar reports, so maybe there's something unusual about your config.
> > But I really don't have any guesses for either one.
>
> Honestly, there are not too many thinpad P73 running linux out there. I
> wouldn't be surprised if it's only a handful or two.
>
> > It sounds like v5.5 worked fine and you first noticed the slow boot
> > problem in v5.8. We *could* try to bisect it, but I know that's a lot
> > of work on your part.
>
> I've done that in the past, to be honest now that it works after I added
> the firmware that nouveau started needing, and didn't need before, the
> hang at boot is gone for sure.
> The PCI PM wakeup issues on batteries happen sometimes still, but they
> are much more rare now.

So maybe the wakeups are related to having vs not having the nouveau
firmware? I'm still curious about that, and it smells like a bug to
me, but probably something to do with nouveau where I have no hope of
debugging it.

> > Grasping for any ideas for the boot delay; could you boot with
> > "initcall_debug" and collect your "lsmod" output? I notice async_tx
> > in some of your logs, but I have no idea what it is. It's from
> > crypto, so possibly somewhat unusual?
>
> Is this still neeeded? I think of nouveau does a better job of helping
> the user correct the issue if firmware is missing (I think intel even
> gives a URL in printk), that would probably be what's needed for the
> most part.

Nope, don't bother with this, thanks.

Bjorn

2021-01-30 04:09:07

by Marc MERLIN

[permalink] [raw]
Subject: Re: 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

On Fri, Jan 29, 2021 at 03:20:32PM -0600, Bjorn Helgaas wrote:
> > For comparison the intel iwlwifi driver is very clear about firmware
> > it's trying to load, if it can't and what exact firmware you need to
> > find on the internet (filename)
>
> I guess you're referring to this in iwl_request_firmware()?
>
> IWL_ERR(drv, "check git://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git\n");

Yes :)

> How can we fix this in nouveau so we don't have the debug this again?
> I don't really know how firmware loading works, but "git grep -A5
> request_firmware drivers/gpu/drm/nouveau/" shows that we generally
> print something when request_firmware() fails.

Well, have a look at https://pastebin.com/dX19aCpj
do you see any warning whatsoever?

> But I didn't notice those messages in your logs, so I'm probably
> barking up the wrong tree.

you're not It seems that newer kernels are a bit better:
[ 189.304662] nouveau 0000:01:00.0: pmu: firmware unavailable
[ 189.312455] nouveau 0000:01:00.0: disp: destroy running...
[ 189.316552] nouveau 0000:01:00.0: disp: destroy completed in 1us
[ 189.320326] nouveau 0000:01:00.0: disp ctor failed, -12
[ 189.324214] nouveau: probe of 0000:01:00.0 failed with error -12

So, it probably got better, but that message got displayed after the 2mn
hang that having the firmware, stops from happening.

whichever developer with the right hardware can probably easily
reproduce this by removing the firmware and looking at the boot
messages.

At the very least, it should print something more clear "driver will not
function properly", and a URL to where one can get the driver, would be
awesome.

> So maybe the wakeups are related to having vs not having the nouveau
> firmware? I'm still curious about that, and it smells like a bug to
> me, but probably something to do with nouveau where I have no hope of
> debugging it.

Right. Honestly, given the time I've lost with this, and now that it
seems gone with the firmware, I'm happy to leave well enough alone :)

I'm not sure how you are involved with the driver, but are you able to
help improve the dmesg output?

Thanks,
Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.

Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08