2023-07-20 04:13:53

by Woody Suwalski

[permalink] [raw]
Subject: Kernel 6.5-rc2: system crash on suspend bisected


Laptop shows a kernel crash trace after a first suspend to ram, on a
second attempt to suspend it becomes frozen solid. This is 100%
repeatable with a 6.5-rc2 kernel, not happening with a 6.4 kernel - see
the attached dmesg output.

I have bisected the kernel uilds and it points to :
[40613da52b13fb21c5566f10b287e0ca8c12c4e9] PCI: acpiphp: Reassign
resources on bridge if necessary

Reversing this patch seems to fix the kernel crash problem on my laptop.

Happy to test some proper fix patches...

Thanks, Woody


Attachments:
dmesg-6.4.txt (67.11 kB)
dmesg-6.5.txt (71.21 kB)
dmesg-6.5-2.txt (73.20 kB)
Download all attachments

2023-07-20 20:59:49

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: Kernel 6.5-rc2: system crash on suspend bisected

[+cc regressions list]

On Wed, Jul 19, 2023 at 11:36:51PM -0400, Woody Suwalski wrote:
> Laptop shows a kernel crash trace after a first suspend to ram, on a second
> attempt to suspend it becomes frozen solid. This is 100% repeatable with a
> 6.5-rc2 kernel, not happening with a 6.4 kernel - see the attached dmesg
> output.
>
> I have bisected the kernel uilds and it points to :
> [40613da52b13fb21c5566f10b287e0ca8c12c4e9] PCI: acpiphp: Reassign resources
> on bridge if necessary
>
> Reversing this patch seems to fix the kernel crash problem on my laptop.

Thank you very much for all your work debugging, bisecting, and
reporting this! This is incredibly helpful.

Original report, including complete dmesg logs for both v6.4 and
v6.5-rc2:
https://lore.kernel.org/r/[email protected]

I queued up a revert of 40613da52b13 ("PCI: acpiphp: Reassign
resources on bridge if necessary") (on my for-linus branch for v6.5).

It looks like a NULL pointer dereference; hopefully the fix is obvious
and I can drop the revert and replace it with the fix.

Bjorn

2023-07-23 09:34:35

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: Kernel 6.5-rc2: system crash on suspend bisected

[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 20.07.23 05:36, Woody Suwalski wrote:
>
> Laptop shows a kernel crash trace after a first suspend to ram, on a
> second attempt to suspend it becomes frozen solid. This is 100%
> repeatable with a 6.5-rc2 kernel, not happening with a 6.4 kernel - see
> the attached dmesg output.
>
> I have bisected the kernel uilds and it points to :
> [40613da52b13fb21c5566f10b287e0ca8c12c4e9] PCI: acpiphp: Reassign
> resources on bridge if necessary
>
> Reversing this patch seems to fix the kernel crash problem on my laptop.
>
> Happy to test some proper fix patches...
>

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced 40613da52b13fb21c5566f10b287e0ca8c12c
#regzbot title PCI: acpiphp: Oops on first attempt to suspend, freeze on
second
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

2023-07-24 10:12:33

by Igor Mammedov

[permalink] [raw]
Subject: Re: Kernel 6.5-rc2: system crash on suspend bisected

On Thu, 20 Jul 2023 15:21:10 -0500
Bjorn Helgaas <[email protected]> wrote:

> [+cc regressions list]
>
> On Wed, Jul 19, 2023 at 11:36:51PM -0400, Woody Suwalski wrote:
> > Laptop shows a kernel crash trace after a first suspend to ram, on a second
> > attempt to suspend it becomes frozen solid. This is 100% repeatable with a
> > 6.5-rc2 kernel, not happening with a 6.4 kernel - see the attached dmesg
> > output.
> >
> > I have bisected the kernel uilds and it points to :
> > [40613da52b13fb21c5566f10b287e0ca8c12c4e9] PCI: acpiphp: Reassign resources
> > on bridge if necessary
> >
> > Reversing this patch seems to fix the kernel crash problem on my laptop.
>
> Thank you very much for all your work debugging, bisecting, and
> reporting this! This is incredibly helpful.
>
> Original report, including complete dmesg logs for both v6.4 and
> v6.5-rc2:
> https://lore.kernel.org/r/[email protected]
>
> I queued up a revert of 40613da52b13 ("PCI: acpiphp: Reassign
> resources on bridge if necessary") (on my for-linus branch for v6.5).
>
> It looks like a NULL pointer dereference; hopefully the fix is obvious
> and I can drop the revert and replace it with the fix.

it happens here:

2145 void pci_assign_unassigned_bridge_resources(struct pci_dev *bridge)
2146 {
2147 struct pci_bus *parent = bridge->subordinate;

Let's see if it reproducable on Lenovo laptop and what reading
involved code yields.
If I can't figure it out anyways, I'll come up with a patch
to trace issue.

>
> Bjorn
>


2023-07-24 14:34:02

by Igor Mammedov

[permalink] [raw]
Subject: [PATCH] hack to debug acpiphp crash


Woody thanks for testing,

can you try following patch which will try to workaround NULL bus->self if it's
a really cuplrit and print an extra debug information.
Add following to kernel command line(make sure that CONFIG_DYNAMIC_DEBUG is enabled):

dyndbg="file drivers/pci/access.c +p; file drivers/pci/hotplug/acpiphp_glue.c +p; file drivers/pci/bus.c +p; file drivers/pci/pci.c +p; file drivers/pci/setup-bus.c +p" ignore_loglevel

What I find odd in you logs is that enable_slot() is called while native PCIe
should be used. Additional info might help to understand what's going on:
1: 'lspci' output
2: DSDT and all SSDT ACPI tables (you can use 'acpidump -b' to get them).

Signed-off-by: Igor Mammedov <[email protected]>
---
drivers/pci/hotplug/acpiphp_glue.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c
index 328d1e416014..9ce3fd9d72a9 100644
--- a/drivers/pci/hotplug/acpiphp_glue.c
+++ b/drivers/pci/hotplug/acpiphp_glue.c
@@ -485,7 +485,10 @@ static void enable_slot(struct acpiphp_slot *slot, bool bridge)
struct pci_bus *bus = slot->bus;
struct acpiphp_func *func;

+WARN(1, "enable_slot");
+pci_info(bus, "enable_slot bus\n");
if (bridge && bus->self && hotplug_is_native(bus->self)) {
+pr_err("enable_slot: bridge branch\n");
/*
* If native hotplug is used, it will take care of hotplug
* slot management and resource allocation for hotplug
@@ -498,8 +501,10 @@ static void enable_slot(struct acpiphp_slot *slot, bool bridge)
acpiphp_native_scan_bridge(dev);
}
} else {
+ LIST_HEAD(add_list);
int max, pass;

+pr_err("enable_slot: acpiphp_rescan_slot branch\n");
acpiphp_rescan_slot(slot);
max = acpiphp_max_busnr(bus);
for (pass = 0; pass < 2; pass++) {
@@ -508,13 +513,23 @@ static void enable_slot(struct acpiphp_slot *slot, bool bridge)
continue;

max = pci_scan_bridge(bus, dev, max, pass);
+pci_info(dev, "enable_slot: pci_scan_bridge: max: %d\n", max);
if (pass && dev->subordinate) {
check_hotplug_bridge(slot, dev);
pcibios_resource_survey_bus(dev->subordinate);
+ if (bus->self)
+ __pci_bus_size_bridges(dev->subordinate,
+ &add_list);
}
}
}
- pci_assign_unassigned_bridge_resources(bus->self);
+ if (bus->self) {
+pci_info(bus->self, "enable_slot: pci_assign_unassigned_bridge_resources:\n");
+ pci_assign_unassigned_bridge_resources(bus->self);
+ } else {
+pci_info(bus, "enable_slot: __pci_bus_assign_resources:\n");
+ __pci_bus_assign_resources(bus, &add_list, NULL);
+ }
}

acpiphp_sanitize_bus(bus);
@@ -541,6 +556,7 @@ static void enable_slot(struct acpiphp_slot *slot, bool bridge)
}
pci_dev_put(dev);
}
+pr_err("enable_slot: end\n");
}

/**
--
2.39.3


2023-07-25 03:02:17

by Woody Suwalski

[permalink] [raw]
Subject: Re: [PATCH] hack to debug acpiphp crash

Igor Mammedov wrote:
> Woody thanks for testing,
>
> can you try following patch which will try to workaround NULL bus->self if it's
> a really cuplrit and print an extra debug information.
> Add following to kernel command line(make sure that CONFIG_DYNAMIC_DEBUG is enabled):
>
> dyndbg="file drivers/pci/access.c +p; file drivers/pci/hotplug/acpiphp_glue.c +p; file drivers/pci/bus.c +p; file drivers/pci/pci.c +p; file drivers/pci/setup-bus.c +p" ignore_loglevel
>
> What I find odd in you logs is that enable_slot() is called while native PCIe
> should be used. Additional info might help to understand what's going on:
> 1: 'lspci' output
> 2: DSDT and all SSDT ACPI tables (you can use 'acpidump -b' to get them).
>
> Signed-off-by: Igor Mammedov <[email protected]>
> ---
> drivers/pci/hotplug/acpiphp_glue.c | 18 +++++++++++++++++-
> 1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c
> index 328d1e416014..9ce3fd9d72a9 100644
> --- a/drivers/pci/hotplug/acpiphp_glue.c
> +++ b/drivers/pci/hotplug/acpiphp_glue.c
> @@ -485,7 +485,10 @@ static void enable_slot(struct acpiphp_slot *slot, bool bridge)
> struct pci_bus *bus = slot->bus;
> struct acpiphp_func *func;
>
> +WARN(1, "enable_slot");
> +pci_info(bus, "enable_slot bus\n");
> if (bridge && bus->self && hotplug_is_native(bus->self)) {
> +pr_err("enable_slot: bridge branch\n");
> /*
> * If native hotplug is used, it will take care of hotplug
> * slot management and resource allocation for hotplug
> @@ -498,8 +501,10 @@ static void enable_slot(struct acpiphp_slot *slot, bool bridge)
> acpiphp_native_scan_bridge(dev);
> }
> } else {
> + LIST_HEAD(add_list);
> int max, pass;
>
> +pr_err("enable_slot: acpiphp_rescan_slot branch\n");
> acpiphp_rescan_slot(slot);
> max = acpiphp_max_busnr(bus);
> for (pass = 0; pass < 2; pass++) {
> @@ -508,13 +513,23 @@ static void enable_slot(struct acpiphp_slot *slot, bool bridge)
> continue;
>
> max = pci_scan_bridge(bus, dev, max, pass);
> +pci_info(dev, "enable_slot: pci_scan_bridge: max: %d\n", max);
> if (pass && dev->subordinate) {
> check_hotplug_bridge(slot, dev);
> pcibios_resource_survey_bus(dev->subordinate);
> + if (bus->self)
> + __pci_bus_size_bridges(dev->subordinate,
> + &add_list);
> }
> }
> }
> - pci_assign_unassigned_bridge_resources(bus->self);
> + if (bus->self) {
> +pci_info(bus->self, "enable_slot: pci_assign_unassigned_bridge_resources:\n");
> + pci_assign_unassigned_bridge_resources(bus->self);
> + } else {
> +pci_info(bus, "enable_slot: __pci_bus_assign_resources:\n");
> + __pci_bus_assign_resources(bus, &add_list, NULL);
> + }
> }
>
> acpiphp_sanitize_bus(bus);
> @@ -541,6 +556,7 @@ static void enable_slot(struct acpiphp_slot *slot, bool bridge)
> }
> pci_dev_put(dev);
> }
> +pr_err("enable_slot: end\n");
> }
>
> /**
Unfortunately the patch above does not seem to prevent the kernel crash.
Here comes the requested diagnostic info: dmesg's before and after,
choice of lspci's and acpi tables. Hope that will help :-)

Thanks, Woody



Attachments:
pcidebug.tar.xz (60.19 kB)

2023-07-25 08:28:16

by Igor Mammedov

[permalink] [raw]
Subject: Re: [PATCH] hack to debug acpiphp crash

On Mon, 24 Jul 2023 21:52:34 -0400
Woody Suwalski <[email protected]> wrote:

> Igor Mammedov wrote:
> > Woody thanks for testing,
> >
> > can you try following patch which will try to workaround NULL bus->self if it's
> > a really cuplrit and print an extra debug information.
> > Add following to kernel command line(make sure that CONFIG_DYNAMIC_DEBUG is enabled):
> >
> > dyndbg="file drivers/pci/access.c +p; file drivers/pci/hotplug/acpiphp_glue.c +p; file drivers/pci/bus.c +p; file drivers/pci/pci.c +p; file drivers/pci/setup-bus.c +p" ignore_loglevel
> >
> > What I find odd in you logs is that enable_slot() is called while native PCIe
> > should be used. Additional info might help to understand what's going on:
> > 1: 'lspci' output
> > 2: DSDT and all SSDT ACPI tables (you can use 'acpidump -b' to get them).
> >
> > Signed-off-by: Igor Mammedov <[email protected]>
[...]
> >
> > /**
> Unfortunately the patch above does not seem to prevent the kernel crash.
> Here comes the requested diagnostic info: dmesg's before and after,
> choice of lspci's and acpi tables. Hope that will help :-)

Looking at dmesg-6.5-debug_after.txt
there aren't "BUG: kernel NULL pointer dereference" line anymore
The call traces you see are induced by WARN(), which purpose is
to show call path that calls enable_slot().

Let me split potential fix from debug and repost that as separate
patches for you to try.
I'd like to see debug output without 'fix' to track down which
root port/device causes NULL pointer dereference. And hopefully
in a few roundtrips figure out why old code doesn't crash.

PS:
What happens is that on resume firmware (likely EC),
issues ACPI bus check on root ports which (bus check) is
wired to acpiphp module (though pciehp module was initialized
at boot to manage root ports), it's likely firmware bug.

I'd guess the intent behind this was to check if PCIe devices
were hotplugged while laptop has been asleep, and for
some reason they didn't use native PCIe hotplug to handle that.
However looking at laptop specs you can't hotplug PCIe
devices via external ports. Given how old laptop is
it isn't going to be fixed, so we would need a workaround
or fixup DSDT to skip buscheck.

The options I see is to keep old kernel as for such case,
or bail out early from bus check/enable_slot since root port
is managed by pciehp module (and let it handle hotplug).

> Thanks, Woody
>
>


2023-07-25 10:06:08

by Igor Mammedov

[permalink] [raw]
Subject: Re: [PATCH] hack to debug acpiphp crash

On Tue, 25 Jul 2023 10:06:44 +0200
Igor Mammedov <[email protected]> wrote:

> PS:
> What happens is that on resume firmware (likely EC),
> issues ACPI bus check on root ports which (bus check) is
> wired to acpiphp module (though pciehp module was initialized
> at boot to manage root ports), it's likely firmware bug.
>
> I'd guess the intent behind this was to check if PCIe devices
> were hotplugged while laptop has been asleep, and for
> some reason they didn't use native PCIe hotplug to handle that.
> However looking at laptop specs you can't hotplug PCIe
> devices via external ports. Given how old laptop is
> it isn't going to be fixed, so we would need a workaround
> or fixup DSDT to skip buscheck.
>
> The options I see is to keep old kernel as for such case,
> or bail out early from bus check/enable_slot since root port
> is managed by pciehp module (and let it handle hotplug).

scratch all of above out (it's wrong). Looking at DSDT
firmware sends Notify(rpxx, 2 /* Wake */) event. Which
according to spec needs to be handed down to the native
device driver.



2023-07-25 12:26:54

by Igor Mammedov

[permalink] [raw]
Subject: Re: [PATCH] hack to debug acpiphp crash

On Tue, 25 Jul 2023 07:45:08 -0400
Woody Suwalski <[email protected]> wrote:

> Igor Mammedov wrote:
> > On Tue, 25 Jul 2023 10:06:44 +0200
> > Igor Mammedov <[email protected]> wrote:
> >
> >> PS:
> >> What happens is that on resume firmware (likely EC),
> >> issues ACPI bus check on root ports which (bus check) is
> >> wired to acpiphp module (though pciehp module was initialized
> >> at boot to manage root ports), it's likely firmware bug.
> >>
> >> I'd guess the intent behind this was to check if PCIe devices
> >> were hotplugged while laptop has been asleep, and for
> >> some reason they didn't use native PCIe hotplug to handle that.
> >> However looking at laptop specs you can't hotplug PCIe
> >> devices via external ports. Given how old laptop is
> >> it isn't going to be fixed, so we would need a workaround
> >> or fixup DSDT to skip buscheck.
> >>
> >> The options I see is to keep old kernel as for such case,
> >> or bail out early from bus check/enable_slot since root port
> >> is managed by pciehp module (and let it handle hotplug).
> > scratch all of above out (it's wrong). Looking at DSDT
> > firmware sends Notify(rpxx, 2 /* Wake */) event. Which
> > according to spec needs to be handed down to the native
> > device driver.
> >
> >
> I agree that this laptop is a tricky one. I had to adjust my kernel
> config NOHZ just to make it suspend to ram, otherwise it was waking back
> right after going to sleep (and the same nohz kernel worked on all my
> other machines)...

Blaming laptop is likely red herring in this case after some more reading.
Anyways I've just sent a new round of patches to test.


2023-07-25 12:28:45

by Woody Suwalski

[permalink] [raw]
Subject: Re: [PATCH] hack to debug acpiphp crash

Igor Mammedov wrote:
> On Tue, 25 Jul 2023 10:06:44 +0200
> Igor Mammedov <[email protected]> wrote:
>
>> PS:
>> What happens is that on resume firmware (likely EC),
>> issues ACPI bus check on root ports which (bus check) is
>> wired to acpiphp module (though pciehp module was initialized
>> at boot to manage root ports), it's likely firmware bug.
>>
>> I'd guess the intent behind this was to check if PCIe devices
>> were hotplugged while laptop has been asleep, and for
>> some reason they didn't use native PCIe hotplug to handle that.
>> However looking at laptop specs you can't hotplug PCIe
>> devices via external ports. Given how old laptop is
>> it isn't going to be fixed, so we would need a workaround
>> or fixup DSDT to skip buscheck.
>>
>> The options I see is to keep old kernel as for such case,
>> or bail out early from bus check/enable_slot since root port
>> is managed by pciehp module (and let it handle hotplug).
> scratch all of above out (it's wrong). Looking at DSDT
> firmware sends Notify(rpxx, 2 /* Wake */) event. Which
> according to spec needs to be handed down to the native
> device driver.
>
>
I agree that this laptop is a tricky one. I had to adjust my kernel
config NOHZ just to make it suspend to ram, otherwise it was waking back
right after going to sleep (and the same nohz kernel worked on all my
other machines)...


2023-07-27 06:41:11

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Kernel 6.5-rc2: system crash on suspend bisected

On Thu, Jul 20, 2023 at 03:21:10PM -0500, Bjorn Helgaas wrote:
> [+cc regressions list]
>
> On Wed, Jul 19, 2023 at 11:36:51PM -0400, Woody Suwalski wrote:
> > Laptop shows a kernel crash trace after a first suspend to ram, on a second
> > attempt to suspend it becomes frozen solid. This is 100% repeatable with a
> > 6.5-rc2 kernel, not happening with a 6.4 kernel - see the attached dmesg
> > output.
> >
> > I have bisected the kernel uilds and it points to :
> > [40613da52b13fb21c5566f10b287e0ca8c12c4e9] PCI: acpiphp: Reassign resources
> > on bridge if necessary
> >
> > Reversing this patch seems to fix the kernel crash problem on my laptop.
>
> Thank you very much for all your work debugging, bisecting, and
> reporting this! This is incredibly helpful.
>
> Original report, including complete dmesg logs for both v6.4 and
> v6.5-rc2:
> https://lore.kernel.org/r/[email protected]
>
> I queued up a revert of 40613da52b13 ("PCI: acpiphp: Reassign
> resources on bridge if necessary") (on my for-linus branch for v6.5).
>
> It looks like a NULL pointer dereference; hopefully the fix is obvious
> and I can drop the revert and replace it with the fix.
>
> Bjorn

Patch on list now:
https://lore.kernel.org/all/20230726123518.2361181-1-imammedo%40redhat.com

--
MST


2023-07-27 12:24:00

by Woody Suwalski

[permalink] [raw]
Subject: Re: Kernel 6.5-rc2: system crash on suspend bisected

Michael S. Tsirkin wrote:
> On Thu, Jul 20, 2023 at 03:21:10PM -0500, Bjorn Helgaas wrote:
>> [+cc regressions list]
>>
>> On Wed, Jul 19, 2023 at 11:36:51PM -0400, Woody Suwalski wrote:
>>> Laptop shows a kernel crash trace after a first suspend to ram, on a second
>>> attempt to suspend it becomes frozen solid. This is 100% repeatable with a
>>> 6.5-rc2 kernel, not happening with a 6.4 kernel - see the attached dmesg
>>> output.
>>>
>>> I have bisected the kernel uilds and it points to :
>>> [40613da52b13fb21c5566f10b287e0ca8c12c4e9] PCI: acpiphp: Reassign resources
>>> on bridge if necessary
>>>
>>> Reversing this patch seems to fix the kernel crash problem on my laptop.
>> Thank you very much for all your work debugging, bisecting, and
>> reporting this! This is incredibly helpful.
>>
>> Original report, including complete dmesg logs for both v6.4 and
>> v6.5-rc2:
>> https://lore.kernel.org/r/[email protected]
>>
>> I queued up a revert of 40613da52b13 ("PCI: acpiphp: Reassign
>> resources on bridge if necessary") (on my for-linus branch for v6.5).
>>
>> It looks like a NULL pointer dereference; hopefully the fix is obvious
>> and I can drop the revert and replace it with the fix.
>>
>> Bjorn
> Patch on list now:
> https://lore.kernel.org/all/20230726123518.2361181-1-imammedo%40redhat.com
Confirm works OK.
--
Tested-by: Woody Suwalski <[email protected]>
--

Thanks, Woody