2023-07-25 12:29:26

by Igor Mammedov

[permalink] [raw]
Subject: [RFC 0/3] acpipcihp: fix kernel crash on 2nd resume


Changelog:
* split out debug patch into a separate one with extra printk added
* fixed inverte bus->self check (probably a reason why it didn't work before)


1/3 debug patch
2/3 offending patch
3/3 potential fix

I added more files to trace, add following to kernel CLI
dyndbg="file drivers/pci/access.c +p; file drivers/pci/hotplug/acpiphp_glue.c +p; file drivers/pci/bus.c +p; file drivers/pci/pci.c +p; file drivers/pci/setup-bus.c +p; file drivers/acpi/bus.c +p" ignore_loglevel

should be applied on top of
e8afd0d9fccc PCI: pciehp: Cancel bringup sequence if card is not present

apply a patch one by one and run testcase + capture dmesg after each patch
one shpould endup with 3 dmesg to ananlyse
1st - old behaviour - no crash
2nd - crash
3rd - no crash hopefully

Igor Mammedov (3):
acpiphp: extra debug hack
PCI: acpiphp: Reassign resources on bridge if necessary
acpipcihp: use __pci_bus_assign_resources() if bus doesn't have bridge

drivers/pci/hotplug/acpiphp_glue.c | 23 ++++++++++++++++++-----
1 file changed, 18 insertions(+), 5 deletions(-)

--
2.39.3



2023-07-25 14:21:00

by Woody Suwalski

[permalink] [raw]
Subject: Re: [RFC 0/3] acpipcihp: fix kernel crash on 2nd resume

Igor Mammedov wrote:
> Changelog:
> * split out debug patch into a separate one with extra printk added
> * fixed inverte bus->self check (probably a reason why it didn't work before)
>
>
> 1/3 debug patch
> 2/3 offending patch
> 3/3 potential fix
>
> I added more files to trace, add following to kernel CLI
> dyndbg="file drivers/pci/access.c +p; file drivers/pci/hotplug/acpiphp_glue.c +p; file drivers/pci/bus.c +p; file drivers/pci/pci.c +p; file drivers/pci/setup-bus.c +p; file drivers/acpi/bus.c +p" ignore_loglevel
>
> should be applied on top of
> e8afd0d9fccc PCI: pciehp: Cancel bringup sequence if card is not present
>
> apply a patch one by one and run testcase + capture dmesg after each patch
> one shpould endup with 3 dmesg to ananlyse
> 1st - old behaviour - no crash
> 2nd - crash
> 3rd - no crash hopefully
>
> Igor Mammedov (3):
> acpiphp: extra debug hack
> PCI: acpiphp: Reassign resources on bridge if necessary
> acpipcihp: use __pci_bus_assign_resources() if bus doesn't have bridge
>
> drivers/pci/hotplug/acpiphp_glue.c | 23 ++++++++++++++++++-----
> 1 file changed, 18 insertions(+), 5 deletions(-)
>
Actually applying patch1 is already creating the crash (why???), hence I
have added also dmesg-6.5-0.txt which shows a working condition based on
git e8afd0d9fccc level (acpiphp_glue in kernel 6.4)

Patch3 did not fix the issue, it seems that the culprit is somewhere
else triggered by  "benign" patch1 :-(

Also note about the trigger description in patch3: the dmesg trace on
Inspiron laptop is collected after the first wake from suspend to ram.
The consecutive  attempt to sleep results in a frozen system.

Thanks, Woody


Attachments:
rfc.tar.xz (34.71 kB)

2023-07-25 15:44:50

by Igor Mammedov

[permalink] [raw]
Subject: Re: [RFC 0/3] acpipcihp: fix kernel crash on 2nd resume

On Tue, 25 Jul 2023 09:51:53 -0400
Woody Suwalski <[email protected]> wrote:

> Igor Mammedov wrote:
> > Changelog:
> > * split out debug patch into a separate one with extra printk added
> > * fixed inverte bus->self check (probably a reason why it didn't work before)
> >
> >
> > 1/3 debug patch
> > 2/3 offending patch
> > 3/3 potential fix
> >
> > I added more files to trace, add following to kernel CLI
> > dyndbg="file drivers/pci/access.c +p; file drivers/pci/hotplug/acpiphp_glue.c +p; file drivers/pci/bus.c +p; file drivers/pci/pci.c +p; file drivers/pci/setup-bus.c +p; file drivers/acpi/bus.c +p" ignore_loglevel
> >
> > should be applied on top of
> > e8afd0d9fccc PCI: pciehp: Cancel bringup sequence if card is not present
> >
> > apply a patch one by one and run testcase + capture dmesg after each patch
> > one shpould endup with 3 dmesg to ananlyse
> > 1st - old behaviour - no crash
> > 2nd - crash
> > 3rd - no crash hopefully
> >
> > Igor Mammedov (3):
> > acpiphp: extra debug hack
> > PCI: acpiphp: Reassign resources on bridge if necessary
> > acpipcihp: use __pci_bus_assign_resources() if bus doesn't have bridge
> >
> > drivers/pci/hotplug/acpiphp_glue.c | 23 ++++++++++++++++++-----
> > 1 file changed, 18 insertions(+), 5 deletions(-)
> >
> Actually applying patch1 is already creating the crash (why???),
probably it's due to an extra debug line, I've added.
I dropped suspicions one, can you try again and see if it works.

> hence I
> have added also dmesg-6.5-0.txt which shows a working condition based on
> git e8afd0d9fccc level (acpiphp_glue in kernel 6.4)
>
> Patch3 did not fix the issue, it seems that the culprit is somewhere
> else triggered by  "benign" patch1 :-(
>
> Also note about the trigger description in patch3: the dmesg trace on
> Inspiron laptop is collected after the first wake from suspend to ram.
> The consecutive  attempt to sleep results in a frozen system.

Thanks for clarification, I'll correct commit message once culprit
is found.

>
> Thanks, Woody
>


2023-07-25 16:05:49

by Woody Suwalski

[permalink] [raw]
Subject: Re: [RFC 0/3] acpipcihp: fix kernel crash on 2nd resume

Woody Suwalski wrote:
> Igor Mammedov wrote:
>> Changelog:
>>    * split out debug patch into a separate one with extra printk added
>>    * fixed inverte bus->self check (probably a reason why it didn't
>> work before)
>>
>>
>> 1/3 debug patch
>> 2/3 offending patch
>> 3/3 potential fix
>>    I added more files to trace, add following to kernel CLI
>>     dyndbg="file drivers/pci/access.c +p; file
>> drivers/pci/hotplug/acpiphp_glue.c +p; file drivers/pci/bus.c +p;
>> file drivers/pci/pci.c +p; file drivers/pci/setup-bus.c +p; file
>> drivers/acpi/bus.c +p" ignore_loglevel
>>
>> should be applied on top of
>>     e8afd0d9fccc PCI: pciehp: Cancel bringup sequence if card is not
>> present
>>
>> apply a patch one by one and run testcase + capture dmesg after each
>> patch
>> one shpould endup with 3 dmesg to ananlyse
>>   1st - old behaviour - no crash
>>   2nd - crash
>>   3rd - no crash hopefully
>>
>> Igor Mammedov (3):
>>    acpiphp: extra debug hack
>>    PCI: acpiphp: Reassign resources on bridge if necessary
>>    acpipcihp: use __pci_bus_assign_resources() if bus doesn't have
>> bridge
>>
>>   drivers/pci/hotplug/acpiphp_glue.c | 23 ++++++++++++++++++-----
>>   1 file changed, 18 insertions(+), 5 deletions(-)
>>
> Actually applying patch1 is already creating the crash (why???), hence
> I have added also dmesg-6.5-0.txt which shows a working condition
> based on git e8afd0d9fccc level (acpiphp_glue in kernel 6.4)
>
> Patch3 did not fix the issue, it seems that the culprit is somewhere
> else triggered by  "benign" patch1 :-(
>
> Also note about the trigger description in patch3: the dmesg trace on
> Inspiron laptop is collected after the first wake from suspend to ram.
> The consecutive  attempt to sleep results in a frozen system.
>
> Thanks, Woody
>
I think that in patch1 there is a problem in your debug statement
acpi_handle_debug(...slot_name...) - it is masking the "old" issue.
when I commented out that line in hotplug_event(), it has worked ok (as
was expected). I will redo the testing in ~2 hours...

Woody


2023-07-25 16:22:08

by Woody Suwalski

[permalink] [raw]
Subject: Re: [RFC 0/3] acpipcihp: fix kernel crash on 2nd resume

Igor Mammedov wrote:
> On Tue, 25 Jul 2023 09:51:53 -0400
> Woody Suwalski <[email protected]> wrote:
>
>> Igor Mammedov wrote:
>>> Changelog:
>>> * split out debug patch into a separate one with extra printk added
>>> * fixed inverte bus->self check (probably a reason why it didn't work before)
>>>
>>>
>>> 1/3 debug patch
>>> 2/3 offending patch
>>> 3/3 potential fix
>>>
>>> I added more files to trace, add following to kernel CLI
>>> dyndbg="file drivers/pci/access.c +p; file drivers/pci/hotplug/acpiphp_glue.c +p; file drivers/pci/bus.c +p; file drivers/pci/pci.c +p; file drivers/pci/setup-bus.c +p; file drivers/acpi/bus.c +p" ignore_loglevel
>>>
>>> should be applied on top of
>>> e8afd0d9fccc PCI: pciehp: Cancel bringup sequence if card is not present
>>>
>>> apply a patch one by one and run testcase + capture dmesg after each patch
>>> one shpould endup with 3 dmesg to ananlyse
>>> 1st - old behaviour - no crash
>>> 2nd - crash
>>> 3rd - no crash hopefully
>>>
>>> Igor Mammedov (3):
>>> acpiphp: extra debug hack
>>> PCI: acpiphp: Reassign resources on bridge if necessary
>>> acpipcihp: use __pci_bus_assign_resources() if bus doesn't have bridge
>>>
>>> drivers/pci/hotplug/acpiphp_glue.c | 23 ++++++++++++++++++-----
>>> 1 file changed, 18 insertions(+), 5 deletions(-)
>>>
>> Actually applying patch1 is already creating the crash (why???),
> probably it's due to an extra debug line, I've added.
> I dropped suspicions one, can you try again and see if it works.
>
>> hence I
>> have added also dmesg-6.5-0.txt which shows a working condition based on
>> git e8afd0d9fccc level (acpiphp_glue in kernel 6.4)
>>
>> Patch3 did not fix the issue, it seems that the culprit is somewhere
>> else triggered by  "benign" patch1 :-(
>>
>> Also note about the trigger description in patch3: the dmesg trace on
>> Inspiron laptop is collected after the first wake from suspend to ram.
>> The consecutive  attempt to sleep results in a frozen system.
> Thanks for clarification, I'll correct commit message once culprit
> is found.
>
Good news. After removing the botched debug statement which was masking
the original issue, the testing went as you have predicted, and on patch
3 system suspends to RAM OK.

Here are the requested 3 dmesg outputs, #2 is for the bad run.

I can retest with a final version of the patch once you have it ready...

Thanks, Woody


Attachments:
rfc1.tar.xz (31.52 kB)

2023-07-26 09:05:43

by Igor Mammedov

[permalink] [raw]
Subject: Re: [RFC 0/3] acpipcihp: fix kernel crash on 2nd resume

On Tue, 25 Jul 2023 11:59:56 -0400
Woody Suwalski <[email protected]> wrote:

> Igor Mammedov wrote:
> > On Tue, 25 Jul 2023 09:51:53 -0400
> > Woody Suwalski <[email protected]> wrote:
> >
> >> Igor Mammedov wrote:
> >>> Changelog:
> >>> * split out debug patch into a separate one with extra printk added
> >>> * fixed inverte bus->self check (probably a reason why it didn't work before)
> >>>
> >>>
> >>> 1/3 debug patch
> >>> 2/3 offending patch
> >>> 3/3 potential fix
> >>>
> >>> I added more files to trace, add following to kernel CLI
> >>> dyndbg="file drivers/pci/access.c +p; file drivers/pci/hotplug/acpiphp_glue.c +p; file drivers/pci/bus.c +p; file drivers/pci/pci.c +p; file drivers/pci/setup-bus.c +p; file drivers/acpi/bus.c +p" ignore_loglevel
> >>>
> >>> should be applied on top of
> >>> e8afd0d9fccc PCI: pciehp: Cancel bringup sequence if card is not present
> >>>
> >>> apply a patch one by one and run testcase + capture dmesg after each patch
> >>> one shpould endup with 3 dmesg to ananlyse
> >>> 1st - old behaviour - no crash
> >>> 2nd - crash
> >>> 3rd - no crash hopefully
> >>>
> >>> Igor Mammedov (3):
> >>> acpiphp: extra debug hack
> >>> PCI: acpiphp: Reassign resources on bridge if necessary
> >>> acpipcihp: use __pci_bus_assign_resources() if bus doesn't have bridge
> >>>
> >>> drivers/pci/hotplug/acpiphp_glue.c | 23 ++++++++++++++++++-----
> >>> 1 file changed, 18 insertions(+), 5 deletions(-)
> >>>
> >> Actually applying patch1 is already creating the crash (why???),
> > probably it's due to an extra debug line, I've added.
> > I dropped suspicions one, can you try again and see if it works.
> >
> >> hence I
> >> have added also dmesg-6.5-0.txt which shows a working condition based on
> >> git e8afd0d9fccc level (acpiphp_glue in kernel 6.4)
> >>
> >> Patch3 did not fix the issue, it seems that the culprit is somewhere
> >> else triggered by  "benign" patch1 :-(
> >>
> >> Also note about the trigger description in patch3: the dmesg trace on
> >> Inspiron laptop is collected after the first wake from suspend to ram.
> >> The consecutive  attempt to sleep results in a frozen system.
> > Thanks for clarification, I'll correct commit message once culprit
> > is found.
> >
> Good news. After removing the botched debug statement which was masking
> the original issue, the testing went as you have predicted, and on patch
> 3 system suspends to RAM OK.
Thanks for confirmation,
I'll post cleaned up 3/3 patch today.

>
> Here are the requested 3 dmesg outputs, #2 is for the bad run.
>
> I can retest with a final version of the patch once you have it ready...
>
> Thanks, Woody
>