2023-05-28 07:20:25

by Salvatore Bonaccorso

[permalink] [raw]
Subject: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

Hi Mario

Nick Hastings reported in Debian in https://bugs.debian.org/1036530
lockups from his system after updating from a 6.0 based version to
6.1.y.

#regzbot ^introduced 24867516f06d

he bisected the issue and tracked it down to:

On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote:
> Control: tags -1 - moreinfo
>
> Hi,
>
> I repeated the git bisect, and the bad commit seems to be:
>
> (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad
> 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit
> commit 24867516f06dabedef3be7eea0ef0846b91538bc
> Author: Mario Limonciello <[email protected]>
> Date: Tue Aug 23 13:51:31 2022 -0500
>
> ACPI: OSI: Remove Linux-Dell-Video _OSI string
>
> This string was introduced because drivers for NVIDIA hardware
> had bugs supporting RTD3 in the past.
>
> Before proprietary NVIDIA driver started to support RTD3, Ubuntu had
> had a mechanism for switching PRIME on and off, though it had required
> to logout/login to make the library switch happen.
>
> When the PRIME had been off, the mechanism had unloaded the NVIDIA
> driver and put the device into D3cold, but the GPU had never come back
> to D0 again which is why ODMs used the _OSI to expose an old _DSM
> method to switch the power on/off.
>
> That has been fixed by commit 5775b843a619 ("PCI: Restore config space
> on runtime resume despite being unbound"). so vendors shouldn't be
> using this string to modify ASL any more.
>
> Reviewed-by: Lyude Paul <[email protected]>
> Signed-off-by: Mario Limonciello <[email protected]>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
>
> drivers/acpi/osi.c | 9 ---------
> 1 file changed, 9 deletions(-)
>
> This machine is a Dell with an nvidia chip so it looks like this really
> could be the commit that that is causing the problems. The description
> of the commit also seems (to my untrained eye) to be consistent with the
> error reported on the console when the lockup occurs:
>
> [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> [ 60.083261] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible
>
> Hopefully this is enough information for experts to resolve this.

Does this ring some bell for you? Do you need any further information
from Nick?

Regards,
Salvatore


2023-05-28 13:03:18

by Mario Limonciello

[permalink] [raw]
Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

On 5/28/23 01:49, Salvatore Bonaccorso wrote:
> Hi Mario
>
> Nick Hastings reported in Debian in https://bugs.debian.org/1036530
> lockups from his system after updating from a 6.0 based version to
> 6.1.y. >
> #regzbot ^introduced 24867516f06d
>
> he bisected the issue and tracked it down to:
>
> On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote:
>> Control: tags -1 - moreinfo
>>
>> Hi,
>>
>> I repeated the git bisect, and the bad commit seems to be:
>>
>> (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad
>> 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit
>> commit 24867516f06dabedef3be7eea0ef0846b91538bc
>> Author: Mario Limonciello <[email protected]>
>> Date: Tue Aug 23 13:51:31 2022 -0500
>>
>> ACPI: OSI: Remove Linux-Dell-Video _OSI string
>>
>> This string was introduced because drivers for NVIDIA hardware
>> had bugs supporting RTD3 in the past.
>>
>> Before proprietary NVIDIA driver started to support RTD3, Ubuntu had
>> had a mechanism for switching PRIME on and off, though it had required
>> to logout/login to make the library switch happen.
>>
>> When the PRIME had been off, the mechanism had unloaded the NVIDIA
>> driver and put the device into D3cold, but the GPU had never come back
>> to D0 again which is why ODMs used the _OSI to expose an old _DSM
>> method to switch the power on/off.
>>
>> That has been fixed by commit 5775b843a619 ("PCI: Restore config space
>> on runtime resume despite being unbound"). so vendors shouldn't be
>> using this string to modify ASL any more.
>>
>> Reviewed-by: Lyude Paul <[email protected]>
>> Signed-off-by: Mario Limonciello <[email protected]>
>> Signed-off-by: Rafael J. Wysocki <[email protected]>
>>
>> drivers/acpi/osi.c | 9 ---------
>> 1 file changed, 9 deletions(-)
>>
>> This machine is a Dell with an nvidia chip so it looks like this really
>> could be the commit that that is causing the problems. The description
>> of the commit also seems (to my untrained eye) to be consistent with the
>> error reported on the console when the lockup occurs:
>>
>> [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
>> [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
>> [ 60.083261] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible
>>
>> Hopefully this is enough information for experts to resolve this.
>
> Does this ring some bell for you? Do you need any further information
> from Nick?
>
> Regards,
> Salvatore

Hi Salvatore,

Have Nick try using "pcie_port_pm=off" and see if it helps the issue.

Does this happen in the latest 6.4 RC as well?

I think we need to see a full dmesg and acpidump to better characterize it.

2023-05-29 01:24:44

by Nick Hastings

[permalink] [raw]
Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

Hi,

* Mario Limonciello <[email protected]> [230528 21:44]:
> On 5/28/23 01:49, Salvatore Bonaccorso wrote:
> > Hi Mario
> >
> > Nick Hastings reported in Debian in https://bugs.debian.org/1036530
> > lockups from his system after updating from a 6.0 based version to
> > 6.1.y. >
> > #regzbot ^introduced 24867516f06d
> >
> > he bisected the issue and tracked it down to:
> >
> > On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote:
> > > Control: tags -1 - moreinfo
> > >
> > > Hi,
> > >
> > > I repeated the git bisect, and the bad commit seems to be:
> > >
> > > (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad
> > > 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit
> > > commit 24867516f06dabedef3be7eea0ef0846b91538bc
> > > Author: Mario Limonciello <[email protected]>
> > > Date: Tue Aug 23 13:51:31 2022 -0500
> > >
> > > ACPI: OSI: Remove Linux-Dell-Video _OSI string
> > > This string was introduced because drivers for NVIDIA hardware
> > > had bugs supporting RTD3 in the past.
> > > Before proprietary NVIDIA driver started to support RTD3, Ubuntu had
> > > had a mechanism for switching PRIME on and off, though it had required
> > > to logout/login to make the library switch happen.
> > > When the PRIME had been off, the mechanism had unloaded the NVIDIA
> > > driver and put the device into D3cold, but the GPU had never come back
> > > to D0 again which is why ODMs used the _OSI to expose an old _DSM
> > > method to switch the power on/off.
> > > That has been fixed by commit 5775b843a619 ("PCI: Restore config space
> > > on runtime resume despite being unbound"). so vendors shouldn't be
> > > using this string to modify ASL any more.
> > > Reviewed-by: Lyude Paul <[email protected]>
> > > Signed-off-by: Mario Limonciello <[email protected]>
> > > Signed-off-by: Rafael J. Wysocki <[email protected]>
> > >
> > > drivers/acpi/osi.c | 9 ---------
> > > 1 file changed, 9 deletions(-)
> > >
> > > This machine is a Dell with an nvidia chip so it looks like this really
> > > could be the commit that that is causing the problems. The description
> > > of the commit also seems (to my untrained eye) to be consistent with the
> > > error reported on the console when the lockup occurs:
> > >
> > > [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > [ 60.083261] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible
> > >
> > > Hopefully this is enough information for experts to resolve this.
> >
> > Does this ring some bell for you? Do you need any further information
> > from Nick?
> >
> > Regards,
> > Salvatore
>

> Have Nick try using "pcie_port_pm=off" and see if it helps the issue.

I booted into a 6.1 kernel with this option. It has been running without
problems for 1.5 hours. Usually I would expect the lockup to have
occurred by now.

> Does this happen in the latest 6.4 RC as well?

I have compiled that kernel and will boot into it after running this one
with the pcie_port_pm=off for another hour or so.

> I think we need to see a full dmesg and acpidump to better
> characterize it.

Please find attached. Let me know if there is anything else I can provide.

Regards,

Nick.


Attachments:
(No filename) (3.54 kB)
dmesg-20230529T082455.log.gz (24.82 kB)
acpidump-20230529T082605.log.gz (386.91 kB)
Download all attachments

2023-05-29 01:24:53

by Mario Limonciello

[permalink] [raw]
Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

On 5/28/23 19:56, Nick Hastings wrote:
> Hi,
>
> * Mario Limonciello <[email protected]> [230528 21:44]:
>> On 5/28/23 01:49, Salvatore Bonaccorso wrote:
>>> Hi Mario
>>>
>>> Nick Hastings reported in Debian in https://bugs.debian.org/1036530
>>> lockups from his system after updating from a 6.0 based version to
>>> 6.1.y. >
>>> #regzbot ^introduced 24867516f06d
>>>
>>> he bisected the issue and tracked it down to:
>>>
>>> On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote:
>>>> Control: tags -1 - moreinfo
>>>>
>>>> Hi,
>>>>
>>>> I repeated the git bisect, and the bad commit seems to be:
>>>>
>>>> (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad
>>>> 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit
>>>> commit 24867516f06dabedef3be7eea0ef0846b91538bc
>>>> Author: Mario Limonciello <[email protected]>
>>>> Date: Tue Aug 23 13:51:31 2022 -0500
>>>>
>>>> ACPI: OSI: Remove Linux-Dell-Video _OSI string
>>>> This string was introduced because drivers for NVIDIA hardware
>>>> had bugs supporting RTD3 in the past.
>>>> Before proprietary NVIDIA driver started to support RTD3, Ubuntu had
>>>> had a mechanism for switching PRIME on and off, though it had required
>>>> to logout/login to make the library switch happen.
>>>> When the PRIME had been off, the mechanism had unloaded the NVIDIA
>>>> driver and put the device into D3cold, but the GPU had never come back
>>>> to D0 again which is why ODMs used the _OSI to expose an old _DSM
>>>> method to switch the power on/off.
>>>> That has been fixed by commit 5775b843a619 ("PCI: Restore config space
>>>> on runtime resume despite being unbound"). so vendors shouldn't be
>>>> using this string to modify ASL any more.
>>>> Reviewed-by: Lyude Paul <[email protected]>
>>>> Signed-off-by: Mario Limonciello <[email protected]>
>>>> Signed-off-by: Rafael J. Wysocki <[email protected]>
>>>>
>>>> drivers/acpi/osi.c | 9 ---------
>>>> 1 file changed, 9 deletions(-)
>>>>
>>>> This machine is a Dell with an nvidia chip so it looks like this really
>>>> could be the commit that that is causing the problems. The description
>>>> of the commit also seems (to my untrained eye) to be consistent with the
>>>> error reported on the console when the lockup occurs:
>>>>
>>>> [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
>>>> [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
>>>> [ 60.083261] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible
>>>>
>>>> Hopefully this is enough information for experts to resolve this.
>>>
>>> Does this ring some bell for you? Do you need any further information
>>> from Nick?
>>>
>>> Regards,
>>> Salvatore
>>
>
>> Have Nick try using "pcie_port_pm=off" and see if it helps the issue.
>
> I booted into a 6.1 kernel with this option. It has been running without
> problems for 1.5 hours. Usually I would expect the lockup to have
> occurred by now.
>
>> Does this happen in the latest 6.4 RC as well?
>
> I have compiled that kernel and will boot into it after running this one
> with the pcie_port_pm=off for another hour or so.
>
>> I think we need to see a full dmesg and acpidump to better
>> characterize it.
>
> Please find attached. Let me know if there is anything else I can provide.
>
> Regards,
>
> Nick.

I don't see nouveau loading, are you explicitly preventing it from
loading? Can I see the journal from a boot when it reproduced?

2023-05-29 03:56:53

by Nick Hastings

[permalink] [raw]
Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

* Mario Limonciello <[email protected]> [230529 10:14]:
> On 5/28/23 19:56, Nick Hastings wrote:
> > Hi,
> >
> > * Mario Limonciello <[email protected]> [230528 21:44]:
> > > On 5/28/23 01:49, Salvatore Bonaccorso wrote:
> > > > Hi Mario
> > > >
> > > > Nick Hastings reported in Debian in https://bugs.debian.org/1036530
> > > > lockups from his system after updating from a 6.0 based version to
> > > > 6.1.y. >
> > > > #regzbot ^introduced 24867516f06d
> > > >
> > > > he bisected the issue and tracked it down to:
> > > >
> > > > On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote:
> > > > > Control: tags -1 - moreinfo
> > > > >
> > > > > Hi,
> > > > >
> > > > > I repeated the git bisect, and the bad commit seems to be:
> > > > >
> > > > > (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad
> > > > > 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit
> > > > > commit 24867516f06dabedef3be7eea0ef0846b91538bc
> > > > > Author: Mario Limonciello <[email protected]>
> > > > > Date: Tue Aug 23 13:51:31 2022 -0500
> > > > >
> > > > > ACPI: OSI: Remove Linux-Dell-Video _OSI string
> > > > > This string was introduced because drivers for NVIDIA hardware
> > > > > had bugs supporting RTD3 in the past.
> > > > > Before proprietary NVIDIA driver started to support RTD3, Ubuntu had
> > > > > had a mechanism for switching PRIME on and off, though it had required
> > > > > to logout/login to make the library switch happen.
> > > > > When the PRIME had been off, the mechanism had unloaded the NVIDIA
> > > > > driver and put the device into D3cold, but the GPU had never come back
> > > > > to D0 again which is why ODMs used the _OSI to expose an old _DSM
> > > > > method to switch the power on/off.
> > > > > That has been fixed by commit 5775b843a619 ("PCI: Restore config space
> > > > > on runtime resume despite being unbound"). so vendors shouldn't be
> > > > > using this string to modify ASL any more.
> > > > > Reviewed-by: Lyude Paul <[email protected]>
> > > > > Signed-off-by: Mario Limonciello <[email protected]>
> > > > > Signed-off-by: Rafael J. Wysocki <[email protected]>
> > > > >
> > > > > drivers/acpi/osi.c | 9 ---------
> > > > > 1 file changed, 9 deletions(-)
> > > > >
> > > > > This machine is a Dell with an nvidia chip so it looks like this really
> > > > > could be the commit that that is causing the problems. The description
> > > > > of the commit also seems (to my untrained eye) to be consistent with the
> > > > > error reported on the console when the lockup occurs:
> > > > >
> > > > > [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > > > [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > > > [ 60.083261] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible
> > > > >
> > > > > Hopefully this is enough information for experts to resolve this.
> > > >
> > > > Does this ring some bell for you? Do you need any further information
> > > > from Nick?
> > > >
> > > > Regards,
> > > > Salvatore
> > >
> >
> > > Have Nick try using "pcie_port_pm=off" and see if it helps the issue.
> >
> > I booted into a 6.1 kernel with this option. It has been running without
> > problems for 1.5 hours. Usually I would expect the lockup to have
> > occurred by now.

I let this run for 3 hours without issue.

> > > Does this happen in the latest 6.4 RC as well?
> >
> > I have compiled that kernel and will boot into it after running this one
> > with the pcie_port_pm=off for another hour or so.

I'm now running 6.4.0-rc4 without seeing the problem after 1 hour.

I did however see two unrelated problems that I include here for
completeness:
1. iwlwifi module did not automatically load
2. Xwayland used huge amount of CPU even though was not running any X
programs. Recompiling my wayland compositor without XWayland support
"fixed" this.

> > > I think we need to see a full dmesg and acpidump to better
> > > characterize it.
> >
> > Please find attached. Let me know if there is anything else I can provide.
> >
> > Regards,
> >
> > Nick.
>
> I don't see nouveau loading, are you explicitly preventing it from
> loading?

Yes nouveau is blacklisted.

> Can I see the journal from a boot when it reproduced?

Hmm not sure which n for "journalctl -b n" maps to which kernel (is that
what you are requesting?). The commit hash doesn't not seem to be
listed. I may have to boot into a bad kernel again.

Regards,

Ncik.



2023-05-29 23:37:35

by Nick Hastings

[permalink] [raw]
Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

Hi,

* Nick Hastings <[email protected]> [230529 12:51]:
> * Mario Limonciello <[email protected]> [230529 10:14]:
> > On 5/28/23 19:56, Nick Hastings wrote:
> > > Hi,
> > >
> > > * Mario Limonciello <[email protected]> [230528 21:44]:
> > > > On 5/28/23 01:49, Salvatore Bonaccorso wrote:
> > > > > Hi Mario
> > > > >
> > > > > Nick Hastings reported in Debian in https://bugs.debian.org/1036530
> > > > > lockups from his system after updating from a 6.0 based version to
> > > > > 6.1.y. >
> > > > > #regzbot ^introduced 24867516f06d
> > > > >
> > > > > he bisected the issue and tracked it down to:
> > > > >
> > > > > On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote:
> > > > > > Control: tags -1 - moreinfo
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I repeated the git bisect, and the bad commit seems to be:
> > > > > >
> > > > > > (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad
> > > > > > 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit
> > > > > > commit 24867516f06dabedef3be7eea0ef0846b91538bc
> > > > > > Author: Mario Limonciello <[email protected]>
> > > > > > Date: Tue Aug 23 13:51:31 2022 -0500
> > > > > >
> > > > > > ACPI: OSI: Remove Linux-Dell-Video _OSI string
> > > > > > This string was introduced because drivers for NVIDIA hardware
> > > > > > had bugs supporting RTD3 in the past.
> > > > > > Before proprietary NVIDIA driver started to support RTD3, Ubuntu had
> > > > > > had a mechanism for switching PRIME on and off, though it had required
> > > > > > to logout/login to make the library switch happen.
> > > > > > When the PRIME had been off, the mechanism had unloaded the NVIDIA
> > > > > > driver and put the device into D3cold, but the GPU had never come back
> > > > > > to D0 again which is why ODMs used the _OSI to expose an old _DSM
> > > > > > method to switch the power on/off.
> > > > > > That has been fixed by commit 5775b843a619 ("PCI: Restore config space
> > > > > > on runtime resume despite being unbound"). so vendors shouldn't be
> > > > > > using this string to modify ASL any more.
> > > > > > Reviewed-by: Lyude Paul <[email protected]>
> > > > > > Signed-off-by: Mario Limonciello <[email protected]>
> > > > > > Signed-off-by: Rafael J. Wysocki <[email protected]>
> > > > > >
> > > > > > drivers/acpi/osi.c | 9 ---------
> > > > > > 1 file changed, 9 deletions(-)
> > > > > >
> > > > > > This machine is a Dell with an nvidia chip so it looks like this really
> > > > > > could be the commit that that is causing the problems. The description
> > > > > > of the commit also seems (to my untrained eye) to be consistent with the
> > > > > > error reported on the console when the lockup occurs:
> > > > > >
> > > > > > [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > > > > [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > > > > [ 60.083261] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible
> > > > > >
> > > > > > Hopefully this is enough information for experts to resolve this.
> > > > >
> > > > > Does this ring some bell for you? Do you need any further information
> > > > > from Nick?
> > > > >
> > > > > Regards,
> > > > > Salvatore
> > > >
> > >
> > > > Have Nick try using "pcie_port_pm=off" and see if it helps the issue.
> > >
> > > I booted into a 6.1 kernel with this option. It has been running without
> > > problems for 1.5 hours. Usually I would expect the lockup to have
> > > occurred by now.
>
> I let this run for 3 hours without issue.
>
> > > > Does this happen in the latest 6.4 RC as well?
> > >
> > > I have compiled that kernel and will boot into it after running this one
> > > with the pcie_port_pm=off for another hour or so.
>
> I'm now running 6.4.0-rc4 without seeing the problem after 1 hour.

I did eventually see a lockup of this kernel. On the console I saw:

[ 151.035036] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible

I did not see the other two lines that were present in earlier lock ups

> I did however see two unrelated problems that I include here for
> completeness:
> 1. iwlwifi module did not automatically load
> 2. Xwayland used huge amount of CPU even though was not running any X
> programs. Recompiling my wayland compositor without XWayland support
> "fixed" this.
>
> > > > I think we need to see a full dmesg and acpidump to better
> > > > characterize it.
> > >
> > > Please find attached. Let me know if there is anything else I can provide.
> > >
> > > Regards,
> > >
> > > Nick.
> >
> > I don't see nouveau loading, are you explicitly preventing it from
> > loading?
>
> Yes nouveau is blacklisted.
>
> > Can I see the journal from a boot when it reproduced?
>
> Hmm not sure which n for "journalctl -b n" maps to which kernel (is that
> what you are requesting?). The commit hash doesn't not seem to be
> listed. I may have to boot into a bad kernel again.

Please find attached the output from a "journalctl --system -bN" for a
kernel that has this issue.

Regards,

Nick.


Attachments:
(No filename) (5.38 kB)
journalctl.log.gz (45.65 kB)
Download all attachments

2023-05-30 04:21:20

by Mario Limonciello

[permalink] [raw]
Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

On 5/29/23 18:01, Nick Hastings wrote:
> Hi,
>
> * Nick Hastings <[email protected]> [230529 12:51]:
>> * Mario Limonciello <[email protected]> [230529 10:14]:
>>> On 5/28/23 19:56, Nick Hastings wrote:
>>>> Hi,
>>>>
>>>> * Mario Limonciello <[email protected]> [230528 21:44]:
>>>>> On 5/28/23 01:49, Salvatore Bonaccorso wrote:
>>>>>> Hi Mario
>>>>>>
>>>>>> Nick Hastings reported in Debian in https://bugs.debian.org/1036530
>>>>>> lockups from his system after updating from a 6.0 based version to
>>>>>> 6.1.y. >
>>>>>> #regzbot ^introduced 24867516f06d
>>>>>>
>>>>>> he bisected the issue and tracked it down to:
>>>>>>
>>>>>> On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote:
>>>>>>> Control: tags -1 - moreinfo
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I repeated the git bisect, and the bad commit seems to be:
>>>>>>>
>>>>>>> (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad
>>>>>>> 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit
>>>>>>> commit 24867516f06dabedef3be7eea0ef0846b91538bc
>>>>>>> Author: Mario Limonciello <[email protected]>
>>>>>>> Date: Tue Aug 23 13:51:31 2022 -0500
>>>>>>>
>>>>>>> ACPI: OSI: Remove Linux-Dell-Video _OSI string
>>>>>>> This string was introduced because drivers for NVIDIA hardware
>>>>>>> had bugs supporting RTD3 in the past.
>>>>>>> Before proprietary NVIDIA driver started to support RTD3, Ubuntu had
>>>>>>> had a mechanism for switching PRIME on and off, though it had required
>>>>>>> to logout/login to make the library switch happen.
>>>>>>> When the PRIME had been off, the mechanism had unloaded the NVIDIA
>>>>>>> driver and put the device into D3cold, but the GPU had never come back
>>>>>>> to D0 again which is why ODMs used the _OSI to expose an old _DSM
>>>>>>> method to switch the power on/off.
>>>>>>> That has been fixed by commit 5775b843a619 ("PCI: Restore config space
>>>>>>> on runtime resume despite being unbound"). so vendors shouldn't be
>>>>>>> using this string to modify ASL any more.
>>>>>>> Reviewed-by: Lyude Paul <[email protected]>
>>>>>>> Signed-off-by: Mario Limonciello <[email protected]>
>>>>>>> Signed-off-by: Rafael J. Wysocki <[email protected]>
>>>>>>>
>>>>>>> drivers/acpi/osi.c | 9 ---------
>>>>>>> 1 file changed, 9 deletions(-)
>>>>>>>
>>>>>>> This machine is a Dell with an nvidia chip so it looks like this really
>>>>>>> could be the commit that that is causing the problems. The description
>>>>>>> of the commit also seems (to my untrained eye) to be consistent with the
>>>>>>> error reported on the console when the lockup occurs:
>>>>>>>
>>>>>>> [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
>>>>>>> [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
>>>>>>> [ 60.083261] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible
>>>>>>>
>>>>>>> Hopefully this is enough information for experts to resolve this.
>>>>>>
>>>>>> Does this ring some bell for you? Do you need any further information
>>>>>> from Nick?
>>>>>>
>>>>>> Regards,
>>>>>> Salvatore
>>>>>
>>>>
>>>>> Have Nick try using "pcie_port_pm=off" and see if it helps the issue.
>>>>
>>>> I booted into a 6.1 kernel with this option. It has been running without
>>>> problems for 1.5 hours. Usually I would expect the lockup to have
>>>> occurred by now.
>>
>> I let this run for 3 hours without issue.
>>
>>>>> Does this happen in the latest 6.4 RC as well?
>>>>
>>>> I have compiled that kernel and will boot into it after running this one
>>>> with the pcie_port_pm=off for another hour or so.
>>
>> I'm now running 6.4.0-rc4 without seeing the problem after 1 hour.
>
> I did eventually see a lockup of this kernel. On the console I saw:
>
> [ 151.035036] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible
>
> I did not see the other two lines that were present in earlier lock ups >
>> I did however see two unrelated problems that I include here for
>> completeness:
>> 1. iwlwifi module did not automatically load
>> 2. Xwayland used huge amount of CPU even though was not running any X
>> programs. Recompiling my wayland compositor without XWayland support
>> "fixed" this.
>>
>>>>> I think we need to see a full dmesg and acpidump to better
>>>>> characterize it.
>>>>
>>>> Please find attached. Let me know if there is anything else I can provide.
>>>>
>>>> Regards,
>>>>
>>>> Nick.
>>>
>>> I don't see nouveau loading, are you explicitly preventing it from
>>> loading?
>>
>> Yes nouveau is blacklisted.
>>
>>> Can I see the journal from a boot when it reproduced?
>>
>> Hmm not sure which n for "journalctl -b n" maps to which kernel (is that
>> what you are requesting?). The commit hash doesn't not seem to be
>> listed. I may have to boot into a bad kernel again.
>
> Please find attached the output from a "journalctl --system -bN" for a
> kernel that has this issue.
>
> Regards,
>
> Nick.

In this log I see nouveau loaded, but I also don't see the failure
occurring.

As you're actually loading nouveau, can you please try nouveau.runpm=0
on the kernel command line?

If that helps the issue; I strongly suggest you cross reference the
latest kernel to see if this bug still exists.

2023-05-30 07:13:02

by Nick Hastings

[permalink] [raw]
Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

Hi,

* Mario Limonciello <[email protected]> [230530 13:00]:
> On 5/29/23 18:01, Nick Hastings wrote:
> > Hi,
> >
> > * Nick Hastings <[email protected]> [230529 12:51]:
> > > * Mario Limonciello <[email protected]> [230529 10:14]:
> > > > On 5/28/23 19:56, Nick Hastings wrote:
> > > > > Hi,
> > > > >
> > > > > * Mario Limonciello <[email protected]> [230528 21:44]:
> > > > > > On 5/28/23 01:49, Salvatore Bonaccorso wrote:
> > > > > > > Hi Mario
> > > > > > >
> > > > > > > Nick Hastings reported in Debian in https://bugs.debian.org/1036530
> > > > > > > lockups from his system after updating from a 6.0 based version to
> > > > > > > 6.1.y. >
> > > > > > > #regzbot ^introduced 24867516f06d
> > > > > > >
> > > > > > > he bisected the issue and tracked it down to:
> > > > > > >
> > > > > > > On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote:
> > > > > > > > Control: tags -1 - moreinfo
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I repeated the git bisect, and the bad commit seems to be:
> > > > > > > >
> > > > > > > > (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad
> > > > > > > > 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit
> > > > > > > > commit 24867516f06dabedef3be7eea0ef0846b91538bc
> > > > > > > > Author: Mario Limonciello <[email protected]>
> > > > > > > > Date: Tue Aug 23 13:51:31 2022 -0500
> > > > > > > >
> > > > > > > > ACPI: OSI: Remove Linux-Dell-Video _OSI string
> > > > > > > > This string was introduced because drivers for NVIDIA hardware
> > > > > > > > had bugs supporting RTD3 in the past.
> > > > > > > > Before proprietary NVIDIA driver started to support RTD3, Ubuntu had
> > > > > > > > had a mechanism for switching PRIME on and off, though it had required
> > > > > > > > to logout/login to make the library switch happen.
> > > > > > > > When the PRIME had been off, the mechanism had unloaded the NVIDIA
> > > > > > > > driver and put the device into D3cold, but the GPU had never come back
> > > > > > > > to D0 again which is why ODMs used the _OSI to expose an old _DSM
> > > > > > > > method to switch the power on/off.
> > > > > > > > That has been fixed by commit 5775b843a619 ("PCI: Restore config space
> > > > > > > > on runtime resume despite being unbound"). so vendors shouldn't be
> > > > > > > > using this string to modify ASL any more.
> > > > > > > > Reviewed-by: Lyude Paul <[email protected]>
> > > > > > > > Signed-off-by: Mario Limonciello <[email protected]>
> > > > > > > > Signed-off-by: Rafael J. Wysocki <[email protected]>
> > > > > > > >
> > > > > > > > drivers/acpi/osi.c | 9 ---------
> > > > > > > > 1 file changed, 9 deletions(-)
> > > > > > > >
> > > > > > > > This machine is a Dell with an nvidia chip so it looks like this really
> > > > > > > > could be the commit that that is causing the problems. The description
> > > > > > > > of the commit also seems (to my untrained eye) to be consistent with the
> > > > > > > > error reported on the console when the lockup occurs:
> > > > > > > >
> > > > > > > > [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > > > > > > [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > > > > > > [ 60.083261] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible
> > > > > > > >
> > > > > > > > Hopefully this is enough information for experts to resolve this.
> > > > > > >
> > > > > > > Does this ring some bell for you? Do you need any further information
> > > > > > > from Nick?
> > > > > > >
> > > > > > > Regards,
> > > > > > > Salvatore
> > > > > >
> > > > >
> > > > > > Have Nick try using "pcie_port_pm=off" and see if it helps the issue.
> > > > >
> > > > > I booted into a 6.1 kernel with this option. It has been running without
> > > > > problems for 1.5 hours. Usually I would expect the lockup to have
> > > > > occurred by now.
> > >
> > > I let this run for 3 hours without issue.
> > >
> > > > > > Does this happen in the latest 6.4 RC as well?
> > > > >
> > > > > I have compiled that kernel and will boot into it after running this one
> > > > > with the pcie_port_pm=off for another hour or so.
> > >
> > > I'm now running 6.4.0-rc4 without seeing the problem after 1 hour.
> >
> > I did eventually see a lockup of this kernel. On the console I saw:
> >
> > [ 151.035036] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible
> >
> > I did not see the other two lines that were present in earlier lock ups >
> > > I did however see two unrelated problems that I include here for
> > > completeness:
> > > 1. iwlwifi module did not automatically load
> > > 2. Xwayland used huge amount of CPU even though was not running any X
> > > programs. Recompiling my wayland compositor without XWayland support
> > > "fixed" this.
> > >
> > > > > > I think we need to see a full dmesg and acpidump to better
> > > > > > characterize it.
> > > > >
> > > > > Please find attached. Let me know if there is anything else I can provide.
> > > > >
> > > > > Regards,
> > > > >
> > > > > Nick.
> > > >
> > > > I don't see nouveau loading, are you explicitly preventing it from
> > > > loading?
> > >
> > > Yes nouveau is blacklisted.
> > >
> > > > Can I see the journal from a boot when it reproduced?
> > >
> > > Hmm not sure which n for "journalctl -b n" maps to which kernel (is that
> > > what you are requesting?). The commit hash doesn't not seem to be
> > > listed. I may have to boot into a bad kernel again.
> >
> > Please find attached the output from a "journalctl --system -bN" for a
> > kernel that has this issue.
> >
> > Regards,
> >
> > Nick.
>
> In this log I see nouveau loaded, but I also don't see the failure
> occurring.

I never saw anything in the logs from a lockup either. I had assumed it
was no longer able to write to disk. The failure did occur on that
occasion.

> As you're actually loading nouveau, can you please try nouveau.runpm=0 on
> the kernel command line?

I'm not intentionally loading it. This machine also has intel graphics
which is what I prefer. Checking my
/etc/modprobe.d/blacklist-nvidia-nouveau.conf
I see:

blacklist nvidia
blacklist nvidia-drm
blacklist nvidia-modeset
blacklist nvidia-uvm
blacklist ipmi_msghandler
blacklist ipmi_devintf

So I thought I had blacklisted it but it seems I did not. Since I do not
want to use it maybe it is better to check if the lock up occurs with
nouveau blacklisted. I will try that now.

> If that helps the issue; I strongly suggest you cross reference the latest
> kernel to see if this bug still exists.

I did. See above.

Regards,

Nick.


2023-05-30 11:42:56

by Salvatore Bonaccorso

[permalink] [raw]
Subject: Re: Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

Hi Nick,

Thanks to you both for triaging the issue!

On Tue, May 30, 2023 at 04:01:04PM +0900, Nick Hastings wrote:
> Hi,
>
> * Mario Limonciello <[email protected]> [230530 13:00]:
> > On 5/29/23 18:01, Nick Hastings wrote:
> > > Hi,
> > >
> > > * Nick Hastings <[email protected]> [230529 12:51]:
> > > > * Mario Limonciello <[email protected]> [230529 10:14]:
> > > > > On 5/28/23 19:56, Nick Hastings wrote:
> > > > > > Hi,
> > > > > >
> > > > > > * Mario Limonciello <[email protected]> [230528 21:44]:
> > > > > > > On 5/28/23 01:49, Salvatore Bonaccorso wrote:
> > > > > > > > Hi Mario
> > > > > > > >
> > > > > > > > Nick Hastings reported in Debian in https://bugs.debian.org/1036530
> > > > > > > > lockups from his system after updating from a 6.0 based version to
> > > > > > > > 6.1.y. >
> > > > > > > > #regzbot ^introduced 24867516f06d
> > > > > > > >
> > > > > > > > he bisected the issue and tracked it down to:
> > > > > > > >
> > > > > > > > On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote:
> > > > > > > > > Control: tags -1 - moreinfo
> > > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > I repeated the git bisect, and the bad commit seems to be:
> > > > > > > > >
> > > > > > > > > (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad
> > > > > > > > > 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit
> > > > > > > > > commit 24867516f06dabedef3be7eea0ef0846b91538bc
> > > > > > > > > Author: Mario Limonciello <[email protected]>
> > > > > > > > > Date: Tue Aug 23 13:51:31 2022 -0500
> > > > > > > > >
> > > > > > > > > ACPI: OSI: Remove Linux-Dell-Video _OSI string
> > > > > > > > > This string was introduced because drivers for NVIDIA hardware
> > > > > > > > > had bugs supporting RTD3 in the past.
> > > > > > > > > Before proprietary NVIDIA driver started to support RTD3, Ubuntu had
> > > > > > > > > had a mechanism for switching PRIME on and off, though it had required
> > > > > > > > > to logout/login to make the library switch happen.
> > > > > > > > > When the PRIME had been off, the mechanism had unloaded the NVIDIA
> > > > > > > > > driver and put the device into D3cold, but the GPU had never come back
> > > > > > > > > to D0 again which is why ODMs used the _OSI to expose an old _DSM
> > > > > > > > > method to switch the power on/off.
> > > > > > > > > That has been fixed by commit 5775b843a619 ("PCI: Restore config space
> > > > > > > > > on runtime resume despite being unbound"). so vendors shouldn't be
> > > > > > > > > using this string to modify ASL any more.
> > > > > > > > > Reviewed-by: Lyude Paul <[email protected]>
> > > > > > > > > Signed-off-by: Mario Limonciello <[email protected]>
> > > > > > > > > Signed-off-by: Rafael J. Wysocki <[email protected]>
> > > > > > > > >
> > > > > > > > > drivers/acpi/osi.c | 9 ---------
> > > > > > > > > 1 file changed, 9 deletions(-)
> > > > > > > > >
> > > > > > > > > This machine is a Dell with an nvidia chip so it looks like this really
> > > > > > > > > could be the commit that that is causing the problems. The description
> > > > > > > > > of the commit also seems (to my untrained eye) to be consistent with the
> > > > > > > > > error reported on the console when the lockup occurs:
> > > > > > > > >
> > > > > > > > > [ 58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > > > > > > > [ 58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > > > > > > > [ 60.083261] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible
> > > > > > > > >
> > > > > > > > > Hopefully this is enough information for experts to resolve this.
> > > > > > > >
> > > > > > > > Does this ring some bell for you? Do you need any further information
> > > > > > > > from Nick?
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Salvatore
> > > > > > >
> > > > > >
> > > > > > > Have Nick try using "pcie_port_pm=off" and see if it helps the issue.
> > > > > >
> > > > > > I booted into a 6.1 kernel with this option. It has been running without
> > > > > > problems for 1.5 hours. Usually I would expect the lockup to have
> > > > > > occurred by now.
> > > >
> > > > I let this run for 3 hours without issue.
> > > >
> > > > > > > Does this happen in the latest 6.4 RC as well?
> > > > > >
> > > > > > I have compiled that kernel and will boot into it after running this one
> > > > > > with the pcie_port_pm=off for another hour or so.
> > > >
> > > > I'm now running 6.4.0-rc4 without seeing the problem after 1 hour.
> > >
> > > I did eventually see a lockup of this kernel. On the console I saw:
> > >
> > > [ 151.035036] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible
> > >
> > > I did not see the other two lines that were present in earlier lock ups >
> > > > I did however see two unrelated problems that I include here for
> > > > completeness:
> > > > 1. iwlwifi module did not automatically load
> > > > 2. Xwayland used huge amount of CPU even though was not running any X
> > > > programs. Recompiling my wayland compositor without XWayland support
> > > > "fixed" this.
> > > >
> > > > > > > I think we need to see a full dmesg and acpidump to better
> > > > > > > characterize it.
> > > > > >
> > > > > > Please find attached. Let me know if there is anything else I can provide.
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Nick.
> > > > >
> > > > > I don't see nouveau loading, are you explicitly preventing it from
> > > > > loading?
> > > >
> > > > Yes nouveau is blacklisted.
> > > >
> > > > > Can I see the journal from a boot when it reproduced?
> > > >
> > > > Hmm not sure which n for "journalctl -b n" maps to which kernel (is that
> > > > what you are requesting?). The commit hash doesn't not seem to be
> > > > listed. I may have to boot into a bad kernel again.
> > >
> > > Please find attached the output from a "journalctl --system -bN" for a
> > > kernel that has this issue.
> > >
> > > Regards,
> > >
> > > Nick.
> >
> > In this log I see nouveau loaded, but I also don't see the failure
> > occurring.
>
> I never saw anything in the logs from a lockup either. I had assumed it
> was no longer able to write to disk. The failure did occur on that
> occasion.

Can you try if you would get more out of it using netconsole?

https://www.kernel.org/doc/html/latest/networking/netconsole.html

Regards,
Salvatore

2023-05-31 23:50:04

by Nick Hastings

[permalink] [raw]
Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

Hi,

* Nick Hastings <[email protected]> [230530 16:01]:
>
> * Mario Limonciello <[email protected]> [230530 13:00]:
<snip>
> > As you're actually loading nouveau, can you please try nouveau.runpm=0 on
> > the kernel command line?
>
> I'm not intentionally loading it. This machine also has intel graphics
> which is what I prefer. Checking my
> /etc/modprobe.d/blacklist-nvidia-nouveau.conf
> I see:
>
> blacklist nvidia
> blacklist nvidia-drm
> blacklist nvidia-modeset
> blacklist nvidia-uvm
> blacklist ipmi_msghandler
> blacklist ipmi_devintf
>
> So I thought I had blacklisted it but it seems I did not. Since I do not
> want to use it maybe it is better to check if the lock up occurs with
> nouveau blacklisted. I will try that now.

I blacklisted nouveau and booted into a 6.1 kernel:
% uname -a
Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) x86_64 GNU/Linux

It has been running without problems for nearly two days now:
% uptime
08:34:48 up 1 day, 16:22, 2 users, load average: 1.33, 1.26, 1.27

Regards,

Nick.


2023-06-01 16:31:31

by Mario Limonciello

[permalink] [raw]
Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

+Lyude, Lukas, Karol

On 5/31/2023 6:40 PM, Nick Hastings wrote:
> Hi,
>
> * Nick Hastings <[email protected]> [230530 16:01]:
>> * Mario Limonciello <[email protected]> [230530 13:00]:
> <snip>
>>> As you're actually loading nouveau, can you please try nouveau.runpm=0 on
>>> the kernel command line?
>> I'm not intentionally loading it. This machine also has intel graphics
>> which is what I prefer. Checking my
>> /etc/modprobe.d/blacklist-nvidia-nouveau.conf
>> I see:
>>
>> blacklist nvidia
>> blacklist nvidia-drm
>> blacklist nvidia-modeset
>> blacklist nvidia-uvm
>> blacklist ipmi_msghandler
>> blacklist ipmi_devintf
>>
>> So I thought I had blacklisted it but it seems I did not. Since I do not
>> want to use it maybe it is better to check if the lock up occurs with
>> nouveau blacklisted. I will try that now.
> I blacklisted nouveau and booted into a 6.1 kernel:
> % uname -a
> Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) x86_64 GNU/Linux
>
> It has been running without problems for nearly two days now:
> % uptime
> 08:34:48 up 1 day, 16:22, 2 users, load average: 1.33, 1.26, 1.27
>
> Regards,
>
> Nick.

Thanks, that makes a lot more sense now.

Nick, Can you please test if nouveau works with runtime PM in the
latest 6.4-rc?

If it works in 6.4-rc, there are probably nouveau commits that need
to be backported to 6.1 LTS.

If it's still broken in 6.4-rc, I believe you should file a bug:

https://gitlab.freedesktop.org/drm/nouveau/


Lyude, Lukas, Karol

This thread is in relation to this commit:

24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")

Nick has found that runtime PM is *not* working for nouveau.

If you recall we did 24867516f06d because 5775b843a619 was
supposed to have fixed it.


2023-06-01 16:58:17

by Karol Herbst

[permalink] [raw]
Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
<[email protected]> wrote:
>
> +Lyude, Lukas, Karol
>
> On 5/31/2023 6:40 PM, Nick Hastings wrote:
> > Hi,
> >
> > * Nick Hastings <[email protected]> [230530 16:01]:
> >> * Mario Limonciello <[email protected]> [230530 13:00]:
> > <snip>
> >>> As you're actually loading nouveau, can you please try nouveau.runpm=0 on
> >>> the kernel command line?
> >> I'm not intentionally loading it. This machine also has intel graphics
> >> which is what I prefer. Checking my
> >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf
> >> I see:
> >>
> >> blacklist nvidia
> >> blacklist nvidia-drm
> >> blacklist nvidia-modeset
> >> blacklist nvidia-uvm
> >> blacklist ipmi_msghandler
> >> blacklist ipmi_devintf
> >>
> >> So I thought I had blacklisted it but it seems I did not. Since I do not
> >> want to use it maybe it is better to check if the lock up occurs with
> >> nouveau blacklisted. I will try that now.
> > I blacklisted nouveau and booted into a 6.1 kernel:
> > % uname -a
> > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) x86_64 GNU/Linux
> >
> > It has been running without problems for nearly two days now:
> > % uptime
> > 08:34:48 up 1 day, 16:22, 2 users, load average: 1.33, 1.26, 1.27
> >
> > Regards,
> >
> > Nick.
>
> Thanks, that makes a lot more sense now.
>
> Nick, Can you please test if nouveau works with runtime PM in the
> latest 6.4-rc?
>
> If it works in 6.4-rc, there are probably nouveau commits that need
> to be backported to 6.1 LTS.
>
> If it's still broken in 6.4-rc, I believe you should file a bug:
>
> https://gitlab.freedesktop.org/drm/nouveau/
>
>
> Lyude, Lukas, Karol
>
> This thread is in relation to this commit:
>
> 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
>
> Nick has found that runtime PM is *not* working for nouveau.
>

keep in mind we have a list of PCIe controllers where we apply a
workaround: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682

And I suspect there might be one or two more IDs we'll have to add
there. Do we have any logs? And could anybody test if adding the
controller in play here does resolve the problem?

> If you recall we did 24867516f06d because 5775b843a619 was
> supposed to have fixed it.
>


2023-06-01 17:25:29

by Mario Limonciello

[permalink] [raw]
Subject: RE: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

[AMD Official Use Only - General]

> -----Original Message-----
> From: Karol Herbst <[email protected]>
> Sent: Thursday, June 1, 2023 11:33 AM
> To: Limonciello, Mario <[email protected]>
> Cc: Nick Hastings <[email protected]>; Lyude Paul
> <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
> Bonaccorso <[email protected]>; [email protected]; Rafael J.
> Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
> [email protected]; [email protected];
> [email protected]
> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
>
> On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
> <[email protected]> wrote:
> >
> > +Lyude, Lukas, Karol
> >
> > On 5/31/2023 6:40 PM, Nick Hastings wrote:
> > > Hi,
> > >
> > > * Nick Hastings <[email protected]> [230530 16:01]:
> > >> * Mario Limonciello <[email protected]> [230530 13:00]:
> > > <snip>
> > >>> As you're actually loading nouveau, can you please try
> nouveau.runpm=0 on
> > >>> the kernel command line?
> > >> I'm not intentionally loading it. This machine also has intel graphics
> > >> which is what I prefer. Checking my
> > >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf
> > >> I see:
> > >>
> > >> blacklist nvidia
> > >> blacklist nvidia-drm
> > >> blacklist nvidia-modeset
> > >> blacklist nvidia-uvm
> > >> blacklist ipmi_msghandler
> > >> blacklist ipmi_devintf
> > >>
> > >> So I thought I had blacklisted it but it seems I did not. Since I do not
> > >> want to use it maybe it is better to check if the lock up occurs with
> > >> nouveau blacklisted. I will try that now.
> > > I blacklisted nouveau and booted into a 6.1 kernel:
> > > % uname -a
> > > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1
> (2023-05-08) x86_64 GNU/Linux
> > >
> > > It has been running without problems for nearly two days now:
> > > % uptime
> > > 08:34:48 up 1 day, 16:22, 2 users, load average: 1.33, 1.26, 1.27
> > >
> > > Regards,
> > >
> > > Nick.
> >
> > Thanks, that makes a lot more sense now.
> >
> > Nick, Can you please test if nouveau works with runtime PM in the
> > latest 6.4-rc?
> >
> > If it works in 6.4-rc, there are probably nouveau commits that need
> > to be backported to 6.1 LTS.
> >
> > If it's still broken in 6.4-rc, I believe you should file a bug:
> >
> > https://gitlab.freedesktop.org/drm/nouveau/
> >
> >
> > Lyude, Lukas, Karol
> >
> > This thread is in relation to this commit:
> >
> > 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
> >
> > Nick has found that runtime PM is *not* working for nouveau.
> >
>
> keep in mind we have a list of PCIe controllers where we apply a
> workaround:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers
> /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682
>
> And I suspect there might be one or two more IDs we'll have to add
> there. Do we have any logs?

There's some archived onto the distro bug. Search this page for "journalctl.log.gz"
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530

> And could anybody test if adding the
> controller in play here does resolve the problem?
>
> > If you recall we did 24867516f06d because 5775b843a619 was
> > supposed to have fixed it.
> >

2023-06-01 18:02:07

by Karol Herbst

[permalink] [raw]
Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
<[email protected]> wrote:
>
> [AMD Official Use Only - General]
>
> > -----Original Message-----
> > From: Karol Herbst <[email protected]>
> > Sent: Thursday, June 1, 2023 11:33 AM
> > To: Limonciello, Mario <[email protected]>
> > Cc: Nick Hastings <[email protected]>; Lyude Paul
> > <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
> > Bonaccorso <[email protected]>; [email protected]; Rafael J.
> > Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
> > [email protected]; [email protected];
> > [email protected]
> > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
> >
> > On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
> > <[email protected]> wrote:
> > >
> > > +Lyude, Lukas, Karol
> > >
> > > On 5/31/2023 6:40 PM, Nick Hastings wrote:
> > > > Hi,
> > > >
> > > > * Nick Hastings <[email protected]> [230530 16:01]:
> > > >> * Mario Limonciello <[email protected]> [230530 13:00]:
> > > > <snip>
> > > >>> As you're actually loading nouveau, can you please try
> > nouveau.runpm=0 on
> > > >>> the kernel command line?
> > > >> I'm not intentionally loading it. This machine also has intel graphics
> > > >> which is what I prefer. Checking my
> > > >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf
> > > >> I see:
> > > >>
> > > >> blacklist nvidia
> > > >> blacklist nvidia-drm
> > > >> blacklist nvidia-modeset
> > > >> blacklist nvidia-uvm
> > > >> blacklist ipmi_msghandler
> > > >> blacklist ipmi_devintf
> > > >>
> > > >> So I thought I had blacklisted it but it seems I did not. Since I do not
> > > >> want to use it maybe it is better to check if the lock up occurs with
> > > >> nouveau blacklisted. I will try that now.
> > > > I blacklisted nouveau and booted into a 6.1 kernel:
> > > > % uname -a
> > > > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1
> > (2023-05-08) x86_64 GNU/Linux
> > > >
> > > > It has been running without problems for nearly two days now:
> > > > % uptime
> > > > 08:34:48 up 1 day, 16:22, 2 users, load average: 1.33, 1.26, 1.27
> > > >
> > > > Regards,
> > > >
> > > > Nick.
> > >
> > > Thanks, that makes a lot more sense now.
> > >
> > > Nick, Can you please test if nouveau works with runtime PM in the
> > > latest 6.4-rc?
> > >
> > > If it works in 6.4-rc, there are probably nouveau commits that need
> > > to be backported to 6.1 LTS.
> > >
> > > If it's still broken in 6.4-rc, I believe you should file a bug:
> > >
> > > https://gitlab.freedesktop.org/drm/nouveau/
> > >
> > >
> > > Lyude, Lukas, Karol
> > >
> > > This thread is in relation to this commit:
> > >
> > > 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
> > >
> > > Nick has found that runtime PM is *not* working for nouveau.
> > >
> >
> > keep in mind we have a list of PCIe controllers where we apply a
> > workaround:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers
> > /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682
> >
> > And I suspect there might be one or two more IDs we'll have to add
> > there. Do we have any logs?
>
> There's some archived onto the distro bug. Search this page for "journalctl.log.gz"
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530
>

interesting.. It seems to be the same controller used here. I wonder
if the pci topology is different or if the workaround is applied at
all.

But yeah, I'd kinda love for somebody with better knowledge on all of
this to figure out what exactly is going wrong, but everytime this
gets investigated Intel says "our hardware has no bugs", the ACPI
folks dig for months and find nothing and I end up figuring out some
weirdo workaround I don't understand. And apparently also nobody is
able to hand out docs explaining in detail how that runtime
suspend/resume stuff is supposed to work.

I have a Dell XPS 9560 where the added workaround in nouveau fixed the
problem and I know it's fixed on a bunch of other systems. So if
anybody is willing to publish docs and/or actually debug it with
domain knowledge, please go ahead.

> > And could anybody test if adding the
> > controller in play here does resolve the problem?
> >
> > > If you recall we did 24867516f06d because 5775b843a619 was
> > > supposed to have fixed it.
> > >
>


2023-06-01 18:04:16

by Mario Limonciello

[permalink] [raw]
Subject: RE: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

[AMD Official Use Only - General]

> -----Original Message-----
> From: Karol Herbst <[email protected]>
> Sent: Thursday, June 1, 2023 12:19 PM
> To: Limonciello, Mario <[email protected]>
> Cc: Nick Hastings <[email protected]>; Lyude Paul
> <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
> Bonaccorso <[email protected]>; [email protected]; Rafael J.
> Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
> [email protected]; [email protected];
> [email protected]
> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
>
> On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
> <[email protected]> wrote:
> >
> > [AMD Official Use Only - General]
> >
> > > -----Original Message-----
> > > From: Karol Herbst <[email protected]>
> > > Sent: Thursday, June 1, 2023 11:33 AM
> > > To: Limonciello, Mario <[email protected]>
> > > Cc: Nick Hastings <[email protected]>; Lyude Paul
> > > <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
> > > Bonaccorso <[email protected]>; [email protected]; Rafael J.
> > > Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
> > > [email protected]; [email protected];
> > > [email protected]
> > > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> > > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
> system)
> > >
> > > On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
> > > <[email protected]> wrote:
> > > >
> > > > +Lyude, Lukas, Karol
> > > >
> > > > On 5/31/2023 6:40 PM, Nick Hastings wrote:
> > > > > Hi,
> > > > >
> > > > > * Nick Hastings <[email protected]> [230530 16:01]:
> > > > >> * Mario Limonciello <[email protected]> [230530 13:00]:
> > > > > <snip>
> > > > >>> As you're actually loading nouveau, can you please try
> > > nouveau.runpm=0 on
> > > > >>> the kernel command line?
> > > > >> I'm not intentionally loading it. This machine also has intel graphics
> > > > >> which is what I prefer. Checking my
> > > > >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf
> > > > >> I see:
> > > > >>
> > > > >> blacklist nvidia
> > > > >> blacklist nvidia-drm
> > > > >> blacklist nvidia-modeset
> > > > >> blacklist nvidia-uvm
> > > > >> blacklist ipmi_msghandler
> > > > >> blacklist ipmi_devintf
> > > > >>
> > > > >> So I thought I had blacklisted it but it seems I did not. Since I do not
> > > > >> want to use it maybe it is better to check if the lock up occurs with
> > > > >> nouveau blacklisted. I will try that now.
> > > > > I blacklisted nouveau and booted into a 6.1 kernel:
> > > > > % uname -a
> > > > > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1
> > > (2023-05-08) x86_64 GNU/Linux
> > > > >
> > > > > It has been running without problems for nearly two days now:
> > > > > % uptime
> > > > > 08:34:48 up 1 day, 16:22, 2 users, load average: 1.33, 1.26, 1.27
> > > > >
> > > > > Regards,
> > > > >
> > > > > Nick.
> > > >
> > > > Thanks, that makes a lot more sense now.
> > > >
> > > > Nick, Can you please test if nouveau works with runtime PM in the
> > > > latest 6.4-rc?
> > > >
> > > > If it works in 6.4-rc, there are probably nouveau commits that need
> > > > to be backported to 6.1 LTS.
> > > >
> > > > If it's still broken in 6.4-rc, I believe you should file a bug:
> > > >
> > > > https://gitlab.freedesktop.org/drm/nouveau/
> > > >
> > > >
> > > > Lyude, Lukas, Karol
> > > >
> > > > This thread is in relation to this commit:
> > > >
> > > > 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
> > > >
> > > > Nick has found that runtime PM is *not* working for nouveau.
> > > >
> > >
> > > keep in mind we have a list of PCIe controllers where we apply a
> > > workaround:
> > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers
> > > /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682
> > >
> > > And I suspect there might be one or two more IDs we'll have to add
> > > there. Do we have any logs?
> >
> > There's some archived onto the distro bug. Search this page for
> "journalctl.log.gz"
> > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530
> >
>
> interesting.. It seems to be the same controller used here. I wonder
> if the pci topology is different or if the workaround is applied at
> all.

I didn't see the message in the log about the workaround being applied
in that log, so I guess PCI topology difference is a likely suspect.

>
> But yeah, I'd kinda love for somebody with better knowledge on all of
> this to figure out what exactly is going wrong, but everytime this
> gets investigated Intel says "our hardware has no bugs", the ACPI
> folks dig for months and find nothing and I end up figuring out some
> weirdo workaround I don't understand. And apparently also nobody is
> able to hand out docs explaining in detail how that runtime
> suspend/resume stuff is supposed to work.
>
> I have a Dell XPS 9560 where the added workaround in nouveau fixed the
> problem and I know it's fixed on a bunch of other systems. So if
> anybody is willing to publish docs and/or actually debug it with
> domain knowledge, please go ahead.
>
> > > And could anybody test if adding the
> > > controller in play here does resolve the problem?
> > >
> > > > If you recall we did 24867516f06d because 5775b843a619 was
> > > > supposed to have fixed it.
> > > >
> >

2023-06-01 18:27:00

by Karol Herbst

[permalink] [raw]
Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario
<[email protected]> wrote:
>
> [AMD Official Use Only - General]
>
> > -----Original Message-----
> > From: Karol Herbst <[email protected]>
> > Sent: Thursday, June 1, 2023 12:19 PM
> > To: Limonciello, Mario <[email protected]>
> > Cc: Nick Hastings <[email protected]>; Lyude Paul
> > <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
> > Bonaccorso <[email protected]>; [email protected]; Rafael J.
> > Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
> > [email protected]; [email protected];
> > [email protected]
> > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
> >
> > On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
> > <[email protected]> wrote:
> > >
> > > [AMD Official Use Only - General]
> > >
> > > > -----Original Message-----
> > > > From: Karol Herbst <[email protected]>
> > > > Sent: Thursday, June 1, 2023 11:33 AM
> > > > To: Limonciello, Mario <[email protected]>
> > > > Cc: Nick Hastings <[email protected]>; Lyude Paul
> > > > <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
> > > > Bonaccorso <[email protected]>; [email protected]; Rafael J.
> > > > Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
> > > > [email protected]; [email protected];
> > > > [email protected]
> > > > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> > > > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
> > system)
> > > >
> > > > On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
> > > > <[email protected]> wrote:
> > > > >
> > > > > +Lyude, Lukas, Karol
> > > > >
> > > > > On 5/31/2023 6:40 PM, Nick Hastings wrote:
> > > > > > Hi,
> > > > > >
> > > > > > * Nick Hastings <[email protected]> [230530 16:01]:
> > > > > >> * Mario Limonciello <[email protected]> [230530 13:00]:
> > > > > > <snip>
> > > > > >>> As you're actually loading nouveau, can you please try
> > > > nouveau.runpm=0 on
> > > > > >>> the kernel command line?
> > > > > >> I'm not intentionally loading it. This machine also has intel graphics
> > > > > >> which is what I prefer. Checking my
> > > > > >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf
> > > > > >> I see:
> > > > > >>
> > > > > >> blacklist nvidia
> > > > > >> blacklist nvidia-drm
> > > > > >> blacklist nvidia-modeset
> > > > > >> blacklist nvidia-uvm
> > > > > >> blacklist ipmi_msghandler
> > > > > >> blacklist ipmi_devintf
> > > > > >>
> > > > > >> So I thought I had blacklisted it but it seems I did not. Since I do not
> > > > > >> want to use it maybe it is better to check if the lock up occurs with
> > > > > >> nouveau blacklisted. I will try that now.
> > > > > > I blacklisted nouveau and booted into a 6.1 kernel:
> > > > > > % uname -a
> > > > > > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1
> > > > (2023-05-08) x86_64 GNU/Linux
> > > > > >
> > > > > > It has been running without problems for nearly two days now:
> > > > > > % uptime
> > > > > > 08:34:48 up 1 day, 16:22, 2 users, load average: 1.33, 1.26, 1.27
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Nick.
> > > > >
> > > > > Thanks, that makes a lot more sense now.
> > > > >
> > > > > Nick, Can you please test if nouveau works with runtime PM in the
> > > > > latest 6.4-rc?
> > > > >
> > > > > If it works in 6.4-rc, there are probably nouveau commits that need
> > > > > to be backported to 6.1 LTS.
> > > > >
> > > > > If it's still broken in 6.4-rc, I believe you should file a bug:
> > > > >
> > > > > https://gitlab.freedesktop.org/drm/nouveau/
> > > > >
> > > > >
> > > > > Lyude, Lukas, Karol
> > > > >
> > > > > This thread is in relation to this commit:
> > > > >
> > > > > 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
> > > > >
> > > > > Nick has found that runtime PM is *not* working for nouveau.
> > > > >
> > > >
> > > > keep in mind we have a list of PCIe controllers where we apply a
> > > > workaround:
> > > >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers
> > > > /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682
> > > >
> > > > And I suspect there might be one or two more IDs we'll have to add
> > > > there. Do we have any logs?
> > >
> > > There's some archived onto the distro bug. Search this page for
> > "journalctl.log.gz"
> > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530
> > >
> >
> > interesting.. It seems to be the same controller used here. I wonder
> > if the pci topology is different or if the workaround is applied at
> > all.
>
> I didn't see the message in the log about the workaround being applied
> in that log, so I guess PCI topology difference is a likely suspect.
>

yeah, but I also couldn't see a log with the usual nouveau messages,
so it's kinda weird.

Anyway, the output of `lspci -tvnn` would help

> >
> > But yeah, I'd kinda love for somebody with better knowledge on all of
> > this to figure out what exactly is going wrong, but everytime this
> > gets investigated Intel says "our hardware has no bugs", the ACPI
> > folks dig for months and find nothing and I end up figuring out some
> > weirdo workaround I don't understand. And apparently also nobody is
> > able to hand out docs explaining in detail how that runtime
> > suspend/resume stuff is supposed to work.
> >
> > I have a Dell XPS 9560 where the added workaround in nouveau fixed the
> > problem and I know it's fixed on a bunch of other systems. So if
> > anybody is willing to publish docs and/or actually debug it with
> > domain knowledge, please go ahead.
> >
> > > > And could anybody test if adding the
> > > > controller in play here does resolve the problem?
> > > >
> > > > > If you recall we did 24867516f06d because 5775b843a619 was
> > > > > supposed to have fixed it.
> > > > >
> > >
>


2023-06-02 00:14:33

by Nick Hastings

[permalink] [raw]
Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

Hi,

* Karol Herbst <[email protected]> [230602 03:10]:
> On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario
> <[email protected]> wrote:
> > > -----Original Message-----
> > > From: Karol Herbst <[email protected]>
> > > Sent: Thursday, June 1, 2023 12:19 PM
> > > To: Limonciello, Mario <[email protected]>
> > > Cc: Nick Hastings <[email protected]>; Lyude Paul
> > > <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
> > > Bonaccorso <[email protected]>; [email protected]; Rafael J.
> > > Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
> > > [email protected]; [email protected];
> > > [email protected]
> > > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> > > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
> > >
> > > On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
> > > <[email protected]> wrote:
> > > >
> > > > [AMD Official Use Only - General]
> > > >
> > > > > -----Original Message-----
> > > > > From: Karol Herbst <[email protected]>
> > > > > Sent: Thursday, June 1, 2023 11:33 AM
> > > > > To: Limonciello, Mario <[email protected]>
> > > > > Cc: Nick Hastings <[email protected]>; Lyude Paul
> > > > > <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
> > > > > Bonaccorso <[email protected]>; [email protected]; Rafael J.
> > > > > Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
> > > > > [email protected]; [email protected];
> > > > > [email protected]
> > > > > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> > > > > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
> > > system)
> > > > >
> > > > > On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
> > > > > >
> > > > > > Lyude, Lukas, Karol
> > > > > >
> > > > > > This thread is in relation to this commit:
> > > > > >
> > > > > > 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
> > > > > >
> > > > > > Nick has found that runtime PM is *not* working for nouveau.
> > > > > >
> > > > >
> > > > > keep in mind we have a list of PCIe controllers where we apply a
> > > > > workaround:
> > > > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers
> > > > > /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682
> > > > >
> > > > > And I suspect there might be one or two more IDs we'll have to add
> > > > > there. Do we have any logs?
> > > >
> > > > There's some archived onto the distro bug. Search this page for
> > > "journalctl.log.gz"
> > > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530
> > > >
> > >
> > > interesting.. It seems to be the same controller used here. I wonder
> > > if the pci topology is different or if the workaround is applied at
> > > all.
> >
> > I didn't see the message in the log about the workaround being applied
> > in that log, so I guess PCI topology difference is a likely suspect.
> >
>
> yeah, but I also couldn't see a log with the usual nouveau messages,
> so it's kinda weird.
>
> Anyway, the output of `lspci -tvnn` would help

% lspci -tvnn
-[0000:00]-+-00.0 Intel Corporation Device [8086:3e20]
+-01.0-[01]----00.0 NVIDIA Corporation TU117M [GeForce GTX 1650 Mobile / Max-Q] [10de:1f91]
+-02.0 Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630] [8086:3e9b]
+-04.0 Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem [8086:1903]
+-08.0 Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model [8086:1911]
+-12.0 Intel Corporation Cannon Lake PCH Thermal Controller [8086:a379]
+-14.0 Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller [8086:a36d]
+-14.2 Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f]
+-15.0 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 [8086:a368]
+-15.1 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1 [8086:a369]
+-16.0 Intel Corporation Cannon Lake PCH HECI Controller [8086:a360]
+-17.0 Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller [8086:a353]
+-1b.0-[02-3a]----00.0-[03-3a]--+-00.0-[04]----00.0 Intel Corporation JHL6340 Thunderbolt 3 NHI (C step) [Alpine Ridge 2C 2016] [8086:15d9]
| +-01.0-[05-39]--
| \-02.0-[3a]----00.0 Intel Corporation JHL6340 Thunderbolt 3 USB 3.1 Controller (C step) [Alpine Ridge 2C 2016] [8086:15db]
+-1c.0-[3b]----00.0 Intel Corporation Wi-Fi 6 AX200 [8086:2723]
+-1c.4-[3c]----00.0 Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader [10ec:525a]
+-1d.0-[3d]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]
+-1f.0 Intel Corporation Cannon Lake LPC Controller [8086:a30e]
+-1f.3 Intel Corporation Cannon Lake PCH cAVS [8086:a348]
+-1f.4 Intel Corporation Cannon Lake PCH SMBus Controller [8086:a323]
\-1f.5 Intel Corporation Cannon Lake PCH SPI Controller
[8086:a324]


Regards,

Nick.


2023-06-02 00:20:43

by Nick Hastings

[permalink] [raw]
Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

Hi,

* Limonciello, Mario <[email protected]> [230602 01:18]:
> +Lyude, Lukas, Karol
>
> On 5/31/2023 6:40 PM, Nick Hastings wrote:
> >
> > * Nick Hastings <[email protected]> [230530 16:01]:
> > > * Mario Limonciello <[email protected]> [230530 13:00]:
> > <snip>
> > > > As you're actually loading nouveau, can you please try nouveau.runpm=0 on
> > > > the kernel command line?
> > > I'm not intentionally loading it. This machine also has intel graphics
> > > which is what I prefer. Checking my
> > > /etc/modprobe.d/blacklist-nvidia-nouveau.conf
> > > I see:
> > >
> > > blacklist nvidia
> > > blacklist nvidia-drm
> > > blacklist nvidia-modeset
> > > blacklist nvidia-uvm
> > > blacklist ipmi_msghandler
> > > blacklist ipmi_devintf
> > >
> > > So I thought I had blacklisted it but it seems I did not. Since I do not
> > > want to use it maybe it is better to check if the lock up occurs with
> > > nouveau blacklisted. I will try that now.
> > I blacklisted nouveau and booted into a 6.1 kernel:
> > % uname -a
> > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) x86_64 GNU/Linux
> >
> > It has been running without problems for nearly two days now:
> > % uptime
> > 08:34:48 up 1 day, 16:22, 2 users, load average: 1.33, 1.26, 1.27
> >
> > Regards,
> >
> > Nick.
>
> Thanks, that makes a lot more sense now.
>
> Nick, Can you please test if nouveau works with runtime PM in the
> latest 6.4-rc?

I reported this twice already. I guess it was lost since for some
reason emails in this thread are not being trimmed. I'll repeat here:

I did eventually see a lockup of this kernel. On the console I saw:

[ 151.035036] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible

I did not see the other two lines that were present in earlier lock ups.

Regards,

Nick.


2023-06-02 01:15:53

by Mario Limonciello

[permalink] [raw]
Subject: RE: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

[AMD Official Use Only - General]

> -----Original Message-----
> From: Nick Hastings <[email protected]>
> Sent: Thursday, June 1, 2023 7:02 PM
> To: Karol Herbst <[email protected]>
> Cc: Limonciello, Mario <[email protected]>; Lyude Paul
> <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
> Bonaccorso <[email protected]>; [email protected]; Rafael J.
> Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
> [email protected]; [email protected];
> [email protected]
> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
>
> Hi,
>
> * Karol Herbst <[email protected]> [230602 03:10]:
> > On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario
> > <[email protected]> wrote:
> > > > -----Original Message-----
> > > > From: Karol Herbst <[email protected]>
> > > > Sent: Thursday, June 1, 2023 12:19 PM
> > > > To: Limonciello, Mario <[email protected]>
> > > > Cc: Nick Hastings <[email protected]>; Lyude Paul
> > > > <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
> > > > Bonaccorso <[email protected]>; [email protected]; Rafael J.
> > > > Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
> > > > [email protected]; [email protected];
> > > > [email protected]
> > > > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> > > > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
> system)
> > > >
> > > > On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
> > > > <[email protected]> wrote:
> > > > >
> > > > > [AMD Official Use Only - General]
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Karol Herbst <[email protected]>
> > > > > > Sent: Thursday, June 1, 2023 11:33 AM
> > > > > > To: Limonciello, Mario <[email protected]>
> > > > > > Cc: Nick Hastings <[email protected]>; Lyude Paul
> > > > > > <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
> > > > > > Bonaccorso <[email protected]>; [email protected]; Rafael
> J.
> > > > > > Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
> > > > > > [email protected]; [email protected];
> > > > > > [email protected]
> > > > > > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video
> _OSI
> > > > > > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
> > > > system)
> > > > > >
> > > > > > On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
> > > > > > >
> > > > > > > Lyude, Lukas, Karol
> > > > > > >
> > > > > > > This thread is in relation to this commit:
> > > > > > >
> > > > > > > 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
> > > > > > >
> > > > > > > Nick has found that runtime PM is *not* working for nouveau.
> > > > > > >
> > > > > >
> > > > > > keep in mind we have a list of PCIe controllers where we apply a
> > > > > > workaround:
> > > > > >
> > > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers
> > > > > > /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682
> > > > > >
> > > > > > And I suspect there might be one or two more IDs we'll have to add
> > > > > > there. Do we have any logs?
> > > > >
> > > > > There's some archived onto the distro bug. Search this page for
> > > > "journalctl.log.gz"
> > > > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530
> > > > >
> > > >
> > > > interesting.. It seems to be the same controller used here. I wonder
> > > > if the pci topology is different or if the workaround is applied at
> > > > all.
> > >
> > > I didn't see the message in the log about the workaround being applied
> > > in that log, so I guess PCI topology difference is a likely suspect.
> > >
> >
> > yeah, but I also couldn't see a log with the usual nouveau messages,
> > so it's kinda weird.
> >
> > Anyway, the output of `lspci -tvnn` would help
>
> % lspci -tvnn
> -[0000:00]-+-00.0 Intel Corporation Device [8086:3e20]
> +-01.0-[01]----00.0 NVIDIA Corporation TU117M [GeForce GTX 1650
> Mobile / Max-Q] [10de:1f91]

So the bridge it's connected to is the same that the quirk *should have been* triggering.

May 29 15:02:42 xps kernel: pci 0000:00:01.0: [8086:1901] type 01 class 0x060400

Since the quirk isn't working and this is still a problem in 6.4-rc4 I suggest opening a
Nouveau drm bug to figure out why.

> +-02.0 Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630]
> [8086:3e9b]
> +-04.0 Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core
> Processor Thermal Subsystem [8086:1903]
> +-08.0 Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 /
> 6th/7th/8th Gen Core Processor Gaussian Mixture Model [8086:1911]
> +-12.0 Intel Corporation Cannon Lake PCH Thermal Controller
> [8086:a379]
> +-14.0 Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller
> [8086:a36d]
> +-14.2 Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f]
> +-15.0 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0
> [8086:a368]
> +-15.1 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1
> [8086:a369]
> +-16.0 Intel Corporation Cannon Lake PCH HECI Controller [8086:a360]
> +-17.0 Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller
> [8086:a353]
> +-1b.0-[02-3a]----00.0-[03-3a]--+-00.0-[04]----00.0 Intel Corporation
> JHL6340 Thunderbolt 3 NHI (C step) [Alpine Ridge 2C 2016] [8086:15d9]
> | +-01.0-[05-39]--
> | \-02.0-[3a]----00.0 Intel Corporation JHL6340
> Thunderbolt 3 USB 3.1 Controller (C step) [Alpine Ridge 2C 2016]
> [8086:15db]
> +-1c.0-[3b]----00.0 Intel Corporation Wi-Fi 6 AX200 [8086:2723]
> +-1c.4-[3c]----00.0 Realtek Semiconductor Co., Ltd. RTS525A PCI
> Express Card Reader [10ec:525a]
> +-1d.0-[3d]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller
> SM981/PM981/PM983 [144d:a808]
> +-1f.0 Intel Corporation Cannon Lake LPC Controller [8086:a30e]
> +-1f.3 Intel Corporation Cannon Lake PCH cAVS [8086:a348]
> +-1f.4 Intel Corporation Cannon Lake PCH SMBus Controller
> [8086:a323]
> \-1f.5 Intel Corporation Cannon Lake PCH SPI Controller
> [8086:a324]
>
>
> Regards,
>
> Nick.

Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
for once, to make this easily accessible to everyone.

Nick, what's the status/was there any progress? Did you do what Mario
suggested and file a nouveau bug?

I ask, as I still have this on my list of regressions and it seems there
was no progress in three+ weeks now.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot backburner: slow progress, likely just affects one machine
#regzbot poke


On 02.06.23 02:57, Limonciello, Mario wrote:
> [AMD Official Use Only - General]
>
>> -----Original Message-----
>> From: Nick Hastings <[email protected]>
>> Sent: Thursday, June 1, 2023 7:02 PM
>> To: Karol Herbst <[email protected]>
>> Cc: Limonciello, Mario <[email protected]>; Lyude Paul
>> <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
>> Bonaccorso <[email protected]>; [email protected]; Rafael J.
>> Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
>> [email protected]; [email protected];
>> [email protected]
>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
>>
>> Hi,
>>
>> * Karol Herbst <[email protected]> [230602 03:10]:
>>> On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario
>>> <[email protected]> wrote:
>>>>> -----Original Message-----
>>>>> From: Karol Herbst <[email protected]>
>>>>> Sent: Thursday, June 1, 2023 12:19 PM
>>>>> To: Limonciello, Mario <[email protected]>
>>>>> Cc: Nick Hastings <[email protected]>; Lyude Paul
>>>>> <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
>>>>> Bonaccorso <[email protected]>; [email protected]; Rafael J.
>>>>> Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
>>>>> [email protected]; [email protected];
>>>>> [email protected]
>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
>> system)
>>>>>
>>>>> On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> [AMD Official Use Only - General]
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Karol Herbst <[email protected]>
>>>>>>> Sent: Thursday, June 1, 2023 11:33 AM
>>>>>>> To: Limonciello, Mario <[email protected]>
>>>>>>> Cc: Nick Hastings <[email protected]>; Lyude Paul
>>>>>>> <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
>>>>>>> Bonaccorso <[email protected]>; [email protected]; Rafael
>> J.
>>>>>>> Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
>>>>>>> [email protected]; [email protected];
>>>>>>> [email protected]
>>>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video
>> _OSI
>>>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
>>>>> system)
>>>>>>>
>>>>>>> On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
>>>>>>>>
>>>>>>>> Lyude, Lukas, Karol
>>>>>>>>
>>>>>>>> This thread is in relation to this commit:
>>>>>>>>
>>>>>>>> 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
>>>>>>>>
>>>>>>>> Nick has found that runtime PM is *not* working for nouveau.
>>>>>>>>
>>>>>>>
>>>>>>> keep in mind we have a list of PCIe controllers where we apply a
>>>>>>> workaround:
>>>>>>>
>>>>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers
>>>>>>> /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682
>>>>>>>
>>>>>>> And I suspect there might be one or two more IDs we'll have to add
>>>>>>> there. Do we have any logs?
>>>>>>
>>>>>> There's some archived onto the distro bug. Search this page for
>>>>> "journalctl.log.gz"
>>>>>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530
>>>>>>
>>>>>
>>>>> interesting.. It seems to be the same controller used here. I wonder
>>>>> if the pci topology is different or if the workaround is applied at
>>>>> all.
>>>>
>>>> I didn't see the message in the log about the workaround being applied
>>>> in that log, so I guess PCI topology difference is a likely suspect.
>>>>
>>>
>>> yeah, but I also couldn't see a log with the usual nouveau messages,
>>> so it's kinda weird.
>>>
>>> Anyway, the output of `lspci -tvnn` would help
>>
>> % lspci -tvnn
>> -[0000:00]-+-00.0 Intel Corporation Device [8086:3e20]
>> +-01.0-[01]----00.0 NVIDIA Corporation TU117M [GeForce GTX 1650
>> Mobile / Max-Q] [10de:1f91]
>
> So the bridge it's connected to is the same that the quirk *should have been* triggering.
>
> May 29 15:02:42 xps kernel: pci 0000:00:01.0: [8086:1901] type 01 class 0x060400
>
> Since the quirk isn't working and this is still a problem in 6.4-rc4 I suggest opening a
> Nouveau drm bug to figure out why.
>
>> +-02.0 Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630]
>> [8086:3e9b]
>> +-04.0 Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core
>> Processor Thermal Subsystem [8086:1903]
>> +-08.0 Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 /
>> 6th/7th/8th Gen Core Processor Gaussian Mixture Model [8086:1911]
>> +-12.0 Intel Corporation Cannon Lake PCH Thermal Controller
>> [8086:a379]
>> +-14.0 Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller
>> [8086:a36d]
>> +-14.2 Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f]
>> +-15.0 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0
>> [8086:a368]
>> +-15.1 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1
>> [8086:a369]
>> +-16.0 Intel Corporation Cannon Lake PCH HECI Controller [8086:a360]
>> +-17.0 Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller
>> [8086:a353]
>> +-1b.0-[02-3a]----00.0-[03-3a]--+-00.0-[04]----00.0 Intel Corporation
>> JHL6340 Thunderbolt 3 NHI (C step) [Alpine Ridge 2C 2016] [8086:15d9]
>> | +-01.0-[05-39]--
>> | \-02.0-[3a]----00.0 Intel Corporation JHL6340
>> Thunderbolt 3 USB 3.1 Controller (C step) [Alpine Ridge 2C 2016]
>> [8086:15db]
>> +-1c.0-[3b]----00.0 Intel Corporation Wi-Fi 6 AX200 [8086:2723]
>> +-1c.4-[3c]----00.0 Realtek Semiconductor Co., Ltd. RTS525A PCI
>> Express Card Reader [10ec:525a]
>> +-1d.0-[3d]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller
>> SM981/PM981/PM983 [144d:a808]
>> +-1f.0 Intel Corporation Cannon Lake LPC Controller [8086:a30e]
>> +-1f.3 Intel Corporation Cannon Lake PCH cAVS [8086:a348]
>> +-1f.4 Intel Corporation Cannon Lake PCH SMBus Controller
>> [8086:a323]
>> \-1f.5 Intel Corporation Cannon Lake PCH SPI Controller
>> [8086:a324]
>>
>>
>> Regards,
>>
>> Nick.
>

2023-06-26 22:48:43

by Nick Hastings

[permalink] [raw]
Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

Hi Thorsten,

* Linux regression tracking (Thorsten Leemhuis) <[email protected]> [230626 21:09]:
> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
> for once, to make this easily accessible to everyone.
>
> Nick, what's the status/was there any progress? Did you do what Mario
> suggested and file a nouveau bug?

It was not apparent that the suggestion to open "a Nouveau drm bug" was
addressed to me.

> I ask, as I still have this on my list of regressions and it seems there
> was no progress in three+ weeks now.

I have not pursued this further since as far as I could tell I already
provided all requested information and I don't actually use nouveau, so
I blacklisted it.

Regards,

Nick.

> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>
> #regzbot backburner: slow progress, likely just affects one machine
> #regzbot poke
>
>
> On 02.06.23 02:57, Limonciello, Mario wrote:
> > [AMD Official Use Only - General]
> >
> >> -----Original Message-----
> >> From: Nick Hastings <[email protected]>
> >> Sent: Thursday, June 1, 2023 7:02 PM
> >> To: Karol Herbst <[email protected]>
> >> Cc: Limonciello, Mario <[email protected]>; Lyude Paul
> >> <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
> >> Bonaccorso <[email protected]>; [email protected]; Rafael J.
> >> Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
> >> [email protected]; [email protected];
> >> [email protected]
> >> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> >> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
> >>
> >> Hi,
> >>
> >> * Karol Herbst <[email protected]> [230602 03:10]:
> >>> On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario
> >>> <[email protected]> wrote:
> >>>>> -----Original Message-----
> >>>>> From: Karol Herbst <[email protected]>
> >>>>> Sent: Thursday, June 1, 2023 12:19 PM
> >>>>> To: Limonciello, Mario <[email protected]>
> >>>>> Cc: Nick Hastings <[email protected]>; Lyude Paul
> >>>>> <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
> >>>>> Bonaccorso <[email protected]>; [email protected]; Rafael J.
> >>>>> Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
> >>>>> [email protected]; [email protected];
> >>>>> [email protected]
> >>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> >>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
> >> system)
> >>>>>
> >>>>> On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
> >>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>> [AMD Official Use Only - General]
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Karol Herbst <[email protected]>
> >>>>>>> Sent: Thursday, June 1, 2023 11:33 AM
> >>>>>>> To: Limonciello, Mario <[email protected]>
> >>>>>>> Cc: Nick Hastings <[email protected]>; Lyude Paul
> >>>>>>> <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
> >>>>>>> Bonaccorso <[email protected]>; [email protected]; Rafael
> >> J.
> >>>>>>> Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
> >>>>>>> [email protected]; [email protected];
> >>>>>>> [email protected]
> >>>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video
> >> _OSI
> >>>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
> >>>>> system)
> >>>>>>>
> >>>>>>> On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
> >>>>>>>>
> >>>>>>>> Lyude, Lukas, Karol
> >>>>>>>>
> >>>>>>>> This thread is in relation to this commit:
> >>>>>>>>
> >>>>>>>> 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
> >>>>>>>>
> >>>>>>>> Nick has found that runtime PM is *not* working for nouveau.
> >>>>>>>>
> >>>>>>>
> >>>>>>> keep in mind we have a list of PCIe controllers where we apply a
> >>>>>>> workaround:
> >>>>>>>
> >>>>>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers
> >>>>>>> /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682
> >>>>>>>
> >>>>>>> And I suspect there might be one or two more IDs we'll have to add
> >>>>>>> there. Do we have any logs?
> >>>>>>
> >>>>>> There's some archived onto the distro bug. Search this page for
> >>>>> "journalctl.log.gz"
> >>>>>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530
> >>>>>>
> >>>>>
> >>>>> interesting.. It seems to be the same controller used here. I wonder
> >>>>> if the pci topology is different or if the workaround is applied at
> >>>>> all.
> >>>>
> >>>> I didn't see the message in the log about the workaround being applied
> >>>> in that log, so I guess PCI topology difference is a likely suspect.
> >>>>
> >>>
> >>> yeah, but I also couldn't see a log with the usual nouveau messages,
> >>> so it's kinda weird.
> >>>
> >>> Anyway, the output of `lspci -tvnn` would help
> >>
> >> % lspci -tvnn
> >> -[0000:00]-+-00.0 Intel Corporation Device [8086:3e20]
> >> +-01.0-[01]----00.0 NVIDIA Corporation TU117M [GeForce GTX 1650
> >> Mobile / Max-Q] [10de:1f91]
> >
> > So the bridge it's connected to is the same that the quirk *should have been* triggering.
> >
> > May 29 15:02:42 xps kernel: pci 0000:00:01.0: [8086:1901] type 01 class 0x060400
> >
> > Since the quirk isn't working and this is still a problem in 6.4-rc4 I suggest opening a
> > Nouveau drm bug to figure out why.
> >
> >> +-02.0 Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630]
> >> [8086:3e9b]
> >> +-04.0 Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core
> >> Processor Thermal Subsystem [8086:1903]
> >> +-08.0 Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 /
> >> 6th/7th/8th Gen Core Processor Gaussian Mixture Model [8086:1911]
> >> +-12.0 Intel Corporation Cannon Lake PCH Thermal Controller
> >> [8086:a379]
> >> +-14.0 Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller
> >> [8086:a36d]
> >> +-14.2 Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f]
> >> +-15.0 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0
> >> [8086:a368]
> >> +-15.1 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1
> >> [8086:a369]
> >> +-16.0 Intel Corporation Cannon Lake PCH HECI Controller [8086:a360]
> >> +-17.0 Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller
> >> [8086:a353]
> >> +-1b.0-[02-3a]----00.0-[03-3a]--+-00.0-[04]----00.0 Intel Corporation
> >> JHL6340 Thunderbolt 3 NHI (C step) [Alpine Ridge 2C 2016] [8086:15d9]
> >> | +-01.0-[05-39]--
> >> | \-02.0-[3a]----00.0 Intel Corporation JHL6340
> >> Thunderbolt 3 USB 3.1 Controller (C step) [Alpine Ridge 2C 2016]
> >> [8086:15db]
> >> +-1c.0-[3b]----00.0 Intel Corporation Wi-Fi 6 AX200 [8086:2723]
> >> +-1c.4-[3c]----00.0 Realtek Semiconductor Co., Ltd. RTS525A PCI
> >> Express Card Reader [10ec:525a]
> >> +-1d.0-[3d]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller
> >> SM981/PM981/PM983 [144d:a808]
> >> +-1f.0 Intel Corporation Cannon Lake LPC Controller [8086:a30e]
> >> +-1f.3 Intel Corporation Cannon Lake PCH cAVS [8086:a348]
> >> +-1f.4 Intel Corporation Cannon Lake PCH SMBus Controller
> >> [8086:a323]
> >> \-1f.5 Intel Corporation Cannon Lake PCH SPI Controller
> >> [8086:a324]
> >>
> >>
> >> Regards,
> >>
> >> Nick.
> >


Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

On 27.06.23 00:34, Nick Hastings wrote:
> * Linux regression tracking (Thorsten Leemhuis) <[email protected]> [230626 21:09]:
>> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
>> for once, to make this easily accessible to everyone.
>>
>> Nick, what's the status/was there any progress? Did you do what Mario
>> suggested and file a nouveau bug?
>
> It was not apparent that the suggestion to open "a Nouveau drm bug" was
> addressed to me.

I wish things were earlier for reporters, but from what I can see this
is the only way forward if you or some silent bystander cares.

>> I ask, as I still have this on my list of regressions and it seems there
>> was no progress in three+ weeks now.
>
> I have not pursued this further since as far as I could tell I already
> provided all requested information and I don't actually use nouveau, so
> I blacklisted it.

I doubt any developer cares enough to take a closer look[1] without a
proper nouveau bug and some help & prodding from someone affected. And
looks to me like reverting the culprit now might create even bigger
problems for users.

Hence I guess then this won't be fixed in the end. In a ideal world this
would not happen, but we don't live in one and all have just 24 hours in
a day. :-/

Nevertheless: thx for your report your help through this thread.

[1] some points on the following page kinda explain this
https://linux-regtracking.leemhuis.info/post/frequent-reasons-why-linux-kernel-bug-reports-are-ignored/

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot inconclusive: reporting deadlock (see thread for details)



>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> If I did something stupid, please tell me, as explained on that page.
>>
>> #regzbot backburner: slow progress, likely just affects one machine
>> #regzbot poke
>>
>>
>> On 02.06.23 02:57, Limonciello, Mario wrote:
>>> [AMD Official Use Only - General]
>>>
>>>> -----Original Message-----
>>>> From: Nick Hastings <[email protected]>
>>>> Sent: Thursday, June 1, 2023 7:02 PM
>>>> To: Karol Herbst <[email protected]>
>>>> Cc: Limonciello, Mario <[email protected]>; Lyude Paul
>>>> <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
>>>> Bonaccorso <[email protected]>; [email protected]; Rafael J.
>>>> Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
>>>> [email protected]; [email protected];
>>>> [email protected]
>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
>>>>
>>>> Hi,
>>>>
>>>> * Karol Herbst <[email protected]> [230602 03:10]:
>>>>> On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario
>>>>> <[email protected]> wrote:
>>>>>>> -----Original Message-----
>>>>>>> From: Karol Herbst <[email protected]>
>>>>>>> Sent: Thursday, June 1, 2023 12:19 PM
>>>>>>> To: Limonciello, Mario <[email protected]>
>>>>>>> Cc: Nick Hastings <[email protected]>; Lyude Paul
>>>>>>> <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
>>>>>>> Bonaccorso <[email protected]>; [email protected]; Rafael J.
>>>>>>> Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
>>>>>>> [email protected]; [email protected];
>>>>>>> [email protected]
>>>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
>>>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
>>>> system)
>>>>>>>
>>>>>>> On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>> [AMD Official Use Only - General]
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Karol Herbst <[email protected]>
>>>>>>>>> Sent: Thursday, June 1, 2023 11:33 AM
>>>>>>>>> To: Limonciello, Mario <[email protected]>
>>>>>>>>> Cc: Nick Hastings <[email protected]>; Lyude Paul
>>>>>>>>> <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
>>>>>>>>> Bonaccorso <[email protected]>; [email protected]; Rafael
>>>> J.
>>>>>>>>> Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
>>>>>>>>> [email protected]; [email protected];
>>>>>>>>> [email protected]
>>>>>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video
>>>> _OSI
>>>>>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
>>>>>>> system)
>>>>>>>>>
>>>>>>>>> On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
>>>>>>>>>>
>>>>>>>>>> Lyude, Lukas, Karol
>>>>>>>>>>
>>>>>>>>>> This thread is in relation to this commit:
>>>>>>>>>>
>>>>>>>>>> 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
>>>>>>>>>>
>>>>>>>>>> Nick has found that runtime PM is *not* working for nouveau.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> keep in mind we have a list of PCIe controllers where we apply a
>>>>>>>>> workaround:
>>>>>>>>>
>>>>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers
>>>>>>>>> /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682
>>>>>>>>>
>>>>>>>>> And I suspect there might be one or two more IDs we'll have to add
>>>>>>>>> there. Do we have any logs?
>>>>>>>>
>>>>>>>> There's some archived onto the distro bug. Search this page for
>>>>>>> "journalctl.log.gz"
>>>>>>>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530
>>>>>>>>
>>>>>>>
>>>>>>> interesting.. It seems to be the same controller used here. I wonder
>>>>>>> if the pci topology is different or if the workaround is applied at
>>>>>>> all.
>>>>>>
>>>>>> I didn't see the message in the log about the workaround being applied
>>>>>> in that log, so I guess PCI topology difference is a likely suspect.
>>>>>>
>>>>>
>>>>> yeah, but I also couldn't see a log with the usual nouveau messages,
>>>>> so it's kinda weird.
>>>>>
>>>>> Anyway, the output of `lspci -tvnn` would help
>>>>
>>>> % lspci -tvnn
>>>> -[0000:00]-+-00.0 Intel Corporation Device [8086:3e20]
>>>> +-01.0-[01]----00.0 NVIDIA Corporation TU117M [GeForce GTX 1650
>>>> Mobile / Max-Q] [10de:1f91]
>>>
>>> So the bridge it's connected to is the same that the quirk *should have been* triggering.
>>>
>>> May 29 15:02:42 xps kernel: pci 0000:00:01.0: [8086:1901] type 01 class 0x060400
>>>
>>> Since the quirk isn't working and this is still a problem in 6.4-rc4 I suggest opening a
>>> Nouveau drm bug to figure out why.
>>>
>>>> +-02.0 Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630]
>>>> [8086:3e9b]
>>>> +-04.0 Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core
>>>> Processor Thermal Subsystem [8086:1903]
>>>> +-08.0 Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 /
>>>> 6th/7th/8th Gen Core Processor Gaussian Mixture Model [8086:1911]
>>>> +-12.0 Intel Corporation Cannon Lake PCH Thermal Controller
>>>> [8086:a379]
>>>> +-14.0 Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller
>>>> [8086:a36d]
>>>> +-14.2 Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f]
>>>> +-15.0 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0
>>>> [8086:a368]
>>>> +-15.1 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1
>>>> [8086:a369]
>>>> +-16.0 Intel Corporation Cannon Lake PCH HECI Controller [8086:a360]
>>>> +-17.0 Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller
>>>> [8086:a353]
>>>> +-1b.0-[02-3a]----00.0-[03-3a]--+-00.0-[04]----00.0 Intel Corporation
>>>> JHL6340 Thunderbolt 3 NHI (C step) [Alpine Ridge 2C 2016] [8086:15d9]
>>>> | +-01.0-[05-39]--
>>>> | \-02.0-[3a]----00.0 Intel Corporation JHL6340
>>>> Thunderbolt 3 USB 3.1 Controller (C step) [Alpine Ridge 2C 2016]
>>>> [8086:15db]
>>>> +-1c.0-[3b]----00.0 Intel Corporation Wi-Fi 6 AX200 [8086:2723]
>>>> +-1c.4-[3c]----00.0 Realtek Semiconductor Co., Ltd. RTS525A PCI
>>>> Express Card Reader [10ec:525a]
>>>> +-1d.0-[3d]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller
>>>> SM981/PM981/PM983 [144d:a808]
>>>> +-1f.0 Intel Corporation Cannon Lake LPC Controller [8086:a30e]
>>>> +-1f.3 Intel Corporation Cannon Lake PCH cAVS [8086:a348]
>>>> +-1f.4 Intel Corporation Cannon Lake PCH SMBus Controller
>>>> [8086:a323]
>>>> \-1f.5 Intel Corporation Cannon Lake PCH SPI Controller
>>>> [8086:a324]
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Nick.
>>>
>
>
>

2023-06-30 13:39:17

by Karol Herbst

[permalink] [raw]
Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

On Fri, Jun 30, 2023 at 3:02 PM Thorsten Leemhuis
<[email protected]> wrote:
>
> On 27.06.23 00:34, Nick Hastings wrote:
> > * Linux regression tracking (Thorsten Leemhuis) <[email protected]> [230626 21:09]:
> >> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
> >> for once, to make this easily accessible to everyone.
> >>
> >> Nick, what's the status/was there any progress? Did you do what Mario
> >> suggested and file a nouveau bug?
> >
> > It was not apparent that the suggestion to open "a Nouveau drm bug" was
> > addressed to me.
>
> I wish things were earlier for reporters, but from what I can see this
> is the only way forward if you or some silent bystander cares.
>
> >> I ask, as I still have this on my list of regressions and it seems there
> >> was no progress in three+ weeks now.
> >
> > I have not pursued this further since as far as I could tell I already
> > provided all requested information and I don't actually use nouveau, so
> > I blacklisted it.
>
> I doubt any developer cares enough to take a closer look[1] without a
> proper nouveau bug and some help & prodding from someone affected. And
> looks to me like reverting the culprit now might create even bigger
> problems for users.
>
> Hence I guess then this won't be fixed in the end. In a ideal world this
> would not happen, but we don't live in one and all have just 24 hours in
> a day. :-/
>

We recently merged this commit:
https://gitlab.freedesktop.org/drm/nouveau/-/commit/11d24327c2d7ad7f24fcc44fb00e1fa91ebf6525

It might resolve the problem. Worth testing at least, but I can't
remember if this was a hybrid AMD/Nvidia system, but I think it was?

> Nevertheless: thx for your report your help through this thread.
>
> [1] some points on the following page kinda explain this
> https://linux-regtracking.leemhuis.info/post/frequent-reasons-why-linux-kernel-bug-reports-are-ignored/
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>
> #regzbot inconclusive: reporting deadlock (see thread for details)
>
>
>
> >> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >> --
> >> Everything you wanna know about Linux kernel regression tracking:
> >> https://linux-regtracking.leemhuis.info/about/#tldr
> >> If I did something stupid, please tell me, as explained on that page.
> >>
> >> #regzbot backburner: slow progress, likely just affects one machine
> >> #regzbot poke
> >>
> >>
> >> On 02.06.23 02:57, Limonciello, Mario wrote:
> >>> [AMD Official Use Only - General]
> >>>
> >>>> -----Original Message-----
> >>>> From: Nick Hastings <[email protected]>
> >>>> Sent: Thursday, June 1, 2023 7:02 PM
> >>>> To: Karol Herbst <[email protected]>
> >>>> Cc: Limonciello, Mario <[email protected]>; Lyude Paul
> >>>> <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
> >>>> Bonaccorso <[email protected]>; [email protected]; Rafael J.
> >>>> Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
> >>>> [email protected]; [email protected];
> >>>> [email protected]
> >>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> >>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
> >>>>
> >>>> Hi,
> >>>>
> >>>> * Karol Herbst <[email protected]> [230602 03:10]:
> >>>>> On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario
> >>>>> <[email protected]> wrote:
> >>>>>>> -----Original Message-----
> >>>>>>> From: Karol Herbst <[email protected]>
> >>>>>>> Sent: Thursday, June 1, 2023 12:19 PM
> >>>>>>> To: Limonciello, Mario <[email protected]>
> >>>>>>> Cc: Nick Hastings <[email protected]>; Lyude Paul
> >>>>>>> <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
> >>>>>>> Bonaccorso <[email protected]>; [email protected]; Rafael J.
> >>>>>>> Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
> >>>>>>> [email protected]; [email protected];
> >>>>>>> [email protected]
> >>>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> >>>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
> >>>> system)
> >>>>>>>
> >>>>>>> On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
> >>>>>>> <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> [AMD Official Use Only - General]
> >>>>>>>>
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Karol Herbst <[email protected]>
> >>>>>>>>> Sent: Thursday, June 1, 2023 11:33 AM
> >>>>>>>>> To: Limonciello, Mario <[email protected]>
> >>>>>>>>> Cc: Nick Hastings <[email protected]>; Lyude Paul
> >>>>>>>>> <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
> >>>>>>>>> Bonaccorso <[email protected]>; [email protected]; Rafael
> >>>> J.
> >>>>>>>>> Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
> >>>>>>>>> [email protected]; [email protected];
> >>>>>>>>> [email protected]
> >>>>>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video
> >>>> _OSI
> >>>>>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
> >>>>>>> system)
> >>>>>>>>>
> >>>>>>>>> On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
> >>>>>>>>>>
> >>>>>>>>>> Lyude, Lukas, Karol
> >>>>>>>>>>
> >>>>>>>>>> This thread is in relation to this commit:
> >>>>>>>>>>
> >>>>>>>>>> 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
> >>>>>>>>>>
> >>>>>>>>>> Nick has found that runtime PM is *not* working for nouveau.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> keep in mind we have a list of PCIe controllers where we apply a
> >>>>>>>>> workaround:
> >>>>>>>>>
> >>>>>>>
> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers
> >>>>>>>>> /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682
> >>>>>>>>>
> >>>>>>>>> And I suspect there might be one or two more IDs we'll have to add
> >>>>>>>>> there. Do we have any logs?
> >>>>>>>>
> >>>>>>>> There's some archived onto the distro bug. Search this page for
> >>>>>>> "journalctl.log.gz"
> >>>>>>>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530
> >>>>>>>>
> >>>>>>>
> >>>>>>> interesting.. It seems to be the same controller used here. I wonder
> >>>>>>> if the pci topology is different or if the workaround is applied at
> >>>>>>> all.
> >>>>>>
> >>>>>> I didn't see the message in the log about the workaround being applied
> >>>>>> in that log, so I guess PCI topology difference is a likely suspect.
> >>>>>>
> >>>>>
> >>>>> yeah, but I also couldn't see a log with the usual nouveau messages,
> >>>>> so it's kinda weird.
> >>>>>
> >>>>> Anyway, the output of `lspci -tvnn` would help
> >>>>
> >>>> % lspci -tvnn
> >>>> -[0000:00]-+-00.0 Intel Corporation Device [8086:3e20]
> >>>> +-01.0-[01]----00.0 NVIDIA Corporation TU117M [GeForce GTX 1650
> >>>> Mobile / Max-Q] [10de:1f91]
> >>>
> >>> So the bridge it's connected to is the same that the quirk *should have been* triggering.
> >>>
> >>> May 29 15:02:42 xps kernel: pci 0000:00:01.0: [8086:1901] type 01 class 0x060400
> >>>
> >>> Since the quirk isn't working and this is still a problem in 6.4-rc4 I suggest opening a
> >>> Nouveau drm bug to figure out why.
> >>>
> >>>> +-02.0 Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630]
> >>>> [8086:3e9b]
> >>>> +-04.0 Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core
> >>>> Processor Thermal Subsystem [8086:1903]
> >>>> +-08.0 Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 /
> >>>> 6th/7th/8th Gen Core Processor Gaussian Mixture Model [8086:1911]
> >>>> +-12.0 Intel Corporation Cannon Lake PCH Thermal Controller
> >>>> [8086:a379]
> >>>> +-14.0 Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller
> >>>> [8086:a36d]
> >>>> +-14.2 Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f]
> >>>> +-15.0 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0
> >>>> [8086:a368]
> >>>> +-15.1 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1
> >>>> [8086:a369]
> >>>> +-16.0 Intel Corporation Cannon Lake PCH HECI Controller [8086:a360]
> >>>> +-17.0 Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller
> >>>> [8086:a353]
> >>>> +-1b.0-[02-3a]----00.0-[03-3a]--+-00.0-[04]----00.0 Intel Corporation
> >>>> JHL6340 Thunderbolt 3 NHI (C step) [Alpine Ridge 2C 2016] [8086:15d9]
> >>>> | +-01.0-[05-39]--
> >>>> | \-02.0-[3a]----00.0 Intel Corporation JHL6340
> >>>> Thunderbolt 3 USB 3.1 Controller (C step) [Alpine Ridge 2C 2016]
> >>>> [8086:15db]
> >>>> +-1c.0-[3b]----00.0 Intel Corporation Wi-Fi 6 AX200 [8086:2723]
> >>>> +-1c.4-[3c]----00.0 Realtek Semiconductor Co., Ltd. RTS525A PCI
> >>>> Express Card Reader [10ec:525a]
> >>>> +-1d.0-[3d]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller
> >>>> SM981/PM981/PM983 [144d:a808]
> >>>> +-1f.0 Intel Corporation Cannon Lake LPC Controller [8086:a30e]
> >>>> +-1f.3 Intel Corporation Cannon Lake PCH cAVS [8086:a348]
> >>>> +-1f.4 Intel Corporation Cannon Lake PCH SMBus Controller
> >>>> [8086:a323]
> >>>> \-1f.5 Intel Corporation Cannon Lake PCH SPI Controller
> >>>> [8086:a324]
> >>>>
> >>>>
> >>>> Regards,
> >>>>
> >>>> Nick.
> >>>
> >
> >
> >
>


2023-06-30 21:52:04

by Nick Hastings

[permalink] [raw]
Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

Hi,

* Thorsten Leemhuis <[email protected]> [230630 22:02]:
> On 27.06.23 00:34, Nick Hastings wrote:
> > * Linux regression tracking (Thorsten Leemhuis) <[email protected]> [230626 21:09]:
> >> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
> >> for once, to make this easily accessible to everyone.
> >>
> >> Nick, what's the status/was there any progress? Did you do what Mario
> >> suggested and file a nouveau bug?
> >
> > It was not apparent that the suggestion to open "a Nouveau drm bug" was
> > addressed to me.
>
> I wish things were earlier for reporters, but from what I can see this
> is the only way forward if you or some silent bystander cares.

In principle I can open another bug report, but I don't know how or
where to report "a Nouveau drm bug". Please keep in mind that I'm just
an end user. I learnt to use git bisect specifically because of this
bug. Prior to that, I hadn't compiled a kernel in about 15 years.

> >> I ask, as I still have this on my list of regressions and it seems there
> >> was no progress in three+ weeks now.
> >
> > I have not pursued this further since as far as I could tell I already
> > provided all requested information and I don't actually use nouveau, so
> > I blacklisted it.
>
> I doubt any developer cares enough to take a closer look[1] without a
> proper nouveau bug and some help & prodding from someone affected. And
> looks to me like reverting the culprit now might create even bigger
> problems for users.

If someone can point me to some docs about for reporting nouveau bugs I
can look into it.

> Hence I guess then this won't be fixed in the end. In a ideal world this
> would not happen, but we don't live in one and all have just 24 hours in
> a day. :-/

This is a very common Dell XPS 15 7590 so I expect many people could
experience this issue. Or maybe like me they only use the intel GPU.

> Nevertheless: thx for your report your help through this thread.

No problem. I am willing to try to do more, but right now I don't know
how to do what has been suggested.

Cheers,

Nick.

> [1] some points on the following page kinda explain this
> https://linux-regtracking.leemhuis.info/post/frequent-reasons-why-linux-kernel-bug-reports-are-ignored/
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>
> #regzbot inconclusive: reporting deadlock (see thread for details)
>
>
>
> >> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >> --
> >> Everything you wanna know about Linux kernel regression tracking:
> >> https://linux-regtracking.leemhuis.info/about/#tldr
> >> If I did something stupid, please tell me, as explained on that page.
> >>
> >> #regzbot backburner: slow progress, likely just affects one machine
> >> #regzbot poke
> >>
> >>
> >> On 02.06.23 02:57, Limonciello, Mario wrote:
> >>> [AMD Official Use Only - General]
> >>>
> >>>> -----Original Message-----
> >>>> From: Nick Hastings <[email protected]>
> >>>> Sent: Thursday, June 1, 2023 7:02 PM
> >>>> To: Karol Herbst <[email protected]>
> >>>> Cc: Limonciello, Mario <[email protected]>; Lyude Paul
> >>>> <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
> >>>> Bonaccorso <[email protected]>; [email protected]; Rafael J.
> >>>> Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
> >>>> [email protected]; [email protected];
> >>>> [email protected]
> >>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> >>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
> >>>>
> >>>> Hi,
> >>>>
> >>>> * Karol Herbst <[email protected]> [230602 03:10]:
> >>>>> On Thu, Jun 1, 2023 at 7:21 PM Limonciello, Mario
> >>>>> <[email protected]> wrote:
> >>>>>>> -----Original Message-----
> >>>>>>> From: Karol Herbst <[email protected]>
> >>>>>>> Sent: Thursday, June 1, 2023 12:19 PM
> >>>>>>> To: Limonciello, Mario <[email protected]>
> >>>>>>> Cc: Nick Hastings <[email protected]>; Lyude Paul
> >>>>>>> <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
> >>>>>>> Bonaccorso <[email protected]>; [email protected]; Rafael J.
> >>>>>>> Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
> >>>>>>> [email protected]; [email protected];
> >>>>>>> [email protected]
> >>>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> >>>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
> >>>> system)
> >>>>>>>
> >>>>>>> On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
> >>>>>>> <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> [AMD Official Use Only - General]
> >>>>>>>>
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Karol Herbst <[email protected]>
> >>>>>>>>> Sent: Thursday, June 1, 2023 11:33 AM
> >>>>>>>>> To: Limonciello, Mario <[email protected]>
> >>>>>>>>> Cc: Nick Hastings <[email protected]>; Lyude Paul
> >>>>>>>>> <[email protected]>; Lukas Wunner <[email protected]>; Salvatore
> >>>>>>>>> Bonaccorso <[email protected]>; [email protected]; Rafael
> >>>> J.
> >>>>>>>>> Wysocki <[email protected]>; Len Brown <[email protected]>; linux-
> >>>>>>>>> [email protected]; [email protected];
> >>>>>>>>> [email protected]
> >>>>>>>>> Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video
> >>>> _OSI
> >>>>>>>>> string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of
> >>>>>>> system)
> >>>>>>>>>
> >>>>>>>>> On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
> >>>>>>>>>>
> >>>>>>>>>> Lyude, Lukas, Karol
> >>>>>>>>>>
> >>>>>>>>>> This thread is in relation to this commit:
> >>>>>>>>>>
> >>>>>>>>>> 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
> >>>>>>>>>>
> >>>>>>>>>> Nick has found that runtime PM is *not* working for nouveau.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> keep in mind we have a list of PCIe controllers where we apply a
> >>>>>>>>> workaround:
> >>>>>>>>>
> >>>>>>>
> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers
> >>>>>>>>> /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682
> >>>>>>>>>
> >>>>>>>>> And I suspect there might be one or two more IDs we'll have to add
> >>>>>>>>> there. Do we have any logs?
> >>>>>>>>
> >>>>>>>> There's some archived onto the distro bug. Search this page for
> >>>>>>> "journalctl.log.gz"
> >>>>>>>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530
> >>>>>>>>
> >>>>>>>
> >>>>>>> interesting.. It seems to be the same controller used here. I wonder
> >>>>>>> if the pci topology is different or if the workaround is applied at
> >>>>>>> all.
> >>>>>>
> >>>>>> I didn't see the message in the log about the workaround being applied
> >>>>>> in that log, so I guess PCI topology difference is a likely suspect.
> >>>>>>
> >>>>>
> >>>>> yeah, but I also couldn't see a log with the usual nouveau messages,
> >>>>> so it's kinda weird.
> >>>>>
> >>>>> Anyway, the output of `lspci -tvnn` would help
> >>>>
> >>>> % lspci -tvnn
> >>>> -[0000:00]-+-00.0 Intel Corporation Device [8086:3e20]
> >>>> +-01.0-[01]----00.0 NVIDIA Corporation TU117M [GeForce GTX 1650
> >>>> Mobile / Max-Q] [10de:1f91]
> >>>
> >>> So the bridge it's connected to is the same that the quirk *should have been* triggering.
> >>>
> >>> May 29 15:02:42 xps kernel: pci 0000:00:01.0: [8086:1901] type 01 class 0x060400
> >>>
> >>> Since the quirk isn't working and this is still a problem in 6.4-rc4 I suggest opening a
> >>> Nouveau drm bug to figure out why.
> >>>
> >>>> +-02.0 Intel Corporation CoffeeLake-H GT2 [UHD Graphics 630]
> >>>> [8086:3e9b]
> >>>> +-04.0 Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core
> >>>> Processor Thermal Subsystem [8086:1903]
> >>>> +-08.0 Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 /
> >>>> 6th/7th/8th Gen Core Processor Gaussian Mixture Model [8086:1911]
> >>>> +-12.0 Intel Corporation Cannon Lake PCH Thermal Controller
> >>>> [8086:a379]
> >>>> +-14.0 Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller
> >>>> [8086:a36d]
> >>>> +-14.2 Intel Corporation Cannon Lake PCH Shared SRAM [8086:a36f]
> >>>> +-15.0 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0
> >>>> [8086:a368]
> >>>> +-15.1 Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1
> >>>> [8086:a369]
> >>>> +-16.0 Intel Corporation Cannon Lake PCH HECI Controller [8086:a360]
> >>>> +-17.0 Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller
> >>>> [8086:a353]
> >>>> +-1b.0-[02-3a]----00.0-[03-3a]--+-00.0-[04]----00.0 Intel Corporation
> >>>> JHL6340 Thunderbolt 3 NHI (C step) [Alpine Ridge 2C 2016] [8086:15d9]
> >>>> | +-01.0-[05-39]--
> >>>> | \-02.0-[3a]----00.0 Intel Corporation JHL6340
> >>>> Thunderbolt 3 USB 3.1 Controller (C step) [Alpine Ridge 2C 2016]
> >>>> [8086:15db]
> >>>> +-1c.0-[3b]----00.0 Intel Corporation Wi-Fi 6 AX200 [8086:2723]
> >>>> +-1c.4-[3c]----00.0 Realtek Semiconductor Co., Ltd. RTS525A PCI
> >>>> Express Card Reader [10ec:525a]
> >>>> +-1d.0-[3d]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller
> >>>> SM981/PM981/PM983 [144d:a808]
> >>>> +-1f.0 Intel Corporation Cannon Lake LPC Controller [8086:a30e]
> >>>> +-1f.3 Intel Corporation Cannon Lake PCH cAVS [8086:a348]
> >>>> +-1f.4 Intel Corporation Cannon Lake PCH SMBus Controller
> >>>> [8086:a323]
> >>>> \-1f.5 Intel Corporation Cannon Lake PCH SPI Controller
> >>>> [8086:a324]
> >>>>
> >>>>
> >>>> Regards,
> >>>>
> >>>> Nick.
> >>>
> >
> >
> >


2023-06-30 22:00:18

by Mario Limonciello

[permalink] [raw]
Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)


>> Nevertheless: thx for your report your help through this thread.
>
> No problem. I am willing to try to do more, but right now I don't know
> how to do what has been suggested.
>

Here is where to report Nouveau bugs:

https://gitlab.freedesktop.org/drm/nouveau/-/issues/


2023-06-30 22:21:25

by Nick Hastings

[permalink] [raw]
Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

Hi,

* Limonciello, Mario <[email protected]> [230701 06:40]:
>
> > > Nevertheless: thx for your report your help through this thread.
> >
> > No problem. I am willing to try to do more, but right now I don't know
> > how to do what has been suggested.
> >
>
> Here is where to report Nouveau bugs:
>
> https://gitlab.freedesktop.org/drm/nouveau/-/issues/

Thanks.

Done: https://gitlab.freedesktop.org/drm/nouveau/-/issues/241

Cheers,

Nick.


2023-07-07 22:10:46

by Lyude Paul

[permalink] [raw]
Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

On Thu, 2023-06-01 at 11:18 -0500, Limonciello, Mario wrote:
> +Lyude, Lukas, Karol
>
> On 5/31/2023 6:40 PM, Nick Hastings wrote:
> > Hi,
> >
> > * Nick Hastings <[email protected]> [230530 16:01]:
> > > * Mario Limonciello <[email protected]> [230530 13:00]:
> > <snip>
> > > > As you're actually loading nouveau, can you please try nouveau.runpm=0 on
> > > > the kernel command line?
> > > I'm not intentionally loading it. This machine also has intel graphics
> > > which is what I prefer. Checking my
> > > /etc/modprobe.d/blacklist-nvidia-nouveau.conf
> > > I see:
> > >
> > > blacklist nvidia
> > > blacklist nvidia-drm
> > > blacklist nvidia-modeset
> > > blacklist nvidia-uvm
> > > blacklist ipmi_msghandler
> > > blacklist ipmi_devintf
> > >
> > > So I thought I had blacklisted it but it seems I did not. Since I do not
> > > want to use it maybe it is better to check if the lock up occurs with
> > > nouveau blacklisted. I will try that now.
> > I blacklisted nouveau and booted into a 6.1 kernel:
> > % uname -a
> > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) x86_64 GNU/Linux
> >
> > It has been running without problems for nearly two days now:
> > % uptime
> > 08:34:48 up 1 day, 16:22, 2 users, load average: 1.33, 1.26, 1.27
> >
> > Regards,
> >
> > Nick.
>
> Thanks, that makes a lot more sense now.
>
> Nick, Can you please test if nouveau works with runtime PM in the
> latest 6.4-rc?
>
> If it works in 6.4-rc, there are probably nouveau commits that need
> to be backported to 6.1 LTS.
>
> If it's still broken in 6.4-rc, I believe you should file a bug:
>
> https://gitlab.freedesktop.org/drm/nouveau/
>
>
> Lyude, Lukas, Karol
>
> This thread is in relation to this commit:
>
> 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
>
> Nick has found that runtime PM is *not* working for nouveau.
>
> If you recall we did 24867516f06d because 5775b843a619 was
> supposed to have fixed it.

Gotcha, I guess keep me updated since it seems like things -might- be working
from what I gathered here? Happy to look further if they find that 6.4-rc is
broken though

>

--
Cheers,
Lyude Paul (she/her)
Software Engineer at Red Hat