2023-01-18 21:39:54

by Chris Clayton

[permalink] [raw]
Subject: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

Hi.

I build and installed the lastest development kernel earlier this week. I've found that when I try the laptop down (or
reboot it), it hangs right at the end of closing the current session. The last line I see on the screen when rebooting is:

sd 4:0:0:0: [sda] Synchronising SCSI cache

when closing down I see one additional line:

sd 4:0:0:0 [sda]Stopping disk

In both cases the machine then hangs and I have to hold down the power button fot a few seconds to switch it off.

Linux 6.1 is OK but 6.2-rc1 hangs, so I bisected between this two and landed on:

# first bad commit: [0e44c21708761977dcbea9b846b51a6fb684907a] drm/nouveau/flcn: new code to load+boot simple HS FWs
(VPR scrubber)

I built and installed a kernel with f15cde64b66161bfa74fb58f4e5697d8265b802e (the parent of the bad commit) checked out
and that shuts down and reboots fine. It the did the same with the bad commit checked out and that does indeed hang, so
I'm confident the bisect outcome is OK.

Kernels 6.1.6 and 5.15.88 are also OK.

My system had dual GPUs - one intel and one NVidia. Related extracts from 'lscpi -v' is:

00:02.0 VGA compatible controller: Intel Corporation CometLake-H GT2 [UHD Graphics] (rev 05) (prog-if 00 [VGA controller])
Subsystem: CLEVO/KAPOK Computer CometLake-H GT2 [UHD Graphics]

Flags: bus master, fast devsel, latency 0, IRQ 142

Memory at c2000000 (64-bit, non-prefetchable) [size=16M]

Memory at a0000000 (64-bit, prefetchable) [size=256M]

I/O ports at 5000 [size=64]

Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]

Capabilities: [40] Vendor Specific Information: Len=0c <?>

Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00

Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit-

Capabilities: [d0] Power Management version 2

Kernel driver in use: i915

Kernel modules: i915


01:00.0 VGA compatible controller: NVIDIA Corporation TU117M [GeForce GTX 1650 Ti Mobile] (rev a1) (prog-if 00 [VGA
controller])
Subsystem: CLEVO/KAPOK Computer TU117M [GeForce GTX 1650 Ti Mobile]
Flags: bus master, fast devsel, latency 0, IRQ 141
Memory at c4000000 (32-bit, non-prefetchable) [size=16M]
Memory at b0000000 (64-bit, prefetchable) [size=256M]
Memory at c0000000 (64-bit, prefetchable) [size=32M]
I/O ports at 4000 [size=128]
Expansion ROM at c3000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Kernel driver in use: nouveau
Kernel modules: nouveau

DRI_PRIME=1 is exported in one of my init scripts (yes, I am still using sysvinit).

I've attached the bisect.log, but please let me know if I can provide any other diagnostics. Please cc me as I'm not
subscribed.


Chris


Attachments:
bisect-log.txt (2.58 kB)

Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

[adding various lists and the two other nouveau maintainers to the list
of recipients]

For the rest of this mail:

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 18.01.23 21:59, Chris Clayton wrote:
> Hi.
>
> I build and installed the lastest development kernel earlier this week. I've found that when I try the laptop down (or
> reboot it), it hangs right at the end of closing the current session. The last line I see on the screen when rebooting is:
>
> sd 4:0:0:0: [sda] Synchronising SCSI cache
>
> when closing down I see one additional line:
>
> sd 4:0:0:0 [sda]Stopping disk
>
> In both cases the machine then hangs and I have to hold down the power button fot a few seconds to switch it off.
>
> Linux 6.1 is OK but 6.2-rc1 hangs, so I bisected between this two and landed on:
>
> # first bad commit: [0e44c21708761977dcbea9b846b51a6fb684907a] drm/nouveau/flcn: new code to load+boot simple HS FWs
> (VPR scrubber)
>
> I built and installed a kernel with f15cde64b66161bfa74fb58f4e5697d8265b802e (the parent of the bad commit) checked out
> and that shuts down and reboots fine. It the did the same with the bad commit checked out and that does indeed hang, so
> I'm confident the bisect outcome is OK.
>
> Kernels 6.1.6 and 5.15.88 are also OK.
>
> My system had dual GPUs - one intel and one NVidia. Related extracts from 'lscpi -v' is:
>
> 00:02.0 VGA compatible controller: Intel Corporation CometLake-H GT2 [UHD Graphics] (rev 05) (prog-if 00 [VGA controller])
> Subsystem: CLEVO/KAPOK Computer CometLake-H GT2 [UHD Graphics]
>
> Flags: bus master, fast devsel, latency 0, IRQ 142
>
> Memory at c2000000 (64-bit, non-prefetchable) [size=16M]
>
> Memory at a0000000 (64-bit, prefetchable) [size=256M]
>
> I/O ports at 5000 [size=64]
>
> Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
>
> Capabilities: [40] Vendor Specific Information: Len=0c <?>
>
> Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
>
> Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit-
>
> Capabilities: [d0] Power Management version 2
>
> Kernel driver in use: i915
>
> Kernel modules: i915
>
>
> 01:00.0 VGA compatible controller: NVIDIA Corporation TU117M [GeForce GTX 1650 Ti Mobile] (rev a1) (prog-if 00 [VGA
> controller])
> Subsystem: CLEVO/KAPOK Computer TU117M [GeForce GTX 1650 Ti Mobile]
> Flags: bus master, fast devsel, latency 0, IRQ 141
> Memory at c4000000 (32-bit, non-prefetchable) [size=16M]
> Memory at b0000000 (64-bit, prefetchable) [size=256M]
> Memory at c0000000 (64-bit, prefetchable) [size=32M]
> I/O ports at 4000 [size=128]
> Expansion ROM at c3000000 [disabled] [size=512K]
> Capabilities: [60] Power Management version 3
> Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
> Capabilities: [78] Express Legacy Endpoint, MSI 00
> Kernel driver in use: nouveau
> Kernel modules: nouveau
>
> DRI_PRIME=1 is exported in one of my init scripts (yes, I am still using sysvinit).
>
> I've attached the bisect.log, but please let me know if I can provide any other diagnostics. Please cc me as I'm not
> subscribed.

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced e44c2170876197
#regzbot title drm: nouveau: hangs on poweroff/reboot
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

[TLDR: This mail in primarily relevant for Linux kernel regression
tracking. See link in footer if these mails annoy you.]

On 19.01.23 15:33, Linux kernel regression tracking (Thorsten Leemhuis)
wrote:
> On 18.01.23 21:59, Chris Clayton wrote:
>>
>> # first bad commit: [0e44c21708761977dcbea9b846b51a6fb684907a] drm/nouveau/flcn: new code to load+boot simple HS FWs
>> (VPR scrubber)
>
> #regzbot ^introduced e44c2170876197

/me wonders if he failed to spot or cut'n'paste the leading 0
/me wonders if he needs glasses
#sigh

Sorry for the noise!

#regzbot 0e44c21708761977dc

> #regzbot title drm: nouveau: hangs on poweroff/reboot
> #regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

#regzbot ignore-activity

2023-01-27 12:12:00

by Karol Herbst

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

Where was the original email sent to anyway, because I don't have it at all.

Anyhow, I suspect we want to fetch logs to see what's happening, but
due to the nature of this bug it might get difficult.

I'm checking out the laptops I have here if I can reproduce this
issue, but I think all mine with Turing GPUs are fine.

Maybe Ben has any idea what might be wrong with
0e44c21708761977dcbea9b846b51a6fb684907a or if that's an issue which
is already fixed by not upstreamed patches as I think I remember Ben
to talk about something like that recently.

Karol

On Fri, Jan 27, 2023 at 12:20 PM Linux kernel regression tracking
(Thorsten Leemhuis) <[email protected]> wrote:
>
> Hi, this is your Linux kernel regression tracker. Top-posting for once,
> to make this easily accessible to everyone.
>
> @nouveau-maintainers, did anyone take a look at this? The report is
> already 8 days old and I don't see a single reply. Sure, we'll likely
> get a -rc8, but still it would be good to not fix this on the finish line.
>
> Chris, btw, did you try if you can revert the commit on top of latest
> mainline? And if so, does it fix the problem?
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>
> #regzbot poke
>
> On 19.01.23 15:33, Linux kernel regression tracking (Thorsten Leemhuis)
> wrote:
> > [adding various lists and the two other nouveau maintainers to the list
> > of recipients]
>
> > On 18.01.23 21:59, Chris Clayton wrote:
> >> Hi.
> >>
> >> I build and installed the lastest development kernel earlier this week. I've found that when I try the laptop down (or
> >> reboot it), it hangs right at the end of closing the current session. The last line I see on the screen when rebooting is:
> >>
> >> sd 4:0:0:0: [sda] Synchronising SCSI cache
> >>
> >> when closing down I see one additional line:
> >>
> >> sd 4:0:0:0 [sda]Stopping disk
> >>
> >> In both cases the machine then hangs and I have to hold down the power button fot a few seconds to switch it off.
> >>
> >> Linux 6.1 is OK but 6.2-rc1 hangs, so I bisected between this two and landed on:
> >>
> >> # first bad commit: [0e44c21708761977dcbea9b846b51a6fb684907a] drm/nouveau/flcn: new code to load+boot simple HS FWs
> >> (VPR scrubber)
> >>
> >> I built and installed a kernel with f15cde64b66161bfa74fb58f4e5697d8265b802e (the parent of the bad commit) checked out
> >> and that shuts down and reboots fine. It the did the same with the bad commit checked out and that does indeed hang, so
> >> I'm confident the bisect outcome is OK.
> >>
> >> Kernels 6.1.6 and 5.15.88 are also OK.
> >>
> >> My system had dual GPUs - one intel and one NVidia. Related extracts from 'lscpi -v' is:
> >>
> >> 00:02.0 VGA compatible controller: Intel Corporation CometLake-H GT2 [UHD Graphics] (rev 05) (prog-if 00 [VGA controller])
> >> Subsystem: CLEVO/KAPOK Computer CometLake-H GT2 [UHD Graphics]
> >>
> >> Flags: bus master, fast devsel, latency 0, IRQ 142
> >>
> >> Memory at c2000000 (64-bit, non-prefetchable) [size=16M]
> >>
> >> Memory at a0000000 (64-bit, prefetchable) [size=256M]
> >>
> >> I/O ports at 5000 [size=64]
> >>
> >> Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
> >>
> >> Capabilities: [40] Vendor Specific Information: Len=0c <?>
> >>
> >> Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
> >>
> >> Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit-
> >>
> >> Capabilities: [d0] Power Management version 2
> >>
> >> Kernel driver in use: i915
> >>
> >> Kernel modules: i915
> >>
> >>
> >> 01:00.0 VGA compatible controller: NVIDIA Corporation TU117M [GeForce GTX 1650 Ti Mobile] (rev a1) (prog-if 00 [VGA
> >> controller])
> >> Subsystem: CLEVO/KAPOK Computer TU117M [GeForce GTX 1650 Ti Mobile]
> >> Flags: bus master, fast devsel, latency 0, IRQ 141
> >> Memory at c4000000 (32-bit, non-prefetchable) [size=16M]
> >> Memory at b0000000 (64-bit, prefetchable) [size=256M]
> >> Memory at c0000000 (64-bit, prefetchable) [size=32M]
> >> I/O ports at 4000 [size=128]
> >> Expansion ROM at c3000000 [disabled] [size=512K]
> >> Capabilities: [60] Power Management version 3
> >> Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
> >> Capabilities: [78] Express Legacy Endpoint, MSI 00
> >> Kernel driver in use: nouveau
> >> Kernel modules: nouveau
> >>
> >> DRI_PRIME=1 is exported in one of my init scripts (yes, I am still using sysvinit).
> >>
> >> I've attached the bisect.log, but please let me know if I can provide any other diagnostics. Please cc me as I'm not
> >> subscribed.
> >
> > Thanks for the report. To be sure the issue doesn't fall through the
> > cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> > tracking bot:
> >
> > #regzbot ^introduced e44c2170876197
> > #regzbot title drm: nouveau: hangs on poweroff/reboot
> > #regzbot ignore-activity
> >
> > This isn't a regression? This issue or a fix for it are already
> > discussed somewhere else? It was fixed already? You want to clarify when
> > the regression started to happen? Or point out I got the title or
> > something else totally wrong? Then just reply and tell me -- ideally
> > while also telling regzbot about it, as explained by the page listed in
> > the footer of this mail.
> >
> > Developers: When fixing the issue, remember to add 'Link:' tags pointing
> > to the report (the parent of this mail). See page linked in footer for
> > details.
> >
> > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> > --
> > Everything you wanna know about Linux kernel regression tracking:
> > https://linux-regtracking.leemhuis.info/about/#tldr
> > That page also explains what to do if mails like this annoy you.
>


Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

Hi, this is your Linux kernel regression tracker. Top-posting for once,
to make this easily accessible to everyone.

@nouveau-maintainers, did anyone take a look at this? The report is
already 8 days old and I don't see a single reply. Sure, we'll likely
get a -rc8, but still it would be good to not fix this on the finish line.

Chris, btw, did you try if you can revert the commit on top of latest
mainline? And if so, does it fix the problem?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

On 19.01.23 15:33, Linux kernel regression tracking (Thorsten Leemhuis)
wrote:
> [adding various lists and the two other nouveau maintainers to the list
> of recipients]

> On 18.01.23 21:59, Chris Clayton wrote:
>> Hi.
>>
>> I build and installed the lastest development kernel earlier this week. I've found that when I try the laptop down (or
>> reboot it), it hangs right at the end of closing the current session. The last line I see on the screen when rebooting is:
>>
>> sd 4:0:0:0: [sda] Synchronising SCSI cache
>>
>> when closing down I see one additional line:
>>
>> sd 4:0:0:0 [sda]Stopping disk
>>
>> In both cases the machine then hangs and I have to hold down the power button fot a few seconds to switch it off.
>>
>> Linux 6.1 is OK but 6.2-rc1 hangs, so I bisected between this two and landed on:
>>
>> # first bad commit: [0e44c21708761977dcbea9b846b51a6fb684907a] drm/nouveau/flcn: new code to load+boot simple HS FWs
>> (VPR scrubber)
>>
>> I built and installed a kernel with f15cde64b66161bfa74fb58f4e5697d8265b802e (the parent of the bad commit) checked out
>> and that shuts down and reboots fine. It the did the same with the bad commit checked out and that does indeed hang, so
>> I'm confident the bisect outcome is OK.
>>
>> Kernels 6.1.6 and 5.15.88 are also OK.
>>
>> My system had dual GPUs - one intel and one NVidia. Related extracts from 'lscpi -v' is:
>>
>> 00:02.0 VGA compatible controller: Intel Corporation CometLake-H GT2 [UHD Graphics] (rev 05) (prog-if 00 [VGA controller])
>> Subsystem: CLEVO/KAPOK Computer CometLake-H GT2 [UHD Graphics]
>>
>> Flags: bus master, fast devsel, latency 0, IRQ 142
>>
>> Memory at c2000000 (64-bit, non-prefetchable) [size=16M]
>>
>> Memory at a0000000 (64-bit, prefetchable) [size=256M]
>>
>> I/O ports at 5000 [size=64]
>>
>> Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
>>
>> Capabilities: [40] Vendor Specific Information: Len=0c <?>
>>
>> Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
>>
>> Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit-
>>
>> Capabilities: [d0] Power Management version 2
>>
>> Kernel driver in use: i915
>>
>> Kernel modules: i915
>>
>>
>> 01:00.0 VGA compatible controller: NVIDIA Corporation TU117M [GeForce GTX 1650 Ti Mobile] (rev a1) (prog-if 00 [VGA
>> controller])
>> Subsystem: CLEVO/KAPOK Computer TU117M [GeForce GTX 1650 Ti Mobile]
>> Flags: bus master, fast devsel, latency 0, IRQ 141
>> Memory at c4000000 (32-bit, non-prefetchable) [size=16M]
>> Memory at b0000000 (64-bit, prefetchable) [size=256M]
>> Memory at c0000000 (64-bit, prefetchable) [size=32M]
>> I/O ports at 4000 [size=128]
>> Expansion ROM at c3000000 [disabled] [size=512K]
>> Capabilities: [60] Power Management version 3
>> Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
>> Capabilities: [78] Express Legacy Endpoint, MSI 00
>> Kernel driver in use: nouveau
>> Kernel modules: nouveau
>>
>> DRI_PRIME=1 is exported in one of my init scripts (yes, I am still using sysvinit).
>>
>> I've attached the bisect.log, but please let me know if I can provide any other diagnostics. Please cc me as I'm not
>> subscribed.
>
> Thanks for the report. To be sure the issue doesn't fall through the
> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> tracking bot:
>
> #regzbot ^introduced e44c2170876197
> #regzbot title drm: nouveau: hangs on poweroff/reboot
> #regzbot ignore-activity
>
> This isn't a regression? This issue or a fix for it are already
> discussed somewhere else? It was fixed already? You want to clarify when
> the regression started to happen? Or point out I got the title or
> something else totally wrong? Then just reply and tell me -- ideally
> while also telling regzbot about it, as explained by the page listed in
> the footer of this mail.
>
> Developers: When fixing the issue, remember to add 'Link:' tags pointing
> to the report (the parent of this mail). See page linked in footer for
> details.
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> That page also explains what to do if mails like this annoy you.

2023-01-27 19:44:23

by Chris Clayton

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

[Resend because the mail client on my phone dedcided to turn HTML on behinf my back, so my repluy got bounced.]

Thanks Karol.

I sent the original report to Ben and LKML. Thorsten then added you, Lyude Paul and the dri-devel and nouveau mail
lists. So you should have received this report on or about January 19.

Chris

On 27/01/2023 11:35, Karol Herbst wrote:
> Where was the original email sent to anyway, because I don't have it at all.
>
> Anyhow, I suspect we want to fetch logs to see what's happening, but
> due to the nature of this bug it might get difficult.
>
> I'm checking out the laptops I have here if I can reproduce this
> issue, but I think all mine with Turing GPUs are fine.
>
> Maybe Ben has any idea what might be wrong with
> 0e44c21708761977dcbea9b846b51a6fb684907a or if that's an issue which
> is already fixed by not upstreamed patches as I think I remember Ben
> to talk about something like that recently.
>
> Karol
>
> On Fri, Jan 27, 2023 at 12:20 PM Linux kernel regression tracking
> (Thorsten Leemhuis) <[email protected]> wrote:
>>
>> Hi, this is your Linux kernel regression tracker. Top-posting for once,
>> to make this easily accessible to everyone.
>>
>> @nouveau-maintainers, did anyone take a look at this? The report is
>> already 8 days old and I don't see a single reply. Sure, we'll likely
>> get a -rc8, but still it would be good to not fix this on the finish line.
>>
>> Chris, btw, did you try if you can revert the commit on top of latest
>> mainline? And if so, does it fix the problem?
>>
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> If I did something stupid, please tell me, as explained on that page.
>>
>> #regzbot poke
>>
>> On 19.01.23 15:33, Linux kernel regression tracking (Thorsten Leemhuis)
>> wrote:
>>> [adding various lists and the two other nouveau maintainers to the list
>>> of recipients]
>>
>>> On 18.01.23 21:59, Chris Clayton wrote:
>>>> Hi.
>>>>
>>>> I build and installed the lastest development kernel earlier this week. I've found that when I try the laptop down (or
>>>> reboot it), it hangs right at the end of closing the current session. The last line I see on the screen when rebooting is:
>>>>
>>>> sd 4:0:0:0: [sda] Synchronising SCSI cache
>>>>
>>>> when closing down I see one additional line:
>>>>
>>>> sd 4:0:0:0 [sda]Stopping disk
>>>>
>>>> In both cases the machine then hangs and I have to hold down the power button fot a few seconds to switch it off.
>>>>
>>>> Linux 6.1 is OK but 6.2-rc1 hangs, so I bisected between this two and landed on:
>>>>
>>>> # first bad commit: [0e44c21708761977dcbea9b846b51a6fb684907a] drm/nouveau/flcn: new code to load+boot simple HS FWs
>>>> (VPR scrubber)
>>>>
>>>> I built and installed a kernel with f15cde64b66161bfa74fb58f4e5697d8265b802e (the parent of the bad commit) checked out
>>>> and that shuts down and reboots fine. It the did the same with the bad commit checked out and that does indeed hang, so
>>>> I'm confident the bisect outcome is OK.
>>>>
>>>> Kernels 6.1.6 and 5.15.88 are also OK.
>>>>
>>>> My system had dual GPUs - one intel and one NVidia. Related extracts from 'lscpi -v' is:
>>>>
>>>> 00:02.0 VGA compatible controller: Intel Corporation CometLake-H GT2 [UHD Graphics] (rev 05) (prog-if 00 [VGA controller])
>>>> Subsystem: CLEVO/KAPOK Computer CometLake-H GT2 [UHD Graphics]
>>>>
>>>> Flags: bus master, fast devsel, latency 0, IRQ 142
>>>>
>>>> Memory at c2000000 (64-bit, non-prefetchable) [size=16M]
>>>>
>>>> Memory at a0000000 (64-bit, prefetchable) [size=256M]
>>>>
>>>> I/O ports at 5000 [size=64]
>>>>
>>>> Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
>>>>
>>>> Capabilities: [40] Vendor Specific Information: Len=0c <?>
>>>>
>>>> Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
>>>>
>>>> Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit-
>>>>
>>>> Capabilities: [d0] Power Management version 2
>>>>
>>>> Kernel driver in use: i915
>>>>
>>>> Kernel modules: i915
>>>>
>>>>
>>>> 01:00.0 VGA compatible controller: NVIDIA Corporation TU117M [GeForce GTX 1650 Ti Mobile] (rev a1) (prog-if 00 [VGA
>>>> controller])
>>>> Subsystem: CLEVO/KAPOK Computer TU117M [GeForce GTX 1650 Ti Mobile]
>>>> Flags: bus master, fast devsel, latency 0, IRQ 141
>>>> Memory at c4000000 (32-bit, non-prefetchable) [size=16M]
>>>> Memory at b0000000 (64-bit, prefetchable) [size=256M]
>>>> Memory at c0000000 (64-bit, prefetchable) [size=32M]
>>>> I/O ports at 4000 [size=128]
>>>> Expansion ROM at c3000000 [disabled] [size=512K]
>>>> Capabilities: [60] Power Management version 3
>>>> Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
>>>> Capabilities: [78] Express Legacy Endpoint, MSI 00
>>>> Kernel driver in use: nouveau
>>>> Kernel modules: nouveau
>>>>
>>>> DRI_PRIME=1 is exported in one of my init scripts (yes, I am still using sysvinit).
>>>>
>>>> I've attached the bisect.log, but please let me know if I can provide any other diagnostics. Please cc me as I'm not
>>>> subscribed.
>>>
>>> Thanks for the report. To be sure the issue doesn't fall through the
>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>>> tracking bot:
>>>
>>> #regzbot ^introduced e44c2170876197
>>> #regzbot title drm: nouveau: hangs on poweroff/reboot
>>> #regzbot ignore-activity
>>>
>>> This isn't a regression? This issue or a fix for it are already
>>> discussed somewhere else? It was fixed already? You want to clarify when
>>> the regression started to happen? Or point out I got the title or
>>> something else totally wrong? Then just reply and tell me -- ideally
>>> while also telling regzbot about it, as explained by the page listed in
>>> the footer of this mail.
>>>
>>> Developers: When fixing the issue, remember to add 'Link:' tags pointing
>>> to the report (the parent of this mail). See page linked in footer for
>>> details.
>>>
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>> --
>>> Everything you wanna know about Linux kernel regression tracking:
>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>> That page also explains what to do if mails like this annoy you.
>>
>

2023-01-27 19:52:04

by Chris Clayton

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

[Resend because the mail client on my phone decided to turn HTML on behind my back, so my reply got bounced.]

Thanks Thorsten.

I did try to revert but it didnt revert cleanly and I don't have the knowledge to fix it up.

The patch was part of a merge that included a number of related patches. Tomorrow, I'll try to revert the lot and report
back.

Chris



On 27/01/2023 11:20, Linux kernel regression tracking (Thorsten Leemhuis) wrote:
> Hi, this is your Linux kernel regression tracker. Top-posting for once,
> to make this easily accessible to everyone.
>
> @nouveau-maintainers, did anyone take a look at this? The report is
> already 8 days old and I don't see a single reply. Sure, we'll likely
> get a -rc8, but still it would be good to not fix this on the finish line.
>
> Chris, btw, did you try if you can revert the commit on top of latest
> mainline? And if so, does it fix the problem?
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>
> #regzbot poke
>
> On 19.01.23 15:33, Linux kernel regression tracking (Thorsten Leemhuis)
> wrote:
>> [adding various lists and the two other nouveau maintainers to the list
>> of recipients]
>
>> On 18.01.23 21:59, Chris Clayton wrote:
>>> Hi.
>>>
>>> I build and installed the lastest development kernel earlier this week. I've found that when I try the laptop down (or
>>> reboot it), it hangs right at the end of closing the current session. The last line I see on the screen when rebooting is:
>>>
>>> sd 4:0:0:0: [sda] Synchronising SCSI cache
>>>
>>> when closing down I see one additional line:
>>>
>>> sd 4:0:0:0 [sda]Stopping disk
>>>
>>> In both cases the machine then hangs and I have to hold down the power button fot a few seconds to switch it off.
>>>
>>> Linux 6.1 is OK but 6.2-rc1 hangs, so I bisected between this two and landed on:
>>>
>>> # first bad commit: [0e44c21708761977dcbea9b846b51a6fb684907a] drm/nouveau/flcn: new code to load+boot simple HS FWs
>>> (VPR scrubber)
>>>
>>> I built and installed a kernel with f15cde64b66161bfa74fb58f4e5697d8265b802e (the parent of the bad commit) checked out
>>> and that shuts down and reboots fine. It the did the same with the bad commit checked out and that does indeed hang, so
>>> I'm confident the bisect outcome is OK.
>>>
>>> Kernels 6.1.6 and 5.15.88 are also OK.
>>>
>>> My system had dual GPUs - one intel and one NVidia. Related extracts from 'lscpi -v' is:
>>>
>>> 00:02.0 VGA compatible controller: Intel Corporation CometLake-H GT2 [UHD Graphics] (rev 05) (prog-if 00 [VGA controller])
>>> Subsystem: CLEVO/KAPOK Computer CometLake-H GT2 [UHD Graphics]
>>>
>>> Flags: bus master, fast devsel, latency 0, IRQ 142
>>>
>>> Memory at c2000000 (64-bit, non-prefetchable) [size=16M]
>>>
>>> Memory at a0000000 (64-bit, prefetchable) [size=256M]
>>>
>>> I/O ports at 5000 [size=64]
>>>
>>> Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
>>>
>>> Capabilities: [40] Vendor Specific Information: Len=0c <?>
>>>
>>> Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
>>>
>>> Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit-
>>>
>>> Capabilities: [d0] Power Management version 2
>>>
>>> Kernel driver in use: i915
>>>
>>> Kernel modules: i915
>>>
>>>
>>> 01:00.0 VGA compatible controller: NVIDIA Corporation TU117M [GeForce GTX 1650 Ti Mobile] (rev a1) (prog-if 00 [VGA
>>> controller])
>>> Subsystem: CLEVO/KAPOK Computer TU117M [GeForce GTX 1650 Ti Mobile]
>>> Flags: bus master, fast devsel, latency 0, IRQ 141
>>> Memory at c4000000 (32-bit, non-prefetchable) [size=16M]
>>> Memory at b0000000 (64-bit, prefetchable) [size=256M]
>>> Memory at c0000000 (64-bit, prefetchable) [size=32M]
>>> I/O ports at 4000 [size=128]
>>> Expansion ROM at c3000000 [disabled] [size=512K]
>>> Capabilities: [60] Power Management version 3
>>> Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
>>> Capabilities: [78] Express Legacy Endpoint, MSI 00
>>> Kernel driver in use: nouveau
>>> Kernel modules: nouveau
>>>
>>> DRI_PRIME=1 is exported in one of my init scripts (yes, I am still using sysvinit).
>>>
>>> I've attached the bisect.log, but please let me know if I can provide any other diagnostics. Please cc me as I'm not
>>> subscribed.
>>
>> Thanks for the report. To be sure the issue doesn't fall through the
>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>> tracking bot:
>>
>> #regzbot ^introduced e44c2170876197
>> #regzbot title drm: nouveau: hangs on poweroff/reboot
>> #regzbot ignore-activity
>>
>> This isn't a regression? This issue or a fix for it are already
>> discussed somewhere else? It was fixed already? You want to clarify when
>> the regression started to happen? Or point out I got the title or
>> something else totally wrong? Then just reply and tell me -- ideally
>> while also telling regzbot about it, as explained by the page listed in
>> the footer of this mail.
>>
>> Developers: When fixing the issue, remember to add 'Link:' tags pointing
>> to the report (the parent of this mail). See page linked in footer for
>> details.
>>
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> That page also explains what to do if mails like this annoy you.

Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On 27.01.23 20:46, Chris Clayton wrote:
> [Resend because the mail client on my phone decided to turn HTML on behind my back, so my reply got bounced.]
>
> Thanks Thorsten.
>
> I did try to revert but it didnt revert cleanly and I don't have the knowledge to fix it up.
>
> The patch was part of a merge that included a number of related patches. Tomorrow, I'll try to revert the lot and report
> back.

You are free to do so, but there is no need for that from my side. I
only wanted to know if a simple revert would do the trick; if it
doesn't, it in my experience often is best to leave things to the
developers of the code in question, as they know it best and thus have a
better idea which hidden side effect a more complex revert might have.

Ciao, Thorsten

> On 27/01/2023 11:20, Linux kernel regression tracking (Thorsten Leemhuis) wrote:
>> Hi, this is your Linux kernel regression tracker. Top-posting for once,
>> to make this easily accessible to everyone.
>>
>> @nouveau-maintainers, did anyone take a look at this? The report is
>> already 8 days old and I don't see a single reply. Sure, we'll likely
>> get a -rc8, but still it would be good to not fix this on the finish line.
>>
>> Chris, btw, did you try if you can revert the commit on top of latest
>> mainline? And if so, does it fix the problem?
>>
>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> If I did something stupid, please tell me, as explained on that page.
>>
>> #regzbot poke
>>
>> On 19.01.23 15:33, Linux kernel regression tracking (Thorsten Leemhuis)
>> wrote:
>>> [adding various lists and the two other nouveau maintainers to the list
>>> of recipients]
>>
>>> On 18.01.23 21:59, Chris Clayton wrote:
>>>> Hi.
>>>>
>>>> I build and installed the lastest development kernel earlier this week. I've found that when I try the laptop down (or
>>>> reboot it), it hangs right at the end of closing the current session. The last line I see on the screen when rebooting is:
>>>>
>>>> sd 4:0:0:0: [sda] Synchronising SCSI cache
>>>>
>>>> when closing down I see one additional line:
>>>>
>>>> sd 4:0:0:0 [sda]Stopping disk
>>>>
>>>> In both cases the machine then hangs and I have to hold down the power button fot a few seconds to switch it off.
>>>>
>>>> Linux 6.1 is OK but 6.2-rc1 hangs, so I bisected between this two and landed on:
>>>>
>>>> # first bad commit: [0e44c21708761977dcbea9b846b51a6fb684907a] drm/nouveau/flcn: new code to load+boot simple HS FWs
>>>> (VPR scrubber)
>>>>
>>>> I built and installed a kernel with f15cde64b66161bfa74fb58f4e5697d8265b802e (the parent of the bad commit) checked out
>>>> and that shuts down and reboots fine. It the did the same with the bad commit checked out and that does indeed hang, so
>>>> I'm confident the bisect outcome is OK.
>>>>
>>>> Kernels 6.1.6 and 5.15.88 are also OK.
>>>>
>>>> My system had dual GPUs - one intel and one NVidia. Related extracts from 'lscpi -v' is:
>>>>
>>>> 00:02.0 VGA compatible controller: Intel Corporation CometLake-H GT2 [UHD Graphics] (rev 05) (prog-if 00 [VGA controller])
>>>> Subsystem: CLEVO/KAPOK Computer CometLake-H GT2 [UHD Graphics]
>>>>
>>>> Flags: bus master, fast devsel, latency 0, IRQ 142
>>>>
>>>> Memory at c2000000 (64-bit, non-prefetchable) [size=16M]
>>>>
>>>> Memory at a0000000 (64-bit, prefetchable) [size=256M]
>>>>
>>>> I/O ports at 5000 [size=64]
>>>>
>>>> Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
>>>>
>>>> Capabilities: [40] Vendor Specific Information: Len=0c <?>
>>>>
>>>> Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
>>>>
>>>> Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit-
>>>>
>>>> Capabilities: [d0] Power Management version 2
>>>>
>>>> Kernel driver in use: i915
>>>>
>>>> Kernel modules: i915
>>>>
>>>>
>>>> 01:00.0 VGA compatible controller: NVIDIA Corporation TU117M [GeForce GTX 1650 Ti Mobile] (rev a1) (prog-if 00 [VGA
>>>> controller])
>>>> Subsystem: CLEVO/KAPOK Computer TU117M [GeForce GTX 1650 Ti Mobile]
>>>> Flags: bus master, fast devsel, latency 0, IRQ 141
>>>> Memory at c4000000 (32-bit, non-prefetchable) [size=16M]
>>>> Memory at b0000000 (64-bit, prefetchable) [size=256M]
>>>> Memory at c0000000 (64-bit, prefetchable) [size=32M]
>>>> I/O ports at 4000 [size=128]
>>>> Expansion ROM at c3000000 [disabled] [size=512K]
>>>> Capabilities: [60] Power Management version 3
>>>> Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
>>>> Capabilities: [78] Express Legacy Endpoint, MSI 00
>>>> Kernel driver in use: nouveau
>>>> Kernel modules: nouveau
>>>>
>>>> DRI_PRIME=1 is exported in one of my init scripts (yes, I am still using sysvinit).
>>>>
>>>> I've attached the bisect.log, but please let me know if I can provide any other diagnostics. Please cc me as I'm not
>>>> subscribed.
>>>
>>> Thanks for the report. To be sure the issue doesn't fall through the
>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>>> tracking bot:
>>>
>>> #regzbot ^introduced e44c2170876197
>>> #regzbot title drm: nouveau: hangs on poweroff/reboot
>>> #regzbot ignore-activity
>>>
>>> This isn't a regression? This issue or a fix for it are already
>>> discussed somewhere else? It was fixed already? You want to clarify when
>>> the regression started to happen? Or point out I got the title or
>>> something else totally wrong? Then just reply and tell me -- ideally
>>> while also telling regzbot about it, as explained by the page listed in
>>> the footer of this mail.
>>>
>>> Developers: When fixing the issue, remember to add 'Link:' tags pointing
>>> to the report (the parent of this mail). See page linked in footer for
>>> details.
>>>
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>> --
>>> Everything you wanna know about Linux kernel regression tracking:
>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>> That page also explains what to do if mails like this annoy you.
>
>

2023-01-28 11:29:44

by Chris Clayton

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected



On 28/01/2023 05:42, Linux kernel regression tracking (Thorsten Leemhuis) wrote:
> On 27.01.23 20:46, Chris Clayton wrote:
>> [Resend because the mail client on my phone decided to turn HTML on behind my back, so my reply got bounced.]
>>
>> Thanks Thorsten.
>>
>> I did try to revert but it didnt revert cleanly and I don't have the knowledge to fix it up.
>>
>> The patch was part of a merge that included a number of related patches. Tomorrow, I'll try to revert the lot and report
>> back.
>
> You are free to do so, but there is no need for that from my side. I
> only wanted to know if a simple revert would do the trick; if it
> doesn't, it in my experience often is best to leave things to the
> developers of the code in question,

Sound advice, Thorsten. Way to many conflicts for me to resolve.

as they know it best and thus have a
> better idea which hidden side effect a more complex revert might have.
>
> Ciao, Thorsten
>
>> On 27/01/2023 11:20, Linux kernel regression tracking (Thorsten Leemhuis) wrote:
>>> Hi, this is your Linux kernel regression tracker. Top-posting for once,
>>> to make this easily accessible to everyone.
>>>
>>> @nouveau-maintainers, did anyone take a look at this? The report is
>>> already 8 days old and I don't see a single reply. Sure, we'll likely
>>> get a -rc8, but still it would be good to not fix this on the finish line.
>>>
>>> Chris, btw, did you try if you can revert the commit on top of latest
>>> mainline? And if so, does it fix the problem?
>>>
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>> --
>>> Everything you wanna know about Linux kernel regression tracking:
>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>> If I did something stupid, please tell me, as explained on that page.
>>>
>>> #regzbot poke
>>>
>>> On 19.01.23 15:33, Linux kernel regression tracking (Thorsten Leemhuis)
>>> wrote:
>>>> [adding various lists and the two other nouveau maintainers to the list
>>>> of recipients]
>>>
>>>> On 18.01.23 21:59, Chris Clayton wrote:
>>>>> Hi.
>>>>>
>>>>> I build and installed the lastest development kernel earlier this week. I've found that when I try the laptop down (or
>>>>> reboot it), it hangs right at the end of closing the current session. The last line I see on the screen when rebooting is:
>>>>>
>>>>> sd 4:0:0:0: [sda] Synchronising SCSI cache
>>>>>
>>>>> when closing down I see one additional line:
>>>>>
>>>>> sd 4:0:0:0 [sda]Stopping disk
>>>>>
>>>>> In both cases the machine then hangs and I have to hold down the power button fot a few seconds to switch it off.
>>>>>
>>>>> Linux 6.1 is OK but 6.2-rc1 hangs, so I bisected between this two and landed on:
>>>>>
>>>>> # first bad commit: [0e44c21708761977dcbea9b846b51a6fb684907a] drm/nouveau/flcn: new code to load+boot simple HS FWs
>>>>> (VPR scrubber)
>>>>>
>>>>> I built and installed a kernel with f15cde64b66161bfa74fb58f4e5697d8265b802e (the parent of the bad commit) checked out
>>>>> and that shuts down and reboots fine. It the did the same with the bad commit checked out and that does indeed hang, so
>>>>> I'm confident the bisect outcome is OK.
>>>>>
>>>>> Kernels 6.1.6 and 5.15.88 are also OK.
>>>>>
>>>>> My system had dual GPUs - one intel and one NVidia. Related extracts from 'lscpi -v' is:
>>>>>
>>>>> 00:02.0 VGA compatible controller: Intel Corporation CometLake-H GT2 [UHD Graphics] (rev 05) (prog-if 00 [VGA controller])
>>>>> Subsystem: CLEVO/KAPOK Computer CometLake-H GT2 [UHD Graphics]
>>>>>
>>>>> Flags: bus master, fast devsel, latency 0, IRQ 142
>>>>>
>>>>> Memory at c2000000 (64-bit, non-prefetchable) [size=16M]
>>>>>
>>>>> Memory at a0000000 (64-bit, prefetchable) [size=256M]
>>>>>
>>>>> I/O ports at 5000 [size=64]
>>>>>
>>>>> Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
>>>>>
>>>>> Capabilities: [40] Vendor Specific Information: Len=0c <?>
>>>>>
>>>>> Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
>>>>>
>>>>> Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit-
>>>>>
>>>>> Capabilities: [d0] Power Management version 2
>>>>>
>>>>> Kernel driver in use: i915
>>>>>
>>>>> Kernel modules: i915
>>>>>
>>>>>
>>>>> 01:00.0 VGA compatible controller: NVIDIA Corporation TU117M [GeForce GTX 1650 Ti Mobile] (rev a1) (prog-if 00 [VGA
>>>>> controller])
>>>>> Subsystem: CLEVO/KAPOK Computer TU117M [GeForce GTX 1650 Ti Mobile]
>>>>> Flags: bus master, fast devsel, latency 0, IRQ 141
>>>>> Memory at c4000000 (32-bit, non-prefetchable) [size=16M]
>>>>> Memory at b0000000 (64-bit, prefetchable) [size=256M]
>>>>> Memory at c0000000 (64-bit, prefetchable) [size=32M]
>>>>> I/O ports at 4000 [size=128]
>>>>> Expansion ROM at c3000000 [disabled] [size=512K]
>>>>> Capabilities: [60] Power Management version 3
>>>>> Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
>>>>> Capabilities: [78] Express Legacy Endpoint, MSI 00
>>>>> Kernel driver in use: nouveau
>>>>> Kernel modules: nouveau
>>>>>
>>>>> DRI_PRIME=1 is exported in one of my init scripts (yes, I am still using sysvinit).
>>>>>
>>>>> I've attached the bisect.log, but please let me know if I can provide any other diagnostics. Please cc me as I'm not
>>>>> subscribed.
>>>>
>>>> Thanks for the report. To be sure the issue doesn't fall through the
>>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>>>> tracking bot:
>>>>
>>>> #regzbot ^introduced e44c2170876197
>>>> #regzbot title drm: nouveau: hangs on poweroff/reboot
>>>> #regzbot ignore-activity
>>>>
>>>> This isn't a regression? This issue or a fix for it are already
>>>> discussed somewhere else? It was fixed already? You want to clarify when
>>>> the regression started to happen? Or point out I got the title or
>>>> something else totally wrong? Then just reply and tell me -- ideally
>>>> while also telling regzbot about it, as explained by the page listed in
>>>> the footer of this mail.
>>>>
>>>> Developers: When fixing the issue, remember to add 'Link:' tags pointing
>>>> to the report (the parent of this mail). See page linked in footer for
>>>> details.
>>>>
>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>> --
>>>> Everything you wanna know about Linux kernel regression tracking:
>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>> That page also explains what to do if mails like this annoy you.
>>
>>

2023-01-30 01:09:45

by Ben Skeggs

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On Sat, 28 Jan 2023 at 21:29, Chris Clayton <[email protected]> wrote:
>
>
>
> On 28/01/2023 05:42, Linux kernel regression tracking (Thorsten Leemhuis) wrote:
> > On 27.01.23 20:46, Chris Clayton wrote:
> >> [Resend because the mail client on my phone decided to turn HTML on behind my back, so my reply got bounced.]
> >>
> >> Thanks Thorsten.
> >>
> >> I did try to revert but it didnt revert cleanly and I don't have the knowledge to fix it up.
> >>
> >> The patch was part of a merge that included a number of related patches. Tomorrow, I'll try to revert the lot and report
> >> back.
> >
> > You are free to do so, but there is no need for that from my side. I
> > only wanted to know if a simple revert would do the trick; if it
> > doesn't, it in my experience often is best to leave things to the
> > developers of the code in question,
>
> Sound advice, Thorsten. Way to many conflicts for me to resolve.
Hey,

This is a complete shot-in-the-dark, as I don't see this behaviour on
*any* of my boards. Could you try the attached patch please?

Thanks,
Ben.

>
> as they know it best and thus have a
> > better idea which hidden side effect a more complex revert might have.
> >
> > Ciao, Thorsten
> >
> >> On 27/01/2023 11:20, Linux kernel regression tracking (Thorsten Leemhuis) wrote:
> >>> Hi, this is your Linux kernel regression tracker. Top-posting for once,
> >>> to make this easily accessible to everyone.
> >>>
> >>> @nouveau-maintainers, did anyone take a look at this? The report is
> >>> already 8 days old and I don't see a single reply. Sure, we'll likely
> >>> get a -rc8, but still it would be good to not fix this on the finish line.
> >>>
> >>> Chris, btw, did you try if you can revert the commit on top of latest
> >>> mainline? And if so, does it fix the problem?
> >>>
> >>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >>> --
> >>> Everything you wanna know about Linux kernel regression tracking:
> >>> https://linux-regtracking.leemhuis.info/about/#tldr
> >>> If I did something stupid, please tell me, as explained on that page.
> >>>
> >>> #regzbot poke
> >>>
> >>> On 19.01.23 15:33, Linux kernel regression tracking (Thorsten Leemhuis)
> >>> wrote:
> >>>> [adding various lists and the two other nouveau maintainers to the list
> >>>> of recipients]
> >>>
> >>>> On 18.01.23 21:59, Chris Clayton wrote:
> >>>>> Hi.
> >>>>>
> >>>>> I build and installed the lastest development kernel earlier this week. I've found that when I try the laptop down (or
> >>>>> reboot it), it hangs right at the end of closing the current session. The last line I see on the screen when rebooting is:
> >>>>>
> >>>>> sd 4:0:0:0: [sda] Synchronising SCSI cache
> >>>>>
> >>>>> when closing down I see one additional line:
> >>>>>
> >>>>> sd 4:0:0:0 [sda]Stopping disk
> >>>>>
> >>>>> In both cases the machine then hangs and I have to hold down the power button fot a few seconds to switch it off.
> >>>>>
> >>>>> Linux 6.1 is OK but 6.2-rc1 hangs, so I bisected between this two and landed on:
> >>>>>
> >>>>> # first bad commit: [0e44c21708761977dcbea9b846b51a6fb684907a] drm/nouveau/flcn: new code to load+boot simple HS FWs
> >>>>> (VPR scrubber)
> >>>>>
> >>>>> I built and installed a kernel with f15cde64b66161bfa74fb58f4e5697d8265b802e (the parent of the bad commit) checked out
> >>>>> and that shuts down and reboots fine. It the did the same with the bad commit checked out and that does indeed hang, so
> >>>>> I'm confident the bisect outcome is OK.
> >>>>>
> >>>>> Kernels 6.1.6 and 5.15.88 are also OK.
> >>>>>
> >>>>> My system had dual GPUs - one intel and one NVidia. Related extracts from 'lscpi -v' is:
> >>>>>
> >>>>> 00:02.0 VGA compatible controller: Intel Corporation CometLake-H GT2 [UHD Graphics] (rev 05) (prog-if 00 [VGA controller])
> >>>>> Subsystem: CLEVO/KAPOK Computer CometLake-H GT2 [UHD Graphics]
> >>>>>
> >>>>> Flags: bus master, fast devsel, latency 0, IRQ 142
> >>>>>
> >>>>> Memory at c2000000 (64-bit, non-prefetchable) [size=16M]
> >>>>>
> >>>>> Memory at a0000000 (64-bit, prefetchable) [size=256M]
> >>>>>
> >>>>> I/O ports at 5000 [size=64]
> >>>>>
> >>>>> Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
> >>>>>
> >>>>> Capabilities: [40] Vendor Specific Information: Len=0c <?>
> >>>>>
> >>>>> Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
> >>>>>
> >>>>> Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit-
> >>>>>
> >>>>> Capabilities: [d0] Power Management version 2
> >>>>>
> >>>>> Kernel driver in use: i915
> >>>>>
> >>>>> Kernel modules: i915
> >>>>>
> >>>>>
> >>>>> 01:00.0 VGA compatible controller: NVIDIA Corporation TU117M [GeForce GTX 1650 Ti Mobile] (rev a1) (prog-if 00 [VGA
> >>>>> controller])
> >>>>> Subsystem: CLEVO/KAPOK Computer TU117M [GeForce GTX 1650 Ti Mobile]
> >>>>> Flags: bus master, fast devsel, latency 0, IRQ 141
> >>>>> Memory at c4000000 (32-bit, non-prefetchable) [size=16M]
> >>>>> Memory at b0000000 (64-bit, prefetchable) [size=256M]
> >>>>> Memory at c0000000 (64-bit, prefetchable) [size=32M]
> >>>>> I/O ports at 4000 [size=128]
> >>>>> Expansion ROM at c3000000 [disabled] [size=512K]
> >>>>> Capabilities: [60] Power Management version 3
> >>>>> Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
> >>>>> Capabilities: [78] Express Legacy Endpoint, MSI 00
> >>>>> Kernel driver in use: nouveau
> >>>>> Kernel modules: nouveau
> >>>>>
> >>>>> DRI_PRIME=1 is exported in one of my init scripts (yes, I am still using sysvinit).
> >>>>>
> >>>>> I've attached the bisect.log, but please let me know if I can provide any other diagnostics. Please cc me as I'm not
> >>>>> subscribed.
> >>>>
> >>>> Thanks for the report. To be sure the issue doesn't fall through the
> >>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> >>>> tracking bot:
> >>>>
> >>>> #regzbot ^introduced e44c2170876197
> >>>> #regzbot title drm: nouveau: hangs on poweroff/reboot
> >>>> #regzbot ignore-activity
> >>>>
> >>>> This isn't a regression? This issue or a fix for it are already
> >>>> discussed somewhere else? It was fixed already? You want to clarify when
> >>>> the regression started to happen? Or point out I got the title or
> >>>> something else totally wrong? Then just reply and tell me -- ideally
> >>>> while also telling regzbot about it, as explained by the page listed in
> >>>> the footer of this mail.
> >>>>
> >>>> Developers: When fixing the issue, remember to add 'Link:' tags pointing
> >>>> to the report (the parent of this mail). See page linked in footer for
> >>>> details.
> >>>>
> >>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >>>> --
> >>>> Everything you wanna know about Linux kernel regression tracking:
> >>>> https://linux-regtracking.leemhuis.info/about/#tldr
> >>>> That page also explains what to do if mails like this annoy you.
> >>
> >>


Attachments:
nvdec0-reset.diff (849.00 B)

2023-01-30 20:19:59

by Chris Clayton

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

Thanks, Ben.

On 30/01/2023 01:09, Ben Skeggs wrote:
> On Sat, 28 Jan 2023 at 21:29, Chris Clayton <[email protected]> wrote:
>>
>>
>>
>> On 28/01/2023 05:42, Linux kernel regression tracking (Thorsten Leemhuis) wrote:
>>> On 27.01.23 20:46, Chris Clayton wrote:
>>>> [Resend because the mail client on my phone decided to turn HTML on behind my back, so my reply got bounced.]
>>>>
>>>> Thanks Thorsten.
>>>>
>>>> I did try to revert but it didnt revert cleanly and I don't have the knowledge to fix it up.
>>>>
>>>> The patch was part of a merge that included a number of related patches. Tomorrow, I'll try to revert the lot and report
>>>> back.
>>>
>>> You are free to do so, but there is no need for that from my side. I
>>> only wanted to know if a simple revert would do the trick; if it
>>> doesn't, it in my experience often is best to leave things to the
>>> developers of the code in question,
>>
>> Sound advice, Thorsten. Way to many conflicts for me to resolve.
> Hey,
>
> This is a complete shot-in-the-dark, as I don't see this behaviour on
> *any* of my boards. Could you try the attached patch please?

Unfortunately, the patch made no difference.

I've been looking at how the graphics on my laptop is set up, and have a bit of a worry about whether the firmware might
be playing a part in this problem. In order to offload video decoding to the NVidia TU117 GPU, it seems the scrubber
firmware must be available, but as far as I know,that has not been released by NVidia. To get it to work, I followed
what ubuntu have done and the scrubber in /lib/firmware/nvidia/tu117/nvdec/ is a symlink to
../../tu116/nvdev/scrubber.bin. That, of course, means that some of the firmware loaded is for a different card is being
loaded. I note that processing related to firmware is being changed in the patch. Might my set up be at the root of my
problem?

I'll have a fiddle an see what I can work out.

Chris

>
> Thanks,
> Ben.
>
>>
>> as they know it best and thus have a
>>> better idea which hidden side effect a more complex revert might have.
>>>
>>> Ciao, Thorsten
>>>
>>>> On 27/01/2023 11:20, Linux kernel regression tracking (Thorsten Leemhuis) wrote:
>>>>> Hi, this is your Linux kernel regression tracker. Top-posting for once,
>>>>> to make this easily accessible to everyone.
>>>>>
>>>>> @nouveau-maintainers, did anyone take a look at this? The report is
>>>>> already 8 days old and I don't see a single reply. Sure, we'll likely
>>>>> get a -rc8, but still it would be good to not fix this on the finish line.
>>>>>
>>>>> Chris, btw, did you try if you can revert the commit on top of latest
>>>>> mainline? And if so, does it fix the problem?
>>>>>
>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>>> --
>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>> If I did something stupid, please tell me, as explained on that page.
>>>>>
>>>>> #regzbot poke
>>>>>
>>>>> On 19.01.23 15:33, Linux kernel regression tracking (Thorsten Leemhuis)
>>>>> wrote:
>>>>>> [adding various lists and the two other nouveau maintainers to the list
>>>>>> of recipients]
>>>>>
>>>>>> On 18.01.23 21:59, Chris Clayton wrote:
>>>>>>> Hi.
>>>>>>>
>>>>>>> I build and installed the lastest development kernel earlier this week. I've found that when I try the laptop down (or
>>>>>>> reboot it), it hangs right at the end of closing the current session. The last line I see on the screen when rebooting is:
>>>>>>>
>>>>>>> sd 4:0:0:0: [sda] Synchronising SCSI cache
>>>>>>>
>>>>>>> when closing down I see one additional line:
>>>>>>>
>>>>>>> sd 4:0:0:0 [sda]Stopping disk
>>>>>>>
>>>>>>> In both cases the machine then hangs and I have to hold down the power button fot a few seconds to switch it off.
>>>>>>>
>>>>>>> Linux 6.1 is OK but 6.2-rc1 hangs, so I bisected between this two and landed on:
>>>>>>>
>>>>>>> # first bad commit: [0e44c21708761977dcbea9b846b51a6fb684907a] drm/nouveau/flcn: new code to load+boot simple HS FWs
>>>>>>> (VPR scrubber)
>>>>>>>
>>>>>>> I built and installed a kernel with f15cde64b66161bfa74fb58f4e5697d8265b802e (the parent of the bad commit) checked out
>>>>>>> and that shuts down and reboots fine. It the did the same with the bad commit checked out and that does indeed hang, so
>>>>>>> I'm confident the bisect outcome is OK.
>>>>>>>
>>>>>>> Kernels 6.1.6 and 5.15.88 are also OK.
>>>>>>>
>>>>>>> My system had dual GPUs - one intel and one NVidia. Related extracts from 'lscpi -v' is:
>>>>>>>
>>>>>>> 00:02.0 VGA compatible controller: Intel Corporation CometLake-H GT2 [UHD Graphics] (rev 05) (prog-if 00 [VGA controller])
>>>>>>> Subsystem: CLEVO/KAPOK Computer CometLake-H GT2 [UHD Graphics]
>>>>>>>
>>>>>>> Flags: bus master, fast devsel, latency 0, IRQ 142
>>>>>>>
>>>>>>> Memory at c2000000 (64-bit, non-prefetchable) [size=16M]
>>>>>>>
>>>>>>> Memory at a0000000 (64-bit, prefetchable) [size=256M]
>>>>>>>
>>>>>>> I/O ports at 5000 [size=64]
>>>>>>>
>>>>>>> Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
>>>>>>>
>>>>>>> Capabilities: [40] Vendor Specific Information: Len=0c <?>
>>>>>>>
>>>>>>> Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
>>>>>>>
>>>>>>> Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit-
>>>>>>>
>>>>>>> Capabilities: [d0] Power Management version 2
>>>>>>>
>>>>>>> Kernel driver in use: i915
>>>>>>>
>>>>>>> Kernel modules: i915
>>>>>>>
>>>>>>>
>>>>>>> 01:00.0 VGA compatible controller: NVIDIA Corporation TU117M [GeForce GTX 1650 Ti Mobile] (rev a1) (prog-if 00 [VGA
>>>>>>> controller])
>>>>>>> Subsystem: CLEVO/KAPOK Computer TU117M [GeForce GTX 1650 Ti Mobile]
>>>>>>> Flags: bus master, fast devsel, latency 0, IRQ 141
>>>>>>> Memory at c4000000 (32-bit, non-prefetchable) [size=16M]
>>>>>>> Memory at b0000000 (64-bit, prefetchable) [size=256M]
>>>>>>> Memory at c0000000 (64-bit, prefetchable) [size=32M]
>>>>>>> I/O ports at 4000 [size=128]
>>>>>>> Expansion ROM at c3000000 [disabled] [size=512K]
>>>>>>> Capabilities: [60] Power Management version 3
>>>>>>> Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
>>>>>>> Capabilities: [78] Express Legacy Endpoint, MSI 00
>>>>>>> Kernel driver in use: nouveau
>>>>>>> Kernel modules: nouveau
>>>>>>>
>>>>>>> DRI_PRIME=1 is exported in one of my init scripts (yes, I am still using sysvinit).
>>>>>>>
>>>>>>> I've attached the bisect.log, but please let me know if I can provide any other diagnostics. Please cc me as I'm not
>>>>>>> subscribed.
>>>>>>
>>>>>> Thanks for the report. To be sure the issue doesn't fall through the
>>>>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
>>>>>> tracking bot:
>>>>>>
>>>>>> #regzbot ^introduced e44c2170876197
>>>>>> #regzbot title drm: nouveau: hangs on poweroff/reboot
>>>>>> #regzbot ignore-activity
>>>>>>
>>>>>> This isn't a regression? This issue or a fix for it are already
>>>>>> discussed somewhere else? It was fixed already? You want to clarify when
>>>>>> the regression started to happen? Or point out I got the title or
>>>>>> something else totally wrong? Then just reply and tell me -- ideally
>>>>>> while also telling regzbot about it, as explained by the page listed in
>>>>>> the footer of this mail.
>>>>>>
>>>>>> Developers: When fixing the issue, remember to add 'Link:' tags pointing
>>>>>> to the report (the parent of this mail). See page linked in footer for
>>>>>> details.
>>>>>>
>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>>>> --
>>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>>> That page also explains what to do if mails like this annoy you.
>>>>
>>>>

2023-01-30 23:09:27

by Chris Clayton

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

Hi again.

On 30/01/2023 20:19, Chris Clayton wrote:
> Thanks, Ben.

<snip>

>> Hey,
>>
>> This is a complete shot-in-the-dark, as I don't see this behaviour on
>> *any* of my boards. Could you try the attached patch please?
>
> Unfortunately, the patch made no difference.
>
> I've been looking at how the graphics on my laptop is set up, and have a bit of a worry about whether the firmware might
> be playing a part in this problem. In order to offload video decoding to the NVidia TU117 GPU, it seems the scrubber
> firmware must be available, but as far as I know,that has not been released by NVidia. To get it to work, I followed
> what ubuntu have done and the scrubber in /lib/firmware/nvidia/tu117/nvdec/ is a symlink to
> ../../tu116/nvdev/scrubber.bin. That, of course, means that some of the firmware loaded is for a different card is being
> loaded. I note that processing related to firmware is being changed in the patch. Might my set up be at the root of my
> problem?
>
> I'll have a fiddle an see what I can work out.
>
> Chris
>
>>
>> Thanks,
>> Ben.
>>
>>>

Well, my fiddling has got my system rebooting and shutting down successfully again. I found that if I delete the symlink
to the scrubber firmware, reboot and shutdown work again. There are however, a number of other files in the tu117
firmware directory tree that that are symlinks to actual files in its tu116 counterpart. So I deleted all of those too.
Unfortunately, the absence of one or more of those symlinks causes Xorg to fail to start. I've reinstated all the links
except scrubber and I now have a system that works as it did until I tried to run a kernel that includes the bad commit
I identified in my bisection. That includes offloading video decoding to the NVidia card, so what ever I read that said
the scrubber firmware was needed seems to have been wrong. I get a new message that (nouveau 0000:01:00.0: fb: VPR
locked, but no scrubber binary!), but, hey, we can't have everything.

If you still want to get to the bottom of this, let me know what you need me to provide and I'll do my best. I suspect
you might want to because there will a n awful lot of Ubuntu-based systems out there with that scrubber.bin symlink in
place. On the other hand,m it could but quite a while before ubuntu are deploying 6.2 or later kernels.

Thanks,

Chris

<snip>

2023-01-30 23:28:14

by Ben Skeggs

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On Tue, 31 Jan 2023 at 09:09, Chris Clayton <[email protected]> wrote:
>
> Hi again.
>
> On 30/01/2023 20:19, Chris Clayton wrote:
> > Thanks, Ben.
>
> <snip>
>
> >> Hey,
> >>
> >> This is a complete shot-in-the-dark, as I don't see this behaviour on
> >> *any* of my boards. Could you try the attached patch please?
> >
> > Unfortunately, the patch made no difference.
> >
> > I've been looking at how the graphics on my laptop is set up, and have a bit of a worry about whether the firmware might
> > be playing a part in this problem. In order to offload video decoding to the NVidia TU117 GPU, it seems the scrubber
> > firmware must be available, but as far as I know,that has not been released by NVidia. To get it to work, I followed
> > what ubuntu have done and the scrubber in /lib/firmware/nvidia/tu117/nvdec/ is a symlink to
> > ../../tu116/nvdev/scrubber.bin. That, of course, means that some of the firmware loaded is for a different card is being
> > loaded. I note that processing related to firmware is being changed in the patch. Might my set up be at the root of my
> > problem?
> >
> > I'll have a fiddle an see what I can work out.
> >
> > Chris
> >
> >>
> >> Thanks,
> >> Ben.
> >>
> >>>
>
> Well, my fiddling has got my system rebooting and shutting down successfully again. I found that if I delete the symlink
> to the scrubber firmware, reboot and shutdown work again. There are however, a number of other files in the tu117
> firmware directory tree that that are symlinks to actual files in its tu116 counterpart. So I deleted all of those too.
> Unfortunately, the absence of one or more of those symlinks causes Xorg to fail to start. I've reinstated all the links
> except scrubber and I now have a system that works as it did until I tried to run a kernel that includes the bad commit
> I identified in my bisection. That includes offloading video decoding to the NVidia card, so what ever I read that said
> the scrubber firmware was needed seems to have been wrong. I get a new message that (nouveau 0000:01:00.0: fb: VPR
> locked, but no scrubber binary!), but, hey, we can't have everything.
>
> If you still want to get to the bottom of this, let me know what you need me to provide and I'll do my best. I suspect
> you might want to because there will a n awful lot of Ubuntu-based systems out there with that scrubber.bin symlink in
> place. On the other hand,m it could but quite a while before ubuntu are deploying 6.2 or later kernels.
The symlinks are correct - whole groups of GPUs share the same FW, and
we use symlinks in linux-firmware to represent this.

I don't really have any ideas how/why this patch causes issues with
shutdown - it's a path that only gets executed during initialisation.
Can you try and capture the kernel log during shutdown ("dmesg -w"
over ssh? netconsole?), and see if there's any relevant messages
providing a hint at what's going on? Alternatively, you could try
unloading the module (you will have to stop X/wayland/gdm/etc/etc
first) and seeing if that hangs too.

Ben.

>
> Thanks,
>
> Chris
>
> <snip>

2023-02-01 13:51:15

by Chris Clayton

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected



On 30/01/2023 23:27, Ben Skeggs wrote:
> On Tue, 31 Jan 2023 at 09:09, Chris Clayton <[email protected]> wrote:
>>
>> Hi again.
>>
>> On 30/01/2023 20:19, Chris Clayton wrote:
>>> Thanks, Ben.
>>
>> <snip>
>>
>>>> Hey,
>>>>
>>>> This is a complete shot-in-the-dark, as I don't see this behaviour on
>>>> *any* of my boards. Could you try the attached patch please?
>>>
>>> Unfortunately, the patch made no difference.
>>>
>>> I've been looking at how the graphics on my laptop is set up, and have a bit of a worry about whether the firmware might
>>> be playing a part in this problem. In order to offload video decoding to the NVidia TU117 GPU, it seems the scrubber
>>> firmware must be available, but as far as I know,that has not been released by NVidia. To get it to work, I followed
>>> what ubuntu have done and the scrubber in /lib/firmware/nvidia/tu117/nvdec/ is a symlink to
>>> ../../tu116/nvdev/scrubber.bin. That, of course, means that some of the firmware loaded is for a different card is being
>>> loaded. I note that processing related to firmware is being changed in the patch. Might my set up be at the root of my
>>> problem?
>>>
>>> I'll have a fiddle an see what I can work out.
>>>
>>> Chris
>>>
>>>>
>>>> Thanks,
>>>> Ben.
>>>>
>>>>>
>>
>> Well, my fiddling has got my system rebooting and shutting down successfully again. I found that if I delete the symlink
>> to the scrubber firmware, reboot and shutdown work again. There are however, a number of other files in the tu117
>> firmware directory tree that that are symlinks to actual files in its tu116 counterpart. So I deleted all of those too.
>> Unfortunately, the absence of one or more of those symlinks causes Xorg to fail to start. I've reinstated all the links
>> except scrubber and I now have a system that works as it did until I tried to run a kernel that includes the bad commit
>> I identified in my bisection. That includes offloading video decoding to the NVidia card, so what ever I read that said
>> the scrubber firmware was needed seems to have been wrong. I get a new message that (nouveau 0000:01:00.0: fb: VPR
>> locked, but no scrubber binary!), but, hey, we can't have everything.
>>
>> If you still want to get to the bottom of this, let me know what you need me to provide and I'll do my best. I suspect
>> you might want to because there will a n awful lot of Ubuntu-based systems out there with that scrubber.bin symlink in
>> place. On the other hand,m it could but quite a while before ubuntu are deploying 6.2 or later kernels.
> The symlinks are correct - whole groups of GPUs share the same FW, and
> we use symlinks in linux-firmware to represent this.
>
> I don't really have any ideas how/why this patch causes issues with
> shutdown - it's a path that only gets executed during initialisation.
> Can you try and capture the kernel log during shutdown ("dmesg -w"
> over ssh? netconsole?), and see if there's any relevant messages
> providing a hint at what's going on? Alternatively, you could try
> unloading the module (you will have to stop X/wayland/gdm/etc/etc
> first) and seeing if that hangs too.
>
> Ben.

Sorry for the delay - I've been learning about netconsole and netcat. However, I had no success with ssh and netconsole
produced a log with nothing unusual in it.

Simply stopping Xorg and removing the nouveau module succeeds.

So, I rebuilt rc6+ after a pull from linus' tree this morning and set the nouveau debug level to 7. I then booted to a
console before doing a reboot (with Ctl+Alt+Del). As expected the machine locked up just before it would ordinarily
restart. The last few lines on the console might be helpful:

...
nouveau 0000:01:00:0 fifo: preinit running...
nouveau 0000:01:00:0 fifo: preinit completed in 4us
nouveau 0000:01:00:0 gr: preinit running...
nouveau 0000:01:00:0 gr: preinit completed in 0us
nouveau 0000:01:00:0 nvdec0: preinit running...
nouveau 0000:01:00:0 nvdec0: preinit completed in 0us
nouveau 0000:01:00:0 nvdec0: preinit running...
nouveau 0000:01:00:0 nvdec0: preinit completed in 0us
nouveau 0000:01:00:0 sec2: preinit running...
nouveau 0000:01:00:0 sec2: preinit completed in 0us
nouveau 0000:01:00:0 fb:.VPR locked, running scrubber binary

These messages appear after the "sd 4:0:0:0 [sda] Stopping disk" I reported in my initial email.

After the "running scrubber" line appears the machine is locked and I have to hold down the power button to recover. I
get the same outcome from running "halt -dip", "poweroff -di" and "shutdown -h -P now". I guess it's no surprise that
all three result in the same outcome because invocations halt, poweroff and reboot (without the -f argument)from a
runlevel other than 0 resukt in shutdown being run. switching to runlevel 0 with "telenit 0" results in the same
messages from nouveau followed by the lockup.

Let me know if you need any additional diagnostics.

Chris

>
>>
>> Thanks,
>>
>> Chris
>>
>> <snip>

2023-02-08 08:48:34

by Chris Clayton

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

Hi.

I'm assuming that we are not going to see a fix for this regression before 6.2 is released. Consequently, I've
implemented a (very simple) workaround. All that happens is that in the (sysv) init script that starts and stops SDDM,
the nouveau module is removed once SDDM is stopped. With that in place, my system no longer freezes on reboot or poweroff.

Let me know if I can provide any additional diagnostics although, with the problem seemingly occurring so late in the
shutdown process, I may need help on how to go about capturing.

Chris

On 02/02/2023 20:45, Chris Clayton wrote:
>
>
> On 01/02/2023 13:51, Chris Clayton wrote:
>>
>>
>> On 30/01/2023 23:27, Ben Skeggs wrote:
>>> On Tue, 31 Jan 2023 at 09:09, Chris Clayton <[email protected]> wrote:
>>>>
>>>> Hi again.
>>>>
>>>> On 30/01/2023 20:19, Chris Clayton wrote:
>>>>> Thanks, Ben.
>>>>
>>>> <snip>
>>>>
>>>>>> Hey,
>>>>>>
>>>>>> This is a complete shot-in-the-dark, as I don't see this behaviour on
>>>>>> *any* of my boards. Could you try the attached patch please?
>>>>>
>>>>> Unfortunately, the patch made no difference.
>>>>>
>>>>> I've been looking at how the graphics on my laptop is set up, and have a bit of a worry about whether the firmware might
>>>>> be playing a part in this problem. In order to offload video decoding to the NVidia TU117 GPU, it seems the scrubber
>>>>> firmware must be available, but as far as I know,that has not been released by NVidia. To get it to work, I followed
>>>>> what ubuntu have done and the scrubber in /lib/firmware/nvidia/tu117/nvdec/ is a symlink to
>>>>> ../../tu116/nvdev/scrubber.bin. That, of course, means that some of the firmware loaded is for a different card is being
>>>>> loaded. I note that processing related to firmware is being changed in the patch. Might my set up be at the root of my
>>>>> problem?
>>>>>
>>>>> I'll have a fiddle an see what I can work out.
>>>>>
>>>>> Chris
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Ben.
>>>>>>
>>>>>>>
>>>>
>>>> Well, my fiddling has got my system rebooting and shutting down successfully again. I found that if I delete the symlink
>>>> to the scrubber firmware, reboot and shutdown work again. There are however, a number of other files in the tu117
>>>> firmware directory tree that that are symlinks to actual files in its tu116 counterpart. So I deleted all of those too.
>>>> Unfortunately, the absence of one or more of those symlinks causes Xorg to fail to start. I've reinstated all the links
>>>> except scrubber and I now have a system that works as it did until I tried to run a kernel that includes the bad commit
>>>> I identified in my bisection. That includes offloading video decoding to the NVidia card, so what ever I read that said
>>>> the scrubber firmware was needed seems to have been wrong. I get a new message that (nouveau 0000:01:00.0: fb: VPR
>>>> locked, but no scrubber binary!), but, hey, we can't have everything.
>>>>
>>>> If you still want to get to the bottom of this, let me know what you need me to provide and I'll do my best. I suspect
>>>> you might want to because there will a n awful lot of Ubuntu-based systems out there with that scrubber.bin symlink in
>>>> place. On the other hand,m it could but quite a while before ubuntu are deploying 6.2 or later kernels.
>>> The symlinks are correct - whole groups of GPUs share the same FW, and
>>> we use symlinks in linux-firmware to represent this.
>>>
>>> I don't really have any ideas how/why this patch causes issues with
>>> shutdown - it's a path that only gets executed during initialisation.
>>> Can you try and capture the kernel log during shutdown ("dmesg -w"
>>> over ssh? netconsole?), and see if there's any relevant messages
>>> providing a hint at what's going on? Alternatively, you could try
>>> unloading the module (you will have to stop X/wayland/gdm/etc/etc
>>> first) and seeing if that hangs too.
>>>
>>> Ben.
>>
>> Sorry for the delay - I've been learning about netconsole and netcat. However, I had no success with ssh and netconsole
>> produced a log with nothing unusual in it.
>>
>> Simply stopping Xorg and removing the nouveau module succeeds.
>>
>> So, I rebuilt rc6+ after a pull from linus' tree this morning and set the nouveau debug level to 7. I then booted to a
>> console before doing a reboot (with Ctl+Alt+Del). As expected the machine locked up just before it would ordinarily
>> restart. The last few lines on the console might be helpful:
>>
>> ...
>> nouveau 0000:01:00:0 fifo: preinit running...
>> nouveau 0000:01:00:0 fifo: preinit completed in 4us
>> nouveau 0000:01:00:0 gr: preinit running...
>> nouveau 0000:01:00:0 gr: preinit completed in 0us
>> nouveau 0000:01:00:0 nvdec0: preinit running...
>> nouveau 0000:01:00:0 nvdec0: preinit completed in 0us
>> nouveau 0000:01:00:0 nvdec0: preinit running...
>> nouveau 0000:01:00:0 nvdec0: preinit completed in 0us
>> nouveau 0000:01:00:0 sec2: preinit running...
>> nouveau 0000:01:00:0 sec2: preinit completed in 0us
>> nouveau 0000:01:00:0 fb:.VPR locked, running scrubber binary
>>
>> These messages appear after the "sd 4:0:0:0 [sda] Stopping disk" I reported in my initial email.
>>
>> After the "running scrubber" line appears the machine is locked and I have to hold down the power button to recover. I
>> get the same outcome from running "halt -dip", "poweroff -di" and "shutdown -h -P now". I guess it's no surprise that
>> all three result in the same outcome because invocations halt, poweroff and reboot (without the -f argument)from a
>> runlevel other than 0 resukt in shutdown being run. switching to runlevel 0 with "telenit 0" results in the same
>> messages from nouveau followed by the lockup.
>>
>> Let me know if you need any additional diagnostics.
>>
>> Chris
>>
>
> I've done some more investigation and found that I hadn't done sufficient amemdment the scripts run at shutdown to
> prevent the network being shutdown. I've now got netconsole captures for 6.2.0-rc6+
> (9f266ccaa2f5228bfe67ad58a94ca4e0109b954a) and, for comparison, 6.1.9. These two logs are attached.
>
> Chris
>
>>>
>>>>
>>>> Thanks,
>>>>
>>>> Chris
>>>>
>>>> <snip>

2023-02-02 20:46:11

by Chris Clayton

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected



On 01/02/2023 13:51, Chris Clayton wrote:
>
>
> On 30/01/2023 23:27, Ben Skeggs wrote:
>> On Tue, 31 Jan 2023 at 09:09, Chris Clayton <[email protected]> wrote:
>>>
>>> Hi again.
>>>
>>> On 30/01/2023 20:19, Chris Clayton wrote:
>>>> Thanks, Ben.
>>>
>>> <snip>
>>>
>>>>> Hey,
>>>>>
>>>>> This is a complete shot-in-the-dark, as I don't see this behaviour on
>>>>> *any* of my boards. Could you try the attached patch please?
>>>>
>>>> Unfortunately, the patch made no difference.
>>>>
>>>> I've been looking at how the graphics on my laptop is set up, and have a bit of a worry about whether the firmware might
>>>> be playing a part in this problem. In order to offload video decoding to the NVidia TU117 GPU, it seems the scrubber
>>>> firmware must be available, but as far as I know,that has not been released by NVidia. To get it to work, I followed
>>>> what ubuntu have done and the scrubber in /lib/firmware/nvidia/tu117/nvdec/ is a symlink to
>>>> ../../tu116/nvdev/scrubber.bin. That, of course, means that some of the firmware loaded is for a different card is being
>>>> loaded. I note that processing related to firmware is being changed in the patch. Might my set up be at the root of my
>>>> problem?
>>>>
>>>> I'll have a fiddle an see what I can work out.
>>>>
>>>> Chris
>>>>
>>>>>
>>>>> Thanks,
>>>>> Ben.
>>>>>
>>>>>>
>>>
>>> Well, my fiddling has got my system rebooting and shutting down successfully again. I found that if I delete the symlink
>>> to the scrubber firmware, reboot and shutdown work again. There are however, a number of other files in the tu117
>>> firmware directory tree that that are symlinks to actual files in its tu116 counterpart. So I deleted all of those too.
>>> Unfortunately, the absence of one or more of those symlinks causes Xorg to fail to start. I've reinstated all the links
>>> except scrubber and I now have a system that works as it did until I tried to run a kernel that includes the bad commit
>>> I identified in my bisection. That includes offloading video decoding to the NVidia card, so what ever I read that said
>>> the scrubber firmware was needed seems to have been wrong. I get a new message that (nouveau 0000:01:00.0: fb: VPR
>>> locked, but no scrubber binary!), but, hey, we can't have everything.
>>>
>>> If you still want to get to the bottom of this, let me know what you need me to provide and I'll do my best. I suspect
>>> you might want to because there will a n awful lot of Ubuntu-based systems out there with that scrubber.bin symlink in
>>> place. On the other hand,m it could but quite a while before ubuntu are deploying 6.2 or later kernels.
>> The symlinks are correct - whole groups of GPUs share the same FW, and
>> we use symlinks in linux-firmware to represent this.
>>
>> I don't really have any ideas how/why this patch causes issues with
>> shutdown - it's a path that only gets executed during initialisation.
>> Can you try and capture the kernel log during shutdown ("dmesg -w"
>> over ssh? netconsole?), and see if there's any relevant messages
>> providing a hint at what's going on? Alternatively, you could try
>> unloading the module (you will have to stop X/wayland/gdm/etc/etc
>> first) and seeing if that hangs too.
>>
>> Ben.
>
> Sorry for the delay - I've been learning about netconsole and netcat. However, I had no success with ssh and netconsole
> produced a log with nothing unusual in it.
>
> Simply stopping Xorg and removing the nouveau module succeeds.
>
> So, I rebuilt rc6+ after a pull from linus' tree this morning and set the nouveau debug level to 7. I then booted to a
> console before doing a reboot (with Ctl+Alt+Del). As expected the machine locked up just before it would ordinarily
> restart. The last few lines on the console might be helpful:
>
> ...
> nouveau 0000:01:00:0 fifo: preinit running...
> nouveau 0000:01:00:0 fifo: preinit completed in 4us
> nouveau 0000:01:00:0 gr: preinit running...
> nouveau 0000:01:00:0 gr: preinit completed in 0us
> nouveau 0000:01:00:0 nvdec0: preinit running...
> nouveau 0000:01:00:0 nvdec0: preinit completed in 0us
> nouveau 0000:01:00:0 nvdec0: preinit running...
> nouveau 0000:01:00:0 nvdec0: preinit completed in 0us
> nouveau 0000:01:00:0 sec2: preinit running...
> nouveau 0000:01:00:0 sec2: preinit completed in 0us
> nouveau 0000:01:00:0 fb:.VPR locked, running scrubber binary
>
> These messages appear after the "sd 4:0:0:0 [sda] Stopping disk" I reported in my initial email.
>
> After the "running scrubber" line appears the machine is locked and I have to hold down the power button to recover. I
> get the same outcome from running "halt -dip", "poweroff -di" and "shutdown -h -P now". I guess it's no surprise that
> all three result in the same outcome because invocations halt, poweroff and reboot (without the -f argument)from a
> runlevel other than 0 resukt in shutdown being run. switching to runlevel 0 with "telenit 0" results in the same
> messages from nouveau followed by the lockup.
>
> Let me know if you need any additional diagnostics.
>
> Chris
>

I've done some more investigation and found that I hadn't done sufficient amemdment the scripts run at shutdown to
prevent the network being shutdown. I've now got netconsole captures for 6.2.0-rc6+
(9f266ccaa2f5228bfe67ad58a94ca4e0109b954a) and, for comparison, 6.1.9. These two logs are attached.

Chris

>>
>>>
>>> Thanks,
>>>
>>> Chris
>>>
>>> <snip>


Attachments:
netconsole-6.1.9.log (432.58 kB)
netconsole-6.2.0-rc6+.log (604.93 kB)
Download all attachments
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On 08.02.23 09:48, Chris Clayton wrote:
>
> I'm assuming that we are not going to see a fix for this regression before 6.2 is released.

Yeah, looks like it. That's unfortunate, but happens. But there is still
time to fix it and there is one thing I wonder:

Did any of the nouveau developers look at the netconsole captures Chris
posted more than a week ago to check if they somehow help to track down
the root of this problem?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

> Consequently, I've
> implemented a (very simple) workaround. All that happens is that in the (sysv) init script that starts and stops SDDM,
> the nouveau module is removed once SDDM is stopped. With that in place, my system no longer freezes on reboot or poweroff.
>
> Let me know if I can provide any additional diagnostics although, with the problem seemingly occurring so late in the
> shutdown process, I may need help on how to go about capturing.
>
> Chris
>
> On 02/02/2023 20:45, Chris Clayton wrote:
>>
>>
>> On 01/02/2023 13:51, Chris Clayton wrote:
>>>
>>>
>>> On 30/01/2023 23:27, Ben Skeggs wrote:
>>>> On Tue, 31 Jan 2023 at 09:09, Chris Clayton <[email protected]> wrote:
>>>>>
>>>>> Hi again.
>>>>>
>>>>> On 30/01/2023 20:19, Chris Clayton wrote:
>>>>>> Thanks, Ben.
>>>>>
>>>>> <snip>
>>>>>
>>>>>>> Hey,
>>>>>>>
>>>>>>> This is a complete shot-in-the-dark, as I don't see this behaviour on
>>>>>>> *any* of my boards. Could you try the attached patch please?
>>>>>>
>>>>>> Unfortunately, the patch made no difference.
>>>>>>
>>>>>> I've been looking at how the graphics on my laptop is set up, and have a bit of a worry about whether the firmware might
>>>>>> be playing a part in this problem. In order to offload video decoding to the NVidia TU117 GPU, it seems the scrubber
>>>>>> firmware must be available, but as far as I know,that has not been released by NVidia. To get it to work, I followed
>>>>>> what ubuntu have done and the scrubber in /lib/firmware/nvidia/tu117/nvdec/ is a symlink to
>>>>>> ../../tu116/nvdev/scrubber.bin. That, of course, means that some of the firmware loaded is for a different card is being
>>>>>> loaded. I note that processing related to firmware is being changed in the patch. Might my set up be at the root of my
>>>>>> problem?
>>>>>>
>>>>>> I'll have a fiddle an see what I can work out.
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ben.
>>>>>>>
>>>>>>>>
>>>>>
>>>>> Well, my fiddling has got my system rebooting and shutting down successfully again. I found that if I delete the symlink
>>>>> to the scrubber firmware, reboot and shutdown work again. There are however, a number of other files in the tu117
>>>>> firmware directory tree that that are symlinks to actual files in its tu116 counterpart. So I deleted all of those too.
>>>>> Unfortunately, the absence of one or more of those symlinks causes Xorg to fail to start. I've reinstated all the links
>>>>> except scrubber and I now have a system that works as it did until I tried to run a kernel that includes the bad commit
>>>>> I identified in my bisection. That includes offloading video decoding to the NVidia card, so what ever I read that said
>>>>> the scrubber firmware was needed seems to have been wrong. I get a new message that (nouveau 0000:01:00.0: fb: VPR
>>>>> locked, but no scrubber binary!), but, hey, we can't have everything.
>>>>>
>>>>> If you still want to get to the bottom of this, let me know what you need me to provide and I'll do my best. I suspect
>>>>> you might want to because there will a n awful lot of Ubuntu-based systems out there with that scrubber.bin symlink in
>>>>> place. On the other hand,m it could but quite a while before ubuntu are deploying 6.2 or later kernels.
>>>> The symlinks are correct - whole groups of GPUs share the same FW, and
>>>> we use symlinks in linux-firmware to represent this.
>>>>
>>>> I don't really have any ideas how/why this patch causes issues with
>>>> shutdown - it's a path that only gets executed during initialisation.
>>>> Can you try and capture the kernel log during shutdown ("dmesg -w"
>>>> over ssh? netconsole?), and see if there's any relevant messages
>>>> providing a hint at what's going on? Alternatively, you could try
>>>> unloading the module (you will have to stop X/wayland/gdm/etc/etc
>>>> first) and seeing if that hangs too.
>>>>
>>>> Ben.
>>>
>>> Sorry for the delay - I've been learning about netconsole and netcat. However, I had no success with ssh and netconsole
>>> produced a log with nothing unusual in it.
>>>
>>> Simply stopping Xorg and removing the nouveau module succeeds.
>>>
>>> So, I rebuilt rc6+ after a pull from linus' tree this morning and set the nouveau debug level to 7. I then booted to a
>>> console before doing a reboot (with Ctl+Alt+Del). As expected the machine locked up just before it would ordinarily
>>> restart. The last few lines on the console might be helpful:
>>>
>>> ...
>>> nouveau 0000:01:00:0 fifo: preinit running...
>>> nouveau 0000:01:00:0 fifo: preinit completed in 4us
>>> nouveau 0000:01:00:0 gr: preinit running...
>>> nouveau 0000:01:00:0 gr: preinit completed in 0us
>>> nouveau 0000:01:00:0 nvdec0: preinit running...
>>> nouveau 0000:01:00:0 nvdec0: preinit completed in 0us
>>> nouveau 0000:01:00:0 nvdec0: preinit running...
>>> nouveau 0000:01:00:0 nvdec0: preinit completed in 0us
>>> nouveau 0000:01:00:0 sec2: preinit running...
>>> nouveau 0000:01:00:0 sec2: preinit completed in 0us
>>> nouveau 0000:01:00:0 fb:.VPR locked, running scrubber binary
>>>
>>> These messages appear after the "sd 4:0:0:0 [sda] Stopping disk" I reported in my initial email.
>>>
>>> After the "running scrubber" line appears the machine is locked and I have to hold down the power button to recover. I
>>> get the same outcome from running "halt -dip", "poweroff -di" and "shutdown -h -P now". I guess it's no surprise that
>>> all three result in the same outcome because invocations halt, poweroff and reboot (without the -f argument)from a
>>> runlevel other than 0 resukt in shutdown being run. switching to runlevel 0 with "telenit 0" results in the same
>>> messages from nouveau followed by the lockup.
>>>
>>> Let me know if you need any additional diagnostics.
>>>
>>> Chris
>>>
>>
>> I've done some more investigation and found that I hadn't done sufficient amemdment the scripts run at shutdown to
>> prevent the network being shutdown. I've now got netconsole captures for 6.2.0-rc6+
>> (9f266ccaa2f5228bfe67ad58a94ca4e0109b954a) and, for comparison, 6.1.9. These two logs are attached.
>>
>> Chris
>>
>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Chris
>>>>>
>>>>> <snip>
>
>

2023-02-10 19:02:33

by Karol Herbst

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
Leemhuis) <[email protected]> wrote:
>
> On 08.02.23 09:48, Chris Clayton wrote:
> >
> > I'm assuming that we are not going to see a fix for this regression before 6.2 is released.
>
> Yeah, looks like it. That's unfortunate, but happens. But there is still
> time to fix it and there is one thing I wonder:
>
> Did any of the nouveau developers look at the netconsole captures Chris
> posted more than a week ago to check if they somehow help to track down
> the root of this problem?
>

I did now and I can't spot anything. I think at this point it would
make sense to dump the active tasks/threads via sqsrq keys to see if
any is in a weird state preventing the machine from shutting down.

> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>
> > Consequently, I've
> > implemented a (very simple) workaround. All that happens is that in the (sysv) init script that starts and stops SDDM,
> > the nouveau module is removed once SDDM is stopped. With that in place, my system no longer freezes on reboot or poweroff.
> >
> > Let me know if I can provide any additional diagnostics although, with the problem seemingly occurring so late in the
> > shutdown process, I may need help on how to go about capturing.
> >
> > Chris
> >
> > On 02/02/2023 20:45, Chris Clayton wrote:
> >>
> >>
> >> On 01/02/2023 13:51, Chris Clayton wrote:
> >>>
> >>>
> >>> On 30/01/2023 23:27, Ben Skeggs wrote:
> >>>> On Tue, 31 Jan 2023 at 09:09, Chris Clayton <[email protected]> wrote:
> >>>>>
> >>>>> Hi again.
> >>>>>
> >>>>> On 30/01/2023 20:19, Chris Clayton wrote:
> >>>>>> Thanks, Ben.
> >>>>>
> >>>>> <snip>
> >>>>>
> >>>>>>> Hey,
> >>>>>>>
> >>>>>>> This is a complete shot-in-the-dark, as I don't see this behaviour on
> >>>>>>> *any* of my boards. Could you try the attached patch please?
> >>>>>>
> >>>>>> Unfortunately, the patch made no difference.
> >>>>>>
> >>>>>> I've been looking at how the graphics on my laptop is set up, and have a bit of a worry about whether the firmware might
> >>>>>> be playing a part in this problem. In order to offload video decoding to the NVidia TU117 GPU, it seems the scrubber
> >>>>>> firmware must be available, but as far as I know,that has not been released by NVidia. To get it to work, I followed
> >>>>>> what ubuntu have done and the scrubber in /lib/firmware/nvidia/tu117/nvdec/ is a symlink to
> >>>>>> ../../tu116/nvdev/scrubber.bin. That, of course, means that some of the firmware loaded is for a different card is being
> >>>>>> loaded. I note that processing related to firmware is being changed in the patch. Might my set up be at the root of my
> >>>>>> problem?
> >>>>>>
> >>>>>> I'll have a fiddle an see what I can work out.
> >>>>>>
> >>>>>> Chris
> >>>>>>
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Ben.
> >>>>>>>
> >>>>>>>>
> >>>>>
> >>>>> Well, my fiddling has got my system rebooting and shutting down successfully again. I found that if I delete the symlink
> >>>>> to the scrubber firmware, reboot and shutdown work again. There are however, a number of other files in the tu117
> >>>>> firmware directory tree that that are symlinks to actual files in its tu116 counterpart. So I deleted all of those too.
> >>>>> Unfortunately, the absence of one or more of those symlinks causes Xorg to fail to start. I've reinstated all the links
> >>>>> except scrubber and I now have a system that works as it did until I tried to run a kernel that includes the bad commit
> >>>>> I identified in my bisection. That includes offloading video decoding to the NVidia card, so what ever I read that said
> >>>>> the scrubber firmware was needed seems to have been wrong. I get a new message that (nouveau 0000:01:00.0: fb: VPR
> >>>>> locked, but no scrubber binary!), but, hey, we can't have everything.
> >>>>>
> >>>>> If you still want to get to the bottom of this, let me know what you need me to provide and I'll do my best. I suspect
> >>>>> you might want to because there will a n awful lot of Ubuntu-based systems out there with that scrubber.bin symlink in
> >>>>> place. On the other hand,m it could but quite a while before ubuntu are deploying 6.2 or later kernels.
> >>>> The symlinks are correct - whole groups of GPUs share the same FW, and
> >>>> we use symlinks in linux-firmware to represent this.
> >>>>
> >>>> I don't really have any ideas how/why this patch causes issues with
> >>>> shutdown - it's a path that only gets executed during initialisation.
> >>>> Can you try and capture the kernel log during shutdown ("dmesg -w"
> >>>> over ssh? netconsole?), and see if there's any relevant messages
> >>>> providing a hint at what's going on? Alternatively, you could try
> >>>> unloading the module (you will have to stop X/wayland/gdm/etc/etc
> >>>> first) and seeing if that hangs too.
> >>>>
> >>>> Ben.
> >>>
> >>> Sorry for the delay - I've been learning about netconsole and netcat. However, I had no success with ssh and netconsole
> >>> produced a log with nothing unusual in it.
> >>>
> >>> Simply stopping Xorg and removing the nouveau module succeeds.
> >>>
> >>> So, I rebuilt rc6+ after a pull from linus' tree this morning and set the nouveau debug level to 7. I then booted to a
> >>> console before doing a reboot (with Ctl+Alt+Del). As expected the machine locked up just before it would ordinarily
> >>> restart. The last few lines on the console might be helpful:
> >>>
> >>> ...
> >>> nouveau 0000:01:00:0 fifo: preinit running...
> >>> nouveau 0000:01:00:0 fifo: preinit completed in 4us
> >>> nouveau 0000:01:00:0 gr: preinit running...
> >>> nouveau 0000:01:00:0 gr: preinit completed in 0us
> >>> nouveau 0000:01:00:0 nvdec0: preinit running...
> >>> nouveau 0000:01:00:0 nvdec0: preinit completed in 0us
> >>> nouveau 0000:01:00:0 nvdec0: preinit running...
> >>> nouveau 0000:01:00:0 nvdec0: preinit completed in 0us
> >>> nouveau 0000:01:00:0 sec2: preinit running...
> >>> nouveau 0000:01:00:0 sec2: preinit completed in 0us
> >>> nouveau 0000:01:00:0 fb:.VPR locked, running scrubber binary
> >>>
> >>> These messages appear after the "sd 4:0:0:0 [sda] Stopping disk" I reported in my initial email.
> >>>
> >>> After the "running scrubber" line appears the machine is locked and I have to hold down the power button to recover. I
> >>> get the same outcome from running "halt -dip", "poweroff -di" and "shutdown -h -P now". I guess it's no surprise that
> >>> all three result in the same outcome because invocations halt, poweroff and reboot (without the -f argument)from a
> >>> runlevel other than 0 resukt in shutdown being run. switching to runlevel 0 with "telenit 0" results in the same
> >>> messages from nouveau followed by the lockup.
> >>>
> >>> Let me know if you need any additional diagnostics.
> >>>
> >>> Chris
> >>>
> >>
> >> I've done some more investigation and found that I hadn't done sufficient amemdment the scripts run at shutdown to
> >> prevent the network being shutdown. I've now got netconsole captures for 6.2.0-rc6+
> >> (9f266ccaa2f5228bfe67ad58a94ca4e0109b954a) and, for comparison, 6.1.9. These two logs are attached.
> >>
> >> Chris
> >>
> >>>>
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Chris
> >>>>>
> >>>>> <snip>
> >
> >
>


Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On 10.02.23 20:01, Karol Herbst wrote:
> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
> Leemhuis) <[email protected]> wrote:
>>
>> On 08.02.23 09:48, Chris Clayton wrote:
>>>
>>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released.
>>
>> Yeah, looks like it. That's unfortunate, but happens. But there is still
>> time to fix it and there is one thing I wonder:
>>
>> Did any of the nouveau developers look at the netconsole captures Chris
>> posted more than a week ago to check if they somehow help to track down
>> the root of this problem?
>
> I did now and I can't spot anything. I think at this point it would
> make sense to dump the active tasks/threads via sqsrq keys to see if
> any is in a weird state preventing the machine from shutting down.

Many thx for looking into it!

Ciao, Thorsten

>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> If I did something stupid, please tell me, as explained on that page.
>>
>>> Consequently, I've
>>> implemented a (very simple) workaround. All that happens is that in the (sysv) init script that starts and stops SDDM,
>>> the nouveau module is removed once SDDM is stopped. With that in place, my system no longer freezes on reboot or poweroff.
>>>
>>> Let me know if I can provide any additional diagnostics although, with the problem seemingly occurring so late in the
>>> shutdown process, I may need help on how to go about capturing.
>>>
>>> Chris
>>>
>>> On 02/02/2023 20:45, Chris Clayton wrote:
>>>>
>>>>
>>>> On 01/02/2023 13:51, Chris Clayton wrote:
>>>>>
>>>>>
>>>>> On 30/01/2023 23:27, Ben Skeggs wrote:
>>>>>> On Tue, 31 Jan 2023 at 09:09, Chris Clayton <[email protected]> wrote:
>>>>>>>
>>>>>>> Hi again.
>>>>>>>
>>>>>>> On 30/01/2023 20:19, Chris Clayton wrote:
>>>>>>>> Thanks, Ben.
>>>>>>>
>>>>>>> <snip>
>>>>>>>
>>>>>>>>> Hey,
>>>>>>>>>
>>>>>>>>> This is a complete shot-in-the-dark, as I don't see this behaviour on
>>>>>>>>> *any* of my boards. Could you try the attached patch please?
>>>>>>>>
>>>>>>>> Unfortunately, the patch made no difference.
>>>>>>>>
>>>>>>>> I've been looking at how the graphics on my laptop is set up, and have a bit of a worry about whether the firmware might
>>>>>>>> be playing a part in this problem. In order to offload video decoding to the NVidia TU117 GPU, it seems the scrubber
>>>>>>>> firmware must be available, but as far as I know,that has not been released by NVidia. To get it to work, I followed
>>>>>>>> what ubuntu have done and the scrubber in /lib/firmware/nvidia/tu117/nvdec/ is a symlink to
>>>>>>>> ../../tu116/nvdev/scrubber.bin. That, of course, means that some of the firmware loaded is for a different card is being
>>>>>>>> loaded. I note that processing related to firmware is being changed in the patch. Might my set up be at the root of my
>>>>>>>> problem?
>>>>>>>>
>>>>>>>> I'll have a fiddle an see what I can work out.
>>>>>>>>
>>>>>>>> Chris
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ben.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>>>> Well, my fiddling has got my system rebooting and shutting down successfully again. I found that if I delete the symlink
>>>>>>> to the scrubber firmware, reboot and shutdown work again. There are however, a number of other files in the tu117
>>>>>>> firmware directory tree that that are symlinks to actual files in its tu116 counterpart. So I deleted all of those too.
>>>>>>> Unfortunately, the absence of one or more of those symlinks causes Xorg to fail to start. I've reinstated all the links
>>>>>>> except scrubber and I now have a system that works as it did until I tried to run a kernel that includes the bad commit
>>>>>>> I identified in my bisection. That includes offloading video decoding to the NVidia card, so what ever I read that said
>>>>>>> the scrubber firmware was needed seems to have been wrong. I get a new message that (nouveau 0000:01:00.0: fb: VPR
>>>>>>> locked, but no scrubber binary!), but, hey, we can't have everything.
>>>>>>>
>>>>>>> If you still want to get to the bottom of this, let me know what you need me to provide and I'll do my best. I suspect
>>>>>>> you might want to because there will a n awful lot of Ubuntu-based systems out there with that scrubber.bin symlink in
>>>>>>> place. On the other hand,m it could but quite a while before ubuntu are deploying 6.2 or later kernels.
>>>>>> The symlinks are correct - whole groups of GPUs share the same FW, and
>>>>>> we use symlinks in linux-firmware to represent this.
>>>>>>
>>>>>> I don't really have any ideas how/why this patch causes issues with
>>>>>> shutdown - it's a path that only gets executed during initialisation.
>>>>>> Can you try and capture the kernel log during shutdown ("dmesg -w"
>>>>>> over ssh? netconsole?), and see if there's any relevant messages
>>>>>> providing a hint at what's going on? Alternatively, you could try
>>>>>> unloading the module (you will have to stop X/wayland/gdm/etc/etc
>>>>>> first) and seeing if that hangs too.
>>>>>>
>>>>>> Ben.
>>>>>
>>>>> Sorry for the delay - I've been learning about netconsole and netcat. However, I had no success with ssh and netconsole
>>>>> produced a log with nothing unusual in it.
>>>>>
>>>>> Simply stopping Xorg and removing the nouveau module succeeds.
>>>>>
>>>>> So, I rebuilt rc6+ after a pull from linus' tree this morning and set the nouveau debug level to 7. I then booted to a
>>>>> console before doing a reboot (with Ctl+Alt+Del). As expected the machine locked up just before it would ordinarily
>>>>> restart. The last few lines on the console might be helpful:
>>>>>
>>>>> ...
>>>>> nouveau 0000:01:00:0 fifo: preinit running...
>>>>> nouveau 0000:01:00:0 fifo: preinit completed in 4us
>>>>> nouveau 0000:01:00:0 gr: preinit running...
>>>>> nouveau 0000:01:00:0 gr: preinit completed in 0us
>>>>> nouveau 0000:01:00:0 nvdec0: preinit running...
>>>>> nouveau 0000:01:00:0 nvdec0: preinit completed in 0us
>>>>> nouveau 0000:01:00:0 nvdec0: preinit running...
>>>>> nouveau 0000:01:00:0 nvdec0: preinit completed in 0us
>>>>> nouveau 0000:01:00:0 sec2: preinit running...
>>>>> nouveau 0000:01:00:0 sec2: preinit completed in 0us
>>>>> nouveau 0000:01:00:0 fb:.VPR locked, running scrubber binary
>>>>>
>>>>> These messages appear after the "sd 4:0:0:0 [sda] Stopping disk" I reported in my initial email.
>>>>>
>>>>> After the "running scrubber" line appears the machine is locked and I have to hold down the power button to recover. I
>>>>> get the same outcome from running "halt -dip", "poweroff -di" and "shutdown -h -P now". I guess it's no surprise that
>>>>> all three result in the same outcome because invocations halt, poweroff and reboot (without the -f argument)from a
>>>>> runlevel other than 0 resukt in shutdown being run. switching to runlevel 0 with "telenit 0" results in the same
>>>>> messages from nouveau followed by the lockup.
>>>>>
>>>>> Let me know if you need any additional diagnostics.
>>>>>
>>>>> Chris
>>>>>
>>>>
>>>> I've done some more investigation and found that I hadn't done sufficient amemdment the scripts run at shutdown to
>>>> prevent the network being shutdown. I've now got netconsole captures for 6.2.0-rc6+
>>>> (9f266ccaa2f5228bfe67ad58a94ca4e0109b954a) and, for comparison, 6.1.9. These two logs are attached.
>>>>
>>>> Chris
>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Chris
>>>>>>>
>>>>>>> <snip>
>>>
>>>
>>
>
>
>

2023-02-11 13:43:47

by Chris Clayton

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected



On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 10.02.23 20:01, Karol Herbst wrote:
>> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
>> Leemhuis) <[email protected]> wrote:
>>>
>>> On 08.02.23 09:48, Chris Clayton wrote:
>>>>
>>>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released.
>>>
>>> Yeah, looks like it. That's unfortunate, but happens. But there is still
>>> time to fix it and there is one thing I wonder:
>>>
>>> Did any of the nouveau developers look at the netconsole captures Chris
>>> posted more than a week ago to check if they somehow help to track down
>>> the root of this problem?
>>
>> I did now and I can't spot anything. I think at this point it would
>> make sense to dump the active tasks/threads via sqsrq keys to see if
>> any is in a weird state preventing the machine from shutting down.
>
> Many thx for looking into it!

Yes, thanks Karol.

Attached is the output from dmesg when this block of code:

/bin/mount /dev/sda7 /mnt/sda7
/bin/mountpoint /proc || /bin/mount /proc
/bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log &
/bin/echo t > /proc/sysrq-trigger
/bin/sleep 1
/bin/sync
/bin/sleep 1
kill $(pidof dmesg)
/bin/umount /mnt/sda7

is executed immediately before /sbin/reboot is called as the final step of rebooting my system.

I hope this is what you were looking for, but if not, please let me know what you need

Chris

>
> Ciao, Thorsten
>
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>> --
>>> Everything you wanna know about Linux kernel regression tracking:
>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>> If I did something stupid, please tell me, as explained on that page.
>>>
>>>> Consequently, I've
>>>> implemented a (very simple) workaround. All that happens is that in the (sysv) init script that starts and stops SDDM,
>>>> the nouveau module is removed once SDDM is stopped. With that in place, my system no longer freezes on reboot or poweroff.
>>>>
>>>> Let me know if I can provide any additional diagnostics although, with the problem seemingly occurring so late in the
>>>> shutdown process, I may need help on how to go about capturing.
>>>>
>>>> Chris
>>>>
>>>> On 02/02/2023 20:45, Chris Clayton wrote:
>>>>>
>>>>>
>>>>> On 01/02/2023 13:51, Chris Clayton wrote:
>>>>>>
>>>>>>
>>>>>> On 30/01/2023 23:27, Ben Skeggs wrote:
>>>>>>> On Tue, 31 Jan 2023 at 09:09, Chris Clayton <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Hi again.
>>>>>>>>
>>>>>>>> On 30/01/2023 20:19, Chris Clayton wrote:
>>>>>>>>> Thanks, Ben.
>>>>>>>>
>>>>>>>> <snip>
>>>>>>>>
>>>>>>>>>> Hey,
>>>>>>>>>>
>>>>>>>>>> This is a complete shot-in-the-dark, as I don't see this behaviour on
>>>>>>>>>> *any* of my boards. Could you try the attached patch please?
>>>>>>>>>
>>>>>>>>> Unfortunately, the patch made no difference.
>>>>>>>>>
>>>>>>>>> I've been looking at how the graphics on my laptop is set up, and have a bit of a worry about whether the firmware might
>>>>>>>>> be playing a part in this problem. In order to offload video decoding to the NVidia TU117 GPU, it seems the scrubber
>>>>>>>>> firmware must be available, but as far as I know,that has not been released by NVidia. To get it to work, I followed
>>>>>>>>> what ubuntu have done and the scrubber in /lib/firmware/nvidia/tu117/nvdec/ is a symlink to
>>>>>>>>> ../../tu116/nvdev/scrubber.bin. That, of course, means that some of the firmware loaded is for a different card is being
>>>>>>>>> loaded. I note that processing related to firmware is being changed in the patch. Might my set up be at the root of my
>>>>>>>>> problem?
>>>>>>>>>
>>>>>>>>> I'll have a fiddle an see what I can work out.
>>>>>>>>>
>>>>>>>>> Chris
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ben.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>>>>> Well, my fiddling has got my system rebooting and shutting down successfully again. I found that if I delete the symlink
>>>>>>>> to the scrubber firmware, reboot and shutdown work again. There are however, a number of other files in the tu117
>>>>>>>> firmware directory tree that that are symlinks to actual files in its tu116 counterpart. So I deleted all of those too.
>>>>>>>> Unfortunately, the absence of one or more of those symlinks causes Xorg to fail to start. I've reinstated all the links
>>>>>>>> except scrubber and I now have a system that works as it did until I tried to run a kernel that includes the bad commit
>>>>>>>> I identified in my bisection. That includes offloading video decoding to the NVidia card, so what ever I read that said
>>>>>>>> the scrubber firmware was needed seems to have been wrong. I get a new message that (nouveau 0000:01:00.0: fb: VPR
>>>>>>>> locked, but no scrubber binary!), but, hey, we can't have everything.
>>>>>>>>
>>>>>>>> If you still want to get to the bottom of this, let me know what you need me to provide and I'll do my best. I suspect
>>>>>>>> you might want to because there will a n awful lot of Ubuntu-based systems out there with that scrubber.bin symlink in
>>>>>>>> place. On the other hand,m it could but quite a while before ubuntu are deploying 6.2 or later kernels.
>>>>>>> The symlinks are correct - whole groups of GPUs share the same FW, and
>>>>>>> we use symlinks in linux-firmware to represent this.
>>>>>>>
>>>>>>> I don't really have any ideas how/why this patch causes issues with
>>>>>>> shutdown - it's a path that only gets executed during initialisation.
>>>>>>> Can you try and capture the kernel log during shutdown ("dmesg -w"
>>>>>>> over ssh? netconsole?), and see if there's any relevant messages
>>>>>>> providing a hint at what's going on? Alternatively, you could try
>>>>>>> unloading the module (you will have to stop X/wayland/gdm/etc/etc
>>>>>>> first) and seeing if that hangs too.
>>>>>>>
>>>>>>> Ben.
>>>>>>
>>>>>> Sorry for the delay - I've been learning about netconsole and netcat. However, I had no success with ssh and netconsole
>>>>>> produced a log with nothing unusual in it.
>>>>>>
>>>>>> Simply stopping Xorg and removing the nouveau module succeeds.
>>>>>>
>>>>>> So, I rebuilt rc6+ after a pull from linus' tree this morning and set the nouveau debug level to 7. I then booted to a
>>>>>> console before doing a reboot (with Ctl+Alt+Del). As expected the machine locked up just before it would ordinarily
>>>>>> restart. The last few lines on the console might be helpful:
>>>>>>
>>>>>> ...
>>>>>> nouveau 0000:01:00:0 fifo: preinit running...
>>>>>> nouveau 0000:01:00:0 fifo: preinit completed in 4us
>>>>>> nouveau 0000:01:00:0 gr: preinit running...
>>>>>> nouveau 0000:01:00:0 gr: preinit completed in 0us
>>>>>> nouveau 0000:01:00:0 nvdec0: preinit running...
>>>>>> nouveau 0000:01:00:0 nvdec0: preinit completed in 0us
>>>>>> nouveau 0000:01:00:0 nvdec0: preinit running...
>>>>>> nouveau 0000:01:00:0 nvdec0: preinit completed in 0us
>>>>>> nouveau 0000:01:00:0 sec2: preinit running...
>>>>>> nouveau 0000:01:00:0 sec2: preinit completed in 0us
>>>>>> nouveau 0000:01:00:0 fb:.VPR locked, running scrubber binary
>>>>>>
>>>>>> These messages appear after the "sd 4:0:0:0 [sda] Stopping disk" I reported in my initial email.
>>>>>>
>>>>>> After the "running scrubber" line appears the machine is locked and I have to hold down the power button to recover. I
>>>>>> get the same outcome from running "halt -dip", "poweroff -di" and "shutdown -h -P now". I guess it's no surprise that
>>>>>> all three result in the same outcome because invocations halt, poweroff and reboot (without the -f argument)from a
>>>>>> runlevel other than 0 resukt in shutdown being run. switching to runlevel 0 with "telenit 0" results in the same
>>>>>> messages from nouveau followed by the lockup.
>>>>>>
>>>>>> Let me know if you need any additional diagnostics.
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>
>>>>> I've done some more investigation and found that I hadn't done sufficient amemdment the scripts run at shutdown to
>>>>> prevent the network being shutdown. I've now got netconsole captures for 6.2.0-rc6+
>>>>> (9f266ccaa2f5228bfe67ad58a94ca4e0109b954a) and, for comparison, 6.1.9. These two logs are attached.
>>>>>
>>>>> Chris
>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Chris
>>>>>>>>
>>>>>>>> <snip>
>>>>
>>>>
>>>
>>
>>
>>


Attachments:
sysrq-t.dmesg.log (214.42 kB)

2023-02-13 02:57:51

by Dave Airlie

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On Sun, 12 Feb 2023 at 00:43, Chris Clayton <[email protected]> wrote:
>
>
>
> On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis) wrote:
> > On 10.02.23 20:01, Karol Herbst wrote:
> >> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
> >> Leemhuis) <[email protected]> wrote:
> >>>
> >>> On 08.02.23 09:48, Chris Clayton wrote:
> >>>>
> >>>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released.
> >>>
> >>> Yeah, looks like it. That's unfortunate, but happens. But there is still
> >>> time to fix it and there is one thing I wonder:
> >>>
> >>> Did any of the nouveau developers look at the netconsole captures Chris
> >>> posted more than a week ago to check if they somehow help to track down
> >>> the root of this problem?
> >>
> >> I did now and I can't spot anything. I think at this point it would
> >> make sense to dump the active tasks/threads via sqsrq keys to see if
> >> any is in a weird state preventing the machine from shutting down.
> >
> > Many thx for looking into it!
>
> Yes, thanks Karol.
>
> Attached is the output from dmesg when this block of code:
>
> /bin/mount /dev/sda7 /mnt/sda7
> /bin/mountpoint /proc || /bin/mount /proc
> /bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log &
> /bin/echo t > /proc/sysrq-trigger
> /bin/sleep 1
> /bin/sync
> /bin/sleep 1
> kill $(pidof dmesg)
> /bin/umount /mnt/sda7
>
> is executed immediately before /sbin/reboot is called as the final step of rebooting my system.
>
> I hope this is what you were looking for, but if not, please let me know what you need

Another shot in the dark, but does nouveau.runpm=0 help at all?

Dave.

2023-02-13 09:15:01

by Chris Clayton

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected



On 13/02/2023 02:57, Dave Airlie wrote:
> On Sun, 12 Feb 2023 at 00:43, Chris Clayton <[email protected]> wrote:
>>
>>
>>
>> On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis) wrote:
>>> On 10.02.23 20:01, Karol Herbst wrote:
>>>> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
>>>> Leemhuis) <[email protected]> wrote:
>>>>>
>>>>> On 08.02.23 09:48, Chris Clayton wrote:
>>>>>>
>>>>>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released.
>>>>>
>>>>> Yeah, looks like it. That's unfortunate, but happens. But there is still
>>>>> time to fix it and there is one thing I wonder:
>>>>>
>>>>> Did any of the nouveau developers look at the netconsole captures Chris
>>>>> posted more than a week ago to check if they somehow help to track down
>>>>> the root of this problem?
>>>>
>>>> I did now and I can't spot anything. I think at this point it would
>>>> make sense to dump the active tasks/threads via sqsrq keys to see if
>>>> any is in a weird state preventing the machine from shutting down.
>>>
>>> Many thx for looking into it!
>>
>> Yes, thanks Karol.
>>
>> Attached is the output from dmesg when this block of code:
>>
>> /bin/mount /dev/sda7 /mnt/sda7
>> /bin/mountpoint /proc || /bin/mount /proc
>> /bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log &
>> /bin/echo t > /proc/sysrq-trigger
>> /bin/sleep 1
>> /bin/sync
>> /bin/sleep 1
>> kill $(pidof dmesg)
>> /bin/umount /mnt/sda7
>>
>> is executed immediately before /sbin/reboot is called as the final step of rebooting my system.
>>
>> I hope this is what you were looking for, but if not, please let me know what you need
>

Thanks Dave.
> Another ot in the dark, but does nouveau.runpm=0 help at all?
>
> Dave.

Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On 13.02.23 10:14, Chris Clayton wrote:
> On 13/02/2023 02:57, Dave Airlie wrote:
>> On Sun, 12 Feb 2023 at 00:43, Chris Clayton <[email protected]> wrote:
>>>
>>>
>>>
>>> On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis) wrote:
>>>> On 10.02.23 20:01, Karol Herbst wrote:
>>>>> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
>>>>> Leemhuis) <[email protected]> wrote:
>>>>>>
>>>>>> On 08.02.23 09:48, Chris Clayton wrote:
>>>>>>>
>>>>>>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released.
>>>>>>
>>>>>> Yeah, looks like it. That's unfortunate, but happens. But there is still
>>>>>> time to fix it and there is one thing I wonder:
>>>>>>
>>>>>> Did any of the nouveau developers look at the netconsole captures Chris
>>>>>> posted more than a week ago to check if they somehow help to track down
>>>>>> the root of this problem?
>>>>>
>>>>> I did now and I can't spot anything. I think at this point it would
>>>>> make sense to dump the active tasks/threads via sqsrq keys to see if
>>>>> any is in a weird state preventing the machine from shutting down.
>>>>
>>>> Many thx for looking into it!
>>>
>>> Yes, thanks Karol.
>>>
>>> Attached is the output from dmesg when this block of code:
>>>
>>> /bin/mount /dev/sda7 /mnt/sda7
>>> /bin/mountpoint /proc || /bin/mount /proc
>>> /bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log &
>>> /bin/echo t > /proc/sysrq-trigger
>>> /bin/sleep 1
>>> /bin/sync
>>> /bin/sleep 1
>>> kill $(pidof dmesg)
>>> /bin/umount /mnt/sda7
>>>
>>> is executed immediately before /sbin/reboot is called as the final step of rebooting my system.
>>>
>>> I hope this is what you were looking for, but if not, please let me know what you need
>
> Thanks Dave. [...]
FWIW, in case anyone strands here in the archives: the msg was
truncated. The full post can be found in a new thread:

https://lore.kernel.org/lkml/[email protected]/

Sadly it seems the info "With runpm=0, both reboot and poweroff work on
my laptop." didn't bring us much further to a solution. :-/ I don't
really like it, but for regression tracking I'm now putting this on the
back-burner, as a fix is not in sight.

#regzbot monitor:
https://lore.kernel.org/lkml/[email protected]/
#regzbot backburner: hard to debug and apparently rare
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

#regzbot ignore-activity

2023-02-15 11:10:54

by Karol Herbst

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On Wed, Feb 15, 2023 at 11:36 AM Linux regression tracking #update
(Thorsten Leemhuis) <[email protected]> wrote:
>
> On 13.02.23 10:14, Chris Clayton wrote:
> > On 13/02/2023 02:57, Dave Airlie wrote:
> >> On Sun, 12 Feb 2023 at 00:43, Chris Clayton <[email protected]> wrote:
> >>>
> >>>
> >>>
> >>> On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis) wrote:
> >>>> On 10.02.23 20:01, Karol Herbst wrote:
> >>>>> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
> >>>>> Leemhuis) <[email protected]> wrote:
> >>>>>>
> >>>>>> On 08.02.23 09:48, Chris Clayton wrote:
> >>>>>>>
> >>>>>>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released.
> >>>>>>
> >>>>>> Yeah, looks like it. That's unfortunate, but happens. But there is still
> >>>>>> time to fix it and there is one thing I wonder:
> >>>>>>
> >>>>>> Did any of the nouveau developers look at the netconsole captures Chris
> >>>>>> posted more than a week ago to check if they somehow help to track down
> >>>>>> the root of this problem?
> >>>>>
> >>>>> I did now and I can't spot anything. I think at this point it would
> >>>>> make sense to dump the active tasks/threads via sqsrq keys to see if
> >>>>> any is in a weird state preventing the machine from shutting down.
> >>>>
> >>>> Many thx for looking into it!
> >>>
> >>> Yes, thanks Karol.
> >>>
> >>> Attached is the output from dmesg when this block of code:
> >>>
> >>> /bin/mount /dev/sda7 /mnt/sda7
> >>> /bin/mountpoint /proc || /bin/mount /proc
> >>> /bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log &
> >>> /bin/echo t > /proc/sysrq-trigger
> >>> /bin/sleep 1
> >>> /bin/sync
> >>> /bin/sleep 1
> >>> kill $(pidof dmesg)
> >>> /bin/umount /mnt/sda7
> >>>
> >>> is executed immediately before /sbin/reboot is called as the final step of rebooting my system.
> >>>
> >>> I hope this is what you were looking for, but if not, please let me know what you need
> >
> > Thanks Dave. [...]
> FWIW, in case anyone strands here in the archives: the msg was
> truncated. The full post can be found in a new thread:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> Sadly it seems the info "With runpm=0, both reboot and poweroff work on
> my laptop." didn't bring us much further to a solution. :-/ I don't
> really like it, but for regression tracking I'm now putting this on the
> back-burner, as a fix is not in sight.
>
> #regzbot monitor:
> https://lore.kernel.org/lkml/[email protected]/
> #regzbot backburner: hard to debug and apparently rare
> #regzbot ignore-activity
>

yeah.. this bug looks a little annoying. Sadly the only Turing based
laptop I got doesn't work on Nouveau because of firmware related
issues and we probably need to get updated ones from Nvidia here :(

But it's a bit weird that the kernel doesn't shutdown, because I don't
see anything in the logs which would prevent that from happening.
Unless it's waiting on one of the tasks to complete, but none of them
looked in any way nouveau related.

If somebody else has any fancy kernel debugging tips here to figure
out why it hangs, that would be very helpful...

> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> That page also explains what to do if mails like this annoy you.
>
> #regzbot ignore-activity
>


2023-02-18 12:22:34

by Chris Clayton

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected



On 15/02/2023 11:09, Karol Herbst wrote:
> On Wed, Feb 15, 2023 at 11:36 AM Linux regression tracking #update
> (Thorsten Leemhuis) <[email protected]> wrote:
>>
>> On 13.02.23 10:14, Chris Clayton wrote:
>>> On 13/02/2023 02:57, Dave Airlie wrote:
>>>> On Sun, 12 Feb 2023 at 00:43, Chris Clayton <[email protected]> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis) wrote:
>>>>>> On 10.02.23 20:01, Karol Herbst wrote:
>>>>>>> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
>>>>>>> Leemhuis) <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On 08.02.23 09:48, Chris Clayton wrote:
>>>>>>>>>
>>>>>>>>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released.
>>>>>>>>
>>>>>>>> Yeah, looks like it. That's unfortunate, but happens. But there is still
>>>>>>>> time to fix it and there is one thing I wonder:
>>>>>>>>
>>>>>>>> Did any of the nouveau developers look at the netconsole captures Chris
>>>>>>>> posted more than a week ago to check if they somehow help to track down
>>>>>>>> the root of this problem?
>>>>>>>
>>>>>>> I did now and I can't spot anything. I think at this point it would
>>>>>>> make sense to dump the active tasks/threads via sqsrq keys to see if
>>>>>>> any is in a weird state preventing the machine from shutting down.
>>>>>>
>>>>>> Many thx for looking into it!
>>>>>
>>>>> Yes, thanks Karol.
>>>>>
>>>>> Attached is the output from dmesg when this block of code:
>>>>>
>>>>> /bin/mount /dev/sda7 /mnt/sda7
>>>>> /bin/mountpoint /proc || /bin/mount /proc
>>>>> /bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log &
>>>>> /bin/echo t > /proc/sysrq-trigger
>>>>> /bin/sleep 1
>>>>> /bin/sync
>>>>> /bin/sleep 1
>>>>> kill $(pidof dmesg)
>>>>> /bin/umount /mnt/sda7
>>>>>
>>>>> is executed immediately before /sbin/reboot is called as the final step of rebooting my system.
>>>>>
>>>>> I hope this is what you were looking for, but if not, please let me know what you need
>>>
>>> Thanks Dave. [...]
>> FWIW, in case anyone strands here in the archives: the msg was
>> truncated. The full post can be found in a new thread:
>>
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> Sadly it seems the info "With runpm=0, both reboot and poweroff work on
>> my laptop." didn't bring us much further to a solution. :-/ I don't
>> really like it, but for regression tracking I'm now putting this on the
>> back-burner, as a fix is not in sight.
>>
>> #regzbot monitor:
>> https://lore.kernel.org/lkml/[email protected]/
>> #regzbot backburner: hard to debug and apparently rare
>> #regzbot ignore-activity
>>
>
> yeah.. this bug looks a little annoying. Sadly the only Turing based
> laptop I got doesn't work on Nouveau because of firmware related
> issues and we probably need to get updated ones from Nvidia here :(
>
> But it's a bit weird that the kernel doesn't shutdown, because I don't
> see anything in the logs which would prevent that from happening.
> Unless it's waiting on one of the tasks to complete, but none of them
> looked in any way nouveau related.
>
> If somebody else has any fancy kernel debugging tips here to figure
> out why it hangs, that would be very helpful...
>

I think I've figured this out. It's to do with how my system is configured. I do have an initrd, but the only thing on
it is the cpu microcode which, it is recommended, should be loaded early. The absence of the NVidia firmare from an
initrd doesn't matter because the drivers for the hardware that need to load firmware are all built as modules, So, by
the time the devices are configured via udev, the root partition is mounted and the drivers can get at the firmware.

I've found, by turning on nouveau debug and taking a video of the screen as the system shuts down, that nouveau seems to
be trying to run the scrubber very very late in the shutdown process. The problem is that by this time, I think the root
partition, and thus the scrubber binary, have become inaccessible.

I seem to have two choices - either make the firmware accessible on an initrd or unload the module in a shutdown script
before the scrubber binary becomes inaccessible. The latter of these is the workaround I have implemented whilst the
problem I reported has been under investigation. For simplicity, I think I'll promote my workaround to being the
permanent solution.

So, apologies (and thanks) to everyone whose time I have taken up with this non-bug.

Chris

>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>> --
>> Everything you wanna know about Linux kernel regression tracking:
>> https://linux-regtracking.leemhuis.info/about/#tldr
>> That page also explains what to do if mails like this annoy you.
>>
>> #regzbot ignore-activity
>>
>

2023-02-18 12:26:48

by Karol Herbst

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On Sat, Feb 18, 2023 at 1:22 PM Chris Clayton <[email protected]> wrote:
>
>
>
> On 15/02/2023 11:09, Karol Herbst wrote:
> > On Wed, Feb 15, 2023 at 11:36 AM Linux regression tracking #update
> > (Thorsten Leemhuis) <[email protected]> wrote:
> >>
> >> On 13.02.23 10:14, Chris Clayton wrote:
> >>> On 13/02/2023 02:57, Dave Airlie wrote:
> >>>> On Sun, 12 Feb 2023 at 00:43, Chris Clayton <[email protected]> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis) wrote:
> >>>>>> On 10.02.23 20:01, Karol Herbst wrote:
> >>>>>>> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
> >>>>>>> Leemhuis) <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> On 08.02.23 09:48, Chris Clayton wrote:
> >>>>>>>>>
> >>>>>>>>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released.
> >>>>>>>>
> >>>>>>>> Yeah, looks like it. That's unfortunate, but happens. But there is still
> >>>>>>>> time to fix it and there is one thing I wonder:
> >>>>>>>>
> >>>>>>>> Did any of the nouveau developers look at the netconsole captures Chris
> >>>>>>>> posted more than a week ago to check if they somehow help to track down
> >>>>>>>> the root of this problem?
> >>>>>>>
> >>>>>>> I did now and I can't spot anything. I think at this point it would
> >>>>>>> make sense to dump the active tasks/threads via sqsrq keys to see if
> >>>>>>> any is in a weird state preventing the machine from shutting down.
> >>>>>>
> >>>>>> Many thx for looking into it!
> >>>>>
> >>>>> Yes, thanks Karol.
> >>>>>
> >>>>> Attached is the output from dmesg when this block of code:
> >>>>>
> >>>>> /bin/mount /dev/sda7 /mnt/sda7
> >>>>> /bin/mountpoint /proc || /bin/mount /proc
> >>>>> /bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log &
> >>>>> /bin/echo t > /proc/sysrq-trigger
> >>>>> /bin/sleep 1
> >>>>> /bin/sync
> >>>>> /bin/sleep 1
> >>>>> kill $(pidof dmesg)
> >>>>> /bin/umount /mnt/sda7
> >>>>>
> >>>>> is executed immediately before /sbin/reboot is called as the final step of rebooting my system.
> >>>>>
> >>>>> I hope this is what you were looking for, but if not, please let me know what you need
> >>>
> >>> Thanks Dave. [...]
> >> FWIW, in case anyone strands here in the archives: the msg was
> >> truncated. The full post can be found in a new thread:
> >>
> >> https://lore.kernel.org/lkml/[email protected]/
> >>
> >> Sadly it seems the info "With runpm=0, both reboot and poweroff work on
> >> my laptop." didn't bring us much further to a solution. :-/ I don't
> >> really like it, but for regression tracking I'm now putting this on the
> >> back-burner, as a fix is not in sight.
> >>
> >> #regzbot monitor:
> >> https://lore.kernel.org/lkml/[email protected]/
> >> #regzbot backburner: hard to debug and apparently rare
> >> #regzbot ignore-activity
> >>
> >
> > yeah.. this bug looks a little annoying. Sadly the only Turing based
> > laptop I got doesn't work on Nouveau because of firmware related
> > issues and we probably need to get updated ones from Nvidia here :(
> >
> > But it's a bit weird that the kernel doesn't shutdown, because I don't
> > see anything in the logs which would prevent that from happening.
> > Unless it's waiting on one of the tasks to complete, but none of them
> > looked in any way nouveau related.
> >
> > If somebody else has any fancy kernel debugging tips here to figure
> > out why it hangs, that would be very helpful...
> >
>
> I think I've figured this out. It's to do with how my system is configured. I do have an initrd, but the only thing on
> it is the cpu microcode which, it is recommended, should be loaded early. The absence of the NVidia firmare from an
> initrd doesn't matter because the drivers for the hardware that need to load firmware are all built as modules, So, by
> the time the devices are configured via udev, the root partition is mounted and the drivers can get at the firmware.
>
> I've found, by turning on nouveau debug and taking a video of the screen as the system shuts down, that nouveau seems to
> be trying to run the scrubber very very late in the shutdown process. The problem is that by this time, I think the root
> partition, and thus the scrubber binary, have become inaccessible.
>
> I seem to have two choices - either make the firmware accessible on an initrd or unload the module in a shutdown script
> before the scrubber binary becomes inaccessible. The latter of these is the workaround I have implemented whilst the
> problem I reported has been under investigation. For simplicity, I think I'll promote my workaround to being the
> permanent solution.
>
> So, apologies (and thanks) to everyone whose time I have taken up with this non-bug.
>

Well.. nouveau shouldn't prevent the system from shutting down if the
firmware file isn't available. Or at least it should print a
warning/error. Mind messing with the code a little to see if skipping
it kind of works? I probably can also come up with a patch by next
week.

> Chris
>
> >> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >> --
> >> Everything you wanna know about Linux kernel regression tracking:
> >> https://linux-regtracking.leemhuis.info/about/#tldr
> >> That page also explains what to do if mails like this annoy you.
> >>
> >> #regzbot ignore-activity
> >>
> >
>


2023-02-18 15:20:06

by Chris Clayton

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected



On 18/02/2023 12:25, Karol Herbst wrote:
> On Sat, Feb 18, 2023 at 1:22 PM Chris Clayton <[email protected]> wrote:
>>
>>
>>
>> On 15/02/2023 11:09, Karol Herbst wrote:
>>> On Wed, Feb 15, 2023 at 11:36 AM Linux regression tracking #update
>>> (Thorsten Leemhuis) <[email protected]> wrote:
>>>>
>>>> On 13.02.23 10:14, Chris Clayton wrote:
>>>>> On 13/02/2023 02:57, Dave Airlie wrote:
>>>>>> On Sun, 12 Feb 2023 at 00:43, Chris Clayton <[email protected]> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis) wrote:
>>>>>>>> On 10.02.23 20:01, Karol Herbst wrote:
>>>>>>>>> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
>>>>>>>>> Leemhuis) <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> On 08.02.23 09:48, Chris Clayton wrote:
>>>>>>>>>>>
>>>>>>>>>>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released.
>>>>>>>>>>
>>>>>>>>>> Yeah, looks like it. That's unfortunate, but happens. But there is still
>>>>>>>>>> time to fix it and there is one thing I wonder:
>>>>>>>>>>
>>>>>>>>>> Did any of the nouveau developers look at the netconsole captures Chris
>>>>>>>>>> posted more than a week ago to check if they somehow help to track down
>>>>>>>>>> the root of this problem?
>>>>>>>>>
>>>>>>>>> I did now and I can't spot anything. I think at this point it would
>>>>>>>>> make sense to dump the active tasks/threads via sqsrq keys to see if
>>>>>>>>> any is in a weird state preventing the machine from shutting down.
>>>>>>>>
>>>>>>>> Many thx for looking into it!
>>>>>>>
>>>>>>> Yes, thanks Karol.
>>>>>>>
>>>>>>> Attached is the output from dmesg when this block of code:
>>>>>>>
>>>>>>> /bin/mount /dev/sda7 /mnt/sda7
>>>>>>> /bin/mountpoint /proc || /bin/mount /proc
>>>>>>> /bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log &
>>>>>>> /bin/echo t > /proc/sysrq-trigger
>>>>>>> /bin/sleep 1
>>>>>>> /bin/sync
>>>>>>> /bin/sleep 1
>>>>>>> kill $(pidof dmesg)
>>>>>>> /bin/umount /mnt/sda7
>>>>>>>
>>>>>>> is executed immediately before /sbin/reboot is called as the final step of rebooting my system.
>>>>>>>
>>>>>>> I hope this is what you were looking for, but if not, please let me know what you need
>>>>>
>>>>> Thanks Dave. [...]
>>>> FWIW, in case anyone strands here in the archives: the msg was
>>>> truncated. The full post can be found in a new thread:
>>>>
>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>
>>>> Sadly it seems the info "With runpm=0, both reboot and poweroff work on
>>>> my laptop." didn't bring us much further to a solution. :-/ I don't
>>>> really like it, but for regression tracking I'm now putting this on the
>>>> back-burner, as a fix is not in sight.
>>>>
>>>> #regzbot monitor:
>>>> https://lore.kernel.org/lkml/[email protected]/
>>>> #regzbot backburner: hard to debug and apparently rare
>>>> #regzbot ignore-activity
>>>>
>>>
>>> yeah.. this bug looks a little annoying. Sadly the only Turing based
>>> laptop I got doesn't work on Nouveau because of firmware related
>>> issues and we probably need to get updated ones from Nvidia here :(
>>>
>>> But it's a bit weird that the kernel doesn't shutdown, because I don't
>>> see anything in the logs which would prevent that from happening.
>>> Unless it's waiting on one of the tasks to complete, but none of them
>>> looked in any way nouveau related.
>>>
>>> If somebody else has any fancy kernel debugging tips here to figure
>>> out why it hangs, that would be very helpful...
>>>
>>
>> I think I've figured this out. It's to do with how my system is configured. I do have an initrd, but the only thing on
>> it is the cpu microcode which, it is recommended, should be loaded early. The absence of the NVidia firmare from an
>> initrd doesn't matter because the drivers for the hardware that need to load firmware are all built as modules, So, by
>> the time the devices are configured via udev, the root partition is mounted and the drivers can get at the firmware.
>>
>> I've found, by turning on nouveau debug and taking a video of the screen as the system shuts down, that nouveau seems to
>> be trying to run the scrubber very very late in the shutdown process. The problem is that by this time, I think the root
>> partition, and thus the scrubber binary, have become inaccessible.
>>
>> I seem to have two choices - either make the firmware accessible on an initrd or unload the module in a shutdown script
>> before the scrubber binary becomes inaccessible. The latter of these is the workaround I have implemented whilst the
>> problem I reported has been under investigation. For simplicity, I think I'll promote my workaround to being the
>> permanent solution.
>>
>> So, apologies (and thanks) to everyone whose time I have taken up with this non-bug.
>>
>
> Well.. nouveau shouldn't prevent the system from shutting down if the
> firmware file isn't available. Or at least it should print a
> warning/error. Mind messing with the code a little to see if skipping
> it kind of works? I probably can also come up with a patch by next
> week.
>
Well, I'd love to but a quick glance at the code caused me to bump into this obscenity:

int
gm200_flcn_reset_wait_mem_scrubbing(struct nvkm_falcon *falcon)
{
nvkm_falcon_mask(falcon, 0x040, 0x00000000, 0x00000000);

if (nvkm_msec(falcon->owner->device, 10,
if (!(nvkm_falcon_rd32(falcon, 0x10c) & 0x00000006))
break;
) < 0)
return -ETIMEDOUT;

return 0;
}

nvkm_msec is #defined to nvkm_usec which in turn is #defined to nvkm_nsec where the loop that the break is related to
appears.> Chris
>>
>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>> --
>>>> Everything you wanna know about Linux kernel regression tracking:
>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>> That page also explains what to do if mails like this annoy you.
>>>>
>>>> #regzbot ignore-activity
>>>>
>>>
>>
>

2023-02-18 18:55:57

by Chris Clayton

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected



On 18/02/2023 15:19, Chris Clayton wrote:
>
>
> On 18/02/2023 12:25, Karol Herbst wrote:
>> On Sat, Feb 18, 2023 at 1:22 PM Chris Clayton <[email protected]> wrote:
>>>
>>>
>>>
>>> On 15/02/2023 11:09, Karol Herbst wrote:
>>>> On Wed, Feb 15, 2023 at 11:36 AM Linux regression tracking #update
>>>> (Thorsten Leemhuis) <[email protected]> wrote:
>>>>>
>>>>> On 13.02.23 10:14, Chris Clayton wrote:
>>>>>> On 13/02/2023 02:57, Dave Airlie wrote:
>>>>>>> On Sun, 12 Feb 2023 at 00:43, Chris Clayton <[email protected]> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis) wrote:
>>>>>>>>> On 10.02.23 20:01, Karol Herbst wrote:
>>>>>>>>>> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
>>>>>>>>>> Leemhuis) <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 08.02.23 09:48, Chris Clayton wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released.
>>>>>>>>>>>
>>>>>>>>>>> Yeah, looks like it. That's unfortunate, but happens. But there is still
>>>>>>>>>>> time to fix it and there is one thing I wonder:
>>>>>>>>>>>
>>>>>>>>>>> Did any of the nouveau developers look at the netconsole captures Chris
>>>>>>>>>>> posted more than a week ago to check if they somehow help to track down
>>>>>>>>>>> the root of this problem?
>>>>>>>>>>
>>>>>>>>>> I did now and I can't spot anything. I think at this point it would
>>>>>>>>>> make sense to dump the active tasks/threads via sqsrq keys to see if
>>>>>>>>>> any is in a weird state preventing the machine from shutting down.
>>>>>>>>>
>>>>>>>>> Many thx for looking into it!
>>>>>>>>
>>>>>>>> Yes, thanks Karol.
>>>>>>>>
>>>>>>>> Attached is the output from dmesg when this block of code:
>>>>>>>>
>>>>>>>> /bin/mount /dev/sda7 /mnt/sda7
>>>>>>>> /bin/mountpoint /proc || /bin/mount /proc
>>>>>>>> /bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log &
>>>>>>>> /bin/echo t > /proc/sysrq-trigger
>>>>>>>> /bin/sleep 1
>>>>>>>> /bin/sync
>>>>>>>> /bin/sleep 1
>>>>>>>> kill $(pidof dmesg)
>>>>>>>> /bin/umount /mnt/sda7
>>>>>>>>
>>>>>>>> is executed immediately before /sbin/reboot is called as the final step of rebooting my system.
>>>>>>>>
>>>>>>>> I hope this is what you were looking for, but if not, please let me know what you need
>>>>>>
>>>>>> Thanks Dave. [...]
>>>>> FWIW, in case anyone strands here in the archives: the msg was
>>>>> truncated. The full post can be found in a new thread:
>>>>>
>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>
>>>>> Sadly it seems the info "With runpm=0, both reboot and poweroff work on
>>>>> my laptop." didn't bring us much further to a solution. :-/ I don't
>>>>> really like it, but for regression tracking I'm now putting this on the
>>>>> back-burner, as a fix is not in sight.
>>>>>
>>>>> #regzbot monitor:
>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>> #regzbot backburner: hard to debug and apparently rare
>>>>> #regzbot ignore-activity
>>>>>
>>>>
>>>> yeah.. this bug looks a little annoying. Sadly the only Turing based
>>>> laptop I got doesn't work on Nouveau because of firmware related
>>>> issues and we probably need to get updated ones from Nvidia here :(
>>>>
>>>> But it's a bit weird that the kernel doesn't shutdown, because I don't
>>>> see anything in the logs which would prevent that from happening.
>>>> Unless it's waiting on one of the tasks to complete, but none of them
>>>> looked in any way nouveau related.
>>>>
>>>> If somebody else has any fancy kernel debugging tips here to figure
>>>> out why it hangs, that would be very helpful...
>>>>
>>>
>>> I think I've figured this out. It's to do with how my system is configured. I do have an initrd, but the only thing on
>>> it is the cpu microcode which, it is recommended, should be loaded early. The absence of the NVidia firmare from an
>>> initrd doesn't matter because the drivers for the hardware that need to load firmware are all built as modules, So, by
>>> the time the devices are configured via udev, the root partition is mounted and the drivers can get at the firmware.
>>>
>>> I've found, by turning on nouveau debug and taking a video of the screen as the system shuts down, that nouveau seems to
>>> be trying to run the scrubber very very late in the shutdown process. The problem is that by this time, I think the root
>>> partition, and thus the scrubber binary, have become inaccessible.
>>>
>>> I seem to have two choices - either make the firmware accessible on an initrd or unload the module in a shutdown script
>>> before the scrubber binary becomes inaccessible. The latter of these is the workaround I have implemented whilst the
>>> problem I reported has been under investigation. For simplicity, I think I'll promote my workaround to being the
>>> permanent solution.
>>>
>>> So, apologies (and thanks) to everyone whose time I have taken up with this non-bug.
>>>
>>
>> Well.. nouveau shouldn't prevent the system from shutting down if the
>> firmware file isn't available. Or at least it should print a
>> warning/error. Mind messing with the code a little to see if skipping
>> it kind of works? I probably can also come up with a patch by next
>> week.
>>
> Well, I'd love to but a quick glance at the code caused me to bump into this obscenity:
>
> int
> gm200_flcn_reset_wait_mem_scrubbing(struct nvkm_falcon *falcon)
> {
> nvkm_falcon_mask(falcon, 0x040, 0x00000000, 0x00000000);
>
> if (nvkm_msec(falcon->owner->device, 10,
> if (!(nvkm_falcon_rd32(falcon, 0x10c) & 0x00000006))
> break;
> ) < 0)
> return -ETIMEDOUT;
>
> return 0;
> }
>
> nvkm_msec is #defined to nvkm_usec which in turn is #defined to nvkm_nsec where the loop that the break is related to
> appears

I think someone who knows the code needs to look at this. What I can confirm is that after a freeze, I waited for 90
seconds for a timeout to occur, but it didn't.


.> Chris
>>>
>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>>> --
>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>> That page also explains what to do if mails like this annoy you.
>>>>>
>>>>> #regzbot ignore-activity
>>>>>
>>>>
>>>
>>

2023-02-20 05:35:26

by Ben Skeggs

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On Sun, 19 Feb 2023 at 04:55, Chris Clayton <[email protected]> wrote:
>
>
>
> On 18/02/2023 15:19, Chris Clayton wrote:
> >
> >
> > On 18/02/2023 12:25, Karol Herbst wrote:
> >> On Sat, Feb 18, 2023 at 1:22 PM Chris Clayton <[email protected]> wrote:
> >>>
> >>>
> >>>
> >>> On 15/02/2023 11:09, Karol Herbst wrote:
> >>>> On Wed, Feb 15, 2023 at 11:36 AM Linux regression tracking #update
> >>>> (Thorsten Leemhuis) <[email protected]> wrote:
> >>>>>
> >>>>> On 13.02.23 10:14, Chris Clayton wrote:
> >>>>>> On 13/02/2023 02:57, Dave Airlie wrote:
> >>>>>>> On Sun, 12 Feb 2023 at 00:43, Chris Clayton <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis) wrote:
> >>>>>>>>> On 10.02.23 20:01, Karol Herbst wrote:
> >>>>>>>>>> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
> >>>>>>>>>> Leemhuis) <[email protected]> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On 08.02.23 09:48, Chris Clayton wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released.
> >>>>>>>>>>>
> >>>>>>>>>>> Yeah, looks like it. That's unfortunate, but happens. But there is still
> >>>>>>>>>>> time to fix it and there is one thing I wonder:
> >>>>>>>>>>>
> >>>>>>>>>>> Did any of the nouveau developers look at the netconsole captures Chris
> >>>>>>>>>>> posted more than a week ago to check if they somehow help to track down
> >>>>>>>>>>> the root of this problem?
> >>>>>>>>>>
> >>>>>>>>>> I did now and I can't spot anything. I think at this point it would
> >>>>>>>>>> make sense to dump the active tasks/threads via sqsrq keys to see if
> >>>>>>>>>> any is in a weird state preventing the machine from shutting down.
> >>>>>>>>>
> >>>>>>>>> Many thx for looking into it!
> >>>>>>>>
> >>>>>>>> Yes, thanks Karol.
> >>>>>>>>
> >>>>>>>> Attached is the output from dmesg when this block of code:
> >>>>>>>>
> >>>>>>>> /bin/mount /dev/sda7 /mnt/sda7
> >>>>>>>> /bin/mountpoint /proc || /bin/mount /proc
> >>>>>>>> /bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log &
> >>>>>>>> /bin/echo t > /proc/sysrq-trigger
> >>>>>>>> /bin/sleep 1
> >>>>>>>> /bin/sync
> >>>>>>>> /bin/sleep 1
> >>>>>>>> kill $(pidof dmesg)
> >>>>>>>> /bin/umount /mnt/sda7
> >>>>>>>>
> >>>>>>>> is executed immediately before /sbin/reboot is called as the final step of rebooting my system.
> >>>>>>>>
> >>>>>>>> I hope this is what you were looking for, but if not, please let me know what you need
> >>>>>>
> >>>>>> Thanks Dave. [...]
> >>>>> FWIW, in case anyone strands here in the archives: the msg was
> >>>>> truncated. The full post can be found in a new thread:
> >>>>>
> >>>>> https://lore.kernel.org/lkml/[email protected]/
> >>>>>
> >>>>> Sadly it seems the info "With runpm=0, both reboot and poweroff work on
> >>>>> my laptop." didn't bring us much further to a solution. :-/ I don't
> >>>>> really like it, but for regression tracking I'm now putting this on the
> >>>>> back-burner, as a fix is not in sight.
> >>>>>
> >>>>> #regzbot monitor:
> >>>>> https://lore.kernel.org/lkml/[email protected]/
> >>>>> #regzbot backburner: hard to debug and apparently rare
> >>>>> #regzbot ignore-activity
> >>>>>
> >>>>
> >>>> yeah.. this bug looks a little annoying. Sadly the only Turing based
> >>>> laptop I got doesn't work on Nouveau because of firmware related
> >>>> issues and we probably need to get updated ones from Nvidia here :(
> >>>>
> >>>> But it's a bit weird that the kernel doesn't shutdown, because I don't
> >>>> see anything in the logs which would prevent that from happening.
> >>>> Unless it's waiting on one of the tasks to complete, but none of them
> >>>> looked in any way nouveau related.
> >>>>
> >>>> If somebody else has any fancy kernel debugging tips here to figure
> >>>> out why it hangs, that would be very helpful...
> >>>>
> >>>
> >>> I think I've figured this out. It's to do with how my system is configured. I do have an initrd, but the only thing on
> >>> it is the cpu microcode which, it is recommended, should be loaded early. The absence of the NVidia firmare from an
> >>> initrd doesn't matter because the drivers for the hardware that need to load firmware are all built as modules, So, by
> >>> the time the devices are configured via udev, the root partition is mounted and the drivers can get at the firmware.
> >>>
> >>> I've found, by turning on nouveau debug and taking a video of the screen as the system shuts down, that nouveau seems to
> >>> be trying to run the scrubber very very late in the shutdown process. The problem is that by this time, I think the root
> >>> partition, and thus the scrubber binary, have become inaccessible.
> >>>
> >>> I seem to have two choices - either make the firmware accessible on an initrd or unload the module in a shutdown script
> >>> before the scrubber binary becomes inaccessible. The latter of these is the workaround I have implemented whilst the
> >>> problem I reported has been under investigation. For simplicity, I think I'll promote my workaround to being the
> >>> permanent solution.
> >>>
> >>> So, apologies (and thanks) to everyone whose time I have taken up with this non-bug.
> >>>
> >>
> >> Well.. nouveau shouldn't prevent the system from shutting down if the
> >> firmware file isn't available. Or at least it should print a
> >> warning/error. Mind messing with the code a little to see if skipping
> >> it kind of works? I probably can also come up with a patch by next
> >> week.
> >>
> > Well, I'd love to but a quick glance at the code caused me to bump into this obscenity:
> >
> > int
> > gm200_flcn_reset_wait_mem_scrubbing(struct nvkm_falcon *falcon)
> > {
> > nvkm_falcon_mask(falcon, 0x040, 0x00000000, 0x00000000);
> >
> > if (nvkm_msec(falcon->owner->device, 10,
> > if (!(nvkm_falcon_rd32(falcon, 0x10c) & 0x00000006))
> > break;
> > ) < 0)
> > return -ETIMEDOUT;
> >
> > return 0;
> > }
> >
> > nvkm_msec is #defined to nvkm_usec which in turn is #defined to nvkm_nsec where the loop that the break is related to
> > appears
>
> I think someone who knows the code needs to look at this. What I can confirm is that after a freeze, I waited for 90
> seconds for a timeout to occur, but it didn't.
Hey,

Are you able to try the attached patch for me please?

Thanks,
Ben.

>
>
> .> Chris
> >>>
> >>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >>>>> --
> >>>>> Everything you wanna know about Linux kernel regression tracking:
> >>>>> https://linux-regtracking.leemhuis.info/about/#tldr
> >>>>> That page also explains what to do if mails like this annoy you.
> >>>>>
> >>>>> #regzbot ignore-activity
> >>>>>
> >>>>
> >>>
> >>


Attachments:
0001-drm-nouveau-fb-gp102-cache-scrubber-binary-on-first-.patch (9.12 kB)

2023-02-20 10:51:59

by Chris Clayton

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected



On 20/02/2023 05:35, Ben Skeggs wrote:
> On Sun, 19 Feb 2023 at 04:55, Chris Clayton <[email protected]> wrote:
>>
>>
>>
>> On 18/02/2023 15:19, Chris Clayton wrote:
>>>
>>>
>>> On 18/02/2023 12:25, Karol Herbst wrote:
>>>> On Sat, Feb 18, 2023 at 1:22 PM Chris Clayton <[email protected]> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 15/02/2023 11:09, Karol Herbst wrote:
>>>>>> On Wed, Feb 15, 2023 at 11:36 AM Linux regression tracking #update
>>>>>> (Thorsten Leemhuis) <[email protected]> wrote:
>>>>>>>
>>>>>>> On 13.02.23 10:14, Chris Clayton wrote:
>>>>>>>> On 13/02/2023 02:57, Dave Airlie wrote:
>>>>>>>>> On Sun, 12 Feb 2023 at 00:43, Chris Clayton <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis) wrote:
>>>>>>>>>>> On 10.02.23 20:01, Karol Herbst wrote:
>>>>>>>>>>>> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
>>>>>>>>>>>> Leemhuis) <[email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 08.02.23 09:48, Chris Clayton wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yeah, looks like it. That's unfortunate, but happens. But there is still
>>>>>>>>>>>>> time to fix it and there is one thing I wonder:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Did any of the nouveau developers look at the netconsole captures Chris
>>>>>>>>>>>>> posted more than a week ago to check if they somehow help to track down
>>>>>>>>>>>>> the root of this problem?
>>>>>>>>>>>>
>>>>>>>>>>>> I did now and I can't spot anything. I think at this point it would
>>>>>>>>>>>> make sense to dump the active tasks/threads via sqsrq keys to see if
>>>>>>>>>>>> any is in a weird state preventing the machine from shutting down.
>>>>>>>>>>>
>>>>>>>>>>> Many thx for looking into it!
>>>>>>>>>>
>>>>>>>>>> Yes, thanks Karol.
>>>>>>>>>>
>>>>>>>>>> Attached is the output from dmesg when this block of code:
>>>>>>>>>>
>>>>>>>>>> /bin/mount /dev/sda7 /mnt/sda7
>>>>>>>>>> /bin/mountpoint /proc || /bin/mount /proc
>>>>>>>>>> /bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log &
>>>>>>>>>> /bin/echo t > /proc/sysrq-trigger
>>>>>>>>>> /bin/sleep 1
>>>>>>>>>> /bin/sync
>>>>>>>>>> /bin/sleep 1
>>>>>>>>>> kill $(pidof dmesg)
>>>>>>>>>> /bin/umount /mnt/sda7
>>>>>>>>>>
>>>>>>>>>> is executed immediately before /sbin/reboot is called as the final step of rebooting my system.
>>>>>>>>>>
>>>>>>>>>> I hope this is what you were looking for, but if not, please let me know what you need
>>>>>>>>
>>>>>>>> Thanks Dave. [...]
>>>>>>> FWIW, in case anyone strands here in the archives: the msg was
>>>>>>> truncated. The full post can be found in a new thread:
>>>>>>>
>>>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>>>
>>>>>>> Sadly it seems the info "With runpm=0, both reboot and poweroff work on
>>>>>>> my laptop." didn't bring us much further to a solution. :-/ I don't
>>>>>>> really like it, but for regression tracking I'm now putting this on the
>>>>>>> back-burner, as a fix is not in sight.
>>>>>>>
>>>>>>> #regzbot monitor:
>>>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>>> #regzbot backburner: hard to debug and apparently rare
>>>>>>> #regzbot ignore-activity
>>>>>>>
>>>>>>
>>>>>> yeah.. this bug looks a little annoying. Sadly the only Turing based
>>>>>> laptop I got doesn't work on Nouveau because of firmware related
>>>>>> issues and we probably need to get updated ones from Nvidia here :(
>>>>>>
>>>>>> But it's a bit weird that the kernel doesn't shutdown, because I don't
>>>>>> see anything in the logs which would prevent that from happening.
>>>>>> Unless it's waiting on one of the tasks to complete, but none of them
>>>>>> looked in any way nouveau related.
>>>>>>
>>>>>> If somebody else has any fancy kernel debugging tips here to figure
>>>>>> out why it hangs, that would be very helpful...
>>>>>>
>>>>>
>>>>> I think I've figured this out. It's to do with how my system is configured. I do have an initrd, but the only thing on
>>>>> it is the cpu microcode which, it is recommended, should be loaded early. The absence of the NVidia firmare from an
>>>>> initrd doesn't matter because the drivers for the hardware that need to load firmware are all built as modules, So, by
>>>>> the time the devices are configured via udev, the root partition is mounted and the drivers can get at the firmware.
>>>>>
>>>>> I've found, by turning on nouveau debug and taking a video of the screen as the system shuts down, that nouveau seems to
>>>>> be trying to run the scrubber very very late in the shutdown process. The problem is that by this time, I think the root
>>>>> partition, and thus the scrubber binary, have become inaccessible.
>>>>>
>>>>> I seem to have two choices - either make the firmware accessible on an initrd or unload the module in a shutdown script
>>>>> before the scrubber binary becomes inaccessible. The latter of these is the workaround I have implemented whilst the
>>>>> problem I reported has been under investigation. For simplicity, I think I'll promote my workaround to being the
>>>>> permanent solution.
>>>>>
>>>>> So, apologies (and thanks) to everyone whose time I have taken up with this non-bug.
>>>>>
>>>>
>>>> Well.. nouveau shouldn't prevent the system from shutting down if the
>>>> firmware file isn't available. Or at least it should print a
>>>> warning/error. Mind messing with the code a little to see if skipping
>>>> it kind of works? I probably can also come up with a patch by next
>>>> week.
>>>>
>>> Well, I'd love to but a quick glance at the code caused me to bump into this obscenity:
>>>
>>> int
>>> gm200_flcn_reset_wait_mem_scrubbing(struct nvkm_falcon *falcon)
>>> {
>>> nvkm_falcon_mask(falcon, 0x040, 0x00000000, 0x00000000);
>>>
>>> if (nvkm_msec(falcon->owner->device, 10,
>>> if (!(nvkm_falcon_rd32(falcon, 0x10c) & 0x00000006))
>>> break;
>>> ) < 0)
>>> return -ETIMEDOUT;
>>>
>>> return 0;
>>> }
>>>
>>> nvkm_msec is #defined to nvkm_usec which in turn is #defined to nvkm_nsec where the loop that the break is related to
>>> appears
>>
>> I think someone who knows the code needs to look at this. What I can confirm is that after a freeze, I waited for 90
>> seconds for a timeout to occur, but it didn't.
> Hey,
>
> Are you able to try the attached patch for me please?
>
> Thanks,
> Ben.
>

Thanks Ben.

Yes, this patch fixes the lockup on reboot and poweroff that I've been seeing on my laptop. As you would expect,
offloaded rendering is still working and the discrete GPU is being powered on and off as required.

Thanks.

Reported-by: Chris Clayton <[email protected]>
Tested-by: Chris Clayton <[email protected]>

>>
>>
>> .> Chris
>>>>>
>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>>>>> --
>>>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>>>> That page also explains what to do if mails like this annoy you.
>>>>>>>
>>>>>>> #regzbot ignore-activity
>>>>>>>
>>>>>>
>>>>>
>>>>

2023-02-20 11:28:09

by Karol Herbst

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On Mon, Feb 20, 2023 at 11:51 AM Chris Clayton <[email protected]> wrote:
>
>
>
> On 20/02/2023 05:35, Ben Skeggs wrote:
> > On Sun, 19 Feb 2023 at 04:55, Chris Clayton <[email protected]> wrote:
> >>
> >>
> >>
> >> On 18/02/2023 15:19, Chris Clayton wrote:
> >>>
> >>>
> >>> On 18/02/2023 12:25, Karol Herbst wrote:
> >>>> On Sat, Feb 18, 2023 at 1:22 PM Chris Clayton <[email protected]> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 15/02/2023 11:09, Karol Herbst wrote:
> >>>>>> On Wed, Feb 15, 2023 at 11:36 AM Linux regression tracking #update
> >>>>>> (Thorsten Leemhuis) <[email protected]> wrote:
> >>>>>>>
> >>>>>>> On 13.02.23 10:14, Chris Clayton wrote:
> >>>>>>>> On 13/02/2023 02:57, Dave Airlie wrote:
> >>>>>>>>> On Sun, 12 Feb 2023 at 00:43, Chris Clayton <[email protected]> wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis) wrote:
> >>>>>>>>>>> On 10.02.23 20:01, Karol Herbst wrote:
> >>>>>>>>>>>> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
> >>>>>>>>>>>> Leemhuis) <[email protected]> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 08.02.23 09:48, Chris Clayton wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Yeah, looks like it. That's unfortunate, but happens. But there is still
> >>>>>>>>>>>>> time to fix it and there is one thing I wonder:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Did any of the nouveau developers look at the netconsole captures Chris
> >>>>>>>>>>>>> posted more than a week ago to check if they somehow help to track down
> >>>>>>>>>>>>> the root of this problem?
> >>>>>>>>>>>>
> >>>>>>>>>>>> I did now and I can't spot anything. I think at this point it would
> >>>>>>>>>>>> make sense to dump the active tasks/threads via sqsrq keys to see if
> >>>>>>>>>>>> any is in a weird state preventing the machine from shutting down.
> >>>>>>>>>>>
> >>>>>>>>>>> Many thx for looking into it!
> >>>>>>>>>>
> >>>>>>>>>> Yes, thanks Karol.
> >>>>>>>>>>
> >>>>>>>>>> Attached is the output from dmesg when this block of code:
> >>>>>>>>>>
> >>>>>>>>>> /bin/mount /dev/sda7 /mnt/sda7
> >>>>>>>>>> /bin/mountpoint /proc || /bin/mount /proc
> >>>>>>>>>> /bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log &
> >>>>>>>>>> /bin/echo t > /proc/sysrq-trigger
> >>>>>>>>>> /bin/sleep 1
> >>>>>>>>>> /bin/sync
> >>>>>>>>>> /bin/sleep 1
> >>>>>>>>>> kill $(pidof dmesg)
> >>>>>>>>>> /bin/umount /mnt/sda7
> >>>>>>>>>>
> >>>>>>>>>> is executed immediately before /sbin/reboot is called as the final step of rebooting my system.
> >>>>>>>>>>
> >>>>>>>>>> I hope this is what you were looking for, but if not, please let me know what you need
> >>>>>>>>
> >>>>>>>> Thanks Dave. [...]
> >>>>>>> FWIW, in case anyone strands here in the archives: the msg was
> >>>>>>> truncated. The full post can be found in a new thread:
> >>>>>>>
> >>>>>>> https://lore.kernel.org/lkml/[email protected]/
> >>>>>>>
> >>>>>>> Sadly it seems the info "With runpm=0, both reboot and poweroff work on
> >>>>>>> my laptop." didn't bring us much further to a solution. :-/ I don't
> >>>>>>> really like it, but for regression tracking I'm now putting this on the
> >>>>>>> back-burner, as a fix is not in sight.
> >>>>>>>
> >>>>>>> #regzbot monitor:
> >>>>>>> https://lore.kernel.org/lkml/[email protected]/
> >>>>>>> #regzbot backburner: hard to debug and apparently rare
> >>>>>>> #regzbot ignore-activity
> >>>>>>>
> >>>>>>
> >>>>>> yeah.. this bug looks a little annoying. Sadly the only Turing based
> >>>>>> laptop I got doesn't work on Nouveau because of firmware related
> >>>>>> issues and we probably need to get updated ones from Nvidia here :(
> >>>>>>
> >>>>>> But it's a bit weird that the kernel doesn't shutdown, because I don't
> >>>>>> see anything in the logs which would prevent that from happening.
> >>>>>> Unless it's waiting on one of the tasks to complete, but none of them
> >>>>>> looked in any way nouveau related.
> >>>>>>
> >>>>>> If somebody else has any fancy kernel debugging tips here to figure
> >>>>>> out why it hangs, that would be very helpful...
> >>>>>>
> >>>>>
> >>>>> I think I've figured this out. It's to do with how my system is configured. I do have an initrd, but the only thing on
> >>>>> it is the cpu microcode which, it is recommended, should be loaded early. The absence of the NVidia firmare from an
> >>>>> initrd doesn't matter because the drivers for the hardware that need to load firmware are all built as modules, So, by
> >>>>> the time the devices are configured via udev, the root partition is mounted and the drivers can get at the firmware.
> >>>>>
> >>>>> I've found, by turning on nouveau debug and taking a video of the screen as the system shuts down, that nouveau seems to
> >>>>> be trying to run the scrubber very very late in the shutdown process. The problem is that by this time, I think the root
> >>>>> partition, and thus the scrubber binary, have become inaccessible.
> >>>>>
> >>>>> I seem to have two choices - either make the firmware accessible on an initrd or unload the module in a shutdown script
> >>>>> before the scrubber binary becomes inaccessible. The latter of these is the workaround I have implemented whilst the
> >>>>> problem I reported has been under investigation. For simplicity, I think I'll promote my workaround to being the
> >>>>> permanent solution.
> >>>>>
> >>>>> So, apologies (and thanks) to everyone whose time I have taken up with this non-bug.
> >>>>>
> >>>>
> >>>> Well.. nouveau shouldn't prevent the system from shutting down if the
> >>>> firmware file isn't available. Or at least it should print a
> >>>> warning/error. Mind messing with the code a little to see if skipping
> >>>> it kind of works? I probably can also come up with a patch by next
> >>>> week.
> >>>>
> >>> Well, I'd love to but a quick glance at the code caused me to bump into this obscenity:
> >>>
> >>> int
> >>> gm200_flcn_reset_wait_mem_scrubbing(struct nvkm_falcon *falcon)
> >>> {
> >>> nvkm_falcon_mask(falcon, 0x040, 0x00000000, 0x00000000);
> >>>
> >>> if (nvkm_msec(falcon->owner->device, 10,
> >>> if (!(nvkm_falcon_rd32(falcon, 0x10c) & 0x00000006))
> >>> break;
> >>> ) < 0)
> >>> return -ETIMEDOUT;
> >>>
> >>> return 0;
> >>> }
> >>>
> >>> nvkm_msec is #defined to nvkm_usec which in turn is #defined to nvkm_nsec where the loop that the break is related to
> >>> appears
> >>
> >> I think someone who knows the code needs to look at this. What I can confirm is that after a freeze, I waited for 90
> >> seconds for a timeout to occur, but it didn't.
> > Hey,
> >
> > Are you able to try the attached patch for me please?
> >
> > Thanks,
> > Ben.
> >
>
> Thanks Ben.
>
> Yes, this patch fixes the lockup on reboot and poweroff that I've been seeing on my laptop. As you would expect,
> offloaded rendering is still working and the discrete GPU is being powered on and off as required.
>
> Thanks.
>
> Reported-by: Chris Clayton <[email protected]>
> Tested-by: Chris Clayton <[email protected]>
>

Ben, did you manage to get push rights to drm-misc by now or should I
just pick the patch and push it through -fixes?

> >>
> >>
> >> .> Chris
> >>>>>
> >>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >>>>>>> --
> >>>>>>> Everything you wanna know about Linux kernel regression tracking:
> >>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
> >>>>>>> That page also explains what to do if mails like this annoy you.
> >>>>>>>
> >>>>>>> #regzbot ignore-activity
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
>


2023-02-20 22:16:56

by Ben Skeggs

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On Mon, 20 Feb 2023 at 21:27, Karol Herbst <[email protected]> wrote:
>
> On Mon, Feb 20, 2023 at 11:51 AM Chris Clayton <[email protected]> wrote:
> >
> >
> >
> > On 20/02/2023 05:35, Ben Skeggs wrote:
> > > On Sun, 19 Feb 2023 at 04:55, Chris Clayton <[email protected]> wrote:
> > >>
> > >>
> > >>
> > >> On 18/02/2023 15:19, Chris Clayton wrote:
> > >>>
> > >>>
> > >>> On 18/02/2023 12:25, Karol Herbst wrote:
> > >>>> On Sat, Feb 18, 2023 at 1:22 PM Chris Clayton <[email protected]> wrote:
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On 15/02/2023 11:09, Karol Herbst wrote:
> > >>>>>> On Wed, Feb 15, 2023 at 11:36 AM Linux regression tracking #update
> > >>>>>> (Thorsten Leemhuis) <[email protected]> wrote:
> > >>>>>>>
> > >>>>>>> On 13.02.23 10:14, Chris Clayton wrote:
> > >>>>>>>> On 13/02/2023 02:57, Dave Airlie wrote:
> > >>>>>>>>> On Sun, 12 Feb 2023 at 00:43, Chris Clayton <[email protected]> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis) wrote:
> > >>>>>>>>>>> On 10.02.23 20:01, Karol Herbst wrote:
> > >>>>>>>>>>>> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
> > >>>>>>>>>>>> Leemhuis) <[email protected]> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On 08.02.23 09:48, Chris Clayton wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Yeah, looks like it. That's unfortunate, but happens. But there is still
> > >>>>>>>>>>>>> time to fix it and there is one thing I wonder:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Did any of the nouveau developers look at the netconsole captures Chris
> > >>>>>>>>>>>>> posted more than a week ago to check if they somehow help to track down
> > >>>>>>>>>>>>> the root of this problem?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I did now and I can't spot anything. I think at this point it would
> > >>>>>>>>>>>> make sense to dump the active tasks/threads via sqsrq keys to see if
> > >>>>>>>>>>>> any is in a weird state preventing the machine from shutting down.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Many thx for looking into it!
> > >>>>>>>>>>
> > >>>>>>>>>> Yes, thanks Karol.
> > >>>>>>>>>>
> > >>>>>>>>>> Attached is the output from dmesg when this block of code:
> > >>>>>>>>>>
> > >>>>>>>>>> /bin/mount /dev/sda7 /mnt/sda7
> > >>>>>>>>>> /bin/mountpoint /proc || /bin/mount /proc
> > >>>>>>>>>> /bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log &
> > >>>>>>>>>> /bin/echo t > /proc/sysrq-trigger
> > >>>>>>>>>> /bin/sleep 1
> > >>>>>>>>>> /bin/sync
> > >>>>>>>>>> /bin/sleep 1
> > >>>>>>>>>> kill $(pidof dmesg)
> > >>>>>>>>>> /bin/umount /mnt/sda7
> > >>>>>>>>>>
> > >>>>>>>>>> is executed immediately before /sbin/reboot is called as the final step of rebooting my system.
> > >>>>>>>>>>
> > >>>>>>>>>> I hope this is what you were looking for, but if not, please let me know what you need
> > >>>>>>>>
> > >>>>>>>> Thanks Dave. [...]
> > >>>>>>> FWIW, in case anyone strands here in the archives: the msg was
> > >>>>>>> truncated. The full post can be found in a new thread:
> > >>>>>>>
> > >>>>>>> https://lore.kernel.org/lkml/[email protected]/
> > >>>>>>>
> > >>>>>>> Sadly it seems the info "With runpm=0, both reboot and poweroff work on
> > >>>>>>> my laptop." didn't bring us much further to a solution. :-/ I don't
> > >>>>>>> really like it, but for regression tracking I'm now putting this on the
> > >>>>>>> back-burner, as a fix is not in sight.
> > >>>>>>>
> > >>>>>>> #regzbot monitor:
> > >>>>>>> https://lore.kernel.org/lkml/[email protected]/
> > >>>>>>> #regzbot backburner: hard to debug and apparently rare
> > >>>>>>> #regzbot ignore-activity
> > >>>>>>>
> > >>>>>>
> > >>>>>> yeah.. this bug looks a little annoying. Sadly the only Turing based
> > >>>>>> laptop I got doesn't work on Nouveau because of firmware related
> > >>>>>> issues and we probably need to get updated ones from Nvidia here :(
> > >>>>>>
> > >>>>>> But it's a bit weird that the kernel doesn't shutdown, because I don't
> > >>>>>> see anything in the logs which would prevent that from happening.
> > >>>>>> Unless it's waiting on one of the tasks to complete, but none of them
> > >>>>>> looked in any way nouveau related.
> > >>>>>>
> > >>>>>> If somebody else has any fancy kernel debugging tips here to figure
> > >>>>>> out why it hangs, that would be very helpful...
> > >>>>>>
> > >>>>>
> > >>>>> I think I've figured this out. It's to do with how my system is configured. I do have an initrd, but the only thing on
> > >>>>> it is the cpu microcode which, it is recommended, should be loaded early. The absence of the NVidia firmare from an
> > >>>>> initrd doesn't matter because the drivers for the hardware that need to load firmware are all built as modules, So, by
> > >>>>> the time the devices are configured via udev, the root partition is mounted and the drivers can get at the firmware.
> > >>>>>
> > >>>>> I've found, by turning on nouveau debug and taking a video of the screen as the system shuts down, that nouveau seems to
> > >>>>> be trying to run the scrubber very very late in the shutdown process. The problem is that by this time, I think the root
> > >>>>> partition, and thus the scrubber binary, have become inaccessible.
> > >>>>>
> > >>>>> I seem to have two choices - either make the firmware accessible on an initrd or unload the module in a shutdown script
> > >>>>> before the scrubber binary becomes inaccessible. The latter of these is the workaround I have implemented whilst the
> > >>>>> problem I reported has been under investigation. For simplicity, I think I'll promote my workaround to being the
> > >>>>> permanent solution.
> > >>>>>
> > >>>>> So, apologies (and thanks) to everyone whose time I have taken up with this non-bug.
> > >>>>>
> > >>>>
> > >>>> Well.. nouveau shouldn't prevent the system from shutting down if the
> > >>>> firmware file isn't available. Or at least it should print a
> > >>>> warning/error. Mind messing with the code a little to see if skipping
> > >>>> it kind of works? I probably can also come up with a patch by next
> > >>>> week.
> > >>>>
> > >>> Well, I'd love to but a quick glance at the code caused me to bump into this obscenity:
> > >>>
> > >>> int
> > >>> gm200_flcn_reset_wait_mem_scrubbing(struct nvkm_falcon *falcon)
> > >>> {
> > >>> nvkm_falcon_mask(falcon, 0x040, 0x00000000, 0x00000000);
> > >>>
> > >>> if (nvkm_msec(falcon->owner->device, 10,
> > >>> if (!(nvkm_falcon_rd32(falcon, 0x10c) & 0x00000006))
> > >>> break;
> > >>> ) < 0)
> > >>> return -ETIMEDOUT;
> > >>>
> > >>> return 0;
> > >>> }
> > >>>
> > >>> nvkm_msec is #defined to nvkm_usec which in turn is #defined to nvkm_nsec where the loop that the break is related to
> > >>> appears
> > >>
> > >> I think someone who knows the code needs to look at this. What I can confirm is that after a freeze, I waited for 90
> > >> seconds for a timeout to occur, but it didn't.
> > > Hey,
> > >
> > > Are you able to try the attached patch for me please?
> > >
> > > Thanks,
> > > Ben.
> > >
> >
> > Thanks Ben.
> >
> > Yes, this patch fixes the lockup on reboot and poweroff that I've been seeing on my laptop. As you would expect,
> > offloaded rendering is still working and the discrete GPU is being powered on and off as required.
> >
> > Thanks.
> >
> > Reported-by: Chris Clayton <[email protected]>
> > Tested-by: Chris Clayton <[email protected]>
> >
>
> Ben, did you manage to get push rights to drm-misc by now or should I
> just pick the patch and push it through -fixes?
Feel free to pick it up!

Thank you,
Ben.

>
> > >>
> > >>
> > >> .> Chris
> > >>>>>
> > >>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> > >>>>>>> --
> > >>>>>>> Everything you wanna know about Linux kernel regression tracking:
> > >>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
> > >>>>>>> That page also explains what to do if mails like this annoy you.
> > >>>>>>>
> > >>>>>>> #regzbot ignore-activity
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> >
>

2023-03-10 09:29:50

by Chris Clayton

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

Hi.

Is it likely that this fix will be sumbmitted to mainline during the ongoing 6.3 development cycle?

Chris

On 20/02/2023 22:16, Ben Skeggs wrote:
> On Mon, 20 Feb 2023 at 21:27, Karol Herbst <[email protected]> wrote:
>>
>> On Mon, Feb 20, 2023 at 11:51 AM Chris Clayton <[email protected]> wrote:
>>>
>>>
>>>
>>> On 20/02/2023 05:35, Ben Skeggs wrote:
>>>> On Sun, 19 Feb 2023 at 04:55, Chris Clayton <[email protected]> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 18/02/2023 15:19, Chris Clayton wrote:
>>>>>>
>>>>>>
>>>>>> On 18/02/2023 12:25, Karol Herbst wrote:
>>>>>>> On Sat, Feb 18, 2023 at 1:22 PM Chris Clayton <[email protected]> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 15/02/2023 11:09, Karol Herbst wrote:
>>>>>>>>> On Wed, Feb 15, 2023 at 11:36 AM Linux regression tracking #update
>>>>>>>>> (Thorsten Leemhuis) <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> On 13.02.23 10:14, Chris Clayton wrote:
>>>>>>>>>>> On 13/02/2023 02:57, Dave Airlie wrote:
>>>>>>>>>>>> On Sun, 12 Feb 2023 at 00:43, Chris Clayton <[email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis) wrote:
>>>>>>>>>>>>>> On 10.02.23 20:01, Karol Herbst wrote:
>>>>>>>>>>>>>>> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
>>>>>>>>>>>>>>> Leemhuis) <[email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 08.02.23 09:48, Chris Clayton wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yeah, looks like it. That's unfortunate, but happens. But there is still
>>>>>>>>>>>>>>>> time to fix it and there is one thing I wonder:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Did any of the nouveau developers look at the netconsole captures Chris
>>>>>>>>>>>>>>>> posted more than a week ago to check if they somehow help to track down
>>>>>>>>>>>>>>>> the root of this problem?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I did now and I can't spot anything. I think at this point it would
>>>>>>>>>>>>>>> make sense to dump the active tasks/threads via sqsrq keys to see if
>>>>>>>>>>>>>>> any is in a weird state preventing the machine from shutting down.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Many thx for looking into it!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, thanks Karol.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Attached is the output from dmesg when this block of code:
>>>>>>>>>>>>>
>>>>>>>>>>>>> /bin/mount /dev/sda7 /mnt/sda7
>>>>>>>>>>>>> /bin/mountpoint /proc || /bin/mount /proc
>>>>>>>>>>>>> /bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log &
>>>>>>>>>>>>> /bin/echo t > /proc/sysrq-trigger
>>>>>>>>>>>>> /bin/sleep 1
>>>>>>>>>>>>> /bin/sync
>>>>>>>>>>>>> /bin/sleep 1
>>>>>>>>>>>>> kill $(pidof dmesg)
>>>>>>>>>>>>> /bin/umount /mnt/sda7
>>>>>>>>>>>>>
>>>>>>>>>>>>> is executed immediately before /sbin/reboot is called as the final step of rebooting my system.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I hope this is what you were looking for, but if not, please let me know what you need
>>>>>>>>>>>
>>>>>>>>>>> Thanks Dave. [...]
>>>>>>>>>> FWIW, in case anyone strands here in the archives: the msg was
>>>>>>>>>> truncated. The full post can be found in a new thread:
>>>>>>>>>>
>>>>>>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>>>>>>
>>>>>>>>>> Sadly it seems the info "With runpm=0, both reboot and poweroff work on
>>>>>>>>>> my laptop." didn't bring us much further to a solution. :-/ I don't
>>>>>>>>>> really like it, but for regression tracking I'm now putting this on the
>>>>>>>>>> back-burner, as a fix is not in sight.
>>>>>>>>>>
>>>>>>>>>> #regzbot monitor:
>>>>>>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>>>>>> #regzbot backburner: hard to debug and apparently rare
>>>>>>>>>> #regzbot ignore-activity
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> yeah.. this bug looks a little annoying. Sadly the only Turing based
>>>>>>>>> laptop I got doesn't work on Nouveau because of firmware related
>>>>>>>>> issues and we probably need to get updated ones from Nvidia here :(
>>>>>>>>>
>>>>>>>>> But it's a bit weird that the kernel doesn't shutdown, because I don't
>>>>>>>>> see anything in the logs which would prevent that from happening.
>>>>>>>>> Unless it's waiting on one of the tasks to complete, but none of them
>>>>>>>>> looked in any way nouveau related.
>>>>>>>>>
>>>>>>>>> If somebody else has any fancy kernel debugging tips here to figure
>>>>>>>>> out why it hangs, that would be very helpful...
>>>>>>>>>
>>>>>>>>
>>>>>>>> I think I've figured this out. It's to do with how my system is configured. I do have an initrd, but the only thing on
>>>>>>>> it is the cpu microcode which, it is recommended, should be loaded early. The absence of the NVidia firmare from an
>>>>>>>> initrd doesn't matter because the drivers for the hardware that need to load firmware are all built as modules, So, by
>>>>>>>> the time the devices are configured via udev, the root partition is mounted and the drivers can get at the firmware.
>>>>>>>>
>>>>>>>> I've found, by turning on nouveau debug and taking a video of the screen as the system shuts down, that nouveau seems to
>>>>>>>> be trying to run the scrubber very very late in the shutdown process. The problem is that by this time, I think the root
>>>>>>>> partition, and thus the scrubber binary, have become inaccessible.
>>>>>>>>
>>>>>>>> I seem to have two choices - either make the firmware accessible on an initrd or unload the module in a shutdown script
>>>>>>>> before the scrubber binary becomes inaccessible. The latter of these is the workaround I have implemented whilst the
>>>>>>>> problem I reported has been under investigation. For simplicity, I think I'll promote my workaround to being the
>>>>>>>> permanent solution.
>>>>>>>>
>>>>>>>> So, apologies (and thanks) to everyone whose time I have taken up with this non-bug.
>>>>>>>>
>>>>>>>
>>>>>>> Well.. nouveau shouldn't prevent the system from shutting down if the
>>>>>>> firmware file isn't available. Or at least it should print a
>>>>>>> warning/error. Mind messing with the code a little to see if skipping
>>>>>>> it kind of works? I probably can also come up with a patch by next
>>>>>>> week.
>>>>>>>
>>>>>> Well, I'd love to but a quick glance at the code caused me to bump into this obscenity:
>>>>>>
>>>>>> int
>>>>>> gm200_flcn_reset_wait_mem_scrubbing(struct nvkm_falcon *falcon)
>>>>>> {
>>>>>> nvkm_falcon_mask(falcon, 0x040, 0x00000000, 0x00000000);
>>>>>>
>>>>>> if (nvkm_msec(falcon->owner->device, 10,
>>>>>> if (!(nvkm_falcon_rd32(falcon, 0x10c) & 0x00000006))
>>>>>> break;
>>>>>> ) < 0)
>>>>>> return -ETIMEDOUT;
>>>>>>
>>>>>> return 0;
>>>>>> }
>>>>>>
>>>>>> nvkm_msec is #defined to nvkm_usec which in turn is #defined to nvkm_nsec where the loop that the break is related to
>>>>>> appears
>>>>>
>>>>> I think someone who knows the code needs to look at this. What I can confirm is that after a freeze, I waited for 90
>>>>> seconds for a timeout to occur, but it didn't.
>>>> Hey,
>>>>
>>>> Are you able to try the attached patch for me please?
>>>>
>>>> Thanks,
>>>> Ben.
>>>>
>>>
>>> Thanks Ben.
>>>
>>> Yes, this patch fixes the lockup on reboot and poweroff that I've been seeing on my laptop. As you would expect,
>>> offloaded rendering is still working and the discrete GPU is being powered on and off as required.
>>>
>>> Thanks.
>>>
>>> Reported-by: Chris Clayton <[email protected]>
>>> Tested-by: Chris Clayton <[email protected]>
>>>
>>
>> Ben, did you manage to get push rights to drm-misc by now or should I
>> just pick the patch and push it through -fixes?
> Feel free to pick it up!
>
> Thank you,
> Ben.
>
>>
>>>>>
>>>>>
>>>>> .> Chris
>>>>>>>>
>>>>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>>>>>>>> --
>>>>>>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>>>>>>> That page also explains what to do if mails like this annoy you.
>>>>>>>>>>
>>>>>>>>>> #regzbot ignore-activity
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>
>>

2023-03-10 10:21:12

by Karol Herbst

[permalink] [raw]
Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On Fri, Mar 10, 2023 at 10:26 AM Chris Clayton <[email protected]> wrote:
>
> Hi.
>
> Is it likely that this fix will be sumbmitted to mainline during the ongoing 6.3 development cycle?
>

yes, it's already pushed to drm-misc-fixed, which then will go into
the current devel cycle. I just don't know when it's the next time it
will be pushed upwards, but it should get there eventually. And
because it also contains a Fixes tag it will be backported to older
branches as well.

> Chris
>
> On 20/02/2023 22:16, Ben Skeggs wrote:
> > On Mon, 20 Feb 2023 at 21:27, Karol Herbst <[email protected]> wrote:
> >>
> >> On Mon, Feb 20, 2023 at 11:51 AM Chris Clayton <[email protected]> wrote:
> >>>
> >>>
> >>>
> >>> On 20/02/2023 05:35, Ben Skeggs wrote:
> >>>> On Sun, 19 Feb 2023 at 04:55, Chris Clayton <[email protected]> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 18/02/2023 15:19, Chris Clayton wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 18/02/2023 12:25, Karol Herbst wrote:
> >>>>>>> On Sat, Feb 18, 2023 at 1:22 PM Chris Clayton <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 15/02/2023 11:09, Karol Herbst wrote:
> >>>>>>>>> On Wed, Feb 15, 2023 at 11:36 AM Linux regression tracking #update
> >>>>>>>>> (Thorsten Leemhuis) <[email protected]> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 13.02.23 10:14, Chris Clayton wrote:
> >>>>>>>>>>> On 13/02/2023 02:57, Dave Airlie wrote:
> >>>>>>>>>>>> On Sun, 12 Feb 2023 at 00:43, Chris Clayton <[email protected]> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis) wrote:
> >>>>>>>>>>>>>> On 10.02.23 20:01, Karol Herbst wrote:
> >>>>>>>>>>>>>>> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
> >>>>>>>>>>>>>>> Leemhuis) <[email protected]> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 08.02.23 09:48, Chris Clayton wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Yeah, looks like it. That's unfortunate, but happens. But there is still
> >>>>>>>>>>>>>>>> time to fix it and there is one thing I wonder:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Did any of the nouveau developers look at the netconsole captures Chris
> >>>>>>>>>>>>>>>> posted more than a week ago to check if they somehow help to track down
> >>>>>>>>>>>>>>>> the root of this problem?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I did now and I can't spot anything. I think at this point it would
> >>>>>>>>>>>>>>> make sense to dump the active tasks/threads via sqsrq keys to see if
> >>>>>>>>>>>>>>> any is in a weird state preventing the machine from shutting down.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Many thx for looking into it!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Yes, thanks Karol.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Attached is the output from dmesg when this block of code:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> /bin/mount /dev/sda7 /mnt/sda7
> >>>>>>>>>>>>> /bin/mountpoint /proc || /bin/mount /proc
> >>>>>>>>>>>>> /bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log &
> >>>>>>>>>>>>> /bin/echo t > /proc/sysrq-trigger
> >>>>>>>>>>>>> /bin/sleep 1
> >>>>>>>>>>>>> /bin/sync
> >>>>>>>>>>>>> /bin/sleep 1
> >>>>>>>>>>>>> kill $(pidof dmesg)
> >>>>>>>>>>>>> /bin/umount /mnt/sda7
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> is executed immediately before /sbin/reboot is called as the final step of rebooting my system.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I hope this is what you were looking for, but if not, please let me know what you need
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks Dave. [...]
> >>>>>>>>>> FWIW, in case anyone strands here in the archives: the msg was
> >>>>>>>>>> truncated. The full post can be found in a new thread:
> >>>>>>>>>>
> >>>>>>>>>> https://lore.kernel.org/lkml/[email protected]/
> >>>>>>>>>>
> >>>>>>>>>> Sadly it seems the info "With runpm=0, both reboot and poweroff work on
> >>>>>>>>>> my laptop." didn't bring us much further to a solution. :-/ I don't
> >>>>>>>>>> really like it, but for regression tracking I'm now putting this on the
> >>>>>>>>>> back-burner, as a fix is not in sight.
> >>>>>>>>>>
> >>>>>>>>>> #regzbot monitor:
> >>>>>>>>>> https://lore.kernel.org/lkml/[email protected]/
> >>>>>>>>>> #regzbot backburner: hard to debug and apparently rare
> >>>>>>>>>> #regzbot ignore-activity
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> yeah.. this bug looks a little annoying. Sadly the only Turing based
> >>>>>>>>> laptop I got doesn't work on Nouveau because of firmware related
> >>>>>>>>> issues and we probably need to get updated ones from Nvidia here :(
> >>>>>>>>>
> >>>>>>>>> But it's a bit weird that the kernel doesn't shutdown, because I don't
> >>>>>>>>> see anything in the logs which would prevent that from happening.
> >>>>>>>>> Unless it's waiting on one of the tasks to complete, but none of them
> >>>>>>>>> looked in any way nouveau related.
> >>>>>>>>>
> >>>>>>>>> If somebody else has any fancy kernel debugging tips here to figure
> >>>>>>>>> out why it hangs, that would be very helpful...
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> I think I've figured this out. It's to do with how my system is configured. I do have an initrd, but the only thing on
> >>>>>>>> it is the cpu microcode which, it is recommended, should be loaded early. The absence of the NVidia firmare from an
> >>>>>>>> initrd doesn't matter because the drivers for the hardware that need to load firmware are all built as modules, So, by
> >>>>>>>> the time the devices are configured via udev, the root partition is mounted and the drivers can get at the firmware.
> >>>>>>>>
> >>>>>>>> I've found, by turning on nouveau debug and taking a video of the screen as the system shuts down, that nouveau seems to
> >>>>>>>> be trying to run the scrubber very very late in the shutdown process. The problem is that by this time, I think the root
> >>>>>>>> partition, and thus the scrubber binary, have become inaccessible.
> >>>>>>>>
> >>>>>>>> I seem to have two choices - either make the firmware accessible on an initrd or unload the module in a shutdown script
> >>>>>>>> before the scrubber binary becomes inaccessible. The latter of these is the workaround I have implemented whilst the
> >>>>>>>> problem I reported has been under investigation. For simplicity, I think I'll promote my workaround to being the
> >>>>>>>> permanent solution.
> >>>>>>>>
> >>>>>>>> So, apologies (and thanks) to everyone whose time I have taken up with this non-bug.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Well.. nouveau shouldn't prevent the system from shutting down if the
> >>>>>>> firmware file isn't available. Or at least it should print a
> >>>>>>> warning/error. Mind messing with the code a little to see if skipping
> >>>>>>> it kind of works? I probably can also come up with a patch by next
> >>>>>>> week.
> >>>>>>>
> >>>>>> Well, I'd love to but a quick glance at the code caused me to bump into this obscenity:
> >>>>>>
> >>>>>> int
> >>>>>> gm200_flcn_reset_wait_mem_scrubbing(struct nvkm_falcon *falcon)
> >>>>>> {
> >>>>>> nvkm_falcon_mask(falcon, 0x040, 0x00000000, 0x00000000);
> >>>>>>
> >>>>>> if (nvkm_msec(falcon->owner->device, 10,
> >>>>>> if (!(nvkm_falcon_rd32(falcon, 0x10c) & 0x00000006))
> >>>>>> break;
> >>>>>> ) < 0)
> >>>>>> return -ETIMEDOUT;
> >>>>>>
> >>>>>> return 0;
> >>>>>> }
> >>>>>>
> >>>>>> nvkm_msec is #defined to nvkm_usec which in turn is #defined to nvkm_nsec where the loop that the break is related to
> >>>>>> appears
> >>>>>
> >>>>> I think someone who knows the code needs to look at this. What I can confirm is that after a freeze, I waited for 90
> >>>>> seconds for a timeout to occur, but it didn't.
> >>>> Hey,
> >>>>
> >>>> Are you able to try the attached patch for me please?
> >>>>
> >>>> Thanks,
> >>>> Ben.
> >>>>
> >>>
> >>> Thanks Ben.
> >>>
> >>> Yes, this patch fixes the lockup on reboot and poweroff that I've been seeing on my laptop. As you would expect,
> >>> offloaded rendering is still working and the discrete GPU is being powered on and off as required.
> >>>
> >>> Thanks.
> >>>
> >>> Reported-by: Chris Clayton <[email protected]>
> >>> Tested-by: Chris Clayton <[email protected]>
> >>>
> >>
> >> Ben, did you manage to get push rights to drm-misc by now or should I
> >> just pick the patch and push it through -fixes?
> > Feel free to pick it up!
> >
> > Thank you,
> > Ben.
> >
> >>
> >>>>>
> >>>>>
> >>>>> .> Chris
> >>>>>>>>
> >>>>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> >>>>>>>>>> --
> >>>>>>>>>> Everything you wanna know about Linux kernel regression tracking:
> >>>>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
> >>>>>>>>>> That page also explains what to do if mails like this annoy you.
> >>>>>>>>>>
> >>>>>>>>>> #regzbot ignore-activity
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>
> >>
>


Subject: Re: linux-6.2-rc4+ hangs on poweroff/reboot: Bisected

On 10.03.23 11:20, Karol Herbst wrote:
> On Fri, Mar 10, 2023 at 10:26 AM Chris Clayton <[email protected]> wrote:
>>
>> Is it likely that this fix will be sumbmitted to mainline during the ongoing 6.3 development cycle?
>>
>
> yes, it's already pushed to drm-misc-fixed, which then will go into
> the current devel cycle. I just don't know when it's the next time it
> will be pushed upwards, but it should get there eventually.

FWIW, the fix landed now as 1b9b4f922f96 ; sadly without a Link: tag to
the report, hence I have to mark this manually as resolved:

#regzbot fix: 1b9b4f922f96108da3bb5d87b2d603f5dfbc5650

> And
> because it also contains a Fixes tag it will be backported to older
> branches as well.

FWIW, nope, that's not enough you have to tag those explicitly to ensure
backporting, as explained in
Documentation/process/stable-kernel-rules.rst Greg points that out every
few weeks, recently here for example:

https://lore.kernel.org/all/[email protected]/

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

>> Chris
>>
>> On 20/02/2023 22:16, Ben Skeggs wrote:
>>> On Mon, 20 Feb 2023 at 21:27, Karol Herbst <[email protected]> wrote:
>>>>
>>>> On Mon, Feb 20, 2023 at 11:51 AM Chris Clayton <[email protected]> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 20/02/2023 05:35, Ben Skeggs wrote:
>>>>>> On Sun, 19 Feb 2023 at 04:55, Chris Clayton <[email protected]> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 18/02/2023 15:19, Chris Clayton wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 18/02/2023 12:25, Karol Herbst wrote:
>>>>>>>>> On Sat, Feb 18, 2023 at 1:22 PM Chris Clayton <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 15/02/2023 11:09, Karol Herbst wrote:
>>>>>>>>>>> On Wed, Feb 15, 2023 at 11:36 AM Linux regression tracking #update
>>>>>>>>>>> (Thorsten Leemhuis) <[email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 13.02.23 10:14, Chris Clayton wrote:
>>>>>>>>>>>>> On 13/02/2023 02:57, Dave Airlie wrote:
>>>>>>>>>>>>>> On Sun, 12 Feb 2023 at 00:43, Chris Clayton <[email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 10/02/2023 19:33, Linux regression tracking (Thorsten Leemhuis) wrote:
>>>>>>>>>>>>>>>> On 10.02.23 20:01, Karol Herbst wrote:
>>>>>>>>>>>>>>>>> On Fri, Feb 10, 2023 at 7:35 PM Linux regression tracking (Thorsten
>>>>>>>>>>>>>>>>> Leemhuis) <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 08.02.23 09:48, Chris Clayton wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'm assuming that we are not going to see a fix for this regression before 6.2 is released.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Yeah, looks like it. That's unfortunate, but happens. But there is still
>>>>>>>>>>>>>>>>>> time to fix it and there is one thing I wonder:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Did any of the nouveau developers look at the netconsole captures Chris
>>>>>>>>>>>>>>>>>> posted more than a week ago to check if they somehow help to track down
>>>>>>>>>>>>>>>>>> the root of this problem?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I did now and I can't spot anything. I think at this point it would
>>>>>>>>>>>>>>>>> make sense to dump the active tasks/threads via sqsrq keys to see if
>>>>>>>>>>>>>>>>> any is in a weird state preventing the machine from shutting down.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Many thx for looking into it!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, thanks Karol.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Attached is the output from dmesg when this block of code:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> /bin/mount /dev/sda7 /mnt/sda7
>>>>>>>>>>>>>>> /bin/mountpoint /proc || /bin/mount /proc
>>>>>>>>>>>>>>> /bin/dmesg -w > /mnt/sda7/sysrq.dmesg.log &
>>>>>>>>>>>>>>> /bin/echo t > /proc/sysrq-trigger
>>>>>>>>>>>>>>> /bin/sleep 1
>>>>>>>>>>>>>>> /bin/sync
>>>>>>>>>>>>>>> /bin/sleep 1
>>>>>>>>>>>>>>> kill $(pidof dmesg)
>>>>>>>>>>>>>>> /bin/umount /mnt/sda7
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> is executed immediately before /sbin/reboot is called as the final step of rebooting my system.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I hope this is what you were looking for, but if not, please let me know what you need
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks Dave. [...]
>>>>>>>>>>>> FWIW, in case anyone strands here in the archives: the msg was
>>>>>>>>>>>> truncated. The full post can be found in a new thread:
>>>>>>>>>>>>
>>>>>>>>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>>>>>>>>
>>>>>>>>>>>> Sadly it seems the info "With runpm=0, both reboot and poweroff work on
>>>>>>>>>>>> my laptop." didn't bring us much further to a solution. :-/ I don't
>>>>>>>>>>>> really like it, but for regression tracking I'm now putting this on the
>>>>>>>>>>>> back-burner, as a fix is not in sight.
>>>>>>>>>>>>
>>>>>>>>>>>> #regzbot monitor:
>>>>>>>>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>>>>>>>> #regzbot backburner: hard to debug and apparently rare
>>>>>>>>>>>> #regzbot ignore-activity
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> yeah.. this bug looks a little annoying. Sadly the only Turing based
>>>>>>>>>>> laptop I got doesn't work on Nouveau because of firmware related
>>>>>>>>>>> issues and we probably need to get updated ones from Nvidia here :(
>>>>>>>>>>>
>>>>>>>>>>> But it's a bit weird that the kernel doesn't shutdown, because I don't
>>>>>>>>>>> see anything in the logs which would prevent that from happening.
>>>>>>>>>>> Unless it's waiting on one of the tasks to complete, but none of them
>>>>>>>>>>> looked in any way nouveau related.
>>>>>>>>>>>
>>>>>>>>>>> If somebody else has any fancy kernel debugging tips here to figure
>>>>>>>>>>> out why it hangs, that would be very helpful...
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I think I've figured this out. It's to do with how my system is configured. I do have an initrd, but the only thing on
>>>>>>>>>> it is the cpu microcode which, it is recommended, should be loaded early. The absence of the NVidia firmare from an
>>>>>>>>>> initrd doesn't matter because the drivers for the hardware that need to load firmware are all built as modules, So, by
>>>>>>>>>> the time the devices are configured via udev, the root partition is mounted and the drivers can get at the firmware.
>>>>>>>>>>
>>>>>>>>>> I've found, by turning on nouveau debug and taking a video of the screen as the system shuts down, that nouveau seems to
>>>>>>>>>> be trying to run the scrubber very very late in the shutdown process. The problem is that by this time, I think the root
>>>>>>>>>> partition, and thus the scrubber binary, have become inaccessible.
>>>>>>>>>>
>>>>>>>>>> I seem to have two choices - either make the firmware accessible on an initrd or unload the module in a shutdown script
>>>>>>>>>> before the scrubber binary becomes inaccessible. The latter of these is the workaround I have implemented whilst the
>>>>>>>>>> problem I reported has been under investigation. For simplicity, I think I'll promote my workaround to being the
>>>>>>>>>> permanent solution.
>>>>>>>>>>
>>>>>>>>>> So, apologies (and thanks) to everyone whose time I have taken up with this non-bug.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Well.. nouveau shouldn't prevent the system from shutting down if the
>>>>>>>>> firmware file isn't available. Or at least it should print a
>>>>>>>>> warning/error. Mind messing with the code a little to see if skipping
>>>>>>>>> it kind of works? I probably can also come up with a patch by next
>>>>>>>>> week.
>>>>>>>>>
>>>>>>>> Well, I'd love to but a quick glance at the code caused me to bump into this obscenity:
>>>>>>>>
>>>>>>>> int
>>>>>>>> gm200_flcn_reset_wait_mem_scrubbing(struct nvkm_falcon *falcon)
>>>>>>>> {
>>>>>>>> nvkm_falcon_mask(falcon, 0x040, 0x00000000, 0x00000000);
>>>>>>>>
>>>>>>>> if (nvkm_msec(falcon->owner->device, 10,
>>>>>>>> if (!(nvkm_falcon_rd32(falcon, 0x10c) & 0x00000006))
>>>>>>>> break;
>>>>>>>> ) < 0)
>>>>>>>> return -ETIMEDOUT;
>>>>>>>>
>>>>>>>> return 0;
>>>>>>>> }
>>>>>>>>
>>>>>>>> nvkm_msec is #defined to nvkm_usec which in turn is #defined to nvkm_nsec where the loop that the break is related to
>>>>>>>> appears
>>>>>>>
>>>>>>> I think someone who knows the code needs to look at this. What I can confirm is that after a freeze, I waited for 90
>>>>>>> seconds for a timeout to occur, but it didn't.
>>>>>> Hey,
>>>>>>
>>>>>> Are you able to try the attached patch for me please?
>>>>>>
>>>>>> Thanks,
>>>>>> Ben.
>>>>>>
>>>>>
>>>>> Thanks Ben.
>>>>>
>>>>> Yes, this patch fixes the lockup on reboot and poweroff that I've been seeing on my laptop. As you would expect,
>>>>> offloaded rendering is still working and the discrete GPU is being powered on and off as required.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Reported-by: Chris Clayton <[email protected]>
>>>>> Tested-by: Chris Clayton <[email protected]>
>>>>>
>>>>
>>>> Ben, did you manage to get push rights to drm-misc by now or should I
>>>> just pick the patch and push it through -fixes?
>>> Feel free to pick it up!
>>>
>>> Thank you,
>>> Ben.
>>>
>>>>
>>>>>>>
>>>>>>>
>>>>>>> .> Chris
>>>>>>>>>>
>>>>>>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>>>>>>>>>>> --
>>>>>>>>>>>> Everything you wanna know about Linux kernel regression tracking:
>>>>>>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>>>>>>>>> That page also explains what to do if mails like this annoy you.
>>>>>>>>>>>>
>>>>>>>>>>>> #regzbot ignore-activity
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>
>>
>
>
>