LinuxLists.cc - Re: Regression in Kernel 6.0: System partially freezes with "nvme controller is down"

2023-01-12 15:17:23

by Linux regression tracking (Thorsten Leemhuis)

Subject: Re: Regression in Kernel 6.0: System partially freezes with "nvme controller is down"

[adding the nvme maintainers and the regressions mailing list to the
list of recipients]

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 11.01.23 23:11, Julian Groß wrote:
> Dear Maintainer,
>
> when running Linux Kernel version 6.0.12, 6.0.10, 6.0-rc7, or 6.1.4, my
> system seemingly randomly freezes due to the file system being set to
> read-only due to an issue with my NVMe controller.
> The issue does *not* appear on Linux Kernel version 5.19.11 or lower.
>
> Through network logging I am able to catch the issue:
> ```
> Jan 8 14:50:16 x299-desktop kernel: [ 1461.259288] nvme nvme0:
> controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
> Jan 8 14:50:16 x299-desktop kernel: [ 1461.259293] nvme nvme0: Does
> your device have a faulty power saving mode enabled?
> Jan 8 14:50:16 x299-desktop kernel: [ 1461.259293] nvme nvme0: Try
> "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
> Jan 8 14:50:16 x299-desktop kernel: [ 1461.331360] nvme 0000:01:00.0:
> enabling device (0000 -> 0002)
> Jan 8 14:50:16 x299-desktop kernel: [ 1461.331458] nvme nvme0: Removing
> after probe failure status: -19
> Jan 8 14:50:16 x299-desktop kernel: [ 1461.371389] nvme0n1: detected
> capacity change from 1953525168 to 0
> Jan 8 14:50:16 x299-desktop kernel: [ 1461.371389] BTRFS error (device
> nvme0n1p4): bdev /dev/nvme0n1p4 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
> Jan 8 14:50:16 x299-desktop kernel: [ 1461.371389] BTRFS error (device
> nvme0n1p4): bdev /dev/nvme0n1p4 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
> Jan 8 14:50:16 x299-desktop kernel: [ 1461.371392] BTRFS error (device
> nvme0n1p4): bdev /dev/nvme0n1p4 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
> Jan 8 14:50:16 x299-desktop kernel: [ 1461.371394] BTRFS error (device
> nvme0n1p4): bdev /dev/nvme0n1p4 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
> Jan 8 14:50:16 x299-desktop kernel: [ 1461.371405] BTRFS error (device
> nvme0n1p4): bdev /dev/nvme0n1p4 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
> Jan 8 14:50:16 x299-desktop kernel: [ 1461.371406] BTRFS error (device
> nvme0n1p4): bdev /dev/nvme0n1p4 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
> Jan 8 14:50:16 x299-desktop kernel: [ 1461.371411] BTRFS error (device
> nvme0n1p4): bdev /dev/nvme0n1p4 errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
> Jan 8 14:50:16 x299-desktop kernel: [ 1461.371419] BTRFS error (device
> nvme0n1p4): bdev /dev/nvme0n1p4 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
> Jan 8 14:50:16 x299-desktop kernel: [ 1461.371425] BTRFS error (device
> nvme0n1p4): bdev /dev/nvme0n1p4 errs: wr 10, rd 0, flush 0, corrupt 0,
> gen 0
> Jan 8 14:50:16 x299-desktop kernel: [ 1461.371426] BTRFS error (device
> nvme0n1p4): bdev /dev/nvme0n1p4 errs: wr 11, rd 0, flush 0, corrupt 0,
> gen 0
> ```
>
> I have tried the suggestion in the log without luck.
>
> Attached is a log that includes two system freezes, as well as a list of
> PCI(e) devices created by Debian reportbug.
> The first freeze happens at "Jan 8 04:26:28" and the second freeze
> happens at "Jan 8 14:50:16".
>
> Currently, I am using git bisect to narrow down the window of possible
> commits, but since the issue appears seemingly random, it will take many
> months to identify the offending commit this way.
>
> The original Debian bug report is here:
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1028309

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced v5.19..v6.0-rc7
#regzbot title nvme: system partially freezes with "nvme controller is down"
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

2023-01-12 17:21:37

by Bjorn Helgaas

[permalink] [raw]

Subject: Re: Regression in Kernel 6.0: System partially freezes with "nvme controller is down"

On Thu, Jan 12, 2023 at 03:48:46PM +0100, Linux kernel regression tracking (Thorsten Leemhuis) wrote:
> ...
> On 11.01.23 23:11, Julian Gro? wrote:
> > Dear Maintainer,
> >
> > when running Linux Kernel version 6.0.12, 6.0.10, 6.0-rc7, or 6.1.4, my
> > system seemingly randomly freezes due to the file system being set to
> > read-only due to an issue with my NVMe controller.
> > The issue does *not* appear on Linux Kernel version 5.19.11 or lower.
> >
> > Through network logging I am able to catch the issue:
> > ```
> > Jan? 8 14:50:16 x299-desktop kernel: [ 1461.259288] nvme nvme0:
> > controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
> > Jan? 8 14:50:16 x299-desktop kernel: [ 1461.259293] nvme nvme0: Does
> > your device have a faulty power saving mode enabled?
> > Jan? 8 14:50:16 x299-desktop kernel: [ 1461.259293] nvme nvme0: Try
> > "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
> > Jan? 8 14:50:16 x299-desktop kernel: [ 1461.331360] nvme 0000:01:00.0:
> > enabling device (0000 -> 0002)
> > ...
> >
> > I have tried the suggestion in the log without luck.
> >
> > Attached is a log that includes two system freezes, as well as a list of
> > PCI(e) devices created by Debian reportbug.
> > The first freeze happens at "Jan? 8 04:26:28" and the second freeze
> > happens at "Jan? 8 14:50:16".
> >
> > Currently, I am using git bisect to narrow down the window of possible
> > commits, but since the issue appears seemingly random, it will take many
> > months to identify the offending commit this way.
> >
> > The original Debian bug report is here:
> > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1028309

For some reason the log [1] has very little of the kernel dmesg log.
It does seem like the freeze is partial (I see messages for hundreds
or thousands of seconds after the nvme reset), but requires a reboot
to recover.

The lspci information [2] shows the 00:1b.0 Root Port leading to the
01:00.0 NVMe device.

Is it possible to collect lspci output after the nvme freeze? If so,
please save the output of:

sudo lspci -vv -s00:1b.0
sudo lspci -vv -s01:00.0

Make sure to run lspci as root so we can see the error logging
registers for these devices.

If you can collect more of the dmesg log after the freeze, e.g., via
the "dmesg" command, that might be helpful, too.

Bjorn

[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?att=1;bug=1028309;filename=x299-desktop_crash.log.xz;msg=5
[2] https://bugs.debian.org/cgi-bin/bugreport.cgi?att=0;bug=1028309;msg=5

2023-02-17 12:40:04

by Linux regression tracking (Thorsten Leemhuis)

[permalink] [raw]

Subject: Re: Regression in Kernel 6.0: System partially freezes with "nvme controller is down"

Hi, this is your Linux kernel regression tracker. Top-posting for once,
to make this easily accessible to everyone.

I might be missing something, but it looks like this discussion stalled.
I wonder why.

Julian, did you ever share the data Bjorn asked for? Or tried a a
bisection, as suggested by Keith? Or did you stop caring for some
reason? Does everything maybe work fine these days?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

On 12.01.23 17:42, Bjorn Helgaas wrote:
> On Thu, Jan 12, 2023 at 03:48:46PM +0100, Linux kernel regression tracking (Thorsten Leemhuis) wrote:
>> ...
>> On 11.01.23 23:11, Julian Groß wrote:
>>> Dear Maintainer,
>>>
>>> when running Linux Kernel version 6.0.12, 6.0.10, 6.0-rc7, or 6.1.4, my
>>> system seemingly randomly freezes due to the file system being set to
>>> read-only due to an issue with my NVMe controller.
>>> The issue does *not* appear on Linux Kernel version 5.19.11 or lower.
>>>
>>> Through network logging I am able to catch the issue:
>>> ```
>>> Jan 8 14:50:16 x299-desktop kernel: [ 1461.259288] nvme nvme0:
>>> controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
>>> Jan 8 14:50:16 x299-desktop kernel: [ 1461.259293] nvme nvme0: Does
>>> your device have a faulty power saving mode enabled?
>>> Jan 8 14:50:16 x299-desktop kernel: [ 1461.259293] nvme nvme0: Try
>>> "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
>>> Jan 8 14:50:16 x299-desktop kernel: [ 1461.331360] nvme 0000:01:00.0:
>>> enabling device (0000 -> 0002)
>>> ...
>>>
>>> I have tried the suggestion in the log without luck.
>>>
>>> Attached is a log that includes two system freezes, as well as a list of
>>> PCI(e) devices created by Debian reportbug.
>>> The first freeze happens at "Jan 8 04:26:28" and the second freeze
>>> happens at "Jan 8 14:50:16".
>>>
>>> Currently, I am using git bisect to narrow down the window of possible
>>> commits, but since the issue appears seemingly random, it will take many
>>> months to identify the offending commit this way.
>>>
>>> The original Debian bug report is here:
>>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1028309
>
> For some reason the log [1] has very little of the kernel dmesg log.
> It does seem like the freeze is partial (I see messages for hundreds
> or thousands of seconds after the nvme reset), but requires a reboot
> to recover.
>
> The lspci information [2] shows the 00:1b.0 Root Port leading to the
> 01:00.0 NVMe device.
>
> Is it possible to collect lspci output after the nvme freeze? If so,
> please save the output of:
>
> sudo lspci -vv -s00:1b.0
> sudo lspci -vv -s01:00.0
>
> Make sure to run lspci as root so we can see the error logging
> registers for these devices.
>
> If you can collect more of the dmesg log after the freeze, e.g., via
> the "dmesg" command, that might be helpful, too.
>
> Bjorn
>
> [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?att=1;bug=1028309;filename=x299-desktop_crash.log.xz;msg=5
> [2] https://bugs.debian.org/cgi-bin/bugreport.cgi?att=0;bug=1028309;msg=5
>
>

2023-02-17 15:03:26

by Linux regression tracking (Thorsten Leemhuis)

[permalink] [raw]

Subject: Re: Regression in Kernel 6.0: System partially freezes with "nvme controller is down"

On 12.01.23 15:48, Linux kernel regression tracking (Thorsten Leemhuis)
wrote:
> On 11.01.23 23:11, Julian Groß wrote:
>>
>> when running Linux Kernel version 6.0.12, 6.0.10, 6.0-rc7, or 6.1.4, my
>> system seemingly randomly freezes due to the file system being set to
>> read-only due to an issue with my NVMe controller.
>> The issue does *not* appear on Linux Kernel version 5.19.11 or lower.
>>
>> Through network logging I am able to catch the issue:
>
> [...]
>
> #regzbot ^introduced v5.19..v6.0-rc7
> #regzbot title nvme: system partially freezes with "nvme controller is down"
> #regzbot ignore-activity

Stop tracking this for now:

#regzbot inconclusive: stalled and might be a hw issue
#regzbot ignore-activity

For details see:

https://lore.kernel.org/all/[email protected]/

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.