2022-04-06 14:37:00

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests

Hi Alan,

I too am seeing this. Tracking it down to the same commit, I decided to
enable NVME_VERBOSE_ERRORS to get some more information. Now on boot and
everytime I wake up from sleep, I see:

[ 89.098578] nvme nvme0: Shutdown timeout set to 8 seconds
[ 89.098683] nvme0: Identify(0x6), Invalid Field in Command (sct 0x0 / sc 0x2) MORE
[ 89.119363] nvme nvme0: 16/0/0 default/read/poll queues

With that middle line in red.

Question is: is this actually an error? If not, maybe it shouldn't be
printed as a KERN_ERR. And if it's printed as a KERN_INFO, maybe it
should only do so when CONFIG_NVME_VERBOSE_ERRORS=y? Or do you think
there is actually some other diagnostic value in having this print
always?

Using a Samsung SSD 970 EVO Plus 2TB, firmware version 2B2QEXM7, in case
that's useful info.

I also noticed a ~2 second boot delay on 5.18-rc1:

[ 0.917631] pstore: Using crash dump compression: deflate
[ 0.917807] Key type encrypted registered
[ 0.951840] ACPI: battery: Slot [BAT0] (battery present)
[ 3.146765] nvme nvme0: Shutdown timeout set to 8 seconds
[ 3.146918] nvme0: Identify(0x6), Invalid Field in Command (sct 0x0 / sc 0x2) MORE
[ 3.188852] nvme nvme0: 16/0/0 default/read/poll queues
[ 3.198163] nvme0n1: p1 p2
[ 3.199554] Freeing unused kernel image (initmem) memory: 12952K

I haven't looked into it much, but I assume it's also NVMe related? Or
maybe the vconsole is just initializing faster so I see text where
before I didn't. Not sure.

Regards,
Jason


2022-06-09 08:28:51

by Jason A. Donenfeld

[permalink] [raw]
Subject: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

Hi folks,

On Tue, Apr 5, 2022 at 10:00 PM Jason A. Donenfeld <[email protected]> wrote:
> Using a Samsung SSD 970 EVO Plus 2TB, firmware version 2B2QEXM7, in case
> that's useful info.
>
> I also noticed a ~2 second boot delay on 5.18-rc1:

Just FYI, I am still seeing this delay in 5.19-rc1.

Boot lines from 5.17:

[ 0.882680] nvme nvme1: missing or invalid SUBNQN field.
[ 0.882719] nvme nvme1: Shutdown timeout set to 10 seconds
[ 0.885227] nvme nvme1: 8/0/0 default/read/poll queues
[ 0.887910] nvme1n1: p1 p2 p3
[ 0.888317] nvme nvme0: missing or invalid SUBNQN field.
[ 0.888361] nvme nvme0: Shutdown timeout set to 8 seconds
[ 0.906301] nvme nvme0: 16/0/0 default/read/poll queues
[ 0.910087] nvme0n1: p1 p2

Boot lines from 5.18 & 5.19:

[ 0.846827] nvme nvme1: missing or invalid SUBNQN field.
[ 0.846857] nvme nvme1: Shutdown timeout set to 10 seconds
[ 0.849043] nvme nvme1: 8/0/0 default/read/poll queues
[ 0.851595] nvme1n1: p1 p2 p3
[ 3.226962] nvme nvme0: Shutdown timeout set to 8 seconds
[ 3.253890] nvme nvme0: 16/0/0 default/read/poll queues
[ 3.263778] nvme0n1: p1 p2

The Samsung 970 EVO Plus has a ~2 second delay that wasn't there in 5.17.

Any idea what's going on?

Thanks,
Jason

2022-06-09 09:15:15

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

Hey again,

Figured it out. 2.3 seconds to be exact... It looks like this is caused by:

bc360b0b1611 ("nvme-pci: add quirks for Samsung X5 SSDs")
https://lore.kernel.org/all/[email protected]/

This commit doesn't have any justification and got applied without
much discussion. Perhaps Monish could supply some more info about why
this is needed here? FTR, I have no issues on my system when reverting
that. Perhaps it should be reverted. (I can send a revert commit for
that if necessary.)

Looking further, however, the PCIe ID is said to be for a "Samsung
X5", which Google says is a portable thunderbolt drive. Is the PCIe ID
correct? On my system, this is the PCIe ID of a Samsung 970 EVO Plus.
Is it possible that Monish copied and pasted the wrong PCIe ID? Or has
Samsung *reused* the same PCIe ID on both devices? In which case, we'd
need some additional data for that quirk to avoid the delay.

Also note that this (potentially errant) commit has been backported to stable.

Jason

2022-06-09 09:57:48

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

Hi Monish,

On Thu, Jun 09, 2022 at 09:32:02AM +0000, R, Monish Kumar wrote:
> Hi Jason,
>
> I would like to provide justification for this Samsung X5 SSD fix added.
> We were facing SSD enumeration issue after cold / warm reboot with device
> connected ends up with probe failures.
>
> When I debug on this issue, I could find that this device was not enumerating
> once the system got booted. Moreover, we were facing this enumeration issue
> specific to this device.
>
> Based on analysis, due to deep power state of the device fails to enumerate.
> So, added the following quirks as a workaround fixe and it helps to enumerate the device after cold/warm reboot. If new Samsung X5 SSD's are working fine as expected, we can remove those
> fix.

FWIW, all of that should have been in the commit message. Also, "based
on analysis" - what analysis exactly? I have no way of thinking more
about the issue at hand other than, "Monish said things are like this in
a lab".

In any case, I believe the 970 ID predates that of the X5, and
destroying battery on those laptops and introducing boot time delays
isn't really okay. So let's just revert this until somebody can work out
better how to differentiate drives that need a quirk from drives that
don't need a quirk.

I sent this in: https://lore.kernel.org/lkml/[email protected]/

Jason

2022-06-09 10:11:17

by R, Monish Kumar

[permalink] [raw]
Subject: RE: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

Hi Jason,

I would like to provide justification for this Samsung X5 SSD fix added.
We were facing SSD enumeration issue after cold / warm reboot with device
connected ends up with probe failures.

When I debug on this issue, I could find that this device was not enumerating
once the system got booted. Moreover, we were facing this enumeration issue
specific to this device.

Based on analysis, due to deep power state of the device fails to enumerate.
So, added the following quirks as a workaround fixe and it helps to enumerate the device after cold/warm reboot. If new Samsung X5 SSD's are working fine as expected, we can remove those
fix.

Regarding the PCI-Id's, I have confirmed from the logs and it shows as vendor ID : 0x144d
device ID : 0xa808. I am not sure about why Samsung 970 EVO Plus have the same PCI-Ids.

Logs for reference :
After connecting Samsung X5 SSD.

lspci
04:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983

dmesg
Line 1478: <6>[ 112.838998] pci 0000:04:00.0: [144d:a808] type 00 class 0x010802
Line 1479: <6>[ 112.845765] pci 0000:04:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
Line 1480: <6>[ 112.853715] pci 0000:04:00.0: 8.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x4 link at 0000:00:07.0 (capable of 31.504 Gb/s with 8.0 GT/s PCIe x4 link)
Line 1481: <6>[ 112.870536] pci 0000:04:00.0: Adding to iommu group 22
Line 1498: <6>[ 113.019698] pci 0000:04:00.0: BAR 0: assigned [mem 0x83000000-0x83003fff 64bit]

Regards,
Monish Kumar R

-----Original Message-----
From: Jason A. Donenfeld <[email protected]>
Sent: 09 June 2022 14:04
To: R, Monish Kumar <[email protected]>
Cc: open list:NVM EXPRESS DRIVER <[email protected]>; Sagi Grimberg <[email protected]>; [email protected]; LKML <[email protected]>; Yi Zhang <[email protected]>; Keith Busch <[email protected]>; [email protected]; Christoph Hellwig <[email protected]>; Rao, Abhijeet <[email protected]>
Subject: Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

Hey again,

Figured it out. 2.3 seconds to be exact... It looks like this is caused by:

bc360b0b1611 ("nvme-pci: add quirks for Samsung X5 SSDs") https://lore.kernel.org/all/[email protected]/

This commit doesn't have any justification and got applied without much discussion. Perhaps Monish could supply some more info about why this is needed here? FTR, I have no issues on my system when reverting that. Perhaps it should be reverted. (I can send a revert commit for that if necessary.)

Looking further, however, the PCIe ID is said to be for a "Samsung X5", which Google says is a portable thunderbolt drive. Is the PCIe ID correct? On my system, this is the PCIe ID of a Samsung 970 EVO Plus.
Is it possible that Monish copied and pasted the wrong PCIe ID? Or has Samsung *reused* the same PCIe ID on both devices? In which case, we'd need some additional data for that quirk to avoid the delay.

Also note that this (potentially errant) commit has been backported to stable.

Jason

2022-06-10 07:04:53

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

On Thu, Jun 09, 2022 at 11:38:47AM +0200, Jason A. Donenfeld wrote:
> FWIW, all of that should have been in the commit message. Also, "based
> on analysis" - what analysis exactly? I have no way of thinking more
> about the issue at hand other than, "Monish said things are like this in
> a lab".

Please calm down a bit. His report is at least as good as your new
report here..

> In any case, I believe the 970 ID predates that of the X5, and

Huh?

The 970 seems to actually be very slightly newer than the X5. What
I suspect is that they actually are the same m.2 SSD or at least a
very similar one and Samsung decided to ship it in the thunderbolt
attached versions first. Maybe one of the Samsung folks here can
confirm.

That leaves us with two plausible theories:

- the problems could be due to an earlier firmware version or
ASIC stepping
- the problems are due to the thunderbolt attachment

Monish and Jason, can you please send me the output of nvme id-ctrl
/dev/nvmeX (where /dev/nvmeX is the actual device number)?

Monish, can you check if you are using the latest available firmware
and if not update it and check if you still need the quirks.


> destroying battery on those laptops and introducing boot time delays
> isn't really okay. So let's just revert this until somebody can work out
> better how to differentiate drives that need a quirk from drives that
> don't need a quirk.

While I'd really like to fix those issue, they are less severe than
not being able to use a device at all. And just as a reminder: if you
want to get anything please be nice to people and try work with them
productively.

2022-06-10 10:05:39

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

Hi Christoph,

On Fri, Jun 10, 2022 at 08:14:49AM +0200, Christoph Hellwig wrote:
> That leaves us with two plausible theories:
>
> - the problems could be due to an earlier firmware version or
> ASIC stepping
> - the problems are due to the thunderbolt attachment

Right, that seems like the set of variance we're dealing with. If it's a
firmware version issue, then we revert because people can update? Or can
we quirk firmware version numbers too? If it's ASIC stepping, I guess we
need to quirk that. And likewise thunderbolt, but that seems more
awkward to quirk around, because afaik, it all just appears as PCIe?

> Monish and Jason, can you please send me the output of nvme id-ctrl
> /dev/nvmeX (where /dev/nvmeX is the actual device number)?

NVME Identify Controller:
vid : 0x144d
ssvid : 0x144d
sn : <redacted>
mn : Samsung SSD 970 EVO Plus 2TB
fr : 2B2QEXM7
rab : 2
ieee : 002538
cmic : 0
mdts : 9
cntlid : 0x4
ver : 0x10300
rtd3r : 0x30d40
rtd3e : 0x7a1200
oaes : 0
ctratt : 0
rrls : 0
cntrltype : 0
fguid :
crdt1 : 0
crdt2 : 0
crdt3 : 0
nvmsr : 0
vwci : 0
mec : 0
oacs : 0x17
acl : 7
aerl : 3
frmw : 0x16
lpa : 0x3
elpe : 63
npss : 4
avscc : 0x1
apsta : 0x1
wctemp : 358
cctemp : 358
mtfa : 0
hmpre : 0
hmmin : 0
tnvmcap : 2000398934016
unvmcap : 0
rpmbs : 0
edstt : 35
dsto : 0
fwug : 0
kas : 0
hctma : 0x1
mntmt : 356
mxtmt : 358
sanicap : 0
hmminds : 0
hmmaxd : 0
nsetidmax : 0
endgidmax : 0
anatt : 0
anacap : 0
anagrpmax : 0
nanagrpid : 0
pels : 0
domainid : 0
megcap : 0
sqes : 0x66
cqes : 0x44
maxcmd : 0
nn : 1
oncs : 0x5f
fuses : 0
fna : 0x5
vwc : 0x1
awun : 1023
awupf : 0
icsvscc : 1
nwpc : 0
acwu : 0
ocfs : 0
sgls : 0
mnan : 0
maxdna : 0
maxcna : 0
subnqn :
ioccsz : 0
iorcsz : 0
icdoff : 0
fcatt : 0
msdbd : 0
ofcs : 0
ps 0 : mp:7.50W operational enlat:0 exlat:0 rrt:0 rrl:0
rwt:0 rwl:0 idle_power:- active_power:-
ps 1 : mp:5.90W operational enlat:0 exlat:0 rrt:1 rrl:1
rwt:1 rwl:1 idle_power:- active_power:-
ps 2 : mp:3.60W operational enlat:0 exlat:0 rrt:2 rrl:2
rwt:2 rwl:2 idle_power:- active_power:-
ps 3 : mp:0.0700W non-operational enlat:210 exlat:1200 rrt:3 rrl:3
rwt:3 rwl:3 idle_power:- active_power:-
ps 4 : mp:0.0050W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4
rwt:4 rwl:4 idle_power:- active_power:-

Jason

2022-06-10 12:26:58

by Pankaj Raghav

[permalink] [raw]
Subject: Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

On Fri, Jun 10, 2022 at 08:14:49AM +0200, Christoph Hellwig wrote:
> The 970 seems to actually be very slightly newer than the X5. What
> I suspect is that they actually are the same m.2 SSD or at least a
> very similar one and Samsung decided to ship it in the thunderbolt
> attached versions first. Maybe one of the Samsung folks here can
> confirm.
>
> That leaves us with two plausible theories:
>
> - the problems could be due to an earlier firmware version or
> ASIC stepping
> - the problems are due to the thunderbolt attachment
>
I have forwarded this report internally within Samsung and I will post an
update once I have more information about this issue.

Cheers,
Pankaj

2022-06-13 06:38:17

by R, Monish Kumar

[permalink] [raw]
Subject: RE: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

Hi Christoph,

Please see the below nvme id-ctrl response of Samsung X5 SSD.

NVME Identify Controller:
vid : 0x144d
ssvid : 0x144d
sn : <redacted>
mn : Samsung Portable SSD X5
fr : 1P3QEXE7
rab : 2
ieee : 002538
cmic : 0
mdts : 9
cntlid : 4
ver : 10300
rtd3r : 30d40
rtd3e : 7a1200
oaes : 0
ctratt : 0
rrls : 0
oacs : 0x7
acl : 7
aerl : 3
frmw : 0x16
lpa : 0x3
elpe : 63
npss : 4
avscc : 0x1
apsta : 0x1
wctemp : 329
cctemp : 330
mtfa : 0
hmpre : 0
hmmin : 0
tnvmcap : 500107862016
unvmcap : 0
rpmbs : 0
edstt : 0
dsto : 0
fwug : 0
kas : 0
hctma : 0
mntmt : 0
mxtmt : 0
sanicap : 0
hmminds : 0
hmmaxd : 0
nsetidmax : 0
sqes : 0x66
cqes : 0x44
maxcmd : 0
nn : 1
oncs : 0x1f
fuses : 0
fna : 0x5
vwc : 0x1
awun : 127
awupf : 0
nvscc : 1
acwu : 0
sgls : 0
subnqn :
ioccsz : 0
iorcsz : 0
icdoff : 0
ctrattr : 0
msdbd : 0
ps 0 : mp:6.20W operational enlat:0 exlat:0 rrt:0 rrl:0
rwt:0 rwl:0 idle_power:- active_power:-
ps 1 : mp:4.30W operational enlat:0 exlat:0 rrt:1 rrl:1
rwt:1 rwl:1 idle_power:- active_power:-
ps 2 : mp:2.10W operational enlat:0 exlat:0 rrt:2 rrl:2
rwt:2 rwl:2 idle_power:- active_power:-
ps 3 : mp:0.0400W non-operational enlat:210 exlat:1200 rrt:3 rrl:3
rwt:3 rwl:3 idle_power:- active_power:-
ps 4 : mp:0.0050W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4
rwt:4 rwl:4 idle_power:- active_power:-

Regards,
Monish Kumar R

-----Original Message-----
From: Jason A. Donenfeld <[email protected]>
Sent: 10 June 2022 14:50
To: Christoph Hellwig <[email protected]>
Cc: R, Monish Kumar <[email protected]>; open list:NVM EXPRESS DRIVER <[email protected]>; Sagi Grimberg <[email protected]>; [email protected]; LKML <[email protected]>; Yi Zhang <[email protected]>; Keith Busch <[email protected]>; [email protected]; Rao, Abhijeet <[email protected]>
Subject: Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

Hi Christoph,

On Fri, Jun 10, 2022 at 08:14:49AM +0200, Christoph Hellwig wrote:
> That leaves us with two plausible theories:
>
> - the problems could be due to an earlier firmware version or
> ASIC stepping
> - the problems are due to the thunderbolt attachment

Right, that seems like the set of variance we're dealing with. If it's a firmware version issue, then we revert because people can update? Or can we quirk firmware version numbers too? If it's ASIC stepping, I guess we need to quirk that. And likewise thunderbolt, but that seems more awkward to quirk around, because afaik, it all just appears as PCIe?

> Monish and Jason, can you please send me the output of nvme id-ctrl
> /dev/nvmeX (where /dev/nvmeX is the actual device number)?

NVME Identify Controller:
vid : 0x144d
ssvid : 0x144d
sn : <redacted>
mn : Samsung SSD 970 EVO Plus 2TB
fr : 2B2QEXM7
rab : 2
ieee : 002538
cmic : 0
mdts : 9
cntlid : 0x4
ver : 0x10300
rtd3r : 0x30d40
rtd3e : 0x7a1200
oaes : 0
ctratt : 0
rrls : 0
cntrltype : 0
fguid :
crdt1 : 0
crdt2 : 0
crdt3 : 0
nvmsr : 0
vwci : 0
mec : 0
oacs : 0x17
acl : 7
aerl : 3
frmw : 0x16
lpa : 0x3
elpe : 63
npss : 4
avscc : 0x1
apsta : 0x1
wctemp : 358
cctemp : 358
mtfa : 0
hmpre : 0
hmmin : 0
tnvmcap : 2000398934016
unvmcap : 0
rpmbs : 0
edstt : 35
dsto : 0
fwug : 0
kas : 0
hctma : 0x1
mntmt : 356
mxtmt : 358
sanicap : 0
hmminds : 0
hmmaxd : 0
nsetidmax : 0
endgidmax : 0
anatt : 0
anacap : 0
anagrpmax : 0
nanagrpid : 0
pels : 0
domainid : 0
megcap : 0
sqes : 0x66
cqes : 0x44
maxcmd : 0
nn : 1
oncs : 0x5f
fuses : 0
fna : 0x5
vwc : 0x1
awun : 1023
awupf : 0
icsvscc : 1
nwpc : 0
acwu : 0
ocfs : 0
sgls : 0
mnan : 0
maxdna : 0
maxcna : 0
subnqn :
ioccsz : 0
iorcsz : 0
icdoff : 0
fcatt : 0
msdbd : 0
ofcs : 0
ps 0 : mp:7.50W operational enlat:0 exlat:0 rrt:0 rrl:0
rwt:0 rwl:0 idle_power:- active_power:-
ps 1 : mp:5.90W operational enlat:0 exlat:0 rrt:1 rrl:1
rwt:1 rwl:1 idle_power:- active_power:-
ps 2 : mp:3.60W operational enlat:0 exlat:0 rrt:2 rrl:2
rwt:2 rwl:2 idle_power:- active_power:-
ps 3 : mp:0.0700W non-operational enlat:210 exlat:1200 rrt:3 rrl:3
rwt:3 rwl:3 idle_power:- active_power:-
ps 4 : mp:0.0050W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4
rwt:4 rwl:4 idle_power:- active_power:-

Jason

2022-06-13 17:00:25

by R, Monish Kumar

[permalink] [raw]
Subject: RE: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

Jason,

I am not sure, whether this running firmware (fr: 1P3QEXE7) was the latest one.
Moreover, this device is a newer one and we bought couple of months back.

When I am tried to update SSD firmware using Samsung Portable SSD Software,
it failed to communicate with their Samsung servers.

Let @Pankaj Raghav <[email protected]> can get back on this and confirm
about this firmware version, as he reported this issue internally to Samsung.

Regards,
Monish Kumar R

-----Original Message-----
From: Jason A. Donenfeld <[email protected]>
Sent: 13 June 2022 17:51
To: R, Monish Kumar <[email protected]>
Cc: Christoph Hellwig <[email protected]>; open list:NVM EXPRESS DRIVER <[email protected]>; Sagi Grimberg <[email protected]>; [email protected]; LKML <[email protected]>; Yi Zhang <[email protected]>; Keith Busch <[email protected]>; [email protected]; Rao, Abhijeet <[email protected]>
Subject: Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

On Mon, Jun 13, 2022 at 06:36:44AM +0000, R, Monish Kumar wrote:
> mn : Samsung Portable SSD X5
> fr : 1P3QEXE7

Isn't the latest firmware for the QEXE7 line 2B2QEXE7?

Jason

2022-06-13 17:22:26

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

On Mon, Jun 13, 2022 at 06:36:44AM +0000, R, Monish Kumar wrote:
> mn : Samsung Portable SSD X5
> fr : 1P3QEXE7

Isn't the latest firmware for the QEXE7 line 2B2QEXE7?

Jason

2022-06-13 18:36:22

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

On Fri, Jun 10, 2022 at 11:19:31AM +0200, Jason A. Donenfeld wrote:
> Right, that seems like the set of variance we're dealing with. If it's a
> firmware version issue, then we revert because people can update? Or can
> we quirk firmware version numbers too?

We can quirk on firmware version and model number as well. Those quirks
need to go into the core nvme module and not just the PCI driver, though.

> If it's ASIC stepping, I guess we
> need to quirk that. And likewise thunderbolt, but that seems more
> awkward to quirk around, because afaik, it all just appears as PCIe?

It all appears as PCIe, but the pci_dev has an is_thunderbolt flag.

Thanks to both of you for the information. I'd like to wait until the
end of the week or so if we can hear something from Samsung, and if we
don't we'll have to quirk based on the model number.

2022-06-15 10:52:04

by Pankaj Raghav

[permalink] [raw]
Subject: Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

Hi Christoph,
On Mon, Jun 13, 2022 at 03:55:49PM +0200, Christoph Hellwig wrote:
> It all appears as PCIe, but the pci_dev has an is_thunderbolt flag.
>
> Thanks to both of you for the information. I'd like to wait until the
> end of the week or so if we can hear something from Samsung, and if we
> don't we'll have to quirk based on the model number.
Our FW team has started looking into the issue. They said they will try
to come up with a solution before 5.20. If not, we can add this quirk
based on the FW ver. and a proper solution can be added by 5.21.
>

2022-06-15 11:48:08

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

On Wed, Jun 15, 2022 at 12:27:57PM +0200, Pankaj Raghav wrote:
> Hi Christoph,
> On Mon, Jun 13, 2022 at 03:55:49PM +0200, Christoph Hellwig wrote:
> > It all appears as PCIe, but the pci_dev has an is_thunderbolt flag.
> >
> > Thanks to both of you for the information. I'd like to wait until the
> > end of the week or so if we can hear something from Samsung, and if we
> > don't we'll have to quirk based on the model number.
> Our FW team has started looking into the issue. They said they will try
> to come up with a solution before 5.20. If not, we can add this quirk
> based on the FW ver. and a proper solution can be added by 5.21.

I don't think we can wait for 5.20 - the "offending" commit is in
5.19-rc and -stable. So I'll plan to prepare a patch based on the model
number for now, still hoping we can come up with something better
eventually.

2022-06-20 03:58:41

by SungHwan Jung

[permalink] [raw]
Subject: Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

> I don't think we can wait for 5.20 - the "offending" commit is in
> 5.19-rc and -stable. So I'll plan to prepare a patch based on the model
> number for now, still hoping we can come up with something better
> eventually.

Hi ,
Some samsung SSD for OEM also have the identical PCI-ids and are affected by this quirk.
But they have different subsystem-ids.

For example,

model number: MZQLB1T9HAJR-00V3 for lenovo
vendor: 144d
device: a808
subvendor: 1d49
subdevice: 403b

model number: MZVLB256HBHQ-00000
vendor: 144d
device: a808
subvendor: 144d
subdevice: a801

Addtition of subsystem-ids of X5 to pci_device_id(as below) may solve this problem.

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 17aeb7d5c48522..92fd3b1d88fc95 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3475,7 +3475,7 @@ static const struct pci_device_id nvme_id_table[] = {
NVME_QUIRK_128_BYTES_SQES |
NVME_QUIRK_SHARED_TAGS |
NVME_QUIRK_SKIP_CID_GEN },
- { PCI_DEVICE(0x144d, 0xa808), /* Samsung X5 */
+ { PCI_DEVICE_SUB(0x144d, 0xa808, {X5 subvendor?}, {X5 subdevice?}), /* Samsung X5 */
.driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY|
NVME_QUIRK_NO_DEEPEST_PS |
NVME_QUIRK_IGNORE_DEV_SUBNQN, },


But I don't know X5's subsystem ids, Can someone provide subsystem-ids of X5?

2022-06-20 04:43:51

by SungHwan Jung

[permalink] [raw]
Subject: Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

Hi again,
I'm sorry to send mail again.

I think it would be to confirm that subsystem-ids for all X5 model and these ids are not used for other ssd by contact with samsung.

But I have no idea how to contant with them, please fowards this to them or if you already have these information, please tests the patch in previous mail.

2022-06-20 07:42:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

On Mon, Jun 20, 2022 at 12:36:27PM +0900, onenowy wrote:
> Some samsung SSD for OEM also have the identical PCI-ids and are affected by this quirk.
> But they have different subsystem-ids.

> Addtition of subsystem-ids of X5 to pci_device_id(as below) may solve this problem.

Monish, can you look into that?