2023-07-31 19:45:49

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] nvme: Don't fail to resume if NSIDs change

On Mon, Jul 31, 2023 at 01:51:03PM -0500, Mario Limonciello wrote:
> Samsung PM9B1 has problems after resume because NSID has changed.
> This has been reported in the past on OEM varities of PM9B1 parts
> and fixed by firmware updates on 'some' of those parts.
>
> However this same issue also happens on 'retail' PM9B1 parts which
> Samsung has not released firmware updates for.
>
> As the check has been relaxed at startup for multiple disks with
> duplicate NSIDs with commit ac522fc6c3165 ("nvme: don't reject
> probe due to duplicate IDs for single-ported PCIe devices") also
> relax the check that runs on resume for NSIDs and mark them bogus
> if this occurs on resume.

How could the driver tell the difference between the device needing a
quirk compared to a rapid delete-create-attach namespace sequence?
Proceeding with the namespace now may get dirty writes intended for the
previous namespace, corrupting the new one.

The commit you mentioned tries to constrain allowing duplication where
we can reasonably assume the quirk is needed. If we need to do similiar
for this condition, one possible constraint might be that the device
doesn't report OACS bit 3 (Namespace Management).


2023-07-31 20:09:35

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] nvme: Don't fail to resume if NSIDs change

On Mon, Jul 31, 2023 at 01:10:11PM -0600, Keith Busch wrote:
> > As the check has been relaxed at startup for multiple disks with
> > duplicate NSIDs with commit ac522fc6c3165 ("nvme: don't reject
> > probe due to duplicate IDs for single-ported PCIe devices") also
> > relax the check that runs on resume for NSIDs and mark them bogus
> > if this occurs on resume.
>
> How could the driver tell the difference between the device needing a
> quirk compared to a rapid delete-create-attach namespace sequence?
> Proceeding with the namespace now may get dirty writes intended for the
> previous namespace, corrupting the new one.
>
> The commit you mentioned tries to constrain allowing duplication where
> we can reasonably assume the quirk is needed. If we need to do similiar
> for this condition, one possible constraint might be that the device
> doesn't report OACS bit 3 (Namespace Management).

Yes, this patch as-is looks really dangerous. I don't think we should
just ignore the fact that IDs change when queried again.

2023-07-31 20:34:03

by August Wikerfors

[permalink] [raw]
Subject: Re: [PATCH] nvme: Don't fail to resume if NSIDs change

On 2023-07-31 21:10, Keith Busch wrote:
> On Mon, Jul 31, 2023 at 01:51:03PM -0500, Mario Limonciello wrote:
>> Samsung PM9B1 has problems after resume because NSID has changed.
>> This has been reported in the past on OEM varities of PM9B1 parts
>> and fixed by firmware updates on 'some' of those parts.
>>
>> However this same issue also happens on 'retail' PM9B1 parts which
>> Samsung has not released firmware updates for.
>>
>> As the check has been relaxed at startup for multiple disks with
>> duplicate NSIDs with commit ac522fc6c3165 ("nvme: don't reject
>> probe due to duplicate IDs for single-ported PCIe devices") also
>> relax the check that runs on resume for NSIDs and mark them bogus
>> if this occurs on resume.
>
> How could the driver tell the difference between the device needing a
> quirk compared to a rapid delete-create-attach namespace sequence?
> Proceeding with the namespace now may get dirty writes intended for the
> previous namespace, corrupting the new one.
>
> The commit you mentioned tries to constrain allowing duplication where
> we can reasonably assume the quirk is needed. If we need to do similiar
> for this condition, one possible constraint might be that the device
> doesn't report OACS bit 3 (Namespace Management).

It looks like that would work for the PM9B1:
> $ sudo nvme id-ctrl -H /dev/nvme0
> [...] > oacs : 0x17
> [10:10] : 0 Lockdown Command and Feature Not Supported
> [9:9] : 0 Get LBA Status Capability Not Supported
> [8:8] : 0 Doorbell Buffer Config Not Supported
> [7:7] : 0 Virtualization Management Not Supported
> [6:6] : 0 NVMe-MI Send and Receive Not Supported
> [5:5] : 0 Directives Not Supported
> [4:4] : 0x1 Device Self-test Supported
> [3:3] : 0 NS Management and Attachment Not Supported
> [2:2] : 0x1 FW Commit and Download Supported
> [1:1] : 0x1 Format NVM Supported
> [0:0] : 0x1 Security Send and Receive Supported

Regards,
August Wikerfors