2023-06-26 02:00:21

by Bagas Sanjaya

[permalink] [raw]
Subject: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

Hi,

I notice a regression report on Bugzilla [1]. Quoting from it:

> Hello,
> since upgrading one of my hosts to KErnel 6.2 i have a problem with 6 nvme disks (SAMSUNG MZ1WV480HCGL-000MV).
>
> This problem didn't occur with Kernel 5.15.
>
> Here is log from dmesg | grep nvme:
>
> nvme nvme0: pci function 0000:04:00.0
> nvme nvme1: pci function 0000:05:00.0
> nvme nvme2: pci function 0000:06:00.0
> nvme nvme3: pci function 0000:07:00.0
> nvme nvme4: pci function 0000:0a:00.0
> nvme nvme5: pci function 0000:0b:00.0
> nvme nvme0: 8/0/0 default/read/poll queues
> nvme nvme1: 8/0/0 default/read/poll queues
> nvme nvme2: 8/0/0 default/read/poll queues
> nvme nvme3: 8/0/0 default/read/poll queues
> nvme nvme4: 8/0/0 default/read/poll queues
> nvme nvme5: 8/0/0 default/read/poll queues
> nvme nvme1: globally duplicate IDs for nsid 1
> nvme nvme1: VID:DID 144d:a802 model:SAMSUNG MZ1WV480HCGL-000MV firmware:BXU87M9Q
> nvme nvme2: globally duplicate IDs for nsid 1
> nvme nvme2: VID:DID 144d:a802 model:SAMSUNG MZ1WV480HCGL-000MV firmware:BXU87M9Q
> nvme nvme4: globally duplicate IDs for nsid 1
> nvme nvme3: globally duplicate IDs for nsid 1
> nvme nvme4: VID:DID 144d:a802 model:SAMSUNG MZ1WV480HCGL-000MV firmware:BXU87M9Q
> nvme nvme3: VID:DID 144d:a802 model:SAMSUNG MZ1WV480HCGL-000MV firmware:BXU87M9Q
> nvme nvme5: globally duplicate IDs for nsid 1
> nvme nvme5: VID:DID 144d:a802 model:SAMSUNG MZ1WV480HCGL-000MV firmware:BXU87M9Q
> nvme0n1: p1 p9
>
> Only one drive, out of 6 is usable.
>
> These drives are rather common datacenter drives (Samsung SM-953).
>
> From another Bug-Report (https://bugzilla.kernel.org/show_bug.cgi?id=217384):
>
> To fix it, please add two strings in pci.c
>
> { PCI_DEVICE(0x144d, 0xa802), /* SAMSUNG MZ1WV480HCGL-000MV */
> .driver_data = NVME_QUIRK_BOGUS_NID, },
>
> Thanks in advance.

See Bugzilla for the full thread.

The reporter had a quirk (see above) that fixed this regression,
nevertheless I'm adding it to regzbot to make sure it doesn't fall
through cracks unnoticed:

#regzbot introduced: 86c2457a8e8112f https://bugzilla.kernel.org/show_bug.cgi?id=217593
#regzbot title: NVME_QUIRK_BOGUS_NID is needed for SAMSUNG MZ1WV480HCGL-000MV

Thanks.

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=217593

--
An old man doll... just what I always wanted! - Clara


Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

On 26.06.23 17:39, Chaitanya Kulkarni wrote:
> On 6/25/2023 6:15 PM, Bagas Sanjaya wrote:
>>
>> I notice a regression report on Bugzilla [1]. Quoting from it:
>
> [...]
>
>>> Only one drive, out of 6 is usable.
>>>
>>> These drives are rather common datacenter drives (Samsung SM-953).
>>>
>>> From another Bug-Report (https://bugzilla.kernel.org/show_bug.cgi?id=217384):
>>>
>>> To fix it, please add two strings in pci.c
>>>
>>> { PCI_DEVICE(0x144d, 0xa802), /* SAMSUNG MZ1WV480HCGL-000MV */
>>> .driver_data = NVME_QUIRK_BOGUS_NID, },
>
> Please send a formal patch I'll be happy to review it ..

Huh, sorry, but why?

You might not know this, but Bagas is somebody that helps with
regression tracking (many thx for this BTW), so he's not even affected
by the problem -- and most likely is neither familiar with the code or
has the hardware to test the quirk entry.

Sure, asking others for help is fine. But from what I see this is
something that should be handled by those that developed the culprit.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.


2023-06-27 16:23:23

by Keith Busch

[permalink] [raw]
Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

On Mon, Jun 26, 2023 at 08:15:49AM +0700, Bagas Sanjaya wrote:
> See Bugzilla for the full thread.
>
> The reporter had a quirk (see above) that fixed this regression,
> nevertheless I'm adding it to regzbot to make sure it doesn't fall
> through cracks unnoticed:
>
> #regzbot introduced: 86c2457a8e8112f https://bugzilla.kernel.org/show_bug.cgi?id=217593
> #regzbot title: NVME_QUIRK_BOGUS_NID is needed for SAMSUNG MZ1WV480HCGL-000MV

These bug reports really should go to the vendors that created the
broken device with non-unique "unique" fields. It's easy enough for me
to write the quirk patch, but it's not the ideal solution and may harm
devices/firmwares with the same VID:DID that don't have this problem.
Not being the vendor, I'm not in a postition to know that, so CC'ing
some Samsung folks.

2023-06-30 15:28:47

by Clemens Springsguth

[permalink] [raw]
Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

Hi Pankaj,

i would be very very grateful if you would create and submit the quirk
as a proper patch.

kind regards
Clemens


On Fri, Jun 30, 2023 at 10:40 AM Pankaj Raghav <[email protected]> wrote:
>
> On 2023-06-27 18:10, Keith Busch wrote:
> > On Mon, Jun 26, 2023 at 08:15:49AM +0700, Bagas Sanjaya wrote:
> >> See Bugzilla for the full thread.
> >>
> >> The reporter had a quirk (see above) that fixed this regression,
> >> nevertheless I'm adding it to regzbot to make sure it doesn't fall
> >> through cracks unnoticed:
> >>
> >> #regzbot introduced: 86c2457a8e8112f https://bugzilla.kernel.org/show_bug.cgi?id=217593
> >> #regzbot title: NVME_QUIRK_BOGUS_NID is needed for SAMSUNG MZ1WV480HCGL-000MV
> >
> > These bug reports really should go to the vendors that created the
> > broken device with non-unique "unique" fields. It's easy enough for me
> > to write the quirk patch, but it's not the ideal solution and may harm
> > devices/firmwares with the same VID:DID that don't have this problem.
> > Not being the vendor, I'm not in a postition to know that, so CC'ing
> > some Samsung folks.
>
> I got a reply from our team internally. There is no plan on releasing
> a new firmware for this model and the related models with the same VID:DID
> pair. We can add a quirk.
>
> @Clemens: Let me know if you want me to send the quirk as a proper patch.

2023-06-30 16:32:41

by Pankaj Raghav

[permalink] [raw]
Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

On 2023-06-30 17:18, Clemens Springsguth wrote:
> Hi Pankaj,
>
> i would be very very grateful if you would create and submit the quirk
> as a proper patch.
>

No problem. I will send something soon, but could you test it and
give me a tested-by tag when I send the patch? I don't have that
model at my disposal.


2023-07-03 13:55:44

by Clemens Springsguth

[permalink] [raw]
Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

On Fri, Jun 30, 2023 at 6:21 PM Pankaj Raghav <[email protected]> wrote:
>
> On 2023-06-30 17:18, Clemens Springsguth wrote:
> > Hi Pankaj,
> >
> > i would be very very grateful if you would create and submit the quirk
> > as a proper patch.
> >
>
> No problem. I will send something soon, but could you test it and
> give me a tested-by tag when I send the patch? I don't have that
> model at my disposal.
>

Hi Pankaj,

I am available for testing with pleasure.

kind regards
Clemens

Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

On 27.06.23 18:10, Keith Busch wrote:
> On Mon, Jun 26, 2023 at 08:15:49AM +0700, Bagas Sanjaya wrote:
>> See Bugzilla for the full thread.
>>
>> The reporter had a quirk (see above) that fixed this regression,
>> nevertheless I'm adding it to regzbot to make sure it doesn't fall
>> through cracks unnoticed:
>>
>> #regzbot introduced: 86c2457a8e8112f https://bugzilla.kernel.org/show_bug.cgi?id=217593
>> #regzbot title: NVME_QUIRK_BOGUS_NID is needed for SAMSUNG MZ1WV480HCGL-000MV
>
> These bug reports really should go to the vendors that created the
> broken device with non-unique "unique" fields.

I understand that, but I think we need middlemen for that, as I or Bagas
don't have the contacts -- and it's IMHO also a bit much too ask us for
in general, as regression tracking is hard enough already. At least
unless this becomes something that happen regularly, then a list of
persons we could contact would be fine I guess. But we simply can't deal
with too many subsystem specific special cases.

> It's easy enough for me
> to write the quirk patch, but it's not the ideal solution and may harm
> devices/firmwares with the same VID:DID that don't have this problem.
> Not being the vendor, I'm not in a postition to know that, so CC'ing
> some Samsung folks.

Another request came in today, even with a pseudo-patch:
https://bugzilla.kernel.org/show_bug.cgi?id=217649

To quote:
```
As with numerous NVMe controllers these days, Samsung's
MZAL41T0HBLB-00BL2, which Lenovo builds into their 16ARP8 also suffers
from invalid IDs, breaking suspend and hibernate also on the latest
kernel 6.4.2.

The following change restores this functionality:

File: root/drivers/nvme/host/pci.c
Change:

- { PCI_DEVICE(0x144d, 0xa80b), /* Samsung PM9B1 256G and 512G */
- .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, },

+ { PCI_DEVICE(0x144d, 0xa80b), /* Samsung PM9B1 256G, 512G and 1TB */
+ .driver_data = NVME_QUIRK_BOGUS_NID |
+ NVME_QUIRK_DISABLE_WRITE_ZEROES, },
```

Ciao, Thorsten

2023-07-10 17:19:19

by Keith Busch

[permalink] [raw]
Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

On Mon, Jul 10, 2023 at 10:52:52AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 27.06.23 18:10, Keith Busch wrote:
> > On Mon, Jun 26, 2023 at 08:15:49AM +0700, Bagas Sanjaya wrote:
> >> See Bugzilla for the full thread.
> >>
> >> The reporter had a quirk (see above) that fixed this regression,
> >> nevertheless I'm adding it to regzbot to make sure it doesn't fall
> >> through cracks unnoticed:
> >>
> >> #regzbot introduced: 86c2457a8e8112f https://bugzilla.kernel.org/show_bug.cgi?id=217593
> >> #regzbot title: NVME_QUIRK_BOGUS_NID is needed for SAMSUNG MZ1WV480HCGL-000MV
> >
> > These bug reports really should go to the vendors that created the
> > broken device with non-unique "unique" fields.
>
> I understand that, but I think we need middlemen for that, as I or Bagas
> don't have the contacts -- and it's IMHO also a bit much too ask us for
> in general, as regression tracking is hard enough already. At least
> unless this becomes something that happen regularly, then a list of
> persons we could contact would be fine I guess. But we simply can't deal
> with too many subsystem specific special cases.

I'm not asking the Linux regression trackers to fill that role, though.
I'm asking people who experience these issues report it to their vendor
directly because these device makers apparently have zero clue that
their spec non-compliance is causing painful experiences for their
customers and annoyance for maintainers. They keep pumping out more and
more devices with the same breakage.

This particular vendor has been great at engaging with Linux, but that's
not necessarily normal among all device makers, and I don't have
contacts with the majority of the vendors we've had to quirk for this
issue.

We did complain to the NVMe spec workgroup that their complaince cert
suite is not testing for this. There was a little initial interest in
fixing that gap, but it fizzled out...

> Another request came in today, even with a pseudo-patch:
> https://bugzilla.kernel.org/show_bug.cgi?id=217649
>
> To quote:
> ```
> As with numerous NVMe controllers these days, Samsung's
> MZAL41T0HBLB-00BL2, which Lenovo builds into their 16ARP8 also suffers
> from invalid IDs, breaking suspend and hibernate also on the latest
> kernel 6.4.2.
>
> The following change restores this functionality:
>
> File: root/drivers/nvme/host/pci.c
> Change:
>
> - { PCI_DEVICE(0x144d, 0xa80b), /* Samsung PM9B1 256G and 512G */
> - .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, },
>
> + { PCI_DEVICE(0x144d, 0xa80b), /* Samsung PM9B1 256G, 512G and 1TB */
> + .driver_data = NVME_QUIRK_BOGUS_NID |
> + NVME_QUIRK_DISABLE_WRITE_ZEROES, },

Panjaj, okay with this one too?

Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

[CCing Linus for the "whack a mole" aspect in the second half]

On 11.07.23 08:54, Pankaj Raghav wrote:
>>> I understand that, but I think we need middlemen for that, as I or Bagas
>>> don't have the contacts -- and it's IMHO also a bit much too ask us for
>>> in general, as regression tracking is hard enough already. At least
>>> unless this becomes something that happen regularly, then a list of
>>> persons we could contact would be fine I guess. But we simply can't deal
>>> with too many subsystem specific special cases.
>>
>> I'm not asking the Linux regression trackers to fill that role, though.

Well, during our work we often encounter those bugs -- often from people
that are no regular developers that already had a hard time
understanding the issue and reporting it to us somehow. Asking those to...

>> I'm asking people who experience these issues report it to their vendor

...find the right destination and format to report their Linux problems
to the vendors is unlikely to fly I suspect. And I'm not sure if that is
in our interest, as then it might take a lot longer to get those quirk
entries into the kernel source.

But whatever, the main reason why I write this mail is different:

>> directly because these device makers apparently have zero clue that
>> their spec non-compliance is causing painful experiences for their
>> customers and annoyance for maintainers. They keep pumping out more and
>> more devices with the same breakage.
>>
>> This particular vendor has been great at engaging with Linux, but that's
>> not necessarily normal among all device makers, and I don't have
>> contacts with the majority of the vendors we've had to quirk for this
>> issue.
>>
>> We did complain to the NVMe spec workgroup that their complaince cert
>> suite is not testing for this. There was a little initial interest in
>> fixing that gap, but it fizzled out...

Preface: this is not my area of expertise, and maybe I should keep my
mouth shut. But whatever.

Well, that "They keep pumping out more and more devices with the same
breakage" and the "new device" comment from Pankaj below bear the
question: should we stop trying to play "whack a mole" with all those
quirk entries and handle devices with duplicate ids just like Windows does?

That would "make things just work"(tm).

And yes, I suspect there are good reasons why we went down the "quirk"
route or why abandoning it might be hard. But maybe it's time to
reconsider that path, as from my outside point of view things sound a
lot like they are somewhat similar to the ACPI problems we dealt with
~15 years ago: we learned that we have to deal with broken ACPI
implementations and somehow use them in a way similar to how Windows
uses them, as that's the OS the machine was designed for and tested with.

Ciao, Thorsten

>>> Another request came in today, even with a pseudo-patch:
>>> https://bugzilla.kernel.org/show_bug.cgi?id=217649
>>>
>>> To quote:
>>> ```
>>> As with numerous NVMe controllers these days, Samsung's
>>> MZAL41T0HBLB-00BL2, which Lenovo builds into their 16ARP8 also suffers
>>> from invalid IDs, breaking suspend and hibernate also on the latest
>>> kernel 6.4.2.
> [...]
>> Panjaj, okay with this one too?
>
> This looks a like a new device that might have a firmware update. I will ping
> internally first.
>
> As you mentioned, the recent addition of globally unique ID check
> is breaking a lot of devices because of non-compliant firmware. I will try to create
> some awareness about this issue internally as well.

2023-07-11 12:19:25

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

On Tue, Jul 11, 2023 at 11:39:11AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> Well, that "They keep pumping out more and more devices with the same
> breakage" and the "new device" comment from Pankaj below bear the
> question: should we stop trying to play "whack a mole" with all those
> quirk entries and handle devices with duplicate ids just like Windows does?

As far as I can tell Windows completely ignores the IDs. Which, looking
back, I'd love to be able to do as well, but they are already used
by udev for the /dev/disk/by-id/ links. Those are usually not used
on desktop systems, as they use the file system labels and UUIDs, but
that doesn't work for non-file system uses.

And all this has been working really well with the good old enterprise
SSDs, it's just that the cheap consumer devices keep fucking it up.

If we'd take it away now we'd break existing users, which puts us between
a rock and a hard place.


2023-07-11 12:19:31

by Sagi Grimberg

[permalink] [raw]
Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)


>> Well, that "They keep pumping out more and more devices with the same
>> breakage" and the "new device" comment from Pankaj below bear the
>> question: should we stop trying to play "whack a mole" with all those
>> quirk entries and handle devices with duplicate ids just like Windows does?
>
> As far as I can tell Windows completely ignores the IDs. Which, looking
> back, I'd love to be able to do as well, but they are already used
> by udev for the /dev/disk/by-id/ links. Those are usually not used
> on desktop systems, as they use the file system labels and UUIDs, but
> that doesn't work for non-file system uses.
>
> And all this has been working really well with the good old enterprise
> SSDs, it's just that the cheap consumer devices keep fucking it up.
>
> If we'd take it away now we'd break existing users, which puts us between
> a rock and a hard place.

Maybe the compromise would be to add a modparam that tells the driver
to ignore it altogether (like allow_bogus_identifiers) that would
default to false. Then people can just workaround the problem instead
of having the back-and-fourth with the vendor?

2023-07-11 14:22:04

by Greg KH

[permalink] [raw]
Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

On Tue, Jul 11, 2023 at 03:14:54PM +0300, Sagi Grimberg wrote:
>
> > > Well, that "They keep pumping out more and more devices with the same
> > > breakage" and the "new device" comment from Pankaj below bear the
> > > question: should we stop trying to play "whack a mole" with all those
> > > quirk entries and handle devices with duplicate ids just like Windows does?
> >
> > As far as I can tell Windows completely ignores the IDs. Which, looking
> > back, I'd love to be able to do as well, but they are already used
> > by udev for the /dev/disk/by-id/ links. Those are usually not used
> > on desktop systems, as they use the file system labels and UUIDs, but
> > that doesn't work for non-file system uses.
> >
> > And all this has been working really well with the good old enterprise
> > SSDs, it's just that the cheap consumer devices keep fucking it up.
> >
> > If we'd take it away now we'd break existing users, which puts us between
> > a rock and a hard place.
>
> Maybe the compromise would be to add a modparam that tells the driver
> to ignore it altogether (like allow_bogus_identifiers) that would
> default to false. Then people can just workaround the problem instead
> of having the back-and-fourth with the vendor?
>

Module parameters do not work on a per-device basis, sorry. This isn't
the 1990's anymore, please do not attempt to add new ones :)

thanks,

greg k-h

2023-07-11 14:45:00

by Sagi Grimberg

[permalink] [raw]
Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)


>>>> Well, that "They keep pumping out more and more devices with the same
>>>> breakage" and the "new device" comment from Pankaj below bear the
>>>> question: should we stop trying to play "whack a mole" with all those
>>>> quirk entries and handle devices with duplicate ids just like Windows does?
>>>
>>> As far as I can tell Windows completely ignores the IDs. Which, looking
>>> back, I'd love to be able to do as well, but they are already used
>>> by udev for the /dev/disk/by-id/ links. Those are usually not used
>>> on desktop systems, as they use the file system labels and UUIDs, but
>>> that doesn't work for non-file system uses.
>>>
>>> And all this has been working really well with the good old enterprise
>>> SSDs, it's just that the cheap consumer devices keep fucking it up.
>>>
>>> If we'd take it away now we'd break existing users, which puts us between
>>> a rock and a hard place.
>>
>> Maybe the compromise would be to add a modparam that tells the driver
>> to ignore it altogether (like allow_bogus_identifiers) that would
>> default to false. Then people can just workaround the problem instead
>> of having the back-and-fourth with the vendor?
>>
>
> Module parameters do not work on a per-device basis, sorry. This isn't
> the 1990's anymore, please do not attempt to add new ones :)

Don't get me wrong, I don't like adding this. But the source of this
is that there are simply too many breakages of non-compliant consumer
drives out there that maybe a compromise would be "globally relax
compliance check in this specific area" as a workaround.

Right now each time this issue is seen in the wild, the only
resolution is either the vendor fixing it, or the driver adds
a quirk, which is positive and exactly what we want. But more
and more users complain, and there is no immediate workaround.

2023-07-11 15:43:30

by Keith Busch

[permalink] [raw]
Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

On Tue, Jul 11, 2023 at 02:06:09PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 11, 2023 at 11:39:11AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> > Well, that "They keep pumping out more and more devices with the same
> > breakage" and the "new device" comment from Pankaj below bear the
> > question: should we stop trying to play "whack a mole" with all those
> > quirk entries and handle devices with duplicate ids just like Windows does?
>
> As far as I can tell Windows completely ignores the IDs. Which, looking
> back, I'd love to be able to do as well, but they are already used
> by udev for the /dev/disk/by-id/ links. Those are usually not used
> on desktop systems, as they use the file system labels and UUIDs, but
> that doesn't work for non-file system uses.

It's also an inescapable requirement for supporting multipath, and a
number of the devices being quirked bizarrely also report CMIC/NMIC
capabilities.

2023-07-11 16:56:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

On Tue, 11 Jul 2023 at 05:06, Christoph Hellwig <[email protected]> wrote:
>
> As far as I can tell Windows completely ignores the IDs. Which, looking
> back, I'd love to be able to do as well, but they are already used
> by udev for the /dev/disk/by-id/ links. Those are usually not used
> on desktop systems, as they use the file system labels and UUIDs, but
> that doesn't work for non-file system uses.

The thing is, the nvme code seems to actively do completely stuipid
things in this area.

> And all this has been working really well with the good old enterprise
> SSDs, it's just that the cheap consumer devices keep f*cking it up.

Christoph, deal with reality, not with what you think things should look like.

Anybody who expected unique ID's is frankly completely incompetent.
People should have *known* not to do this.

"Those Who Do Not Learn History Are Doomed To Repeat It"
- Santayana

and we have NEVER EVER seen devices with reliably unique IDs. Really.
We've had these uuid's before (ask Greg about USB devices one day, and
that was *recent*).

We've always known that vendors will fill in a fixed value, and
somebody still decided to make this a correctness issue?

Christoph, don't blame vendors. Somebody did indeed f*ck up. But it was you.

> If we'd take it away now we'd break existing users, which puts us between
> a rock and a hard place.

Well, here's a suggestion: stop making it worse.

For example, we have this completely unacceptable garbage:

ret = nvme_global_check_duplicate_ids(ctrl->subsys, &info->ids);
if (ret) {
dev_err(ctrl->device,
"globally duplicate IDs for nsid %d\n", info->nsid);
nvme_print_device_info(ctrl);
return ret;
}

iow, the code even checks for and *notices* that there are duplicate
IDs, and what does it do? It then errors out.

Then expecting people TO WAIT FOR A NEW KERNEL VERSION when you
noticed something wrong? What an absolute crock.

So stop blaming anybody else.

I think the code should *default* to "unreliable uuid", and then if
you're sure it's actually ok, then you use it. Then some rare
enterprise user with multipathing - who is going to be very very
careful about which device to use anyway - can use the "approved
list".

Or "Oh, I noticed a non-unique UUID, let me generate one for you based
on physical location".

But this "my disk doesn't work in v6.0 and later because some clown
added a duplicate check that shouldn't be there" is not a good thing
to then try to make excuses for.

Linus

2023-07-11 17:42:38

by Keith Busch

[permalink] [raw]
Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

On Tue, Jul 11, 2023 at 09:47:00AM -0700, Linus Torvalds wrote:
> On Tue, 11 Jul 2023 at 05:06, Christoph Hellwig <[email protected]> wrote:
> For example, we have this completely unacceptable garbage:
>
> ret = nvme_global_check_duplicate_ids(ctrl->subsys, &info->ids);
> if (ret) {
> dev_err(ctrl->device,
> "globally duplicate IDs for nsid %d\n", info->nsid);
> nvme_print_device_info(ctrl);
> return ret;
> }
>
> iow, the code even checks for and *notices* that there are duplicate
> IDs, and what does it do? It then errors out.

This check came from a recent half-baked spec feature called "Dispersed
Namespaces" that caused breakage and data corruption when used in Linux.
Rather than attempt to support that mostly vendor specific feature, the
driver attempted to fence that off as unmaintainable. This check wasn't
aimed at enforcing "correctness", but it certainly found a lot of that
as collatoral damage. Let's see if we can find a better way to detect
the difference with a sane fallback as you suggest.

2023-07-11 22:23:30

by John Meneghini

[permalink] [raw]
Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

Yes, this is what I thought. This is all the result of the duplicate NID check added to deal with TP4034 Dispersed Namespaces.

One suggestion I have would be to limit this check to nvme-of subsystems only. These are the only devices I am aware of out
there which support TP4034. Moreover, all nvme-of devices report a valid NID. It's required with NVMe Over Fabrics. The PCIe
devices, I expect, don't care. You don't really need a valid NID with a private namespace - which is what most PCIe devices are.

I'll wager that if you change nvme_global_check_duplicate_ids() to check only nvme-of subsystems, and simply continue with PCIe
subsystems, 90% of these nvme quirks can be removed.

John Meneghini
Senior Principal Platform Storage Engineer
RHEL SST - Platform Storage Group
[email protected]

On 7/11/23 13:21, Keith Busch wrote:
> On Tue, Jul 11, 2023 at 09:47:00AM -0700, Linus Torvalds wrote:
>> On Tue, 11 Jul 2023 at 05:06, Christoph Hellwig <[email protected]> wrote:
>> For example, we have this completely unacceptable garbage:
>>
>> ret = nvme_global_check_duplicate_ids(ctrl->subsys, &info->ids);
>> if (ret) {
>> dev_err(ctrl->device,
>> "globally duplicate IDs for nsid %d\n", info->nsid);
>> nvme_print_device_info(ctrl);
>> return ret;
>> }
>>
>> iow, the code even checks for and *notices* that there are duplicate
>> IDs, and what does it do? It then errors out.
>
> This check came from a recent half-baked spec feature called "Dispersed
> Namespaces" that caused breakage and data corruption when used in Linux.
> Rather than attempt to support that mostly vendor specific feature, the
> driver attempted to fence that off as unmaintainable. This check wasn't
> aimed at enforcing "correctness", but it certainly found a lot of that
> as collatoral damage. Let's see if we can find a better way to detect
> the difference with a sane fallback as you suggest.
>


2023-07-12 07:50:30

by Sagi Grimberg

[permalink] [raw]
Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)


>> For example, we have this completely unacceptable garbage:
>>
>> ret = nvme_global_check_duplicate_ids(ctrl->subsys, &info->ids);
>> if (ret) {
>> dev_err(ctrl->device,
>> "globally duplicate IDs for nsid %d\n", info->nsid);
>> nvme_print_device_info(ctrl);
>> return ret;
>> }
>>
>> iow, the code even checks for and *notices* that there are duplicate
>> IDs, and what does it do? It then errors out.
>
> This check came from a recent half-baked spec feature called "Dispersed
> Namespaces" that caused breakage and data corruption when used in Linux.
> Rather than attempt to support that mostly vendor specific feature, the
> driver attempted to fence that off as unmaintainable. This check wasn't
> aimed at enforcing "correctness", but it certainly found a lot of that
> as collatoral damage. Let's see if we can find a better way to detect
> the difference with a sane fallback as you suggest.

Perhaps we could fallback to what we do in wwid_show()?

"nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id, serial_len,
subsys->serial, model_len, subsys->model, head->ns_id)

2023-07-12 17:00:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

On Tue, Jul 11, 2023 at 09:47:00AM -0700, Linus Torvalds wrote:
> Anybody who expected unique ID's is frankly completely incompetent.
> People should have *known* not to do this.

Except that storage software really does rely on them. Not your
laptop (or mine) but all kinds of data center software. Where these
IDs have worked fine for decades. It just turns out nvme has the
misfortune of trying to deal with both that and cheap consumer crap.

> and we have NEVER EVER seen devices with reliably unique IDs. Really.

Sorry, but that's bullshit.

> iow, the code even checks for and *notices* that there are duplicate
> IDs, and what does it do? It then errors out.

Yes. Because we have applications which will lose data if they
suddently get the wrong device.

> I think the code should *default* to "unreliable uuid", and then if
> you're sure it's actually ok, then you use it. Then some rare
> enterprise user with multipathing - who is going to be very very
> careful about which device to use anyway - can use the "approved
> list".

That doesn't scale either.

> Or "Oh, I noticed a non-unique UUID, let me generate one for you based
> on physical location".
>
> But this "my disk doesn't work in v6.0 and later because some clown
> added a duplicate check that shouldn't be there" is not a good thing
> to then try to make excuses for.

Well, let's try something like this (co-developed with Sagi) that
allows it for non-multipath PCIe devices, hich covers the quirks
we've added so far, but also add a big fat warning, because we know
people rely on the /dev/disk/by-id/ links. Probably not on these
devices, but who knows.

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 47d7ba2827ff29..37b6fa74666204 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3431,10 +3431,40 @@ static int nvme_init_ns_head(struct nvme_ns *ns, struct nvme_ns_info *info)

ret = nvme_global_check_duplicate_ids(ctrl->subsys, &info->ids);
if (ret) {
- dev_err(ctrl->device,
- "globally duplicate IDs for nsid %d\n", info->nsid);
+ /*
+ * We've found two different namespaces on two different
+ * subsystems that report the same ID. This is pretty nasty
+ * for anything that actually requires unique device
+ * identification. In the kernel we need this for multipathing,
+ * and in user space the /dev/disk/by-id/ links rely on it.
+ *
+ * If the device also claims to be multi-path capable back off
+ * here now and refuse the probe the second device as this is a
+ * recipe for data corruption. If not this is probably a
+ * cheap consumer device if on the PCIe bus, so let the user
+ * proceed and use the shiny toy, but warn that with changing
+ * probing order (which due to our async probing could just be
+ * device taking longer to startup) the other device could show
+ * up at any time.
+ */
nvme_print_device_info(ctrl);
- return ret;
+ if ((ns->ctrl->ops->flags & NVME_F_FABRICS) || /* !PCIe */
+ ((ns->ctrl->subsys->cmic & NVME_CTRL_CMIC_MULTI_CTRL) &&
+ info->is_shared)) {
+ dev_err(ctrl->device,
+ "ignoring nsid %d because of duplicate IDs\n",
+ info->nsid);
+ return ret;
+ }
+
+ dev_err(ctrl->device,
+ "clearing duplicate IDs for nsid %d\n", info->nsid);
+ dev_err(ctrl->device,
+ "use of /dev/disk/by-id/ may cause data corruption\n");
+ memset(&info->ids.nguid, 0, sizeof(info->ids.nguid));
+ memset(&info->ids.uuid, 0, sizeof(info->ids.uuid));
+ memset(&info->ids.eui64, 0, sizeof(info->ids.eui64));
+ ctrl->quirks |= NVME_QUIRK_BOGUS_NID;
}

mutex_lock(&ctrl->subsys->lock);

2023-07-12 17:04:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

On Wed, 12 Jul 2023 at 00:38, Sagi Grimberg <[email protected]> wrote:
>
> Perhaps we could fallback to what we do in wwid_show()?
>
> "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id, serial_len,
> subsys->serial, model_len, subsys->model, head->ns_id)

That's not going to be any more unique.

Yes, there is *slightly* more hope that a simpler "serial number"
might be different between devices, but honestly, when I told
Christoph to learn from history and not assume UUID's are unique for
hardware, that "history" is literally decades of serial numbers.

Why do you think things like dm/md generate the UUID and literally
write it to disk? It's because those unique serial numbers have never
ever existed.

And absolutely NONE OF THIS IS NEW. This is not a nvme issue. SCSI
disks allegedly has a VPD page 0x80 with a serial number. Not only
won't many devices report that VPD in the first place, even if they do
there's no guarantee that it's actually unique.

And USB devices are *supposed* to have unique device serial numbers
("iSerial"). I think it's literally been in the spec since day#1. And
some even do. Quite a lot absolutely do NOT.

It's just fundamentally cheaper for hardware manufacturers that do
mass manufacturing to do completely identical pieces of hardware.
Serial numbers and UUID's are _fundamentally_ hard and expensive at
some level.

If you sell overpriced garbage to enterprise users, then a serial
number or UUID will most definitely be part of the package. Because
that is _literally_ what "Enterprise hardware" means: "Overpriced
garbage with a provenance and somebody else to blame".

But if you sell to the mass market, and aren't going for people who
pay extra for provenance, a serial number or a UUID is pure overhead.

Sometimes it's easy enough to do, and people do it.

But if you *rely* on it, *you* are the problem. Not the vendor that
was churning out a million devices for real users, and that was
spending all the effort on just making it work.

Basically, I think all of these things are purely for fabrics and the
nvme target side.

Expecting serial numbers from a PCIe device is pure fantasy.

And, more importantly, it's stupid and wrong. Exactly because we've
been here before. Don't repeat mistakes from the past.

By all means _report_ things like serial numbers. They can be very
useful for tracking. But never *ever* rely on them for basic
functionality. People absolutely will have two identical devices from
the same batch, and unless there is some actual hardware compliance
verification, they will not have unique serial numbers.

In many cases, the best you can hope for is that the duplicate ones
are *obviously* duplicate (ie "all zeroes" and similar). But you
shouldn't expect or depend on even that to be the case.

Linus

2023-07-12 17:20:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

On Wed, 12 Jul 2023 at 09:45, Christoph Hellwig <[email protected]> wrote:
>
> On Tue, Jul 11, 2023 at 09:47:00AM -0700, Linus Torvalds wrote:
>
> > and we have NEVER EVER seen devices with reliably unique IDs. Really.
>
> Sorry, but that's bullshit.

Christoph, you are *literally* involved in a discussion where this is the case.

What you call "bullshit" is what everybody else here calls REALITY.

The fact that *some* devices have serial numbers in no way implies
that all of them do.

And if you think that they all do, you are the problem. Literally in this case.

So you had better really internalize this "there are no reliable
serial numbers in general". Because it's simply a FACT.

> Well, let's try something like this (co-developed with Sagi) that
> allows it for non-multipath PCIe devices, which covers the quirks
> we've added so far, but also add a big fat warning, because we know
> people rely on the /dev/disk/by-id/ links. Probably not on these
> devices, but who knows.

Looks sane to me.

Linus

2023-07-12 17:37:56

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

On Wed, Jul 12, 2023 at 09:51:52AM -0700, Linus Torvalds wrote:
> On Wed, 12 Jul 2023 at 09:45, Christoph Hellwig <[email protected]> wrote:
> >
> > On Tue, Jul 11, 2023 at 09:47:00AM -0700, Linus Torvalds wrote:
> >
> > > and we have NEVER EVER seen devices with reliably unique IDs. Really.
> >
> > Sorry, but that's bullshit.
>
> Christoph, you are *literally* involved in a discussion where this is the case.

Yes. But that's not never ever. Enterprise storage devices have been
pretty very good at it, because it is part of the purchase specs.

So don't claim NEVER EVER, which is just BS. Claim we've been way
to optimistic in that even cheap devices would get something so
basic, and you're spot on.

2023-07-12 17:50:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

On Wed, 12 Jul 2023 at 09:57, Christoph Hellwig <[email protected]> wrote:
>
> On Wed, Jul 12, 2023 at 09:51:52AM -0700, Linus Torvalds wrote:
> > Christoph, you are *literally* involved in a discussion where this is the case.
>
> Yes. But that's not never ever. Enterprise storage devices have been
> pretty very good at it, because it is part of the purchase specs.

Your basic logic is lacking.

"Humans are animals" and "Some animals have four legs" are both
undeniabvly true.

That still does *not* mean "Humans have four legs".

And yet hat is *literally* the kind of nonsense you are spouting.

I very much say that "some devices will not have UUID's".

That's a *FACT*.

You then say "Some devices have UUID's, so all devices must have UUIDs".

That's *STUPID*.

Can you really not tell the difference?

Linus

2023-07-13 11:52:22

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

On Wed, Jul 12, 2023 at 10:34:05AM -0700, Linus Torvalds wrote:
> That's a *FACT*.
>
> You then say "Some devices have UUID's, so all devices must have UUIDs".

Please stop putting words in my mouth. Maybe instead of shouting it
would help to actually read the text?

You said we've never seen devices with reliably IDs, which simply isn't
true. And that doesn't mean I'm saying all devices have reliably IDs,
which somehow you're not trying to make me say.

2023-07-13 17:09:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

On Thu, 13 Jul 2023 at 04:44, Christoph Hellwig <[email protected]> wrote:
>
> Please stop putting words in my mouth. Maybe instead of shouting it
> would help to actually read the text?

I have read the text. You spout unbelievable garbage. Even in this email:

> You said we've never seen devices with reliably IDs, which simply isn't
> true. And that doesn't mean I'm saying all devices have reliably IDs,
> which somehow you're not trying to make me say.

When I say "devices", I obviously mean in general.

You seem to argue against it.

OF COURSE a _single_ device that has shown to have a UUID is likely to
then repeat that UUID reliably. But that doesn't mean it's true in
general.

The fact is, any driver subsystem that deals with more than one device
manufacturer had better not rely on any serial number for uniqueness.

If some device layer does, and there is a device without serial
numbers, then the bug is squarely on the driver, not the device.
Because the driver author clearly didn't think things through.

THAT is what I mean by "devices do not EVER reliably have uuids".

I just checked. I have on my desk access to three machines with NVMe
drives. Of them

(a) one didn't have any UUID at all, didn't report namespaces and the
"wwid" exposed in /proc is that

return sysfs_emit(buf, "nvme.%04x-%*phN-%*phN-%08x\n",
subsys->vendor_id,
serial_len, subsys->serial, model_len, subsys->model,
head->ns_id);

thing.

(b) the other two had a NULL uuid and nguid, but seem to have a eui64,
whatever the hell that is.

So just from my limited test, one in three didn't have any UUID at all.

Deal with it. Stop claiming it's somehow "reliable". Two out of three
out of a random selection is not "reliable".

And it has never been reliable, judging by those quirks. When we have
manufacturers like Samsung, Phison, ADATA, Lexar, Sandisk, Micron, and
Kingston mentioned in the quirk list, we probably have covered most of
the consumer SSD manufacturers.

It's a sad that NVMe - that started out as a "lean mean low-overhead
disk interface definition" results in this kind of mess.

Linus

2023-07-16 19:30:40

by August Wikerfors

[permalink] [raw]
Subject: Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)

On 2023-07-11 08:54, Pankaj Raghav wrote:
>>> Another request came in today, even with a pseudo-patch:
>>> https://bugzilla.kernel.org/show_bug.cgi?id=217649
>>>
>>> To quote:
>>> ```
>>> As with numerous NVMe controllers these days, Samsung's
>>> MZAL41T0HBLB-00BL2, which Lenovo builds into their 16ARP8 also suffers
>>> from invalid IDs, breaking suspend and hibernate also on the latest
>>> kernel 6.4.2.
>>>
>>> The following change restores this functionality:
>>>
>>> File: root/drivers/nvme/host/pci.c
>>> Change:
>>>
>>> - { PCI_DEVICE(0x144d, 0xa80b), /* Samsung PM9B1 256G and 512G */
>>> - .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, },
>>>
>>> + { PCI_DEVICE(0x144d, 0xa80b), /* Samsung PM9B1 256G, 512G and 1TB */
>>> + .driver_data = NVME_QUIRK_BOGUS_NID |
>>> + NVME_QUIRK_DISABLE_WRITE_ZEROES, },
>>
>> Panjaj, okay with this one too?
>
> This looks a like a new device that might have a firmware update. I will ping
> internally first.

(Note that this is a different issue from the regression in this thread)

I reported this back in November [1] and was told by Kanchan Joshi at
Samsung that it was fixed in new firmware [2]. Lenovo was also contacted
and said they were working on the update in December [3]. I'm not sure
what happened then, but in March, Mark Pearson at Lenovo wrote [4]:

> I'm stuck on this one - the FW team reached out to Samsung to see if
> there were fixes that we should be picking up and Samsung reported back
> that there are no Linux issues reported against this part :(

The release process then seems to have started over [5] and the latest
update as of May is that the update is supposed to be released this month.

[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/20221206055928.GB24451@test-zns/
[3] https://github.com/tomsom/yoga-linux/issues/9#issuecomment-1368013293
[4] https://github.com/fwupd/firmware-lenovo/issues/308#issuecomment-1466631468
[5] https://forums.lenovo.com/topic/findpost/27/5196929/5984302