LinuxLists.cc - [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

On Thu, Feb 21, 2019 at 5:07 PM Jon Derrick <[email protected]> wrote:
>
> Some platforms don't seem to easily tolerate non-posted mmio reads on
> lost (hot removed) devices. This has been noted in previous
> modifications to other layers where an mmio read to a lost device could
> cause an undesired firmware intervention [1][2].

This is broken, and whatever platform that requires this is broken.

This has absolutely nothing to do with nvme, and should not be handled
by a driver.

The platform code should be fixed.

What broken platform is this, and why is it causing problems?

None of this wishy-washy "some platforms". Name them, and let's get them fixed.

Linus

2019-02-22 22:00:19

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

On Fri, Feb 22, 2019 at 01:28:42PM -0800, Linus Torvalds wrote:
> On Thu, Feb 21, 2019 at 5:07 PM Jon Derrick <[email protected]> wrote:
> >
> > Some platforms don't seem to easily tolerate non-posted mmio reads on
> > lost (hot removed) devices. This has been noted in previous
> > modifications to other layers where an mmio read to a lost device could
> > cause an undesired firmware intervention [1][2].
>
> This is broken, and whatever platform that requires this is broken.
>
> This has absolutely nothing to do with nvme, and should not be handled
> by a driver.
>
> The platform code should be fixed.

This is, of course, the correct answer. We just find platform firmware
uncooperative, so we see these attempts to avoid them.

I really don't like this driver piecemeal approach if we're going to
quirk around these platforms, though. I'd rather see the invalidated
address ranges remapped to a fault handler fixup exception once and be
done with it.

Or we can say you don't get to use this feature if you bought that
hardware.

2019-02-24 20:38:17

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

On 2/22/19 3:29 PM, Linus Torvalds wrote:
> On Thu, Feb 21, 2019 at 5:07 PM Jon Derrick <[email protected]> wrote:
>>
>> Some platforms don't seem to easily tolerate non-posted mmio reads on
>> lost (hot removed) devices. This has been noted in previous
>> modifications to other layers where an mmio read to a lost device could
>> cause an undesired firmware intervention [1][2].
>
> This is broken, and whatever platform that requires this is broken.
>
> This has absolutely nothing to do with nvme, and should not be handled
> by a driver.
>
> The platform code should be fixed.
>
> What broken platform is this, and why is it causing problems?
>
> None of this wishy-washy "some platforms". Name them, and let's get them fixed.

Dell r740xd to name one. r640 is even worse -- they probably didn't give
me one because I'd have too much stuff to complain about.

On the above machines, firmware-first (FFS) tries to guess when there's
a SURPRISE!!! removal of a PCIe card and supress any errors reported to
the OS. When the OS keeps firing IO over the dead link, FFS doesn't know
if it can safely supress the error. It reports is via NMI, and
drivers/acpi/apei/ghes.c panics whenever that happens.

Does this sound like horrible code?

As I see it, there's a more fundamental problem. As long as we accept
platforms where firmware does some things first (FFS), we have much less
control over what happens. The best we can do is wishy-washy fixes like
this one.

Did I mention that FFS's explicit intent on the above machines is to
make the OS crash?

Alex

2019-02-24 22:43:56

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

On Sun, Feb 24, 2019 at 12:37 PM <[email protected]> wrote:
>
> Dell r740xd to name one. r640 is even worse -- they probably didn't give
> me one because I'd have too much stuff to complain about.
>
> On the above machines, firmware-first (FFS) tries to guess when there's
> a SURPRISE!!! removal of a PCIe card and supress any errors reported to
> the OS. When the OS keeps firing IO over the dead link, FFS doesn't know
> if it can safely supress the error. It reports is via NMI, and
> drivers/acpi/apei/ghes.c panics whenever that happens.

Can we just fix that ghes driver?

It's not useful to panic just for random reasons. I realize that some
of the RAS people have the mindset that "hey, I don't know what's
wrong, so I'd better kill the machine than continue", but that's
bogus.

What happens if we just fix that part?

> As I see it, there's a more fundamental problem. As long as we accept
> platforms where firmware does some things first (FFS), we have much less
> control over what happens. The best we can do is wishy-washy fixes like
> this one.

Oh, I agree that platforms with random firmware things are horrid. But
we've been able to handle them just fine before, without making every
single possible hotplug pci driver have nasty problems and
workarounds.

I suspect we'd be much better off having the ghes driver just not panic.

What is the actual ghes error? Is it the "unknown, just panic" case,
or something else?

Linus

2019-02-24 23:28:21

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

On 2/24/19 4:42 PM, Linus Torvalds wrote:
> On Sun, Feb 24, 2019 at 12:37 PM <[email protected]> wrote:
>>
>> Dell r740xd to name one. r640 is even worse -- they probably didn't give
>> me one because I'd have too much stuff to complain about.
>>
>> On the above machines, firmware-first (FFS) tries to guess when there's
>> a SURPRISE!!! removal of a PCIe card and supress any errors reported to
>> the OS. When the OS keeps firing IO over the dead link, FFS doesn't know
>> if it can safely supress the error. It reports is via NMI, and
>> drivers/acpi/apei/ghes.c panics whenever that happens.
>
> Can we just fix that ghes driver?
>
> It's not useful to panic just for random reasons. I realize that some
> of the RAS people have the mindset that "hey, I don't know what's
> wrong, so I'd better kill the machine than continue", but that's
> bogus.

That's the first thing I tried, but Borislav didn't like it. And he's
right in the strictest sense of the ACPI spec: a fatal GHES error must
result in a machine reboot [1].

> What happens if we just fix that part?

On rx740xd, on a NVMe hotplug bay, the upstream port stops sending
hotplug interrupts. We could fix that with a quirk by clearing a
proprietary bit in the switch. However, FFS won't re-arm itself to
receive any further errors, so we'd never get notified in case there is
a genuine error.

>> As I see it, there's a more fundamental problem. As long as we accept
>> platforms where firmware does some things first (FFS), we have much less
>> control over what happens. The best we can do is wishy-washy fixes like
>> this one.
>
> Oh, I agree that platforms with random firmware things are horrid. But
> we've been able to handle them just fine before, without making every
> single possible hotplug pci driver have nasty problems and
> workarounds.
>
> I suspect we'd be much better off having the ghes driver just not panic.

Keith Busch of Intel at some point suggested remapping all MMIO
resources of a dead PCIe device to a read-only page that returns all
F's. Neither of us were too sure how to do that, or how to handle the
problem of in-flight DMA, which wouldn't hit the page tables.

> What is the actual ghes error? Is it the "unknown, just panic" case,
> or something else?

More like "fatal error, just panic". It looks like this (from a serial
console):

[ 57.680494] {1}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 1
[ 57.680495] {1}[Hardware Error]: event severity: fatal
[ 57.680496] {1}[Hardware Error]: Error 0, type: fatal
[ 57.680496] {1}[Hardware Error]: section_type: PCIe error
[ 57.680497] {1}[Hardware Error]: port_type: 6, downstream switch port
[ 57.680498] {1}[Hardware Error]: version: 3.0
[ 57.680498] {1}[Hardware Error]: command: 0x0407, status: 0x0010
[ 57.680499] {1}[Hardware Error]: device_id: 0000:3c:07.0
[ 57.680499] {1}[Hardware Error]: slot: 1
[ 57.680500] {1}[Hardware Error]: secondary_bus: 0x40
[ 57.680500] {1}[Hardware Error]: vendor_id: 0x10b5, device_id: 0x9733
[ 57.680501] {1}[Hardware Error]: class_code: 000406
[ 57.680502] {1}[Hardware Error]: bridge: secondary_status: 0x0000,
control: 0x0003
[ 57.680503] Kernel panic - not syncing: Fatal hardware error!
[ 57.680572] Kernel Offset: 0x2a000000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Alex

[1] ACPI 6.3 - 18.1 Hardware Errors and Error Sources

"A fatal hardware error is an uncorrected or uncontained error condition
that is determined to be unrecoverable by the hardware. When a fatal
uncorrected error occurs, the system is restarted to prevent propagation
of the error."

2019-02-25 00:44:51

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

On Sun, Feb 24, 2019 at 3:27 PM <[email protected]> wrote:
>
> >
> > It's not useful to panic just for random reasons. I realize that some
> > of the RAS people have the mindset that "hey, I don't know what's
> > wrong, so I'd better kill the machine than continue", but that's
> > bogus.
>
> That's the first thing I tried, but Borislav didn't like it. And he's
> right in the strictest sense of the ACPI spec: a fatal GHES error must
> result in a machine reboot [1].
>
> > What happens if we just fix that part?
>
> On rx740xd, on a NVMe hotplug bay, the upstream port stops sending
> hotplug interrupts. We could fix that with a quirk by clearing a
> proprietary bit in the switch. However, FFS won't re-arm itself to
> receive any further errors, so we'd never get notified in case there is
> a genuine error.

But this is not a genuine fatal error.

When spec and reality collide, the spec is just so much toilet paper.

In fact, the spec is worth _less_ than toilet paper, because at least
toilet paper is useful for wiping your butt clean. The spec? Not so
much.

> Keith Busch of Intel at some point suggested remapping all MMIO
> resources of a dead PCIe device to a read-only page that returns all
> F's. Neither of us were too sure how to do that, or how to handle the
> problem of in-flight DMA, which wouldn't hit the page tables.

I agree that that would be a really cute and smart way to fix things,
but no, right now I don't think we have any kind of infrastructure in
place to do something like that.

> > What is the actual ghes error? Is it the "unknown, just panic" case,
> > or something else?
>
> More like "fatal error, just panic". It looks like this (from a serial
> console):
>
> [ 57.680494] {1}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 1
> [ 57.680495] {1}[Hardware Error]: event severity: fatal

Ok, so the ghes information is actively wrong, and tries to kill the
machine when it shouldn't be killed.

I seriously think that the correct thing is to fix the problem at the
*source* - ie the ghes driver. That's the only driver that should care
about "this platform is broken and sends invalid fatal errors".

So instead of adding hacks to the nvme driver, I think the hacks
should be in the ghes driver. Possibly just a black-list of "this
platform is known broken, don't even enable the ghes driver for it".
Or possibly a bit more fine-grained in the sense that it knows that
"ok, this particular kind of error is due to a hotplug event, the
driver will handle it without help from us, so ignore it".

Linus

2019-02-25 15:55:34

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

On Sun, Feb 24, 2019 at 03:27:09PM -0800, [email protected] wrote:
>
> More like "fatal error, just panic". It looks like this (from a serial
> console):
>
> [ 57.680494] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
> [ 57.680495] {1}[Hardware Error]: event severity: fatal
> [ 57.680496] {1}[Hardware Error]: Error 0, type: fatal
> [ 57.680496] {1}[Hardware Error]: section_type: PCIe error
> [ 57.680497] {1}[Hardware Error]: port_type: 6, downstream switch port
> [ 57.680498] {1}[Hardware Error]: version: 3.0
> [ 57.680498] {1}[Hardware Error]: command: 0x0407, status: 0x0010
> [ 57.680499] {1}[Hardware Error]: device_id: 0000:3c:07.0
> [ 57.680499] {1}[Hardware Error]: slot: 1
> [ 57.680500] {1}[Hardware Error]: secondary_bus: 0x40
> [ 57.680500] {1}[Hardware Error]: vendor_id: 0x10b5, device_id: 0x9733
> [ 57.680501] {1}[Hardware Error]: class_code: 000406
> [ 57.680502] {1}[Hardware Error]: bridge: secondary_status: 0x0000, > control: 0x0003

This is a reaction to a ERR_FATAL message, right? What happens if we
ignore firmware first and override control of the AER masking with a
set to the Unsupported Request Error Mask in the root and downstream
ports? You can do a quick test like this for the above's hardware:

# setpci -s 3c:07.0 ECAP_AER+8.l=100000:100000

You'd probably have to do the same command to the root port BDf, and any
other switches you have them in the hierarchy.

2019-02-26 22:38:08

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

On 2/25/19 9:55 AM, Keith Busch wrote:
> On Sun, Feb 24, 2019 at 03:27:09PM -0800, [email protected] wrote:
>> [ 57.680494] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
>> [ 57.680495] {1}[Hardware Error]: event severity: fatal
>> [ 57.680496] {1}[Hardware Error]: Error 0, type: fatal
>> [ 57.680496] {1}[Hardware Error]: section_type: PCIe error
>> [ 57.680497] {1}[Hardware Error]: port_type: 6, downstream switch port
>> [ 57.680498] {1}[Hardware Error]: version: 3.0
>> [ 57.680498] {1}[Hardware Error]: command: 0x0407, status: 0x0010
>> [ 57.680499] {1}[Hardware Error]: device_id: 0000:3c:07.0
>> [ 57.680499] {1}[Hardware Error]: slot: 1
>> [ 57.680500] {1}[Hardware Error]: secondary_bus: 0x40
>> [ 57.680500] {1}[Hardware Error]: vendor_id: 0x10b5, device_id: 0x9733
>> [ 57.680501] {1}[Hardware Error]: class_code: 000406
>> [ 57.680502] {1}[Hardware Error]: bridge: secondary_status: 0x0000, > control: 0x0003
>
> This is a reaction to a ERR_FATAL message, right? What happens if we
> ignore firmware first and override control of the AER masking with a
> set to the Unsupported Request Error Mask in the root and downstream
> ports? You can do a quick test like this for the above's hardware:
>
> # setpci -s 3c:07.0 ECAP_AER+8.l=100000:100000
>
> You'd probably have to do the same command to the root port BDf, and any
> other switches you have them in the hierarchy.

Then nobody gets the (error) message. You can go a bit further and try
'pcie_ports=native". Again, nobody gets the memo. ):

Alex

2019-02-27 01:03:44

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

On Tue, Feb 26, 2019 at 2:37 PM <[email protected]> wrote:
>
> Then nobody gets the (error) message. You can go a bit further and try
> 'pcie_ports=native". Again, nobody gets the memo. ):

So? The error was bogus to begin with. Why would we care?

Yes, yes, PCI bridges have the ability to return errors in accesses to
non-existent devices. But that was always bogus, and is never useful.
The whole "you get an interrupt or NMI on a bad access" is simply a
horribly broken model. It's not useful.

We already have long depended on hotplug drivers noticing the "oh, I'm
getting all-ff returns, the device may be gone". It's usually trivial,
and works a whole lot better.

It's not an error. Trying to force it to be an NMI or SCI or machine
check is bogus. It causes horrendous pain, because asynchronous
reporting doesn't work reliably anyway, and *synchronous* reporting is
impossible to sanely handle without crazy problems.

So the only sane model for hotplug devices is "IO still works, and
returns all ones". Maybe with an async one-time and *recoverable*
machine check or other reporting the access after the fact.

Anything else is simply broken. It would be broken even if firmware
wasn't involved, but obviously firmware people tend to often make a
bad situation even worse.

Linus

2019-02-27 16:43:09

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

On 2/26/19 7:02 PM, Linus Torvalds wrote:
> On Tue, Feb 26, 2019 at 2:37 PM <[email protected]> wrote:
>>
>> Then nobody gets the (error) message. You can go a bit further and try
>> 'pcie_ports=native". Again, nobody gets the memo. ):
>
> So? The error was bogus to begin with. Why would we care?

Of course nobody cares about that. We care about actual errors that we
now know we won't be notified of. Imagine if we didn't get the memo that
a piece of data is corrupt, and imagine the reaction of RAS folk.

And I know the counter to that is a panic() is much more likely to cause
data corruption, and we're trading one piece of crap for an even
stinkier one. Whatever we end up doing, we have to do better than
silence errors and pretend nothing happened.

> Yes, yes, PCI bridges have the ability to return errors in accesses to
> non-existent devices. But that was always bogus, and is never useful.
> The whole "you get an interrupt or NMI on a bad access" is simply a
> horribly broken model. It's not useful.
>
> We already have long depended on hotplug drivers noticing the "oh, I'm
> getting all-ff returns, the device may be gone". It's usually trivial,
> and works a whole lot better.

And that's been working great, hasn't it? I think you're thinking
strictly about hotplug. There are other situations where things are all
F'd, but the hardware isn't sending all F's. (example: ECRC errors)

> It's not an error. Trying to force it to be an NMI or SCI or machine
> check is bogus. It causes horrendous pain, because asynchronous
> reporting doesn't work reliably anyway, and *synchronous* reporting is
> impossible to sanely handle without crazy problems.
>
> So the only sane model for hotplug devices is "IO still works, and
> returns all ones". Maybe with an async one-time and *recoverable*
> machine check or other reporting the access after the fact.

Exactly!!! A notification (not calling it an 'error') that something
unusual has happened is good. Treating these things like errors is so
obvious, even a caveman wouldn't do it.
In a world with FFS, we don't always get to have that model. Oh, FFS!

> Anything else is simply broken. It would be broken even if firmware
> wasn't involved, but obviously firmware people tend to often make a
> bad situation even worse.

Linus, be nice to firmware people. I've met a few, and I can vouch that
they're very kind and nice. They're also very scared, especially when OS
people want to ask them a few questions.

I think FFS should get out of the way when OS advertises it's capable of
handling XYZ. There are some good arguments why this hasn't happened,
but I won't get into details. I do think it's unlikely that machines
will be moving back to an OS-controlled model.

And Linus, keep in mind, when these machines were developed, OSes
couldn't handle recovery properly. None of this was ever an issue. It's
our fault that we've changed the OS after the machines are on the market.

Alex

2019-02-27 17:53:16

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

On Wed, Feb 27, 2019 at 04:42:05PM +0000, [email protected] wrote:
> On 2/26/19 7:02 PM, Linus Torvalds wrote:
> > On Tue, Feb 26, 2019 at 2:37 PM <[email protected]> wrote:
> >>
> >> Then nobody gets the (error) message. You can go a bit further and try
> >> 'pcie_ports=native". Again, nobody gets the memo. ):
> >
> > So? The error was bogus to begin with. Why would we care?
>
> Of course nobody cares about that. We care about actual errors that we
> now know we won't be notified of. Imagine if we didn't get the memo that
> a piece of data is corrupt, and imagine the reaction of RAS folk.
>
> And I know the counter to that is a panic() is much more likely to cause
> data corruption, and we're trading one piece of crap for an even
> stinkier one. Whatever we end up doing, we have to do better than
> silence errors and pretend nothing happened.
>
>
> > Yes, yes, PCI bridges have the ability to return errors in accesses to
> > non-existent devices. But that was always bogus, and is never useful.
> > The whole "you get an interrupt or NMI on a bad access" is simply a
> > horribly broken model. It's not useful.
> >
> > We already have long depended on hotplug drivers noticing the "oh, I'm
> > getting all-ff returns, the device may be gone". It's usually trivial,
> > and works a whole lot better.
>
> And that's been working great, hasn't it? I think you're thinking
> strictly about hotplug. There are other situations where things are all
> F'd, but the hardware isn't sending all F's. (example: ECRC errors)
>
>
> > It's not an error. Trying to force it to be an NMI or SCI or machine
> > check is bogus. It causes horrendous pain, because asynchronous
> > reporting doesn't work reliably anyway, and *synchronous* reporting is
> > impossible to sanely handle without crazy problems.
> >
> > So the only sane model for hotplug devices is "IO still works, and
> > returns all ones". Maybe with an async one-time and *recoverable*
> > machine check or other reporting the access after the fact.
>
> Exactly!!! A notification (not calling it an 'error') that something
> unusual has happened is good. Treating these things like errors is so
> obvious, even a caveman wouldn't do it.
> In a world with FFS, we don't always get to have that model. Oh, FFS!
>
>
> > Anything else is simply broken. It would be broken even if firmware
> > wasn't involved, but obviously firmware people tend to often make a
> > bad situation even worse.
>
> Linus, be nice to firmware people. I've met a few, and I can vouch that
> they're very kind and nice. They're also very scared, especially when OS
> people want to ask them a few questions.
>
> I think FFS should get out of the way when OS advertises it's capable of
> handling XYZ. There are some good arguments why this hasn't happened,
> but I won't get into details. I do think it's unlikely that machines
> will be moving back to an OS-controlled model.
>
> And Linus, keep in mind, when these machines were developed, OSes
> couldn't handle recovery properly. None of this was ever an issue. It's
> our fault that we've changed the OS after the machines are on the market.
>
> Alex

I can't tell where you're going with this. It doesn't sound like you're
talking about hotplug anymore, at least.

2019-02-27 17:56:57

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

On 2/27/2019 10:42 AM, Gagniuc, Alexandru - Dell Team wrote:
>
> [EXTERNAL EMAIL]
>
> On 2/26/19 7:02 PM, Linus Torvalds wrote:
>> On Tue, Feb 26, 2019 at 2:37 PM <[email protected]> wrote:
>>>
>>> Then nobody gets the (error) message. You can go a bit further and try
>>> 'pcie_ports=native". Again, nobody gets the memo. ):
>>
>> So? The error was bogus to begin with. Why would we care?
>
> Of course nobody cares about that. We care about actual errors that we
> now know we won't be notified of. Imagine if we didn't get the memo that
> a piece of data is corrupt, and imagine the reaction of RAS folk.
>
> And I know the counter to that is a panic() is much more likely to cause
> data corruption, and we're trading one piece of crap for an even
> stinkier one. Whatever we end up doing, we have to do better than
> silence errors and pretend nothing happened.
>
>
>> Yes, yes, PCI bridges have the ability to return errors in accesses to
>> non-existent devices. But that was always bogus, and is never useful.
>> The whole "you get an interrupt or NMI on a bad access" is simply a
>> horribly broken model. It's not useful.
>>
>> We already have long depended on hotplug drivers noticing the "oh, I'm
>> getting all-ff returns, the device may be gone". It's usually trivial,
>> and works a whole lot better.
>
> And that's been working great, hasn't it? I think you're thinking
> strictly about hotplug. There are other situations where things are all
> F'd, but the hardware isn't sending all F's. (example: ECRC errors)
>
>
>> It's not an error. Trying to force it to be an NMI or SCI or machine
>> check is bogus. It causes horrendous pain, because asynchronous
>> reporting doesn't work reliably anyway, and *synchronous* reporting is
>> impossible to sanely handle without crazy problems.
>>
>> So the only sane model for hotplug devices is "IO still works, and
>> returns all ones". Maybe with an async one-time and *recoverable*
>> machine check or other reporting the access after the fact.
>
> Exactly!!! A notification (not calling it an 'error') that something
> unusual has happened is good. Treating these things like errors is so
> obvious, even a caveman wouldn't do it.
> In a world with FFS, we don't always get to have that model. Oh, FFS!
>
>
>> Anything else is simply broken. It would be broken even if firmware
>> wasn't involved, but obviously firmware people tend to often make a
>> bad situation even worse.
>
> Linus, be nice to firmware people. I've met a few, and I can vouch that
> they're very kind and nice. They're also very scared, especially when OS
> people want to ask them a few questions.
>
> I think FFS should get out of the way when OS advertises it's capable of
> handling XYZ. There are some good arguments why this hasn't happened,
> but I won't get into details. I do think it's unlikely that machines
> will be moving back to an OS-controlled model.
>
> And Linus, keep in mind, when these machines were developed, OSes
> couldn't handle recovery properly. None of this was ever an issue. It's
> our fault that we've changed the OS after the machines are on the market.

Just to add some background on these particular systems... at the time
they were designed none of the Linux distros we use supported the PCI
Error Recovery Services or maybe they did and we simply didn't know
about it. We found out rather by accident after the systems were
shipped that Linux had this ability. It was a pleasant surprise as
we've been asking OSVs to try to recover from PCI/PCIe errors for years.
Linux is the first (and still only) OS we use that can has a PCIe
error recovery model so kudos on that!

100% agree that crashing the system on PCIe errors like this is terrible
and we want to get to a recovery model. It was too late for the
generation of systems being discussed as it is a big paradigm shift and
lots of changes/validation that folks are not comfortable doing on
shipping products. But it's coming in future products.

We also noticed that there was no mechanism in the recovery models for
system firmware and OS to interact and come to know if recovery services
are available and no way for OS to inform platform if error recovery was
successful or not and no way to establish control of Downstream Port
Containment. This model - which PCI-SIG is calling Containment Error
Recovery - has been added in the relevant PCI-SIG documents and ACPI
spec and I believe Intel will start pushing patches soon. Hopefully
this will provide the industry standard solution needed to get to a full
error recovery model for PCIe-related errors.

There are other hardware additions in PCIe that could help with
synchronizing errors such as the RP PIO synchronous exception handling
added to PCIe as part of eDPC. Hardware is sparse but shipping AMD
Naples products already support this new hardware. I expect more
hardware to support as the industry shifts to Downstream Port
Containment/Containment Error Recovery model.

For these issues on existing platforms, I'm not sure what could be done.
Continuing to crash may be necessary until the CER solution is in place.

BTW, this patch in particular is complaining about an error for a
removed device. The Dell servers referenced in this chain will check if
the device is removed and if so it will suppress the error so I don't
think they are susceptible to this particular issue and I agree it is
broken if they do. If that is the case we can and will fix it in firmware.

>
> Alex
>
> _______________________________________________
> Linux-nvme mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-nvme
>

2019-02-27 18:09:02

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

On 2/27/19 11:51 AM, Keith Busch wrote:
> I can't tell where you're going with this. It doesn't sound like you're
> talking about hotplug anymore, at least.

We're trying to fix an issue related to hotplug. However, the proposed
fixes may have unintended consequences and side-effects. I want to make
sure that we're aware of those potential problems.

Alex

2019-02-27 20:12:29

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

On 2/27/2019 11:56 AM, Bolen, Austin wrote:
>
> BTW, this patch in particular is complaining about an error for a
> removed device. The Dell servers referenced in this chain will check if
> the device is removed and if so it will suppress the error so I don't
> think they are susceptible to this particular issue and I agree it is
> broken if they do. If that is the case we can and will fix it in firmware.
>

Confirmed this issue does not apply to the referenced Dell servers so I
don't not have a stake in how this should be handled for those systems.
It may be they just don't support surprise removal. I know in our case
all the Linux distributions we qualify (RHEL, SLES, Ubuntu Server) have
told us they do not support surprise removal. So I'm guessing that any
issues found with surprise removal could potentially fall under the
category of "unsupported".

Still though, the larger issue of recovering from other types of PCIe
errors that are not due to device removal is still important. I would
expect many system from many platform makers to not be able to recover
PCIe errors in general and hopefully the new DPC CER model will help
address this and provide added protection for cases like above as well.

Thanks,
Austin

2019-02-28 14:19:23

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

On Wed, Feb 27, 2019 at 08:04:35PM +0000, [email protected] wrote:
> Confirmed this issue does not apply to the referenced Dell servers so I
> don't not have a stake in how this should be handled for those systems.
> It may be they just don't support surprise removal. I know in our case
> all the Linux distributions we qualify (RHEL, SLES, Ubuntu Server) have
> told us they do not support surprise removal. So I'm guessing that any
> issues found with surprise removal could potentially fall under the
> category of "unsupported".
>
> Still though, the larger issue of recovering from other types of PCIe
> errors that are not due to device removal is still important. I would
> expect many system from many platform makers to not be able to recover
> PCIe errors in general and hopefully the new DPC CER model will help
> address this and provide added protection for cases like above as well.

FYI, a related issue I saw about a year two ago with Dell servers was
with a dual ported NVMe add-in (non U.2) card, is that once you did
a subsystem reset, which would cause both controller to retrain the link
you'd run into Firmware First error handling issue that would instantly
crash the system. I don't really have the hardware anymore, but the
end result was that I think the affected product ended up shipping
with subsystem resets only enabled for the U.2 form factor.

2019-02-28 23:33:32

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

On Thu, Feb 28, 2019 at 11:10:11PM +0000, [email protected] wrote:
> I'd also note that in PCIe, things that intentionally take the link down
> like SBR or Link Disable suppress surprise down error reporting. But
> NSSR doesn't have this requirement to suppress surprise down reporting.
> I think that's a gap on the part of the NVMe spec.

SBR and Link Disable are done from the down stream port, though, so the
host can still communicate with the function that took the link down.
That's entirely different than taking the link down from the end device,
so I'm not sure how NVMe can fix that.

But I can't even recall why NVMe defined NSSR to require PCIe LTSSM
Detect. That seems entirely unnecessary and is just asking for trouble.

2019-02-28 23:46:53

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

On 2/28/2019 5:20 PM, Keith Busch wrote:
>
> [EXTERNAL EMAIL]
>
> On Thu, Feb 28, 2019 at 11:10:11PM +0000, [email protected] wrote:
>> I'd also note that in PCIe, things that intentionally take the link down
>> like SBR or Link Disable suppress surprise down error reporting. But
>> NSSR doesn't have this requirement to suppress surprise down reporting.
>> I think that's a gap on the part of the NVMe spec.
>
> SBR and Link Disable are done from the down stream port, though, so the
> host can still communicate with the function that took the link down.
> That's entirely different than taking the link down from the end device,
> so I'm not sure how NVMe can fix that.
>

Agreed it is different. Here is one way they could have solved it: host
writes magic value to NSSRC but device latches this instead of
resetting. Then require host to do SBR. When device sees SBR with
magic value in NSSRC it does an NSSR.

> But I can't even recall why NVMe defined NSSR to require PCIe LTSSM
> Detect. That seems entirely unnecessary and is just asking for trouble.
>

Not sure why but maybe they wanted a full reset of everything including
the PCIe block. I can ask around. Also agree it is asking for trouble
to have device take the link down without parent bridge knowing what's
going on. There are new mechanism being proposed in NVMe that would
also allow the device to initiate link down with no co-ordination with
the parent bridge so may need to think on ways to avoid this issue for
these new similar mechanisms.

-Austin

2019-03-01 01:22:29

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

On 2/28/2019 8:17 AM, Christoph Hellwig wrote:
>
> [EXTERNAL EMAIL]
>
> On Wed, Feb 27, 2019 at 08:04:35PM +0000, [email protected] wrote:
>> Confirmed this issue does not apply to the referenced Dell servers so I
>> don't not have a stake in how this should be handled for those systems.
>> It may be they just don't support surprise removal. I know in our case
>> all the Linux distributions we qualify (RHEL, SLES, Ubuntu Server) have
>> told us they do not support surprise removal. So I'm guessing that any
>> issues found with surprise removal could potentially fall under the
>> category of "unsupported".
>>
>> Still though, the larger issue of recovering from other types of PCIe
>> errors that are not due to device removal is still important. I would
>> expect many system from many platform makers to not be able to recover
>> PCIe errors in general and hopefully the new DPC CER model will help
>> address this and provide added protection for cases like above as well.
>
> FYI, a related issue I saw about a year two ago with Dell servers was
> with a dual ported NVMe add-in (non U.2) card, is that once you did
> a subsystem reset, which would cause both controller to retrain the link
> you'd run into Firmware First error handling issue that would instantly
> crash the system. I don't really have the hardware anymore, but the
> end result was that I think the affected product ended up shipping
> with subsystem resets only enabled for the U.2 form factor.
>

Yes, that's another good one. For add-in cards, they are not
hot-pluggable and so the platform will not set the Hot-Plug Surprise bit
in the port above them. So when the surprise link down happens the
platform will generate a fatal error. For U.2, the Hot-Plug Surprise
bit is set on these platforms which suppresses the fatal error. It's ok
to suppress in this case since OS will get notified via hot-plug
interrupt. In the case of the add-in card there is no hot-plug
interrupt and so the platform has no idea if the OS will handle the
surprise link down or not so platform has to err on the side of caution.
This is another case where the new containment error recovery model
will help by allowing platform to know if OS can recover from this error
or not.

Even if the system sets the Hot-Plug Surprise bit, the system can still
crater if OS does an NSSR and then some sort of MMIO is generated to the
downed port. Platforms that suppress errors for removed devices will
still escalate this error as fatal since the device is still present.
But again the error containment model should protect this case as well.

I'd also note that in PCIe, things that intentionally take the link down
like SBR or Link Disable suppress surprise down error reporting. But
NSSR doesn't have this requirement to suppress surprise down reporting.
I think that's a gap on the part of the NVMe spec.

-Austin

2019-03-01 01:59:47

[permalink] [raw]

Subject: Re: [PATCH] nvme-pci: Prevent mmio reads if pci channel offline

On Thu, Feb 28, 2019 at 11:43:46PM +0000, [email protected] wrote:
> On 2/28/2019 5:20 PM, Keith Busch wrote:
> > SBR and Link Disable are done from the down stream port, though, so the
> > host can still communicate with the function that took the link down.
> > That's entirely different than taking the link down from the end device,
> > so I'm not sure how NVMe can fix that.
> >
>
> Agreed it is different. Here is one way they could have solved it: host
> writes magic value to NSSRC but device latches this instead of
> resetting. Then require host to do SBR. When device sees SBR with
> magic value in NSSRC it does an NSSR.

For single port drives, yes, but that wouldn't work so well for multi-port
devices connected to different busses, maybe even across multiple hosts.
The equivalent of an FLR across all ports should have been sufficient,
IMO.

2019-03-01 02:13:41