2023-09-18 09:51:37

by Shuai Xue

[permalink] [raw]
Subject: Questions: Should kernel panic when PCIe fatal error occurs?

Hi, all folks,

Error reporting and recovery are one of the important features of PCIe, and
the kernel has been supporting them since version 2.6, 17 years ago.
I am very curious about the expected behavior of the software.
I first recap the error classification and then list my questions bellow it.

## Recap: Error classification

- Fatal Errors

Fatal errors are uncorrectable error conditions which render the particular
Link and related hardware unreliable. For Fatal errors, a reset of the
components on the Link may be required to return to reliable operation.
Platform handling of Fatal errors, and any efforts to limit the effects of
these errors, is platform implementation specific. (PCIe 6.0.1, sec
6.2.2.2.1 Fatal Errors).

- Non-Fatal Errors

Non-fatal errors are uncorrectable errors which cause a particular
transaction to be unreliable but the Link is otherwise fully functional.
Isolating Non-fatal from Fatal errors provides Requester/Receiver logic in
a device or system management software the opportunity to recover from the
error without resetting the components on the Link and disturbing other
transactions in progress. Devices not associated with the transaction in
error are not impacted by the error. (PCIe 6.0.1, sec 6.2.2.2.1 Non-Fatal
Errors).

## What the kernel do?

The Linux kernel supports both the OS native and firmware first modes in
AER and DPC drivers. The error recovery API is defined in `struct
pci_error_handlers`, and the recovery process is performed in several
stages in pcie_do_recovery(). One main difference in handling PCIe errors
is that the kernel only resets the link when a fatal error is detected.

## Questions

1. Should kernel panic when fatal errors occur without AER recovery?

IMHO, the answer is NO. The AER driver handles both fatal and non-fatal
errors, and I have not found any panic changes in the recovery path in OS
native mode.

As far as I know, on many X86 platforms, struct `acpi_hest_generic_status::error_severity`
is set as CPER_SEV_FATAL in firmware first mode. As a result, kernel will
panic immediately in ghes_proc() when fatal AER errors occur, and there
is no chance to handle the error and perform recovery in AER driver.

For fatal and non-fatal errors, struct `acpi_hest_generic_status::error_severity`
should as CPER_SEV_RECOVERABLE, and struct `acpi_hest_generic_data::error_severity`
should reflect its real severity. Then, the kernel is equivalent to handling
PCIe errors in Firmware first mode as it does in OS native mode.
Please correct me if I am wrong.

However, I have changed my mind on this issue as I encounter a case where
a error propagation is detected due to fatal DLLP (Data Link Protocol
Error) error. A DLLP error occurred in the Compute node, causing the
node to panic because `struct acpi_hest_generic_status::error_severity` was
set as CPER_SEV_FATAL. However, data corruption was still detected in the
storage node by CRC.

2. Should kernel panic when AER recovery failed?

This question is actually a TODO that was added when the AER driver was
first upstreamed 17 years ago, and it is still relevant today. The kernel
does not proactively panic regardless of the error types occurring in OS
native mode. The DLLP error propagation case indicates that the kernel
might should panic when recovery failed?

3. Should DPC be enabled by default to contain fatal and non-fatal error?

According to the PCIe specification, DPC halts PCIe traffic below a
Downstream Port after an unmasked uncorrectable error is detected at or
below the Port, avoiding the potential spread of any data corruption.

The kernel configures DPC to be triggered only on ERR_FATAL. Literally
speaking, only fatal error have the potential spread of any data
corruption? In addition, the AER Severity is programable by the
Uncorrectable Error Severity Register (Offset 0Ch in PCIe AER cap). If a
default fatal error, e.g. DLLP, set as non-fatal, DPC will not be
triggered.


Looking forward to any comments and reply :)

Thank you.

Best Regards,
Shuai


[1] https://github.com/torvalds/linux/commit/6c2b374d74857e892080ee726184ec1d15e7d4e4#diff-fea64904d30501b59d2e948189bbedc476fc270ed4c15e4ae29d7f0efd06771aR438










2023-09-20 23:07:22

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: Questions: Should kernel panic when PCIe fatal error occurs?

On Mon, Sep 18, 2023 at 05:39:58PM +0800, Shuai Xue wrote:
> Hi, all folks,
>
> Error reporting and recovery are one of the important features of PCIe, and
> the kernel has been supporting them since version 2.6, 17 years ago.
> I am very curious about the expected behavior of the software.
> I first recap the error classification and then list my questions bellow it.
>
> ## Recap: Error classification
>
> - Fatal Errors
>
> Fatal errors are uncorrectable error conditions which render the particular
> Link and related hardware unreliable. For Fatal errors, a reset of the
> components on the Link may be required to return to reliable operation.
> Platform handling of Fatal errors, and any efforts to limit the effects of
> these errors, is platform implementation specific. (PCIe 6.0.1, sec
> 6.2.2.2.1 Fatal Errors).
>
> - Non-Fatal Errors
>
> Non-fatal errors are uncorrectable errors which cause a particular
> transaction to be unreliable but the Link is otherwise fully functional.
> Isolating Non-fatal from Fatal errors provides Requester/Receiver logic in
> a device or system management software the opportunity to recover from the
> error without resetting the components on the Link and disturbing other
> transactions in progress. Devices not associated with the transaction in
> error are not impacted by the error. (PCIe 6.0.1, sec 6.2.2.2.1 Non-Fatal
> Errors).
>
> ## What the kernel do?
>
> The Linux kernel supports both the OS native and firmware first modes in
> AER and DPC drivers. The error recovery API is defined in `struct
> pci_error_handlers`, and the recovery process is performed in several
> stages in pcie_do_recovery(). One main difference in handling PCIe errors
> is that the kernel only resets the link when a fatal error is detected.
>
> ## Questions
>
> 1. Should kernel panic when fatal errors occur without AER recovery?
>
> IMHO, the answer is NO. The AER driver handles both fatal and
> non-fatal errors, and I have not found any panic changes in the
> recovery path in OS native mode.
>
> As far as I know, on many X86 platforms, struct
> `acpi_hest_generic_status::error_severity` is set as CPER_SEV_FATAL
> in firmware first mode. As a result, kernel will panic immediately
> in ghes_proc() when fatal AER errors occur, and there is no chance
> to handle the error and perform recovery in AER driver.

UEFI r2.10, sec N.2.1,, defines CPER_SEV_FATAL, and platform firmware
decides which Error Severity to put in the error record. I don't see
anything in UEFI about how the OS should handle fatal errors.

ACPI r6.5, sec 18.1, says on fatal uncorrected error, the system
should be restarted to prevent propagation of the error. For
CPER_SEV_FATAL errors, it looks like ghes_proc() panics even before
trying AER recovery.

I guess your point is that for CPER_SEV_FATAL errors, the APEI/GHES
path always panics but the native path never does, and that maybe both
paths should work the same way?

It would be nice if they worked the same, but I suspect that vendors
may rely on the fact that CPER_SEV_FATAL forces a restart/panic as
part of their system integrity story.

It doesn't seem like the native path should always panic. If we can
tell that data was corrupted, we may want to panic, but otherwise I
don't think we should crash the entire system even if some device is
permanently broken.

> For fatal and non-fatal errors, struct
> `acpi_hest_generic_status::error_severity` should as
> CPER_SEV_RECOVERABLE, and struct
> `acpi_hest_generic_data::error_severity` should reflect its real
> severity. Then, the kernel is equivalent to handling PCIe errors in
> Firmware first mode as it does in OS native mode. Please correct me
> if I am wrong.

I don't know enough to comment on how Error Severity should be used in
the Generic Error Status Block vs the Generic Error Data Entry.

> However, I have changed my mind on this issue as I encounter a case where
> a error propagation is detected due to fatal DLLP (Data Link Protocol
> Error) error. A DLLP error occurred in the Compute node, causing the
> node to panic because `struct acpi_hest_generic_status::error_severity` was
> set as CPER_SEV_FATAL. However, data corruption was still detected in the
> storage node by CRC.

The only mention of Data Link Protocol Error that looks relevant is
PCIe r6.0, sec 3.6.2.2, which basically says a DLLP with an unexpected
Sequence Number should be discarded:

For Ack and Nak DLLPs, the following steps are followed (see Figure
3-21):

- If the Sequence Number specified by the AckNak_Seq_Num does not
correspond to an unacknowledged TLP, or to the value in
ACKD_SEQ, the DLLP is discarded

- This is a Data Link Protocol Error, which is a reported error
associated with the Port (see Section 6.2).

So data from that DLLP should not have made it to memory, although of
course the DMA may not have been completed. But it sounds like you
did see corrupted data written to memory?

I assume it is not reproducible and we have no reason to think the
receiver of the DLLP has a design defect, e.g., it reported the error
but failed to drop the DLLP?

> 2. Should kernel panic when AER recovery failed?
>
> This question is actually a TODO that was added when the AER driver was
> first upstreamed 17 years ago, and it is still relevant today. The kernel
> does not proactively panic regardless of the error types occurring in OS
> native mode. The DLLP error propagation case indicates that the kernel
> might should panic when recovery failed?

I'm not a hardware engineer, but I'm not yet convinced that a Data
Link Protocol Error should cause a panic because sec 3.6.2.2 suggests
that this error should not cause data corruption. Certainly willing
to be proved wrong!

> 3. Should DPC be enabled by default to contain fatal and non-fatal error?
>
> According to the PCIe specification, DPC halts PCIe traffic below a
> Downstream Port after an unmasked uncorrectable error is detected at or
> below the Port, avoiding the potential spread of any data corruption.
>
> The kernel configures DPC to be triggered only on ERR_FATAL. Literally
> speaking, only fatal error have the potential spread of any data
> corruption?

Sec 6.2.2.2 talks about fatal vs non-fatal but only in terms of
whether the error affects a particular transaction (non-fatal) or
everything related to a Link (fatal). Unless there's more detail
elsewhere, I would assume either could corrupt data.

> In addition, the AER Severity is programable by the
> Uncorrectable Error Severity Register (Offset 0Ch in PCIe AER cap). If a
> default fatal error, e.g. DLLP, set as non-fatal, DPC will not be
> triggered.

Sec 6.2.7 and 7.8.4.4 suggest the Data Link Protocol Error should be
a fatal error by default.

I don't think Linux changes PCI_ERR_UNC_DLP (unless there's an _HPX or
similar method), so I would expect it to be set as fatal.

Bjorn

> [1] https://github.com/torvalds/linux/commit/6c2b374d74857e892080ee726184ec1d15e7d4e4#diff-fea64904d30501b59d2e948189bbedc476fc270ed4c15e4ae29d7f0efd06771aR438

2023-09-21 22:35:37

by David Laight

[permalink] [raw]
Subject: RE: Questions: Should kernel panic when PCIe fatal error occurs?

> It would be nice if they worked the same, but I suspect that vendors
> may rely on the fact that CPER_SEV_FATAL forces a restart/panic as
> part of their system integrity story.

The file system errors created by a panic (especially an NMI panic)
could easily be more problematic than a failed PCIe data transfer.
Evan a read that returned ~0u - which can be checked for.

Panicking a system that is converting TDM telephony to RTP for the
911 emergency service because a PCIe cable/riser connecting one of the
TDM board has become loose doesn't seem ideal.
(Or because the TDM board's fpga has decided it isn't going to respond
to any accesses until the BARs are setup again...)

The system can carry on with some TDM connections disabled - but that
is ok because they are all duplicated in case a cable gets cuit.

(Yes - that is a live system...)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-09-21 23:24:05

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: Questions: Should kernel panic when PCIe fatal error occurs?

On Thu, Sep 21, 2023 at 08:10:19PM +0800, Shuai Xue wrote:
> On 2023/9/21 07:02, Bjorn Helgaas wrote:
> > On Mon, Sep 18, 2023 at 05:39:58PM +0800, Shuai Xue wrote:
> ...

> > I guess your point is that for CPER_SEV_FATAL errors, the APEI/GHES
> > path always panics but the native path never does, and that maybe both
> > paths should work the same way?
>
> Yes, exactly. Both OS native and APEI/GHES firmware first are notifications
> used to handles PCIe AER errors, and IMHO, they should ideally work in the
> same way.

I agree, that would be nice, but the whole point of the APEI/GHES
functionality is vendor value-add, so I'm not sure we can achieve that
ideal.

> ...
> As a result, AER driver only does recovery for non-fatal PCIe error.

This is only true for the APEI/GHES path, right? For *native* AER
handling, we attempt recovery for both fatal and non-fatal errors.

> > It doesn't seem like the native path should always panic. If we can
> > tell that data was corrupted, we may want to panic, but otherwise I
> > don't think we should crash the entire system even if some device is
> > permanently broken.
>
> Got it. But how can we tell if the data is corrupted with OS native?

I naively expect that by PCIe protocol, corrupted DLLPs or TLPs
detected by CRC, sequence number errors, etc, would be discarded
before corrupting memory, so I doubt we'd get an uncorrectable error
that means "sorry, I just corrupted your data."

But DPC is advertised as "avoiding the potential spread of any data
corruption," so there must be some mechanisms of corruption, and since
DPC is triggered by either ERR_FATAL or ERR_NONFATAL, I guess maybe
the errors could tell us something. I'm going to quit speculating
because I obviously don't know enough about this area.

> >> However, I have changed my mind on this issue as I encounter a case where
> >> a error propagation is detected due to fatal DLLP (Data Link Protocol
> >> Error) error. A DLLP error occurred in the Compute node, causing the
> >> node to panic because `struct acpi_hest_generic_status::error_severity` was
> >> set as CPER_SEV_FATAL. However, data corruption was still detected in the
> >> storage node by CRC.
> >
> > The only mention of Data Link Protocol Error that looks relevant is
> > PCIe r6.0, sec 3.6.2.2, which basically says a DLLP with an unexpected
> > Sequence Number should be discarded:
> >
> > For Ack and Nak DLLPs, the following steps are followed (see Figure
> > 3-21):
> >
> > - If the Sequence Number specified by the AckNak_Seq_Num does not
> > correspond to an unacknowledged TLP, or to the value in
> > ACKD_SEQ, the DLLP is discarded
> >
> > - This is a Data Link Protocol Error, which is a reported error
> > associated with the Port (see Section 6.2).
> >
> > So data from that DLLP should not have made it to memory, although of
> > course the DMA may not have been completed. But it sounds like you
> > did see corrupted data written to memory?
>
> The storage node use RDMA to directly access remote compute node.
> And a error detected by CRC in the storage node. So I suspect yes.

When doing the CRC, can you distinguish between corrupted data and
data that was not written because a DMA was only partially completed?

> ...
> I tried to inject Data Link Protocol Error on some platform. The mechanism
> behind is that rootport controls the sequence number of the specific TLPs
> and ACK/NAK DLLPs. Data Link Protocol Error will be detected at the Rx side
> of ACK/NAK DLLPs.
>
> In such case, NIC and NVMe recovered on fatal and non-fatal DLLP
> errors.

I'm guessing this error injection directly writes the AER status bit,
which would probably only test the reporting (sending an ERR_FATAL
message), AER interrupt generation, firmware or OS interrupt handling,
etc.

It probably would not actually generate a DLLP with a bad sequence
number, so it probably does not test the hardware behavior of
discarding the DLLP if the sequence number is bad. Just my guess
though.

> ...
> My point is that how kernel could recover from non-fatal and fatal
> errors in firmware first without DPC? If CPER_SEV_FATAL is used to
> report fatal PCIe error, kernel will panic in APEI/GHES driver.

The platform decides whether to use CPER_SEV_FATAL, so we can't change
that. We *could* change whether Linux panics when the platform says
an error is CPER_SEV_FATAL. That happens in drivers/acpi, so it's
really up to Rafael.

Personally I would want to hear from vendors who use the APEI/GHES
path. Poking around the web for logs that mention HEST and related
things, it looks like at least Dell, HP, and Lenovo use it. And there
are drivers/acpi/apei commits from nxp.com, alibaba.com, amd.com,
arm.com huawei.com, etc., so some of them probably care, too.

Bjorn

2023-09-22 00:31:32

by David Laight

[permalink] [raw]
Subject: RE: Questions: Should kernel panic when PCIe fatal error occurs?

...
I've got a target to generate AER errors by generating read cycles
that are inside the address range that the bridge forwards but
outside of any BAR because there are 2 different sized BARs.
(Pretty easy to setup.)
On the system I was using they didn't get propagated all the way
to the root bridge - but were visible in the lower bridge.
It would be nice for a driver to be able to detect/clear such
a flag if it gets an unexpected ~0u read value.
(I'm not sure an error callback helps.)

OTOH a 'nebs compliant' server routed any kind of PCIe link error
through to some 'system management' logic that then raised an NMI.
I'm not sure who thought an NMI was a good idea - they are pretty
impossible to handle in the kernel and too late to be of use to
the code performing the access.

In any case we were getting one after 'echo 1 >xxx/remove' and
then taking the PCIe link down by reprogramming the fpga.
So the link going down was entirely expected, but there seemed
to be nothing we could do to stop the kernel crashing.

I'm sure 'nebs compliant' ought to contain some requirements for
resilience to hardware failures!

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-09-22 02:48:24

by Shuai Xue

[permalink] [raw]
Subject: Re: Questions: Should kernel panic when PCIe fatal error occurs?

+ @Rafael for the APEI/GHES part.

On 2023/9/22 05:52, Bjorn Helgaas wrote:
> On Thu, Sep 21, 2023 at 08:10:19PM +0800, Shuai Xue wrote:
>> On 2023/9/21 07:02, Bjorn Helgaas wrote:
>>> On Mon, Sep 18, 2023 at 05:39:58PM +0800, Shuai Xue wrote:
>> ...
>
>>> I guess your point is that for CPER_SEV_FATAL errors, the APEI/GHES
>>> path always panics but the native path never does, and that maybe both
>>> paths should work the same way?
>>
>> Yes, exactly. Both OS native and APEI/GHES firmware first are notifications
>> used to handles PCIe AER errors, and IMHO, they should ideally work in the
>> same way.
>
> I agree, that would be nice, but the whole point of the APEI/GHES
> functionality is vendor value-add, so I'm not sure we can achieve that
> ideal.
>
>> ...
>> As a result, AER driver only does recovery for non-fatal PCIe error.
>
> This is only true for the APEI/GHES path, right? For *native* AER
> handling, we attempt recovery for both fatal and non-fatal errors.

Yes, exactly.

>
>>> It doesn't seem like the native path should always panic. If we can
>>> tell that data was corrupted, we may want to panic, but otherwise I
>>> don't think we should crash the entire system even if some device is
>>> permanently broken.
>>
>> Got it. But how can we tell if the data is corrupted with OS native?
>
> I naively expect that by PCIe protocol, corrupted DLLPs or TLPs
> detected by CRC, sequence number errors, etc, would be discarded
> before corrupting memory, so I doubt we'd get an uncorrectable error
> that means "sorry, I just corrupted your data."
>
> But DPC is advertised as "avoiding the potential spread of any data
> corruption," so there must be some mechanisms of corruption, and since
> DPC is triggered by either ERR_FATAL or ERR_NONFATAL, I guess maybe
> the errors could tell us something. I'm going to quit speculating
> because I obviously don't know enough about this area.
>
>>>> However, I have changed my mind on this issue as I encounter a case where
>>>> a error propagation is detected due to fatal DLLP (Data Link Protocol
>>>> Error) error. A DLLP error occurred in the Compute node, causing the
>>>> node to panic because `struct acpi_hest_generic_status::error_severity` was
>>>> set as CPER_SEV_FATAL. However, data corruption was still detected in the
>>>> storage node by CRC.
>>>
>>> The only mention of Data Link Protocol Error that looks relevant is
>>> PCIe r6.0, sec 3.6.2.2, which basically says a DLLP with an unexpected
>>> Sequence Number should be discarded:
>>>
>>> For Ack and Nak DLLPs, the following steps are followed (see Figure
>>> 3-21):
>>>
>>> - If the Sequence Number specified by the AckNak_Seq_Num does not
>>> correspond to an unacknowledged TLP, or to the value in
>>> ACKD_SEQ, the DLLP is discarded
>>>
>>> - This is a Data Link Protocol Error, which is a reported error
>>> associated with the Port (see Section 6.2).
>>>
>>> So data from that DLLP should not have made it to memory, although of
>>> course the DMA may not have been completed. But it sounds like you
>>> did see corrupted data written to memory?
>>
>> The storage node use RDMA to directly access remote compute node.
>> And a error detected by CRC in the storage node. So I suspect yes.
>
> When doing the CRC, can you distinguish between corrupted data and
> data that was not written because a DMA was only partially completed?

Yes, the receiving application layer will perform length verification.
So the data length is definitely correct.

>
>> ...
>> I tried to inject Data Link Protocol Error on some platform. The mechanism
>> behind is that rootport controls the sequence number of the specific TLPs
>> and ACK/NAK DLLPs. Data Link Protocol Error will be detected at the Rx side
>> of ACK/NAK DLLPs.
>>
>> In such case, NIC and NVMe recovered on fatal and non-fatal DLLP
>> errors.
>
> I'm guessing this error injection directly writes the AER status bit,
> which would probably only test the reporting (sending an ERR_FATAL
> message), AER interrupt generation, firmware or OS interrupt handling,
> etc.
>
> It probably would not actually generate a DLLP with a bad sequence
> number, so it probably does not test the hardware behavior of
> discarding the DLLP if the sequence number is bad. Just my guess
> though.

No, we don't touch AER status bit. The Root port controller provides Error
Injection Function to trigger a real DLLP error. For example,

- set a bad Bad sequence number, assuming 3
- enable error injection
- send a TLP from the controller's Application Interface, assuming SEQ#5 is
given to the TLP
- the SEQ# is Changed to #2 by the Error Injection Function in Layer2.

>
>> ...
>> My point is that how kernel could recover from non-fatal and fatal
>> errors in firmware first without DPC? If CPER_SEV_FATAL is used to
>> report fatal PCIe error, kernel will panic in APEI/GHES driver.
>
> The platform decides whether to use CPER_SEV_FATAL, so we can't change
> that. We *could* change whether Linux panics when the platform says
> an error is CPER_SEV_FATAL. That happens in drivers/acpi, so it's
> really up to Rafael.

Agreed. Of course, we are talking about and focus on the desired behavior
of Linux kernel.

>
> Personally I would want to hear from vendors who use the APEI/GHES
> path. Poking around the web for logs that mention HEST and related
> things, it looks like at least Dell, HP, and Lenovo use it. And there
> are drivers/acpi/apei commits from nxp.com, alibaba.com, amd.com,
> arm.com huawei.com, etc., so some of them probably care, too.
>

Aha, sure. In OS native mode, kernel controls everything, directly acting
on the MSI, checking AER status and do recovery for CE, non-fatal and fatal
errors. While for firmware first, as it name indicates, the kernel and
firmware co-work together to handle processor, memory, PCIe and
platform specific errors.

Actually, this is a question from my colleague from firmware team.
The original question is that:

"Should I set CPER_SEV_FATAL for Generic Error Status Block when a
PCIe fatal error is detected? If set, kernel will always panic.
Otherwise, kernel will always not panic."

So I pull a question about desired behavior of Linux kernel first :)
From the perspective of the kernel, CPER_SEV_FATAL for Generic Error
Status Block is not reasonable. The kernel will attempt to recover
Fatal errors, although recovery may fail.

Best Regards,
Shuai




2023-09-25 01:44:08

by Shuai Xue

[permalink] [raw]
Subject: Re: Questions: Should kernel panic when PCIe fatal error occurs?



On 2023/9/21 21:20, David Laight wrote:
> ...
> I've got a target to generate AER errors by generating read cycles
> that are inside the address range that the bridge forwards but
> outside of any BAR because there are 2 different sized BARs.
> (Pretty easy to setup.)
> On the system I was using they didn't get propagated all the way
> to the root bridge - but were visible in the lower bridge.

So how did you observe it? If the error message does not propagate
to the root bridge, I think no AER interrupt will be trigger.

> It would be nice for a driver to be able to detect/clear such
> a flag if it gets an unexpected ~0u read value.
> (I'm not sure an error callback helps.)

IMHO, a general model is that error detected at endpoint should be
routed to upstream port for example: RCiEP route error message to RCEC,
so that the AER port service could handle the error, the device driver
only have to implement error handler callback.

>
> OTOH a 'nebs compliant' server routed any kind of PCIe link error
> through to some 'system management' logic that then raised an NMI.
> I'm not sure who thought an NMI was a good idea - they are pretty
> impossible to handle in the kernel and too late to be of use to
> the code performing the access.

I think it is the responsibility of the device to prevent the spread of
errors while reporting that errors have been detected. For example, drop
the current, (drain submit queue) and report error in completion record.
Both NMI and MSI are asynchronous interrupts.

>
> In any case we were getting one after 'echo 1 >xxx/remove' and
> then taking the PCIe link down by reprogramming the fpga.
> So the link going down was entirely expected, but there seemed
> to be nothing we could do to stop the kernel crashing.
>
> I'm sure 'nebs compliant' ought to contain some requirements for
> resilience to hardware failures!

How the kernel crash after a link down? Did the system detect a surprise
down error?

Best Regards,
Shuai

2023-09-25 08:24:55

by David Laight

[permalink] [raw]
Subject: RE: Questions: Should kernel panic when PCIe fatal error occurs?

From: Shuai Xue
> Sent: 25 September 2023 02:44
>
> On 2023/9/21 21:20, David Laight wrote:
> > ...
> > I've got a target to generate AER errors by generating read cycles
> > that are inside the address range that the bridge forwards but
> > outside of any BAR because there are 2 different sized BARs.
> > (Pretty easy to setup.)
> > On the system I was using they didn't get propagated all the way
> > to the root bridge - but were visible in the lower bridge.
>
> So how did you observe it? If the error message does not propagate
> to the root bridge, I think no AER interrupt will be trigger.

I looked at the internal registers (IIRC in PCIe config space)
of the intermediate bridge.
I don't think the root bridge on that system supported AER.
(I was testing the generation of AER indications by our fpga.)

>
> > It would be nice for a driver to be able to detect/clear such
> > a flag if it gets an unexpected ~0u read value.
> > (I'm not sure an error callback helps.)
>
> IMHO, a general model is that error detected at endpoint should be
> routed to upstream port for example: RCiEP route error message to RCEC,
> so that the AER port service could handle the error, the device driver
> only have to implement error handler callback.

The problem is that that and callback is too late for something
triggered by a PCIe read.
The driver has to detect that the value is 'dubious' and wants
a method of detecting whether there was an associated AER (or other)
error.
If the AER indication is routed through some external entity (like
board management hardware) there will be additional latency that
means that the associated interrupt (even if an NMI) may not have
been processed when the driver code is trying to determine what
happened.
This can only be made worse by the interrupt coming in on a
different cpu.

> > OTOH a 'nebs compliant' server routed any kind of PCIe link error
> > through to some 'system management' logic that then raised an NMI.
> > I'm not sure who thought an NMI was a good idea - they are pretty
> > impossible to handle in the kernel and too late to be of use to
> > the code performing the access.
>
> I think it is the responsibility of the device to prevent the spread of
> errors while reporting that errors have been detected. For example, drop
> the current, (drain submit queue) and report error in completion record.

Eh?
I can generate two types of PCIe error:
- Read/write requests for addresses that aren't inside a BAR.
- Link failures that cause retraining and might need config
space reconfiguring.

> Both NMI and MSI are asynchronous interrupts.

Indeed, which makes neither of them suitable for any indication
relating to a bus cycle failure.

> > In any case we were getting one after 'echo 1 >xxx/remove' and
> > then taking the PCIe link down by reprogramming the fpga.
> > So the link going down was entirely expected, but there seemed
> > to be nothing we could do to stop the kernel crashing.
> >
> > I'm sure 'nebs compliant' ought to contain some requirements for
> > resilience to hardware failures!
>
> How the kernel crash after a link down? Did the system detect a surprise
> down error?

It was a couple of years ago..
IIRC the 'link down' cause the hub to generate an AER error.
The root hub forwarded it to some 'board management hardware/software'
that then raised and NMI.
The kernel crashed because of an unexpected NMI.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-09-25 09:16:43

by Oliver O'Halloran

[permalink] [raw]
Subject: Re: Questions: Should kernel panic when PCIe fatal error occurs?

On Fri, Sep 22, 2023 at 8:23 AM David Laight <[email protected]> wrote:
>
> > It would be nice if they worked the same, but I suspect that vendors
> > may rely on the fact that CPER_SEV_FATAL forces a restart/panic as
> > part of their system integrity story.
>
> The file system errors created by a panic (especially an NMI panic)
> could easily be more problematic than a failed PCIe data transfer.
> Evan a read that returned ~0u - which can be checked for.
>
> Panicking a system that is converting TDM telephony to RTP for the
> 911 emergency service because a PCIe cable/riser connecting one of the
> TDM board has become loose doesn't seem ideal.

For kernel native AER the default reaction to errors is
reset-and-reinit which probably isn't much better for your case.
Sounds like you would want a knob to suppress everything except error
reporting so you can handle it in userspace?

> (Or because the TDM board's fpga has decided it isn't going to respond
> to any accesses until the BARs are setup again...)
>
> The system can carry on with some TDM connections disabled - but that
> is ok because they are all duplicated in case a cable gets cuit.

Well that's a relief :)

Oliver

2023-09-26 23:47:55

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: Questions: Should kernel panic when PCIe fatal error occurs?

On Fri, Sep 22, 2023 at 10:46:36AM +0800, Shuai Xue wrote:
> ...

> Actually, this is a question from my colleague from firmware team.
> The original question is that:
>
> "Should I set CPER_SEV_FATAL for Generic Error Status Block when a
> PCIe fatal error is detected? If set, kernel will always panic.
> Otherwise, kernel will always not panic."
>
> So I pull a question about desired behavior of Linux kernel first :)
> From the perspective of the kernel, CPER_SEV_FATAL for Generic Error
> Status Block is not reasonable. The kernel will attempt to recover
> Fatal errors, although recovery may fail.

I don't know the semantics of CPER_SEV_FATAL or why it's there.
With CPER, we have *two* error severities: a "native" one defined by
the PCIe spec and another defined by the platform via CPER.

I speculate that the reason for the CPER severity could be to provide
a severity for error sources that don't have a "native" severity like
AER does, or for the vendor to force the OS to restart (for
CPER_SEV_FATAL, anyway) in cases where it might not otherwise.

In the native case, we only have the PCIe severity and don't have the
CPER severity at all, and I suspect that unless there's uncontained
data corruption, we would rather handle even the most severe PCIe
fatal error by disabling the specific device(s) instead of panicking
and restarting the whole machine.

So for PCIe errors, I'm not sure setting CPER_SEV_FATAL is beneficial
unless the platform wants to force the OS to panic, e.g., maybe the
platform knows about data corruption and/or the vendor wants the OS to
panic as part of a reliability story.

Presumably the platform has already logged the error, and I assume the
platform *could* restart without even returning to the OS, but maybe
it wants the OS to do a crashdump or shutdown in a more orderly way.

Bjorn

2023-09-27 03:58:49

by Shuai Xue

[permalink] [raw]
Subject: Re: Questions: Should kernel panic when PCIe fatal error occurs?



On 2023/9/27 07:02, Bjorn Helgaas wrote:
> On Fri, Sep 22, 2023 at 10:46:36AM +0800, Shuai Xue wrote:
>> ...
>
>> Actually, this is a question from my colleague from firmware team.
>> The original question is that:
>>
>> "Should I set CPER_SEV_FATAL for Generic Error Status Block when a
>> PCIe fatal error is detected? If set, kernel will always panic.
>> Otherwise, kernel will always not panic."
>>
>> So I pull a question about desired behavior of Linux kernel first :)
>> From the perspective of the kernel, CPER_SEV_FATAL for Generic Error
>> Status Block is not reasonable. The kernel will attempt to recover
>> Fatal errors, although recovery may fail.
>
> I don't know the semantics of CPER_SEV_FATAL or why it's there.
> With CPER, we have *two* error severities: a "native" one defined by
> the PCIe spec and another defined by the platform via CPER.
>
> I speculate that the reason for the CPER severity could be to provide
> a severity for error sources that don't have a "native" severity like
> AER does, or for the vendor to force the OS to restart (for
> CPER_SEV_FATAL, anyway) in cases where it might not otherwise.

Agreed, it is the key point.

Per ACPI 6.5 18.1 Hardware Errors and Error Sources[1]:

- An uncorrected error is a hardware error condition that cannot be
corrected by the hardware or by the firmware. Uncorrected errors
are either fatal or non-fatal.

- A fatal hardware error is an uncorrected or uncontained error
condition that is determined to be unrecoverable by the hardware.
When a fatal uncorrected error occurs, the system is restarted to
prevent propagation of the error.

A non-fatal hardware error is an uncorrected error condition from
which OSPM can attempt recovery by trying to correct the error.
These are also referred to as correctable or recoverable errors.

Based on our discussion and the PCIe and APCI Spec:

- Native AER fatal error defined in PCIe does not indate that there's
uncontained data corruption.
- The kernel is capable of handle native AER fatal and non-fatal errors.
- When a CPER_SEV_FATAL error nofitied by firmware, it indicates the
platform wants to force the OS to restart, and the APEI/GHES driver follows
the Spec now.

(Please correct me if I misunderstand any)


>
> In the native case, we only have the PCIe severity and don't have the
> CPER severity at all, and I suspect that unless there's uncontained
> data corruption, we would rather handle even the most severe PCIe
> fatal error by disabling the specific device(s) instead of panicking
> and restarting the whole machine.
>
> So for PCIe errors, I'm not sure setting CPER_SEV_FATAL is beneficial
> unless the platform wants to force the OS to panic, e.g., maybe the
> platform knows about data corruption and/or the vendor wants the OS to
> panic as part of a reliability story.

So back to the original question, I think your above comments are clear enough.

>
> Presumably the platform has already logged the error, and I assume the
> platform *could* restart without even returning to the OS, but maybe
> it wants the OS to do a crashdump or shutdown in a more orderly way.
>

If the system is reset in platform without even returning to the OS,
it is not visible to end user. IMHO, it always a bad choice.
The OS can provide enhanced debuggability, for example:

- providing details about the runtime context through crashdump
- saving error information to persistent storage

Thank you for your patience and valuable feedback. It is greatly appreciated
and truly helpful.

Best Regards and Cheers.
Shuai




[1] https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#hardware-errors-and-error-sources

2023-09-27 05:50:19

by Oliver O'Halloran

[permalink] [raw]
Subject: Re: Questions: Should kernel panic when PCIe fatal error occurs?

On Wed, Sep 27, 2023 at 9:03 AM Bjorn Helgaas <[email protected]> wrote:
>
> On Fri, Sep 22, 2023 at 10:46:36AM +0800, Shuai Xue wrote:
> > ...
>
> > Actually, this is a question from my colleague from firmware team.
> > The original question is that:
> >
> > "Should I set CPER_SEV_FATAL for Generic Error Status Block when a
> > PCIe fatal error is detected? If set, kernel will always panic.
> > Otherwise, kernel will always not panic."
> >
> > So I pull a question about desired behavior of Linux kernel first :)
> > From the perspective of the kernel, CPER_SEV_FATAL for Generic Error
> > Status Block is not reasonable. The kernel will attempt to recover
> > Fatal errors, although recovery may fail.
>
> I don't know the semantics of CPER_SEV_FATAL or why it's there.
> With CPER, we have *two* error severities: a "native" one defined by
> the PCIe spec and another defined by the platform via CPER.
>
> I speculate that the reason for the CPER severity could be to provide
> a severity for error sources that don't have a "native" severity like
> AER does, or for the vendor to force the OS to restart (for
> CPER_SEV_FATAL, anyway) in cases where it might not otherwise.
>
> In the native case, we only have the PCIe severity and don't have the
> CPER severity at all, and I suspect that unless there's uncontained
> data corruption, we would rather handle even the most severe PCIe
> fatal error by disabling the specific device(s) instead of panicking
> and restarting the whole machine.

From a user's point of view disabling a device is often worse than a
reboot. If you get a fatal error from a system's only network card
then disabling the card may result in the system being uncontactable
until someone manually recovers it. Similarly, if the disk hosting the
root filesystem disappears the system may not crash immediately (most
of what it needs will be in page cache), but there's no guarantee that
it can do anything useful in that state. In both cases forcing a
reboot will probably bring the system back into a usable state.

> So for PCIe errors, I'm not sure setting CPER_SEV_FATAL is beneficial
> unless the platform wants to force the OS to panic, e.g., maybe the
> platform knows about data corruption and/or the vendor wants the OS to
> panic as part of a reliability story.

The PCIe spec is (intentionally?) vague about the causes of fatal
errors. For all we know a device is reporting one because the embedded
OS it was running crashed and as a side effect it's been DMAing junk
into system memory for the past hour. If you know something about the
device in question maybe you can make those assumptions, but in
general it's impossible to assess the actual severity of an error.
Even in the case of a noisy link causing bit flips (it's possible,
LCRC is only 16bit and ECEC is optional) if we get corruption of the
address bits of the TLP header then the DMA might have overwritten
something important to the OS. From a hardware vendor's point of view
just forcing a reboot makes a lot of sense.

Paranoia aside, in a lot of cases PCI device errors are nothing major
and can be resolved by just restarting the device. However, there's no
way for generic kernel code to make that assessment and we probably
shouldn't have the kernel guess. I'd say the safest option would be to
punt that decision to userspace and provide some way to whitelist
devices that we can ignore errors from. I'm not familiar enough with
the ACPI to know if we have enough details in the notification it
sends to actually implement that though.

Oliver