by Duan, Zhenzhong

[permalink] [raw]

Subject: RE: [PATCH v4 3/3] PCI/AER: Clear UNCOR_STATUS bits that might be ANFE

>-----Original Message-----
>From: Kuppuswamy Sathyanarayanan
><[email protected]>
>Subject: Re: [PATCH v4 3/3] PCI/AER: Clear UNCOR_STATUS bits that might
>be ANFE
>
>
>On 6/13/24 7:40 PM, Duan, Zhenzhong wrote:
>>
>>> -----Original Message-----
>>> From: Kuppuswamy, Sathyanarayanan
>>> Subject: Re: [PATCH v4 3/3] PCI/AER: Clear UNCOR_STATUS bits that
>might
>>> be ANFE
>>>
>>>
>>> On 5/9/24 1:48 AM, Zhenzhong Duan wrote:
>>>> When processing an ANFE, ideally both correctable error(CE) status and
>>>> uncorrectable error(UE) status should be cleared. However, there is no
>>>> way to fully identify the UE associated with ANFE. Even worse, Non-Fatal
>>>> Error(NFE) may set the same UE status bit as ANFE. Treating an ANFE as
>>>> NFE will bring some issues, i.e., breaking softwore probing; treating
>>> /s/softwore/software
>> Good catch, will fix. It's strange 'checkpatch --codespell' doesn't catch this.
>>
>>> May be this is already discussed. But can you explain why treating
>>> AFNE as non-fatal error will bring probing issues?
>> Copied below from spec 6.1, 6.2.3.2.4, says it can results in a System Error.
>>
>> In some cases the detector of a non-fatal error is not the most appropriate
>agent to determine whether the error is
>> recoverable or not, or if it even needs any recovery action at all. For
>example, if software attempts to perform a
>> configuration read from a non-existent device or Function, the resulting UR
>Status in the Completion will signal the error
>> to software, and software does not need for the Completer in addition to
>signal the error by sending an ERR_NONFATAL
>> Message. In fact, on some platforms, signaling the error with
>ERR_NONFATAL results in a System Error, which breaks
>> normal software probing.
>>
>>>> NFE as ANFE will make us ignoring some UEs which need active recover
>>> /s/ignoring/ignore
>> Will fix.
>>
>>>> operation. To avoid clearing UEs that are not ANFE by accident, the
>>>> most conservative route is taken here: If any of the NFE Detected bits
>>>> is set in Device Status, do not touch UE status, they should be cleared
>>>> later by the UE handler. Otherwise, a specific set of UEs that may be
>>>> raised as ANFE according to the PCIe specification will be cleared if
>>>> their corresponding severity is Non-Fatal.
>>>>
>>>> For instance, previously when kernel receives an ANFE with Poisoned
>TLP
>>>> in OS native AER mode, only status of CE will be reported and cleared:
>>>>
>>>> AER: Correctable error message received from 0000:b7:02.0
>>>> PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver
>ID)
>>>> device [8086:0db0] error status/mask=00002000/00000000
>>>> [13] NonFatalErr
>>>>
>>>> If the kernel receives a Malformed TLP after that, two UEs will be
>>>> reported, which is unexpected. Malformed TLP Header is lost since
>>>> the previous ANFE gated the TLP header logs:
>>>>
>>>> PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer,
>>> (Receiver ID)
>>>> device [8086:0db0] error status/mask=00041000/00180020
>>>> [12] TLP (First)
>>>> [18] MalfTLP
>>>>
>>>> Now, for the same scenario, both CE status and related UE status will be
>>>> reported and cleared after ANFE:
>>>>
>>>> AER: Correctable error message received from 0000:b7:02.0
>>>> PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver
>ID)
>>>> device [8086:0db0] error status/mask=00002000/00000000
>>>> [13] NonFatalErr
>>>> Uncorrectable errors that may cause Advisory Non-Fatal:
>>>> [18] TLP
>>>>
>>>> Tested-by: Yudong Wang <[email protected]>
>>>> Co-developed-by: "Wang, Qingshun" <[email protected]>
>>>> Signed-off-by: "Wang, Qingshun" <[email protected]>
>>>> Signed-off-by: Zhenzhong Duan <[email protected]>
>>>> ---
>>>> drivers/pci/pcie/aer.c | 7 ++++++-
>>>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>>>> index ed435f09ac27..6a6a3a40569a 100644
>>>> --- a/drivers/pci/pcie/aer.c
>>>> +++ b/drivers/pci/pcie/aer.c
>>>> @@ -1115,9 +1115,14 @@ static void pci_aer_handle_error(struct
>>> pci_dev *dev, struct aer_err_info *info)
>>>> * Correctable error does not need software intervention.
>>>> * No need to go through error recovery process.
>>>> */
>>>> - if (aer)
>>>> + if (aer) {
>>>> pci_write_config_dword(dev, aer +
>>> PCI_ERR_COR_STATUS,
>>>> info->status);
>>>> + if (info->anfe_status)
>>>> + pci_write_config_dword(dev,
>>>> + aer +
>>> PCI_ERR_UNCOR_STATUS,
>>>> + info->anfe_status);
>>>> + }
>>> Why split the handling part and storing part into two patches? Why not
>>> merge
>>> this part of patch 1/3.
>> This is based on Bjorn's suggestion at https://www.spinics.net/lists/linux-
>pci/msg149012.html,
>> clearing UNCOR_STATUS might be more important, deserve to raise out.
>
>I think Bjorn's suggestion is to divide it into two logical patches.
>One for printing the error and another to clear the UNCOR_STATUS
>properly. But currently you have split the UNCOR_STATUS status caching and
>clearing process into two patches. IMO, your first patch can store ANFE
>status and clear it. You can add print support in the second patch.

OK, I'll merge patch1 with patch3 in next version.

>
>
>Code wise it looks fine to me. You can add my Reviewed-by after fixing
>the typos
>
>Reviewed-by: Kuppuswamy Sathyanarayanan
><[email protected]>

Thanks
zhenzhong

>
>>
>> Thanks
>> Zhenzhong
>
>--
>Sathyanarayanan Kuppuswamy
>Linux Kernel Developer