LinuxLists.cc - Re: Race b/w pciehp and FirmwareFirst DPC->EDR flow

2023-11-27 14:44:28

Subject: Re: Race b/w pciehp and FirmwareFirst DPC->EDR flow - Reg

Hi Bjorn and others,
Could you please share your thoughts on this?

Thanks,
Vidya Sagar

On 11/10/2023 10:31 PM, Vidya Sagar wrote:
> There was a typo. Corrected it.
>
> s/If DPC->EDR flow checks DPC also/If DPC->EDR flow checks 'PD Change' also
>
> On 11/10/2023 6:30 PM, Vidya Sagar wrote:
>> Hi folks,
>> Here are the platform details.
>> - System with a Firmware First approach for both AER and DPC
>> - CPERs are sent to the OS for the AER events
>> - DPC is notified to the OS through EDR mechanism
>> - System doesn't have support for in-band PD and supports only OOB PD
>> where writing to a private register would set the PD state
>> - System has this design where PD state gets cleared whenever there is
>> a link down (without even writing to the private register)
>>
>> In this system, if the root port has to experience a DPC because of
>> any right reasons (say SW trigger of DPC for time being), I'm
>> observing the following flow which is causing a race.
>> 1. Since DPC brings the link down, there is a DLLSC and an MSI is sent
>> to the OS hence PCIe HP ISR is invoked.
>> 2. ISR then takes stock of the Slot_Status register to see what all
>> events caused the MSI generation.
>> 3. It sees both DLLSC and PDC bits getting set.
>> 4. PDC is set because of the aforementioned hardware design where for
>> every link down, PD state gets cleared and since this is considered as
>> a change in the PD state, PDC also gets set.
>> 5. PCIe HP ISR flow transitions to the PCIe HP IST (interrupt
>> thread/bottom half) and waits for the DPC_EDR to complete (because
>> DLLSC is one of the events)
>> 6. Parallel to the PCIe HP ISR/IST, DPC interrupt is raised to the
>> firmware and that would then send an EDR event to the OS. Firmware
>> also sets the PD state to '1' before that, as the endpoint device is
>> still available.
>> 7. Firmware programming of the private register to set the PD state
>> raises second interrupt and PCIe HP ISR takes stock of the events and
>> this time it is only the PDC and not DLLSC. ISR sends a wake to the
>> IST, but since there is an IST in progress already, nothing much
>> happens at this point.
>> 8. Once the DPC is completed and link comes back up again, DPC
>> framework asks the endpoint drivers to handle it by calling the
>> 'pci_error_handlers' callabacks registered by the endpoint device
>> drivers.
>> 9. The PCIe HP IST (of the very first MSI) resumes after receiving the
>> completion from the DPC framework saying that DPC recovery is successful.
>> 10. Since PDC (Presence Detect Change) bit is also set for the first
>> interrupt, IST attempts to remove the devices (as part of
>> pciehp_handle_presence_or_link_change())
>>
>> At this point, there is a race between the device driver that is
>> trying to work with the device (through pci_error_handlers callback)
>> and the IST that is trying to remove the device.
>> To be fair to pciehp_handle_presence_or_link_change(), after removing
>> the devices, it checks for the link-up/PD being '1' and scans the
>> devices again if the device is still available. But unfortunately, IST
>> is deadlocked (with the device driver) while removing the devices
>> itself and won't go to the next step.
>>
>> The rationale given in the form of a comment in
>> pciehp_handle_presence_or_link_change() for removing the devices first
>> (without checking PD/link-up) and adding them back after checking
>> link-up/PD is that, since there is a change in PD, the new link-up
>> need not be with the same old device rather it could be with a
>> different device. So, it justifies in removing the existing devices
>> and then scanning for new ones. But this is causing deadlock with the
>> already initiated DPC recovery process.
>>
>> I see two ways to avoid the deadlock in this scenario.
>> 1. When PCIe HP IST is looking at both DLLSC and PDC, why should
>> DPC->EDR flow look at only DLLSC and not DPC? If DPC->EDR flow checks
>> 'PD Change' also (along with DLL) and declares that the DPC recovery
>> can't happen (as there could be a change of device) hence returning
>> DPC recovery failure, then, the device driver's pci_error_handlers
>> callbacks won't be called and there won't be any deadlock.
>> 2. Check for the possibility of a common lock so that PCIe HP IST and
>> device driver's pci_error_handlers callbacks don't race.
>>
>> Let me know your comments on this.
>>
>> Thanks,
>> Vidya Sagar

2023-11-27 17:14:08

by Kuppuswamy Sathyanarayanan

[permalink] [raw]

Subject: Re: Race b/w pciehp and FirmwareFirst DPC->EDR flow - Reg

On 11/27/2023 6:43 AM, Vidya Sagar wrote:
> Hi Bjorn and others,
> Could you please share your thoughts on this?
>
> Thanks,
> Vidya Sagar
>
> On 11/10/2023 10:31 PM, Vidya Sagar wrote:
>> There was a typo. Corrected it.
>>
>> s/If DPC->EDR flow checks DPC also/If DPC->EDR flow checks 'PD Change' also
>>
>> On 11/10/2023 6:30 PM, Vidya Sagar wrote:
>>> Hi folks,
>>> Here are the platform details.
>>> - System with a Firmware First approach for both AER and DPC
>>> - CPERs are sent to the OS for the AER events
>>> - DPC is notified to the OS through EDR mechanism
>>> - System doesn't have support for in-band PD and supports only OOB PD where writing to a private register would set the PD state
>>> - System has this design where PD state gets cleared whenever there is a link down (without even writing to the private register)
>>>

Is clearing the PD state when link goes down normal? Is this a bug in the hardware?

>>> In this system, if the root port has to experience a DPC because of any right reasons (say SW trigger of DPC for time being), I'm observing the following flow which is causing a race.
>>> 1. Since DPC brings the link down, there is a DLLSC and an MSI is sent to the OS hence PCIe HP ISR is invoked.
>>> 2. ISR then takes stock of the Slot_Status register to see what all events caused the MSI generation.
>>> 3. It sees both DLLSC and PDC bits getting set.
>>> 4. PDC is set because of the aforementioned hardware design where for every link down, PD state gets cleared and since this is considered as a change in the PD state, PDC also gets set.
>>> 5. PCIe HP ISR flow transitions to the PCIe HP IST (interrupt thread/bottom half) and waits for the DPC_EDR to complete (because DLLSC is one of the events)
>>> 6. Parallel to the PCIe HP ISR/IST, DPC interrupt is raised to the firmware and that would then send an EDR event to the OS. Firmware also sets the PD state to '1' before that, as the endpoint device is still available.
>>> 7. Firmware programming of the private register to set the PD state raises second interrupt and PCIe HP ISR takes stock of the events and this time it is only the PDC and not DLLSC. ISR sends a wake to the IST, but since there is an IST in progress already, nothing much happens at this point.
>>> 8. Once the DPC is completed and link comes back up again, DPC framework asks the endpoint drivers to handle it by calling the 'pci_error_handlers' callabacks registered by the endpoint device drivers.
>>> 9. The PCIe HP IST (of the very first MSI) resumes after receiving the completion from the DPC framework saying that DPC recovery is successful.
>>> 10. Since PDC (Presence Detect Change) bit is also set for the first interrupt, IST attempts to remove the devices (as part of pciehp_handle_presence_or_link_change())
>>>
>>> At this point, there is a race between the device driver that is trying to work with the device (through pci_error_handlers callback) and the IST that is trying to remove the device.

IIUC, this step is after DPC recovery, which means the device is up and error_handlers are not used anymore, right? Why do you say there is a race between IST and error handler?

>>> To be fair to pciehp_handle_presence_or_link_change(), after removing the devices, it checks for the link-up/PD being '1' and scans the devices again if the device is still available. But unfortunately, IST is deadlocked (with the device driver) while removing the devices itself and won't go to the next step.
>>>
>>> The rationale given in the form of a comment in pciehp_handle_presence_or_link_change() for removing the devices first (without checking PD/link-up) and adding them back after checking link-up/PD is that, since there is a change in PD, the new link-up need not be with the same old device rather it could be with a different device. So, it justifies in removing the existing devices and then scanning for new ones. But this is causing deadlock with the already initiated DPC recovery process.
>>>
>>> I see two ways to avoid the deadlock in this scenario.
>>> 1. When PCIe HP IST is looking at both DLLSC and PDC, why should DPC->EDR flow look at only DLLSC and not DPC? If DPC->EDR flow checks 'PD Change' also (along with DLL) and declares that the DPC recovery can't happen (as there could be a change of device) hence returning DPC recovery failure, then, the device driver's pci_error_handlers callbacks won't be called and there won't be any deadlock.
>>> 2. Check for the possibility of a common lock so that PCIe HP IST and device driver's pci_error_handlers callbacks don't race.
>>>
>>> Let me know your comments on this.
>>>
>>> Thanks,
>>> Vidya Sagar

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer