2023-04-18 21:06:34

by Smita Koralahalli

[permalink] [raw]
Subject: [PATCH v2 0/2] PCI: pciehp: Add support for native AER and DPC handling on async remove

This series of patches adds native support to handle AER and DPC/EDR events
occurred as a side-effect on an async remove.

Link to v1:
https://lore.kernel.org/all/[email protected]/

Smita Koralahalli (2):
PCI: pciehp: Add support for async hotplug with native AER and DPC/EDR
PCI: pciehp: Clear the optional capabilities in DEVCTL2 on a hot-plug

drivers/pci/hotplug/pciehp_pci.c | 1 +
drivers/pci/pcie/dpc.c | 50 ++++++++++++++++++++++++++++++++
2 files changed, 51 insertions(+)

--
2.17.1


2023-04-18 21:07:46

by Smita Koralahalli

[permalink] [raw]
Subject: [PATCH v2 1/2] PCI: pciehp: Add support for async hotplug with native AER and DPC/EDR

According to Section 6.7.6 of PCIe Base Specification [1], async removal
with DPC and EDR may be unexpected and may result in a Surprise Down error.
This error is just a side effect of hot remove. Most of the time, these
errors will be abstract to the OS as current systems rely on Firmware-First
model for AER and DPC, where the error handling (side effects of async
remove) and other necessary HW sequencing actions is taken care by the FW
and is abstract to the OS. However, FW-First model poses issues while
rolling out updates or fixing bugs as the servers need to be brought down
for firmware updates.

Add support for async hot-plug with native AER and DPC/EDR. Here, OS is
responsible for handling async add and remove along with handling of AER
and DPC events which are generated as a side-effect of async remove.

The implementation is as follows: On an async remove a DPC is triggered
along with a Presence Detect State change. Determine it's an async remove
by checking for DPC Trigger Status in DPC Status Register and Surprise Down
Error Status in AER Uncorrected Error Status to be non-zero. If true, treat
the DPC event as a side-effect of async remove, clear the error status
registers and continue with hot-plug tear down routines. If not, follow the
existing routine to handle AER and DPC errors.

Dmesg before:

pcieport 0000:00:01.4: DPC: containment event, status:0x1f01 source:0x0000
pcieport 0000:00:01.4: DPC: unmasked uncorrectable error detected
pcieport 0000:00:01.4: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:00:01.4: device [1022:14ab] error status/mask=00000020/04004000
pcieport 0000:00:01.4: [ 5] SDES (First)
nvme nvme2: frozen state error detected, reset controller
pcieport 0000:00:01.4: DPC: Data Link Layer Link Active not set in 1000 msec
pcieport 0000:00:01.4: AER: subordinate device reset failed
pcieport 0000:00:01.4: AER: device recovery failed
pcieport 0000:00:01.4: pciehp: Slot(16): Link Down
nvme2n1: detected capacity change from 1953525168 to 0
pci 0000:04:00.0: Removing from iommu group 49

Dmesg after:

pcieport 0000:00:01.4: pciehp: Slot(16): Link Down
nvme1n1: detected capacity change from 1953525168 to 0
pci 0000:04:00.0: Removing from iommu group 37

[1] PCI Express Base Specification Revision 6.0, Dec 16 2021.
https://members.pcisig.com/wg/PCI-SIG/document/16609

Signed-off-by: Smita Koralahalli <[email protected]>
---
v2:
Indentation is taken care. (Bjorn)
Unrelevant dmesg logs are removed. (Bjorn)
Rephrased commit message, to be clear on native vs FW-First
handling. (Bjorn and Sathyanarayanan)
Prefix changed from pciehp_ to dpc_. (Lukas)
Clearing ARI and AtomicOp Requester are performed as a part of
(de-)enumeration in pciehp_unconfigure_device(). (Lukas)
Changed to clearing all optional capabilities in DEVCTL2.
OS-First -> native. (Sathyanarayanan)

Please note that, I have provided explanation why I'm not setting the
Surprise Down bit in uncorrectable error mask register in AER.
https://lore.kernel.org/all/[email protected]/

Also, while testing I noticed PCI_STATUS and PCI_EXP_DEVSTA will be set
on an async remove and will not be cleared while the device is brought
down. I have included clearing them here in order to mask any kind of
appearance that there was an error and as well duplicating our BIOS
functionality. I can remove if its not necessary.

On AMD systems we observe Presence Detect State change along with DPC
event on an async remove. Hence, the errors observed are benign on AMD
systems and the device will be brought down normally with PDSC. But the
errors logged might confuse users.
---
drivers/pci/pcie/dpc.c | 50 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 50 insertions(+)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index a5d7c69b764e..78559188b9ac 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -293,10 +293,60 @@ void dpc_process_error(struct pci_dev *pdev)
}
}

+static void pci_clear_surpdn_errors(struct pci_dev *pdev)
+{
+ u16 reg16;
+ u32 reg32;
+
+ pci_read_config_dword(pdev, pdev->dpc_cap + PCI_EXP_DPC_RP_PIO_STATUS, &reg32);
+ pci_write_config_dword(pdev, pdev->dpc_cap + PCI_EXP_DPC_RP_PIO_STATUS, reg32);
+
+ pci_read_config_word(pdev, PCI_STATUS, &reg16);
+ pci_write_config_word(pdev, PCI_STATUS, reg16);
+
+ pcie_capability_read_word(pdev, PCI_EXP_DEVSTA, &reg16);
+ pcie_capability_write_word(pdev, PCI_EXP_DEVSTA, reg16);
+}
+
+static void dpc_handle_surprise_removal(struct pci_dev *pdev)
+{
+ if (pdev->dpc_rp_extensions && dpc_wait_rp_inactive(pdev))
+ return;
+
+ /*
+ * According to Section 6.7.6 of the PCIe Base Spec 6.0, since async
+ * removal might be unexpected, errors might be reported as a side
+ * effect of the event and software should handle them as an expected
+ * part of this event.
+ */
+ pci_aer_raw_clear_status(pdev);
+ pci_clear_surpdn_errors(pdev);
+
+ pci_write_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_STATUS,
+ PCI_EXP_DPC_STATUS_TRIGGER);
+}
+
+static bool dpc_is_surprise_removal(struct pci_dev *pdev)
+{
+ u16 status;
+
+ pci_read_config_word(pdev, pdev->aer_cap + PCI_ERR_UNCOR_STATUS, &status);
+
+ if (!(status & PCI_ERR_UNC_SURPDN))
+ return false;
+
+ dpc_handle_surprise_removal(pdev);
+
+ return true;
+}
+
static irqreturn_t dpc_handler(int irq, void *context)
{
struct pci_dev *pdev = context;

+ if (dpc_is_surprise_removal(pdev))
+ return IRQ_HANDLED;
+
dpc_process_error(pdev);

/* We configure DPC so it only triggers on ERR_FATAL */
--
2.17.1

2023-04-18 21:07:59

by Smita Koralahalli

[permalink] [raw]
Subject: [PATCH v2 2/2] PCI: pciehp: Clear the optional capabilities in DEVCTL2 on a hot-plug

Clear all capabilities in Device Control 2 register as they are optional
and it is not determined whether the next device inserted will support
these capabilities. Moreover, Section 6.13 of the PCIe Base
Specification [1], recommends clearing the ARI Forwarding Enable bit on
a hot-plug event as its not guaranteed that the newly added component
is in fact an ARI device.

[1] PCI Express Base Specification Revision 6.0, Dec 16 2021.
https://members.pcisig.com/wg/PCI-SIG/document/16609

Signed-off-by: Smita Koralahalli <[email protected]>
---
v2:
Clear all optional capabilities in Device Control 2 register
instead of individually clearing ARI Forwarding Enable,
AtomicOp Requestor Enable and 10-bit Tag Requestor Enable.
---
drivers/pci/hotplug/pciehp_pci.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
index d17f3bf36f70..aabf7884ff30 100644
--- a/drivers/pci/hotplug/pciehp_pci.c
+++ b/drivers/pci/hotplug/pciehp_pci.c
@@ -104,6 +104,7 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
bus_list) {
pci_dev_get(dev);
+ pcie_capability_clear_word(dev, PCI_EXP_DEVCTL2, 0xffff);
pci_stop_and_remove_bus_device(dev);
/*
* Ensure that no new Requests will be generated from
--
2.17.1

2023-05-09 21:17:21

by Smita Koralahalli

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] PCI: pciehp: Add support for native AER and DPC handling on async remove

Hi,

Could I please get review comments on my patch set? I had re-based it on
latest tree
in v2 as I did not hear back on my questions in v1.

Thanks,
Smita

On 4/18/2023 2:05 PM, Smita Koralahalli wrote:
> This series of patches adds native support to handle AER and DPC/EDR events
> occurred as a side-effect on an async remove.
>
> Link to v1:
> https://lore.kernel.org/all/[email protected]/
>
> Smita Koralahalli (2):
> PCI: pciehp: Add support for async hotplug with native AER and DPC/EDR
> PCI: pciehp: Clear the optional capabilities in DEVCTL2 on a hot-plug
>
> drivers/pci/hotplug/pciehp_pci.c | 1 +
> drivers/pci/pcie/dpc.c | 50 ++++++++++++++++++++++++++++++++
> 2 files changed, 51 insertions(+)
>

Subject: Re: [PATCH v2 1/2] PCI: pciehp: Add support for async hotplug with native AER and DPC/EDR



On 4/18/23 2:05 PM, Smita Koralahalli wrote:
> According to Section 6.7.6 of PCIe Base Specification [1], async removal
> with DPC and EDR may be unexpected and may result in a Surprise Down error.
> This error is just a side effect of hot remove. Most of the time, these
> errors will be abstract to the OS as current systems rely on Firmware-First
> model for AER and DPC, where the error handling (side effects of async
> remove) and other necessary HW sequencing actions is taken care by the FW
> and is abstract to the OS. However, FW-First model poses issues while
> rolling out updates or fixing bugs as the servers need to be brought down
> for firmware updates.
>
> Add support for async hot-plug with native AER and DPC/EDR. Here, OS is
> responsible for handling async add and remove along with handling of AER
> and DPC events which are generated as a side-effect of async remove.

PCIe spec r6.0, sec 6.7.6 mentions that the async removal can be handled
via DPC. So why treat it as a special case here? What do we gain with this
patch other than preventing the error recovery process?

>
> The implementation is as follows: On an async remove a DPC is triggered
> along with a Presence Detect State change. Determine it's an async remove
> by checking for DPC Trigger Status in DPC Status Register and Surprise Down
> Error Status in AER Uncorrected Error Status to be non-zero. If true, treat
> the DPC event as a side-effect of async remove, clear the error status
> registers and continue with hot-plug tear down routines. If not, follow the
> existing routine to handle AER and DPC errors.
>
> Dmesg before:
>
> pcieport 0000:00:01.4: DPC: containment event, status:0x1f01 source:0x0000
> pcieport 0000:00:01.4: DPC: unmasked uncorrectable error detected
> pcieport 0000:00:01.4: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
> pcieport 0000:00:01.4: device [1022:14ab] error status/mask=00000020/04004000
> pcieport 0000:00:01.4: [ 5] SDES (First)
> nvme nvme2: frozen state error detected, reset controller
> pcieport 0000:00:01.4: DPC: Data Link Layer Link Active not set in 1000 msec
> pcieport 0000:00:01.4: AER: subordinate device reset failed
> pcieport 0000:00:01.4: AER: device recovery failed
> pcieport 0000:00:01.4: pciehp: Slot(16): Link Down
> nvme2n1: detected capacity change from 1953525168 to 0
> pci 0000:04:00.0: Removing from iommu group 49
>
> Dmesg after:
>
> pcieport 0000:00:01.4: pciehp: Slot(16): Link Down
> nvme1n1: detected capacity change from 1953525168 to 0
> pci 0000:04:00.0: Removing from iommu group 37
>
> [1] PCI Express Base Specification Revision 6.0, Dec 16 2021.
> https://members.pcisig.com/wg/PCI-SIG/document/16609
>
> Signed-off-by: Smita Koralahalli <[email protected]>
> ---
> v2:
> Indentation is taken care. (Bjorn)
> Unrelevant dmesg logs are removed. (Bjorn)
> Rephrased commit message, to be clear on native vs FW-First
> handling. (Bjorn and Sathyanarayanan)
> Prefix changed from pciehp_ to dpc_. (Lukas)
> Clearing ARI and AtomicOp Requester are performed as a part of
> (de-)enumeration in pciehp_unconfigure_device(). (Lukas)
> Changed to clearing all optional capabilities in DEVCTL2.
> OS-First -> native. (Sathyanarayanan)
>
> Please note that, I have provided explanation why I'm not setting the
> Surprise Down bit in uncorrectable error mask register in AER.
> https://lore.kernel.org/all/[email protected]/
>
> Also, while testing I noticed PCI_STATUS and PCI_EXP_DEVSTA will be set
> on an async remove and will not be cleared while the device is brought
> down. I have included clearing them here in order to mask any kind of
> appearance that there was an error and as well duplicating our BIOS
> functionality. I can remove if its not necessary.
>
> On AMD systems we observe Presence Detect State change along with DPC
> event on an async remove. Hence, the errors observed are benign on AMD
> systems and the device will be brought down normally with PDSC. But the
> errors logged might confuse users.
> ---
> drivers/pci/pcie/dpc.c | 50 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 50 insertions(+)
>
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index a5d7c69b764e..78559188b9ac 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -293,10 +293,60 @@ void dpc_process_error(struct pci_dev *pdev)
> }
> }
>
> +static void pci_clear_surpdn_errors(struct pci_dev *pdev)
> +{
> + u16 reg16;
> + u32 reg32;
> +
> + pci_read_config_dword(pdev, pdev->dpc_cap + PCI_EXP_DPC_RP_PIO_STATUS, &reg32);
> + pci_write_config_dword(pdev, pdev->dpc_cap + PCI_EXP_DPC_RP_PIO_STATUS, reg32);
> +
> + pci_read_config_word(pdev, PCI_STATUS, &reg16);
> + pci_write_config_word(pdev, PCI_STATUS, reg16);
> +
> + pcie_capability_read_word(pdev, PCI_EXP_DEVSTA, &reg16);
> + pcie_capability_write_word(pdev, PCI_EXP_DEVSTA, reg16);
> +}
> +
> +static void dpc_handle_surprise_removal(struct pci_dev *pdev)
> +{
> + if (pdev->dpc_rp_extensions && dpc_wait_rp_inactive(pdev))
> + return;
> +
> + /*
> + * According to Section 6.7.6 of the PCIe Base Spec 6.0, since async
> + * removal might be unexpected, errors might be reported as a side
> + * effect of the event and software should handle them as an expected
> + * part of this event.
> + */
> + pci_aer_raw_clear_status(pdev);
> + pci_clear_surpdn_errors(pdev);
> +
> + pci_write_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_STATUS,
> + PCI_EXP_DPC_STATUS_TRIGGER);
> +}
> +
> +static bool dpc_is_surprise_removal(struct pci_dev *pdev)
> +{
> + u16 status;
> +
> + pci_read_config_word(pdev, pdev->aer_cap + PCI_ERR_UNCOR_STATUS, &status);
> +
> + if (!(status & PCI_ERR_UNC_SURPDN))
> + return false;
> +
> + dpc_handle_surprise_removal(pdev);
> +
> + return true;
> +}
> +
> static irqreturn_t dpc_handler(int irq, void *context)
> {
> struct pci_dev *pdev = context;
>
> + if (dpc_is_surprise_removal(pdev))
> + return IRQ_HANDLED;
> +
> dpc_process_error(pdev);
>
> /* We configure DPC so it only triggers on ERR_FATAL */

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2023-05-11 07:02:22

by Lukas Wunner

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] PCI: pciehp: Add support for async hotplug with native AER and DPC/EDR

On Wed, May 10, 2023 at 02:42:13PM -0700, Smita Koralahalli wrote:
> As far as I can see, async removal solely with DPC is not handled properly
> in Linux.

The dpc driver can react to a DPC event, attempt reset recovery.

But it doesn't do de-enumeration or re-enumeration of subordinate devices.
It also doesn't do slot handling (enable/disable Power Controller etc).
That's only implemented in the hotplug driver.

PCIe r6.0.1 contains appendix I.2 which basically suggests to "use DPC"
for async hot-plug but that doesn't really seem to make sense.


> On AMD systems, PDSC is triggered along with DPC on a async remove. And this
> PDSC event (hotplug handler) will unconfigure and uninitialize the driver
> and device.
> This is one thing which I wanted clarity on as per my question in v1.
> Whether all systems
> trigger PDSC on a async remove along with DPC?

In principle, yes. Actually the hotplug driver will see both a DLLSC
*and* a PDC event and will react to whichever comes first. Experience
has shown that the two events may occur in arbitrary order and with
significant delays in-between.

There are systems which erroneously hardwire Presence Detect to zero.
The hotplug driver works even with those. It solely relies on the
DLLSC event then, see commit 80696f991424 ("PCI: pciehp: Tolerate
Presence Detect hardwired to zero").


> I feel there are two approaches going forward. Since, hotplug handler is
> also
> triggered with PDSC, rely on it to bring down the device and prevent calling
> the
> error_recovery process in dpc handler as its not a true error event. I have
> taken this
> approach.
>
> Or, don't call the hotplug handler at all and rely on DPC solely to bring
> down the device
> but here, there should be additional callbacks to unconfigure and
> uninitialize the pcie
> driver and currently I only see report_slot_reset() being called from
> error_recovery()
> and I don't think it unconfigures the driver/device.

The latter approach doesn't really make sense to me because we'd have to
duplicate all the slot handling and device de-/re-enumeration in the dpc
driver.

Let's try masking Surprise Down Errors first and see how that goes.

Thanks,

Lukas

2023-05-11 11:49:31

by Lukas Wunner

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] PCI: pciehp: Clear the optional capabilities in DEVCTL2 on a hot-plug

On Tue, Apr 18, 2023 at 09:05:26PM +0000, Smita Koralahalli wrote:
> Clear all capabilities in Device Control 2 register as they are optional
> and it is not determined whether the next device inserted will support
> these capabilities. Moreover, Section 6.13 of the PCIe Base
> Specification [1], recommends clearing the ARI Forwarding Enable bit on
> a hot-plug event as its not guaranteed that the newly added component
> is in fact an ARI device.

Clearing ARI Forwarding Enable sounds reasonable, but I don't see
why all the other bits in the Device Control 2 register need to be
cleared. If there isn't a reason to clear them, I'd be in favor of
leaving them alone.

As for clearing ARI Forwarding Enable, it seems commit b0cc6020e1cc
("PCI: Enable ARI if dev and upstream bridge support it; disable
otherwise") already solved this problem. Quoth its commit message:

"Currently, we enable ARI in a device's upstream bridge if the bridge and
the device support it. But we never disable ARI, even if the device is
removed and replaced with a device that doesn't support ARI.

This means that if we hot-remove an ARI device and replace it with a
non-ARI multi-function device, we find only function 0 of the new device
because the upstream bridge still has ARI enabled, and next_ari_fn()
only returns function 0 for the new non-ARI device.

This patch disables ARI in the upstream bridge if the device doesn't
support ARI. See the PCIe spec, r3.0, sec 6.13."

My superficial understanding of that patch is that we do find function 0,
while enumerating it we clear the ARI Forwarding Enable bit and thus the
remaining functions become accessible and are subsequently enumerated.

Are you seeing issues when removing an ARI-capable endpoint from a
hotplug slot and replacing it with a non-ARI-capable device?
If you do, the question is why the above-quoted commit doesn't avoid them.


> --- a/drivers/pci/hotplug/pciehp_pci.c
> +++ b/drivers/pci/hotplug/pciehp_pci.c
> @@ -104,6 +104,7 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
> list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
> bus_list) {
> pci_dev_get(dev);
> + pcie_capability_clear_word(dev, PCI_EXP_DEVCTL2, 0xffff);
> pci_stop_and_remove_bus_device(dev);
> /*
> * Ensure that no new Requests will be generated from

This clears the Device Control 2 register on the hotplugged device,
but to clear ARI Forwarding Enable, you'd have to clear the register
of the hotplug port, i.e. the *parent* of the hotplugged device.

Also, this patch doesn't apply cleanly to v6.4-rc1 because of a context
change by f5eff5591b8f ("PCI: pciehp: Fix AB-BA deadlock between
reset_lock and device_lock").

Thanks,

Lukas

2023-05-16 10:12:37

by Lukas Wunner

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] PCI: pciehp: Add support for async hotplug with native AER and DPC/EDR

On Tue, Apr 18, 2023 at 09:05:25PM +0000, Smita Koralahalli wrote:
> According to Section 6.7.6 of PCIe Base Specification [1], async removal
> with DPC and EDR may be unexpected and may result in a Surprise Down error.
> This error is just a side effect of hot remove. Most of the time, these
> errors will be abstract to the OS as current systems rely on Firmware-First
> model for AER and DPC, where the error handling (side effects of async
> remove) and other necessary HW sequencing actions is taken care by the FW
> and is abstract to the OS. However, FW-First model poses issues while
> rolling out updates or fixing bugs as the servers need to be brought down
> for firmware updates.
>
> Add support for async hot-plug with native AER and DPC/EDR. Here, OS is
> responsible for handling async add and remove along with handling of AER
> and DPC events which are generated as a side-effect of async remove.

I think you can probably leave out the details about Firmware-First.
Pointing to 6.7.6 and the fact that Surprise Down Errors may occur as
an expected side-effect of surprise removal is sufficient. They should
be treated as such.

You also want to point out what you're trying to achieve here, i.e. get
rid of irritating log messages and a 1 sec delay while pciehp waits for
DPC recovery.


> Please note that, I have provided explanation why I'm not setting the
> Surprise Down bit in uncorrectable error mask register in AER.
> https://lore.kernel.org/all/[email protected]/

Add an explanation to the commit message that masking Surprise Down Errors
was explored as an alternative approach, but scrapped due to the odd
behavior that masking only avoids the interrupt, but still records an
error per PCIe r6.0.1 sec 6.2.3.2.2. That stale error is going to be
reported the next time some error other than Surprise Down is handled.


> Also, while testing I noticed PCI_STATUS and PCI_EXP_DEVSTA will be set
> on an async remove and will not be cleared while the device is brought
> down. I have included clearing them here in order to mask any kind of
> appearance that there was an error and as well duplicating our BIOS
> functionality. I can remove if its not necessary.

Which bits are set exactly? Can you constrain the register write to
those bits?


> +static void dpc_handle_surprise_removal(struct pci_dev *pdev)
> +{
> + if (pdev->dpc_rp_extensions && dpc_wait_rp_inactive(pdev))
> + return;

Emit an error message here?


> + /*
> + * According to Section 6.7.6 of the PCIe Base Spec 6.0, since async
> + * removal might be unexpected, errors might be reported as a side
> + * effect of the event and software should handle them as an expected
> + * part of this event.
> + */

I'd move that code comment to into dpc_handler() above the
"if (dpc_is_surprise_removal(pdev))" check.


> + pci_aer_raw_clear_status(pdev);
> + pci_clear_surpdn_errors(pdev);
> +
> + pci_write_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_STATUS,
> + PCI_EXP_DPC_STATUS_TRIGGER);
> +}

Do you need a "wake_up_all(&dpc_completed_waitqueue);" at the end
of the function to wake up a pciehp handler waiting for DPC recovery?


> +static bool dpc_is_surprise_removal(struct pci_dev *pdev)
> +{
> + u16 status;
> +
> + pci_read_config_word(pdev, pdev->aer_cap + PCI_ERR_UNCOR_STATUS, &status);
> +
> + if (!(status & PCI_ERR_UNC_SURPDN))
> + return false;
> +

You need an additional check for pdev->is_hotplug_bridge here.

And you need to read PCI_EXP_SLTCAP and check for PCI_EXP_SLTCAP_HPS.

Return false if either of them isn't set.


> + dpc_handle_surprise_removal(pdev);
> +
> + return true;
> +}

A function named "..._is_..." should not have any side effects.
So move the dpc_handle_surprise_removal() call down into dpc_handler()
before the "return IRQ_HANDLED;" statement.


What about ports which support AER but not DPC? Don't you need to
amend aer.c in that case? I suppose you don't have hardware without
DPC to test that?

Thanks,

Lukas

2023-05-22 22:44:27

by Smita Koralahalli

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] PCI: pciehp: Clear the optional capabilities in DEVCTL2 on a hot-plug

Hi Lukas,

Sorry for the delay in replying to this patch. We had some internal
discussions after the review comments. Please find the responses inline..

On 5/11/2023 4:19 AM, Lukas Wunner wrote:
> On Tue, Apr 18, 2023 at 09:05:26PM +0000, Smita Koralahalli wrote:
>> Clear all capabilities in Device Control 2 register as they are optional
>> and it is not determined whether the next device inserted will support
>> these capabilities. Moreover, Section 6.13 of the PCIe Base
>> Specification [1], recommends clearing the ARI Forwarding Enable bit on
>> a hot-plug event as its not guaranteed that the newly added component
>> is in fact an ARI device.
>
> Clearing ARI Forwarding Enable sounds reasonable, but I don't see
> why all the other bits in the Device Control 2 register need to be
> cleared. If there isn't a reason to clear them, I'd be in favor of
> leaving them alone.

I understand. The SPEC doesn't "clearly" state to clear the other bits
except ARI on a hot-plug event.

But, we came across issues when a device with 10-bit tags was removed
and replaced with a device that didn't support 10-bit tags. The device
became inaccessible and the port was not able to be recovered without a
system reset. So, we thought it would be better to cherry pick all bits
that were negotiated between endpoint and root port and decided that we
should clear them all on removal. Hence, my first revision of this patch
series had aimed to clear only ARI, AtomicOp Requestor and 10 bit tags
as these were the negotiated settings between endpoint and root port.
Ideally, these settings should be re-negotiated and set up for optimal
operation on a hot add.

We had some internal discussions to understand if SPEC has it documented
somewhere. And we could see in Section 2.2.6.2, it implies that:
[i] If a Requester sends a 10-Bit Tag Request to a Completer that lacks
10-Bit Completer capability, the returned Completion(s) will have Tags
with Tag[9:8] equal to 00b. Since the Requester is forbidden to generate
these Tag values for 10-Bit Tags, such Completions will be handled as
Unexpected Completions, which by default are Advisory Non-Fatal Errors.
The Requester must follow standard PCI Express error handling requirements.
[ii] In configurations where a Requester with 10-Bit Tag Requester
capability needs to target multiple Completers, one needs to ensure that
the Requester sends 10-Bit Tag Requests only to Completers that have
10-Bit Tag Completer capability.

Now, we might wonder, why clear them (especially 10-bit tags and
AtomicOps) if Linux hasn't enabled them at all as the "10-Bit Tag
Requester Enable" bit is not defined in Linux currently. But, these
features might be enabled by Platform FW for "performance reasons" if
the endpoint supports and now it is the responsibility of the operating
system to disable it on a hot remove event.

According to implementation notes in 2.2.6.2: "For platforms where the
RC supports 10-Bit Tag Completer capability, it is highly recommended
for platform firmware or operating software that configures PCIe
hierarchies to Set the 10-Bit Tag Requester Enable bit automatically in
Endpoints with 10-Bit Tag Requester capability. This enables the
important class of 10-Bit Tag capable adapters that send Memory Read
Requests only to host memory." So if the endpoint and root port both
support 10-bit tags BIOS is enabling it at boot time..

I ran a quick check to see how DEV_CTL2 registers for root port look on
a 10-bit tag supported NVMe device.

pci 0000:80:05.1: DEVCTL2 0x1726 (Bit 12: 10-bit tag is enabled..)
pci 0000:80:05.1: DEVCAP2 0x7f19ff

So, it seems like BIOS has enabled 10-bit tags at boot time even though
Linux hasn't enabled it.

Some couple of ways we think could be:
[1] Check if these bits are enabled by Platform at boot time, clear them
only it is set during hotplug flow.
[2] Clear them unconditionally as I did..
[3] Enable 10-bits tags in Linux when a device is probed just like how
we do for ARI..

Similarly call pci_enable_atomic_ops_to_root() during a hot add..

Let me know what you think..

>
> As for clearing ARI Forwarding Enable, it seems commit b0cc6020e1cc
> ("PCI: Enable ARI if dev and upstream bridge support it; disable
> otherwise") already solved this problem. Quoth its commit message:
>
> "Currently, we enable ARI in a device's upstream bridge if the bridge and
> the device support it. But we never disable ARI, even if the device is
> removed and replaced with a device that doesn't support ARI.
>
> This means that if we hot-remove an ARI device and replace it with a
> non-ARI multi-function device, we find only function 0 of the new device
> because the upstream bridge still has ARI enabled, and next_ari_fn()
> only returns function 0 for the new non-ARI device.
>
> This patch disables ARI in the upstream bridge if the device doesn't
> support ARI. See the PCIe spec, r3.0, sec 6.13."
>
> My superficial understanding of that patch is that we do find function 0,
> while enumerating it we clear the ARI Forwarding Enable bit and thus the
> remaining functions become accessible and are subsequently enumerated.
>
> Are you seeing issues when removing an ARI-capable endpoint from a
> hotplug slot and replacing it with a non-ARI-capable device?
> If you do, the question is why the above-quoted commit doesn't avoid them.

Yeah! Sorry I missed this. ARI is already checked and enabled during
device initialization.

>
>
>> --- a/drivers/pci/hotplug/pciehp_pci.c
>> +++ b/drivers/pci/hotplug/pciehp_pci.c
>> @@ -104,6 +104,7 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
>> list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
>> bus_list) {
>> pci_dev_get(dev);
>> + pcie_capability_clear_word(dev, PCI_EXP_DEVCTL2, 0xffff);
>> pci_stop_and_remove_bus_device(dev);
>> /*
>> * Ensure that no new Requests will be generated from
>
> This clears the Device Control 2 register on the hotplugged device,
> but to clear ARI Forwarding Enable, you'd have to clear the register
> of the hotplug port, i.e. the *parent* of the hotplugged device.

I agree!

Thanks,
Smita
>
> Also, this patch doesn't apply cleanly to v6.4-rc1 because of a context
> change by f5eff5591b8f ("PCI: pciehp: Fix AB-BA deadlock between
> reset_lock and device_lock").
>
> Thanks,
>
> Lukas


2023-05-22 22:48:05

by Smita Koralahalli

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] PCI: pciehp: Add support for async hotplug with native AER and DPC/EDR

Hi,

On 5/16/2023 3:10 AM, Lukas Wunner wrote:
> On Tue, Apr 18, 2023 at 09:05:25PM +0000, Smita Koralahalli wrote:
>> According to Section 6.7.6 of PCIe Base Specification [1], async removal
>> with DPC and EDR may be unexpected and may result in a Surprise Down error.
>> This error is just a side effect of hot remove. Most of the time, these
>> errors will be abstract to the OS as current systems rely on Firmware-First
>> model for AER and DPC, where the error handling (side effects of async
>> remove) and other necessary HW sequencing actions is taken care by the FW
>> and is abstract to the OS. However, FW-First model poses issues while
>> rolling out updates or fixing bugs as the servers need to be brought down
>> for firmware updates.
>>
>> Add support for async hot-plug with native AER and DPC/EDR. Here, OS is
>> responsible for handling async add and remove along with handling of AER
>> and DPC events which are generated as a side-effect of async remove.
>
> I think you can probably leave out the details about Firmware-First.
> Pointing to 6.7.6 and the fact that Surprise Down Errors may occur as
> an expected side-effect of surprise removal is sufficient. They should
> be treated as such.
>
> You also want to point out what you're trying to achieve here, i.e. get
> rid of irritating log messages and a 1 sec delay while pciehp waits for
> DPC recovery.

Okay.

>
>
>> Please note that, I have provided explanation why I'm not setting the
>> Surprise Down bit in uncorrectable error mask register in AER.
>> https://lore.kernel.org/all/[email protected]/
>
> Add an explanation to the commit message that masking Surprise Down Errors
> was explored as an alternative approach, but scrapped due to the odd
> behavior that masking only avoids the interrupt, but still records an
> error per PCIe r6.0.1 sec 6.2.3.2.2. That stale error is going to be
> reported the next time some error other than Surprise Down is handled.
>
>

Okay.

>> Also, while testing I noticed PCI_STATUS and PCI_EXP_DEVSTA will be set
>> on an async remove and will not be cleared while the device is brought
>> down. I have included clearing them here in order to mask any kind of
>> appearance that there was an error and as well duplicating our BIOS
>> functionality. I can remove if its not necessary.
>
> Which bits are set exactly? Can you constrain the register write to
> those bits?
>
Hmm, I was mostly trying to follow the similar approach done for AER.
pci_aer_raw_clear_status(), where they clear status registers
unconditionally. Also, thought might be better if we are dealing with
legacy endpoints or so.

I see these bits set in status registers:
PCI_ERR_UNCOR_STATUS 0x20 (Surprise Down)
PCI_EXP_DPC_RP_PIO_STATUS 0x10000 (Memory Request received URCompletion)
PCI_EXP_DEVSTA 0x604 (fatal error detected)

>
>> +static void dpc_handle_surprise_removal(struct pci_dev *pdev)
>> +{
>> + if (pdev->dpc_rp_extensions && dpc_wait_rp_inactive(pdev))
>> + return;
>
> Emit an error message here?

Okay

>
>
>> + /*
>> + * According to Section 6.7.6 of the PCIe Base Spec 6.0, since async
>> + * removal might be unexpected, errors might be reported as a side
>> + * effect of the event and software should handle them as an expected
>> + * part of this event.
>> + */
>
> I'd move that code comment to into dpc_handler() above the
> "if (dpc_is_surprise_removal(pdev))" check.

Okay.

>
>
>> + pci_aer_raw_clear_status(pdev);
>> + pci_clear_surpdn_errors(pdev);
>> +
>> + pci_write_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_STATUS,
>> + PCI_EXP_DPC_STATUS_TRIGGER);
>> +}
>
> Do you need a "wake_up_all(&dpc_completed_waitqueue);" at the end
> of the function to wake up a pciehp handler waiting for DPC recovery?

I don't think so. The pciehp handler is however getting invoked
simultaneously due to PDSC or DLLSC state change right.. Let me know if
I'm missing anything here.

>
>
>> +static bool dpc_is_surprise_removal(struct pci_dev *pdev)
>> +{
>> + u16 status;
>> +
>> + pci_read_config_word(pdev, pdev->aer_cap + PCI_ERR_UNCOR_STATUS, &status);
>> +
>> + if (!(status & PCI_ERR_UNC_SURPDN))
>> + return false;
>> +
>
> You need an additional check for pdev->is_hotplug_bridge here.
>
> And you need to read PCI_EXP_SLTCAP and check for PCI_EXP_SLTCAP_HPS.
>
> Return false if either of them isn't set.

Return false, if PCI_EXP_SLTCAP isn't set only correct?
PCI_EXP_SLTCAP_HPS should be disabled if DPC is enabled.

Implementation notes in 6.7.6 says that:
"The Hot-Plug Surprise (HPS) mechanism, as indicated by the Hot-Plug
Surprise bit in the Slot Capabilities Register being Set, is deprecated
for use with async hot-plug. DPC is the recommended mechanism for
supporting async hot-plug."

Platform FW will disable the SLTCAP_HPS bit at boot time to enable async
hotplug on AMD devices.

Probably check if SLTCAP_HPS bit is set and return false?

>
>
>> + dpc_handle_surprise_removal(pdev);
>> +
>> + return true;
>> +}
>
> A function named "..._is_..." should not have any side effects.
> So move the dpc_handle_surprise_removal() call down into dpc_handler()
> before the "return IRQ_HANDLED;" statement.

Okay.

>
>
> What about ports which support AER but not DPC? Don't you need to
> amend aer.c in that case? I suppose you don't have hardware without
> DPC to test that?

Yeah right. Also, if DPC isn't supported we leave HPS bit set and we
won't see the DPC event at that time.

Thanks,
Smita

>
> Thanks,
>
> Lukas


2023-06-12 18:12:18

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] PCI: pciehp: Add support for native AER and DPC handling on async remove

On Tue, May 09, 2023 at 01:58:37PM -0700, Smita Koralahalli wrote:
> Hi,
>
> Could I please get review comments on my patch set? I had re-based
> it on latest tree in v2 as I did not hear back on my questions in
> v1.

Thanks for persevering. It looks like there may still be some ongoing
discussion with Lukas? And a few of his comments that you plan to
address? Thanks for looking into the 10-bit tag issue. I know those
tag widths are non-trivial to handle correctly and it would be good to
clean them up.

Bjorn

2023-06-12 19:52:33

by Smita Koralahalli

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] PCI: pciehp: Add support for native AER and DPC handling on async remove

On 6/12/2023 10:54 AM, Bjorn Helgaas wrote:
> On Tue, May 09, 2023 at 01:58:37PM -0700, Smita Koralahalli wrote:
>> Hi,
>>
>> Could I please get review comments on my patch set? I had re-based
>> it on latest tree in v2 as I did not hear back on my questions in
>> v1.
>
> Thanks for persevering. It looks like there may still be some ongoing
> discussion with Lukas? And a few of his comments that you plan to
> address? Thanks for looking into the 10-bit tag issue. I know those
> tag widths are non-trivial to handle correctly and it would be good to
> clean them up.
>
> Bjorn

Yeah. Since we didn't come to a conclusion on how should we attempt
clearing 10-bit tags and Atomic Op Requestor, I held on preparing v3.

https://lore.kernel.org/all/[email protected]/

But I can go ahead and send v3 for just the first patch,
https://lore.kernel.org/all/[email protected]/ and
wait on Lukas for 10-bit tag issue.

Thanks,
Smita

2023-06-16 17:50:32

by Lukas Wunner

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] PCI: pciehp: Add support for async hotplug with native AER and DPC/EDR

On Mon, May 22, 2023 at 03:23:57PM -0700, Smita Koralahalli wrote:
> On 5/16/2023 3:10 AM, Lukas Wunner wrote:
> > On Tue, Apr 18, 2023 at 09:05:25PM +0000, Smita Koralahalli wrote:
> > > Also, while testing I noticed PCI_STATUS and PCI_EXP_DEVSTA will be set
> > > on an async remove and will not be cleared while the device is brought
> > > down. I have included clearing them here in order to mask any kind of
> > > appearance that there was an error and as well duplicating our BIOS
> > > functionality. I can remove if its not necessary.
> >
> > Which bits are set exactly? Can you constrain the register write to
> > those bits?
>
> Hmm, I was mostly trying to follow the similar approach done for AER.
> pci_aer_raw_clear_status(), where they clear status registers
> unconditionally. Also, thought might be better if we are dealing with legacy
> endpoints or so.
>
> I see these bits set in status registers:
> PCI_ERR_UNCOR_STATUS 0x20 (Surprise Down)
> PCI_EXP_DPC_RP_PIO_STATUS 0x10000 (Memory Request received URCompletion)
> PCI_EXP_DEVSTA 0x604 (fatal error detected)

I'd recommend clearing only PCI_EXP_DEVSTA_FED in PCI_EXP_DEVSTA.

As for PCI_EXP_DPC_RP_PIO_STATUS, PCIe r6.1 sec 2.9.3 says that
during DPC, either UR or CA completions are returned depending on
the DPC Completion Control bit in the DPC Control register.
The kernel doesn't touch that bit, so it will contain whatever value
the BIOS has set. It seems fine to me to just clear all bits in
PCI_EXP_DPC_RP_PIO_STATUS, as you've done in your patch.

However, the RP PIO Status register is present only in Root Ports
that support RP Extensions for DPC, per PCIe r6.1 sec 7.9.14.6.
So you need to constrain that to "if (pdev->dpc_rp_extensions)".


> > > + pci_aer_raw_clear_status(pdev);
> > > + pci_clear_surpdn_errors(pdev);
> > > +
> > > + pci_write_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_STATUS,
> > > + PCI_EXP_DPC_STATUS_TRIGGER);
> > > +}
> >
> > Do you need a "wake_up_all(&dpc_completed_waitqueue);" at the end
> > of the function to wake up a pciehp handler waiting for DPC recovery?
>
> I don't think so. The pciehp handler is however getting invoked
> simultaneously due to PDSC or DLLSC state change right.. Let me know if I'm
> missing anything here.

I think you need to follow the procedure in dpc_reset_link().

That function first waits for the link to go down, in accordance with
PCIe r6.1 sec 6.2.11:

if (!pcie_wait_for_link(pdev, false))
...

Note that the link should not come back up due to a newly hot-added
device until DPC Trigger Status is cleared.

The function then waits for the Root Port to quiesce:

if (pdev->dpc_rp_extensions && dpc_wait_rp_inactive(pdev))
...

And only then does the function clear DPC Trigger Status.

You definitely need to wake_up_all(&dpc_completed_waitqueue) because
pciehp may be waiting for DPC Trigger Status to clear.

And you need to "clear_bit(PCI_DPC_RECOVERED, &pdev->priv_flags)"
before calling wake_up_all().


> > > +static bool dpc_is_surprise_removal(struct pci_dev *pdev)
> > > +{
> > > + u16 status;
> > > +
> > > + pci_read_config_word(pdev, pdev->aer_cap + PCI_ERR_UNCOR_STATUS, &status);
> > > +
> > > + if (!(status & PCI_ERR_UNC_SURPDN))
> > > + return false;
> > > +
> >
> > You need an additional check for pdev->is_hotplug_bridge here.
> >
> > And you need to read PCI_EXP_SLTCAP and check for PCI_EXP_SLTCAP_HPS.
> >
> > Return false if either of them isn't set.
>
> Return false, if PCI_EXP_SLTCAP isn't set only correct? PCI_EXP_SLTCAP_HPS
> should be disabled if DPC is enabled.
>
> Implementation notes in 6.7.6 says that:
> "The Hot-Plug Surprise (HPS) mechanism, as indicated by the Hot-Plug
> Surprise bit in the Slot Capabilities Register being Set, is deprecated
> for use with async hot-plug. DPC is the recommended mechanism for supporting
> async hot-plug."
>
> Platform FW will disable the SLTCAP_HPS bit at boot time to enable async
> hotplug on AMD devices.

Huh, is PCI_EXP_SLTCAP_HPS not set on the hotplug port in question?

If it's not set, why do you get Surprise Down Errors in the first place?

How do you bring down the slot without surprise-removal capability?
Via sysfs?


> Probably check if SLTCAP_HPS bit is set and return false?

Quite the opposite! If it's not set, return false.


Thanks,

Lukas

2023-06-16 18:30:51

by Lukas Wunner

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] PCI: pciehp: Clear the optional capabilities in DEVCTL2 on a hot-plug

Hi Smita,

My apologies for the delay!

On Mon, May 22, 2023 at 03:23:31PM -0700, Smita Koralahalli wrote:
> On 5/11/2023 4:19 AM, Lukas Wunner wrote:
> > On Tue, Apr 18, 2023 at 09:05:26PM +0000, Smita Koralahalli wrote:
> > > Clear all capabilities in Device Control 2 register as they are optional
> > > and it is not determined whether the next device inserted will support
> > > these capabilities. Moreover, Section 6.13 of the PCIe Base
> > > Specification [1], recommends clearing the ARI Forwarding Enable bit on
> > > a hot-plug event as its not guaranteed that the newly added component
> > > is in fact an ARI device.
> >
> > Clearing ARI Forwarding Enable sounds reasonable, but I don't see
> > why all the other bits in the Device Control 2 register need to be
> > cleared. If there isn't a reason to clear them, I'd be in favor of
> > leaving them alone.
>
> I understand. The SPEC doesn't "clearly" state to clear the other bits
> except ARI on a hot-plug event.
>
> But, we came across issues when a device with 10-bit tags was removed and
> replaced with a device that didn't support 10-bit tags. The device became
> inaccessible and the port was not able to be recovered without a system
> reset. So, we thought it would be better to cherry pick all bits that were
> negotiated between endpoint and root port and decided that we should clear
> them all on removal. Hence, my first revision of this patch series had aimed
> to clear only ARI, AtomicOp Requestor and 10 bit tags as these were the
> negotiated settings between endpoint and root port. Ideally, these settings
> should be re-negotiated and set up for optimal operation on a hot add.

Makes total sense. I like the approach of clearing only these three
bits, as you did in v1 of the patch. I also appreciate the detailed
explanation that you've provided. Much of your e-mail can be copy-pasted
to the commit message, in my opinion it's valuable information to any
reviewer and future reader of the commit.


> We had some internal discussions to understand if SPEC has it documented
> somewhere. And we could see in Section 2.2.6.2, it implies that:
> [i] If a Requester sends a 10-Bit Tag Request to a Completer that lacks
> 10-Bit Completer capability, the returned Completion(s) will have Tags with
> Tag[9:8] equal to 00b. Since the Requester is forbidden to generate these
> Tag values for 10-Bit Tags, such Completions will be handled as Unexpected
> Completions, which by default are Advisory Non-Fatal Errors. The Requester
> must follow standard PCI Express error handling requirements.
> [ii] In configurations where a Requester with 10-Bit Tag Requester
> capability needs to target multiple Completers, one needs to ensure that the
> Requester sends 10-Bit Tag Requests only to Completers that have 10-Bit Tag
> Completer capability.
>
> Now, we might wonder, why clear them (especially 10-bit tags and AtomicOps)
> if Linux hasn't enabled them at all as the "10-Bit Tag Requester Enable" bit
> is not defined in Linux currently. But, these features might be enabled by
> Platform FW for "performance reasons" if the endpoint supports and now it is
> the responsibility of the operating system to disable it on a hot remove
> event.

Again, makes total sense.


> According to implementation notes in 2.2.6.2: "For platforms where the RC
> supports 10-Bit Tag Completer capability, it is highly recommended for
> platform firmware or operating software that configures PCIe hierarchies to
> Set the 10-Bit Tag Requester Enable bit automatically in Endpoints with
> 10-Bit Tag Requester capability. This enables the important class of 10-Bit
> Tag capable adapters that send Memory Read Requests only to host memory." So
> if the endpoint and root port both support 10-bit tags BIOS is enabling it
> at boot time..
>
> I ran a quick check to see how DEV_CTL2 registers for root port look on a
> 10-bit tag supported NVMe device.
>
> pci 0000:80:05.1: DEVCTL2 0x1726 (Bit 12: 10-bit tag is enabled..)
> pci 0000:80:05.1: DEVCAP2 0x7f19ff
>
> So, it seems like BIOS has enabled 10-bit tags at boot time even though
> Linux hasn't enabled it.
>
> Some couple of ways we think could be:
> [1] Check if these bits are enabled by Platform at boot time, clear them
> only it is set during hotplug flow.
> [2] Clear them unconditionally as I did..
> [3] Enable 10-bits tags in Linux when a device is probed just like how we do
> for ARI..
>
> Similarly call pci_enable_atomic_ops_to_root() during a hot add..

Personally I'm fine with option [2]. If you or Bjorn prefer option [3],
I'm fine with that as well.


> > As for clearing ARI Forwarding Enable, it seems commit b0cc6020e1cc
> > ("PCI: Enable ARI if dev and upstream bridge support it; disable
> > otherwise") already solved this problem. Quoth its commit message:
[...]
> > My superficial understanding of that patch is that we do find function 0,
> > while enumerating it we clear the ARI Forwarding Enable bit and thus the
> > remaining functions become accessible and are subsequently enumerated.
> >
> > Are you seeing issues when removing an ARI-capable endpoint from a
> > hotplug slot and replacing it with a non-ARI-capable device?
> > If you do, the question is why the above-quoted commit doesn't avoid them.
>
> Yeah! Sorry I missed this. ARI is already checked and enabled during device
> initialization.

It doesn't hurt to additionally clear on hot-removal.

Thanks,

Lukas

2023-06-16 23:50:41

by Smita Koralahalli

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] PCI: pciehp: Add support for async hotplug with native AER and DPC/EDR

On 6/16/2023 10:31 AM, Lukas Wunner wrote:
> On Mon, May 22, 2023 at 03:23:57PM -0700, Smita Koralahalli wrote:
>> On 5/16/2023 3:10 AM, Lukas Wunner wrote:
>>> On Tue, Apr 18, 2023 at 09:05:25PM +0000, Smita Koralahalli wrote:

>
> I'd recommend clearing only PCI_EXP_DEVSTA_FED in PCI_EXP_DEVSTA.
>
> As for PCI_EXP_DPC_RP_PIO_STATUS, PCIe r6.1 sec 2.9.3 says that
> during DPC, either UR or CA completions are returned depending on
> the DPC Completion Control bit in the DPC Control register.
> The kernel doesn't touch that bit, so it will contain whatever value
> the BIOS has set. It seems fine to me to just clear all bits in
> PCI_EXP_DPC_RP_PIO_STATUS, as you've done in your patch.
>
> However, the RP PIO Status register is present only in Root Ports
> that support RP Extensions for DPC, per PCIe r6.1 sec 7.9.14.6.
> So you need to constrain that to "if (pdev->dpc_rp_extensions)".
>

Okay will make changes.

>
>>>> + pci_aer_raw_clear_status(pdev);
>>>> + pci_clear_surpdn_errors(pdev);
>>>> +
>>>> + pci_write_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_STATUS,
>>>> + PCI_EXP_DPC_STATUS_TRIGGER);
>>>> +}
>>>
>>> Do you need a "wake_up_all(&dpc_completed_waitqueue);" at the end
>>> of the function to wake up a pciehp handler waiting for DPC recovery?
>>
>> I don't think so. The pciehp handler is however getting invoked
>> simultaneously due to PDSC or DLLSC state change right.. Let me know if I'm
>> missing anything here.
>
> I think you need to follow the procedure in dpc_reset_link().
>
> That function first waits for the link to go down, in accordance with
> PCIe r6.1 sec 6.2.11:
>
> if (!pcie_wait_for_link(pdev, false))
> ...
>
> Note that the link should not come back up due to a newly hot-added
> device until DPC Trigger Status is cleared.
>
> The function then waits for the Root Port to quiesce:
>
> if (pdev->dpc_rp_extensions && dpc_wait_rp_inactive(pdev))
> ...
>
> And only then does the function clear DPC Trigger Status.
>
> You definitely need to wake_up_all(&dpc_completed_waitqueue) because
> pciehp may be waiting for DPC Trigger Status to clear.
>
> And you need to "clear_bit(PCI_DPC_RECOVERED, &pdev->priv_flags)"
> before calling wake_up_all().
>
>

Okay. I did not consider the fact that pciehp handler "may" wait on DPC
Trigger Status to be cleared. Because in my case both the handlers were
invoked due to their respective bit changes and I did not come across
the case where pciehp handler was waiting on DPC to complete.


>>>> +static bool dpc_is_surprise_removal(struct pci_dev *pdev)
>>>> +{
>>>> + u16 status;
>>>> +
>>>> + pci_read_config_word(pdev, pdev->aer_cap + PCI_ERR_UNCOR_STATUS, &status);
>>>> +
>>>> + if (!(status & PCI_ERR_UNC_SURPDN))
>>>> + return false;
>>>> +
>>>
>>> You need an additional check for pdev->is_hotplug_bridge here.
>>>
>>> And you need to read PCI_EXP_SLTCAP and check for PCI_EXP_SLTCAP_HPS.
>>>
>>> Return false if either of them isn't set.
>>
>> Return false, if PCI_EXP_SLTCAP isn't set only correct? PCI_EXP_SLTCAP_HPS
>> should be disabled if DPC is enabled.
>>
>> Implementation notes in 6.7.6 says that:
>> "The Hot-Plug Surprise (HPS) mechanism, as indicated by the Hot-Plug
>> Surprise bit in the Slot Capabilities Register being Set, is deprecated
>> for use with async hot-plug. DPC is the recommended mechanism for supporting
>> async hot-plug."
>>
>> Platform FW will disable the SLTCAP_HPS bit at boot time to enable async
>> hotplug on AMD devices.
>
> Huh, is PCI_EXP_SLTCAP_HPS not set on the hotplug port in question?
>
> If it's not set, why do you get Surprise Down Errors in the first place?
>
> How do you bring down the slot without surprise-removal capability?
> Via sysfs?
>

As per SPEC 6.7.6, "Either Downstream Port Containment (DPC) or the
Hot-Plug Surprise (HPS) mechanism may be used to support async removal
as part of an overall async hot-plug architecture".

Also, the implementation notes below, it conveys that HPS is deprecated
and DPC is recommended mechanism. More details can be found in Appendix
I, I.1 Async Hot-Plug Initial Configuration:
...
If DPC capability then,
If HPS bit not Set, use DPC
Else attempt to Clear HPS bit (ยง Section 6.7.4.4 )
If successful, use DPC
Else use HPS
...

So, this is "likely" a new feature support patch where DPC supports
async remove. HPS bit will be disabled by BIOS if DPC is chosen as
recommended mechanism to handle async removal.

I see the slot is being brought down by PDC or DLLSC event, which is
triggered alongside DPC.

pciehp_handle_presence_or_link_change() -> pciehp_disable_slot() ->
__pciehp_disable_slot() -> remove_board()..

But I want to clear one thing, are you implying that PDC or DLLSC
shouldn't be triggered when HPS is disabled?

Thanks,
Smita

>
>> Probably check if SLTCAP_HPS bit is set and return false?
>
> Quite the opposite! If it's not set, return false.
>
>
> Thanks,
>
> Lukas


2023-06-17 00:33:11

by Smita Koralahalli

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] PCI: pciehp: Clear the optional capabilities in DEVCTL2 on a hot-plug

Hi Lukas,

Thanks for reviewing my patch

On 6/16/2023 11:24 AM, Lukas Wunner wrote:
> Hi Smita,
>
> My apologies for the delay!
>
> On Mon, May 22, 2023 at 03:23:31PM -0700, Smita Koralahalli wrote:
>> On 5/11/2023 4:19 AM, Lukas Wunner wrote:
>>> On Tue, Apr 18, 2023 at 09:05:26PM +0000, Smita Koralahalli wrote:
>>>> Clear all capabilities in Device Control 2 register as they are optional
>>>> and it is not determined whether the next device inserted will support
>>>> these capabilities. Moreover, Section 6.13 of the PCIe Base
>>>> Specification [1], recommends clearing the ARI Forwarding Enable bit on
>>>> a hot-plug event as its not guaranteed that the newly added component
>>>> is in fact an ARI device.
>>>
>>> Clearing ARI Forwarding Enable sounds reasonable, but I don't see
>>> why all the other bits in the Device Control 2 register need to be
>>> cleared. If there isn't a reason to clear them, I'd be in favor of
>>> leaving them alone.
>>
>> I understand. The SPEC doesn't "clearly" state to clear the other bits
>> except ARI on a hot-plug event.
>>
>> But, we came across issues when a device with 10-bit tags was removed and
>> replaced with a device that didn't support 10-bit tags. The device became
>> inaccessible and the port was not able to be recovered without a system
>> reset. So, we thought it would be better to cherry pick all bits that were
>> negotiated between endpoint and root port and decided that we should clear
>> them all on removal. Hence, my first revision of this patch series had aimed
>> to clear only ARI, AtomicOp Requestor and 10 bit tags as these were the
>> negotiated settings between endpoint and root port. Ideally, these settings
>> should be re-negotiated and set up for optimal operation on a hot add.
>
> Makes total sense. I like the approach of clearing only these three
> bits, as you did in v1 of the patch. I also appreciate the detailed
> explanation that you've provided. Much of your e-mail can be copy-pasted
> to the commit message, in my opinion it's valuable information to any
> reviewer and future reader of the commit.
>
>
>> We had some internal discussions to understand if SPEC has it documented
>> somewhere. And we could see in Section 2.2.6.2, it implies that:
>> [i] If a Requester sends a 10-Bit Tag Request to a Completer that lacks
>> 10-Bit Completer capability, the returned Completion(s) will have Tags with
>> Tag[9:8] equal to 00b. Since the Requester is forbidden to generate these
>> Tag values for 10-Bit Tags, such Completions will be handled as Unexpected
>> Completions, which by default are Advisory Non-Fatal Errors. The Requester
>> must follow standard PCI Express error handling requirements.
>> [ii] In configurations where a Requester with 10-Bit Tag Requester
>> capability needs to target multiple Completers, one needs to ensure that the
>> Requester sends 10-Bit Tag Requests only to Completers that have 10-Bit Tag
>> Completer capability.
>>
>> Now, we might wonder, why clear them (especially 10-bit tags and AtomicOps)
>> if Linux hasn't enabled them at all as the "10-Bit Tag Requester Enable" bit
>> is not defined in Linux currently. But, these features might be enabled by
>> Platform FW for "performance reasons" if the endpoint supports and now it is
>> the responsibility of the operating system to disable it on a hot remove
>> event.
>
> Again, makes total sense.
>
>
>> According to implementation notes in 2.2.6.2: "For platforms where the RC
>> supports 10-Bit Tag Completer capability, it is highly recommended for
>> platform firmware or operating software that configures PCIe hierarchies to
>> Set the 10-Bit Tag Requester Enable bit automatically in Endpoints with
>> 10-Bit Tag Requester capability. This enables the important class of 10-Bit
>> Tag capable adapters that send Memory Read Requests only to host memory." So
>> if the endpoint and root port both support 10-bit tags BIOS is enabling it
>> at boot time..
>>
>> I ran a quick check to see how DEV_CTL2 registers for root port look on a
>> 10-bit tag supported NVMe device.
>>
>> pci 0000:80:05.1: DEVCTL2 0x1726 (Bit 12: 10-bit tag is enabled..)
>> pci 0000:80:05.1: DEVCAP2 0x7f19ff
>>
>> So, it seems like BIOS has enabled 10-bit tags at boot time even though
>> Linux hasn't enabled it.
>>
>> Some couple of ways we think could be:
>> [1] Check if these bits are enabled by Platform at boot time, clear them
>> only it is set during hotplug flow.
>> [2] Clear them unconditionally as I did..
>> [3] Enable 10-bits tags in Linux when a device is probed just like how we do
>> for ARI..
>>
>> Similarly call pci_enable_atomic_ops_to_root() during a hot add..
>
> Personally I'm fine with option [2]. If you or Bjorn prefer option [3],
> I'm fine with that as well.

Looking forward for Bjorn comments!

Thanks,
Smita
>
>
>>> As for clearing ARI Forwarding Enable, it seems commit b0cc6020e1cc
>>> ("PCI: Enable ARI if dev and upstream bridge support it; disable
>>> otherwise") already solved this problem. Quoth its commit message:
> [...]
>>> My superficial understanding of that patch is that we do find function 0,
>>> while enumerating it we clear the ARI Forwarding Enable bit and thus the
>>> remaining functions become accessible and are subsequently enumerated.
>>>
>>> Are you seeing issues when removing an ARI-capable endpoint from a
>>> hotplug slot and replacing it with a non-ARI-capable device?
>>> If you do, the question is why the above-quoted commit doesn't avoid them.
>>
>> Yeah! Sorry I missed this. ARI is already checked and enabled during device
>> initialization.
>
> It doesn't hurt to additionally clear on hot-removal.
>
> Thanks,
>
> Lukas


2023-06-17 08:06:08

by Lukas Wunner

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] PCI: pciehp: Add support for async hotplug with native AER and DPC/EDR

On Fri, Jun 16, 2023 at 04:30:49PM -0700, Smita Koralahalli wrote:
> On 6/16/2023 10:31 AM, Lukas Wunner wrote:
> > On Mon, May 22, 2023 at 03:23:57PM -0700, Smita Koralahalli wrote:
> > > On 5/16/2023 3:10 AM, Lukas Wunner wrote:
> > > > On Tue, Apr 18, 2023 at 09:05:25PM +0000, Smita Koralahalli wrote:
> > > > > +static bool dpc_is_surprise_removal(struct pci_dev *pdev)
> > > > > +{
> > > > > + u16 status;
> > > > > +
> > > > > + pci_read_config_word(pdev, pdev->aer_cap + PCI_ERR_UNCOR_STATUS, &status);
> > > > > +
> > > > > + if (!(status & PCI_ERR_UNC_SURPDN))
> > > > > + return false;
> > > > > +
> > > >
> > > > You need an additional check for pdev->is_hotplug_bridge here.
> > > >
> > > > And you need to read PCI_EXP_SLTCAP and check for PCI_EXP_SLTCAP_HPS.
> > > >
> > > > Return false if either of them isn't set.
> > >
> > > Return false, if PCI_EXP_SLTCAP isn't set only correct? PCI_EXP_SLTCAP_HPS
> > > should be disabled if DPC is enabled.
> > >
> > > Implementation notes in 6.7.6 says that:
> > > "The Hot-Plug Surprise (HPS) mechanism, as indicated by the Hot-Plug
> > > Surprise bit in the Slot Capabilities Register being Set, is deprecated
> > > for use with async hot-plug. DPC is the recommended mechanism for supporting
> > > async hot-plug."
> > >
> > > Platform FW will disable the SLTCAP_HPS bit at boot time to enable async
> > > hotplug on AMD devices.
> >
> > Huh, is PCI_EXP_SLTCAP_HPS not set on the hotplug port in question?
> >
> > If it's not set, why do you get Surprise Down Errors in the first place?
> >
> > How do you bring down the slot without surprise-removal capability?
> > Via sysfs?
>
> As per SPEC 6.7.6, "Either Downstream Port Containment (DPC) or the Hot-Plug
> Surprise (HPS) mechanism may be used to support async removal as part of an
> overall async hot-plug architecture".
>
> Also, the implementation notes below, it conveys that HPS is deprecated and
> DPC is recommended mechanism. More details can be found in Appendix I, I.1
> Async Hot-Plug Initial Configuration:
> ...
> If DPC capability then,
> If HPS bit not Set, use DPC
> Else attempt to Clear HPS bit (? Section 6.7.4.4 )
> If successful, use DPC
> Else use HPS
> ...
>
> So, this is "likely" a new feature support patch where DPC supports async
> remove. HPS bit will be disabled by BIOS if DPC is chosen as recommended
> mechanism to handle async removal.
>
> I see the slot is being brought down by PDC or DLLSC event, which is
> triggered alongside DPC.
>
> pciehp_handle_presence_or_link_change() -> pciehp_disable_slot() ->
> __pciehp_disable_slot() -> remove_board()..
>
> But I want to clear one thing, are you implying that PDC or DLLSC shouldn't
> be triggered when HPS is disabled?

Sorry, please ignore my advice to check PCI_EXP_SLTCAP_HPS in
dpc_is_surprise_removal().

Instead, only check for pdev->is_hotplug_bridge. The rationale being
that Surprise Down errors are par for the course on hotplug ports,
but they're an anomaly that must be reported on non-hotplug ports.

PCI_EXP_SLTCAP_HPS has no bearing on pciehp's behavior. If there's
a hotplug port and hotplug control was granted to the OS, pciehp will
bind to the device. pciehp will react to any DLLSC and PDC event.

I think the target audience for PCIe r6.1 sec 6.7.6 are hardware
designers and the advice given there is to add DPC capability to
hotplug ports. That's fine. It doesn't affect how Linux handles
async removal.

I think the wording in the spec to "use DPC" for async removal is
confusing. We don't want to duplicate the code for device
de-/re-enumeration and slot bringup / bringdown in the dpc driver.
We just continue letting pciehp do that (and thus retain compatibility
with older, non-DPC-capable hardware). But we wire up pciehp with
the dpc driver to add DPC-awareness for hotplug.

One part of the equation was to ignore link changes which occur
as a side effect of a DPC event from which we've successfully
recovered. That was added with commit a97396c6eb13 ("PCI: pciehp:
Ignore Link Down/Up caused by DPC").

The other part of the equation (which you're adding here) is to
ignore Surprise Down errors in the dpc driver which occur as a side
effect of async removal.

I think that makes us compliant with PCIe r6.1 sec 6.7.6, although
we're interpreting "use DPC" (or async removal) somewhat liberally.
Actually I'd prefer the term "pragmatically" instead of "liberally". ;)

We don't want to duplicate code just for the sake of being word-by-word
compliant with the spec. If it works in the real world, it's fine. :)

Thanks,

Lukas

2023-06-21 07:26:15

by Lukas Wunner

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] PCI: pciehp: Clear the optional capabilities in DEVCTL2 on a hot-plug

On Fri, Jun 16, 2023 at 04:34:27PM -0700, Smita Koralahalli wrote:
> On 6/16/2023 11:24 AM, Lukas Wunner wrote:
> > On Mon, May 22, 2023 at 03:23:31PM -0700, Smita Koralahalli wrote:
> > > On 5/11/2023 4:19 AM, Lukas Wunner wrote:
> > > > On Tue, Apr 18, 2023 at 09:05:26PM +0000, Smita Koralahalli wrote:
> > > Some couple of ways we think could be:
> > > [1] Check if these bits are enabled by Platform at boot time, clear them
> > > only it is set during hotplug flow.
> > > [2] Clear them unconditionally as I did..
> > > [3] Enable 10-bits tags in Linux when a device is probed just like how
> > > we do for ARI..
> > >
> > > Similarly call pci_enable_atomic_ops_to_root() during a hot add..
> >
> > Personally I'm fine with option [2]. If you or Bjorn prefer option [3],
> > I'm fine with that as well.
>
> Looking forward for Bjorn comments!

You may want to consider first doing [2], i.e. clear the DevCtl2 bits
on hot removal, and then in a separate step do [3], i.e. add support
for enabling 10 bit tags and atomic ops in the kernel. Having that
would certainly be useful, but it's more complex than just clearing
the DevCtl2 bits on unplug. So you may want to do the latter as a
stop-gap.

Thanks,

Lukas

2023-06-21 19:02:01

by Smita Koralahalli

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] PCI: pciehp: Clear the optional capabilities in DEVCTL2 on a hot-plug

On 6/21/2023 12:12 AM, Lukas Wunner wrote:
> On Fri, Jun 16, 2023 at 04:34:27PM -0700, Smita Koralahalli wrote:
>> On 6/16/2023 11:24 AM, Lukas Wunner wrote:
>>> On Mon, May 22, 2023 at 03:23:31PM -0700, Smita Koralahalli wrote:
>>>> On 5/11/2023 4:19 AM, Lukas Wunner wrote:
>>>>> On Tue, Apr 18, 2023 at 09:05:26PM +0000, Smita Koralahalli wrote:
>>>> Some couple of ways we think could be:
>>>> [1] Check if these bits are enabled by Platform at boot time, clear them
>>>> only it is set during hotplug flow.
>>>> [2] Clear them unconditionally as I did..
>>>> [3] Enable 10-bits tags in Linux when a device is probed just like how
>>>> we do for ARI..
>>>>
>>>> Similarly call pci_enable_atomic_ops_to_root() during a hot add..
>>>
>>> Personally I'm fine with option [2]. If you or Bjorn prefer option [3],
>>> I'm fine with that as well.
>>
>> Looking forward for Bjorn comments!
>
> You may want to consider first doing [2], i.e. clear the DevCtl2 bits
> on hot removal, and then in a separate step do [3], i.e. add support
> for enabling 10 bit tags and atomic ops in the kernel. Having that
> would certainly be useful, but it's more complex than just clearing
> the DevCtl2 bits on unplug. So you may want to do the latter as a
> stop-gap.

Okay I just sent v3 now. I will take care of this in v4 along with other
comments in v3.

Thanks,
Smita
>
> Thanks,
>
> Lukas