2023-06-02 19:55:07

by Lucas Karpinski

[permalink] [raw]
Subject: [PATCH] Revert "arm64: dts: qcom: sa8540p-ride: enable pcie2a node"

This reverts commit 2eb4cdcd5aba2db83f2111de1242721eeb659f71.

The patch introduced a sporadic error where the Qdrive3 will fail to
boot occasionally due to an rcu preempt stall.
Qualcomm has disabled pcie2a downstream:
https://git.codelinaro.org/clo/la/platform/vendor/qcom-opensource/rh-patch/-/commit/447f2135909683d1385af36f95fae5e1d63a7e2f

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 0-....: (1 GPs behind) idle=77fc/1/0x4000000000000004 softirq=841/841 fqs=2476
rcu: (t=5253 jiffies g=-175 q=2552 ncpus=8)
Call trace:
__do_softirq
____do_softirq
call_on_irq_stack
do_softirq_own_stack
__irq_exit_rcu
irq_exit_rcu

The issue occurs normally once every 3-4 boot cycles.
There is likely a race condition caused when setting up the two pcie
domains concurrently (pcie2a and pcie3a).

The issue is not present when only pcie2a is enabled or when only pcie3a
is enabled.
A workaround was found that allowed the Qdrive3 to boot with both pcie2a
and pcie3a enabled.
Set the .probe_type to PROBE_FORCE_SYNCHRONOUS and add an msleep() to
the probing function.
This is not a solution, so this patch is disabling pcie2a as it seems
Red Hat are the only ones working on the board,
we're find with disabling the node until a root cause is found. If
anyone has further suggestions for debugging, let me know.

Signed-off-by: Lucas Karpinski <[email protected]>
---
During debugging:
- Added additional time for clock/regulator stabilization.
- Reduced the bandwidth across pcie2a and pcie3a.
- Replaced the interconnect setup from another driver.
- The 32-bit/64-bit/config-io space for both pcie2a and pcie3a look to be mapped correctly.
- Verified interconnects were started successfully.

arch/arm64/boot/dts/qcom/sa8540p-ride.dts | 44 -----------------------
1 file changed, 44 deletions(-)

diff --git a/arch/arm64/boot/dts/qcom/sa8540p-ride.dts b/arch/arm64/boot/dts/qcom/sa8540p-ride.dts
index 24fa449d48a6..d492723ccf7c 100644
--- a/arch/arm64/boot/dts/qcom/sa8540p-ride.dts
+++ b/arch/arm64/boot/dts/qcom/sa8540p-ride.dts
@@ -186,27 +186,6 @@ &i2c18 {
status = "okay";
};

-&pcie2a {
- ranges = <0x01000000 0x0 0x3c200000 0x0 0x3c200000 0x0 0x100000>,
- <0x02000000 0x0 0x3c300000 0x0 0x3c300000 0x0 0x1d00000>,
- <0x03000000 0x5 0x00000000 0x5 0x00000000 0x1 0x00000000>;
-
- perst-gpios = <&tlmm 143 GPIO_ACTIVE_LOW>;
- wake-gpios = <&tlmm 145 GPIO_ACTIVE_HIGH>;
-
- pinctrl-names = "default";
- pinctrl-0 = <&pcie2a_default>;
-
- status = "okay";
-};
-
-&pcie2a_phy {
- vdda-phy-supply = <&vreg_l11a>;
- vdda-pll-supply = <&vreg_l3a>;
-
- status = "okay";
-};
-
&pcie3a {
ranges = <0x01000000 0x0 0x40200000 0x0 0x40200000 0x0 0x100000>,
<0x02000000 0x0 0x40300000 0x0 0x40300000 0x0 0x20000000>,
@@ -356,29 +335,6 @@ i2c18_default: i2c18-default-state {
bias-pull-up;
};

- pcie2a_default: pcie2a-default-state {
- perst-pins {
- pins = "gpio143";
- function = "gpio";
- drive-strength = <2>;
- bias-pull-down;
- };
-
- clkreq-pins {
- pins = "gpio142";
- function = "pcie2a_clkreq";
- drive-strength = <2>;
- bias-pull-up;
- };
-
- wake-pins {
- pins = "gpio145";
- function = "gpio";
- drive-strength = <2>;
- bias-pull-up;
- };
- };
-
pcie3a_default: pcie3a-default-state {
perst-pins {
pins = "gpio151";
--
2.40.1



2023-06-07 15:58:46

by Brian Masney

[permalink] [raw]
Subject: Re: [PATCH] Revert "arm64: dts: qcom: sa8540p-ride: enable pcie2a node"

Hi Lucas,

On Fri, Jun 02, 2023 at 03:33:21PM -0400, Lucas Karpinski wrote:
> This reverts commit 2eb4cdcd5aba2db83f2111de1242721eeb659f71.

I am all for reverting this commit however I think your commit message
needs cleaned up.

> The patch introduced a sporadic error where the Qdrive3 will fail to
> boot occasionally due to an rcu preempt stall.
> Qualcomm has disabled pcie2a downstream:
> https://git.codelinaro.org/clo/la/platform/vendor/qcom-opensource/rh-patch/-/commit/447f2135909683d1385af36f95fae5e1d63a7e2f

Personally I'd remove the mention of the downstream kernel is this case.

Also your paragraphs are formatted weird with a newline at the end
of every sentence. Get them to flow together as a regular paragraph.
This is the relevant line that I have in my muttrc file to help.

set editor="vim -c 'set spell spelllang=en' -c 'set tw=72' -c 'set wrap'"

> rcu: INFO: rcu_preempt self-detected stall on CPU
> rcu: 0-....: (1 GPs behind) idle=77fc/1/0x4000000000000004 softirq=841/841 fqs=2476
> rcu: (t=5253 jiffies g=-175 q=2552 ncpus=8)
> Call trace:
> __do_softirq
> ____do_softirq
> call_on_irq_stack
> do_softirq_own_stack
> __irq_exit_rcu
> irq_exit_rcu
>
> The issue occurs normally once every 3-4 boot cycles.
> There is likely a race condition caused when setting up the two pcie
> domains concurrently (pcie2a and pcie3a).

I would also add that Qualcomm told us that upgrading the firmware on
the PCIe switch would correct this issue. We've upgraded the PCIe switch
to the latest firmware and this issue is still present. Apparently we
need to use a specific older version of the firmware that we can't get
from the PCIe switch vendor or Qualcomm.

Nothing is hooked up to pcie2a on the QDrive3 so there's no loss in
functionality by disabling this. We always have to remember to revert
this commit when working with an upstream kernel.

> This is not a solution, so this patch is disabling pcie2a as it seems
> Red Hat are the only ones working on the board,
> we're find with disabling the node until a root cause is found. If
> anyone has further suggestions for debugging, let me know.

This should go under the ---.

Brian


2023-06-12 20:16:03

by Eric Chanudet

[permalink] [raw]
Subject: Re: [PATCH] Revert "arm64: dts: qcom: sa8540p-ride: enable pcie2a node"

On Fri, Jun 02, 2023 at 03:33:21PM -0400, Lucas Karpinski wrote:
> This reverts commit 2eb4cdcd5aba2db83f2111de1242721eeb659f71.
>
> The patch introduced a sporadic error where the Qdrive3 will fail to
> boot occasionally due to an rcu preempt stall.
> Qualcomm has disabled pcie2a downstream:
> https://git.codelinaro.org/clo/la/platform/vendor/qcom-opensource/rh-patch/-/commit/447f2135909683d1385af36f95fae5e1d63a7e2f
>
> rcu: INFO: rcu_preempt self-detected stall on CPU
> rcu: 0-....: (1 GPs behind) idle=77fc/1/0x4000000000000004 softirq=841/841 fqs=2476
> rcu: (t=5253 jiffies g=-175 q=2552 ncpus=8)
> Call trace:
> __do_softirq
> ____do_softirq
> call_on_irq_stack
> do_softirq_own_stack
> __irq_exit_rcu
> irq_exit_rcu
>
> The issue occurs normally once every 3-4 boot cycles.
> There is likely a race condition caused when setting up the two pcie
> domains concurrently (pcie2a and pcie3a).
>
> The issue is not present when only pcie2a is enabled or when only pcie3a
> is enabled.
> A workaround was found that allowed the Qdrive3 to boot with both pcie2a
> and pcie3a enabled.
> Set the .probe_type to PROBE_FORCE_SYNCHRONOUS and add an msleep() to
> the probing function.
> This is not a solution, so this patch is disabling pcie2a as it seems
> Red Hat are the only ones working on the board,
> we're find with disabling the node until a root cause is found. If
> anyone has further suggestions for debugging, let me know.
>
> Signed-off-by: Lucas Karpinski <[email protected]>
> ---
> During debugging:
> - Added additional time for clock/regulator stabilization.
> - Reduced the bandwidth across pcie2a and pcie3a.
> - Replaced the interconnect setup from another driver.
> - The 32-bit/64-bit/config-io space for both pcie2a and pcie3a look to be mapped correctly.
> - Verified interconnects were started successfully.

I was looking at another issue downstream triggering a soft lock on
CPU0, but it turns out this could be the same thing except the symptoms
are less noticeable (the 3-4 boot cycles you mention).

Using next-20230609, if I add a return kprobe on dw_handle_msi_irq:

echo 'r:dwmsi_probe dw_handle_msi_irq $retval' > /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/events/kprobes/dwmsi_probe/enable
cat /sys/kernel/debug/tracing/trace_pipe
<idle>-0 [000] d.h1. 690.417268: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0
<idle>-0 [000] d.h1. 690.417272: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0
<idle>-0 [000] d.h1. 690.417276: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0
<idle>-0 [000] d.h1. 690.417281: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0
<idle>-0 [000] d.h1. 690.417284: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0
<idle>-0 [000] d.h1. 690.417288: dwmsi_probe: (dw_chained_msi_isr+0x38/0xb8 <- dw_handle_msi_irq) arg1=0x0
[...]

dw_handle_msi_irq constantly fires and never returns IRQ_HANDLED. It
happens consistently for pcie2a or pcie3a, after I disable one or the
other. I presume having both might be enough to overwhelm the system and
trigger the stall?

Looking at the handler, the status is always 0 after:
status = dw_pcie_readl_dbi(pci, PCIE_MSI_INTR0_STATUS +
(i * MSI_REG_CTRL_BLOCK_SIZE));

Unfortunately I do not know why that is yet.

>
> arch/arm64/boot/dts/qcom/sa8540p-ride.dts | 44 -----------------------
> 1 file changed, 44 deletions(-)
>
> diff --git a/arch/arm64/boot/dts/qcom/sa8540p-ride.dts b/arch/arm64/boot/dts/qcom/sa8540p-ride.dts
> index 24fa449d48a6..d492723ccf7c 100644
> --- a/arch/arm64/boot/dts/qcom/sa8540p-ride.dts
> +++ b/arch/arm64/boot/dts/qcom/sa8540p-ride.dts
> @@ -186,27 +186,6 @@ &i2c18 {
> status = "okay";
> };
>
> -&pcie2a {
> - ranges = <0x01000000 0x0 0x3c200000 0x0 0x3c200000 0x0 0x100000>,
> - <0x02000000 0x0 0x3c300000 0x0 0x3c300000 0x0 0x1d00000>,
> - <0x03000000 0x5 0x00000000 0x5 0x00000000 0x1 0x00000000>;
> -
> - perst-gpios = <&tlmm 143 GPIO_ACTIVE_LOW>;
> - wake-gpios = <&tlmm 145 GPIO_ACTIVE_HIGH>;
> -
> - pinctrl-names = "default";
> - pinctrl-0 = <&pcie2a_default>;
> -
> - status = "okay";
> -};
> -
> -&pcie2a_phy {
> - vdda-phy-supply = <&vreg_l11a>;
> - vdda-pll-supply = <&vreg_l3a>;
> -
> - status = "okay";
> -};
> -
> &pcie3a {
> ranges = <0x01000000 0x0 0x40200000 0x0 0x40200000 0x0 0x100000>,
> <0x02000000 0x0 0x40300000 0x0 0x40300000 0x0 0x20000000>,
> @@ -356,29 +335,6 @@ i2c18_default: i2c18-default-state {
> bias-pull-up;
> };
>
> - pcie2a_default: pcie2a-default-state {
> - perst-pins {
> - pins = "gpio143";
> - function = "gpio";
> - drive-strength = <2>;
> - bias-pull-down;
> - };
> -
> - clkreq-pins {
> - pins = "gpio142";
> - function = "pcie2a_clkreq";
> - drive-strength = <2>;
> - bias-pull-up;
> - };
> -
> - wake-pins {
> - pins = "gpio145";
> - function = "gpio";
> - drive-strength = <2>;
> - bias-pull-up;
> - };
> - };
> -
> pcie3a_default: pcie3a-default-state {
> perst-pins {
> pins = "gpio151";
> --
> 2.40.1
>

--
Eric Chanudet