2020-03-24 19:14:37

by Dmitry Osipenko

[permalink] [raw]
Subject: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

Boot CPU0 always handle I2C interrupt and under some rare circumstances
(like running KASAN + NFS root) it may stuck in uninterruptible state for
a significant time. In this case we will get timeout if I2C transfer is
running on a sibling CPU, despite of IRQ being raised. In order to handle
this rare condition, the IRQ status needs to be checked after completion
timeout.

Signed-off-by: Dmitry Osipenko <[email protected]>
---
drivers/i2c/busses/i2c-tegra.c | 27 +++++++++++++++------------
1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/drivers/i2c/busses/i2c-tegra.c b/drivers/i2c/busses/i2c-tegra.c
index cbc2ad49043e..0daa863fb26f 100644
--- a/drivers/i2c/busses/i2c-tegra.c
+++ b/drivers/i2c/busses/i2c-tegra.c
@@ -1000,14 +1000,13 @@ tegra_i2c_poll_completion_timeout(struct tegra_i2c_dev *i2c_dev,
do {
u32 status = i2c_readl(i2c_dev, I2C_INT_STATUS);

- if (status) {
+ if (status)
tegra_i2c_isr(i2c_dev->irq, i2c_dev);

- if (completion_done(complete)) {
- s64 delta = ktime_ms_delta(ktimeout, ktime);
+ if (completion_done(complete)) {
+ s64 delta = ktime_ms_delta(ktimeout, ktime);

- return msecs_to_jiffies(delta) ?: 1;
- }
+ return msecs_to_jiffies(delta) ?: 1;
}

ktime = ktime_get();
@@ -1034,14 +1033,18 @@ tegra_i2c_wait_completion_timeout(struct tegra_i2c_dev *i2c_dev,
disable_irq(i2c_dev->irq);

/*
- * There is a chance that completion may happen after IRQ
- * synchronization, which is done by disable_irq().
+ * Under some rare circumstances (like running KASAN +
+ * NFS root) CPU, which handles interrupt, may stuck in
+ * uninterruptible state for a significant time. In this
+ * case we will get timeout if I2C transfer is running on
+ * a sibling CPU, despite of IRQ being raised.
+ *
+ * In order to handle this rare condition, the IRQ status
+ * needs to be checked after timeout.
*/
- if (ret == 0 && completion_done(complete)) {
- dev_warn(i2c_dev->dev,
- "completion done after timeout\n");
- ret = 1;
- }
+ if (ret == 0)
+ ret = tegra_i2c_poll_completion_timeout(i2c_dev,
+ complete, 0);
}

return ret;
--
2.25.1


2020-04-16 00:21:45

by Wolfram Sang

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

On Tue, Mar 24, 2020 at 10:12:16PM +0300, Dmitry Osipenko wrote:
> Boot CPU0 always handle I2C interrupt and under some rare circumstances
> (like running KASAN + NFS root) it may stuck in uninterruptible state for
> a significant time. In this case we will get timeout if I2C transfer is
> running on a sibling CPU, despite of IRQ being raised. In order to handle
> this rare condition, the IRQ status needs to be checked after completion
> timeout.
>
> Signed-off-by: Dmitry Osipenko <[email protected]>

Applied to for-current, thanks!


Attachments:
(No filename) (552.00 B)
signature.asc (849.00 B)
Download all attachments

2020-04-20 19:55:31

by Jon Hunter

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

Hi Dmitry,

On 24/03/2020 19:12, Dmitry Osipenko wrote:
> Boot CPU0 always handle I2C interrupt and under some rare circumstances
> (like running KASAN + NFS root) it may stuck in uninterruptible state for
> a significant time. In this case we will get timeout if I2C transfer is
> running on a sibling CPU, despite of IRQ being raised. In order to handle
> this rare condition, the IRQ status needs to be checked after completion
> timeout.
>
> Signed-off-by: Dmitry Osipenko <[email protected]>
> ---
> drivers/i2c/busses/i2c-tegra.c | 27 +++++++++++++++------------
> 1 file changed, 15 insertions(+), 12 deletions(-)


I have noticed a regression on tegra30-cardhu-a04 when testing system
suspend. Git bisect is pointing to this commit and reverting it fixes
the problem. In the below console log I2C fails to resume ...

[ 40.888512] usb1_vbus: supplied by 5v0

[ 40.892408] vddio_sdmmc,avdd_vdac: supplied by 5v0

[ 40.897401] cam_1v8: disabling

[ 40.900548] modem_3v3: disabling

[ 40.903875] vdd_cam1_ldo: disabling

[ 40.907501] vdd_cam2_ldo: disabling

[ 40.911092] vdd_cam3_ldo: disabling

[ 40.914714] vdd_fuse_3v3: disabling

[ 40.918305] vddio_vid: disabling

[ 40.921623] usb1_vbus: disabling

[ 59.445032] PM: suspend entry (deep)

[ 59.448852] Filesystems sync: 0.000 seconds

[ 59.456161] Freezing user space processes ... (elapsed 0.001 seconds) done.

[ 59.457645] OOM killer disabled.

[ 59.457649] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.

[ 59.764926] Disabling non-boot CPUs ...

[ 59.769540] IRQ 18: no longer affine to CPU1

[ 59.789070] IRQ 19: no longer affine to CPU2

[ 59.808049] IRQ 20: no longer affine to CPU3

[ 59.827113] Entering suspend state LP1

[ 59.827163] Enabling non-boot CPUs ...

[ 59.834797] CPU1 is up

[ 59.840943] CPU2 is up

[ 59.847378] CPU3 is up

[ 59.850577] tegra-i2c 7000d000.i2c: runtime resume failed -13

[ 59.856432] tegra-i2c 7000d000.i2c: runtime resume failed -13

[ 59.862231] tegra-i2c 7000d000.i2c: runtime resume failed -13

[ 59.868070] vdd_pexa,vdd_pexb: is_enabled() failed: -13

[ 59.873334] tegra-i2c 7000d000.i2c: runtime resume failed -13

[ 59.879143] vdd_pexa,vdd_pexb: is_enabled() failed: -13

[ 59.884420] Failed to enable avdd-pex-pll: -13

[ 59.888877] Failed to enable avdd-plle: -13

[ 59.893061] Failed to enable avdd-pexb: -13

[ 59.897279] Failed to enable vdd-pexb: -13

[ 59.901383] tegra-pcie 3000.pcie: failed to enable regulators: -13

[ 60.434185] clk_plle_training: timeout waiting for PLLE

[ 60.439565] tegra-pcie 3000.pcie: failed to enable CML clock: -16

[ 60.445700] ------------[ cut here ]------------

[ 60.450346] WARNING: CPU: 0 PID: 653 at /home/jonathanh/workdir/tegra/mlt-linux_next/kernel/drivers/regulator/core.c:2603 _regulator_disable+0xb8/0x1b4

[ 60.463959] unbalanced disables for vdd_pexa,vdd_pexb

[ 60.469038] Modules linked in:

[ 60.472107] CPU: 0 PID: 653 Comm: rtcwake Tainted: G W 5.7.0-rc2-next-20200420 #2

[ 60.480892] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)

[ 60.487190] [<c0111b68>] (unwind_backtrace) from [<c010bc00>] (show_stack+0x10/0x14)

[ 60.494951] [<c010bc00>] (show_stack) from [<c0480f14>] (dump_stack+0xc0/0xd4)

[ 60.502189] [<c0480f14>] (dump_stack) from [<c01234a4>] (__warn+0xe0/0xf8)

[ 60.509073] [<c01234a4>] (__warn) from [<c0123530>] (warn_slowpath_fmt+0x74/0xb8)

[ 60.516568] [<c0123530>] (warn_slowpath_fmt) from [<c0516714>] (_regulator_disable+0xb8/0x1b4)

[ 60.525191] [<c0516714>] (_regulator_disable) from [<c0516844>] (regulator_disable+0x34/0xd0)

[ 60.533729] [<c0516844>] (regulator_disable) from [<c0518488>] (regulator_bulk_disable+0x28/0xb4)

[ 60.542619] [<c0518488>] (regulator_bulk_disable) from [<c04dbc84>] (tegra_pcie_pm_resume+0xbb0/0x107c)

[ 60.552032] [<c04dbc84>] (tegra_pcie_pm_resume) from [<c05f7e44>] (dpm_run_callback+0x38/0x1d4)

[ 60.560741] [<c05f7e44>] (dpm_run_callback) from [<c05f8af8>] (device_resume_noirq+0x110/0x248)

[ 60.569451] [<c05f8af8>] (device_resume_noirq) from [<c05f93e0>] (dpm_resume_noirq+0x10c/0x36c)

[ 60.578162] [<c05f93e0>] (dpm_resume_noirq) from [<c017dd74>] (suspend_devices_and_enter+0x27c/0x9dc)

[ 60.587393] [<c017dd74>] (suspend_devices_and_enter) from [<c017e7dc>] (pm_suspend+0x308/0x370)

[ 60.596110] [<c017e7dc>] (pm_suspend) from [<c017cb30>] (state_store+0x6c/0xc8)

[ 60.603440] [<c017cb30>] (state_store) from [<c03138e4>] (kernfs_fop_write+0xf8/0x210)

[ 60.611379] [<c03138e4>] (kernfs_fop_write) from [<c0286c44>] (__vfs_write+0x2c/0x1c4)

[ 60.619310] [<c0286c44>] (__vfs_write) from [<c02886e8>] (vfs_write+0xa4/0x188)

[ 60.626632] [<c02886e8>] (vfs_write) from [<c028898c>] (ksys_write+0xa4/0xd4)

[ 60.633778] [<c028898c>] (ksys_write) from [<c01000c0>] (ret_fast_syscall+0x0/0x54)

[ 60.641437] Exception stack(0xeda91fa8 to 0xeda91ff0)

[ 60.646497] 1fa0: 0000006c 00498438 00000004 00498438 00000004 00000000

[ 60.654683] 1fc0: 0000006c 00498438 00497228 00000004 00000004 00000004 0048478c 00497228

[ 60.662866] 1fe0: 00000004 be9029b8 b6ec8c0b b6e53206

[ 60.668007] ---[ end trace 5453317048e46ae9 ]---

[ 60.672632] Failed to disable vdd-pexb: -5

[ 60.676761] tegra-pcie 3000.pcie: tegra pcie power on fail: -16

[ 60.682694] PM: dpm_run_callback(): tegra_pcie_pm_resume+0x0/0x107c returns -16

[ 60.690035] PM: Device 3000.pcie failed to resume noirq: error -16

[ 60.696859] tegra-mc 7000f000.memory-controller: fdcdwr2: write @0x877e8400: EMEM address decode error (SMMU translation error [--S])

[ 60.708876] tegra-mc 7000f000.memory-controller: fdcdwr2: write @0x877e8400: Page fault (SMMU translation error [--S])

[ 61.278965] OOM killer enabled.

[ 61.288563] Restarting tasks ... done.

[ 61.300508] PM: suspend exit

[ 63.124813] asix 1-1:1.0 eth0: link up, 100Mbps, full-duplex, lpa 0xCDE1

[ 63.740705] PM: suspend entry (deep)

[ 63.744593] Filesystems sync: 0.000 seconds

[ 63.749600] Freezing user space processes ... (elapsed 0.001 seconds) done.

[ 63.751053] OOM killer disabled.

[ 63.751057] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.


Have you seen this?

Cheers
Jon

--
nvpublic

2020-04-20 22:17:26

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

Hello Jon,

20.04.2020 22:53, Jon Hunter пишет:
> Hi Dmitry,
>
> On 24/03/2020 19:12, Dmitry Osipenko wrote:
>> Boot CPU0 always handle I2C interrupt and under some rare circumstances
>> (like running KASAN + NFS root) it may stuck in uninterruptible state for
>> a significant time. In this case we will get timeout if I2C transfer is
>> running on a sibling CPU, despite of IRQ being raised. In order to handle
>> this rare condition, the IRQ status needs to be checked after completion
>> timeout.
>>
>> Signed-off-by: Dmitry Osipenko <[email protected]>
>> ---
>> drivers/i2c/busses/i2c-tegra.c | 27 +++++++++++++++------------
>> 1 file changed, 15 insertions(+), 12 deletions(-)
>
>
> I have noticed a regression on tegra30-cardhu-a04 when testing system
> suspend. Git bisect is pointing to this commit and reverting it fixes
> the problem. In the below console log I2C fails to resume ...
>
...
> [ 60.690035] PM: Device 3000.pcie failed to resume noirq: error -16

...
> [ 60.696859] tegra-mc 7000f000.memory-controller: fdcdwr2: write @0x877e8400: EMEM address decode error (SMMU translation error [--S])
>
> [ 60.708876] tegra-mc 7000f000.memory-controller: fdcdwr2: write @0x877e8400: Page fault (SMMU translation error [--S])

This looks very wrong, the error tells that 3d hardware is active and
doing something odd. Are you running some 3d tests?

...
> Have you seen this?

No, I haven't seen that. I'm not using PCIE and it looks like it's the
problem.

Looking at the PCIE driver code, seems it's not syncing the RPM state on
suspend/resume.

Please try this change:

--- >8 ---
diff --git a/drivers/pci/controller/pci-tegra.c
b/drivers/pci/controller/pci-tegra.c
index 3e64ba6a36a8..b1fcbae4109c 100644
--- a/drivers/pci/controller/pci-tegra.c
+++ b/drivers/pci/controller/pci-tegra.c
@@ -2870,8 +2870,8 @@ static int __maybe_unused
tegra_pcie_pm_resume(struct device *dev)

static const struct dev_pm_ops tegra_pcie_pm_ops = {
SET_RUNTIME_PM_OPS(tegra_pcie_pm_suspend, tegra_pcie_pm_resume, NULL)
- SET_NOIRQ_SYSTEM_SLEEP_PM_OPS(tegra_pcie_pm_suspend,
- tegra_pcie_pm_resume)
+ SET_NOIRQ_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend,
+ pm_runtime_force_resume)
};


static struct platform_driver tegra_pcie_driver = {
--- >8 ---

Secondly, I2C driver suspends on NOIRQ level, while APBDMA driver
suspends on default level. This is also wrong, please try to apply this
hunk as well:

--- >8 ---
diff --git a/drivers/dma/tegra20-apb-dma.c b/drivers/dma/tegra20-apb-dma.c
index f6a2f42ffc51..e682ac86bd27 100644
--- a/drivers/dma/tegra20-apb-dma.c
+++ b/drivers/dma/tegra20-apb-dma.c
@@ -1653,7 +1653,7 @@ static int __maybe_unused
tegra_dma_dev_resume(struct device *dev)
static const struct dev_pm_ops tegra_dma_dev_pm_ops = {
SET_RUNTIME_PM_OPS(tegra_dma_runtime_suspend, tegra_dma_runtime_resume,
NULL)
- SET_SYSTEM_SLEEP_PM_OPS(tegra_dma_dev_suspend, tegra_dma_dev_resume)
+ SET_NOIRQ_SYSTEM_SLEEP_PM_OPS(tegra_dma_dev_suspend, tegra_dma_dev_resume)
};

static const struct of_device_id tegra_dma_of_match[] = {
--- >8 ---

2020-04-21 00:34:30

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

21.04.2020 01:11, Dmitry Osipenko пишет:
> Hello Jon,
>
> 20.04.2020 22:53, Jon Hunter пишет:
>> Hi Dmitry,
>>
>> On 24/03/2020 19:12, Dmitry Osipenko wrote:
>>> Boot CPU0 always handle I2C interrupt and under some rare circumstances
>>> (like running KASAN + NFS root) it may stuck in uninterruptible state for
>>> a significant time. In this case we will get timeout if I2C transfer is
>>> running on a sibling CPU, despite of IRQ being raised. In order to handle
>>> this rare condition, the IRQ status needs to be checked after completion
>>> timeout.
>>>
>>> Signed-off-by: Dmitry Osipenko <[email protected]>
>>> ---
>>> drivers/i2c/busses/i2c-tegra.c | 27 +++++++++++++++------------
>>> 1 file changed, 15 insertions(+), 12 deletions(-)
>>
>>
>> I have noticed a regression on tegra30-cardhu-a04 when testing system
>> suspend. Git bisect is pointing to this commit and reverting it fixes
>> the problem. In the below console log I2C fails to resume ...
>>
> ...
>> [ 60.690035] PM: Device 3000.pcie failed to resume noirq: error -16
>
> ...
>> [ 60.696859] tegra-mc 7000f000.memory-controller: fdcdwr2: write @0x877e8400: EMEM address decode error (SMMU translation error [--S])
>>
>> [ 60.708876] tegra-mc 7000f000.memory-controller: fdcdwr2: write @0x877e8400: Page fault (SMMU translation error [--S])
>
> This looks very wrong, the error tells that 3d hardware is active and
> doing something odd. Are you running some 3d tests?
>
> ...
>> Have you seen this?
>
> No, I haven't seen that. I'm not using PCIE and it looks like it's the
> problem.
>
> Looking at the PCIE driver code, seems it's not syncing the RPM state on
> suspend/resume.
>
> Please try this change:
>
> --- >8 ---
> diff --git a/drivers/pci/controller/pci-tegra.c
> b/drivers/pci/controller/pci-tegra.c
> index 3e64ba6a36a8..b1fcbae4109c 100644
> --- a/drivers/pci/controller/pci-tegra.c
> +++ b/drivers/pci/controller/pci-tegra.c
> @@ -2870,8 +2870,8 @@ static int __maybe_unused
> tegra_pcie_pm_resume(struct device *dev)
>
> static const struct dev_pm_ops tegra_pcie_pm_ops = {
> SET_RUNTIME_PM_OPS(tegra_pcie_pm_suspend, tegra_pcie_pm_resume, NULL)
> - SET_NOIRQ_SYSTEM_SLEEP_PM_OPS(tegra_pcie_pm_suspend,
> - tegra_pcie_pm_resume)
> + SET_NOIRQ_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend,
> + pm_runtime_force_resume)
> };
>
>
> static struct platform_driver tegra_pcie_driver = {
> --- >8 ---
>
> Secondly, I2C driver suspends on NOIRQ level, while APBDMA driver
> suspends on default level. This is also wrong, please try to apply this
> hunk as well:
>
> --- >8 ---
> diff --git a/drivers/dma/tegra20-apb-dma.c b/drivers/dma/tegra20-apb-dma.c
> index f6a2f42ffc51..e682ac86bd27 100644
> --- a/drivers/dma/tegra20-apb-dma.c
> +++ b/drivers/dma/tegra20-apb-dma.c
> @@ -1653,7 +1653,7 @@ static int __maybe_unused
> tegra_dma_dev_resume(struct device *dev)
> static const struct dev_pm_ops tegra_dma_dev_pm_ops = {
> SET_RUNTIME_PM_OPS(tegra_dma_runtime_suspend, tegra_dma_runtime_resume,
> NULL)
> - SET_SYSTEM_SLEEP_PM_OPS(tegra_dma_dev_suspend, tegra_dma_dev_resume)
> + SET_NOIRQ_SYSTEM_SLEEP_PM_OPS(tegra_dma_dev_suspend, tegra_dma_dev_resume)
> };
>
> static const struct of_device_id tegra_dma_of_match[] = {
> --- >8 ---
>

Although, I'm now having a second though about the APBDMA change... I'm
recalling that there are some complications in regards to PCIE driver
suspending, requiring it to be at NOIRQ level, but this should be wrong
because PCIE driver uses voltage regulator driver at NOIRQ level, while
regulator drivers suspend on default level. The current behavior of the
PCIE driver should be wrong, I think it needs to be moved to the default
suspend-resume level somehow.

2020-04-21 09:51:22

by Jon Hunter

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

Hi Dmitry,
DKIM-Signature: v aa-sha256; claxed/relaxed; didia.com; s;
t87462465; bhmqRv+sNd/vlpkXkRsCrkXSkOm+NZfBA8tHeeFzi60 hPGP-Universal:Subject:To:CC:References:From:Message-ID:Date:
User-Agent:MIME-Version:In-Reply-To:X-Originating-IP:
X-ClientProxiedBy:Content-Type:Content-Language:
Content-Transfer-Encoding;
b?MZuIoMEvnocJwaAlK59ppeq2HXtg8If2WkeG7oekW2SEfUw3GtGcyqdn9hJH1oG
iWeL+2EeEngLPi+venjHFoJSPqmU2/GtzShtyfKipiSVWCY1Q5wSiH3IcS61O4jNUw
Y/qqm0avHiNNN3bfuGOS735o3X3ZLDF8lA13aKEhc6KrpXFSXrTlzMobSZsiF5/qgf
8QLaBKcYl0nQyy3ZuOXo4x4ZWmbJOnlrzq8jWUn7EuU/lQ77GDAwGa66DRoR/6T9Qx
Scu9IgO6uxJMqmVC32Jeb12Ig5aTwq1RpzBHJTaHZRh+4yZi8opT9FiPj9FgTzmEuh
J78e3AStRA77A
On 21/04/2020 01:32, Dmitry Osipenko wrote:
> 21.04.2020 01:11, Dmitry Osipenko пишет:
>> Hello Jon,
>>
>> 20.04.2020 22:53, Jon Hunter пишет:
>>> Hi Dmitry,
>>>
>>> On 24/03/2020 19:12, Dmitry Osipenko wrote:
>>>> Boot CPU0 always handle I2C interrupt and under some rare circumstances
>>>> (like running KASAN + NFS root) it may stuck in uninterruptible state for
>>>> a significant time. In this case we will get timeout if I2C transfer is
>>>> running on a sibling CPU, despite of IRQ being raised. In order to handle
>>>> this rare condition, the IRQ status needs to be checked after completion
>>>> timeout.
>>>>
>>>> Signed-off-by: Dmitry Osipenko <[email protected]>
>>>> ---
>>>> drivers/i2c/busses/i2c-tegra.c | 27 +++++++++++++++------------
>>>> 1 file changed, 15 insertions(+), 12 deletions(-)
>>>
>>>
>>> I have noticed a regression on tegra30-cardhu-a04 when testing system
>>> suspend. Git bisect is pointing to this commit and reverting it fixes
>>> the problem. In the below console log I2C fails to resume ...
>>>
>> ...
>>> [ 60.690035] PM: Device 3000.pcie failed to resume noirq: error -16
>>
>> ...
>>> [ 60.696859] tegra-mc 7000f000.memory-controller: fdcdwr2: write @0x877e8400: EMEM address decode error (SMMU translation error [--S])
>>>
>>> [ 60.708876] tegra-mc 7000f000.memory-controller: fdcdwr2: write @0x877e8400: Page fault (SMMU translation error [--S])
>>
>> This looks very wrong, the error tells that 3d hardware is active and
>> doing something odd. Are you running some 3d tests?

I am not running any GFX tests. However, I am not sure if the above is
unrelated.

>>> Have you seen this?
>>
>> No, I haven't seen that. I'm not using PCIE and it looks like it's the
>> problem.
>>
>> Looking at the PCIE driver code, seems it's not syncing the RPM state on
>> suspend/resume.
>>
>> Please try this change:
>>
>> --- >8 ---
>> diff --git a/drivers/pci/controller/pci-tegra.c
>> b/drivers/pci/controller/pci-tegra.c
>> index 3e64ba6a36a8..b1fcbae4109c 100644
>> --- a/drivers/pci/controller/pci-tegra.c
>> +++ b/drivers/pci/controller/pci-tegra.c
>> @@ -2870,8 +2870,8 @@ static int __maybe_unused
>> tegra_pcie_pm_resume(struct device *dev)
>>
>> static const struct dev_pm_ops tegra_pcie_pm_ops = {
>> SET_RUNTIME_PM_OPS(tegra_pcie_pm_suspend, tegra_pcie_pm_resume, NULL)
>> - SET_NOIRQ_SYSTEM_SLEEP_PM_OPS(tegra_pcie_pm_suspend,
>> - tegra_pcie_pm_resume)
>> + SET_NOIRQ_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend,
>> + pm_runtime_force_resume)
>> };
>>
>>
>> static struct platform_driver tegra_pcie_driver = {
>> --- >8 ---
>>
>> Secondly, I2C driver suspends on NOIRQ level, while APBDMA driver
>> suspends on default level. This is also wrong, please try to apply this
>> hunk as well:
>>
>> --- >8 ---
>> diff --git a/drivers/dma/tegra20-apb-dma.c b/drivers/dma/tegra20-apb-dma.c
>> index f6a2f42ffc51..e682ac86bd27 100644
>> --- a/drivers/dma/tegra20-apb-dma.c
>> +++ b/drivers/dma/tegra20-apb-dma.c
>> @@ -1653,7 +1653,7 @@ static int __maybe_unused
>> tegra_dma_dev_resume(struct device *dev)
>> static const struct dev_pm_ops tegra_dma_dev_pm_ops = {
>> SET_RUNTIME_PM_OPS(tegra_dma_runtime_suspend, tegra_dma_runtime_resume,
>> NULL)
>> - SET_SYSTEM_SLEEP_PM_OPS(tegra_dma_dev_suspend, tegra_dma_dev_resume)
>> + SET_NOIRQ_SYSTEM_SLEEP_PM_OPS(tegra_dma_dev_suspend, tegra_dma_dev_resume)
>> };
>>
>> static const struct of_device_id tegra_dma_of_match[] = {
>> --- >8 ---
>>
>
> Although, I'm now having a second though about the APBDMA change... I'm
> recalling that there are some complications in regards to PCIE driver
> suspending, requiring it to be at NOIRQ level, but this should be wrong
> because PCIE driver uses voltage regulator driver at NOIRQ level, while
> regulator drivers suspend on default level. The current behavior of the
> PCIE driver should be wrong, I think it needs to be moved to the default
> suspend-resume level somehow.

I can try the above, but I agree it would be best to avoid messing with
the suspend levels if possible.

I am adding Manikanta to get some feedback on why we moved the PCI
suspend to the NOIRQ phase because it is not clear to me if we need to
do this here.

Manikanta, can you comment on whether we really need to suspend Tegra
PCI during the noirq phase?

Cheers
Jon

--
nvpublic

2020-04-21 12:41:39

by Manikanta Maddireddy

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time



On 21-Apr-20 3:19 PM, Jon Hunter wrote:
> Hi Dmitry,
>
> On 21/04/2020 01:32, Dmitry Osipenko wrote:
>> 21.04.2020 01:11, Dmitry Osipenko пишет:
>>> Hello Jon,
>>>
>>> 20.04.2020 22:53, Jon Hunter пишет:
>>>> Hi Dmitry,
>>>>
>>>> On 24/03/2020 19:12, Dmitry Osipenko wrote:
>>>>> Boot CPU0 always handle I2C interrupt and under some rare circumstances
>>>>> (like running KASAN + NFS root) it may stuck in uninterruptible state for
>>>>> a significant time. In this case we will get timeout if I2C transfer is
>>>>> running on a sibling CPU, despite of IRQ being raised. In order to handle
>>>>> this rare condition, the IRQ status needs to be checked after completion
>>>>> timeout.
>>>>>
>>>>> Signed-off-by: Dmitry Osipenko <[email protected]>
>>>>> ---
>>>>> drivers/i2c/busses/i2c-tegra.c | 27 +++++++++++++++------------
>>>>> 1 file changed, 15 insertions(+), 12 deletions(-)
>>>>
>>>> I have noticed a regression on tegra30-cardhu-a04 when testing system
>>>> suspend. Git bisect is pointing to this commit and reverting it fixes
>>>> the problem. In the below console log I2C fails to resume ...
>>>>
>>> ...
>>>> [ 60.690035] PM: Device 3000.pcie failed to resume noirq: error -16
>>> ...
>>>> [ 60.696859] tegra-mc 7000f000.memory-controller: fdcdwr2: write @0x877e8400: EMEM address decode error (SMMU translation error [--S])
>>>>
>>>> [ 60.708876] tegra-mc 7000f000.memory-controller: fdcdwr2: write @0x877e8400: Page fault (SMMU translation error [--S])
>>> This looks very wrong, the error tells that 3d hardware is active and
>>> doing something odd. Are you running some 3d tests?
> I am not running any GFX tests. However, I am not sure if the above is
> unrelated.
>
>>>> Have you seen this?
>>> No, I haven't seen that. I'm not using PCIE and it looks like it's the
>>> problem.
>>>
>>> Looking at the PCIE driver code, seems it's not syncing the RPM state on
>>> suspend/resume.
>>>
>>> Please try this change:
>>>
>>> --- >8 ---
>>> diff --git a/drivers/pci/controller/pci-tegra.c
>>> b/drivers/pci/controller/pci-tegra.c
>>> index 3e64ba6a36a8..b1fcbae4109c 100644
>>> --- a/drivers/pci/controller/pci-tegra.c
>>> +++ b/drivers/pci/controller/pci-tegra.c
>>> @@ -2870,8 +2870,8 @@ static int __maybe_unused
>>> tegra_pcie_pm_resume(struct device *dev)
>>>
>>> static const struct dev_pm_ops tegra_pcie_pm_ops = {
>>> SET_RUNTIME_PM_OPS(tegra_pcie_pm_suspend, tegra_pcie_pm_resume, NULL)
>>> - SET_NOIRQ_SYSTEM_SLEEP_PM_OPS(tegra_pcie_pm_suspend,
>>> - tegra_pcie_pm_resume)
>>> + SET_NOIRQ_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend,
>>> + pm_runtime_force_resume)
>>> };
>>>
>>>
>>> static struct platform_driver tegra_pcie_driver = {
>>> --- >8 ---
>>>
>>> Secondly, I2C driver suspends on NOIRQ level, while APBDMA driver
>>> suspends on default level. This is also wrong, please try to apply this
>>> hunk as well:
>>>
>>> --- >8 ---
>>> diff --git a/drivers/dma/tegra20-apb-dma.c b/drivers/dma/tegra20-apb-dma.c
>>> index f6a2f42ffc51..e682ac86bd27 100644
>>> --- a/drivers/dma/tegra20-apb-dma.c
>>> +++ b/drivers/dma/tegra20-apb-dma.c
>>> @@ -1653,7 +1653,7 @@ static int __maybe_unused
>>> tegra_dma_dev_resume(struct device *dev)
>>> static const struct dev_pm_ops tegra_dma_dev_pm_ops = {
>>> SET_RUNTIME_PM_OPS(tegra_dma_runtime_suspend, tegra_dma_runtime_resume,
>>> NULL)
>>> - SET_SYSTEM_SLEEP_PM_OPS(tegra_dma_dev_suspend, tegra_dma_dev_resume)
>>> + SET_NOIRQ_SYSTEM_SLEEP_PM_OPS(tegra_dma_dev_suspend, tegra_dma_dev_resume)
>>> };
>>>
>>> static const struct of_device_id tegra_dma_of_match[] = {
>>> --- >8 ---
>>>
>> Although, I'm now having a second though about the APBDMA change... I'm
>> recalling that there are some complications in regards to PCIE driver
>> suspending, requiring it to be at NOIRQ level, but this should be wrong
>> because PCIE driver uses voltage regulator driver at NOIRQ level, while
>> regulator drivers suspend on default level. The current behavior of the
>> PCIE driver should be wrong, I think it needs to be moved to the default
>> suspend-resume level somehow.
> I can try the above, but I agree it would be best to avoid messing with
> the suspend levels if possible.
>
> I am adding Manikanta to get some feedback on why we moved the PCI
> suspend to the NOIRQ phase because it is not clear to me if we need to
> do this here.
>
> Manikanta, can you comment on whether we really need to suspend Tegra
> PCI during the noirq phase?

PCIe subsystem driver implemented noirq PM callbacks, it will save & restore
endpoint config space in these PM callbacks. PCIe controller should be
available during this time, so noirq PM callbacks are implemented in Tegra
PCIe driver.

file: drivers/pci/pci-driver.c
static const struct dev_pm_ops pci_dev_pm_ops = {
...
.suspend_noirq = pci_pm_suspend_noirq,
.resume_noirq = pci_pm_resume_noirq,
...
};

Thanks,
Manikanta

>
> Cheers
> Jon
>

2020-04-21 13:09:57

by Jon Hunter

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time



On 21/04/2020 13:39, Manikanta Maddireddy wrote:

...

>> I am adding Manikanta to get some feedback on why we moved the PCI
>> suspend to the NOIRQ phase because it is not clear to me if we need to
>> do this here.
>>
>> Manikanta, can you comment on whether we really need to suspend Tegra
>> PCI during the noirq phase?
>
> PCIe subsystem driver implemented noirq PM callbacks, it will save & restore
> endpoint config space in these PM callbacks. PCIe controller should be
> available during this time, so noirq PM callbacks are implemented in Tegra
> PCIe driver.
>
> file: drivers/pci/pci-driver.c
> static const struct dev_pm_ops pci_dev_pm_ops = {
> ...
> .suspend_noirq = pci_pm_suspend_noirq,
> .resume_noirq = pci_pm_resume_noirq,
> ...
> };

Thanks, however, it is still not clear why this needs to be done during
this phase. When you say PCIe subsystem driver, specifically which
driver are you referring too? Are you referring to the
pci_pm_suspend_noirq() in the drivers/pci/pci-driver.c driver? If so,
just out of curiosity why does this need to be handled in the noirq phase?

Thanks
Jon

--
nvpublic

2020-04-21 13:27:52

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

21.04.2020 12:49, Jon Hunter пишет:
...
> I can try the above, but I agree it would be best to avoid messing with
> the suspend levels if possible.

Will be awesome if you could try it and report back the result.

2020-04-21 13:52:02

by Manikanta Maddireddy

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time



On 21-Apr-20 6:38 PM, Jon Hunter wrote:
>
> On 21/04/2020 13:39, Manikanta Maddireddy wrote:
>
> ...
>
>>> I am adding Manikanta to get some feedback on why we moved the PCI
>>> suspend to the NOIRQ phase because it is not clear to me if we need to
>>> do this here.
>>>
>>> Manikanta, can you comment on whether we really need to suspend Tegra
>>> PCI during the noirq phase?
>> PCIe subsystem driver implemented noirq PM callbacks, it will save & restore
>> endpoint config space in these PM callbacks. PCIe controller should be
>> available during this time, so noirq PM callbacks are implemented in Tegra
>> PCIe driver.
>>
>> file: drivers/pci/pci-driver.c
>> static const struct dev_pm_ops pci_dev_pm_ops = {
>> ...
>> .suspend_noirq = pci_pm_suspend_noirq,
>> .resume_noirq = pci_pm_resume_noirq,
>> ...
>> };
> Thanks, however, it is still not clear why this needs to be done during
> this phase. When you say PCIe subsystem driver, specifically which
> driver are you referring too? Are you referring to the
> pci_pm_suspend_noirq() in the drivers/pci/pci-driver.c driver? If so,
> just out of curiosity why does this need to be handled in the noirq phase?

If PCIe device is not in valid state it might cause interrupt issues and
can lead to system lockup. Please refer to below paper for complete details.
https://www.kernel.org/doc/ols/2009/ols2009-pages-319-330.pdf

>
> Thanks
> Jon
>

2020-04-21 14:42:31

by Jon Hunter

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time


On 21/04/2020 14:25, Dmitry Osipenko wrote:
> 21.04.2020 12:49, Jon Hunter пишет:
> ...
>> I can try the above, but I agree it would be best to avoid messing with
>> the suspend levels if possible.
>
> Will be awesome if you could try it and report back the result.
>

I gave it a try but suspend still fails.

Jon

--
nvpublic

2020-04-21 15:11:22

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

21.04.2020 17:40, Jon Hunter пишет:
>
> On 21/04/2020 14:25, Dmitry Osipenko wrote:
>> 21.04.2020 12:49, Jon Hunter пишет:
>> ...
>>> I can try the above, but I agree it would be best to avoid messing with
>>> the suspend levels if possible.
>>
>> Will be awesome if you could try it and report back the result.
>>
>
> I gave it a try but suspend still fails.

Perhaps the RPM's -EACCES is returned from here:

https://elixir.free-electrons.com/linux/v5.7-rc2/source/drivers/base/power/runtime.c#L723

Which suggests that I2C is accessed after being suspended. I guess the
PCIe driver suspends after the I2C and somehow my change affected the
suspension order, although not sure how.

Jon, could you please try to enable PM logging and post the log? Please
also post log of the working kernel version, so that we could compare
the PM sequence.

Something like this should enable the logging: "echo 1 >
/sys/power/pm_trace" + there is RPM tracing.

2020-04-21 15:22:12

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

21.04.2020 17:40, Jon Hunter пишет:
>
> On 21/04/2020 14:25, Dmitry Osipenko wrote:
>> 21.04.2020 12:49, Jon Hunter пишет:
>> ...
>>> I can try the above, but I agree it would be best to avoid messing with
>>> the suspend levels if possible.
>>
>> Will be awesome if you could try it and report back the result.
>>
>
> I gave it a try but suspend still fails.

Is this regulator error gone with my changes?

[ 60.450346] WARNING: CPU: 0 PID: 653 at
/home/jonathanh/workdir/tegra/mlt-linux_next/kernel/drivers/regulator/core.c:2603
_regulator_disable+0xb8/0x1b4
[ 60.463959] unbalanced disables for vdd_pexa,vdd_pexb

2020-04-21 15:35:51

by Jon Hunter

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time


On 21/04/2020 16:18, Dmitry Osipenko wrote:
> 21.04.2020 17:40, Jon Hunter пишет:
>>
>> On 21/04/2020 14:25, Dmitry Osipenko wrote:
>>> 21.04.2020 12:49, Jon Hunter пишет:
>>> ...
>>>> I can try the above, but I agree it would be best to avoid messing with
>>>> the suspend levels if possible.
>>>
>>> Will be awesome if you could try it and report back the result.
>>>
>>
>> I gave it a try but suspend still fails.
>
> Is this regulator error gone with my changes?
>
> [ 60.450346] WARNING: CPU: 0 PID: 653 at
> /home/jonathanh/workdir/tegra/mlt-linux_next/kernel/drivers/regulator/core.c:2603
> _regulator_disable+0xb8/0x1b4
> [ 60.463959] unbalanced disables for vdd_pexa,vdd_pexb

The above is still there with your changes.

Jon

--
nvpublic

2020-04-21 19:09:35

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

21.04.2020 18:34, Jon Hunter пишет:
>
> On 21/04/2020 16:18, Dmitry Osipenko wrote:
>> 21.04.2020 17:40, Jon Hunter пишет:
>>>
>>> On 21/04/2020 14:25, Dmitry Osipenko wrote:
>>>> 21.04.2020 12:49, Jon Hunter пишет:
>>>> ...
>>>>> I can try the above, but I agree it would be best to avoid messing with
>>>>> the suspend levels if possible.
>>>>
>>>> Will be awesome if you could try it and report back the result.
>>>>
>>>
>>> I gave it a try but suspend still fails.
>>
>> Is this regulator error gone with my changes?
>>
>> [ 60.450346] WARNING: CPU: 0 PID: 653 at
>> /home/jonathanh/workdir/tegra/mlt-linux_next/kernel/drivers/regulator/core.c:2603
>> _regulator_disable+0xb8/0x1b4
>> [ 60.463959] unbalanced disables for vdd_pexa,vdd_pexb
>
> The above is still there with your changes.

Interesting, hopefully the PM logs will point out the source of the problem.

2020-04-21 19:44:37

by Jon Hunter

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time


On 21/04/2020 16:08, Dmitry Osipenko wrote:
> 21.04.2020 17:40, Jon Hunter пишет:
>>
>> On 21/04/2020 14:25, Dmitry Osipenko wrote:
>>> 21.04.2020 12:49, Jon Hunter пишет:
>>> ...
>>>> I can try the above, but I agree it would be best to avoid messing with
>>>> the suspend levels if possible.
>>>
>>> Will be awesome if you could try it and report back the result.
>>>
>>
>> I gave it a try but suspend still fails.
>
> Perhaps the RPM's -EACCES is returned from here:
>
> https://elixir.free-electrons.com/linux/v5.7-rc2/source/drivers/base/power/runtime.c#L723
>
> Which suggests that I2C is accessed after being suspended. I guess the
> PCIe driver suspends after the I2C and somehow my change affected the
> suspension order, although not sure how.
>
> Jon, could you please try to enable PM logging and post the log? Please
> also post log of the working kernel version, so that we could compare
> the PM sequence.
>
> Something like this should enable the logging: "echo 1 >
> /sys/power/pm_trace" + there is RPM tracing.

Unfortunately, after enabling that I don't any output and so no help there.

Jon

--
nvpublic

2020-04-22 13:42:05

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

21.04.2020 22:42, Jon Hunter пишет:
>
> On 21/04/2020 16:08, Dmitry Osipenko wrote:
>> 21.04.2020 17:40, Jon Hunter пишет:
>>>
>>> On 21/04/2020 14:25, Dmitry Osipenko wrote:
>>>> 21.04.2020 12:49, Jon Hunter пишет:
>>>> ...
>>>>> I can try the above, but I agree it would be best to avoid messing with
>>>>> the suspend levels if possible.
>>>>
>>>> Will be awesome if you could try it and report back the result.
>>>>
>>>
>>> I gave it a try but suspend still fails.
>>
>> Perhaps the RPM's -EACCES is returned from here:
>>
>> https://elixir.free-electrons.com/linux/v5.7-rc2/source/drivers/base/power/runtime.c#L723
>>
>> Which suggests that I2C is accessed after being suspended. I guess the
>> PCIe driver suspends after the I2C and somehow my change affected the
>> suspension order, although not sure how.
>>
>> Jon, could you please try to enable PM logging and post the log? Please
>> also post log of the working kernel version, so that we could compare
>> the PM sequence.
>>
>> Something like this should enable the logging: "echo 1 >
>> /sys/power/pm_trace" + there is RPM tracing.
>
> Unfortunately, after enabling that I don't any output and so no help there.

1. I now tried to check the pm_trace myself and found that it's
available only on X86, my bad.

2. Jon, could you please clarify what exactly you were trying to test?

3. Is it only the Cardhu board that is affected by this problem?

4. Could you please try to add this to the kernel's cmdline and post the
logs:

tp_printk
trace_event=rpm_suspend,rpm_resume,rpm_usage,rpm_idle,rpm_return_int


2020-04-22 14:10:18

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

22.04.2020 16:59, Jon Hunter пишет:
>
> On 22/04/2020 14:40, Dmitry Osipenko wrote:
>> 21.04.2020 22:42, Jon Hunter пишет:
>>>
>>> On 21/04/2020 16:08, Dmitry Osipenko wrote:
>>>> 21.04.2020 17:40, Jon Hunter пишет:
>>>>>
>>>>> On 21/04/2020 14:25, Dmitry Osipenko wrote:
>>>>>> 21.04.2020 12:49, Jon Hunter пишет:
>>>>>> ...
>>>>>>> I can try the above, but I agree it would be best to avoid messing with
>>>>>>> the suspend levels if possible.
>>>>>>
>>>>>> Will be awesome if you could try it and report back the result.
>>>>>>
>>>>>
>>>>> I gave it a try but suspend still fails.
>>>>
>>>> Perhaps the RPM's -EACCES is returned from here:
>>>>
>>>> https://elixir.free-electrons.com/linux/v5.7-rc2/source/drivers/base/power/runtime.c#L723
>>>>
>>>> Which suggests that I2C is accessed after being suspended. I guess the
>>>> PCIe driver suspends after the I2C and somehow my change affected the
>>>> suspension order, although not sure how.
>>>>
>>>> Jon, could you please try to enable PM logging and post the log? Please
>>>> also post log of the working kernel version, so that we could compare
>>>> the PM sequence.
>>>>
>>>> Something like this should enable the logging: "echo 1 >
>>>> /sys/power/pm_trace" + there is RPM tracing.
>>>
>>> Unfortunately, after enabling that I don't any output and so no help there.
>>
>> 1. I now tried to check the pm_trace myself and found that it's
>> available only on X86, my bad.
>>
>> 2. Jon, could you please clarify what exactly you were trying to test?
>>
>> 3. Is it only the Cardhu board that is affected by this problem?
>>
>> 4. Could you please try to add this to the kernel's cmdline and post the
>> logs:
>>
>> tp_printk
>> trace_event=rpm_suspend,rpm_resume,rpm_usage,rpm_idle,rpm_return_int
>
>
> So I think that part of the problem already existed prior to these
> patches. Without your patches I see ...
>
> [ 59.543528] tegra-i2c 7000d000.i2c: i2c transfer timed out

Does this I2C timeout happen with my patches? Could you please post full
logs of an older and the recent kernel versions?

> [ 59.549036] vdd_sata,avdd_plle: failed to disable
>
> [ 59.553778] Failed to disable avdd-plle: -110
>
> [ 59.558150] tegra-pcie 3000.pcie: failed to disable regulators: -110
>
>
> However, now with your patches the i2c is failing to resume and this
> breaks system suspend completely for Tegra30. So far this is the only
> board that appears to break.
>
> So the above issue needs to be fixed and I will chat to Thierry about this.

Okay

2020-04-22 14:40:35

by Jon Hunter

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time


On 22/04/2020 14:40, Dmitry Osipenko wrote:
> 21.04.2020 22:42, Jon Hunter пишет:
>>
>> On 21/04/2020 16:08, Dmitry Osipenko wrote:
>>> 21.04.2020 17:40, Jon Hunter пишет:
>>>>
>>>> On 21/04/2020 14:25, Dmitry Osipenko wrote:
>>>>> 21.04.2020 12:49, Jon Hunter пишет:
>>>>> ...
>>>>>> I can try the above, but I agree it would be best to avoid messing with
>>>>>> the suspend levels if possible.
>>>>>
>>>>> Will be awesome if you could try it and report back the result.
>>>>>
>>>>
>>>> I gave it a try but suspend still fails.
>>>
>>> Perhaps the RPM's -EACCES is returned from here:
>>>
>>> https://elixir.free-electrons.com/linux/v5.7-rc2/source/drivers/base/power/runtime.c#L723
>>>
>>> Which suggests that I2C is accessed after being suspended. I guess the
>>> PCIe driver suspends after the I2C and somehow my change affected the
>>> suspension order, although not sure how.
>>>
>>> Jon, could you please try to enable PM logging and post the log? Please
>>> also post log of the working kernel version, so that we could compare
>>> the PM sequence.
>>>
>>> Something like this should enable the logging: "echo 1 >
>>> /sys/power/pm_trace" + there is RPM tracing.
>>
>> Unfortunately, after enabling that I don't any output and so no help there.
>
> 1. I now tried to check the pm_trace myself and found that it's
> available only on X86, my bad.
>
> 2. Jon, could you please clarify what exactly you were trying to test?
>
> 3. Is it only the Cardhu board that is affected by this problem?
>
> 4. Could you please try to add this to the kernel's cmdline and post the
> logs:
>
> tp_printk
> trace_event=rpm_suspend,rpm_resume,rpm_usage,rpm_idle,rpm_return_int


So I think that part of the problem already existed prior to these
patches. Without your patches I see ...

[ 59.543528] tegra-i2c 7000d000.i2c: i2c transfer timed out

[ 59.549036] vdd_sata,avdd_plle: failed to disable

[ 59.553778] Failed to disable avdd-plle: -110

[ 59.558150] tegra-pcie 3000.pcie: failed to disable regulators: -110


However, now with your patches the i2c is failing to resume and this
breaks system suspend completely for Tegra30. So far this is the only
board that appears to break.

So the above issue needs to be fixed and I will chat to Thierry about this.

Jon

--
nvpublic

2020-04-23 11:00:40

by Jon Hunter

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time


On 22/04/2020 15:07, Dmitry Osipenko wrote:

...

>> So I think that part of the problem already existed prior to these
>> patches. Without your patches I see ...
>>
>> [ 59.543528] tegra-i2c 7000d000.i2c: i2c transfer timed out
>
> Does this I2C timeout happen with my patches? Could you please post full
> logs of an older and the recent kernel versions?

I believe that it does, but I need to check.

>> [ 59.549036] vdd_sata,avdd_plle: failed to disable
>>
>> [ 59.553778] Failed to disable avdd-plle: -110
>>
>> [ 59.558150] tegra-pcie 3000.pcie: failed to disable regulators: -110
>>
>>
>> However, now with your patches the i2c is failing to resume and this
>> breaks system suspend completely for Tegra30. So far this is the only
>> board that appears to break.
>>
>> So the above issue needs to be fixed and I will chat to Thierry about this.
>
> Okay

So I have been looking at this some more and this starting to all look a
bit of a mess :-(

Firstly, the Tegra PCI driver suspends late (noirq) as we know. The PCI
driver will warn if it cannot disable the regulators when suspending but
does not actually fail suspend. So this warning is just indicating that
we were unable to disable the regulators.

Now I don't see that we can ever disable the PCI regulators currently
when entering suspend because ...

1. We are in the noirq phase and so we will not get the completion
interrupt for the I2C transfer. I know that you recently added the
atomic I2C transfer support, but we can get the regulator framework
to use this (I have not looked in much detail so far).

2. Even if the regulator framework supported atomic I2C transfers, we
also have the problem that the I2C controller could be runtime-
suspended and pm_runtime_get_sync() will not work during during this
phase to resume it correctly. This is a problem that needs to be
fixed indeed!

3. Then we also have the possible dependency on the DMA driver that is
suspended during the noirq phase.

It could be argued that if the PCI regulators are never turned off
(currently) then we should not even bother and this will likely resolve
this for now. However, really we should try to fix that correctly.

What I still don't understand is why your patch breaks resume. Even if
the I2C transfer fails, and is deemed harmless by the client driver, we
should still be able to suspend and resume correctly.

Cheers
Jon

--
nvpublic

2020-04-23 16:35:42

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

23.04.2020 13:56, Jon Hunter пишет:
>
> On 22/04/2020 15:07, Dmitry Osipenko wrote:
>
> ...
>
>>> So I think that part of the problem already existed prior to these
>>> patches. Without your patches I see ...
>>>
>>> [ 59.543528] tegra-i2c 7000d000.i2c: i2c transfer timed out
>>
>> Does this I2C timeout happen with my patches? Could you please post full
>> logs of an older and the recent kernel versions?
>
> I believe that it does, but I need to check.
>
>>> [ 59.549036] vdd_sata,avdd_plle: failed to disable
>>>
>>> [ 59.553778] Failed to disable avdd-plle: -110
>>>
>>> [ 59.558150] tegra-pcie 3000.pcie: failed to disable regulators: -110
>>>
>>>
>>> However, now with your patches the i2c is failing to resume and this
>>> breaks system suspend completely for Tegra30. So far this is the only
>>> board that appears to break.
>>>
>>> So the above issue needs to be fixed and I will chat to Thierry about this.
>>
>> Okay
>
> So I have been looking at this some more and this starting to all look a
> bit of a mess :-(
>
> Firstly, the Tegra PCI driver suspends late (noirq) as we know. The PCI
> driver will warn if it cannot disable the regulators when suspending but
> does not actually fail suspend. So this warning is just indicating that
> we were unable to disable the regulators.
>
> Now I don't see that we can ever disable the PCI regulators currently
> when entering suspend because ...
>
> 1. We are in the noirq phase and so we will not get the completion
> interrupt for the I2C transfer. I know that you recently added the
> atomic I2C transfer support, but we can get the regulator framework
> to use this (I have not looked in much detail so far).

That's not good :) I didn't realize that *all* interrupts of every
device are disabled before .noirq is invoked. It appeared to me that the
IRQs disabling and .noirq invocation is performed for the drivers one
after another, but now I see that it's not true.

https://elixir.bootlin.com/linux/v5.7-rc2/source/drivers/base/power/main.c#L1446

> 2. Even if the regulator framework supported atomic I2C transfers, we
> also have the problem that the I2C controller could be runtime-
> suspended and pm_runtime_get_sync() will not work during during this
> phase to resume it correctly. This is a problem that needs to be
> fixed indeed!

Could you please clarify why pm_runtime_get_sync() can't be used by the
I2C driver's in NOIRQ phase?

> 3. Then we also have the possible dependency on the DMA driver that is
> suspended during the noirq phase.

Yes, this is correct.

Again, some regulator drivers may do something on suspend too, although
looks like the current upstream Tegra devices are not affected by this
potential problem.

> It could be argued that if the PCI regulators are never turned off
> (currently) then we should not even bother and this will likely resolve
> this for now. However, really we should try to fix that correctly.

Yes, keeping PCI regulators always-enabled should be a good immediate
solution.

Also, the RPM's system suspend/resume needs to fixed for the pci-tegra
driver, like I already suggested before.

> What I still don't understand is why your patch breaks resume. Even if
> the I2C transfer fails, and is deemed harmless by the client driver, we
> should still be able to suspend and resume correctly.

If DMA is getting synchronized after DMA driver being suspended, then it
could be a problem.

Could you please try to apply this hunk and see if it makes any
difference (I'll probably make it as proper patch):

--- >8 ---
diff --git a/drivers/dma/tegra20-apb-dma.c b/drivers/dma/tegra20-apb-dma.c
index a42c0b4d14ac..55fc7400f717 100644
--- a/drivers/dma/tegra20-apb-dma.c
+++ b/drivers/dma/tegra20-apb-dma.c
@@ -816,6 +816,13 @@ static bool
tegra_dma_eoc_interrupt_deasserted(struct tegra_dma_channel *tdc)
static void tegra_dma_synchronize(struct dma_chan *dc)
{
struct tegra_dma_channel *tdc = to_tegra_dma_chan(dc);
+ int err;
+
+ err = pm_runtime_get_sync(tdc->tdma->dev);
+ if (err < 0) {
+ dev_err(tdc2dev(tdc), "Failed to synchronize DMA: %d\n", err);
+ return;
+ }

/*
* CPU, which handles interrupt, could be busy in
@@ -825,6 +832,8 @@ static void tegra_dma_synchronize(struct dma_chan *dc)
wait_event(tdc->wq, tegra_dma_eoc_interrupt_deasserted(tdc));

tasklet_kill(&tdc->tasklet);
+
+ pm_runtime_put(tdc->tdma->dev);
}

static unsigned int tegra_dma_sg_bytes_xferred(struct tegra_dma_channel
*tdc,
--- >8 ---

It also could be that there is more than the suspend ordering problem,
but for now it is hard to tell without having a detailed log which
includes I2C/DMA/RPM traces.

Lastly, it should be worthwhile to try to apply the WIP ARM32 KASAN
series and see what will happen using CONFIG_KASAN=y.

https://patchwork.kernel.org/project/linux-arm-kernel/list/?series=230197

2020-04-24 07:12:15

by Jon Hunter

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time


On 23/04/2020 17:33, Dmitry Osipenko wrote:
> 23.04.2020 13:56, Jon Hunter пишет:
>>
>> On 22/04/2020 15:07, Dmitry Osipenko wrote:
>>
>> ...
>>
>>>> So I think that part of the problem already existed prior to these
>>>> patches. Without your patches I see ...
>>>>
>>>> [ 59.543528] tegra-i2c 7000d000.i2c: i2c transfer timed out
>>>
>>> Does this I2C timeout happen with my patches? Could you please post full
>>> logs of an older and the recent kernel versions?
>>
>> I believe that it does, but I need to check.
>>
>>>> [ 59.549036] vdd_sata,avdd_plle: failed to disable
>>>>
>>>> [ 59.553778] Failed to disable avdd-plle: -110
>>>>
>>>> [ 59.558150] tegra-pcie 3000.pcie: failed to disable regulators: -110
>>>>
>>>>
>>>> However, now with your patches the i2c is failing to resume and this
>>>> breaks system suspend completely for Tegra30. So far this is the only
>>>> board that appears to break.
>>>>
>>>> So the above issue needs to be fixed and I will chat to Thierry about this.
>>>
>>> Okay
>>
>> So I have been looking at this some more and this starting to all look a
>> bit of a mess :-(
>>
>> Firstly, the Tegra PCI driver suspends late (noirq) as we know. The PCI
>> driver will warn if it cannot disable the regulators when suspending but
>> does not actually fail suspend. So this warning is just indicating that
>> we were unable to disable the regulators.
>>
>> Now I don't see that we can ever disable the PCI regulators currently
>> when entering suspend because ...
>>
>> 1. We are in the noirq phase and so we will not get the completion
>> interrupt for the I2C transfer. I know that you recently added the
>> atomic I2C transfer support, but we can get the regulator framework
>> to use this (I have not looked in much detail so far).
>
> That's not good :) I didn't realize that *all* interrupts of every
> device are disabled before .noirq is invoked. It appeared to me that the
> IRQs disabling and .noirq invocation is performed for the drivers one
> after another, but now I see that it's not true.
>
> https://elixir.bootlin.com/linux/v5.7-rc2/source/drivers/base/power/main.c#L1446
>
>> 2. Even if the regulator framework supported atomic I2C transfers, we
>> also have the problem that the I2C controller could be runtime-
>> suspended and pm_runtime_get_sync() will not work during during this
>> phase to resume it correctly. This is a problem that needs to be
>> fixed indeed!
>
> Could you please clarify why pm_runtime_get_sync() can't be used by the
> I2C driver's in NOIRQ phase?

Yes take a look at commit 1e2ef05bb8cf ("PM: Limit race conditions
between runtime PM and system sleep (v2)").

>> 3. Then we also have the possible dependency on the DMA driver that is
>> suspended during the noirq phase.
>
> Yes, this is correct.
>
> Again, some regulator drivers may do something on suspend too, although
> looks like the current upstream Tegra devices are not affected by this
> potential problem.
>
>> It could be argued that if the PCI regulators are never turned off
>> (currently) then we should not even bother and this will likely resolve
>> this for now. However, really we should try to fix that correctly.
>
> Yes, keeping PCI regulators always-enabled should be a good immediate
> solution.

I was thinking about that, and I am not sure it is. I don't think that
the failure to send the I2C command should break suspend.

> Also, the RPM's system suspend/resume needs to fixed for the pci-tegra
> driver, like I already suggested before.

Yes but I don't think that is the cause of the issue in this particular
case.

>> What I still don't understand is why your patch breaks resume. Even if
>> the I2C transfer fails, and is deemed harmless by the client driver, we
>> should still be able to suspend and resume correctly.
>
> If DMA is getting synchronized after DMA driver being suspended, then it
> could be a problem.

So I confirmed that DMA is not the issue in this case. I tested this by
ensuring that DMA is never used. However, it is a potential problem
indeed.

> Could you please try to apply this hunk and see if it makes any
> difference (I'll probably make it as proper patch):

Per my tests, I don't believe that it will as disabling DMA does not
resolve the problem.

> It also could be that there is more than the suspend ordering problem,
> but for now it is hard to tell without having a detailed log which
> includes I2C/DMA/RPM traces.

I have taken a look and I don't see any issues with ordering. I2C is
suspended after PCI. This did not change.

Cheers
Jon

--
nvpublic

2020-04-24 14:49:36

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

24.04.2020 10:10, Jon Hunter пишет:
...
>> Could you please clarify why pm_runtime_get_sync() can't be used by the
>> I2C driver's in NOIRQ phase?
>
> Yes take a look at commit 1e2ef05bb8cf ("PM: Limit race conditions
> between runtime PM and system sleep (v2)").

I2C driver now uses irq-safe RPM since ede2299f7 ("i2c: tegra: Support
atomic transfers"), and thus, the RPM's workqueue shouldn't be a
problem. I guess RPM should work fine in this case, don't you think so?

...
>> Yes, keeping PCI regulators always-enabled should be a good immediate
>> solution.
>
> I was thinking about that, and I am not sure it is. I don't think that
> the failure to send the I2C command should break suspend.

It shouldn't, but looks like it should be a separate problem.

....
> So I confirmed that DMA is not the issue in this case. I tested this by
> ensuring that DMA is never used. However, it is a potential problem
> indeed.
>
>> Could you please try to apply this hunk and see if it makes any
>> difference (I'll probably make it as proper patch):
>
> Per my tests, I don't believe that it will as disabling DMA does not
> resolve the problem.
>
>> It also could be that there is more than the suspend ordering problem,
>> but for now it is hard to tell without having a detailed log which
>> includes I2C/DMA/RPM traces.
>
> I have taken a look and I don't see any issues with ordering. I2C is
> suspended after PCI. This did not change.

Do you see a "completion done after timeout" messages in the KMSG log of
the v5.6 kernel?

Could you please try this hunk? Although, I'll be surprised if it
changes anything.

--- >8 ---
diff --git a/drivers/i2c/busses/i2c-tegra.c b/drivers/i2c/busses/i2c-tegra.c
index 36d7114823ce..7196084b15fd 100644
--- a/drivers/i2c/busses/i2c-tegra.c
+++ b/drivers/i2c/busses/i2c-tegra.c
@@ -1028,6 +1028,13 @@ tegra_i2c_wait_completion_timeout(struct
tegra_i2c_dev *i2c_dev,
msecs_to_jiffies(timeout_ms));
disable_irq(i2c_dev->irq);

+ /*
+ * There is a chance that completion may happen after IRQ
+ * synchronization, which is done by disable_irq().
+ */
+ if (ret == 0 && completion_done(complete))
+ ret = 1;
+
/*
* Under some rare circumstances (like running KASAN +
* NFS root) CPU, which handles interrupt, may stuck in
--- >8 ---

2020-04-24 15:23:36

by Jon Hunter

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time


On 24/04/2020 15:45, Dmitry Osipenko wrote:
> 24.04.2020 10:10, Jon Hunter пишет:
> ...
>>> Could you please clarify why pm_runtime_get_sync() can't be used by the
>>> I2C driver's in NOIRQ phase?
>>
>> Yes take a look at commit 1e2ef05bb8cf ("PM: Limit race conditions
>> between runtime PM and system sleep (v2)").
>
> I2C driver now uses irq-safe RPM since ede2299f7 ("i2c: tegra: Support
> atomic transfers"), and thus, the RPM's workqueue shouldn't be a
> problem. I guess RPM should work fine in this case, don't you think so?

I was testing, and I did not see it using atomic transfers. I can
confirm if the RPM callbacks are called or not, but I did not think so.
However, let me confirm.

>>> Yes, keeping PCI regulators always-enabled should be a good immediate
>>> solution.
>>
>> I was thinking about that, and I am not sure it is. I don't think that
>> the failure to send the I2C command should break suspend.
>
> It shouldn't, but looks like it should be a separate problem.

Maybe but all these other problems appear to have existed for sometime
now. We need to fix all, but for the moment we need to figure out what's
best for v5.7.

>> So I confirmed that DMA is not the issue in this case. I tested this by
>> ensuring that DMA is never used. However, it is a potential problem
>> indeed.
>>
>>> Could you please try to apply this hunk and see if it makes any
>>> difference (I'll probably make it as proper patch):
>>
>> Per my tests, I don't believe that it will as disabling DMA does not
>> resolve the problem.
>>
>>> It also could be that there is more than the suspend ordering problem,
>>> but for now it is hard to tell without having a detailed log which
>>> includes I2C/DMA/RPM traces.
>>
>> I have taken a look and I don't see any issues with ordering. I2C is
>> suspended after PCI. This did not change.
>
> Do you see a "completion done after timeout" messages in the KMSG log of
> the v5.6 kernel?
>
> Could you please try this hunk? Although, I'll be surprised if it
> changes anything.

Yes I can test.

Cheers
Jon

--
nvpublic

2020-04-27 07:50:22

by Thierry Reding

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

On Fri, Apr 24, 2020 at 04:19:25PM +0100, Jon Hunter wrote:
>
> On 24/04/2020 15:45, Dmitry Osipenko wrote:
> > 24.04.2020 10:10, Jon Hunter пишет:
> > ...
> >>> Could you please clarify why pm_runtime_get_sync() can't be used by the
> >>> I2C driver's in NOIRQ phase?
> >>
> >> Yes take a look at commit 1e2ef05bb8cf ("PM: Limit race conditions
> >> between runtime PM and system sleep (v2)").
> >
> > I2C driver now uses irq-safe RPM since ede2299f7 ("i2c: tegra: Support
> > atomic transfers"), and thus, the RPM's workqueue shouldn't be a
> > problem. I guess RPM should work fine in this case, don't you think so?
>
> I was testing, and I did not see it using atomic transfers. I can
> confirm if the RPM callbacks are called or not, but I did not think so.
> However, let me confirm.
>
> >>> Yes, keeping PCI regulators always-enabled should be a good immediate
> >>> solution.
> >>
> >> I was thinking about that, and I am not sure it is. I don't think that
> >> the failure to send the I2C command should break suspend.
> >
> > It shouldn't, but looks like it should be a separate problem.
>
> Maybe but all these other problems appear to have existed for sometime
> now. We need to fix all, but for the moment we need to figure out what's
> best for v5.7.

To me it doesn't sound like we have a good handle on what exactly is
going on here and we're mostly just poking around.

And even if things weren't working quite properly before, it sounds to
me like this patch actually made things worse.

Given all that, I think the best course of action at this point is to
revert for v5.7 and prevent this from spreading[0]. After that we need
to look at fixing the regulator issues and make sure that suspend/resume
actually works properly and without errors. After that we should have a
better chance of isolating why exactly this patch fails.

Wolfram, can you revert the following two patches for v5.7, please?

8814044fe0fa i2c: tegra: Synchronize DMA before termination
a900aeac2537 i2c: tegra: Better handle case where CPU0 is busy for a long time

Thanks,
Thierry

[0]: https://lkml.org/lkml/2020/4/24/498


Attachments:
(No filename) (2.13 kB)
signature.asc (849.00 B)
Download all attachments

2020-04-27 08:47:49

by Wolfram Sang

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time


> Wolfram, can you revert the following two patches for v5.7, please?
>
> 8814044fe0fa i2c: tegra: Synchronize DMA before termination
> a900aeac2537 i2c: tegra: Better handle case where CPU0 is busy for a long time

Sure, will do!


Attachments:
(No filename) (244.00 B)
signature.asc (849.00 B)
Download all attachments

2020-04-27 09:09:05

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

27.04.2020 11:44, Wolfram Sang пишет:
>
>> Wolfram, can you revert the following two patches for v5.7, please?
>>
>> 8814044fe0fa i2c: tegra: Synchronize DMA before termination

This patch has nothing to do with your trouble, why do you want to
revert it?

>> a900aeac2537 i2c: tegra: Better handle case where CPU0 is busy for a long time
>
> Sure, will do!
>

2020-04-27 09:54:02

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

27.04.2020 10:48, Thierry Reding пишет:
...
>> Maybe but all these other problems appear to have existed for sometime
>> now. We need to fix all, but for the moment we need to figure out what's
>> best for v5.7.
>
> To me it doesn't sound like we have a good handle on what exactly is
> going on here and we're mostly just poking around.
>
> And even if things weren't working quite properly before, it sounds to
> me like this patch actually made things worse.

There is a plenty of time to work on the proper fix now. To me it sounds
like you're giving up on fixing the root of the problem, sorry.

2020-04-27 10:37:38

by Wolfram Sang

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

On Mon, Apr 27, 2020 at 12:07:19PM +0300, Dmitry Osipenko wrote:
> 27.04.2020 11:44, Wolfram Sang пишет:
> >
> >> Wolfram, can you revert the following two patches for v5.7, please?
> >>
> >> 8814044fe0fa i2c: tegra: Synchronize DMA before termination
>
> This patch has nothing to do with your trouble, why do you want to
> revert it?

I'll wait some more before pushing out, so you can discuss it.

2020-04-27 10:40:44

by Wolfram Sang

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

On Mon, Apr 27, 2020 at 12:52:10PM +0300, Dmitry Osipenko wrote:
> 27.04.2020 10:48, Thierry Reding пишет:
> ...
> >> Maybe but all these other problems appear to have existed for sometime
> >> now. We need to fix all, but for the moment we need to figure out what's
> >> best for v5.7.
> >
> > To me it doesn't sound like we have a good handle on what exactly is
> > going on here and we're mostly just poking around.
> >
> > And even if things weren't working quite properly before, it sounds to
> > me like this patch actually made things worse.
>
> There is a plenty of time to work on the proper fix now. To me it sounds
> like you're giving up on fixing the root of the problem, sorry.

From what I understood, there were (at least) two regressions reported.
So, to me, it makes sense to revert the change, so for upstream users
everything stays "the same". Of course, this does not mean it should
stay like this forever and you guys can work on fixing the root causes.
I'll happily apply them for this release when you are confident with the
results.

2020-04-27 10:51:32

by Thierry Reding

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

On Mon, Apr 27, 2020 at 12:07:19PM +0300, Dmitry Osipenko wrote:
> 27.04.2020 11:44, Wolfram Sang пишет:
> >
> >> Wolfram, can you revert the following two patches for v5.7, please?
> >>
> >> 8814044fe0fa i2c: tegra: Synchronize DMA before termination
>
> This patch has nothing to do with your trouble, why do you want to
> revert it?

It was part of the same series and addressing the same "busy CPU"
scenario, so I think it makes sense to keep both in the same series. I
guess we could try to run some tests with only this applied and see if
that really doesn't break anything. If so, I don't have any objection
to keeping this.

Thierry


Attachments:
(No filename) (666.00 B)
signature.asc (849.00 B)
Download all attachments

2020-04-27 10:52:43

by Thierry Reding

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

On Mon, Apr 27, 2020 at 12:35:53PM +0200, Wolfram Sang wrote:
> On Mon, Apr 27, 2020 at 12:07:19PM +0300, Dmitry Osipenko wrote:
> > 27.04.2020 11:44, Wolfram Sang пишет:
> > >
> > >> Wolfram, can you revert the following two patches for v5.7, please?
> > >>
> > >> 8814044fe0fa i2c: tegra: Synchronize DMA before termination
> >
> > This patch has nothing to do with your trouble, why do you want to
> > revert it?
>
> I'll wait some more before pushing out, so you can discuss it.

Okay, let me run a quick test with that second patch still applied to
make sure it really is harmless.

Thierry


Attachments:
(No filename) (622.00 B)
signature.asc (849.00 B)
Download all attachments

2020-04-27 11:04:40

by Thierry Reding

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

On Mon, Apr 27, 2020 at 12:52:10PM +0300, Dmitry Osipenko wrote:
> 27.04.2020 10:48, Thierry Reding пишет:
> ...
> >> Maybe but all these other problems appear to have existed for sometime
> >> now. We need to fix all, but for the moment we need to figure out what's
> >> best for v5.7.
> >
> > To me it doesn't sound like we have a good handle on what exactly is
> > going on here and we're mostly just poking around.
> >
> > And even if things weren't working quite properly before, it sounds to
> > me like this patch actually made things worse.
>
> There is a plenty of time to work on the proper fix now. To me it sounds
> like you're giving up on fixing the root of the problem, sorry.

We're at -rc3 now and I haven't seen any promising progress in the last
week. All the while suspend/resume is now broken on at least one board
and that may end up hiding any other issues that could creep in in the
meantime.

Furthermore we seem to have a preexisting issue that may very well
interfere with this patch, so I think the cautious thing is to revert
for now and then fix the original issue first. We can always come back
to this once everything is back to normal.

Also, people are now looking at backporting this to v5.6. Unless we
revert this from v5.7 it may get picked up for backports to other
kernels and then I have to notify stable kernel maintainers that they
shouldn't and they have to back things out again. That's going to cause
a lot of wasted time for a lot of people.

So, sorry, I disagree. I don't think we have "plenty of time".

Thierry


Attachments:
(No filename) (1.57 kB)
signature.asc (849.00 B)
Download all attachments

2020-04-27 12:48:23

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

23.04.2020 13:56, Jon Hunter пишет:
>>> So I think that part of the problem already existed prior to these
>>> patches. Without your patches I see ...
>>>
>>> [ 59.543528] tegra-i2c 7000d000.i2c: i2c transfer timed out
>>> [ 59.549036] vdd_sata,avdd_plle: failed to disable
>>> [ 59.553778] Failed to disable avdd-plle: -110
>>> [ 59.558150] tegra-pcie 3000.pcie: failed to disable regulators: -110
>> Does this I2C timeout happen with my patches? Could you please post full
>> logs of an older and the recent kernel versions?
> I believe that it does, but I need to check.
>

Jon, could you please confirm that you're seeing those regulator-disable
errors with my patch? I don't see those errors in yours original log [1].

[1]
https://lore.kernel.org/lkml/[email protected]/

Again, could you please post the *full* logs?

If regulator's disabling was "failing" before without my patch because
of the I2C interrupt being force-disabled during of NOIRQ phase, and now
regulator's disabling succeeds with my patch because IRQ is manually
handled after the timeout, then this could be bad. It means that
regulator was actually getting disabled, but I2C driver was timing out
because interrupt couldn't be handled in NOIRQ phase, which should
result in a dead PCIe on a resume from suspend since regulator's core
thinks that regulator is enabled (I2C said it failed to disable), while
it is actually disabled.

Do you have anything plugged into the PCIe slot in yours testing farm?
It wouldn't surprise me if the plugged card isn't functional after
resume from suspend on a stable kernels.

2020-04-27 13:19:23

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

27.04.2020 13:38, Wolfram Sang пишет:
> On Mon, Apr 27, 2020 at 12:52:10PM +0300, Dmitry Osipenko wrote:
>> 27.04.2020 10:48, Thierry Reding пишет:
>> ...
>>>> Maybe but all these other problems appear to have existed for sometime
>>>> now. We need to fix all, but for the moment we need to figure out what's
>>>> best for v5.7.
>>>
>>> To me it doesn't sound like we have a good handle on what exactly is
>>> going on here and we're mostly just poking around.
>>>
>>> And even if things weren't working quite properly before, it sounds to
>>> me like this patch actually made things worse.
>>
>> There is a plenty of time to work on the proper fix now. To me it sounds
>> like you're giving up on fixing the root of the problem, sorry.
>
> From what I understood, there were (at least) two regressions reported.
> So, to me, it makes sense to revert the change, so for upstream users
> everything stays "the same". Of course, this does not mean it should
> stay like this forever and you guys can work on fixing the root causes.
> I'll happily apply them for this release when you are confident with the
> results.
>

For now it's a single regression in the PCIe driver and it's actually
not a regression, but a PCIe driver bug that needs to be fixed. The I2C
part should be okay.

By reverting the I2C patch, we're back to the PCIe bug being papered
over and I don't like this. Let's just fix the PCIe driver and the
problem is gone.. it needs to be fixed anyways.

2020-04-27 14:16:48

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

27.04.2020 15:46, Dmitry Osipenko пишет:
> 23.04.2020 13:56, Jon Hunter пишет:
>>>> So I think that part of the problem already existed prior to these
>>>> patches. Without your patches I see ...
>>>>
>>>> [ 59.543528] tegra-i2c 7000d000.i2c: i2c transfer timed out
>>>> [ 59.549036] vdd_sata,avdd_plle: failed to disable
>>>> [ 59.553778] Failed to disable avdd-plle: -110
>>>> [ 59.558150] tegra-pcie 3000.pcie: failed to disable regulators: -110
>>> Does this I2C timeout happen with my patches? Could you please post full
>>> logs of an older and the recent kernel versions?
>> I believe that it does, but I need to check.
>>
>
> Jon, could you please confirm that you're seeing those regulator-disable
> errors with my patch? I don't see those errors in yours original log [1].
>
> [1]
> https://lore.kernel.org/lkml/[email protected]/
>
> Again, could you please post the *full* logs?
>
> If regulator's disabling was "failing" before without my patch because
> of the I2C interrupt being force-disabled during of NOIRQ phase, and now
> regulator's disabling succeeds with my patch because IRQ is manually
> handled after the timeout, then this could be bad. It means that
> regulator was actually getting disabled, but I2C driver was timing out
> because interrupt couldn't be handled in NOIRQ phase, which should
> result in a dead PCIe on a resume from suspend since regulator's core
> thinks that regulator is enabled (I2C said it failed to disable), while
> it is actually disabled.
>
> Do you have anything plugged into the PCIe slot in yours testing farm?
> It wouldn't surprise me if the plugged card isn't functional after
> resume from suspend on a stable kernels.
>

I actually now see that interrupt is not allowed to be enabled during
the NOIRQ phase:

https://elixir.bootlin.com/linux/v5.7-rc3/source/kernel/irq/manage.c#L640

it should be worthwhile to turn it into a WARN_ON.

2020-04-27 14:22:43

by Thierry Reding

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

On Mon, Apr 27, 2020 at 04:15:21PM +0300, Dmitry Osipenko wrote:
> 27.04.2020 13:38, Wolfram Sang пишет:
> > On Mon, Apr 27, 2020 at 12:52:10PM +0300, Dmitry Osipenko wrote:
> >> 27.04.2020 10:48, Thierry Reding пишет:
> >> ...
> >>>> Maybe but all these other problems appear to have existed for sometime
> >>>> now. We need to fix all, but for the moment we need to figure out what's
> >>>> best for v5.7.
> >>>
> >>> To me it doesn't sound like we have a good handle on what exactly is
> >>> going on here and we're mostly just poking around.
> >>>
> >>> And even if things weren't working quite properly before, it sounds to
> >>> me like this patch actually made things worse.
> >>
> >> There is a plenty of time to work on the proper fix now. To me it sounds
> >> like you're giving up on fixing the root of the problem, sorry.
> >
> > From what I understood, there were (at least) two regressions reported.
> > So, to me, it makes sense to revert the change, so for upstream users
> > everything stays "the same". Of course, this does not mean it should
> > stay like this forever and you guys can work on fixing the root causes.
> > I'll happily apply them for this release when you are confident with the
> > results.
> >
>
> For now it's a single regression in the PCIe driver and it's actually
> not a regression, but a PCIe driver bug that needs to be fixed. The I2C
> part should be okay.
>
> By reverting the I2C patch, we're back to the PCIe bug being papered
> over and I don't like this. Let's just fix the PCIe driver and the
> problem is gone.. it needs to be fixed anyways.

Yes, that bug should be fixed anyway. But that doesn't justify breaking
suspend/resume completely, which *is* a regression.

Look, I'm not saying that we should drop this patch altogether. All I'm
saying is that we should postpone it so that we can: a) get suspend and
resume working again (and by doing so make sure no other suspend/resume
regressions silently creep in, because that always seems to happen when
you're not looking) and b) fix any preexisting issues without possibly
scrambling the result with this perhaps unrelated fix.

So, again, I think the safest road forward is to back this one out for
now, fix whatever this other bug is and once suspend/resume is working
properly again we can revisit this patch based on a known-good baseline.

Thierry


Attachments:
(No filename) (2.37 kB)
signature.asc (849.00 B)
Download all attachments

2020-04-27 14:23:41

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

27.04.2020 14:00, Thierry Reding пишет:
> On Mon, Apr 27, 2020 at 12:52:10PM +0300, Dmitry Osipenko wrote:
>> 27.04.2020 10:48, Thierry Reding пишет:
>> ...
>>>> Maybe but all these other problems appear to have existed for sometime
>>>> now. We need to fix all, but for the moment we need to figure out what's
>>>> best for v5.7.
>>>
>>> To me it doesn't sound like we have a good handle on what exactly is
>>> going on here and we're mostly just poking around.
>>>
>>> And even if things weren't working quite properly before, it sounds to
>>> me like this patch actually made things worse.
>>
>> There is a plenty of time to work on the proper fix now. To me it sounds
>> like you're giving up on fixing the root of the problem, sorry.
>
> We're at -rc3 now and I haven't seen any promising progress in the last
> week. All the while suspend/resume is now broken on at least one board
> and that may end up hiding any other issues that could creep in in the
> meantime.
>
> Furthermore we seem to have a preexisting issue that may very well
> interfere with this patch, so I think the cautious thing is to revert
> for now and then fix the original issue first. We can always come back
> to this once everything is back to normal.
>
> Also, people are now looking at backporting this to v5.6. Unless we
> revert this from v5.7 it may get picked up for backports to other
> kernels and then I have to notify stable kernel maintainers that they
> shouldn't and they have to back things out again. That's going to cause
> a lot of wasted time for a lot of people.
>
> So, sorry, I disagree. I don't think we have "plenty of time".

There is about a month now before the 5.7 release. It's a bit too early
to start the panic, IMO :)

Jon already proposed a reasonable simple solution: to keep PCIe
regulators always-ON. In a longer run we may want to have I2C atomic
transfers supported for a late suspend phase.

This should fix yours problem and it should go into stable kernels:

--- >8 ---
diff --git a/drivers/pci/controller/pci-tegra.c
b/drivers/pci/controller/pci-tegra.c
index 3e64ba6a36a8..6ac76323ca70 100644
--- a/drivers/pci/controller/pci-tegra.c
+++ b/drivers/pci/controller/pci-tegra.c
@@ -1533,8 +1533,16 @@ static int tegra_pcie_get_resources(struct
tegra_pcie *pcie)
goto phys_put;
}

+ err = regulator_bulk_enable(pcie->num_supplies, pcie->supplies);
+ if (err) {
+ dev_err(dev, "failed to enable regulators: %d\n", err);
+ goto irq_free;
+ }
+
return 0;

+irq_free:
+ free_irq(pcie->irq, pcie);
phys_put:
if (soc->program_uphy)
tegra_pcie_phys_put(pcie);
@@ -1545,6 +1553,12 @@ static int tegra_pcie_put_resources(struct
tegra_pcie *pcie)
{
const struct tegra_pcie_soc *soc = pcie->soc;

+ err = regulator_bulk_disable(pcie->num_supplies, pcie->supplies);
+ if (err) {
+ dev_err(pcie->dev, "failed to disable regulators: %d\n", err);
+ return err;
+ }
+
if (pcie->irq > 0)
free_irq(pcie->irq, pcie);
--- >8 ---

2020-04-27 14:48:37

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

27.04.2020 17:13, Dmitry Osipenko пишет:
> 27.04.2020 15:46, Dmitry Osipenko пишет:
>> 23.04.2020 13:56, Jon Hunter пишет:
>>>>> So I think that part of the problem already existed prior to these
>>>>> patches. Without your patches I see ...
>>>>>
>>>>> [ 59.543528] tegra-i2c 7000d000.i2c: i2c transfer timed out
>>>>> [ 59.549036] vdd_sata,avdd_plle: failed to disable
>>>>> [ 59.553778] Failed to disable avdd-plle: -110
>>>>> [ 59.558150] tegra-pcie 3000.pcie: failed to disable regulators: -110
>>>> Does this I2C timeout happen with my patches? Could you please post full
>>>> logs of an older and the recent kernel versions?
>>> I believe that it does, but I need to check.
>>>
>>
>> Jon, could you please confirm that you're seeing those regulator-disable
>> errors with my patch? I don't see those errors in yours original log [1].
>>
>> [1]
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> Again, could you please post the *full* logs?
>>
>> If regulator's disabling was "failing" before without my patch because
>> of the I2C interrupt being force-disabled during of NOIRQ phase, and now
>> regulator's disabling succeeds with my patch because IRQ is manually
>> handled after the timeout, then this could be bad. It means that
>> regulator was actually getting disabled, but I2C driver was timing out
>> because interrupt couldn't be handled in NOIRQ phase, which should
>> result in a dead PCIe on a resume from suspend since regulator's core
>> thinks that regulator is enabled (I2C said it failed to disable), while
>> it is actually disabled.
>>
>> Do you have anything plugged into the PCIe slot in yours testing farm?
>> It wouldn't surprise me if the plugged card isn't functional after
>> resume from suspend on a stable kernels.
>>
>
> I actually now see that interrupt is not allowed to be enabled during
> the NOIRQ phase:
>
> https://elixir.bootlin.com/linux/v5.7-rc3/source/kernel/irq/manage.c#L640
>
> it should be worthwhile to turn it into a WARN_ON.
>

Oh, wait! There is already a warning there.. hmm.

2020-04-27 15:17:13

by Thierry Reding

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

On Mon, Apr 27, 2020 at 05:21:30PM +0300, Dmitry Osipenko wrote:
> 27.04.2020 14:00, Thierry Reding пишет:
> > On Mon, Apr 27, 2020 at 12:52:10PM +0300, Dmitry Osipenko wrote:
> >> 27.04.2020 10:48, Thierry Reding пишет:
> >> ...
> >>>> Maybe but all these other problems appear to have existed for sometime
> >>>> now. We need to fix all, but for the moment we need to figure out what's
> >>>> best for v5.7.
> >>>
> >>> To me it doesn't sound like we have a good handle on what exactly is
> >>> going on here and we're mostly just poking around.
> >>>
> >>> And even if things weren't working quite properly before, it sounds to
> >>> me like this patch actually made things worse.
> >>
> >> There is a plenty of time to work on the proper fix now. To me it sounds
> >> like you're giving up on fixing the root of the problem, sorry.
> >
> > We're at -rc3 now and I haven't seen any promising progress in the last
> > week. All the while suspend/resume is now broken on at least one board
> > and that may end up hiding any other issues that could creep in in the
> > meantime.
> >
> > Furthermore we seem to have a preexisting issue that may very well
> > interfere with this patch, so I think the cautious thing is to revert
> > for now and then fix the original issue first. We can always come back
> > to this once everything is back to normal.
> >
> > Also, people are now looking at backporting this to v5.6. Unless we
> > revert this from v5.7 it may get picked up for backports to other
> > kernels and then I have to notify stable kernel maintainers that they
> > shouldn't and they have to back things out again. That's going to cause
> > a lot of wasted time for a lot of people.
> >
> > So, sorry, I disagree. I don't think we have "plenty of time".
>
> There is about a month now before the 5.7 release. It's a bit too early
> to start the panic, IMO :)

There's no panic. A patch got merged and it broken something, so we
revert it and try again. It's very much standard procedure.

> Jon already proposed a reasonable simple solution: to keep PCIe
> regulators always-ON. In a longer run we may want to have I2C atomic
> transfers supported for a late suspend phase.

That's not really a solution, though, is it? It's just papering over
an issue that this patch introduced or uncovered. I'm much more in
favour of fixing problems at the root rather than keep papering over
until we loose track of what the actual problems are.

Thierry


Attachments:
(No filename) (2.46 kB)
signature.asc (849.00 B)
Download all attachments

2020-04-27 15:20:24

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

27.04.2020 18:12, Thierry Reding пишет:
> On Mon, Apr 27, 2020 at 05:21:30PM +0300, Dmitry Osipenko wrote:
>> 27.04.2020 14:00, Thierry Reding пишет:
>>> On Mon, Apr 27, 2020 at 12:52:10PM +0300, Dmitry Osipenko wrote:
>>>> 27.04.2020 10:48, Thierry Reding пишет:
>>>> ...
>>>>>> Maybe but all these other problems appear to have existed for sometime
>>>>>> now. We need to fix all, but for the moment we need to figure out what's
>>>>>> best for v5.7.
>>>>>
>>>>> To me it doesn't sound like we have a good handle on what exactly is
>>>>> going on here and we're mostly just poking around.
>>>>>
>>>>> And even if things weren't working quite properly before, it sounds to
>>>>> me like this patch actually made things worse.
>>>>
>>>> There is a plenty of time to work on the proper fix now. To me it sounds
>>>> like you're giving up on fixing the root of the problem, sorry.
>>>
>>> We're at -rc3 now and I haven't seen any promising progress in the last
>>> week. All the while suspend/resume is now broken on at least one board
>>> and that may end up hiding any other issues that could creep in in the
>>> meantime.
>>>
>>> Furthermore we seem to have a preexisting issue that may very well
>>> interfere with this patch, so I think the cautious thing is to revert
>>> for now and then fix the original issue first. We can always come back
>>> to this once everything is back to normal.
>>>
>>> Also, people are now looking at backporting this to v5.6. Unless we
>>> revert this from v5.7 it may get picked up for backports to other
>>> kernels and then I have to notify stable kernel maintainers that they
>>> shouldn't and they have to back things out again. That's going to cause
>>> a lot of wasted time for a lot of people.
>>>
>>> So, sorry, I disagree. I don't think we have "plenty of time".
>>
>> There is about a month now before the 5.7 release. It's a bit too early
>> to start the panic, IMO :)
>
> There's no panic. A patch got merged and it broken something, so we
> revert it and try again. It's very much standard procedure.
>
>> Jon already proposed a reasonable simple solution: to keep PCIe
>> regulators always-ON. In a longer run we may want to have I2C atomic
>> transfers supported for a late suspend phase.
>
> That's not really a solution, though, is it? It's just papering over
> an issue that this patch introduced or uncovered. I'm much more in
> favour of fixing problems at the root rather than keep papering over
> until we loose track of what the actual problems are.

It's not "papering over an issue". The bug can't be fixed properly
without introducing I2C atomic transfers support for a late suspend
phase, I don't see any other solutions for now. Stable kernels do not
support atomic transfers at all, that proper solution won't be backportable.

2020-04-27 15:33:38

by Wolfram Sang

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time


> Yes, that bug should be fixed anyway. But that doesn't justify breaking
> suspend/resume completely, which *is* a regression.
>
> Look, I'm not saying that we should drop this patch altogether. All I'm
> saying is that we should postpone it so that we can: a) get suspend and
> resume working again (and by doing so make sure no other suspend/resume
> regressions silently creep in, because that always seems to happen when
> you're not looking) and b) fix any preexisting issues without possibly
> scrambling the result with this perhaps unrelated fix.
>
> So, again, I think the safest road forward is to back this one out for
> now, fix whatever this other bug is and once suspend/resume is working
> properly again we can revisit this patch based on a known-good baseline.

I am with you here. I want to add that the proper fix should be
developed without thinking too much about stable in the first place.
*When* we have a proper working fix, then we can think about making it
"more" suitable for backporting. Yet, it may also be a result that older
kernels need a different solution. Or have no solution at all, in case
they can't do atomic_transfers and this is needed.

D'accord?


Attachments:
(No filename) (1.19 kB)
signature.asc (849.00 B)
Download all attachments

2020-04-27 15:36:57

by Thierry Reding

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

On Mon, Apr 27, 2020 at 12:50:29PM +0200, Thierry Reding wrote:
> On Mon, Apr 27, 2020 at 12:35:53PM +0200, Wolfram Sang wrote:
> > On Mon, Apr 27, 2020 at 12:07:19PM +0300, Dmitry Osipenko wrote:
> > > 27.04.2020 11:44, Wolfram Sang пишет:
> > > >
> > > >> Wolfram, can you revert the following two patches for v5.7, please?
> > > >>
> > > >> 8814044fe0fa i2c: tegra: Synchronize DMA before termination
> > >
> > > This patch has nothing to do with your trouble, why do you want to
> > > revert it?
> >
> > I'll wait some more before pushing out, so you can discuss it.
>
> Okay, let me run a quick test with that second patch still applied to
> make sure it really is harmless.

Alright, I tested v5.7-rc3 with this patch reverted:

a900aeac2537 i2c: tegra: Better handle case where CPU0 is busy for a long time

and the results came back positive, so I think we can leave patch:

8814044fe0fa i2c: tegra: Synchronize DMA before termination

in. But then again, I see that Dmitry posted this yesterday:

https://lkml.org/lkml/2020/4/26/481

which seems like it would be related to this and potentially be a
follow-up fix for some corner cases? So I'm not sure how well this whole
set has been tested yet.

Maybe a better solution would be for the DMA synchronization patch to go
into the 5.8 queue instead to make sure we get more testing cycles.

Thierry


Attachments:
(No filename) (1.38 kB)
signature.asc (849.00 B)
Download all attachments

2020-04-27 16:05:37

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

27.04.2020 18:32, Thierry Reding пишет:
> On Mon, Apr 27, 2020 at 12:50:29PM +0200, Thierry Reding wrote:
>> On Mon, Apr 27, 2020 at 12:35:53PM +0200, Wolfram Sang wrote:
>>> On Mon, Apr 27, 2020 at 12:07:19PM +0300, Dmitry Osipenko wrote:
>>>> 27.04.2020 11:44, Wolfram Sang пишет:
>>>>>
>>>>>> Wolfram, can you revert the following two patches for v5.7, please?
>>>>>>
>>>>>> 8814044fe0fa i2c: tegra: Synchronize DMA before termination
>>>>
>>>> This patch has nothing to do with your trouble, why do you want to
>>>> revert it?
>>>
>>> I'll wait some more before pushing out, so you can discuss it.
>>
>> Okay, let me run a quick test with that second patch still applied to
>> make sure it really is harmless.
>
> Alright, I tested v5.7-rc3 with this patch reverted:
>
> a900aeac2537 i2c: tegra: Better handle case where CPU0 is busy for a long time
>
> and the results came back positive, so I think we can leave patch:
>
> 8814044fe0fa i2c: tegra: Synchronize DMA before termination
>
> in. But then again, I see that Dmitry posted this yesterday:
>
> https://lkml.org/lkml/2020/4/26/481
>
> which seems like it would be related to this and potentially be a
> follow-up fix for some corner cases?

This is a follow-up to my previous message in this thread:

https://lkml.org/lkml/2020/4/23/792

> So I'm not sure how well this whole set has been tested yet.

It depends on what you're meaning by the testing. We have some
yet-out-of-tree real-world devices that are using APBDMA for Bluetooth,
Audio, I2C (touchscreens) and etc peripherals. These devices were using
the DMA patches before they were posted to the ML.

> Maybe a better solution would be for the DMA synchronization patch to go
> into the 5.8 queue instead to make sure we get more testing cycles.

It should be fine to re-queue the patches for 5.8. I'm just a bit afraid
that if patches are simply dropped now, then you won't get back to it
for a year or so ;)

2020-04-27 16:34:24

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

27.04.2020 17:45, Dmitry Osipenko пишет:
> 27.04.2020 17:13, Dmitry Osipenko пишет:
>> 27.04.2020 15:46, Dmitry Osipenko пишет:
>>> 23.04.2020 13:56, Jon Hunter пишет:
>>>>>> So I think that part of the problem already existed prior to these
>>>>>> patches. Without your patches I see ...
>>>>>>
>>>>>> [ 59.543528] tegra-i2c 7000d000.i2c: i2c transfer timed out
>>>>>> [ 59.549036] vdd_sata,avdd_plle: failed to disable
>>>>>> [ 59.553778] Failed to disable avdd-plle: -110
>>>>>> [ 59.558150] tegra-pcie 3000.pcie: failed to disable regulators: -110
>>>>> Does this I2C timeout happen with my patches? Could you please post full
>>>>> logs of an older and the recent kernel versions?
>>>> I believe that it does, but I need to check.
>>>>
>>>
>>> Jon, could you please confirm that you're seeing those regulator-disable
>>> errors with my patch? I don't see those errors in yours original log [1].
>>>
>>> [1]
>>> https://lore.kernel.org/lkml/[email protected]/
>>>
>>> Again, could you please post the *full* logs?
>>>
>>> If regulator's disabling was "failing" before without my patch because
>>> of the I2C interrupt being force-disabled during of NOIRQ phase, and now
>>> regulator's disabling succeeds with my patch because IRQ is manually
>>> handled after the timeout, then this could be bad. It means that
>>> regulator was actually getting disabled, but I2C driver was timing out
>>> because interrupt couldn't be handled in NOIRQ phase, which should
>>> result in a dead PCIe on a resume from suspend since regulator's core
>>> thinks that regulator is enabled (I2C said it failed to disable), while
>>> it is actually disabled.
>>>
>>> Do you have anything plugged into the PCIe slot in yours testing farm?
>>> It wouldn't surprise me if the plugged card isn't functional after
>>> resume from suspend on a stable kernels.
>>>
>>
>> I actually now see that interrupt is not allowed to be enabled during
>> the NOIRQ phase:
>>
>> https://elixir.bootlin.com/linux/v5.7-rc3/source/kernel/irq/manage.c#L640
>>
>> it should be worthwhile to turn it into a WARN_ON.
>>
>
> Oh, wait! There is already a warning there.. hmm.
>

Aha, the disable depth for the I2C interrupt is 2 after
suspend_device_irq(), that's why there is no warning.

This should catch the bug and trigger the warning:

--- >8 ---
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 453a8a0f4804..fe25104d8b22 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -653,6 +653,8 @@ void __enable_irq(struct irq_desc *desc)
break;
}
default:
+ if (desc->istate & IRQS_SUSPENDED)
+ goto err_out;
desc->depth--;
}
}
--- >8 ---

Jon could you please give it a try? Will this change produce a warning
for the I2C driver on a PCIe suspend for the v5.6 kernel?

2020-04-28 08:04:00

by Jon Hunter

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time


On 27/04/2020 16:18, Dmitry Osipenko wrote:
> 27.04.2020 18:12, Thierry Reding пишет:
>> On Mon, Apr 27, 2020 at 05:21:30PM +0300, Dmitry Osipenko wrote:
>>> 27.04.2020 14:00, Thierry Reding пишет:
>>>> On Mon, Apr 27, 2020 at 12:52:10PM +0300, Dmitry Osipenko wrote:
>>>>> 27.04.2020 10:48, Thierry Reding пишет:
>>>>> ...
>>>>>>> Maybe but all these other problems appear to have existed for sometime
>>>>>>> now. We need to fix all, but for the moment we need to figure out what's
>>>>>>> best for v5.7.
>>>>>>
>>>>>> To me it doesn't sound like we have a good handle on what exactly is
>>>>>> going on here and we're mostly just poking around.
>>>>>>
>>>>>> And even if things weren't working quite properly before, it sounds to
>>>>>> me like this patch actually made things worse.
>>>>>
>>>>> There is a plenty of time to work on the proper fix now. To me it sounds
>>>>> like you're giving up on fixing the root of the problem, sorry.
>>>>
>>>> We're at -rc3 now and I haven't seen any promising progress in the last
>>>> week. All the while suspend/resume is now broken on at least one board
>>>> and that may end up hiding any other issues that could creep in in the
>>>> meantime.
>>>>
>>>> Furthermore we seem to have a preexisting issue that may very well
>>>> interfere with this patch, so I think the cautious thing is to revert
>>>> for now and then fix the original issue first. We can always come back
>>>> to this once everything is back to normal.
>>>>
>>>> Also, people are now looking at backporting this to v5.6. Unless we
>>>> revert this from v5.7 it may get picked up for backports to other
>>>> kernels and then I have to notify stable kernel maintainers that they
>>>> shouldn't and they have to back things out again. That's going to cause
>>>> a lot of wasted time for a lot of people.
>>>>
>>>> So, sorry, I disagree. I don't think we have "plenty of time".
>>>
>>> There is about a month now before the 5.7 release. It's a bit too early
>>> to start the panic, IMO :)
>>
>> There's no panic. A patch got merged and it broken something, so we
>> revert it and try again. It's very much standard procedure.
>>
>>> Jon already proposed a reasonable simple solution: to keep PCIe
>>> regulators always-ON. In a longer run we may want to have I2C atomic
>>> transfers supported for a late suspend phase.
>>
>> That's not really a solution, though, is it? It's just papering over
>> an issue that this patch introduced or uncovered. I'm much more in
>> favour of fixing problems at the root rather than keep papering over
>> until we loose track of what the actual problems are.
>
> It's not "papering over an issue". The bug can't be fixed properly
> without introducing I2C atomic transfers support for a late suspend
> phase, I don't see any other solutions for now. Stable kernels do not
> support atomic transfers at all, that proper solution won't be backportable.


There are a few issues here, but the issue Thierry and I are referring
to is the regression introduced by this change. Yes this exposes other
problems, but we first need to understand why this breaks resume in
general, regardless of what the PCIe driver is doing. I will look at
this a bit more later this week.

Cheers
Jon

--
nvpublic

2020-04-28 08:06:31

by Jon Hunter

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time


On 27/04/2020 16:38, Dmitry Osipenko wrote:
> 27.04.2020 17:45, Dmitry Osipenko пишет:
>> 27.04.2020 17:13, Dmitry Osipenko пишет:
>>> 27.04.2020 15:46, Dmitry Osipenko пишет:
>>>> 23.04.2020 13:56, Jon Hunter пишет:
>>>>>>> So I think that part of the problem already existed prior to these
>>>>>>> patches. Without your patches I see ...
>>>>>>>
>>>>>>> [ 59.543528] tegra-i2c 7000d000.i2c: i2c transfer timed out
>>>>>>> [ 59.549036] vdd_sata,avdd_plle: failed to disable
>>>>>>> [ 59.553778] Failed to disable avdd-plle: -110
>>>>>>> [ 59.558150] tegra-pcie 3000.pcie: failed to disable regulators: -110
>>>>>> Does this I2C timeout happen with my patches? Could you please post full
>>>>>> logs of an older and the recent kernel versions?
>>>>> I believe that it does, but I need to check.
>>>>>
>>>>
>>>> Jon, could you please confirm that you're seeing those regulator-disable
>>>> errors with my patch? I don't see those errors in yours original log [1].
>>>>
>>>> [1]
>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>
>>>> Again, could you please post the *full* logs?
>>>>
>>>> If regulator's disabling was "failing" before without my patch because
>>>> of the I2C interrupt being force-disabled during of NOIRQ phase, and now
>>>> regulator's disabling succeeds with my patch because IRQ is manually
>>>> handled after the timeout, then this could be bad. It means that
>>>> regulator was actually getting disabled, but I2C driver was timing out
>>>> because interrupt couldn't be handled in NOIRQ phase, which should
>>>> result in a dead PCIe on a resume from suspend since regulator's core
>>>> thinks that regulator is enabled (I2C said it failed to disable), while
>>>> it is actually disabled.
>>>>
>>>> Do you have anything plugged into the PCIe slot in yours testing farm?
>>>> It wouldn't surprise me if the plugged card isn't functional after
>>>> resume from suspend on a stable kernels.
>>>>
>>>
>>> I actually now see that interrupt is not allowed to be enabled during
>>> the NOIRQ phase:
>>>
>>> https://elixir.bootlin.com/linux/v5.7-rc3/source/kernel/irq/manage.c#L640
>>>
>>> it should be worthwhile to turn it into a WARN_ON.
>>>
>>
>> Oh, wait! There is already a warning there.. hmm.
>>
>
> Aha, the disable depth for the I2C interrupt is 2 after
> suspend_device_irq(), that's why there is no warning.
>
> This should catch the bug and trigger the warning:
>
> --- >8 ---
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 453a8a0f4804..fe25104d8b22 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -653,6 +653,8 @@ void __enable_irq(struct irq_desc *desc)
> break;
> }
> default:
> + if (desc->istate & IRQS_SUSPENDED)
> + goto err_out;
> desc->depth--;
> }
> }
> --- >8 ---
>
> Jon could you please give it a try? Will this change produce a warning
> for the I2C driver on a PCIe suspend for the v5.6 kernel?


Yes I can test, but I still want to know why resume is currently broken.

Jon

--
nvpublic

2020-04-28 12:41:53

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

28.04.2020 11:01, Jon Hunter пишет:
>
> On 27/04/2020 16:18, Dmitry Osipenko wrote:
>> 27.04.2020 18:12, Thierry Reding пишет:
>>> On Mon, Apr 27, 2020 at 05:21:30PM +0300, Dmitry Osipenko wrote:
>>>> 27.04.2020 14:00, Thierry Reding пишет:
>>>>> On Mon, Apr 27, 2020 at 12:52:10PM +0300, Dmitry Osipenko wrote:
>>>>>> 27.04.2020 10:48, Thierry Reding пишет:
>>>>>> ...
>>>>>>>> Maybe but all these other problems appear to have existed for sometime
>>>>>>>> now. We need to fix all, but for the moment we need to figure out what's
>>>>>>>> best for v5.7.
>>>>>>>
>>>>>>> To me it doesn't sound like we have a good handle on what exactly is
>>>>>>> going on here and we're mostly just poking around.
>>>>>>>
>>>>>>> And even if things weren't working quite properly before, it sounds to
>>>>>>> me like this patch actually made things worse.
>>>>>>
>>>>>> There is a plenty of time to work on the proper fix now. To me it sounds
>>>>>> like you're giving up on fixing the root of the problem, sorry.
>>>>>
>>>>> We're at -rc3 now and I haven't seen any promising progress in the last
>>>>> week. All the while suspend/resume is now broken on at least one board
>>>>> and that may end up hiding any other issues that could creep in in the
>>>>> meantime.
>>>>>
>>>>> Furthermore we seem to have a preexisting issue that may very well
>>>>> interfere with this patch, so I think the cautious thing is to revert
>>>>> for now and then fix the original issue first. We can always come back
>>>>> to this once everything is back to normal.
>>>>>
>>>>> Also, people are now looking at backporting this to v5.6. Unless we
>>>>> revert this from v5.7 it may get picked up for backports to other
>>>>> kernels and then I have to notify stable kernel maintainers that they
>>>>> shouldn't and they have to back things out again. That's going to cause
>>>>> a lot of wasted time for a lot of people.
>>>>>
>>>>> So, sorry, I disagree. I don't think we have "plenty of time".
>>>>
>>>> There is about a month now before the 5.7 release. It's a bit too early
>>>> to start the panic, IMO :)
>>>
>>> There's no panic. A patch got merged and it broken something, so we
>>> revert it and try again. It's very much standard procedure.
>>>
>>>> Jon already proposed a reasonable simple solution: to keep PCIe
>>>> regulators always-ON. In a longer run we may want to have I2C atomic
>>>> transfers supported for a late suspend phase.
>>>
>>> That's not really a solution, though, is it? It's just papering over
>>> an issue that this patch introduced or uncovered. I'm much more in
>>> favour of fixing problems at the root rather than keep papering over
>>> until we loose track of what the actual problems are.
>>
>> It's not "papering over an issue". The bug can't be fixed properly
>> without introducing I2C atomic transfers support for a late suspend
>> phase, I don't see any other solutions for now. Stable kernels do not
>> support atomic transfers at all, that proper solution won't be backportable.
>
>
> There are a few issues here, but the issue Thierry and I are referring
> to is the regression introduced by this change. Yes this exposes other
> problems, but we first need to understand why this breaks resume in
> general, regardless of what the PCIe driver is doing. I will look at
> this a bit more later this week.

Let's postpone the reverting by 1-3 weeks then. Likely that there will
be a proper (and trivial) solution by that time, otherwise it should be
okay to revert the I2C patch.

2020-04-28 13:45:08

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

20.04.2020 22:53, Jon Hunter пишет:
> Hi Dmitry,
>
> On 24/03/2020 19:12, Dmitry Osipenko wrote:
>> Boot CPU0 always handle I2C interrupt and under some rare circumstances
>> (like running KASAN + NFS root) it may stuck in uninterruptible state for
>> a significant time. In this case we will get timeout if I2C transfer is
>> running on a sibling CPU, despite of IRQ being raised. In order to handle
>> this rare condition, the IRQ status needs to be checked after completion
>> timeout.
>>
>> Signed-off-by: Dmitry Osipenko <[email protected]>
>> ---
>> drivers/i2c/busses/i2c-tegra.c | 27 +++++++++++++++------------
>> 1 file changed, 15 insertions(+), 12 deletions(-)
>
>
> I have noticed a regression on tegra30-cardhu-a04 when testing system
> suspend. Git bisect is pointing to this commit and reverting it fixes
> the problem. In the below console log I2C fails to resume ...
>> [ 40.888512] usb1_vbus: supplied by 5v0
> [ 40.892408] vddio_sdmmc,avdd_vdac: supplied by 5v0
> [ 40.897401] cam_1v8: disabling
> [ 40.900548] modem_3v3: disabling
> [ 40.903875] vdd_cam1_ldo: disabling
> [ 40.907501] vdd_cam2_ldo: disabling
> [ 40.911092] vdd_cam3_ldo: disabling
> [ 40.914714] vdd_fuse_3v3: disabling
> [ 40.918305] vddio_vid: disabling
> [ 40.921623] usb1_vbus: disabling
> [ 59.445032] PM: suspend entry (deep)
> [ 59.448852] Filesystems sync: 0.000 seconds
> [ 59.456161] Freezing user space processes ... (elapsed 0.001 seconds) done.
> [ 59.457645] OOM killer disabled.
> [ 59.457649] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.

1. Previously, without my patch, I2C was failing here:

[ 59.543528] tegra-i2c 7000d000.i2c: i2c transfer timed out
[ 59.549036] vdd_sata,avdd_plle: failed to disable
[ 59.553778] Failed to disable avdd-plle: -110
[ 59.558150] tegra-pcie 3000.pcie: failed to disable regulators: -110

The I2C was failing because interrupts are force-disabled in the NOIRQ
suspend phase. This means that the regulator_bulk_disable() of the PCIe
driver failed on suspend and regulators were kept to the "enabled" state.

https://elixir.bootlin.com/linux/v5.7-rc3/source/drivers/regulator/core.c#L4526

Although, the vdd_sata,avdd_plle should be disabled actually, since I2C
returned a spurious error.

2. Now, with my patch, the I2C also times out, but I2C interrupt is
manually handled by the I2C driver after the timeout. Hence regulators
are getting disabled successfully on the PCIe suspend.

> [ 59.764926] Disabling non-boot CPUs ...
> [ 59.769540] IRQ 18: no longer affine to CPU1
> [ 59.789070] IRQ 19: no longer affine to CPU2
> [ 59.808049] IRQ 20: no longer affine to CPU3
> [ 59.827113] Entering suspend state LP1
> [ 59.827163] Enabling non-boot CPUs ...
> [ 59.834797] CPU1 is up
> [ 59.840943] CPU2 is up
> [ 59.847378] CPU3 is up
> [ 59.850577] tegra-i2c 7000d000.i2c: runtime resume failed -13
> [ 59.856432] tegra-i2c 7000d000.i2c: runtime resume failed -13
> [ 59.862231] tegra-i2c 7000d000.i2c: runtime resume failed -13
> [ 59.868070] vdd_pexa,vdd_pexb: is_enabled() failed: -13
> [ 59.873334] tegra-i2c 7000d000.i2c: runtime resume failed -13
> [ 59.879143] vdd_pexa,vdd_pexb: is_enabled() failed: -13
> [ 59.884420] Failed to enable avdd-pex-pll: -13
> [ 59.888877] Failed to enable avdd-plle: -13
> [ 59.893061] Failed to enable avdd-pexb: -13
> [ 59.897279] Failed to enable vdd-pexb: -13

3. This error didn't happen before because regulators were in the
enabled state on resume from suspend. Hence on each resume from suspend
the PCIe regulator's use-count should be bumped by one.

4. Now, the regulators are in the disabled state on resume from suspend.
Hence regulator_bulk_enable() of PCIe tries to enable them on resume,
but fails because I2C RPM can't be resumed by that time.

I'm not sure why RPM enable/disable behavior is inconsistent for
suspend/resume. Apparently the I2C RPM is getting disabled late on
suspend and also getting enabled late on resume. This needs a clarification.

> [ 59.901383] tegra-pcie 3000.pcie: failed to enable regulators: -13
> [ 60.434185] clk_plle_training: timeout waiting for PLLE
> [ 60.439565] tegra-pcie 3000.pcie: failed to enable CML clock: -16

5. The PCIe driver ignores the regulator_bulk_enable/disable() errors,
hence hardware can't power up properly.

https://elixir.bootlin.com/linux/v5.7-rc3/source/drivers/pci/controller/pci-tegra.c#L1231

> [ 60.445700] ------------[ cut here ]------------
> [ 60.450346] WARNING: CPU: 0 PID: 653 at /home/jonathanh/workdir/tegra/mlt-linux_next/kernel/drivers/regulator/core.c:2603 _regulator_disable+0xb8/0x1b4
> [ 60.463959] unbalanced disables for vdd_pexa,vdd_pexb

6. The CML clock failed to enable and PCIe tries to disable regulators.

https://elixir.bootlin.com/linux/v5.7-rc3/source/drivers/pci/controller/pci-tegra.c#L1258

But regulators failed to be enabled before and that error was ignored!
Hence now there is the unbalanced disable error.

> [ 60.469038] Modules linked in:
> [ 60.472107] CPU: 0 PID: 653 Comm: rtcwake Tainted: G W 5.7.0-rc2-next-20200420 #2
> [ 60.480892] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
> [ 60.487190] [<c0111b68>] (unwind_backtrace) from [<c010bc00>] (show_stack+0x10/0x14)
> [ 60.494951] [<c010bc00>] (show_stack) from [<c0480f14>] (dump_stack+0xc0/0xd4)
> [ 60.502189] [<c0480f14>] (dump_stack) from [<c01234a4>] (__warn+0xe0/0xf8)
> [ 60.509073] [<c01234a4>] (__warn) from [<c0123530>] (warn_slowpath_fmt+0x74/0xb8)
> [ 60.516568] [<c0123530>] (warn_slowpath_fmt) from [<c0516714>] (_regulator_disable+0xb8/0x1b4)
> [ 60.525191] [<c0516714>] (_regulator_disable) from [<c0516844>] (regulator_disable+0x34/0xd0)
> [ 60.533729] [<c0516844>] (regulator_disable) from [<c0518488>] (regulator_bulk_disable+0x28/0xb4)
> [ 60.542619] [<c0518488>] (regulator_bulk_disable) from [<c04dbc84>] (tegra_pcie_pm_resume+0xbb0/0x107c)
> [ 60.552032] [<c04dbc84>] (tegra_pcie_pm_resume) from [<c05f7e44>] (dpm_run_callback+0x38/0x1d4)
> [ 60.560741] [<c05f7e44>] (dpm_run_callback) from [<c05f8af8>] (device_resume_noirq+0x110/0x248)
> [ 60.569451] [<c05f8af8>] (device_resume_noirq) from [<c05f93e0>] (dpm_resume_noirq+0x10c/0x36c)
> [ 60.578162] [<c05f93e0>] (dpm_resume_noirq) from [<c017dd74>] (suspend_devices_and_enter+0x27c/0x9dc)
> [ 60.587393] [<c017dd74>] (suspend_devices_and_enter) from [<c017e7dc>] (pm_suspend+0x308/0x370)
> [ 60.596110] [<c017e7dc>] (pm_suspend) from [<c017cb30>] (state_store+0x6c/0xc8)
> [ 60.603440] [<c017cb30>] (state_store) from [<c03138e4>] (kernfs_fop_write+0xf8/0x210)
> [ 60.611379] [<c03138e4>] (kernfs_fop_write) from [<c0286c44>] (__vfs_write+0x2c/0x1c4)
> [ 60.619310] [<c0286c44>] (__vfs_write) from [<c02886e8>] (vfs_write+0xa4/0x188)
> [ 60.626632] [<c02886e8>] (vfs_write) from [<c028898c>] (ksys_write+0xa4/0xd4)
> [ 60.633778] [<c028898c>] (ksys_write) from [<c01000c0>] (ret_fast_syscall+0x0/0x54)
> [ 60.641437] Exception stack(0xeda91fa8 to 0xeda91ff0)
> [ 60.646497] 1fa0: 0000006c 00498438 00000004 00498438 00000004 00000000
> [ 60.654683] 1fc0: 0000006c 00498438 00497228 00000004 00000004 00000004 0048478c 00497228
> [ 60.662866] 1fe0: 00000004 be9029b8 b6ec8c0b b6e53206
> [ 60.668007] ---[ end trace 5453317048e46ae9 ]---
> [ 60.672632] Failed to disable vdd-pexb: -5
> [ 60.676761] tegra-pcie 3000.pcie: tegra pcie power on fail: -16
> [ 60.682694] PM: dpm_run_callback(): tegra_pcie_pm_resume+0x0/0x107c returns -16
> [ 60.690035] PM: Device 3000.pcie failed to resume noirq: error -16
> [ 60.696859] tegra-mc 7000f000.memory-controller: fdcdwr2: write @0x877e8400: EMEM address decode error (SMMU translation error [--S])
> [ 60.708876] tegra-mc 7000f000.memory-controller: fdcdwr2: write @0x877e8400: Page fault (SMMU translation error [--S])

7. This MC error may happen because the PCIe regulators were not enabled
during the resume process and then ungating PCIe powerdomain and clocks
caused the SoC hardware malfunction.

2020-04-28 23:14:29

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

28.04.2020 11:02, Jon Hunter пишет:
>
> On 27/04/2020 16:38, Dmitry Osipenko wrote:
>> 27.04.2020 17:45, Dmitry Osipenko пишет:
>>> 27.04.2020 17:13, Dmitry Osipenko пишет:
>>>> 27.04.2020 15:46, Dmitry Osipenko пишет:
>>>>> 23.04.2020 13:56, Jon Hunter пишет:
>>>>>>>> So I think that part of the problem already existed prior to these
>>>>>>>> patches. Without your patches I see ...
>>>>>>>>
>>>>>>>> [ 59.543528] tegra-i2c 7000d000.i2c: i2c transfer timed out
>>>>>>>> [ 59.549036] vdd_sata,avdd_plle: failed to disable
>>>>>>>> [ 59.553778] Failed to disable avdd-plle: -110
>>>>>>>> [ 59.558150] tegra-pcie 3000.pcie: failed to disable regulators: -110
>>>>>>> Does this I2C timeout happen with my patches? Could you please post full
>>>>>>> logs of an older and the recent kernel versions?
>>>>>> I believe that it does, but I need to check.
>>>>>>
>>>>>
>>>>> Jon, could you please confirm that you're seeing those regulator-disable
>>>>> errors with my patch? I don't see those errors in yours original log [1].
>>>>>
>>>>> [1]
>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>
>>>>> Again, could you please post the *full* logs?
>>>>>
>>>>> If regulator's disabling was "failing" before without my patch because
>>>>> of the I2C interrupt being force-disabled during of NOIRQ phase, and now
>>>>> regulator's disabling succeeds with my patch because IRQ is manually
>>>>> handled after the timeout, then this could be bad. It means that
>>>>> regulator was actually getting disabled, but I2C driver was timing out
>>>>> because interrupt couldn't be handled in NOIRQ phase, which should
>>>>> result in a dead PCIe on a resume from suspend since regulator's core
>>>>> thinks that regulator is enabled (I2C said it failed to disable), while
>>>>> it is actually disabled.
>>>>>
>>>>> Do you have anything plugged into the PCIe slot in yours testing farm?
>>>>> It wouldn't surprise me if the plugged card isn't functional after
>>>>> resume from suspend on a stable kernels.
>>>>>
>>>>
>>>> I actually now see that interrupt is not allowed to be enabled during
>>>> the NOIRQ phase:
>>>>
>>>> https://elixir.bootlin.com/linux/v5.7-rc3/source/kernel/irq/manage.c#L640
>>>>
>>>> it should be worthwhile to turn it into a WARN_ON.
>>>>
>>>
>>> Oh, wait! There is already a warning there.. hmm.
>>>
>>
>> Aha, the disable depth for the I2C interrupt is 2 after
>> suspend_device_irq(), that's why there is no warning.
>>
>> This should catch the bug and trigger the warning:
>>
>> --- >8 ---
>> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
>> index 453a8a0f4804..fe25104d8b22 100644
>> --- a/kernel/irq/manage.c
>> +++ b/kernel/irq/manage.c
>> @@ -653,6 +653,8 @@ void __enable_irq(struct irq_desc *desc)
>> break;
>> }
>> default:
>> + if (desc->istate & IRQS_SUSPENDED)
>> + goto err_out;
>> desc->depth--;
>> }
>> }
>> --- >8 ---
>>
>> Jon could you please give it a try? Will this change produce a warning
>> for the I2C driver on a PCIe suspend for the v5.6 kernel?
>
>
> Yes I can test, but I still want to know why resume is currently broken.

BTW, I guess we could use the IRQF_NO_SUSPEND flag for I2C interrupt.
Then it should be possible to use I2C in the late suspend without the
need for atomic transfers, once RPM is resolved.

2020-04-29 08:17:45

by Thierry Reding

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

On Mon, Apr 27, 2020 at 06:18:34PM +0300, Dmitry Osipenko wrote:
> 27.04.2020 18:12, Thierry Reding пишет:
> > On Mon, Apr 27, 2020 at 05:21:30PM +0300, Dmitry Osipenko wrote:
> >> 27.04.2020 14:00, Thierry Reding пишет:
> >>> On Mon, Apr 27, 2020 at 12:52:10PM +0300, Dmitry Osipenko wrote:
> >>>> 27.04.2020 10:48, Thierry Reding пишет:
> >>>> ...
> >>>>>> Maybe but all these other problems appear to have existed for sometime
> >>>>>> now. We need to fix all, but for the moment we need to figure out what's
> >>>>>> best for v5.7.
> >>>>>
> >>>>> To me it doesn't sound like we have a good handle on what exactly is
> >>>>> going on here and we're mostly just poking around.
> >>>>>
> >>>>> And even if things weren't working quite properly before, it sounds to
> >>>>> me like this patch actually made things worse.
> >>>>
> >>>> There is a plenty of time to work on the proper fix now. To me it sounds
> >>>> like you're giving up on fixing the root of the problem, sorry.
> >>>
> >>> We're at -rc3 now and I haven't seen any promising progress in the last
> >>> week. All the while suspend/resume is now broken on at least one board
> >>> and that may end up hiding any other issues that could creep in in the
> >>> meantime.
> >>>
> >>> Furthermore we seem to have a preexisting issue that may very well
> >>> interfere with this patch, so I think the cautious thing is to revert
> >>> for now and then fix the original issue first. We can always come back
> >>> to this once everything is back to normal.
> >>>
> >>> Also, people are now looking at backporting this to v5.6. Unless we
> >>> revert this from v5.7 it may get picked up for backports to other
> >>> kernels and then I have to notify stable kernel maintainers that they
> >>> shouldn't and they have to back things out again. That's going to cause
> >>> a lot of wasted time for a lot of people.
> >>>
> >>> So, sorry, I disagree. I don't think we have "plenty of time".
> >>
> >> There is about a month now before the 5.7 release. It's a bit too early
> >> to start the panic, IMO :)
> >
> > There's no panic. A patch got merged and it broken something, so we
> > revert it and try again. It's very much standard procedure.
> >
> >> Jon already proposed a reasonable simple solution: to keep PCIe
> >> regulators always-ON. In a longer run we may want to have I2C atomic
> >> transfers supported for a late suspend phase.
> >
> > That's not really a solution, though, is it? It's just papering over
> > an issue that this patch introduced or uncovered. I'm much more in
> > favour of fixing problems at the root rather than keep papering over
> > until we loose track of what the actual problems are.
>
> It's not "papering over an issue". The bug can't be fixed properly
> without introducing I2C atomic transfers support for a late suspend
> phase, I don't see any other solutions for now. Stable kernels do not
> support atomic transfers at all, that proper solution won't be backportable.

Hm... on a hunch I tried something and, lo and behold, it worked. I can
get Cardhu to properly suspend/resume on top of v5.7-rc3 with the
following sequence:

revert 9f42de8d4ec2 i2c: tegra: Fix suspending in active runtime PM state
apply http://patchwork.ozlabs.org/project/linux-tegra/patch/[email protected]/

I also ran that through our test farm and I don't see any other issues.
At the time I was already skeptical about pm_runtime_force_suspend() and
pm_runtime_force_resume() and while I'm not fully certain why exactly it
doesn't work, the above on top of v5.7-rc3 seems like a good option.

I'll try to do some digging if I can find out why exactly force suspend
and resume doesn't work.

Thierry


Attachments:
(No filename) (3.71 kB)
signature.asc (849.00 B)
Download all attachments

2020-04-29 08:56:53

by Thierry Reding

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

On Wed, Apr 29, 2020 at 10:14:48AM +0200, Thierry Reding wrote:
> On Mon, Apr 27, 2020 at 06:18:34PM +0300, Dmitry Osipenko wrote:
> > 27.04.2020 18:12, Thierry Reding пишет:
> > > On Mon, Apr 27, 2020 at 05:21:30PM +0300, Dmitry Osipenko wrote:
> > >> 27.04.2020 14:00, Thierry Reding пишет:
> > >>> On Mon, Apr 27, 2020 at 12:52:10PM +0300, Dmitry Osipenko wrote:
> > >>>> 27.04.2020 10:48, Thierry Reding пишет:
> > >>>> ...
> > >>>>>> Maybe but all these other problems appear to have existed for sometime
> > >>>>>> now. We need to fix all, but for the moment we need to figure out what's
> > >>>>>> best for v5.7.
> > >>>>>
> > >>>>> To me it doesn't sound like we have a good handle on what exactly is
> > >>>>> going on here and we're mostly just poking around.
> > >>>>>
> > >>>>> And even if things weren't working quite properly before, it sounds to
> > >>>>> me like this patch actually made things worse.
> > >>>>
> > >>>> There is a plenty of time to work on the proper fix now. To me it sounds
> > >>>> like you're giving up on fixing the root of the problem, sorry.
> > >>>
> > >>> We're at -rc3 now and I haven't seen any promising progress in the last
> > >>> week. All the while suspend/resume is now broken on at least one board
> > >>> and that may end up hiding any other issues that could creep in in the
> > >>> meantime.
> > >>>
> > >>> Furthermore we seem to have a preexisting issue that may very well
> > >>> interfere with this patch, so I think the cautious thing is to revert
> > >>> for now and then fix the original issue first. We can always come back
> > >>> to this once everything is back to normal.
> > >>>
> > >>> Also, people are now looking at backporting this to v5.6. Unless we
> > >>> revert this from v5.7 it may get picked up for backports to other
> > >>> kernels and then I have to notify stable kernel maintainers that they
> > >>> shouldn't and they have to back things out again. That's going to cause
> > >>> a lot of wasted time for a lot of people.
> > >>>
> > >>> So, sorry, I disagree. I don't think we have "plenty of time".
> > >>
> > >> There is about a month now before the 5.7 release. It's a bit too early
> > >> to start the panic, IMO :)
> > >
> > > There's no panic. A patch got merged and it broken something, so we
> > > revert it and try again. It's very much standard procedure.
> > >
> > >> Jon already proposed a reasonable simple solution: to keep PCIe
> > >> regulators always-ON. In a longer run we may want to have I2C atomic
> > >> transfers supported for a late suspend phase.
> > >
> > > That's not really a solution, though, is it? It's just papering over
> > > an issue that this patch introduced or uncovered. I'm much more in
> > > favour of fixing problems at the root rather than keep papering over
> > > until we loose track of what the actual problems are.
> >
> > It's not "papering over an issue". The bug can't be fixed properly
> > without introducing I2C atomic transfers support for a late suspend
> > phase, I don't see any other solutions for now. Stable kernels do not
> > support atomic transfers at all, that proper solution won't be backportable.
>
> Hm... on a hunch I tried something and, lo and behold, it worked. I can
> get Cardhu to properly suspend/resume on top of v5.7-rc3 with the
> following sequence:
>
> revert 9f42de8d4ec2 i2c: tegra: Fix suspending in active runtime PM state
> apply http://patchwork.ozlabs.org/project/linux-tegra/patch/[email protected]/
>
> I also ran that through our test farm and I don't see any other issues.
> At the time I was already skeptical about pm_runtime_force_suspend() and
> pm_runtime_force_resume() and while I'm not fully certain why exactly it
> doesn't work, the above on top of v5.7-rc3 seems like a good option.
>
> I'll try to do some digging if I can find out why exactly force suspend
> and resume doesn't work.

Ah... so it looks like pm_runtime_force_resume() never actually does
anything in this case and then disable_depth remains at 1 and the first
tegra_i2c_xfer() will then fail to runtime resume the controller.

Thierry


Attachments:
(No filename) (4.12 kB)
signature.asc (849.00 B)
Download all attachments

2020-04-29 12:37:22

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

29.04.2020 11:55, Thierry Reding пишет:
...
>>> It's not "papering over an issue". The bug can't be fixed properly
>>> without introducing I2C atomic transfers support for a late suspend
>>> phase, I don't see any other solutions for now. Stable kernels do not
>>> support atomic transfers at all, that proper solution won't be backportable.
>>
>> Hm... on a hunch I tried something and, lo and behold, it worked. I can
>> get Cardhu to properly suspend/resume on top of v5.7-rc3 with the
>> following sequence:
>>
>> revert 9f42de8d4ec2 i2c: tegra: Fix suspending in active runtime PM state
>> apply http://patchwork.ozlabs.org/project/linux-tegra/patch/[email protected]/
>>
>> I also ran that through our test farm and I don't see any other issues.
>> At the time I was already skeptical about pm_runtime_force_suspend() and
>> pm_runtime_force_resume() and while I'm not fully certain why exactly it
>> doesn't work, the above on top of v5.7-rc3 seems like a good option.
>>
>> I'll try to do some digging if I can find out why exactly force suspend
>> and resume doesn't work.
>
> Ah... so it looks like pm_runtime_force_resume() never actually does
> anything in this case and then disable_depth remains at 1 and the first
> tegra_i2c_xfer() will then fail to runtime resume the controller.

That's the exactly expected behaviour of the RPM force suspend/resume.
The only unexpected part for me is that the tegra_i2c_xfer() runtime
resume then fails in the NOIRQ phase.

Anyways, again, today it's wrong to use I2C in the NOIRQ phase becasue
I2C interrupt is disabled. It's the PCIe driver that should be fixed.

2020-04-29 14:00:45

by Jon Hunter

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time


On 29/04/2020 13:35, Dmitry Osipenko wrote:
> 29.04.2020 11:55, Thierry Reding пишет:
> ...
>>>> It's not "papering over an issue". The bug can't be fixed properly
>>>> without introducing I2C atomic transfers support for a late suspend
>>>> phase, I don't see any other solutions for now. Stable kernels do not
>>>> support atomic transfers at all, that proper solution won't be backportable.
>>>
>>> Hm... on a hunch I tried something and, lo and behold, it worked. I can
>>> get Cardhu to properly suspend/resume on top of v5.7-rc3 with the
>>> following sequence:
>>>
>>> revert 9f42de8d4ec2 i2c: tegra: Fix suspending in active runtime PM state
>>> apply http://patchwork.ozlabs.org/project/linux-tegra/patch/[email protected]/
>>>
>>> I also ran that through our test farm and I don't see any other issues.
>>> At the time I was already skeptical about pm_runtime_force_suspend() and
>>> pm_runtime_force_resume() and while I'm not fully certain why exactly it
>>> doesn't work, the above on top of v5.7-rc3 seems like a good option.
>>>
>>> I'll try to do some digging if I can find out why exactly force suspend
>>> and resume doesn't work.
>>
>> Ah... so it looks like pm_runtime_force_resume() never actually does
>> anything in this case and then disable_depth remains at 1 and the first
>> tegra_i2c_xfer() will then fail to runtime resume the controller.
>
> That's the exactly expected behaviour of the RPM force suspend/resume.
> The only unexpected part for me is that the tegra_i2c_xfer() runtime
> resume then fails in the NOIRQ phase.

From reading the changelog for commit 1e2ef05bb8cf ("PM: Limit race
conditions between runtime PM and system sleep (v2))", this is the
expected behaviour for runtime resume in the noirq phase.

Jon

--
nvpublic

2020-04-29 14:51:10

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

29.04.2020 16:57, Jon Hunter пишет:
>
> On 29/04/2020 13:35, Dmitry Osipenko wrote:
>> 29.04.2020 11:55, Thierry Reding пишет:
>> ...
>>>>> It's not "papering over an issue". The bug can't be fixed properly
>>>>> without introducing I2C atomic transfers support for a late suspend
>>>>> phase, I don't see any other solutions for now. Stable kernels do not
>>>>> support atomic transfers at all, that proper solution won't be backportable.
>>>>
>>>> Hm... on a hunch I tried something and, lo and behold, it worked. I can
>>>> get Cardhu to properly suspend/resume on top of v5.7-rc3 with the
>>>> following sequence:
>>>>
>>>> revert 9f42de8d4ec2 i2c: tegra: Fix suspending in active runtime PM state
>>>> apply http://patchwork.ozlabs.org/project/linux-tegra/patch/[email protected]/
>>>>
>>>> I also ran that through our test farm and I don't see any other issues.
>>>> At the time I was already skeptical about pm_runtime_force_suspend() and
>>>> pm_runtime_force_resume() and while I'm not fully certain why exactly it
>>>> doesn't work, the above on top of v5.7-rc3 seems like a good option.
>>>>
>>>> I'll try to do some digging if I can find out why exactly force suspend
>>>> and resume doesn't work.
>>>
>>> Ah... so it looks like pm_runtime_force_resume() never actually does
>>> anything in this case and then disable_depth remains at 1 and the first
>>> tegra_i2c_xfer() will then fail to runtime resume the controller.
>>
>> That's the exactly expected behaviour of the RPM force suspend/resume.
>> The only unexpected part for me is that the tegra_i2c_xfer() runtime
>> resume then fails in the NOIRQ phase.
>
> From reading the changelog for commit 1e2ef05bb8cf ("PM: Limit race
> conditions between runtime PM and system sleep (v2))", this is the
> expected behaviour for runtime resume in the noirq phase.

I'm curious whether there is a way to tell RPM that it's okay to do it
for a particular device, like I2C that uses IRQ-safe RPM + doesn't have
parent devices that need to be resumed.

2020-04-29 16:26:33

by Thierry Reding

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

On Wed, Apr 29, 2020 at 05:46:46PM +0300, Dmitry Osipenko wrote:
> 29.04.2020 16:57, Jon Hunter пишет:
> >
> > On 29/04/2020 13:35, Dmitry Osipenko wrote:
> >> 29.04.2020 11:55, Thierry Reding пишет:
> >> ...
> >>>>> It's not "papering over an issue". The bug can't be fixed properly
> >>>>> without introducing I2C atomic transfers support for a late suspend
> >>>>> phase, I don't see any other solutions for now. Stable kernels do not
> >>>>> support atomic transfers at all, that proper solution won't be backportable.
> >>>>
> >>>> Hm... on a hunch I tried something and, lo and behold, it worked. I can
> >>>> get Cardhu to properly suspend/resume on top of v5.7-rc3 with the
> >>>> following sequence:
> >>>>
> >>>> revert 9f42de8d4ec2 i2c: tegra: Fix suspending in active runtime PM state
> >>>> apply http://patchwork.ozlabs.org/project/linux-tegra/patch/[email protected]/
> >>>>
> >>>> I also ran that through our test farm and I don't see any other issues.
> >>>> At the time I was already skeptical about pm_runtime_force_suspend() and
> >>>> pm_runtime_force_resume() and while I'm not fully certain why exactly it
> >>>> doesn't work, the above on top of v5.7-rc3 seems like a good option.
> >>>>
> >>>> I'll try to do some digging if I can find out why exactly force suspend
> >>>> and resume doesn't work.
> >>>
> >>> Ah... so it looks like pm_runtime_force_resume() never actually does
> >>> anything in this case and then disable_depth remains at 1 and the first
> >>> tegra_i2c_xfer() will then fail to runtime resume the controller.
> >>
> >> That's the exactly expected behaviour of the RPM force suspend/resume.
> >> The only unexpected part for me is that the tegra_i2c_xfer() runtime
> >> resume then fails in the NOIRQ phase.
> >
> > From reading the changelog for commit 1e2ef05bb8cf ("PM: Limit race
> > conditions between runtime PM and system sleep (v2))", this is the
> > expected behaviour for runtime resume in the noirq phase.
>
> I'm curious whether there is a way to tell RPM that it's okay to do it
> for a particular device, like I2C that uses IRQ-safe RPM + doesn't have
> parent devices that need to be resumed.

Been there, done that:

http://patchwork.ozlabs.org/project/linux-tegra/patch/[email protected]/

Thierry


Attachments:
(No filename) (2.33 kB)
signature.asc (849.00 B)
Download all attachments

2020-04-29 16:32:29

by Thierry Reding

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

On Wed, Apr 29, 2020 at 03:35:26PM +0300, Dmitry Osipenko wrote:
> 29.04.2020 11:55, Thierry Reding пишет:
> ...
> >>> It's not "papering over an issue". The bug can't be fixed properly
> >>> without introducing I2C atomic transfers support for a late suspend
> >>> phase, I don't see any other solutions for now. Stable kernels do not
> >>> support atomic transfers at all, that proper solution won't be backportable.
> >>
> >> Hm... on a hunch I tried something and, lo and behold, it worked. I can
> >> get Cardhu to properly suspend/resume on top of v5.7-rc3 with the
> >> following sequence:
> >>
> >> revert 9f42de8d4ec2 i2c: tegra: Fix suspending in active runtime PM state
> >> apply http://patchwork.ozlabs.org/project/linux-tegra/patch/[email protected]/
> >>
> >> I also ran that through our test farm and I don't see any other issues.
> >> At the time I was already skeptical about pm_runtime_force_suspend() and
> >> pm_runtime_force_resume() and while I'm not fully certain why exactly it
> >> doesn't work, the above on top of v5.7-rc3 seems like a good option.
> >>
> >> I'll try to do some digging if I can find out why exactly force suspend
> >> and resume doesn't work.
> >
> > Ah... so it looks like pm_runtime_force_resume() never actually does
> > anything in this case and then disable_depth remains at 1 and the first
> > tegra_i2c_xfer() will then fail to runtime resume the controller.
>
> That's the exactly expected behaviour of the RPM force suspend/resume.
> The only unexpected part for me is that the tegra_i2c_xfer() runtime
> resume then fails in the NOIRQ phase.
>
> Anyways, again, today it's wrong to use I2C in the NOIRQ phase becasue
> I2C interrupt is disabled. It's the PCIe driver that should be fixed.

I don't think so. Everything works perfectly fine if we fix system
suspend/resume not to rely on pm_runtime_force_{suspend,resume}() and
directly call the runtime suspend/resume implementations.

So can we please stop deflecting and fix the damn I2C driver. From my
perspective we have two choices:

1) do what I suggested above and revert the force suspend/resume patch
and apply the "manual" suspend/resume patch instead

2) revert this patch and go back to the drawing board

I suspect that with 2) we'd end up back where we started and have to do
1) anyway.

An alternative that I'd prefer even more would be to do 2) now for v5.7
and then we do 1) for v5.8 and give this some more soaking time.

Thierry


Attachments:
(No filename) (2.50 kB)
signature.asc (849.00 B)
Download all attachments

2020-04-29 16:57:43

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

29.04.2020 19:30, Thierry Reding пишет:
> On Wed, Apr 29, 2020 at 03:35:26PM +0300, Dmitry Osipenko wrote:
>> 29.04.2020 11:55, Thierry Reding пишет:
>> ...
>>>>> It's not "papering over an issue". The bug can't be fixed properly
>>>>> without introducing I2C atomic transfers support for a late suspend
>>>>> phase, I don't see any other solutions for now. Stable kernels do not
>>>>> support atomic transfers at all, that proper solution won't be backportable.
>>>>
>>>> Hm... on a hunch I tried something and, lo and behold, it worked. I can
>>>> get Cardhu to properly suspend/resume on top of v5.7-rc3 with the
>>>> following sequence:
>>>>
>>>> revert 9f42de8d4ec2 i2c: tegra: Fix suspending in active runtime PM state
>>>> apply http://patchwork.ozlabs.org/project/linux-tegra/patch/[email protected]/
>>>>
>>>> I also ran that through our test farm and I don't see any other issues.
>>>> At the time I was already skeptical about pm_runtime_force_suspend() and
>>>> pm_runtime_force_resume() and while I'm not fully certain why exactly it
>>>> doesn't work, the above on top of v5.7-rc3 seems like a good option.
>>>>
>>>> I'll try to do some digging if I can find out why exactly force suspend
>>>> and resume doesn't work.
>>>
>>> Ah... so it looks like pm_runtime_force_resume() never actually does
>>> anything in this case and then disable_depth remains at 1 and the first
>>> tegra_i2c_xfer() will then fail to runtime resume the controller.
>>
>> That's the exactly expected behaviour of the RPM force suspend/resume.
>> The only unexpected part for me is that the tegra_i2c_xfer() runtime
>> resume then fails in the NOIRQ phase.
>>
>> Anyways, again, today it's wrong to use I2C in the NOIRQ phase becasue
>> I2C interrupt is disabled. It's the PCIe driver that should be fixed.
>
> I don't think so. Everything works perfectly fine if we fix system
> suspend/resume not to rely on pm_runtime_force_{suspend,resume}() and
> directly call the runtime suspend/resume implementations.

It should "work" only in conjunction with my I2C patch, otherwise you'll
get a spurious I2C timeout error. And it will "work" only because
interrupt is handled manually after the timeout, meaning that yours
suspending time will take few hundreds ms more.

> So can we please stop deflecting and fix the damn I2C driver. From my
> perspective we have two choices:
>
> 1) do what I suggested above and revert the force suspend/resume patch
> and apply the "manual" suspend/resume patch instead
>
> 2) revert this patch and go back to the drawing board
>
> I suspect that with 2) we'd end up back where we started and have to do
> 1) anyway.
>
> An alternative that I'd prefer even more would be to do 2) now for v5.7
> and then we do 1) for v5.8 and give this some more soaking time.

I2C driver isn't broken, PCIe driver is. IMO.

Both yours variants are not going to be a backportable fix for the
stable kernels, they won't fix the suspended interrupt problem. What I'm
missing?

2020-04-29 17:06:32

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

29.04.2020 19:24, Thierry Reding пишет:
> On Wed, Apr 29, 2020 at 05:46:46PM +0300, Dmitry Osipenko wrote:
>> 29.04.2020 16:57, Jon Hunter пишет:
>>>
>>> On 29/04/2020 13:35, Dmitry Osipenko wrote:
>>>> 29.04.2020 11:55, Thierry Reding пишет:
>>>> ...
>>>>>>> It's not "papering over an issue". The bug can't be fixed properly
>>>>>>> without introducing I2C atomic transfers support for a late suspend
>>>>>>> phase, I don't see any other solutions for now. Stable kernels do not
>>>>>>> support atomic transfers at all, that proper solution won't be backportable.
>>>>>>
>>>>>> Hm... on a hunch I tried something and, lo and behold, it worked. I can
>>>>>> get Cardhu to properly suspend/resume on top of v5.7-rc3 with the
>>>>>> following sequence:
>>>>>>
>>>>>> revert 9f42de8d4ec2 i2c: tegra: Fix suspending in active runtime PM state
>>>>>> apply http://patchwork.ozlabs.org/project/linux-tegra/patch/[email protected]/
>>>>>>
>>>>>> I also ran that through our test farm and I don't see any other issues.
>>>>>> At the time I was already skeptical about pm_runtime_force_suspend() and
>>>>>> pm_runtime_force_resume() and while I'm not fully certain why exactly it
>>>>>> doesn't work, the above on top of v5.7-rc3 seems like a good option.
>>>>>>
>>>>>> I'll try to do some digging if I can find out why exactly force suspend
>>>>>> and resume doesn't work.
>>>>>
>>>>> Ah... so it looks like pm_runtime_force_resume() never actually does
>>>>> anything in this case and then disable_depth remains at 1 and the first
>>>>> tegra_i2c_xfer() will then fail to runtime resume the controller.
>>>>
>>>> That's the exactly expected behaviour of the RPM force suspend/resume.
>>>> The only unexpected part for me is that the tegra_i2c_xfer() runtime
>>>> resume then fails in the NOIRQ phase.
>>>
>>> From reading the changelog for commit 1e2ef05bb8cf ("PM: Limit race
>>> conditions between runtime PM and system sleep (v2))", this is the
>>> expected behaviour for runtime resume in the noirq phase.
>>
>> I'm curious whether there is a way to tell RPM that it's okay to do it
>> for a particular device, like I2C that uses IRQ-safe RPM + doesn't have
>> parent devices that need to be resumed.
>
> Been there, done that:
>
> http://patchwork.ozlabs.org/project/linux-tegra/patch/[email protected]/

It should work, but it looks to me more like a hack rather than a proper
fix. At least I haven't seen any other drivers doing anything like that.

I don't have any better suggestions for now, so perhaps it should be a
good enough solution for the starter, combined with setting the
IRQF_NO_SUSPEND flag for I2C interrupt. It should allow drivers like
PCIe to use I2C in the NOIRQ phase.

Maybe it could be worthwhile to try to ask Rafael about how drivers
should handle this situation in regards to the RPM usage.

2020-04-29 17:36:12

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

29.04.2020 19:54, Dmitry Osipenko пишет:
> 29.04.2020 19:30, Thierry Reding пишет:
>> On Wed, Apr 29, 2020 at 03:35:26PM +0300, Dmitry Osipenko wrote:
>>> 29.04.2020 11:55, Thierry Reding пишет:
>>> ...
>>>>>> It's not "papering over an issue". The bug can't be fixed properly
>>>>>> without introducing I2C atomic transfers support for a late suspend
>>>>>> phase, I don't see any other solutions for now. Stable kernels do not
>>>>>> support atomic transfers at all, that proper solution won't be backportable.
>>>>>
>>>>> Hm... on a hunch I tried something and, lo and behold, it worked. I can
>>>>> get Cardhu to properly suspend/resume on top of v5.7-rc3 with the
>>>>> following sequence:
>>>>>
>>>>> revert 9f42de8d4ec2 i2c: tegra: Fix suspending in active runtime PM state
>>>>> apply http://patchwork.ozlabs.org/project/linux-tegra/patch/[email protected]/
>>>>>
>>>>> I also ran that through our test farm and I don't see any other issues.
>>>>> At the time I was already skeptical about pm_runtime_force_suspend() and
>>>>> pm_runtime_force_resume() and while I'm not fully certain why exactly it
>>>>> doesn't work, the above on top of v5.7-rc3 seems like a good option.
>>>>>
>>>>> I'll try to do some digging if I can find out why exactly force suspend
>>>>> and resume doesn't work.
>>>>
>>>> Ah... so it looks like pm_runtime_force_resume() never actually does
>>>> anything in this case and then disable_depth remains at 1 and the first
>>>> tegra_i2c_xfer() will then fail to runtime resume the controller.
>>>
>>> That's the exactly expected behaviour of the RPM force suspend/resume.
>>> The only unexpected part for me is that the tegra_i2c_xfer() runtime
>>> resume then fails in the NOIRQ phase.
>>>
>>> Anyways, again, today it's wrong to use I2C in the NOIRQ phase becasue
>>> I2C interrupt is disabled. It's the PCIe driver that should be fixed.
>>
>> I don't think so. Everything works perfectly fine if we fix system
>> suspend/resume not to rely on pm_runtime_force_{suspend,resume}() and
>> directly call the runtime suspend/resume implementations.
>
> It should "work" only in conjunction with my I2C patch, otherwise you'll
> get a spurious I2C timeout error. And it will "work" only because
> interrupt is handled manually after the timeout, meaning that yours
> suspending time will take few hundreds ms more.
>
>> So can we please stop deflecting and fix the damn I2C driver. From my
>> perspective we have two choices:
>>
>> 1) do what I suggested above and revert the force suspend/resume patch
>> and apply the "manual" suspend/resume patch instead
>>
>> 2) revert this patch and go back to the drawing board
>>
>> I suspect that with 2) we'd end up back where we started and have to do
>> 1) anyway.
>>
>> An alternative that I'd prefer even more would be to do 2) now for v5.7
>> and then we do 1) for v5.8 and give this some more soaking time.
>
> I2C driver isn't broken, PCIe driver is. IMO.
>
> Both yours variants are not going to be a backportable fix for the
> stable kernels, they won't fix the suspended interrupt problem. What I'm
> missing?
>

My proposal:

1. Fix PCIe driver by keeping regulator always-ON, propagate it to
stable kernels.

2. Make I2C driver usable in NOIRQ phase.

3. Make PCIe driver to handle errors properly.

4. Revert the PCIe driver "fix".

2020-05-02 14:42:36

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

27.04.2020 18:31, Wolfram Sang пишет:
>
>> Yes, that bug should be fixed anyway. But that doesn't justify breaking
>> suspend/resume completely, which *is* a regression.
>>
>> Look, I'm not saying that we should drop this patch altogether. All I'm
>> saying is that we should postpone it so that we can: a) get suspend and
>> resume working again (and by doing so make sure no other suspend/resume
>> regressions silently creep in, because that always seems to happen when
>> you're not looking) and b) fix any preexisting issues without possibly
>> scrambling the result with this perhaps unrelated fix.
>>
>> So, again, I think the safest road forward is to back this one out for
>> now, fix whatever this other bug is and once suspend/resume is working
>> properly again we can revisit this patch based on a known-good baseline.
>
> I am with you here. I want to add that the proper fix should be
> developed without thinking too much about stable in the first place.
> *When* we have a proper working fix, then we can think about making it
> "more" suitable for backporting. Yet, it may also be a result that older
> kernels need a different solution. Or have no solution at all, in case
> they can't do atomic_transfers and this is needed.
>
> D'accord?
>

I saw that you submitted the revert of the patches for 5.7, hopefully it
won't result in putting the PCIe driver problem into the back burner.
I'll try not to forget about these patches to resubmit them later on,
once the problem will be resolved :)

2020-05-02 14:47:15

by Wolfram Sang

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time


> I'll try not to forget about these patches to resubmit them later on,
> once the problem will be resolved :)

Don't worry, I'll keep an eye on these issues, too :)


Attachments:
(No filename) (174.00 B)
signature.asc (849.00 B)
Download all attachments

2020-05-04 15:47:06

by Thierry Reding

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

On Sat, May 02, 2020 at 05:40:35PM +0300, Dmitry Osipenko wrote:
> 27.04.2020 18:31, Wolfram Sang пишет:
> >
> >> Yes, that bug should be fixed anyway. But that doesn't justify breaking
> >> suspend/resume completely, which *is* a regression.
> >>
> >> Look, I'm not saying that we should drop this patch altogether. All I'm
> >> saying is that we should postpone it so that we can: a) get suspend and
> >> resume working again (and by doing so make sure no other suspend/resume
> >> regressions silently creep in, because that always seems to happen when
> >> you're not looking) and b) fix any preexisting issues without possibly
> >> scrambling the result with this perhaps unrelated fix.
> >>
> >> So, again, I think the safest road forward is to back this one out for
> >> now, fix whatever this other bug is and once suspend/resume is working
> >> properly again we can revisit this patch based on a known-good baseline.
> >
> > I am with you here. I want to add that the proper fix should be
> > developed without thinking too much about stable in the first place.
> > *When* we have a proper working fix, then we can think about making it
> > "more" suitable for backporting. Yet, it may also be a result that older
> > kernels need a different solution. Or have no solution at all, in case
> > they can't do atomic_transfers and this is needed.
> >
> > D'accord?
> >
>
> I saw that you submitted the revert of the patches for 5.7, hopefully it
> won't result in putting the PCIe driver problem into the back burner.
> I'll try not to forget about these patches to resubmit them later on,
> once the problem will be resolved :)

I can put these two patches into a local development branch to keep
track of them. From what I said earlier, it looks like it would be fine
to apply these if we also make that runtime PM change (i.e. drop force
runtime PM and instead manually invoke runtime PM callbacks, which seems
to be in line with what the PM maintainers suggest, as pointed out
elsewhere in this thread).

How about if I put all of that into a branch and push it to linux-next
so that we can get some broader testing? I've already run it through our
internal test system, which, while not perfect, is the broadest system I
am aware of, and all tests came back positive.

I'm not exactly sure I see a real issue with the PCIe driver after those
patches are applied. The regulator errors are gone (presumably because
the regulators now do get turned off properly) and I don't observe any
other issues.

Thierry


Attachments:
(No filename) (2.52 kB)
signature.asc (849.00 B)
Download all attachments

2020-05-04 20:57:51

by Dmitry Osipenko

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] i2c: tegra: Better handle case where CPU0 is busy for a long time

04.05.2020 18:42, Thierry Reding пишет:
> On Sat, May 02, 2020 at 05:40:35PM +0300, Dmitry Osipenko wrote:
>> 27.04.2020 18:31, Wolfram Sang пишет:
>>>
>>>> Yes, that bug should be fixed anyway. But that doesn't justify breaking
>>>> suspend/resume completely, which *is* a regression.
>>>>
>>>> Look, I'm not saying that we should drop this patch altogether. All I'm
>>>> saying is that we should postpone it so that we can: a) get suspend and
>>>> resume working again (and by doing so make sure no other suspend/resume
>>>> regressions silently creep in, because that always seems to happen when
>>>> you're not looking) and b) fix any preexisting issues without possibly
>>>> scrambling the result with this perhaps unrelated fix.
>>>>
>>>> So, again, I think the safest road forward is to back this one out for
>>>> now, fix whatever this other bug is and once suspend/resume is working
>>>> properly again we can revisit this patch based on a known-good baseline.
>>>
>>> I am with you here. I want to add that the proper fix should be
>>> developed without thinking too much about stable in the first place.
>>> *When* we have a proper working fix, then we can think about making it
>>> "more" suitable for backporting. Yet, it may also be a result that older
>>> kernels need a different solution. Or have no solution at all, in case
>>> they can't do atomic_transfers and this is needed.
>>>
>>> D'accord?
>>>
>>
>> I saw that you submitted the revert of the patches for 5.7, hopefully it
>> won't result in putting the PCIe driver problem into the back burner.
>> I'll try not to forget about these patches to resubmit them later on,
>> once the problem will be resolved :)
>
> I can put these two patches into a local development branch to keep
> track of them. From what I said earlier, it looks like it would be fine
> to apply these if we also make that runtime PM change (i.e. drop force
> runtime PM and instead manually invoke runtime PM callbacks, which seems
> to be in line with what the PM maintainers suggest, as pointed out
> elsewhere in this thread).
>
> How about if I put all of that into a branch and push it to linux-next
> so that we can get some broader testing? I've already run it through our
> internal test system, which, while not perfect, is the broadest system I
> am aware of, and all tests came back positive.
Will be great.

> I'm not exactly sure I see a real issue with the PCIe driver after those
> patches are applied. The regulator errors are gone (presumably because
> the regulators now do get turned off properly) and I don't observe any
> other issues.

That's probably because this I2C patch removed the "completion done
after timeout" message. You may try to re-add the message, it should pop
up on the PCIe driver's suspension. The IRQF_NO_SUSPEND flag should fix it.

My assumption was that it should be always fine handle interrupt after
timeout, and thus, the message isn't really needed. But this wasn't a
correct assumption as we see now, so it should be better to keep the
message for the debugging purposes, maybe turn it into dev_info_once().