2014-05-14 07:04:41

by Michal Kazior

[permalink] [raw]
Subject: [PATCH 0/2] ath10k: warm reset fixes 2014-05-14

Hi,

Warm reset was not able to recover from a device
crash so ath10k was falling back to cold reset
which would occasionally hang the system or
generate a data bus error with some device chips.

Warm reset now works a lot more reliably (at least
on my T430). It's able to recover from a "hard"
simulate fw crash and some firmware asserts I've
forced meaning it should recover in most cases
without falling back to the cold reset.

There's still one case of a crash warm reset is
incapable to recover from: IOMMU faults (when
supported and enabled and device tries to access
non-DMA mapped host memory) trigger device crash
and leave CE totally unresponsive before and after
the warm reset.


Michal Kazior (2):
ath10k: improve warm reset reliability
ath10k: retry warm reset a few times

drivers/net/wireless/ath/ath10k/pci.c | 53 +++++++++++++++++++++++++++++++++--
1 file changed, 50 insertions(+), 3 deletions(-)

--
1.8.5.3



2014-05-15 12:42:17

by Michal Kazior

[permalink] [raw]
Subject: Re: [PATCH 1/2] ath10k: improve warm reset reliability

On 15 May 2014 12:48, Kalle Valo <[email protected]> wrote:
> Michal Kazior <[email protected]> writes:
>
>> Warm reset is now able to recover after device
>> crashes which required a cold reset before.
>>
>> This should greatly reduce chances of getting data
>> bus errors or host system freezes due to buggy
>> cold reset on some chips.
>>
>> Signed-off-by: Michal Kazior <[email protected]>
>
> Awesome! This is very much needed. I just have a cosmetic comment:
>
>> +/* this function effectively clears target memory controller assert line */
>> +static void ath10k_pci_warm_reset_si0(struct ath10k *ar)
>> +{
>> + u32 val;
>> +
>> + val = ath10k_pci_read32(ar, RTC_SOC_BASE_ADDRESS +
>> + SOC_RESET_CONTROL_ADDRESS);
>> + ath10k_pci_write32(ar, RTC_SOC_BASE_ADDRESS + SOC_RESET_CONTROL_ADDRESS,
>> + val | SOC_RESET_CONTROL_SI0_RST_MASK);
>> + val = ath10k_pci_read32(ar, RTC_SOC_BASE_ADDRESS +
>> + SOC_RESET_CONTROL_ADDRESS);
>
> We do have the ath10k_pci_soc_ functions for accessing SOC registers, I
> would prefer to use those here. I now modified your patch with the diff
> below. Is that ok to you?

Ah, yeah. I totally forgot about these functions (again).


> --- a/drivers/net/wireless/ath/ath10k/pci.c
> +++ b/drivers/net/wireless/ath/ath10k/pci.c
> @@ -1807,20 +1807,18 @@ static void ath10k_pci_warm_reset_si0(struct ath10k *ar)
> {
> u32 val;
>
> - val = ath10k_pci_read32(ar, RTC_SOC_BASE_ADDRESS +
> - SOC_RESET_CONTROL_ADDRESS);
> - ath10k_pci_write32(ar, RTC_SOC_BASE_ADDRESS + SOC_RESET_CONTROL_ADDRESS,
> - val | SOC_RESET_CONTROL_SI0_RST_MASK);
> - val = ath10k_pci_read32(ar, RTC_SOC_BASE_ADDRESS +
> - SOC_RESET_CONTROL_ADDRESS);
> + val = ath10k_pci_soc_read32(ar, SOC_RESET_CONTROL_ADDRESS);
> + ath10k_pci_soc_write32(ar, SOC_RESET_CONTROL_ADDRESS,
> + val | SOC_RESET_CONTROL_SI0_RST_MASK);
> + val = ath10k_pci_soc_read32(ar, SOC_RESET_CONTROL_ADDRESS);
> +
> msleep(10);
>
> - val = ath10k_pci_read32(ar, RTC_SOC_BASE_ADDRESS +
> - SOC_RESET_CONTROL_ADDRESS);
> - ath10k_pci_write32(ar, RTC_SOC_BASE_ADDRESS + SOC_RESET_CONTROL_ADDRESS,
> - val & ~SOC_RESET_CONTROL_SI0_RST_MASK);
> - val = ath10k_pci_read32(ar, RTC_SOC_BASE_ADDRESS +
> - SOC_RESET_CONTROL_ADDRESS);
> + val = ath10k_pci_soc_read32(ar, SOC_RESET_CONTROL_ADDRESS);
> + ath10k_pci_soc_write32(ar, SOC_RESET_CONTROL_ADDRESS,
> + val & ~SOC_RESET_CONTROL_SI0_RST_MASK);
> + val = ath10k_pci_soc_read32(ar, SOC_RESET_CONTROL_ADDRESS);
> +
> msleep(10);
> }

Looks good to me :-)


> Full patch here:
>
> https://github.com/kvalo/ath/commit/7b52054308a371d479a9e686e1d8411d19a90fd7

You probably mean:

https://github.com/kvalo/ath/commit/136ef8110cba61752eee5ef6bd7ce170b15cf491


MichaƂ

2014-05-15 10:48:21

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH 1/2] ath10k: improve warm reset reliability

Michal Kazior <[email protected]> writes:

> Warm reset is now able to recover after device
> crashes which required a cold reset before.
>
> This should greatly reduce chances of getting data
> bus errors or host system freezes due to buggy
> cold reset on some chips.
>
> Signed-off-by: Michal Kazior <[email protected]>

Awesome! This is very much needed. I just have a cosmetic comment:

> +/* this function effectively clears target memory controller assert line */
> +static void ath10k_pci_warm_reset_si0(struct ath10k *ar)
> +{
> + u32 val;
> +
> + val = ath10k_pci_read32(ar, RTC_SOC_BASE_ADDRESS +
> + SOC_RESET_CONTROL_ADDRESS);
> + ath10k_pci_write32(ar, RTC_SOC_BASE_ADDRESS + SOC_RESET_CONTROL_ADDRESS,
> + val | SOC_RESET_CONTROL_SI0_RST_MASK);
> + val = ath10k_pci_read32(ar, RTC_SOC_BASE_ADDRESS +
> + SOC_RESET_CONTROL_ADDRESS);

We do have the ath10k_pci_soc_ functions for accessing SOC registers, I
would prefer to use those here. I now modified your patch with the diff
below. Is that ok to you?

--- a/drivers/net/wireless/ath/ath10k/pci.c
+++ b/drivers/net/wireless/ath/ath10k/pci.c
@@ -1807,20 +1807,18 @@ static void ath10k_pci_warm_reset_si0(struct ath10k *ar)
{
u32 val;

- val = ath10k_pci_read32(ar, RTC_SOC_BASE_ADDRESS +
- SOC_RESET_CONTROL_ADDRESS);
- ath10k_pci_write32(ar, RTC_SOC_BASE_ADDRESS + SOC_RESET_CONTROL_ADDRESS,
- val | SOC_RESET_CONTROL_SI0_RST_MASK);
- val = ath10k_pci_read32(ar, RTC_SOC_BASE_ADDRESS +
- SOC_RESET_CONTROL_ADDRESS);
+ val = ath10k_pci_soc_read32(ar, SOC_RESET_CONTROL_ADDRESS);
+ ath10k_pci_soc_write32(ar, SOC_RESET_CONTROL_ADDRESS,
+ val | SOC_RESET_CONTROL_SI0_RST_MASK);
+ val = ath10k_pci_soc_read32(ar, SOC_RESET_CONTROL_ADDRESS);
+
msleep(10);

- val = ath10k_pci_read32(ar, RTC_SOC_BASE_ADDRESS +
- SOC_RESET_CONTROL_ADDRESS);
- ath10k_pci_write32(ar, RTC_SOC_BASE_ADDRESS + SOC_RESET_CONTROL_ADDRESS,
- val & ~SOC_RESET_CONTROL_SI0_RST_MASK);
- val = ath10k_pci_read32(ar, RTC_SOC_BASE_ADDRESS +
- SOC_RESET_CONTROL_ADDRESS);
+ val = ath10k_pci_soc_read32(ar, SOC_RESET_CONTROL_ADDRESS);
+ ath10k_pci_soc_write32(ar, SOC_RESET_CONTROL_ADDRESS,
+ val & ~SOC_RESET_CONTROL_SI0_RST_MASK);
+ val = ath10k_pci_soc_read32(ar, SOC_RESET_CONTROL_ADDRESS);
+
msleep(10);
}

Full patch here:

https://github.com/kvalo/ath/commit/7b52054308a371d479a9e686e1d8411d19a90fd7

--
Kalle Valo

2014-05-14 14:22:13

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH 0/2] ath10k: warm reset fixes 2014-05-14

Michal Kazior <[email protected]> writes:

> Warm reset was not able to recover from a device
> crash so ath10k was falling back to cold reset
> which would occasionally hang the system or
> generate a data bus error with some device chips.
>
> Warm reset now works a lot more reliably (at least
> on my T430). It's able to recover from a "hard"
> simulate fw crash and some firmware asserts I've
> forced meaning it should recover in most cases
> without falling back to the cold reset.
>
> There's still one case of a crash warm reset is
> incapable to recover from: IOMMU faults (when
> supported and enabled and device tries to access
> non-DMA mapped host memory) trigger device crash
> and leave CE totally unresponsive before and after
> the warm reset.
>
>
> Michal Kazior (2):
> ath10k: improve warm reset reliability
> ath10k: retry warm reset a few times

The patches don't apply anymore and the conflict was not trivial enough
for me to fix it. Can you rebase, please?

--
Kalle Valo

2014-05-14 14:25:18

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH 0/2] ath10k: warm reset fixes 2014-05-14

Kalle Valo <[email protected]> writes:

> Michal Kazior <[email protected]> writes:
>
>> Warm reset was not able to recover from a device
>> crash so ath10k was falling back to cold reset
>> which would occasionally hang the system or
>> generate a data bus error with some device chips.
>>
>> Warm reset now works a lot more reliably (at least
>> on my T430). It's able to recover from a "hard"
>> simulate fw crash and some firmware asserts I've
>> forced meaning it should recover in most cases
>> without falling back to the cold reset.
>>
>> There's still one case of a crash warm reset is
>> incapable to recover from: IOMMU faults (when
>> supported and enabled and device tries to access
>> non-DMA mapped host memory) trigger device crash
>> and leave CE totally unresponsive before and after
>> the warm reset.
>>
>>
>> Michal Kazior (2):
>> ath10k: improve warm reset reliability
>> ath10k: retry warm reset a few times
>
> The patches don't apply anymore and the conflict was not trivial enough
> for me to fix it. Can you rebase, please?

Sorry, I replied to the wrong thread. Please ignore what I said above!

--
Kalle Valo

2014-05-14 07:04:44

by Michal Kazior

[permalink] [raw]
Subject: [PATCH 2/2] ath10k: retry warm reset a few times

Sometimes warm reset works upon retry. It might be
related to imperfect warm reset routine, but for
now let's just do the retries.

This should improve the reliability of some chips
that hang/crash with cold reset which is used as a
last resort if warm reset fails.

Signed-off-by: Michal Kazior <[email protected]>
---
drivers/net/wireless/ath/ath10k/pci.c | 29 ++++++++++++++++++++++++++---
1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c
index c9b14b4..c9e482e 100644
--- a/drivers/net/wireless/ath/ath10k/pci.c
+++ b/drivers/net/wireless/ath/ath10k/pci.c
@@ -59,6 +59,7 @@ MODULE_PARM_DESC(reset_mode, "0: auto, 1: warm only (default: 0)");

/* how long wait to wait for target to initialise, in ms */
#define ATH10K_PCI_TARGET_WAIT 3000
+#define ATH10K_PCI_NUM_WARM_RESET_ATTEMPTS 3

#define QCA988X_2_0_DEVICE_ID (0x003c)

@@ -2012,6 +2013,28 @@ err:
return ret;
}

+static int ath10k_pci_hif_power_up_warm(struct ath10k *ar)
+{
+ int i, ret;
+
+ /*
+ * Sometime warm reset succeeds after retries.
+ *
+ * FIXME: It might be possible to tune ath10k_pci_warm_reset() to work
+ * at first try.
+ */
+ for (i = 0; i < ATH10K_PCI_NUM_WARM_RESET_ATTEMPTS; i++) {
+ ret = __ath10k_pci_hif_power_up(ar, false);
+ if (ret == 0)
+ break;
+
+ ath10k_warn("failed to warm reset (attempt %d out of %d): %d\n",
+ i + 1, ATH10K_PCI_NUM_WARM_RESET_ATTEMPTS, ret);
+ }
+
+ return ret;
+}
+
static int ath10k_pci_hif_power_up(struct ath10k *ar)
{
int ret;
@@ -2023,10 +2046,10 @@ static int ath10k_pci_hif_power_up(struct ath10k *ar)
* preferred (and safer) way to perform a device reset is through a
* warm reset.
*
- * Warm reset doesn't always work though (notably after a firmware
- * crash) so fall back to cold reset if necessary.
+ * Warm reset doesn't always work though so fall back to cold reset may
+ * be necessary.
*/
- ret = __ath10k_pci_hif_power_up(ar, false);
+ ret = ath10k_pci_hif_power_up_warm(ar);
if (ret) {
ath10k_warn("failed to power up target using warm reset: %d\n",
ret);
--
1.8.5.3


2014-05-16 13:49:13

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH 0/2] ath10k: warm reset fixes 2014-05-14

Michal Kazior <[email protected]> writes:

> Hi,
>
> Warm reset was not able to recover from a device
> crash so ath10k was falling back to cold reset
> which would occasionally hang the system or
> generate a data bus error with some device chips.
>
> Warm reset now works a lot more reliably (at least
> on my T430). It's able to recover from a "hard"
> simulate fw crash and some firmware asserts I've
> forced meaning it should recover in most cases
> without falling back to the cold reset.
>
> There's still one case of a crash warm reset is
> incapable to recover from: IOMMU faults (when
> supported and enabled and device tries to access
> non-DMA mapped host memory) trigger device crash
> and leave CE totally unresponsive before and after
> the warm reset.
>
>
> Michal Kazior (2):
> ath10k: improve warm reset reliability
> ath10k: retry warm reset a few times

Thanks, both patches applied.

--
Kalle Valo

2014-05-14 07:04:44

by Michal Kazior

[permalink] [raw]
Subject: [PATCH 1/2] ath10k: improve warm reset reliability

Warm reset is now able to recover after device
crashes which required a cold reset before.

This should greatly reduce chances of getting data
bus errors or host system freezes due to buggy
cold reset on some chips.

Signed-off-by: Michal Kazior <[email protected]>
---
drivers/net/wireless/ath/ath10k/pci.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c
index 66b1f30..c9b14b4 100644
--- a/drivers/net/wireless/ath/ath10k/pci.c
+++ b/drivers/net/wireless/ath/ath10k/pci.c
@@ -1802,6 +1802,28 @@ static void ath10k_pci_fw_interrupt_handler(struct ath10k *ar)
ath10k_pci_sleep(ar);
}

+/* this function effectively clears target memory controller assert line */
+static void ath10k_pci_warm_reset_si0(struct ath10k *ar)
+{
+ u32 val;
+
+ val = ath10k_pci_read32(ar, RTC_SOC_BASE_ADDRESS +
+ SOC_RESET_CONTROL_ADDRESS);
+ ath10k_pci_write32(ar, RTC_SOC_BASE_ADDRESS + SOC_RESET_CONTROL_ADDRESS,
+ val | SOC_RESET_CONTROL_SI0_RST_MASK);
+ val = ath10k_pci_read32(ar, RTC_SOC_BASE_ADDRESS +
+ SOC_RESET_CONTROL_ADDRESS);
+ msleep(10);
+
+ val = ath10k_pci_read32(ar, RTC_SOC_BASE_ADDRESS +
+ SOC_RESET_CONTROL_ADDRESS);
+ ath10k_pci_write32(ar, RTC_SOC_BASE_ADDRESS + SOC_RESET_CONTROL_ADDRESS,
+ val & ~SOC_RESET_CONTROL_SI0_RST_MASK);
+ val = ath10k_pci_read32(ar, RTC_SOC_BASE_ADDRESS +
+ SOC_RESET_CONTROL_ADDRESS);
+ msleep(10);
+}
+
static int ath10k_pci_warm_reset(struct ath10k *ar)
{
int ret = 0;
@@ -1860,6 +1882,8 @@ static int ath10k_pci_warm_reset(struct ath10k *ar)
SOC_RESET_CONTROL_ADDRESS);
msleep(10);

+ ath10k_pci_warm_reset_si0(ar);
+
/* debug */
val = ath10k_pci_read32(ar, SOC_CORE_BASE_ADDRESS +
PCIE_INTR_CAUSE_ADDRESS);
--
1.8.5.3