2016-01-18 09:55:42

by Matthias Fend

[permalink] [raw]
Subject: ath10k: host lock up after firmware crash

Hi,

I have a x86_64 system where I use a COMPEX WLE600VX miniPCIe module wihich is based on Qualcomm-Atheros QCA9882.
Occasionally the system completely freezes - even the serial console does not work anymore.
During some cumbersome time of trying to produce such a behavior I found out what has happened before the systems locked up.

These are the last words on the console:
[..]
[214618.559061] ath10k_pci 0000:03:00.0: firmware crashed! (uuid a11b4028-a801-454d-82e2-6400792883e2)
[214618.580137] ath10k_pci 0000:03:00.0: failed to read diag value at 0xf7900804: -16
[214618.587704] ath10k_pci 0000:03:00.0: failed to get memcpy hi address for firmware address 4: -16
[214618.596589] ath10k_pci 0000:03:00.0: failed to read firmware dump area: -16
[214618.900072] ath10k_pci 0000:03:00.0: failed to read diag value at 0xf7900800: -16
[214618.907641] ath10k_pci 0000:03:00.0: failed to poke copy engine: -16
[214619.001861] ath10k_pci 0000:03:00.0: failed to read diag value at 0xf7900800: -16
[214619.009438] ath10k_pci 0000:03:00.0: failed to poke copy engine: -16
[214619.110129] ath10k_pci 0000:03:00.0: failed to read diag value at 0xf7900800: -16
[214619.117696] ath10k_pci 0000:03:00.0: failed to poke copy engine: -16
[214619.967868] x

While reading through the related kernel driver I stumbled over this comment:
[..]
/* FIXME: Sometimes copy engine doesn't recover after warm
* reset. In most cases this needs cold reset. In some of these
* cases the device is in such a state that a cold reset may
* lock up the host.
[..]

Currently for me it looks like that this is exactly my problem. Due the closed source of the module firmware it's not really possible to fix the root cause (firmware crash). Therefore I'm looking for alternative workaround.
To do so I am looking for answers to following questions:
Does anybody know what's exactly happening in such a situation?
What are the possibilities for a PCIe module to complete stall the whole system?
Is there a known (fast) way to reproduce this behavior (forcing a cold reset after crash and use simulate_fw_crash as trigger does not work)?

Of course I would be also very happy for any other hints.

Thanks,
~Matthias