2018-11-14 02:50:35

by Wen Gong

[permalink] [raw]
Subject: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in simulate fw crash

When test simulate firmware crash, it is easy to trigger error.
command:
echo soft > /sys/kernel/debug/ieee80211/phyxx/ath10k/simulate_fw_crash.

If input more than two times continuously, then it will have error.
Error message:
ath10k_pci 0000:02:00.0: failed to set vdev 1 RX wake policy: -108
ath10k_pci 0000:02:00.0: device is wedged, will not restart

It is because the state has not changed to ATH10K_STATE_ON immediately,
then it will have more than two simulate crash process running meanwhile,
and complete/wakeup some field twice, it destroy the normal recovery
process.

Tested with QCA6174 PCI with firmware
WLAN.RM.4.4.1-00109-QCARMSWPZ-1, but this will also affect QCA9377 PCI.
It's not a regression with new firmware releases.

Signed-off-by: Wen Gong <[email protected]>
---
drivers/net/wireless/ath/ath10k/debug.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/debug.c b/drivers/net/wireless/ath/ath10k/debug.c
index ada29a4..dc8700b 100644
--- a/drivers/net/wireless/ath/ath10k/debug.c
+++ b/drivers/net/wireless/ath/ath10k/debug.c
@@ -569,8 +569,7 @@ static ssize_t ath10k_write_simulate_fw_crash(struct file *file,

mutex_lock(&ar->conf_mutex);

- if (ar->state != ATH10K_STATE_ON &&
- ar->state != ATH10K_STATE_RESTARTED) {
+ if (ar->state != ATH10K_STATE_ON) {
ret = -ENETDOWN;
goto exit;
}
--
1.9.1



2018-11-14 07:49:04

by Michal Kazior

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in simulate fw crash

On Wed, 14 Nov 2018 at 03:51, Wen Gong <[email protected]> wrote:
>
> When test simulate firmware crash, it is easy to trigger error.
> command:
> echo soft > /sys/kernel/debug/ieee80211/phyxx/ath10k/simulate_fw_crash.
>
> If input more than two times continuously, then it will have error.
> Error message:
> ath10k_pci 0000:02:00.0: failed to set vdev 1 RX wake policy: -108
> ath10k_pci 0000:02:00.0: device is wedged, will not restart
>
> It is because the state has not changed to ATH10K_STATE_ON immediately,
> then it will have more than two simulate crash process running meanwhile,
> and complete/wakeup some field twice, it destroy the normal recovery
> process.

This was intended to allow testing not only firmware crash path (and
recovery) but also firmware crash while recovering from a firmware
crash.


Michał

2019-01-07 07:22:28

by Wen Gong

[permalink] [raw]
Subject: RE: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in simulate fw crash

> > It is because the state has not changed to ATH10K_STATE_ON
> > immediately, then it will have more than two simulate crash process
> > running meanwhile, and complete/wakeup some field twice, it destroy
> > the normal recovery process.
>
> This was intended to allow testing not only firmware crash path (and
> recovery) but also firmware crash while recovering from a firmware crash.
>
If firmware is recovering from crash, then simulate a new crash will trigger error.
So remove it.
>
> Michał
>
> _______________________________________________
> ath10k mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/ath10k

2019-01-07 08:35:57

by Michal Kazior

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in simulate fw crash

On Mon, 7 Jan 2019 at 08:16, Wen Gong <[email protected]> wrote:
>
> > > It is because the state has not changed to ATH10K_STATE_ON
> > > immediately, then it will have more than two simulate crash process
> > > running meanwhile, and complete/wakeup some field twice, it destroy
> > > the normal recovery process.
> >
> > This was intended to allow testing not only firmware crash path (and
> > recovery) but also firmware crash while recovering from a firmware crash.
> >
> If firmware is recovering from crash, then simulate a new crash will trigger error.
> So remove it.

That's actually a feature, not a bug. If firmware crashes while driver
is restarting after a crash then its likely going to fail again and
again causing a crash-restart loop which can affect system performance
and responsiveness. It's better to give up and let the system admin
take over.

If it's still bothering you then please consider a crash counter
threshold so that, e.g. after 5 crash-while-restarting it's going to
give up. However I doubt it's worth the effort. My experience tells me
firmware crashes during recovery are rarely, if at all, transient.

The simulated fw crash is not representative here. It's a mere tool to
test driver code.


Michał

2019-01-08 08:45:33

by Wen Gong

[permalink] [raw]
Subject: RE: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in simulate fw crash

> >
> > > > It is because the state has not changed to ATH10K_STATE_ON
> > > > immediately, then it will have more than two simulate crash
> > > > process running meanwhile, and complete/wakeup some field twice,
> > > > it destroy the normal recovery process.
> > >
> > > This was intended to allow testing not only firmware crash path (and
> > > recovery) but also firmware crash while recovering from a firmware crash.
> > >
> > If firmware is recovering from crash, then simulate a new crash will trigger
> error.
> > So remove it.
>
> That's actually a feature, not a bug. If firmware crashes while driver is
> restarting after a crash then its likely going to fail again and again causing a
> crash-restart loop which can affect system performance and responsiveness.
> It's better to give up and let the system admin take over.
>
> If it's still bothering you then please consider a crash counter threshold so
> that, e.g. after 5 crash-while-restarting it's going to give up. However I doubt
> it's worth the effort. My experience tells me firmware crashes during
> recovery are rarely, if at all, transient.
>
> The simulated fw crash is not representative here. It's a mere tool to test
> driver code.

The simulated fw crash is only a tool for user to trigger fw crash with command,
This change's purpose is to disallow user to trigger fw crash if the fw is not in a
Normal state.

If the fw is in recovering state triggered by user's command or by fw, then it will
disallow user to run command to trigger fw crash again until fw become to a normal
State.

>
>
> Michał

2019-02-08 13:32:18

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in simulate fw crash

Wen Gong <[email protected]> writes:

>> > > > It is because the state has not changed to ATH10K_STATE_ON
>> > > > immediately, then it will have more than two simulate crash
>> > > > process running meanwhile, and complete/wakeup some field twice,
>> > > > it destroy the normal recovery process.
>> > >
>> > > This was intended to allow testing not only firmware crash path (and
>> > > recovery) but also firmware crash while recovering from a firmware crash.
>> > >
>> > If firmware is recovering from crash, then simulate a new crash will trigger
>> error.
>> > So remove it.
>>
>> That's actually a feature, not a bug. If firmware crashes while driver is
>> restarting after a crash then its likely going to fail again and again causing a
>> crash-restart loop which can affect system performance and responsiveness.
>> It's better to give up and let the system admin take over.
>>
>> If it's still bothering you then please consider a crash counter threshold so
>> that, e.g. after 5 crash-while-restarting it's going to give up. However I doubt
>> it's worth the effort. My experience tells me firmware crashes during
>> recovery are rarely, if at all, transient.
>>
>> The simulated fw crash is not representative here. It's a mere tool to test
>> driver code.
>
> The simulated fw crash is only a tool for user to trigger fw crash
> with command

I think Michal knows what simulate_fw_crash as he is the one who
implemented it in commit 278c4a85e626 :)

> This change's purpose is to disallow user to trigger fw crash if the fw is not in a
> Normal state.
>
> If the fw is in recovering state triggered by user's command or by fw, then it will
> disallow user to run command to trigger fw crash again until fw become to a normal
> State.

I agree with Michal here and his proposal about having a crash counter
sounds like a good to me. So I'm dropping this patch.

--
Kalle Valo

2019-02-08 13:34:11

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in simulate fw crash

Kalle Valo <[email protected]> writes:

>> This change's purpose is to disallow user to trigger fw crash if the fw is not in a
>> Normal state.
>>
>> If the fw is in recovering state triggered by user's command or by fw, then it will
>> disallow user to run command to trigger fw crash again until fw become to a normal
>> State.
>
> I agree with Michal here and his proposal about having a crash counter
> sounds like a good to me. So I'm dropping this patch.

Bah, missed a word again. I meant "sounds like a good idea to me".

--
Kalle Valo

2019-04-01 06:17:50

by Wen Gong

[permalink] [raw]
Subject: RE: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in simulate fw crash

> -----Original Message-----
> From: Michał Kazior <[email protected]>
> Sent: Monday, January 7, 2019 4:36 PM
> To: Wen Gong <[email protected]>
> Cc: Wen Gong <[email protected]>; linux-wireless <linux-
> [email protected]>; [email protected]
> Subject: [EXT] Re: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in
> simulate fw crash
> That's actually a feature, not a bug. If firmware crashes while driver
> is restarting after a crash then its likely going to fail again and
> again causing a crash-restart loop which can affect system performance
> and responsiveness. It's better to give up and let the system admin
> take over.
>
> If it's still bothering you then please consider a crash counter
> threshold so that, e.g. after 5 crash-while-restarting it's going to
> give up. However I doubt it's worth the effort. My experience tells me
> firmware crashes during recovery are rarely, if at all, transient.
>
> The simulated fw crash is not representative here. It's a mere tool to
> test driver code.
>
Hi Michal,
There have a stress test case for the simulate fw crash, it will simulate fw crash
in a very short time for each test, this will trigger the stress test fail.
The simulate fw crash process should not be run parallel, after this patch, the
Stress test case will pass.
>
> Michał

2019-04-08 10:20:04

by Wen Gong

[permalink] [raw]
Subject: RE: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in simulate fw crash





> -----Original Message-----
> From: Wen Gong
> Sent: Monday, April 1, 2019 2:11 PM
> To: 'Michał Kazior' <[email protected]>
> Cc: Wen Gong <[email protected]>; linux-wireless <linux-
> [email protected]>; [email protected]
> Subject: RE: [EXT] Re: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in
> simulate fw crash
>
> >
> > If it's still bothering you then please consider a crash counter
> > threshold so that, e.g. after 5 crash-while-restarting it's going to
> > give up. However I doubt it's worth the effort. My experience tells me
> > firmware crashes during recovery are rarely, if at all, transient.
> >
> > The simulated fw crash is not representative here. It's a mere tool to
> > test driver code.
> >
> Hi Michal,
> There have a stress test case for the simulate fw crash, it will simulate fw
> crash
> in a very short time for each test, this will trigger the stress test fail.
> The simulate fw crash process should not be run parallel, after this patch, the
> Stress test case will pass.
> >

Hi Michał,
Do you have some new comments?

> > Michał

2019-04-08 17:27:20

by Michal Kazior

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in simulate fw crash

On Mon, 8 Apr 2019 at 12:20, Wen Gong <[email protected]> wrote:
> > -----Original Message-----
> > From: Wen Gong
> > Sent: Monday, April 1, 2019 2:11 PM
> > To: 'Michał Kazior' <[email protected]>
> > Cc: Wen Gong <[email protected]>; linux-wireless <linux-
> > [email protected]>; [email protected]
> > Subject: RE: [EXT] Re: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in
> > simulate fw crash
> >
> > >
> > > If it's still bothering you then please consider a crash counter
> > > threshold so that, e.g. after 5 crash-while-restarting it's going to
> > > give up. However I doubt it's worth the effort. My experience tells me
> > > firmware crashes during recovery are rarely, if at all, transient.
> > >
> > > The simulated fw crash is not representative here. It's a mere tool to
> > > test driver code.
> > >
> > Hi Michal,
> > There have a stress test case for the simulate fw crash, it will simulate fw
> > crash
> > in a very short time for each test, this will trigger the stress test fail.
> > The simulate fw crash process should not be run parallel, after this patch, the
> > Stress test case will pass.
> > >
>
> Hi Michał,
> Do you have some new comments?

My original use case was to be able to exercise the driver's
robustness in handling nested fw crashes, IOW crash-within-a-crash.

Your test case, as far as I understand, intends to perform
consecutive, non-nested fw crash simulation stress test.

Both of these are mutually exclusive and your patch fixes your test
case at the expense of breaking my original case.

To satisfy both I would suggest you either expose ar->state via
debugfs and make your test procedure wait for that to get back into ON
state before simulating a crash again, or to extend the set of current
simulate_fw_crash commands (currently just: soft, hard, assert,
hw-restart) to something that allows expressing the intent whether
crash-in-crash prevention is intended (your case) or not (my original
case).

This could be for example something like this:
echo soft wait-ready > simulate_fw_crash

The "wait-ready" extra keyword would imply crash-in-crash prevention.
This would keep existing tools working (both behavior and syntax) and
would allow your test case to be implemented.


Michał

2019-04-09 05:09:09

by Wen Gong

[permalink] [raw]
Subject: RE: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in simulate fw crash

> From: Michał Kazior <[email protected]>
> Sent: Tuesday, April 9, 2019 1:27 AM
> To: Wen Gong <[email protected]>
> Cc: Wen Gong <[email protected]>; linux-wireless <linux-
> [email protected]>; [email protected]
> Subject: [EXT] Re: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in
> simulate fw crash
>
> > > Hi Michal,
> > > There have a stress test case for the simulate fw crash, it will simulate fw
> > > crash
> > > in a very short time for each test, this will trigger the stress test fail.
> > > The simulate fw crash process should not be run parallel, after this patch,
> the
> > > Stress test case will pass.
> > > >
> >
> > Hi Michał,
> > Do you have some new comments?
>
> My original use case was to be able to exercise the driver's
> robustness in handling nested fw crashes, IOW crash-within-a-crash.
>
> Your test case, as far as I understand, intends to perform
> consecutive, non-nested fw crash simulation stress test.
>
> Both of these are mutually exclusive and your patch fixes your test
> case at the expense of breaking my original case.
>
> To satisfy both I would suggest you either expose ar->state via
> debugfs and make your test procedure wait for that to get back into ON
> state before simulating a crash again, or to extend the set of current
> simulate_fw_crash commands (currently just: soft, hard, assert,
> hw-restart) to something that allows expressing the intent whether
> crash-in-crash prevention is intended (your case) or not (my original
> case).
>
> This could be for example something like this:
> echo soft wait-ready > simulate_fw_crash
>
> The "wait-ready" extra keyword would imply crash-in-crash prevention.
> This would keep existing tools working (both behavior and syntax) and
> would allow your test case to be implemented.
>
Is it easy to change your existing tools?
I want to change it to: echo soft skip-ready > simulate_fw_crash
The "skip-ready" extra keyword would imply crash-in-crash, *not* prevention.
My test tools is hard to change.

>
> Michał

2019-04-09 23:25:43

by Brian Norris

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in simulate fw crash

On Mon, Apr 8, 2019 at 10:09 PM Wen Gong <[email protected]> wrote:
> > From: Michał Kazior <[email protected]>
> > To satisfy both I would suggest you either expose ar->state via
> > debugfs and make your test procedure wait for that to get back into ON
> > state before simulating a crash again, or to extend the set of current
> > simulate_fw_crash commands (currently just: soft, hard, assert,
> > hw-restart) to something that allows expressing the intent whether
> > crash-in-crash prevention is intended (your case) or not (my original
> > case).
> >
> > This could be for example something like this:
> > echo soft wait-ready > simulate_fw_crash
> >
> > The "wait-ready" extra keyword would imply crash-in-crash prevention.
> > This would keep existing tools working (both behavior and syntax) and
> > would allow your test case to be implemented.
> >
> Is it easy to change your existing tools?
> I want to change it to: echo soft skip-ready > simulate_fw_crash
> The "skip-ready" extra keyword would imply crash-in-crash, *not* prevention.
> My test tools is hard to change.

In case you're talking about the test framework we run for ChromeOS
validation, no, it's not hard at all to change. As long as there's a
good reason.

I haven't closely followed this, but judging by the above summary,
it's probably more reasonable for our test framework to only simulate
FW crashes after the driver returns to "ready" (or at least, if we do
crash-in-crash, don't expect the driver to recover?). I expect we can
work with whatever mechanism you implement for that (exposing the
"state", or providing a new simulate_fw_crash mode).

Brian

2019-04-10 02:45:35

by Wen Gong

[permalink] [raw]
Subject: RE: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in simulate fw crash

> -----Original Message-----
> From: Brian Norris <[email protected]>
> Sent: Wednesday, April 10, 2019 7:25 AM
> To: Wen Gong <[email protected]>
> Cc: Michał Kazior <[email protected]>; Wen Gong
> <[email protected]>; linux-wireless <[email protected]>;
> [email protected]
> Subject: [EXT] Re: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in
> simulate fw crash
>
> On Mon, Apr 8, 2019 at 10:09 PM Wen Gong <[email protected]>
> wrote:
> > > From: Michał Kazior <[email protected]>
> > > To satisfy both I would suggest you either expose ar->state via
> > > debugfs and make your test procedure wait for that to get back into ON
> > > state before simulating a crash again, or to extend the set of current
> > > simulate_fw_crash commands (currently just: soft, hard, assert,
> > > hw-restart) to something that allows expressing the intent whether
> > > crash-in-crash prevention is intended (your case) or not (my original
> > > case).
> > >
> > > This could be for example something like this:
> > > echo soft wait-ready > simulate_fw_crash
> > >
> > > The "wait-ready" extra keyword would imply crash-in-crash prevention.
> > > This would keep existing tools working (both behavior and syntax) and
> > > would allow your test case to be implemented.
> > >
> > Is it easy to change your existing tools?
> > I want to change it to: echo soft skip-ready > simulate_fw_crash
> > The "skip-ready" extra keyword would imply crash-in-crash, *not*
> prevention.
> > My test tools is hard to change.
>
> In case you're talking about the test framework we run for ChromeOS
> validation, no, it's not hard at all to change. As long as there's a
> good reason.
>
> I haven't closely followed this, but judging by the above summary,
> it's probably more reasonable for our test framework to only simulate
> FW crashes after the driver returns to "ready" (or at least, if we do
> crash-in-crash, don't expect the driver to recover?). I expect we can
> work with whatever mechanism you implement for that (exposing the
> "state", or providing a new simulate_fw_crash mode).
>

If ChromeOS is easy to change tool,
I think I will change the mechanism of the simulate_fw_crash.
Then all tools will work normally.

> Brian

2019-05-28 02:50:14

by Wen Gong

[permalink] [raw]
Subject: RE: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in simulate fw crash

> -----Original Message-----
> From: ath10k <[email protected]> On Behalf Of Wen Gong
> Sent: Wednesday, April 10, 2019 10:45 AM
> To: Brian Norris <[email protected]>
> Cc: Michał Kazior <[email protected]>; linux-wireless <linux-
> [email protected]>; [email protected]; Wen Gong
> <[email protected]>
> Subject: [EXT] RE: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in
> simulate fw crash
>
> If ChromeOS is easy to change tool,
> I think I will change the mechanism of the simulate_fw_crash.
> Then all tools will work normally.
>
New patch uploaded
https://patchwork.kernel.org/patch/10897587/
[v2] ath10k: Remove ATH10K_STATE_RESTARTED in simulate fw crash