2012-09-27 14:41:10

by Sven Eckelmann

[permalink] [raw]
Subject: [PATCH] ath9k_hw: Handle AR_INTR_SYNC_HOST1_(FATAL|PERR) on AR9003

Interrupts with the sync_cause AR_INTR_SYNC_HOST1_FATAL and
AR_INTR_SYNC_HOST1_PERR have to be handled using a chip reset. Otherwise a
interrupt storm with unhandled interrupts will cause a hang or crash of the
machine.

Signed-off-by: Sven Eckelmann <[email protected]>
Signed-off-by: Simon Wunderlich <[email protected]>
---
I don't have any hardware documentation. So I need someone who checks the
flags whether this is really like AR9002 and not another thing which only
gets fixed by accident using this patch.

drivers/net/wireless/ath/ath9k/ar9003_mac.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/drivers/net/wireless/ath/ath9k/ar9003_mac.c b/drivers/net/wireless/ath/ath9k/ar9003_mac.c
index d5b2e0e..301bf72 100644
--- a/drivers/net/wireless/ath/ath9k/ar9003_mac.c
+++ b/drivers/net/wireless/ath/ath9k/ar9003_mac.c
@@ -182,6 +182,7 @@ static bool ar9003_hw_get_isr(struct ath_hw *ah, enum ath9k_int *masked)
struct ath9k_hw_capabilities *pCap = &ah->caps;
struct ath_common *common = ath9k_hw_common(ah);
u32 sync_cause = 0, async_cause, async_mask = AR_INTR_MAC_IRQ;
+ bool fatal_int;

if (ath9k_hw_mci_is_enabled(ah))
async_mask |= AR_INTR_ASYNC_MASK_MCI;
@@ -310,6 +311,22 @@ static bool ar9003_hw_get_isr(struct ath_hw *ah, enum ath9k_int *masked)

if (sync_cause) {
ath9k_debug_sync_cause(common, sync_cause);
+ fatal_int =
+ (sync_cause &
+ (AR_INTR_SYNC_HOST1_FATAL | AR_INTR_SYNC_HOST1_PERR))
+ ? true : false;
+
+ if (fatal_int) {
+ if (sync_cause & AR_INTR_SYNC_HOST1_FATAL) {
+ ath_dbg(common, ANY,
+ "received PCI FATAL interrupt\n");
+ }
+ if (sync_cause & AR_INTR_SYNC_HOST1_PERR) {
+ ath_dbg(common, ANY,
+ "received PCI PERR interrupt\n");
+ }
+ *masked |= ATH9K_INT_FATAL;
+ }

if (sync_cause & AR_INTR_SYNC_RADM_CPL_TIMEOUT) {
REG_WRITE(ah, AR_RC, AR_RC_HOSTIF);
--
1.7.10.4



2012-10-05 23:48:47

by Adrian Chadd

[permalink] [raw]
Subject: Re: [ath9k-devel] [PATCHv2] ath9k_hw: Handle AR_INTR_SYNC_HOST1_FATAL on AR9003

On 5 October 2012 09:51, Sven Eckelmann <[email protected]> wrote:

>> Well, is it a RX chainmask thing, or is it a chip thing?
>>
>> It's totally possible to have an RX chainmask of say 0x2 or 0x4..
>
> What are you trying to tell us?

That the check for "rx chainmask == 1? Definitely can't do MRC CCK"
implying "rx chainmask != 1? Definitely can do MRC CCK."
I think that's the wrong logic. It may be a general chipset problem
across some/all AR9300 and later chips that doing MRC CCK with only
one RX chain enabled is a problem, or it may be a single-chain NIC
problem.

I'm pretty sure we can configure any of the RX antennas; it doesn't
have to be "one chain == chain 0."

Anywy. I'll double check that.

> Maybe you missed that part of the conversation, but this was exactly what I
> did. You can even find one of my debug patches which implements it (but many
> fancy hacks are missing).

Cool. I did miss that. But if its' a TX descriptor issue it won't be
due to a register access? :)



Adrian

2012-10-03 14:51:29

by Adrian Chadd

[permalink] [raw]
Subject: Re: [ath9k-devel] [PATCHv2] ath9k_hw: Handle AR_INTR_SYNC_HOST1_FATAL on AR9003

On 2 October 2012 08:20, Felix Fietkau <[email protected]> wrote:

>> This sync cause 0x20 isn't handled anywhere and may be the cause of the
>> hang/crash. At least this is the symptom which can be fixed without crashing
>> the system.
> I checked the AR933x datasheet, and it says that cause 0x20 is tx
> descriptor corruption.

Ah hey, for Hornet they redefined those bits:

5: MAC_TXC_CORRUPTION_FLAG_SYNC (TX descriptor integrity flag)
6: INVALID_ADDRESS_ACCESS (invalid register access)

Good catch. That's definitely something in the right direction.



Adrian

2012-10-02 13:13:37

by Adrian Chadd

[permalink] [raw]
Subject: Re: [PATCHv2] ath9k_hw: Handle AR_INTR_SYNC_HOST1_FATAL on AR9003

.. well, the rule here is "You shouldn't get PERR/FATAL interrupts."

Haven't I posted a summary of what those errors are?

Ok. So they're signals from the PCIe core (named host1_fatal and
host1_perr. Helpfully.) Those errors occured during a DMA transfer.

So the question is why you're seeing PERR interrupts when creating an
adhoc interface. That hints to me that something odd is going on..

I've seen these issues creep up when the NIC is in some way behaving
very, very badly (lots of timeouts and sync errors with little to no
traffic at all), which resulted in all kinds of odd and weird,
unstable behaviour. After replacing the NIC with another NIC (in my
case, an AR9280 -> AR9280 NIC :-) the errors went away and things
continued swimmingly.

I'd have to go digging through the PCIe core source to figure out
exactly what host1_peer and host1_fatal mean. I can if you'd like,
it'll just take some time as I'm not familiar at all with the PCIe
host interface.

Thanks,



Adrian

On 2 October 2012 03:33, Sven Eckelmann <[email protected]> wrote:
> Interrupts with the sync_cause AR_INTR_SYNC_HOST1_FATAL has to be handled
> using a chip reset. Otherwise a interrupt storm with unhandled interrupts
> will cause a hang or crash of the machine.
>
> Signed-off-by: Sven Eckelmann <[email protected]>
> ---
> I was informed that AR_INTR_SYNC_HOST1_PERR should not be handled this way
> because it can create system freezes after an adhoc interface was created.
>
> I really need some Atheros developer who can check the documentation to
> verify the interpretation of these flags. Otherwise this is just guessing
> and may lead to even bigger problems.
>
> drivers/net/wireless/ath/ath9k/ar9003_mac.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/drivers/net/wireless/ath/ath9k/ar9003_mac.c b/drivers/net/wireless/ath/ath9k/ar9003_mac.c
> index d5b2e0e..6031bdf 100644
> --- a/drivers/net/wireless/ath/ath9k/ar9003_mac.c
> +++ b/drivers/net/wireless/ath/ath9k/ar9003_mac.c
> @@ -311,6 +311,11 @@ static bool ar9003_hw_get_isr(struct ath_hw *ah, enum ath9k_int *masked)
> if (sync_cause) {
> ath9k_debug_sync_cause(common, sync_cause);
>
> + if (sync_cause & AR_INTR_SYNC_HOST1_FATAL) {
> + ath_dbg(common, ANY, "received PCI FATAL interrupt\n");
> + *masked |= ATH9K_INT_FATAL;
> + }
> +
> if (sync_cause & AR_INTR_SYNC_RADM_CPL_TIMEOUT) {
> REG_WRITE(ah, AR_RC, AR_RC_HOSTIF);
> REG_WRITE(ah, AR_RC, 0);
> --
> 1.7.10.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-10-02 13:35:47

by Simon Wunderlich

[permalink] [raw]
Subject: Re: [PATCHv2] ath9k_hw: Handle AR_INTR_SYNC_HOST1_FATAL on AR9003

Hey Adrian,

On Tue, Oct 02, 2012 at 06:13:37AM -0700, Adrian Chadd wrote:
> .. well, the rule here is "You shouldn't get PERR/FATAL interrupts."
>
> Haven't I posted a summary of what those errors are?
>
> Ok. So they're signals from the PCIe core (named host1_fatal and
> host1_perr. Helpfully.) Those errors occured during a DMA transfer.
>
> So the question is why you're seeing PERR interrupts when creating an
> adhoc interface. That hints to me that something odd is going on..

thanks for the explanation!

>
> I've seen these issues creep up when the NIC is in some way behaving
> very, very badly (lots of timeouts and sync errors with little to no
> traffic at all), which resulted in all kinds of odd and weird,
> unstable behaviour. After replacing the NIC with another NIC (in my
> case, an AR9280 -> AR9280 NIC :-) the errors went away and things
> continued swimmingly.

Sounds like a good solution, but I'm afraid it won't work for us. We
are using AR9330 SoCs (Hornet), and as long as we don't have a very sharp knife
we won't be able to replace the NIC ... And cutting a few thousand of
them will also not be funny.

I'm starting to lose a little bit of confidence in these insects ... :/

>
> I'd have to go digging through the PCIe core source to figure out
> exactly what host1_peer and host1_fatal mean. I can if you'd like,
> it'll just take some time as I'm not familiar at all with the PCIe
> host interface.

It would at least be interesting if we are supposed to handle the interrupt
somehow, instead of resetting the chip.

Thanks,
Simon
>
> Thanks,
>
>
>
> Adrian
>
> On 2 October 2012 03:33, Sven Eckelmann <[email protected]> wrote:
> > Interrupts with the sync_cause AR_INTR_SYNC_HOST1_FATAL has to be handled
> > using a chip reset. Otherwise a interrupt storm with unhandled interrupts
> > will cause a hang or crash of the machine.
> >
> > Signed-off-by: Sven Eckelmann <[email protected]>
> > ---
> > I was informed that AR_INTR_SYNC_HOST1_PERR should not be handled this way
> > because it can create system freezes after an adhoc interface was created.
> >
> > I really need some Atheros developer who can check the documentation to
> > verify the interpretation of these flags. Otherwise this is just guessing
> > and may lead to even bigger problems.
> >
> > drivers/net/wireless/ath/ath9k/ar9003_mac.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/net/wireless/ath/ath9k/ar9003_mac.c b/drivers/net/wireless/ath/ath9k/ar9003_mac.c
> > index d5b2e0e..6031bdf 100644
> > --- a/drivers/net/wireless/ath/ath9k/ar9003_mac.c
> > +++ b/drivers/net/wireless/ath/ath9k/ar9003_mac.c
> > @@ -311,6 +311,11 @@ static bool ar9003_hw_get_isr(struct ath_hw *ah, enum ath9k_int *masked)
> > if (sync_cause) {
> > ath9k_debug_sync_cause(common, sync_cause);
> >
> > + if (sync_cause & AR_INTR_SYNC_HOST1_FATAL) {
> > + ath_dbg(common, ANY, "received PCI FATAL interrupt\n");
> > + *masked |= ATH9K_INT_FATAL;
> > + }
> > +
> > if (sync_cause & AR_INTR_SYNC_RADM_CPL_TIMEOUT) {
> > REG_WRITE(ah, AR_RC, AR_RC_HOSTIF);
> > REG_WRITE(ah, AR_RC, 0);
> > --
> > 1.7.10.4
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


Attachments:
(No filename) (3.64 kB)
signature.asc (198.00 B)
Digital signature
Download all attachments

2012-10-05 16:51:59

by Sven Eckelmann

[permalink] [raw]
Subject: Re: [ath9k-devel] [PATCHv2] ath9k_hw: Handle AR_INTR_SYNC_HOST1_FATAL on AR9003

On Fri, Oct 05, 2012 at 09:21:40AM -0700, Adrian Chadd wrote:
> On 5 October 2012 05:34, Felix Fietkau <[email protected]> wrote:
>
> > ---
> > --- a/drivers/net/wireless/ath/ath9k/ani.c
> > +++ b/drivers/net/wireless/ath/ath9k/ani.c
> > @@ -307,7 +307,8 @@ void ath9k_ani_reset(struct ath_hw *ah,
> > if (IS_CHAN_2GHZ(chan)) {
> > ah->ani_function = (ATH9K_ANI_SPUR_IMMUNITY_LEVEL |
> > ATH9K_ANI_FIRSTEP_LEVEL);
> > - if (AR_SREV_9300_20_OR_LATER(ah))
> > + if (AR_SREV_9300_20_OR_LATER(ah) &&
> > + ah->caps.rx_chainmask != 1)
> > ah->ani_function |= ATH9K_ANI_MRC_CCK;
> > } else
> > ah->ani_function = 0;
>
> Well, is it a RX chainmask thing, or is it a chip thing?
>
> It's totally possible to have an RX chainmask of say 0x2 or 0x4..

What are you trying to tell us?

> Also to figure out which registers triggered the interrupt is likely
> going to be a bit .. special. Maybe keep a circular buffer of the last
> N register accesses and dump them in time order whenever you get an
> interrupt.

Maybe you missed that part of the conversation, but this was exactly what I
did. You can even find one of my debug patches which implements it (but many
fancy hacks are missing).

Kind regards,
Sven


Attachments:
(No filename) (1.40 kB)
signature.asc (836.00 B)
Digital signature
Download all attachments

2012-10-06 09:03:51

by Felix Fietkau

[permalink] [raw]
Subject: Re: [ath9k-devel] [PATCHv2] ath9k_hw: Handle AR_INTR_SYNC_HOST1_FATAL on AR9003

On 2012-10-06 1:48 AM, Adrian Chadd wrote:
> On 5 October 2012 09:51, Sven Eckelmann <[email protected]> wrote:
>
>>> Well, is it a RX chainmask thing, or is it a chip thing?
>>>
>>> It's totally possible to have an RX chainmask of say 0x2 or 0x4..
>>
>> What are you trying to tell us?
>
> That the check for "rx chainmask == 1? Definitely can't do MRC CCK"
> implying "rx chainmask != 1? Definitely can do MRC CCK."
> I think that's the wrong logic. It may be a general chipset problem
> across some/all AR9300 and later chips that doing MRC CCK with only
> one RX chain enabled is a problem, or it may be a single-chain NIC
> problem.
>
> I'm pretty sure we can configure any of the RX antennas; it doesn't
> have to be "one chain == chain 0."
>
> Anywy. I'll double check that.
I'm pretty sure it's an issue specific to single-stream chipsets, where
all MRC functionality was left out (and thus the register access leads
nowhere).
I don't think this needs to consider multi-stream chipsets with only one
enabled chain.

- Felix


2012-10-02 15:02:34

by Sven Eckelmann

[permalink] [raw]
Subject: Re: [PATCHv2] ath9k_hw: Handle AR_INTR_SYNC_HOST1_FATAL on AR9003

On Tuesday 02 October 2012 07:06:03 Adrian Chadd wrote:
> Hm, there are still issues on Hornet?

Yes, we still have problems with hornet. The issue I am trying to "fix" with
this patch is an interrupt storm on AR9330 devices with sta interface(s).
Random devices crash after getting a stacktrace reporting __report_bad_irq.
The crash either results in a reboot or hang of the device

[ 952.950000] irq 2: nobody cared (try booting with the "irqpoll" option)
[ 952.950000] Call Trace:
[ 952.950000] [<8026ade8>] dump_stack+0x8/0x34
[ 952.950000] [<800a75d0>] __report_bad_irq+0x44/0xf4
[ 952.950000] [<800a78ec>] note_interrupt+0x200/0x2a4
[ 952.950000] [<800a58c8>] handle_irq_event_percpu+0x19c/0x1e0
[ 952.950000] [<800a86cc>] handle_percpu_irq+0x54/0x88
[ 952.950000] [<800a501c>] generic_handle_irq+0x3c/0x4c
[ 952.950000] [<80064748>] do_IRQ+0x1c/0x34
[ 952.950000] [<80062d6c>] ret_from_irq+0x0/0x4
[ 952.950000] [<8007673c>] tasklet_action+0xb8/0xd4
[ 952.950000] [<80076c24>] __do_softirq+0xa0/0x154
[ 952.950000] [<80076e30>] do_softirq+0x48/0x68
[ 952.950000] [<80076f94>] local_bh_enable+0x94/0xb0
[ 952.950000] [<83406d60>] cfg80211_scan_done+0x670/0x6d0 [cfg80211]
[ 952.950000]
[ 952.950000] handlers:
[ 952.950000] [<83564d48>] ath_isr
[ 952.950000] Disabling IRQ #2

The test setup is using 30 AR9330 devices running OpenWRT 32727/33559. 32727
is using compat-wireless-2012-04-17 (+ many OpenWRT patches) and 33559 is
running compat-wireless-2012-09-07 (+many more patches from Felix). 1 device
is running an open AP device (standard OpenWRT settings) and 29 devices are
trying to connect. Random devices will now fail. To debug this problem, I used
one devices with 8 vif devices and restarted the network script again and
again to force the recreation of the vif and reconnect.

The stack trace doesn't seem to be very helpful. Therefore, I checked ath_isr
and noticed that the interrupts right before the device crash get the status 0
from ar9003_hw_get_isr. Digging a little but further also revealed that the
interrupts in the interrupt storm also have async_cause 0 and sync_cause 0x20.

This sync cause 0x20 isn't handled anywhere and may be the cause of the
hang/crash. At least this is the symptom which can be fixed without crashing
the system.

I hope that helps to track down the problem.

Kind regards,
Sven


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part.

2012-10-05 16:05:16

by Sven Eckelmann

[permalink] [raw]
Subject: Re: [ath9k-devel] [PATCHv2] ath9k_hw: Handle AR_INTR_SYNC_HOST1_FATAL on AR9003

On Friday 05 October 2012 17:15:01 Felix Fietkau wrote:
> Thanks for testing, but for submitting I'd prefer something simple,
> how about this:
> ---
> --- a/drivers/net/wireless/ath/ath9k/ar9003_phy.c
> +++ b/drivers/net/wireless/ath/ath9k/ar9003_phy.c
> @@ -1035,6 +1035,10 @@ static bool ar9003_hw_ani_control(struct
> * is_on == 0 means MRC CCK is OFF (more noise imm)
> */
> bool is_on = param ? 1 : 0;
> +
> + if (ah->caps.rx_chainmask == 1)
> + break;
> +
> REG_RMW_FIELD(ah, AR_PHY_MRC_CCK_CTRL,
> AR_PHY_MRC_CCK_ENABLE, is_on);
> REG_RMW_FIELD(ah, AR_PHY_MRC_CCK_CTRL,

Yes, looks good.

Kind regards,
Sven


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part.

2012-10-05 15:15:10

by Felix Fietkau

[permalink] [raw]
Subject: Re: [ath9k-devel] [PATCHv2] ath9k_hw: Handle AR_INTR_SYNC_HOST1_FATAL on AR9003

On 2012-10-05 5:03 PM, Sven Eckelmann wrote:
> On Friday 05 October 2012 15:24:25 Felix Fietkau wrote:
>> On 2012-10-05 3:07 PM, Sven Eckelmann wrote:
>> > On Friday 05 October 2012 14:34:28 Felix Fietkau wrote:
>> >> On 2012-10-05 1:08 PM, Sven Eckelmann wrote:
>> > [...]
>> >
>> >> Please try this patch to see if it gets rid of these interrupts:
>> >> ---
>> >> --- a/drivers/net/wireless/ath/ath9k/ani.c
>> >> +++ b/drivers/net/wireless/ath/ath9k/ani.c
>> >> @@ -307,7 +307,8 @@ void ath9k_ani_reset(struct ath_hw *ah,
>> >>
>> >> if (IS_CHAN_2GHZ(chan)) {
>> >>
>> >> ah->ani_function = (ATH9K_ANI_SPUR_IMMUNITY_LEVEL |
>> >>
>> >> ATH9K_ANI_FIRSTEP_LEVEL);
>> >>
>> >> - if (AR_SREV_9300_20_OR_LATER(ah))
>> >> + if (AR_SREV_9300_20_OR_LATER(ah) &&
>> >> + ah->caps.rx_chainmask != 1)
>> >>
>> >> ah->ani_function |= ATH9K_ANI_MRC_CCK;
>> >>
>> >> } else
>> >>
>> >> ah->ani_function = 0;
>> >
>> > Looks partially good. At least this patch fixed parts my friday :D
>> >
>> > I have more similar bugs, but at least this one is related to a bandwidth
>> > problem which I also wanted to check today. But it didn't fix _this_
>> > invalid register access on the client device (but I don't see it anymore
>> > on the AP device).
>>
>> Are you sure that it's still the same register access on the client
>> side? I don't see how it could still access MRC related registers with
>> this part masked out.
>
> Yes, I am sure. Let's read some code:
>
> if (ah->opmode == NL80211_IFTYPE_AP) {
> if (IS_CHAN_2GHZ(chan)) {
> ah->ani_function = (ATH9K_ANI_SPUR_IMMUNITY_LEVEL |
> ATH9K_ANI_FIRSTEP_LEVEL);
> if (AR_SREV_9300_20_OR_LATER(ah) &&
> ah->caps.rx_chainmask != 1)
> ah->ani_function |= ATH9K_ANI_MRC_CCK;
> } else
> ah->ani_function = 0;
> }
>
> Now raise your hands when you see the "ah->opmode == NL80211_IFTYPE_AP". I've
> just added following after this block
>
> if (!AR_SREV_9300_20_OR_LATER(ah) || ah->caps.rx_chainmask == 1)
> ah->ani_function &= ~ATH9K_ANI_MRC_CCK;
>
> But maybe it is better to fix the test in __ath9k_hw_init
>
> if (!AR_SREV_9300_20_OR_LATER(ah) || ah->caps.rx_chainmask == 1)
> ah->ani_function &= ~ATH9K_ANI_MRC_CCK;
>
> The problem in __ath9k_hw_init is the value of ah->caps.rx_chainmask ... which
> is not yet initialized correctly (and therefore ends up as 0).
>
> I've attached my "please don't enable MRC CCK" version of the patch. Feel free
> to submit it because you've submitted the initial version... or other things
> with it ;)
Thanks for testing, but for submitting I'd prefer something simple,
how about this:
---
--- a/drivers/net/wireless/ath/ath9k/ar9003_phy.c
+++ b/drivers/net/wireless/ath/ath9k/ar9003_phy.c
@@ -1035,6 +1035,10 @@ static bool ar9003_hw_ani_control(struct
* is_on == 0 means MRC CCK is OFF (more noise imm)
*/
bool is_on = param ? 1 : 0;
+
+ if (ah->caps.rx_chainmask == 1)
+ break;
+
REG_RMW_FIELD(ah, AR_PHY_MRC_CCK_CTRL,
AR_PHY_MRC_CCK_ENABLE, is_on);
REG_RMW_FIELD(ah, AR_PHY_MRC_CCK_CTRL,


2012-10-05 13:07:14

by Sven Eckelmann

[permalink] [raw]
Subject: Re: [ath9k-devel] [PATCHv2] ath9k_hw: Handle AR_INTR_SYNC_HOST1_FATAL on AR9003

On Friday 05 October 2012 14:34:28 Felix Fietkau wrote:
> On 2012-10-05 1:08 PM, Sven Eckelmann wrote:
[...]
> Please try this patch to see if it gets rid of these interrupts:
> ---
> --- a/drivers/net/wireless/ath/ath9k/ani.c
> +++ b/drivers/net/wireless/ath/ath9k/ani.c
> @@ -307,7 +307,8 @@ void ath9k_ani_reset(struct ath_hw *ah,
> if (IS_CHAN_2GHZ(chan)) {
> ah->ani_function = (ATH9K_ANI_SPUR_IMMUNITY_LEVEL |
> ATH9K_ANI_FIRSTEP_LEVEL);
> - if (AR_SREV_9300_20_OR_LATER(ah))
> + if (AR_SREV_9300_20_OR_LATER(ah) &&
> + ah->caps.rx_chainmask != 1)
> ah->ani_function |= ATH9K_ANI_MRC_CCK;
> } else
> ah->ani_function = 0;

Looks partially good. At least this patch fixed parts my friday :D

I have more similar bugs, but at least this one is related to a bandwidth
problem which I also wanted to check today. But it didn't fix _this_ invalid
register access on the client device (but I don't see it anymore on the AP
device).

Kind regards,
Sven


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part.

2012-10-05 16:21:41

by Adrian Chadd

[permalink] [raw]
Subject: Re: [ath9k-devel] [PATCHv2] ath9k_hw: Handle AR_INTR_SYNC_HOST1_FATAL on AR9003

On 5 October 2012 05:34, Felix Fietkau <[email protected]> wrote:

> ---
> --- a/drivers/net/wireless/ath/ath9k/ani.c
> +++ b/drivers/net/wireless/ath/ath9k/ani.c
> @@ -307,7 +307,8 @@ void ath9k_ani_reset(struct ath_hw *ah,
> if (IS_CHAN_2GHZ(chan)) {
> ah->ani_function = (ATH9K_ANI_SPUR_IMMUNITY_LEVEL |
> ATH9K_ANI_FIRSTEP_LEVEL);
> - if (AR_SREV_9300_20_OR_LATER(ah))
> + if (AR_SREV_9300_20_OR_LATER(ah) &&
> + ah->caps.rx_chainmask != 1)
> ah->ani_function |= ATH9K_ANI_MRC_CCK;
> } else
> ah->ani_function = 0;

Well, is it a RX chainmask thing, or is it a chip thing?

It's totally possible to have an RX chainmask of say 0x2 or 0x4..

Also to figure out which registers triggered the interrupt is likely
going to be a bit .. special. Maybe keep a circular buffer of the last
N register accesses and dump them in time order whenever you get an
interrupt.

Adrian

2012-10-05 15:03:54

by Sven Eckelmann

[permalink] [raw]
Subject: Re: [ath9k-devel] [PATCHv2] ath9k_hw: Handle AR_INTR_SYNC_HOST1_FATAL on AR9003

On Friday 05 October 2012 15:24:25 Felix Fietkau wrote:
> On 2012-10-05 3:07 PM, Sven Eckelmann wrote:
> > On Friday 05 October 2012 14:34:28 Felix Fietkau wrote:
> >> On 2012-10-05 1:08 PM, Sven Eckelmann wrote:
> > [...]
> >
> >> Please try this patch to see if it gets rid of these interrupts:
> >> ---
> >> --- a/drivers/net/wireless/ath/ath9k/ani.c
> >> +++ b/drivers/net/wireless/ath/ath9k/ani.c
> >> @@ -307,7 +307,8 @@ void ath9k_ani_reset(struct ath_hw *ah,
> >>
> >> if (IS_CHAN_2GHZ(chan)) {
> >>
> >> ah->ani_function = (ATH9K_ANI_SPUR_IMMUNITY_LEVEL |
> >>
> >> ATH9K_ANI_FIRSTEP_LEVEL);
> >>
> >> - if (AR_SREV_9300_20_OR_LATER(ah))
> >> + if (AR_SREV_9300_20_OR_LATER(ah) &&
> >> + ah->caps.rx_chainmask != 1)
> >>
> >> ah->ani_function |= ATH9K_ANI_MRC_CCK;
> >>
> >> } else
> >>
> >> ah->ani_function = 0;
> >
> > Looks partially good. At least this patch fixed parts my friday :D
> >
> > I have more similar bugs, but at least this one is related to a bandwidth
> > problem which I also wanted to check today. But it didn't fix _this_
> > invalid register access on the client device (but I don't see it anymore
> > on the AP device).
>
> Are you sure that it's still the same register access on the client
> side? I don't see how it could still access MRC related registers with
> this part masked out.

Yes, I am sure. Let's read some code:

if (ah->opmode == NL80211_IFTYPE_AP) {
if (IS_CHAN_2GHZ(chan)) {
ah->ani_function = (ATH9K_ANI_SPUR_IMMUNITY_LEVEL |
ATH9K_ANI_FIRSTEP_LEVEL);
if (AR_SREV_9300_20_OR_LATER(ah) &&
ah->caps.rx_chainmask != 1)
ah->ani_function |= ATH9K_ANI_MRC_CCK;
} else
ah->ani_function = 0;
}

Now raise your hands when you see the "ah->opmode == NL80211_IFTYPE_AP". I've
just added following after this block

if (!AR_SREV_9300_20_OR_LATER(ah) || ah->caps.rx_chainmask == 1)
ah->ani_function &= ~ATH9K_ANI_MRC_CCK;

But maybe it is better to fix the test in __ath9k_hw_init

if (!AR_SREV_9300_20_OR_LATER(ah) || ah->caps.rx_chainmask == 1)
ah->ani_function &= ~ATH9K_ANI_MRC_CCK;

The problem in __ath9k_hw_init is the value of ah->caps.rx_chainmask ... which
is not yet initialized correctly (and therefore ends up as 0).

I've attached my "please don't enable MRC CCK" version of the patch. Feel free
to submit it because you've submitted the initial version... or other things
with it ;)

And thanks a lot for the help.

> Maybe it would make sense to come up with a debugging patch that checks
> the IRQ status register on every register access to see if an error was
> reported by the last one, and if there is an error, throw a stack trace.

Already done that. But got not enough useful information from this spam of
half backed stackdumps. But storing the value of the sync_cause register
helped a lot.

Kind regards,
Sven


Attachments:
991-ani_mrc_cck.patch (1.13 kB)
signature.asc (836.00 B)
This is a digitally signed message part.
Download all attachments

2012-10-02 10:33:37

by Sven Eckelmann

[permalink] [raw]
Subject: [PATCHv2] ath9k_hw: Handle AR_INTR_SYNC_HOST1_FATAL on AR9003

Interrupts with the sync_cause AR_INTR_SYNC_HOST1_FATAL has to be handled
using a chip reset. Otherwise a interrupt storm with unhandled interrupts
will cause a hang or crash of the machine.

Signed-off-by: Sven Eckelmann <[email protected]>
---
I was informed that AR_INTR_SYNC_HOST1_PERR should not be handled this way
because it can create system freezes after an adhoc interface was created.

I really need some Atheros developer who can check the documentation to
verify the interpretation of these flags. Otherwise this is just guessing
and may lead to even bigger problems.

drivers/net/wireless/ath/ath9k/ar9003_mac.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/drivers/net/wireless/ath/ath9k/ar9003_mac.c b/drivers/net/wireless/ath/ath9k/ar9003_mac.c
index d5b2e0e..6031bdf 100644
--- a/drivers/net/wireless/ath/ath9k/ar9003_mac.c
+++ b/drivers/net/wireless/ath/ath9k/ar9003_mac.c
@@ -311,6 +311,11 @@ static bool ar9003_hw_get_isr(struct ath_hw *ah, enum ath9k_int *masked)
if (sync_cause) {
ath9k_debug_sync_cause(common, sync_cause);

+ if (sync_cause & AR_INTR_SYNC_HOST1_FATAL) {
+ ath_dbg(common, ANY, "received PCI FATAL interrupt\n");
+ *masked |= ATH9K_INT_FATAL;
+ }
+
if (sync_cause & AR_INTR_SYNC_RADM_CPL_TIMEOUT) {
REG_WRITE(ah, AR_RC, AR_RC_HOSTIF);
REG_WRITE(ah, AR_RC, 0);
--
1.7.10.4


2012-10-02 15:20:15

by Felix Fietkau

[permalink] [raw]
Subject: Re: [ath9k-devel] [PATCHv2] ath9k_hw: Handle AR_INTR_SYNC_HOST1_FATAL on AR9003

On 2012-10-02 5:02 PM, Sven Eckelmann wrote:
> On Tuesday 02 October 2012 07:06:03 Adrian Chadd wrote:
>> Hm, there are still issues on Hornet?
>
> Yes, we still have problems with hornet. The issue I am trying to "fix" with
> this patch is an interrupt storm on AR9330 devices with sta interface(s).
> Random devices crash after getting a stacktrace reporting __report_bad_irq.
> The crash either results in a reboot or hang of the device
>
> [ 952.950000] irq 2: nobody cared (try booting with the "irqpoll" option)
> [ 952.950000] Call Trace:
> [ 952.950000] [<8026ade8>] dump_stack+0x8/0x34
> [ 952.950000] [<800a75d0>] __report_bad_irq+0x44/0xf4
> [ 952.950000] [<800a78ec>] note_interrupt+0x200/0x2a4
> [ 952.950000] [<800a58c8>] handle_irq_event_percpu+0x19c/0x1e0
> [ 952.950000] [<800a86cc>] handle_percpu_irq+0x54/0x88
> [ 952.950000] [<800a501c>] generic_handle_irq+0x3c/0x4c
> [ 952.950000] [<80064748>] do_IRQ+0x1c/0x34
> [ 952.950000] [<80062d6c>] ret_from_irq+0x0/0x4
> [ 952.950000] [<8007673c>] tasklet_action+0xb8/0xd4
> [ 952.950000] [<80076c24>] __do_softirq+0xa0/0x154
> [ 952.950000] [<80076e30>] do_softirq+0x48/0x68
> [ 952.950000] [<80076f94>] local_bh_enable+0x94/0xb0
> [ 952.950000] [<83406d60>] cfg80211_scan_done+0x670/0x6d0 [cfg80211]
> [ 952.950000]
> [ 952.950000] handlers:
> [ 952.950000] [<83564d48>] ath_isr
> [ 952.950000] Disabling IRQ #2
>
> The test setup is using 30 AR9330 devices running OpenWRT 32727/33559. 32727
> is using compat-wireless-2012-04-17 (+ many OpenWRT patches) and 33559 is
> running compat-wireless-2012-09-07 (+many more patches from Felix). 1 device
> is running an open AP device (standard OpenWRT settings) and 29 devices are
> trying to connect. Random devices will now fail. To debug this problem, I used
> one devices with 8 vif devices and restarted the network script again and
> again to force the recreation of the vif and reconnect.
>
> The stack trace doesn't seem to be very helpful. Therefore, I checked ath_isr
> and noticed that the interrupts right before the device crash get the status 0
> from ar9003_hw_get_isr. Digging a little but further also revealed that the
> interrupts in the interrupt storm also have async_cause 0 and sync_cause 0x20.
>
> This sync cause 0x20 isn't handled anywhere and may be the cause of the
> hang/crash. At least this is the symptom which can be fixed without crashing
> the system.
I checked the AR933x datasheet, and it says that cause 0x20 is tx
descriptor corruption.

- Felix


2012-10-02 14:06:03

by Adrian Chadd

[permalink] [raw]
Subject: Re: [PATCHv2] ath9k_hw: Handle AR_INTR_SYNC_HOST1_FATAL on AR9003

Hm, there are still issues on Hornet?

And no, you're not supposed to handle the interrupt per se.. it's a
sign that things got upset and you can't trust anything from that
point forward.

Felix is right, it'd be good to log the register accesses that lead up to this.



Adrian

2012-10-05 12:34:36

by Felix Fietkau

[permalink] [raw]
Subject: Re: [ath9k-devel] [PATCHv2] ath9k_hw: Handle AR_INTR_SYNC_HOST1_FATAL on AR9003

On 2012-10-05 1:08 PM, Sven Eckelmann wrote:
> On Wednesday 03 October 2012 07:51:28 Adrian Chadd wrote:
>> On 2 October 2012 08:20, Felix Fietkau <[email protected]> wrote:
>> >> This sync cause 0x20 isn't handled anywhere and may be the cause of the
>> >> hang/crash. At least this is the symptom which can be fixed without
>> >> crashing the system.
>> >
>> > I checked the AR933x datasheet, and it says that cause 0x20 is tx
>> > descriptor corruption.
>>
>> Ah hey, for Hornet they redefined those bits:
>>
>> 5: MAC_TXC_CORRUPTION_FLAG_SYNC (TX descriptor integrity flag)
>> 6: INVALID_ADDRESS_ACCESS (invalid register access)
>>
>> Good catch. That's definitely something in the right direction.
>
> Ok, I've just created a dirty hack to trace some of the register reads/writes.
>
> I used following test setup: Two Hornets, one is the "internet gateway" (just
> attached using ethernet to a test server) and is running one AP vif with
> standard OpenWRT settings. The other Hornet is placed next to it (only some
> centimeter far away) and is connected to the AP using WiFi.
>
> The client device is just trying to download a large file using HTTP. The
> serial consoles on both devices will now print the "same" log. I already
> searched for the interesting section. It is started in ath_ani_calibrate and
> contains ~42 register access operations.
>
> My best guess is the REG_RMW_FIELD for ATH9K_ANI_MRC_CCK in
> ar9003_hw_ani_control (just checked sync_cause before and after the access).
>
> So, now I need some input again from the guis with the spec. :)
Actually, this makes a lot of sense. Maximal Ratio Combining can only
be done if you have multiple inputs, and Hornet is a single-chain
device ;)

Please try this patch to see if it gets rid of these interrupts:
---
--- a/drivers/net/wireless/ath/ath9k/ani.c
+++ b/drivers/net/wireless/ath/ath9k/ani.c
@@ -307,7 +307,8 @@ void ath9k_ani_reset(struct ath_hw *ah,
if (IS_CHAN_2GHZ(chan)) {
ah->ani_function = (ATH9K_ANI_SPUR_IMMUNITY_LEVEL |
ATH9K_ANI_FIRSTEP_LEVEL);
- if (AR_SREV_9300_20_OR_LATER(ah))
+ if (AR_SREV_9300_20_OR_LATER(ah) &&
+ ah->caps.rx_chainmask != 1)
ah->ani_function |= ATH9K_ANI_MRC_CCK;
} else
ah->ani_function = 0;


2012-10-05 11:08:36

by Sven Eckelmann

[permalink] [raw]
Subject: Re: [ath9k-devel] [PATCHv2] ath9k_hw: Handle AR_INTR_SYNC_HOST1_FATAL on AR9003

On Wednesday 03 October 2012 07:51:28 Adrian Chadd wrote:
> On 2 October 2012 08:20, Felix Fietkau <[email protected]> wrote:
> >> This sync cause 0x20 isn't handled anywhere and may be the cause of the
> >> hang/crash. At least this is the symptom which can be fixed without
> >> crashing the system.
> >
> > I checked the AR933x datasheet, and it says that cause 0x20 is tx
> > descriptor corruption.
>
> Ah hey, for Hornet they redefined those bits:
>
> 5: MAC_TXC_CORRUPTION_FLAG_SYNC (TX descriptor integrity flag)
> 6: INVALID_ADDRESS_ACCESS (invalid register access)
>
> Good catch. That's definitely something in the right direction.

Ok, I've just created a dirty hack to trace some of the register reads/writes.

I used following test setup: Two Hornets, one is the "internet gateway" (just
attached using ethernet to a test server) and is running one AP vif with
standard OpenWRT settings. The other Hornet is placed next to it (only some
centimeter far away) and is connected to the AP using WiFi.

The client device is just trying to download a large file using HTTP. The
serial consoles on both devices will now print the "same" log. I already
searched for the interesting section. It is started in ath_ani_calibrate and
contains ~42 register access operations.

My best guess is the REG_RMW_FIELD for ATH9K_ANI_MRC_CCK in
ar9003_hw_ani_control (just checked sync_cause before and after the access).

So, now I need some input again from the guis with the spec. :)

Kind regards,
Sven


Attachments:
999-debug_reg_write.patch (11.19 kB)
invalid_register_access.log (26.59 kB)
signature.asc (836.00 B)
This is a digitally signed message part.
Download all attachments

2012-10-05 13:24:31

by Felix Fietkau

[permalink] [raw]
Subject: Re: [ath9k-devel] [PATCHv2] ath9k_hw: Handle AR_INTR_SYNC_HOST1_FATAL on AR9003

On 2012-10-05 3:07 PM, Sven Eckelmann wrote:
> On Friday 05 October 2012 14:34:28 Felix Fietkau wrote:
>> On 2012-10-05 1:08 PM, Sven Eckelmann wrote:
> [...]
>> Please try this patch to see if it gets rid of these interrupts:
>> ---
>> --- a/drivers/net/wireless/ath/ath9k/ani.c
>> +++ b/drivers/net/wireless/ath/ath9k/ani.c
>> @@ -307,7 +307,8 @@ void ath9k_ani_reset(struct ath_hw *ah,
>> if (IS_CHAN_2GHZ(chan)) {
>> ah->ani_function = (ATH9K_ANI_SPUR_IMMUNITY_LEVEL |
>> ATH9K_ANI_FIRSTEP_LEVEL);
>> - if (AR_SREV_9300_20_OR_LATER(ah))
>> + if (AR_SREV_9300_20_OR_LATER(ah) &&
>> + ah->caps.rx_chainmask != 1)
>> ah->ani_function |= ATH9K_ANI_MRC_CCK;
>> } else
>> ah->ani_function = 0;
>
> Looks partially good. At least this patch fixed parts my friday :D
>
> I have more similar bugs, but at least this one is related to a bandwidth
> problem which I also wanted to check today. But it didn't fix _this_ invalid
> register access on the client device (but I don't see it anymore on the AP
> device).
Are you sure that it's still the same register access on the client
side? I don't see how it could still access MRC related registers with
this part masked out.

Maybe it would make sense to come up with a debugging patch that checks
the IRQ status register on every register access to see if an error was
reported by the last one, and if there is an error, throw a stack trace.

- Felix

2012-10-02 13:33:40

by Felix Fietkau

[permalink] [raw]
Subject: Re: [ath9k-devel] [PATCHv2] ath9k_hw: Handle AR_INTR_SYNC_HOST1_FATAL on AR9003

On 2012-10-02 3:13 PM, Adrian Chadd wrote:
> .. well, the rule here is "You shouldn't get PERR/FATAL interrupts."
>
> Haven't I posted a summary of what those errors are?
>
> Ok. So they're signals from the PCIe core (named host1_fatal and
> host1_perr. Helpfully.) Those errors occured during a DMA transfer.
>
> So the question is why you're seeing PERR interrupts when creating an
> adhoc interface. That hints to me that something odd is going on..
>
> I've seen these issues creep up when the NIC is in some way behaving
> very, very badly (lots of timeouts and sync errors with little to no
> traffic at all), which resulted in all kinds of odd and weird,
> unstable behaviour. After replacing the NIC with another NIC (in my
> case, an AR9280 -> AR9280 NIC :-) the errors went away and things
> continued swimmingly.
>
> I'd have to go digging through the PCIe core source to figure out
> exactly what host1_peer and host1_fatal mean. I can if you'd like,
> it'll just take some time as I'm not familiar at all with the PCIe
> host interface.
According to the datasheet, AR_INTR_SYNC_HOST1_PERR is triggered by an
invalid register access, and AR_INTR_SYNC_HOST1_FATAL is triggered by
corrupt descriptors or other DMA issues.
Maybe you can get some information on the source of this PERR error if
you record the last register accesses outside of the irq context and
print them once this IRQ comes in.

- Felix


2013-02-21 11:26:38

by Felix Liao

[permalink] [raw]
Subject: RE: [ath9k-devel] [PATCHv2] ath9k_hw: Handle AR_INTR_SYNC_HOST1_FATAL on AR9003

I can reproduce this issue on my AR9300 chip using compat-wireless-3.6.8-1 without any patch except Sven's patch on ar9003_hw_get_isr() to handle AR_INTR_SYNC_HOST1_FATAL.

I just start two hostapd to driver a wireless AP(ath1, see hostapd1.conf) and a guest AP(ath2, see hostapd2.conf), the hardware will stop working after I restart these APs a lot of times, the number of ath9k interrupts stop increasing. At this time, I can't discovery these two SSIDs on my client PC even though hostapds are working well.

just start one hostapd seems don't have this issue, just start two hostapd to drive two wireless APs still have this issue but rarely to reproduce. pls see reproduce.sh in the attachment.

I can see "ath: phy0: Failed to stop TX DMA, queues=0x001!" in the dmesg, which was output when ath_isr() receive AR_INTR_SYNC_HOST1_FATAL and try to reset the hardware.

I also record the register access like what Sven did, but I think it is different with Sven's case, since nobody called ar9003_hw_ani_control(), pls check the dmesg info in the attachment. I also test Felix F's patch on ar9003_hw_ani_control(), this issue still reproduce on my device.

here is my register record when the issue happened, just need to trace the reg whose address is 0x4028, after 4 times ath9k_conf_tx(), I receive a AR_INTR_SYNC_HOST1_FATAL value.

time, record idx, line, func, Begin|End|Write|Read, reg, val, pid, comm
[ 214.856089] 3116.[0454][ath_isr][B] 0000 00000000 0 swapper
[ 214.856102] 3117.[0760][ath9k_hw_intrpend][R] 4038 00000002 0 swapper
[ 214.856116] 3118.[0189][ar9003_hw_get_isr][R] 4038 00000002 0 swapper
[ 214.856130] 3119.[0192][ar9003_hw_get_isr][R] 7044 00000002 0 swapper
[ 214.856144] 3120.[0194][ar9003_hw_get_isr][R] 0080 81000002 0 swapper
[ 214.856157] 3121.[0198][ar9003_hw_get_isr][R] 4028 00000000 0 swapper <-- REG_READ(0x4028) = 0x0, no problem here
[ 214.856171] 3122.[0234][ar9003_hw_get_isr][R] 00c0 81000002 0 swapper
[ 214.856185] 3123.[0781][ath9k_hw_kill_interrupts][W] 0024 00000000 0 swapper
[ 214.856200] 3124.[0782][ath9k_hw_kill_interrupts][R] 0024 00000000 0 swapper
[ 214.856214] 3125.[0784][ath9k_hw_kill_interrupts][W] 403c 00000000 0 swapper
[ 214.856228] 3126.[0785][ath9k_hw_kill_interrupts][R] 403c 00000000 0 swapper
[ 214.856243] 3127.[0787][ath9k_hw_kill_interrupts][W] 402c 00000000 0 swapper
[ 214.856257] 3128.[0788][ath9k_hw_kill_interrupts][R] 402c 00000000 0 swapper
[ 214.856271] 3129.[0566][ath_isr][E] 0000 00000000 0 swapper

[ 214.856284] 3130.[0371][ath9k_tasklet][B] 0000 00000000 0 swapper
[ 214.856297] 3131.[2876][ath9k_hw_gettsf64][R] 8050 00000000 0 swapper
[ 214.856311] 3132.[2878][ath9k_hw_gettsf64][R] 804c 0004f6d1 0 swapper
[ 214.856325] 3133.[2879][ath9k_hw_gettsf64][R] 8050 00000000 0 swapper
[ 214.856339] 3134.[0445][ath9k_hw_addrxbuf_edma][W] 0078 18a38040 0 swapper
[ 214.856354] 3135.[0828][ath9k_hw_enable_interrupts][W] 0024 00000001 0 swapper
[ 214.856368] 3136.[0830][ath9k_hw_enable_interrupts][W] 403c 00000002 0 swapper
[ 214.856383] 3137.[0831][ath9k_hw_enable_interrupts][W] 4030 00000002 0 swapper
[ 214.856397] 3138.[0833][ath9k_hw_enable_interrupts][W] 402c 00023f60 0 swapper
[ 214.856412] 3139.[0834][ath9k_hw_enable_interrupts][W] 4034 00023f60 0 swapper
[ 214.856427] 3140.[0837][ath9k_hw_enable_interrupts][R] 00a0 81800175 0 swapper
[ 214.856441] 3141.[0837][ath9k_hw_enable_interrupts][R] 0024 00000001 0 swapper
[ 214.856455] 3142.[0428][ath9k_tasklet][E] 0000 00000000 0 swapper

[ 214.856469] 3143.[1396][ath9k_conf_tx][B] 0000 00000000 2639 hostapd
[ 214.856483] 3144.[0404][ath9k_hw_resettxqueue][W] 1040 00101c03 2639 hostapd
[ 214.856498] 3145.[0409][ath9k_hw_resettxqueue][W] 1080 0008200a 2639 hostapd
[ 214.856512] 3146.[0411][ath9k_hw_resettxqueue][W] 09c0 00000800 2639 hostapd
[ 214.856527] 3147.[0418][ath9k_hw_resettxqueue][W] 1100 00001102 2639 hostapd
[ 214.856541] 3148.[0436][ath9k_hw_resettxqueue][W] 10c0 001005e0 2639 hostapd
[ 214.856556] 3149.[0517][ath9k_hw_resettxqueue][W] 0a44 00000001 2639 hostapd
[ 214.856571] 3150.[0034][ath9k_hw_set_txq_interrupts][W] 00a4 0000030f 2639 hostapd
[ 214.856586] 3151.[0037][ath9k_hw_set_txq_interrupts][W] 00a8 0000030f 2639 hostapd
[ 214.856601] 3152.[0041][ath9k_hw_set_txq_interrupts][W] 00ac 00c00000 2639 hostapd
[ 214.856615] 3153.[1421][ath9k_conf_tx][E] 0000 00000000 2639 hostapd

[ 214.856629] 3154.[1396][ath9k_conf_tx][B] 0000 00000000 2639 hostapd
[ 214.856643] 3155.[0404][ath9k_hw_resettxqueue][W] 1044 00103c07 2639 hostapd
[ 214.856657] 3156.[0409][ath9k_hw_resettxqueue][W] 1084 0008200a 2639 hostapd
[ 214.856672] 3157.[0411][ath9k_hw_resettxqueue][W] 09c4 00000800 2639 hostapd
[ 214.856686] 3158.[0418][ath9k_hw_resettxqueue][W] 1104 00001102 2639 hostapd
[ 214.856701] 3159.[0436][ath9k_hw_resettxqueue][W] 10c4 00100bc0 2639 hostapd
[ 214.856715] 3160.[0517][ath9k_hw_resettxqueue][W] 0a44 00000001 2639 hostapd
[ 214.856730] 3161.[0034][ath9k_hw_set_txq_interrupts][W] 00a4 0000030f 2639 hostapd
[ 214.856745] 3162.[0037][ath9k_hw_set_txq_interrupts][W] 00a8 0000030f 2639 hostapd
[ 214.856760] 3163.[0041][ath9k_hw_set_txq_interrupts][W] 00ac 00c00000 2639 hostapd
[ 214.856774] 3164.[1421][ath9k_conf_tx][E] 0000 00000000 2639 hostapd

[ 214.856788] 3165.[1396][ath9k_conf_tx][B] 0000 00000000 2639 hostapd
[ 214.856802] 3166.[0404][ath9k_hw_resettxqueue][W] 1048 0030fc0f 2639 hostapd
[ 214.856817] 3167.[0409][ath9k_hw_resettxqueue][W] 1088 0008200a 2639 hostapd
[ 214.856831] 3168.[0411][ath9k_hw_resettxqueue][W] 09c8 00000800 2639 hostapd
[ 214.856846] 3169.[0418][ath9k_hw_resettxqueue][W] 1108 00001102 2639 hostapd
[ 214.856860] 3170.[0436][ath9k_hw_resettxqueue][W] 10c8 00000000 2639 hostapd
[ 214.856875] 3171.[0517][ath9k_hw_resettxqueue][W] 0a44 00000001 2639 hostapd
[ 214.856889] 3172.[0034][ath9k_hw_set_txq_interrupts][W] 00a4 0000030f 2639 hostapd
[ 214.856904] 3173.[0037][ath9k_hw_set_txq_interrupts][W] 00a8 0000030f 2639 hostapd
[ 214.856919] 3174.[0041][ath9k_hw_set_txq_interrupts][W] 00ac 00c00000 2639 hostapd
[ 214.856934] 3175.[1421][ath9k_conf_tx][E] 0000 00000000 2639 hostapd

[ 214.856947] 3176.[1396][ath9k_conf_tx][B] 0000 00000000 2639 hostapd
[ 214.856961] 3177.[0404][ath9k_hw_resettxqueue][W] 104c 007ffc0f 2639 hostapd
[ 214.856976] 3178.[0409][ath9k_hw_resettxqueue][W] 108c 0008200a 2639 hostapd
[ 214.856990] 3179.[0411][ath9k_hw_resettxqueue][W] 09cc 00000800 2639 hostapd
[ 214.857005] 3180.[0418][ath9k_hw_resettxqueue][W] 110c 00001102 2639 hostapd
[ 214.857019] 3181.[0436][ath9k_hw_resettxqueue][W] 10cc 00000000 2639 hostapd
[ 214.857034] 3182.[0517][ath9k_hw_resettxqueue][W] 0a44 00000001 2639 hostapd
[ 214.857048] 3183.[0034][ath9k_hw_set_txq_interrupts][W] 00a4 0000030f 2639 hostapd
[ 214.857063] 3184.[0037][ath9k_hw_set_txq_interrupts][W] 00a8 0000030f 2639 hostapd
[ 214.857078] 3185.[0041][ath9k_hw_set_txq_interrupts][W] 00ac 00c00000 2639 hostapd
[ 214.857093] 3186.[1421][ath9k_conf_tx][E] 0000 00000000 2639 hostapd

[ 214.857106] 3187.[0454][ath_isr][B] 0000 00000000 2639 hostapd
[ 214.857120] 3188.[0760][ath9k_hw_intrpend][R] 4038 00000000 2639 hostapd
[ 214.857134] 3189.[0767][ath9k_hw_intrpend][R] 4028 00000020 2639 hostapd
[ 214.857148] 3190.[0189][ar9003_hw_get_isr][R] 4038 00000000 2639 hostapd
[ 214.857162] 3191.[0198][ar9003_hw_get_isr][R] 4028 00000020 2639 hostapd <---- REG_READ(0x4028) = 0x20, raise a AR_INTR_SYNC_HOST1_FATAL error

I did some tests, such as
1. add delay(1000) after ath9k_conf_tx(), but can't resolve this issue.
2. add a spinlock sc->sc_reset_lock to protect the hardware between ath_reset_work() and other reset contexts, but can't resolve this issue.

Now I have no idea what to do, any suggestion will be appreciated.

- Felix Liao

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Felix Fietkau
Sent: Saturday, October 06, 2012 5:04 PM
To: Adrian Chadd
Cc: Sven Eckelmann; Simon Wunderlich; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
Subject: Re: [ath9k-devel] [PATCHv2] ath9k_hw: Handle AR_INTR_SYNC_HOST1_FATAL on AR9003

On 2012-10-06 1:48 AM, Adrian Chadd wrote:
> On 5 October 2012 09:51, Sven Eckelmann <[email protected]> wrote:
>
>>> Well, is it a RX chainmask thing, or is it a chip thing?
>>>
>>> It's totally possible to have an RX chainmask of say 0x2 or 0x4..
>>
>> What are you trying to tell us?
>
> That the check for "rx chainmask == 1? Definitely can't do MRC CCK"
> implying "rx chainmask != 1? Definitely can do MRC CCK."
> I think that's the wrong logic. It may be a general chipset problem
> across some/all AR9300 and later chips that doing MRC CCK with only
> one RX chain enabled is a problem, or it may be a single-chain NIC
> problem.
>
> I'm pretty sure we can configure any of the RX antennas; it doesn't
> have to be "one chain == chain 0."
>
> Anywy. I'll double check that.
I'm pretty sure it's an issue specific to single-stream chipsets, where
all MRC functionality was left out (and thus the register access leads
nowhere).
I don't think this needs to consider multi-stream chipsets with only one
enabled chain.

- Felix


Attachments:
debug_failed_to_stop_tx_dma.tar.gz (39.77 kB)
debug_failed_to_stop_tx_dma.tar.gz