2010-01-10 10:53:02

by Michael Büsch

[permalink] [raw]
Subject: Ath5k on 2.6.32 suddenly fails

Hi,

I just upgraded my AccessPoint to 2.6.32.3. It has an atheros wireless card,
which basically runs fine with ath5k. Well, it did run fine until I upgraded
to 2.6.32. I upgraded from 2.6.31, but I used compat-wireless-2009-10-28 for wireless.
(I'm not exactly sure on the compat package. Maybe it was even older. But it was stable).

I'm currently using the vanilla ath5k driver on 2.6.32.

So yesterday, after two days of uptime, the atheros card suddenly failed to
work at all. Completely dead.
Reloading the module only resulted in:
11 Jan 10 11:11:28 quimby kernel: [163143.871947] ath5k phy0: failed to wakeup the MAC Chip

So I added a debug patch:

Index: linux-2.6.32/drivers/net/wireless/ath/ath5k/reset.c
===================================================================
--- linux-2.6.32.orig/drivers/net/wireless/ath/ath5k/reset.c 2010-01-10 11:16:00.000000000 +0100
+++ linux-2.6.32/drivers/net/wireless/ath/ath5k/reset.c 2010-01-10 11:27:03.000000000 +0100
@@ -223,6 +223,7 @@
int ath5k_hw_set_power(struct ath5k_hw *ah, enum ath5k_power_mode mode,
bool set_chip, u16 sleep_duration)
{
+ struct pci_dev *pdev = ah->ah_sc->pdev;
unsigned int i;
u32 staid, data;

@@ -273,7 +274,7 @@
AR5K_SLEEP_CTL);
udelay(15);

- for (i = 200; i > 0; i--) {
+ for (i = 20000; i > 0; i--) {
/* Check if the chip did wake up */
if ((ath5k_hw_reg_read(ah, AR5K_PCICFG) &
AR5K_PCICFG_SPWR_DN) == 0)
@@ -286,8 +287,13 @@
}

/* Fail if the chip didn't wake up */
- if (i == 0)
+ if (i == 0) {
+ u32 val;
+ int res = pci_read_config_dword(pdev, PCI_STATUS, &val);
+ printk("%d, 0x%08X\n", res, val);
return -EIO;
+ }
+ printk("Wakeup %d\n", i);

break;

Which resulted in these messages:

35 Jan 10 11:28:30 quimby kernel: [164165.417931] 135, 0x00000000
36 Jan 10 11:28:30 quimby kernel: [164165.417940] ath5k phy0: failed to wakeup the MAC Chip

So the card really is completely dead. It doesn't even respond to standard PCI reads.

Only a complete reboot (warmstart) of the machine brought it back to life.

--
Greetings, Michael.


2010-01-21 12:05:38

by Michael Büsch

[permalink] [raw]
Subject: Re: Ath5k on 2.6.32 suddenly fails

So so AP failed again.

This time with an older failure that I already reported in the past.

[92588.864887] ath5k phy0: unsupported jumbo
[92860.064977] ath5k phy0: unsupported jumbo
[93019.724035] ath5k phy0: unsupported jumbo
[93444.334305] ath5k phy0: unsupported jumbo
[161109.006052] ath5k phy0: unsupported jumbo
[186430.369739] ath5k phy0: unsupported jumbo
[210174.502973] ath5k phy0: no further txbuf available, dropping packet
[210175.527224] ath5k phy0: no further txbuf available, dropping packet
[210176.551473] ath5k phy0: no further txbuf available, dropping packet
...

Note that the "unsupported jumbo" messages seem to be harmless.
It fails, if these "dropping packet" messages appear.

Here's a wireshark log of the WM while I was trying to connect to
the AP with a device:
http://bu3sch.de/misc/quimbyfail
So it still beacons, but otherwise is completely dead.

--
Greetings, Michael.

2010-01-21 15:57:54

by Bob Copeland

[permalink] [raw]
Subject: Re: Ath5k on 2.6.32 suddenly fails

On Thu, Jan 21, 2010 at 7:05 AM, Michael Buesch <[email protected]> wrote:
> So so AP failed again.
>
> This time with an older failure that I already reported in the past.
> [210174.502973] ath5k phy0: no further txbuf available, dropping packet

Ok, I posted a patch for this last night -- sorry for sitting on it so
long.

--
Bob Copeland %% http://www.bobcopeland.com

2010-01-15 22:30:37

by Michael Büsch

[permalink] [raw]
Subject: Re: Ath5k on 2.6.32 suddenly fails

On Monday 11 January 2010 23:36:49 Michael Buesch wrote:
> I currently have one and a half days of uptime. I think I'll first
> continue running .32 to check whether it happens again or if this was just
> some random hardware burp.
> I think it should be likely to trigger again within one or two days, if this is a bug.

mb@quimby:~$ uptime
23:23:34 up 5 days, 11:49, 1 user, load average: 0.00, 0.00, 0.00

So, it didn't trigger, yet.
I think I will assume for now that we had a hardware burp and this
is not caused by a software bug. The AP is used a lot and it currently
is rock-stable on 2.6.32.

The card is a minipci connected through a minipci->pci converter card.
I know that the converter does not have high quality contact pins, so I
currently blame the converter card for flipping a bit.
I think I'll replace it by something better soon.

So let's close this, unless I come back to you guys with new results.

--
Greetings, Michael.

2010-01-11 16:11:30

by Luis R. Rodriguez

[permalink] [raw]
Subject: Re: Ath5k on 2.6.32 suddenly fails

On Sun, Jan 10, 2010 at 2:52 AM, Michael Buesch <[email protected]> wrote:
> Hi,
>
> I just upgraded my AccessPoint to 2.6.32.3. It has an atheros wireless card,
> which basically runs fine with ath5k. Well, it did run fine until I upgraded
> to 2.6.32. I upgraded from 2.6.31, but I used compat-wireless-2009-10-28 for wireless.
> (I'm not exactly sure on the compat package. Maybe it was even older. But it was stable).
>
> I'm currently using the vanilla ath5k driver on 2.6.32.
>
> So yesterday, after two days of uptime, the atheros card suddenly failed to
> work at all. Completely dead.
> Reloading the module only resulted in:
>  11 Jan 10 11:11:28 quimby kernel: [163143.871947] ath5k phy0: failed to wakeup the MAC Chip

So to be clear, this seems to be a regression from 2.6.31 to 2.6.32.
Can you confirm if ath5k worked in AP mode smoothly without this issue
or was this a regression against some random wireless-testing based
snapshit (which seems to be the case).

Rafael, please add to your list if Michael confirms this was working
fine on 2.6.31.

Michael, it seems this is not easy to reproduce but can you try the
latest 2.6.33-rc code to see if the issue is also there? I can make a
new compat-wireless snapshot for that soon if it helps.

> So I added a debug patch:
>
> Index: linux-2.6.32/drivers/net/wireless/ath/ath5k/reset.c
> ===================================================================
> --- linux-2.6.32.orig/drivers/net/wireless/ath/ath5k/reset.c    2010-01-10 11:16:00.000000000 +0100
> +++ linux-2.6.32/drivers/net/wireless/ath/ath5k/reset.c 2010-01-10 11:27:03.000000000 +0100
> @@ -223,6 +223,7 @@
>  int ath5k_hw_set_power(struct ath5k_hw *ah, enum ath5k_power_mode mode,
>                bool set_chip, u16 sleep_duration)
>  {
> +       struct pci_dev *pdev = ah->ah_sc->pdev;
>        unsigned int i;
>        u32 staid, data;
>
> @@ -273,7 +274,7 @@
>                                                        AR5K_SLEEP_CTL);
>                udelay(15);
>
> -               for (i = 200; i > 0; i--) {
> +               for (i = 20000; i > 0; i--) {
>                        /* Check if the chip did wake up */
>                        if ((ath5k_hw_reg_read(ah, AR5K_PCICFG) &
>                                        AR5K_PCICFG_SPWR_DN) == 0)
> @@ -286,8 +287,13 @@
>                }
>
>                /* Fail if the chip didn't wake up */
> -               if (i == 0)
> +               if (i == 0) {
> +                       u32 val;
> +                       int res = pci_read_config_dword(pdev, PCI_STATUS, &val);
> +                       printk("%d, 0x%08X\n", res, val);
>                        return -EIO;
> +               }
> +               printk("Wakeup %d\n", i);
>
>                break;
>
> Which resulted in these messages:
>
>  35 Jan 10 11:28:30 quimby kernel: [164165.417931] 135, 0x00000000
>  36 Jan 10 11:28:30 quimby kernel: [164165.417940] ath5k phy0: failed to wakeup the MAC Chip
>
> So the card really is completely dead. It doesn't even respond to standard PCI reads.
>
> Only a complete reboot (warmstart) of the machine brought it back to life.

2010-01-11 16:35:40

by Larry Finger

[permalink] [raw]
Subject: Re: Ath5k on 2.6.32 suddenly fails

On 01/11/2010 10:11 AM, Luis R. Rodriguez wrote:
> or was this a regression against some random wireless-testing based
> snapshit (which seems to be the case).

Typo, or editorial comment on wireless-testing?

Larry

2010-01-11 22:37:21

by Michael Büsch

[permalink] [raw]
Subject: Re: Ath5k on 2.6.32 suddenly fails

On Monday 11 January 2010 17:11:10 Luis R. Rodriguez wrote:
> >  11 Jan 10 11:11:28 quimby kernel: [163143.871947] ath5k phy0: failed to wakeup the MAC Chip
>
> So to be clear, this seems to be a regression from 2.6.31 to 2.6.32.
> Can you confirm if ath5k worked in AP mode smoothly without this issue
> or was this a regression against some random wireless-testing based
> snapshit (which seems to be the case).

I don't know if it is a regression between .31 and .32.
Before I upgraded to .32, I used a .31 kernel, but with compat-wireless.
So I did not use the .31 wireless bits (there were some other breakages
that I did not track down further)

I was either using
compat-wireless-2009-09-28
or
compat-wireless-2009-10-28

I don't really remember exactly which one. I don't really know how to
find out, though.
I was using it for months and it was rock-stable.

> Michael, it seems this is not easy to reproduce but can you try the
> latest 2.6.33-rc code to see if the issue is also there? I can make a
> new compat-wireless snapshot for that soon if it helps.

Well, It's a real pain to do experiments on that machine, because
it is a production machine. Each failure will immediately result in people
yelling at me. :D So I don't feel to well running an rc kernel on it...

I currently have one and a half days of uptime. I think I'll first
continue running .32 to check whether it happens again or if this was just
some random hardware burp.
I think it should be likely to trigger again within one or two days, if this is a bug.

--
Greetings, Michael.

2010-01-15 22:40:43

by Michael Büsch

[permalink] [raw]
Subject: Re: Ath5k on 2.6.32 suddenly fails

On Friday 15 January 2010 23:29:59 Michael Buesch wrote:
> On Monday 11 January 2010 23:36:49 Michael Buesch wrote:
> > I currently have one and a half days of uptime. I think I'll first
> > continue running .32 to check whether it happens again or if this was just
> > some random hardware burp.
> > I think it should be likely to trigger again within one or two days, if this is a bug.
>
> mb@quimby:~$ uptime
> 23:23:34 up 5 days, 11:49, 1 user, load average: 0.00, 0.00, 0.00
>
> So, it didn't trigger, yet.
> I think I will assume for now that we had a hardware burp and this
> is not caused by a software bug. The AP is used a lot and it currently
> is rock-stable on 2.6.32.
>
> The card is a minipci connected through a minipci->pci converter card.
> I know that the converter does not have high quality contact pins, so I
> currently blame the converter card for flipping a bit.
> I think I'll replace it by something better soon.
>
> So let's close this, unless I come back to you guys with new results.
>

Argh, so exactly one minute after sending this mail the AP died. -.-

There also are a bunch of jumbo messages in dmesg, but they are probably unrelated:

[177589.693544] ath5k phy0: unsupported jumbo
[264620.683114] ath5k phy0: unsupported jumbo
[276726.009197] ath5k phy0: unsupported jumbo
[348619.483527] ath5k phy0: unsupported jumbo
[349918.090802] ath5k phy0: unsupported jumbo
[438574.817309] ath5k phy0: unsupported jumbo
[457967.099642] ath5k phy0: unsupported jumbo

However, it had different failure symptoms this time.
Last time it failed, the AP was completely dead. No beacons, etc..
This time it was still beaconing, but auth failed:

Trying to associate with 00:1d:0f:b9:df:2d (SSID='quimby-net' freq=2472 MHz)
Authentication with 00:1d:0f:b9:df:2d timed out.

A machine reboot was _not_ needed this time to revive the card.
Module unload cycle was enough to bring it back to life.

I'm really unsure what's going on and if these two failures are related
to each other. Probably not...

--
Greetings, Michael.

2010-01-15 22:51:01

by Michael Büsch

[permalink] [raw]
Subject: Re: Ath5k on 2.6.32 suddenly fails

On Friday 15 January 2010 23:47:00 Luis R. Rodriguez wrote:
> I wonder if a possible failure here might be that the box gets under
> load and some DMA allocation actually gives back less memory than what
> was requested., hrm, but even then we'd still tell hardware it has the
> whole desired length we intended...
>
> Not sure... I haven't reviewed this code in ages.
>
> > However, it had different failure symptoms this time.
> > Last time it failed, the AP was completely dead. No beacons, etc..
> > This time it was still beaconing, but auth failed:
>
> Can you reproduce? Do you know if anything particular happened at this time?

No, there was nothing special happening. I was just sending the previous mail
and a few seconds later I noticed that wpa_supplicant lost connection
and failed to gain a new authentication.

> > Trying to associate with 00:1d:0f:b9:df:2d (SSID='quimby-net' freq=2472 MHz)
> > Authentication with 00:1d:0f:b9:df:2d timed out.
> >
> > A machine reboot was _not_ needed this time to revive the card.
> > Module unload cycle was enough to bring it back to life.
> >
> > I'm really unsure what's going on and if these two failures are related
> > to each other. Probably not...
>
> Anyway you can burp your box sooner (replace the PCI connector) to
> rule that out?

Yes, I'm going to replace the extender over the weekend.

--
Greetings, Michael.

2010-01-11 16:43:45

by Luis R. Rodriguez

[permalink] [raw]
Subject: Re: Ath5k on 2.6.32 suddenly fails

On Mon, Jan 11, 2010 at 8:35 AM, Larry Finger <[email protected]> wrote:
> On 01/11/2010 10:11 AM, Luis R. Rodriguez wrote:
>> or was this a regression against some random wireless-testing based
>> snapshit (which seems to be the case).
>
> Typo, or editorial comment on wireless-testing?

Hah big typo indeed, sorry about that.

Luis

2010-01-15 22:47:21

by Luis R. Rodriguez

[permalink] [raw]
Subject: Re: Ath5k on 2.6.32 suddenly fails

On Fri, Jan 15, 2010 at 2:40 PM, Michael Buesch <[email protected]> wrote:
> On Friday 15 January 2010 23:29:59 Michael Buesch wrote:
>> On Monday 11 January 2010 23:36:49 Michael Buesch wrote:
>> > I currently have one and a half days of uptime. I think I'll first
>> > continue running .32 to check whether it happens again or if this was just
>> > some random hardware burp.
>> > I think it should be likely to trigger again within one or two days, if this is a bug.
>>
>> mb@quimby:~$ uptime
>>  23:23:34 up 5 days, 11:49,  1 user,  load average: 0.00, 0.00, 0.00
>>
>> So, it didn't trigger, yet.
>> I think I will assume for now that we had a hardware burp and this
>> is not caused by a software bug. The AP is used a lot and it currently
>> is rock-stable on 2.6.32.
>>
>> The card is a minipci connected through a minipci->pci converter card.
>> I know that the converter does not have high quality contact pins, so I
>> currently blame the converter card for flipping a bit.
>> I think I'll replace it by something better soon.
>>
>> So let's close this, unless I come back to you guys with new results.
>>
>
> Argh, so exactly one minute after sending this mail the AP died. -.-
>
> There also are a bunch of jumbo messages in dmesg, but they are probably unrelated:
>
> [177589.693544] ath5k phy0: unsupported jumbo
> [264620.683114] ath5k phy0: unsupported jumbo
> [276726.009197] ath5k phy0: unsupported jumbo
> [348619.483527] ath5k phy0: unsupported jumbo
> [349918.090802] ath5k phy0: unsupported jumbo
> [438574.817309] ath5k phy0: unsupported jumbo
> [457967.099642] ath5k phy0: unsupported jumbo

So these would be seen when hardware detects a frame was received on
which and the payload is larger than what we programmed hardware for
DMA for.

I wonder if a possible failure here might be that the box gets under
load and some DMA allocation actually gives back less memory than what
was requested., hrm, but even then we'd still tell hardware it has the
whole desired length we intended...

Not sure... I haven't reviewed this code in ages.

> However, it had different failure symptoms this time.
> Last time it failed, the AP was completely dead. No beacons, etc..
> This time it was still beaconing, but auth failed:

Can you reproduce? Do you know if anything particular happened at this time?

> Trying to associate with 00:1d:0f:b9:df:2d (SSID='quimby-net' freq=2472 MHz)
> Authentication with 00:1d:0f:b9:df:2d timed out.
>
> A machine reboot was _not_ needed this time to revive the card.
> Module unload cycle was enough to bring it back to life.
>
> I'm really unsure what's going on and if these two failures are related
> to each other. Probably not...

Anyway you can burp your box sooner (replace the PCI connector) to
rule that out?

Luis