2005-05-09 14:28:26

by Colin Leroy

[permalink] [raw]
Subject: [2.6.12-rc4] network wlan connection goes down

Hi,

I upgraded to 2.6.12-rc4, and noticed something strange after that.
After a few hours, the network connection goes down. The network
connectivity is done via a USB wifi stick driven by zd1201.

After that, nothing gets through, not even a ping. ifconfig wlan0 shows
the interface UP and configured; iwconfig shows the Wifi is correctly
associated with the access point (and the access point's client list
shows the zd1201's MAC as associated). The LED on the stick is lit as
usual (when associated). The kernel log doesn't show anything useful.

The connection comes back when running my network configuration script
again. The script issues four commands:
iwconfig wlan0 essid foo channel 11 key xx:xx...:xx
ifconfig wlan0 192.168.0.11
route del default
route add default gw 192.168.0.100
(I have to find out which of the four commands reenables the
connection, didn't try yet)

Everything was fine using 2.6.12-rc3; the only zd1201 patch that went
in 2.6.12-rc4 is "USB: drivers/usb/net/zd1201.c: make some code static"
by Adrian Bunk, which I think can't be harmful at all.

Would anyone have any hint about what could have changed in the usb
subsystem (ohci driver) or in the network subsystem, that might cause
that?

Thanks,
--
Colin


2005-05-09 15:13:21

by David Brownell

[permalink] [raw]
Subject: Re: [linux-usb-devel] [2.6.12-rc4] network wlan connection goes down

On Monday 09 May 2005 7:24 am, Colin Leroy wrote:
> Hi,
>
> I upgraded to 2.6.12-rc4, and noticed something strange after that.
> After a few hours, the network connection goes down. The network
> connectivity is done via a USB wifi stick driven by zd1201.
>
> After that, nothing gets through, not even a ping. ...
>
> Would anyone have any hint about what could have changed in the usb
> subsystem (ohci driver) or in the network subsystem, that might cause
> that?

The OHCI code shouldn't have changed either, unless you're
maybe using an old Compaq-brand chipset. However, if you
enable CONFIG_USB_DEBUG and send the "async" and "registers"
files from the relevant /sys/class/usb_host/usbN directory,
it should be easy to tell whether there's an issue at that
particular level.

- Dave

2005-05-10 08:44:22

by Colin Leroy

[permalink] [raw]
Subject: Re: [linux-usb-devel] [2.6.12-rc4] network wlan connection goes down

On Mon, 9 May 2005 08:12:48 -0700
David Brownell <[email protected]> wrote:

> On Monday 09 May 2005 7:24 am, Colin Leroy wrote:
> > Hi,
> >
> > I upgraded to 2.6.12-rc4, and noticed something strange after that.
> > After a few hours, the network connection goes down. The network
> > connectivity is done via a USB wifi stick driven by zd1201.
> >
> > After that, nothing gets through, not even a ping. ...
> >
> > Would anyone have any hint about what could have changed in the usb
> > subsystem (ohci driver) or in the network subsystem, that might
> > cause that?
>
> The OHCI code shouldn't have changed either, unless you're
> maybe using an old Compaq-brand chipset.

Nope, it's the ohci controller in the iBook g4 (Nec, I believe).

> However, if you enable CONFIG_USB_DEBUG and send the "async" and
> "registers" files from the relevant /sys/class/usb_host/usbN
> directory, it should be easy to tell whether there's an issue at that
> particular level.

ok, I reproduced it. I've put in place a crontab that checks if my
router is pingable, and if not, dumps async and registers, resets the
zd1201 (the iwconfig command is the one that puts it back into a
functioning state), and dumps async and registers.

Here's the result:
before resetting (when nothing gets through):
async:
ed/e1f8b040 fs dev3 ep1in max 64 00401083 DATA0
td e6f44040 in 3000 cc=0 urb ddf17120 (00140000)

registers:
bus pci, device 0001:10:1b.0
NEC Corporation USB
ohci_hcd version 2004 Nov 08
OHCI 1.0, NO legacy support registers
control 0x0a3 HCFS=operational BLE CBSR=3
cmdstatus 0x00000 SOC=0
intrstatus 0x00000064 RHSC FNO SF
intrenable 0x8000001a MIE UE RD WDH
ed_controlhead 21f8b000
ed_bulkhead 21f8b040
ed_bulkcurrent 21f8b040
hcca frame 0x5468
fmintvl 0xa7782edf FIT FSMPS=0xa778 FI=0x2edf
fmremaining 0x8000227a FRT FR=0x227a
periodicstart 0x2a2f
lsthresh 0x0628
roothub.a 0a000203 POTPGT=10 NPS NDP=3
roothub.b 00000000 PPCM=0000 DR=0000
roothub.status 00008000 DRWE
roothub.portstatus [0] 0x00000103 PPS PES CCS
roothub.portstatus [1] 0x00000100 PPS
roothub.portstatus [2] 0x00000100 PPS


#iwconfig essid ...
#ping works now

async:
ed/e1f8b040 fs dev3 ep1in max 64 00401083 DATA0
td e6f44040 in 3000 cc=0 urb ddf17120 (00140000)

registers:
bus pci, device 0001:10:1b.0
NEC Corporation USB
ohci_hcd version 2004 Nov 08
OHCI 1.0, NO legacy support registers
control 0x0a3 HCFS=operational BLE CBSR=3
cmdstatus 0x00004 SOC=0 BLF
intrstatus 0x00000064 RHSC FNO SF
intrenable 0x8000001a MIE UE RD WDH
ed_controlhead 21f8b000
ed_bulkhead 21f8b040
ed_bulkcurrent 21f8b040
hcca frame 0x667b
fmintvl 0xa7782edf FIT FSMPS=0xa778 FI=0x2edf
fmremaining 0x8000258e FRT FR=0x258e
periodicstart 0x2a2f
lsthresh 0x0628
roothub.a 0a000203 POTPGT=10 NPS NDP=3
roothub.b 00000000 PPCM=0000 DR=0000
roothub.status 00008000 DRWE
roothub.portstatus [0] 0x00000103 PPS PES CCS
roothub.portstatus [1] 0x00000100 PPS
roothub.portstatus [2] 0x00000100 PPS


The registers file changes while everything works, too. For example
"BLF" isn't always present on the cmdstatus line.

Also, I saw these lines in dmesg. Sadly enough they don't appear in
syslog so I can't tell whether it happened at the same time the network
went down.
ohci_hcd 0001:10:1b.0: fminterval a7782edf
ohci_hcd 0001:10:1b.0: fminterval a7782edf
ohci_hcd 0001:10:1b.0: fminterval a7782edf
ohci_hcd 0001:10:1b.0: fminterval a7782edf

Hope this helps,
--
Colin

2005-05-10 14:07:34

by David Brownell

[permalink] [raw]
Subject: Re: [linux-usb-devel] [2.6.12-rc4] network wlan connection goes down

On Tuesday 10 May 2005 1:43 am, Colin Leroy wrote:
> On Mon, 9 May 2005 08:12:48 -0700
> David Brownell <[email protected]> wrote:
>
> > However, if you enable CONFIG_USB_DEBUG and send the "async" and
> > "registers" files from the relevant /sys/class/usb_host/usbN
> > directory, it should be easy to tell whether there's an issue at that
> > particular level.
>
> ok, I reproduced it. I've put in place a crontab that checks if my
> router is pingable, and if not, dumps async and registers, resets the
> zd1201 (the iwconfig command is the one that puts it back into a
> functioning state), and dumps async and registers.

Hmm, well nothing looks wrong at the OHCI level. It's possible that
the data toggle got screwed up somehow; there are devices where the
hardware doesn't reset it when it should. If that's the case, there'd
likely be some rx or tx error (does "ifconfig" show any? dmesg?) and
the wlan driver's recovery would need updating to ensure that the
various endpoints get properly reset given those quirks.

If you can report what the "iwconfig essid ..." command does down
at the USB level, that should help sort things out. It's possible
that the network TX timeout mechanism might be a good place to kick
in some driver recovery scheme. And it's not unknown that device
firmware cause problems!

It might also be good to check whether this is a case where packets
go out, but don't come back in; or vice versa.


> Here's the result:
> ... <snip> ...
>
> The registers file changes while everything works, too. For example
> "BLF" isn't always present on the cmdstatus line.

BLF == "Bulk List Filled", basically it means the wlan
driver submitted a bulk URB and your printed the schedule
out before the hardware restarted its scan of that part
of the async schedule and cleared the bit.

What those dumps showed is just that there's an IN transfer
pending, with that WLAN driver polling the device to see
if there's a packet for your Linux host. USB network links
do that more or less all the time the link is up.


> Also, I saw these lines in dmesg. Sadly enough they don't appear in
> syslog so I can't tell whether it happened at the same time the network
> went down.
> ohci_hcd 0001:10:1b.0: fminterval a7782edf
> ohci_hcd 0001:10:1b.0: fminterval a7782edf
> ohci_hcd 0001:10:1b.0: fminterval a7782edf
> ohci_hcd 0001:10:1b.0: fminterval a7782edf

Safe to ignore; there's some debug stuff that shouldn't
kick in when you "cat registers", but it does.

- Dave


>
> Hope this helps,
> --
> Colin
>

2005-05-11 07:10:48

by Colin Leroy

[permalink] [raw]
Subject: Re: [linux-usb-devel] [2.6.12-rc4] network wlan connection goes down

On Tue, 10 May 2005 07:07:13 -0700
David Brownell <[email protected]> wrote:

Hi David,

> Hmm, well nothing looks wrong at the OHCI level. It's possible that
> the data toggle got screwed up somehow; there are devices where the
> hardware doesn't reset it when it should. If that's the case, there'd
> likely be some rx or tx error (does "ifconfig" show any? dmesg?) and
> the wlan driver's recovery would need updating to ensure that the
> various endpoints get properly reset given those quirks.

ifconfig and dmesg don't show anything. However, the latest news is
that I rebooted (to 2.6.12-rc4 too), and it didn't reproduce. After a
few hours (plenty enough of time to get this bug), I tried to sleep and
resume the laptop a few times, I'll see if it bugs again now. Maybe
that's sleep related; as you say the usb layer looks fine...

> If you can report what the "iwconfig essid ..." command does down
> at the USB level, that should help sort things out.

basically my iwconfig command looks like
iwconfig mode managed essid AP channel 11 key restricted aa:bb:cc...

this results in calls to zd1201_set_mode(), zd1201_set_essid(),
zd1201_set_freq(), zd1201_set_encode() (Once each, in this order). Each
of those, at the USB level, does some register reading/writing
(zd1201_setconfig16(), zd1201_getconfig16() and friends) and at the
end, calls zd1201_mac_reset(zd), which disables and reenables the chip
using some zd1201 commands sent down an urb.
I guess this zd1201_mac_reset() call is what "fixes" it.

> It's possible that the network TX timeout mechanism might be a good
> place to kick in some driver recovery scheme.

There's a tx_timeout callback, but it does log to dmesg and none of
these lines appear.

But I think that the problem probably lies in a complex interaction of
sleep code and usb code, and simply finding a way to reset the
zd1201 chip when it misbehaves, instead of finding the root cause,
isn't the solution. In other words, I'll get a hard time finding this
solution :)

Thanks,
--
Colin

2005-05-11 11:56:49

by Jeroen Vreeken

[permalink] [raw]
Subject: Re: [linux-usb-devel] [2.6.12-rc4] network wlan connection goes down

Colin Leroy wrote:

>I guess this zd1201_mac_reset() call is what "fixes" it.
>
>
One of the results of the mac reset is that the device reassociates with
the access point. You might just have lost your link with it and for
some reason automagic reassociation goes wrong or doesn't happen at all....
When the link is gone can you look what the BSSID is with iwconfig?
If this is the problem there isn't much the driver can do... This is all
done by firmware. (One hack might be to have a timer do a mac_reset
every once in a while if the link is gone)

Jeroen