LinuxLists.cc - e1000e fails after several S3 resumes (2.6.26 Debian, TP T60)

2008-10-22 13:29:18

Subject: e1000e fails after several S3 resumes (2.6.26 Debian, TP T60)

Once in a while after resuming from S3 sleep, the Ethernet driver
gets confused, whereupon dhcp'ing for an IP address fails, e.g.

/* doing the dhcp: */
Listening on LPF/eth0/00:16:41:52:50:de
Sending on LPF/eth0/00:16:41:52:50:de
Sending on Socket/fallback
DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 7
/* and so on with various intervals */

I workaround it with

modprobe -rv e1000e ; modprobe -v e1000e
(the '-v' options to make sure the module does vanish and return)

and then try again to get an address, which works. A similar failure
mode happens with the iwl3945 driver (and a similar workaround usually
succeeds).

How can I debug this issue the next time that it happens (it's about
once every two weeks)? Using 'ethtool' or 'lspci -vvvv'?

$ uname -a
Linux approx 2.6.26-1-686 #1 SMP Thu Oct 9 15:18:09 UTC 2008 i686 GNU/Linux

It's Debian unstable's kernel 2.6.26 based on 2.6.26.4. The laptop is a
Thinkpad T60 whose network controllers are given by lspci as

02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller
03:00.0 Network controller: Intel Corporation PRO/Wireless 3945ABG [Golan] Network Connection (rev 02)

Could it be caused by the kernel (and modules) getting upgraded
underneath a running system? In which case I'll just 'not do that
again' as the simplest fix, and reboot after a kernel upgrade. My
installed kernel is based on 2.6.26.6, but the running kernel is based
on 2.6.26.4 [where based on means 'with Debian's patches'].

Please CC me on any responses.

-Sanjoy

`Until lions have their historians, tales of the hunt shall always
glorify the hunters.' --African Proverb

2008-10-22 16:29:37

by Jesse Brandeburg

[permalink] [raw]

Subject: Re: e1000e fails after several S3 resumes (2.6.26 Debian, TP T60)

added netdev, and maintainer's list.

On Wed, Oct 22, 2008 at 6:28 AM, Sanjoy Mahajan <[email protected]> wrote:
> Once in a while after resuming from S3 sleep, the Ethernet driver
> gets confused, whereupon dhcp'ing for an IP address fails, e.g.
>
> /* doing the dhcp: */
> Listening on LPF/eth0/00:16:41:52:50:de
> Sending on LPF/eth0/00:16:41:52:50:de
> Sending on Socket/fallback
> DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 7
> /* and so on with various intervals */

ethtool -d ethX at this point might be interesting, also we have a
debug tool called ethregs that dumps all the registers of the adapter
that would help isolate the difference in the hardware configuration.
run it once you've hung after sending a few dhcp, and then again after
you reload the driver and things are working. You can download ethregs
at prdownloads.sourceforge.net/e1000

you'll have to build ethregs which I haven't tried to do on debian,
but it should be possible.

> I workaround it with
>
> modprobe -rv e1000e ; modprobe -v e1000e
> (the '-v' options to make sure the module does vanish and return)

an ethtool -r eth0 might be sufficient.

> and then try again to get an address, which works. A similar failure
> mode happens with the iwl3945 driver (and a similar workaround usually
> succeeds).
>
> How can I debug this issue the next time that it happens (it's about
> once every two weeks)? Using 'ethtool' or 'lspci -vvvv'?

yes... :-)

> $ uname -a
> Linux approx 2.6.26-1-686 #1 SMP Thu Oct 9 15:18:09 UTC 2008 i686 GNU/Linux
>
> It's Debian unstable's kernel 2.6.26 based on 2.6.26.4. The laptop is a
> Thinkpad T60 whose network controllers are given by lspci as
>
> 02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller
> 03:00.0 Network controller: Intel Corporation PRO/Wireless 3945ABG [Golan] Network Connection (rev 02)
>
> Could it be caused by the kernel (and modules) getting upgraded
> underneath a running system? In which case I'll just 'not do that
> again' as the simplest fix, and reboot after a kernel upgrade. My
> installed kernel is based on 2.6.26.6, but the running kernel is based
> on 2.6.26.4 [where based on means 'with Debian's patches'].

no, if the kernel version changes, the modules that go with it are
only compatible with that version and would not be loaded
accidentally. Also, e1000e does not get unloaded during S3 suspend,
but we do take a different init path.

There is also lots of opportunity for BIOS bugs to be effecting things
so please make sure that you have the latest bios.

Jesse

2008-10-22 19:22:40

by Sanjoy Mahajan

[permalink] [raw]

Subject: Re: e1000e fails after several S3 resumes (2.6.26 Debian, TP T60)

Thanks. I'll use 'ethregs' and 'ethtool -d eth0' at the next
opportunity, though it might take a few weeks for the problem to recur,

> you'll have to build ethregs which I haven't tried to do on debian,
> but it should be possible.

It needed the 'libpci-dev' package and built smoothly.

> There is also lots of opportunity for BIOS bugs to be effecting things
> so please make sure that you have the latest bios.

It has BIOS 2.20 (79ETE3WW), the latest version when I last upgraded the
BIOS, but I see that 2.23 is available. I'll upgrade as soon as I
remember how on a Windowless (TM) machine.

-Sanjoy

`Until lions have their historians, tales of the hunt shall always
glorify the hunters.' --African Proverb

2008-10-22 21:17:24

by Yves-Alexis Perez

[permalink] [raw]

Subject: Re: e1000e fails after several S3 resumes (2.6.26 Debian, TP T60)

On mer, 2008-10-22 at 15:21 -0400, Sanjoy Mahajan wrote:
> It has BIOS 2.20 (79ETE3WW), the latest version when I last upgraded
> the
> BIOS, but I see that 2.23 is available. I'll upgrade as soon as I
> remember how on a Windowless (TM) machine.

Just pick the bootable cdrom
--
Yves-Alexis

Attachments:

signature.asc (197.00 B)
This is a digitally signed message part

2008-10-23 13:32:22

by Sanjoy Mahajan

[permalink] [raw]

Subject: Re: e1000e fails after several S3 resumes (2.6.26 Debian, TP T60)

MAC Registers
-------------
0x00000: CTRL (Device control register) 0x18140248
Endian mode (buffers): little
Link reset: reset
Set link up: 1
Invert Loss-Of-Signal: no
Receive flow control: enabled
Transmit flow control: enabled
VLAN mode: disabled
Auto speed detect: disabled
Speed select: 1000Mb/s
Force speed: no
Force duplex: no
0x00008: STATUS (Device status register) 0x80080783
Duplex: full
Link up: link config
TBI mode: disabled
Link speed: 1000Mb/s
Bus type: PCI Express
Port number: 0
0x00100: RCTL (Receive control register) 0x04008002
Receiver: enabled
Store bad packets: disabled
Unicast promiscuous: disabled
Multicast promiscuous: disabled
Long packet: disabled
Descriptor minimum threshold size: 1/2
Broadcast accept mode: accept
VLAN filter: disabled
Canonical form indicator: disabled
Discard pause frames: filtered
Pass MAC control frames: don't pass
Receive buffer size: 2048
0x02808: RDLEN (Receive desc length) 0x00001000
0x02810: RDH (Receive desc head) 0x00000051
0x02818: RDT (Receive desc tail) 0x0000004F
0x02820: RDTR (Receive delay timer) 0x00000000
0x00400: TCTL (Transmit ctrl register) 0x3103F0FA
Transmitter: enabled
Pad short packets: enabled
Software XOFF Transmission: disabled
Re-transmit on late collision: enabled
0x03808: TDLEN (Transmit desc length) 0x00001000
0x03810: TDH (Transmit desc head) 0x00000075
0x03818: TDT (Transmit desc tail) 0x00000075
0x03820: TIDV (Transmit delay timer) 0x00000008
PHY type: IGP2

2008-10-23 22:43:25

by Jesse Brandeburg

[permalink] [raw]

Subject: RE: e1000e fails after several S3 resumes (2.6.26 Debian, TP T60)

Sanjoy Mahajan wrote:
>> There is also lots of opportunity for BIOS bugs to be effecting
>> things so please make sure that you have the latest bios.
>
> I was about to burn the CD to update the bios to 2.23 when the failure
> recurred. So, with the caveat that the bios is still 2.20, I've
> attached logs from ethregs and ethtool before and after
> ethtool -r eth0
> (which fixed the dhcp).
>
> Here is the e1000e driver version:
>
> $ grep e1000e /var/log/dmesg
> [ 23.988317] e1000e: Intel(R) PRO/1000 Network Driver - 0.3.3.3-k2
> [ 23.988390] e1000e: Copyright (c) 1999-2008 Intel Corporation.
> [ 23.988505] e1000e 0000:02:00.0: Disabling L1 ASPM

hm, does your kernel have CONFIG_PM defined? if it happens again please include lspci -vvv before and after ethtool -r (see below)

> Here are diffs of the attached before and after logs:
>
> --- ethtool-before.log 2008-10-23 09:14:41.000000000 -0400
> +++ ethtool-after.log 2008-10-23 09:17:54.000000000 -0400
> @@ -33,8 +33,8 @@
> Pass MAC control frames: don't pass
> Receive buffer size: 2048
> 0x02808: RDLEN (Receive desc length) 0x00001000
> -0x02810: RDH (Receive desc head) 0x000000BB
> -0x02818: RDT (Receive desc tail) 0x000000B9
> +0x02810: RDH (Receive desc head) 0x00000051
> +0x02818: RDT (Receive desc tail) 0x0000004F

this indicates the device was actually receiving packets okay (RDH) and the
driver was returning buffers to hardware (RDT)

> 0x02820: RDTR (Receive delay timer) 0x00000000
> 0x00400: TCTL (Transmit ctrl register) 0x3103F0FA
> Transmitter: enabled
> @@ -42,7 +42,7 @@
> Software XOFF Transmission: disabled
> Re-transmit on late collision: enabled
> 0x03808: TDLEN (Transmit desc length) 0x00001000
> -0x03810: TDH (Transmit desc head) 0x00000018
> -0x03818: TDT (Transmit desc tail) 0x00000018
> +0x03810: TDH (Transmit desc head) 0x00000075
> +0x03818: TDT (Transmit desc tail) 0x00000075

device was also claiming successfully transmitting, so I don't know why
the DHCP packets don't work, can you tcpdump on the network or the dhcp
server by chance? I'm looking to see if the server receives the transmits
and then replies.

> RAL[0] 52411600
> RAH[0] 8000de50
> - RAL[1] 00003333
> + RAL[1] 005e0001
> RAH[1] 8000fb00
> - RAL[2] 52ff3333
> - RAH[2] 8000de50
> - RAL[3] 00003333
> - RAH[3] 80000100
> - RAL[4] 005e0001
> + RAL[2] 00003333
> + RAH[2] 8000fb00
> + RAL[3] 52ff3333
> + RAH[3] 8000de50
> + RAL[4] 00003333
> RAH[4] 80000100
> - RAL[5] 00000000
> - RAH[5] 00000000
> + RAL[5] 005e0001
> + RAH[5] 80000100

after resume, one multicast address is added and one is missing from the
list of addresses the adapter will listen on. I reordered but here are
the diffs
before:
RAL[5] 00000000
RAH[5] 00000000
after
RAL[5] 005e0001
RAH[5] 8000fb00

I don't know which protocol added 01005e00fb as a multicast address only
after suspend.

can you ifconfig eth0 promisc before doing suspend? I'd be curious if
that fixed it.

> RAL[6] 00000000
> RAH[6] 00000000
> RAL[7] 00000000
> @@ -390,7 +390,7 @@
> GSCL_2 00000000
> GSCL_3 00000000
> GSCL_4 00000000
> - FACTPS a1041046
> + FACTPS 21041046

FACTPS bits are reserved in our manuals (but have to do with PCIe power state
changes), but I can't help but wonder if there isn't something with ASPM L0s or
L1 on your system (where we had trouble with that feature on your laptop) when
coming out of resume, therefore the lspci would show us the difference if there
was one.

2008-10-24 14:25:51

by Sanjoy Mahajan

[permalink] [raw]

Subject: Re: e1000e fails after several S3 resumes (2.6.26 Debian, TP T60)

> hm, does your kernel have CONFIG_PM defined?

It does have that defined:

$ grep CONFIG_PM /boot/config-2.6.26-1-686
CONFIG_PM=y
# CONFIG_PM_DEBUG is not set
CONFIG_PM_SLEEP_SMP=y
CONFIG_PM_SLEEP=y
CONFIG_PM_STD_PARTITION=""

> if it happens again please include lspci -vvv before and after ethtool
> -r (see below)

I will. Now I'm running BIOS 2.23, so I'm curious whether that
'upgrade' fixes the problem.

I say 'upgrade' because now S3 sleep and wakeup often take 60 seconds.
I've also noticed ACPI errors in the 'dmesg'. Once I have something
reproducible I'll file a bugzilla report.

> device was also claiming successfully transmitting, so I don't know
> why the DHCP packets don't work, can you tcpdump on the network or the
> dhcp server by chance?

I'll do that too on the next failure. Is 'tcpdump host 18.38.0.1'
sufficient or do I need a few -v switches?

> can you ifconfig eth0 promisc before doing suspend? I'd be curious if
> that fixed it.

If/when it reproduces, I'll add that line to the pre-suspend code. (I
use 's2ram', which I think sleeps with 'echo mem > /sys/power/state' and
does a vt switch on wakeup).

Generally: For making debugging go smoothly, is it worth running a
vanilla kernel rather than the Debian one? I could try 2.6.26.7 or
2.6.27.3. Is running 2.6.27.y not as useful as running 2.6.26.y, in
case the bug is merely hidden but not solved in the new kernel? On the
other hand, I'm tempted to try 2.6.27.y in case it fixes the slow
suspend/resume.

-Sanjoy

`Until lions have their historians, tales of the hunt shall always
glorify the hunters.' --African Proverb

2008-10-24 16:23:34

by Jesse Brandeburg

[permalink] [raw]

Subject: RE: e1000e fails after several S3 resumes (2.6.26 Debian, TP T60)

Sanjoy Mahajan wrote:
>> hm, does your kernel have CONFIG_PM defined?
>
> It does have that defined:

ok

>> if it happens again please include lspci -vvv before and after
>> ethtool -r (see below)
>
> I will. Now I'm running BIOS 2.23, so I'm curious whether that
> 'upgrade' fixes the problem.
>
> I say 'upgrade' because now S3 sleep and wakeup often take 60 seconds.
> I've also noticed ACPI errors in the 'dmesg'. Once I have something
> reproducible I'll file a bugzilla report.

ick, it would be nice if the system vendors actually tested their acpi implementations on multiple OSes.

>> device was also claiming successfully transmitting, so I don't know
>> why the DHCP packets don't work, can you tcpdump on the network or
>> the dhcp server by chance?
>
> I'll do that too on the next failure. Is 'tcpdump host 18.38.0.1'
> sufficient or do I need a few -v switches?

I'm mostly looking for the conversation back and forth, so that should be fine.
Keep in mind that the first dhcp packet is usually a broadcast (not to a
particular IP)

>> can you ifconfig eth0 promisc before doing suspend? I'd be curious
>> if that fixed it.
>
> If/when it reproduces, I'll add that line to the pre-suspend code. (I
> use 's2ram', which I think sleeps with 'echo mem > /sys/power/state'
> and does a vt switch on wakeup).

okay, thanks

> Generally: For making debugging go smoothly, is it worth running a
> vanilla kernel rather than the Debian one? I could try 2.6.26.7 or
> 2.6.27.3. Is running 2.6.27.y not as useful as running 2.6.26.y, in
> case the bug is merely hidden but not solved in the new kernel? On
> the other hand, I'm tempted to try 2.6.27.y in case it fixes the slow
> suspend/resume.

I think you should definitely try 2.6.27.y, the e1000e versions in the kernel
are different than what is in ubuntu at least, so not sure if that applies to
debian.-

2008-10-24 19:55:59

by Sanjoy Mahajan

[permalink] [raw]

Subject: Re: e1000e fails after several S3 resumes (2.6.26 Debian, TP T60)

Brandeburg, Jesse <[email protected]> wrote:

> > I say 'upgrade' because now S3 sleep and wakeup often take 60
> > seconds. I've also noticed ACPI errors in the 'dmesg'. Once I have
> > something reproducible I'll file a bugzilla report.
>
> ick, it would be nice if the system vendors actually tested their acpi
> implementations on multiple OSes.

They do: XP, Vista, NT, ... Are there any other OS's?!

Good news for 2.6.27.3: With the latest stable kernel, the
suspend/resume is quick again, and the ACPI dmesg errors are gone!

So I'll keep running it and wait for the e1000e problem to return (or
vanish). I'll hurry it along by doing suspend/resume/dhcp lots of
times.

-Sanjoy

`Until lions have their historians, tales of the hunt shall always
glorify the hunters.' --African Proverb

2008-10-28 20:19:30

by Sanjoy Mahajan

[permalink] [raw]

Subject: Re: e1000e fails after several S3 resumes (2.6.26 Debian, TP T60)

> I think you should definitely try 2.6.27.y, the e1000e versions in the
> kernel are different than what is in ubuntu at least, so not sure if
> that applies to debian.

I'm running 2.6.27.4, and I haven't seen the e1000e problem yet. But it
happens quite often with wlan0 (iwl3945), whereupon unloading and
loading the module fixes it.

Is there an analogue of ethtool for wireless cards (for debugging and
resetting a la 'ethtool -r eth0'? Other than the tcpdump, what
debugging information should I collect for iwl3945? And who should I
add/remove from the CC before sending it out?

-Sanjoy

`Until lions have their historians, tales of the hunt shall always
glorify the hunters.' --African Proverb