LinuxLists.cc - 2.6.29-rc3: tg3 dead after resume

2009-01-29 00:33:47

Subject: 2.6.29-rc3: tg3 dead after resume

With 2.6.29-rc3 suspend/resume has started working on my workstation again
(did not resume with rc2 - not sure when it broke) but tg3 is dead
after resume.

This is similar to the issue reported back in Jul 2007 -
http://kerneltrap.org/mailarchive/linux-kernel/2007/8/1/154073/thread
which was fixed with a patch to unconditionally save/restore pci config
space - that one is still in tg3.c.

After resume tg3 complains that no firmware is running and eth0 is
non-existent. Rmmoding and modprobing tg3 again causes some timeouts and
errors from tg3 and the link still doesn't work.

Reboot fixes it.

Parag

2009-01-29 01:10:02

by Linus Torvalds

[permalink] [raw]

Subject: Re: 2.6.29-rc3: tg3 dead after resume

On Wed, 28 Jan 2009, Parag Warudkar wrote:
>
> This is similar to the issue reported back in Jul 2007 -
> http://kerneltrap.org/mailarchive/linux-kernel/2007/8/1/154073/thread
> which was fixed with a patch to unconditionally save/restore pci config
> space - that one is still in tg3.c.

In fact, the new PCI suspend/restore code should have made that
unnecessary, since the PCI layer now makes sure that a save/restore is
done even if the driver hadn't done it.

But at the same time, still having the driver do it certainly shouldn't
have _hurt_ anything either. But it's quite possible that the tg3 thing is
very sensitive to the exact order things happen in - there's a lot of
comments about bugs in there ;)

> After resume tg3 complains that no firmware is running and eth0 is
> non-existent. Rmmoding and modprobing tg3 again causes some timeouts and
> errors from tg3 and the link still doesn't work.

That seems to imply that even the reset failed, which is interesting.

But it also possibly means that the problem is not necessarily the driver
itself, but some cached state that we keep around in "struct pci_dev" even
across a module load/unload.

For example, if we get the "dev->current_state" cache wrong, then we may
not actually end up changing it when we should, because we think we
already match the target state. I don't _think_ that is it, but that's the
kind of thing that could happen.

Can you do a

lspci -vvxxx -s [tg3-device]

before-and-after suspend? Is there some state that looks like it got
corrupted?

Linus

2009-01-29 01:49:45

by Parag Warudkar

[permalink] [raw]

Subject: Re: 2.6.29-rc3: tg3 dead after resume

On Wed, 28 Jan 2009, Linus Torvalds wrote:

> For example, if we get the "dev->current_state" cache wrong, then we may
> not actually end up changing it when we should, because we think we
> already match the target state. I don't _think_ that is it, but that's the
> kind of thing that could happen.
>
> Can you do a
>
> lspci -vvxxx -s [tg3-device]
>
> before-and-after suspend? Is there some state that looks like it got
> corrupted?

Sure, diff -u below. There are differences but not sure if they are
abnormal or expected.

Also, BTW, reverting the only tg3 specific commit -
commit 9e9fd12dc0679643c191fc9795a3021807e77de4
Author: Matt Carlson <[email protected]>
Date: Mon Jan 19 16:57:45 2009 -0800

tg3: Fix firmware loading

did not help.

parag@parag-desktop:~$ diff -u lspci-pre-suspend lspci-post-suspend
--- lspci-pre-suspend 2009-01-28 20:35:37.070584068 -0500
+++ lspci-post-suspend 2009-01-28 20:36:56.922471408 -0500
@@ -12,7 +12,7 @@
Capabilities: [50] Vital Product Data <?>
Capabilities: [58] Vendor Specific Information <?>
Capabilities: [e8] Message Signalled Interrupts: Mask- 64bit+
Queue=0/0 Enable+
- Address: 00000000fee0f00c Data: 41c9
+ Address: 00000000fee0f00c Data: 41d1
Capabilities: [d0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s
<4us, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
@@ -36,15 +36,15 @@
20: 00 00 00 00 00 00 00 00 00 00 00 00 3c 10 07 13
30: 00 00 04 20 48 00 00 00 00 00 00 00 03 01 00 00
40: 00 00 00 00 00 00 00 00 01 50 03 c0 08 20 00 64
-50: 03 58 fc 00 00 00 00 78 09 e8 78 00 7d c9 08 78
-60: 00 00 00 00 00 00 00 00 98 02 02 a0 00 00 18 76
-70: f2 10 00 00 c0 00 00 00 2c 00 00 00 00 00 00 00
-80: 3c 10 07 13 00 00 00 00 34 00 13 04 82 70 08 fc
-90: 19 be 00 01 00 00 00 b7 00 00 00 00 14 00 00 00
-a0: 00 00 00 00 4c 01 00 00 00 00 00 00 3e 01 00 00
-b0: 00 00 00 00 00 00 00 36 00 00 00 00 00 00 00 00
+50: 03 58 fc 00 00 00 00 78 09 e8 78 00 7e cb 08 a8
+60: 00 00 00 00 00 00 00 00 9a 02 02 a0 00 00 00 10
+70: 72 10 00 00 c0 00 00 00 2c 00 00 00 00 00 00 00
+80: 3c 10 07 13 00 00 00 00 00 00 00 00 fe 70 08 fc
+90: 11 be 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 80 00 00 0e 00 00 00 00 00 00 00
d0: 10 00 01 00 a0 8f 00 00 00 50 10 00 11 64 03 00
e0: 40 00 11 10 00 00 00 00 05 d0 81 00 0c f0 e0 fe
-f0: 00 00 00 00 c9 41 00 00 00 00 00 00 00 00 00 00
+f0: 00 00 00 00 d1 41 00 00 00 00 00 00 00 00 00 00

Parag

2009-01-29 02:11:20

by Linus Torvalds

[permalink] [raw]

Subject: Re: 2.6.29-rc3: tg3 dead after resume

On Wed, 28 Jan 2009, Parag Warudkar wrote:
>
> Sure, diff -u below. There are differences but not sure if they are
> abnormal or expected.

Well, they're all in the "extended set", ie not the basic registers that
the PCI layer saves. The PCI layer normally just saves the low 16 dwords,
along with the PCI[EX] capability thing.

None of the PCI save/restore routines have ever saved the extended state
(well, "ever" is a strong word - I think we long ago used to pass in how
many bytes we wanted saved, but got rid of it), and it certainly didn't
change with the recent PCI suspend/resume changes.

I get the feeling that it's some odd tg3 issue. That tg3 driver does have
that special

/* Make sure register accesses (indirect or otherwise)
* will function correctly.
*/
pci_write_config_dword(tp->pdev,
TG3PCI_MISC_HOST_CTRL,
tp->misc_host_ctrl);

in its own version of setting the power state, and maybe that really
_must_ happen before we actually set the state back to PCI_D0. That sounds
very odd, but hey..

I added Matt Carlson to the cc, since he seems to be the main tg3
authority here.

Matt: the whole discussion is on netdev and the kernel mailing list, but
the short version is that -rc3 suspends and resumes for Parag again
(unlike -rc2), but tg3 doesn't appear to resume properly. The generic PCI
layer now does more at resume time (very early, when interrupts are still
off), see

- pci_pm_resume_noirq ->
pci_pm_default_resume_noirq() ->
pci_restore_standard_config()

for more of the details (basically it always does that
"pci_restore_state()" and tries to bring the device back to PCI_D0).

Linus

2009-01-29 02:20:23

by Matt Carlson

[permalink] [raw]

Subject: Re: 2.6.29-rc3: tg3 dead after resume

On Wed, Jan 28, 2009 at 06:10:37PM -0800, Linus Torvalds wrote:
>
>
> On Wed, 28 Jan 2009, Parag Warudkar wrote:
> >
> > Sure, diff -u below. There are differences but not sure if they are
> > abnormal or expected.
>
> Well, they're all in the "extended set", ie not the basic registers that
> the PCI layer saves. The PCI layer normally just saves the low 16 dwords,
> along with the PCI[EX] capability thing.
>
> None of the PCI save/restore routines have ever saved the extended state
> (well, "ever" is a strong word - I think we long ago used to pass in how
> many bytes we wanted saved, but got rid of it), and it certainly didn't
> change with the recent PCI suspend/resume changes.
>
> I get the feeling that it's some odd tg3 issue. That tg3 driver does have
> that special
>
> /* Make sure register accesses (indirect or otherwise)
> * will function correctly.
> */
> pci_write_config_dword(tp->pdev,
> TG3PCI_MISC_HOST_CTRL,
> tp->misc_host_ctrl);
>
> in its own version of setting the power state, and maybe that really
> _must_ happen before we actually set the state back to PCI_D0. That sounds
> very odd, but hey..
>
> I added Matt Carlson to the cc, since he seems to be the main tg3
> authority here.
>
> Matt: the whole discussion is on netdev and the kernel mailing list, but
> the short version is that -rc3 suspends and resumes for Parag again
> (unlike -rc2), but tg3 doesn't appear to resume properly. The generic PCI
> layer now does more at resume time (very early, when interrupts are still
> off), see
>
> - pci_pm_resume_noirq ->
> pci_pm_default_resume_noirq() ->
> pci_restore_standard_config()
>
> for more of the details (basically it always does that
> "pci_restore_state()" and tries to bring the device back to PCI_D0).

Thanks Linus. I'm looking over the diffs Parag sent and I already see
some suspicious register settings. Let me think about this some more
and then I'll jump into the discussion.

2009-01-29 18:42:38

by Matt Carlson

[permalink] [raw]

Subject: Re: 2.6.29-rc3: tg3 dead after resume

On Wed, Jan 28, 2009 at 05:49:18PM -0800, Parag Warudkar wrote:
>
>
> On Wed, 28 Jan 2009, Linus Torvalds wrote:
>
> > For example, if we get the "dev->current_state" cache wrong, then we may
> > not actually end up changing it when we should, because we think we
> > already match the target state. I don't _think_ that is it, but that's the
> > kind of thing that could happen.
> >
> > Can you do a
> >
> > lspci -vvxxx -s [tg3-device]
> >
> > before-and-after suspend? Is there some state that looks like it got
> > corrupted?
>
> Sure, diff -u below. There are differences but not sure if they are
> abnormal or expected.
>
> Also, BTW, reverting the only tg3 specific commit -
> commit 9e9fd12dc0679643c191fc9795a3021807e77de4
> Author: Matt Carlson <[email protected]>
> Date: Mon Jan 19 16:57:45 2009 -0800
>
> tg3: Fix firmware loading
>
> did not help.
>
> parag@parag-desktop:~$ diff -u lspci-pre-suspend lspci-post-suspend
> --- lspci-pre-suspend 2009-01-28 20:35:37.070584068 -0500
> +++ lspci-post-suspend 2009-01-28 20:36:56.922471408 -0500
> @@ -12,7 +12,7 @@
> Capabilities: [50] Vital Product Data <?>
> Capabilities: [58] Vendor Specific Information <?>
> Capabilities: [e8] Message Signalled Interrupts: Mask- 64bit+
> Queue=0/0 Enable+
> - Address: 00000000fee0f00c Data: 41c9
> + Address: 00000000fee0f00c Data: 41d1
> Capabilities: [d0] Express (v1) Endpoint, MSI 00
> DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s
> <4us, L1 unlimited
> ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
> @@ -36,15 +36,15 @@
> 20: 00 00 00 00 00 00 00 00 00 00 00 00 3c 10 07 13
> 30: 00 00 04 20 48 00 00 00 00 00 00 00 03 01 00 00
> 40: 00 00 00 00 00 00 00 00 01 50 03 c0 08 20 00 64
> -50: 03 58 fc 00 00 00 00 78 09 e8 78 00 7d c9 08 78
> -60: 00 00 00 00 00 00 00 00 98 02 02 a0 00 00 18 76
> -70: f2 10 00 00 c0 00 00 00 2c 00 00 00 00 00 00 00
> -80: 3c 10 07 13 00 00 00 00 34 00 13 04 82 70 08 fc
> -90: 19 be 00 01 00 00 00 b7 00 00 00 00 14 00 00 00
> -a0: 00 00 00 00 4c 01 00 00 00 00 00 00 3e 01 00 00
> -b0: 00 00 00 00 00 00 00 36 00 00 00 00 00 00 00 00
> +50: 03 58 fc 00 00 00 00 78 09 e8 78 00 7e cb 08 a8
> +60: 00 00 00 00 00 00 00 00 9a 02 02 a0 00 00 00 10
> +70: 72 10 00 00 c0 00 00 00 2c 00 00 00 00 00 00 00
> +80: 3c 10 07 13 00 00 00 00 00 00 00 00 fe 70 08 fc
> +90: 11 be 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> +a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> +b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> c0: 00 00 00 00 00 80 00 00 0e 00 00 00 00 00 00 00
> d0: 10 00 01 00 a0 8f 00 00 00 50 10 00 11 64 03 00
> e0: 40 00 11 10 00 00 00 00 05 d0 81 00 0c f0 e0 fe
> -f0: 00 00 00 00 c9 41 00 00 00 00 00 00 00 00 00 00
> +f0: 00 00 00 00 d1 41 00 00 00 00 00 00 00 00 00 00

O.K. These differences can probably be attributed to the driver's chip
reset failure. For some reason, the driver has lost communication with
the firmware through the device's shared memory. A cascading series of
errors will probably be the consequence.

Can you apply the following test patch and see if it helps? The patch
does two things. First, it enables a bit which should restore firmware
communication. If that fixes the problem, then let me know and I'll
spin a proper patch.

In the event that it doesn't work, the patch goes on to test the memory
mapping by simply printing the register value at offset 0x0. The value
should be the device's vendor ID and device ID. Please post the
results so that I can verify it.

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 8b3f846..39fce42 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -7227,6 +7227,11 @@ static int tg3_init_hw(struct tg3 *tp, int reset_phy)
{
tg3_switch_clocks(tp);

+ printk( KERN_NOTICE "%s: Reg value at offset 0x0 is 0x%x\n",
+ tp->dev->name, tr32(0x0) );
+
+ tw32(MEMARB_MODE, tr32(MEMARB_MODE) | MEMARB_MODE_ENABLE);
+
tw32(TG3PCI_MEM_WIN_BASE_ADDR, 0);

return tg3_reset_hw(tp, reset_phy);

2009-01-29 22:07:33

by Parag Warudkar

[permalink] [raw]

Subject: Re: 2.6.29-rc3: tg3 dead after resume

On Thu, 29 Jan 2009, Matt Carlson wrote:

> Can you apply the following test patch and see if it helps? The patch
> does two things. First, it enables a bit which should restore firmware
> communication. If that fixes the problem, then let me know and I'll
> spin a proper patch.
>
> In the event that it doesn't work, the patch goes on to test the memory
> mapping by simply printing the register value at offset 0x0. The value
> should be the device's vendor ID and device ID. Please post the
> results so that I can verify it.
>
>
> diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
> index 8b3f846..39fce42 100644
> --- a/drivers/net/tg3.c
> +++ b/drivers/net/tg3.c
> @@ -7227,6 +7227,11 @@ static int tg3_init_hw(struct tg3 *tp, int reset_phy)
> {
> tg3_switch_clocks(tp);
>
> + printk( KERN_NOTICE "%s: Reg value at offset 0x0 is 0x%x\n",
> + tp->dev->name, tr32(0x0) );
> +
> + tw32(MEMARB_MODE, tr32(MEMARB_MODE) | MEMARB_MODE_ENABLE);
> +
> tw32(TG3PCI_MEM_WIN_BASE_ADDR, 0);
>
> return tg3_reset_hw(tp, reset_phy);
>

Hi Matt,

Thanks for the patch. It didn't help with resume - but below is the
output after patching, let me know if you need more details.

( Looks like 0xffffffff is invalid/corrupted device id /vendor id? )

[ 163.856001] tg3 0000:0e:00.0: restoring config space at offset 0xc (was 0x0, writing 0x20040000)
[ 163.856001] tg3 0000:0e:00.0: restoring config space at offset 0x3 (was 0x0, writing 0x10)
[ 163.856001] tg3 0000:0e:00.0: restoring config space at offset 0x1 (was 0x100000, writing 0x100006)

[snip]

[ 164.450277] pcieport-driver 0000:1e:00.0: setting latency timer to 64
[ 164.450415] pcieport-driver 0000:1e:01.0: setting latency timer to 64
[ 164.450493] tg3 0000:0e:00.0: restoring config space at offset 0xc (was 0x0, writing 0x20040000)
[ 164.451110] serial 00:08: activated

[snip]

[ 168.913863] Restarting tasks ... done.
[ 170.332953] tg3 0000:0e:00.0: wake-up capability disabled by ACPI
[ 170.332960] tg3 0000:0e:00.0: PME# disabled
[ 170.333047] tg3 0000:0e:00.0: irq 54 for MSI/MSI-X
[ 170.333250] eth0: Reg value at offset 0x0 is 0xffffffff
[ 170.394281] [drm] Loading R500 Microcode
[ 170.394330] [drm] Num pipes: 1
[ 171.726650] tg3: eth0: No firmware running.
[ 183.119745] ADDRCONF(NETDEV_UP): eth0: link is not ready

Parag

2009-01-29 22:22:39

Subject: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Attachments:

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: What should PCI core do during suspend-resume? (was: Re: 2.6.29-rc3: tg3 dead after resume)

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: What should PCI core do during suspend-resume? (was: Re: 2.6.29-rc3: tg3 dead after resume)

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: What should PCI core do during suspend-resume? (was: Re: 2.6.29-rc3: tg3 dead after resume)

Subject: Re: What should PCI core do during suspend-resume? (was: Re: 2.6.29-rc3: tg3 dead after resume)

Subject: Re: What should PCI core do during suspend-resume? (was: Re: 2.6.29-rc3: tg3 dead after resume)

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: What should PCI core do during suspend-resume? (was: Re: 2.6.29-rc3: tg3 dead after resume)

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: 2.6.29-rc3: tg3 dead after resume

Subject: Re: What should PCI core do during suspend-resume? (was: Re: 2.6.29-rc3: tg3 dead after resume)

Subject: Re: What should PCI core do during suspend-resume? (was: Re: 2.6.29-rc3: tg3 dead after resume)

Subject: Re: What should PCI core do during suspend-resume? (was: Re: 2.6.29-rc3: tg3 dead after resume)

Subject: Re: What should PCI core do during suspend-resume? (was: Re: 2.6.29-rc3: tg3 dead after resume)