2011-02-03 19:48:00

by R. Herbst

[permalink] [raw]
Subject: Sun GEM PPC32 Bug?

Hallo.

Ich habe vielleicht/hoffentlich einen Bug gefunden, der sich
mindestens von Kernel 2.6.27 bis 2.6.37 befindet.
Ich versuche ihn mal so gut als m?glich zu beschreiben.
Meine verwendete Hardware:
Apple PowerMac Dual G4 mit Sonnetupgrade auf 2x1.83GHz (7447A), Gentoo
Linux (3 Wochen alt)
---
cat /proc/cpuinfo
processor??? : 0
cpu??? ??? : 7447A, altivec supported
clock??? ??? : 1833.333326MHz
revision??? : 1.1 (pvr 8003 0101)
bogomips??? : 83.31

processor??? : 1
cpu??? ??? : 7447A, altivec supported
clock??? ??? : 1833.333326MHz
revision??? : 1.1 (pvr 8003 0101)
bogomips??? : 83.31

total bogomips??? : 166.63
timebase??? : 41658586
platform??? : PowerMac
model??? ??? : PowerMac3,6
machine??? ??? : PowerMac3,6
motherboard??? : PowerMac3,6 MacRISC2 MacRISC Power Macintosh
detected as??? : 129 (PowerMac G4 Windtunnel)
pmac flags??? : 00000010
L2 cache??? : 256K unified
pmac-generation??? : NewWorld
Memory??? ??? : 2048 MB
---
Solange der Datenverkehr relativ gering ist, gibt es keinerlei
Probleme (unter 40MBit), wie beispielsweise eine SFTP von meinem IBM
x345 auf meinen G4.
M?chte ich mehr Durchsatz haben und nutze beispielsweise FTP (ca.
200MBit/s durchsatz), oder RSYNC (ca. 120MBit/s), dann bekomme ich in
/var/log/messages folgende Meldungen angezeigt.
---
grep gem /var/log/messages

Feb? 3 19:51:35 G4 kernel: gem 0002:20:0f.0: eth0: Link is up at 1000
Mbps, full-duplex
Feb? 3 19:51:35 G4 kernel: gem 0002:20:0f.0: eth0: Pause is disabled
Feb? 3 19:54:49 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
overflow smac[00810400]
Feb? 3 19:54:51 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
overflow smac[00810400]
Feb? 3 19:54:51 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
overflow smac[00810400]
Feb? 3 19:54:58 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
overflow smac[00810400]
Feb? 3 19:55:04 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
overflow smac[02010400]
Feb? 3 19:55:11 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
overflow smac[00810400]
Feb? 3 19:55:12 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
overflow smac[00010400]
Feb? 3 19:55:16 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
overflow smac[00810400]
Feb? 3 19:55:25 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
overflow smac[02010400]
Feb? 3 19:55:26 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
overflow smac[02010400]
Feb? 3 19:55:30 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
overflow smac[00810400]
Feb? 3 19:55:32 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
overflow smac[02010400]
---

Diese Meldungen k?nnen mehrfach pro Sekunde dann aufschalgen. Nach
geringer Zeit (manchmal ein paar Sekunden, in besten Fall 1, 2
Minuten) freezed dann das komplette System.
Unter MacOS X 10.5 l?uft mein Computer ohne jeglichen Probleme.
Hardware ist v?llig in Ordnung.

Ich hatte vorher Debian versucht drauf zu installieren. Das System da
ging sehr gut, solange man keinen Kernel mit SMP benutzt hat. Dann
hakte die Tastatur und die Maus so derma?en, dass ein arbeiten nicht
mehr m?glich war. Nur nebenbei erw?hnt. Das ist der Grund warum ich
Gentoo verwendet habe.

Auf meinem Cisco 2960G Switch sehe ich leider auch keine Fehler.

Ich hoffe, Sie k?nnen mir da weiter helfen.

R?diger Herbst

PS.: Sollten noch Fragen offen sein, beantworte ich diese gerne wenn m?glich!


2011-02-04 08:02:52

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: Sun GEM PPC32 Bug?

2011/2/3 R. Herbst <[email protected]>:
> Ich habe vielleicht/hoffentlich einen Bug gefunden, der sich
> mindestens von Kernel 2.6.27 bis 2.6.37 befindet.

Probably I have found a bug, that's present in at least kernels 2.6.27
to 2.6.37.

> Ich versuche ihn mal so gut als möglich zu beschreiben.
> Meine verwendete Hardware:
> Apple PowerMac Dual G4 mit Sonnetupgrade auf 2x1.83GHz (7447A), Gentoo
> Linux (3 Wochen alt)

3 weeks old.

> ---
> cat /proc/cpuinfo
> processor    : 0
> cpu        : 7447A, altivec supported
> clock        : 1833.333326MHz
> revision    : 1.1 (pvr 8003 0101)
> bogomips    : 83.31
>
> processor    : 1
> cpu        : 7447A, altivec supported
> clock        : 1833.333326MHz
> revision    : 1.1 (pvr 8003 0101)
> bogomips    : 83.31
>
> total bogomips    : 166.63
> timebase    : 41658586
> platform    : PowerMac
> model        : PowerMac3,6
> machine        : PowerMac3,6
> motherboard    : PowerMac3,6 MacRISC2 MacRISC Power Macintosh
> detected as    : 129 (PowerMac G4 Windtunnel)
> pmac flags    : 00000010
> L2 cache    : 256K unified
> pmac-generation    : NewWorld
> Memory        : 2048 MB
> ---
> Solange der Datenverkehr relativ gering ist, gibt es keinerlei
> Probleme (unter 40MBit), wie beispielsweise eine SFTP von meinem IBM
> x345 auf meinen G4.

As long as network traffic is low (below 40 MBit), there's no problem.
For example using SFTP from my IBM x345 to my G4.

> Möchte ich mehr Durchsatz haben und nutze beispielsweise FTP (ca.
> 200MBit/s durchsatz), oder RSYNC (ca. 120MBit/s), dann bekomme ich in
> /var/log/messages folgende Meldungen angezeigt.

If I use more bandwidth, e.g. using FTP (200 MBit/s bandwidth), or RSYNC,
then I get the following messages in /var/log/messages:

> ---
> grep gem /var/log/messages
>
> Feb  3 19:51:35 G4 kernel: gem 0002:20:0f.0: eth0: Link is up at 1000
> Mbps, full-duplex
> Feb  3 19:51:35 G4 kernel: gem 0002:20:0f.0: eth0: Pause is disabled
> Feb  3 19:54:49 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
> overflow smac[00810400]
> Feb  3 19:54:51 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
> overflow smac[00810400]
> Feb  3 19:54:51 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
> overflow smac[00810400]
> Feb  3 19:54:58 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
> overflow smac[00810400]
> Feb  3 19:55:04 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
> overflow smac[02010400]
> Feb  3 19:55:11 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
> overflow smac[00810400]
> Feb  3 19:55:12 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
> overflow smac[00010400]
> Feb  3 19:55:16 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
> overflow smac[00810400]
> Feb  3 19:55:25 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
> overflow smac[02010400]
> Feb  3 19:55:26 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
> overflow smac[02010400]
> Feb  3 19:55:30 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
> overflow smac[00810400]
> Feb  3 19:55:32 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
> overflow smac[02010400]
> ---
>
> Diese Meldungen können mehrfach pro Sekunde dann aufschalgen. Nach
> geringer Zeit (manchmal ein paar Sekunden, in besten Fall 1, 2
> Minuten) freezed dann das komplette System.

These messages can appear mutiple times per second. After a while (worst case
a few seconds, sometimes up to 1 or 2 minutes), the complete system freezes.

> Unter MacOS X 10.5 läuft mein Computer ohne jeglichen Probleme.
> Hardware ist völlig in Ordnung.

Under MacOS my computer runs fine. The hardware has no problems.

> Ich hatte vorher Debian versucht drauf zu installieren. Das System da
> ging sehr gut, solange man keinen Kernel mit SMP benutzt hat. Dann
> hakte die Tastatur und die Maus so dermaßen, dass ein arbeiten nicht
> mehr möglich war. Nur nebenbei erwähnt. Das ist der Grund warum ich
> Gentoo verwendet habe.

I also tried Debian. It worked well, as long as I used an SMP kernel.
Else the keyboard and mouse worked so bad they became unusuable.
That's why I went with Gentoo.

> Auf meinem Cisco 2960G Switch sehe ich leider auch keine Fehler.

I see no error messages on my Cisco switch.

> Ich hoffe, Sie können mir da weiter helfen.

I hope that you can help me.

> Rüdiger Herbst
>
> PS.: Sollten noch Fragen offen sein, beantworte ich diese gerne wenn möglich!

I you have any questions, feel free to ask!

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

2011-02-04 16:55:43

by Matt

[permalink] [raw]
Subject: Re: Sun GEM PPC32 Bug?

Hi guys,

I myself don't have any PPC32 box but I just googled some of the
keywords Ruediger posted "gem eth0: RX MAC fifooverflow smac"

and there were even similar or related messages going back to 2004
(and kernel 2.6.9).

Slab corruption seems to be involved (in some ?) cases (e.g.
http://www.mail-archive.com/[email protected]/msg08345.html)

so it sounds serious to me (from an users point of view).

A kind of temporary fix seems to rmmod and modprobe the kernel-module,
according to:http://ubuntuforums.org/showthread.php?t=1428330


For the German speaking folks there's a thread over at
forums.gentoo.org (http://forums.gentoo.org/viewtopic-t-862767.html)

and 2 additional English threads which might provide additional info
on this (and another sound) issue:

http://forums.gentoo.org/viewtopic-t-862229.html "kernel: eth0: RX MAC
fifo overflow smac"

http://forums.gentoo.org/viewtopic-t-862579.html "Soundissue extreme quietly"

I'm not subscribed to the list so please CC

Thanks & Regards

Matt

2011-02-04 21:59:45

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Sun GEM PPC32 Bug?

On Fri, 2011-02-04 at 16:55 +0000, Matt wrote:
> Hi guys,
>
> I myself don't have any PPC32 box but I just googled some of the
> keywords Ruediger posted "gem eth0: RX MAC fifooverflow smac"
>
> and there were even similar or related messages going back to 2004
> (and kernel 2.6.9).
>
> Slab corruption seems to be involved (in some ?) cases (e.g.
> http://www.mail-archive.com/[email protected]/msg08345.html)
>
> so it sounds serious to me (from an users point of view).
>
> A kind of temporary fix seems to rmmod and modprobe the kernel-module,
> according to:http://ubuntuforums.org/showthread.php?t=1428330
>
>
> For the German speaking folks there's a thread over at
> forums.gentoo.org (http://forums.gentoo.org/viewtopic-t-862767.html)
>
> and 2 additional English threads which might provide additional info
> on this (and another sound) issue:
>
> http://forums.gentoo.org/viewtopic-t-862229.html "kernel: eth0: RX MAC
> fifo overflow smac"
>
> http://forums.gentoo.org/viewtopic-t-862579.html "Soundissue extreme quietly"
>
> I'm not subscribed to the list so please CC

So the slab corruption doesn't seem to have ever been reported since
2.6.16, do we know if that's still a problem ?

The FIFO overflow could be a driver bug or a HW issue, there are some
known issues with the small FIFOs in that chip, but it's also possible
that we don't configure them quite right. Anybody wants to dig in and
see what's going on there ? May want to look at the Darwin sungem driver
for reference on how it configures them... However, it should generally
recover when that happens. If not, then we have a bug there.

Cheers,
Ben.

2011-02-04 22:54:34

by David Miller

[permalink] [raw]
Subject: Re: Sun GEM PPC32 Bug?

From: Benjamin Herrenschmidt <[email protected]>
Date: Sat, 05 Feb 2011 07:51:07 +1100

> The FIFO overflow could be a driver bug or a HW issue, there are some
> known issues with the small FIFOs in that chip, but it's also possible
> that we don't configure them quite right. Anybody wants to dig in and
> see what's going on there ? May want to look at the Darwin sungem driver
> for reference on how it configures them... However, it should generally
> recover when that happens. If not, then we have a bug there.

I think we're simply not resetting enough when the RX FIFO overflow
happens.

Just for fun I checked the OpenBSD GEM driver to see what they do.
When an overflow occurs, they bump the statistic, record the current
read and write fifo pointer registers, and schedule a watchdog timer
for 400ms into the future.

If the watchdog timer sees that the RX FIFO overflow bit is still set
in the RX status register, and the RX FIFO read and write pointers
have not changed, it resets the entire chip.

We unconditionally reset the RX MAC when an overflow occurs, that may
simply not be enough to unwedge this thing.

2011-02-05 18:35:51

by R. Herbst

[permalink] [raw]
Subject: Re: Sun GEM PPC32 Bug?

Am 04.02.2011 23:55, schrieb David Miller:
>
> I think we're simply not resetting enough when the RX FIFO overflow
> happens.
>
> Just for fun I checked the OpenBSD GEM driver to see what they do.
> When an overflow occurs, they bump the statistic, record the current
> read and write fifo pointer registers, and schedule a watchdog timer
> for 400ms into the future.
>
> If the watchdog timer sees that the RX FIFO overflow bit is still set
> in the RX status register, and the RX FIFO read and write pointers
> have not changed, it resets the entire chip.
>
> We unconditionally reset the RX MAC when an overflow occurs, that may
> simply not be enough to unwedge this thing.
>
Is there a special Kernel parameter (sysctl.conf), that makes it
possible to use the Network. So how it is now I can?t work with it. What
can I do at the moment?

2011-02-05 20:32:32

by Matt

[permalink] [raw]
Subject: Re: Sun GEM PPC32 Bug?

>Is there a special Kernel parameter (sysctl.conf), that makes it
>possible to use the Network. So how it is now I canŽt work with it. What
>can I do at the moment?

please add others in the CC

(take a look in the CC field of this mail and you'll know how to do it:
put the person you're answering in the "to" ("an") field :) )

thanks !

there was a kind of temporary solution provided in :
http://ubuntuforums.org/showpost.php?p=9002232&postcount=3

>linuxopjemac, thank you for referring to Ben. He replied and might look into it.

>Meanwhile, I put the module reloading and interface restarting into a cron job that executes every hour: sudo crontab -e

# m h dom mon dow command
0 * * * * /usr/sbin/fix-sungem

>and the script /usr/sbin/fix-sungem contains

/sbin/ifdown eth0 && /sbin/rmmod sungem && /sbin/modprobe sungem &&
/sbin/ifup eth0

Hope that helps somewhat

Regards

Matt

2011-02-05 23:21:09

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Sun GEM PPC32 Bug?

On Sat, 2011-02-05 at 20:32 +0000, Matt wrote:
> >Is there a special Kernel parameter (sysctl.conf), that makes it
> >possible to use the Network. So how it is now I canŽt work with it. What
> >can I do at the moment?
>
> please add others in the CC
>
> (take a look in the CC field of this mail and you'll know how to do it:
> put the person you're answering in the "to" ("an") field :) )

Note that it's quite hard for me to figure out all the details of that
problem since it seems to be scattered accross various email threads
referring to vastly different kernel versions etc...

Can you guys open a bug at kernel.org's bugzilla, CC me, and put
relevant informations (dmesg logs, description of the problem, etc...)
with a -current- upstream kernel please ?

Cheers,
Ben.

> thanks !
>
> there was a kind of temporary solution provided in :
> http://ubuntuforums.org/showpost.php?p=9002232&postcount=3
>
> >linuxopjemac, thank you for referring to Ben. He replied and might look into it.
>
> >Meanwhile, I put the module reloading and interface restarting into a cron job that executes every hour: sudo crontab -e
>
> # m h dom mon dow command
> 0 * * * * /usr/sbin/fix-sungem
>
> >and the script /usr/sbin/fix-sungem contains
>
> /sbin/ifdown eth0 && /sbin/rmmod sungem && /sbin/modprobe sungem &&
> /sbin/ifup eth0
>
> Hope that helps somewhat
>
> Regards
>
> Matt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2011-02-05 23:39:32

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Sun GEM PPC32 Bug?

On Sat, 2011-02-05 at 19:35 +0100, R. Herbst wrote:
> > I think we're simply not resetting enough when the RX FIFO overflow
> > happens.
> >
> > Just for fun I checked the OpenBSD GEM driver to see what they do.
> > When an overflow occurs, they bump the statistic, record the current
> > read and write fifo pointer registers, and schedule a watchdog timer
> > for 400ms into the future.
> >
> > If the watchdog timer sees that the RX FIFO overflow bit is still set
> > in the RX status register, and the RX FIFO read and write pointers
> > have not changed, it resets the entire chip.
> >
> > We unconditionally reset the RX MAC when an overflow occurs, that may
> > simply not be enough to unwedge this thing.

Right. It would be quite easy for us to do the same thing. Interestingly
enough, I have never observed this behaviour on any of my machines (a
wide range of 32-bit and 64-bit Apple machines).

Also, Apple's own driver does things differently. In case of overflow
interrupt, it seems to only bump some statistics. However, it has a
timeout if no packets have been received for a while (5 seconds) and the
Rx fifo overflow bit is set. In that case, they restart the receiver
(and the receiver only).

Their sequence for restarting the receiver however is a tad different
than ours (mostly slightly different ordering of things), it's hard to
tell whether that's relevant or not, but some of the things do make
sense, such as they stop the DMA before resetting the MAC and restart it
after re-enabling the MAC.

If I find some time tonight, else tomorrow, I'll whip up a couple of
patches:

- One simpler re-arranging our Rx reset sequence and adding a test for
the overflow bit at the end, printing out the results, etc...

- One that basically always reset the chip on overflow.

>From there we can decide what works and maybe add a bit of a timeout
to the second approach if needed etc... but how often does that overflow
happens in practice ?

Cheers,
Ben.

2011-02-05 23:45:59

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Sun GEM PPC32 Bug?


> If I find some time tonight, else tomorrow, I'll whip up a couple of
> patches:
>
> - One simpler re-arranging our Rx reset sequence and adding a test for
> the overflow bit at the end, printing out the results, etc...
>
> - One that basically always reset the chip on overflow.

Actually, the second one is trivial, just modify gem_rxmac_interrupt()
as follow:

if (rxmac_stat & MAC_RXSTAT_OFLW) {
u32 smac = readl(gp->regs + MAC_SMACHINE);

netdev_err(dev, "RX MAC fifo overflow smac[%08x]\n", smac);
gp->net_stats.rx_over_errors++;
gp->net_stats.rx_fifo_errors++;

- ret = gem_rxmac_reset(gp);
+ ret = 1;
}

And tell us if that makes a difference.

Cheers,
Ben.


2011-02-06 00:20:40

by Matt

[permalink] [raw]
Subject: Re: Sun GEM PPC32 Bug?

On Sat, Feb 5, 2011 at 11:39 PM, Benjamin Herrenschmidt
<[email protected]> wrote:
> On Sat, 2011-02-05 at 19:35 +0100, R. Herbst wrote:
>> > I think we're simply not resetting enough when the RX FIFO overflow
>> > happens.
>> >
>> > Just for fun I checked the OpenBSD GEM driver to see what they do.
>> > When an overflow occurs, they bump the statistic, record the current
>> > read and write fifo pointer registers, and schedule a watchdog timer
>> > for 400ms into the future.
>> >
>> > If the watchdog timer sees that the RX FIFO overflow bit is still set
>> > in the RX status register, and the RX FIFO read and write pointers
>> > have not changed, it resets the entire chip.
>> >
>> > We unconditionally reset the RX MAC when an overflow occurs, that may
>> > simply not be enough to unwedge this thing.
>
> Right. It would be quite easy for us to do the same thing. Interestingly
> enough, I have never observed this behaviour on any of my machines (a
> wide range of 32-bit and 64-bit Apple machines).
>
> Also, Apple's own driver does things differently. In case of overflow
> interrupt, it seems to only bump some statistics. However, it has a
> timeout if no packets have been received for a while (5 seconds) and the
> Rx fifo overflow bit is set. In that case, they restart the receiver
> (and the receiver only).
>
> Their sequence for restarting the receiver however is a tad different
> than ours (mostly slightly different ordering of things), it's hard to
> tell whether that's relevant or not, but some of the things do make
> sense, such as they stop the DMA before resetting the MAC and restart it
> after re-enabling the MAC.
>
> If I find some time tonight, else tomorrow, I'll whip up a couple of
> patches:
>
> ?- One simpler re-arranging our Rx reset sequence and adding a test for
> the overflow bit at the end, printing out the results, etc...
>
> ?- One that basically always reset the chip on overflow.
>
> From there we can decide what works and maybe add a bit of a timeout
> to the second approach if needed etc... but how often does that overflow
> happens in practice ?
>
> Cheers,
> Ben.
>
>
>

Well,

according to this post:
http://forums.gentoo.org/viewtopic-p-6565281.html#6565281

>Okay... Es muss ein Kernelproblem sein. Mit SFTP bekomme ich nur einen Datendurchsatz von ca. 40MBit hin. Da kommen dann die Probleme nicht vor.
>Mit FTP direkt habe ich einen Datendurchsatz von ca. 180MBit/s, dann rasseln wieder die Kernelmeldungen ohne Ende.

it doesn't appear when using sftp (only achieving 40 MBit/s
throughput), whereas kernel messages continuously seem to appear when
using FTP with a throughput of around 180 MBit/s


this also seems to apply to recent kernels (2.6.37-ish) according to
http://forums.gentoo.org/viewtopic-p-6564885.html#6564885

>Erstmal vielen dank f?r diesen Tipp. Leider hat auch das nichts an meinem Netzwerkproblem ge?mdert. Zieht das System immer noch bei Netzwerklast ins verderben. FREEZE.
>Werde jetzt mal einen uraltkernel probieren. Vielleicht hilft ja das. Fr?her ging das System ja mal sehr gut mit Linux.

I don't know what Ruediger exactly means with "freeze"

@Ruediger:

did I miss anything ?

does magic sysrq key still work for you ?



This remembers me of a similar problem I had with the sky2-driver and
the 88E8053 chipset, for several kernel-releases it seems to reset
itself fine now :)

Regards

Matt

2011-02-06 14:22:32

by R. Herbst

[permalink] [raw]
Subject: Re: Sun GEM PPC32 Bug?

Am 05.02.2011 21:32, schrieb Matt:
>> Is there a special Kernel parameter (sysctl.conf), that makes it
>> possible to use the Network. So how it is now I canŽt work with it. What
>> can I do at the moment?
>>
>
>> linuxopjemac, thank you for referring to Ben. He replied and might look into it.
>>
>
>> Meanwhile, I put the module reloading and interface restarting into a cron job that executes every hour: sudo crontab -e
>>
> # m h dom mon dow command
> 0 * * * * /usr/sbin/fix-sungem
>
>
>> and the script /usr/sbin/fix-sungem contains
>>
> /sbin/ifdown eth0&& /sbin/rmmod sungem&& /sbin/modprobe sungem&&
> /sbin/ifup eth0
>
> Hope that helps somewhat
>
> Regards
>
> Matt
>
Sorry. No positive effect in my case. Same bad Kernel-Messages for Sun-Gem.

Regards
Rüdi

2011-02-06 15:01:32

by R. Herbst

[permalink] [raw]
Subject: Re: Sun GEM PPC32 Bug?

Am 06.02.2011 00:45, schrieb Benjamin Herrenschmidt:
>
>
> Actually, the second one is trivial, just modify gem_rxmac_interrupt()
> as follow:
>
> if (rxmac_stat & MAC_RXSTAT_OFLW) {
> u32 smac = readl(gp->regs + MAC_SMACHINE);
>
> netdev_err(dev, "RX MAC fifo overflow smac[%08x]\n", smac);
> gp->net_stats.rx_over_errors++;
> gp->net_stats.rx_fifo_errors++;
>
> - ret = gem_rxmac_reset(gp);
> + ret = 1;
> }
>
> And tell us if that makes a difference.
>
> Cheers,
> Ben.
>

Okay. I have made the change. The only difference is that:

In /var/log/messages
Feb 6 15:52:12 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
overflow smac[00810400]
Feb 6 15:52:12 G4 kernel: gem 0002:20:0f.0: eth0: Link is up at 1000
Mbps, full-duplex
Feb 6 15:52:12 G4 kernel: gem 0002:20:0f.0: eth0: Pause is disabled
Feb 6 15:57:10 G4 kernel: NETDEV WATCHDOG: eth0 (gem): transmit queue
0 timed out
Feb 6 15:57:10 G4 kernel: ------------[ cut here ]------------
Feb 6 15:57:10 G4 kernel: WARNING: at net/sched/sch_generic.c:258
Feb 6 15:57:10 G4 kernel: Modules linked in: radeon ttm
drm_kms_helper drm hwmon power_supply ipv6 snd_pcm_oss snd_mixer_oss
snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device
snd_powermac snd_pcm snd_timer snd soundcore snd_page_alloc dm_mod
uninorth_agp sungem agpgart sungem_phy
Feb 6 15:57:10 G4 kernel: NIP: c03dceec LR: c03dceec CTR: 00000001
Feb 6 15:57:10 G4 kernel: REGS: effefe20 TRAP: 0700 Not tainted
(2.6.37-gentoo)
Feb 6 15:57:10 G4 kernel: MSR: 00029032 <EE,ME,CE,IR,DR> CR:
44200084 XER: 20000000
Feb 6 15:57:10 G4 kernel: TASK = ef854cb0[0] 'swapper' THREAD: ef878000 CPU: 1
Feb 6 15:57:10 G4 kernel: GPR00: c03dceec effefed0 ef854cb0 0000003e
00001032 ffffffff c059f182 2074696d
Feb 6 15:57:10 G4 kernel: GPR08: 000069f7 effee000 01ea1000 00000004
ffffffff fff80b18 fff80154 00000000
Feb 6 15:57:10 G4 kernel: GPR16: 00000420 c03dcd4c c0589084 00200200
c04c9786 ef888814 ef888a14 ef888c14
Feb 6 15:57:10 G4 kernel: GPR24: 00000001 ffffffff ef12e7a0 00000002
00000001 00000000 ef8141d4 ef814000
Feb 6 15:57:10 G4 kernel: NIP [c03dceec] dev_watchdog+0x1a0/0x2e4
Feb 6 15:57:10 G4 kernel: LR [c03dceec] dev_watchdog+0x1a0/0x2e4
Feb 6 15:57:10 G4 kernel: Call Trace:
Feb 6 15:57:10 G4 kernel: [effefed0] [c03dceec]
dev_watchdog+0x1a0/0x2e4 (unreliable)
Feb 6 15:57:10 G4 kernel: [effeff40] [c0043db4] run_timer_softirq+0x1ac/0x260
Feb 6 15:57:10 G4 kernel: [effeffa0] [c003d9cc] __do_softirq+0x118/0x1ec
Feb 6 15:57:10 G4 kernel: [effefff0] [c0011398] call_do_softirq+0x14/0x24
Feb 6 15:57:10 G4 kernel: [ef879ea0] [c000687c] do_softirq+0x88/0xb4
Feb 6 15:57:10 G4 kernel: [ef879ec0] [c003d178] irq_exit+0x54/0x74
Feb 6 15:57:10 G4 kernel: [ef879ed0] [c000ead4] timer_interrupt+0x154/0x190
Feb 6 15:57:10 G4 kernel: [ef879ee0] [c0012080] ret_from_except+0x0/0x14
Feb 6 15:57:10 G4 kernel: --- Exception: 901 at cpu_idle+0xe0/0x180
Feb 6 15:57:10 G4 kernel: LR = cpu_idle+0xd4/0x180
Feb 6 15:57:10 G4 kernel: [ef879fa0] [c000a4f8] cpu_idle+0x170/0x180
(unreliable)
Feb 6 15:57:10 G4 kernel: [ef879fc0] [c044952c] start_secondary+0x314/0x350
Feb 6 15:57:10 G4 kernel: [ef879ff0] [00003270] 0x3270
Feb 6 15:57:10 G4 kernel: Instruction dump:
Feb 6 15:57:10 G4 kernel: 2f800001 41be003c 38810008 7fe3fb78
38a00040 4bfe77c9 7fa6eb78 7fe4fb78
Feb 6 15:57:10 G4 kernel: 7c651b78 3c60c050 3863ed12 48068721
<0fe00000> 38000001 3d20c05c 9809d3bc
Feb 6 15:57:10 G4 kernel: ---[ end trace 876ff0d47c88271d ]---
Feb 6 15:57:10 G4 kernel: gem 0002:20:0f.0: eth0: transmit timed out, resetting
Feb 6 15:57:10 G4 kernel: gem 0002:20:0f.0: eth0:
TX_STATE[00000001:00000000:00000001]
Feb 6 15:57:10 G4 kernel: gem 0002:20:0f.0: eth0:
RX_STATE[0609441d:00000001:00000001]
Feb 6 15:57:10 G4 kernel: gem 0002:20:0f.0: eth0: Link is up at 1000
Mbps, full-duplex
Feb 6 15:57:10 G4 kernel: gem 0002:20:0f.0: eth0: Pause is disabled
---
It seems that the Network dies and halt for ca. 25 seconds. After a
while it comes a call trace and the rsync session is dead. But not the
hole system dies.

Regards
R?di

2011-02-07 05:34:57

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Sun GEM PPC32 Bug?

What's your machine model (cat /proc/cpuinfo) and what do you do to
trigger the problem ? I'm trying to reproduce here and so far had
no success doing so.

Cheers,
Ben.

On Sun, 2011-02-06 at 16:01 +0100, R. Herbst wrote:
> Am 06.02.2011 00:45, schrieb Benjamin Herrenschmidt:
> >
> >
> > Actually, the second one is trivial, just modify gem_rxmac_interrupt()
> > as follow:
> >
> > if (rxmac_stat & MAC_RXSTAT_OFLW) {
> > u32 smac = readl(gp->regs + MAC_SMACHINE);
> >
> > netdev_err(dev, "RX MAC fifo overflow smac[%08x]\n", smac);
> > gp->net_stats.rx_over_errors++;
> > gp->net_stats.rx_fifo_errors++;
> >
> > - ret = gem_rxmac_reset(gp);
> > + ret = 1;
> > }
> >
> > And tell us if that makes a difference.
> >
> > Cheers,
> > Ben.
> >
>
> Okay. I have made the change. The only difference is that:
>
> In /var/log/messages
> Feb 6 15:52:12 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
> overflow smac[00810400]
> Feb 6 15:52:12 G4 kernel: gem 0002:20:0f.0: eth0: Link is up at 1000
> Mbps, full-duplex
> Feb 6 15:52:12 G4 kernel: gem 0002:20:0f.0: eth0: Pause is disabled
> Feb 6 15:57:10 G4 kernel: NETDEV WATCHDOG: eth0 (gem): transmit queue
> 0 timed out
> Feb 6 15:57:10 G4 kernel: ------------[ cut here ]------------
> Feb 6 15:57:10 G4 kernel: WARNING: at net/sched/sch_generic.c:258
> Feb 6 15:57:10 G4 kernel: Modules linked in: radeon ttm
> drm_kms_helper drm hwmon power_supply ipv6 snd_pcm_oss snd_mixer_oss
> snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device
> snd_powermac snd_pcm snd_timer snd soundcore snd_page_alloc dm_mod
> uninorth_agp sungem agpgart sungem_phy
> Feb 6 15:57:10 G4 kernel: NIP: c03dceec LR: c03dceec CTR: 00000001
> Feb 6 15:57:10 G4 kernel: REGS: effefe20 TRAP: 0700 Not tainted
> (2.6.37-gentoo)
> Feb 6 15:57:10 G4 kernel: MSR: 00029032 <EE,ME,CE,IR,DR> CR:
> 44200084 XER: 20000000
> Feb 6 15:57:10 G4 kernel: TASK = ef854cb0[0] 'swapper' THREAD: ef878000 CPU: 1
> Feb 6 15:57:10 G4 kernel: GPR00: c03dceec effefed0 ef854cb0 0000003e
> 00001032 ffffffff c059f182 2074696d
> Feb 6 15:57:10 G4 kernel: GPR08: 000069f7 effee000 01ea1000 00000004
> ffffffff fff80b18 fff80154 00000000
> Feb 6 15:57:10 G4 kernel: GPR16: 00000420 c03dcd4c c0589084 00200200
> c04c9786 ef888814 ef888a14 ef888c14
> Feb 6 15:57:10 G4 kernel: GPR24: 00000001 ffffffff ef12e7a0 00000002
> 00000001 00000000 ef8141d4 ef814000
> Feb 6 15:57:10 G4 kernel: NIP [c03dceec] dev_watchdog+0x1a0/0x2e4
> Feb 6 15:57:10 G4 kernel: LR [c03dceec] dev_watchdog+0x1a0/0x2e4
> Feb 6 15:57:10 G4 kernel: Call Trace:
> Feb 6 15:57:10 G4 kernel: [effefed0] [c03dceec]
> dev_watchdog+0x1a0/0x2e4 (unreliable)
> Feb 6 15:57:10 G4 kernel: [effeff40] [c0043db4] run_timer_softirq+0x1ac/0x260
> Feb 6 15:57:10 G4 kernel: [effeffa0] [c003d9cc] __do_softirq+0x118/0x1ec
> Feb 6 15:57:10 G4 kernel: [effefff0] [c0011398] call_do_softirq+0x14/0x24
> Feb 6 15:57:10 G4 kernel: [ef879ea0] [c000687c] do_softirq+0x88/0xb4
> Feb 6 15:57:10 G4 kernel: [ef879ec0] [c003d178] irq_exit+0x54/0x74
> Feb 6 15:57:10 G4 kernel: [ef879ed0] [c000ead4] timer_interrupt+0x154/0x190
> Feb 6 15:57:10 G4 kernel: [ef879ee0] [c0012080] ret_from_except+0x0/0x14
> Feb 6 15:57:10 G4 kernel: --- Exception: 901 at cpu_idle+0xe0/0x180
> Feb 6 15:57:10 G4 kernel: LR = cpu_idle+0xd4/0x180
> Feb 6 15:57:10 G4 kernel: [ef879fa0] [c000a4f8] cpu_idle+0x170/0x180
> (unreliable)
> Feb 6 15:57:10 G4 kernel: [ef879fc0] [c044952c] start_secondary+0x314/0x350
> Feb 6 15:57:10 G4 kernel: [ef879ff0] [00003270] 0x3270
> Feb 6 15:57:10 G4 kernel: Instruction dump:
> Feb 6 15:57:10 G4 kernel: 2f800001 41be003c 38810008 7fe3fb78
> 38a00040 4bfe77c9 7fa6eb78 7fe4fb78
> Feb 6 15:57:10 G4 kernel: 7c651b78 3c60c050 3863ed12 48068721
> <0fe00000> 38000001 3d20c05c 9809d3bc
> Feb 6 15:57:10 G4 kernel: ---[ end trace 876ff0d47c88271d ]---
> Feb 6 15:57:10 G4 kernel: gem 0002:20:0f.0: eth0: transmit timed out, resetting
> Feb 6 15:57:10 G4 kernel: gem 0002:20:0f.0: eth0:
> TX_STATE[00000001:00000000:00000001]
> Feb 6 15:57:10 G4 kernel: gem 0002:20:0f.0: eth0:
> RX_STATE[0609441d:00000001:00000001]
> Feb 6 15:57:10 G4 kernel: gem 0002:20:0f.0: eth0: Link is up at 1000
> Mbps, full-duplex
> Feb 6 15:57:10 G4 kernel: gem 0002:20:0f.0: eth0: Pause is disabled
> ---
> It seems that the Network dies and halt for ca. 25 seconds. After a
> while it comes a call trace and the rsync session is dead. But not the
> hole system dies.
>
> Regards
> Rüdi

2011-02-08 18:28:37

by R. Herbst

[permalink] [raw]
Subject: Re: Sun GEM PPC32 Bug?

Here is the cpuinfo:

cat /proc/cpuinfo
processor : 0
cpu : 7447A, altivec supported
clock : 1833.333326MHz
revision : 1.1 (pvr 8003 0101)
bogomips : 83.31

processor : 1
cpu : 7447A, altivec supported
clock : 1833.333326MHz
revision : 1.1 (pvr 8003 0101)
bogomips : 83.31

total bogomips : 166.63
timebase : 41659035
platform : PowerMac
model : PowerMac3,6
machine : PowerMac3,6
motherboard : PowerMac3,6 MacRISC2 MacRISC Power Macintosh
detected as : 129 (PowerMac G4 Windtunnel)
pmac flags : 00000010
L2 cache : 256K unified
pmac-generation : NewWorld
Memory : 2048 MB

The problems comes only when network going over 40MBit. Have tested with rsync and FTP. Same problem with samba connection.

Ragards
Rüdiger Herbst

On Sun, 2011-02-06 at 16:01 +0100, R. Herbst wrote:

> Am 06.02.2011 00:45, schrieb Benjamin Herrenschmidt:
>
>>
>> Actually, the second one is trivial, just modify gem_rxmac_interrupt()
>> as follow:
>>
>> if (rxmac_stat& MAC_RXSTAT_OFLW) {
>> u32 smac = readl(gp->regs + MAC_SMACHINE);
>>
>> netdev_err(dev, "RX MAC fifo overflow smac[%08x]\n", smac);
>> gp->net_stats.rx_over_errors++;
>> gp->net_stats.rx_fifo_errors++;
>>
>> - ret = gem_rxmac_reset(gp);
>> + ret = 1;
>> }
>>
>> And tell us if that makes a difference.
>>
>> Cheers,
>> Ben.
>>
>>
>


Am 07.02.2011 06:34, schrieb Benjamin Herrenschmidt:
> What's your machine model (cat /proc/cpuinfo) and what do you do to
> trigger the problem ? I'm trying to reproduce here and so far had
> no success doing so.
>
> Cheers,
> Ben.
>
>> Okay. I have made the change. The only difference is that:
>>
>> In /var/log/messages
>> Feb 6 15:52:12 G4 kernel: gem 0002:20:0f.0: eth0: RX MAC fifo
>> overflow smac[00810400]
>> Feb 6 15:52:12 G4 kernel: gem 0002:20:0f.0: eth0: Link is up at 1000
>> Mbps, full-duplex
>> Feb 6 15:52:12 G4 kernel: gem 0002:20:0f.0: eth0: Pause is disabled
>> Feb 6 15:57:10 G4 kernel: NETDEV WATCHDOG: eth0 (gem): transmit queue
>> 0 timed out
>> Feb 6 15:57:10 G4 kernel: ------------[ cut here ]------------
>> Feb 6 15:57:10 G4 kernel: WARNING: at net/sched/sch_generic.c:258
>> Feb 6 15:57:10 G4 kernel: Modules linked in: radeon ttm
>> drm_kms_helper drm hwmon power_supply ipv6 snd_pcm_oss snd_mixer_oss
>> snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device
>> snd_powermac snd_pcm snd_timer snd soundcore snd_page_alloc dm_mod
>> uninorth_agp sungem agpgart sungem_phy
>> Feb 6 15:57:10 G4 kernel: NIP: c03dceec LR: c03dceec CTR: 00000001
>> Feb 6 15:57:10 G4 kernel: REGS: effefe20 TRAP: 0700 Not tainted
>> (2.6.37-gentoo)
>> Feb 6 15:57:10 G4 kernel: MSR: 00029032<EE,ME,CE,IR,DR> CR:
>> 44200084 XER: 20000000
>> Feb 6 15:57:10 G4 kernel: TASK = ef854cb0[0] 'swapper' THREAD: ef878000 CPU: 1
>> Feb 6 15:57:10 G4 kernel: GPR00: c03dceec effefed0 ef854cb0 0000003e
>> 00001032 ffffffff c059f182 2074696d
>> Feb 6 15:57:10 G4 kernel: GPR08: 000069f7 effee000 01ea1000 00000004
>> ffffffff fff80b18 fff80154 00000000
>> Feb 6 15:57:10 G4 kernel: GPR16: 00000420 c03dcd4c c0589084 00200200
>> c04c9786 ef888814 ef888a14 ef888c14
>> Feb 6 15:57:10 G4 kernel: GPR24: 00000001 ffffffff ef12e7a0 00000002
>> 00000001 00000000 ef8141d4 ef814000
>> Feb 6 15:57:10 G4 kernel: NIP [c03dceec] dev_watchdog+0x1a0/0x2e4
>> Feb 6 15:57:10 G4 kernel: LR [c03dceec] dev_watchdog+0x1a0/0x2e4
>> Feb 6 15:57:10 G4 kernel: Call Trace:
>> Feb 6 15:57:10 G4 kernel: [effefed0] [c03dceec]
>> dev_watchdog+0x1a0/0x2e4 (unreliable)
>> Feb 6 15:57:10 G4 kernel: [effeff40] [c0043db4] run_timer_softirq+0x1ac/0x260
>> Feb 6 15:57:10 G4 kernel: [effeffa0] [c003d9cc] __do_softirq+0x118/0x1ec
>> Feb 6 15:57:10 G4 kernel: [effefff0] [c0011398] call_do_softirq+0x14/0x24
>> Feb 6 15:57:10 G4 kernel: [ef879ea0] [c000687c] do_softirq+0x88/0xb4
>> Feb 6 15:57:10 G4 kernel: [ef879ec0] [c003d178] irq_exit+0x54/0x74
>> Feb 6 15:57:10 G4 kernel: [ef879ed0] [c000ead4] timer_interrupt+0x154/0x190
>> Feb 6 15:57:10 G4 kernel: [ef879ee0] [c0012080] ret_from_except+0x0/0x14
>> Feb 6 15:57:10 G4 kernel: --- Exception: 901 at cpu_idle+0xe0/0x180
>> Feb 6 15:57:10 G4 kernel: LR = cpu_idle+0xd4/0x180
>> Feb 6 15:57:10 G4 kernel: [ef879fa0] [c000a4f8] cpu_idle+0x170/0x180
>> (unreliable)
>> Feb 6 15:57:10 G4 kernel: [ef879fc0] [c044952c] start_secondary+0x314/0x350
>> Feb 6 15:57:10 G4 kernel: [ef879ff0] [00003270] 0x3270
>> Feb 6 15:57:10 G4 kernel: Instruction dump:
>> Feb 6 15:57:10 G4 kernel: 2f800001 41be003c 38810008 7fe3fb78
>> 38a00040 4bfe77c9 7fa6eb78 7fe4fb78
>> Feb 6 15:57:10 G4 kernel: 7c651b78 3c60c050 3863ed12 48068721
>> <0fe00000> 38000001 3d20c05c 9809d3bc
>> Feb 6 15:57:10 G4 kernel: ---[ end trace 876ff0d47c88271d ]---
>> Feb 6 15:57:10 G4 kernel: gem 0002:20:0f.0: eth0: transmit timed out, resetting
>> Feb 6 15:57:10 G4 kernel: gem 0002:20:0f.0: eth0:
>> TX_STATE[00000001:00000000:00000001]
>> Feb 6 15:57:10 G4 kernel: gem 0002:20:0f.0: eth0:
>> RX_STATE[0609441d:00000001:00000001]
>> Feb 6 15:57:10 G4 kernel: gem 0002:20:0f.0: eth0: Link is up at 1000
>> Mbps, full-duplex
>> Feb 6 15:57:10 G4 kernel: gem 0002:20:0f.0: eth0: Pause is disabled
>> ---
>> It seems that the Network dies and halt for ca. 25 seconds. After a
>> while it comes a call trace and the rsync session is dead. But not the
>> hole system dies.
>>
>> Regards
>> Rüdi
>>
>
>

2011-02-08 19:58:33

by Andreas Schwab

[permalink] [raw]
Subject: Re: Sun GEM PPC32 Bug?

Benjamin Herrenschmidt <[email protected]> writes:

> What's your machine model (cat /proc/cpuinfo) and what do you do to
> trigger the problem ? I'm trying to reproduce here and so far had
> no success doing so.

Just today I saw the same problem on my PowerMac G5, while sending a lot
of data over LAN.

NETDEV WATCHDOG: eth0 (gem): transmit queue 0 timed out
------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:256
Modules linked in: usb_storage uas tcp_diag inet_diag firewire_sbp2 snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device nfsd lockd exportfs auth_rpcgss nfs_acl sunrpc tun cpufreq_conservative cpufreq_userspace cpufreq_powersave nf_conntrack_ipv6 nf_defrag_ipv6 ip6t_REJECT ip6t_LOG ip6table_filter ip6_tables xt_TCPMSS xt_recent xt_state ipt_REJECT ipt_LOG xt_tcpudp iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables loop snd_aoa_codec_tas snd_aoa_fabric_layout snd_aoa snd_aoa_i2sbus snd_aoa_soundbus sg snd_pcm firewire_ohci snd_page_alloc firewire_core sr_mod snd_timer crc_itu_t uninorth_agp cdrom sungem sungem_phy snd agpgart soundcore linear sd_mod pata_macio dm_snapshot dm_mod sata_svw libata scsi_mod
NIP: c00000000030ae50 LR: c00000000030ae4c CTR: 0000000000000001
REGS: c00000000fff3a00 TRAP: 0700 Not tainted (2.6.38-rc3)
MSR: 9000000000029032 <EE,ME,CE,IR,DR> CR: 48ffff84 XER: 20000000
TASK = c00000017a0d28c0[0] 'swapper' THREAD: c00000017a0f0000 CPU: 1
GPR00: c00000000030ae4c c00000000fff3c80 c00000000085e410 000000000000003e
GPR04: 0000000000000001 c00000000004d6f0 0000000000000000 0000000000000001
GPR08: 0000000000000000 c00000017a0d28c0 c00000000006eb04 0000000000000001
GPR12: 7472616e736d6974 c00000000ffff780 c0000001778d4400 0000000000000001
GPR16: 0000000000000000 c0000001778d4000 c00000017a119c60 0000000000000100
GPR20: c000000000869280 c00000017a119060 c00000017a119460 0000000000000001
GPR24: ffffffffffffffff c00000017a5f0780 0000000000000002 0000000000000001
GPR28: 0000000000000000 c0000001778d43a0 c0000000007f7350 c0000001778d4000
NIP [c00000000030ae50] .dev_watchdog+0x19c/0x2cc
LR [c00000000030ae4c] .dev_watchdog+0x198/0x2cc
Call Trace:
[c00000000fff3c80] [c00000000030ae4c] .dev_watchdog+0x198/0x2cc (unreliable)
[c00000000fff3d80] [c00000000005986c] .run_timer_softirq+0x1c4/0x264
[c00000000fff3ec0] [c00000000005385c] .__do_softirq+0xe8/0x1c4
[c00000000fff3f90] [c000000000017628] .call_do_softirq+0x14/0x24
[c00000017a0f39b0] [c00000000000b2bc] .do_softirq+0x78/0xc4
[c00000017a0f3a50] [c0000000000539f8] .irq_exit+0x4c/0x9c
[c00000017a0f3ad0] [c000000000014704] .timer_interrupt+0xbc/0xd4
[c00000017a0f3b60] [c000000000003c8c] decrementer_common+0x10c/0x180
--- Exception: 901 at .cpu_idle+0x110/0x1d4
LR = .cpu_idle+0x110/0x1d4
[c00000017a0f3e50] [c0000000000108fc] .cpu_idle+0x64/0x1d4 (unreliable)
[c00000017a0f3ee0] [c0000000003d22d0] .start_secondary+0x310/0x320
[c00000017a0f3f90] [c0000000000072dc] .start_secondary_prolog+0x10/0x14
Instruction dump:
41fe0040 38810070 7fe3fb78 38a00040 4bfea021 60000000 7fe4fb78 7f86e378
7c651b78 e87e8030 480c35cd 60000000 <0fe00000> e93e8018 38000001 98090008
---[ end trace cc84d3d8a2a0b1a7 ]---
gem 0001:03:0f.0: eth0: transmit timed out, resetting
gem 0001:03:0f.0: eth0: TX_STATE[003ffc05:00000001:0000001f]
gem 0001:03:0f.0: eth0: RX_STATE[0100c805:00000001:00000021]
gem 0001:03:0f.0: eth0: Link is up at 100 Mbps, full-duplex
gem 0001:03:0f.0: eth0: Pause is enabled (rxfifo: 10240 off: 7168 on: 5632)

The watchdog message happend only once, but the transmit timeouts
recurred over the whole transfer.

Andreas.

--
Andreas Schwab, [email protected]
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

2011-02-09 00:18:28

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Sun GEM PPC32 Bug?

On Tue, 2011-02-08 at 20:58 +0100, Andreas Schwab wrote:
> Benjamin Herrenschmidt <[email protected]> writes:
>
> > What's your machine model (cat /proc/cpuinfo) and what do you do to
> > trigger the problem ? I'm trying to reproduce here and so far had
> > no success doing so.
>
> Just today I saw the same problem on my PowerMac G5, while sending a lot
> of data over LAN.

This isn't the same problem... this looks like a tx timeout. Or do you
have some previous messages you didn't paste indicating that it all
started with an RX overflow ? :-)

My main G5 has tg3's but I still have a crash box with sungem, I'll
hammer it with a cross-over see if I can make anything happen.

Cheers,
Ben.

> NETDEV WATCHDOG: eth0 (gem): transmit queue 0 timed out
> ------------[ cut here ]------------
> WARNING: at net/sched/sch_generic.c:256
> Modules linked in: usb_storage uas tcp_diag inet_diag firewire_sbp2 snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device nfsd lockd exportfs auth_rpcgss nfs_acl sunrpc tun cpufreq_conservative cpufreq_userspace cpufreq_powersave nf_conntrack_ipv6 nf_defrag_ipv6 ip6t_REJECT ip6t_LOG ip6table_filter ip6_tables xt_TCPMSS xt_recent xt_state ipt_REJECT ipt_LOG xt_tcpudp iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_filter ip_tables x_tables loop snd_aoa_codec_tas snd_aoa_fabric_layout snd_aoa snd_aoa_i2sbus snd_aoa_soundbus sg snd_pcm firewire_ohci snd_page_alloc firewire_core sr_mod snd_timer crc_itu_t uninorth_agp cdrom sungem sungem_phy snd agpgart soundcore linear sd_mod pata_macio dm_snapshot dm_mod sata_svw libata scsi_mod
> NIP: c00000000030ae50 LR: c00000000030ae4c CTR: 0000000000000001
> REGS: c00000000fff3a00 TRAP: 0700 Not tainted (2.6.38-rc3)
> MSR: 9000000000029032 <EE,ME,CE,IR,DR> CR: 48ffff84 XER: 20000000
> TASK = c00000017a0d28c0[0] 'swapper' THREAD: c00000017a0f0000 CPU: 1
> GPR00: c00000000030ae4c c00000000fff3c80 c00000000085e410 000000000000003e
> GPR04: 0000000000000001 c00000000004d6f0 0000000000000000 0000000000000001
> GPR08: 0000000000000000 c00000017a0d28c0 c00000000006eb04 0000000000000001
> GPR12: 7472616e736d6974 c00000000ffff780 c0000001778d4400 0000000000000001
> GPR16: 0000000000000000 c0000001778d4000 c00000017a119c60 0000000000000100
> GPR20: c000000000869280 c00000017a119060 c00000017a119460 0000000000000001
> GPR24: ffffffffffffffff c00000017a5f0780 0000000000000002 0000000000000001
> GPR28: 0000000000000000 c0000001778d43a0 c0000000007f7350 c0000001778d4000
> NIP [c00000000030ae50] .dev_watchdog+0x19c/0x2cc
> LR [c00000000030ae4c] .dev_watchdog+0x198/0x2cc
> Call Trace:
> [c00000000fff3c80] [c00000000030ae4c] .dev_watchdog+0x198/0x2cc (unreliable)
> [c00000000fff3d80] [c00000000005986c] .run_timer_softirq+0x1c4/0x264
> [c00000000fff3ec0] [c00000000005385c] .__do_softirq+0xe8/0x1c4
> [c00000000fff3f90] [c000000000017628] .call_do_softirq+0x14/0x24
> [c00000017a0f39b0] [c00000000000b2bc] .do_softirq+0x78/0xc4
> [c00000017a0f3a50] [c0000000000539f8] .irq_exit+0x4c/0x9c
> [c00000017a0f3ad0] [c000000000014704] .timer_interrupt+0xbc/0xd4
> [c00000017a0f3b60] [c000000000003c8c] decrementer_common+0x10c/0x180
> --- Exception: 901 at .cpu_idle+0x110/0x1d4
> LR = .cpu_idle+0x110/0x1d4
> [c00000017a0f3e50] [c0000000000108fc] .cpu_idle+0x64/0x1d4 (unreliable)
> [c00000017a0f3ee0] [c0000000003d22d0] .start_secondary+0x310/0x320
> [c00000017a0f3f90] [c0000000000072dc] .start_secondary_prolog+0x10/0x14
> Instruction dump:
> 41fe0040 38810070 7fe3fb78 38a00040 4bfea021 60000000 7fe4fb78 7f86e378
> 7c651b78 e87e8030 480c35cd 60000000 <0fe00000> e93e8018 38000001 98090008
> ---[ end trace cc84d3d8a2a0b1a7 ]---
> gem 0001:03:0f.0: eth0: transmit timed out, resetting
> gem 0001:03:0f.0: eth0: TX_STATE[003ffc05:00000001:0000001f]
> gem 0001:03:0f.0: eth0: RX_STATE[0100c805:00000001:00000021]
> gem 0001:03:0f.0: eth0: Link is up at 100 Mbps, full-duplex
> gem 0001:03:0f.0: eth0: Pause is enabled (rxfifo: 10240 off: 7168 on: 5632)
>
> The watchdog message happend only once, but the transmit timeouts
> recurred over the whole transfer.
>
> Andreas.
>

2011-02-09 17:37:48

by Andreas Schwab

[permalink] [raw]
Subject: Re: Sun GEM PPC32 Bug?

Benjamin Herrenschmidt <[email protected]> writes:

> On Tue, 2011-02-08 at 20:58 +0100, Andreas Schwab wrote:
>> Benjamin Herrenschmidt <[email protected]> writes:
>>
>> > What's your machine model (cat /proc/cpuinfo) and what do you do to
>> > trigger the problem ? I'm trying to reproduce here and so far had
>> > no success doing so.
>>
>> Just today I saw the same problem on my PowerMac G5, while sending a lot
>> of data over LAN.
>
> This isn't the same problem... this looks like a tx timeout. Or do you
> have some previous messages you didn't paste indicating that it all
> started with an RX overflow ? :-)

No, there was no RX overflow.

Andreas.

--
Andreas Schwab, [email protected]
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."