2006-09-08 08:35:08

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: TG3 data corruption (TSO ?)

Hi !

I've been chasing with Segher a data corruption problem lately.
Basically transferring huge amount of data (several Gb) and I get
corrupted data at the rx side. I cannot tell for sure wether what I've
been observing here is the same problem that segher's been seing on is
blades, he will confirm or not. He also seemed to imply that reverting
to an older kernel on the -receiver- side fixed it, which makes me
wonder, since it's looks really like a sending side problem (see
explanation below), if some change in, for exmaple, window scaling,
might hide or trigger it.

Now, first, I've been playing with ssh from /dev/zero on one machine
to /dev/zero on the other. That allowed me to run enough tests all over
the place to have some idea of where the problem comes from since ssh
will shoke at decryption when hitting the corruption.

The base setup where it happens often is 2 Quad G5's connected to a
gigabit switch. Both were running some versions of 2.6.18-rc4 and -rc5
(some random git actually, but see below as I've reproduced the problem
with today's git snapshot which includes the TG3 tx race fix among
others).

I have reproduced with various machines as the receiver. A sungem in a
Dual G5 and a virtual ethernet in a Power5 partition (so the packets go
to an e1000 then routed through an AIX IO server to a virtual
ethernet :) are good examples of "variety" :) I haven't tested with
non-PowerPC machines so far. I've also never been able to reproduce with
TSO disabled on the emitting TG3's

Then, I've hacked tridge "socklib" test program (a simple TCP server
that pushes a known buffer and a simple TCP receiver that connects to it
and reads the data). I've added comparison of the data with what they
are supposed to be on the receiving end. The interesting thing is that
is much faster than ssh or whatever else I tried. ssh or rsync between
those 2 Quad G5s give me about 35Mb/sec while I get to 107Mb/sec average
with the small test program.

The fun thing is, I've not been able to reproduce at all that way. When
the link is pretty much saturated, the problem doesn't occur !

As soon as I introduce a small delay (some crap waiting loop) in the
sender to slow down the throughput to about 80Mb/sec, then the problem
starts occuring every now and then (I don't have precise frquency data
but I get a corruption every couple of gigabytes I'd say).

As for my previous tests, disabling TSO on the sending side "fixes" it.

Below is a dump of what the corruption look like. I've trimmed the
beginning and end of the dumped packet (the receiver does 8k reads). The
0x5a are the expected data, the rest is corruption. They look like
kernel pointers, but that isn't always the case (often though but that
might not be relevant). The size and position within the buffer of the
corrupted data is variable (doesn't seem to be specifically a page or
anything nice and round like that).

I've configured the switch to send all the traffic between the two
machines to a 3rd box and then recorded it with tcpdump (the "spy" uses
an e1000) and I can see the corrupted data in the recorded
traces (the exact same pattern as detected by the receiver). So it seems
very likely at this point that the corruption happens on the sending
side. The TCP checksums are correct I assume. I don't see any error
count on the receiving tg3 nor suspicous message in dmesg indicating
they aren't.

That's all the data I have at this point. I can't guarantee 100% that
it's a TSO bug (it might be a bug that TSO renders visible due to timing
effects) but it looks like it since I've not reproduced yet with TSO
disabled. I'll do an overnight test to confirm that though... sometimes
the bug can take it's time to show up ... I've seen it wait 20Gb before
it kicked in. Also the fact that fully loading the machine never
produced it is strange.... smells like a race.

Cheers,
Ben.

5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 5a 00 00 00 00 00 00 00 00
2f 63 70 75 73 00 7f 7e c0 00 00 00 01 cb 70 82
00 00 00 04 bf 1d db 4d c0 00 00 00 01 cb 92 00
c0 00 00 01 7b fe 6d 98 c0 00 00 00 01 cb 70 91
00 00 00 04 df 5d fe fd c0 00 00 00 01 cb 92 10
c0 00 00 01 7b fe 6d b8 c0 00 00 00 01 cb 71 0e
00 00 00 04 fe e2 fb cf c0 00 00 00 01 cb 92 20
c0 00 00 01 7b fe 6d d8 c0 00 00 00 01 cb 71 1f
00 00 00 04 73 69 ed ff c0 00 00 00 01 cb 92 30
c0 00 00 01 7b fe 6d f8 c0 00 00 00 01 cb 70 04
00 00 00 04 b9 fe cf ff c0 00 00 00 01 cb 92 40
c0 00 00 01 7b fe 6e 18 c0 00 00 00 00 3f 7b c8
00 00 00 05 ff df b9 bc c0 00 00 01 7b fe 6e 38
00 00 00 00 00 00 00 00 63 70 75 73 00 8d f1 ce
c0 00 00 01 7b fe 73 60 c0 00 00 00 01 cb 92 64
ff 89 d6 80 ff 89 d6 80 00 00 00 01 00 00 00 00
c0 00 00 01 7b fe 68 00 c0 00 00 01 7b fe 6e c8
c0 00 00 01 7b fe 6e e0 00 00 00 00 00 00 00 00
c0 00 00 01 7b fe 6c e8 c0 00 00 01 7b fe 73 70
c0 00 00 01 7b fe 75 48 c0 00 00 01 7b fe 73 70
c0 00 00 01 7b fe 73 70 c0 00 00 01 7b fa 9e 80
00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 2f 63 70 75 73 2f 50 6f
77 65 72 50 43 2c 47 35 40 30 00 7e 5e 6f 4d ef
c0 00 00 00 01 cb 70 6c 00 00 00 04 7f ee b7 fe
c0 00 00 00 01 cb 92 64 c0 00 00 01 7b fe 6f 00
c0 00 00 00 01 cb 71 35 00 00 00 04 bb cc e9 67
c0 00 00 00 01 cb 92 74 c0 00 00 01 7b fe 6f 20
c0 00 00 00 01 cb 71 39 00 00 00 04 2f fc eb b9
c0 00 00 00 01 cb 92 84 c0 00 00 01 7b fe 6f 40
c0 00 00 00 01 cb 71 3e 00 00 00 04 e7 5f be de
c0 00 00 00 01 cb 92 94 c0 00 00 01 7b fe 6f 60
c0 00 00 00 01 cb 71 49 00 00 00 04 e6 73 e7 a7
c0 00 00 00 01 cb 92 a4 c0 00 00 01 7b fe 6f 80
c0 00 00 00 01 cb 71 55 00 00 00 08 1b fb 77 f9
c0 00 00 00 01 cb 92 b4 c0 00 00 01 7b fe 6f a0
c0 00 00 00 01 cb 70 9d 00 00 00 04 b6 db 59 ef
c0 00 00 00 01 cb 92 c8 c0 00 00 01 7b fe 6f c0
c0 00 00 00 01 cb 71 5b 00 00 00 04 69 6f fc da
c0 00 00 00 01 cb 92 d8 c0 00 00 01 7b fe 6f e0
c0 00 00 00 01 cb 71 69 00 00 00 04 d6 7b 66 de
c0 00 00 00 01 cb 92 e8 c0 00 00 01 7b fe 70 00
5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a





2006-09-08 15:49:52

by Michael Chan

[permalink] [raw]
Subject: Re: TG3 data corruption (TSO ?)

Copying netdev.

Benjamin Herrenschmidt wrote:
> Hi !
>
> I've been chasing with Segher a data corruption problem lately.
> Basically transferring huge amount of data (several Gb) and I get
> corrupted data at the rx side. I cannot tell for sure wether what I've
> been observing here is the same problem that segher's been seing on is
> blades, he will confirm or not. He also seemed to imply that reverting
> to an older kernel on the -receiver- side fixed it, which makes me
> wonder, since it's looks really like a sending side problem (see
> explanation below), if some change in, for exmaple, window scaling,
> might hide or trigger it.

Please send me lspci and tg3 probing output so that I know what
tg3 hardware you're using. I also want to look at the tcpdump or
ethereal on the mirrored port that shows the packet being corrupted.

>
> Now, first, I've been playing with ssh from /dev/zero on one machine
> to /dev/zero on the other. That allowed me to run enough
> tests all over
> the place to have some idea of where the problem comes from since ssh
> will shoke at decryption when hitting the corruption.
>
> The base setup where it happens often is 2 Quad G5's connected to a
> gigabit switch. Both were running some versions of 2.6.18-rc4 and -rc5
> (some random git actually, but see below as I've reproduced
> the problem
> with today's git snapshot which includes the TG3 tx race fix among
> others).
>
> I have reproduced with various machines as the receiver. A sungem in a
> Dual G5 and a virtual ethernet in a Power5 partition (so the
> packets go
> to an e1000 then routed through an AIX IO server to a virtual
> ethernet :) are good examples of "variety" :) I haven't tested with
> non-PowerPC machines so far. I've also never been able to
> reproduce with
> TSO disabled on the emitting TG3's
>
> Then, I've hacked tridge "socklib" test program (a simple TCP server
> that pushes a known buffer and a simple TCP receiver that
> connects to it
> and reads the data). I've added comparison of the data with what they
> are supposed to be on the receiving end. The interesting thing is that
> is much faster than ssh or whatever else I tried. ssh or rsync between
> those 2 Quad G5s give me about 35Mb/sec while I get to
> 107Mb/sec average
> with the small test program.
>
> The fun thing is, I've not been able to reproduce at all that
> way. When
> the link is pretty much saturated, the problem doesn't occur !
>
> As soon as I introduce a small delay (some crap waiting loop) in the
> sender to slow down the throughput to about 80Mb/sec, then the problem
> starts occuring every now and then (I don't have precise frquency data
> but I get a corruption every couple of gigabytes I'd say).
>
> As for my previous tests, disabling TSO on the sending side
> "fixes" it.
>
> Below is a dump of what the corruption look like. I've trimmed the
> beginning and end of the dumped packet (the receiver does 8k
> reads). The
> 0x5a are the expected data, the rest is corruption. They look like
> kernel pointers, but that isn't always the case (often though but that
> might not be relevant). The size and position within the buffer of the
> corrupted data is variable (doesn't seem to be specifically a page or
> anything nice and round like that).
>
> I've configured the switch to send all the traffic between the two
> machines to a 3rd box and then recorded it with tcpdump (the
> "spy" uses
> an e1000) and I can see the corrupted data in the recorded
> traces (the exact same pattern as detected by the receiver).
> So it seems
> very likely at this point that the corruption happens on the sending
> side. The TCP checksums are correct I assume. I don't see any error
> count on the receiving tg3 nor suspicous message in dmesg indicating
> they aren't.
>
> That's all the data I have at this point. I can't guarantee 100% that
> it's a TSO bug (it might be a bug that TSO renders visible
> due to timing
> effects) but it looks like it since I've not reproduced yet with TSO
> disabled. I'll do an overnight test to confirm that though...
> sometimes
> the bug can take it's time to show up ... I've seen it wait
> 20Gb before
> it kicked in. Also the fact that fully loading the machine never
> produced it is strange.... smells like a race.
>
> Cheers,
> Ben.
>
> 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> 5a 5a 5a 5a 5a 5a 5a 5a 00 00 00 00 00 00 00 00
> 2f 63 70 75 73 00 7f 7e c0 00 00 00 01 cb 70 82
> 00 00 00 04 bf 1d db 4d c0 00 00 00 01 cb 92 00
> c0 00 00 01 7b fe 6d 98 c0 00 00 00 01 cb 70 91
> 00 00 00 04 df 5d fe fd c0 00 00 00 01 cb 92 10
> c0 00 00 01 7b fe 6d b8 c0 00 00 00 01 cb 71 0e
> 00 00 00 04 fe e2 fb cf c0 00 00 00 01 cb 92 20
> c0 00 00 01 7b fe 6d d8 c0 00 00 00 01 cb 71 1f
> 00 00 00 04 73 69 ed ff c0 00 00 00 01 cb 92 30
> c0 00 00 01 7b fe 6d f8 c0 00 00 00 01 cb 70 04
> 00 00 00 04 b9 fe cf ff c0 00 00 00 01 cb 92 40
> c0 00 00 01 7b fe 6e 18 c0 00 00 00 00 3f 7b c8
> 00 00 00 05 ff df b9 bc c0 00 00 01 7b fe 6e 38
> 00 00 00 00 00 00 00 00 63 70 75 73 00 8d f1 ce
> c0 00 00 01 7b fe 73 60 c0 00 00 00 01 cb 92 64
> ff 89 d6 80 ff 89 d6 80 00 00 00 01 00 00 00 00
> c0 00 00 01 7b fe 68 00 c0 00 00 01 7b fe 6e c8
> c0 00 00 01 7b fe 6e e0 00 00 00 00 00 00 00 00
> c0 00 00 01 7b fe 6c e8 c0 00 00 01 7b fe 73 70
> c0 00 00 01 7b fe 75 48 c0 00 00 01 7b fe 73 70
> c0 00 00 01 7b fe 73 70 c0 00 00 01 7b fa 9e 80
> 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 2f 63 70 75 73 2f 50 6f
> 77 65 72 50 43 2c 47 35 40 30 00 7e 5e 6f 4d ef
> c0 00 00 00 01 cb 70 6c 00 00 00 04 7f ee b7 fe
> c0 00 00 00 01 cb 92 64 c0 00 00 01 7b fe 6f 00
> c0 00 00 00 01 cb 71 35 00 00 00 04 bb cc e9 67
> c0 00 00 00 01 cb 92 74 c0 00 00 01 7b fe 6f 20
> c0 00 00 00 01 cb 71 39 00 00 00 04 2f fc eb b9
> c0 00 00 00 01 cb 92 84 c0 00 00 01 7b fe 6f 40
> c0 00 00 00 01 cb 71 3e 00 00 00 04 e7 5f be de
> c0 00 00 00 01 cb 92 94 c0 00 00 01 7b fe 6f 60
> c0 00 00 00 01 cb 71 49 00 00 00 04 e6 73 e7 a7
> c0 00 00 00 01 cb 92 a4 c0 00 00 01 7b fe 6f 80
> c0 00 00 00 01 cb 71 55 00 00 00 08 1b fb 77 f9
> c0 00 00 00 01 cb 92 b4 c0 00 00 01 7b fe 6f a0
> c0 00 00 00 01 cb 70 9d 00 00 00 04 b6 db 59 ef
> c0 00 00 00 01 cb 92 c8 c0 00 00 01 7b fe 6f c0
> c0 00 00 00 01 cb 71 5b 00 00 00 04 69 6f fc da
> c0 00 00 00 01 cb 92 d8 c0 00 00 01 7b fe 6f e0
> c0 00 00 00 01 cb 71 69 00 00 00 04 d6 7b 66 de
> c0 00 00 00 01 cb 92 e8 c0 00 00 01 7b fe 70 00
> 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
> 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
>


2006-09-08 19:31:54

by Segher Boessenkool

[permalink] [raw]
Subject: Re: TG3 data corruption (TSO ?)

>> I've been chasing with Segher a data corruption problem lately.
>> Basically transferring huge amount of data (several Gb) and I get
>> corrupted data at the rx side. I cannot tell for sure wether what
>> I've
>> been observing here is the same problem that segher's been seing
>> on is
>> blades, he will confirm or not. He also seemed to imply that
>> reverting
>> to an older kernel on the -receiver- side fixed it, which makes me
>> wonder, since it's looks really like a sending side problem (see
>> explanation below), if some change in, for exmaple, window scaling,
>> might hide or trigger it.
>
> Please send me lspci and tg3 probing output so that I know what
> tg3 hardware you're using.

I use a 5780 rev. A3, but the problem is not limited to this chip.

> I also want to look at the tcpdump or
> ethereal on the mirrored port that shows the packet being corrupted.

I don't have such, sorry.

>> That's all the data I have at this point. I can't guarantee 100% that
>> it's a TSO bug (it might be a bug that TSO renders visible
>> due to timing
>> effects) but it looks like it since I've not reproduced yet with TSO
>> disabled.

It seems to indeed to only be exposed by TSO, not actually a
bug of it /an sich/.

I've got a patch that seems so solve the problem, it needs more testing
though (maybe Ben can do this :-) ). The problem is that there should
be quite a few wmb()'s in the code that are just not there; adding some
to tg3_set_txd() seems to fix the immediate problem but more is needed
(and I don't see why those should be needed, unless tg3_set_txd() is
updating a life ring entry in place or something like that).

More testing is needed, but the problem is definitely the lack of memory
ordering.


Segher

2006-09-08 19:56:19

by Michael Chan

[permalink] [raw]
Subject: Re: TG3 data corruption (TSO ?)

On Fri, 2006-09-08 at 21:29 +0200, Segher Boessenkool wrote:

> I've got a patch that seems so solve the problem, it needs more testing
> though (maybe Ben can do this :-) ). The problem is that there should
> be quite a few wmb()'s in the code that are just not there; adding some
> to tg3_set_txd() seems to fix the immediate problem but more is needed
> (and I don't see why those should be needed, unless tg3_set_txd() is
> updating a life ring entry in place or something like that).
>
> More testing is needed, but the problem is definitely the lack of memory
> ordering.
>
Oh, we know about this. The powerpc writel() used to have memory
barriers in 2.4 kernels but not any more in 2.6 kernels. Red Hat's
version of tg3 has extra wmb()'s to fix this problem. David doesn't
think that the upstream version of tg3 should have these wmb()'s, and
the problem should instead be fixed in powerpc's writel().

2006-09-08 21:41:52

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: TG3 data corruption (TSO ?)


> Please send me lspci and tg3 probing output so that I know what
> tg3 hardware you're using. I also want to look at the tcpdump or
> ethereal on the mirrored port that shows the packet being corrupted.

Hi Michael !

It's the dual controller of an Apple Quad G5, thus afaik in an HT2000
chip:

0001:05:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5780 Gigabit Ethernet (rev 03)
Subsystem: Apple Computer Inc. Unknown device 0085
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 64 (16000ns min)
Interrupt: pin A routed to IRQ 66
Region 0: Memory at fa530000 (64-bit, non-prefetchable) [size=64K]
Region 2: Memory at fa520000 (64-bit, non-prefetchable) [size=64K]
Capabilities: [40] PCI-X non-bridge device
Command: DPERE- ERO- RBC=512 OST=1
Status: Dev=05:04.0 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
Capabilities: [48] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable+ DSel=0 DScale=1 PME-
Capabilities: [50] Vital Product Data
Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 Enable-
Address: 00011aa5c8ce4904 Data: 18d8

0001:05:04.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5780 Gigabit Ethernet (rev 03)
Subsystem: Apple Computer Inc. Unknown device 0085
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 16 (16000ns min), Cache Line Size: 64 bytes
Interrupt: pin B routed to IRQ 67
Region 0: Memory at fa510000 (64-bit, non-prefetchable) [size=64K]
Region 2: Memory at fa500000 (64-bit, non-prefetchable) [size=64K]
Capabilities: [40] PCI-X non-bridge device
Command: DPERE- ERO+ RBC=512 OST=1
Status: Dev=05:04.1 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
Capabilities: [48] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [50] Vital Product Data
Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 Enable-
Address: 4e001a0002804460 Data: 00a2

And the dmesg bits:

tg3.c:v3.65 (August 07, 2006)
eth0: Tigon3 [partno(BCM95780) rev 8003 PHY(5780)] (PCIX:133MHz:64-bit) 10/100/1000BaseT Ethernet 00:14:51:65:e6:90
eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[1]
eth0: dma_rwctrl[76144000] dma_mask[40-bit]
eth1: Tigon3 [partno(BCM95780) rev 8003 PHY(5780)] (PCIX:133MHz:64-bit) 10/100/1000BaseT Ethernet 00:14:51:65:e6:91
eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[1]
eth1: dma_rwctrl[76144000] dma_mask[40-bit]

As for the tcpdump output, well, I have a 3Gb file for now :) I need to do a bit of surgery on it to
get only the interesting part. I'll try to do that later today (but it may have to wait for monday).

Cheers,
Ben.


2006-09-08 21:46:26

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: TG3 data corruption (TSO ?)

On Fri, 2006-09-08 at 12:54 -0700, Michael Chan wrote:
> On Fri, 2006-09-08 at 21:29 +0200, Segher Boessenkool wrote:
>
> > I've got a patch that seems so solve the problem, it needs more testing
> > though (maybe Ben can do this :-) ). The problem is that there should
> > be quite a few wmb()'s in the code that are just not there; adding some
> > to tg3_set_txd() seems to fix the immediate problem but more is needed
> > (and I don't see why those should be needed, unless tg3_set_txd() is
> > updating a life ring entry in place or something like that).
> >
> > More testing is needed, but the problem is definitely the lack of memory
> > ordering.
> >
> Oh, we know about this. The powerpc writel() used to have memory
> barriers in 2.4 kernels but not any more in 2.6 kernels. Red Hat's
> version of tg3 has extra wmb()'s to fix this problem. David doesn't
> think that the upstream version of tg3 should have these wmb()'s, and
> the problem should instead be fixed in powerpc's writel().

The PowerPC writel has a full sync _after_ the write, mostly to prevent
it from leaking out of a spinlock, and for ordering it vs. other
writel's or readl's. It doesn't provide any ordering guarantee vs
cacheable storage (and was never intended to do so afaik). Such ordering
shall
be provided explicitely. It's possible that 2.4 used a big hammer
approach but we've since been actively fixing drivers for that. It's to
be noted that PowerPC might not be the only architecture affected as I
don't think that in general, you have ordering guarantees between
cacheable and non-cacheable stores unless you use explicit barriers.

Thus I disagree with "fixing" the powerpc writel(). The barries shall
definitely go into tg3.

Cheers,
Ben.




2006-09-08 22:09:22

by Michael Chan

[permalink] [raw]
Subject: Re: TG3 data corruption (TSO ?)

On Sat, 2006-09-09 at 07:41 +1000, Benjamin Herrenschmidt wrote:

> As for the tcpdump output, well, I have a 3Gb file for now :) I need to do a bit of surgery on it to
> get only the interesting part. I'll try to do that later today (but it may have to wait for monday).
>
Ben, We probably don't need the tcpdump anymore now that we know it's a
memory ordering issue.

2006-09-08 22:24:54

by Michael Chan

[permalink] [raw]
Subject: Re: TG3 data corruption (TSO ?)

On Sat, 2006-09-09 at 07:46 +1000, Benjamin Herrenschmidt wrote:

> The PowerPC writel has a full sync _after_ the write, mostly to prevent
> it from leaking out of a spinlock, and for ordering it vs. other
> writel's or readl's. It doesn't provide any ordering guarantee vs
> cacheable storage (and was never intended to do so afaik). Such ordering
> shall
> be provided explicitely. It's possible that 2.4 used a big hammer
> approach but we've since been actively fixing drivers for that. It's to
> be noted that PowerPC might not be the only architecture affected as I
> don't think that in general, you have ordering guarantees between
> cacheable and non-cacheable stores unless you use explicit barriers.

I think 2.4 might have an additional sync before the write which will
guarantee that the buffer descriptor is written before telling the chip
to DMA it.

>
> Thus I disagree with "fixing" the powerpc writel(). The barries shall
> definitely go into tg3.
>

You'll have to take this up with David.

2006-09-08 22:25:48

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: TG3 data corruption (TSO ?)

On Fri, 2006-09-08 at 15:07 -0700, Michael Chan wrote:
> On Sat, 2006-09-09 at 07:41 +1000, Benjamin Herrenschmidt wrote:
>
> > As for the tcpdump output, well, I have a 3Gb file for now :) I need to do a bit of surgery on it to
> > get only the interesting part. I'll try to do that later today (but it may have to wait for monday).
> >
> Ben, We probably don't need the tcpdump anymore now that we know it's a
> memory ordering issue.

Ok. I'm trying to figure out what's the best way with fixing that. I can
see the flamewar coming on wether stores to memory vs. writel shall be
ordered or not :)

I'm very reluctant to add another sync instruction to our writel though.
It needs one already after the stores to prevent leaking out of
spinlocks (and thus possible mmio vs. mmio order issues on SMP with
stores from different CPUs being re-ordered). Fixing the above would
require one before the store as well. We already pay a pretty high price
for that sync, having 2 would be a real shame.

(Unfortunately, there is no cheap barrier available for ordering
cacheable vs. non cacheable storage on PowerPC, they are completely
separate domains).

One option I was discussing with others would be to drop that sync after
the store, and instead start requiring drivers to use mmiowb() (as
defined by the ia64 folks) to provide ordering of writel's vs. locks.
But that probably means breaking and then having to fix a while bunch of
drivers in the tree who haven't been updated to use it...

I'd rather not have to do that, or even if I go that way, not have to
add that sync at all before the store and thus get back the few percent
of perfs lost due to those sync's on some heavy IO benchmarks.

Ben.


2006-09-08 22:42:20

by Michael Chan

[permalink] [raw]
Subject: Re: TG3 data corruption (TSO ?)

On Sat, 2006-09-09 at 08:25 +1000, Benjamin Herrenschmidt wrote:
> Ok. I'm trying to figure out what's the best way with fixing that. I can
> see the flamewar coming on wether stores to memory vs. writel shall be
> ordered or not :)
>
> I'm very reluctant to add another sync instruction to our writel though.
> It needs one already after the stores to prevent leaking out of
> spinlocks (and thus possible mmio vs. mmio order issues on SMP with
> stores from different CPUs being re-ordered). Fixing the above would
> require one before the store as well. We already pay a pretty high price
> for that sync, having 2 would be a real shame.
>
> (Unfortunately, there is no cheap barrier available for ordering
> cacheable vs. non cacheable storage on PowerPC, they are completely
> separate domains).
>
> One option I was discussing with others would be to drop that sync after
> the store, and instead start requiring drivers to use mmiowb() (as
> defined by the ia64 folks) to provide ordering of writel's vs. locks.
> But that probably means breaking and then having to fix a while bunch of
> drivers in the tree who haven't been updated to use it...
>
> I'd rather not have to do that, or even if I go that way, not have to
> add that sync at all before the store and thus get back the few percent
> of perfs lost due to those sync's on some heavy IO benchmarks.
>
Another way to fix this without requiring drivers to add all kinds of
barriers in the driver code is to add a writel_sync() variant. So on
powerpc, writel_sync() will have a sync before and after the write. On
most other architectures, writel_sync() is the same as writel() if the
ordering is guaranteed. We'll then convert tg3 and other drivers to use
writel_sync() in places where they're needed.

2006-09-08 22:49:52

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: TG3 data corruption (TSO ?)


> > I'd rather not have to do that, or even if I go that way, not have to
> > add that sync at all before the store and thus get back the few percent
> > of perfs lost due to those sync's on some heavy IO benchmarks.
> >
> Another way to fix this without requiring drivers to add all kinds of
> barriers in the driver code is to add a writel_sync() variant. So on
> powerpc, writel_sync() will have a sync before and after the write. On
> most other architectures, writel_sync() is the same as writel() if the
> ordering is guaranteed. We'll then convert tg3 and other drivers to use
> writel_sync() in places where they're needed.

I think the preferred approach for that sort of thing is to have writel
be the "sync" version and add special "relaxed" version. Now there have
been talks and debates about relaxed IOs but they generally map to
something different, typically IOs that are relaxed vs. DMA (PCI-X/PCIe
relaxed ordering options for example).

Adding yet another round of IO accessors sounds like a bit nasty to me,
driver writers will potentially not understand which ones to use etc...

Anyway, I think I'll let Anton and Paulus argue that one for now.

Cheers,
Ben.


2006-09-09 09:21:57

by David Miller

[permalink] [raw]
Subject: Re: TG3 data corruption (TSO ?)

From: Benjamin Herrenschmidt <[email protected]>
Date: Sat, 09 Sep 2006 07:46:02 +1000

> I don't think that in general, you have ordering guarantees between
> cacheable and non-cacheable stores unless you use explicit barriers.

In fact, on most systems you absolutely do have ordering between
MMIO and memory accesses.

So you are making an extremely poor engineering decision
by trying to fixup all the drivers to match PowerPC's
semantics. I think a smart engineer would decrease his
debugging burdon, by matching his platform's MMIO accessors
such that it matches what other platforms do and therefore
inheriting the testing coverage provided by all platforms.

Otherwise you will be hunting down these kinds of memory
barrier issues forever.

2006-09-09 22:36:41

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: TG3 data corruption (TSO ?)

On Sat, 2006-09-09 at 02:22 -0700, David Miller wrote:
> From: Benjamin Herrenschmidt <[email protected]>
> Date: Sat, 09 Sep 2006 07:46:02 +1000
>
> > I don't think that in general, you have ordering guarantees between
> > cacheable and non-cacheable stores unless you use explicit barriers.
>
> In fact, on most systems you absolutely do have ordering between
> MMIO and memory accesses.

Well, at least powerpc and ia64 don't in hw, I don't know about
others... out of order in general is getting fascionable in processor
design ...

> So you are making an extremely poor engineering decision
> by trying to fixup all the drivers to match PowerPC's
> semantics. I think a smart engineer would decrease his
> debugging burdon, by matching his platform's MMIO accessors
> such that it matches what other platforms do and therefore
> inheriting the testing coverage provided by all platforms.
>
> Otherwise you will be hunting down these kinds of memory
> barrier issues forever.

Well, some of you (Alan, you, etc...) seem to imply that it's always
been the rule to have a memory store followed by an MMIO write be
strongly ordered.

However, if you look at drivers like e1000, USB OHCI, or even sungem
(:-) they, all have at least wmb()'s between updating descriptor in
memory and the MMIO that triggers reading those by the chip. So it seems
that I'm not the only one to have thought otherwise ;-) More
specificaly, at least ia64 I think, like PowerPC, assumes no ordering
requirement here. So they would need fixing too.

My main problem is the cost... it's actually very expensive to do that
sort of synchronisation. I don't know for ia64 or other potentially out
of order architectures, but we do introduced a measureable performance
hit by adding the one we already have to guard against spin_unlock.

So if we decide to go the way of making writel synchronous vs. previous
MMIOs, I'd really like to have a clearly defined "relaxed" version as
well.

However, I'm not sure any of our current "relaxed" accessor have clear
semantics. At least what is implemented currently on PowerPC is the
__raw_* versions which not only have no barriers at all (they don't even
order between MMIOs, for example, readl might cross writel), and do no
endian swap. Quite a mess of semantics if you ask me... Then there has
been talks about those *_relaxed operations but those are more a match
to the relaxed PCI-X and PCI-E cycels, that is they relax ordering vs.
requests in a different direction on the bus, they have nothing to do
with storage domains on the CPU.

Ben.


2006-09-10 00:15:32

by Alan

[permalink] [raw]
Subject: Re: TG3 data corruption (TSO ?)

Ar Sul, 2006-09-10 am 08:36 +1000, ysgrifennodd Benjamin Herrenschmidt:
> Well, some of you (Alan, you, etc...) seem to imply that it's always
> been the rule to have a memory store followed by an MMIO write be
> strongly ordered.

It has always been the rule

> However, if you look at drivers like e1000, USB OHCI, or even sungem
> (:-) they, all have at least wmb()'s between updating descriptor in

Driver hacks to cope with platform authors who got read/writel wrong.

> semantics. At least what is implemented currently on PowerPC is the
> __raw_* versions which not only have no barriers at all (they don't even
> order between MMIOs, for example, readl might cross writel), and do no
> endian swap. Quite a mess of semantics if you ask me... Then there has

__writel/__readl seems more in keeping for just not locking.


2006-09-10 01:17:57

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: TG3 data corruption (TSO ?)


> > semantics. At least what is implemented currently on PowerPC is the
> > __raw_* versions which not only have no barriers at all (they don't even
> > order between MMIOs, for example, readl might cross writel), and do no
> > endian swap. Quite a mess of semantics if you ask me... Then there has
>
> __writel/__readl seems more in keeping for just not locking.

Not locking... you mean not ordering I suppose. Ok, so the question is
no ordering at all (that is even between MMIO read/writes, thus a
__readl can cross a __writel), or just no ordering between MMIO and
cacheable storage ?

It's an important difference and both have their use. For example, on
PowerPC, if I completely remove barriers, I get the first semantic and I
get the ability to write combine on non-cacheable storage as a benefit
(provided we add an ioremap_wc or such, as the Guarded bit we set on
normal non-cacheable space does also prevent write combining on most
implementations). However, if I keep at least ordering between MMIOs,
then I leave an eieio in there, which is not nearly as expensive than a
full sync but will not order cacheable cs. non-cacheable. However, it
will also prevent write combine as far as I remember.

Ben.


2006-09-11 04:53:44

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: TG3 data corruption (TSO ?)


> Oh, we know about this. The powerpc writel() used to have memory
> barriers in 2.4 kernels but not any more in 2.6 kernels. Red Hat's
> version of tg3 has extra wmb()'s to fix this problem. David doesn't
> think that the upstream version of tg3 should have these wmb()'s, and
> the problem should instead be fixed in powerpc's writel().

I've added a wmb() in tw32_rx_mbox() and tw32_tx_mbox() and can still
reproduce the problem. I've also done a 2 days run without TSO enabled
without a failure (my test program normally fails after a couple of
minutes).

Thus, do you see any other code path in the driver where a
synchronisation might be missing ? Is there any case where the chip
might use data in memory before it has been told to do so with a
mailbox write ? (There are no "OWN" bits that I can see in the
descriptors, thus I doubt it will use a transmit descriptor that is
half-built before the store to the mailbox allows using it) but who
knows....

That leads to the question that there might be an unrelated bug in the
driver. Segher thinks we might be overriding "live" descriptors, though
I haven't seen how yet. It seems to be TSO specific tho... maybe some
missing smp synchronisation in the driver itself or a problem when the
TX ring is full ?

I don't have the chip docs and I'm not familiar with the driver, so I'll
keep looking, but advice is welcome. I'll also see if I can reproduce
with some other TSO capable card, in case the problem is in the kernel
TSO code and not in the driver.

Cheers,
Ben.


2006-09-11 05:19:05

by Michael Chan

[permalink] [raw]
Subject: Re: TG3 data corruption (TSO ?)

Benjamin Herrenschmidt wrote:

> I've added a wmb() in tw32_rx_mbox() and tw32_tx_mbox() and can still
> reproduce the problem. I've also done a 2 days run without TSO enabled
> without a failure (my test program normally fails after a couple of
> minutes).
>

Hi Ben,

The code is a bit tricky. It uses function pointers for the various
register read/write methods. For the 5780, I believe it will be
assigned a simple writel() and not tg3_write32_tx_mbox(). Can you
double check to make sure you have actually added the wmb()?

It's probably easiest to just add the wmb() in tg3_xmit_dma_bug()
before the tw32_tx_mbox().

2006-09-11 05:26:04

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: TG3 data corruption (TSO ?)

On Sun, 2006-09-10 at 22:18 -0700, Michael Chan wrote:
> Benjamin Herrenschmidt wrote:
>
> > I've added a wmb() in tw32_rx_mbox() and tw32_tx_mbox() and can still
> > reproduce the problem. I've also done a 2 days run without TSO enabled
> > without a failure (my test program normally fails after a couple of
> > minutes).
> >
>
> Hi Ben,
>
> The code is a bit tricky. It uses function pointers for the various
> register read/write methods. For the 5780, I believe it will be
> assigned a simple writel() and not tg3_write32_tx_mbox(). Can you
> double check to make sure you have actually added the wmb()?
>
> It's probably easiest to just add the wmb() in tg3_xmit_dma_bug()
> before the tw32_tx_mbox().

I've done:

#define tw32_rx_mbox(reg, val) do { wmb(); tp->write32_rx_mbox(tp, reg, val); } while(0)
#define tw32_tx_mbox(reg, val) do { wmb(); tp->write32_tx_mbox(tp, reg, val); } while(0)

Cheers,
Ben.


2006-09-11 05:33:56

by Michael Chan

[permalink] [raw]
Subject: Re: TG3 data corruption (TSO ?)

Benjamin Herrenschmidt wrote:

> I've done:
>
> #define tw32_rx_mbox(reg, val) do { wmb();
tp->write32_rx_mbox(tp, reg, val); } while(0)
> #define tw32_tx_mbox(reg, val) do { wmb();
tp->write32_tx_mbox(tp, reg, val); } while(0)
>

That should do it.

I think we need those tcpdump after all. Can you send it to me?

2006-09-11 05:52:22

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: TG3 data corruption (TSO ?)

On Sun, 2006-09-10 at 22:33 -0700, Michael Chan wrote:
> Benjamin Herrenschmidt wrote:
>
> > I've done:
> >
> > #define tw32_rx_mbox(reg, val) do { wmb();
> tp->write32_rx_mbox(tp, reg, val); } while(0)
> > #define tw32_tx_mbox(reg, val) do { wmb();
> tp->write32_tx_mbox(tp, reg, val); } while(0)
> >
>
> That should do it.
>
> I think we need those tcpdump after all. Can you send it to me?

Looks like adding a sync to writel does fix it though... I'm trying to
figure out which specific writel in the driver makes a difference. I'll
then look into slicing those tcpdumps.

Ben.


2006-09-11 08:21:06

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: TG3 data corruption (TSO ?)

Ok, it seems like we might have more than just the missing barrier in
TG3. Possibly some IOMMU problems on some machines as well.
Unfortunately, I don't have a tg3 on a PCI-X or PCI-E card to test on a
pSeries or some other machine.

[Olof: I've disabled the new U4 DART invalidate code (reverted to the
old one) and added an unconditional barrier to dart_flush and I yet have
to reproduce the problem. I suspect a problem with the DART invalidate
one thingy, maybe a HW problem with the U4 chip. Now regarding the
barrier in flush, we'll talk about it later, I think we might have a
problem with the way we do the DART accesses (they might leak out of the
lock) though I yet have to see that cause a problem in practice due to
the round-robin nature of our allocation algorithm.]

Ben.


2006-09-11 13:54:32

by Segher Boessenkool

[permalink] [raw]
Subject: Re: TG3 data corruption (TSO ?)

>>> #define tw32_rx_mbox(reg, val) do { wmb();
>> tp->write32_rx_mbox(tp, reg, val); } while(0)
>>> #define tw32_tx_mbox(reg, val) do { wmb();
>> tp->write32_tx_mbox(tp, reg, val); } while(0)
>>>
>>
>> That should do it.
>>
>> I think we need those tcpdump after all. Can you send it to me?
>
> Looks like adding a sync to writel does fix it though... I'm trying to
> figure out which specific writel in the driver makes a difference.
> I'll
> then look into slicing those tcpdumps.

Adding a PowerPC "sync" to every writel() will slow down things by
so much that it could well just hide the problem, not solve it...

Michael, we see this problem only with TSO on, and then the failure
we see is bad data being sent out, with the correct header (but the
header is constructed by the tg3 in this case, so no surprise).

I'm theorizing that this same failure can happen with TSO off as well,
but the header sent on the wire will be bogus as well as the data,
so the receiving NIC won't pick it up. tcpdump probably won't see
it either... will need a low-level ethernet analyser.


Segher

2006-09-11 16:11:00

by Michael Chan

[permalink] [raw]
Subject: Re: TG3 data corruption (TSO ?)

On Mon, 2006-09-11 at 15:52 +1000, Benjamin Herrenschmidt wrote:

> Looks like adding a sync to writel does fix it though... I'm trying to
> figure out which specific writel in the driver makes a difference. I'll
> then look into slicing those tcpdumps.

During runtime in the fast path, the only writel()'s we do in tg3 are to
the tx mbox, rx_mbox, and the interrupt mbox. The interrupt mbox
shouldn't matter that much since it has no dependencies on other memory
writes before it.