2005-02-03 03:43:21

by Ethan Weinstein

[permalink] [raw]
Subject: e1000, sshd, and the infamous "Corrupted MAC on input"

Hey all,

I've been having quite a time with the e1000 driver running at gigabit
speeds. Running it at 100Fdx has never been a problem, which I've done
done for a long time. Last week I picked up a gigabit switch, and that's
when the trouble began. I find that transferring large amounts of data
using scp invariably ends up with sshd spitting out "Disconnecting:
Corrupted MAC on input." After deciding I must have purchased a bum
switch, I grabbed another model.. only to get the same error.
Finally, I used a crossover cable between the two boxes, which resulted
in the same error from sshd again.

Both systems are 2.6.10, with 4k stacks, and regparm enabled. system 1
has an onboard Intel 82547EI, system 2 has an onboard Intel 82545EM,
both have NAPI enabled... Oddly, running the nics at 100Fdx does not
generate this error no matter how much pressure I put on them. I've
found a lot of scuttlebutt regarding these problems with sshd on the
net, but this appears a hardware/driver problem. There's mention of a
specific problem with e1000 here:
http://www.psc.edu/networking/projects/hpn-ssh but no apparent resolution.

Any suggestions are greatly appreciated.

-E


2005-02-03 07:04:27

by Matt Mackall

[permalink] [raw]
Subject: Re: e1000, sshd, and the infamous "Corrupted MAC on input"

On Wed, Feb 02, 2005 at 10:44:14PM -0500, Ethan Weinstein wrote:
> Hey all,
>
> I've been having quite a time with the e1000 driver running at gigabit
> speeds. Running it at 100Fdx has never been a problem, which I've done
> done for a long time. Last week I picked up a gigabit switch, and that's
> when the trouble began. I find that transferring large amounts of data
> using scp invariably ends up with sshd spitting out "Disconnecting:
> Corrupted MAC on input." After deciding I must have purchased a bum
> switch, I grabbed another model.. only to get the same error.
> Finally, I used a crossover cable between the two boxes, which resulted
> in the same error from sshd again.

Well ssh isn't an especially good test as it's hard to debug.

Try transferring large compressed files via netcat and comparing the
results. eg:

host1# nc -l -p 2000 > foo.bz2

host2# nc host1 2000 < foo.bz2

If the md5sums differ, follow up with a cmp -bl to see what changed.

Then we can look at the failure patterns and determine if there's some
data or alignment dependence.

--
Mathematics is the supreme nostalgia of our time.

2005-02-04 04:15:29

by Ethan Weinstein

[permalink] [raw]
Subject: Re: e1000, sshd, and the infamous "Corrupted MAC on input"

Matt Mackall wrote:
> On Wed, Feb 02, 2005 at 10:44:14PM -0500, Ethan Weinstein wrote:
...
>>Finally, I used a crossover cable between the two boxes, which resulted
>>in the same error from sshd again.
>
>
> Well ssh isn't an especially good test as it's hard to debug.
>
> Try transferring large compressed files via netcat and comparing the
> results. eg:
>
> host1# nc -l -p 2000 > foo.bz2
>
> host2# nc host1 2000 < foo.bz2
>
> If the md5sums differ, follow up with a cmp -bl to see what changed.
>
> Then we can look at the failure patterns and determine if there's some
> data or alignment dependence.
>

Excellent tip, thanks. I was able to reprodce the problem several times
using this technique with nc, however the problem was intermittent (as
nasty problems like this often are). I used a 1.3G gzipped tarball and
experienced several botched transfers along with a few good ones. To
be fair, I also switched back to 100Fdx and repeated; I didn't get a
single failure at this speed over 25 or so runs.

The results of two cmp's are here:

http://www.stinkfoot.org/e1000tests.out

What next?

-Ethan

2005-02-04 05:09:15

by Matt Mackall

[permalink] [raw]
Subject: Re: e1000, sshd, and the infamous "Corrupted MAC on input"

On Thu, Feb 03, 2005 at 11:16:37PM -0500, Ethan Weinstein wrote:
> Matt Mackall wrote:
> >On Wed, Feb 02, 2005 at 10:44:14PM -0500, Ethan Weinstein wrote:
> ...
> >>Finally, I used a crossover cable between the two boxes, which resulted
> >>in the same error from sshd again.
> >
> >
> >Well ssh isn't an especially good test as it's hard to debug.
> >
> >Try transferring large compressed files via netcat and comparing the
> >results. eg:
> >
> >host1# nc -l -p 2000 > foo.bz2
> >
> >host2# nc host1 2000 < foo.bz2
> >
> >If the md5sums differ, follow up with a cmp -bl to see what changed.
> >
> >Then we can look at the failure patterns and determine if there's some
> >data or alignment dependence.
> >
>
> Excellent tip, thanks. I was able to reprodce the problem several times
> using this technique with nc, however the problem was intermittent (as
> nasty problems like this often are). I used a 1.3G gzipped tarball and
> experienced several botched transfers along with a few good ones. To
> be fair, I also switched back to 100Fdx and repeated; I didn't get a
> single failure at this speed over 25 or so runs.
>
> The results of two cmp's are here:
>
> http://www.stinkfoot.org/e1000tests.out
>
> What next?

Ok, reproduceable without ssh makes narrowing this down much easier.
Are you seeing errors on the interface? No would indicate problems
post CRC checking on the receive side. Do errors happen in both
directions? If not, it may be CPU speed-related or specific to a given
NIC - swap them if they're not onboard.

The next test is to send patterns. Try sending yourself a gigabyte of:

#include <stdio.h>

int main(void)
{
int i;

for (i = 0; i < 0x10000000; i++) {
fwrite(&i, 4, 1, stdout);
}
}

If there's some sort of partial DMA transfer going on, this should
make it evident.

--
Mathematics is the supreme nostalgia of our time.

2005-02-04 05:55:17

by Anthony DiSante

[permalink] [raw]
Subject: Re: Has anyone dumped udev for devfs?

Kevin Fries wrote:
> Any ETA on when udev is going to be ready for prime time? And, any
> clue why Fedora insists on relying on a program that does not f*(&%ing
> work!!!!
>
> I am trying to get a Microtek X12 USL scanner attached, and udev fails
> to mount it, every time. Has anyone tried uninstalling udev and
> reinstalling devfs to stop all these damn usb failures?
>
> If so, any hints on how not to make your system unstable?
>
> TIA
> Kevin Fries

I haven't gone back to devfs, but I feel your pain. udev+hal worked fine
for a couple months, until hald started intermittently locking up. Now I
can't go 2 days without a reboot, because hald so often goes into
"uninterruptible sleep" and is totally unkillable. I've upgraded udev, hal,
and my kernel a bunch of times, but nothing has fixed this. And it's not a
single piece of hardware; sometimes it's USB, sometimes Firewire, sometimes
a CDROM, that causes hald to take a nap, permanently.

-Anthony DiSante
http://nodivisions.com/

2005-02-04 06:03:52

by Willy Tarreau

[permalink] [raw]
Subject: Re: e1000, sshd, and the infamous "Corrupted MAC on input"

Hi,

On Thu, Feb 03, 2005 at 11:16:37PM -0500, Ethan Weinstein wrote:
(...)
> Excellent tip, thanks. I was able to reprodce the problem several times
> using this technique with nc, however the problem was intermittent (as
> nasty problems like this often are). I used a 1.3G gzipped tarball and
> experienced several botched transfers along with a few good ones. To
> be fair, I also switched back to 100Fdx and repeated; I didn't get a
> single failure at this speed over 25 or so runs.
>
> The results of two cmp's are here:
>
> http://www.stinkfoot.org/e1000tests.out
>
> What next?

I would disable rx/tx checksums on the cards to ensure that's not a bug
in this part. Because one reason to see what you encounter would be that
some frames are corrupted at gigabit speed (possibly on one of the cards
themselves), and they don't correctly compute the checksum on the receive
side, or they ignore when it's bad.

IIRC, you can do this with ethtool :

# ethtool -K rx off tx off

Willy

2005-02-05 04:52:21

by Ethan Weinstein

[permalink] [raw]
Subject: Re: e1000, sshd, and the infamous "Corrupted MAC on input"

Matt Mackall wrote:
>
> Ok, reproduceable without ssh makes narrowing this down much easier.
> Are you seeing errors on the interface? No would indicate problems
> post CRC checking on the receive side. Do errors happen in both
> directions? If not, it may be CPU speed-related or specific to a given
> NIC - swap them if they're not onboard.
>
> The next test is to send patterns. Try sending yourself a gigabyte of:
>
> #include <stdio.h>
>
> int main(void)
> {
> int i;
>
> for (i = 0; i < 0x10000000; i++) {
> fwrite(&i, 4, 1, stdout);
> }
> }
>
> If there's some sort of partial DMA transfer going on, this should
> make it evident.
>

No errors reported on either interface.

Interesting results, in one direction though. It seems highly likely
the problem is only with the 82545EM as I couldn't get a botched
transfer FROM it to the 82547EI after 20 or so attempts, (both of these
are onboard unfortunately so no swapping). Several transfers TO it did
yield bad files, though (using my big 1.6G gzipped tarball).

Now, on to the patterns. I didn't get a _single_ failure in either
directions using what that code snippet generated in over 20 attempts.
Perhaps we're failing on larger amounts of more complex data?

-Ethan