2005-01-21 20:47:03

by Denis Zaitsev

[permalink] [raw]
Subject: [BUG] Onboard Ethernet Pro 100 on a SMP box: a very strange errors

The long story is:

There is a Dual-processor Intel Server Board STL2 with two P-III/800
and an onboard Intel 82557-based ethernet card. The box has all the
/usr and nearly all of the /var filesystems mounted over NFS. And the
box works for months without any problems around the NFS. So, I think
that the ethernet card just works fine.

But I have some enigmatic problems when I copying _some_ files from an
NFS to the local fs: the process is freezes on the middle.

1) Only _some_ files can't be copied. There are:

gcc-testsuite-3.4-20041217.tar.bz2

krb5-1.3.6-signed.tar

X430src-1.tgz

They are the well-known sources from the well-known ftp and web
places. And I don't think that it's the full list, just the files
for which I have met the problem.

2) Only _these_ files can't be copied. Any other is copied plainly.

3) These files _never_ can be copied.

4) The copy process always freezes at the same place (per file - the
each file has its own place).

In short: it's a list of files, on which the copying is always freezes
and always freezes exactly the same way. And there are no any
exception - I have freezeng each time.

The freezing is forever. The freezed process is in D state, its
/proc/PID/wchan contains page_sync. Each such process eats 1.0 from
/proc/loadavg. And the process can't be killed by any signal.

Then, copying by dd bs=1024 ... just succeeds. After that cp succeeds
too - I think it's because of caching.

Then, there is no visual correlation with the size of the file. So,
it seems that the content of the file is involved... But it is
enigmatic.

The NIC works fine all the other time, so there no suspicions about
hardware problems.

The other NIC - 3C905 PCI external card doesn't show the problem - all
the files are just copied.

An either driver for Ethernet Pro 100 - e100 or eepro100 - show the
same result, but eepro100 logs periodicaly:

eth0: wait_for_cmd_done timeout!

e100 logs nothing.

So, it doesn't ever look like a driver bug...

The kernels tested: 2.6.8.1, 2.6.9, 2.6.10.

GLIBC used: 2.3.2.


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2005-01-26 19:11:25

by Chris Leech

[permalink] [raw]
Subject: Re: [BUG] Onboard Ethernet Pro 100 on a SMP box: a very strange errors

On Wed, 26 Jan 2005 07:37:49 +0500, Denis Zaitsev <[email protected]> wrote:
> On Tue, Jan 25, 2005 at 11:19:11PM +0200, Denis Vlasenko wrote:
> >
> > Something corrupts packets. It can be sending NIC, switch in the middle,
> > or receiving NIC.
>
> Changing the receiving card closes the question. Doesn't it matter?

Take a look at this technical advisory for the STL2 board, I'm fairly
sure this is your issue.
http://support.intel.com/support/motherboards/server/sb/CS-010790.htm

IP fragmented packets as part of an NFS file transfer are improperly
being identified as TCO management packets and are being routed to the
BMC. The fix is to upgrade the BMC firmware.

Updated BMC firmware and BIOS images for this board are available at
http://support.intel.com/support/motherboards/server/STL2/sb/CS-007118.htm

- Chris

2005-01-24 20:15:43

by Dave van Leeuwen

[permalink] [raw]
Subject: Re: [NFS] [BUG] Onboard Ethernet Pro 100 on a SMP box: a very strange errors

Hi Denis,
I bought a Intel STL2 mobo based machine a few years ago. I'm not sure
what the network controller on it is, as it is turned off. I had a
problem where the card works fine until it gets loaded. Then things went
pear shaped. There was a bios upgrade to fix this problem. I never
installed the bios patch as I had spare network cards.
You may find that you are just reaching the point where the card fails to
work correctly.

Hope this helps,
Dave.

Dave van Leeuwen
Analyst Programmer
University of Canterbury
New Zealand

On Sat, 22 Jan 2005, Denis Zaitsev wrote:

> The long story is:
>
> There is a Dual-processor Intel Server Board STL2 with two P-III/800
> and an onboard Intel 82557-based ethernet card. The box has all the
> /usr and nearly all of the /var filesystems mounted over NFS. And the
> box works for months without any problems around the NFS. So, I think
> that the ethernet card just works fine.
>
> But I have some enigmatic problems when I copying _some_ files from an
> NFS to the local fs: the process is freezes on the middle.
>
> 1) Only _some_ files can't be copied. There are:
>
> gcc-testsuite-3.4-20041217.tar.bz2
>
> krb5-1.3.6-signed.tar
>
> X430src-1.tgz
>
> They are the well-known sources from the well-known ftp and web
> places. And I don't think that it's the full list, just the files
> for which I have met the problem.
>
> 2) Only _these_ files can't be copied. Any other is copied plainly.
>
> 3) These files _never_ can be copied.
>
> 4) The copy process always freezes at the same place (per file - the
> each file has its own place).
>
> In short: it's a list of files, on which the copying is always freezes
> and always freezes exactly the same way. And there are no any
> exception - I have freezeng each time.
>
> The freezing is forever. The freezed process is in D state, its
> /proc/PID/wchan contains page_sync. Each such process eats 1.0 from
> /proc/loadavg. And the process can't be killed by any signal.
>
> Then, copying by dd bs=1024 ... just succeeds. After that cp succeeds
> too - I think it's because of caching.
>
> Then, there is no visual correlation with the size of the file. So,
> it seems that the content of the file is involved... But it is
> enigmatic.
>
> The NIC works fine all the other time, so there no suspicions about
> hardware problems.
>
> The other NIC - 3C905 PCI external card doesn't show the problem - all
> the files are just copied.
>
> An either driver for Ethernet Pro 100 - e100 or eepro100 - show the
> same result, but eepro100 logs periodicaly:
>
> eth0: wait_for_cmd_done timeout!
>
> e100 logs nothing.
>
> So, it doesn't ever look like a driver bug...
>
> The kernels tested: 2.6.8.1, 2.6.9, 2.6.10.
>
> GLIBC used: 2.3.2.
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
> Tool for open source databases. Create drag-&-drop reports. Save time
> by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
> Download a FREE copy at http://www.intelliview.com/go/osdn_nl
> _______________________________________________
> NFS maillist - [email protected]
> https://lists.sourceforge.net/lists/listinfo/nfs
>


2005-01-25 21:19:11

by Denis Vlasenko

[permalink] [raw]
Subject: Re: [BUG] Onboard Ethernet Pro 100 on a SMP box: a very strange errors

On Friday 21 January 2005 22:46, Denis Zaitsev wrote:
> The long story is:
>
> There is a Dual-processor Intel Server Board STL2 with two P-III/800
> and an onboard Intel 82557-based ethernet card. The box has all the
> /usr and nearly all of the /var filesystems mounted over NFS. And the
> box works for months without any problems around the NFS. So, I think
> that the ethernet card just works fine.
>
> But I have some enigmatic problems when I copying _some_ files from an
> NFS to the local fs: the process is freezes on the middle.
>
> 1) Only _some_ files can't be copied. There are:
>
> gcc-testsuite-3.4-20041217.tar.bz2
>
> krb5-1.3.6-signed.tar
>
> X430src-1.tgz
>
> They are the well-known sources from the well-known ftp and web
> places. And I don't think that it's the full list, just the files
> for which I have met the problem.
>
> 2) Only _these_ files can't be copied. Any other is copied plainly.
>
> 3) These files _never_ can be copied.
>
> 4) The copy process always freezes at the same place (per file - the
> each file has its own place).
>
> In short: it's a list of files, on which the copying is always freezes
> and always freezes exactly the same way. And there are no any
> exception - I have freezeng each time.
>
> The freezing is forever. The freezed process is in D state, its
> /proc/PID/wchan contains page_sync. Each such process eats 1.0 from
> /proc/loadavg. And the process can't be killed by any signal.
>
> Then, copying by dd bs=1024 ... just succeeds. After that cp succeeds
> too - I think it's because of caching.

Because this causes different packets to be sent.

> Then, there is no visual correlation with the size of the file. So,
> it seems that the content of the file is involved... But it is
> enigmatic.

Something corrupts packets. It can be sending NIC, switch in the middle,
or receiving NIC.

Do 'tcpdump -nliethN -s0 -Xx udp >log' on both ends while you do the copying
and compare last dozen of packets.
--
vda

2005-01-26 02:37:49

by Denis Zaitsev

[permalink] [raw]
Subject: Re: [BUG] Onboard Ethernet Pro 100 on a SMP box: a very strange errors

On Tue, Jan 25, 2005 at 11:19:11PM +0200, Denis Vlasenko wrote:
>
> Something corrupts packets. It can be sending NIC, switch in the middle,
> or receiving NIC.

Changing the receiving card closes the question. Doesn't it matter?