2003-03-21 10:26:15

by Daniel Pittman

[permalink] [raw]
Subject: Linux <-> Linux NFS issues.

I am sharing large volume data between two Linux machines using NFS v3,
and have a bit of a reliability issue with it, with writes on the client
machine failing occasionally.


The server is running:
Linux gavroche 2.5.64 #2 Fri Mar 7 20:11:43 EST 2003

The client is running:
Linux anu 2.5.64 #6 Mon Mar 17 19:08:37 EST 2003

Both are .64 + a few csets after that, but the problem was evident with
.62 or so as well. I am not confident enough to say it started then,
though, as I believe it happened with an earlier version on the client.


The client machine reports, in dmesg:

NFS: server cheating in read reply: count 4096 > recvd 1000

The 'count' value is occasionally higher, but not often, and the 'recvd'
never seems to differ from 1000.


On the client, a write or close syscall returns an error, specifically
'Input/Output Error' (from perror, so -EIO is the code).


This usually happens somewhere between one and ten hours through the
encoding of a set of DV files to MPEG2, and at different spots in the
process.


Searching the archives shows that this problem has been seen before, but
didn't turn up anything that was a solution.

Daniel

--
Specialized meaninglessness has come to be regarded,
in some circles, as a kind of hallmark of true science.
-- Aldous Huxley


2003-03-21 12:21:37

by Dave Jones

[permalink] [raw]
Subject: Re: Linux <-> Linux NFS issues.

On Fri, Mar 21, 2003 at 09:37:13PM +1100, Daniel Pittman wrote:

> The client machine reports, in dmesg:
> NFS: server cheating in read reply: count 4096 > recvd 1000
> The 'count' value is occasionally higher, but not often, and the 'recvd'
> never seems to differ from 1000.

When I was last seeing this, there was also a lot of 'crap' packets
on the wire, with bogus header lengths etc (some of which were so
malformed they broke ethereal).

I've not retried any NFS tests since 2.5.60, sounds like the problem
is still there, so I'll do some more investigation soon.

Dave

2003-03-21 12:41:28

by Trond Myklebust

[permalink] [raw]
Subject: Re: Linux <-> Linux NFS issues.

>>>>> " " == Dave Jones <[email protected]> writes:

> On Fri, Mar 21, 2003 at 09:37:13PM +1100, Daniel Pittman wrote:
>> The client machine reports, in dmesg: NFS: server cheating in
>> read reply: count 4096 > recvd 1000 The 'count' value is
>> occasionally higher, but not often, and the 'recvd' never seems
>> to differ from 1000.

> When I was last seeing this, there was also a lot of 'crap'
> packets on the wire, with bogus header lengths etc (some of
> which were so malformed they broke ethereal).

> I've not retried any NFS tests since 2.5.60, sounds like the
> problem is still there, so I'll do some more investigation
> soon.

Dave,

Are you seeing bogus packets from both the 2.5.x client and the
server, or is it just the server (or just the client)?
It could also be interesting to find out if this is a UDP only
problem, or if it occurs with TCP too...

Cheers,
Trond

2003-03-21 12:46:56

by Dave Jones

[permalink] [raw]
Subject: Re: Linux <-> Linux NFS issues.

On Fri, Mar 21, 2003 at 01:52:08PM +0100, Trond Myklebust wrote:

> Are you seeing bogus packets from both the 2.5.x client and the
> server, or is it just the server (or just the client)?
> It could also be interesting to find out if this is a UDP only
> problem, or if it occurs with TCP too...

To be honest, I've forgotten what the exact issues were.
I'll rerun some tests this afternoon, and post the results.
>From what I do recall, it was fairly simple to trigger,
an fsx run made 'bad shit' happen within a minute or two.

Dave