2002-06-01 14:02:28

by Kenneth Johansson

[permalink] [raw]
Subject: nfs problem 2.4.19-pre9

I have had a problem for some time that processes get stuck in D state
and I now have a way to get this to happen at will.

One way to do this is to copy a file from one nfs mounted directory to
another. It dose not happen on the same mount and not when copying from
nfs to a local disk. To make this even more complex it works with cp and
mv but not in mc(midnight commander F6 ).

The nfs mounts is from the same server and same disk on that server,
mounted using automount.

linux version is
client 2.4.19-pre9
server 2.4.19-pre8 using reiserfs on a raid1 partition.

here is a calltrace on mc when it is dead.
Trace; c0129875 <__lock_page+a1/c8>
Trace; c01298b1 <lock_page+15/1c>
Trace; c0129faa <do_generic_file_read+28e/464>
Trace; c012a466 <generic_file_read+7e/130>
Trace; c012a360 <file_read_actor+0/88>
Trace; c0173a2d <nfs_file_read+9d/ac>
Trace; c01372f7 <sys_read+8f/100>
Trace; c01088cb <system_call+33/38>

The nfs mount is unusable after this. Every process that uses it enters
D state.

I have attached a strace of mc when moving file /home/ken/3 to
/delta/kernel/
mc 4.5.55




And now to something completely different.
(please forward to gconf people)

I also found what I would call a serious problem with gconf during the
testing. It goes something like this.

1. have homedir on nfs
2. pull network cable when gconf app active(nautilus, galeon ..)
3. Shutdown
4. insert network cable. restart
5. watch all gconf apps fail to start.

gconf 1.0.9


Attachments:
mc.log (6.24 kB)

2002-06-01 19:09:36

by Trond Myklebust

[permalink] [raw]
Subject: Re: nfs problem 2.4.19-pre9

>>>>> " " == Kenneth Johansson <[email protected]> writes:

> I have had a problem for some time that processes get stuck in
> D state and I now have a way to get this to happen at will.

> One way to do this is to copy a file from one nfs mounted
> directory to another. It dose not happen on the same mount and
> not when copying from nfs to a local disk. To make this even
> more complex it works with cp and mv but not in mc(midnight
> commander F6 ).

Sounds like a network driver problem or something like that. UDP
appears to trigger these lockups a lot more easily than does TCP.

Try testing with a different brand of networking card...

Cheers,
Trond

2002-06-01 19:39:23

by Kenneth Johansson

[permalink] [raw]
Subject: Re: nfs problem 2.4.19-pre9

On Sat, 2002-06-01 at 21:09, Trond Myklebust wrote:
> >>>>> " " == Kenneth Johansson <[email protected]> writes:
>
> > I have had a problem for some time that processes get stuck in
> > D state and I now have a way to get this to happen at will.
>
> > One way to do this is to copy a file from one nfs mounted
> > directory to another. It dose not happen on the same mount and
> > not when copying from nfs to a local disk. To make this even
> > more complex it works with cp and mv but not in mc(midnight
> > commander F6 ).
>
> Sounds like a network driver problem or something like that. UDP
> appears to trigger these lockups a lot more easily than does TCP.
>
> Try testing with a different brand of networking card...
>

I have three cards but they are all the same :(
3Com Corporation 3c905B 100BaseTX [Cyclone] (rev 30).

Also Why only this nfs mount. I can still telnet to other computers and
use nfs on another mount point so it's not like I lose the network.





2002-06-01 19:50:46

by Trond Myklebust

[permalink] [raw]
Subject: Re: nfs problem 2.4.19-pre9

>>>>> " " == Kenneth Johansson <[email protected]> writes:

> I have three cards but they are all the same :( 3Com
> Corporation 3c905B 100BaseTX [Cyclone] (rev 30).

> Also Why only this nfs mount. I can still telnet to other
> computers and use nfs on another mount point so it's not like I
> lose the network.

Fair enough. Have you tried a tcpdump?

cheers,
Trond

2002-06-01 20:10:42

by Kenneth Johansson

[permalink] [raw]
Subject: Re: nfs problem 2.4.19-pre9

On Sat, 2002-06-01 at 21:50, Trond Myklebust wrote:
> >>>>> " " == Kenneth Johansson <[email protected]> writes:
>
> > I have three cards but they are all the same :( 3Com
> > Corporation 3c905B 100BaseTX [Cyclone] (rev 30).
>
> > Also Why only this nfs mount. I can still telnet to other
> > computers and use nfs on another mount point so it's not like I
> > lose the network.
>
> Fair enough. Have you tried a tcpdump?

No for the simple reason that I would not know what to look for.

I can send you a trace if you want. I guess you only need a trace from
the first stat to read fails but it has to wait an hour or two it's not
a good time to crash just now.


2002-06-02 10:52:31

by Trond Myklebust

[permalink] [raw]
Subject: Re: nfs problem 2.4.19-pre9

>>>>> " " == Kenneth Johansson <[email protected]> writes:

>> Fair enough. Have you tried a tcpdump?

> I can send you a trace if you want. I guess you only need a
> trace from the first stat to read fails but it has to wait an
> hour or two it's not a good time to crash just now.

Problem is very apparent from the tcpdump: your client is only
receiving 2 or 3 out of the 6 UDP fragments in the NFS read
reply from the server. The rest is getting lost en route.

Check out the NFS FAQ on nfs.sourceforge.net. The relevant section is
the bit that asks questions of the form:

1) Are both server and client running on the same speed network
(i.e. are both switched 100Mbit/100Mbit or 10Mbit/10Mbit)?
2) If you are using a switch, are you also using autonegotiation,
or have you forced one or both of the cards (forcing is *bad*
if your switch/hub is autonegotiating)

Cheers,
Trond