2002-02-03 20:23:15

by Burjan Gabor

[permalink] [raw]
Subject: 2.4.17 NFS hangup

Hello,

I have a reproducable problem with 2.4.17 kernel and NFS client after
netbooting an RS/6000 (ppc architecture).

Immediately after boot:

partvis:/tmp# dd if=/dev/zero of=blah1 count=1
1+0 records in
1+0 records out
partvis:/tmp#

partvis:/tmp# dd if=/dev/zero of=blah2 count=2
2+0 records in
2+0 records out
nfs: server 157.181.150.31 not responding, still trying
nfs: server 157.181.150.31 not responding, still trying
nfs: task 913 can't get a request slot
... and so on

Relevant tcpdump output:

20:41:40.927855 heron.elte.hu.nfs > partvis.elte.hu.3648238371: reply ok 28 lookup ERROR: No such file or directory (DF)
20:41:40.928622 partvis.elte.hu.3648238372 > heron.elte.hu.nfs: 148 create [|nfs] (DF)
20:41:40.929271 heron.elte.hu.nfs > partvis.elte.hu.3648238372: reply ok 128 create [|nfs] (DF)
20:41:40.930655 partvis.elte.hu.3648238373 > heron.elte.hu.nfs: 100 getattr [|nfs] (DF)
20:41:40.930976 heron.elte.hu.nfs > partvis.elte.hu.3648238373: reply ok 96 getattr REG 100644 ids 0

However, reading works without any problems. Full tcpdump output from
poweron: http://www.csoma.elte.hu/~burjang/nfs-tcpdump-20010203.out.gz

buga


2002-02-03 21:07:07

by Trond Myklebust

[permalink] [raw]
Subject: Re: 2.4.17 NFS hangup

>>>>> " " == Burjan Gabor <[email protected]> writes:

> 20:41:40.927855 heron.elte.hu.nfs > partvis.elte.hu.3648238371:
> reply ok 28 lookup ERROR: No such file or directory (DF)
> 20:41:40.928622 partvis.elte.hu.3648238372 > heron.elte.hu.nfs:
> 148 create [|nfs] (DF) 20:41:40.929271 heron.elte.hu.nfs >
> partvis.elte.hu.3648238372: reply ok 128 create [|nfs] (DF)
> 20:41:40.930655 partvis.elte.hu.3648238373 > heron.elte.hu.nfs:
> 100 getattr [|nfs] (DF) 20:41:40.930976 heron.elte.hu.nfs >
> partvis.elte.hu.3648238373: reply ok 96 getattr REG 100644 ids
> 0

Nothing abnormal there or in your file. However, when you start
getting 'server not responding' messages, and no tcpdump output it's
usually a sign that the networking layer has given up on you. Any
strange output from 'netstat -s'?

It would be useful to know what networking card/driver combination you
are using? Any firewalls/netfilter setups? Any special mount options?

Cheers,
Trond

2002-02-03 21:34:39

by Burján Gábor

[permalink] [raw]
Subject: Re: 2.4.17 NFS hangup

Hello,

On Sun, Feb 03, Trond Myklebust wrote:

> Nothing abnormal there or in your file. However, when you start
> getting 'server not responding' messages, and no tcpdump output it's
> usually a sign that the networking layer has given up on you. Any
> strange output from 'netstat -s'?

Output is here: http://www.csoma.elte.hu/~burjang/netstat-s-20020203.out

I think `1710 reassemblies required' may be strange after boot...
How can I figure out what causes this?

> It would be useful to know what networking card/driver combination you
> are using? Any firewalls/netfilter setups? Any special mount options?

eth0: PCnet/PCI II 79C970A at 0x1020, 08 00 5a f8 82 e7
pcnet32: pcnet32_private lp=c0591000 lp_dma_addr=0x80591000 assigned IRQ 15.
pcnet32.c:v1.25kf 17.11.2001 [email protected]

(this card is an integrated AMD pcnet32 in a 43P-140)

There are no firewalls or packet filters. I didn't specify any
special mount options for nfs:

partvis:~$ cat /proc/mounts
/dev/root / nfs rw,v2,rsize=4096,wsize=4096,hard,udp,nolock,addr=157.181.150.31 0 0
proc /proc proc rw 0 0
partvis:~$

buga

2002-02-03 21:45:41

by Alan

[permalink] [raw]
Subject: Re: 2.4.17 NFS hangup

> Output is here: http://www.csoma.elte.hu/~burjang/netstat-s-20020203.out
>
> I think `1710 reassemblies required' may be strange after boot...
> How can I figure out what causes this?

NFS uses large packets.

2002-02-03 22:45:09

by Trond Myklebust

[permalink] [raw]
Subject: Re: 2.4.17 NFS hangup


Hmm... pcnet32.c seems to engage in some dubious practices. Look for
instance at the way it can call pcnet32_restart() from within the
interrupt handler.

Are you seeing any kernel log messages about 'Tx FIFO error!' that
might indicate that particular code is getting triggered?

Cheers,
Trond

2002-02-03 23:01:00

by Burján Gábor

[permalink] [raw]
Subject: Re: 2.4.17 NFS hangup

On Sun, Feb 03, Trond Myklebust wrote:

> Are you seeing any kernel log messages about 'Tx FIFO error!' that
> might indicate that particular code is getting triggered?

No, nothing logged except the NFS related messages. However, after NFS
hangup I cannot scp from the host, but ssh works... I am beginning to
think that this is not an NFS issue. Then what could it be?

buga

2002-02-04 13:22:26

by Athanasius

[permalink] [raw]
Subject: Re: 2.4.17 NFS hangup

On Mon, Feb 04, 2002 at 12:00:30AM +0100, Burj?n G?bor wrote:
> On Sun, Feb 03, Trond Myklebust wrote:
>
> > Are you seeing any kernel log messages about 'Tx FIFO error!' that
> > might indicate that particular code is getting triggered?
>
> No, nothing logged except the NFS related messages. However, after NFS
> hangup I cannot scp from the host, but ssh works... I am beginning to
> think that this is not an NFS issue. Then what could it be?

I'm seeing something like this as well. Two machines using
BNC/thinwire (yes, I know, waiting on finances to make this better), 2
other machines on the same segment. I use an NFS mount from the server
(jimblewix) on the workstation (emelia) for amongst other things playing
mp3s.
Machine specs:

SERVER
PII-400 @400MHz
384MB PC100 SDRAM
eth0: NE2000 (ISA) <--- internal interface
eth1: 3com509b <--- external interface, NFS traffic NOT on this
Linux jimblewix 2.4.17 #7 Sat Jan 5 16:15:44 GMT 2002 i686 unknown

WORKSTATION
AMD Athlon XP 1600+ 1.4GHz, not overclocked
512MB PC2100 DDR
eth0: NE2000 (PCI eth0: NetVin NV5000SC found at 0xdc00, IRQ 11,
00:40:95:45:91:38.)
Linux emelia 2.4.18-pre7 #3 Thu Jan 31 07:07:48 GMT 2002 i686 unknown
ALSO on 2.4.17

Repeatedly I'll have xmms stop playing an mp3 mid-file due to NFS
timeouts. I have the same problem cp'ing large files over the NFS
mounts as well. Currently these are soft mounts. IF I change them to
hard mounts rather than an i/o error on that file and control coming
back the app will just lock hard in D state until a reboot.

/etc/fstab on the WORKSTATION:

192.168.0.162:/home/users on /home/users type nfs (rw,nosuid,nodev,nolock,rsize=8192,wsize=8192,soft,intr,addr=192.168.0.162)
192.168.0.162:/usr/local on /export/miggy-1/usr-local type nfs (rw,nosuid,nodev,rsize=8192,wsize=8192,soft,intr,addr=192.168.0.162)
192.168.0.162:/other on /other type nfs (rw,nosuid,nodev,rsize=8192,wsize=8192,soft,intr,addr=192.168.0.162)

That last one is usually where I'm doing the big cp'ing to/from.

I've just had the problem twice whilst typing this email:

Feb 4 13:07:31 emelia kernel: nfs: server 192.168.0.162 not responding, timed o
ut
Feb 4 13:07:52 emelia last message repeated 2 times
Feb 4 13:12:17 emelia kernel: nfs: server 192.168.0.162 not responding, timed o
ut
Feb 4 13:12:38 emelia last message repeated 2 times

<NOTHING in /var/log/kern.log on jimblewix>

I haven't had any of the following since this line:

kern.log.2.gz:1649:Jan 18 07:39:28 emelia kernel: nfs: task 13016 can't
get a request slot

Whilst I appreciate that thinnet/BNC isn't the best technology to be
using this segment isn't THAT busy most of the time, certainly not the
majority of times mp3s cut out (ones that WILL play fine end to end at
other times so it's not corruption in them).

If there any patches/options (other than hard mounts without other
changes) I should be trying please let me know.

thanks,

-Ath
--
- Athanasius = Athanasius(at)gurus.tf / http://www.clan-lovely.org/~athan/
Finger athan(at)fysh.org for PGP key
"And it's me who is my enemy. Me who beats me up.
Me who makes the monsters. Me who strips my confidence." Paula Cole - ME


Attachments:
(No filename) (3.14 kB)
(No filename) (240.00 B)
Download all attachments

2002-02-04 14:47:47

by Athanasius

[permalink] [raw]
Subject: Re: 2.4.17 NFS hangup

On Mon, Feb 04, 2002 at 01:21:46PM +0000, Athanasius wrote:
> I'm seeing something like this as well. Two machines using
> BNC/thinwire (yes, I know, waiting on finances to make this better), 2
> other machines on the same segment. I use an NFS mount from the server
> (jimblewix) on the workstation (emelia) for amongst other things playing
> mp3s.

Seems to be my day for this happening. A bit more data:

There's next to no collisions going on, from ifconfig eth0 on the
SERVER:

RX packets:31331103 errors:0 dropped:1 overruns:0 frame:151
TX packets:42576602 errors:0 dropped:0 overruns:0 carrier:0
collisions:33733 txqueuelen:100

and the WORKSTATION:

RX packets:301884 errors:0 dropped:0 overruns:0 frame:0
TX packets:238086 errors:0 dropped:0 overruns:0 carrier:0
collisions:397 txqueuelen:100

Also the numbers on the SERVER at least for collisions didn't increase
the last time NFS cut out on me.

I'm not seeing ANY other logging in kern.log on either machine above
the NFS timeout reports, nothing about NICs having trouble or the like.

-Ath
--
- Athanasius = Athanasius(at)gurus.tf / http://www.clan-lovely.org/~athan/
Finger athan(at)fysh.org for PGP key
"And it's me who is my enemy. Me who beats me up.
Me who makes the monsters. Me who strips my confidence." Paula Cole - ME


Attachments:
(No filename) (1.36 kB)
(No filename) (240.00 B)
Download all attachments