2001-11-15 21:29:59

by Birger Lammering

[permalink] [raw]
Subject: nfs problem: hp-server --- linux 2.4.13 client, ooops

Hi there,

I've got an interesting problem for people working on nfs:

Kernel version: 2.4.13 (up and smp), Trond's seekdir patch, ext3 patch
Hardware: P4 1700 (1 and 2), 2 GB RAM.
Compiled with RedHat 7.1 gcc (2.96-85)
nfs-Server: HP-UX 10.20

Without the nfs patch I had a lot of trouble with Irix servers - now Irix,
NetApp, Aix, Linux... nfs- servers are fine, but connecting to a HP server
leads to statements like this in the syslog:
NFS: short packet in readdir reply!

Even worse: Accessing one particular directory (using ls, find etc) leads
to a segmentation fault in the accessing process. The machine stays up and
running, but some other processes (without relationship to nfs mount) die a
sudden death as well.

There is nothing special about the directory on the nfs server.
There is no problem with Linux 2.2.14-5.0smp. I can check with other 2.4
versions if necessary. The problem is independent of the nfs version.

The syslog says:


Unable to handle kernel paging request at virtual address fe01e000
printing eip:
f898eaed
*pde = 00004063
*pte = 00000000
Oops: 0000
CPU: 1
EIP: 0010:[<f898eaed>] Not tainted
EFLAGS: 00010296
eax: 01000000 ebx: fe01dff8 ecx: fe01e000 edx: 00000019
esi: fe01e000 edi: 0000006c ebp: f537405c esp: f4e83c60
ds: 0018 es: 0018 ss: 0018
Process ls (pid: 11091, stackpage=f4e83000)
Stack: f4e83cec f518c200 f898ea4c f8956e07 f537405c f6a15240 f4e83d94
f4e83d40
f4e83d54 f4e83cac 00000002 f895a18c f4e83cec f4e83cec fffffff5
f4e83cec
f518c200 f4e82000 f4e82000 00000000 f4e82000 f4e83d58 f4e83d58
f895a498
Call Trace: [<f898ea4c>] [<f8956e07>] [<f895a18c>] [<f895a498>]
[<f895646e>]
[<f8997520>] [<f89594e0>] [<f898e09e>] [<f8997520>] [<f898b89a>]
[<c01299e8>]
[<f898bc2b>] [<f898b7c0>] [<f898eb1c>] [<c0142bb4>] [<c0142f90>]
[<c01430f3>]

[<c0142f90>] [<c0125838>] [<c0106f7b>]

Code: 8b 11 0f ca 8d 42 03 24 fc 8d 4c 01 08 81 fa ff 00 00 00 77
<1>Unable to handle kernel NULL pointer dereference at virtual address
00000000
printing eip:
00000000
*pde = 00000000
Oops: 0000
CPU: 1
EIP: 0010:[<00000000>] Not tainted
EFLAGS: 00010286
eax: 00000000 ebx: f6ff2480 ecx: 00000000 edx: 00000000
esi: f6f49b80 edi: f4e83d24 ebp: f7308180 esp: f4e83d00
ds: 0018 es: 0018 ss: 0018
Process gcc (pid: 11097, stackpage=f4e83000)
Stack: f7308180 f6ff24bc f4e83d90 f4e83d24 f6ff2480 00000000 f4e83e34
f72fae00
f4e83d90 00010002 00000000 00000000 00000000 00000000 00000000
00000000
00000005 0000a1ff 00000001 00000000 00000000 0000001c 00000000
00001000
Call Trace: [<c013e3de>] [<c013e850>] [<c013ee1a>] [<c013c65e>]
[<c013d13e>]
[<f898eb1c>] [<c0142bb4>] [<c0141f0a>] [<c013e14d>] [<c0105b8f>]
[<c0106f7b>]


Code: Bad EIP value.
<1>Unable to handle kernel NULL pointer dereference at virtual address
00000000
printing eip:
00000000
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<00000000>] Not tainted
EFLAGS: 00010286
eax: 00000000 ebx: f6ff2480 ecx: 00000000 edx: 00000000
esi: f6f49b80 edi: f4e83d24 ebp: f7308180 esp: f4e83d00
ds: 0018 es: 0018 ss: 0018
Process gcc (pid: 11139, stackpage=f4e83000)
Stack: f7308180 f6ff24bc f4e83d90 f4e83d24 f6ff2480 00000000 f4e83e34
f72fae00
f4e83d90 08060002 f4e82000 f6fe4100 00001812 00000000 080603ec
080600a0
00000005 0000a1ff 00000001 00000000 00000000 0000001c 00000000
00001000
Call Trace: [<c013e3de>] [<c013e850>] [<c013ee1a>] [<c013c65e>]
[<c013d13e>]
[<c012ee22>] [<c01201e5>] [<c01202aa>] [<c0116da1>] [<c0141f0a>]
[<c013e14d>]

[<c0105b8f>] [<c0106f7b>]

Code: Bad EIP value.

If you are interested in more details or even better have a solution,
please also cc email to my account, since I'm only a casual spectator at
this list.

Cheers,
Birger

--
Dr. Birger Lammering
science+computing ag Tel: 089 356386-15
Geschaeftsstelle Muenchen Fax: 089 356386-37
Ingolstaedter Str. 22
80807 Muenchen email: Birger.Lammering(at)partner.bmw.de
B.Lammering(at)science-computing.de


2001-11-15 23:59:04

by Trond Myklebust

[permalink] [raw]
Subject: Re: nfs problem: hp-server --- linux 2.4.13 client, ooops

>>>>> " " == Birger Lammering <[email protected]> writes:

> Without the nfs patch I had a lot of trouble with Irix servers
> - now Irix, NetApp, Aix, Linux... nfs- servers are fine, but
> connecting to a HP server leads to statements like this in the
> syslog: NFS: short packet in readdir reply!

<snip>
> Unable to handle kernel paging request at virtual address
> fe01e000
> printing eip:
> f898eaed *pde = 00004063 *pte = 00000000 Oops: 0000 CPU: 1 EIP:
> 0010:[<f898eaed>] Not tainted EFLAGS: 00010296 eax: 01000000
> ebx: fe01dff8 ecx: fe01e000 edx: 00000019 esi: fe01e000 edi:
> 0000006c ebp: f537405c esp: f4e83c60 ds: 0018 es: 0018 ss: 0018
> Process ls (pid: 11091, stackpage=f4e83000) Stack: f4e83cec
> f518c200 f898ea4c f8956e07 f537405c f6a15240 f4e83d94 f4e83d40
> f4e83d54 f4e83cac 00000002 f895a18c f4e83cec f4e83cec
> fffffff5
> f4e83cec
> f518c200 f4e82000 f4e82000 00000000 f4e82000 f4e83d58
> f4e83d58
> f895a498 Call Trace: [<f898ea4c>] [<f8956e07>] [<f895a18c>]
> [<f895a498>] [<f895646e>]
> [<f8997520>] [<f89594e0>] [<f898e09e>] [<f8997520>]
> [<f898b89a>]
> [<c01299e8>]
> [<f898bc2b>] [<f898b7c0>] [<f898eb1c>] [<c0142bb4>]
> [<c0142f90>]
> [<c01430f3>]

That particular Oops should already be fixed in 2.4.14.

Cheers,
Trond

2001-11-16 11:24:28

by Birger Lammering

[permalink] [raw]
Subject: nfs problem: hp|aix-server --- linux 2.4.15pre5 client

Hi Trond,

Trond Myklebust writes:
> > [<c01430f3>]
>
> That particular Oops should already be fixed in 2.4.14.

Thanks, I've tried 2.4.15pre5 (+ seekdir patch) now and could not
reproduce the Ooops with the HP nfs server. But still: The Kernel
complains about: "NFS: short packet in readdir reply!" when I access
any directory provided by a HP-UX 10.20 nfs-server (using nfs2 |
nfs3). I dont' notice any strange behaviour other than that, though.

But a new problem emerged: Copying from a linux (2.4.13 | 2.4.15pre5)
nfs-client onto an AIX nfs-server doesn't work. About 800kb are copied
then the cp command just hangs. Syslog says:
Nov 16 11:51:12 capc20 kernel: nfs: server caes04 not responding, still trying
This problem only occures with nfs3, and not with nfs2.

Is there a cure for this, without being too experimental?

Cheers,
Birger

ps: It seems to be quite difficult to get (and surely to write :-)
nfs3-clients that work with all thinkable other platforms - we not
only have nfs problems with Linux-clients in out network...


>
> Cheers,
> Trond
>

2001-11-16 11:46:19

by Trond Myklebust

[permalink] [raw]
Subject: nfs problem: hp|aix-server --- linux 2.4.15pre5 client

>>>>> " " == Birger Lammering <[email protected]> writes:

> not reproduce the Ooops with the HP nfs server. But still: The
> Kernel complains about: "NFS: short packet in readdir reply!"
> when I access any directory provided by a HP-UX 10.20
> nfs-server (using nfs2 | nfs3). I dont' notice any strange
> behaviour other than that, though.

That's because the HP is returning a READDIR reply that is larger than
the buffer size we specified. When this happens, we truncate the reply
at the last valid record before the buffer overflow, and print out the
above message.

> But a new problem emerged: Copying from a linux (2.4.13 |
> 2.4.15pre5) nfs-client onto an AIX nfs-server doesn't
> work. About 800kb are copied then the cp command just
> hangs. Syslog says: Nov 16 11:51:12 capc20 kernel: nfs: server
> caes04 not responding, still trying This problem only occures
> with nfs3, and not with nfs2.

> Is there a cure for this, without being too experimental?

No idea. That depends on where the problem lies...
Do you have a tcpdump for me? Preferably one from the client and one
from the server (showing the same period of time).

Cheers,
Trond

2001-11-16 12:01:32

by Miquel van Smoorenburg

[permalink] [raw]
Subject: Re: nfs problem: hp|aix-server --- linux 2.4.15pre5 client

In article <[email protected]>,
Trond Myklebust <[email protected]> wrote:
>>>>>> " " == Birger Lammering <[email protected]> writes:
>
> > not reproduce the Ooops with the HP nfs server. But still: The
> > Kernel complains about: "NFS: short packet in readdir reply!"
>
>That's because the HP is returning a READDIR reply that is larger than
>the buffer size we specified. When this happens, we truncate the reply
>at the last valid record before the buffer overflow, and print out the
>above message.

Shouldn't the message then be "NFS: too large packet in readdir reply!" ?

Mike.
--
"Only two things are infinite, the universe and human stupidity,
and I'm not sure about the former" -- Albert Einstein.

2001-11-16 12:25:39

by Trond Myklebust

[permalink] [raw]
Subject: Re: nfs problem: hp|aix-server --- linux 2.4.15pre5 client

>>>>> " " == Miquel van Smoorenburg <[email protected]> writes:

>> That's because the HP is returning a READDIR reply that is
>> larger than the buffer size we specified. When this happens, we
>> truncate the reply at the last valid record before the buffer
>> overflow, and print out the above message.

> Shouldn't the message then be "NFS: too large packet in readdir
> reply!" ?

8) Or 'NFS: truncated packet in readdir reply!', since that is what
NFS actually is returned by the RPC layer.

When we are sure that the code is stable, the whole message can go. It
is really just reporting a server error. As long as we handle it
correctly, there should be no need to churn out all these printks.

Cheers,
Trond

2001-11-16 13:20:15

by Birger Lammering

[permalink] [raw]
Subject: nfs problem: aix-server --- linux 2.4.15pre5 client

Hi,

Trond Myklebust writes:
> > But a new problem emerged: Copying from a linux (2.4.13 |
> > 2.4.15pre5) nfs-client onto an AIX nfs-server doesn't
> > work. About 800kb are copied then the cp command just
> > hangs. Syslog says: Nov 16 11:51:12 capc20 kernel: nfs: server
> > caes04 not responding, still trying This problem only occures
> > with nfs3, and not with nfs2.
>
> > Is there a cure for this, without being too experimental?
>
> No idea. That depends on where the problem lies...

I should have known that. ;-)

> Do you have a tcpdump for me? Preferably one from the client and one
> from the server (showing the same period of time).

The tcpdump was taken while copying from the Linux nfs3-Client onto
the Aix nfs3-Server:

tcpdump on AIX (caes04):
13:47:28.337776317 truncated-ip - 18 bytes missing!capc25.muc.796 > caes04.muc.shilp: P 2059179904:2059180060(156) ack 4022052897 win 17520 (DF)
13:47:28.337860266 caes04.muc.shilp > capc25.muc.796: P 1:121(120) ack 156 win 60032
13:47:28.343619224 capc25.muc.796 > caes04.muc.shilp: . ack 121 win 17520 (DF)
13:47:28.344042473 truncated-ip - 50 bytes missing!capc25.muc.796 > caes04.muc.shilp: P 156:344(188) ack 121 win 17520 (DF)
13:47:28.357398139 truncated-ip - 138 bytes missing!caes04.muc.shilp > capc25.muc.796: P 121:397(276) ack 344 win 60032
13:47:28.364496982 truncated-ip - 1322 bytes missing!capc25.muc.796 > caes04.muc.shilp: . 344:1804(1460) ack 397 win 17520 (DF)
13:47:28.364872917 truncated-ip - 1322 bytes missing!capc25.muc.796 > caes04.muc.shilp: . 1804:3264(1460) ack 397 win 17520 (DF)
13:47:28.365142046 truncated-ip - 1322 bytes missing!capc25.muc.796 > caes04.muc.shilp: . 3264:4724(1460) ack 397 win 17520 (DF)
13:47:28.482136300 caes04.muc.shilp > capc25.muc.796: . ack 4724 win 55652
13:47:28.489517784 truncated-ip - 1322 bytes missing!capc25.muc.796 > caes04.muc.shilp: . 4724:6184(1460) ack 397 win 17520 (DF)
13:47:28.490018916 truncated-ip - 1322 bytes missing!capc25.muc.796 > caes04.muc.shilp: . 6184:7644(1460) ack 397 win 17520 (DF)
13:47:28.490313868 truncated-ip - 1322 bytes missing!capc25.muc.796 > caes04.muc.shilp: . 7644:9104(1460) ack 397 win 17520 (DF)
13:47:28.490566573 truncated-ip - 1322 bytes missing!capc25.muc.796 > caes04.muc.shilp: . 9104:10564(1460) ack 397 win 17520 (DF)
13:47:28.685246999 caes04.muc.shilp > capc25.muc.796: . ack 2059190468 win 49812

and on Linux (capc25):
13:47:28.324042 > capc25.muc.796 > caes04.muc.nfs: P 2059179904:2059180060(156) ack 4022052897 win 17520 (DF)
13:47:28.330599 < caes04.muc.nfs > capc25.muc.796: P 1:121(120) ack 156 win 60032
13:47:28.330620 > capc25.muc.796 > caes04.muc.nfs: . 156:156(0) ack 121 win 17520 (DF)
13:47:28.330857 > capc25.muc.796 > caes04.muc.nfs: P 156:344(188) ack 121 win 17520 (DF)
13:47:28.350291 < caes04.muc.nfs > capc25.muc.796: P 121:397(276) ack 344 win 60032
13:47:28.350556 > capc25.muc.796 > caes04.muc.nfs: . 344:1804(1460) ack 397 win 17520 (DF)
13:47:28.350569 > capc25.muc.796 > caes04.muc.nfs: . 1804:3264(1460) ack 397 win 17520 (DF)
13:47:28.350581 > capc25.muc.796 > caes04.muc.nfs: . 3264:4724(1460) ack 397 win 17520 (DF)
13:47:28.475691 < caes04.muc.nfs > capc25.muc.796: . 397:397(0) ack 4724 win 55652
13:47:28.475724 > capc25.muc.796 > caes04.muc.nfs: . 4724:6184(1460) ack 397 win 17520 (DF)
13:47:28.475734 > capc25.muc.796 > caes04.muc.nfs: . 6184:7644(1460) ack 397 win 17520 (DF)
13:47:28.475743 > capc25.muc.796 > caes04.muc.nfs: . 7644:9104(1460) ack 397 win 17520 (DF)
13:47:28.475752 > capc25.muc.796 > caes04.muc.nfs: . 9104:10564(1460) ack 397 win 17520 (DF)
13:47:28.678570 < caes04.muc.nfs > capc25.muc.796: . 397:397(0) ack 10564 win 49812
13:47:28.678604 > capc25.muc.796 > caes04.muc.nfs: . 10564:12024(1460) ack 397 win 17520 (DF)
13:47:28.678614 > capc25.muc.796 > caes04.muc.nfs: . 12024:13484(1460) ack 397 win 17520 (DF)
13:47:28.678623 > capc25.muc.796 > caes04.muc.nfs: . 13484:14944(1460) ack 397 win 17520 (DF)
13:47:28.678632 > capc25.muc.796 > caes04.muc.nfs: . 14944:16404(1460) ack 397 win 17520 (DF)
13:47:28.678642 > capc25.muc.796 > caes04.muc.nfs: . 16404:17864(1460) ack 397 win 17520 (DF)
13:47:28.884628 < caes04.muc.nfs > capc25.muc.796: . 397:397(0) ack 17864 win 42512


I can send you the whole dump if you think it's helpful.

The Linux settings:
Linux capc20 2.4.15pre5-sc1 #1 Fri Nov 16 10:26:21 CET 2001 i686 unknown

Gnu C 2.96
Gnu make 3.79.1
binutils 2.10.91.0.2
util-linux 2.11f
mount 2.11g
modutils 2.4.6
e2fsprogs 1.23
PPP 2.4.0
Linux C Library 2.2.4
Dynamic linker (ldd) 2.2.4
Procps 2.0.7
Net-tools 1.57
Console-tools 0.3.3
Sh-utils 2.0
Modules Loaded firegl23 3c59x usb-uhci usbcore

Cheers,
Birger

> Cheers,
> Trond
>