Hi all,
I have a machine with kernel 2.4.1 + acls patch. It exports some volume via
NFS (installed with RedHat 7.0 + custom 2.4.1 kernel). The underlying
filesystem is ext2. I tried with NFS v2 and v3 and without ACLs in the
kernel. results are the same.
The problem is that NFSD dies unexpectedly with a Oops (see below).
When booting, I have 8 NFSD processes, but suddenly, they all die. I can't
see why it happens, because the machine is a production one and I can't
reboot it too often. But when I reboot, all is fine for a moment. And
suddenly, the 8 NFSD die altogether... Last time, I rebooted the machine at
23h00 and NFSD died ~ 9h36 next day : 10h uptime!
When running lsmod, nfsd.o has 8 locks even after NFSD died, so it's
impossible to make a rmmod (the 8 NFSD processes don't give their ressources
back).
I tried to put NFSD in the kernel directly, without modules. Same thing.
Anyone have similar problems?
Thanks for any help on that topic
Bye
-jec
PS: I'm not in the list, so CC please.
PS2: Thanks to Andrew M. for his help :-)
Oops:
ksymoops 2.4.0 on i686 2.4.1-acls0.7.5-4. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.1-acls0.7.5-4/ (default)
-m /boot/System.map-2.4.1-acls0.7.5-4 (specified)
Unable to handle kernel NULL pointer dereference at virtual address 00000000
00000000
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[acpi_exit+0/-1072693248]
EFLAGS: 00010286
eax: 00000000 ebx: c4f5c03c ecx: c091d040 edx: c0173710
esi: c4f63424 edi: c4f5c03c ebp: c4f5c03c esp: c4f61f38
ds: 0018 es: 0018 ss: 0018
Process nfsd (pid: 2690, stackpage=c4f61000)
Stack: c0173774 c091d040 00008000 c4f63000 c02f9220 c4f5c014 c4f63000
c091d040
a1ffc014 c016bdbb c4f63000 c4f5c01c c4f63400 c4f63138 c02f9220
c4f63490
c0273e38 c4f63000 c4f5c014 c4f60000 0034fdbb c7f68560 c4f60550
c4f63400
Call Trace: [nfssvc_encode_diropres+100/520] [nfsd_dispatch+275/360]
[svc_process+684/1348] [nfsd+401/760] [kernel_thread+35/48]
Code: Bad EIP value.
Using defaults from ksymoops -t elf32-i386 -a i386
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Jean-Eric Cuendet
Linkvest SA
Av des Baumettes 19, 1020 Renens Switzerland
Tel +41 21 632 9043 Fax +41 21 632 9090
http://www.linkvest.com E-mail: [email protected]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
On Wednesday February 14, [email protected] wrote:
>
> Hi all,
> I have a machine with kernel 2.4.1 + acls patch. It exports some volume via
> NFS (installed with RedHat 7.0 + custom 2.4.1 kernel). The underlying
> filesystem is ext2. I tried with NFS v2 and v3 and without ACLs in the
> kernel. results are the same.
....
>
> Unable to handle kernel NULL pointer dereference at virtual address 00000000
> 00000000
> *pde = 00000000
> Oops: 0000
> CPU: 0
> EIP: 0010:[acpi_exit+0/-1072693248]
> EFLAGS: 00010286
> eax: 00000000 ebx: c4f5c03c ecx: c091d040 edx: c0173710
> esi: c4f63424 edi: c4f5c03c ebp: c4f5c03c esp: c4f61f38
> ds: 0018 es: 0018 ss: 0018
> Process nfsd (pid: 2690, stackpage=c4f61000)
> Stack: c0173774 c091d040 00008000 c4f63000 c02f9220 c4f5c014 c4f63000
> c091d040
> a1ffc014 c016bdbb c4f63000 c4f5c01c c4f63400 c4f63138 c02f9220
> c4f63490
> c0273e38 c4f63000 c4f5c014 c4f60000 0034fdbb c7f68560 c4f60550
> c4f63400
> Call Trace: [nfssvc_encode_diropres+100/520] [nfsd_dispatch+275/360]
> [svc_process+684/1348] [nfsd+401/760] [kernel_thread+35/48]
> Code: Bad EIP value.
> Using defaults from ksymoops -t elf32-i386 -a i386
This trace seems to make sense, except that nfssvc_encode_diropres
doesn't seem to make any subroutine calls at offset 100 as seems to be
implied.
Could you run
echo disassemble nfssvc_encode_diropres | gdb -batch -x /dev/stdin vmlinux
giving it the vmlinux that was running when this oops was produced? and
also tell me exactly what patches you have ontop of 2.4.1 and where to
find them.
NeilBrown
Here I am again! NFSD died at 11h23, ~12 hours after the last reboot, a
record :-)
I'll try to best answer your questions.
> This trace seems to make sense, except that nfssvc_encode_diropres
> doesn't seem to make any subroutine calls at offset 100 as seems to be
> implied.
>
> Could you run
>
> echo disassemble nfssvc_encode_diropres | gdb -batch -x
> /dev/stdin vmlinux
It's in the attached file.
In fact, the Oops was with a new compiled kernel with NFSD in modules... So
the GDB stuff would not work...
So attached is the output of GDB recompiled with NFSD in the kernel. Is it
sufficient for you? It's not the one that was running but just recompiled.
If not, I'll send you a new Oops + GDB output of a RUNNING kernel with NFSD
in the kernel.
> giving it the vmlinux that was running when this oops was
> produced? and
> also tell me exactly what patches you have ontop of 2.4.1 and where to
> find them.
I have only ACL patched. You can find them at acl.bestbits.at.
I have tried without them, with exactly the same behaviour.
Thanks
-jec
On Thursday February 15, [email protected] wrote:
>
> Here I am again! NFSD died at 11h23, ~12 hours after the last reboot, a
> record :-)
I'm guessing you don't have many symlinks on the exported
filesystem....
> I'll try to best answer your questions.
>
> > This trace seems to make sense, except that nfssvc_encode_diropres
> > doesn't seem to make any subroutine calls at offset 100 as seems to be
> > implied.
> >
> > Could you run
> >
> > echo disassemble nfssvc_encode_diropres | gdb -batch -x
> > /dev/stdin vmlinux
>
> It's in the attached file.
> In fact, the Oops was with a new compiled kernel with NFSD in modules... So
> the GDB stuff would not work...
> So attached is the output of GDB recompiled with NFSD in the kernel. Is it
> sufficient for you? It's not the one that was running but just recompiled.
> If not, I'll send you a new Oops + GDB output of a RUNNING kernel with NFSD
> in the kernel.
>
> > giving it the vmlinux that was running when this oops was
> > produced? and
> > also tell me exactly what patches you have ontop of 2.4.1 and where to
> > find them.
>
> I have only ACL patched. You can find them at acl.bestbits.at.
The information you sent was very helpful.
You are getting an Oops here:
> 0xc0173769 <nfssvc_encode_diropres+89>: push $0x8000
> 0xc017376e <nfssvc_encode_diropres+94>: push %ecx
> 0xc017376f <nfssvc_encode_diropres+95>: mov 0x48(%eax),%eax
> 0xc0173772 <nfssvc_encode_diropres+98>: call *%eax
^^^^^^^^
%eax is zero.
This corresponds to the code fragment:
+ if (IS_POSIX_ACL(inode) && inode->i_op) {
+ posix_acl_t *acl = inode->i_op->get_posix_acl(
+ inode, ACL_TYPE_ACCESS);
%eax has the value of inode->i_op->get_posix_acl. Clearly this field
hasn't been initialised.
A quick look at the patch suggests that it doesn't get initialised for
symlinks, but I haven't poured over it in detail.
I must admit that it appears somewhat courageous to be running 2.4.1
with patches that were made for 2.4.0-test12 on a production machine,
but I guess you know what you are doing.
> I have tried without them, with exactly the same behaviour.
>
That may be, but you have only given evidence that it that there are
problems with the patches installed, and that evidence points very
strongly at a problem with the patch. If you can give me evidence (an
Oops for example) which shows problems without the patches, then I
will be happy to look at it.
Also, the most recent ksymoops output is totally useless. It looks
like the kernel image that ksymoops was using to find symbol
information was different from the kernel image that was running.
It is very important that these two match, or the result is of no use.
NeilBrown