2000-10-27 22:20:43

by Tony Lill

[permalink] [raw]
Subject: OOPS in nfsd, affects all 2.2 and 2.4 kernels

This was first reported in 2.2.12, according to Deja. Solaris clients,
on rare occaisons, will send some command to a linux server which
causes a null resp->fh.fh_dentry to be passed to routines in
/usr/src/linux/fs/nfsd/nfsxdr.c. This causes an oops, and then the nfs
server subsystem stop functioning. A fix is to check that this is not
null before de-referencing it in the following three routines. I
looked and this bug is present in the latest 2.2 and 2.4 kernels.

Whatever condition causes this is very rare. We had a linux server
supporting 100 Solaris and HP-UX boxes running flawlessly for 8
months, then one day something triggered this bug, and it wouldn't go
away until I implemented this fix. There were no apparent side effects
to doing this, although you may want to print some informative message
to try and track down the real culprit.

THis patch is against 2.2.16, but the code looks unchanged in
2.4.0.

/usr/src/linux/fs/nfsd/nfsxdr.c Wed Nov 26 16:08:38 1997
--- nfsxdr.c Wed Aug 9 19:07:40 2000
***************
*** 364,370 ****
nfssvc_encode_attrstat(struct svc_rqst *rqstp, u32 *p,
struct nfsd_attrstat *resp)
{
! if (!(p = encode_fattr(rqstp, p, resp->fh.fh_dentry->d_inode)))
return 0;
return xdr_ressize_check(rqstp, p);
}
--- 364,371 ----
nfssvc_encode_attrstat(struct svc_rqst *rqstp, u32 *p,
struct nfsd_attrstat *resp)
{
! if ( resp->fh.fh_dentry == NULL ||
! !(p = encode_fattr(rqstp, p, resp->fh.fh_dentry->d_inode)))
return 0;
return xdr_ressize_check(rqstp, p);
}
***************
*** 373,379 ****
nfssvc_encode_diropres(struct svc_rqst *rqstp, u32 *p,
struct nfsd_diropres *resp)
{
! if (!(p = encode_fh(p, &resp->fh))
|| !(p = encode_fattr(rqstp, p, resp->fh.fh_dentry->d_inode)))
return 0;
return xdr_ressize_check(rqstp, p);
--- 374,381 ----
nfssvc_encode_diropres(struct svc_rqst *rqstp, u32 *p,
struct nfsd_diropres *resp)
{
! if ( resp->fh.fh_dentry == NULL ||
! !(p = encode_fh(p, &resp->fh))
|| !(p = encode_fattr(rqstp, p, resp->fh.fh_dentry->d_inode)))
return 0;
return xdr_ressize_check(rqstp, p);
***************
*** 392,398 ****
nfssvc_encode_readres(struct svc_rqst *rqstp, u32 *p,
struct nfsd_readres *resp)
{
! if (!(p = encode_fattr(rqstp, p, resp->fh.fh_dentry->d_inode)))
return 0;
*p++ = htonl(resp->count);
p += XDR_QUADLEN(resp->count);
--- 394,401 ----
nfssvc_encode_readres(struct svc_rqst *rqstp, u32 *p,
struct nfsd_readres *resp)
{
! if ( resp->fh.fh_dentry == NULL ||
! !(p = encode_fattr(rqstp, p, resp->fh.fh_dentry->d_inode)))
return 0;
*p++ = htonl(resp->count);
p += XDR_QUADLEN(resp->count);


--
Tony Lill, [email protected]
President, A. J. Lill Consultants fax/data (519) 650 3571
539 Grand Valley Dr., Cambridge, Ont. N3H 2S2 (519) 241 2461
--------------- http://www.ajlc.waterloo.on.ca/ ----------------
"Welcome to All Things UNIX, where if it's not UNIX, it's CRAP!"


2000-10-28 05:42:22

by NeilBrown

[permalink] [raw]
Subject: Re: OOPS in nfsd, affects all 2.2 and 2.4 kernels

On Friday October 27, [email protected] wrote:
> This was first reported in 2.2.12, according to Deja. Solaris clients,
> on rare occaisons, will send some command to a linux server which
> causes a null resp->fh.fh_dentry to be passed to routines in
> /usr/src/linux/fs/nfsd/nfsxdr.c. This causes an oops, and then the nfs
> server subsystem stop functioning. A fix is to check that this is not
> null before de-referencing it in the following three routines. I
> looked and this bug is present in the latest 2.2 and 2.4 kernels.
>
> Whatever condition causes this is very rare. We had a linux server
> supporting 100 Solaris and HP-UX boxes running flawlessly for 8
> months, then one day something triggered this bug, and it wouldn't go
> away until I implemented this fix. There were no apparent side effects
> to doing this, although you may want to print some informative message
> to try and track down the real culprit.
>
> THis patch is against 2.2.16, but the code looks unchanged in
> 2.4.0.

Thanks for sending me this.

This problem that you are addressing is caused when solaris sends a
zero length write (I assume to implement the "access" system call, but
I haven't checked).
If you look at the code at the top of nfsd_write in nfsd/vfs.c, you
will see that if cnt (the number of bytes to write) is zero, then it
jumps straight out without setting an error or initialising fhp (by
calling nfsd_open).
As there is no error, nfsd tried to return status information (hence
the call the nfssvc_encode_attrstat) but doesn't have a valid file
handle. So it Oopses.

The correct fix, which is in 2.4 and would have recently gone into the
2.2.18 pre-patches (but I haven't actually looked at them) is to move
the test for (!cnt) to after the call to nfsd_open.

NeilBrown

2000-10-28 23:32:51

by Michael Eisler

[permalink] [raw]
Subject: Re: OOPS in nfsd, affects all 2.2 and 2.4 kernels


> This problem that you are addressing is caused when solaris sends a
> zero length write (I assume to implement the "access" system call, but
> I haven't checked).

more likely a long standing bug in Solaris that hasn't been stomped.

Tony, you might let Sun know that you have a way to reproduce it at
will, though there are Sun people on this alias who I'm sure will
make it a high priority to stomp this one. :-)

-mre