2004-08-24 15:14:07

by Frank Steiner

[permalink] [raw]
Subject: very strange nfs errors with nfsroot

Hi,

we run a diskless system where the clients boot via pxeboot, then
first mount / read-only from the server. Then we run our own
boot.nfsroot instead of /etc/init.d/boot and mount some directories
read-write per client, i.e., /var, /dev, /etc/local and /media.
Then a "client-script" is run on the client (still from within the
boot.nfsroot script) to setup some links to shared files,
copy some templates to /etc/local and "sed" some client-specific
values in the templaces. Normal stuff for a diskless setup I think.

Any failure in "client-script" causes a shutdown, assuming that sth.
essential went wrong during the client configuration in boot.nfsroot.

We mount the nfsroot and the client-directories with these options:
"nfsvers=3,tcp,hard,intr,nolock,rsize=8192,wsize=8192"

This all went fine with kernel 2.4.x. When we switched to 2.6.7, and up
to currently running 2.6.8.1, the "client-script" started to fail randomly.
Trying to trace the error down, playing around with some -x flags in
the scripts, or sleeps and more error messages, the failure changes
everytime I change some sleep or debugging output in either boot.nfsroot
or the client-script.

Here are the symptoms:
- It all started when on about every third boot, the clients
complained
"ln: /etc/local/printcap: File exists
Error: Could not link shared file /etc/local/printcap"

That was caused by a piece of code:

for name in $SHARES
do
if ! ( rm -f $name && ln -s $COMMON/$name $name )
then
exitstatus=1
echo "Error: Could not link shared file $name"
fi
done

I think that the ln should never be allowed to complain
with this messages because of the "rm ...&& ln.."

- in a different state of the script (some more sleeps, and client-script
with -x) I got this:

+ cat /etc/local/fstab
sed: Couldn't flush stdout: Stale NFS file handle
+ exitstatus=1

The code was "cat $name | sed 's/...' > $name.tmp

- again from a different state with more sleeps etc. I got from somewhere
in the client-script: (no -x here, and not debug output made it to the
screen):

nfs_update_inode: inode number mismatch
expected (0:e/0x2c617b), got (0:e/0x2c6178)

- and finally the same with an additional sed error message:
nfs_update_inode: inode number mismatch
expected (0:e/0x2c617b), got (0:e/0x2c6178)
sed: Couldn't flush <unknown>: Input/output error

Note that:
- when the script once has run, the system will boot without problems and
run stable without any failure
- the problems are indepentend from mounting the nfsroot or the client-dirs
with udp or tcp.

Just a guess: Could that be caused by mounting a /dev directory for
the client over the /dev directory of the server within the boot.nfsroot
script? The boot.nfsroot script uses /dev/console and likely /dev/stdout
from the read-only-mounted /dev from the server, because it got the initial
console from this directory (with Neils patch with that MAY_LOCAL_ACCESS...)
Now when the scripts mounts the client-specific /dev, could that cause
a problem like the "Couldn't flush stdout: Stale NFS file handle"?

What could I do about this if that was the reason...? It never was a problem
with the 2.4 kernel...

I have a current setup where I can quite well reproduce the "nfs_update_inode
+ sed" error and the printcap problem so please let me know if I can do sth
to trace the error down.

Thanks for any hints!
cu,
Frank


--
Dipl.-Inform. Frank Steiner Web: http://www.bio.ifi.lmu.de/~steiner/
Lehrstuhl f. Bioinformatik Mail: http://www.bio.ifi.lmu.de/~steiner/m/
LMU, Amalienstr. 17 Phone: +49 89 2180-4049
80333 Muenchen, Germany Fax: +49 89 2180-99-4049



-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2004-08-24 16:08:21

by Trond Myklebust

[permalink] [raw]
Subject: Re: very strange nfs errors with nfsroot

P=E5 ty , 24/08/2004 klokka 11:14, skreiv Frank Steiner:

> Here are the symptoms:
> - It all started when on about every third boot, the clients
> complained
> "ln: /etc/local/printcap: File exists
> Error: Could not link shared file /etc/local/printcap"
>=20
> That was caused by a piece of code:
>=20
> for name in $SHARES
> do
> if ! ( rm -f $name && ln -s $COMMON/$name $name )
> then
> exitstatus=3D1
> echo "Error: Could not link shared file $name"
> fi
> done

Is the /etc/local shared by all clients? If so, your script is buggy
since it assumes that some other client cannot come in and ln before you
do (this is a prime example of what is known as a "race condition").

As for the ESTALE error. Again, if some other client is screwing with
deleting and recreating the soft link on the server, then ESTALE is
entirely expected behaviour. I suggest you read up on the NFS cache
consistency promises.

Cheers,
Trond


-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-08-25 06:15:01

by Frank Steiner

[permalink] [raw]
Subject: Re: very strange nfs errors with nfsroot

Hi,

Trond Myklebust wrote:
>
> Is the /etc/local shared by all clients? If so, your script is buggy
> since it assumes that some other client cannot come in and ln before you
> do (this is a prime example of what is known as a "race condition").

No, for every client there is some /export/<ip>/ directory
with subdirectories etc/local, var/ etc...
So every client mounts just the subdirectories for its own
ip, so no race conditions can occur. The complete boot.nfsroot
script and the client-script works on completely separate
dirs for every client.
The errors can indeed be reproduced while I have only one
client up and running at all... (it's a test scenario
currently with a maximum of 5 clients).

> As for the ESTALE error. Again, if some other client is screwing with
> deleting and recreating the soft link on the server, then ESTALE is
> entirely expected behaviour. I suggest you read up on the NFS cache
> consistency promises.

Since this is not the case, it must be sth. happening just between
the server and the single client. And the only thing I could think
of that would cause "/dev/stdout" to be considered stale, was
the mount of /export/<ip>/dev over the /dev directory in the
read-only-mounted nfsroot...

Are there any debugging parameters I can set on the server
or the client side e.g. in /proc (I mount /proc for the
clients at the beginning of the boot.nfsroot script, so
it's there) to get some more details what's going wrong?

cu,
Frank

--
Dipl.-Inform. Frank Steiner Web: http://www.bio.ifi.lmu.de/~steiner/
Lehrstuhl f. Bioinformatik Mail: http://www.bio.ifi.lmu.de/~steiner/m/
LMU, Amalienstr. 17 Phone: +49 89 2180-4049
80333 Muenchen, Germany Fax: +49 89 2180-99-4049



-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs