2003-05-13 14:37:44

by jlnance

[permalink] [raw]
Subject: NFS problems with Linux-2.4

Hello all,

I am having some problems with NFS which I suspect may be a bug in the
2.4 kernels. I can probably come up with a small testcase, but before I do
that I would like to describe the problem and see if it is something that
is susposed to work. Perhaps I simply do not understand the guarantees
that NFS makes.

The setup is like this. I have two machines which share an NFS mounted
directory. The NFS server is a network appliance box. Machine A
does an fopen/fwrite/fclose to create a file on the NFS filesystem. It
then sends a message to machine B. Machine B then attemps to fopen the
file, but fopen fails (as does stat). If I add code that sleeps for a
couple of seconds and retries the fopen then everything works.

I have seen the problem on both IA64 machines running the kernel 2.4.18
from Red Hats Advanced Server and on x86 machines running Red Hats
2.4.7-10smp kernel. I have not tried other linux kernels (I am not root),
but I have run the same program under Solaris (sparc) and have never
observed this.

The IA64 and x86 machines were on different networks and using different
network appliance servers. The IA64 /proc/mounts entry is:

na1:/vol/h1/home /remote/na1h1home nfs rw,v3,rsize=4096,wsize=4096,hard,intr,udp,lock,addr=na1 0 0

and the x86 entry is:

na1-rtp:/vol/vol0/home/jlnance /home/jlnance nfs rw,v3,rsize=8192,wsize=8192,hard,intr,udp,lock,addr=na1-rtp 0 0


If you would like more information, please let me know.

Thanks,

Jim


2003-05-13 15:06:46

by Trond Myklebust

[permalink] [raw]
Subject: NFS problems with Linux-2.4


Could you please try with a newer kernel. The close-to-open cache
consistency fixes are a relatively recent addition to the Linux NFS
client. I dunno if RedHat's 2.4.18 kernel has them.

2.4.7 certainly does not.

Cheers,
Trond

2003-05-13 18:54:29

by jjs

[permalink] [raw]
Subject: Re: NFS problems with Linux-2.4

[email protected] wrote:

>I have seen the problem on both IA64 machines running the kernel 2.4.18
>from Red Hats Advanced Server
>
Huh?

Advanced server is currently at 2.4.9-e.16 -

perhaps this is a self-compiled kernel?

Joe

2003-05-13 19:11:44

by Roland Dreier

[permalink] [raw]
Subject: Re: NFS problems with Linux-2.4

jlnance> I have seen the problem on both IA64 machines running the
jlnance> kernel 2.4.18 from Red Hats Advanced Server

jjs> Huh?
jjs> Advanced server is currently at 2.4.9-e.16 -
jjs> perhaps this is a self-compiled kernel?

Advanced Server on IA64 is at 2.4.18, NOT 2.4.9.

- Roland

2003-05-13 21:42:41

by jjs

[permalink] [raw]
Subject: Re: NFS problems with Linux-2.4



Roland Dreier wrote:

> jlnance> I have seen the problem on both IA64 machines running the
> jlnance> kernel 2.4.18 from Red Hats Advanced Server
>
> jjs> Huh?
> jjs> Advanced server is currently at 2.4.9-e.16 -
> jjs> perhaps this is a self-compiled kernel?
>
>Advanced Server on IA64 is at 2.4.18, NOT 2.4.9.
>
Ah -

Thanks for the info, I didn't know about
the version skew between platforms...

Joe

2003-05-13 23:57:16

by Alan

[permalink] [raw]
Subject: Re: NFS problems with Linux-2.4

On Maw, 2003-05-13 at 20:07, jjs wrote:
> [email protected] wrote:
>
> >I have seen the problem on both IA64 machines running the kernel 2.4.18
> >from Red Hats Advanced Server
> >
> Huh?
>
> Advanced server is currently at 2.4.9-e.16 -
>
> perhaps this is a self-compiled kernel?

AS for IA64 isnt the same kernel as for IA32


2003-05-15 15:11:26

by Jim Nance

[permalink] [raw]
Subject: Re: NFS problems with Linux-2.4

On Tue, May 13, 2003 at 05:19:23PM +0200, Trond Myklebust wrote:
>
> Could you please try with a newer kernel. The close-to-open cache
> consistency fixes are a relatively recent addition to the Linux NFS
> client. I dunno if RedHat's 2.4.18 kernel has them.
>
> 2.4.7 certainly does not.

I tried again with the 2.4.20 based kernel that Red Hat released
yesterday (2.4.20-13.7bigmem). The problem that I was seeing occurs
less frequently there, but it still happens.

I have attached a program which can reproduce this. If you run it
under 2.4.7 it fails instantly. If you use 2.4.20 it may take a
minute or so but it will also fail.

Thanks,

Jim

PS: Do you know if there is any way to work around this problem from
within my program?

--
----------------------------------------------------------------------------
Jim Nance Synopsys
(919) 425-7219 Do you have sweet iced tea? jlnance at synopsys.com
No, but there's sugar on the table.


Attachments:
(No filename) (1.01 kB)
p1.c (4.40 kB)
Download all attachments

2003-05-18 14:51:25

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS problems with Linux-2.4


Sorry. stat doesn't obey close-to-open. It relies on standard
attribute caching. close-to-open means "open()" (and only "open()")
checks data cache consistency...

Cheers,
Trond

2003-05-19 00:44:04

by jlnance

[permalink] [raw]
Subject: Re: NFS problems with Linux-2.4

On Sun, May 18, 2003 at 05:00:24PM +0200, Trond Myklebust wrote:
>
> Sorry. stat doesn't obey close-to-open. It relies on standard
> attribute caching. close-to-open means "open()" (and only "open()")
> checks data cache consistency...

Trond,
Thanks for the info. Here is a section of the man page for open.
Is the information it gives correct wrt using link & stat?

O_EXCL When used with O_CREAT, if the file already exists
it is an error and the open will fail. In this con
text, a symbolic link exists, regardless of where
its points to. O_EXCL is broken on NFS file sys
tems, programs which rely on it for performing
locking tasks will contain a race condition. The
solution for performing atomic file locking using a
lockfile is to create a unique file on the same fs
(e.g., incorporating hostname and pid), use link(2)
to make a link to the lockfile. If link() returns
0, the lock is successful. Otherwise, use stat(2)
on the unique file to check if its link count has
increased to 2, in which case the lock is also suc
cessful.

Thanks,

Jim

2003-05-19 11:18:29

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS problems with Linux-2.4

>>>>> " " == jlnance <[email protected]> writes:

> Thanks for the info. Here is a section of the man page for
> open.
> Is the information it gives correct wrt using link & stat?

Yes. Attempting to link a file will normally update the cached
attributes, so stat() will give correct results.

Cheers,
Trond

2003-05-19 19:49:24

by Jim Nance

[permalink] [raw]
Subject: Re: NFS problems with Linux-2.4

On Sun, May 18, 2003 at 05:00:24PM +0200, Trond Myklebust wrote:
>
> Sorry. stat doesn't obey close-to-open. It relies on standard
> attribute caching. close-to-open means "open()" (and only "open()")
> checks data cache consistency...

Hi Trond,
I rewrote my test program so that it uses open() instead of stat().
I also changed it so that it does not rename the file after it writing
it. This should only leave close, open, and unlink calls. The program
still fails for me after running for a minute or so:

cayman> ./p1 s
Failed to find #0 which client wrote
Failed on file number 10202

Again, this is with 2.4.20 kernel. It fails much faster with a 2.4.7.

Thanks,

Jim

--
----------------------------------------------------------------------------
Jim Nance Synopsys
(919) 425-7219 Do you have sweet iced tea? [email protected]
No, but there's sugar on the table.


Attachments:
(No filename) (977.00 B)
p1.c (4.70 kB)
Download all attachments

2003-05-27 17:16:33

by jlnance

[permalink] [raw]
Subject: Re: NFS problems with Linux-2.4

Hello All,
I wanted to follow up this thread now that I have a working solution.

My initial problem was that machine A would create a file and machine B
would attempt to stat() or open() it over NFS and it would not be there.
I was using the 2.4.7 kernel that came with Red Hat 7.2.

Trond suggested I try a more recent kernel since 2.4.7 had known close
to open cache consistency problems. I tried the 2.4.20 kernel and it
did make the problem better, but it was still there.

Someone suggested doing an opendir() to flush the NFS cache. This did
make the problem go away with the 2.4.20 kernels. With the 2.4.7
kenrels, I started getting ESTALE errors after I did this. I found
that I could work around these errors by doing something like:

f = fopen(filename, mode);

if(!f) {
if(errno==ESTALE) {
sleep(1);
f = fopen(filename, mode);
}
}

which is ugly, but it allow me to run on unpatched Red Hat 7.2 systems
which is highly desirable.

Thanks,

Jim