2003-08-19 18:25:49

by Ulrich Drepper

[permalink] [raw]
Subject: NFS regression in 2.6

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This is a problem which pops up in the glibc test suite. It's been like
this for many weeks, even months. I just hadn't time to investigate.
But the problem is actually very easy.

Go into a directory mounted via NFS. You need write access. Then
execute this little program:

#include <errno.h>
#include <error.h>
#include <stdlib.h>
#include <unistd.h>
int
main (void)
{
char tmp[] = "estale-test.XXXXXX";
int fd = mkstemp (tmp);
if (fd == -1)
error (EXIT_FAILURE, errno, "mkstemp failed");
if (unlink (tmp) != 0)
error (EXIT_FAILURE, errno, "unlink '%s' failed", tmp);
int fd2 = dup (fd);
if (fd2 == -1)
error (EXIT_FAILURE, errno, "dup failed");
if (ftruncate (fd2, 0) != 0)
error (EXIT_FAILURE, errno, "ftruncate failed");
return 0;
}

The result is always, 100% of the time, a failure in ftruncate. The
kernel reports ESTALE. This has not been a problem in 2.4 and not even
in 2.6 until <mumble> months ago. And of course it works with local disks.

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE/QmjB2ijCOnn/RHQRAhMhAJ94N+/b7G1jC6XVUOwCv0/rZyeeIgCdH2zg
JxbK7PqAuSMmUKQX76CjNVM=
=XUXf
-----END PGP SIGNATURE-----


2003-08-19 18:53:39

by Ulrich Drepper

[permalink] [raw]
Subject: Re: NFS regression in 2.6

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ulrich Drepper wrote:

> Go into a directory mounted via NFS. You need write access. Then
> execute this little program:

Just to be clear: the client is running 2.6. The server in my case runs
a 2.4 kernel.

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE/QnEj2ijCOnn/RHQRAmu9AKCJk8Khn/za3DGP/XAeioes9zk+PgCdEizd
HrSpPHo7v3KmOZRjLTeIKJk=
=XNdG
-----END PGP SIGNATURE-----

2003-08-19 19:09:00

by Andries Brouwer

[permalink] [raw]
Subject: Re: NFS regression in 2.6

On Tue, Aug 19, 2003 at 11:13:21AM -0700, Ulrich Drepper wrote:

> This is a problem which pops up in the glibc test suite. It's been like
> this for many weeks, even months. I just hadn't time to investigate.
> But the problem is actually very easy.
>
> Go into a directory mounted via NFS. You need write access. Then
> execute this little program:
>
> The result is always, 100% of the time, a failure in ftruncate. The
> kernel reports ESTALE. This has not been a problem in 2.4 and not even
> in 2.6 until <mumble> months ago. And of course it works with local disks.

I just tried NFS client 2.6.0-test3, NFS server 2.0.34, try test on client.
No problems. ftruncate did not fail.

(Do you require some NFS version?)

2003-08-19 19:23:09

by Tupshin Harper

[permalink] [raw]
Subject: Re: NFS regression in 2.6

Ulrich Drepper wrote:

>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>Ulrich Drepper wrote:
>
>
>
>>Go into a directory mounted via NFS. You need write access. Then
>>execute this little program:
>>
>>
>
>Just to be clear: the client is running 2.6. The server in my case runs
>a 2.4 kernel.
>
Duplicated here. Same symptoms with a 2.6.0-test3 server. Works with 2.4
client, breaks with 2.6 client.

-Tupshin

2003-08-19 19:19:41

by Ulrich Drepper

[permalink] [raw]
Subject: Re: NFS regression in 2.6

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andries Brouwer wrote:

> I just tried NFS client 2.6.0-test3, NFS server 2.0.34, try test on client.
> No problems. ftruncate did not fail.
>
> (Do you require some NFS version?)

I have no special settings. It's an out-of-the-box RHL9 server setting,
using NFSv3. Mounted with "rw,intr". It's mounted by autofs but this
shouldn't be of any concern here.

I can reproduce it at will and if somebody needs more info, no problem.
The server should be fine. I can do the same from another RHL9 machine
or any of our more recent kernels and it works just fine. It's only the
2.6 kernel client which fails.

Does the 2.0 kernel have NFSv3? That might be why you don't see it.

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE/Qndn2ijCOnn/RHQRAsseAJ4zYB4GQTswat4jc0yaC2rbP/hlwQCeOfae
Bx2OFb/XFNVH7epLaKaRZIY=
=SDRK
-----END PGP SIGNATURE-----

2003-08-20 02:26:17

by Andries Brouwer

[permalink] [raw]
Subject: Re: NFS regression in 2.6

On Tue, Aug 19, 2003 at 12:15:50PM -0700, Ulrich Drepper wrote:

> > I just tried NFS client 2.6.0-test3, NFS server 2.0.34, try test on client.
> > No problems. ftruncate did not fail.
> >
> > (Do you require some NFS version?)
>
> I have no special settings. It's an out-of-the-box RHL9 server setting,
> using NFSv3.
>
> Does the 2.0 kernel have NFSv3? That might be why you don't see it.

No it doesnt.

So, instead of testing let us just read the source.
I do not completely understand all details yet, but the
interesting parts seem to live in nfs/dir.c.

Since NFS v2/v3 is stateless, doing things with open fds will fail.
This deviation from Unix semantics is repaired by replacing an unlink
by a "silly rename".

Your testprogram works under 2.4 and fails under 2.6.
The conclusion is no doubt that 2.4 does the silly rename
and 2.6 does not.
Why not?
2.4 says
error = nfs_sillyrename(dir, dentry);
2.6 says
if (atomic_read(&dentry->d_count) > 1)
error = nfs_sillyrename(dir, dentry);

Your testprogram does a dup(), but that will not increment d_count,
so that is still 1 and 2.6 doesnt do the sillyrename.

[I must read further but have no time. Probably Trond will tell us
all details.]

Andries



2003-08-20 14:35:54

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS regression in 2.6

>>>>> " " == Ulrich Drepper <[email protected]> writes:

> The result is always, 100% of the time, a failure in ftruncate.
> The kernel reports ESTALE. This has not been a problem in 2.4
> and not even in 2.6 until <mumble> months ago. And of course
> it works with local disks.

There are known bugs in the way we handle readdirplus. That's why it
only hits NFSv3. Does the following patch fix it?

Cheers,
Trond

diff -u --recursive --new-file linux-2.6.0-test2/fs/nfs/dir.c linux-2.6.0-test2-fix_cto/fs/nfs/dir.c
--- linux-2.6.0-test2/fs/nfs/dir.c 2003-06-30 07:19:26.000000000 -0700
+++ linux-2.6.0-test2-fix_cto/fs/nfs/dir.c 2003-08-11 07:54:47.000000000 -0700
@@ -557,7 +557,7 @@
/* Force a full look up iff the parent directory has changed */
if (nfs_check_verifier(dir, dentry)) {
if (nfs_lookup_verify_inode(inode, isopen))
- goto out_bad;
+ goto out_zap_parent;
goto out_valid;
}

@@ -566,7 +566,7 @@
if (memcmp(NFS_FH(inode), &fhandle, sizeof(struct nfs_fh))!= 0)
goto out_bad;
if (nfs_lookup_verify_inode(inode, isopen))
- goto out_bad;
+ goto out_zap_parent;
goto out_valid_renew;
}

@@ -587,6 +587,8 @@
unlock_kernel();
dput(parent);
return 1;
+out_zap_parent:
+ nfs_zap_caches(dir);
out_bad:
NFS_CACHEINV(dir);
if (inode && S_ISDIR(inode->i_mode)) {

2003-08-20 17:24:17

by Andries Brouwer

[permalink] [raw]
Subject: Re: NFS regression in 2.6

On Tue, Aug 19, 2003 at 10:37:50PM -0700, Trond Myklebust wrote:
> >>>>> " " == Ulrich Drepper <[email protected]> writes:
>
> > The result is always, 100% of the time, a failure in ftruncate.
> > The kernel reports ESTALE. This has not been a problem in 2.4
> > and not even in 2.6 until <mumble> months ago. And of course
> > it works with local disks.
>
> There are known bugs in the way we handle readdirplus. That's why it
> only hits NFSv3. Does the following patch fix it?

> +out_zap_parent:
> + nfs_zap_caches(dir);

I don't think it will. My analysis of yesterday night was:
- no silly rename is done
- this is because d_count equals 1
- this is because we have two different dentries for the same file
- this is caused by the fragment

/* If we're doing an exclusive create, optimize away the lookup */
if (nfs_is_exclusive_create(dir, nd))
return NULL;

in nfs/dir.c.
Do you agree?

Andries


[but I do not understand all details yet]
[may look at it again this evening if you don't tell us what happens]

2003-08-20 17:44:49

by Ulrich Drepper

[permalink] [raw]
Subject: Re: NFS regression in 2.6

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Trond Myklebust wrote:

> There are known bugs in the way we handle readdirplus. That's why it
> only hits NFSv3. Does the following patch fix it?

As Andries suspected, no change. The test still fails.

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE/Q7NN2ijCOnn/RHQRAmudAKCzj93j8Ih/4jOXP1IcllvTQyAJUQCgmRy0
sJ3FOh4gd6tWLZEV1N75jek=
=p2xm
-----END PGP SIGNATURE-----

2003-08-20 18:06:25

by Andries Brouwer

[permalink] [raw]
Subject: Re: NFS regression in 2.6

On Wed, Aug 20, 2003 at 10:43:41AM -0700, Ulrich Drepper wrote:

> > There are known bugs in the way we handle readdirplus. That's why it
> > only hits NFSv3. Does the following patch fix it?
>
> As Andries suspected, no change. The test still fails.

Try to comment out the two lines I quoted.

2003-08-20 18:43:24

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS regression in 2.6

>>>>> " " == Andries Brouwer <[email protected]> writes:

> I don't think it will. My analysis of yesterday night was:
> - no silly rename is done
> - this is because d_count equals 1
> - this is because we have two different dentries for the same
> file
> - this is caused by the fragment

> /* If we're doing an exclusive create, optimize away
> the lookup */ if (nfs_is_exclusive_create(dir, nd))
> return NULL;

> in nfs/dir.c. Do you agree?

No... The above snippet just short-circuits the process of doing an
RPC call in order to look the file up on the *server*. Doing such a
lookup would be wrong since it can race with a file creation on
another NFS client.
IOW the result of the above 2 lines should be the immediate creation
of a negative dentry (i.e. one without an inode) that open_namei() can
pass on to vfs_create().

When we get to the unlink() call, we shouldn't be hitting nfs_lookup()
at all unless something somewhere is causing this first dentry to be
permanently dropped out of the dcache.

In short the scenario should be that

- mkstemp() does an open(O_EXCL) -> nfs_lookup() creates hashed
negative dentry -> nfs_create() then does an O_EXCL call to the
server and instantiates the dentry.

- unlink() walks the pathname -> finds the existing dentry using
cached_lookup() and only calls down to nfs_lookup_revalidate().

Cheers,
Trond

2003-08-20 19:03:52

by Ulrich Drepper

[permalink] [raw]
Subject: Re: NFS regression in 2.6

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Trond Myklebust wrote:

> In short the scenario should be that
>
> - mkstemp() does an open(O_EXCL) -> nfs_lookup() creates hashed
> negative dentry -> nfs_create() then does an O_EXCL call to the
> server and instantiates the dentry.
>
> - unlink() walks the pathname -> finds the existing dentry using
> cached_lookup() and only calls down to nfs_lookup_revalidate().

Sounds reasonable especially since the dup() call in my original example
isn't necessary. So, the shortened test case is this:

#include <errno.h>
#include <error.h>
#include <stdlib.h>
#include <unistd.h>
int
main (void)
{
char tmp[] = "estale-test.XXXXXX";
int fd = mkstemp (tmp);
if (fd == -1)
error (EXIT_FAILURE, errno, "mkstemp failed");
if (unlink (tmp) != 0)
error (EXIT_FAILURE, errno, "unlink '%s' failed", tmp);
if (ftruncate (fd, 0) != 0)
error (EXIT_FAILURE, errno, "ftruncate failed");
return 0;
}

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD4DBQE/Q8XU2ijCOnn/RHQRAtMYAJ9CgabR0FdQFG8Sobkrfv9aKloDmQCWN28G
QUj8oMiKWM7v61yupENQ+Q==
=fzKm
-----END PGP SIGNATURE-----

2003-08-20 19:52:50

by Andries Brouwer

[permalink] [raw]
Subject: Re: NFS regression in 2.6

On Wed, Aug 20, 2003 at 11:43:04AM -0700, Trond Myklebust wrote:
> >>>>> " " == Andries Brouwer <[email protected]> writes:
>
> > I don't think it will. My analysis of yesterday night was:
> > - no silly rename is done
> > - this is because d_count equals 1
> > - this is because we have two different dentries for the same
> > file
> > - this is caused by the fragment
>
> > /* If we're doing an exclusive create, optimize away
> > the lookup */ if (nfs_is_exclusive_create(dir, nd))
> > return NULL;
>
> > in nfs/dir.c. Do you agree?
>
> No... The above snippet just short-circuits the process of doing an
> RPC call in order to look the file up on the *server*. Doing such a
> lookup would be wrong since it can race with a file creation on
> another NFS client.
> IOW the result of the above 2 lines should be the immediate creation
> of a negative dentry (i.e. one without an inode) that open_namei() can
> pass on to vfs_create().

It should be. But it isnt. I propose the following patch
(with whitespace damage):

diff -u --recursive --new-file -X /linux/dontdiff a/fs/nfs/dir.c b/fs/nfs/dir.c
--- a/fs/nfs/dir.c Fri Jul 11 00:35:26 2003
+++ b/fs/nfs/dir.c Wed Aug 20 22:38:42 2003
@@ -671,8 +671,10 @@
dentry->d_op = &nfs_dentry_operations;

/* If we're doing an exclusive create, optimize away the lookup */
- if (nfs_is_exclusive_create(dir, nd))
+ if (nfs_is_exclusive_create(dir, nd)) {
+ d_add(dentry, NULL);
return NULL;
+ }

lock_kernel();
error = nfs_cached_lookup(dir, dentry, &fhandle, &fattr);

Andries

2003-08-20 20:03:10

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS regression in 2.6

>>>>> " " == Andries Brouwer <[email protected]> writes:


> It should be. But it isnt. I propose the following patch (with
> whitespace damage):

Yup... That should fix it...

Cheers,
Trond

2003-08-21 00:28:08

by Tupshin Harper

[permalink] [raw]
Subject: Re: NFS regression in 2.6 -- gnome problem

Andries Brouwer wrote:

>It should be. But it isnt. I propose the following patch
>(with whitespace damage):
>
>diff -u --recursive --new-file -X /linux/dontdiff a/fs/nfs/dir.c b/fs/nfs/dir.c
>--- a/fs/nfs/dir.c Fri Jul 11 00:35:26 2003
>+++ b/fs/nfs/dir.c Wed Aug 20 22:38:42 2003
>@@ -671,8 +671,10 @@
> dentry->d_op = &nfs_dentry_operations;
>
> /* If we're doing an exclusive create, optimize away the lookup */
>- if (nfs_is_exclusive_create(dir, nd))
>+ if (nfs_is_exclusive_create(dir, nd)) {
>+ d_add(dentry, NULL);
> return NULL;
>+ }
>
> lock_kernel();
> error = nfs_cached_lookup(dir, dentry, &fhandle, &fattr);
>
>Andries
>
>
>
This patch makes the previously posted test work for me, but I'm
experiencing a differenct NFS regression between 2.4 and 2.6. Whatever
locking method that gnome2 is using when running home directories over
nfs is failing when the client is running 2.6. Tried it again, using
2.6.0-test3 + the above patch, and the results are the same. Gnome
reports that it failed to lock it's test file, and aborts. It says that
the error was "no locks available", but I'm not sure whether to believe
that or not. The only differece is booting between 2.4.x and 2.6.x, and
it doesn't matter whether the server is running 2.4 or 2.6. Any suggestions?

-Tupshin

2003-08-21 00:38:46

by Andries Brouwer

[permalink] [raw]
Subject: Re: NFS regression in 2.6 -- gnome problem

On Wed, Aug 20, 2003 at 05:28:03PM -0700, Tupshin Harper wrote:

> This patch makes the previously posted test work for me, but I'm
> experiencing a differenct NFS regression between 2.4 and 2.6. Whatever
> locking method that gnome2 is using when running home directories over
> nfs is failing when the client is running 2.6.
> Gnome reports that it failed to lock it's test file, and aborts.
> It says that the error was "no locks available".

"Gnome" is not precise enough for me.
If you have an explicit test program that works on 2.4 and fails
on 2.6 and is not more than a single page in length, I wouldnt mind
looking at it.

2003-08-21 01:07:47

by Tupshin Harper

[permalink] [raw]
Subject: Re: NFS regression in 2.6 -- gnome problem

Andries Brouwer wrote:

>On Wed, Aug 20, 2003 at 05:28:03PM -0700, Tupshin Harper wrote:
>
>
>
>>This patch makes the previously posted test work for me, but I'm
>>experiencing a differenct NFS regression between 2.4 and 2.6. Whatever
>>locking method that gnome2 is using when running home directories over
>>nfs is failing when the client is running 2.6.
>>Gnome reports that it failed to lock it's test file, and aborts.
>>It says that the error was "no locks available".
>>
>>
>
>"Gnome" is not precise enough for me.
>If you have an explicit test program that works on 2.4 and fails
>on 2.6 and is not more than a single page in length, I wouldnt mind
>looking at it.
>
>
>
Fair enough. I don't have such an explicit test at the moment, but I
will talk to the gnome guys and see if I can come up with one. I was
reporting what the gnome2 session-manager complains about, but
comparable errors come from something as simple as gnome-terminal
(simplest program I've seen that has the problem). FWIW, this is gnome
2.2.2. I'll get back to you when I know more.

-Tupshin

2003-08-21 03:06:49

by Ulrich Drepper

[permalink] [raw]
Subject: Re: NFS regression in 2.6

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andries Brouwer wrote:

> It should be. But it isnt. I propose the following patch
> (with whitespace damage):
> [...]

This indeed fixes the problem. Thanks,

- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE/RDcI2ijCOnn/RHQRAtBPAJ4j1oPL2DkdNwXQlH4+j2VxNppEowCgpKWM
4tezLlkNQJzwvzsuwb3U0uY=
=I0bD
-----END PGP SIGNATURE-----