LinuxLists.cc - infinite getdents64 loop

[permalink] [raw]

Subject: Re: infinite getdents64 loop

On 05/31/2011 07:26 PM, Andreas Dilger wrote:
> On 2011-05-31, at 6:35 AM, Ted Ts'o wrote:
>> On Tue, May 31, 2011 at 12:18:11PM +0200, Bernd Schubert wrote:
>>>
>>> Out of interest, did anyone ever benchmark if dirindex provides any
>>> advantages to readdir? And did those benchmarks include the
>>> disadvantages of the present implementation (non-linear inode
>>> numbers from readdir, so disk seeks on stat() (e.g. from 'ls -l') or
>>> 'rm -fr $dir')?
>>
>> The problem is that seekdir/telldir is terminally broken (and so is
>> NFSv2 for using a such a tiny cookie) in that it fundamentally assumes
>> a linear data structure. If you're going to use any kind of
>> tree-based data structure, a 32-bit "offset" for seekdir/telldir just
>> doesn't cut it. We actually play games where we memoize the low
>> 32-bits of the hash and keep track of which cookies we hand out via
>> seekdir/telldir so that things mostly work --- except for NFSv2, where
>> with the 32-bit cookie, you're just hosed.
>>
>> The reason why we have to iterate over the directory in hash tree
>> order is because if we have a leaf node split, half the directories
>> entries get copied to another directory entry, given the promises made
>> by seekdir() and telldir() about directory entries appearing exactly
>> once during a readdir() stream, even if you hold the fd open for weeks
>> or days, mean that you really have to iterate over things in hash
>> order.
>>
>> I'd have to look, since it's been too many years, but as I recall the
>> problem was that there is a common path for NFSv2 and NFSv3/v4, so we
>> don't know whether we can hand back a 32-bit cookie or a 64-bit
>> cookie, so we're always handing the NFS server a 32-bit "offset", even
>> though ew could do better. Actually, if we had an interface where we
>> could give you a 128-bit "offset" into the directory, we could
>> probably eliminate the duplicate cookie problem entirely. We just
>> send 64-bits worth of hash, plus the first two bytes of the of file
>> name.
>
> If it's of interest, we've implemented a 64-bit hash mode for ext4 to
> solve just this problem for Lustre. The llseek() code will return a
> 64-bit hash value on 64-bit systems, unless it is running for some
> process that needs a 32-bit hash value (only NFSv2, AFAIK).
>
> The attached patch can at least form the basis for being able to return
> 64-bit hash values for userspace/NFSv3/v4 when usable. The patch
> is NOT usable as it stands now, since I've had to modify it from the
> version that we are currently using for Lustre (this version hasn't
> actually been compiled), but it at least shows the outline of what needs
> to be done to get this working. None of the NFS side is implemented.

Thanks Andreas! I haven't tested it yet, but the generic idea looks
good. I guess the lower part of the patch (netfilter stuff) got
accidentally in?

Cheers,
Bernd

2011-05-30 12:43:03

[permalink] [raw]

Subject: Re: infinite getdents64 loop

On Monday 30 May 2011, Jeff Layton wrote:
> On Mon, 30 May 2011 11:37:01 +0200
> Ruediger Meier <[email protected]> wrote:
> > On Sunday 29 May 2011, Trond Myklebust wrote:
> > > It's actually a problem with the underlying filesystem: it is
> > > generating readdir 'offsets' that are not unique. In other words,
> > > if
> >
> > Does this mean ext4 generally does not work with for nfs?
>
> Does it help if you turn off the dir_index feature on the filesystem?
> See the tune2fs(8) manpage for how to do that.

Unfortunately I can't umount it allthough I did exportfs -u and lsof
doesn't show used files. (reboot not possible right now)

Hopefully I can try it until tomorrow on the other machine (have to wait
for some jobs finished).

Pity that I am not able to create such a broken ext4/nfs server from
scratch on a test machine. Seems I get it broken only if it was
maltreated by our users some time in production.

cu,
Rudi

2011-05-30 14:58:42

[permalink] [raw]

Subject: Re: infinite getdents64 loop

On Mon, 2011-05-30 at 11:37 +0200, Ruediger Meier wrote:
> On Sunday 29 May 2011, Trond Myklebust wrote:
> > It's actually a problem with the underlying filesystem: it is
> > generating readdir 'offsets' that are not unique. In other words, if
>
> Does this mean ext4 generally does not work with for nfs?

ext2/3/4 are all known to have this problem when you switch on the
hashed b-tree directories. Typically, a directory with a million entries
will have several tens of cookie collisions.

Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-05-30 11:58:21

by Jeff Layton

[permalink] [raw]

Subject: Re: infinite getdents64 loop

On Mon, 30 May 2011 11:37:01 +0200
Ruediger Meier <[email protected]> wrote:

> On Sunday 29 May 2011, Trond Myklebust wrote:
> > It's actually a problem with the underlying filesystem: it is
> > generating readdir 'offsets' that are not unique. In other words, if
>
> Does this mean ext4 generally does not work with for nfs?
>
>

Does it help if you turn off the dir_index feature on the filesystem?
See the tune2fs(8) manpage for how to do that.

--
Jeff Layton <[email protected]>

2011-05-31 19:16:35

by Andreas Dilger

[permalink] [raw]

Subject: Re: infinite getdents64 loop

On 2011-05-31, at 11:43 AM, Bernd Schubert wrote:
> On 05/31/2011 07:26 PM, Andreas Dilger wrote:
>> If it's of interest, we've implemented a 64-bit hash mode for ext4 to
>> solve just this problem for Lustre. The llseek() code will return a
>> 64-bit hash value on 64-bit systems, unless it is running for some
>> process that needs a 32-bit hash value (only NFSv2, AFAIK).
>>
>> The attached patch can at least form the basis for being able to return
>> 64-bit hash values for userspace/NFSv3/v4 when usable. The patch
>> is NOT usable as it stands now, since I've had to modify it from the
>> version that we are currently using for Lustre (this version hasn't
>> actually been compiled), but it at least shows the outline of what needs
>> to be done to get this working. None of the NFS side is implemented.
>
> Thanks Andreas! I haven't tested it yet, but the generic idea looks good. I guess the lower part of the patch (netfilter stuff) got accidentally in?

Oops, I had refreshed the patch just before sending, and forgot to remove those parts. They are definitely not relevant to this issue.

Cheers, Andreas

2011-05-31 10:33:25

[permalink] [raw]

Subject: Re: infinite getdents64 loop

On 05/31/2011 11:47 AM, Rüdiger Meier wrote:
> On Monday 30 May 2011, Trond Myklebust wrote:
>> On Mon, 2011-05-30 at 11:37 +0200, Ruediger Meier wrote:
>>>
>>> Does this mean ext4 generally does not work with for nfs?
>>
>> ext2/3/4 are all known to have this problem when you switch on the
>> hashed b-tree directories. Typically, a directory with a million
>> entries will have several tens of cookie collisions.
>
> Ok, like Jeff mentioned in the other reply disabling dir_index solves
> it.
>
> I wish I had seen this documented somewhere before switching from xfs to
> ext4 but it's not easy to find something about these ext4/nfs probs
> without knowing the details already.
> Ext4 being default file system on many distros made me feel safe.

Well, this is hardly acceptable and we really need to find a solution. I
think any parallel filesystem and fuse, etc will have problems with that.

Out of interest, did anyone ever benchmark if dirindex provides any
advantages to readdir? And did those benchmarks include the
disadvantages of the present implementation (non-linear inode numbers
from readdir, so disk seeks on stat() (e.g. from 'ls -l') or
'rm -fr $dir')?

I see those options to solve the ext3/ext4 seek problem:

1) Break 32bit applications on 64 bit kernels

2) Update the vfs to tell the underlying functions to tell them if
lseek() was called from 64bit or 32bit userspace

3) Disable dirindexing for readdirs

Thanks,
Bernd

2011-05-29 16:55:08

[permalink] [raw]

Subject: Re: infinite getdents64 loop

On Sunday 29 May 2011, Trond Myklebust wrote:

> Sorry, but that patch makes absolutely no sense whatsoever as a fix
> for the problem you describe.

It wasn't ment to be a real fix. I just tried to find out where the prob
is roughly located.

> All you are doing is changing the size
> of the readdir cache entry, which is probably causing a READDIR with
> a duplicate cookie to trigger.

Yup, my patch "repaired" the test directory and let another one fail.
Currently Ive reverted
commit d1bacf9e, NFS: add readdir cache array
(and a lot followups) to let clients work again.

> When running with the stock 2.6.39
> client, do you see the "directory contains a readdir loop." message
> in your syslog?

Yes, didn't noticed that because I've booted 2.6.39 only a few times.
There are a lot like this:
May 25 13:26:09 kubera-114 kernel: [ 1105.419604] NFS: directory
gen/radar contains a readdir loop. Please contact your server vendor.
Offending cookie: 947700512

I hope it's not my server vendor's fault :)
Or does this mean the NFS server is bad rather than the client?

cu,
Rudi

2011-05-31 12:35:24

by Theodore Ts'o

[permalink] [raw]

Subject: Re: infinite getdents64 loop

On Tue, May 31, 2011 at 12:18:11PM +0200, Bernd Schubert wrote:
>
> Out of interest, did anyone ever benchmark if dirindex provides any
> advantages to readdir? And did those benchmarks include the
> disadvantages of the present implementation (non-linear inode
> numbers from readdir, so disk seeks on stat() (e.g. from 'ls -l') or
> 'rm -fr $dir')?

The problem is that seekdir/telldir is terminally broken (and so is
NFSv2 for using a such a tiny cookie) in that it fundamentally assumes
a linear data structure. If you're going to use any kind of
tree-based data structure, a 32-bit "offset" for seekdir/telldir just
doesn't cut it. We actually play games where we memoize the low
32-bits of the hash and keep track of which cookies we hand out via
seekdir/telldir so that things mostly work --- except for NFSv2, where
with the 32-bit cookie, you're just hosed.

The reason why we have to iterate over the directory in hash tree
order is because if we have a leaf node split, half the directories
entries get copied to another directory entry, given the promises made
by seekdir() and telldir() about directory entries appearing exactly
once during a readdir() stream, even if you hold the fd open for weeks
or days, mean that you really have to iterate over things in hash
order.

I'd have to look, since it's been too many years, but as I recall the
problem was that there is a common path for NFSv2 and NFSv3/v4, so we
don't know whether we can hand back a 32-bit cookie or a 64-bit
cookie, so we're always handing the NFS server a 32-bit "offset", even
though ew could do better. Actually, if we had an interface where we
could give you a 128-bit "offset" into the directory, we could
probably eliminate the duplicate cookie problem entirely. We just
send 64-bits worth of hash, plus the first two bytes of the of file
name.

> 3) Disable dirindexing for readdirs

That won't work, since it will break POSIX compliance. Once again,
we're tied by the decisions made decades ago...

- Ted

2011-05-28 15:00:34

[permalink] [raw]

Subject: Re: infinite getdents64 loop

On Saturday 28 May 2011, R?diger Meier wrote:
> I could track down the problem to:
>
> commit 0b26a0bf6ff398185546432420bb772bcfdf8d94
> Author: Trond Myklebust <[email protected]>
> Date: Sat Nov 20 14:26:44 2010 -0500
>
> NFS: Ensure we return the dirent->d_type when it is known
>
>
> After reverting the problem is gone.

Actually it's enough to remove d_type from struct nfs_cache_array_entry
again. It's not enough to set it DT_UNKNOWN always. I had to remove it
from struct to let it work.
Tested with kernels 2.6.37.6 and 2.6.39.

commit c9799af304af2a22acffaae25e7e9c3b733a5b68
Author: Ruediger Meier <[email protected]>
Date: Sat May 28 15:26:15 2011 +0200

hotfix, opensuse bug 678123
this reverts the effect of 0b26a0bf

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 7237672..48cfc27 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -202,7 +202,6 @@ struct nfs_cache_array_entry {
u64 cookie;
u64 ino;
struct qstr string;
- unsigned char d_type;
};

struct nfs_cache_array {
@@ -305,7 +304,6 @@ int nfs_readdir_add_to_array(struct nfs_entry
*entry, struct page *page)

cache_entry->cookie = entry->prev_cookie;
cache_entry->ino = entry->ino;
- cache_entry->d_type = entry->d_type;
ret = nfs_readdir_make_qstr(&cache_entry->string, entry->name,
entry->len);
if (ret)
goto out;
@@ -770,7 +768,7 @@ int nfs_do_filldir(nfs_readdir_descriptor_t *desc,
void *dirent,
ent = &array->array[i];
if (filldir(dirent, ent->string.name, ent->string.len,
file->f_pos, nfs_compat_user_ino64(ent->ino),
- ent->d_type) < 0) {
+ DT_UNKNOWN) < 0) {
desc->eof = 1;
break;
}

2011-05-31 14:51:06

by Anna Schumaker

[permalink] [raw]

Subject: Re: infinite getdents64 loop

On 05/30/2011 05:37 AM, Ruediger Meier wrote:
> On Sunday 29 May 2011, Trond Myklebust wrote:
>> It's actually a problem with the underlying filesystem: it is
>> generating readdir 'offsets' that are not unique. In other words, if
>
> Does this mean ext4 generally does not work with for nfs?

It'll work for smaller directories, but when you get closer to about 300,000 entries this problem starts showing up.

- Bryan
>
>
> cu,
> Rudi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-05-31 17:30:26

[permalink] [raw]

Subject: Re: infinite getdents64 loop

On 05/31/2011 07:13 PM, Boaz Harrosh wrote:
> On 05/31/2011 03:35 PM, Ted Ts'o wrote:
>> On Tue, May 31, 2011 at 12:18:11PM +0200, Bernd Schubert wrote:
>>>
>>> Out of interest, did anyone ever benchmark if dirindex provides any
>>> advantages to readdir? And did those benchmarks include the
>>> disadvantages of the present implementation (non-linear inode
>>> numbers from readdir, so disk seeks on stat() (e.g. from 'ls -l') or
>>> 'rm -fr $dir')?
>>
>> The problem is that seekdir/telldir is terminally broken (and so is
>> NFSv2 for using a such a tiny cookie) in that it fundamentally assumes
>> a linear data structure. If you're going to use any kind of
>> tree-based data structure, a 32-bit "offset" for seekdir/telldir just
>> doesn't cut it. We actually play games where we memoize the low
>> 32-bits of the hash and keep track of which cookies we hand out via
>> seekdir/telldir so that things mostly work --- except for NFSv2, where
>> with the 32-bit cookie, you're just hosed.
>>
>> The reason why we have to iterate over the directory in hash tree
>> order is because if we have a leaf node split, half the directories
>> entries get copied to another directory entry, given the promises made
>> by seekdir() and telldir() about directory entries appearing exactly
>> once during a readdir() stream, even if you hold the fd open for weeks
>> or days, mean that you really have to iterate over things in hash
>> order.
>
> open fd means that it does not survive a server reboot. Why don't you
> keep an array per open fd, and hand out the array index. In the array
> you can keep a pointer to any info you want to keep. (that's the meaning of
> a cookie)

An array can take lots of memory for a large directory, of course. Do we
really want to do that in kernel space? Although I wouldn't have a
problem to reserve a certain amount of memory for that. But what do we
do if that gets exhausted (for example directory too large or several
open filedescriptors)?
And how does that help with NFS and other cluster filesystems where the
client passes over the cookie? We ignore posix compliance then?

Thanks,
Bernd

2011-05-31 17:07:42

[permalink] [raw]

Subject: Re: infinite getdents64 loop

On 05/31/2011 02:35 PM, Ted Ts'o wrote:
> On Tue, May 31, 2011 at 12:18:11PM +0200, Bernd Schubert wrote:
>>
>> Out of interest, did anyone ever benchmark if dirindex provides any
>> advantages to readdir? And did those benchmarks include the
>> disadvantages of the present implementation (non-linear inode
>> numbers from readdir, so disk seeks on stat() (e.g. from 'ls -l') or
>> 'rm -fr $dir')?
>
> The problem is that seekdir/telldir is terminally broken (and so is
> NFSv2 for using a such a tiny cookie) in that it fundamentally assumes
> a linear data structure. If you're going to use any kind of
> tree-based data structure, a 32-bit "offset" for seekdir/telldir just
> doesn't cut it. We actually play games where we memoize the low
> 32-bits of the hash and keep track of which cookies we hand out via
> seekdir/telldir so that things mostly work --- except for NFSv2, where
> with the 32-bit cookie, you're just hosed.

Well, lets just ignore NFSv2, for NFS there are better working v3 and v4
alternatives. My real concern are ext3 and ext4, which have

#define pos2min_hash(pos) (0)

>
> The reason why we have to iterate over the directory in hash tree
> order is because if we have a leaf node split, half the directories
> entries get copied to another directory entry, given the promises made
> by seekdir() and telldir() about directory entries appearing exactly
> once during a readdir() stream, even if you hold the fd open for weeks
> or days, mean that you really have to iterate over things in hash
> order.

Ah, I never looked into the dirindex implementation, I always thought
the dirindex blocks get updated and not real directory entries as well.

>
> I'd have to look, since it's been too many years, but as I recall the
> problem was that there is a common path for NFSv2 and NFSv3/v4, so we
> don't know whether we can hand back a 32-bit cookie or a 64-bit
> cookie, so we're always handing the NFS server a 32-bit "offset", even
> though ew could do better. Actually, if we had an interface where we
> could give you a 128-bit "offset" into the directory, we could
> probably eliminate the duplicate cookie problem entirely. We just
> send 64-bits worth of hash, plus the first two bytes of the of file
> name.

Well, personally I'm more interested in user space, but I don't see any
difference between NFS, other kernel paths and user space. I think this
is used for everything:

/* Some one has messed with f_pos; reset the world */
if (info->last_pos != filp->f_pos) {
free_rb_tree_fname(&info->root);
info->curr_node = NULL;
info->extra_fname = NULL;
info->curr_hash = pos2maj_hash(filp->f_pos);
info->curr_minor_hash = pos2min_hash(filp->f_pos);
}

So with the above #define pos2min_hash(), info->curr_minor_hash is
always zero with no exception. Or do I miss something?

>
>> 3) Disable dirindexing for readdirs
>
> That won't work, since it will break POSIX compliance. Once again,
> we're tied by the decisions made decades ago...

I really wonder if we couldn't set a flag somewhere to ignore posix for
applications that could handle it on their own. Pity that opendir
doesn't allow to set flags. An ioctl would be another choice.

Thanks,
Bernd

2011-05-31 17:26:36

by Andreas Dilger

[permalink] [raw]

Subject: Re: infinite getdents64 loop

On 2011-05-31, at 6:35 AM, Ted Ts'o wrote:
> On Tue, May 31, 2011 at 12:18:11PM +0200, Bernd Schubert wrote:
>>
>> Out of interest, did anyone ever benchmark if dirindex provides any
>> advantages to readdir? And did those benchmarks include the
>> disadvantages of the present implementation (non-linear inode
>> numbers from readdir, so disk seeks on stat() (e.g. from 'ls -l') or
>> 'rm -fr $dir')?
>
> The problem is that seekdir/telldir is terminally broken (and so is
> NFSv2 for using a such a tiny cookie) in that it fundamentally assumes
> a linear data structure. If you're going to use any kind of
> tree-based data structure, a 32-bit "offset" for seekdir/telldir just
> doesn't cut it. We actually play games where we memoize the low
> 32-bits of the hash and keep track of which cookies we hand out via
> seekdir/telldir so that things mostly work --- except for NFSv2, where
> with the 32-bit cookie, you're just hosed.
>
> The reason why we have to iterate over the directory in hash tree
> order is because if we have a leaf node split, half the directories
> entries get copied to another directory entry, given the promises made
> by seekdir() and telldir() about directory entries appearing exactly
> once during a readdir() stream, even if you hold the fd open for weeks
> or days, mean that you really have to iterate over things in hash
> order.
>
> I'd have to look, since it's been too many years, but as I recall the
> problem was that there is a common path for NFSv2 and NFSv3/v4, so we
> don't know whether we can hand back a 32-bit cookie or a 64-bit
> cookie, so we're always handing the NFS server a 32-bit "offset", even
> though ew could do better. Actually, if we had an interface where we
> could give you a 128-bit "offset" into the directory, we could
> probably eliminate the duplicate cookie problem entirely. We just
> send 64-bits worth of hash, plus the first two bytes of the of file
> name.

If it's of interest, we've implemented a 64-bit hash mode for ext4 to
solve just this problem for Lustre. The llseek() code will return a
64-bit hash value on 64-bit systems, unless it is running for some
process that needs a 32-bit hash value (only NFSv2, AFAIK).

The attached patch can at least form the basis for being able to return
64-bit hash values for userspace/NFSv3/v4 when usable. The patch
is NOT usable as it stands now, since I've had to modify it from the
version that we are currently using for Lustre (this version hasn't
actually been compiled), but it at least shows the outline of what needs
to be done to get this working. None of the NFS side is implemented.

>> 3) Disable dirindexing for readdirs
>
> That won't work, since it will break POSIX compliance. Once again,
> we're tied by the decisions made decades ago...

Cheers, Andreas

Attachments:

ext4-export-64bit-name-hash.patch (51.76 kB)

2011-05-29 16:05:19

[permalink] [raw]

Subject: Re: infinite getdents64 loop

On Sat, 2011-05-28 at 17:00 +0200, Rüdiger Meier wrote:
> On Saturday 28 May 2011, Rüdiger Meier wrote:
> > I could track down the problem to:
> >
> > commit 0b26a0bf6ff398185546432420bb772bcfdf8d94
> > Author: Trond Myklebust <[email protected]>
> > Date: Sat Nov 20 14:26:44 2010 -0500
> >
> > NFS: Ensure we return the dirent->d_type when it is known
> >
> >
> > After reverting the problem is gone.
>
> Actually it's enough to remove d_type from struct nfs_cache_array_entry
> again. It's not enough to set it DT_UNKNOWN always. I had to remove it
> from struct to let it work.
> Tested with kernels 2.6.37.6 and 2.6.39.

Sorry, but that patch makes absolutely no sense whatsoever as a fix for
the problem you describe. All you are doing is changing the size of the
readdir cache entry, which is probably causing a READDIR with a
duplicate cookie to trigger. When running with the stock 2.6.39 client,
do you see the "directory contains a readdir loop." message in your
syslog?

Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-05-31 09:47:27

[permalink] [raw]

Subject: Re: infinite getdents64 loop

On Monday 30 May 2011, Trond Myklebust wrote:
> On Mon, 2011-05-30 at 11:37 +0200, Ruediger Meier wrote:
> >
> > Does this mean ext4 generally does not work with for nfs?
>
> ext2/3/4 are all known to have this problem when you switch on the
> hashed b-tree directories. Typically, a directory with a million
> entries will have several tens of cookie collisions.

Ok, like Jeff mentioned in the other reply disabling dir_index solves
it.

I wish I had seen this documented somewhere before switching from xfs to
ext4 but it's not easy to find something about these ext4/nfs probs
without knowing the details already.
Ext4 being default file system on many distros made me feel safe.

thanks for helping,
Rudi

2011-05-29 17:04:04

[permalink] [raw]

Subject: Re: infinite getdents64 loop

On Sun, 2011-05-29 at 18:55 +0200, Rüdiger Meier wrote:
> On Sunday 29 May 2011, Trond Myklebust wrote:
>
> > Sorry, but that patch makes absolutely no sense whatsoever as a fix
> > for the problem you describe.
>
> It wasn't ment to be a real fix. I just tried to find out where the prob
> is roughly located.
>
> > All you are doing is changing the size
> > of the readdir cache entry, which is probably causing a READDIR with
> > a duplicate cookie to trigger.
>
> Yup, my patch "repaired" the test directory and let another one fail.
> Currently Ive reverted
> commit d1bacf9e, NFS: add readdir cache array
> (and a lot followups) to let clients work again.
>
> > When running with the stock 2.6.39
> > client, do you see the "directory contains a readdir loop." message
> > in your syslog?
>
> Yes, didn't noticed that because I've booted 2.6.39 only a few times.
> There are a lot like this:
> May 25 13:26:09 kubera-114 kernel: [ 1105.419604] NFS: directory
> gen/radar contains a readdir loop. Please contact your server vendor.
> Offending cookie: 947700512
>
> I hope it's not my server vendor's fault :)
> Or does this mean the NFS server is bad rather than the client?

It's actually a problem with the underlying filesystem: it is generating
readdir 'offsets' that are not unique. In other words, if you use
telldir() to list out the offsets for each readdir entry on the server,
you will see the same value 947700512 above appear at least two times,
which means that 'seekdir()' is also broken, for instance.

IOW: This isn't something that we can fix on the NFS client. It needs to
be fixed on the server. The only thing that has hidden the problem
previously is blind luck (which is why your patch appeared to work).

Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-05-31 17:13:39

by Boaz Harrosh

[permalink] [raw]

Subject: Re: infinite getdents64 loop

On 05/31/2011 03:35 PM, Ted Ts'o wrote:
> On Tue, May 31, 2011 at 12:18:11PM +0200, Bernd Schubert wrote:
>>
>> Out of interest, did anyone ever benchmark if dirindex provides any
>> advantages to readdir? And did those benchmarks include the
>> disadvantages of the present implementation (non-linear inode
>> numbers from readdir, so disk seeks on stat() (e.g. from 'ls -l') or
>> 'rm -fr $dir')?
>
> The problem is that seekdir/telldir is terminally broken (and so is
> NFSv2 for using a such a tiny cookie) in that it fundamentally assumes
> a linear data structure. If you're going to use any kind of
> tree-based data structure, a 32-bit "offset" for seekdir/telldir just
> doesn't cut it. We actually play games where we memoize the low
> 32-bits of the hash and keep track of which cookies we hand out via
> seekdir/telldir so that things mostly work --- except for NFSv2, where
> with the 32-bit cookie, you're just hosed.
>
> The reason why we have to iterate over the directory in hash tree
> order is because if we have a leaf node split, half the directories
> entries get copied to another directory entry, given the promises made
> by seekdir() and telldir() about directory entries appearing exactly
> once during a readdir() stream, even if you hold the fd open for weeks
> or days, mean that you really have to iterate over things in hash
> order.

open fd means that it does not survive a server reboot. Why don't you
keep an array per open fd, and hand out the array index. In the array
you can keep a pointer to any info you want to keep. (that's the meaning of
a cookie)

>
> I'd have to look, since it's been too many years, but as I recall the
> problem was that there is a common path for NFSv2 and NFSv3/v4, so we
> don't know whether we can hand back a 32-bit cookie or a 64-bit
> cookie, so we're always handing the NFS server a 32-bit "offset", even
> though ew could do better.

Please fix that. In the 64-bit case of NFSv3/v4 you can give out a pointer
instead of array-index. In NFSv2 on 64bit arches you are stuck with an index

> Actually, if we had an interface where we
> could give you a 128-bit "offset" into the directory, we could
> probably eliminate the duplicate cookie problem entirely. We just
> send 64-bits worth of hash, plus the first two bytes of the of file
> name.
>

If you hand out a pointer or index per fd, you could keep in memory
any info you want, as big as you need it.

>> 3) Disable dirindexing for readdirs
>
> That won't work, since it will break POSIX compliance. Once again,
> we're tied by the decisions made decades ago...
>
> - Ted

Thanks
Boaz

2011-05-30 09:37:05

[permalink] [raw]

Subject: Re: infinite getdents64 loop

On Sunday 29 May 2011, Trond Myklebust wrote:
> It's actually a problem with the underlying filesystem: it is
> generating readdir 'offsets' that are not unique. In other words, if

Does this mean ext4 generally does not work with for nfs?

cu,
Rudi

2011-06-01 13:10:36

by Boaz Harrosh

[permalink] [raw]

Subject: Re: infinite getdents64 loop

On 05/31/2011 08:30 PM, Bernd Schubert wrote:
> On 05/31/2011 07:13 PM, Boaz Harrosh wrote:
>> On 05/31/2011 03:35 PM, Ted Ts'o wrote:
>>> On Tue, May 31, 2011 at 12:18:11PM +0200, Bernd Schubert wrote:
>>>>
>>>> Out of interest, did anyone ever benchmark if dirindex provides any
>>>> advantages to readdir? And did those benchmarks include the
>>>> disadvantages of the present implementation (non-linear inode
>>>> numbers from readdir, so disk seeks on stat() (e.g. from 'ls -l') or
>>>> 'rm -fr $dir')?
>>>
>>> The problem is that seekdir/telldir is terminally broken (and so is
>>> NFSv2 for using a such a tiny cookie) in that it fundamentally assumes
>>> a linear data structure. If you're going to use any kind of
>>> tree-based data structure, a 32-bit "offset" for seekdir/telldir just
>>> doesn't cut it. We actually play games where we memoize the low
>>> 32-bits of the hash and keep track of which cookies we hand out via
>>> seekdir/telldir so that things mostly work --- except for NFSv2, where
>>> with the 32-bit cookie, you're just hosed.
>>>
>>> The reason why we have to iterate over the directory in hash tree
>>> order is because if we have a leaf node split, half the directories
>>> entries get copied to another directory entry, given the promises made
>>> by seekdir() and telldir() about directory entries appearing exactly
>>> once during a readdir() stream, even if you hold the fd open for weeks
>>> or days, mean that you really have to iterate over things in hash
>>> order.
>>
>> open fd means that it does not survive a server reboot. Why don't you
>> keep an array per open fd, and hand out the array index. In the array
>> you can keep a pointer to any info you want to keep. (that's the meaning of
>> a cookie)
>
> An array can take lots of memory for a large directory, of course. Do we
> really want to do that in kernel space? Although I wouldn't have a
> problem to reserve a certain amount of memory for that. But what do we
> do if that gets exhausted (for example directory too large or several
> open filedescriptors)?

You miss understood me. Ted was complaining that the cookie was only 32
bit and he hoped it was bigger, perhaps 128 minimum. What I said is that
for each open fd, a cookie is returned that denotes a temporary space
allocated for just that caller. When a second call with the same fd, same
cookie comes, the allocated object is inspected to retrieve all the
information needed to continue the walk from the same place. So the allocated
space is only per active caller, up to the time fd is closed.
(I never meant per directory entry)

> And how does that help with NFS and other cluster filesystems where the
> client passes over the cookie? We ignore posix compliance then?
>

I was not referring to that. I understand that this is an hard problem
but it is solvable. The space per cookie is solved above.

> Thanks,
> Bernd

But this is all talk. I don't know enough, or use, ext4 to be able to solve
it myself. So I'm just babbling out here. Just that in the server we've done
it before to keep things in an internal array and return the index as a magic
cookie, when more information was needed internally.

Boaz

2011-06-01 16:16:17