2013-02-14 21:46:38

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [Gluster-devel] regressions due to 64-bit ext4 directory cookies

On Wed, Feb 13, 2013 at 06:44:30PM -0500, Theodore Ts'o wrote:
> On Wed, Feb 13, 2013 at 06:05:11PM -0500, J. Bruce Fields wrote:
> >
> > Would it be possible to make something work like, for example, a 31-bit
> > hash plus an offset into a hash bucket?
> >
> > I have trouble thinking about this, partly because I can't remember
> > where to find the requirements for readdir on concurrently modified
> > directories....
>
> The requires are that for a directory entry which has not been
> modified since the last opendir() or rewindir(), readdir() must return
> that directory entry exactly once.
>
> For a directory entry which has been added or removed since the last
> opendir() or rewinddir() call, it is undefined whether the directory
> entry is returned once or not at all. And a rename is defined as a
> add/remove, so it's OK for the old filename and the new file name to
> appear in the readdir() stream; it would also be OK if neither
> appeared in the readdir() stream.

That's what I couldn't remember, thanks!

--b.

>
> The SUSv3 definition of readdir() can be found here:
>
> http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir.html
>
> Note also that if you look at the SuSv3 definition of seekdir(), it
> explicitly states that the value returned by telldir() is not
> guaranteed to be valid after a rewinddir() or across another opendir():
>
> If the value of loc was not obtained from an earlier call to
> telldir(), or if a call to rewinddir() occurred between the call to
> telldir() and the call to seekdir(), the results of subsequent
> calls to readdir() are unspecified.
>
> Hence, it would be legal, and arguably more correct, if we created an
> internal array of pointers into the directory structure, where the
> first call to telldir() return 1, and the second call to telldir()
> returned 2, and the third call to telldir() returned 3, regardless of
> the position in the directory, and this number was used by seekdir()
> to index into the array of pointers to return the exact location in
> the b-tree. This would completely eliminate the possibility of hash
> collisions, and guarantee that readdir() would never drop or return a
> directory entry multiple times after seekdir().
>
> This implementation approach would have a potential denial of service
> potential since each call to telldir() would potentially be allocating
> kernel memory, but as long as we make sure the OOM killler kills the
> nasty process which is calling telldir() a lot, this would probably be
> OK.
>
> It would also be legal to throw away this array after a call to
> rewinddir() and closedir(), since telldir() cookies and not guaranteed
> to valid indefinitely. See:
>
> http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html
>
> I suspect this would seriously screw over Gluster, though, and this
> wouldn't be a solution for NFSv3, since NFS needs long-lived directory
> cookies, and not the short-lived cookies which is all POSIX/SuSv3 guarantees.
>
> Regards,
>
> - Ted
>