2013-02-13 22:57:13

by Anand Avati

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On Wed, Feb 13, 2013 at 2:47 PM, Theodore Ts'o <[email protected]> wrote:

> On Wed, Feb 13, 2013 at 05:41:41PM -0500, J. Bruce Fields wrote:
> > > What if we have an ioctl or a process personality flag where a broken
> > > application can tell the file system "I'm broken, please give me a
> > > degraded telldir/seekdir cookie"? That way we don't penalize programs
> > > that are doing the right thing, while providing some accomodation for
> > > programs who are abusing the telldir cookie.
> >
> > Yeah, if there's a simple way to do that, maybe it would be worth it.
>
> Doing this as an ioctl which gets called right after opendir, i.e
> (ignoring error checking):
>
> DIR *dir = opendir("/foo/bar/baz");
> ioctl(dirfd(dir), EXT4_IOC_DEGRADED_READDIR, 1);
> ...
>
> should be quite easy. It would be a very ext3/4 specific thing,
> though.


That would work, even though it would be ext3/4 specific. What is the
recommended programmatic way to detect if the file is on ext3/4 -- we would
not want to attempt that blindly on a non-ext3/4 FS as the numerical value
of EXT4_IOC_DEGRADED_READDIR might get interpreted in dangerous ways?

Avati


Attachments:
(No filename) (156.00 B)

2013-02-14 21:46:43

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [Gluster-devel] regressions due to 64-bit ext4 directory cookies

On Wed, Feb 13, 2013 at 06:44:30PM -0500, Theodore Ts'o wrote:
> On Wed, Feb 13, 2013 at 06:05:11PM -0500, J. Bruce Fields wrote:
> >
> > Would it be possible to make something work like, for example, a 31-bit
> > hash plus an offset into a hash bucket?
> >
> > I have trouble thinking about this, partly because I can't remember
> > where to find the requirements for readdir on concurrently modified
> > directories....
>
> The requires are that for a directory entry which has not been
> modified since the last opendir() or rewindir(), readdir() must return
> that directory entry exactly once.
>
> For a directory entry which has been added or removed since the last
> opendir() or rewinddir() call, it is undefined whether the directory
> entry is returned once or not at all. And a rename is defined as a
> add/remove, so it's OK for the old filename and the new file name to
> appear in the readdir() stream; it would also be OK if neither
> appeared in the readdir() stream.

That's what I couldn't remember, thanks!

--b.

>
> The SUSv3 definition of readdir() can be found here:
>
> http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir.html
>
> Note also that if you look at the SuSv3 definition of seekdir(), it
> explicitly states that the value returned by telldir() is not
> guaranteed to be valid after a rewinddir() or across another opendir():
>
> If the value of loc was not obtained from an earlier call to
> telldir(), or if a call to rewinddir() occurred between the call to
> telldir() and the call to seekdir(), the results of subsequent
> calls to readdir() are unspecified.
>
> Hence, it would be legal, and arguably more correct, if we created an
> internal array of pointers into the directory structure, where the
> first call to telldir() return 1, and the second call to telldir()
> returned 2, and the third call to telldir() returned 3, regardless of
> the position in the directory, and this number was used by seekdir()
> to index into the array of pointers to return the exact location in
> the b-tree. This would completely eliminate the possibility of hash
> collisions, and guarantee that readdir() would never drop or return a
> directory entry multiple times after seekdir().
>
> This implementation approach would have a potential denial of service
> potential since each call to telldir() would potentially be allocating
> kernel memory, but as long as we make sure the OOM killler kills the
> nasty process which is calling telldir() a lot, this would probably be
> OK.
>
> It would also be legal to throw away this array after a call to
> rewinddir() and closedir(), since telldir() cookies and not guaranteed
> to valid indefinitely. See:
>
> http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html
>
> I suspect this would seriously screw over Gluster, though, and this
> wouldn't be a solution for NFSv3, since NFS needs long-lived directory
> cookies, and not the short-lived cookies which is all POSIX/SuSv3 guarantees.
>
> Regards,
>
> - Ted
>

2013-02-13 23:44:38

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Gluster-devel] regressions due to 64-bit ext4 directory cookies

On Wed, Feb 13, 2013 at 06:05:11PM -0500, J. Bruce Fields wrote:
>
> Would it be possible to make something work like, for example, a 31-bit
> hash plus an offset into a hash bucket?
>
> I have trouble thinking about this, partly because I can't remember
> where to find the requirements for readdir on concurrently modified
> directories....

The requires are that for a directory entry which has not been
modified since the last opendir() or rewindir(), readdir() must return
that directory entry exactly once.

For a directory entry which has been added or removed since the last
opendir() or rewinddir() call, it is undefined whether the directory
entry is returned once or not at all. And a rename is defined as a
add/remove, so it's OK for the old filename and the new file name to
appear in the readdir() stream; it would also be OK if neither
appeared in the readdir() stream.

The SUSv3 definition of readdir() can be found here:

http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir.html

Note also that if you look at the SuSv3 definition of seekdir(), it
explicitly states that the value returned by telldir() is not
guaranteed to be valid after a rewinddir() or across another opendir():

If the value of loc was not obtained from an earlier call to
telldir(), or if a call to rewinddir() occurred between the call to
telldir() and the call to seekdir(), the results of subsequent
calls to readdir() are unspecified.

Hence, it would be legal, and arguably more correct, if we created an
internal array of pointers into the directory structure, where the
first call to telldir() return 1, and the second call to telldir()
returned 2, and the third call to telldir() returned 3, regardless of
the position in the directory, and this number was used by seekdir()
to index into the array of pointers to return the exact location in
the b-tree. This would completely eliminate the possibility of hash
collisions, and guarantee that readdir() would never drop or return a
directory entry multiple times after seekdir().

This implementation approach would have a potential denial of service
potential since each call to telldir() would potentially be allocating
kernel memory, but as long as we make sure the OOM killler kills the
nasty process which is calling telldir() a lot, this would probably be
OK.

It would also be legal to throw away this array after a call to
rewinddir() and closedir(), since telldir() cookies and not guaranteed
to valid indefinitely. See:

http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html

I suspect this would seriously screw over Gluster, though, and this
wouldn't be a solution for NFSv3, since NFS needs long-lived directory
cookies, and not the short-lived cookies which is all POSIX/SuSv3 guarantees.

Regards,

- Ted


2013-02-13 23:05:15

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [Gluster-devel] regressions due to 64-bit ext4 directory cookies

On Wed, Feb 13, 2013 at 02:57:13PM -0800, Anand Avati wrote:
> On Wed, Feb 13, 2013 at 2:47 PM, Theodore Ts'o <[email protected]> wrote:
>
> > On Wed, Feb 13, 2013 at 05:41:41PM -0500, J. Bruce Fields wrote:
> > > > What if we have an ioctl or a process personality flag where a broken
> > > > application can tell the file system "I'm broken, please give me a
> > > > degraded telldir/seekdir cookie"? That way we don't penalize programs
> > > > that are doing the right thing, while providing some accomodation for
> > > > programs who are abusing the telldir cookie.
> > >
> > > Yeah, if there's a simple way to do that, maybe it would be worth it.
> >
> > Doing this as an ioctl which gets called right after opendir, i.e
> > (ignoring error checking):
> >
> > DIR *dir = opendir("/foo/bar/baz");
> > ioctl(dirfd(dir), EXT4_IOC_DEGRADED_READDIR, 1);
> > ...
> >
> > should be quite easy. It would be a very ext3/4 specific thing,
> > though.
>
>
> That would work, even though it would be ext3/4 specific. What is the
> recommended programmatic way to detect if the file is on ext3/4 -- we would
> not want to attempt that blindly on a non-ext3/4 FS as the numerical value
> of EXT4_IOC_DEGRADED_READDIR might get interpreted in dangerous ways?

We must have been through this before, but: is the only way to generate
a collision-free readdir cookie really to use a larger hash?

Would it be possible to make something work like, for example, a 31-bit
hash plus an offset into a hash bucket?

I have trouble thinking about this, partly because I can't remember
where to find the requirements for readdir on concurrently modified
directories....

--b.