2013-02-13 16:21:04

by J. Bruce Fields

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

Oops, probably should have cc'd linux-nfs.

On Wed, Feb 13, 2013 at 10:36:54AM -0500, Theodore Ts'o wrote:
> On Wed, Feb 13, 2013 at 10:19:53AM -0500, J. Bruce Fields wrote:
> > > > (In more detail: they're spreading a single directory across multiple
> > > > nodes, and encoding a node ID into the cookie they return, so they can
> > > > tell which node the cookie came from when they get it back.)
> > > >
> > > > That works if you assume the cookie is an "offset" bounded above by some
> > > > measure of the directory size, hence unlikely to ever use the high
> > > > bits....
> > >
> > > Right, but why wouldn't a nfs export option solave the problem for
> > > gluster?
> >
> > No, gluster is running on ext4 directly.
>
> OK, so let me see if I can get this straight. Each local gluster node
> is running a userspace NFS server, right?

My understanding is that only one frontend server is running the server.
So in your picture below, "NFS v3" should be some internal gluster
protocol:


/------ GFS Storage
/ Server #1
GFS Cluster NFS V3 GFS Cluster -- gluster protocol
Client <---------> Frontend Server ---------- GFS Storage
-- Server #2
\
\------ GFS Storage
Server #3


That frontend server gets a readdir request for a directory which is
stored across several of the storage servers. It has to return a
cookie. It will get that cookie back from the client at some unknown
later time (possibly after the server has rebooted). So their solution
is to return a cookie from one of the storage servers, plus some kind of
node id in the top bits so they can remember which server it came from.

(I don't know much about gluster, but I think that's the basic idea.)

I've assumed that users of directory cookies should treat them as
opaque, so I don't think what gluster is doing is correct. But on the
other hand they are defined as integers and described as offsets here
and there. And I can't actually think of anything else that would work,
short of gluster generating and storing its own cookies.

> Because if it were running
> a kernel-side NFS server, it would be sufficient to use an nfs export
> option.
>
> A client which mounts a "gluster file system" is also doing this via
> NFSv3, right? Or are they using their own protocol? If they are
> using their own protocol, why can't they encode the node ID somewhere
> else?
>
> So this a correct picture of what is going on:
>
> /------ GFS Storage
> / Server #1
> GFS Cluster NFS V3 GFS Cluster -- NFS v3
> Client <---------> Frontend Server ---------- GFS Storage
> -- Server #2
> \
> \------ GFS Storage
> Server #3
>
>
> And the reason why it needs to use the high bits is because when it
> needs to coalesce the results from each GFS Storage Server to the GFS
> Cluster client?
>
> The other thing that I'd note is that the readdir cookie has been
> 64-bit since NFSv3, which was released in June ***1995***. And the
> explicit, stated purpose of making it be a 64-bit value (as stated in
> RFC 1813) was to reduce interoperability problems. If that were the
> case, are you telling me that Sun (who has traditionally been pretty
> good worrying about interoperability concerns, and in fact employed
> the editors of RFC 1813) didn't get this right? This seems
> quite.... surprising to me.
>
> I thought this was the whole point of the various NFS interoperability
> testing done at Connectathon, for which Sun was a major sponsor?!? No
> one noticed?!?

Beats me. But it's not necessarily easy to replace clients running
legacy applications, so we're stuck working with the clients we have....

The linux client does remap the server-provided cookies to small
integers, I believe exactly because older applications had trouble with
servers returning "large" cookies. So presumably ext4-exporting-Linux
servers aren't the first to do this.

I don't know which client versions are affected--Connectathon's next
week and I'll talk to people and make sure there's an ext4 export with
this turned on to test against.

--b.


2013-02-13 22:47:27

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Gluster-devel] regressions due to 64-bit ext4 directory cookies

On Wed, Feb 13, 2013 at 05:41:41PM -0500, J. Bruce Fields wrote:
> > What if we have an ioctl or a process personality flag where a broken
> > application can tell the file system "I'm broken, please give me a
> > degraded telldir/seekdir cookie"? That way we don't penalize programs
> > that are doing the right thing, while providing some accomodation for
> > programs who are abusing the telldir cookie.
>
> Yeah, if there's a simple way to do that, maybe it would be worth it.

Doing this as an ioctl which gets called right after opendir, i.e
(ignoring error checking):

DIR *dir = opendir("/foo/bar/baz");
ioctl(dirfd(dir), EXT4_IOC_DEGRADED_READDIR, 1);
...

should be quite easy. It would be a very ext3/4 specific thing,
though.

It would be more work to get something in as a process personality
flag, mostly due to the politics of assiging a bit out of the
bitfield.

- Ted


2013-02-14 03:59:20

by Myklebust, Trond

[permalink] [raw]
Subject: RE: regressions due to 64-bit ext4 directory cookies

> -----Original Message-----
> From: J. Bruce Fields [mailto:[email protected]]
> Sent: Wednesday, February 13, 2013 4:34 PM
> To: Myklebust, Trond
> Cc: Theodore Ts'o; [email protected]; [email protected];
> Bernd Schubert; [email protected]; [email protected]
> Subject: Re: regressions due to 64-bit ext4 directory cookies
>
> On Wed, Feb 13, 2013 at 04:43:05PM +0000, Myklebust, Trond wrote:
> > On Wed, 2013-02-13 at 11:20 -0500, J. Bruce Fields wrote:
> > > Oops, probably should have cc'd linux-nfs.
> > >
> > > On Wed, Feb 13, 2013 at 10:36:54AM -0500, Theodore Ts'o wrote:
> > > > The other thing that I'd note is that the readdir cookie has been
> > > > 64-bit since NFSv3, which was released in June ***1995***. And
> > > > the explicit, stated purpose of making it be a 64-bit value (as
> > > > stated in RFC 1813) was to reduce interoperability problems. If
> > > > that were the case, are you telling me that Sun (who has
> > > > traditionally been pretty good worrying about interoperability
> > > > concerns, and in fact employed the editors of RFC 1813) didn't get
> > > > this right? This seems quite.... surprising to me.
> > > >
> > > > I thought this was the whole point of the various NFS
> > > > interoperability testing done at Connectathon, for which Sun was a
> > > > major sponsor?!? No one noticed?!?
> > >
> > > Beats me. But it's not necessarily easy to replace clients running
> > > legacy applications, so we're stuck working with the clients we have....
> > >
> > > The linux client does remap the server-provided cookies to small
> > > integers, I believe exactly because older applications had trouble
> > > with servers returning "large" cookies. So presumably
> > > ext4-exporting-Linux servers aren't the first to do this.
> > >
> > > I don't know which client versions are affected--Connectathon's next
> > > week and I'll talk to people and make sure there's an ext4 export
> > > with this turned on to test against.
> >
> > Actually, one of the main reasons for the Linux client not exporting
> > raw readdir cookies is because the glibc-2 folks in their infinite
> > wisdom declared that telldir()/seekdir() use an off_t. They then went
> > yet one further and decided to declare negative offsets to be illegal
> > so that they could use the negative values internally in their syscall
> wrappers.
> >
> > The POSIX definition has none of the above rubbish
> > (http://pubs.opengroup.org/onlinepubs/009695399/functions/telldir.html
> > ) and so glibc brilliantly saddled Linux with a crippled readdir
> > implementation that is _not_ POSIX compatible.
> >
> > No, I'm not at all bitter...
>
> Oh, right, I knew I'd forgotten part of the story....
>
> But then you must have actually been testing against servers that were using
> that 32nd bit?
>
> I think ext4 actually only uses 31 bits even in the 32-bit case. And for a server
> that was literally using an offset inside a directory file, that would be a
> colossal directory.
>
> So I'm wondering how you ran across it.
>
> Partly just pure curiosity.

IIRC, XFS on IRIX used 0xFFFFF as the readdir eof marker, which caused us to generate an EIO...

Cheers
Trond

2013-02-15 02:27:41

by Dave Chinner

[permalink] [raw]
Subject: Re: [Gluster-devel] regressions due to 64-bit ext4 directory cookies

On Thu, Feb 14, 2013 at 05:01:10PM -0500, J. Bruce Fields wrote:
> On Thu, Feb 14, 2013 at 05:10:02PM +1100, Dave Chinner wrote:
> > On Wed, Feb 13, 2013 at 05:20:52PM -0500, Theodore Ts'o wrote:
> > > Telldir() and seekdir() are basically implementation horrors for any
> > > file system that is using anything other than a simple array of
> > > directory entries ala the V7 Unix file system or the BSD FFS. For any
> > > file system which is using a more advanced data structure, like
> > > b-trees hash trees, etc, there **can't** possibly be a "offset" into a
> > > readdir stream.
> >
> > I'll just point you to this:
> >
> > http://marc.info/?l=linux-ext4&m=136081996316453&w=2
> >
> > so you can see that XFS implements what you say can't possibly be
> > done. ;)
> >
> > FWIW, that post only talked about the data segment. I didn't mention
> > that XFS has 2 other segments in the directory file (both beyond
> > EOF) for the directory data indexes. One contains the name-hash btree
> > index used for name based lookups and the other contains a freespace
> > index for tracking free space in the data segment.
>
> OK, so in some sense that reduces the problem to that of implementing
> readdir cookies for directories that are stored in a simple linear
> array.

*nod*

> Which I should know how to do but I don't: I guess all you need is a
> provision for making holes on remove (so that you aren't required move
> existing entries, messing up offsets for concurrent readers)?

Exactly.

The data segment is a virtual mapping that is maintained by the
extent tree, so we can simply punch holes in it for directory blocks
that are empty and no longer referenced. i.e. the data segement
really is just a sparse file.

The result of doing block mapping this way is that the freespace
tracking segment actually only needs to track space in partially
used blocks. Hence we only need to allocate new blocks when the
freespace map empties, And we work out where to allocate the new
block in the virtual map by doing an extent tree lookup to find the
first hole....

> Purely out of curiosity: is there a more detailed writeup of XFS's
> directory format? (Or a pointer to a piece of the code a person could
> understand without losing a month to it?)

Not really. There's documentation of the on-disk structures, but
it's a massive leap from there to understanding the structure and
how it all ties together. I've been spending the past couple of
months deep in the depths of the XFS directory code so how it all
works is front-and-center in my brain right now...

That said, the thought had crossed my mind that there's a a couple
of LWN articles/conference talks I could put together as a brain
dump. ;)

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-02-14 22:01:17

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [Gluster-devel] regressions due to 64-bit ext4 directory cookies

On Thu, Feb 14, 2013 at 05:10:02PM +1100, Dave Chinner wrote:
> On Wed, Feb 13, 2013 at 05:20:52PM -0500, Theodore Ts'o wrote:
> > Telldir() and seekdir() are basically implementation horrors for any
> > file system that is using anything other than a simple array of
> > directory entries ala the V7 Unix file system or the BSD FFS. For any
> > file system which is using a more advanced data structure, like
> > b-trees hash trees, etc, there **can't** possibly be a "offset" into a
> > readdir stream.
>
> I'll just point you to this:
>
> http://marc.info/?l=linux-ext4&m=136081996316453&w=2
>
> so you can see that XFS implements what you say can't possibly be
> done. ;)
>
> FWIW, that post only talked about the data segment. I didn't mention
> that XFS has 2 other segments in the directory file (both beyond
> EOF) for the directory data indexes. One contains the name-hash btree
> index used for name based lookups and the other contains a freespace
> index for tracking free space in the data segment.

OK, so in some sense that reduces the problem to that of implementing
readdir cookies for directories that are stored in a simple linear
array.

Which I should know how to do but I don't: I guess all you need is a
provision for making holes on remove (so that you aren't required move
existing entries, messing up offsets for concurrent readers)?

Purely out of curiosity: is there a more detailed writeup of XFS's
directory format? (Or a pointer to a piece of the code a person could
understand without losing a month to it?)

--b.

>
> IOWs persistent, deterministic, low cost telldir/seekdir behaviour
> was a problem solved in the 1990s. :)


2013-02-14 06:10:06

by Dave Chinner

[permalink] [raw]
Subject: Re: [Gluster-devel] regressions due to 64-bit ext4 directory cookies

On Wed, Feb 13, 2013 at 05:20:52PM -0500, Theodore Ts'o wrote:
> Telldir() and seekdir() are basically implementation horrors for any
> file system that is using anything other than a simple array of
> directory entries ala the V7 Unix file system or the BSD FFS. For any
> file system which is using a more advanced data structure, like
> b-trees hash trees, etc, there **can't** possibly be a "offset" into a
> readdir stream.

I'll just point you to this:

http://marc.info/?l=linux-ext4&m=136081996316453&w=2

so you can see that XFS implements what you say can't possibly be
done. ;)

FWIW, that post only talked about the data segment. I didn't mention
that XFS has 2 other segments in the directory file (both beyond
EOF) for the directory data indexes. One contains the name-hash btree
index used for name based lookups and the other contains a freespace
index for tracking free space in the data segment.

IOWs persistent, deterministic, low cost telldir/seekdir behaviour
was a problem solved in the 1990s. :)

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-02-13 16:43:07

by Myklebust, Trond

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On Wed, 2013-02-13 at 11:20 -0500, J. Bruce Fields wrote:
> Oops, probably should have cc'd linux-nfs.
>
> On Wed, Feb 13, 2013 at 10:36:54AM -0500, Theodore Ts'o wrote:
> > On Wed, Feb 13, 2013 at 10:19:53AM -0500, J. Bruce Fields wrote:
> > > > > (In more detail: they're spreading a single directory across multiple
> > > > > nodes, and encoding a node ID into the cookie they return, so they can
> > > > > tell which node the cookie came from when they get it back.)
> > > > >
> > > > > That works if you assume the cookie is an "offset" bounded above by some
> > > > > measure of the directory size, hence unlikely to ever use the high
> > > > > bits....
> > > >
> > > > Right, but why wouldn't a nfs export option solave the problem for
> > > > gluster?
> > >
> > > No, gluster is running on ext4 directly.
> >
> > OK, so let me see if I can get this straight. Each local gluster node
> > is running a userspace NFS server, right?
>
> My understanding is that only one frontend server is running the server.
> So in your picture below, "NFS v3" should be some internal gluster
> protocol:
>
>
> /------ GFS Storage
> / Server #1
> GFS Cluster NFS V3 GFS Cluster -- gluster protocol
> Client <---------> Frontend Server ---------- GFS Storage
> -- Server #2
> \
> \------ GFS Storage
> Server #3
>
>
> That frontend server gets a readdir request for a directory which is
> stored across several of the storage servers. It has to return a
> cookie. It will get that cookie back from the client at some unknown
> later time (possibly after the server has rebooted). So their solution
> is to return a cookie from one of the storage servers, plus some kind of
> node id in the top bits so they can remember which server it came from.
>
> (I don't know much about gluster, but I think that's the basic idea.)
>
> I've assumed that users of directory cookies should treat them as
> opaque, so I don't think what gluster is doing is correct. But on the
> other hand they are defined as integers and described as offsets here
> and there. And I can't actually think of anything else that would work,
> short of gluster generating and storing its own cookies.
>
> > Because if it were running
> > a kernel-side NFS server, it would be sufficient to use an nfs export
> > option.
> >
> > A client which mounts a "gluster file system" is also doing this via
> > NFSv3, right? Or are they using their own protocol? If they are
> > using their own protocol, why can't they encode the node ID somewhere
> > else?
> >
> > So this a correct picture of what is going on:
> >
> > /------ GFS Storage
> > / Server #1
> > GFS Cluster NFS V3 GFS Cluster -- NFS v3
> > Client <---------> Frontend Server ---------- GFS Storage
> > -- Server #2
> > \
> > \------ GFS Storage
> > Server #3
> >
> >
> > And the reason why it needs to use the high bits is because when it
> > needs to coalesce the results from each GFS Storage Server to the GFS
> > Cluster client?
> >
> > The other thing that I'd note is that the readdir cookie has been
> > 64-bit since NFSv3, which was released in June ***1995***. And the
> > explicit, stated purpose of making it be a 64-bit value (as stated in
> > RFC 1813) was to reduce interoperability problems. If that were the
> > case, are you telling me that Sun (who has traditionally been pretty
> > good worrying about interoperability concerns, and in fact employed
> > the editors of RFC 1813) didn't get this right? This seems
> > quite.... surprising to me.
> >
> > I thought this was the whole point of the various NFS interoperability
> > testing done at Connectathon, for which Sun was a major sponsor?!? No
> > one noticed?!?
>
> Beats me. But it's not necessarily easy to replace clients running
> legacy applications, so we're stuck working with the clients we have....
>
> The linux client does remap the server-provided cookies to small
> integers, I believe exactly because older applications had trouble with
> servers returning "large" cookies. So presumably ext4-exporting-Linux
> servers aren't the first to do this.
>
> I don't know which client versions are affected--Connectathon's next
> week and I'll talk to people and make sure there's an ext4 export with
> this turned on to test against.

Actually, one of the main reasons for the Linux client not exporting raw
readdir cookies is because the glibc-2 folks in their infinite wisdom
declared that telldir()/seekdir() use an off_t. They then went yet one
further and decided to declare negative offsets to be illegal so that
they could use the negative values internally in their syscall wrappers.

The POSIX definition has none of the above rubbish
(http://pubs.opengroup.org/onlinepubs/009695399/functions/telldir.html)
and so glibc brilliantly saddled Linux with a crippled readdir
implementation that is _not_ POSIX compatible.

No, I'm not at all bitter...

Trond

2013-02-13 22:41:45

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [Gluster-devel] regressions due to 64-bit ext4 directory cookies

On Wed, Feb 13, 2013 at 05:20:52PM -0500, Theodore Ts'o wrote:
> On Wed, Feb 13, 2013 at 01:21:06PM -0800, Anand Avati wrote:
> >
> > NFS uses the term cookies, while man pages of readdir/seekdir/telldir calls
> > them "offsets".
>
> Unfortunately, telldir and seekdir are part of the "unspeakable Unix
> design horrors" which has been with us for 25+ years. To quote from
> the rationale section from the Single Unix Specification v3 (there is
> similar language in the Posix spec).
>
> The original standard developers perceived that there were
> restrictions on the use of the seekdir() and telldir() functions
> related to implementation details, and for that reason these
> functions need not be supported on all POSIX-conforming
> systems. They are required on implementations supporting the XSI
> extension.
>
> One of the perceived problems of implementation is that returning
> to a given point in a directory is quite difficult to describe
> formally, in spite of its intuitive appeal, when systems that use
> B-trees, hashing functions, or other similar mechanisms to order
> their directories are considered. The definition of seekdir() and
> telldir() does not specify whether, when using these interfaces, a
> given directory entry will be seen at all, or more than once.
>
> On systems not supporting these functions, their capability can
> sometimes be accomplished by saving a filename found by readdir()
> and later using rewinddir() and a loop on readdir() to relocate
> the position from which the filename was saved.
>
>
> Telldir() and seekdir() are basically implementation horrors for any
> file system that is using anything other than a simple array of
> directory entries ala the V7 Unix file system or the BSD FFS. For any
> file system which is using a more advanced data structure, like
> b-trees hash trees, etc, there **can't** possibly be a "offset" into a
> readdir stream. This is why ext3/ext4 uses a telldir cookie, and it's
> why the NFS specifications refer to it as a cookie. If you are using
> a modern file system, it can't possibly be an offset.
>
> > You can always say "this is your fault" for interpreting the man pages
> > differently and punish us by leaving things as they are (and unfortunately
> > a big chunk of users who want both ext4 and gluster jeapordized). Or you
> > can be kind, generous and be considerate to the legacy apps and users (of
> > which gluster is only a subset) and only provide a mount option to control
> > the large d_off behavior.
>
> The problem is that we made this change to fix real problems that take
> place when you have hash collisions. And if you are using a 31-bit
> cookie, the birthday paradox means that by the time you have a
> directory with 2**16 entries, the chances of hash collisions are very
> real. This could result in NFS readdir getting stuck in loops where
> it constantly gets the file "foo.c", and then when it passes the
> 31-bit cookie for "bar.c", since there is a hash collision, it gets
> "foo.c" again, and the readdir never terminates.
>
> So the problem is that you are effectively asking me to penalize
> well-behaved programs that don't try to steel bits from the top of the
> telldir cookie, just for the benefit of gluster.
>
> What if we have an ioctl or a process personality flag where a broken
> application can tell the file system "I'm broken, please give me a
> degraded telldir/seekdir cookie"? That way we don't penalize programs
> that are doing the right thing, while providing some accomodation for
> programs who are abusing the telldir cookie.

Yeah, if there's a simple way to do that, maybe it would be worth it.

--b.

2013-02-14 05:45:39

by Dave Chinner

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On Thu, Feb 14, 2013 at 03:59:17AM +0000, Myklebust, Trond wrote:
> > -----Original Message-----
> > From: J. Bruce Fields [mailto:[email protected]]
> > Sent: Wednesday, February 13, 2013 4:34 PM
> > To: Myklebust, Trond
> > Cc: Theodore Ts'o; [email protected]; [email protected];
> > Bernd Schubert; [email protected]; [email protected]
> > Subject: Re: regressions due to 64-bit ext4 directory cookies
> >
> > On Wed, Feb 13, 2013 at 04:43:05PM +0000, Myklebust, Trond wrote:
> > > On Wed, 2013-02-13 at 11:20 -0500, J. Bruce Fields wrote:
> > > > Oops, probably should have cc'd linux-nfs.
> > > >
> > > > On Wed, Feb 13, 2013 at 10:36:54AM -0500, Theodore Ts'o wrote:
> > > > > The other thing that I'd note is that the readdir cookie has been
> > > > > 64-bit since NFSv3, which was released in June ***1995***. And
> > > > > the explicit, stated purpose of making it be a 64-bit value (as
> > > > > stated in RFC 1813) was to reduce interoperability problems. If
> > > > > that were the case, are you telling me that Sun (who has
> > > > > traditionally been pretty good worrying about interoperability
> > > > > concerns, and in fact employed the editors of RFC 1813) didn't get
> > > > > this right? This seems quite.... surprising to me.
> > > > >
> > > > > I thought this was the whole point of the various NFS
> > > > > interoperability testing done at Connectathon, for which Sun was a
> > > > > major sponsor?!? No one noticed?!?
> > > >
> > > > Beats me. But it's not necessarily easy to replace clients running
> > > > legacy applications, so we're stuck working with the clients we have....
> > > >
> > > > The linux client does remap the server-provided cookies to small
> > > > integers, I believe exactly because older applications had trouble
> > > > with servers returning "large" cookies. So presumably
> > > > ext4-exporting-Linux servers aren't the first to do this.
> > > >
> > > > I don't know which client versions are affected--Connectathon's next
> > > > week and I'll talk to people and make sure there's an ext4 export
> > > > with this turned on to test against.
> > >
> > > Actually, one of the main reasons for the Linux client not exporting
> > > raw readdir cookies is because the glibc-2 folks in their infinite
> > > wisdom declared that telldir()/seekdir() use an off_t. They then went
> > > yet one further and decided to declare negative offsets to be illegal
> > > so that they could use the negative values internally in their syscall
> > wrappers.
> > >
> > > The POSIX definition has none of the above rubbish
> > > (http://pubs.opengroup.org/onlinepubs/009695399/functions/telldir.html
> > > ) and so glibc brilliantly saddled Linux with a crippled readdir
> > > implementation that is _not_ POSIX compatible.
> > >
> > > No, I'm not at all bitter...
> >
> > Oh, right, I knew I'd forgotten part of the story....
> >
> > But then you must have actually been testing against servers that were using
> > that 32nd bit?
> >
> > I think ext4 actually only uses 31 bits even in the 32-bit case. And for a server
> > that was literally using an offset inside a directory file, that would be a
> > colossal directory.

That's exactly what XFS directory cookies are - a direct encoding of
the dirent offset into the directory file. Which means a overflow
would occur at 16GB of directory data for XFS. That is in the realm
of several hundreds of millions of files in a single directory,
which I have seen done before....

> > So I'm wondering how you ran across it.
> >
> > Partly just pure curiosity.
>
> IIRC, XFS on IRIX used 0xFFFFF as the readdir eof marker, which caused us to generate an EIO...

And this discussion explains the magic 0x7fffffff offset mask in the
linux XFS readdir code. I've been trying to find out for years
exactly why that was necessary, and now I know.

I probably should write a patch that makes it a "non-magic" number
and remove it completely for 64 bit platforms before I forget again...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-02-13 21:33:54

by J. Bruce Fields

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On Wed, Feb 13, 2013 at 04:43:05PM +0000, Myklebust, Trond wrote:
> On Wed, 2013-02-13 at 11:20 -0500, J. Bruce Fields wrote:
> > Oops, probably should have cc'd linux-nfs.
> >
> > On Wed, Feb 13, 2013 at 10:36:54AM -0500, Theodore Ts'o wrote:
> > > The other thing that I'd note is that the readdir cookie has been
> > > 64-bit since NFSv3, which was released in June ***1995***. And the
> > > explicit, stated purpose of making it be a 64-bit value (as stated in
> > > RFC 1813) was to reduce interoperability problems. If that were the
> > > case, are you telling me that Sun (who has traditionally been pretty
> > > good worrying about interoperability concerns, and in fact employed
> > > the editors of RFC 1813) didn't get this right? This seems
> > > quite.... surprising to me.
> > >
> > > I thought this was the whole point of the various NFS interoperability
> > > testing done at Connectathon, for which Sun was a major sponsor?!? No
> > > one noticed?!?
> >
> > Beats me. But it's not necessarily easy to replace clients running
> > legacy applications, so we're stuck working with the clients we have....
> >
> > The linux client does remap the server-provided cookies to small
> > integers, I believe exactly because older applications had trouble with
> > servers returning "large" cookies. So presumably ext4-exporting-Linux
> > servers aren't the first to do this.
> >
> > I don't know which client versions are affected--Connectathon's next
> > week and I'll talk to people and make sure there's an ext4 export with
> > this turned on to test against.
>
> Actually, one of the main reasons for the Linux client not exporting raw
> readdir cookies is because the glibc-2 folks in their infinite wisdom
> declared that telldir()/seekdir() use an off_t. They then went yet one
> further and decided to declare negative offsets to be illegal so that
> they could use the negative values internally in their syscall wrappers.
>
> The POSIX definition has none of the above rubbish
> (http://pubs.opengroup.org/onlinepubs/009695399/functions/telldir.html)
> and so glibc brilliantly saddled Linux with a crippled readdir
> implementation that is _not_ POSIX compatible.
>
> No, I'm not at all bitter...

Oh, right, I knew I'd forgotten part of the story....

But then you must have actually been testing against servers that were
using that 32nd bit?

I think ext4 actually only uses 31 bits even in the 32-bit case. And
for a server that was literally using an offset inside a directory file,
that would be a colossal directory.

So I'm wondering how you ran across it.

Partly just pure curiosity.

--b.

2013-02-13 22:21:02

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Gluster-devel] regressions due to 64-bit ext4 directory cookies

On Wed, Feb 13, 2013 at 01:21:06PM -0800, Anand Avati wrote:
>
> NFS uses the term cookies, while man pages of readdir/seekdir/telldir calls
> them "offsets".

Unfortunately, telldir and seekdir are part of the "unspeakable Unix
design horrors" which has been with us for 25+ years. To quote from
the rationale section from the Single Unix Specification v3 (there is
similar language in the Posix spec).

The original standard developers perceived that there were
restrictions on the use of the seekdir() and telldir() functions
related to implementation details, and for that reason these
functions need not be supported on all POSIX-conforming
systems. They are required on implementations supporting the XSI
extension.

One of the perceived problems of implementation is that returning
to a given point in a directory is quite difficult to describe
formally, in spite of its intuitive appeal, when systems that use
B-trees, hashing functions, or other similar mechanisms to order
their directories are considered. The definition of seekdir() and
telldir() does not specify whether, when using these interfaces, a
given directory entry will be seen at all, or more than once.

On systems not supporting these functions, their capability can
sometimes be accomplished by saving a filename found by readdir()
and later using rewinddir() and a loop on readdir() to relocate
the position from which the filename was saved.


Telldir() and seekdir() are basically implementation horrors for any
file system that is using anything other than a simple array of
directory entries ala the V7 Unix file system or the BSD FFS. For any
file system which is using a more advanced data structure, like
b-trees hash trees, etc, there **can't** possibly be a "offset" into a
readdir stream. This is why ext3/ext4 uses a telldir cookie, and it's
why the NFS specifications refer to it as a cookie. If you are using
a modern file system, it can't possibly be an offset.

> You can always say "this is your fault" for interpreting the man pages
> differently and punish us by leaving things as they are (and unfortunately
> a big chunk of users who want both ext4 and gluster jeapordized). Or you
> can be kind, generous and be considerate to the legacy apps and users (of
> which gluster is only a subset) and only provide a mount option to control
> the large d_off behavior.

The problem is that we made this change to fix real problems that take
place when you have hash collisions. And if you are using a 31-bit
cookie, the birthday paradox means that by the time you have a
directory with 2**16 entries, the chances of hash collisions are very
real. This could result in NFS readdir getting stuck in loops where
it constantly gets the file "foo.c", and then when it passes the
31-bit cookie for "bar.c", since there is a hash collision, it gets
"foo.c" again, and the readdir never terminates.

So the problem is that you are effectively asking me to penalize
well-behaved programs that don't try to steel bits from the top of the
telldir cookie, just for the benefit of gluster.

What if we have an ioctl or a process personality flag where a broken
application can tell the file system "I'm broken, please give me a
degraded telldir/seekdir cookie"? That way we don't penalize programs
that are doing the right thing, while providing some accomodation for
programs who are abusing the telldir cookie.

- Ted

2013-02-13 21:21:06

by Anand Avati

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

>
> My understanding is that only one frontend server is running the server.
> So in your picture below, "NFS v3" should be some internal gluster
> protocol:
>
>
> /------ GFS Storage
> / Server #1
> GFS Cluster NFS V3 GFS Cluster -- gluster protocol
> Client <---------> Frontend Server ---------- GFS Storage
> -- Server #2
> \
> \------ GFS Storage
> Server #3
>
>
> That frontend server gets a readdir request for a directory which is
> stored across several of the storage servers. It has to return a
> cookie. It will get that cookie back from the client at some unknown
> later time (possibly after the server has rebooted). So their solution
> is to return a cookie from one of the storage servers, plus some kind of
> node id in the top bits so they can remember which server it came from.
>
> (I don't know much about gluster, but I think that's the basic idea.)
>
> I've assumed that users of directory cookies should treat them as
> opaque, so I don't think what gluster is doing is correct.


NFS uses the term cookies, while man pages of readdir/seekdir/telldir calls
them "offsets". RFC 1813 only talks about communication between and NFS
server and NFS client. While knfsd performs a trivial 1:1 mapping between
d_off "offsets" into these "opaque cookies", the "gluster" issue at hand is
that, it made assumptions about the nature of these "offsets" (that they
are representing some kind of true distance/offset and therefore fall
within some kind of bounded magnitude -- somewhat like the inode
numbering), and performs a transformation (instead of a 1:1 trivial
mapping) like this:

final_d_off = (ext4_d_off * MAX_SERVERS) + server_idx

thereby utilizing a few more top bits, also ability to perform a reverse
transformation to "continue" from a previous location. As you can see,
final_d_off now overflows for very large values of ext4_d_off. This
final_d_off is used both as cookies in gluster-NFS (userspace) server, and
also as d_off entry parameter in FUSE readdir reply. The gluster / ext4
d_off issue is not limited to gluster-NFS, but also exists in the FUSE
client where NFS is completely out of picture.

You are probably right in that gluster has made different assumptions about
the "nature" of values filled in d_off fields. But the language used in all
man pages makes you believe they were supposed to be numbers representing
some kind of distance/offset (with bounded magnitude), and not a "random"
number.

This had worked (accidentally, you may call it) on all filesystems
including ext4, as expected. But on kernel upgrade, only ext4 backed
deployments started giving problems and we have been advising our users to
either downgrade their kernel or use a different filesystem (we really do
not want to force them into making a choice of one backend filesystem vs
another.)

You can always say "this is your fault" for interpreting the man pages
differently and punish us by leaving things as they are (and unfortunately
a big chunk of users who want both ext4 and gluster jeapordized). Or you
can be kind, generous and be considerate to the legacy apps and users (of
which gluster is only a subset) and only provide a mount option to control
the large d_off behavior.

Thanks!
Avati


Attachments:
(No filename) (156.00 B)