2013-02-12 20:28:44

by J. Bruce Fields

[permalink] [raw]
Subject: regressions due to 64-bit ext4 directory cookies

06effdbb49af5f6c "nfsd: vfs_llseek() with 32 or 64 bit offsets (hashes)"
and previous patches solved problems with hash collisions in large
directories by using 64- instead of 32- bit directory hashes in some
cases. But it caused problems for users who assume directory offsets
are "small". Two cases we've run across:

- older NFS clients: 64-bit cookies cause applications on many
older clients to fail.
- gluster: gluster assumed that it could take the top bits of
the offset for its own use.

In both cases we could argue we're in the right: the nfs protocol
defines cookies to be 64 bits, so clients should be prepared to handle
them (remapping to smaller integers if necessary to placate applications
using older system interfaces). And gluster was incorrect to assume
that the "offset" was really an "offset" as opposed to just an opaque
value.

But in practice things that worked fine for a long time break on a
kernel upgrade.

So at a minimum I think we owe people a workaround, and turning off
dir_index may not be practical for everyone.

A "no_64bit_cookies" export option would provide a workaround for NFS
servers with older NFS clients, but not for applications like gluster.

For that reason I'd rather have a way to turn this off on a given ext4
filesystem. Is that practical?

--b.


2013-02-12 20:56:44

by Bernd Schubert

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On 02/12/2013 09:28 PM, J. Bruce Fields wrote:
> 06effdbb49af5f6c "nfsd: vfs_llseek() with 32 or 64 bit offsets (hashes)"
> and previous patches solved problems with hash collisions in large
> directories by using 64- instead of 32- bit directory hashes in some
> cases. But it caused problems for users who assume directory offsets
> are "small". Two cases we've run across:
>
> - older NFS clients: 64-bit cookies cause applications on many
> older clients to fail.
> - gluster: gluster assumed that it could take the top bits of
> the offset for its own use.
>
> In both cases we could argue we're in the right: the nfs protocol
> defines cookies to be 64 bits, so clients should be prepared to handle
> them (remapping to smaller integers if necessary to placate applications
> using older system interfaces). And gluster was incorrect to assume
> that the "offset" was really an "offset" as opposed to just an opaque
> value.
>
> But in practice things that worked fine for a long time break on a
> kernel upgrade.
>
> So at a minimum I think we owe people a workaround, and turning off
> dir_index may not be practical for everyone.
>
> A "no_64bit_cookies" export option would provide a workaround for NFS
> servers with older NFS clients, but not for applications like gluster.
>
> For that reason I'd rather have a way to turn this off on a given ext4
> filesystem. Is that practical?

I think Ted needs to answer if he would accept another mount option. But
before we are going this way, what is gluster doing if there are hash
collions?

Thanks,
Bernd

2013-02-12 21:00:57

by J. Bruce Fields

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On Tue, Feb 12, 2013 at 09:56:41PM +0100, Bernd Schubert wrote:
> On 02/12/2013 09:28 PM, J. Bruce Fields wrote:
> > 06effdbb49af5f6c "nfsd: vfs_llseek() with 32 or 64 bit offsets (hashes)"
> > and previous patches solved problems with hash collisions in large
> > directories by using 64- instead of 32- bit directory hashes in some
> > cases. But it caused problems for users who assume directory offsets
> > are "small". Two cases we've run across:
> >
> > - older NFS clients: 64-bit cookies cause applications on many
> > older clients to fail.
> > - gluster: gluster assumed that it could take the top bits of
> > the offset for its own use.
> >
> > In both cases we could argue we're in the right: the nfs protocol
> > defines cookies to be 64 bits, so clients should be prepared to handle
> > them (remapping to smaller integers if necessary to placate applications
> > using older system interfaces). And gluster was incorrect to assume
> > that the "offset" was really an "offset" as opposed to just an opaque
> > value.
> >
> > But in practice things that worked fine for a long time break on a
> > kernel upgrade.
> >
> > So at a minimum I think we owe people a workaround, and turning off
> > dir_index may not be practical for everyone.
> >
> > A "no_64bit_cookies" export option would provide a workaround for NFS
> > servers with older NFS clients, but not for applications like gluster.
> >
> > For that reason I'd rather have a way to turn this off on a given ext4
> > filesystem. Is that practical?
>
> I think Ted needs to answer if he would accept another mount option. But
> before we are going this way, what is gluster doing if there are hash
> collions?

They probably just haven't tested NFS with large enough directories.
The birthday paradox says you'd need about 2^16 entries to have a 50-50
chance of hitting the problem.

I don't know enough about ext4 directory performance. But unfortunately
I suspect there's a range of directory sizes that are too small to have
a significant chance of having directory collisions, but still large
enough to need dir_index?

--b.

2013-02-13 04:00:12

by Theodore Ts'o

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On Tue, Feb 12, 2013 at 03:28:41PM -0500, J. Bruce Fields wrote:
> 06effdbb49af5f6c "nfsd: vfs_llseek() with 32 or 64 bit offsets (hashes)"
> and previous patches solved problems with hash collisions in large
> directories by using 64- instead of 32- bit directory hashes in some
> cases. But it caused problems for users who assume directory offsets
> are "small". Two cases we've run across:
>
> - older NFS clients: 64-bit cookies cause applications on many
> older clients to fail.

Is there a list of clients (and version numbers) which are having
problems?

> A "no_64bit_cookies" export option would provide a workaround for NFS
> servers with older NFS clients, but not for applications like gluster.

Why isn't it sufficient for gluster? Are they doing something
horrible such as assuming that telldir() cookies accessed from
userspace are identical to NFS cookies? Or is it some other horrible
abstraction violation?

- Ted

2013-02-13 06:56:39

by Andreas Dilger

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On 2013-02-12, at 12:28 PM, J. Bruce Fields wrote:
> 06effdbb49af5f6c "nfsd: vfs_llseek() with 32 or 64 bit offsets (hashes)"
> and previous patches solved problems with hash collisions in large
> directories by using 64- instead of 32- bit directory hashes in some
> cases. But it caused problems for users who assume directory offsets
> are "small". Two cases we've run across:
>
> - older NFS clients: 64-bit cookies cause applications on
> many older clients to fail.
> - gluster: gluster assumed that it could take the top bits of
> the offset for its own use.
>
> In both cases we could argue we're in the right: the nfs protocol
> defines cookies to be 64 bits, so clients should be prepared to handle them (remapping to smaller integers if necessary to placate
> applications using older system interfaces).

There appears to already be support for handling this for NFSv2
clients, so it should be possible to have an NFS server mount
option to set this for all clients:

/* NFSv2 only supports 32 bit cookies */
if (rqstp->rq_vers > 2)
may_flags |= NFSD_MAY_64BIT_COOKIE;

Alternately, this might be detected on a per-client basis by
whitelist or blacklist if there is some way for the server to
identify the client?

> And gluster was incorrect to assume that the "offset" was really
> an "offset" as opposed to just an opaque value.

Hmm, userspace already can't use the top bit of the cookie,
since the offset is a signed value, so gluster could continue
to use that bit for itself. It could, in theory, also downshift
the cookie by one bit for 64-bit cookies and shift it back
before use, but I'm not sure that is kosher for all filesystems.

> But in practice things that worked fine for a long time break on a
> kernel upgrade.
>
> So at a minimum I think we owe people a workaround, and turning off
> dir_index may not be practical for everyone.
>
> A "no_64bit_cookies" export option would provide a workaround for NFS
> servers with older NFS clients, but not for applications like gluster.

We added a "32bitapi" mount option to Lustre to handle the case
where it is re-exporting via NFS to 32-bit clients, which is like
your proposed "no_64bit_cookies" and "nfs.enable_ino64=0" together.

> For that reason I'd rather have a way to turn this off on a given ext4 filesystem. Is that practical?

It wouldn't be impossible - pos2maj_hash() and pos2min_hash()
could get a per-superblock and/or kernel option to force 32-bit
hash values.

Cheers, Andreas






2013-02-13 08:17:36

by Bernd Schubert

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On 02/12/2013 10:00 PM, J. Bruce Fields wrote:
> On Tue, Feb 12, 2013 at 09:56:41PM +0100, Bernd Schubert wrote:
>> On 02/12/2013 09:28 PM, J. Bruce Fields wrote:
>>> 06effdbb49af5f6c "nfsd: vfs_llseek() with 32 or 64 bit offsets (hashes)"
>>> and previous patches solved problems with hash collisions in large
>>> directories by using 64- instead of 32- bit directory hashes in some
>>> cases. But it caused problems for users who assume directory offsets
>>> are "small". Two cases we've run across:
>>>
>>> - older NFS clients: 64-bit cookies cause applications on many
>>> older clients to fail.
>>> - gluster: gluster assumed that it could take the top bits of
>>> the offset for its own use.
>>>
>>> In both cases we could argue we're in the right: the nfs protocol
>>> defines cookies to be 64 bits, so clients should be prepared to handle
>>> them (remapping to smaller integers if necessary to placate applications
>>> using older system interfaces). And gluster was incorrect to assume
>>> that the "offset" was really an "offset" as opposed to just an opaque
>>> value.
>>>
>>> But in practice things that worked fine for a long time break on a
>>> kernel upgrade.
>>>
>>> So at a minimum I think we owe people a workaround, and turning off
>>> dir_index may not be practical for everyone.
>>>
>>> A "no_64bit_cookies" export option would provide a workaround for NFS
>>> servers with older NFS clients, but not for applications like gluster.
>>>
>>> For that reason I'd rather have a way to turn this off on a given ext4
>>> filesystem. Is that practical?
>>
>> I think Ted needs to answer if he would accept another mount option. But
>> before we are going this way, what is gluster doing if there are hash
>> collions?
>
> They probably just haven't tested NFS with large enough directories.

Is it only related to NFS or generic readdir over gluster?

> The birthday paradox says you'd need about 2^16 entries to have a 50-50
> chance of hitting the problem.

We are frequently running into it with 50000 files per directory.

>
> I don't know enough about ext4 directory performance. But unfortunately
> I suspect there's a range of directory sizes that are too small to have
> a significant chance of having directory collisions, but still large
> enough to need dir_index?

Here is a link to the initial benchmark:
http://search.luky.org/linux-kernel.2001/msg00117.html


Cheers,
Bernd

2013-02-13 13:31:34

by J. Bruce Fields

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On Tue, Feb 12, 2013 at 11:00:03PM -0500, Theodore Ts'o wrote:
> On Tue, Feb 12, 2013 at 03:28:41PM -0500, J. Bruce Fields wrote:
> > 06effdbb49af5f6c "nfsd: vfs_llseek() with 32 or 64 bit offsets (hashes)"
> > and previous patches solved problems with hash collisions in large
> > directories by using 64- instead of 32- bit directory hashes in some
> > cases. But it caused problems for users who assume directory offsets
> > are "small". Two cases we've run across:
> >
> > - older NFS clients: 64-bit cookies cause applications on many
> > older clients to fail.
>
> Is there a list of clients (and version numbers) which are having
> problems?

I've seen complaints about Solaris, AIX, and HP-UX clients. I don't
have version numbers. It's possible that this is a problem with their
latest versions, so I probably shouldn't have said "older" above.

> > A "no_64bit_cookies" export option would provide a workaround for NFS
> > servers with older NFS clients, but not for applications like gluster.
>
> Why isn't it sufficient for gluster? Are they doing something
> horrible such as assuming that telldir() cookies accessed from
> userspace are identical to NFS cookies? Or is it some other horrible
> abstraction violation?

They're assuming they can take the high bits of the cookie for their own
use.

(In more detail: they're spreading a single directory across multiple
nodes, and encoding a node ID into the cookie they return, so they can
tell which node the cookie came from when they get it back.)

That works if you assume the cookie is an "offset" bounded above by some
measure of the directory size, hence unlikely to ever use the high
bits....

--b.

2013-02-13 13:31:43

by Niels de Vos

[permalink] [raw]
Subject: Re: [Gluster-devel] regressions due to 64-bit ext4 directory cookies

On Tue, Feb 12, 2013 at 04:00:54PM -0500, J. Bruce Fields wrote:
> On Tue, Feb 12, 2013 at 09:56:41PM +0100, Bernd Schubert wrote:
> > On 02/12/2013 09:28 PM, J. Bruce Fields wrote:
> > > 06effdbb49af5f6c "nfsd: vfs_llseek() with 32 or 64 bit offsets (hashes)"
> > > and previous patches solved problems with hash collisions in large
> > > directories by using 64- instead of 32- bit directory hashes in some
> > > cases. But it caused problems for users who assume directory offsets
> > > are "small". Two cases we've run across:
> > >
> > > - older NFS clients: 64-bit cookies cause applications on many
> > > older clients to fail.
> > > - gluster: gluster assumed that it could take the top bits of
> > > the offset for its own use.
> > >
> > > In both cases we could argue we're in the right: the nfs protocol
> > > defines cookies to be 64 bits, so clients should be prepared to handle
> > > them (remapping to smaller integers if necessary to placate applications
> > > using older system interfaces). And gluster was incorrect to assume
> > > that the "offset" was really an "offset" as opposed to just an opaque
> > > value.
> > >
> > > But in practice things that worked fine for a long time break on a
> > > kernel upgrade.
> > >
> > > So at a minimum I think we owe people a workaround, and turning off
> > > dir_index may not be practical for everyone.
> > >
> > > A "no_64bit_cookies" export option would provide a workaround for NFS
> > > servers with older NFS clients, but not for applications like gluster.
> > >
> > > For that reason I'd rather have a way to turn this off on a given ext4
> > > filesystem. Is that practical?
> >
> > I think Ted needs to answer if he would accept another mount option. But
> > before we are going this way, what is gluster doing if there are hash
> > collions?
>
> They probably just haven't tested NFS with large enough directories.
> The birthday paradox says you'd need about 2^16 entries to have a 50-50
> chance of hitting the problem.

The Gluster NFS-server gets into an infinite loop:
- https://bugzilla.redhat.com/show_bug.cgi?id=838784

The general advise (even before this Bug) is that XFS should be used,
which is not affected with this problem (yet?).

Cheers,
Niels

2013-02-13 13:40:37

by J. Bruce Fields

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On Tue, Feb 12, 2013 at 10:56:36PM -0800, Andreas Dilger wrote:
> On 2013-02-12, at 12:28 PM, J. Bruce Fields wrote:
> > 06effdbb49af5f6c "nfsd: vfs_llseek() with 32 or 64 bit offsets (hashes)"
> > and previous patches solved problems with hash collisions in large
> > directories by using 64- instead of 32- bit directory hashes in some
> > cases. But it caused problems for users who assume directory offsets
> > are "small". Two cases we've run across:
> >
> > - older NFS clients: 64-bit cookies cause applications on
> > many older clients to fail.
> > - gluster: gluster assumed that it could take the top bits of
> > the offset for its own use.
> >
> > In both cases we could argue we're in the right: the nfs protocol
> > defines cookies to be 64 bits, so clients should be prepared to handle them (remapping to smaller integers if necessary to placate
> > applications using older system interfaces).
>
> There appears to already be support for handling this for NFSv2
> clients, so it should be possible to have an NFS server mount
> option to set this for all clients:
>
> /* NFSv2 only supports 32 bit cookies */
> if (rqstp->rq_vers > 2)
> may_flags |= NFSD_MAY_64BIT_COOKIE;
>
> Alternately, this might be detected on a per-client basis by
> whitelist or blacklist if there is some way for the server to
> identify the client?

No, there isn't.

--b.

2013-02-13 15:15:01

by Theodore Ts'o

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On Wed, Feb 13, 2013 at 08:31:31AM -0500, J. Bruce Fields wrote:
> They're assuming they can take the high bits of the cookie for their own
> use.
>
> (In more detail: they're spreading a single directory across multiple
> nodes, and encoding a node ID into the cookie they return, so they can
> tell which node the cookie came from when they get it back.)
>
> That works if you assume the cookie is an "offset" bounded above by some
> measure of the directory size, hence unlikely to ever use the high
> bits....

Right, but why wouldn't a nfs export option solave the problem for
gluster?

Basically, it would be nice if we did not have to degrade locally
running userspace applications by globally turning off 64-bit telldir
cookies just because there are some broken cluster file systems and
nfsv3 clients out there. And if we are only turning off 64-bit
cookies for NFS, wouldn't it make sense to make this be a NFS export
option, as opposed to a mount option?

Regards,

- Ted

2013-02-13 15:19:55

by J. Bruce Fields

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On Wed, Feb 13, 2013 at 10:14:55AM -0500, Theodore Ts'o wrote:
> On Wed, Feb 13, 2013 at 08:31:31AM -0500, J. Bruce Fields wrote:
> > They're assuming they can take the high bits of the cookie for their own
> > use.
> >
> > (In more detail: they're spreading a single directory across multiple
> > nodes, and encoding a node ID into the cookie they return, so they can
> > tell which node the cookie came from when they get it back.)
> >
> > That works if you assume the cookie is an "offset" bounded above by some
> > measure of the directory size, hence unlikely to ever use the high
> > bits....
>
> Right, but why wouldn't a nfs export option solave the problem for
> gluster?

No, gluster is running on ext4 directly.

> Basically, it would be nice if we did not have to degrade locally
> running userspace applications by globally turning off 64-bit telldir
> cookies just because there are some broken cluster file systems and
> nfsv3 clients out there. And if we are only turning off 64-bit
> cookies for NFS, wouldn't it make sense to make this be a NFS export
> option, as opposed to a mount option?

Right, the problem is that from ext4's point of view gluster is just
another userspace application.

(And my worry of course is that there may be others. Samba would be
another one to check.)

--b.

2013-02-13 15:37:04

by Theodore Ts'o

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On Wed, Feb 13, 2013 at 10:19:53AM -0500, J. Bruce Fields wrote:
> > > (In more detail: they're spreading a single directory across multiple
> > > nodes, and encoding a node ID into the cookie they return, so they can
> > > tell which node the cookie came from when they get it back.)
> > >
> > > That works if you assume the cookie is an "offset" bounded above by some
> > > measure of the directory size, hence unlikely to ever use the high
> > > bits....
> >
> > Right, but why wouldn't a nfs export option solave the problem for
> > gluster?
>
> No, gluster is running on ext4 directly.

OK, so let me see if I can get this straight. Each local gluster node
is running a userspace NFS server, right? Because if it were running
a kernel-side NFS server, it would be sufficient to use an nfs export
option.

A client which mounts a "gluster file system" is also doing this via
NFSv3, right? Or are they using their own protocol? If they are
using their own protocol, why can't they encode the node ID somewhere
else?

So this a correct picture of what is going on:

/------ GFS Storage
/ Server #1
GFS Cluster NFS V3 GFS Cluster -- NFS v3
Client <---------> Frontend Server ---------- GFS Storage
-- Server #2
\
\------ GFS Storage
Server #3


And the reason why it needs to use the high bits is because when it
needs to coalesce the results from each GFS Storage Server to the GFS
Cluster client?


The other thing that I'd note is that the readdir cookie has been
64-bit since NFSv3, which was released in June ***1995***. And the
explicit, stated purpose of making it be a 64-bit value (as stated in
RFC 1813) was to reduce interoperability problems. If that were the
case, are you telling me that Sun (who has traditionally been pretty
good worrying about interoperability concerns, and in fact employed
the editors of RFC 1813) didn't get this right? This seems
quite.... surprising to me.

I thought this was the whole point of the various NFS interoperability
testing done at Connectathon, for which Sun was a major sponsor?!? No
one noticed?!?

- Ted

2013-02-13 15:40:42

by Bernd Schubert

[permalink] [raw]
Subject: Re: [Gluster-devel] regressions due to 64-bit ext4 directory cookies

On 02/13/2013 02:31 PM, Niels de Vos wrote:
> On Tue, Feb 12, 2013 at 04:00:54PM -0500, J. Bruce Fields wrote:
>> On Tue, Feb 12, 2013 at 09:56:41PM +0100, Bernd Schubert wrote:
>>> On 02/12/2013 09:28 PM, J. Bruce Fields wrote:
>>>> 06effdbb49af5f6c "nfsd: vfs_llseek() with 32 or 64 bit offsets (hashes)"
>>>> and previous patches solved problems with hash collisions in large
>>>> directories by using 64- instead of 32- bit directory hashes in some
>>>> cases. But it caused problems for users who assume directory offsets
>>>> are "small". Two cases we've run across:
>>>>
>>>> - older NFS clients: 64-bit cookies cause applications on many
>>>> older clients to fail.
>>>> - gluster: gluster assumed that it could take the top bits of
>>>> the offset for its own use.
>>>>
>>>> In both cases we could argue we're in the right: the nfs protocol
>>>> defines cookies to be 64 bits, so clients should be prepared to handle
>>>> them (remapping to smaller integers if necessary to placate applications
>>>> using older system interfaces). And gluster was incorrect to assume
>>>> that the "offset" was really an "offset" as opposed to just an opaque
>>>> value.
>>>>
>>>> But in practice things that worked fine for a long time break on a
>>>> kernel upgrade.
>>>>
>>>> So at a minimum I think we owe people a workaround, and turning off
>>>> dir_index may not be practical for everyone.
>>>>
>>>> A "no_64bit_cookies" export option would provide a workaround for NFS
>>>> servers with older NFS clients, but not for applications like gluster.
>>>>
>>>> For that reason I'd rather have a way to turn this off on a given ext4
>>>> filesystem. Is that practical?
>>>
>>> I think Ted needs to answer if he would accept another mount option. But
>>> before we are going this way, what is gluster doing if there are hash
>>> collions?
>>
>> They probably just haven't tested NFS with large enough directories.
>> The birthday paradox says you'd need about 2^16 entries to have a 50-50
>> chance of hitting the problem.
>
> The Gluster NFS-server gets into an infinite loop:
> - https://bugzilla.redhat.com/show_bug.cgi?id=838784

Hmm, this bugzilla is not entirely what I meant, as it refers to 64-bit
hashes.
My question actually was, what is gluster going to do if there is a
32-bit hash collision and ext4 seeks back to a random entry?
That might end in an endless loop, but it also simply might list entries
multiple times on readdir().
Of course, something that only happens rarely is better than something
that happens all the time, but it still would be better to properly fix
it, wouldn't it?

> The general advise (even before this Bug) is that XFS should be used,
> which is not affected with this problem (yet?).

Hmm, well, always depends on the workload.


Cheers,
Bernd

2013-02-13 16:20:59

by J. Bruce Fields

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

Oops, probably should have cc'd linux-nfs.

On Wed, Feb 13, 2013 at 10:36:54AM -0500, Theodore Ts'o wrote:
> On Wed, Feb 13, 2013 at 10:19:53AM -0500, J. Bruce Fields wrote:
> > > > (In more detail: they're spreading a single directory across multiple
> > > > nodes, and encoding a node ID into the cookie they return, so they can
> > > > tell which node the cookie came from when they get it back.)
> > > >
> > > > That works if you assume the cookie is an "offset" bounded above by some
> > > > measure of the directory size, hence unlikely to ever use the high
> > > > bits....
> > >
> > > Right, but why wouldn't a nfs export option solave the problem for
> > > gluster?
> >
> > No, gluster is running on ext4 directly.
>
> OK, so let me see if I can get this straight. Each local gluster node
> is running a userspace NFS server, right?

My understanding is that only one frontend server is running the server.
So in your picture below, "NFS v3" should be some internal gluster
protocol:


/------ GFS Storage
/ Server #1
GFS Cluster NFS V3 GFS Cluster -- gluster protocol
Client <---------> Frontend Server ---------- GFS Storage
-- Server #2
\
\------ GFS Storage
Server #3


That frontend server gets a readdir request for a directory which is
stored across several of the storage servers. It has to return a
cookie. It will get that cookie back from the client at some unknown
later time (possibly after the server has rebooted). So their solution
is to return a cookie from one of the storage servers, plus some kind of
node id in the top bits so they can remember which server it came from.

(I don't know much about gluster, but I think that's the basic idea.)

I've assumed that users of directory cookies should treat them as
opaque, so I don't think what gluster is doing is correct. But on the
other hand they are defined as integers and described as offsets here
and there. And I can't actually think of anything else that would work,
short of gluster generating and storing its own cookies.

> Because if it were running
> a kernel-side NFS server, it would be sufficient to use an nfs export
> option.
>
> A client which mounts a "gluster file system" is also doing this via
> NFSv3, right? Or are they using their own protocol? If they are
> using their own protocol, why can't they encode the node ID somewhere
> else?
>
> So this a correct picture of what is going on:
>
> /------ GFS Storage
> / Server #1
> GFS Cluster NFS V3 GFS Cluster -- NFS v3
> Client <---------> Frontend Server ---------- GFS Storage
> -- Server #2
> \
> \------ GFS Storage
> Server #3
>
>
> And the reason why it needs to use the high bits is because when it
> needs to coalesce the results from each GFS Storage Server to the GFS
> Cluster client?
>
> The other thing that I'd note is that the readdir cookie has been
> 64-bit since NFSv3, which was released in June ***1995***. And the
> explicit, stated purpose of making it be a 64-bit value (as stated in
> RFC 1813) was to reduce interoperability problems. If that were the
> case, are you telling me that Sun (who has traditionally been pretty
> good worrying about interoperability concerns, and in fact employed
> the editors of RFC 1813) didn't get this right? This seems
> quite.... surprising to me.
>
> I thought this was the whole point of the various NFS interoperability
> testing done at Connectathon, for which Sun was a major sponsor?!? No
> one noticed?!?

Beats me. But it's not necessarily easy to replace clients running
legacy applications, so we're stuck working with the clients we have....

The linux client does remap the server-provided cookies to small
integers, I believe exactly because older applications had trouble with
servers returning "large" cookies. So presumably ext4-exporting-Linux
servers aren't the first to do this.

I don't know which client versions are affected--Connectathon's next
week and I'll talk to people and make sure there's an ext4 export with
this turned on to test against.

--b.

2013-02-13 16:43:05

by Myklebust, Trond

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On Wed, 2013-02-13 at 11:20 -0500, J. Bruce Fields wrote:
> Oops, probably should have cc'd linux-nfs.
>
> On Wed, Feb 13, 2013 at 10:36:54AM -0500, Theodore Ts'o wrote:
> > On Wed, Feb 13, 2013 at 10:19:53AM -0500, J. Bruce Fields wrote:
> > > > > (In more detail: they're spreading a single directory across multiple
> > > > > nodes, and encoding a node ID into the cookie they return, so they can
> > > > > tell which node the cookie came from when they get it back.)
> > > > >
> > > > > That works if you assume the cookie is an "offset" bounded above by some
> > > > > measure of the directory size, hence unlikely to ever use the high
> > > > > bits....
> > > >
> > > > Right, but why wouldn't a nfs export option solave the problem for
> > > > gluster?
> > >
> > > No, gluster is running on ext4 directly.
> >
> > OK, so let me see if I can get this straight. Each local gluster node
> > is running a userspace NFS server, right?
>
> My understanding is that only one frontend server is running the server.
> So in your picture below, "NFS v3" should be some internal gluster
> protocol:
>
>
> /------ GFS Storage
> / Server #1
> GFS Cluster NFS V3 GFS Cluster -- gluster protocol
> Client <---------> Frontend Server ---------- GFS Storage
> -- Server #2
> \
> \------ GFS Storage
> Server #3
>
>
> That frontend server gets a readdir request for a directory which is
> stored across several of the storage servers. It has to return a
> cookie. It will get that cookie back from the client at some unknown
> later time (possibly after the server has rebooted). So their solution
> is to return a cookie from one of the storage servers, plus some kind of
> node id in the top bits so they can remember which server it came from.
>
> (I don't know much about gluster, but I think that's the basic idea.)
>
> I've assumed that users of directory cookies should treat them as
> opaque, so I don't think what gluster is doing is correct. But on the
> other hand they are defined as integers and described as offsets here
> and there. And I can't actually think of anything else that would work,
> short of gluster generating and storing its own cookies.
>
> > Because if it were running
> > a kernel-side NFS server, it would be sufficient to use an nfs export
> > option.
> >
> > A client which mounts a "gluster file system" is also doing this via
> > NFSv3, right? Or are they using their own protocol? If they are
> > using their own protocol, why can't they encode the node ID somewhere
> > else?
> >
> > So this a correct picture of what is going on:
> >
> > /------ GFS Storage
> > / Server #1
> > GFS Cluster NFS V3 GFS Cluster -- NFS v3
> > Client <---------> Frontend Server ---------- GFS Storage
> > -- Server #2
> > \
> > \------ GFS Storage
> > Server #3
> >
> >
> > And the reason why it needs to use the high bits is because when it
> > needs to coalesce the results from each GFS Storage Server to the GFS
> > Cluster client?
> >
> > The other thing that I'd note is that the readdir cookie has been
> > 64-bit since NFSv3, which was released in June ***1995***. And the
> > explicit, stated purpose of making it be a 64-bit value (as stated in
> > RFC 1813) was to reduce interoperability problems. If that were the
> > case, are you telling me that Sun (who has traditionally been pretty
> > good worrying about interoperability concerns, and in fact employed
> > the editors of RFC 1813) didn't get this right? This seems
> > quite.... surprising to me.
> >
> > I thought this was the whole point of the various NFS interoperability
> > testing done at Connectathon, for which Sun was a major sponsor?!? No
> > one noticed?!?
>
> Beats me. But it's not necessarily easy to replace clients running
> legacy applications, so we're stuck working with the clients we have....
>
> The linux client does remap the server-provided cookies to small
> integers, I believe exactly because older applications had trouble with
> servers returning "large" cookies. So presumably ext4-exporting-Linux
> servers aren't the first to do this.
>
> I don't know which client versions are affected--Connectathon's next
> week and I'll talk to people and make sure there's an ext4 export with
> this turned on to test against.

Actually, one of the main reasons for the Linux client not exporting raw
readdir cookies is because the glibc-2 folks in their infinite wisdom
declared that telldir()/seekdir() use an off_t. They then went yet one
further and decided to declare negative offsets to be illegal so that
they could use the negative values internally in their syscall wrappers.

The POSIX definition has none of the above rubbish
(http://pubs.opengroup.org/onlinepubs/009695399/functions/telldir.html)
and so glibc brilliantly saddled Linux with a crippled readdir
implementation that is _not_ POSIX compatible.

No, I'm not at all bitter...

Trond

2013-02-13 21:21:06

by Anand Avati

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

>
> My understanding is that only one frontend server is running the server.
> So in your picture below, "NFS v3" should be some internal gluster
> protocol:
>
>
> /------ GFS Storage
> / Server #1
> GFS Cluster NFS V3 GFS Cluster -- gluster protocol
> Client <---------> Frontend Server ---------- GFS Storage
> -- Server #2
> \
> \------ GFS Storage
> Server #3
>
>
> That frontend server gets a readdir request for a directory which is
> stored across several of the storage servers. It has to return a
> cookie. It will get that cookie back from the client at some unknown
> later time (possibly after the server has rebooted). So their solution
> is to return a cookie from one of the storage servers, plus some kind of
> node id in the top bits so they can remember which server it came from.
>
> (I don't know much about gluster, but I think that's the basic idea.)
>
> I've assumed that users of directory cookies should treat them as
> opaque, so I don't think what gluster is doing is correct.


NFS uses the term cookies, while man pages of readdir/seekdir/telldir calls
them "offsets". RFC 1813 only talks about communication between and NFS
server and NFS client. While knfsd performs a trivial 1:1 mapping between
d_off "offsets" into these "opaque cookies", the "gluster" issue at hand is
that, it made assumptions about the nature of these "offsets" (that they
are representing some kind of true distance/offset and therefore fall
within some kind of bounded magnitude -- somewhat like the inode
numbering), and performs a transformation (instead of a 1:1 trivial
mapping) like this:

final_d_off = (ext4_d_off * MAX_SERVERS) + server_idx

thereby utilizing a few more top bits, also ability to perform a reverse
transformation to "continue" from a previous location. As you can see,
final_d_off now overflows for very large values of ext4_d_off. This
final_d_off is used both as cookies in gluster-NFS (userspace) server, and
also as d_off entry parameter in FUSE readdir reply. The gluster / ext4
d_off issue is not limited to gluster-NFS, but also exists in the FUSE
client where NFS is completely out of picture.

You are probably right in that gluster has made different assumptions about
the "nature" of values filled in d_off fields. But the language used in all
man pages makes you believe they were supposed to be numbers representing
some kind of distance/offset (with bounded magnitude), and not a "random"
number.

This had worked (accidentally, you may call it) on all filesystems
including ext4, as expected. But on kernel upgrade, only ext4 backed
deployments started giving problems and we have been advising our users to
either downgrade their kernel or use a different filesystem (we really do
not want to force them into making a choice of one backend filesystem vs
another.)

You can always say "this is your fault" for interpreting the man pages
differently and punish us by leaving things as they are (and unfortunately
a big chunk of users who want both ext4 and gluster jeapordized). Or you
can be kind, generous and be considerate to the legacy apps and users (of
which gluster is only a subset) and only provide a mount option to control
the large d_off behavior.

Thanks!
Avati


Attachments:
(No filename) (185.00 B)

2013-02-13 21:33:54

by J. Bruce Fields

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On Wed, Feb 13, 2013 at 04:43:05PM +0000, Myklebust, Trond wrote:
> On Wed, 2013-02-13 at 11:20 -0500, J. Bruce Fields wrote:
> > Oops, probably should have cc'd linux-nfs.
> >
> > On Wed, Feb 13, 2013 at 10:36:54AM -0500, Theodore Ts'o wrote:
> > > The other thing that I'd note is that the readdir cookie has been
> > > 64-bit since NFSv3, which was released in June ***1995***. And the
> > > explicit, stated purpose of making it be a 64-bit value (as stated in
> > > RFC 1813) was to reduce interoperability problems. If that were the
> > > case, are you telling me that Sun (who has traditionally been pretty
> > > good worrying about interoperability concerns, and in fact employed
> > > the editors of RFC 1813) didn't get this right? This seems
> > > quite.... surprising to me.
> > >
> > > I thought this was the whole point of the various NFS interoperability
> > > testing done at Connectathon, for which Sun was a major sponsor?!? No
> > > one noticed?!?
> >
> > Beats me. But it's not necessarily easy to replace clients running
> > legacy applications, so we're stuck working with the clients we have....
> >
> > The linux client does remap the server-provided cookies to small
> > integers, I believe exactly because older applications had trouble with
> > servers returning "large" cookies. So presumably ext4-exporting-Linux
> > servers aren't the first to do this.
> >
> > I don't know which client versions are affected--Connectathon's next
> > week and I'll talk to people and make sure there's an ext4 export with
> > this turned on to test against.
>
> Actually, one of the main reasons for the Linux client not exporting raw
> readdir cookies is because the glibc-2 folks in their infinite wisdom
> declared that telldir()/seekdir() use an off_t. They then went yet one
> further and decided to declare negative offsets to be illegal so that
> they could use the negative values internally in their syscall wrappers.
>
> The POSIX definition has none of the above rubbish
> (http://pubs.opengroup.org/onlinepubs/009695399/functions/telldir.html)
> and so glibc brilliantly saddled Linux with a crippled readdir
> implementation that is _not_ POSIX compatible.
>
> No, I'm not at all bitter...

Oh, right, I knew I'd forgotten part of the story....

But then you must have actually been testing against servers that were
using that 32nd bit?

I think ext4 actually only uses 31 bits even in the 32-bit case. And
for a server that was literally using an offset inside a directory file,
that would be a colossal directory.

So I'm wondering how you ran across it.

Partly just pure curiosity.

--b.

2013-02-13 22:18:05

by J. Bruce Fields

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On Wed, Feb 13, 2013 at 09:17:28AM +0100, Bernd Schubert wrote:
> On 02/12/2013 10:00 PM, J. Bruce Fields wrote:
> >On Tue, Feb 12, 2013 at 09:56:41PM +0100, Bernd Schubert wrote:
> >>On 02/12/2013 09:28 PM, J. Bruce Fields wrote:
> >>>06effdbb49af5f6c "nfsd: vfs_llseek() with 32 or 64 bit offsets (hashes)"
> >>>and previous patches solved problems with hash collisions in large
> >>>directories by using 64- instead of 32- bit directory hashes in some
> >>>cases. But it caused problems for users who assume directory offsets
> >>>are "small". Two cases we've run across:
> >>>
> >>> - older NFS clients: 64-bit cookies cause applications on many
> >>> older clients to fail.
> >>> - gluster: gluster assumed that it could take the top bits of
> >>> the offset for its own use.
> >>>
> >>>In both cases we could argue we're in the right: the nfs protocol
> >>>defines cookies to be 64 bits, so clients should be prepared to handle
> >>>them (remapping to smaller integers if necessary to placate applications
> >>>using older system interfaces). And gluster was incorrect to assume
> >>>that the "offset" was really an "offset" as opposed to just an opaque
> >>>value.
> >>>
> >>>But in practice things that worked fine for a long time break on a
> >>>kernel upgrade.
> >>>
> >>>So at a minimum I think we owe people a workaround, and turning off
> >>>dir_index may not be practical for everyone.
> >>>
> >>>A "no_64bit_cookies" export option would provide a workaround for NFS
> >>>servers with older NFS clients, but not for applications like gluster.
> >>>
> >>>For that reason I'd rather have a way to turn this off on a given ext4
> >>>filesystem. Is that practical?
> >>
> >>I think Ted needs to answer if he would accept another mount option. But
> >>before we are going this way, what is gluster doing if there are hash
> >>collions?
> >
> >They probably just haven't tested NFS with large enough directories.
>
> Is it only related to NFS or generic readdir over gluster?
>
> >The birthday paradox says you'd need about 2^16 entries to have a 50-50
> >chance of hitting the problem.
>
> We are frequently running into it with 50000 files per directory.
>
> >
> >I don't know enough about ext4 directory performance. But unfortunately
> >I suspect there's a range of directory sizes that are too small to have
> >a significant chance of having directory collisions, but still large
> >enough to need dir_index?
>
> Here is a link to the initial benchmark:
> http://search.luky.org/linux-kernel.2001/msg00117.html

Hm, so I still don't have a good feeling for when dir_index is likely to
start winning.

For comparison, assuming the probability of seeing a failure due to hash
collisions in an n-entry directory is the probability of a collision
among n numbers chosen uniformly at random from 2^31, that's about:

0.0002% for n= 100
0.006 % for n= 500
0.02 % for n= 1000
0.6 % for n= 5000
2 % for n=10000

So if we could tell anyone with directories smaller than 10,000 entries:
"hey, you don't need dir_index anyway, just turn it off"--good, the only
people still forced to deal with 64-bit cookies will be the ones that
have probably already found that ext4 isn't reliable for their purposes.

If there are people with only a few hundred entries who still need
dir_index--well, we may be making them unhappy as we're making them
suffer to fix a bug that they've never actually seen.

--b.

2013-02-13 22:57:13

by Anand Avati

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On Wed, Feb 13, 2013 at 2:47 PM, Theodore Ts'o <[email protected]> wrote:

> On Wed, Feb 13, 2013 at 05:41:41PM -0500, J. Bruce Fields wrote:
> > > What if we have an ioctl or a process personality flag where a broken
> > > application can tell the file system "I'm broken, please give me a
> > > degraded telldir/seekdir cookie"? That way we don't penalize programs
> > > that are doing the right thing, while providing some accomodation for
> > > programs who are abusing the telldir cookie.
> >
> > Yeah, if there's a simple way to do that, maybe it would be worth it.
>
> Doing this as an ioctl which gets called right after opendir, i.e
> (ignoring error checking):
>
> DIR *dir = opendir("/foo/bar/baz");
> ioctl(dirfd(dir), EXT4_IOC_DEGRADED_READDIR, 1);
> ...
>
> should be quite easy. It would be a very ext3/4 specific thing,
> though.


That would work, even though it would be ext3/4 specific. What is the
recommended programmatic way to detect if the file is on ext3/4 -- we would
not want to attempt that blindly on a non-ext3/4 FS as the numerical value
of EXT4_IOC_DEGRADED_READDIR might get interpreted in dangerous ways?

Avati


Attachments:
(No filename) (185.00 B)

2013-02-14 00:05:01

by Anand Avati

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On Wed, Feb 13, 2013 at 3:44 PM, Theodore Ts'o <[email protected]> wrote:
>
> I suspect this would seriously screw over Gluster, though, and this
> wouldn't be a solution for NFSv3, since NFS needs long-lived directory
> cookies, and not the short-lived cookies which is all POSIX/SuSv3
> guarantees.
>

Actually this would work just fine with Gluster. Except in the case of
gluster-NFS, the native client is only acting like a router/proxy of
syscalls to the backend system. A directory opened by an application will
have a matching directory fd opened on ext4, and readdir from an app will
be translated into readdir on the matching fd on ext4. So the
app-on-glusterfs and glusterfsd-on-ext4 are essentially "moving in tandem".
As long as the offs^H^H^H^H cookies do not overflow in the transformation,
Gluster would not have a problem.

However Gluster-NFS (and NFS in general, too) will break, as we
opendir/closedir potentially on every request.

Avati


Attachments:
(No filename) (185.00 B)

2013-02-14 03:59:20

by Myklebust, Trond

[permalink] [raw]
Subject: RE: regressions due to 64-bit ext4 directory cookies

> -----Original Message-----
> From: J. Bruce Fields [mailto:[email protected]]
> Sent: Wednesday, February 13, 2013 4:34 PM
> To: Myklebust, Trond
> Cc: Theodore Ts'o; [email protected]; [email protected];
> Bernd Schubert; [email protected]; [email protected]
> Subject: Re: regressions due to 64-bit ext4 directory cookies
>
> On Wed, Feb 13, 2013 at 04:43:05PM +0000, Myklebust, Trond wrote:
> > On Wed, 2013-02-13 at 11:20 -0500, J. Bruce Fields wrote:
> > > Oops, probably should have cc'd linux-nfs.
> > >
> > > On Wed, Feb 13, 2013 at 10:36:54AM -0500, Theodore Ts'o wrote:
> > > > The other thing that I'd note is that the readdir cookie has been
> > > > 64-bit since NFSv3, which was released in June ***1995***. And
> > > > the explicit, stated purpose of making it be a 64-bit value (as
> > > > stated in RFC 1813) was to reduce interoperability problems. If
> > > > that were the case, are you telling me that Sun (who has
> > > > traditionally been pretty good worrying about interoperability
> > > > concerns, and in fact employed the editors of RFC 1813) didn't get
> > > > this right? This seems quite.... surprising to me.
> > > >
> > > > I thought this was the whole point of the various NFS
> > > > interoperability testing done at Connectathon, for which Sun was a
> > > > major sponsor?!? No one noticed?!?
> > >
> > > Beats me. But it's not necessarily easy to replace clients running
> > > legacy applications, so we're stuck working with the clients we have....
> > >
> > > The linux client does remap the server-provided cookies to small
> > > integers, I believe exactly because older applications had trouble
> > > with servers returning "large" cookies. So presumably
> > > ext4-exporting-Linux servers aren't the first to do this.
> > >
> > > I don't know which client versions are affected--Connectathon's next
> > > week and I'll talk to people and make sure there's an ext4 export
> > > with this turned on to test against.
> >
> > Actually, one of the main reasons for the Linux client not exporting
> > raw readdir cookies is because the glibc-2 folks in their infinite
> > wisdom declared that telldir()/seekdir() use an off_t. They then went
> > yet one further and decided to declare negative offsets to be illegal
> > so that they could use the negative values internally in their syscall
> wrappers.
> >
> > The POSIX definition has none of the above rubbish
> > (http://pubs.opengroup.org/onlinepubs/009695399/functions/telldir.html
> > ) and so glibc brilliantly saddled Linux with a crippled readdir
> > implementation that is _not_ POSIX compatible.
> >
> > No, I'm not at all bitter...
>
> Oh, right, I knew I'd forgotten part of the story....
>
> But then you must have actually been testing against servers that were using
> that 32nd bit?
>
> I think ext4 actually only uses 31 bits even in the 32-bit case. And for a server
> that was literally using an offset inside a directory file, that would be a
> colossal directory.
>
> So I'm wondering how you ran across it.
>
> Partly just pure curiosity.

IIRC, XFS on IRIX used 0xFFFFF as the readdir eof marker, which caused us to generate an EIO...

Cheers
Trond

2013-02-14 05:32:35

by Dave Chinner

[permalink] [raw]
Subject: Re: [Gluster-devel] regressions due to 64-bit ext4 directory cookies

On Wed, Feb 13, 2013 at 04:40:35PM +0100, Bernd Schubert wrote:
> >The general advise (even before this Bug) is that XFS should be used,
> >which is not affected with this problem (yet?).
>
> Hmm, well, always depends on the workload.

XFS won't suffer from this collision bug, for 2 reasons. The first
is that XFS uses a virtual mapping for directory data and uses an
encoded index into that virtual mapping as the cookie data. You
can't have 2 entries at the same index, so you cannot get cookie
collisions.

The second is that the virtual mapping is for a 32GB data segment,
(2^35 bytes) and, like so much of XFS, the cookie is made up of
bitfields that encode a specific location. The high bits are the
virtual block offset into the directory data segment, the low bits
the offset into the directory block. Given that directory entries
are aligned to 8 bytes, the offset into the directory block can have
3 bits compressed out and hence we end up with only 32 bits being
needed to address the entire 32GB directory data segment.

So, there are no collisions or 32/64 bit issues with XFS directory
cookies regardless of the workload.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-02-14 05:45:36

by Dave Chinner

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On Thu, Feb 14, 2013 at 03:59:17AM +0000, Myklebust, Trond wrote:
> > -----Original Message-----
> > From: J. Bruce Fields [mailto:[email protected]]
> > Sent: Wednesday, February 13, 2013 4:34 PM
> > To: Myklebust, Trond
> > Cc: Theodore Ts'o; [email protected]; [email protected];
> > Bernd Schubert; [email protected]; [email protected]
> > Subject: Re: regressions due to 64-bit ext4 directory cookies
> >
> > On Wed, Feb 13, 2013 at 04:43:05PM +0000, Myklebust, Trond wrote:
> > > On Wed, 2013-02-13 at 11:20 -0500, J. Bruce Fields wrote:
> > > > Oops, probably should have cc'd linux-nfs.
> > > >
> > > > On Wed, Feb 13, 2013 at 10:36:54AM -0500, Theodore Ts'o wrote:
> > > > > The other thing that I'd note is that the readdir cookie has been
> > > > > 64-bit since NFSv3, which was released in June ***1995***. And
> > > > > the explicit, stated purpose of making it be a 64-bit value (as
> > > > > stated in RFC 1813) was to reduce interoperability problems. If
> > > > > that were the case, are you telling me that Sun (who has
> > > > > traditionally been pretty good worrying about interoperability
> > > > > concerns, and in fact employed the editors of RFC 1813) didn't get
> > > > > this right? This seems quite.... surprising to me.
> > > > >
> > > > > I thought this was the whole point of the various NFS
> > > > > interoperability testing done at Connectathon, for which Sun was a
> > > > > major sponsor?!? No one noticed?!?
> > > >
> > > > Beats me. But it's not necessarily easy to replace clients running
> > > > legacy applications, so we're stuck working with the clients we have....
> > > >
> > > > The linux client does remap the server-provided cookies to small
> > > > integers, I believe exactly because older applications had trouble
> > > > with servers returning "large" cookies. So presumably
> > > > ext4-exporting-Linux servers aren't the first to do this.
> > > >
> > > > I don't know which client versions are affected--Connectathon's next
> > > > week and I'll talk to people and make sure there's an ext4 export
> > > > with this turned on to test against.
> > >
> > > Actually, one of the main reasons for the Linux client not exporting
> > > raw readdir cookies is because the glibc-2 folks in their infinite
> > > wisdom declared that telldir()/seekdir() use an off_t. They then went
> > > yet one further and decided to declare negative offsets to be illegal
> > > so that they could use the negative values internally in their syscall
> > wrappers.
> > >
> > > The POSIX definition has none of the above rubbish
> > > (http://pubs.opengroup.org/onlinepubs/009695399/functions/telldir.html
> > > ) and so glibc brilliantly saddled Linux with a crippled readdir
> > > implementation that is _not_ POSIX compatible.
> > >
> > > No, I'm not at all bitter...
> >
> > Oh, right, I knew I'd forgotten part of the story....
> >
> > But then you must have actually been testing against servers that were using
> > that 32nd bit?
> >
> > I think ext4 actually only uses 31 bits even in the 32-bit case. And for a server
> > that was literally using an offset inside a directory file, that would be a
> > colossal directory.

That's exactly what XFS directory cookies are - a direct encoding of
the dirent offset into the directory file. Which means a overflow
would occur at 16GB of directory data for XFS. That is in the realm
of several hundreds of millions of files in a single directory,
which I have seen done before....

> > So I'm wondering how you ran across it.
> >
> > Partly just pure curiosity.
>
> IIRC, XFS on IRIX used 0xFFFFF as the readdir eof marker, which caused us to generate an EIO...

And this discussion explains the magic 0x7fffffff offset mask in the
linux XFS readdir code. I've been trying to find out for years
exactly why that was necessary, and now I know.

I probably should write a patch that makes it a "non-magic" number
and remove it completely for 64 bit platforms before I forget again...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-03-26 15:23:11

by Bernd Schubert

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

Sorry for my late reply, I had been rather busy.

On 02/14/2013 01:05 AM, Anand Avati wrote:
> On Wed, Feb 13, 2013 at 3:44 PM, Theodore Ts'o <[email protected]> wrote:
>>
>> I suspect this would seriously screw over Gluster, though, and this
>> wouldn't be a solution for NFSv3, since NFS needs long-lived directory
>> cookies, and not the short-lived cookies which is all POSIX/SuSv3
>> guarantees.
>>
>
> Actually this would work just fine with Gluster. Except in the case of

Would it really work perfectly? What about a server reboot in the middle
of a readdir of a client?

> gluster-NFS, the native client is only acting like a router/proxy of
> syscalls to the backend system. A directory opened by an application will
> have a matching directory fd opened on ext4, and readdir from an app will
> be translated into readdir on the matching fd on ext4. So the
> app-on-glusterfs and glusterfsd-on-ext4 are essentially "moving in tandem".
> As long as the offs^H^H^H^H cookies do not overflow in the transformation,
> Gluster would not have a problem.
>
> However Gluster-NFS (and NFS in general, too) will break, as we
> opendir/closedir potentially on every request.

We don't have reached a conclusion so far, do we? What about the ioctl
approach, but a bit differently? Would it work to specify the allowed
upper bits for ext4 (for example 16 additional bit) and the remaining
part for gluster? One of the mails had the calculation formula:

final_d_off = (ext4_d_off * MAX_SERVERS) + server_idx

But what is the value of MAX_SERVERS?


Cheers,
Bernd

2013-03-28 18:05:41

by Anand Avati

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On Thu, Mar 28, 2013 at 10:52 AM, Zach Brown <[email protected]> wrote:

> On Thu, Mar 28, 2013 at 10:07:44AM -0400, Theodore Ts'o wrote:
> > On Tue, Mar 26, 2013 at 10:48:14AM -0500, Eric Sandeen wrote:
> > > > We don't have reached a conclusion so far, do we? What about the
> > > > ioctl approach, but a bit differently? Would it work to specify the
> > > > allowed upper bits for ext4 (for example 16 additional bit) and the
> > > > remaining part for gluster? One of the mails had the calculation
> > > > formula:
> > >
> > > I did throw together an ioctl patch last week, but I think Anand has a
> new
> > > approach he's trying out which won't require ext4 code changes. I'll
> let
> > > him reply when he has a moment. :)
> >
> > Any update about whether Gluster can address this without needing the
> > ioctl patch? Or should we push the ioctl patch into ext4 for the next
> > merge window?
>
> They're testing a work-around:
>
> http://review.gluster.org/#change,4711
>
> I'm not sure if they've decided that they're going to go with it, or
> not.
>

Jeff reported that the approach did not work in his testing. I haven't had
a chance to look into the failure yet. Independent of the fix, it would
certainly be good have the ioctl() support - Samba could use it too, if it
wanted.

Avati


Attachments:
(No filename) (185.00 B)

2013-03-28 18:49:36

by Anand Avati

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On Thu, Mar 28, 2013 at 11:31 AM, J. Bruce Fields <[email protected]>wrote:
>
> > Jeff reported that the approach did not work in his testing. I haven't
> had
> > a chance to look into the failure yet. Independent of the fix, it would
> > certainly be good have the ioctl() support
>
> The one advantage of your scheme is that it keeps more of the hash bits;
> the chance of 31-bit cookie collisions is much higher.


Yes, it should, based on the theory of how ext4 was generating the 63bits.
But Jeff's test finds that the experiment is not matching the theory. I
intend to debug this, but currently drowned in a different issue. It would
be good if the ext developers can have a look at
http://review.gluster.org/4711 and see if there are obvious holes in the
approach or code.

Avati


Attachments:
(No filename) (185.00 B)

2013-03-28 22:14:50

by Anand Avati

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On Thu, Mar 28, 2013 at 12:43 PM, Jeff Darcy <[email protected]> wrote:

> On 03/28/2013 02:49 PM, Anand Avati wrote:
> > Yes, it should, based on the theory of how ext4 was generating the
> > 63bits. But Jeff's test finds that the experiment is not matching the
> > theory.
>
> FWIW, I was able to re-run my test in between stuff related to That
> Other Problem. What seems to be happening is that we read correctly
> until just after d_off 0x4000000000000000, then we suddenly wrap around
> - not to the very first d_off we saw, but to a pretty early one (e.g.
> 0x0041b6340689a32e). This is all on a single brick, BTW, so it's pretty
> easy to line up the back-end and front-end d_off values which match
> perfectly up to this point.
>
> I haven't had a chance to ponder what this all means and debug it
> further. Hopefully I'll be able to do so soon, but I figured I'd
> mention it in case something about those numbers rang a bell.
>

Of course, the unit tests (with artificial offsets) were done with brick
count >= 2. You have tested with DHT subvol count=1, which was not tested,
and sure enough, the code isn't handling it well. Just verified with the
unit tests that brick count = 1 condition fails to return the same d_off.

Posting a fixed version. Thanks for the catch!

Avati


Attachments:
(No filename) (185.00 B)

2013-03-28 22:20:59

by Anand Avati

[permalink] [raw]
Subject: Re: regressions due to 64-bit ext4 directory cookies

On Thu, Mar 28, 2013 at 3:14 PM, Anand Avati <[email protected]> wrote:

> On Thu, Mar 28, 2013 at 12:43 PM, Jeff Darcy <[email protected]> wrote:
>
>> On 03/28/2013 02:49 PM, Anand Avati wrote:
>> > Yes, it should, based on the theory of how ext4 was generating the
>> > 63bits. But Jeff's test finds that the experiment is not matching the
>> > theory.
>>
>> FWIW, I was able to re-run my test in between stuff related to That
>> Other Problem. What seems to be happening is that we read correctly
>> until just after d_off 0x4000000000000000, then we suddenly wrap around
>> - not to the very first d_off we saw, but to a pretty early one (e.g.
>> 0x0041b6340689a32e). This is all on a single brick, BTW, so it's pretty
>> easy to line up the back-end and front-end d_off values which match
>> perfectly up to this point.
>>
>> I haven't had a chance to ponder what this all means and debug it
>> further. Hopefully I'll be able to do so soon, but I figured I'd
>> mention it in case something about those numbers rang a bell.
>>
>
> Of course, the unit tests (with artificial offsets) were done with brick
> count >= 2. You have tested with DHT subvol count=1, which was not tested,
> and sure enough, the code isn't handling it well. Just verified with the
> unit tests that brick count = 1 condition fails to return the same d_off.
>
> Posting a fixed version. Thanks for the catch!
>

Posted an updated version http://review.gluster.org/4711. This passes unit
tests for all brick counts (>= 1). Can you confirm if the "loop"ing is now
gone in your test env?

Thanks,
Avati


Attachments:
(No filename) (185.00 B)