2008-03-11 05:55:44

by Bharata B Rao

[permalink] [raw]
Subject: [RFC] Union mount readdir support in glibc

Extending glibc readdir() to support union mounted directories
==============================================================

Any filesystem namespace unification solution needs to merge the
directory contents by eliminating duplicate(1) and whiteout(2) directory
entries. We (as part of Union Mount(3) effort) have been trying to do this
inside readdir/getdents system call in the kernel(4), but have found the
approach to be not very feasible. The effort of maintaining a cache of dirents
within the kernel turned out to be expensive on memory and cpu.
In this approach, we discuss about the feasibility of achieving
duplicate elimination and whiteout supression by extending the readdir
implementation of glibc.

Before actually going ahead with the implementation, this is an effort
to build consensus about the approach within the glibc community.

readdir/getdents in kernel
--------------------------
readdir/getdents system call on a union mounted directory would just return
all the entries of the union starting from the topmost directory.

readdir in glibc
----------------
glibc readdir would maintain a cache of dirents returned by readdir/getdents
system call to perform duplicate elimination and whiteout supression.

glibc readdir would need the following information from the kernel:

- Indication that this directory is a union mounted directory.
With this indication, glibc need not build and maintain
dirent cache for normal non-union directories. Need to check if glibc
can live without such indication, which means that glibc readdir would
build dirent cache for _all_ directories. Need to check the overhead
incurred by doing this.

- Indication that kernel has finished returning the dirents from the
topmost directory of the union.
This would tell glibc readdir to start eliminating the duplicates
for the subsequent dirents returned from the bottom directories of the union.
Can kernel pass a special dirent like a "." whiteout to indicate this ?
Any other methods ? Or should glibc not bother about dirents from
different directories and start duplicate elimination right from the
beginning ? Ofcourse this would result in some extra comparisions
(comparisions against d_name to determine the duplicates) for the dirents
from topmost directory.

- Indication that this directory entry is a whiteout entry.
DT_WHT(which already exists in linux) value for d_type will be
used to identify whiteouts.

- Indication that one of the directories of the union got modified(removal
or addition of a directory entry) during readdir.
Ideally any change to the directories should result in
dirent cache invalidation. But, should we be even bothered about this ?
Should we just ignore directory modifications happening between glibc opendir
and closedir and just live with whatever entries the kernel returns ?

Downsides of this approach

- Applications using readdir/getdents system calls directly would have
to handle duplicate elimination and whiteout supression themselves.
- Applications linked statically with glibc would have problems with
union mounted directories.

Directory seek support
----------------------
Can we just don't support seek on a union mounted directory ?

But if directory seek support on a union mounted directory is a necessity,
my current thinking is that we can define the seek behaviour on the
dirent cache maintained by glibc readdir. With this, glibc would be
defining the seek behaviour for union mounted directories instead of
getting it from kernel. But again this would work only for applications
that make use of seekdir() and friends and not for applications that
use lseek(2) directly. I am yet to look at glibc seekdir in detail to
ascertain if the above approach is doable.

Regards,
Bharata.
--

(1) Duplicates: With filesystem namespace unification, it is possible to
have same-named entries in multiple layers/branches/directories of the union.
In such cases only the entry from the topmost layer/branch/directory is
made available.

(2) Whiteouts: Whiteouts are place holders for the entries that don't exist
logically in the union. Typically, deletion of an entry present only in
the (read-only) lower layer of a union would result in a whiteout getting
created in the topmost layer of the union. Whiteout lookup would return
-ENOENT.

(3) Union Mount: http://lkml.org/lkml/2007/7/30/193

(4) Directory lising approaches in the kernel:
http://lkml.org/lkml/2007/12/5/147


2008-03-11 08:10:05

by Roland McGrath

[permalink] [raw]
Subject: Re: [RFC] Union mount readdir support in glibc

It seems very unlikely you'd come up with a version of this plan that we'd
find acceptable in glibc. readdir does buffering, sometimes entry format
conversion, and it can skip dummy entries. That's it. It's not going to
become a big hairy thing with all kinds of new state. Sorry.

This really is the kernel filesystem's problem. It just doesn't make sense
to expect userland to implement half of your directory semantics for you.
What are you going to do when you want to export a union directory to NFS?
readdir is a filesystem operation. You're implementing a filesystem.

Exposing DT_WHT entries may be useful as a user feature. (BSD had unions
with whiteouts years ago, and their ls et al have options to let you see
and operate on whiteouts explicitly so users can make sense of strange
situations with unions.) But even for that, we'd have to consider the
compatibility issues.


Thanks,
Roland

2008-03-11 12:49:57

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC] Union mount readdir support in glibc

On Tue, Mar 11, 2008 at 01:09:29AM -0700, Roland McGrath wrote:
> It seems very unlikely you'd come up with a version of this plan that we'd
> find acceptable in glibc. readdir does buffering, sometimes entry format
> conversion, and it can skip dummy entries. That's it. It's not going to
> become a big hairy thing with all kinds of new state. Sorry.

In the approach we are suggesting, at the minimum, glibc readdir would
have to maintain a unified cache of dirents with the knowlege of
whiteouts (DT_WHT). Would that be too much ?

>
> This really is the kernel filesystem's problem. It just doesn't make sense
> to expect userland to implement half of your directory semantics for you.
> What are you going to do when you want to export a union directory to NFS?
> readdir is a filesystem operation. You're implementing a filesystem.

Not really. In Union Mount, most of the unification support is done at
VFS layer with some support from filesystems (for things like
whiteouts). It is Unionfs which implements a new filesystem to achieve
unification. Unification is not purely a kernel filesystem's problem, it
involves both VFS and FS.

>
> Exposing DT_WHT entries may be useful as a user feature. (BSD had unions
> with whiteouts years ago, and their ls et al have options to let you see
> and operate on whiteouts explicitly so users can make sense of strange
> situations with unions.) But even for that, we'd have to consider the
> compatibility issues.

AFAIK, even BSD implements duplicate elimination and whiteout
suppression in the userland.

Thanks for your comments.

Regards,
Bharata.

2008-03-12 04:29:20

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC] Union mount readdir support in glibc

On Tue, Mar 11, 2008 at 01:09:29AM -0700, Roland McGrath wrote:
>
> This really is the kernel filesystem's problem. It just doesn't make sense
> to expect userland to implement half of your directory semantics for you.

I agree that we are asking glibc to handle part of the union mount
semantics in readdir. But we have tried handling directory listing of
union entirely inside the kernel, but the results haven't been so good.
(http://lkml.org/lkml/2007/12/5/147). Recently Al Viro suggested that we
do this in the userland and he felt that is the only sane way of doing this.
Infact I had mentioned about this approach to Ulrich briefly during
FOSS.IN and he sounded positive to the idea of maintaining a dirent
cache for duplicate elimination as long as it doesn't slowdown normal
users.

Regards,
Bharata.

2008-03-14 03:58:35

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [RFC] Union mount readdir support in glibc

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Bharata B Rao wrote:
> readdir in glibc
> ----------------
> glibc readdir would maintain a cache of dirents returned by readdir/getdents
> system call to perform duplicate elimination and whiteout supression.

As an optimization I have no problem with adding the code to glibc if it
does not impact non-union directories. But requiring lockstep update of
kernel and glibc to use a feature like this which is not under control
of the application is a huge problem. I don't think we ever had to
resort to this. I consider this absolutely only the last resort.

Current readdir requires:


opendir: open
fstat (this can probably even be removed)
malloc small buffer

readdir: if buffer is empty
getdents call (read multiple entries)
return next entry


There is very little overhead. Since we copy using getdents multiple
records it is more efficient than implementing readdir in the kernel.
This is how efficient normal directory operations must remain. The only
slight inefficiency is that we have to copy the entries after getdents()
because the d_type field is not in the place we expect it at userlevel.
For this a new interface could help.


To handle union FS at userlevel somewhere in that code sequence (perhaps
in the fstat call) we'd have to recognize such mounts. Before any
agreement on userlevel sorting can be made you'll have to answer a
question Roland already asked:

- - How does this work with NFS?


Regarding questions you have: if a directory currently is read and file
are added or removed, all bets are off.

re seeking: you have to support seeking. There is no way around it.
Once again, if any file has been added/removed, all bets are off. So,
why not provide a cookie similar to what is done today? I think it is
not acceptable to require caching the entire directory content at
userlevel. It's bad enough if we have to store the file names for
duplicate elimination.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iEYEARECAAYFAkfZ9GgACgkQ2ijCOnn/RHS/nACgx/NQqWB0kIbXBkwuSZIr1alX
78EAn2PECJ/9Ax3RzyayatE61ZM9I42W
=vFJP
-----END PGP SIGNATURE-----

2008-03-14 05:40:37

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] Union mount readdir support in glibc

On Thu, Mar 13, 2008 at 08:53:48PM -0700, Ulrich Drepper wrote:

> To handle union FS at userlevel somewhere in that code sequence (perhaps
> in the fstat call) we'd have to recognize such mounts.

*Snort*

How about "the first entry returned by getdents(3) after open() is a whiteout
for e.g. '.'"? No fstat needed, zero impact for normal directories,
zero impact for any binaries on old kernels (where you wouldn't have
unions) and zero impact for old binaries on new kernels unless they
do getdents() on directory that happens to be a union.

And no lockstep...

> Before any
> agreement on userlevel sorting can be made you'll have to answer a
> question Roland already asked:
>
> - - How does this work with NFS?

It won't, kernel-side or done in userland.

> re seeking: you have to support seeking. There is no way around it.

Actually, do we really need it other than to 0 and to current position
(i.e. full rewind and a no-op)?

2008-03-14 07:19:19

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [RFC] Union mount readdir support in glibc

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Al Viro wrote:
> How about "the first entry returned by getdents(3) after open() is a whiteout
> for e.g. '.'"? No fstat needed, zero impact for normal directories,
> zero impact for any binaries on old kernels (where you wouldn't have
> unions) and zero impact for old binaries on new kernels unless they
> do getdents() on directory that happens to be a union.

Your definition of "zero impact" doesn't quite match mine. This would
require significant changes.


> And no lockstep...

Of course there is lockstep. It is not under the application's control
whether a directory is a union fs or not. Every implementation except a
pure kernel implementation has this problem.


>> - - How does this work with NFS?
>
> It won't, kernel-side or done in userland.

Why wouldn't a kernel-side implementation work?


> Actually, do we really need it other than to 0 and to current position
> (i.e. full rewind and a no-op)?

Ever heard of the little function "telldir"?

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iEYEARECAAYFAkfaJXwACgkQ2ijCOnn/RHSRpQCZAXNktqSs6WRvxIlTlzUd6GC5
PrAAnRecjUcM6ZHoclzXrFFCsBWuIgid
=8pZl
-----END PGP SIGNATURE-----

2008-03-14 08:42:48

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC] Union mount readdir support in glibc

> > Actually, do we really need it other than to 0 and to current position
> > (i.e. full rewind and a no-op)?
>
> Ever heard of the little function "telldir"?

Actually, telldir/seekdir is already broken for some filesystems (NFS
comes to mind). POSIX was really crazy to require a working seekdir
implementation, and userspace should quickly start _not_ using it.

The more new filesystems it doesn't work, the better, IMHO.

Miklos

2008-03-14 15:07:45

by Jan Blunck

[permalink] [raw]
Subject: Re: [RFC] Union mount readdir support in glibc

On Thu, Mar 13, Ulrich Drepper wrote:

> There is very little overhead. Since we copy using getdents multiple
> records it is more efficient than implementing readdir in the kernel.
> This is how efficient normal directory operations must remain. The only
> slight inefficiency is that we have to copy the entries after getdents()
> because the d_type field is not in the place we expect it at userlevel.
> For this a new interface could help.

BTW, Since some filesystem always give DT_UNKNOWN an additional stat is
necessary to implement whiteout filtering. I don't want to do that in
kernel-space if possible ...

> Regarding questions you have: if a directory currently is read and file
> are added or removed, all bets are off.
>
> re seeking: you have to support seeking. There is no way around it.
> Once again, if any file has been added/removed, all bets are off. So,
> why not provide a cookie similar to what is done today? I think it is
> not acceptable to require caching the entire directory content at
> userlevel. It's bad enough if we have to store the file names for
> duplicate elimination.

Which basically means tracking of the "space" between dirents and maintaining
the relative order of entries. Which is a pain. I already tried to solve this
problem for tmpfs before and it needs a hugh amount of kernel memory for open
directories. In the end I only know of one situation where it is used: very old
glibc when running 32bit applications on 64bit kernel.

Cheers,
Jan

2008-03-14 17:57:24

by Peter Staubach

[permalink] [raw]
Subject: Re: [RFC] Union mount readdir support in glibc

Miklos Szeredi wrote:
>>> Actually, do we really need it other than to 0 and to current position
>>> (i.e. full rewind and a no-op)?
>>>
>> Ever heard of the little function "telldir"?
>>
>
> Actually, telldir/seekdir is already broken for some filesystems (NFS
> comes to mind). POSIX was really crazy to require a working seekdir
> implementation, and userspace should quickly start _not_ using it.
>
>

What makes you think that telldir/seekdir don't work for NFS? The over the
wire protocols clearly take values which could be retrieved and stored via
those interfaces.

ps

> The more new filesystems it doesn't work, the better, IMHO.
>
> Miklos
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2008-03-14 20:53:20

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC] Union mount readdir support in glibc

> >>> Actually, do we really need it other than to 0 and to current position
> >>> (i.e. full rewind and a no-op)?
> >>>
> >> Ever heard of the little function "telldir"?
> >>
> >
> > Actually, telldir/seekdir is already broken for some filesystems (NFS
> > comes to mind). POSIX was really crazy to require a working seekdir
> > implementation, and userspace should quickly start _not_ using it.
> >
> >
>
> What makes you think that telldir/seekdir don't work for NFS?

http://thread.gmane.org/gmane.comp.file-systems.fuse.devel/5124

It turned out to be due to incorrect NFS behavior if files are removed
between telldir and seekdir.

So it does work sometimes, but does not seem to correctly handle all
cases. I have no idea if this is an issue in the server, the client
or the protocol.

What is certain, is that seekdir/telldir is a really bad interface,
that just makes life difficult for filesystem implementors, without
any real gain. It deserves to die.

Miklos

2008-03-14 21:00:17

by Trond Myklebust

[permalink] [raw]
Subject: Re: [RFC] Union mount readdir support in glibc


On Fri, 2008-03-14 at 13:53 -0400, Peter Staubach wrote:
> Miklos Szeredi wrote:
> >>> Actually, do we really need it other than to 0 and to current position
> >>> (i.e. full rewind and a no-op)?
> >>>
> >> Ever heard of the little function "telldir"?
> >>
> >
> > Actually, telldir/seekdir is already broken for some filesystems (NFS
> > comes to mind). POSIX was really crazy to require a working seekdir
> > implementation, and userspace should quickly start _not_ using it.

POSIX never did require a working seekdir implementation. That
requirement came from our friends in the "Open Group":

http://www.opengroup.org/onlinepubs/009695399/functions/seekdir.html

> What makes you think that telldir/seekdir don't work for NFS? The over the
> wire protocols clearly take values which could be retrieved and stored via
> those interfaces.
>
> ps

Except for the fact that the NFS cookies are unsigned (and in the case
of NFSv3/v4 are 64-bit wide), whereas glibc gets confused when
confronted with 'negative' telldir values.

Hence the current Linux client's wrapping of the on-the-wire cookies. As
far as I can see, it is fully conformant with the spec, which has the
perfect "get out of jail free" card:

"The definition of seekdir() and telldir() does not specify
whether, when using these interfaces, a given directory entry
will be seen at all, or more than once."

Trond