LinuxLists.cc - make getdents/readdir POSIX compliant wrt mount-point dirent.d

2009-09-01 13:08:41

Subject: make getdents/readdir POSIX compliant wrt mount-point dirent.d_ino

Currently, on all unix and linux-based systems the dirent.d_ino of a mount
point (as read from its parent directory) fails to match the stat-returned
st_ino value for that same entry. That is contrary to POSIX 2008.

I'm bringing this up today because I've just had to disable an
optimization in coreutils ls -i:

http://thread.gmane.org/gmane.comp.gnu.coreutils.bugs/17887

Normally, work-arounds in coreutils penalize non-linux, or old-linux
kernels, but this is the first that has penalized *all* unix/linux-based
systems. Ironically, the sole system that can still take advanatage
of the optimization is Cygwin.

I'm hoping that Linux can catch up before too long.

------------------------
The POSIX readdir spec says this:

The structure dirent defined in the <dirent.h> header describes a
directory entry. The value of the structure's d_ino member shall be set
to the file serial number of the file named by the d_name member.

The description for sys/stat.h makes the connection between
"file serial number" and the stat.st_ino member:

The <sys/stat.h> header shall define the stat structure, which shall
include at least the following members:
...
ino_t st_ino File serial number.
------------------------

The current linux/unix readdir behavior makes it so ls -i cannot perform
the optimization of printing only readdir-returned inode numbers, and
instead must incur the cost of actually stat'ing each entry in order to
be assured that it prints valid inode numbers.

If you have gnu coreutils 6.0 or newer (but not built from today's
git repository) tools on your system, you can demonstrate the mismatch
with the following shell code: [if not, use the C program in
<http://thread.gmane.org/gmane.comp.gnu.coreutils.bugs/14020>]

#!/bin/sh
mount_points=$(df --local -P 2>&1 | sed -n 's,.*[0-9]% $/.$,\1,p')

# Given e.g., /dev/shm, produce the list of GNU ls options that
# let us list just that entry using readdir data from its parent:
# ls -i -I '[^s]*' -I 's[^h]*' -I 'sh[^m]*' -I 'shm?*' -I '.?*' \
# -I '?' -I '??' /dev
ls_ignore_options()
{
name=$1
opts="-I '.?*' -I '$name?*'"
while :; do
glob=$(echo "$name"|sed 's/$.*$$.$$/\1[^\2]*/')
opts="$opts -I '$glob'"
name=$(echo "$name"|sed 's/.$//')
test -z "$name" && break
glob=$(echo "$name"|sed 's/./?/g')
opts="$opts -I '$glob'"
done
echo "$opts"
}

inode_via_readdir()
{
mount_point=$1
base=$(basename $mount_point)
case $base in
.*) skip_test_ 'mount point component starts with "."' ;;
*[*?]*) skip_test_ 'mount point component contains "?" or "*"' ;;
esac
opts=$(ls_ignore_options "$base")
parent_dir=$(dirname $mount_point)
eval "ls -i $opts $parent_dir" | sed 's/ .*//'
}

first_failure=1
for dir in $mount_points; do
readdir_inode=$(inode_via_readdir $dir)
stat_inode=$(env stat --format=%i $dir)
if test "$readdir_inode" != "$stat_inode"; then
test $first_failure = 1 \
&& printf '%8s %8s %-20s\n' st_ino d_ino mount-point
printf '%8d %8d %-20s\n' $stat_inode $readdir_inode $dir
first_failure=0
fi
done
#--------------------------------------------------------------

For example, here's the result of running it on one
of my systems:

st_ino d_ino mount-point
3508 36850 /lib/init/rw
824 376097 /dev
6237 3532 /dev/shm
2 8177 /boot
2 12265 /full
2 147197 /h
2 298428 /f
2 310689 /usr
2 73585 /var
6992 253457 /t
2 327041 /b
2 4113 /d
2 302521 /x
2 53378 /media/sdd1

The d_ino number is what ls -i $parent_dir would print,
before today's fix, while the st_ino value is the correct inode
number for that directory.

2009-09-01 20:19:45

by Theodore Ts'o

[permalink] [raw]

Subject: Re: make getdents/readdir POSIX compliant wrt mount-point dirent.d_ino

On Tue, Sep 01, 2009 at 03:07:23PM +0200, Jim Meyering wrote:
> Currently, on all unix and linux-based systems the dirent.d_ino of a mount
> point (as read from its parent directory) fails to match the stat-returned
> st_ino value for that same entry. That is contrary to POSIX 2008.

The language which you referenced has been around for a very long
time; it's not new to POSIX.1-2008. At the same time, the behaviour
of what is returned for direct.d_ino at a mount point has been around
for a very long time, and it's not new either. Furthermore, there are
plenty of Unix systems that have received POSIX certifications despite
having this behavior. (I just checked and Solaris behaves exactly the
same way, as I expect; pretty much all Unix systems work this way.)

If you're going to quote chapter and verse, the more convincing one
would probably be from the non-normative RATIONALE section for
readdir():

When returning a directory entry for the root of a mounted file
system, some historical implementations of readdir() returned the
file serial number of the underlying mount point, rather than of
the root of the mounted file system. This behavior is considered
to be a bug, since the underlying file serial number has no
significance to applications.

> I'm bringing this up today because I've just had to disable an
> optimization in coreutils ls -i:

I'm not sure how many poeple will care about this, since (a) stat(2)
is fast, so this only becomes user-visible in the cold cache case, and
(b) "ls -i" is generally not considered a common case.

Fixing it is also going to be decidedly non-trivial since it depends
on how the directory was orignally accessed. For example, suppose
/usr is a mount point; and we do a readdir on '/'. In that case, when
we return 'usr' we should return the inode number of the covering
inode. But if we have a bind mount ("mount --bind / /root") and we
are calling readdir on the exact same directory, but it was opened via
opendir("/root"), now when we return 'usr', we should return the
underlying directory's inode. This means that before returning from
readdir, we would have to scan every single directory entry against
the combination of the orignal dentry used to open the directory plus
the d_name field to see if it exists in the current process's mount
namespace.

This would require burning extra CPU time for every single entry
returned by readdir(2), all for catching a case is a technical
violation of the POSIX spec, but which all historical Unix
implementations have had the same behaviour, all to enable an
optimization for a use case ("/bin/ls -i") which isn't very common.
Hence, even a "nyah, nyah, but Cygwin gets this case right" may not be
a big motivator for people to work on making this change to Linux.

Playing devil's advocate for the moment, you could even make the case,
ignoring the non-normative POSIX rationale and writing off standards
authors as wankers who don't care about real world issues, and noting
that in POSIX world, "mounts" are hand-waved away as not being within
the scope of the standard, that the current behaviour makes *sense*.
That is, the inode number of the directory entry is what it is, but
when you mount a filesystem, what happens is when you dereference the
directory entry, you get something else, much like the difference
between what happens with stat(2) vs. lstat(2) in the presence of a
symlink. It is because it makes *sense* from a computer science point
of view that all Unix implementations do things the same way Linux
does. Given all of this, it's not surprising that even an OS as anal
about being standards-compliant as Solaris has ignored this one...

- Ted

2009-09-01 22:36:52

by Ulrich Drepper

[permalink] [raw]

Subject: Re: make getdents/readdir POSIX compliant wrt mount-point dirent.d_ino

On Tue, Sep 1, 2009 at 13:19, Theodore Tso<[email protected]> wrote:
> Furthermore, there are
> plenty of Unix systems that have received POSIX certifications despite
> having this behavior.

A common misunderstanding of certification.

Like for all certifications, being POSIX certified doesn't mean the
certification is valid for all situations. it only means that there
is (at least) one configuration which meets the requirements. In this
case it means the environment simply uses one single filesystem and no
mount points. This way the problem doesn't even arise.

> Fixing it is also going to be decidedly non-trivial since it depends
> on how the directory was orignally accessed. [...]

I guess that this is really a difficult way to solve. I wouldn't want
to pay for something which is hardly ever really used.

But there are programs out there which would like to use the inode
uniqueness. Therefore the next best thing to do is perhaps to return
a flag in the getdents information (in d_type, perhaps) to indicate
that this is a mount point and/or that there are multiple ways to
access the file in question. Then programs which can use the inode
information can be watching for this flag and enter the slow path only
if it's set.

2009-11-04 19:29:05

by Jim Meyering

[permalink] [raw]

Subject: Re: make getdents/readdir POSIX compliant wrt mount-point dirent.d_ino

Ulrich Drepper wrote:
> On Tue, Sep 1, 2009 at 13:19, Theodore Tso<[email protected]> wrote:
>> Furthermore, there are
>> plenty of Unix systems that have received POSIX certifications despite
>> having this behavior.
>
> A common misunderstanding of certification.
>
> Like for all certifications, being POSIX certified doesn't mean the
> certification is valid for all situations. it only means that there
> is (at least) one configuration which meets the requirements. In this
> case it means the environment simply uses one single filesystem and no
> mount points. This way the problem doesn't even arise.
>
>> Fixing it is also going to be decidedly non-trivial since it depends
>> on how the directory was orignally accessed. [...]
>
> I guess that this is really a difficult way to solve. I wouldn't want
> to pay for something which is hardly ever really used.
>
> But there are programs out there which would like to use the inode
> uniqueness. Therefore the next best thing to do is perhaps to return
> a flag in the getdents information (in d_type, perhaps) to indicate
> that this is a mount point and/or that there are multiple ways to
> access the file in question. Then programs which can use the inode
> information can be watching for this flag and enter the slow path only
> if it's set.

Hi Uli,

Here is another reason to do what you suggest.
This bug report started it:

on the fly varying device numbers on a NFS mount point
http://bugzilla.redhat.com/501848

More discussion here:

http://thread.gmane.org/gmane.comp.gnu.coreutils.bugs/18553/focus=18822

The problem is with hierarchy traversals again. The first time
a mount-point directory is encountered, fts opens it (with openat),
stats it and records dev,ino, and then reads entries. The first readdir
triggers the automount and thus, the assignment of a new device number
to the already-open directory. When the traversal process finishes
processing the hierarchy and traverses back "up" to that mount point,
it fails due to the old-st_dev/new-st_dev mismatch.[1] Normally such
a mismatch indicates that someone is attempting to subvert a traversal,
or perhaps has inadvertently moved a subtree while it's being traversed.
In any case, once such a mismatch has been detected, there is no way
the traversal can safely continue.

One way to accommodate the current automount semantics, is to make fts.c
incur, _for every directory traversed_, the cost of an additional
stat (fstatat, actually) call just in case this happens to be one of
those rare mount points.

I would really rather not pessimize most[*] hierarchy-traversing
command-line tools by up to 17% (though usually far less) in order
to accommodate device-number change semantics that arise
for an automountable directory.

Jim

[*] At least the following GNU tools would be affected: find, chmod,
chown, chgrp, chcon, du, rm, and possibly soon, cp and ls.

[1] Note that if the mounted hierarchy is not too deep (I think it's
4 or 5 levels), cached "active-directory" file descriptors mask the
problem, because when we traverse back to the mount point, we still
have an open file descriptor for that directory. In that case, we don't
even need to perform the dev/inode comparison.

2009-11-05 19:49:06

by Theodore Ts'o

[permalink] [raw]

Subject: Re: make getdents/readdir POSIX compliant wrt mount-point dirent.d_ino

On Wed, Nov 04, 2009 at 08:29:00PM +0100, Jim Meyering wrote:
> One way to accommodate the current automount semantics, is to make fts.c
> incur, _for every directory traversed_, the cost of an additional
> stat (fstatat, actually) call just in case this happens to be one of
> those rare mount points.
>
> I would really rather not pessimize most[*] hierarchy-traversing
> command-line tools by up to 17% (though usually far less) in order
> to accommodate device-number change semantics that arise
> for an automountable directory.

I must be missing something. How do you come up with the 17% penalty
figure? And what does this actually mean in real life?

stat() in Linux is fast. Really fast. A quick benchmark clocks
stat() on my system at 0.814 *microseconds* in the warm cache case,
and if you're stating a directory that you've traversed, odds are
extremely high that it will still be in the cache.

My entire laptop root filesystem has 53,934 directories, so an extra
stat() per directory translates to an extra 43 milliseconds, assuming
I needed to walk my entire root filesystem. It's really hard to see
why kernel developers should get worked up into a lather over that
kind of "performance penalty".

Regards,

- Ted

2009-11-05 23:29:00

by Jim Meyering

[permalink] [raw]

Subject: Re: make getdents/readdir POSIX compliant wrt mount-point dirent.d_ino

Theodore Tso wrote:
> On Wed, Nov 04, 2009 at 08:29:00PM +0100, Jim Meyering wrote:
>> One way to accommodate the current automount semantics, is to make fts.c
>> incur, _for every directory traversed_, the cost of an additional
>> stat (fstatat, actually) call just in case this happens to be one of
>> those rare mount points.
>>
>> I would really rather not pessimize most[*] hierarchy-traversing
>> command-line tools by up to 17% (though usually far less) in order
>> to accommodate device-number change semantics that arise
>> for an automountable directory.
>
> I must be missing something. How do you come up with the 17% penalty
> figure? And what does this actually mean in real life?

Actually, it can approach 25%. See below.

> stat() in Linux is fast. Really fast.

Sure. But so are chown, rm, du, etc.
And an extra stat can make more than a measurable difference.
On an absolute scale, the difference is not prohibitive,
but wouldn't it be a shame to penalize everyone
for a feature that some of us don't ever use?

> A quick benchmark clocks
> stat() on my system at 0.814 *microseconds* in the warm cache case,
> and if you're stating a directory that you've traversed, odds are
> extremely high that it will still be in the cache.
>
> My entire laptop root filesystem has 53,934 directories, so an extra
> stat() per directory translates to an extra 43 milliseconds, assuming
> I needed to walk my entire root filesystem. It's really hard to see
> why kernel developers should get worked up into a lather over that
> kind of "performance penalty".

Here's a comparison with fewer than 5000 directories:
Given a directory named z, with 70 subdirs, each containing 70 empty subdirs.
All names are in 0..69. Hot cache. On a tmpfs file system.
linux 2.6.31.1-56.fc12.x86_64

Compare chgrp -R applied to "z", with and without the stat-adding patch:

$ for i in 1 2 3; do for p in prev .; do echo $p; \
env time $p/chgrp -R group2 z; done; done;
prev
0.03user 0.31system 0:00.34elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+235minor)pagefaults 0swaps
.
0.02user 0.39system 0:00.42elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+235minor)pagefaults 0swaps
prev
0.03user 0.30system 0:00.34elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+235minor)pagefaults 0swaps
.
0.04user 0.38system 0:00.43elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+236minor)pagefaults 0swaps
prev
0.03user 0.31system 0:00.34elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+236minor)pagefaults 0swaps
.
0.04user 0.37system 0:00.41elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+235minor)pagefaults 0swaps

That's a 23.5% performance hit. (42-34)/34

Sure, it's not even 1/10th of a second, but remember this is a tiny
hierarchy, and it's not just chgrp, but also find, rm, du, etc. that
are affected. And this is not the sole reason to make a change, but
rather one more reason, in addition to the one that started this thread.