LinuxLists.cc - [PATCH] d_ino considered harmful

2010-06-16 18:59:51

Subject: [PATCH] d_ino considered harmful

Who needs d_ino anyway? I am running a kernel with this patch -
Gnome, a browser, IRC, kernel compile, etc. and everything works.

-VAL

commit 184f3919d0071f3bfa40010aa6919ea89999d79b
Author: Valerie Aurora <[email protected]>
Date: Wed Jun 16 11:05:06 2010 -0700

VFS: Always return 0 for d_ino

Use of d_ino without the corresponding st_dev is always buggy in the
presence of submounts, bind mounts, and union mounts. E.g., the d_ino
of a mountpoint will be the inode number of the directory under the
mountpoint, not the mounted directory. Correct code must call stat(),
which returns the correct device ID and inode in st_dev and st_ino.
Since no one should be using d_ino anyway, always return 0 to detect
bugs.

diff --git a/fs/readdir.c b/fs/readdir.c
index dd3eae1..38ea772 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -91,7 +91,9 @@ static int fillonedir(void * __buf, const char * name, int namlen, loff_t offset

if (buf->result)
return -EINVAL;
- d_ino = ino;
+ /* Use of d_ino without st_dev is always buggy. */
+ d_ino = 0;
+
if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) {
buf->result = -EOVERFLOW;
return -EOVERFLOW;

2010-06-16 19:11:21

by Arnd Bergmann

[permalink] [raw]

Subject: Re: [PATCH] d_ino considered harmful

On Wednesday 16 June 2010 20:59:13 Valerie Aurora wrote:
> @@ -91,7 +91,9 @@ static int fillonedir(void * __buf, const char * name, int namlen, loff_t offset
>
> if (buf->result)
> return -EINVAL;
> - d_ino = ino;
> + /* Use of d_ino without st_dev is always buggy. */
> + d_ino = 0;
> +
> if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) {
> buf->result = -EOVERFLOW;
> return -EOVERFLOW;

Isn't this just the path taken by sys_oldreaddir?

Glibc (at least on my box) translates all user calls to readdir into
sys_getdents or sys_getdents64, so I think you'd also need to change
filldir() and filldir64() for your testing.

Arnd

2010-06-16 19:16:21

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH] d_ino considered harmful

On Wed, 16 Jun 2010 14:59:13 -0400
Valerie Aurora <[email protected]> wrote:

> Who needs d_ino anyway? I am running a kernel with this patch -
> Gnome, a browser, IRC, kernel compile, etc. and everything works.
>
> -VAL
>
> commit 184f3919d0071f3bfa40010aa6919ea89999d79b
> Author: Valerie Aurora <[email protected]>
> Date: Wed Jun 16 11:05:06 2010 -0700
>
> VFS: Always return 0 for d_ino
>
> Use of d_ino without the corresponding st_dev is always buggy in the
> presence of submounts, bind mounts, and union mounts. E.g., the d_ino
> of a mountpoint will be the inode number of the directory under the
> mountpoint, not the mounted directory. Correct code must call stat(),
> which returns the correct device ID and inode in st_dev and st_ino.
> Since no one should be using d_ino anyway, always return 0 to detect
> bugs.
>
> diff --git a/fs/readdir.c b/fs/readdir.c
> index dd3eae1..38ea772 100644
> --- a/fs/readdir.c
> +++ b/fs/readdir.c
> @@ -91,7 +91,9 @@ static int fillonedir(void * __buf, const char * name, int namlen, loff_t offset
>
> if (buf->result)
> return -EINVAL;
> - d_ino = ino;
> + /* Use of d_ino without st_dev is always buggy. */
> + d_ino = 0;
> +
> if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) {
> buf->result = -EOVERFLOW;
> return -EOVERFLOW;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

That may run afoul of the following check (where d_ino is compared to
ino). You can probably just remove that check though since you can be
reasonably sure that 0 will never overflow the field.

--
Jeff Layton <[email protected]>

2010-06-16 19:54:32

by Valerie Aurora

[permalink] [raw]

Subject: Re: [PATCH v2] d_ino considered harmful

On Wed, Jun 16, 2010 at 02:59:13PM -0400, Valerie Aurora wrote:
> Who needs d_ino anyway? I am running a kernel with this patch -
> Gnome, a browser, IRC, kernel compile, etc. and everything works.

Gosh, maybe it would help to patch the currently used readdir instead
of just old_readdir() (thanks, Arnd). And return 1 instead of 0 so ls
doesn't think all files are deleted (thanks, Andreas).

I'm running a kernel with the below patch and everything still works.
Apparently "ls -i" is still using the bogus d_ino performance
improvement mentioned here because it returns all 1's for inode
number.

http://www.mail-archive.com/[email protected]/msg02531.html

-VAL

commit 5902fd7b7407e059c5cea1bf1ea101a1ff8a6072
Author: Valerie Aurora <[email protected]>
Date: Wed Jun 16 11:05:06 2010 -0700

VFS: Always return 1 for d_ino

Use of d_ino without the corresponding st_dev is always buggy in the
presence of submounts, bind mounts, and union mounts. E.g., the d_ino
of a mountpoint will be the inode number of the directory under the
mountpoint, not the mounted directory. Correct code must call stat(),
which returns the correct device ID and inode in st_dev and st_ino.
Since no one should be using d_ino anyway, always return 1 to detect
bugs.

diff --git a/fs/readdir.c b/fs/readdir.c
index dd3eae1..5ff8f10 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -91,11 +91,8 @@ static int fillonedir(void * __buf, const char * name, int namlen, loff_t offset

if (buf->result)
return -EINVAL;
- d_ino = ino;
- if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) {
- buf->result = -EOVERFLOW;
- return -EOVERFLOW;
- }
+ /* Use of d_ino without st_dev is always buggy. */
+ d_ino = 1;
buf->result++;
dirent = buf->dirent;
if (!access_ok(VERIFY_WRITE, dirent,
@@ -172,11 +169,8 @@ static int filldir(void * __buf, const char * name, int namlen, loff_t offset,
buf->error = -EINVAL; /* only used if we fail.. */
if (reclen > buf->count)
return -EINVAL;
- d_ino = ino;
- if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) {
- buf->error = -EOVERFLOW;
- return -EOVERFLOW;
- }
+ /* Use of d_ino without st_dev is always buggy. */
+ d_ino = 1;
dirent = buf->previous;
if (dirent) {
if (__put_user(offset, &dirent->d_off))

2010-06-16 19:58:24

by Valerie Aurora

[permalink] [raw]

Subject: Re: [PATCH] d_ino considered harmful

On Wed, Jun 16, 2010 at 09:10:42PM +0200, Arnd Bergmann wrote:
> On Wednesday 16 June 2010 20:59:13 Valerie Aurora wrote:
> > @@ -91,7 +91,9 @@ static int fillonedir(void * __buf, const char * name, int namlen, loff_t offset
> >
> > if (buf->result)
> > return -EINVAL;
> > - d_ino = ino;
> > + /* Use of d_ino without st_dev is always buggy. */
> > + d_ino = 0;
> > +
> > if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) {
> > buf->result = -EOVERFLOW;
> > return -EOVERFLOW;
>
> Isn't this just the path taken by sys_oldreaddir?
>
> Glibc (at least on my box) translates all user calls to readdir into
> sys_getdents or sys_getdents64, so I think you'd also need to change
> filldir() and filldir64() for your testing.

You're right. I changed both code paths in the new version, and
actually tested this time.

Thanks,

-VAL

2010-06-16 19:58:52

by Valerie Aurora

[permalink] [raw]

Subject: Re: [PATCH] d_ino considered harmful

On Wed, Jun 16, 2010 at 03:15:56PM -0400, Jeff Layton wrote:
> On Wed, 16 Jun 2010 14:59:13 -0400
> Valerie Aurora <[email protected]> wrote:
>
> > Who needs d_ino anyway? I am running a kernel with this patch -
> > Gnome, a browser, IRC, kernel compile, etc. and everything works.
> >
> > -VAL
> >
> > commit 184f3919d0071f3bfa40010aa6919ea89999d79b
> > Author: Valerie Aurora <[email protected]>
> > Date: Wed Jun 16 11:05:06 2010 -0700
> >
> > VFS: Always return 0 for d_ino
> >
> > Use of d_ino without the corresponding st_dev is always buggy in the
> > presence of submounts, bind mounts, and union mounts. E.g., the d_ino
> > of a mountpoint will be the inode number of the directory under the
> > mountpoint, not the mounted directory. Correct code must call stat(),
> > which returns the correct device ID and inode in st_dev and st_ino.
> > Since no one should be using d_ino anyway, always return 0 to detect
> > bugs.
> >
> > diff --git a/fs/readdir.c b/fs/readdir.c
> > index dd3eae1..38ea772 100644
> > --- a/fs/readdir.c
> > +++ b/fs/readdir.c
> > @@ -91,7 +91,9 @@ static int fillonedir(void * __buf, const char * name, int namlen, loff_t offset
> >
> > if (buf->result)
> > return -EINVAL;
> > - d_ino = ino;
> > + /* Use of d_ino without st_dev is always buggy. */
> > + d_ino = 0;
> > +
> > if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) {
> > buf->result = -EOVERFLOW;
> > return -EOVERFLOW;
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
> That may run afoul of the following check (where d_ino is compared to
> ino). You can probably just remove that check though since you can be
> reasonably sure that 0 will never overflow the field.

You're right, I included that in the next version of the patch.

Thanks,

-VAL

2010-06-16 21:00:55

by David Dillow

[permalink] [raw]

Subject: Re: [PATCH v2] d_ino considered harmful

On Wed, 2010-06-16 at 15:54 -0400, Valerie Aurora wrote:
> On Wed, Jun 16, 2010 at 02:59:13PM -0400, Valerie Aurora wrote:
> > Who needs d_ino anyway? I am running a kernel with this patch -
> > Gnome, a browser, IRC, kernel compile, etc. and everything works.

For large-scale Lustre users, not having a valid d_ino in the dirent is
a problem when we need to put a name (or names) to an inode number.

>From time to time, we have a problem reported in the logs against an
inode and need to figure out the name of that file. stat() is very slow
on these filesystems due to the need to talk to multiple servers to get
file size information. When we purge the filesystem, it is faster to do
a scan of the inode table looking for old files than it is to walk the
tree. Then we have to match those inodes to names for unlink().

For example, our main Lustre scratch space has over 285 million files in
it, and using find -inum takes over 72 hours to walk the tree using
stat(). Using a scanner that takes advantage of d_ino allows us to cover
that ground in 13 hours. It's not perfectly apples-to-apples, as the
custom scanner has some parallelism, but it gives you an idea of the
problem's scale. Certainly, using d_ino reduces the number of RPCs we
have to send.

Using ne2scan -- which uses libext2fs and combines the inode scan and
the name lookup -- takes over 48 hours to generate a list of candidate
files for the purge example. With an optimized inode scan and the custom
scanner above, it should take 18 hours, which is a considerable time
saver for us.

I can see that there are pitfalls in the presence of mountpoint and
such, but we control the environment on these machines, and using d_ino
is a huge win for us.
--
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

2010-06-17 17:54:49

by Jamie Lokier

[permalink] [raw]

Subject: Re: [PATCH v2] d_ino considered harmful

Valerie Aurora wrote:
>> Who needs d_ino anyway? I am running a kernel with this patch -
>> Gnome, a browser, IRC, kernel compile, etc. and everything works.
> I'm running a kernel with the below patch and everything still works.
> Apparently "ls -i" is still using the bogus d_ino performance
> improvement mentioned here because it returns all 1's for inode
> number.

I'm surprised at "ls -i", as a patch to change that has been submitted:

http://marc.info/?l=linux-kernel&m=125181054102075
http://thread.gmane.org/gmane.comp.gnu.coreutils.bugs/17887

>
> Use of d_ino without the corresponding st_dev is always buggy in the
> presence of submounts, bind mounts, and union mounts. E.g., the d_ino
> of a mountpoint will be the inode number of the directory under the
> mountpoint, not the mounted directory.

It's not surprising everything seems to work.

It can be useful as a performance hint, which you probably didn't test.

I have some "find"-style program ("treescan" if you want to Google;
it's ancient now), which shows a 50x speedup in some cold-cache cases
using d_ino to sort prior to stat() calls, to reduce disk seeking.
50x is unusual; usually it's more like 2x. I still run that program
prior to "git status" on a directory in cold cache.

Others have mentioned other programs which benefit from the d_ino hint.

It is possible to use it in a "correct" way, too, because it is only
mountpoints where the value is incorrect.

When the value is correct, it's useful for detecting which files have
been renamed, linked or deleted in a directory, and you detect the
directory has changed (e.g. using mtime). If you know (by agreement)
that there are no mount points and certain files aren't overwritten in
place, always replaced, that can be used to maintain application
caches too.

I strongly disagree that correct code must call stat(). Correct code
can check against the list of mountpoints in /proc/mounts, because it
is strictly only mountpoints where the number doesn't agree with
stat() -- prior to your patch :-)

Anyway, maybe your patch is not allowed by POSIX :-) as follows
(posted to linux-kernel some time ago):

http://marc.info/?l=linux-kernel&m=125181054102075
http://www.gossamer-threads.com/lists/linux/kernel/1124140

The POSIX readdir spec says this:

The structure dirent defined in the <dirent.h> header describes a
directory entry. The value of the structure's d_ino member shall be set
to the file serial number of the file named by the d_name member.

The description for sys/stat.h makes the connection between
"file serial number" and the stat.st_ino member:

The <sys/stat.h> header shall define the stat structure, which shall
include at least the following members:
...
ino_t st_ino File serial number.

Returning the covered inode's number at a mountpoint is apparently not
POSIX compliant either, but is widespread. (I.e. all unixes except
Cygwin apparently.)

> Gosh, maybe it would help to patch the currently used readdir instead
> of just old_readdir() (thanks, Arnd). And return 1 instead of 0 so ls
> doesn't think all files are deleted (thanks, Andreas).

It's not just ls. Bash 3.0 ignores entries for completion if d_ino == 0.

> I'm running a kernel with the below patch and everything still works.
> Apparently "ls -i" is still using the bogus d_ino performance
> improvement mentioned here because it returns all 1's for inode
> number.
>
> http://www.mail-archive.com/[email protected]/msg02531.html

I'm intrigued by the mentioned in that report that Linux bind mounts
return the covering inode number in d_ino, not the covered inode number.

If true, that means mounts are already being checked when returning d_ino,
and suggests that doing it for all mounts isn't expensive.

Is it feasible to return a clear "this is a mount point" indicator in
d_ino and/or d_type? For example, adding a DT_MOUNTED d_type, or
ORing that into the d_type value (both should be seen be programs as
"I don't know what type that is, I'd better call lstat()").

I agree that inode number without st_dev isn't good for correct code
(but still useful as a performance hint). But when a program knows
which entries are mountpoints by any method, then it knows the st_dev
of all the other entries is identical to the directory being read.

-- Jamie

2010-06-17 18:05:24

by J. R. Okajima

[permalink] [raw]

Subject: Re: [PATCH v2] d_ino considered harmful

David Dillow:
> For example, our main Lustre scratch space has over 285 million files in
> it, and using find -inum takes over 72 hours to walk the tree using
:::
> Using ne2scan -- which uses libext2fs and combines the inode scan and
> the name lookup -- takes over 48 hours to generate a list of candidate
> files for the purge example. With an optimized inode scan and the custom
:::

While I've never heard of ne2scan, I am interested in this simplified
problem such as "find the pathname(s) from an inum in a huge fs."
Is ne2scan essentially equivalent to "debugfs ncheck inum"?

About Valeris's patch, as long as "ls -i" is useful/helpful,
> + /* Use of d_ino without st_dev is always buggy. */
is not true.

J. R. Okajima

2010-06-17 18:17:20

by David Dillow

[permalink] [raw]

Subject: Re: [PATCH v2] d_ino considered harmful

On Thu, 2010-06-17 at 14:04 -0400, J. R. Okajima wrote:
> David Dillow:
> > For example, our main Lustre scratch space has over 285 million files in
> > it, and using find -inum takes over 72 hours to walk the tree using
> :::
> > Using ne2scan -- which uses libext2fs and combines the inode scan and
> > the name lookup -- takes over 48 hours to generate a list of candidate
> > files for the purge example. With an optimized inode scan and the custom
> :::
>
> While I've never heard of ne2scan, I am interested in this simplified
> problem such as "find the pathname(s) from an inum in a huge fs."
> Is ne2scan essentially equivalent to "debugfs ncheck inum"?

Yes, except it does that for every live inode in the system. ne2scan is
extended for use on Lustre's backing store -- it parses the LOV object
map and displays the information -- so I'm not sure how usable it will
be on an plain ext{2,3,4} file system. It is based off of e2scan in the
Oracle (nee Sun) Lustre version of e2fsprogs, so that could be a
starting point for you.

--
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

2010-06-17 18:58:55

by Valerie Aurora

[permalink] [raw]

Subject: Re: [PATCH v2] d_ino considered harmful

On Fri, Jun 18, 2010 at 03:04:08AM +0900, J. R. Okajima wrote:
>
> David Dillow:
> > For example, our main Lustre scratch space has over 285 million files in
> > it, and using find -inum takes over 72 hours to walk the tree using
> :::
> > Using ne2scan -- which uses libext2fs and combines the inode scan and
> > the name lookup -- takes over 48 hours to generate a list of candidate
> > files for the purge example. With an optimized inode scan and the custom
> :::
>
> While I've never heard of ne2scan, I am interested in this simplified
> problem such as "find the pathname(s) from an inum in a huge fs."
> Is ne2scan essentially equivalent to "debugfs ncheck inum"?
>
> About Valeris's patch, as long as "ls -i" is useful/helpful,
> > + /* Use of d_ino without st_dev is always buggy. */
> is not true.

What I'm hearing again and again is that d_ino is useful to improve
performance. As Andreas put it to me, if d_ino is the same, the
referenced file may or may not be the same, but if it's different, the
files are definitely different. Only in well-controlled environments
known not to have submounts or bind mounts do people trust d_ino to be
from the same file system as the other entries in a directory.

I only submitted this patch half-seriously - mainly I wanted to find
out how people are using d_ino, and therefore what I need to do for
fallthru directory entries in union mounts.

In order to get the correct inode number for a directory entry
referring to a lower layer file or directory, we have to do a
->lookup() from the fs-specific readdir code (or else require that
fallthrus store an arbitrarily sized integer - which seriously
restricts the implementation). Now, doing a ->lookup() to get d_ino
makes no sense if we are using d_ino as a way to avoid the cost
stat(), which is mainly the ->lookup(). And you definitely can't use
d_ino by itself in a union mount.

I'm inclined to save the trouble and just return 1 in d_ino for
fallthru directory entries, especially now that I've tested it
system-wide and had no obvious problems.

-VAL

2010-06-17 19:11:13

by Valerie Aurora

[permalink] [raw]

Subject: Re: [PATCH v2] d_ino considered harmful

On Thu, Jun 17, 2010 at 06:54:29PM +0100, Jamie Lokier wrote:
> Valerie Aurora wrote:
> >> Who needs d_ino anyway? I am running a kernel with this patch -
> >> Gnome, a browser, IRC, kernel compile, etc. and everything works.
> > I'm running a kernel with the below patch and everything still works.
> > Apparently "ls -i" is still using the bogus d_ino performance
> > improvement mentioned here because it returns all 1's for inode
> > number.
>
> I'm surprised at "ls -i", as a patch to change that has been submitted:
>
> http://marc.info/?l=linux-kernel&m=125181054102075
> http://thread.gmane.org/gmane.comp.gnu.coreutils.bugs/17887

I was surprised too. I guess people still want to optimize ls -i,
even at the cost of wrong results.

> > Use of d_ino without the corresponding st_dev is always buggy in the
> > presence of submounts, bind mounts, and union mounts. E.g., the d_ino
> > of a mountpoint will be the inode number of the directory under the
> > mountpoint, not the mounted directory.
>
> It's not surprising everything seems to work.
>
> It can be useful as a performance hint, which you probably didn't test.

I'm afraid I wasn't entirely serious with that patch. :) But it was an
interesting exercise.

> I strongly disagree that correct code must call stat(). Correct code
> can check against the list of mountpoints in /proc/mounts, because it
> is strictly only mountpoints where the number doesn't agree with
> stat() -- prior to your patch :-)

If you are assuming that the application is parsing /proc/mounts (does
anyone actually do this?), then the application can also learn about
union mounts and not trust d_ino in any directory below the union
mount point. :)

> Anyway, maybe your patch is not allowed by POSIX :-) as follows
> (posted to linux-kernel some time ago):
>
> http://marc.info/?l=linux-kernel&m=125181054102075
> http://www.gossamer-threads.com/lists/linux/kernel/1124140
>
> The POSIX readdir spec says this:
>
> The structure dirent defined in the <dirent.h> header describes a
> directory entry. The value of the structure's d_ino member shall be set
> to the file serial number of the file named by the d_name member.
>
> The description for sys/stat.h makes the connection between
> "file serial number" and the stat.st_ino member:
>
> The <sys/stat.h> header shall define the stat structure, which shall
> include at least the following members:
> ...
> ino_t st_ino File serial number.
>
> Returning the covered inode's number at a mountpoint is apparently not
> POSIX compliant either, but is widespread. (I.e. all unixes except
> Cygwin apparently.)
>
> > Gosh, maybe it would help to patch the currently used readdir instead
> > of just old_readdir() (thanks, Arnd). And return 1 instead of 0 so ls
> > doesn't think all files are deleted (thanks, Andreas).
>
> It's not just ls. Bash 3.0 ignores entries for completion if d_ino == 0.
>
> > I'm running a kernel with the below patch and everything still works.
> > Apparently "ls -i" is still using the bogus d_ino performance
> > improvement mentioned here because it returns all 1's for inode
> > number.
> >
> > http://www.mail-archive.com/[email protected]/msg02531.html
>
> I'm intrigued by the mentioned in that report that Linux bind mounts
> return the covering inode number in d_ino, not the covered inode number.
>
> If true, that means mounts are already being checked when returning d_ino,
> and suggests that doing it for all mounts isn't expensive.

This surprises me too. I will check into it further.

-VAL

2010-06-17 23:39:22

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH v2] d_ino considered harmful

On 2010-06-16, at 13:54, Valerie Aurora wrote:
> On Wed, Jun 16, 2010 at 02:59:13PM -0400, Valerie Aurora wrote:
>> Who needs d_ino anyway? I am running a kernel with this patch -
>> Gnome, a browser, IRC, kernel compile, etc. and everything works.
>
> Gosh, maybe it would help to patch the currently used readdir instead
> of just old_readdir() (thanks, Arnd). And return 1 instead of 0 so ls
> doesn't think all files are deleted (thanks, Andreas).
>
> I'm running a kernel with the below patch and everything still works.
> Apparently "ls -i" is still using the bogus d_ino performance
> improvement mentioned here because it returns all 1's for inode
> number.

I don't see why the presence of d_ino is a "bogus" performance optimization. It is useful for some things, and replacing this with "1" by no means helps anything IMHO, and destroys some useful optimizations (e.g. finding which inodes may be hard links), so I'm against this patch.

> http://www.mail-archive.com/[email protected]/msg02531.html
>
> -VAL
>
> commit 5902fd7b7407e059c5cea1bf1ea101a1ff8a6072
> Author: Valerie Aurora <[email protected]>
> Date: Wed Jun 16 11:05:06 2010 -0700
>
> VFS: Always return 1 for d_ino
>
> Use of d_ino without the corresponding st_dev is always buggy in the
> presence of submounts, bind mounts, and union mounts. E.g., the d_ino
> of a mountpoint will be the inode number of the directory under the
> mountpoint, not the mounted directory. Correct code must call stat(),
> which returns the correct device ID and inode in st_dev and st_ino.
> Since no one should be using d_ino anyway, always return 1 to detect
> bugs.
>
> diff --git a/fs/readdir.c b/fs/readdir.c
> index dd3eae1..5ff8f10 100644
> --- a/fs/readdir.c
> +++ b/fs/readdir.c
> @@ -91,11 +91,8 @@ static int fillonedir(void * __buf, const char * name, int namlen, loff_t offset
>
> if (buf->result)
> return -EINVAL;
> - d_ino = ino;
> - if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) {
> - buf->result = -EOVERFLOW;
> - return -EOVERFLOW;
> - }
> + /* Use of d_ino without st_dev is always buggy. */
> + d_ino = 1;
> buf->result++;
> dirent = buf->dirent;
> if (!access_ok(VERIFY_WRITE, dirent,
> @@ -172,11 +169,8 @@ static int filldir(void * __buf, const char * name, int namlen, loff_t offset,
> buf->error = -EINVAL; /* only used if we fail.. */
> if (reclen > buf->count)
> return -EINVAL;
> - d_ino = ino;
> - if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) {
> - buf->error = -EOVERFLOW;
> - return -EOVERFLOW;
> - }
> + /* Use of d_ino without st_dev is always buggy. */
> + d_ino = 1;
> dirent = buf->previous;
> if (dirent) {
> if (__put_user(offset, &dirent->d_off))

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

2010-06-18 01:42:47

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH v2] d_ino considered harmful

On 2010-06-17, at 12:04, "J. R. Okajima" <[email protected]> wrote:

>
> I am interested in this simplified
> problem such as "find the pathname(s) from an inum in a huge fs."
> Is ne2scan essentially equivalent to "debugfs ncheck inum"?

The (n)e2scan program is essentially just an optimized ext3 inode
table scanner we wrote for Lustre that walks the inode table in order,
and optimistically reads directory inode blocks (in disk offset order)
and matches the inode numbers to an icrementally-build tree of parent
directories when the directory entries appear. Since the most common
case is that parent has a lower inode number than the subdirectories
there is rarely a need to keep whole subdirectories in memory. This
is fairly efficient when dumping the whole Filesystem, since it makes
a single pass over the metadata, though it is inefficient when doing a
small subset of the filesystem.

As the name implies, it is very extN specific. For Lustre 2.0 we use a
different method to get O(1) FID (inode number) to pathname(s)
lookup. Each file stores an xattr with the {parent FID, filename}
tuples for each link to the file, whenever an inode is created,
linked, unlinked, or renamed.

In the common case, storing the filename and parent FID adds no
overhead to these operations since the inode needs to be written to
update the nlink count anyway, and the xattr can be stored in the
inode and does not generate extra IO unless there are more hard links
than can fit in the inode.

This allows doing optimized pathname generation for all links to a
file, and can in theory be used for any type of filesystem that has
efficient xattr storage.

2010-06-18 02:59:50

by J. R. Okajima

[permalink] [raw]

Subject: Re: [PATCH v2] d_ino considered harmful

Andreas Dilger:
> As the name implies, it is very extN specific. For Lustre 2.0 we use a
> different method to get O(1) FID (inode number) to pathname(s)
> lookup. Each file stores an xattr with the {parent FID, filename}
> tuples for each link to the file, whenever an inode is created,
> linked, unlinked, or renamed.

Honestly speaking, this approach is the one which came to my mind when I
read David's mail. Andreas's approach and explanation is perfect as I
should admire.

Thank you
J. R. Okajima

2010-06-18 19:41:48

by Valerie Aurora

[permalink] [raw]

Subject: Re: [PATCH v2] d_ino considered harmful

On Thu, Jun 17, 2010 at 05:39:17PM -0600, Andreas Dilger wrote:
> On 2010-06-16, at 13:54, Valerie Aurora wrote:
> > On Wed, Jun 16, 2010 at 02:59:13PM -0400, Valerie Aurora wrote:
> >> Who needs d_ino anyway? I am running a kernel with this patch -
> >> Gnome, a browser, IRC, kernel compile, etc. and everything works.
> >
> > Gosh, maybe it would help to patch the currently used readdir instead
> > of just old_readdir() (thanks, Arnd). And return 1 instead of 0 so ls
> > doesn't think all files are deleted (thanks, Andreas).
> >
> > I'm running a kernel with the below patch and everything still works.
> > Apparently "ls -i" is still using the bogus d_ino performance
> > improvement mentioned here because it returns all 1's for inode
> > number.
> >
> > http://www.mail-archive.com/[email protected]/msg02531.html
>
> I don't see why the presence of d_ino is a "bogus" performance optimization. It is useful for some things, and replacing this with "1" by no means helps anything IMHO, and destroys some useful optimizations (e.g. finding which inodes may be hard links), so I'm against this patch.

Ah, this particular performance optimization is bogus because the
output result is wrong in the case of mountpoints. (It's a long long
thread but maybe worth reading it all.) In general, there's nothing
wrong with using d_ino as a performance optimization. I just posted
this patch (and tested it) to see how people actually use d_ino in
real life. I don't think there's any danger of it being accepted,
although it is useful for testing programs that use d_ino.

Thanks for your help,

-VAL