2008-08-12 06:22:52

by Al Viro

[permalink] [raw]
Subject: [RFC] readdir mess

When getdents() had been introduced in 1.3.0, ->readdir() has grown
a callback and became an iterator. The type of that iterator had been
a Bloody Bad Idea(tm); it returns a value that gets reduced to one bit
by all callers. Basically it's "do we stop or do we move to the next
entry?". 0 means "go on" for all ->readdir() instances, negative - "stop".
Positives are treated inconsistently.

Note that it can not be used to return an error. In particular,
the callback of getdents(2) returns "stop" when we run out of buffer.
If any instance of ->readdir() would ever treat that as an error to
pass back to caller, it would break spectaculary on the first ls(1) on
a directory large enough to require more than one getdents(2) call.

Anything that wants to pass errors back to caller of ->readdir()
has to store it in whatever struct we are passing to the callback.

That had lead to two recurring classes of bugs:
1) something_readdir() treats "callback returned negative" as "return
than negative". Breaks pretty fast.
2) whatever_filldir() happily returns some error value and expects it
to be seen by caller of ->readdir(). Somehow. Doesn't work. We do
have such bugs right now.

In addition, we have other kinds of crap going on around ->readdir()
(the most popular being "whaddya mean, somebody might've done lseek(2) and
set offset to hell knows what???"), but those are separate story.

An obvious way to deal with that would be to have filldir_t return
bool; true => stop, false => go on. However, that's going to cause silent
breakage of out-of-tree filesystems. Even though e.g. ext2 had always
treated any non-zero return value of filldir as "stop", we'd grown filesystems
that check for return value < 0 instead. Having it return true (i.e. 1)
will break all of those. Note that generic callbacks currently return
negative values for "stop", so they work with those filesystems right now.

FWIW, I'm seriously tempted to go for the following variant:
new __bitwise type with two values - READDIR_STOP and READDIR_CONTINUE,
-1 and 0 resp. The type of filldir_t would become filldir_res_t (*)(......),
all existing instances of ->readdir() would keep working and sparse
would catch all mismatches.

Another variant is to change filldir_t in visible incompatible way
that would have bool as return value and let readdir instances adjust or
refuse to compile; maintainers of out-of-tree code would presumably notice
that, so we would just have to document the calling conventions and say
that checking for < 0 doesn't work anymore.

Comments?


2008-08-12 17:02:56

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [RFC] readdir mess

Al Viro <[email protected]> writes:

> An obvious way to deal with that would be to have filldir_t return
> bool; true => stop, false => go on. However, that's going to cause silent
> breakage of out-of-tree filesystems. Even though e.g. ext2 had always
> treated any non-zero return value of filldir as "stop", we'd grown filesystems
> that check for return value < 0 instead. Having it return true (i.e. 1)
> will break all of those. Note that generic callbacks currently return
> negative values for "stop", so they work with those filesystems right now.
>
> FWIW, I'm seriously tempted to go for the following variant:
> new __bitwise type with two values - READDIR_STOP and READDIR_CONTINUE,
> -1 and 0 resp. The type of filldir_t would become filldir_res_t (*)(......),
> all existing instances of ->readdir() would keep working and sparse
> would catch all mismatches.
>
> Another variant is to change filldir_t in visible incompatible way
> that would have bool as return value and let readdir instances adjust or
> refuse to compile; maintainers of out-of-tree code would presumably notice
> that, so we would just have to document the calling conventions and say
> that checking for < 0 doesn't work anymore.

Personally, I'd like latter than would it work. And I hope we don't do
this again... In the case of -EOVERFLOW, even current generic filldir()
is strange like following, and I saw simular in past.
--
OGAWA Hirofumi <[email protected]>


The return value of fillonedir/filldir() is for the caller of
fillonedir/filldir(). If we want to tell the result to the caller of
->readdir(), we need to set it to buf->result/error.

This fixes -EOVERFLOW case, and cleans up related stuff.

[BTW: 32bit compat of ia64/powerpc seems to need to update]

Signed-off-by: OGAWA Hirofumi <[email protected]>
---

fs/readdir.c | 45 ++++++++++++++++++++++++++-------------------
1 file changed, 26 insertions(+), 19 deletions(-)

diff -puN fs/readdir.c~filldir-errcode-fix fs/readdir.c
--- linux-2.6/fs/readdir.c~filldir-errcode-fix 2008-02-02 04:03:55.000000000 +0900
+++ linux-2.6-hirofumi/fs/readdir.c 2008-02-02 04:03:55.000000000 +0900
@@ -46,18 +46,15 @@ out:

EXPORT_SYMBOL(vfs_readdir);

+#ifdef __ARCH_WANT_OLD_READDIR
/*
* Traditional linux readdir() handling..
*
- * "count=1" is a special case, meaning that the buffer is one
+ * "result=1" is a special case, meaning that the buffer is one
* dirent-structure in size and that the code can't handle more
* anyway. Thus the special "fillonedir()" function for that
* case (the low-level handlers don't need to care about this).
*/
-#define NAME_OFFSET(de) ((int) ((de)->d_name - (char __user *) (de)))
-
-#ifdef __ARCH_WANT_OLD_READDIR
-
struct old_linux_dirent {
unsigned long d_ino;
unsigned long d_offset;
@@ -80,8 +77,10 @@ static int fillonedir(void * __buf, cons
if (buf->result)
return -EINVAL;
d_ino = ino;
- if (sizeof(d_ino) < sizeof(ino) && d_ino != ino)
- return -EOVERFLOW;
+ if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) {
+ buf->result = -EOVERFLOW;
+ goto error;
+ }
buf->result++;
dirent = buf->dirent;
if (!access_ok(VERIFY_WRITE, dirent,
@@ -97,7 +96,8 @@ static int fillonedir(void * __buf, cons
return 0;
efault:
buf->result = -EFAULT;
- return -EFAULT;
+error:
+ return buf->result;
}

asmlinkage long old_readdir(unsigned int fd, struct old_linux_dirent __user * dirent, unsigned int count)
@@ -122,9 +122,10 @@ asmlinkage long old_readdir(unsigned int
out:
return error;
}
-
#endif /* __ARCH_WANT_OLD_READDIR */

+#define NAME_OFFSET(de) ((int) ((de)->d_name - (char __user *) (de)))
+
/*
* New, all-improved, singing, dancing, iBCS2-compliant getdents()
* interface.
@@ -151,12 +152,15 @@ static int filldir(void * __buf, const c
unsigned long d_ino;
int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(long));

- buf->error = -EINVAL; /* only used if we fail.. */
- if (reclen > buf->count)
- return -EINVAL;
+ if (reclen > buf->count) {
+ buf->error = -EINVAL;
+ goto error;
+ }
d_ino = ino;
- if (sizeof(d_ino) < sizeof(ino) && d_ino != ino)
- return -EOVERFLOW;
+ if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) {
+ buf->error = -EOVERFLOW;
+ goto error;
+ }
dirent = buf->previous;
if (dirent) {
if (__put_user(offset, &dirent->d_off))
@@ -180,7 +184,8 @@ static int filldir(void * __buf, const c
return 0;
efault:
buf->error = -EFAULT;
- return -EFAULT;
+error:
+ return buf->error;
}

asmlinkage long sys_getdents(unsigned int fd, struct linux_dirent __user * dirent, unsigned int count)
@@ -236,9 +241,10 @@ static int filldir64(void * __buf, const
struct getdents_callback64 * buf = (struct getdents_callback64 *) __buf;
int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 1, sizeof(u64));

- buf->error = -EINVAL; /* only used if we fail.. */
- if (reclen > buf->count)
- return -EINVAL;
+ if (reclen > buf->count) {
+ buf->error = -EINVAL;
+ goto error;
+ }
dirent = buf->previous;
if (dirent) {
if (__put_user(offset, &dirent->d_off))
@@ -264,7 +270,8 @@ static int filldir64(void * __buf, const
return 0;
efault:
buf->error = -EFAULT;
- return -EFAULT;
+error:
+ return buf->error;
}

asmlinkage long sys_getdents64(unsigned int fd, struct linux_dirent64 __user * dirent, unsigned int count)
_

2008-08-12 17:19:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] readdir mess



On Wed, 13 Aug 2008, OGAWA Hirofumi wrote:
>
> This fixes -EOVERFLOW case, and cleans up related stuff.

I think it just makes things more confused.

If we actually want to change the readdir() thing, then we should just
make the rule be:

- if the callback returns a non-zero value, the filesystem "readdir()"
function should return that value (right now they are taught to return
zero, and return errors on internal fatal things). And get rid of
"buf.error" entirely.

The reason for the whole "buf.error" etc stuff is simply because that
avoided having the low-level filesystem even care about things. But if
you really want to clean this up, we should *NOT* continue the current

- the callers should then do

error = vfs_readdir(file, filldir, &buf);
lastdirent = buf.previous;
if (lastdirent) {
error = count - buf.count;
if (put_user(file->f_pos, &lastdirent->d_off))
error = -EFAULT;
}
fput(file);
return error;

and we wouldn't need any other logic at all.

As to -EOVERFLOW, I suspect we are better off just dropping that whole
logic. Returning -EOVERFLOW and truncating the readdir listing is likely
much worse than the alternative. It made sense back when we needed to get
people to upgrade the system interfaces, now it just means that old
binaries won't work at all.

Linus

2008-08-12 18:11:19

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] readdir mess

On Tue, Aug 12, 2008 at 10:18:49AM -0700, Linus Torvalds wrote:

> If we actually want to change the readdir() thing, then we should just
> make the rule be:
>
> - if the callback returns a non-zero value, the filesystem "readdir()"
> function should return that value (right now they are taught to return
> zero, and return errors on internal fatal things). And get rid of
> "buf.error" entirely.

Doesn't work well for readdir(2)...

> error = vfs_readdir(file, filldir, &buf);
> lastdirent = buf.previous;
> if (lastdirent) {
> error = count - buf.count;
> if (put_user(file->f_pos, &lastdirent->d_off))
> error = -EFAULT;
> }
> fput(file);
> return error;
>
> and we wouldn't need any other logic at all.

you've just lost e.g. -EIO for getdents(). And if you bail out on
non-zero return value from vfs_readdir(), you are back to -EINVAL
on full buffer.

Frankly, I'd rather keep ->readdir() instances simpler. There are far
more of those, for one thing. As it is, we only have "stop"/"continue"
->readdir() has to care about...

There's one more thing in that mess: a bunch of vfs_readdir() callers
end up playing very sick games to make sure they get the entire
directory. The trick is to find whether the damn thing has reached
the end; as it is, there are instances of ->readdir() that do _not_
(e.g. call filldir only once and let the caller repeat).

I'm certainly not too fond of buf->error. If you see a better interface
I'd love to hear about it, but I don't think that "just return anything
non-zero we'd got from callback" is going to be good. And if we go for
flagday changes in ->readdir(), we'd better get it right...

2008-08-12 18:22:54

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] readdir mess

On Tue, Aug 12, 2008 at 07:10:57PM +0100, Al Viro wrote:
> There's one more thing in that mess: a bunch of vfs_readdir() callers
> end up playing very sick games to make sure they get the entire
> directory. The trick is to find whether the damn thing has reached
> the end; as it is, there are instances of ->readdir() that do _not_
> (e.g. call filldir only once and let the caller repeat).
>
> I'm certainly not too fond of buf->error. If you see a better interface
> I'd love to hear about it, but I don't think that "just return anything
> non-zero we'd got from callback" is going to be good. And if we go for
> flagday changes in ->readdir(), we'd better get it right...

PS: we might get away with both, if we used _positive_ values as well.
E.g. have getdents() filldir return 1 if we are out of buffer *and*
have ->previous != NULL (and -EINVAL if we are out of buffer on the
first call)... And have some other positive constant for "->readdir()
didn't feel like going all the way to the end of directory".

2008-08-12 18:37:57

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] readdir mess

On Tue, Aug 12, 2008 at 07:22:40PM +0100, Al Viro wrote:

> PS: we might get away with both, if we used _positive_ values as well.
> E.g. have getdents() filldir return 1 if we are out of buffer *and*
> have ->previous != NULL (and -EINVAL if we are out of buffer on the
> first call)... And have some other positive constant for "->readdir()
> didn't feel like going all the way to the end of directory".

FWIW, how about that sequence:

Patch 1:
Turn all filldir(...) < 0 into filldir() != 0 in ->readdir() instances,
no changes other than that. Everything should keep working as-is.

Patch 2:
Make fillonedir() return 1 on the second call; make filldir() et.al.
return 1 instead of -EINVAL if we have ->previous != NULL. Again,
should be no breakage.

Patch 3: switch ->readdir() to your "return anything non-null we got from
callback". AFAICS, main callers will see no breakage, but in any case
we have few enough of those to adjust them as needed first.

Patch 4: get rid of ->error and its ilk; adjust callers in obvious ways
(e.g. sys_gtedents() would bail out on negative from vfs_readdir() as
it does now and treat 0 and 1 in the same way - put_user() ? -EFAULT : <how
much did we copy). Callers can be taken care one by one. Again, no breakage
and everything's bisectable.

Patch 5 (maybe):
#define READDIR_MORE INT_MAX
Have ->readdir() instances that decide to stop once they'd done several
filldir calls return it if there's still more left.
Have vfs_readdir() loop calling ->readdir() as long as it gets READDIR_MORE.
Get rid of weird loops in callers.

I'm not sure that the last one is needed - we might be better off just by
making the such instances loop themselves. In any case, loops in callers
(nfsd, etc.) are begging for trouble...

2008-08-12 19:24:32

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] readdir mess

On Tue, Aug 12, 2008 at 07:37:45PM +0100, Al Viro wrote:

> FWIW, how about that sequence:
>
> Patch 1:
> Turn all filldir(...) < 0 into filldir() != 0 in ->readdir() instances,
> no changes other than that. Everything should keep working as-is.
>
> Patch 2:
> Make fillonedir() return 1 on the second call; make filldir() et.al.
> return 1 instead of -EINVAL if we have ->previous != NULL. Again,
> should be no breakage.

aargh...
Patch 2.5:
Fix braindead instances of ->readdir() that return odd crap on success
(e.g. coda_readdir() returning the count of filldir calls that had returned
zero).

> Patch 3: switch ->readdir() to your "return anything non-null we got from
> callback". AFAICS, main callers will see no breakage, but in any case
> we have few enough of those to adjust them as needed first.
>
> Patch 4: get rid of ->error and its ilk; adjust callers in obvious ways
> (e.g. sys_gtedents() would bail out on negative from vfs_readdir() as
> it does now and treat 0 and 1 in the same way - put_user() ? -EFAULT : <how
> much did we copy). Callers can be taken care one by one. Again, no breakage
> and everything's bisectable.
>
> Patch 5 (maybe):
> #define READDIR_MORE INT_MAX
> Have ->readdir() instances that decide to stop once they'd done several
> filldir calls return it if there's still more left.
> Have vfs_readdir() loop calling ->readdir() as long as it gets READDIR_MORE.
> Get rid of weird loops in callers.
>
> I'm not sure that the last one is needed - we might be better off just by
> making the such instances loop themselves. In any case, loops in callers
> (nfsd, etc.) are begging for trouble...
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-08-12 19:45:21

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [RFC] readdir mess

Linus Torvalds <[email protected]> writes:

> As to -EOVERFLOW, I suspect we are better off just dropping that whole
> logic. Returning -EOVERFLOW and truncating the readdir listing is likely
> much worse than the alternative. It made sense back when we needed to get
> people to upgrade the system interfaces, now it just means that old
> binaries won't work at all.

However, there are some similar stuff: ->st_size, ->st_nlink and
->st_ino in stat() (cp_old_stat()). Maybe EOVERFLOW is the reason for
consistency...
--
OGAWA Hirofumi <[email protected]>

2008-08-12 20:03:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] readdir mess



On Tue, 12 Aug 2008, Al Viro wrote:
>
> Doesn't work well for readdir(2)...

Sure it does.

> > error = vfs_readdir(file, filldir, &buf);
> > lastdirent = buf.previous;
> > if (lastdirent) {
> > error = count - buf.count;
> > if (put_user(file->f_pos, &lastdirent->d_off))
> > error = -EFAULT;
> > }
> > fput(file);
> > return error;
> >
> > and we wouldn't need any other logic at all.
>
> you've just lost e.g. -EIO for getdents().

No I've not.

If we returned a partial result, we _should_ return a partial result.

And if we got EIO on the first entry, we should return EIO.

The _current_ code is crap. It sometimes returns the error (if the
->readdir() function returned error), and sometimes returns the partial
result (if the "buf.error" was set).

> Frankly, I'd rather keep ->readdir() instances simpler. There are far
> more of those, for one thing. As it is, we only have "stop"/"continue"
> ->readdir() has to care about...

Keeping them simple (and not changing them - always returning zero is what
the _original_ readdir() thing did!) is why the current situation exists.

So if we keep it that way, then we really *KEEP* it that way. Don't go
around changing any of the existing rules. Just make sure that the
callbacks keep on always returning negative or zero (and never positive).

Linus

2008-08-12 20:06:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] readdir mess



On Wed, 13 Aug 2008, OGAWA Hirofumi wrote:
>
> However, there are some similar stuff: ->st_size, ->st_nlink and
> ->st_ino in stat() (cp_old_stat()). Maybe EOVERFLOW is the reason for
> consistency...

.. I actually had an old binary that triggered that case (actually, it was
O_LARGEFILE in open()), and didn't work as a result. I only needed to run
it once, so I literally hacked up a once-time-use kernel that just removed
the EOVERFLOW in open.

So no, I'm not a fan of EOVERFLOW at all. Not in readdir(), not really
anywhere else.

Linus

2008-08-12 20:22:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] readdir mess



On Tue, 12 Aug 2008, Al Viro wrote:
>
> you've just lost e.g. -EIO for getdents(). And if you bail out on
> non-zero return value from vfs_readdir(), you are back to -EINVAL
> on full buffer.

Btw, this whole sentence, and the one from your next email seems to really
show a fundamental misunderstanding of the whole readdir() error handling:

> PS: we might get away with both, if we used _positive_ values as well.
> E.g. have getdents() filldir return 1 if we are out of buffer *and*
> have ->previous != NULL (and -EINVAL if we are out of buffer on the
> first call)... And have some other positive constant for "->readdir()
> didn't feel like going all the way to the end of directory".

We *must* handle partial returns by returning "success". And we do,
except for our _confusion_ about ->readdir() returning error and that
somehow "overriding" the fact that it already returned non-errors earlier
through the callback.

All your blathering about "positive values as well" seems to ttoally
misunderstand how readdir() works. We absolutely do *not* need positive
return values, because the fact is, the only positive return value we ever
need is the "we already filled _earlier_ buffers". And that's the one
that we already do.

The fact is, NO ERROR VALUE CAN POSSIBLY MATTER if we already returned one
or more entries to getdents/readdir(). We should return a success value.

Linus

2008-08-12 20:38:22

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] readdir mess

On Tue, Aug 12, 2008 at 01:21:56PM -0700, Linus Torvalds wrote:
> We *must* handle partial returns by returning "success". And we do,
> except for our _confusion_ about ->readdir() returning error and that
> somehow "overriding" the fact that it already returned non-errors earlier
> through the callback.
>
> All your blathering about "positive values as well" seems to ttoally
> misunderstand how readdir() works. We absolutely do *not* need positive
> return values, because the fact is, the only positive return value we ever
> need is the "we already filled _earlier_ buffers". And that's the one
> that we already do.
>
> The fact is, NO ERROR VALUE CAN POSSIBLY MATTER if we already returned one
> or more entries to getdents/readdir(). We should return a success value.

Would you care to grep for vfs_readdir() in the tree? It's not just
sys_getdents(); for better of worse the thing had become a general-purpose
iterator. And I'm not suggesting to pass the damn thing to caller of
sys_getdents(). At all.

2008-08-12 20:59:56

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] readdir mess

On Tue, Aug 12, 2008 at 01:05:43PM -0700, Linus Torvalds wrote:
>
>
> On Wed, 13 Aug 2008, OGAWA Hirofumi wrote:
> >
> > However, there are some similar stuff: ->st_size, ->st_nlink and
> > ->st_ino in stat() (cp_old_stat()). Maybe EOVERFLOW is the reason for
> > consistency...
>
> .. I actually had an old binary that triggered that case (actually, it was
> O_LARGEFILE in open()), and didn't work as a result. I only needed to run
> it once, so I literally hacked up a once-time-use kernel that just removed
> the EOVERFLOW in open.
>
> So no, I'm not a fan of EOVERFLOW at all. Not in readdir(), not really
> anywhere else.

Um... Here it would happen only on attempt to return an entry for file
that really has an inumber not fitting into the field; what would you
do in such case? open() on a huge file is a different story, since you
would get a valid opened file and that'd be it; the logics in case of
open() is, IIRC, "I guess your binary is old, so it'll probably do other
things that would really have to trigger -EOVERFLOW; better bail out right
now". In this case, though, you either generate an error or get wrong
values silently stored in userland struct (wrapped around inumber).

2008-08-12 21:04:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] readdir mess



On Tue, 12 Aug 2008, Al Viro wrote:
>
> Would you care to grep for vfs_readdir() in the tree? It's not just
> sys_getdents(); for better of worse the thing had become a general-purpose
> iterator. And I'm not suggesting to pass the damn thing to caller of
> sys_getdents(). At all.

Umm. What does that matter wrt what I said?

What does that matter wrt your bogus argument about EIO (which we do
_wrong_ right now, exactly because we return EIO too eagerly, instead of
returning the partial data we got)?

vfs_readdir() itself would not change AT ALL. The change I talked about
was in the caller. Which you should realize, since I actually _called_
vfs_readdir() in that example code.

Here's a patch to get the _semantics_ I talked about, to make the thing
more obvious.

But the actual _change_ I talked about would be to try to avoid using
"buf.error" entirely, and just make all the (many, I know) filesystem
readdir() functions return the callback value instead of 0. Which would
mean that most of the users would be able to drop their "buf.error" use
entirely, along with the ugly

if (error >= 0)
error = buf.error;

that I have in this patch.

(Side note: the "if (error >= 0)" approach is what old_readdir() already
does, so that function actually gets the EIO case _correct_: it's only
the newer sys_getdents() that is buggy)

Linus

---
fs/readdir.c | 6 ++----
1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/readdir.c b/fs/readdir.c
index 4e026e5..e4ded16 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -205,9 +205,8 @@ asmlinkage long sys_getdents(unsigned int fd, struct linux_dirent __user * diren
buf.error = 0;

error = vfs_readdir(file, filldir, &buf);
- if (error < 0)
- goto out_putf;
- error = buf.error;
+ if (error >= 0)
+ error = buf.error;
lastdirent = buf.previous;
if (lastdirent) {
if (put_user(file->f_pos, &lastdirent->d_off))
@@ -216,7 +215,6 @@ asmlinkage long sys_getdents(unsigned int fd, struct linux_dirent __user * diren
error = count - buf.count;
}

-out_putf:
fput(file);
out:
return error;

2008-08-12 21:24:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] readdir mess



On Tue, 12 Aug 2008, Al Viro wrote:
>
> Um... Here it would happen only on attempt to return an entry for file
> that really has an inumber not fitting into the field; what would you
> do in such case?

You'd truncate the inode number. What's the big deal? Inode numbers aren't
that important - they're just about the _least_ important part of the data
returned for a readdir.

Sure, truncating the inode number may confuse some programs like old-style
/bin/pwd, but the thing is, we effectively _already_ do that by generating
essentially random numbers for filesystems that don't have UNIX-style
inode numbers anyway, so what's the big deal really?

> open() on a huge file is a different story, since you would get a valid
> opened file and that'd be it; the logics in case of open() is, IIRC, "I
> guess your binary is old, so it'll probably do other things that would
> really have to trigger -EOVERFLOW; better bail out right now".

The thing is, the "probably" is likely "probably not". There's a lot of
things that work quite well. Sure, if you lseek() you get confused. If
you do "stat()" and then mmap(), you'll get confused. But a lot of
programs really do just read/write. Or use llseek(), in fact. O_LARGEFILE
is actually a later thing tan llseek() was.

So the problem with EOVERFLOW is that it says that programs shouldn't do
anything at all because they _may_ do bad things.

And don't get me wrong - I think EOVERFLOW was (and is) actually a good
thing for the _transition_ period. Exactly because it breaks programs in
obvious ways, and early. It's good to be annoying in order to find
problems early and fix them.

But I also think that we're not in a transition period any more, and as a
result the annoyance part is just annoying an doesn't help find and fix
problems any more, it just makes legacy binaries not work even if they
could otherwise work fine (and _maybe_ have problems).

So something that made sense five years ago may not make sense any more,
is what I'm saying. These days, if somebody runs legacy binaries, they do
it because of archeology reasons or similar..

Linus

2008-08-12 21:54:56

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] readdir mess

On Tue, Aug 12, 2008 at 02:24:04PM -0700, Linus Torvalds wrote:

> On Tue, 12 Aug 2008, Al Viro wrote:
> >
> > Um... Here it would happen only on attempt to return an entry for file
> > that really has an inumber not fitting into the field; what would you
> > do in such case?
>
> You'd truncate the inode number. What's the big deal? Inode numbers aren't
> that important - they're just about the _least_ important part of the data
> returned for a readdir.

Tell that to tar(1) ;-)

> But I also think that we're not in a transition period any more, and as a
> result the annoyance part is just annoying an doesn't help find and fix
> problems any more, it just makes legacy binaries not work even if they
> could otherwise work fine (and _maybe_ have problems).
>
> So something that made sense five years ago may not make sense any more,
> is what I'm saying. These days, if somebody runs legacy binaries, they do
> it because of archeology reasons or similar..

I suspect that SUS specifies that crap in some cases, but I honestly do not
remember. For large offsets, that is. Large inode numbers are more recent
and hit relatively few filesystems. OTOH, I suspect that most of getdents()
call sites are in libc anyway...

Anyway, the point for getdents() is simply that we *do* return an error; it's
just that it ends up with -EINVAL instead of -EOVERFLOW, and that's simply
bogus - we should either truncate silently or return the right value. The
code definitely intends to do the latter and fucks up.

2008-08-12 22:05:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] readdir mess



On Tue, 12 Aug 2008, Al Viro wrote:

> On Tue, Aug 12, 2008 at 02:24:04PM -0700, Linus Torvalds wrote:
> >
> > You'd truncate the inode number. What's the big deal? Inode numbers aren't
> > that important - they're just about the _least_ important part of the data
> > returned for a readdir.
>
> Tell that to tar(1) ;-)

The thing is, you still have the low 32 (or 16) bits, so even tar is
better off most of the time.

Let's face it, we actually always truncate things as-is. What are inode
numbers on NFS but truncated file handles?

And the other side of the coin is that legacy binaries are almost never
_system_ binaries. You upgrade those with your system They are some odd
special-purpose thing.

> Anyway, the point for getdents() is simply that we *do* return an error; it's
> just that it ends up with -EINVAL instead of -EOVERFLOW, and that's simply
> bogus - we should either truncate silently or return the right value. The
> code definitely intends to do the latter and fucks up.

I agree. However, the reason it f*cks up is exactly the fact that we have
two different error numbers, which was why I suggested that if we really
want to fix this problem and are talking about cleaning things up, then
_that_ should be our primary place to look at.

Yes, it implies fixing and checking a lot of low-level filesystems. I
agree. It's easier to just leave it be.

Linus

2008-08-12 22:05:36

by Alan

[permalink] [raw]
Subject: Re: [RFC] readdir mess

On Tue, 12 Aug 2008 13:05:43 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

>
>
> On Wed, 13 Aug 2008, OGAWA Hirofumi wrote:
> >
> > However, there are some similar stuff: ->st_size, ->st_nlink and
> > ->st_ino in stat() (cp_old_stat()). Maybe EOVERFLOW is the reason for
> > consistency...
>
> .. I actually had an old binary that triggered that case (actually, it was
> O_LARGEFILE in open()), and didn't work as a result. I only needed to run
> it once, so I literally hacked up a once-time-use kernel that just removed
> the EOVERFLOW in open.
>
> So no, I'm not a fan of EOVERFLOW at all. Not in readdir(), not really
> anywhere else.

In a lot of places the standard mandates it to ensure you don't get nasty
suprises. There are places where during the LFS work people found apps
whose large file response was unpleasant - things like "get the file
length, reduce it to an exact number of records long with ftruncate in
case we got a partial record write somewhere" do horrible things when the
length wraps.

That said there is nothing that says we can't have a 'posix_me_harder'
sysfs control for such things. The standards say what should occur in the
normal situation not what should occur if you intentionally move out of
the standard definition.

Alan

2008-08-12 22:20:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] readdir mess



On Tue, 12 Aug 2008, Alan Cox wrote:
>
> That said there is nothing that says we can't have a 'posix_me_harder'
> sysfs control for such things. The standards say what should occur in the
> normal situation not what should occur if you intentionally move out of
> the standard definition.

Yeah. That said, I doubt it's a very common problem in practice. I'm
pretty sure nobody really cares, and almost nobody really wants to run
really old binaries. So it's almost certainly not worth it (and the one
time I had it happen, I didn't bother to do it right, I just hacked around
it and obviously never committed the hack).

I just find it a bit sad how well we actually _can_ run old binaries, but
sometimes there are these new things that were literally designed to break
them.

Of course, the much more common breakage comes from not having access to
old shared libraries etc totally user-space issues. The few kernel cases
of EOVERFLOW are totally hidden by just distro differences over time. If
the binary I had hadn't been statically linked, I wouldn't have had a
chance, I suspect.

(Of course, static linking wasn't exactly unusual for really old binaries)

Linus

2008-08-12 22:28:01

by Alan

[permalink] [raw]
Subject: Re: [RFC] readdir mess

> of EOVERFLOW are totally hidden by just distro differences over time. If
> the binary I had hadn't been statically linked, I wouldn't have had a
> chance, I suspect.

Libc 2 still worked as of early 2.6. I've not tried it more recently as I
recompiled rogue instead.

> (Of course, static linking wasn't exactly unusual for really old binaries)

Important ones anyway as you needed the exact matching libc.

2008-08-13 00:04:46

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] readdir mess

On Tue, Aug 12, 2008 at 02:04:25PM -0700, Linus Torvalds wrote:

> But the actual _change_ I talked about would be to try to avoid using
> "buf.error" entirely, and just make all the (many, I know) filesystem
> readdir() functions return the callback value instead of 0. Which would
> mean that most of the users would be able to drop their "buf.error" use
> entirely, along with the ugly
>
> if (error >= 0)
> error = buf.error;
>
> that I have in this patch.

Careful; e.g. coda returns the number of entries read, _not_ 0. So it'd
take more than just "return what the callback had given us".

FWIW, I'm more concerned about the other callers of vfs_readdir();
getdents(2) will get tested on pretty much any fs, so if that breaks
we'll know it (OK, short of ->readdir() doing something spectaculary
stupid when given unusual arguments).

_IF_ we are changing things in ->readdir() (and your "pass return value of
filldir to caller of ->readdir()" certainly qualifies), we'd better make
sure that old instances break visibly and immediately; preferably - at
compile time.

As for whether we want to bother... I've looked through about half of the
->readdir instances. We _do_ want to touch them, with cattle prod if nothing
else. And that's just a cursory look...

9p: touching belief that f_pos can't be changed under us.
adfs: ditto.
affs: user-triggerable affs_warning() (just lseek to 0x10001 and you've
got it), plus fun race: default_lseek() sets ->f_pos before clearing
->f_version and we have no exclusion whatsoever (and then there is an
SMP ordering race). Besides, it returns the number of entries read.
autofs4:broken lseek (uses dcache_readdir() without the matching ->llseek)
befs: pulls out early (after one call of filldir, no matter
what), advances ->f_pos no matter what filldir returns.
bfs: fs-wide exclusion between ->readdir(), -EBADF(!) on invalid offset,
believes that ->f_pos can't change under it.
cifs: f_pos races
coda: returns fsck knows what (number of entries, mostly)
cramfs: blind assumption that on-disk data would be valid; entry crossing a
block boundary => fucked.
ecryptfs: badly broken
efs: f_pos races (BKL doesn't help if you block), ignores the return value
filldir completely, fucked on corrupted data (it's a bit too late
to check that we have bogus offset of name in block after we'd already
accessed the contents). Alignment issues, while we are at it (or
EFS_SLOTAT() is buggy).
ext3: take a look at comments around filldir call. Yes, they are _that_
ancient, and so's the logics around revalidate. ext2 is sane, but
that hadn't propagated. Refuses to go through more than one block,
BTW. Revalidation loop is buggered if we have corrupt data, while
we are at it.
ext4: ditto
freevxfs: AFAICS simply bogus (grep for nblocks there).
hfs: at least missing checks for hfs_bnode_read() failure. And I'm not
at all sure that hfs_mac2asc() use is safe there. BTW, open_dir_list
handling appears to be odd - how the hell does decrementing ->f_pos
help anything? And hfs_dir_release() removes from list without any
locks, so that's almost certainly racy as well.
hfsplus: ditto

2008-08-13 00:34:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] readdir mess



On Wed, 13 Aug 2008, Al Viro wrote:
>
> As for whether we want to bother... I've looked through about half of the
> ->readdir instances. We _do_ want to touch them, with cattle prod if nothing
> else.

The really sad part is that readdir() really is also the thing that should
make us change locking. That i_mutex thing is fine and dandy for
everything else, but for readdir() we really would be much better off with
a rwsem - and only take it for reading.

Right now, readdir() is one of the most serialized parts of the whole
kernel. Sad. And while it's a per-directory lock, there are directories
that get much more reading than others, and this has been a small
scalability issue (for samba and apache) for years.

> 9p: touching belief that f_pos can't be changed under us.
> adfs: ditto.

The thing is, generic_file_llseek() takes i_mutex, exactly because of
issues like this. Of course, you have to ask for it (the _default_ llseek
does not do it), and you're right that 9p does not.

Strangely enough, at least 9p _does_ use it for regular files. I'm not
sure how come it decided to do that, but whatever.

> ext3: take a look at comments around filldir call. Yes, they are _that_
> ancient, and so's the logics around revalidate. ext2 is sane, but
> that hadn't propagated. Refuses to go through more than one block,
> BTW. Revalidation loop is buggered if we have corrupt data, while
> we are at it.
> ext4: ditto

The reason ext2 is ok is that you long long ago fixed it to use the page
cache. That got rid of a _lot_ of the crap, and made all the IO look like
regular files (including read-ahead etc). Ext2 _used_ to be the same crap
that ext3 is.

I so wish that ext3 could do the same thing, but no. I still think it
should be possible, but the whole journalling is designed for buffer
heads.

> freevxfs: AFAICS simply bogus (grep for nblocks there).
> hfs: at least missing checks for hfs_bnode_read() failure. And I'm not
> at all sure that hfs_mac2asc() use is safe there. BTW, open_dir_list
> handling appears to be odd - how the hell does decrementing ->f_pos
> help anything? And hfs_dir_release() removes from list without any
> locks, so that's almost certainly racy as well.
> hfsplus: ditto

I don't dispute at all that the readdir() thing is one of the weakest
points of the whole VFS layer. And part of it is that there is no good
caching helper for it at the VFS level, so we always end up having to do
everything at the low-level filesystem level, and that invariably ends up
being sh*t compared to the shared VFS routines.

I'm convinced that the reason we do well on most other filesystem accesses
is exactly the fact that a filesystem basically has to be crazy to try to
do their own version, and in many cases cannot really do it at all (eg you
can't really even avoid using the dcache or the page cache and actually
get any valid semantics).

But readdir() is the _one_ operation where the low-level filesystem still
basically does it all itself. Which is why we can't fix locking, and why
even simple changes are hard because it's not just complex code, it's
complex code in 50+ filesystems with almost zero shared code!

Linus

2008-08-13 01:19:27

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] readdir mess

On Tue, Aug 12, 2008 at 05:28:28PM -0700, Linus Torvalds wrote:
>
>
> On Wed, 13 Aug 2008, Al Viro wrote:
> >
> > As for whether we want to bother... I've looked through about half of the
> > ->readdir instances. We _do_ want to touch them, with cattle prod if nothing
> > else.
>
> The really sad part is that readdir() really is also the thing that should
> make us change locking. That i_mutex thing is fine and dandy for
> everything else, but for readdir() we really would be much better off with
> a rwsem - and only take it for reading.

*cringe*

I agree, but... there's another shitpile in the same area and there we are
_really_ buggered - take a look at what ncpfs/smbfs/cifs/<parts of> procfs
are pulling off with very odd attempts to dump stuff into dcache.

That means manual equivalent of lookup and you _do_ want some exclusion for
that. If nothing else, you want to prevent multiple dentries for the same
subdirectory...

And rwsem for reading wouldn't help at all - readdir/readdir contention
might be annoying, but it's readdir/lookup that really hurts.

> Right now, readdir() is one of the most serialized parts of the whole
> kernel. Sad. And while it's a per-directory lock, there are directories
> that get much more reading than others, and this has been a small
> scalability issue (for samba and apache) for years.

I know. And things like exportfs use of vfs_readdir() are also rather
painful.

> The reason ext2 is ok is that you long long ago fixed it to use the page
> cache. That got rid of a _lot_ of the crap, and made all the IO look like
> regular files (including read-ahead etc). Ext2 _used_ to be the same crap
> that ext3 is.

I remember ;-) However, f_version/i_version mess at that place simply an
ancient crap - we *are* holding i_mutex and we are using generic_file_lseek().
IOW, it simply hadn't been reviewed since then - or reviewers had been
stunned into glazed-eyes mode by the entire thing.

> I so wish that ext3 could do the same thing, but no. I still think it
> should be possible, but the whole journalling is designed for buffer
> heads.

Don't. Get. Me. Started. Hell, at least by now somebody had cleaned
htree code into nearly readable form; maybe it might be doable...

> > freevxfs: AFAICS simply bogus (grep for nblocks there).
> > hfs: at least missing checks for hfs_bnode_read() failure. And I'm not
> > at all sure that hfs_mac2asc() use is safe there. BTW, open_dir_list
> > handling appears to be odd - how the hell does decrementing ->f_pos
> > help anything? And hfs_dir_release() removes from list without any
> > locks, so that's almost certainly racy as well.
> > hfsplus: ditto
>
> I don't dispute at all that the readdir() thing is one of the weakest
> points of the whole VFS layer. And part of it is that there is no good
> caching helper for it at the VFS level, so we always end up having to do
> everything at the low-level filesystem level, and that invariably ends up
> being sh*t compared to the shared VFS routines.

What _can_ a common helper do, anyway, when we are busy parsing an arseload of
possibly corrupt data in whatever weird format fs insists upon?

As for the locking... I toyed with one idea for a while: instead of passing
a callback and one of many completely different structs, how about a common
*beginning* of a struct, with callback stored in it along with several
common fields? Namely,
* count of filldir calls already made
* pointer to file in question
* "are we done" flag
And instead of calling filldir(buf, ...) ->readdir() would call one of several
helpers:
* readdir_emit(buf, ...) - obvious
* readdir_relax(buf) - called in a spot convenient for dropping
and retaking lock; returns whether we need to do revalidation.
* readdir_eof(buf) - end of directory
* maybe readdir_dot() and readdir_dotdot() - those are certainly
common enough

That's the obvious flagday stuff, but since we need to give serious beating
to most of the instances anyway... Might as well consider something in
that direction.

2008-08-13 01:52:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] readdir mess



On Wed, 13 Aug 2008, Al Viro wrote:
>
> What _can_ a common helper do, anyway, when we are busy parsing an arseload of
> possibly corrupt data in whatever weird format fs insists upon?

Well, the parsing has to be done by the low-level filesystem code, yes.

However, the whole thing with races with "f_pos" and all the locking -
that's only because we see the filesystem "readdir" code as being the
primary source of data.

Quite frankly, if we had a "readdir page cache", the low-level filesystem
would still have to parse the insane low-level data with corruption
issues, but we could make it totally independent of f_pos (because we
would never use in the _real_ file->f_pos - we would just populate the
cache), and the locking issues would be only a cold-cache issue, with the
hot-cache hopefully needing little locking at all.

For an exmple of that: you did a good job with all the "seq_file" helpers,
which meant that the low-level "filesystem" ops didn't need to know
_anything_ about partial results etc, and it automatically did the right
thing wrt f_pos updates and lseek etc.

I'm not saying that readdir() would use the _same_ model, but I do suspect
that a common format in between the disk format and the eventual readdir()
output, that also could be cached, might mitigate a lot of the problems.

As to the issues with lookup() - yes, a lookup would need to get the lock
for writing, but only for the last entry, and only if O_CREAT is set.
There's nothing wrogn with concurrent read-only lookups, I think (apart
from having to protect the dentries from being duplicated, of course, but
that would be a per-dentry lock flag, not a directory lock, methinks).

I dunno.

That said, I think you are right that we could also just improve on the
current non-caching version with soem higher-level semantics. Including
flags like "yes, we've seen the end", so that we don't need to always call
into the low-level filesystem one extra time to see that final zero
return.

So yes, instead of separate "filldir_t" and "void *data" things, having a
"struct filldir_t" with several fields in common might be worth it.

Linus

2008-08-13 08:37:12

by Brad Boyer

[permalink] [raw]
Subject: Re: [RFC] readdir mess

On Wed, Aug 13, 2008 at 01:04:33AM +0100, Al Viro wrote:
> hfs: at least missing checks for hfs_bnode_read() failure. And I'm not
> at all sure that hfs_mac2asc() use is safe there. BTW, open_dir_list
> handling appears to be odd - how the hell does decrementing ->f_pos
> help anything? And hfs_dir_release() removes from list without any
> locks, so that's almost certainly racy as well.
> hfsplus: ditto

I don't work on this code anymore, but I wrote the original version which
makes me a bit curious about some of the critcism here. The function
hfs_bnode_read is declared void, so it doesn't return any errors. The
only thing inside it that I could even think of failing is kmap, but
I was under the impression that kmap would sleep until it could
complete. The actual disk read happens earlier and is saved in the
page cache as long as the bnode object exists.

Is there any reason that hfs_mac2asc would not be safe? I can't think
of any good way to avoid that call even if it is unsafe, since the
readdir will expect UTF8 strings rather than the mac (or UTF16 for
hfsplus) encoding found on disk.

The open_dir_list stuff is a little odd I admit, and I think you are
right about the locking issue in release. However, I feel like I should
explain the way hfs and hfsplus use f_pos on directories. The on-disk
format requires that the directory entries be sorted in order by
filename and does not allow any holes in the list. Because of this,
adding and removing entries will move all the items that are later
in the order to make room or eliminate the hole. The manipulation
of f_pos is intended to make it act more like a standard unix
filesystem where removing an item doesn't usually make a pending
readdir skip an unrelated item. The value of f_pos for a directory
is only incremented by one for each entry to make seeking work
in a more sane fashion. Because of this, an increment moves to the
next item and decrement moves to the previous one.

As a side note about the complexity of making hfs and hfsplus fit
into a unix model, there is just one file that contains the equivalent
of every directory and the entire inode table. Because of this, the
directory code is very synthetic. I tried to make it look as much as
possible like a normal unix filesystem, including making i_nlink on
the directory inodes reflect the number of child directories. It
makes for some code that is admittedly a little hard to follow.

Brad Boyer
[email protected]

2008-08-13 16:19:57

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] readdir mess

On Wed, Aug 13, 2008 at 01:36:35AM -0700, Brad Boyer wrote:
> On Wed, Aug 13, 2008 at 01:04:33AM +0100, Al Viro wrote:
> > hfs: at least missing checks for hfs_bnode_read() failure. And I'm not
> > at all sure that hfs_mac2asc() use is safe there. BTW, open_dir_list
> > handling appears to be odd - how the hell does decrementing ->f_pos
> > help anything? And hfs_dir_release() removes from list without any
> > locks, so that's almost certainly racy as well.
> > hfsplus: ditto
>
> I don't work on this code anymore, but I wrote the original version which
> makes me a bit curious about some of the critcism here. The function
> hfs_bnode_read is declared void, so it doesn't return any errors. The
> only thing inside it that I could even think of failing is kmap, but
> I was under the impression that kmap would sleep until it could
> complete. The actual disk read happens earlier and is saved in the
> page cache as long as the bnode object exists.

Argh... s/failure/arguments/; sorry about the braino. Take a look at
the call in the main loop. entrylength comes from 16bit on-disk value
(set in hfs_brec_goto()). It's not checked anywhere for being too large,
AFAICS. And we proceed to do memcpy() to entry. On stack, BTW.

> Is there any reason that hfs_mac2asc would not be safe? I can't think
> of any good way to avoid that call even if it is unsafe, since the
> readdir will expect UTF8 strings rather than the mac (or UTF16 for
> hfsplus) encoding found on disk.

As for mac2asc... Are multibyte encodings possible there? If they are,
you'd need to validate the first byte of CName as well - result of conversion
will fit the strbuf, but that doesn't mean we do not overrun the source
buffer...

> The open_dir_list stuff is a little odd I admit, and I think you are
> right about the locking issue in release. However, I feel like I should
> explain the way hfs and hfsplus use f_pos on directories. The on-disk
> format requires that the directory entries be sorted in order by
> filename and does not allow any holes in the list. Because of this,
> adding and removing entries will move all the items that are later
> in the order to make room or eliminate the hole. The manipulation
> of f_pos is intended to make it act more like a standard unix
> filesystem where removing an item doesn't usually make a pending
> readdir skip an unrelated item. The value of f_pos for a directory
> is only incremented by one for each entry to make seeking work
> in a more sane fashion.

What happens if you repeatedly create and remove an entry with name below
that of the place where readdir has stopped? AFAICS, on each iteration
f_pos will decrement... I see that scanning of the list in hfs_cat_delete()
and nowhere else; we don't have the matching increment of f_pos...

> Because of this, an increment moves to the
> next item and decrement moves to the previous one.
>
> As a side note about the complexity of making hfs and hfsplus fit
> into a unix model, there is just one file that contains the equivalent
> of every directory and the entire inode table. Because of this, the
> directory code is very synthetic. I tried to make it look as much as
> possible like a normal unix filesystem, including making i_nlink on
> the directory inodes reflect the number of child directories. It
> makes for some code that is admittedly a little hard to follow.

It's actually fairly readable, but AFAICS doesn't validate the on-disk
data enough... Sure, don't go around mounting corrupt filesystem images
and all such, but getting buffer overruns on kernel stack is a bit over
the top, IMO...

[que the grsec pack popping out of latrines screaming "coverup" and demanding
CVEs to be allocated]

2008-08-13 16:20:53

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [RFC] readdir mess

On Tue, Aug 12, 2008 at 03:04:05PM -0700, Linus Torvalds wrote:
>
>
> On Tue, 12 Aug 2008, Al Viro wrote:
>
> > On Tue, Aug 12, 2008 at 02:24:04PM -0700, Linus Torvalds wrote:
> > >
> > > You'd truncate the inode number. What's the big deal? Inode numbers aren't
> > > that important - they're just about the _least_ important part of the data
> > > returned for a readdir.
> >
> > Tell that to tar(1) ;-)
>
> The thing is, you still have the low 32 (or 16) bits, so even tar is
> better off most of the time.
>
> Let's face it, we actually always truncate things as-is. What are inode
> numbers on NFS but truncated file handles?

The server returns the inode number to the client as part of the file
attributes, and that's what the client reports as the inode number. So,
no, the client doesn't try to fake up an inode number by truncating or
hashing the filehandle somehow, if that's what you mean. Maybe I'm
missing your point....

--b.

>
> And the other side of the coin is that legacy binaries are almost never
> _system_ binaries. You upgrade those with your system They are some odd
> special-purpose thing.
>
> > Anyway, the point for getdents() is simply that we *do* return an error; it's
> > just that it ends up with -EINVAL instead of -EOVERFLOW, and that's simply
> > bogus - we should either truncate silently or return the right value. The
> > code definitely intends to do the latter and fucks up.
>
> I agree. However, the reason it f*cks up is exactly the fact that we have
> two different error numbers, which was why I suggested that if we really
> want to fix this problem and are talking about cleaning things up, then
> _that_ should be our primary place to look at.
>
> Yes, it implies fixing and checking a lot of low-level filesystems. I
> agree. It's easier to just leave it be.
>
> Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2008-08-15 05:32:37

by Jan Harkes

[permalink] [raw]
Subject: Re: [RFC] readdir mess

On Wed, Aug 13, 2008 at 01:04:33AM +0100, Al Viro wrote:
> coda: returns fsck knows what (number of entries, mostly)

Not sure either and I was the one that sent the patch that introduced
that. My closest guess would be that I looked too long at a getdents(2)
manpage, but then again it doesn't really match that either.

Signed-off-by: Jan Harkes <[email protected]>
---

If I understood your description, then the following would be the
correct fix. We return 0 as long as we managed to read some entries, and
any non-zero return value from filldir otherwise.

diff --git a/fs/coda/dir.c b/fs/coda/dir.c
index c591622..6026b91 100644
--- a/fs/coda/dir.c
+++ b/fs/coda/dir.c
@@ -513,14 +513,14 @@ static int coda_venus_readdir(struct file *coda_file, void *buf,

if (coda_file->f_pos == 0) {
ret = filldir(buf, ".", 1, 0, de->d_inode->i_ino, DT_DIR);
- if (ret < 0)
+ if (ret != 0)
goto out;
result++;
coda_file->f_pos++;
}
if (coda_file->f_pos == 1) {
ret = filldir(buf, "..", 2, 1, de->d_parent->d_inode->i_ino, DT_DIR);
- if (ret < 0)
+ if (ret != 0)
goto out;
result++;
coda_file->f_pos++;
@@ -572,7 +572,7 @@ static int coda_venus_readdir(struct file *coda_file, void *buf,
ret = filldir(buf, name.name, name.len,
coda_file->f_pos, ino, type);
/* failure means no space for filling in this round */
- if (ret < 0) break;
+ if (ret != 0) break;
result++;
}
/* we'll always have progress because d_reclen is unsigned and
@@ -581,7 +581,7 @@ static int coda_venus_readdir(struct file *coda_file, void *buf,
}
out:
kfree(vdir);
- return result ? result : ret;
+ return result ? 0 : ret;
}

/* called when a cache lookup succeeds */

2008-08-15 05:34:47

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] readdir mess

On Fri, Aug 15, 2008 at 01:06:13AM -0400, Jan Harkes wrote:
> On Wed, Aug 13, 2008 at 01:04:33AM +0100, Al Viro wrote:
> > coda: returns fsck knows what (number of entries, mostly)
>
> Not sure either and I was the one that sent the patch that introduced
> that. My closest guess would be that I looked too long at a getdents(2)
> manpage, but then again it doesn't really match that either.

Simply return 0 unless you get an error in iterator itself. Returning the
first error from filldir requires changes in callers.

2008-08-15 16:59:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] readdir mess



On Fri, 15 Aug 2008, Jan Harkes wrote:
>
> If I understood your description, then the following would be the
> correct fix. We return 0 as long as we managed to read some entries, and
> any non-zero return value from filldir otherwise.

This looks fine.

On the other hand, you really _should_ be able to just drop "result"
entirely, and just do "return ret" at the end.

But no, you can't do it right now, becasue the callers are broken crap.
But that really isn't your fault.

Also, you don't really need to change the

if (ret < 0)
goto out;

to

if (ret != 0)
goto out;

because the "filldir()" functions should all do the right thing anyway.
But there's certainly nothing wrong with doing it either.

However, I think the real fix is something like this. This

- fixes all the callers

- removes more lines than it adds

- simplifies and clarifies the code

- avoids pointless goto's

- makes error handling of vfs_readdir() consistent among the callers
(some callers already did the error handling _correctly_ before this
patch - this makes everybody do it the same way)

but I didn't actually even try to test this. Al? Jan?

(Side note: if this were a commit, I'd fix the _users_ of vfs_readdir() in
one patch, and then the fs/coda/dir.c patch would be the next commit, but
I'm sending it out as one single patch for comments/testing)

Linus

---
arch/alpha/kernel/osf_sys.c | 8 ++------
arch/ia64/ia32/sys_ia32.c | 6 ++----
arch/parisc/hpux/fs.c | 7 ++-----
arch/powerpc/kernel/sys_ppc32.c | 7 ++-----
fs/coda/dir.c | 6 +-----
fs/compat.c | 21 +++++++--------------
fs/readdir.c | 22 +++++++---------------
7 files changed, 23 insertions(+), 54 deletions(-)

diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c
index 6e94313..869484e 100644
--- a/arch/alpha/kernel/osf_sys.c
+++ b/arch/alpha/kernel/osf_sys.c
@@ -160,14 +160,10 @@ osf_getdirentries(unsigned int fd, struct osf_dirent __user *dirent,
buf.error = 0;

error = vfs_readdir(file, osf_filldir, &buf);
- if (error < 0)
- goto out_putf;
-
- error = buf.error;
+ if (error >= 0)
+ error = buf.error;
if (count != buf.count)
error = count - buf.count;
-
- out_putf:
fput(file);
out:
return error;
diff --git a/arch/ia64/ia32/sys_ia32.c b/arch/ia64/ia32/sys_ia32.c
index 465116a..e70fb30 100644
--- a/arch/ia64/ia32/sys_ia32.c
+++ b/arch/ia64/ia32/sys_ia32.c
@@ -1278,9 +1278,8 @@ sys32_getdents (unsigned int fd, struct compat_dirent __user *dirent, unsigned i
buf.error = 0;

error = vfs_readdir(file, filldir32, &buf);
- if (error < 0)
- goto out_putf;
- error = buf.error;
+ if (error >= 0)
+ error = buf.error;
lastdirent = buf.previous;
if (lastdirent) {
if (put_user(file->f_pos, &lastdirent->d_off))
@@ -1289,7 +1288,6 @@ sys32_getdents (unsigned int fd, struct compat_dirent __user *dirent, unsigned i
error = count - buf.count;
}

-out_putf:
fput(file);
out:
return error;
diff --git a/arch/parisc/hpux/fs.c b/arch/parisc/hpux/fs.c
index 1263f00..ca0cd1b 100644
--- a/arch/parisc/hpux/fs.c
+++ b/arch/parisc/hpux/fs.c
@@ -121,16 +121,13 @@ int hpux_getdents(unsigned int fd, struct hpux_dirent __user *dirent, unsigned i
buf.error = 0;

error = vfs_readdir(file, filldir, &buf);
- if (error < 0)
- goto out_putf;
- error = buf.error;
+ if (error >= 0)
+ error = buf.error;
lastdirent = buf.previous;
if (lastdirent) {
put_user(file->f_pos, &lastdirent->d_off);
error = count - buf.count;
}
-
-out_putf:
fput(file);
out:
return error;
diff --git a/arch/powerpc/kernel/sys_ppc32.c b/arch/powerpc/kernel/sys_ppc32.c
index 709f8cb..b2a1634 100644
--- a/arch/powerpc/kernel/sys_ppc32.c
+++ b/arch/powerpc/kernel/sys_ppc32.c
@@ -100,11 +100,8 @@ asmlinkage int old32_readdir(unsigned int fd, struct old_linux_dirent32 __user *
buf.dirent = dirent;

error = vfs_readdir(file, (filldir_t)fillonedir, &buf);
- if (error < 0)
- goto out_putf;
- error = buf.count;
-
-out_putf:
+ if (error >= 0)
+ error = buf.count;
fput(file);
out:
return error;
diff --git a/fs/coda/dir.c b/fs/coda/dir.c
index c591622..c4b1d58 100644
--- a/fs/coda/dir.c
+++ b/fs/coda/dir.c
@@ -488,7 +488,6 @@ static inline unsigned int CDT2DT(unsigned char cdt)
static int coda_venus_readdir(struct file *coda_file, void *buf,
filldir_t filldir)
{
- int result = 0; /* # of entries returned */
struct coda_file_info *cfi;
struct coda_inode_info *cii;
struct file *host_file;
@@ -515,14 +514,12 @@ static int coda_venus_readdir(struct file *coda_file, void *buf,
ret = filldir(buf, ".", 1, 0, de->d_inode->i_ino, DT_DIR);
if (ret < 0)
goto out;
- result++;
coda_file->f_pos++;
}
if (coda_file->f_pos == 1) {
ret = filldir(buf, "..", 2, 1, de->d_parent->d_inode->i_ino, DT_DIR);
if (ret < 0)
goto out;
- result++;
coda_file->f_pos++;
}
while (1) {
@@ -573,7 +570,6 @@ static int coda_venus_readdir(struct file *coda_file, void *buf,
coda_file->f_pos, ino, type);
/* failure means no space for filling in this round */
if (ret < 0) break;
- result++;
}
/* we'll always have progress because d_reclen is unsigned and
* we've already established it is non-zero. */
@@ -581,7 +577,7 @@ static int coda_venus_readdir(struct file *coda_file, void *buf,
}
out:
kfree(vdir);
- return result ? result : ret;
+ return ret;
}

/* called when a cache lookup succeeds */
diff --git a/fs/compat.c b/fs/compat.c
index c9d1472..f4432fc 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -913,18 +913,14 @@ asmlinkage long compat_sys_getdents(unsigned int fd,
buf.error = 0;

error = vfs_readdir(file, compat_filldir, &buf);
- if (error < 0)
- goto out_putf;
- error = buf.error;
+ if (error >= 0)
+ error = buf.error;
lastdirent = buf.previous;
if (lastdirent) {
+ error = count - buf.count;
if (put_user(file->f_pos, &lastdirent->d_off))
error = -EFAULT;
- else
- error = count - buf.count;
}
-
-out_putf:
fput(file);
out:
return error;
@@ -1004,19 +1000,16 @@ asmlinkage long compat_sys_getdents64(unsigned int fd,
buf.error = 0;

error = vfs_readdir(file, compat_filldir64, &buf);
- if (error < 0)
- goto out_putf;
- error = buf.error;
+ if (error >= 0)
+ error = buf.error;
lastdirent = buf.previous;
if (lastdirent) {
typeof(lastdirent->d_off) d_off = file->f_pos;
- error = -EFAULT;
- if (__put_user_unaligned(d_off, &lastdirent->d_off))
- goto out_putf;
error = count - buf.count;
+ if (__put_user_unaligned(d_off, &lastdirent->d_off))
+ error = -EFAULT;
}

-out_putf:
fput(file);
out:
return error;
diff --git a/fs/readdir.c b/fs/readdir.c
index 4e026e5..f5c02bc 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -205,18 +205,14 @@ asmlinkage long sys_getdents(unsigned int fd, struct linux_dirent __user * diren
buf.error = 0;

error = vfs_readdir(file, filldir, &buf);
- if (error < 0)
- goto out_putf;
- error = buf.error;
+ if (error >= 0)
+ error = buf.error;
lastdirent = buf.previous;
if (lastdirent) {
+ error = count - buf.count;
if (put_user(file->f_pos, &lastdirent->d_off))
error = -EFAULT;
- else
- error = count - buf.count;
}
-
-out_putf:
fput(file);
out:
return error;
@@ -289,19 +285,15 @@ asmlinkage long sys_getdents64(unsigned int fd, struct linux_dirent64 __user * d
buf.error = 0;

error = vfs_readdir(file, filldir64, &buf);
- if (error < 0)
- goto out_putf;
- error = buf.error;
+ if (error >= 0)
+ error = buf.error;
lastdirent = buf.previous;
if (lastdirent) {
typeof(lastdirent->d_off) d_off = file->f_pos;
- error = -EFAULT;
- if (__put_user(d_off, &lastdirent->d_off))
- goto out_putf;
error = count - buf.count;
+ if (__put_user(d_off, &lastdirent->d_off))
+ error = -EFAULT;
}
-
-out_putf:
fput(file);
out:
return error;

2008-08-24 23:52:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] readdir mess



On Sun, 24 Aug 2008, Al Viro wrote:
>
> The fact that coda_readdir() will _not_ be returning 0 with your change
> when called with the arguments old_readdir() gives it? You'll get ret
> from filldir, i.e. what you'll normally see will be -EINVAL in case of
> fillonedir as callback.

Ahh. A light finally goes on. No on the first filldir() callback, but on
the second.

Yeah, so this should fix it.

Linus
---
fs/readdir.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/readdir.c b/fs/readdir.c
index 4e026e5..4899ba4 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -115,7 +115,7 @@ asmlinkage long old_readdir(unsigned int fd, struct old_linux_dirent __user * di
buf.dirent = dirent;

error = vfs_readdir(file, fillonedir, &buf);
- if (error >= 0)
+ if (buf.result || error >= 0)
error = buf.result;

fput(file);

2008-08-25 01:33:33

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] readdir mess

On Sun, Aug 24, 2008 at 04:51:46PM -0700, Linus Torvalds wrote:
>
>
> On Sun, 24 Aug 2008, Al Viro wrote:
> >
> > The fact that coda_readdir() will _not_ be returning 0 with your change
> > when called with the arguments old_readdir() gives it? You'll get ret
> > from filldir, i.e. what you'll normally see will be -EINVAL in case of
> > fillonedir as callback.
>
> Ahh. A light finally goes on. No on the first filldir() callback, but on
> the second.

> - if (error >= 0)
> + if (buf.result || error >= 0)
> error = buf.result;

Actually, in vfs-2.6.git/for-next patch I'd done simply
if (buf.result)
error = buf.result;

If we have !buf.result, we know that foo_readdir() hadn't called filldir at
all. I.e. it's either an error or a genuine EOF. And no ->readdir()
instances return a positive in the latter case, so there's no need to bother.

FWIW, below is the patch in question (commit cb81e118...). I have _not_
touched the mess in nfs4 code; it's badly broken and I strongly suspect
that the right thing to do is to evict that crap to userland. Another
omission is ecryptfs_readdir(), but since that sucker is badly broken
as well *and* I can't even guess WTF had it been trying to achieve...
I'd asked mhalcrow, but looks like he's on vacation ;-/

cb81e1183a8192b0fd5bf869987eb11267fcedbd
diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c
index 8509dad..f25f6c4 100644
--- a/arch/alpha/kernel/osf_sys.c
+++ b/arch/alpha/kernel/osf_sys.c
@@ -165,14 +165,11 @@ osf_getdirentries(unsigned int fd, struct osf_dirent __user *dirent,
buf.error = 0;

error = vfs_readdir(file, osf_filldir, &buf);
- if (error < 0)
- goto out_putf;
-
- error = buf.error;
+ if (error >= 0)
+ error = buf.error;
if (count != buf.count)
error = count - buf.count;

- out_putf:
fput(file);
out:
return error;
diff --git a/arch/parisc/hpux/fs.c b/arch/parisc/hpux/fs.c
index 69ff671..b1312bb 100644
--- a/arch/parisc/hpux/fs.c
+++ b/arch/parisc/hpux/fs.c
@@ -127,9 +127,8 @@ int hpux_getdents(unsigned int fd, struct hpux_dirent __user *dirent, unsigned i
buf.error = 0;

error = vfs_readdir(file, filldir, &buf);
- if (error < 0)
- goto out_putf;
- error = buf.error;
+ if (error >= 0)
+ error = buf.error;
lastdirent = buf.previous;
if (lastdirent) {
if (put_user(file->f_pos, &lastdirent->d_off))
diff --git a/fs/compat.c b/fs/compat.c
index 075d050..d2aa6a2 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -830,7 +830,7 @@ asmlinkage long compat_sys_old_readdir(unsigned int fd,
buf.dirent = dirent;

error = vfs_readdir(file, compat_fillonedir, &buf);
- if (error >= 0)
+ if (buf.result)
error = buf.result;

fput(file);
@@ -917,9 +917,8 @@ asmlinkage long compat_sys_getdents(unsigned int fd,
buf.error = 0;

error = vfs_readdir(file, compat_filldir, &buf);
- if (error < 0)
- goto out_putf;
- error = buf.error;
+ if (error >= 0)
+ error = buf.error;
lastdirent = buf.previous;
if (lastdirent) {
if (put_user(file->f_pos, &lastdirent->d_off))
@@ -927,8 +926,6 @@ asmlinkage long compat_sys_getdents(unsigned int fd,
else
error = count - buf.count;
}
-
-out_putf:
fput(file);
out:
return error;
@@ -1008,19 +1005,16 @@ asmlinkage long compat_sys_getdents64(unsigned int fd,
buf.error = 0;

error = vfs_readdir(file, compat_filldir64, &buf);
- if (error < 0)
- goto out_putf;
- error = buf.error;
+ if (error >= 0)
+ error = buf.error;
lastdirent = buf.previous;
if (lastdirent) {
typeof(lastdirent->d_off) d_off = file->f_pos;
- error = -EFAULT;
if (__put_user_unaligned(d_off, &lastdirent->d_off))
- goto out_putf;
- error = count - buf.count;
+ error = -EFAULT;
+ else
+ error = count - buf.count;
}
-
-out_putf:
fput(file);
out:
return error;
diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 9960bbf..890e018 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -280,13 +280,14 @@ static int get_name(struct vfsmount *mnt, struct dentry *dentry,
int old_seq = buffer.sequence;

error = vfs_readdir(file, filldir_one, &buffer);
+ if (buffer.found) {
+ error = 0;
+ break;
+ }

if (error < 0)
break;

- error = 0;
- if (buffer.found)
- break;
error = -ENOENT;
if (old_seq == buffer.sequence)
break;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 8291591..77ad3a5 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1832,6 +1832,7 @@ struct buffered_dirent {
struct readdir_data {
char *dirent;
size_t used;
+ int full;
};

static int nfsd_buffered_filldir(void *__buf, const char *name, int namlen,
@@ -1842,8 +1843,10 @@ static int nfsd_buffered_filldir(void *__buf, const char *name, int namlen,
unsigned int reclen;

reclen = ALIGN(sizeof(struct buffered_dirent) + namlen, sizeof(u64));
- if (buf->used + reclen > PAGE_SIZE)
+ if (buf->used + reclen > PAGE_SIZE) {
+ buf->full = 1;
return -EINVAL;
+ }

de->namlen = namlen;
de->offset = offset;
@@ -1875,9 +1878,13 @@ static int nfsd_buffered_readdir(struct file *file, filldir_t func,
unsigned int reclen;

buf.used = 0;
+ buf.full = 0;

host_err = vfs_readdir(file, nfsd_buffered_filldir, &buf);
- if (host_err)
+ if (buf.full)
+ host_err = 0;
+
+ if (host_err < 0)
break;

size = buf.used;
diff --git a/fs/readdir.c b/fs/readdir.c
index 93a7559..b318d9b 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -117,7 +117,7 @@ asmlinkage long old_readdir(unsigned int fd, struct old_linux_dirent __user * di
buf.dirent = dirent;

error = vfs_readdir(file, fillonedir, &buf);
- if (error >= 0)
+ if (buf.result)
error = buf.result;

fput(file);
@@ -209,9 +209,8 @@ asmlinkage long sys_getdents(unsigned int fd, struct linux_dirent __user * diren
buf.error = 0;

error = vfs_readdir(file, filldir, &buf);
- if (error < 0)
- goto out_putf;
- error = buf.error;
+ if (error >= 0)
+ error = buf.error;
lastdirent = buf.previous;
if (lastdirent) {
if (put_user(file->f_pos, &lastdirent->d_off))
@@ -219,8 +218,6 @@ asmlinkage long sys_getdents(unsigned int fd, struct linux_dirent __user * diren
else
error = count - buf.count;
}
-
-out_putf:
fput(file);
out:
return error;
@@ -293,19 +290,16 @@ asmlinkage long sys_getdents64(unsigned int fd, struct linux_dirent64 __user * d
buf.error = 0;

error = vfs_readdir(file, filldir64, &buf);
- if (error < 0)
- goto out_putf;
- error = buf.error;
+ if (error >= 0)
+ error = buf.error;
lastdirent = buf.previous;
if (lastdirent) {
typeof(lastdirent->d_off) d_off = file->f_pos;
- error = -EFAULT;
if (__put_user(d_off, &lastdirent->d_off))
- goto out_putf;
- error = count - buf.count;
+ error = -EFAULT;
+ else
+ error = count - buf.count;
}
-
-out_putf:
fput(file);
out:
return error;

2008-08-25 01:44:50

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] readdir mess

PS: the part below was on top of patch from dwmw2; in the mainline the analog
of that sucker is in XFS.

> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 8291591..77ad3a5 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1832,6 +1832,7 @@ struct buffered_dirent {
> struct readdir_data {
> char *dirent;
> size_t used;
> + int full;
> };
>
> static int nfsd_buffered_filldir(void *__buf, const char *name, int namlen,
> @@ -1842,8 +1843,10 @@ static int nfsd_buffered_filldir(void *__buf, const char *name, int namlen,
> unsigned int reclen;
>
> reclen = ALIGN(sizeof(struct buffered_dirent) + namlen, sizeof(u64));
> - if (buf->used + reclen > PAGE_SIZE)
> + if (buf->used + reclen > PAGE_SIZE) {
> + buf->full = 1;
> return -EINVAL;
> + }
>
> de->namlen = namlen;
> de->offset = offset;
> @@ -1875,9 +1878,13 @@ static int nfsd_buffered_readdir(struct file *file, filldir_t func,
> unsigned int reclen;
>
> buf.used = 0;
> + buf.full = 0;
>
> host_err = vfs_readdir(file, nfsd_buffered_filldir, &buf);
> - if (host_err)
> + if (buf.full)
> + host_err = 0;
> +
> + if (host_err < 0)
> break;
>
> size = buf.used;

2008-08-24 17:21:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] readdir mess



On Sun, 24 Aug 2008, Al Viro wrote:
>
> One obvious note: that'll break old_readdir() on coda. There you need to
> change the existing check (you need to check buf.result, then ignore error
> unless buf.result ended up 0).

Hmm? old_readdir() was the only one that I didn't change, because it
didn't need changing. It already ignores the return value of
"vfs_readdir()" entirely if it is positive or zero, and takes it from
buf.result.

So old_readdir() literally doesn't care at all (and never has) whether a
->readdir() function returns zero or a positive number. So changing coda
readdir() it to return zero _instead_ of a positive number makes
absolutely zero difference: old_readdir() will do the same thing
regardless.

What am I missing?

Linus

2008-08-24 11:03:40

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] readdir mess

On Sun, Aug 24, 2008 at 11:10:14AM +0100, Al Viro wrote:

> I agree that such transition plan makes sense, but that'll take more
> preliminary work than in your patch; there are other vfs_readdir() and
> ->readdir() callers, not just the obvious syscall ones.

BTW, nfsd4_list_rec_dir() is FUBAR. Its users, actually - they try to use it
for lovely things like kernel-side rm -rf /var/lib/nfs/v4recovery/* and
screw up in rather amusing ways.. I'm not even talking about the effects
of OOM (dentry leak); if you rename something away from that directory,
you'll get vfs_rmdir(dir, dentry) with dentry not being a child of dir,
which means deadlock if you are lucky and underlying fs corruption if you
are not...

I really wonder WTF is that doing in the kernel, anyway. Looks like an
obvious candidate for userland helper...

2008-08-24 10:10:32

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] readdir mess

On Fri, Aug 15, 2008 at 09:58:31AM -0700, Linus Torvalds wrote:

> because the "filldir()" functions should all do the right thing anyway.
> But there's certainly nothing wrong with doing it either.

Not all, but those won't hit coda.

> However, I think the real fix is something like this. This
>
> - fixes all the callers
>
> - removes more lines than it adds
>
> - simplifies and clarifies the code
>
> - avoids pointless goto's
>
> - makes error handling of vfs_readdir() consistent among the callers
> (some callers already did the error handling _correctly_ before this
> patch - this makes everybody do it the same way)

One obvious note: that'll break old_readdir() on coda. There you need to
change the existing check (you need to check buf.result, then ignore error
unless buf.result ended up 0).

I agree that such transition plan makes sense, but that'll take more
preliminary work than in your patch; there are other vfs_readdir() and
->readdir() callers, not just the obvious syscall ones.

BTW, there are several places that call specific foo_readdir() or its helper
functions, passing odd stuff as filldir (afs implements ->lookup() that way,
for one; ocfs2 checks that directory is empty; gfs2 does ->get_name() -
with filldir returning 1 on match, at that; etc.). We obviously do not care
about those in the beginning of patch series - they won't be affected,
but once we start converting foofs_readdir() to returning what filldir had
returned, we'll need to watch out for complications from those (BTW, another
fun place in that respect is __fat_readdir())

2008-08-24 19:59:22

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] readdir mess

On Sun, Aug 24, 2008 at 10:20:52AM -0700, Linus Torvalds wrote:
>
>
> On Sun, 24 Aug 2008, Al Viro wrote:
> >
> > One obvious note: that'll break old_readdir() on coda. There you need to
> > change the existing check (you need to check buf.result, then ignore error
> > unless buf.result ended up 0).
>
> Hmm? old_readdir() was the only one that I didn't change, because it
> didn't need changing. It already ignores the return value of
> "vfs_readdir()" entirely if it is positive or zero, and takes it from
> buf.result.
>
> So old_readdir() literally doesn't care at all (and never has) whether a
> ->readdir() function returns zero or a positive number. So changing coda
> readdir() it to return zero _instead_ of a positive number makes
> absolutely zero difference: old_readdir() will do the same thing
> regardless.
>
> What am I missing?

The fact that coda_readdir() will _not_ be returning 0 with your change
when called with the arguments old_readdir() gives it? You'll get ret
from filldir, i.e. what you'll normally see will be -EINVAL in case of
fillonedir as callback.

The normal sequence for old_readdir() is
* call fillonedir on the current entry
* have it bump ->result from 0 to 1 and return 0
* advance f_pos to the next entry
* call fillonedir for it
* have it see ->result != 0 and immediately bail out with -EINVAL
* seeing a negative from the callback, foo_readdir does *not* advance
f_pos this time and returns 0 (or at least something non-negative)
* old_readdir() sees non-negative from vfs_readdir() and returns
buf->result (i.e. 1)

Now you've got vfs_readdir() returning -EINVAL in that scenario. See why
old_readdir() needs an update too? It doesn't have the "OK, we'd already
called its filldir, so bugger whatever had happened afterwards" logics -
and it'll need it now.

2008-08-25 16:16:28

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [RFC] readdir mess

On Sun, Aug 24, 2008 at 12:03:26PM +0100, Al Viro wrote:
> On Sun, Aug 24, 2008 at 11:10:14AM +0100, Al Viro wrote:
>
> > I agree that such transition plan makes sense, but that'll take more
> > preliminary work than in your patch; there are other vfs_readdir() and
> > ->readdir() callers, not just the obvious syscall ones.
>
> BTW, nfsd4_list_rec_dir() is FUBAR. Its users, actually - they try to use it
> for lovely things like kernel-side rm -rf /var/lib/nfs/v4recovery/* and
> screw up in rather amusing ways.. I'm not even talking about the effects
> of OOM (dentry leak); if you rename something away from that directory,
> you'll get vfs_rmdir(dir, dentry) with dentry not being a child of dir,
> which means deadlock if you are lucky and underlying fs corruption if you
> are not...
>
> I really wonder WTF is that doing in the kernel, anyway. Looks like an
> obvious candidate for userland helper...

Yes. Christoph complained about this before (err, 3 years ago!), though
without (as far as I can tell) catching those particular bugs:

http://marc.info/?l=linux-fsdevel&m=112703894118581&w=2

We eventually agreed that it was a problem and worked on a userspace
replacement, but it never got to the point where I was happy enough with
it to commit to the new user interface, and the effort died. I'll take
a look at what we last had.

--b.