2011-03-03 03:24:58

by Al Viro

[permalink] [raw]
Subject: [RFC] st_nlink after rmdir() and rename()

We have an interesting problem. Consider the following sequence
of syscalls:
mkdir("foo", 0777);
mkdir("bar", 0777);
fd1 = open("foo", O_DIRECTORY);
fd2 = open("bar", O_DIRECTORY);
rename("foo", "bar"); /* kill old bar */
rmdir("bar"); /* kill old foo */
fstat(fd1, &buf1);
fstat(fd2, &buf2);
What should be in buf1.st_nlink and buf2.st_nlink, if none of these
syscalls fail? Note that in both cases any lookups in victim directory
will fail and so will readdir; as far as VFS is concerned, the effect of
such rmdir() and rename() on their victims are identical. In particular,
both . and .. are gone, as explicitly required by POSIX in case of rmdir().

Surprisingly, the results are *NOT* identical wrt fstat(); for most of
the filesystems we will get 0 in both cases (as expected), but some will
leave 1 in buf2.st_nlink. What we have is

0 0: ext*, xfs, jfs, reiserfs, ocfs2, gfs2, nilfs, exofs, udf, ubifs,
minix, sysv, ufs, msdos, vfat, hfs+
0 1: ramfs, shmem, hugetlbfs, jffs2, omfs, hfs[*], apparently nfs as well
hell knows: ncpfs, fuse, ecryptfs, coda, cifs, ceph, btrfs, affs
1 1: (unless I'm misreading it) logfs
completely FUBAR wrt fstat(): hostfs[**]
-EEXIST on rename(): cgroup
-EINVAL on rename(): hpfs
server ought to fail such rename(): 9p
apparently fails rename(): smbfs
completely broken rename(): pohmelfs[***]

[*] yes, different from hfs+; the code is clearly broken, since it simply
does unlink() on target, without even verifying that it's empty. And
yes, it's trivial fs corruption...
[**] even open() + unlink() + fstat() will report original st_nlink, etc.
[***] new_dir is target's _parent_ directory and it's not required to be
empty; it's never going to be NULL, while we are at it. Code makes no
sense...

The variant with st_nlink getting to 0 in both cases is definitely the most
common and at least for local filesystems I think it should be mandatory.
I.e. ramfs and friends, jffs2, omfs and hfs should all switch to it.

Comments would be welcome; I really don't know the protocols of most of
the network filesystems well enough to tell what'll happen in these
situations.
Al, digging through i_nlink code audit...


2011-03-03 04:42:46

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] st_nlink after rmdir() and rename()

On Thu, Mar 03, 2011 at 03:24:54AM +0000, Al Viro wrote:
> 0 0: ext*, xfs, jfs, reiserfs, ocfs2, gfs2, nilfs, exofs, udf, ubifs,
> minix, sysv, ufs, msdos, vfat, hfs+
> 0 1: ramfs, shmem, hugetlbfs, jffs2, omfs, hfs[*], apparently nfs as well

> [*] yes, different from hfs+; the code is clearly broken, since it simply
> does unlink() on target, without even verifying that it's empty. And
> yes, it's trivial fs corruption...

Actually, hfs turns out to be 0 0 as well; it *is* broken (lacks check
for target being empty), but when the target is empty it's doing the
right thing. jffs2 and ramfs-based ones do 0 1 - confirmed by direct
experiment.

2011-03-03 05:17:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] st_nlink after rmdir() and rename()

On Wed, Mar 2, 2011 at 7:24 PM, Al Viro <[email protected]> wrote:
>
> Surprisingly, the results are *NOT* identical wrt fstat(); for most of
> the filesystems we will get 0 in both cases (as expected), but some will
> leave 1 in buf2.st_nlink.

Why do we care? Some filesystems don't support i_nlink at all, so it's
always 1. Others won't do the real delete until later (nfs
sillyrename), so returning 0 would be wrong and insane.

So the fact is, expecting 0,0 seems to be an incorrect expectation,
and I don't understand why you would really care. Does it matter?

Linus

2011-03-03 06:03:55

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] st_nlink after rmdir() and rename()

On Wed, Mar 02, 2011 at 09:17:15PM -0800, Linus Torvalds wrote:
> On Wed, Mar 2, 2011 at 7:24 PM, Al Viro <[email protected]> wrote:
> >
> > Surprisingly, the results are *NOT* identical wrt fstat(); for most of
> > the filesystems we will get 0 in both cases (as expected), but some will
> > leave 1 in buf2.st_nlink.
>
> Why do we care? Some filesystems don't support i_nlink at all, so it's
> always 1. Others won't do the real delete until later (nfs
> sillyrename), so returning 0 would be wrong and insane.
>
> So the fact is, expecting 0,0 seems to be an incorrect expectation,
> and I don't understand why you would really care. Does it matter?

Umm... For one thing, "is that sucker deleted" is a reasonable question
to ask. For another, consistency is a good thing to have - rmdir() and
overwriting rename() being different is unexpected.

Note that it also means that e.g. on jffs2 you do get fsnotify event for
inode removal on rmdir() but not for overwriting rename() over a directory.
And while I have very little sympathy for *notify-style APIs, why the
hell break it if we don't have to? Sure, anyone trying to use it on FUSE is
going to get what they deserve - there are tons of reasons why it's going
to be completely useless there. But when we are talking about a local
filesystem, changed only by normal explicit syscalls...

IOW, it's a QoI question - sure, NFS is weird and you lose a lot of usual
warranties there when it comes to object removal. But why get local
fs behaving strangely? It's not like "decrement i_nlink from 2 to 1" was
cheaper than "assign 0 to i_nlink", after all.

FWIW, a lot of local filesystems have no notion of links; they tend to
maintain zero/non-zero distinction just fine. Including those that have
all directories with i_nlink == 1 for as long as directory lives.

Anyway, I'm going to send obvious i_nlink fixes (races a-la ext2 one, plus
reiserfs blowing up if it fails in the right place in mkdir, plus hfs fs
corruptor on rename() over non-empty, plus bogosity in udf max links
calculation) in a few, just need to sanitize commit messages and drop
Josh's ext2 patch you've already merged. About 3/4th through the
audit, there's going to be more fun at least with omfs and btrfs... ;-/

2011-03-03 14:43:41

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] st_nlink after rmdir() and rename()


On Mar 2, 2011, at 10:24 PM, Al Viro wrote:

> We have an interesting problem. Consider the following sequence
> of syscalls:
> mkdir("foo", 0777);
> mkdir("bar", 0777);
> fd1 = open("foo", O_DIRECTORY);
> fd2 = open("bar", O_DIRECTORY);
> rename("foo", "bar"); /* kill old bar */

I must be missing something. I didn't think you could rename on
top of a directory and have the directory disappear. Don't you get
an error in that case? What happens if bar contains files?

We don't allow: mkdir("bar", 0777); unlink("bar");

Why should this be any different?

-- Ted

2011-03-03 16:18:13

by Andreas Schwab

[permalink] [raw]
Subject: Re: [RFC] st_nlink after rmdir() and rename()

Theodore Tso <[email protected]> writes:

> On Mar 2, 2011, at 10:24 PM, Al Viro wrote:
>
>> We have an interesting problem. Consider the following sequence
>> of syscalls:
>> mkdir("foo", 0777);
>> mkdir("bar", 0777);
>> fd1 = open("foo", O_DIRECTORY);
>> fd2 = open("bar", O_DIRECTORY);
>> rename("foo", "bar"); /* kill old bar */
>
> I must be missing something. I didn't think you could rename on
> top of a directory and have the directory disappear. Don't you get
> an error in that case?

rename is required to be able to move a directory over an empty
directory, atomically.

> What happens if bar contains files?

That's an error.

Andreas.

--
Andreas Schwab, [email protected]
GPG Key fingerprint = D4E8 DBE3 3813 BB5D FA84 5EC7 45C6 250E 6F00 984E
"And now for something completely different."

2011-03-03 19:16:32

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] st_nlink after rmdir() and rename()

On Thu, Mar 03, 2011 at 09:34:08AM -0500, Theodore Tso wrote:
>
> On Mar 2, 2011, at 10:24 PM, Al Viro wrote:
>
> > We have an interesting problem. Consider the following sequence
> > of syscalls:
> > mkdir("foo", 0777);
> > mkdir("bar", 0777);
> > fd1 = open("foo", O_DIRECTORY);
> > fd2 = open("bar", O_DIRECTORY);
> > rename("foo", "bar"); /* kill old bar */
>
> I must be missing something. I didn't think you could rename on
> top of a directory and have the directory disappear. Don't you get
> an error in that case? What happens if bar contains files?

ENOTEMPTY. Checked by ->rename() and yes, ext4 does that.

> We don't allow: mkdir("bar", 0777); unlink("bar");
>
> Why should this be any different?

Because it worked since 4.2BSD and got into POSIX. Replacing rename()
works; the only restrictions are
* you can't replace directory with non-directory
* you can't replace non-directory with directory
* directory being replaced shall be empty
* you can't replace a mountpoint or filesystem root
If the target is not busy, it is required to work. Whether it returns
-EBUSY for busy target is implementation-dependent and we allow that
for most of the filesystems.

2011-03-03 20:06:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] st_nlink after rmdir() and rename()

On Wed, Mar 2, 2011 at 10:03 PM, Al Viro <[email protected]> wrote:
>
> IOW, it's a QoI question - sure, NFS is weird and you lose a lot of usual
> warranties there when it comes to object removal. ?But why get local
> fs behaving strangely? ?It's not like "decrement i_nlink from 2 to 1" was
> cheaper than "assign 0 to i_nlink", after all.
>
> FWIW, a lot of local filesystems have no notion of links; they tend to
> maintain zero/non-zero distinction just fine. ?Including those that have
> all directories with i_nlink == 1 for as long as directory lives.

The thing is, I don't think it's a QoI question at all - since any
user that _depends_ on this kind of behavior is simply broken. We
aren't going to guarantee it, exactly because some filesystems simply
will not ever guarantee it.

Now, for FAT we do in fact try rather hard to fake the i_nlink count,
but I'm not at all convinced that's a good idea. It makes reading
directory inodes on FAT much more expensive (we have to basically do a
readdir for each open). So I'd hate to make that whole "you need to
emulate i_nlink even if you really don't care" be something that we
actually end up thinking is a quality issue.

There are other filesystems where i_nlink can be even _less_
meaningful, ie if the filesystem does any kind of fancy
content-indexing thing or lazy COW (think "union filesystem within the
filesystem") or whatever, I could easily see i_nlink not having any
traditional unix filesystem semantics.

Seriously - how did you even notice this?

I'm not opposed to fix actual bugs, but I _do_ think it is
questionable to make this kind of nonsense semantic issue be an issue.

Linus

2011-03-03 20:46:37

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [RFC] st_nlink after rmdir() and rename()

Linus Torvalds <[email protected]> writes:

> On Wed, Mar 2, 2011 at 10:03 PM, Al Viro <[email protected]> wrote:
>>
>> IOW, it's a QoI question - sure, NFS is weird and you lose a lot of usual
>> warranties there when it comes to object removal. ?But why get local
>> fs behaving strangely? ?It's not like "decrement i_nlink from 2 to 1" was
>> cheaper than "assign 0 to i_nlink", after all.
>>
>> FWIW, a lot of local filesystems have no notion of links; they tend to
>> maintain zero/non-zero distinction just fine. ?Including those that have
>> all directories with i_nlink == 1 for as long as directory lives.
>
> The thing is, I don't think it's a QoI question at all - since any
> user that _depends_ on this kind of behavior is simply broken. We
> aren't going to guarantee it, exactly because some filesystems simply
> will not ever guarantee it.
>
> Now, for FAT we do in fact try rather hard to fake the i_nlink count,
> but I'm not at all convinced that's a good idea. It makes reading
> directory inodes on FAT much more expensive (we have to basically do a
> readdir for each open). So I'd hate to make that whole "you need to
> emulate i_nlink even if you really don't care" be something that we
> actually end up thinking is a quality issue.

Yes. It's one of reason linux's FAT is slow for some operations (from
original designs).

> There are other filesystems where i_nlink can be even _less_
> meaningful, ie if the filesystem does any kind of fancy
> content-indexing thing or lazy COW (think "union filesystem within the
> filesystem") or whatever, I could easily see i_nlink not having any
> traditional unix filesystem semantics.

But, some commands see i_nlink (IIRC, it's checking i_nlink == 2, to
know empty dir or not). So we have to simulate some levels. I guess you
are not saying we don't need to care it at all though.

I think if it's _really easy_ to do, I think we should try.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2011-03-03 20:50:26

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [RFC] st_nlink after rmdir() and rename()

OGAWA Hirofumi <[email protected]> writes:

> (IIRC, it's checking i_nlink == 2, to know empty dir or not).

Ah, the check would be it doesn't have sub dirs actually.
--
OGAWA Hirofumi <[email protected]>

2011-03-03 21:03:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] st_nlink after rmdir() and rename()

On Thu, Mar 3, 2011 at 12:46 PM, OGAWA Hirofumi
<[email protected]> wrote:
>
> But, some commands see i_nlink (IIRC, it's checking i_nlink == 2, to
> know empty dir or not).

Actually, that would be a serious bug in the application.

The traditional rule of thumb is that a directory with i_nlink==1 has
a "I'm not counting at all".

For example, I think that's the rule that 'find' uses to decide if a
directory can have subdirectories (and when it could try to stop
scanning early): i_nlink == 1 means that yes, it _can_ have
subdirectories, we just don't know how many.

So checking i_nlink==2 is actually a user-level bug.

> So we have to simulate some levels. I guess you
> are not saying we don't need to care it at all though.

I'm saying that it should just work to set i_nlink=1 and not do
anything at all. Ever. It's a valid model for directory counts.

Seriously - that's what isofs does, for example. It does mean that
'find' for certain cases gets bit more expensive, but on the other
hand, other operations are a lot _less_ expensive.

We might well try that as a FAT mount option, to let people decide
whether they really do want the "scan directories all the time" or
only the "let 'find' scan directories when it needs to" behavior.

Linus

2011-03-03 21:23:57

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] st_nlink after rmdir() and rename()

On Thu, Mar 03, 2011 at 12:05:43PM -0800, Linus Torvalds wrote:

> The thing is, I don't think it's a QoI question at all - since any
> user that _depends_ on this kind of behavior is simply broken. We
> aren't going to guarantee it, exactly because some filesystems simply
> will not ever guarantee it.

Umm... Not really. "If you decide to use idiotify, at least retain some
measure of sanity and keep it to local filesystems" is more or less
feasible. "Just say no to CONFIG_INOTIFY" is getting harder and harder -
recent udev won't run without that and recent X depends on udev.

And inotify *does* expose that to userland. Set inotify watch on a
directory. Ask to be notified of IN_DELETE_SELF. Then do overwriting
rename() (note that it doesn't have to be busy - just sitting there
being watched). You will get event on most of the local filesystems,
same as you would from rmdir(). However, do that on jffs2 and event
will be generated only on rmdir(). Directory itself will be killed
by rename() just as thoroughly as by rmdir(); it's not something like
silllyrename on NFS where we really have different behaviour.

> Now, for FAT we do in fact try rather hard to fake the i_nlink count,
> but I'm not at all convinced that's a good idea. It makes reading
> directory inodes on FAT much more expensive (we have to basically do a
> readdir for each open). So I'd hate to make that whole "you need to
> emulate i_nlink even if you really don't care" be something that we
> actually end up thinking is a quality issue.

That is completely separate story; keeping st_nlink for live directories
equal to 2 + number of subdirectories is, IMO, fairly silly on such
filesystems. The only reason for doing that was to allow find(1) some
optimizations, IIRC.

But we are talking about a very different thing - not "I can tell if it's
a leaf directory by looking at st_nlink", but "link count 0 means that it's
only kept alive by being busy". And it's trivial to maintain.

Look, in rename() we *must* check that victim is empty anyway. IOW, any
instance on a local fs will have the place where it has decided that we
are killing a directory. Ditto for successful rmdir(). We don't need
to count subdirectories or anything like that - it's a matter of saying
victim->i_nlink = 0 instead of victim->i_nlink-- in a specific place in
foo_rename().

> There are other filesystems where i_nlink can be even _less_
> meaningful, ie if the filesystem does any kind of fancy
> content-indexing thing or lazy COW (think "union filesystem within the
> filesystem") or whatever, I could easily see i_nlink not having any
> traditional unix filesystem semantics.
>
> Seriously - how did you even notice this?

Code review. Went crawling through i_nlink users in the tree after the
ext2_rename() fun caught by Josh, found a bunch of obvious bugs and
this oddity.

> I'm not opposed to fix actual bugs, but I _do_ think it is
> questionable to make this kind of nonsense semantic issue be an issue.

See above re exposure to userland via inotify. I don't think that it's
an earth-shattering problem, obviously, but since fixing it really doesn't
cost anything on few local fs that have it...

2011-03-03 21:30:21

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] st_nlink after rmdir() and rename()

On Thu, Mar 03, 2011 at 01:02:50PM -0800, Linus Torvalds wrote:

> > So we have to simulate some levels. I guess you
> > are not saying we don't need to care it at all though.
>
> I'm saying that it should just work to set i_nlink=1 and not do
> anything at all. Ever. It's a valid model for directory counts.

Sure, no problem. Just leave that cleaning of i_nlink on victim
in unlink/rmdir/rename; we *really* rely on that in e.g. deciding
when to free the damn inode in fat_evict_inode().

We need to mark them for freeing _anyway_, right? It doesn't depend
on what exact value do we keep for live directories - "everyone got 1
for as long as they live" is just fine, and IMO it's a win, but it's
an unrelated question.

2011-03-03 21:37:21

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [RFC] st_nlink after rmdir() and rename()

Linus Torvalds <[email protected]> writes:

> On Thu, Mar 3, 2011 at 12:46 PM, OGAWA Hirofumi
> <[email protected]> wrote:
>>
>> But, some commands see i_nlink (IIRC, it's checking i_nlink == 2, to
>> know empty dir or not).
>
> Actually, that would be a serious bug in the application.
>
> The traditional rule of thumb is that a directory with i_nlink==1 has
> a "I'm not counting at all".
>
> For example, I think that's the rule that 'find' uses to decide if a
> directory can have subdirectories (and when it could try to stop
> scanning early): i_nlink == 1 means that yes, it _can_ have
> subdirectories, we just don't know how many.

Yes. I think it is.

> So checking i_nlink==2 is actually a user-level bug.

We can call it's the user-level bug, but like you know, we will receive
many complain if old behavior was broken. If you are saying i_nlink == 2
is meaning the undefine, it sounds strange. (I was thinking it's why
isofs is using i_nlink == 1 always. I.e. i_nlink >= 2 is defined
behavior. Actually the reason would be "find" was checking i_nlink == 2
though)

>> So we have to simulate some levels. I guess you
>> are not saying we don't need to care it at all though.
>
> I'm saying that it should just work to set i_nlink=1 and not do
> anything at all. Ever. It's a valid model for directory counts.
>
> Seriously - that's what isofs does, for example. It does mean that
> 'find' for certain cases gets bit more expensive, but on the other
> hand, other operations are a lot _less_ expensive.
>
> We might well try that as a FAT mount option, to let people decide
> whether they really do want the "scan directories all the time" or
> only the "let 'find' scan directories when it needs to" behavior.

Yes. i_nlink == 1 is ok, IIRC (I checked last time at least. And I used
i_nlink == 1 for exFAT). And I agree with you almost all, but isofs is
read-only, the read-only is why I was not really sure.

And I can't only see is why you refuse to make consistent behavior (if
you are saying it). It's why I said if it's _really easy_.

Thanks
--
OGAWA Hirofumi <[email protected]>

2011-03-03 21:53:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] st_nlink after rmdir() and rename()

On Thu, Mar 3, 2011 at 1:37 PM, OGAWA Hirofumi
<[email protected]> wrote:
>
> And I can't only see is why you refuse to make consistent behavior (if
> you are saying it). It's why I said if it's _really easy_.

The thing is, it really isn't really easy. As mentioned, it's actually
impossible on NFS, and it's possibly impossible on other filesystems
too.

So what I'm objecting to is "try to make something consistent that
CANNOT be consistent anyway", and calling it a bug.

I'm not saying there aren't real bugs there too (the actual races in
i_nlink handling are real bugs). But I _am_ saying that it's simply
not true that i_nlink must be zero if you do an "fstat()" after doing
an rmdir on an fd that you held open. Nobody can reasonably care, and
anybody who _does_ care is better off getting a nasty surprise early
rather than late.

Linus

2011-03-03 22:26:21

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [RFC] st_nlink after rmdir() and rename()

Linus Torvalds <[email protected]> writes:

> On Thu, Mar 3, 2011 at 1:37 PM, OGAWA Hirofumi
> <[email protected]> wrote:
>>
>> And I can't only see is why you refuse to make consistent behavior (if
>> you are saying it). It's why I said if it's _really easy_.
>
> The thing is, it really isn't really easy. As mentioned, it's actually
> impossible on NFS, and it's possibly impossible on other filesystems
> too.

I don't know much about NFS though, I imaged the NFS just fill the
stat.st_nlink to return to userland by 0 if sillyrenamed dentry? (of
course, I'm not saying let's emulate "i_nlink >= 1" on all
filesystems. just about i_nlink == 0) I was thinking Al is working for
it...

And yeah, if it doesn't have consistency on system, I would agree it
would not be meaningful.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2011-03-03 22:37:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] st_nlink after rmdir() and rename()

On Thu, Mar 3, 2011 at 2:26 PM, OGAWA Hirofumi
<[email protected]> wrote:
>
> I don't know much about NFS though, I imaged the NFS just fill the
> stat.st_nlink to return to userland by 0 if sillyrenamed dentry? (of
> course, I'm not saying let's emulate "i_nlink >= 1" on all
> filesystems. just about i_nlink == 0) I was thinking Al is working for
> it...

So even if we did that, WHAT WOULD BE THE UPSIDE?

Code that cares wouldn't run on any other Unix, or on any older
version of Linux.

And I claim that there is not a single reason to do it anyway. That
whole "code that cares" is totally theoretical. Such code simply
doesn't exist.

If you just opened a directory, and then did a "rmdir()" on that
directory, then you're just a f*cking moron if you go around saying
"ok, let me now do an fstat() on that fd to see if it really got
deleted or not". That's just _stupid_.

Really. There is absolutely no point in introducing a new rule that
nobody cares about, that we haven't followed ourselves historically,
and that would require us to play insane hacky games.

WHY DO IT? WHY CARE? WHY, WHY, WHY?

Linus

2011-03-03 22:57:06

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] st_nlink after rmdir() and rename()

On Thu, Mar 03, 2011 at 01:52:18PM -0800, Linus Torvalds wrote:
> On Thu, Mar 3, 2011 at 1:37 PM, OGAWA Hirofumi
> <[email protected]> wrote:
> >
> > And I can't only see is why you refuse to make consistent behavior (if
> > you are saying it). It's why I said if it's _really easy_.
>
> The thing is, it really isn't really easy. As mentioned, it's actually
> impossible on NFS, and it's possibly impossible on other filesystems
> too.
>
> So what I'm objecting to is "try to make something consistent that
> CANNOT be consistent anyway", and calling it a bug.
>
> I'm not saying there aren't real bugs there too (the actual races in
> i_nlink handling are real bugs). But I _am_ saying that it's simply
> not true that i_nlink must be zero if you do an "fstat()" after doing
> an rmdir on an fd that you held open. Nobody can reasonably care, and
> anybody who _does_ care is better off getting a nasty surprise early
> rather than late.

Ho-hum... OK, let me put it that way:
* pile I've sent a pull request for is really bug-only; none of it
has anything to do with what's discussed in that thread, other than "it's
also about i_nlink and found during the same code review". i_nlink races,
pair of fs corruptors and a braino in UDF (256 << sizeof(inode->i_nlink)
as a way to spell "maximal allowed number of links"; never really worked,
even before we had switched to 32bit internal i_nlink - the real limit is
0xffff, not 0x3ff or 0xfff).
* it's trivial to get the same behaviour on all local filesystems;
most of them have it and rely on it to detect the inodes that need to be
freed on final iput(). It has nothing to do with counting subdirs or any
such nonsense.
* inotify is broken for filesystems that don't get you zero ->i_nlink
when the last dentry pointing to doomed inode is dropped. Regardless of what
you get in fstat(). Excusable for remote fs, but not nice for local ones.
I'd *LOVE* to get rid of inotife/dnotify/etc., but it's probably not feasible
now.
* NFS is not hard to handle, actually, especially for directories.
Regular files may be trickier, but then we have many places in that area
where NFS is not quite POSIX-compliant, to put it mildly.
* I honestly don't know what's the real situation with other
remote filesystems; thus the RFC. Hopefully, people familiar with that
are on fsdevel...

BTW, I suspect that another exception among the local filesystems (affs)
is actually leaking blocks on rmdir. Need to experiment to verify that,
but it smells like another genuine bug.

2011-03-03 23:07:27

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] st_nlink after rmdir() and rename()

On Thu, Mar 03, 2011 at 10:57:02PM +0000, Al Viro wrote:
> * inotify is broken for filesystems that don't get you zero ->i_nlink
> when the last dentry pointing to doomed inode is dropped. Regardless of what
> you get in fstat(). Excusable for remote fs, but not nice for local ones.
> I'd *LOVE* to get rid of inotife/dnotify/etc., but it's probably not feasible
> now.
> * NFS is not hard to handle, actually, especially for directories.
> Regular files may be trickier, but then we have many places in that area
> where NFS is not quite POSIX-compliant, to put it mildly.

To clarify: I don't particulary _care_ if NFS breaks something like inotify,
as long as it can't be used to do nasty things to kernel itself. And I'm
not at all sure if I care about st_nlink there at all, directory or
non-directory. Again, NFS has enough weirdness wrt opened-but-unlinked
files anyway and will remain weird by design.

2011-03-03 23:12:37

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] st_nlink after rmdir() and rename()

On Fri, Mar 04, 2011 at 07:26:14AM +0900, OGAWA Hirofumi wrote:
> Linus Torvalds <[email protected]> writes:
>
> > On Thu, Mar 3, 2011 at 1:37 PM, OGAWA Hirofumi
> > <[email protected]> wrote:
> >>
> >> And I can't only see is why you refuse to make consistent behavior (if
> >> you are saying it). It's why I said if it's _really easy_.
> >
> > The thing is, it really isn't really easy. As mentioned, it's actually
> > impossible on NFS, and it's possibly impossible on other filesystems
> > too.
>
> I don't know much about NFS though, I imaged the NFS just fill the
> stat.st_nlink to return to userland by 0 if sillyrenamed dentry? (of
> course, I'm not saying let's emulate "i_nlink >= 1" on all
> filesystems. just about i_nlink == 0) I was thinking Al is working for
> it...

No. I don't really care much about what NFS does with st_nlink. It might
be possible to fake for directories, but you'll have a hell of a time doing
that accurately for regular files. Not worth doing.

2011-03-03 23:15:03

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [RFC] st_nlink after rmdir() and rename()

Linus Torvalds <[email protected]> writes:

> On Thu, Mar 3, 2011 at 2:26 PM, OGAWA Hirofumi
> <[email protected]> wrote:
>>
>> I don't know much about NFS though, I imaged the NFS just fill the
>> stat.st_nlink to return to userland by 0 if sillyrenamed dentry? (of
>> course, I'm not saying let's emulate "i_nlink >= 1" on all
>> filesystems. just about i_nlink == 0) I was thinking Al is working for
>> it...
>
> So even if we did that, WHAT WOULD BE THE UPSIDE?
>
> Code that cares wouldn't run on any other Unix, or on any older
> version of Linux.
>
> And I claim that there is not a single reason to do it anyway. That
> whole "code that cares" is totally theoretical. Such code simply
> doesn't exist.
>
> If you just opened a directory, and then did a "rmdir()" on that
> directory, then you're just a f*cking moron if you go around saying
> "ok, let me now do an fstat() on that fd to see if it really got
> deleted or not". That's just _stupid_.
>
> Really. There is absolutely no point in introducing a new rule that
> nobody cares about, that we haven't followed ourselves historically,
> and that would require us to play insane hacky games.
>
> WHY DO IT? WHY CARE? WHY, WHY, WHY?

The reason is simple, I'm not sure it is true or not though. Because the
undefined behavior makes the unfixable bug or hard to fix bug (just
because of backward compatibiliy reason). In my experience, I was
bothered many times with that, and why I want to make define state
(well, now I think we define as "we don't care it". that's fine). I hate
it and it is why from me.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2011-03-04 06:55:48

by Al Viro

[permalink] [raw]
Subject: omfs fixes

On Thu, Mar 03, 2011 at 10:57:02PM +0000, Al Viro wrote:

> BTW, I suspect that another exception among the local filesystems (affs)
> is actually leaking blocks on rmdir. Need to experiment to verify that,
> but it smells like another genuine bug.

affs turned out to be OK; it uses rather convoluted scheme for link counter
(and uses i_nlink == 0 as trigger for freeing inode), and it actually
works fine. Directories: 1 as long as they are alive, 0 once they are
doomed. Everything else: 1 if there's exactly one link, 0 once it's doomed,
2 when there's more than 1 link. It works.

omfs, however, is _not_ OK in several respects. There's no i_nlink races,
since everything has one parent and rename() gets away with its playing with
link count of old_inode (and it's more complicated than in case of ext*,
BTW). However,
a) it gives all live directories i_nlink == 2. Should be 1, since it's
really "can't tell how many".
b) it uses i_nlink == 0 as indication of doomed inode. And on
rmdir() it drives i_nlink to 0, so that works. On overwriting directory
rename(), OTOH, it leaves i_nlink at 1 and leaks the sucker. Blocks
are not freed.
c) readdir() is really not well. Directories on omfs are hash
tables - array of pointers to chains, indexed by hash function of name.
If we run into an entry that is too long for what remains in the buffer,
we keep scanning the chain. And if something past it will turn out to
be short enough, we'll advance file->f_pos and keep going. Moreover,
if the _last_ one in chain happens to be short enough, we'll spill over
to the next chain, completely forgetting what had been missed in the
previous one.
d) rename playing odd games with link count of old inode is not
a bug; it's too convoluted for no good reason, but at least it's correct.
Failing to mark old_inode dirty after updating ->i_ctime in there, OTOH,
is racy.

I've pushed fixes for (b), (c) and (d) into #i_nlink and I think that (a)
needs to be dealt with as well, but that one is less nasty. Bob, do you
prefer that to go through your tree or should it go straight to Linus?
It's in git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6.git/ i_nlink,
Al Viro (4):
omfs: rename() needs to mark old_inode dirty after ctime update
omfs: stop playing silly buggers with omfs_unlink() in ->rename()
omfs: merge unlink() and rmdir(), close leak in rename()
omfs: make readdir stop when filldir says so
fs/omfs/dir.c | 66 +++++++++++++++++---------------------------------------
1 files changed, 20 insertions(+), 46 deletions(-)

2011-03-04 15:25:17

by Bob Copeland

[permalink] [raw]
Subject: Re: omfs fixes

On Fri, Mar 04, 2011 at 06:55:42AM +0000, Al Viro wrote:
> I've pushed fixes for (b), (c) and (d) into #i_nlink and I think that (a)
> needs to be dealt with as well, but that one is less nasty. Bob, do you
> prefer that to go through your tree or should it go straight to Linus?
> It's in git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6.git/ i_nlink,

Looks great, thanks for the review. I'll go ahead and take them through
my tree after a quick run through the test harness.

--
Bob Copeland %% http://www.bobcopeland.com