by C. Scott Ananian

[permalink] [raw]

Subject: Re: [RFC PATCH -v4 00/14] fsnotify, dnotify, and inotify

On Thu, Dec 25, 2008 at 8:44 PM, Al Viro <[email protected]> wrote:
>> As far as I can tell, none of the existing Linux desktop search tools
>> attempt to deal with these races. (Beagle handles the 'mkdir' race,
>> but not the other rename races.) This is acceptable only if an
>> unreliable file index is acceptable.
>
> ... and inotify is unreliable by design, or what passes for it anyway.

Really? I don't see that in the documentation anywhere (except as an
aside in a comment you added to inotify's pin-to-kill method). Nor do
I find it supported by code review. Sure, the queue is limited size,
but you are guaranteed an overflow event when messages are dropped;
that's all that's needed for reliability. Can you provide some
specific example where events are lost "by design"? There are
userland races in inotify, but they are all solvable, and I've
described above how to do so. The problem is that the workarounds
make practical programming with inotify very cumbersome, and the
result is that most implementors haven't bothered to make things
correct (ie, moving a directory breaks all search tools I've looked
at). I'm interested in determining whether there are minor tweaks
which can make correct userlands easier to write. That's why I've
been beating the "autoadd" drum in this thread: it's a very small
patch would would remove one category of race entirely (leaving only
the rename races for userland to deal with).

>> b) use inode numbers rather than path names uniformly, in both
>> inotify and the userland search index, along with an iopen() syscall,
>> as in Mach. This decouples path maintenance from indexing. This was
>> discussed in (for example)
>> http://www.coda.cs.cmu.edu/maillists/codalist/codalist-1998/0217.html
>> by Peter J. Braam and Ted Ts'o, but Al Viro has been objecting to the
>> idea here. (If all you need to do is open found files after a search,
>> you can skip path maintenance entirely.)
>
> For one thing, opened directory can be fchdir'ed to. Or passed to
> openat() et.al.

You seem to be confusing inotify watch descriptors with file
descriptors. Directories are not held open in inotify, as you should
know.

Adding an inotify_open_wd() syscall would indeed eliminate the last of
the 3 races I described above (a/b/c rename to a/d/c). It wouldn't do
anything about a/b -> a/c rename races.

> For another, go ahead, show me how to implement that
> sucker for something like NFS. Or CIFS. Or FAT, while we are at it.
> Or anything that does have stable numbers identifying fs objects, but where those numbers are huge.

Sure. Because not all filesystems support xattr (say) we should
refuse to add it to any? As you know well, The standard solution to
the inode problem (in NFS or FUSE, say) is to synthesize stable inode
numbers in the kernel (or in metadata in something like vfat), with
the understanding that these numbers will be different on every mount.
For the purposes of race-avoidance in inotify, transient inode
numbers are acceptable. And we can always fall back to "old racy
inotify" or manual filesystem scans for minority filesystems.

>> c) Pass file descriptors in the notification API from the kernel.
>> This solves the races associated with renames before indexing.
>> Userland still has to maintain its own copy of all the direntries for
>> indexed content, but at least this task is decoupled. (The proposed
>> fanotify API passes file descriptors, but provides no mechanism (yet)
>> for path maintenance.)

You didn't comment on this alternative, but as I've continued thinking
about the problem, this alternative has seemed best.

> e) start with providing a higher-level description of what you want to
> achieve. While we are at it, is it supposed to be fs-agnostic? What
> kinds of filesystems could in principle be used with that?

The high level problem should be obvious, but I'll restate it for you anyway:

* Maintain a auxiliary set of content-derived metadata for each file
in some subset (under a directory, mount-point, matching a wildcard,
etc). This metadata must be updated whenever the content changes, and
must include/maintain some stable reference to the file so that
searches on the metadata can be mapped back to their corresponding
files later.

Implementors vary on the definition of 'content' (do file permissions
count? xattrs? etc), the best "update" API (is notify-on-close and
reindexing entire files sufficient, or do we want to do incremental
updates on each dirty block/byte), and whether the notification is
synchronous or asynchronous (eg: Eric Paris wants to act on 'bad'
content, so he needs a synchronous interface to prevent races from
acting on bad content). In Linux search tools I've looked at, the
'stable reference' notion has usually been kept as a file:// URL (ie,
absolute path), but Mac OS X has file handles closer to inode numbers
(although I'm not positive that's what they are using).

Filesystem-agnostic tools are preferable, but "works best on
filesystem X" search tools are already common. The BeFS was an
example of non-agnostic search, and Beagle "works best" on filesystems
that support xattrs. The i_version field will also be useful for
search tools, for filesystems which support it (only ext4 so far?).
So give me filesystem-specific search if it is compelling, or agnostic
search where possible.

> So far you've mentioned use of blatantly inadequate tool and far too
> low-level description of changes that might possibly make it more
> tolerable for unspecified use you have in mind. I'm sorry, but that's
> exactly how a bunch of bad APIs (including inotify) got pushed into the
> tree and I would rather not repeat the experience.

So, you don't understand the problem, and you'd rather not have
inotify in tree because you think it is "blatantly inadequate" -- yet
you are using inotify in your audit subsystem? I think I am beginning
to get the impression that you're just not going to be much help.

> Interfaces must make sense. And "we need <list of things> for our project,
> nevermind what are they for, here's what we would like to see" is a recipe
> for kludges. Inotify, dnotify, F_SETLEASE, etc. Sigh...

I don't think this is a sound technical argument. We've got at least
two different APIs on the table (fanotify and inotify), a
well-understood problem description, *and* I've made at least 4
different proposals for how the current situation might be improved.
Please feel free to suggest concrete alternatives, but I don't plan to
continue this thread while you are just taking bogus potshots.
--scott

--
( http://cscott.net/ )

2008-12-29 18:19:59

by C. Scott Ananian

[permalink] [raw]

Subject: Re: [RFC PATCH -v4 00/14] fsnotify, dnotify, and inotify

On Mon, Dec 22, 2008 at 3:53 PM, Eric Paris <[email protected]> wrote:
> fanotify doesn't have directory notifications. You get notifications
> about the individual inodes. I don't have mv tracking, and I'm not sure
> how much trouble it'll be. Maybe not that bad since a mv doesn't have
> to deliver the fd to userspace. I promise to think about how to make
> this better for you. In any case, you can get pathnames, dev, and inode
> very quickly.

For consistency, directory events could appear as mutations
(write/close_was_writable) of the directory inodes, with an open file
descriptor to the directory.

It would be a bit annoying, but you wouldn't even have to pass the
'from' and 'to' names. You can then readdir the open file descriptor
to the directory find out what's changed for yourself. (If you do
pass 'from' and 'to' path information, then the open file descriptor
to the directory is unnecessary.)

Capturing a coherent directory listing on startup (or while processing
a directory mutation) is the other tricky bit; being able to
block-for-ack on a rename would make that much easier, returning
SEND_DELAY for example. The basic ideas seem coherent with what
you've got in fanotify. A write_need_access_decision on directories
would allow userspace locking during the scan, and you'd set a
fastpath (with SURVIVE_WRITE cleared) after the scan was complete.

The write_need_access_decision method would, as testiment to the
generality of the mechanism, allow userspace implementations of
copy-on-write and versioning file systems. A versioning filesystem
could check in the 'new version' of a file on close_was_writable;
write_need_access would just need to block until userland had
completed the checkin to prevent races between kernel and userland.
For a copy-on-write system based on hardlinks,
write_need_access_decision would break a hardlink if necessary, and
regardless install a fast path skipping future checks.

> In any case, event coallessing seems like it needs to work by walking
> the notification queue starting at the back and working forwards. If
> you find a duplicate just drop. If you find a mv, place this one at the
> end. Kinda sucks that we are taking and O(1) operation and making it
> O(n).
>
> At least with fanotify you don't have mv races, since you have an open
> fd which still gives you the access you need even if it mv'd.

On reflection event coalescing is not necessary so long as you're
passing open file descriptors around. The kernel mechanisms are
sufficient to ensure that the file descriptors are still valid even if
unlink/move operations occur after the mutation.

To briefly summarize my fanotify wishlist:
1) write/close_was_writable events on directory inodes triggered by
create/rename/unlink.
2) write_need_access_decision event applicable to both regular files
and directories.
--scott

--
( http://cscott.net/ )