2009-03-27 12:47:43

by Alexander Larsson

[permalink] [raw]
Subject: Issues with using fanotify for a filesystem indexer

I took a look at fanotify to see if it would be a better fit for a
filesystem indexer (like tracker or beagle), as inotify is pretty bad.
I think it is a better fit in general, but it needs some additions.

Lets first define the requirements. By "indexer" I mean a userspace
program that keeps a database that stores information about files on
some form of pathname basis. You can use this to do a query for
something and reverse-map back to the filename. As a minimally complex,
yet sufficient model we have locate/updatedb that lets you quickly
search for files by filename. However, instead of indexing everything
each night we want to continuosly index things "soon" after they change,
for some loose definition of soon (say within a few minutes in the
normal case).

Its not realistic to imagine the indexer handling each file change as
they happen, as modern machines can dirty a lot of files in a short
time which would immediately result in change event queues
overflowing. It as also not really what isis desired. Many kinds of
activities produce a lot of filesystem activity with creation of
temporary files and changing of files multiple times over some time
(for instance a compile). What we really want is to ignore all these
temporary files and the flury of changes and wait for a more quiescent
state to reindex.

One of the core properties of the indexer is that it knows what the
filesystem looked like last time it indexed, so a more adequate model
for changes would be to get told on a per-directory basis that
"something changed in this directory" with a much more coarse-grained
time scale, say e.g. at most once every 30 seconds. The indexer could
then schedule a re-indexing of that directory, comparing mtimes with
what is stored in its db to see which files changed. This is how the
MacOS FSEvents userspace framework[1] works, and it seems reasonable.

updatedb keeps track of the full filesystem tree, based on the
"current" mounts in it (at the time updatedb ran). While this was
probably valid when it was designed it is not really adequate for
current use which is much more dynamic in how things get plugged in and
out. A more modern way to look at this is to consider the full set of
mounted filesystems being a forrest of trees, with the current process
namespace being composed of a subset of these mounted in various places
in the namespace.

So, in order to handle a filesystem being unmounted, and then later
e.g. mounted in another place or another filesystem mounted in the
same location we shouldn't index based on how things are mounted, but
rather keep an index per filesystem. The kernel identifier for a
filesystem is the major:minor of the block device its mounted on. This
is not a persistent identifier, but given such an identifier a
userspace app could use a library like libvolume_id to make up a
persistent identifier for use as the key in its index. It would then
store each item in its database by a volume id + relative path pair,
which can be mapped to a pathname in the current namespace by using
e.g. /proc/self/mountinfo.

In order to write an app using the fanotify API satisfying the above
needs we would need the following events:
* the event queue overflowed, (you need to reindex everything)
* An inode was linked into the filesystem (creat, O_CREAT,
mkdir, link, symlink, etc)
* An inode was unlinked (unlink, rmdir, rename replaced existing file)
* An inode was moved in the filesystem (rename)
* A file handle that was written to was closed
* optionally: A file handle was written to (this is somewhat expensive
to track as there are a lot of these events)

For these events we need some form of identifier that references the
file that was affected. There are two types of changes above, pure
name changes (link/unlink/rename) and inode changes
(close/write). fanotify currently only gives "inode changes" kind of
events, and it uses a file descriptor as the identifier.

Using an fd as an identifier is interesting, because it avoids the
problems with absolute pathnames and namespaces. The user of the API
can use readlink on /proc/self/fd/<fd> to get at the pathname of the
file that was opened (in its namespace), we can also use fstat to get
the block device of the file and /proc/self/mountinfo to calculate the
filesystem relative path. Additionally, by using a fd like this we're
basically given a userspace reference to a dentry. This means that the
link in /proc will be updated as the filename changes. So we can rely
on the paths gotten from the events to be up to date wrt any namespace
changes during the time of the change to the time we're handling the
event. We don't have to manually update events due to e.g. later
rename events.

However, this is somewhat of a problem in the name change events. For
instance, for rename if we have an fd to the moved file we can't
really know its original position. For these types of changes we want
the fd of the parent directory and the filename that changed.

With these events we should be able to track any directory that has
changed files in it, with these exceptions:
* Sometimes we can only say "everything might have changed" (queue
overflow)
* We only track locally originating changes
* If a hardlinked file is updated in-place we only know of the change
in the filename used to open the file.
* If we chose not to pick up every write event (for performance
reasons) we won't know of writes to files that weren't closed (like
e.g. logfiles)

I think these exceptions are reasonable for most usecases.

Its unlikely that users actually want to index all files in the
system. In practice its more likely that they want to index their
homedirs, removable media and maybe a few other directories. So, in
order to lower the total system load due to changes on areas where we're
not interested in changes it would be nice to be able to set up either
blacklists like the current fastpath, or even better subscriptions,
where we ignore everything not specifically requested. I don't really
think the fastpaths that are currently in fanotify are good enought
for file indexing, as they are per file, and there are potentially
millions of files that we want to ignore.

Instead I would like a form of subscription based on block major+minor
and dentry prefix. So, you'd say "I want everything on block 8:1
affecting the subtree under the dentry specified by this fd I
opened". The fd should be optional, and probably the minor nr too. In
fact, even the major nr should probably be optional too if you really
want events for every change in the system. In an indexer this would
be used by reading the set of paths that the user specified as
wanting indexed, looking up in /proc/self/mountinfo what this
corresponds to wrt devices and registering the required subscriptions.

---

[1] http://arstechnica.com/apple/reviews/2007/10/mac-os-x-10-5.ars/7


2009-03-27 13:02:55

by Al Viro

[permalink] [raw]
Subject: Re: Issues with using fanotify for a filesystem indexer

On Fri, Mar 27, 2009 at 01:47:23PM +0100, Alexander Larsson wrote:

> In order to write an app using the fanotify API satisfying the above
> needs we would need the following events:
> * the event queue overflowed, (you need to reindex everything)
> * An inode was linked into the filesystem (creat, O_CREAT,
> mkdir, link, symlink, etc)
> * An inode was unlinked (unlink, rmdir, rename replaced existing file)
> * An inode was moved in the filesystem (rename)

Erm... Just how would you represent and *order* the events? Note that
"serialize all directory operations on given fs" is a non-starter...

2009-03-27 13:47:20

by Alexander Larsson

[permalink] [raw]
Subject: Re: Issues with using fanotify for a filesystem indexer

On Fri, 2009-03-27 at 13:02 +0000, Al Viro wrote:
> On Fri, Mar 27, 2009 at 01:47:23PM +0100, Alexander Larsson wrote:
>
> > In order to write an app using the fanotify API satisfying the above
> > needs we would need the following events:
> > * the event queue overflowed, (you need to reindex everything)
> > * An inode was linked into the filesystem (creat, O_CREAT,
> > mkdir, link, symlink, etc)
> > * An inode was unlinked (unlink, rmdir, rename replaced existing file)
> > * An inode was moved in the filesystem (rename)
>
> Erm... Just how would you represent and *order* the events? Note that
> "serialize all directory operations on given fs" is a non-starter...

This particular usecase doesn't really require a specific order for the
events. Its enough to know that something changed in a directory, or a
subtree. I can't think of an event reordering that would cause us to not
eventually reindex the files was affected by a change.

Note, I don't think its possible to track renames precisely so that we
know that a previosly indexed file is in another place now, but rather
that if we move a directory from a to b we now need to fully reindex
both the a and b subtrees.

However, the order might be important for other usecases, I can't really
tell. Is it possible to give any kind of guarantee for the event order?

As for representation, I guess there are two issues here, how the event
looks in the kernel side event queue and how it looks to userspace. I'm
no kernel developer, but i guess a rename could be something liks

struct path *old_dir;
char *old_name
struct path *new_dir;
char *new_name

and the userspace side could be:

int old_dir_fd
char *old_name
int new_dir_fd
char *new_name

2009-03-28 20:38:40

by Alexander Larsson

[permalink] [raw]
Subject: Re: Issues with using fanotify for a filesystem indexer

On Fri, 2009-03-27 at 13:02 +0000, Al Viro wrote:
> On Fri, Mar 27, 2009 at 01:47:23PM +0100, Alexander Larsson wrote:
>
> > In order to write an app using the fanotify API satisfying the above
> > needs we would need the following events:
> > * the event queue overflowed, (you need to reindex everything)
> > * An inode was linked into the filesystem (creat, O_CREAT,
> > mkdir, link, symlink, etc)
> > * An inode was unlinked (unlink, rmdir, rename replaced existing file)
> > * An inode was moved in the filesystem (rename)
>
> Erm... Just how would you represent and *order* the events? Note that
> "serialize all directory operations on given fs" is a non-starter...

So, I've been thinking a bit more about this. You're right that
serializing all directory operations is way to expensive. And I don't
actually need it for my usecase. However, the event types I listed above
are more or less taken from the "inotify style" events, and they sort of
demand an ordering (or much of the arguments are useless).

That information would not be used by an indexer like the one i
described anyway, so i think the set of events could be drastically
simplified.

Basically, we would need a single event for all the namespace changing
events (link, unlink, rename, etc). This event would say "some name in
this directory changed", you'll get a single event of these for a
link/unlink and two of them for a rename.

Furthermore, since ordering is not specified multiple events to the same
location is meaningless. So instead of "queue of events" we're more
talking about a set of changed files/dirs, containing all the things
that changed since you last read the event.

This simplification means we can drop a lot of data from the events,
cutting down on memory use. It also means we only have to store one
event for each dentry or struct file that changes, meaning less memory
use. (Although the event "queue" would have to turn into some other form
of datatype that allows quickly finding if a file is already in the
queue.

This also simplifies the userspace API, so that the current fanotify
userspace event struct with an fd + a mask doesn't have to be changed.

2009-04-02 14:55:25

by Jan Kara

[permalink] [raw]
Subject: Re: Issues with using fanotify for a filesystem indexer

Hi,

> I took a look at fanotify to see if it would be a better fit for a
> filesystem indexer (like tracker or beagle), as inotify is pretty bad.
> I think it is a better fit in general, but it needs some additions.
>
> Lets first define the requirements. By "indexer" I mean a userspace
> program that keeps a database that stores information about files on
> some form of pathname basis. You can use this to do a query for
> something and reverse-map back to the filename. As a minimally complex,
> yet sufficient model we have locate/updatedb that lets you quickly
> search for files by filename. However, instead of indexing everything
> each night we want to continuosly index things "soon" after they change,
> for some loose definition of soon (say within a few minutes in the
> normal case).
>
> Its not realistic to imagine the indexer handling each file change as
> they happen, as modern machines can dirty a lot of files in a short
> time which would immediately result in change event queues
> overflowing. It as also not really what isis desired. Many kinds of
> activities produce a lot of filesystem activity with creation of
> temporary files and changing of files multiple times over some time
> (for instance a compile). What we really want is to ignore all these
> temporary files and the flury of changes and wait for a more quiescent
> state to reindex.
>
> One of the core properties of the indexer is that it knows what the
> filesystem looked like last time it indexed, so a more adequate model
> for changes would be to get told on a per-directory basis that
> "something changed in this directory" with a much more coarse-grained
> time scale, say e.g. at most once every 30 seconds. The indexer could
> then schedule a re-indexing of that directory, comparing mtimes with
> what is stored in its db to see which files changed. This is how the
> MacOS FSEvents userspace framework[1] works, and it seems reasonable.
>
> updatedb keeps track of the full filesystem tree, based on the
> "current" mounts in it (at the time updatedb ran). While this was
> probably valid when it was designed it is not really adequate for
> current use which is much more dynamic in how things get plugged in and
> out. A more modern way to look at this is to consider the full set of
> mounted filesystems being a forrest of trees, with the current process
> namespace being composed of a subset of these mounted in various places
> in the namespace.
>
> So, in order to handle a filesystem being unmounted, and then later
> e.g. mounted in another place or another filesystem mounted in the
> same location we shouldn't index based on how things are mounted, but
> rather keep an index per filesystem. The kernel identifier for a
> filesystem is the major:minor of the block device its mounted on. This
> is not a persistent identifier, but given such an identifier a
> userspace app could use a library like libvolume_id to make up a
> persistent identifier for use as the key in its index. It would then
> store each item in its database by a volume id + relative path pair,
> which can be mapped to a pathname in the current namespace by using
> e.g. /proc/self/mountinfo.
Some time ago I was trying to solve a similar problem and I've come up
with a solution which I've called recursive mtime. The general idea is
that with each directory, kernel additionally keeps a flag and a
timestamp. When a directory is modified, we do:
dir = changed dir;
while dir has flag set do
update timestamp to current time
clear flag
dir = parent dir

When a file is modified, you just start with a parent directory of
that file. With this scheme, you are able to find reasonably quickly
(without looking at unchanged directories) what has changed since
you've looked last time (you look for timestamps newer than the time
when you started last scan and you set flags as you go). Also the scheme
is quite cheap to maintain and has no problems with overflowing event
queues etc. (observe that the scheme works perfectly fine for several
independent scanners in parallel). As a bonus, if you store the flag +
timestamp persistently on disk, you can use this scheme to speedup things
like rsync.
What gets nasty (but solvable) are hardlinks and bind mounts. I was
writing a library to handle these last summer but then had to work on
something else and didn't get back to it yet.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2009-04-02 16:29:38

by Alexander Larsson

[permalink] [raw]
Subject: Re: Issues with using fanotify for a filesystem indexer

On Thu, 2009-04-02 at 16:54 +0200, Jan Kara wrote:

> Some time ago I was trying to solve a similar problem and I've come up
> with a solution which I've called recursive mtime. The general idea is
> that with each directory, kernel additionally keeps a flag and a
> timestamp. When a directory is modified, we do:
> dir = changed dir;
> while dir has flag set do
> update timestamp to current time
> clear flag
> dir = parent dir
>
> When a file is modified, you just start with a parent directory of
> that file. With this scheme, you are able to find reasonably quickly
> (without looking at unchanged directories) what has changed since
> you've looked last time (you look for timestamps newer than the time
> when you started last scan and you set flags as you go). Also the scheme
> is quite cheap to maintain and has no problems with overflowing event
> queues etc. (observe that the scheme works perfectly fine for several
> independent scanners in parallel). As a bonus, if you store the flag +
> timestamp persistently on disk, you can use this scheme to speedup things
> like rsync.
> What gets nasty (but solvable) are hardlinks and bind mounts. I was
> writing a library to handle these last summer but then had to work on
> something else and didn't get back to it yet.

Another potential issue with this is that every change bubbles up to the
top, modifying the recursive mtime of that. This will become very
contented, and may imply a partial serialization of fs activity, which
is kinda costly.

2009-04-02 16:51:26

by Jan Kara

[permalink] [raw]
Subject: Re: Issues with using fanotify for a filesystem indexer

On Thu 02-04-09 18:29:04, Alexander Larsson wrote:
> On Thu, 2009-04-02 at 16:54 +0200, Jan Kara wrote:
>
> > Some time ago I was trying to solve a similar problem and I've come up
> > with a solution which I've called recursive mtime. The general idea is
> > that with each directory, kernel additionally keeps a flag and a
> > timestamp. When a directory is modified, we do:
> > dir = changed dir;
> > while dir has flag set do
> > update timestamp to current time
> > clear flag
> > dir = parent dir
> >
> > When a file is modified, you just start with a parent directory of
> > that file. With this scheme, you are able to find reasonably quickly
> > (without looking at unchanged directories) what has changed since
> > you've looked last time (you look for timestamps newer than the time
> > when you started last scan and you set flags as you go). Also the scheme
> > is quite cheap to maintain and has no problems with overflowing event
> > queues etc. (observe that the scheme works perfectly fine for several
> > independent scanners in parallel). As a bonus, if you store the flag +
> > timestamp persistently on disk, you can use this scheme to speedup things
> > like rsync.
> > What gets nasty (but solvable) are hardlinks and bind mounts. I was
> > writing a library to handle these last summer but then had to work on
> > something else and didn't get back to it yet.
>
> Another potential issue with this is that every change bubbles up to the
> top, modifying the recursive mtime of that. This will become very
> contented, and may imply a partial serialization of fs activity, which
> is kinda costly.
Not every change - only the first change bubbles to the top, clearing the
flag on its way. Then next change stops bubbling up as soon as it reaches
a directory with the flag cleared. So no contention happen - we update flag
+ timestamp only at most once per scan of the directory by indexer (or
someone else interested in recursive mtime) => once per a few minutes on
average system.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-04-02 17:16:15

by Alexander Larsson

[permalink] [raw]
Subject: Re: Issues with using fanotify for a filesystem indexer

On Thu, 2009-04-02 at 18:50 +0200, Jan Kara wrote:
> On Thu 02-04-09 18:29:04, Alexander Larsson wrote:
> > Another potential issue with this is that every change bubbles up to the
> > top, modifying the recursive mtime of that. This will become very
> > contented, and may imply a partial serialization of fs activity, which
> > is kinda costly.
> Not every change - only the first change bubbles to the top, clearing the
> flag on its way. Then next change stops bubbling up as soon as it reaches
> a directory with the flag cleared. So no contention happen - we update flag
> + timestamp only at most once per scan of the directory by indexer (or
> someone else interested in recursive mtime) => once per a few minutes on
> average system.

Ah, I see. The indexer sets the flag.
I see some issues. First of all, writing the flag/mtime to disk seems
like a bad idea. It'll cause a lot of writing when the indexer recurses
throught the filesystem, similar to atimes. But, if you're not
persisting the flag/mtime then you need to keep all the dentries with
the flag set in memory, which has resource use risks similar to
unbounded event queues.

2009-04-02 19:53:30

by Jan Kara

[permalink] [raw]
Subject: Re: Issues with using fanotify for a filesystem indexer

On Thu 02-04-09 19:15:37, Alexander Larsson wrote:
> On Thu, 2009-04-02 at 18:50 +0200, Jan Kara wrote:
> > On Thu 02-04-09 18:29:04, Alexander Larsson wrote:
> > > Another potential issue with this is that every change bubbles up to the
> > > top, modifying the recursive mtime of that. This will become very
> > > contented, and may imply a partial serialization of fs activity, which
> > > is kinda costly.
> > Not every change - only the first change bubbles to the top, clearing the
> > flag on its way. Then next change stops bubbling up as soon as it reaches
> > a directory with the flag cleared. So no contention happen - we update flag
> > + timestamp only at most once per scan of the directory by indexer (or
> > someone else interested in recursive mtime) => once per a few minutes on
> > average system.
>
> Ah, I see. The indexer sets the flag.
> I see some issues. First of all, writing the flag/mtime to disk seems
> like a bad idea. It'll cause a lot of writing when the indexer recurses
> throught the filesystem, similar to atimes. But, if you're not
There's some cost but it's not nearly as bad as with atimes. Indexer only
has to change flags on directories in whose subtree something has changed -
so we modify only directories (atime is modified for all files +
directories) and we even modify only a small fraction of them because only
a small amount of them has changed and thus has the flag reset.
Similarly a write to a file can cause some additional modifications to
above directories as we traverse up the tree but again, there aren't going
to be many of those since most of the time quickly hit a directory with a
flag reset. Now I understand this is a handwaving and one does not believe
it until he tries it on his machine if it is really a noticable slowdown or
not ;)
Also I'm not saying it's a perfect fit for things like desktop indexing
because as you said, it's desirable to rescan the system every few minutes
while I was rather aiming at usecases where either the rescan is done only
once per few hours (rsync, updatedb) or where files change only
exceptionally (e.g. KDE/GNOME watching desktop config files). But still I
don't think it's completely insane to try something like that...

> persisting the flag/mtime then you need to keep all the dentries with
> the flag set in memory, which has resource use risks similar to
> unbounded event queues.
Ah, true - I have implemented just the persistent case and have not
thought too much about the non-persistent one. You're right that it won't
work because we'd pin memory.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-04-03 06:44:20

by Alexander Larsson

[permalink] [raw]
Subject: Re: Issues with using fanotify for a filesystem indexer

On Thu, 2009-04-02 at 21:52 +0200, Jan Kara wrote:
> On Thu 02-04-09 19:15:37, Alexander Larsson wrote:
> > Ah, I see. The indexer sets the flag.
> > I see some issues. First of all, writing the flag/mtime to disk seems
> > like a bad idea. It'll cause a lot of writing when the indexer recurses
> > throught the filesystem, similar to atimes. But, if you're not
>
> There's some cost but it's not nearly as bad as with atimes.

True, its not as bad as atimes. But it still does some writes, and
writes seem to affect i/o performance more than low prio reads from the
indexer. I'm very wary about the background indexer process disturbing
the foreground processes. This is one of the main problems with current
indexers.

> > persisting the flag/mtime then you need to keep all the dentries with
> > the flag set in memory, which has resource use risks similar to
> > unbounded event queues.
> Ah, true - I have implemented just the persistent case and have not
> thought too much about the non-persistent one. You're right that it won't
> work because we'd pin memory.

So, where do you persist the flag/time? Is there some availible space in
the inode for it on ext3/4?

2009-04-03 09:33:29

by Jan Kara

[permalink] [raw]
Subject: Re: Issues with using fanotify for a filesystem indexer

On Fri 03-04-09 08:44:03, Alexander Larsson wrote:
> On Thu, 2009-04-02 at 21:52 +0200, Jan Kara wrote:
> > On Thu 02-04-09 19:15:37, Alexander Larsson wrote:
> > > Ah, I see. The indexer sets the flag.
> > > I see some issues. First of all, writing the flag/mtime to disk seems
> > > like a bad idea. It'll cause a lot of writing when the indexer recurses
> > > throught the filesystem, similar to atimes. But, if you're not
> >
> > There's some cost but it's not nearly as bad as with atimes.
> True, its not as bad as atimes. But it still does some writes, and
> writes seem to affect i/o performance more than low prio reads from the
> indexer. I'm very wary about the background indexer process disturbing
> the foreground processes. This is one of the main problems with current
> indexers.
I agree - that's why I have beagle turned off on my system ;)

> > > persisting the flag/mtime then you need to keep all the dentries with
> > > the flag set in memory, which has resource use risks similar to
> > > unbounded event queues.
> > Ah, true - I have implemented just the persistent case and have not
> > thought too much about the non-persistent one. You're right that it won't
> > work because we'd pin memory.
>
> So, where do you persist the flag/time? Is there some availible space in
> the inode for it on ext3/4?
Yes, there's enough space in ext3/ext4 inode. I've already talked about
it with Ted and other fs developers at Plumbers Conf. and they weren't
opposed to such change.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR