2012-11-01 12:52:46

by Martin Steigerwald

[permalink] [raw]
Subject: Better support for (desktop) file search / indexing applications

Hi!

Some time ago I stumpled over a blog entry that kernel user inotify watch
limit is often not enough for Nepomuk File Watcher to be notified of file
renames, new files and file deletes reliably[1].

There has been a discussion about that on various places[2,3,4] and likely
others.


I am writing to help the Nepomuk team to get in contact with Kernel
developers who could advise or help on how to solve the issues they
have with the current filesystem notification APIs in the kernel.

I thus added to CC any DNotify, INotify and FANotify maintainers as well
as Jan Kara who analyzed the advantages and disadvantages of each approach
and also developed some patches about recursive mtimes. I can dig out the
links to that as well, just ask if you want that. I also cc LKML,
linux-fsdevel and Nepomuk mailinglist. Feel free to drop CCs that you
deem inapprobiate or to add some for other Linux desktop or server
file indexing projects. Please tell me if I missed other kernel developers
who worked on file notification stuff.


The following two main issues led to the discussion about adding
notification about user inotify watch limit or even having it raised
automatically via some policy kit mechanism:

1) Watches are not working recursively. Thus one has to add a watch to
each sub directory.

2) There are inotify file move events. But one has to watch source and
destination directory to get notified of a file move between these. Thus
one has to watch each directory again. File moves outside the watched
home directory will go unnotified unless every other accessible directory
is watched as well.


What would be nice to have for file indexers would be:

1) Recursive notifications. I.e. one watch for /home/martin can notify
about everything what happens in sub directories of that directory.

2) File move events that work from the source directory. I.e. if
watching a directory like /home/martin recursively it would be nice to
be notified about:

a) A file is moved from one sub directory inside /home/martin to another
one inside it.

b) A file is moved outside /home/martin

While these enhancement would likely fix the issues desktop file search
applications have with the kernel notification APIs, there might be other
approaches I did not yet thought off... so feel free to comment with your
thoughts on it.


Furthermore there is an issue with updating the file index on login or
service start. In order to catch all other file renames a indexer would
have to run over every directory whose modification time stamp has changed
again in order to see whether a (checksummed) file has moved.

An approach like recursive mtime as proposed by Jan Kara can help to
improve initial scan times a lot.

As to what I know this scan has been enabled in Nepomuk recently, with the
hope that files are moved mainly during the user session is active. I
think thats an assumption that may be accurate for many cases.

Still something like recursive mtime or BTRFS generation numbers with
btrfs subvolume find-new PATH LASTGENERATION would help that case a lot.
The issue with the BTRFS approach is that it only works as root. A
solution to this would be to integrate it in some daemon that works as
root and have applications communicate via socket or DBUS with it.


Some of this issues may apply to server side services like constellio or
Apache SolR (Lucene) as well. For example when there has been a service
downtime and after service restart the service wants to pick up last
changes. Or for near realtime indexing.


I hope to help to unstick the current state. I think its important for
kernel and userspace developers to talk to each other about good ways
to move forward.

So maybe some time in the future:

martin@merkaba:~> cat /etc/sysctl.d/nepomuk.conf
# F?r Nepomuk File Indexer
# martin@merkaba:~> find -type d | wc -l
# 34515
#
# merkaba:/proc/sys/fs/inotify> cat max_user_watches
# 8192

fs.inotify.max_user_watches = 200000

Wont be necessary anymore.

I found that SLES 11 SP 2, maybe earlier versions as well, raise the
user watch limit to 65536 by default. So this seems to have been an
issue in a server-oriented enterprise distribution as well.



[1] Alvaro Soliverez: Nepomuk not indexing a large home:
http://soliverez.com.ar/home/2012/10/nepomuk-not-indexing-a-large-home/

[2] [Nepomuk] User limit reached. Please raise the inotify user watch limit:
http://lists.kde.org/?l=nepomuk&m=134954456529570&w=2

[3] Vishesh Handa, Nepomuk Without Files:
http://vhanda.in/blog/2012/08/nepomuk-without-files/

[4] Martin Sandsmark, KFileMon,:
http://martinsandsmark.wordpress.com/2012/08/07/kfilemon/

Thanks,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7


2012-11-01 12:55:57

by Martin Steigerwald

[permalink] [raw]
Subject: Re: [Nepomuk] Better support for (desktop) file search / indexing applications

Am Donnerstag, 1. November 2012 schrieb Martin Steigerwald:
> Furthermore there is an issue with updating the file index on login or
> service start. In order to catch all other file renames a indexer would
> have to run over every directory whose modification time stamp has
> changed again in order to see whether a (checksummed) file has moved.
>
> An approach like recursive mtime as proposed by Jan Kara can help to
> improve initial scan times a lot.
>
> As to what I know this scan has been enabled in Nepomuk recently, with
> the hope that files are moved mainly during the user session is
> active. I think thats an assumption that may be accurate for many
> cases.

disabled, not enabled.

I read over this before but did not see the typo.

Sorry,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7

2012-11-01 14:34:09

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: Better support for (desktop) file search / indexing applications

On Thursday 01 November 2012 13:52:42 Martin Steigerwald wrote:
...
> The following two main issues led to the discussion about adding
> notification about user inotify watch limit or even having it raised
> automatically via some policy kit mechanism:
>
> 1) Watches are not working recursively. Thus one has to add a watch to
> each sub directory.
>
> 2) There are inotify file move events. But one has to watch source and
> destination directory to get notified of a file move between these. Thus
> one has to watch each directory again. File moves outside the watched
> home directory will go unnotified unless every other accessible directory
> is watched as well.
>
>
> What would be nice to have for file indexers would be:
>
> 1) Recursive notifications. I.e. one watch for /home/martin can notify
> about everything what happens in sub directories of that directory.
>
> 2) File move events that work from the source directory. I.e. if
> watching a directory like /home/martin recursively it would be nice to
> be notified about:
>
> a) A file is moved from one sub directory inside /home/martin to another
> one inside it.
>
> b) A file is moved outside /home/martin
>
> While these enhancement would likely fix the issues desktop file search
> applications have with the kernel notification APIs, there might be other
> approaches I did not yet thought off... so feel free to comment with your
> thoughts on it.

I will only comment on the real time indexing part since I had some part in
the inception of fanotify and still remember a thing or two about it.

Perhaps you should look into how hard would it be to add directory or rename,
and unlink events to fanotify. It may not be too hard.

In that case, even though it does not support recursive directory watches (I
tried to implement this some time around 2009. but found it impossible to
wedge into the fanotify locking model), it does support mount point watches.
Which for the desktop use might be sufficient, assuming /home is typically a
separate filesystem.

Downside with this approach is that you have to filter out the events you do
not care about like /home/some-other-user, or even more if /home is not a
separate filesystem. Which with the current fanotify state can be done using
paths, but that includes resolving a link in procfs which may be a too
expensive thing to do.

Or perhaps it is acceptable, if you for example only cared about CLOSE_WRITE
events (closure of file which were open for writing).

So I think for this part you have two options, have a go of extending
directory watches to be recursive, or live with the mount watches giving you
too much traffic.

Regards,

Tvrtko

2012-11-10 16:53:52

by Martin Steigerwald

[permalink] [raw]
Subject: Re: Better support for (desktop) file search / indexing applications

Am Donnerstag, 1. November 2012 schrieb Tvrtko Ursulin:
> On Thursday 01 November 2012 13:52:42 Martin Steigerwald wrote:
> ...
>
> > The following two main issues led to the discussion about adding
> > notification about user inotify watch limit or even having it raised
> > automatically via some policy kit mechanism:
> >
> > 1) Watches are not working recursively. Thus one has to add a watch
> > to each sub directory.
> >
> > 2) There are inotify file move events. But one has to watch source
> > and destination directory to get notified of a file move between
> > these. Thus one has to watch each directory again. File moves
> > outside the watched home directory will go unnotified unless every
> > other accessible directory is watched as well.
> >
> >
> > What would be nice to have for file indexers would be:
> >
> > 1) Recursive notifications. I.e. one watch for /home/martin can
> > notify about everything what happens in sub directories of that
> > directory.
> >
> > 2) File move events that work from the source directory. I.e. if
> > watching a directory like /home/martin recursively it would be nice
> > to be notified about:
> >
> > a) A file is moved from one sub directory inside /home/martin to
> > another one inside it.
> >
> > b) A file is moved outside /home/martin
> >
> > While these enhancement would likely fix the issues desktop file
> > search applications have with the kernel notification APIs, there
> > might be other approaches I did not yet thought off... so feel free
> > to comment with your thoughts on it.
>
> I will only comment on the real time indexing part since I had some
> part in the inception of fanotify and still remember a thing or two
> about it.
>
> Perhaps you should look into how hard would it be to add directory or
> rename, and unlink events to fanotify. It may not be too hard.
>
> In that case, even though it does not support recursive directory
> watches (I tried to implement this some time around 2009. but found it
> impossible to wedge into the fanotify locking model), it does support
> mount point watches. Which for the desktop use might be sufficient,
> assuming /home is typically a separate filesystem.
>
> Downside with this approach is that you have to filter out the events
> you do not care about like /home/some-other-user, or even more if
> /home is not a separate filesystem. Which with the current fanotify
> state can be done using paths, but that includes resolving a link in
> procfs which may be a too expensive thing to do.
>
> Or perhaps it is acceptable, if you for example only cared about
> CLOSE_WRITE events (closure of file which were open for writing).
>
> So I think for this part you have two options, have a go of extending
> directory watches to be recursive, or live with the mount watches
> giving you too much traffic.

Thanks for your suggestions.

Still fanotify needs root access and thus this would need a daemon running
as root and some policy kit stuff to access it and in case of mount point
watches robust and secure code so that each user may only see his/her own
results.

Any other ideas from anyone?

Thanks.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7

2012-11-12 09:10:32

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: Better support for (desktop) file search / indexing applications

On Saturday 10 November 2012 17:53:45 Martin Steigerwald wrote:
> Still fanotify needs root access and thus this would need a daemon running
> as root and some policy kit stuff to access it and in case of mount point
> watches robust and secure code so that each user may only see his/her own
> results.

Perhaps then also extend fanotify to support user watches, from the top of my
head I can't think of a reason it would be very difficult to implement. But it
has been a few years since I actively worked with that code.

Since you are not the only group having issues with fanotify feature set I can
see this mini-project (together with extensions from me previous reply) being
useful. It is also better to evolve it than neglect due a few shortcomings and
then in a few years someone will come up with something completely new and
then we will have yet another notification system.

Tvrtko