Subject: Things I wish I'd known about Inotify

(To: == [the set of people I believe know a lot about inotify])

Hello all,

Lately, I've been studying the inotify API fairly thoroughly and
realized that there's a very big gap between knowing what the system
calls do versus using them to reliably and efficiently monitor the
state of a set of filesystem objects.

With that in mind, I've drafted some substantial additions to the
inotify(7) man page. I would be very happy if folk on the "To:" list
could comment on the text below, since I believe you all have a lot of
practical experience with Inotify. (Of course, I also welcome comments
from anyone else.) In particular, I would like comments on the
accuracy of the various technical points (especially those relating to
matching up related IN_MOVED_FROM and IN_MOVED_TO events), as well as
pointers on any other pitfalls that the programmers should be wary of
that should be added to the page.

Thanks,

Michael

Limitations and caveats
The inotify API provides no information about the user or process
that triggered the inotify event. In particular, there is no
easy way for a process that is monitoring events via inotify to
distinguish events that it triggers itself from those that are
triggered by other processes.

The inotify API identifies affected files by filename. However,
by the time an application processes an inotify event, the file‐
name may already have been deleted or renamed.

The inotify API identifies events via watch descriptors. It is
the application's responsibility to cache a mapping (if one is
needed) between watch descriptors and pathnames. Be aware that
directory renamings may affect multiple cached pathnames.

Inotify monitoring of directories is not recursive: to monitor
subdirectories under a directory, additional watches must be cre‐
ated. This can take a significant amount time for large direc‐
tory trees.

If monitoring an entire directory subtree, and a new subdirectory
is created in that tree or an existing directory is renamed into
that tree, be aware that by the time you create a watch for the
new subdirectory, new files (and subdirectories) may already
exist inside the subdirectory. Therefore, you may want to scan
the contents of the subdirectory immediately after adding the
watch (and, if desired, recursively add watches for any subdirec‐
tories that it contains).

Note that the event queue can overflow. In this case, events are
lost. Robust applications should handle the possibility of lost
events gracefully. For example, it may be necessary to rebuild
part or all of the application cache. (One simple, but possibly
expensive, approach is to close the inotify file descriptor,
empty the cache, create a new inotify file descriptor, and then
re-create watches and cache entries for the objects to be moni‐
tored.)

Dealing with rename() events
The IN_MOVED_FROM and IN_MOVED_TO events that are generated by
rename(2) are usually available as consecutive events when read‐
ing from the inotify file descriptor. However, this is not guar‐
anteed. If multiple processes are triggering events for moni‐
tored objects, then (on rare occasions) an arbitrary number of
other events may appear between the IN_MOVED_FROM and IN_MOVED_TO
events.

Matching up the IN_MOVED_FROM and IN_MOVED_TO event pair gener‐
ated by rename(2) is thus inherently racy. (Don't forget that if
an object is renamed outside of a monitored directory, there may
not even be an IN_MOVED_TO event.) Heuristic approaches (e.g.,
assume the events are always consecutive) can be used to ensure a
match in most cases, but will inevitably miss some cases, causing
the application to perceive the IN_MOVED_FROM and IN_MOVED_TO
events as being unrelated. If watch descriptors are destroyed
and re-created as a result, then those watch descriptors will be
inconsistent with the watch descriptors in any pending events.
(Re-creating the inotify file descriptor and rebuilding the cache
may be useful to deal with this scenario.)

Applications should also allow for the possibility that the
IN_MOVED_FROM event was the last event that could fit in the buf‐
fer returned by the current call to read(2), and the accompanying
IN_MOVED_TO event might be fetched only on the next read(2).


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/


2014-04-03 15:39:19

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Things I wish I'd known about Inotify

"Michael Kerrisk (man-pages)" <[email protected]> writes:

> (To: == [the set of people I believe know a lot about inotify])
>
> Hello all,
>
> Lately, I've been studying the inotify API fairly thoroughly and
> realized that there's a very big gap between knowing what the system
> calls do versus using them to reliably and efficiently monitor the
> state of a set of filesystem objects.
>
> With that in mind, I've drafted some substantial additions to the
> inotify(7) man page. I would be very happy if folk on the "To:" list
> could comment on the text below, since I believe you all have a lot of
> practical experience with Inotify. (Of course, I also welcome comments
> from anyone else.) In particular, I would like comments on the
> accuracy of the various technical points (especially those relating to
> matching up related IN_MOVED_FROM and IN_MOVED_TO events), as well as
> pointers on any other pitfalls that the programmers should be wary of
> that should be added to the page.


Other pitfalls.

Inotify only report events that a user space program triggers through
the filesystem API. Which means inotify is limited for remote
filesystems, and filesystems like proc and sys have no monitorable
events.

Eric


> Thanks,
>
> Michael
>
> Limitations and caveats
> The inotify API provides no information about the user or process
> that triggered the inotify event. In particular, there is no
> easy way for a process that is monitoring events via inotify to
> distinguish events that it triggers itself from those that are
> triggered by other processes.
>
> The inotify API identifies affected files by filename. However,
> by the time an application processes an inotify event, the file‐
> name may already have been deleted or renamed.
>
> The inotify API identifies events via watch descriptors. It is
> the application's responsibility to cache a mapping (if one is
> needed) between watch descriptors and pathnames. Be aware that
> directory renamings may affect multiple cached pathnames.
>
> Inotify monitoring of directories is not recursive: to monitor
> subdirectories under a directory, additional watches must be cre‐
> ated. This can take a significant amount time for large direc‐
> tory trees.
>
> If monitoring an entire directory subtree, and a new subdirectory
> is created in that tree or an existing directory is renamed into
> that tree, be aware that by the time you create a watch for the
> new subdirectory, new files (and subdirectories) may already
> exist inside the subdirectory. Therefore, you may want to scan
> the contents of the subdirectory immediately after adding the
> watch (and, if desired, recursively add watches for any subdirec‐
> tories that it contains).
>
> Note that the event queue can overflow. In this case, events are
> lost. Robust applications should handle the possibility of lost
> events gracefully. For example, it may be necessary to rebuild
> part or all of the application cache. (One simple, but possibly
> expensive, approach is to close the inotify file descriptor,
> empty the cache, create a new inotify file descriptor, and then
> re-create watches and cache entries for the objects to be moni‐
> tored.)
>
> Dealing with rename() events
> The IN_MOVED_FROM and IN_MOVED_TO events that are generated by
> rename(2) are usually available as consecutive events when read‐
> ing from the inotify file descriptor. However, this is not guar‐
> anteed. If multiple processes are triggering events for moni‐
> tored objects, then (on rare occasions) an arbitrary number of
> other events may appear between the IN_MOVED_FROM and IN_MOVED_TO
> events.
>
> Matching up the IN_MOVED_FROM and IN_MOVED_TO event pair gener‐
> ated by rename(2) is thus inherently racy. (Don't forget that if
> an object is renamed outside of a monitored directory, there may
> not even be an IN_MOVED_TO event.) Heuristic approaches (e.g.,
> assume the events are always consecutive) can be used to ensure a
> match in most cases, but will inevitably miss some cases, causing
> the application to perceive the IN_MOVED_FROM and IN_MOVED_TO
> events as being unrelated. If watch descriptors are destroyed
> and re-created as a result, then those watch descriptors will be
> inconsistent with the watch descriptors in any pending events.
> (Re-creating the inotify file descriptor and rebuilding the cache
> may be useful to deal with this scenario.)
>
> Applications should also allow for the possibility that the
> IN_MOVED_FROM event was the last event that could fit in the buf‐
> fer returned by the current call to read(2), and the accompanying
> IN_MOVED_TO event might be fetched only on the next read(2).

2014-04-03 20:52:42

by Jan Kara

[permalink] [raw]
Subject: Re: Things I wish I'd known about Inotify

On Thu 03-04-14 08:34:44, Michael Kerrisk (man-pages) wrote:
> Limitations and caveats
> The inotify API provides no information about the user or process
> that triggered the inotify event. In particular, there is no
> easy way for a process that is monitoring events via inotify to
> distinguish events that it triggers itself from those that are
> triggered by other processes.
>
> The inotify API identifies affected files by filename. However,
> by the time an application processes an inotify event, the file‐
> name may already have been deleted or renamed.
>
> The inotify API identifies events via watch descriptors. It is
> the application's responsibility to cache a mapping (if one is
> needed) between watch descriptors and pathnames. Be aware that
> directory renamings may affect multiple cached pathnames.
>
> Inotify monitoring of directories is not recursive: to monitor
> subdirectories under a directory, additional watches must be cre‐
> ated. This can take a significant amount time for large direc‐
> tory trees.
And also there's a problem with the limit on the number of watches a user
can have.

> If monitoring an entire directory subtree, and a new subdirectory
> is created in that tree or an existing directory is renamed into
> that tree, be aware that by the time you create a watch for the
> new subdirectory, new files (and subdirectories) may already
> exist inside the subdirectory. Therefore, you may want to scan
> the contents of the subdirectory immediately after adding the
> watch (and, if desired, recursively add watches for any subdirec‐
> tories that it contains).
>
> Note that the event queue can overflow. In this case, events are
> lost. Robust applications should handle the possibility of lost
> events gracefully. For example, it may be necessary to rebuild
> part or all of the application cache. (One simple, but possibly
> expensive, approach is to close the inotify file descriptor,
> empty the cache, create a new inotify file descriptor, and then
> re-create watches and cache entries for the objects to be moni‐
> tored.)
>
> Dealing with rename() events
> The IN_MOVED_FROM and IN_MOVED_TO events that are generated by
> rename(2) are usually available as consecutive events when read‐
> ing from the inotify file descriptor. However, this is not guar‐
> anteed. If multiple processes are triggering events for moni‐
> tored objects, then (on rare occasions) an arbitrary number of
> other events may appear between the IN_MOVED_FROM and IN_MOVED_TO
> events.
>
> Matching up the IN_MOVED_FROM and IN_MOVED_TO event pair gener‐
> ated by rename(2) is thus inherently racy. (Don't forget that if
> an object is renamed outside of a monitored directory, there may
> not even be an IN_MOVED_TO event.) Heuristic approaches (e.g.,
> assume the events are always consecutive) can be used to ensure a
> match in most cases, but will inevitably miss some cases, causing
> the application to perceive the IN_MOVED_FROM and IN_MOVED_TO
> events as being unrelated. If watch descriptors are destroyed
> and re-created as a result, then those watch descriptors will be
> inconsistent with the watch descriptors in any pending events.
> (Re-creating the inotify file descriptor and rebuilding the cache
> may be useful to deal with this scenario.)
Well, but there's 'cookie' value meant exactly for matching up
IN_MOVED_FROM and IN_MOVED_TO events. And 'cookie' is guaranteed to be
unique at least within the inotify instance (in fact currently it is unique
within the whole system but I don't think we want to give that promise).

> Applications should also allow for the possibility that the
> IN_MOVED_FROM event was the last event that could fit in the buf‐
> fer returned by the current call to read(2), and the accompanying
> IN_MOVED_TO event might be fetched only on the next read(2).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

Subject: Re: Things I wish I'd known about Inotify

On 04/03/2014 10:52 PM, Jan Kara wrote:
> On Thu 03-04-14 08:34:44, Michael Kerrisk (man-pages) wrote:
>> Limitations and caveats
>> The inotify API provides no information about the user or process
>> that triggered the inotify event. In particular, there is no
>> easy way for a process that is monitoring events via inotify to
>> distinguish events that it triggers itself from those that are
>> triggered by other processes.
>>
>> The inotify API identifies affected files by filename. However,
>> by the time an application processes an inotify event, the file‐
>> name may already have been deleted or renamed.
>>
>> The inotify API identifies events via watch descriptors. It is
>> the application's responsibility to cache a mapping (if one is
>> needed) between watch descriptors and pathnames. Be aware that
>> directory renamings may affect multiple cached pathnames.
>>
>> Inotify monitoring of directories is not recursive: to monitor
>> subdirectories under a directory, additional watches must be cre‐
>> ated. This can take a significant amount time for large direc‐
>> tory trees.
> And also there's a problem with the limit on the number of watches a user
> can have.

What is the problem exactly (given that the limit is configurable)?

>> If monitoring an entire directory subtree, and a new subdirectory
>> is created in that tree or an existing directory is renamed into
>> that tree, be aware that by the time you create a watch for the
>> new subdirectory, new files (and subdirectories) may already
>> exist inside the subdirectory. Therefore, you may want to scan
>> the contents of the subdirectory immediately after adding the
>> watch (and, if desired, recursively add watches for any subdirec‐
>> tories that it contains).
>>
>> Note that the event queue can overflow. In this case, events are
>> lost. Robust applications should handle the possibility of lost
>> events gracefully. For example, it may be necessary to rebuild
>> part or all of the application cache. (One simple, but possibly
>> expensive, approach is to close the inotify file descriptor,
>> empty the cache, create a new inotify file descriptor, and then
>> re-create watches and cache entries for the objects to be moni‐
>> tored.)
>>
>> Dealing with rename() events
>> The IN_MOVED_FROM and IN_MOVED_TO events that are generated by
>> rename(2) are usually available as consecutive events when read‐
>> ing from the inotify file descriptor. However, this is not guar‐
>> anteed. If multiple processes are triggering events for moni‐
>> tored objects, then (on rare occasions) an arbitrary number of
>> other events may appear between the IN_MOVED_FROM and IN_MOVED_TO
>> events.
>>
>> Matching up the IN_MOVED_FROM and IN_MOVED_TO event pair gener‐
>> ated by rename(2) is thus inherently racy. (Don't forget that if
>> an object is renamed outside of a monitored directory, there may
>> not even be an IN_MOVED_TO event.) Heuristic approaches (e.g.,
>> assume the events are always consecutive) can be used to ensure a
>> match in most cases, but will inevitably miss some cases, causing
>> the application to perceive the IN_MOVED_FROM and IN_MOVED_TO
>> events as being unrelated. If watch descriptors are destroyed
>> and re-created as a result, then those watch descriptors will be
>> inconsistent with the watch descriptors in any pending events.
>> (Re-creating the inotify file descriptor and rebuilding the cache
>> may be useful to deal with this scenario.)
> Well, but there's 'cookie' value meant exactly for matching up
> IN_MOVED_FROM and IN_MOVED_TO events. And 'cookie' is guaranteed to be
> unique at least within the inotify instance (in fact currently it is unique
> within the whole system but I don't think we want to give that promise).

Yes, that's already assumed by my discussion above (its described elsewhere
in the page). But your comment makes me think I should add a few words to
remind the reader of that fact. I'll do that.

But, the point is that even with the cookie, matching the events is
nontrivial, since:

* There may not even be an IN_MOVED_FROM event
* There may be an arbitrary number of other events in between the
IN_MOVED_FROM and the IN_MOVED_TO.

Therefore, one has to use heuristic approaches such as "allow at least
N millisconds" or "check the next N events" to see if there is an
IN_MOVED_FROM that matches the IN_MOVED_TO. I can't see any way around
that being inherently racy. (It's unfortunate that the kernel can't
provide a guarantee that the two events are always consecutive, since
that would simply user space's life considerably.)

Cheers,

Michael


>> Applications should also allow for the possibility that the
>> IN_MOVED_FROM event was the last event that could fit in the buf‐
>> fer returned by the current call to read(2), and the accompanying
>> IN_MOVED_TO event might be fetched only on the next read(2).
>
> Honza
>


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Subject: Re: Things I wish I'd known about Inotify

[CC += Al Viro & Linux, since they also discussed the point about
remote filesystems and /proc and /sys here:
http://thread.gmane.org/gmane.linux.file-systems/83641/focus=83713 .]

On 04/03/2014 05:38 PM, Eric W. Biederman wrote:
> "Michael Kerrisk (man-pages)" <[email protected]> writes:
>
>> (To: == [the set of people I believe know a lot about inotify])
>>
>> Hello all,
>>
>> Lately, I've been studying the inotify API fairly thoroughly and
>> realized that there's a very big gap between knowing what the system
>> calls do versus using them to reliably and efficiently monitor the
>> state of a set of filesystem objects.
>>
>> With that in mind, I've drafted some substantial additions to the
>> inotify(7) man page. I would be very happy if folk on the "To:" list
>> could comment on the text below, since I believe you all have a lot of
>> practical experience with Inotify. (Of course, I also welcome comments
>> from anyone else.) In particular, I would like comments on the
>> accuracy of the various technical points (especially those relating to
>> matching up related IN_MOVED_FROM and IN_MOVED_TO events), as well as
>> pointers on any other pitfalls that the programmers should be wary of
>> that should be added to the page.
>
>
> Other pitfalls.
>
> Inotify only report events that a user space program triggers through
> the filesystem API. Which means inotify is limited for remote
> filesystems, and filesystems like proc and sys have no monitorable

Good point. I recently got CCed on that very point, but hadn't
added it to the page. I've added it now.

Revised text below, after incorporating changes from your comments and those
of Jan Kara.

Cheers,

Michael


Limitations and caveats
The inotify API provides no information about the user or process
that triggered the inotify event. In particular, there is no
easy way for a process that is monitoring events via inotify to
distinguish events that it triggers itself from those that are
triggered by other processes.

Inotify reports only events that a user-space program triggers
through the filesystem API. As a result, it does not catch
remote events that occur on network filesystems. (Applications
must fall back to polling the filesystem to catch such events.)
Furthermore, various virtual filesystems such as /proc, /sys, and
/dev/pts are not monitorable with inotify.

The inotify API identifies affected files by filename. However,
by the time an application processes an inotify event, the file‐
name may already have been deleted or renamed.

The inotify API identifies events via watch descriptors. It is
the application's responsibility to cache a mapping (if one is
needed) between watch descriptors and pathnames. Be aware that
directory renamings may affect multiple cached pathnames.

Inotify monitoring of directories is not recursive: to monitor
subdirectories under a directory, additional watches must be cre‐
ated. This can take a significant amount time for large direc‐
tory trees.

If monitoring an entire directory subtree, and a new subdirectory
is created in that tree or an existing directory is renamed into
that tree, be aware that by the time you create a watch for the
new subdirectory, new files (and subdirectories) may already
exist inside the subdirectory. Therefore, you may want to scan
the contents of the subdirectory immediately after adding the
watch (and, if desired, recursively add watches for any subdirec‐
tories that it contains).

Note that the event queue can overflow. In this case, events are
lost. Robust applications should handle the possibility of lost
events gracefully. For example, it may be necessary to rebuild
part or all of the application cache. (One simple, but possibly
expensive, approach is to close the inotify file descriptor,
empty the cache, create a new inotify file descriptor, and then
re-create watches and cache entries for the objects to be moni‐
tored.)

Dealing with rename() events
As noted above, the IN_MOVED_FROM and IN_MOVED_TO event pair that
is generated by rename(2) can be matched up via their shared
cookie value. However, the task of matching has some challenges.

These two events are usually consecutive in the event stream
available when reading from the inotify file descriptor. How‐
ever, this is not guaranteed. If multiple processes are trigger‐
ing events for monitored objects, then (on rare occasions) an
arbitrary number of other events may appear between the
IN_MOVED_FROM and IN_MOVED_TO events.

Matching up the IN_MOVED_FROM and IN_MOVED_TO event pair gener‐
ated by rename(2) is thus inherently racy. (Don't forget that if
an object is renamed outside of a monitored directory, there may
not even be an IN_MOVED_TO event.) Heuristic approaches (e.g.,
assume the events are always consecutive) can be used to ensure a
match in most cases, but will inevitably miss some cases, causing
the application to perceive the IN_MOVED_FROM and IN_MOVED_TO
events as being unrelated. If watch descriptors are destroyed
and re-created as a result, then those watch descriptors will be
inconsistent with the watch descriptors in any pending events.
(Re-creating the inotify file descriptor and rebuilding the cache
may be useful to deal with this scenario.)

Applications should also allow for the possibility that the
IN_MOVED_FROM event was the last event that could fit in the buf‐
fer returned by the current call to read(2), and the accompanying
IN_MOVED_TO event might be fetched only on the next read(2).


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2014-04-04 12:43:48

by Jan Kara

[permalink] [raw]
Subject: Re: Things I wish I'd known about Inotify

On Fri 04-04-14 09:35:50, Michael Kerrisk (man-pages) wrote:
> On 04/03/2014 10:52 PM, Jan Kara wrote:
> > On Thu 03-04-14 08:34:44, Michael Kerrisk (man-pages) wrote:
> >> Limitations and caveats
> >> The inotify API provides no information about the user or process
> >> that triggered the inotify event. In particular, there is no
> >> easy way for a process that is monitoring events via inotify to
> >> distinguish events that it triggers itself from those that are
> >> triggered by other processes.
> >>
> >> The inotify API identifies affected files by filename. However,
> >> by the time an application processes an inotify event, the file‐
> >> name may already have been deleted or renamed.
> >>
> >> The inotify API identifies events via watch descriptors. It is
> >> the application's responsibility to cache a mapping (if one is
> >> needed) between watch descriptors and pathnames. Be aware that
> >> directory renamings may affect multiple cached pathnames.
> >>
> >> Inotify monitoring of directories is not recursive: to monitor
> >> subdirectories under a directory, additional watches must be cre‐
> >> ated. This can take a significant amount time for large direc‐
> >> tory trees.
> > And also there's a problem with the limit on the number of watches a user
> > can have.
>
> What is the problem exactly (given that the limit is configurable)?
Well, if you want to watch the whole home directory and you have a large
one, you may run into that limit. Sure you can ask sysadmin to raise the
limit but it's a bit anoying.

> >> If monitoring an entire directory subtree, and a new subdirectory
> >> is created in that tree or an existing directory is renamed into
> >> that tree, be aware that by the time you create a watch for the
> >> new subdirectory, new files (and subdirectories) may already
> >> exist inside the subdirectory. Therefore, you may want to scan
> >> the contents of the subdirectory immediately after adding the
> >> watch (and, if desired, recursively add watches for any subdirec‐
> >> tories that it contains).
> >>
> >> Note that the event queue can overflow. In this case, events are
> >> lost. Robust applications should handle the possibility of lost
> >> events gracefully. For example, it may be necessary to rebuild
> >> part or all of the application cache. (One simple, but possibly
> >> expensive, approach is to close the inotify file descriptor,
> >> empty the cache, create a new inotify file descriptor, and then
> >> re-create watches and cache entries for the objects to be moni‐
> >> tored.)
> >>
> >> Dealing with rename() events
> >> The IN_MOVED_FROM and IN_MOVED_TO events that are generated by
> >> rename(2) are usually available as consecutive events when read‐
> >> ing from the inotify file descriptor. However, this is not guar‐
> >> anteed. If multiple processes are triggering events for moni‐
> >> tored objects, then (on rare occasions) an arbitrary number of
> >> other events may appear between the IN_MOVED_FROM and IN_MOVED_TO
> >> events.
> >>
> >> Matching up the IN_MOVED_FROM and IN_MOVED_TO event pair gener‐
> >> ated by rename(2) is thus inherently racy. (Don't forget that if
> >> an object is renamed outside of a monitored directory, there may
> >> not even be an IN_MOVED_TO event.) Heuristic approaches (e.g.,
> >> assume the events are always consecutive) can be used to ensure a
> >> match in most cases, but will inevitably miss some cases, causing
> >> the application to perceive the IN_MOVED_FROM and IN_MOVED_TO
> >> events as being unrelated. If watch descriptors are destroyed
> >> and re-created as a result, then those watch descriptors will be
> >> inconsistent with the watch descriptors in any pending events.
> >> (Re-creating the inotify file descriptor and rebuilding the cache
> >> may be useful to deal with this scenario.)
> > Well, but there's 'cookie' value meant exactly for matching up
> > IN_MOVED_FROM and IN_MOVED_TO events. And 'cookie' is guaranteed to be
> > unique at least within the inotify instance (in fact currently it is unique
> > within the whole system but I don't think we want to give that promise).
>
> Yes, that's already assumed by my discussion above (its described elsewhere
> in the page). But your comment makes me think I should add a few words to
> remind the reader of that fact. I'll do that.
Yes, that would be good.

> But, the point is that even with the cookie, matching the events is
> nontrivial, since:
>
> * There may not even be an IN_MOVED_FROM event
> * There may be an arbitrary number of other events in between the
> IN_MOVED_FROM and the IN_MOVED_TO.
>
> Therefore, one has to use heuristic approaches such as "allow at least
> N millisconds" or "check the next N events" to see if there is an
> IN_MOVED_FROM that matches the IN_MOVED_TO. I can't see any way around
> that being inherently racy. (It's unfortunate that the kernel can't
> provide a guarantee that the two events are always consecutive, since
> that would simply user space's life considerably.)
Yeah, it's unpleasant but doing that would be quite costly/complex at the
kernel side. And the race would in the worst case lead to application
thinking there's been file moved outside of watched area & a file moved
somewhere else inside the watched area. So the application will have to
possibly inspect that file. That doesn't seem too bad.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-04-04 13:00:36

by David Herrmann

[permalink] [raw]
Subject: Re: Things I wish I'd known about Inotify

Hi

On Thu, Apr 3, 2014 at 8:34 AM, Michael Kerrisk (man-pages)
<[email protected]> wrote:
> With that in mind, I've drafted some substantial additions to the
> inotify(7) man page. I would be very happy if folk on the "To:" list
> could comment on the text below, since I believe you all have a lot of
> practical experience with Inotify. (Of course, I also welcome comments
> from anyone else.) In particular, I would like comments on the
> accuracy of the various technical points (especially those relating to
> matching up related IN_MOVED_FROM and IN_MOVED_TO events), as well as
> pointers on any other pitfalls that the programmers should be wary of
> that should be added to the page.

1)
IN_IGNORED is async and _immediate_ in case a file got deleted. So if
you use watch-descriptors as keys for your objects, an _already_ used
key might be returned by inotify_add_watch() if an IN_IGNORED is
queued for the old watch (which implicitly destroys the watch). Once
you read the IN_IGNORED from the queue, there is usually no way to
know whether it's generated by the old watch or by the new. The
man-page mentions this in:
"IN_IGNORED: Watch was removed explicitly (inotify_rm_watch(2)) or
automatically (file was deleted, or filesystem was unmounted)."
I think we should add a note to BUGS that mentions this race (which is
really not obvious from the description).

This race could be fixed by requiring an explicit inotify_rm_watch()
if an IN_IGNORED was generated asynchronously.

2)
inotify_add_watch() is based on inodes. So if you call it on
hardlinks, you will modify the existing watch instead of creating a
new one. This is often really annoying and I think an IN_FORCE_NEW
flag that disables this would be really nice. Imagine the following
code:

wd1 = inotify_add_watch(fd, A);
wd2 = inotify_add_watch(fd, B);
...
inotify_rm_watch(fd, wd2);
wd3 = inotify_add_watch(fd, C);
...
inotify_rm_watch(fd, wd1);
...
inotify_rm_watch(fd, wd3);

If A and B are hardlinks to the same file, then wd1==wd2. Therefore,
after wd2 was removed, we _might_ end up with wd3==wd1 and thus remove
wd3 early (which obviously is not intended). So simple code like this
doesn't work. You have to verify whether returned handles are
duplicates or new. An IN_FORCE_NEW flag would really help here.

Thanks
David

2014-04-04 13:08:40

by David Herrmann

[permalink] [raw]
Subject: Re: Things I wish I'd known about Inotify

Hi

On Fri, Apr 4, 2014 at 3:00 PM, David Herrmann <[email protected]> wrote:
> 1)
> IN_IGNORED is async and _immediate_ in case a file got deleted. So if
> you use watch-descriptors as keys for your objects, an _already_ used
> key might be returned by inotify_add_watch() if an IN_IGNORED is
> queued for the old watch (which implicitly destroys the watch). Once
> you read the IN_IGNORED from the queue, there is usually no way to
> know whether it's generated by the old watch or by the new. The
> man-page mentions this in:
> "IN_IGNORED: Watch was removed explicitly (inotify_rm_watch(2)) or
> automatically (file was deleted, or filesystem was unmounted)."
> I think we should add a note to BUGS that mentions this race (which is
> really not obvious from the description).
>
> This race could be fixed by requiring an explicit inotify_rm_watch()
> if an IN_IGNORED was generated asynchronously.
>
> 2)
> inotify_add_watch() is based on inodes. So if you call it on
> hardlinks, you will modify the existing watch instead of creating a
> new one. This is often really annoying and I think an IN_FORCE_NEW
> flag that disables this would be really nice. Imagine the following
> code:
>
> wd1 = inotify_add_watch(fd, A);
> wd2 = inotify_add_watch(fd, B);
> ...
> inotify_rm_watch(fd, wd2);
> wd3 = inotify_add_watch(fd, C);
> ...
> inotify_rm_watch(fd, wd1);
> ...
> inotify_rm_watch(fd, wd3);
>
> If A and B are hardlinks to the same file, then wd1==wd2. Therefore,
> after wd2 was removed, we _might_ end up with wd3==wd1 and thus remove
> wd3 early (which obviously is not intended). So simple code like this
> doesn't work. You have to verify whether returned handles are
> duplicates or new. An IN_FORCE_NEW flag would really help here.

Note that both of these races rely on watch-descriptors being reused
after they were freed. Turns out, that was "fixed" about exactly 1
year ago in:

commit a66c04b4534f9b25e1241dff9a9d94dff9fd66f8
Author: Jeff Layton <[email protected]>
Date: Mon Apr 29 16:21:21 2013 -0700

inotify: convert inotify_add_to_idr() to use idr_alloc_cyclic()

So in case that was never backported, only older kernels are affected.
In newer kernels, wd reuse is quite unlikely. The races are still
there, though.

Thanks
David

2014-04-04 15:48:40

by Eric Paris

[permalink] [raw]
Subject: Re: Things I wish I'd known about Inotify

On Fri, 2014-04-04 at 15:00 +0200, David Herrmann wrote:

> 1)
> IN_IGNORED is async and _immediate_ in case a file got deleted. So if
> you use watch-descriptors as keys for your objects, an _already_ used
> key might be returned by inotify_add_watch() if an IN_IGNORED is
> queued for the old watch (which implicitly destroys the watch). Once
> you read the IN_IGNORED from the queue, there is usually no way to
> know whether it's generated by the old watch or by the new. The
> man-page mentions this in:
> "IN_IGNORED: Watch was removed explicitly (inotify_rm_watch(2)) or
> automatically (file was deleted, or filesystem was unmounted)."
> I think we should add a note to BUGS that mentions this race (which is
> really not obvious from the description).
>
> This race could be fixed by requiring an explicit inotify_rm_watch()
> if an IN_IGNORED was generated asynchronously.

For a brief while after the introduction of fsnotify this was a problem,
but not before then, or on anything remotely recent (like 4-5 years?).
We didn't re-use watch descriptors at all, so if you get a notification
after the IGNORED, its still the old one. Today it's possible to wrap
around at INT_MAX and reuse, but that is a tee tiny issue...

----

Note that both of these races rely on watch-descriptors being reused
after they were freed. Turns out, that was "fixed" about exactly 1
year ago in:

commit a66c04b4534f9b25e1241dff9a9d94dff9fd66f8
Author: Jeff Layton <[email protected]>
Date: Mon Apr 29 16:21:21 2013 -0700

inotify: convert inotify_add_to_idr() to use idr_alloc_cyclic()

So in case that was never backported, only older kernels are affected.
In newer kernels, wd reuse is quite unlikely. The races are still
there, though.

----

Actually that has nothing to do with it. If anything, it reintroduces
the reuse since now it wraps instead of fails...

2014-04-04 20:24:06

by Stef Bon

[permalink] [raw]
Subject: Re: Things I wish I'd known about Inotify

2014-04-03 17:38 GMT+02:00 Eric W. Biederman <[email protected]>:

>
> Other pitfalls.
>
> Inotify only report events that a user space program triggers through
> the filesystem API. Which means inotify is limited for remote
> filesystems

I'm working on enabling fsnotify for fuse. The sending of a watch/mask
to userspace works, and the fuse filesystem can forward the watch to
it's backend.
Reporting delete events back to the kernel works with the help of one
existing call fuse_lowlevel_notify_delete. There is no existing call
to report a create, so I'm working on that.

( at this moment I'm trying to enable the most simple events : delete
and create of an entry in a directory which is watched).

Stef Bon

Subject: Re: Things I wish I'd known about Inotify

On 04/04/2014 02:43 PM, Jan Kara wrote:
> On Fri 04-04-14 09:35:50, Michael Kerrisk (man-pages) wrote:
>> On 04/03/2014 10:52 PM, Jan Kara wrote:
>>> On Thu 03-04-14 08:34:44, Michael Kerrisk (man-pages) wrote:

[...]

>>>> Dealing with rename() events
>>>> The IN_MOVED_FROM and IN_MOVED_TO events that are generated by
>>>> rename(2) are usually available as consecutive events when read‐
>>>> ing from the inotify file descriptor. However, this is not guar‐
>>>> anteed. If multiple processes are triggering events for moni‐
>>>> tored objects, then (on rare occasions) an arbitrary number of
>>>> other events may appear between the IN_MOVED_FROM and IN_MOVED_TO
>>>> events.
>>>>
>>>> Matching up the IN_MOVED_FROM and IN_MOVED_TO event pair gener‐
>>>> ated by rename(2) is thus inherently racy. (Don't forget that if
>>>> an object is renamed outside of a monitored directory, there may
>>>> not even be an IN_MOVED_TO event.) Heuristic approaches (e.g.,
>>>> assume the events are always consecutive) can be used to ensure a
>>>> match in most cases, but will inevitably miss some cases, causing
>>>> the application to perceive the IN_MOVED_FROM and IN_MOVED_TO
>>>> events as being unrelated. If watch descriptors are destroyed
>>>> and re-created as a result, then those watch descriptors will be
>>>> inconsistent with the watch descriptors in any pending events.
>>>> (Re-creating the inotify file descriptor and rebuilding the cache
>>>> may be useful to deal with this scenario.)
>>> Well, but there's 'cookie' value meant exactly for matching up
>>> IN_MOVED_FROM and IN_MOVED_TO events. And 'cookie' is guaranteed to be
>>> unique at least within the inotify instance (in fact currently it is unique
>>> within the whole system but I don't think we want to give that promise).
>>
>> Yes, that's already assumed by my discussion above (its described elsewhere
>> in the page). But your comment makes me think I should add a few words to
>> remind the reader of that fact. I'll do that.
> Yes, that would be good.
>
>> But, the point is that even with the cookie, matching the events is
>> nontrivial, since:
>>
>> * There may not even be an IN_MOVED_FROM event
>> * There may be an arbitrary number of other events in between the
>> IN_MOVED_FROM and the IN_MOVED_TO.
>>
>> Therefore, one has to use heuristic approaches such as "allow at least
>> N millisconds" or "check the next N events" to see if there is an
>> IN_MOVED_FROM that matches the IN_MOVED_TO. I can't see any way around
>> that being inherently racy. (It's unfortunate that the kernel can't
>> provide a guarantee that the two events are always consecutive, since
>> that would simply user space's life considerably.)

> Yeah, it's unpleasant but doing that would be quite costly/complex at the
> kernel side.

Yep, I imagined that was probably the reason.

> And the race would in the worst case lead to application
> thinking there's been file moved outside of watched area & a file moved
> somewhere else inside the watched area. So the application will have to
> possibly inspect that file. That doesn't seem too bad.

It's actually very bad. See the text above. The point is that one likely
treatment on an IN_MOVED_FROM event that has no IN_MOVED_TO is to remove
the watches for the moved out subtree. If it turns out that this really
was just a rename(), then on the IN_MOVED_TO, the watches will be recreated
*with different watch descriptors*, thus invalidating the watch descriptors
in any queued but as yet unprocessed inotify events. See what I mean?
That's quite painful for user space.

Cheers,

Michael



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2014-04-07 09:31:59

by Jan Kara

[permalink] [raw]
Subject: Re: Things I wish I'd known about Inotify

On Sun 06-04-14 11:00:29, Michael Kerrisk (man-pages) wrote:
> On 04/04/2014 02:43 PM, Jan Kara wrote:
> > On Fri 04-04-14 09:35:50, Michael Kerrisk (man-pages) wrote:
> >> On 04/03/2014 10:52 PM, Jan Kara wrote:
> >>> On Thu 03-04-14 08:34:44, Michael Kerrisk (man-pages) wrote:
>
> [...]
>
> >>>> Dealing with rename() events
> >>>> The IN_MOVED_FROM and IN_MOVED_TO events that are generated by
> >>>> rename(2) are usually available as consecutive events when read‐
> >>>> ing from the inotify file descriptor. However, this is not guar‐
> >>>> anteed. If multiple processes are triggering events for moni‐
> >>>> tored objects, then (on rare occasions) an arbitrary number of
> >>>> other events may appear between the IN_MOVED_FROM and IN_MOVED_TO
> >>>> events.
> >>>>
> >>>> Matching up the IN_MOVED_FROM and IN_MOVED_TO event pair gener‐
> >>>> ated by rename(2) is thus inherently racy. (Don't forget that if
> >>>> an object is renamed outside of a monitored directory, there may
> >>>> not even be an IN_MOVED_TO event.) Heuristic approaches (e.g.,
> >>>> assume the events are always consecutive) can be used to ensure a
> >>>> match in most cases, but will inevitably miss some cases, causing
> >>>> the application to perceive the IN_MOVED_FROM and IN_MOVED_TO
> >>>> events as being unrelated. If watch descriptors are destroyed
> >>>> and re-created as a result, then those watch descriptors will be
> >>>> inconsistent with the watch descriptors in any pending events.
> >>>> (Re-creating the inotify file descriptor and rebuilding the cache
> >>>> may be useful to deal with this scenario.)
> >>> Well, but there's 'cookie' value meant exactly for matching up
> >>> IN_MOVED_FROM and IN_MOVED_TO events. And 'cookie' is guaranteed to be
> >>> unique at least within the inotify instance (in fact currently it is unique
> >>> within the whole system but I don't think we want to give that promise).
> >>
> >> Yes, that's already assumed by my discussion above (its described elsewhere
> >> in the page). But your comment makes me think I should add a few words to
> >> remind the reader of that fact. I'll do that.
> > Yes, that would be good.
> >
> >> But, the point is that even with the cookie, matching the events is
> >> nontrivial, since:
> >>
> >> * There may not even be an IN_MOVED_FROM event
> >> * There may be an arbitrary number of other events in between the
> >> IN_MOVED_FROM and the IN_MOVED_TO.
> >>
> >> Therefore, one has to use heuristic approaches such as "allow at least
> >> N millisconds" or "check the next N events" to see if there is an
> >> IN_MOVED_FROM that matches the IN_MOVED_TO. I can't see any way around
> >> that being inherently racy. (It's unfortunate that the kernel can't
> >> provide a guarantee that the two events are always consecutive, since
> >> that would simply user space's life considerably.)
>
> > Yeah, it's unpleasant but doing that would be quite costly/complex at the
> > kernel side.
>
> Yep, I imagined that was probably the reason.
I had a look into that code again and it's all designed around the fact
that there's a single inode to notify. If you liked to have atomic rename
notifications, you'd have to rewrite that to work with two inodes, finding
out whether these two inodes are actually watched by the same group or
not... Doable but complex. Alternatively you could just lock down the whole
notification subsystem while generating rename events. But that's rather
costly. Just that we have the complications written down somewhere in case
someone wants to look into this in future.

> > And the race would in the worst case lead to application
> > thinking there's been file moved outside of watched area & a file moved
> > somewhere else inside the watched area. So the application will have to
> > possibly inspect that file. That doesn't seem too bad.
>
> It's actually very bad. See the text above. The point is that one likely
> treatment on an IN_MOVED_FROM event that has no IN_MOVED_TO is to remove
> the watches for the moved out subtree. If it turns out that this really
> was just a rename(), then on the IN_MOVED_TO, the watches will be recreated
> *with different watch descriptors*, thus invalidating the watch descriptors
> in any queued but as yet unprocessed inotify events. See what I mean?
> That's quite painful for user space.
But if I understand it right, you loose only the information for recreated
watches. So you effectively loose all the information about what has
happened inside the subtree of moved directory (or what has happened with
the moved file). But since you think it's a file / dir moved from outside
of watched area, you have to fully rescan that file / dir anyway. Sure
that's costly but if your heuristics for detecting rename works 99.9% of
time it should be OK, shouldn't it? And you have to have that code handling
caching file / dir written anyway for handling real moves from outside of
watched hierarchy.

Don't get me wrong, I understand it would be easier for userspace to get
atomic rename notifications, I'm just trying to understand what exactly is
painful so that I can compare the cost at the kernel side with the cost at
the userspace side...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

Subject: Re: Things I wish I'd known about Inotify

On 04/07/2014 11:31 AM, Jan Kara wrote:
> On Sun 06-04-14 11:00:29, Michael Kerrisk (man-pages) wrote:
>> On 04/04/2014 02:43 PM, Jan Kara wrote:
>>> On Fri 04-04-14 09:35:50, Michael Kerrisk (man-pages) wrote:
>>>> On 04/03/2014 10:52 PM, Jan Kara wrote:
>>>>> On Thu 03-04-14 08:34:44, Michael Kerrisk (man-pages) wrote:
>>
>> [...]
>>
>>>>>> Dealing with rename() events
>>>>>> The IN_MOVED_FROM and IN_MOVED_TO events that are generated by
>>>>>> rename(2) are usually available as consecutive events when read‐
>>>>>> ing from the inotify file descriptor. However, this is not guar‐
>>>>>> anteed. If multiple processes are triggering events for moni‐
>>>>>> tored objects, then (on rare occasions) an arbitrary number of
>>>>>> other events may appear between the IN_MOVED_FROM and IN_MOVED_TO
>>>>>> events.
>>>>>>
>>>>>> Matching up the IN_MOVED_FROM and IN_MOVED_TO event pair gener‐
>>>>>> ated by rename(2) is thus inherently racy. (Don't forget that if
>>>>>> an object is renamed outside of a monitored directory, there may
>>>>>> not even be an IN_MOVED_TO event.) Heuristic approaches (e.g.,
>>>>>> assume the events are always consecutive) can be used to ensure a
>>>>>> match in most cases, but will inevitably miss some cases, causing
>>>>>> the application to perceive the IN_MOVED_FROM and IN_MOVED_TO
>>>>>> events as being unrelated. If watch descriptors are destroyed
>>>>>> and re-created as a result, then those watch descriptors will be
>>>>>> inconsistent with the watch descriptors in any pending events.
>>>>>> (Re-creating the inotify file descriptor and rebuilding the cache
>>>>>> may be useful to deal with this scenario.)
>>>>> Well, but there's 'cookie' value meant exactly for matching up
>>>>> IN_MOVED_FROM and IN_MOVED_TO events. And 'cookie' is guaranteed to be
>>>>> unique at least within the inotify instance (in fact currently it is unique
>>>>> within the whole system but I don't think we want to give that promise).
>>>>
>>>> Yes, that's already assumed by my discussion above (its described elsewhere
>>>> in the page). But your comment makes me think I should add a few words to
>>>> remind the reader of that fact. I'll do that.
>>> Yes, that would be good.
>>>
>>>> But, the point is that even with the cookie, matching the events is
>>>> nontrivial, since:
>>>>
>>>> * There may not even be an IN_MOVED_FROM event
>>>> * There may be an arbitrary number of other events in between the
>>>> IN_MOVED_FROM and the IN_MOVED_TO.
>>>>
>>>> Therefore, one has to use heuristic approaches such as "allow at least
>>>> N millisconds" or "check the next N events" to see if there is an
>>>> IN_MOVED_FROM that matches the IN_MOVED_TO. I can't see any way around
>>>> that being inherently racy. (It's unfortunate that the kernel can't
>>>> provide a guarantee that the two events are always consecutive, since
>>>> that would simply user space's life considerably.)
>>
>>> Yeah, it's unpleasant but doing that would be quite costly/complex at the
>>> kernel side.
>>
>> Yep, I imagined that was probably the reason.
> I had a look into that code again and it's all designed around the fact
> that there's a single inode to notify. If you liked to have atomic rename
> notifications, you'd have to rewrite that to work with two inodes, finding
> out whether these two inodes are actually watched by the same group or
> not... Doable but complex. Alternatively you could just lock down the whole
> notification subsystem while generating rename events. But that's rather
> costly. Just that we have the complications written down somewhere in case
> someone wants to look into this in future.
>
>>> And the race would in the worst case lead to application
>>> thinking there's been file moved outside of watched area & a file moved
>>> somewhere else inside the watched area. So the application will have to
>>> possibly inspect that file. That doesn't seem too bad.
>>
>> It's actually very bad. See the text above. The point is that one likely
>> treatment on an IN_MOVED_FROM event that has no IN_MOVED_TO is to remove
>> the watches for the moved out subtree. If it turns out that this really
>> was just a rename(), then on the IN_MOVED_TO, the watches will be recreated
>> *with different watch descriptors*, thus invalidating the watch descriptors
>> in any queued but as yet unprocessed inotify events. See what I mean?
>> That's quite painful for user space.

Sorry for the late follow-up....

> But if I understand it right, you loose only the information for recreated
> watches. So you effectively loose all the information about what has
> happened inside the subtree of moved directory (or what has happened with
> the moved file). But since you think it's a file / dir moved from outside
> of watched area, you have to fully rescan that file / dir anyway.

Ack on you summary there.

> Sure
> that's costly but if your heuristics for detecting rename works 99.9% of
> time it should be OK, shouldn't it? And you have to have that code handling
> caching file / dir written anyway for handling real moves from outside of
> watched hierarchy.

And ack on that.

> Don't get me wrong, I understand it would be easier for userspace to get
> atomic rename notifications, I'm just trying to understand what exactly is
> painful so that I can compare the cost at the kernel side with the cost at
> the userspace side...

Yes, I was probably a little too strong in my statement. My perspective
is that I'd tried to write an (experimental) application that would track
*all* events for a file tree (modulo queue overflow), and then I
encountered the wall of "rename() events are not consecutive", which
basically rendered that task impossible because of the races involved.

All that you say above also fits with my understanding. I was just
(perhaps overly) disappointed to find that I couldn't (perfectly)
achieve the tracking task that I'd attempted. (And furthermore, of
course, the code became a bit more complicated to handle the
possibility that some queued events may be for watch descriptors
that are no longer valid.)

Thanks for your response.

Cheers,

Michael




--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Subject: Re: Things I wish I'd known about Inotify

Late follow up on this thread..., since another question occurred in
discussions with Jake.

On Fri, Apr 4, 2014 at 2:43 PM, Jan Kara <[email protected]> wrote:
> On Fri 04-04-14 09:35:50, Michael Kerrisk (man-pages) wrote:
>> On 04/03/2014 10:52 PM, Jan Kara wrote:
>> > On Thu 03-04-14 08:34:44, Michael Kerrisk (man-pages) wrote:
[...]
>> >> Dealing with rename() events
>> >> The IN_MOVED_FROM and IN_MOVED_TO events that are generated by
>> >> rename(2) are usually available as consecutive events when read‐
>> >> ing from the inotify file descriptor. However, this is not guar‐
>> >> anteed. If multiple processes are triggering events for moni‐
>> >> tored objects, then (on rare occasions) an arbitrary number of
>> >> other events may appear between the IN_MOVED_FROM and IN_MOVED_TO
>> >> events.
>> >>
>> >> Matching up the IN_MOVED_FROM and IN_MOVED_TO event pair gener‐
>> >> ated by rename(2) is thus inherently racy. (Don't forget that if
>> >> an object is renamed outside of a monitored directory, there may
>> >> not even be an IN_MOVED_TO event.) Heuristic approaches (e.g.,
>> >> assume the events are always consecutive) can be used to ensure a
>> >> match in most cases, but will inevitably miss some cases, causing
>> >> the application to perceive the IN_MOVED_FROM and IN_MOVED_TO
>> >> events as being unrelated. If watch descriptors are destroyed
>> >> and re-created as a result, then those watch descriptors will be
>> >> inconsistent with the watch descriptors in any pending events.
>> >> (Re-creating the inotify file descriptor and rebuilding the cache
>> >> may be useful to deal with this scenario.)
>> > Well, but there's 'cookie' value meant exactly for matching up
>> > IN_MOVED_FROM and IN_MOVED_TO events. And 'cookie' is guaranteed to be
>> > unique at least within the inotify instance (in fact currently it is unique
>> > within the whole system but I don't think we want to give that promise).
>>
>> Yes, that's already assumed by my discussion above (its described elsewhere
>> in the page). But your comment makes me think I should add a few words to
>> remind the reader of that fact. I'll do that.
> Yes, that would be good.
>
>> But, the point is that even with the cookie, matching the events is
>> nontrivial, since:
>>
>> * There may not even be an IN_MOVED_FROM event
>> * There may be an arbitrary number of other events in between the
>> IN_MOVED_FROM and the IN_MOVED_TO.
>>
>> Therefore, one has to use heuristic approaches such as "allow at least
>> N millisconds" or "check the next N events" to see if there is an
>> IN_MOVED_FROM that matches the IN_MOVED_TO. I can't see any way around
>> that being inherently racy. (It's unfortunate that the kernel can't
>> provide a guarantee that the two events are always consecutive, since
>> that would simply user space's life considerably.)
> Yeah, it's unpleasant but doing that would be quite costly/complex at the
> kernel side. And the race would in the worst case lead to application
> thinking there's been file moved outside of watched area & a file moved
> somewhere else inside the watched area. So the application will have to
> possibly inspect that file. That doesn't seem too bad.

One further question. The IN_MOVED_FROM+IN_MOVED_TO pair may not be
guaranteed to be contiguous in the read buffer, but is their insertion
in the event queue guaranteed to be atomic from a user-space point of
view? That is to say: having read an IN_MOVED_FROM event, does user
space have the guarantee that if there is an IN_MOVED_TO event, then
it will already be in the queue? The reason I ask is that this would
affect how user space might try to read the IN_MOVED_TO event. If
there is no such guarantee, then a read() (or select()/poll()) with
(small) timeout is needed. If such a guarantee is provided, then a
nonblocking read() would suffice.

Cheers,

Michael

PS I just now found this code by John McCutchan
https://git.gnome.org/browse/gnome-vfs/tree/modules/inotify-kernel.c#n570
which suggests that the insertion of the event pair is not atomic
w.r.t. user space. Still, I wonder if there is any definitive
statement about this.

--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2014-07-14 11:28:50

by Jan Kara

[permalink] [raw]
Subject: Re: Things I wish I'd known about Inotify

On Sat 12-07-14 21:06:45, Michael Kerrisk (man-pages) wrote:
> Late follow up on this thread..., since another question occurred in
> discussions with Jake.
>
> On Fri, Apr 4, 2014 at 2:43 PM, Jan Kara <[email protected]> wrote:
> > On Fri 04-04-14 09:35:50, Michael Kerrisk (man-pages) wrote:
> >> On 04/03/2014 10:52 PM, Jan Kara wrote:
> >> > On Thu 03-04-14 08:34:44, Michael Kerrisk (man-pages) wrote:
> [...]
> >> >> Dealing with rename() events
> >> >> The IN_MOVED_FROM and IN_MOVED_TO events that are generated by
> >> >> rename(2) are usually available as consecutive events when read‐
> >> >> ing from the inotify file descriptor. However, this is not guar‐
> >> >> anteed. If multiple processes are triggering events for moni‐
> >> >> tored objects, then (on rare occasions) an arbitrary number of
> >> >> other events may appear between the IN_MOVED_FROM and IN_MOVED_TO
> >> >> events.
> >> >>
> >> >> Matching up the IN_MOVED_FROM and IN_MOVED_TO event pair gener‐
> >> >> ated by rename(2) is thus inherently racy. (Don't forget that if
> >> >> an object is renamed outside of a monitored directory, there may
> >> >> not even be an IN_MOVED_TO event.) Heuristic approaches (e.g.,
> >> >> assume the events are always consecutive) can be used to ensure a
> >> >> match in most cases, but will inevitably miss some cases, causing
> >> >> the application to perceive the IN_MOVED_FROM and IN_MOVED_TO
> >> >> events as being unrelated. If watch descriptors are destroyed
> >> >> and re-created as a result, then those watch descriptors will be
> >> >> inconsistent with the watch descriptors in any pending events.
> >> >> (Re-creating the inotify file descriptor and rebuilding the cache
> >> >> may be useful to deal with this scenario.)
> >> > Well, but there's 'cookie' value meant exactly for matching up
> >> > IN_MOVED_FROM and IN_MOVED_TO events. And 'cookie' is guaranteed to be
> >> > unique at least within the inotify instance (in fact currently it is unique
> >> > within the whole system but I don't think we want to give that promise).
> >>
> >> Yes, that's already assumed by my discussion above (its described elsewhere
> >> in the page). But your comment makes me think I should add a few words to
> >> remind the reader of that fact. I'll do that.
> > Yes, that would be good.
> >
> >> But, the point is that even with the cookie, matching the events is
> >> nontrivial, since:
> >>
> >> * There may not even be an IN_MOVED_FROM event
> >> * There may be an arbitrary number of other events in between the
> >> IN_MOVED_FROM and the IN_MOVED_TO.
> >>
> >> Therefore, one has to use heuristic approaches such as "allow at least
> >> N millisconds" or "check the next N events" to see if there is an
> >> IN_MOVED_FROM that matches the IN_MOVED_TO. I can't see any way around
> >> that being inherently racy. (It's unfortunate that the kernel can't
> >> provide a guarantee that the two events are always consecutive, since
> >> that would simply user space's life considerably.)
> > Yeah, it's unpleasant but doing that would be quite costly/complex at the
> > kernel side. And the race would in the worst case lead to application
> > thinking there's been file moved outside of watched area & a file moved
> > somewhere else inside the watched area. So the application will have to
> > possibly inspect that file. That doesn't seem too bad.
>
> One further question. The IN_MOVED_FROM+IN_MOVED_TO pair may not be
> guaranteed to be contiguous in the read buffer, but is their insertion
> in the event queue guaranteed to be atomic from a user-space point of
> view? That is to say: having read an IN_MOVED_FROM event, does user
> space have the guarantee that if there is an IN_MOVED_TO event, then
> it will already be in the queue? The reason I ask is that this would
> affect how user space might try to read the IN_MOVED_TO event. If
> there is no such guarantee, then a read() (or select()/poll()) with
> (small) timeout is needed. If such a guarantee is provided, then a
> nonblocking read() would suffice.
That's a good question... So the events are not generated atomically even
from userspace POV - i.e., a userspace process may see a state where
IN_MOVED_FROM event is already in the buffer but IN_MOVED_TO event isn't
generated yet.

> PS I just now found this code by John McCutchan
> https://git.gnome.org/browse/gnome-vfs/tree/modules/inotify-kernel.c#n570
> which suggests that the insertion of the event pair is not atomic
> w.r.t. user space. Still, I wonder if there is any definitive
> statement about this.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

Subject: Re: Things I wish I'd known about Inotify

On 07/14/2014 01:28 PM, Jan Kara wrote:
> On Sat 12-07-14 21:06:45, Michael Kerrisk (man-pages) wrote:
>> Late follow up on this thread..., since another question occurred in
>> discussions with Jake.
>>
>> On Fri, Apr 4, 2014 at 2:43 PM, Jan Kara <[email protected]> wrote:
>>> On Fri 04-04-14 09:35:50, Michael Kerrisk (man-pages) wrote:
>>>> On 04/03/2014 10:52 PM, Jan Kara wrote:
>>>>> On Thu 03-04-14 08:34:44, Michael Kerrisk (man-pages) wrote:
>> [...]
>>>>>> Dealing with rename() events
>>>>>> The IN_MOVED_FROM and IN_MOVED_TO events that are generated by
>>>>>> rename(2) are usually available as consecutive events when read‐
>>>>>> ing from the inotify file descriptor. However, this is not guar‐
>>>>>> anteed. If multiple processes are triggering events for moni‐
>>>>>> tored objects, then (on rare occasions) an arbitrary number of
>>>>>> other events may appear between the IN_MOVED_FROM and IN_MOVED_TO
>>>>>> events.
>>>>>>
>>>>>> Matching up the IN_MOVED_FROM and IN_MOVED_TO event pair gener‐
>>>>>> ated by rename(2) is thus inherently racy. (Don't forget that if
>>>>>> an object is renamed outside of a monitored directory, there may
>>>>>> not even be an IN_MOVED_TO event.) Heuristic approaches (e.g.,
>>>>>> assume the events are always consecutive) can be used to ensure a
>>>>>> match in most cases, but will inevitably miss some cases, causing
>>>>>> the application to perceive the IN_MOVED_FROM and IN_MOVED_TO
>>>>>> events as being unrelated. If watch descriptors are destroyed
>>>>>> and re-created as a result, then those watch descriptors will be
>>>>>> inconsistent with the watch descriptors in any pending events.
>>>>>> (Re-creating the inotify file descriptor and rebuilding the cache
>>>>>> may be useful to deal with this scenario.)
>>>>> Well, but there's 'cookie' value meant exactly for matching up
>>>>> IN_MOVED_FROM and IN_MOVED_TO events. And 'cookie' is guaranteed to be
>>>>> unique at least within the inotify instance (in fact currently it is unique
>>>>> within the whole system but I don't think we want to give that promise).
>>>>
>>>> Yes, that's already assumed by my discussion above (its described elsewhere
>>>> in the page). But your comment makes me think I should add a few words to
>>>> remind the reader of that fact. I'll do that.
>>> Yes, that would be good.
>>>
>>>> But, the point is that even with the cookie, matching the events is
>>>> nontrivial, since:
>>>>
>>>> * There may not even be an IN_MOVED_FROM event
>>>> * There may be an arbitrary number of other events in between the
>>>> IN_MOVED_FROM and the IN_MOVED_TO.
>>>>
>>>> Therefore, one has to use heuristic approaches such as "allow at least
>>>> N millisconds" or "check the next N events" to see if there is an
>>>> IN_MOVED_FROM that matches the IN_MOVED_TO. I can't see any way around
>>>> that being inherently racy. (It's unfortunate that the kernel can't
>>>> provide a guarantee that the two events are always consecutive, since
>>>> that would simply user space's life considerably.)
>>> Yeah, it's unpleasant but doing that would be quite costly/complex at the
>>> kernel side. And the race would in the worst case lead to application
>>> thinking there's been file moved outside of watched area & a file moved
>>> somewhere else inside the watched area. So the application will have to
>>> possibly inspect that file. That doesn't seem too bad.
>>
>> One further question. The IN_MOVED_FROM+IN_MOVED_TO pair may not be
>> guaranteed to be contiguous in the read buffer, but is their insertion
>> in the event queue guaranteed to be atomic from a user-space point of
>> view? That is to say: having read an IN_MOVED_FROM event, does user
>> space have the guarantee that if there is an IN_MOVED_TO event, then
>> it will already be in the queue? The reason I ask is that this would
>> affect how user space might try to read the IN_MOVED_TO event. If
>> there is no such guarantee, then a read() (or select()/poll()) with
>> (small) timeout is needed. If such a guarantee is provided, then a
>> nonblocking read() would suffice.
> That's a good question... So the events are not generated atomically even
> from userspace POV - i.e., a userspace process may see a state where
> IN_MOVED_FROM event is already in the buffer but IN_MOVED_TO event isn't
> generated yet.

Thanks for the confirmation, Jan. I also did some user-space
experimentation that pretty much showed the insertion must be nonatomic.

Cheers,

Michael



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/