2010-08-06 15:58:51

by Eric Paris

[permalink] [raw]
Subject: [GIT PULL] notification tree - try 37!

Here it is again! Another notification pull request! There is still
future work to be done on notification, but nothing that I believe
others would call blocking or functional. The work I plan to do
includes:

1) Al has discussed the addition of a file_clone() call in the VFS which
would eliminate my use of __dentry_open() and would allow the removal of
the rather hideous FMODE_/O_NONOTIFY.
2) Al ask that the multiplexer from hell interfaces between fsnotify.h
and fsnotify.c be broken into multiple smaller cleaner functions.
3) Andrew has ask that I reduce the amount of code in the static inlines
in fsnotify.h by moving more to fsnotify.c.

- Eric


The following changes since commit fc1caf6eafb30ea185720e29f7f5eccca61ecd60:
Linus Torvalds (1):
Merge branch 'drm-core-next' of git://git.kernel.org/.../airlied/drm-2.6

are available in the git repository at:

git://git.infradead.org/users/eparis/notify.git for-linus

Alexey Dobriyan (1):
dnotify: move dir_notify_enable declaration

Andreas Gruenbacher (16):
fsnotify: kill FSNOTIFY_EVENT_FILE
fsnotify: take inode->i_lock inside fsnotify_find_mark_entry()
fanotify: create_fd cleanup
fanotify: Add pids to events
fsnotify/vfsmount: add fsnotify fields to struct vfsmount
fsnotify: Infrastructure for per-mount watches
fanotify: remove fanotify_update_mark
fanotify: do not call fanotify_update_object_mask in fanotify_remove_mark
fanotify: do not call fanotify_update_object_mask in fanotify_add_mark
fanotify: do not return pointer from fanotify_add_*_mark
fanotify: remove fanotify_add_mark
fanotify: rename FAN_MARK_ON_VFSMOUNT to FAN_MARK_MOUNT
fanotify: split fanotify_remove_mark
fanotify: remove fanotify.h declarations
fanotify: remove outgoing function checks in fanotify.h
fsnotify: Exchange list heads instead of moving elements

Dave Young (1):
sysctl extern cleanup: inotify

Eric Paris (107):
inotify: simplify the inotify idr handling
Audit: clean up the audit_watch split
audit: convert audit watches to use fsnotify instead of inotify
audit: redo audit watch locking and refcnt in light of fsnotify
audit: do not get and put just to free a watch
fsnotify: duplicate fsnotify_mark_entry data between 2 marks
fsnotify: allow addition of duplicate fsnotify marks
audit: reimplement audit_trees using fsnotify rather than inotify
Audit: audit watches depend on fsnotify
Audit: split audit watch Kconfig
Audit: audit watch init should not be before fsnotify init
fsnotify: use fsnotify_create_event to allocate the q_overflow event
inotify: use container_of instead of casting
fsnotify: kzalloc fsnotify groups
fsnotify: use kmem_cache_zalloc to simplify event initialization
inotify: do not reuse watch descriptors
inotify: remove inotify in kernel interface
inotify: do not spam console without limit
fsnotify: provide the data type to should_send_event
fsnotify: include data in should_send calls
fsnotify: pass a file instead of an inode to open, read, and write
fsnotify: send struct file when sending events to parents when possible
fsnotify: per group notification queue merge types
fsnotify: clone existing events
fsnotify: replace an event on a list
fsnotify: lock annotation for event replacement
fsnotify: remove group_num altogether
fsnotify: fsnotify_obtain_group kzalloc cleanup
fsnotify: fsnotify_obtain_group should be fsnotify_alloc_group
Audit: only set group mask when something is being watched
fsnotify: drop mask argument from fsnotify_alloc_group
fsnotify: rename fsnotify_groups to fsnotify_inode_groups
fsnotify: initialize the group->num_marks in a better place
fsnotify: add groups to fsnotify_inode_groups when registering inode watch
fsnotify: mount point listeners list and global mask
fsnotify: include vfsmount in should_send_event when appropriate
fsnotify: put inode specific fields in an fsnotify_mark in a union
fsnotify: add vfsmount specific fields to the fsnotify_mark_entry union
fsnotify: add flags to fsnotify_mark_entries
fsnotify: rename fsnotify_mark_entry to just fsnotify_mark
fsnotify: rename fsnotify_find_mark_entry to fsnotify_find_mark
fsnotify: rename mark_entry to just mark
inotify: rename mark_entry to just mark
dnotify: rename mark_entry to mark
vfs: introduce FMODE_NONOTIFY
fanotify: fscking all notification system
fanotify:drop notification if they exist in the outgoing queue
fanotify: merge notification events with different masks
fanotify: do not clone on merge unless needed
fanotify: fanotify_init syscall declaration
fanotify: fanotify_init syscall implementation
fanotify: sys_fanotify_mark declartion
fanotify: fanotify_mark syscall implementation
fanotify: send events using read
fsnotify: split generic and inode specific mark code
fsnotify: clear marks to 0 in fsnotify_init_mark
fsnotify: vfsmount marks generic functions
fanotify: should_send_event needs to handle vfsmounts
fanotify: infrastructure to add an remove marks on vfsmounts
fanotify: hooks the fanotify_mark syscall to the vfsmount code
fsnotify: allow marks to not pin inodes in core
fsnotify: ignored_mask - excluding notification
fanotify: ignored_mask to ignore events
fanotify: allow users to set an ignored_mask
fsnotify: clear ignored mask on modify
fsnotify: allow ignored_mask to survive modification
fanotify: allow ignored_masks to survive modify
fanotify: clear all fanotify marks
fsnotify: add group priorities
fsnotify: intoduce a notification merge argument
fanotify: use merge argument to determine actual event added to queue
fsnotify: use unsigned char * for dentry->d_name.name
fsnotify: new fsnotify hooks and events types for access decisions
fanotify: permissions and blocking
fanotify: userspace interface for permission responses
fanotify: do not return 0 in a void function
fsnotify: call iput on inodes when no longer marked
fsnotify: initialize mask in fsnotify_perm
fanotify: default Kconfig to n
fanotify: drop the useless priority argument
inotify: fix inotify oneshot support
inotify: send IN_UNMOUNT events
inotify: allow users to request not to recieve events on unlinked children
inotify: force inotify and fsnotify use same bits
fsnotify: check to make sure all fsnotify bits are unique
fanotify: groups can specify their f_flags for new fd
fsnotify: add pr_debug throughout
fsnotify: fsnotify_add_notify_event should return an event
fsnotify: store struct file not struct path
vfs/fsnotify: fsnotify_close can delay the final work in fput
fsnotify: place marks on object in order of group memory address
fsnotify: use _rcu functions for mark list traversal
fsnotify: use an explicit flag to indicate fsnotify_destroy_mark has been called
fsnotify: srcu to protect read side of inode and vfsmount locks
fsnotify: send fsnotify_mark to groups in event handling functions
inotify: use the mark in handler functions
dnotify: use the mark in handler functions
audit: use the mark in handler functions
fanotify: use the mark in handler functions
fsnotify: cleanup should_send_event
fsnotify: remove the global masks
fsnotify: remove group->mask
fsnotify: remove global fsnotify groups lists
fsnotify: rework ignored mark flushing
fsnotify: walk the inode and vfsmount lists simultaneously
fsnotify: pass both the vfsmount mark and inode mark
fanotify: use both marks when possible

H Hartley Sweeten (1):
inotify_user.c: make local symbol static

Heiko Carstens (1):
fanotify: CONFIG_HAVE_SYSCALL_WRAPPERS for sys_fanotify_mark

Jean-Christophe Dubois (1):
fanotify: do not always return 0 in fsnotify

Jerome Marchand (1):
inotify: Fix mask checks

Paul Mundt (1):
fanotify: select ANON_INODES.

Signed-off-by: Wu Fengguang (1):
fanotify: FMODE_NONOTIFY and __O_SYNC in sparc conflict

Tejun Heo (1):
fsnotify: update gfp/slab.h includes

Documentation/feature-removal-schedule.txt | 8 -
arch/x86/ia32/ia32entry.S | 2 +
arch/x86/ia32/sys_ia32.c | 9 +
arch/x86/include/asm/sys_ia32.h | 3 +
arch/x86/include/asm/unistd_32.h | 4 +-
arch/x86/include/asm/unistd_64.h | 4 +
arch/x86/kernel/syscall_table_32.S | 2 +
fs/compat.c | 5 +-
fs/exec.c | 4 +-
fs/file_table.c | 9 +
fs/inode.c | 8 +-
fs/namei.c | 2 +-
fs/namespace.c | 5 +
fs/nfsd/vfs.c | 4 +-
fs/notify/Kconfig | 1 +
fs/notify/Makefile | 4 +-
fs/notify/dnotify/dnotify.c | 213 +++----
fs/notify/fanotify/Kconfig | 26 +
fs/notify/fanotify/Makefile | 1 +
fs/notify/fanotify/fanotify.c | 212 +++++++
fs/notify/fanotify/fanotify_user.c | 760 ++++++++++++++++++++++++
fs/notify/fsnotify.c | 201 ++++++--
fs/notify/fsnotify.h | 27 +-
fs/notify/group.c | 182 +------
fs/notify/inode_mark.c | 331 ++++--------
fs/notify/inotify/Kconfig | 15 -
fs/notify/inotify/Makefile | 1 -
fs/notify/inotify/inotify.c | 873 ----------------------------
fs/notify/inotify/inotify.h | 7 +-
fs/notify/inotify/inotify_fsnotify.c | 151 ++++--
fs/notify/inotify/inotify_user.c | 369 ++++++++-----
fs/notify/mark.c | 371 ++++++++++++
fs/notify/notification.c | 236 +++++---
fs/notify/vfsmount_mark.c | 187 ++++++
fs/open.c | 3 +-
fs/read_write.c | 8 +-
include/asm-generic/fcntl.h | 8 +
include/linux/Kbuild | 1 +
include/linux/dnotify.h | 1 +
include/linux/fanotify.h | 105 ++++
include/linux/fs.h | 16 +-
include/linux/fsnotify.h | 161 +++---
include/linux/fsnotify_backend.h | 211 +++++---
include/linux/inotify.h | 185 +------
include/linux/mount.h | 6 +-
include/linux/security.h | 1 +
include/linux/syscalls.h | 4 +
init/Kconfig | 10 +-
kernel/Makefile | 5 +-
kernel/audit.c | 1 -
kernel/audit.h | 26 +-
kernel/audit_tree.c | 237 +++++----
kernel/audit_watch.c | 274 +++++----
kernel/auditfilter.c | 39 +-
kernel/auditsc.c | 10 +-
kernel/sys_ni.c | 4 +
kernel/sysctl.c | 7 +-
security/security.c | 16 +-
58 files changed, 3197 insertions(+), 2379 deletions(-)
create mode 100644 fs/notify/fanotify/Kconfig
create mode 100644 fs/notify/fanotify/Makefile
create mode 100644 fs/notify/fanotify/fanotify.c
create mode 100644 fs/notify/fanotify/fanotify_user.c
delete mode 100644 fs/notify/inotify/inotify.c
create mode 100644 fs/notify/mark.c
create mode 100644 fs/notify/vfsmount_mark.c
create mode 100644 include/linux/fanotify.h


2010-08-06 23:34:37

by Matt Helsley

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Fri, Aug 06, 2010 at 11:58:39AM -0400, Eric Paris wrote:
> Here it is again! Another notification pull request! There is still
> future work to be done on notification, but nothing that I believe
> others would call blocking or functional. The work I plan to do
> includes:
>
> 1) Al has discussed the addition of a file_clone() call in the VFS which
> would eliminate my use of __dentry_open() and would allow the removal of
> the rather hideous FMODE_/O_NONOTIFY.

I did a quick search and can't find a mailing list post on this. Was
it a private discussion or is there something I can read about what
file_clone() will do?

Cheers,
-Matt Helsley

2010-08-07 00:06:32

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Fri, Aug 06, 2010 at 04:34:31PM -0700, Matt Helsley wrote:
> On Fri, Aug 06, 2010 at 11:58:39AM -0400, Eric Paris wrote:
> > Here it is again! Another notification pull request! There is still
> > future work to be done on notification, but nothing that I believe
> > others would call blocking or functional. The work I plan to do
> > includes:
> >
> > 1) Al has discussed the addition of a file_clone() call in the VFS which
> > would eliminate my use of __dentry_open() and would allow the removal of
> > the rather hideous FMODE_/O_NONOTIFY.
>
> I did a quick search and can't find a mailing list post on this. Was
> it a private discussion or is there something I can read about what
> file_clone() will do?

I'm also totally missing on any re-post of these patches or discussion
of the changes during the last development window.

2010-08-07 19:15:58

by Eric Paris

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Fri, 2010-08-06 at 20:06 -0400, Christoph Hellwig wrote:
> On Fri, Aug 06, 2010 at 04:34:31PM -0700, Matt Helsley wrote:
> > On Fri, Aug 06, 2010 at 11:58:39AM -0400, Eric Paris wrote:
> > > Here it is again! Another notification pull request! There is still
> > > future work to be done on notification, but nothing that I believe
> > > others would call blocking or functional. The work I plan to do
> > > includes:
> > >
> > > 1) Al has discussed the addition of a file_clone() call in the VFS which
> > > would eliminate my use of __dentry_open() and would allow the removal of
> > > the rather hideous FMODE_/O_NONOTIFY.
> >
> > I did a quick search and can't find a mailing list post on this. Was
> > it a private discussion or is there something I can read about what
> > file_clone() will do?

No, it was from a face to face meeting and a couple of irc conversations
talk about all of this stuff. My understanding was that it was going to
be a lot like dentry_open() only it was going to require a valid struct
file and would return a new struct file. One of the purposes of the new
interface being the ability to set f_mode at a better time to eliminate
the FMODE/O_ overlapping horror that fanotify requires to prevent
recursion and deadlock.

> I'm also totally missing on any re-post of these patches or discussion
> of the changes during the last development window.

I just searched lkml an fsdevel where I usually send everything don't
see then. I totally failed. I'm used to not hearing a peep about my
patches so I never noticed the lack of feedback. I would have sworn the
last set of patches I sent to both Al and the list, but apparently I
only ever sent stuff to Al. Looks like all changes between
f874e1ac21d770846 and 1968f5eed54ce47bde4 are the ones of interest.
They are all notification internal except for one:

http://git.infradead.org/users/eparis/notify.git/commitdiff/c1e5c954020e123d30b4abf4038ce501861bcf9f

It screws with struct file refcnting.

Diffstat of when I don't see posted (well some of it is, but most isn't)
is below....

fs/file_table.c | 9 +
fs/notify/dnotify/dnotify.c | 43 +-----
fs/notify/fanotify/fanotify.c | 221 +++++++++++++---------------------
fs/notify/fanotify/fanotify_user.c | 35 +----
fs/notify/fsnotify.c | 226 ++++++++++++++++++++---------------
fs/notify/fsnotify.h | 18 --
fs/notify/group.c | 168 --------------------------
fs/notify/inode_mark.c | 48 +++++--
fs/notify/inotify/inotify_fsnotify.c | 91 +++++---------
fs/notify/inotify/inotify_user.c | 34 +++--
fs/notify/mark.c | 78 +++++++++---
fs/notify/notification.c | 88 +++++++++----
fs/notify/vfsmount_mark.c | 48 ++++---
include/linux/fsnotify.h | 37 ++---
include/linux/fsnotify_backend.h | 84 +++++--------
kernel/audit_tree.c | 12 +
kernel/audit_watch.c | 47 +------
17 files changed, 578 insertions(+), 709 deletions(-)

2010-08-07 20:55:17

by Matt Helsley

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Sat, Aug 07, 2010 at 03:15:14PM -0400, Eric Paris wrote:
> On Fri, 2010-08-06 at 20:06 -0400, Christoph Hellwig wrote:
> > On Fri, Aug 06, 2010 at 04:34:31PM -0700, Matt Helsley wrote:
> > > On Fri, Aug 06, 2010 at 11:58:39AM -0400, Eric Paris wrote:
> > > > Here it is again! Another notification pull request! There is still
> > > > future work to be done on notification, but nothing that I believe
> > > > others would call blocking or functional. The work I plan to do
> > > > includes:
> > > >
> > > > 1) Al has discussed the addition of a file_clone() call in the VFS which
> > > > would eliminate my use of __dentry_open() and would allow the removal of
> > > > the rather hideous FMODE_/O_NONOTIFY.
> > >
> > > I did a quick search and can't find a mailing list post on this. Was
> > > it a private discussion or is there something I can read about what
> > > file_clone() will do?
>
> No, it was from a face to face meeting and a couple of irc conversations
> talk about all of this stuff. My understanding was that it was going to
> be a lot like dentry_open() only it was going to require a valid struct
> file and would return a new struct file. One of the purposes of the new
> interface being the ability to set f_mode at a better time to eliminate
> the FMODE/O_ overlapping horror that fanotify requires to prevent
> recursion and deadlock.

Thanks Eric, that's the information I was looking for. I was curious
because there's a chance file_clone() as you described it may also be useful
for checkpoint/restart.

Cheers,
-Matt Helsley

2010-08-16 20:32:53

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Saturday 07 August 2010 21:15:14 Eric Paris wrote:
> On Fri, 2010-08-06 at 20:06 -0400, Christoph Hellwig wrote:
> > I'm also totally missing on any re-post of these patches or discussion
> > of the changes during the last development window.
>
> I just searched lkml an fsdevel where I usually send everything don't
> see then. I totally failed.

Oh yes.

This introduces two new syscalls which will be impossible to fix up after the
fact, and those system calls are poorly documented: commits 2a3edf86 and
52c923dd document the initial versions (in the commit message!), but
subsequent commits then extend that interface. The interface for replying to
events is not documented at all beyond the example code [1]. There is no
documentation in Documentation/filesystems/, either.

[1] http://people.redhat.com/~eparis/fanotify/


Q: What happens when a process watching for FAN_OPEN_PERM or FAN_ACCESS_PERM
events exits or dies while events are in flight? I can't see anything in the
code that would wake sleeping processes up when the fsnotify_group of the
listener is torn down.


Q: What prevents the system from going out of memory when a listener decides
to stop reading events or simply can't keep up? There doesn't seem to be a
limit on the queue depth. Listeners currently need CAP_SYS_ADMIN, but somehow
limiting the queue depth and throttling when things start to go bad still
sounds like a reasonable thing to do, right?)


Don't get me wrong, I really appreciate your work on this (this shows through
the patches I've committed), and what we have now is a lot better than what we
had before.


Thanks,
Andreas

2010-08-17 03:40:11

by Eric Paris

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Mon, 2010-08-16 at 22:32 +0200, Andreas Gruenbacher wrote:
> On Saturday 07 August 2010 21:15:14 Eric Paris wrote:
> > On Fri, 2010-08-06 at 20:06 -0400, Christoph Hellwig wrote:
> > > I'm also totally missing on any re-post of these patches or discussion
> > > of the changes during the last development window.
> >
> > I just searched lkml an fsdevel where I usually send everything don't
> > see then. I totally failed.
>
> Oh yes.
>
> This introduces two new syscalls which will be impossible to fix up after the
> fact, and those system calls are poorly documented: commits 2a3edf86 and
> 52c923dd document the initial versions (in the commit message!), but
> subsequent commits then extend that interface. The interface for replying to
> events is not documented at all beyond the example code [1]. There is no
> documentation in Documentation/filesystems/, either.
>
> [1] http://people.redhat.com/~eparis/fanotify/

I'll work on documentation. Although it should be pointed out that the
interface was sent to list many times with lots of discussion and
feedback. The only patches that didn't make the list were the last
couple which changed internal notification semantics (and fscked with
fput() but that patch, which caused problems, was specifically pointed
out in this thread and reverted).

> Q: What happens when a process watching for FAN_OPEN_PERM or FAN_ACCESS_PERM
> events exits or dies while events are in flight? I can't see anything in the
> code that would wake sleeping processes up when the fsnotify_group of the
> listener is torn down.

We can get stuck. There was code which cleaned that up, but it got
accidentally removed long ago when, upon review on list, I was told to
remove all timeout code. It's easy enough to fix up. I'll post a patch
this week.

> Q: What prevents the system from going out of memory when a listener decides
> to stop reading events or simply can't keep up? There doesn't seem to be a
> limit on the queue depth. Listeners currently need CAP_SYS_ADMIN, but somehow
> limiting the queue depth and throttling when things start to go bad still
> sounds like a reasonable thing to do, right?)

It's an interesting question and obviously one that I've thought about.
You remember when we talked previously I said the hardest part left was
allowing non-root users to use the interface. It gets especially
difficult when thinking about perm-events. I was specifically told not
to timeout or drop those. But when dealing with non-root users using
perm events? As for pure notification we can do something like inotify
does quite easily.

I'm not certain exactly what the best semantics are for non trusted
users, so I didn't push any patches that way. Suggestions welcome :)

-Eric

2010-08-17 04:04:07

by Matt Helsley

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Mon, Aug 16, 2010 at 11:39:47PM -0400, Eric Paris wrote:
> On Mon, 2010-08-16 at 22:32 +0200, Andreas Gruenbacher wrote:
> > On Saturday 07 August 2010 21:15:14 Eric Paris wrote:
> > > On Fri, 2010-08-06 at 20:06 -0400, Christoph Hellwig wrote:
> > > > I'm also totally missing on any re-post of these patches or discussion
> > > > of the changes during the last development window.
> > >
> > > I just searched lkml an fsdevel where I usually send everything don't
> > > see then. I totally failed.
> >
> > Oh yes.
> >
> > This introduces two new syscalls which will be impossible to fix up after the
> > fact, and those system calls are poorly documented: commits 2a3edf86 and
> > 52c923dd document the initial versions (in the commit message!), but
> > subsequent commits then extend that interface. The interface for replying to
> > events is not documented at all beyond the example code [1]. There is no
> > documentation in Documentation/filesystems/, either.
> >
> > [1] http://people.redhat.com/~eparis/fanotify/
>
> I'll work on documentation. Although it should be pointed out that the
> interface was sent to list many times with lots of discussion and
> feedback. The only patches that didn't make the list were the last
> couple which changed internal notification semantics (and fscked with
> fput() but that patch, which caused problems, was specifically pointed
> out in this thread and reverted).
>
> > Q: What happens when a process watching for FAN_OPEN_PERM or FAN_ACCESS_PERM
> > events exits or dies while events are in flight? I can't see anything in the
> > code that would wake sleeping processes up when the fsnotify_group of the
> > listener is torn down.
>
> We can get stuck. There was code which cleaned that up, but it got
> accidentally removed long ago when, upon review on list, I was told to
> remove all timeout code. It's easy enough to fix up. I'll post a patch
> this week.
>
> > Q: What prevents the system from going out of memory when a listener decides
> > to stop reading events or simply can't keep up? There doesn't seem to be a
> > limit on the queue depth. Listeners currently need CAP_SYS_ADMIN, but somehow
> > limiting the queue depth and throttling when things start to go bad still
> > sounds like a reasonable thing to do, right?)
>
> It's an interesting question and obviously one that I've thought about.
> You remember when we talked previously I said the hardest part left was
> allowing non-root users to use the interface. It gets especially
> difficult when thinking about perm-events. I was specifically told not
> to timeout or drop those. But when dealing with non-root users using
> perm events? As for pure notification we can do something like inotify
> does quite easily.
>
> I'm not certain exactly what the best semantics are for non trusted
> users, so I didn't push any patches that way. Suggestions welcome :)

Hi Eric,

Sorry I haven't had a chance to look at the perm events. I think
user namespace and containers folks might be interested in them for
non-root users though.

[Adding userns/containers folks to Cc]

I'm guessing non-trusted users can be restricted to only get perm events
on stuff they already own. Plus make perm events by non-trusted users
unreliable yet failsafe -- return EACESS/EPERM when we have to timeout or
drop the events if I understand the issues correctly. Those rules plus user
namespaces would still be quite useful I think. Do they seem reasonable to
implement? Am I forgetting anything important?

Cheers,
-Matt Helsley

2010-08-17 08:10:10

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Tuesday 17 August 2010 05:39:47 Eric Paris wrote:
> On Mon, 2010-08-16 at 22:32 +0200, Andreas Gruenbacher wrote:
> > Q: What happens when a process watching for FAN_OPEN_PERM or
> > FAN_ACCESS_PERM events exits or dies while events are in flight? I
> > can't see anything in the code that would wake sleeping processes up
> > when the fsnotify_group of the listener is torn down.
>
> We can get stuck. There was code which cleaned that up, but it got
> accidentally removed long ago when, upon review on list, I was told to
> remove all timeout code. It's easy enough to fix up. I'll post a patch
> this week.

This needs to be fixed then. Not such a big deal, but it shows that the tree
wasn't ready for being merged yet and needs further review.

> > Q: What prevents the system from going out of memory when a listener
> > decides to stop reading events or simply can't keep up? There doesn't
> > seem to be a limit on the queue depth. Listeners currently need
> > CAP_SYS_ADMIN, but somehow limiting the queue depth and throttling when
> > things start to go bad still sounds like a reasonable thing to do,
> > right?
>
> It's an interesting question and obviously one that I've thought about.
> You remember when we talked previously I said the hardest part left was
> allowing non-root users to use the interface. It gets especially
> difficult when thinking about perm-events. I was specifically told not
> to timeout or drop those. But when dealing with non-root users using
> perm events? As for pure notification we can do something like inotify
> does quite easily.
>
> I'm not certain exactly what the best semantics are for non trusted
> users, so I didn't push any patches that way. Suggestions welcome :)

The system will happily go OOM for trusted users and non-perm events if the
listener doesn't keep up, so some throttling, dropping, or both needs to
happen for non-perm events. This is the critical case. Doing what inotify
does (queue an overflow event and drop further events) seems to make sense
here.

The situation with perm-events is less severe because the number of
outstanding perm events is bounded by the number of running processes. This
may be enough of a limit.

I don't think we need to worry about perm-events for untrusted users. We can
start supporting some kinds of non-perm-events for untrusted users later; this
won't change the existing interface.

Thanks,
Andreas

2010-08-17 08:39:08

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Tuesday 17 August 2010 05:39:47 Eric Paris wrote:
> On Mon, 2010-08-16 at 22:32 +0200, Andreas Gruenbacher wrote:
> > On Saturday 07 August 2010 21:15:14 Eric Paris wrote:
> > > On Fri, 2010-08-06 at 20:06 -0400, Christoph Hellwig wrote:
> > > > I'm also totally missing on any re-post of these patches or
> > > > discussion of the changes during the last development window.
> > >
> > > I just searched lkml an fsdevel where I usually send everything don't
> > > see then. I totally failed.
> >
> > Oh yes.
> >
> > This introduces two new syscalls which will be impossible to fix up after
> > the fact, and those system calls are poorly documented: commits 2a3edf86
> > and 52c923dd document the initial versions (in the commit message!), but
> > subsequent commits then extend that interface. The interface for
> > replying to events is not documented at all beyond the example code [1].
> > There is no documentation in Documentation/filesystems/, either.
> >
> > [1] http://people.redhat.com/~eparis/fanotify/

Oh ... this example doesn't actually build; both syscall prototypes are wrong.
What have you been testing this with?

> I'll work on documentation. Although it should be pointed out that the
> interface was sent to list many times with lots of discussion and
> feedback.

One of the wonky remaining bits is the way how files are reopened with
dentry_open() with the f_flags passed to fanotify_init(). The open can fail,
in which case the user is left with an error condition but with no indication
as to which object the error happened for. What the heck?

Thanks,
Andreas

2010-08-17 09:45:30

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Tuesday 17 Aug 2010 04:39:47 Eric Paris wrote:
> > Q: What prevents the system from going out of memory when a listener
> > decides to stop reading events or simply can't keep up? There doesn't
> > seem to be a limit on the queue depth. Listeners currently need
> > CAP_SYS_ADMIN, but somehow limiting the queue depth and throttling when
> > things start to go bad still sounds like a reasonable thing to do,
> > right?)
>
> It's an interesting question and obviously one that I've thought about.
> You remember when we talked previously I said the hardest part left was
> allowing non-root users to use the interface. It gets especially
> difficult when thinking about perm-events. I was specifically told not
> to timeout or drop those. But when dealing with non-root users using
> perm events? As for pure notification we can do something like inotify
> does quite easily.

Why no timeouts? It sounds like a feasible way to work around listeners which
have stopped working. (Timeout and -ETIME for example to be clear, not
allowing access).

Alternative might be to expose queue size per group (and some additional group
info) so a daemon could keep an eye on listeners which are not making progress
and act accordingly. Sometimes appropriate action would be to restart, or to
kill, or even spawn a new one. Last bit is especially useful with some FUSE
filesystems to avoid deadlocks. Otherwise listener can get a perm event for
the top level, and then another perm event is generated when FUSE opens the
underlying object and there is noone to handle it.

But this can also work together with timeouts.

Tvrtko

Sophos Plc, The Pentagon, Abingdon Science Park, Abingdon, OX14 3YP, United Kingdom.
Company Reg No 2096520. VAT Reg No GB 348 3873 20.

2010-08-17 10:01:28

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Tuesday 17 August 2010 11:45:25 Tvrtko Ursulin wrote:
> Why no timeouts? It sounds like a feasible way to work around listeners
> which have stopped working. (Timeout and -ETIME for example to be clear,
> not allowing access).

>From the kernel's point of view, there is no way to guess how long those
timeouts should be. Watching for progress can be implemented in user space
though.

Setting errno to ETIME as a result of trying to access a file is likely to
break some applications which are not prepared to receive this error
condition; we cannot do that.

I'm quite sure that both of these issues have been discussed already.

Andreas

2010-08-17 10:12:51

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Tuesday 17 Aug 2010 11:01:09 Andreas Gruenbacher wrote:
> On Tuesday 17 August 2010 11:45:25 Tvrtko Ursulin wrote:
> > Why no timeouts? It sounds like a feasible way to work around listeners
> > which have stopped working. (Timeout and -ETIME for example to be clear,
> > not allowing access).
>
> From the kernel's point of view, there is no way to guess how long those
> timeouts should be. Watching for progress can be implemented in user space
> though.

That is why I said the timeout can be configurable. But I agree (and raise you
double :) that not only the kernel can not guess how long the timeout should
be, but the application can not know as well. One possible way to let fanotify
know that listeners is making progress is via some sort of a heartbeat
message. Then application can set the timeout as a fail safe and perm events
can take as long as needed and we still prevent clogging the system.

How would you create a robust solution purely from userspace? With the current
interface?

> Setting errno to ETIME as a result of trying to access a file is likely to
> break some applications which are not prepared to receive this error
> condition; we cannot do that.

Why you think EPERM will be handled better?

> I'm quite sure that both of these issues have been discussed already.

Ok, I obviosuly missed it. Do you have a pointer perhaps?

Tvrtko

Sophos Plc, The Pentagon, Abingdon Science Park, Abingdon, OX14 3YP, United Kingdom.
Company Reg No 2096520. VAT Reg No GB 348 3873 20.

2010-08-17 10:55:25

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Tuesday 17 Aug 2010 11:12:41 Tvrtko Ursulin wrote:
> On Tuesday 17 Aug 2010 11:01:09 Andreas Gruenbacher wrote:
> > I'm quite sure that both of these issues have been discussed already.
>
> Ok, I obviosuly missed it. Do you have a pointer perhaps?

I have found the thread. It has been discussed but I do not find it had a
clear outcome. At the end Eric has proposed the heartbeat/in-progress option
(which IMHO can only work together with a timeout) to which no-one objected. I
did not find other arguments against such functionality solid.

Tvrtko


Sophos Plc, The Pentagon, Abingdon Science Park, Abingdon, OX14 3YP, United Kingdom.
Company Reg No 2096520. VAT Reg No GB 348 3873 20.

2010-08-17 15:09:16

by Eric Paris

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Tue, 2010-08-17 at 10:09 +0200, Andreas Gruenbacher wrote:
> On Tuesday 17 August 2010 05:39:47 Eric Paris wrote:
> > On Mon, 2010-08-16 at 22:32 +0200, Andreas Gruenbacher wrote:
> > > Q: What happens when a process watching for FAN_OPEN_PERM or
> > > FAN_ACCESS_PERM events exits or dies while events are in flight? I
> > > can't see anything in the code that would wake sleeping processes up
> > > when the fsnotify_group of the listener is torn down.
> >
> > We can get stuck. There was code which cleaned that up, but it got
> > accidentally removed long ago when, upon review on list, I was told to
> > remove all timeout code. It's easy enough to fix up. I'll post a patch
> > this week.
>
> This needs to be fixed then. Not such a big deal, but it shows that the tree
> wasn't ready for being merged yet and needs further review.

Code with bugs, shocking! Two other bugs have been found and patches
for those will be coming shortly. I've begged for review how many
times? I don't care when review it comes, I'll address any issues as
they come up.

-Eric

2010-08-17 15:25:15

by Eric Paris

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Tue, 2010-08-17 at 10:38 +0200, Andreas Gruenbacher wrote:
> On Tuesday 17 August 2010 05:39:47 Eric Paris wrote:
> > On Mon, 2010-08-16 at 22:32 +0200, Andreas Gruenbacher wrote:
> > > On Saturday 07 August 2010 21:15:14 Eric Paris wrote:
> > > > On Fri, 2010-08-06 at 20:06 -0400, Christoph Hellwig wrote:
> > > > > I'm also totally missing on any re-post of these patches or
> > > > > discussion of the changes during the last development window.
> > > >
> > > > I just searched lkml an fsdevel where I usually send everything don't
> > > > see then. I totally failed.
> > >
> > > Oh yes.
> > >
> > > This introduces two new syscalls which will be impossible to fix up after
> > > the fact, and those system calls are poorly documented: commits 2a3edf86
> > > and 52c923dd document the initial versions (in the commit message!), but
> > > subsequent commits then extend that interface. The interface for
> > > replying to events is not documented at all beyond the example code [1].
> > > There is no documentation in Documentation/filesystems/, either.
> > >
> > > [1] http://people.redhat.com/~eparis/fanotify/
>
> Oh ... this example doesn't actually build; both syscall prototypes are wrong.
> What have you been testing this with?

I updated that code, I didn't realize just how out of date it got.

> > I'll work on documentation. Although it should be pointed out that the
> > interface was sent to list many times with lots of discussion and
> > feedback.
>
> One of the wonky remaining bits is the way how files are reopened with
> dentry_open() with the f_flags passed to fanotify_init(). The open can fail,
> in which case the user is left with an error condition but with no indication
> as to which object the error happened for. What the heck?

What else can be done? When notification is based on an open fd and you
can't give them an open fd, there's nothing left....

2010-08-17 15:28:31

by Eric Paris

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Tue, 2010-08-17 at 11:55 +0100, Tvrtko Ursulin wrote:
> On Tuesday 17 Aug 2010 11:12:41 Tvrtko Ursulin wrote:
> > On Tuesday 17 Aug 2010 11:01:09 Andreas Gruenbacher wrote:
> > > I'm quite sure that both of these issues have been discussed already.
> >
> > Ok, I obviosuly missed it. Do you have a pointer perhaps?
>
> I have found the thread. It has been discussed but I do not find it had a
> clear outcome. At the end Eric has proposed the heartbeat/in-progress option
> (which IMHO can only work together with a timeout) to which no-one objected. I
> did not find other arguments against such functionality solid.

You'll notice that anything anyone I respected objected to, even if I
didn't agree, I dropped or replaced. I had such code. We can bring it
back. But the objection was: 'what's the point?" They believed that
everyone would just do it in a library and end up in the 'block forever'
situation we have today.

-Eric

2010-08-17 15:49:14

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Tuesday 17 August 2010 17:24:28 Eric Paris wrote:
> On Tue, 2010-08-17 at 10:38 +0200, Andreas Gruenbacher wrote:
> > One of the wonky remaining bits is the way how files are reopened with
> > dentry_open() with the f_flags passed to fanotify_init(). The open can
> > fail, in which case the user is left with an error condition but with no
> > indication as to which object the error happened for. What the heck?
>
> What else can be done? When notification is based on an open fd and you
> can't give them an open fd, there's nothing left....

The main point would be to allow the listener to object the event is about.
This could be done by returning st_dev and st_ino in the event. (i_generation
might be useful too but we don't even return that in struct stat today.)

Another way would be to return a file pointer to a bad inode that can be
stat() normally, and to return the error code separately.

Al might have an opinion on that.

Andreas

2010-08-18 14:18:32

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Tuesday 17 August 2010 17:24:28 Eric Paris wrote:
> On Tue, 2010-08-17 at 10:38 +0200, Andreas Gruenbacher wrote:
> > On Tuesday 17 August 2010 05:39:47 Eric Paris wrote:
> > > On Mon, 2010-08-16 at 22:32 +0200, Andreas Gruenbacher wrote:
> > > > [1] http://people.redhat.com/~eparis/fanotify/
> >
> > Oh ... this example doesn't actually build; both syscall prototypes are
> > wrong. What have you been testing this with?
>
> I updated that code, I didn't realize just how out of date it got.

Thanks. Please grab some cleanups from here:

http://git.kernel.org/?p=linux/kernel/git/agruen/fanotify-example.git

Andreas

2010-08-18 15:47:34

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree: directory events

Eric,

one more thing that doesn't make sense is events on directories: when watching
a directory, some actions like reading a directory or creat()ing a file in the
directory will generate events (even open_perm and access_perm), but other
actions like hard linking files into the directory or removing files will not
create any events. This means that fanotify currently cannot be used for
watching for directory changes.

In my opinion, events on directories should either be made to actually work,
or else no directory events at all should be generated.

Also, I can think of users of fanotify which are not concerned with directory
events at all. It would make sense to allow such users to subscribe only to
file events and not receive any directory events which they will end up
ignoring anyway.

Again it shows that this code just wasn't ready to be merged yet ...

Andreas

2010-08-18 15:59:27

by Eric Paris

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree: directory events

On Wed, 2010-08-18 at 17:47 +0200, Andreas Gruenbacher wrote:
> Eric,
>
> one more thing that doesn't make sense is events on directories: when watching
> a directory, some actions like reading a directory or creat()ing a file in the
> directory will generate events (even open_perm and access_perm), but other
> actions like hard linking files into the directory or removing files will not
> create any events. This means that fanotify currently cannot be used for
> watching for directory changes.
>
> In my opinion, events on directories should either be made to actually work,
> or else no directory events at all should be generated.
>
> Also, I can think of users of fanotify which are not concerned with directory
> events at all. It would make sense to allow such users to subscribe only to
> file events and not receive any directory events which they will end up
> ignoring anyway.
>
> Again it shows that this code just wasn't ready to be merged yet ...

I'm going to file your e-mail into my todo list and hopefully I get the
time to look at the ability to ignore directory events. Nothing hard
about it. It's as easy as defining a flag and adding a conditional in
the code but it's not high on my list.

Thus far your e-mails have pointed out one bug in the permissions
implementation I am currently working fixing and a bunch of complaining
about features you can imagine someone might someday want but which
noone has actually stood up and said 'I will use this' or 'this sucks
for my use case'. I can find all sorts of things around the kernel
where I can imagine some mythical users might want to do something
different but it isn't a reason to prevent merger. I'd love to have
more review, I'm certainly going to look at your wish list, but don't
expect response to future trolling messages.

-Eric

2010-08-18 16:42:47

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree: directory events

On Wed, Aug 18, 2010 at 11:59:06AM -0400, Eric Paris wrote:
> Thus far your e-mails have pointed out one bug in the permissions
> implementation I am currently working fixing and a bunch of complaining
> about features you can imagine someone might someday want but which
> noone has actually stood up and said 'I will use this' or 'this sucks
> for my use case'. I can find all sorts of things around the kernel
> where I can imagine some mythical users might want to do something
> different but it isn't a reason to prevent merger. I'd love to have
> more review, I'm certainly going to look at your wish list, but don't
> expect response to future trolling messages.

Eric, please stop that crap. You've sent a pull request for stuff
that's not only been contentious but also not reviewed at all in this
form to Linus behind everyones back. Andreas actually takes his time
to review the clusterfuck you created, so better be really quite and
listen to him.

2010-08-18 17:07:27

by Eric Paris

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree: directory events

On Wed, 2010-08-18 at 12:42 -0400, Christoph Hellwig wrote:
> On Wed, Aug 18, 2010 at 11:59:06AM -0400, Eric Paris wrote:
> > Thus far your e-mails have pointed out one bug in the permissions
> > implementation I am currently working fixing and a bunch of complaining
> > about features you can imagine someone might someday want but which
> > noone has actually stood up and said 'I will use this' or 'this sucks
> > for my use case'. I can find all sorts of things around the kernel
> > where I can imagine some mythical users might want to do something
> > different but it isn't a reason to prevent merger. I'd love to have
> > more review, I'm certainly going to look at your wish list, but don't
> > expect response to future trolling messages.
>
> Eric, please stop that crap. You've sent a pull request for stuff
> that's not only been contentious but also not reviewed at all in this
> form to Linus behind everyones back. Andreas actually takes his time
> to review the clusterfuck you created, so better be really quite and
> listen to him.

I admit that there were a number of patches created since the last merge
request was held up based on Al's review that weren't sent to list. I
said I screwed up and pointed out what was missed before it was merged.
Clearly those changes didn't live in linux-next long enough to catch all
of their problems (namely the f_count thing everyone agreed was dirty
and broke sound) I wasn't the only person to look at most of those
changes, but they absolutely should have been on list. I've screwed up
on that twice.

But an implication that the idea, the interface, the event types sent
and received, how things worked, or anything like that wasn't sent to
list or that I didn't beg for review just isn't true (all of which has
been implied).

I fucked up not posting some of my internal notification reworks to
improve system performance, maintainability, and reliability. But any
belief that 'contentious' portions of code just magically showed up
behind anyone's back or at the last minute isn't true.

Like I said, I'd love more review. I'll gladly add more things to my
todo list if people have useful ideas. But multiple messages suggesting
code should be reverted because it doesn't implement some imagined
feature or because the code has a bug for I'm betting well over a year
obviously bothers me.

-Eric

2010-08-19 12:44:32

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree: directory events

On Wednesday 18 August 2010 17:59:06 Eric Paris wrote:
> I'm going to file your e-mail into my todo list and hopefully I get the
> time to look at the ability to ignore directory events.

As far as I can remember, several people involved in the previous discussion
agreed that a reasonable goal would be to make fanotify a superset of
inotify. My understanding was that this would be the general direction of
development. The code apparently is not there yet; it only reports a subset
of the relevant directory events. (In other words, the directory event part
of the code is currently useless.) Given that, I was surprised to see the
code getting merged.

Fanotify has a subset of functionality for watching and vetting regular file
accesses which seems to be useful in its own right; the anti-malware folks
want this part. Implementing only this part was not what was originally
discussed, but I can see some arguments for putting this functionality in now
(or rather leaving it in) and adding the rest later.

The half-thought-out directory events are not part of this subset though.
They are not useful in their own right and only generate overheads, and much
worse, they could even get in the way when proper directory event support is
eventually added. So that part should really be removed ASAP.

I expect more from you than just ignoring my concerns as you imply.

> Nothing hard about it. It's as easy as defining a flag and adding a
> conditional in the code but it's not high on my list.

We are not talking about Eric's own private syscalls here. Things we screw up
now may be very hard or impossible to fix later; syscalls don't just change
from release to release.

This also applies to the error reporting mess I have commented on. At least
the interface should be changed so that it can report a valid file descriptor
and an error condition at the same time; otherwise, we will end up with a
weakness in the interface that we won't be able to fix.

Andreas

2010-08-19 15:00:35

by Eric Paris

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree: directory events

On Thu, 2010-08-19 at 14:44 +0200, Andreas Gruenbacher wrote:
> On Wednesday 18 August 2010 17:59:06 Eric Paris wrote:

> The half-thought-out directory events are not part of this subset though.
> They are not useful in their own right and only generate overheads, and much
> worse, they could even get in the way when proper directory event support is
> eventually added. So that part should really be removed ASAP.

They aren't something I specifically added so you can call them
zero-thought-out. fanotify is just a userspace interface on top of the
notification infrastructure in the kernel. The events the
infrastructure delivers are the events it sends.

> > Nothing hard about it. It's as easy as defining a flag and adding a
> > conditional in the code but it's not high on my list.
>
> We are not talking about Eric's own private syscalls here. Things we screw up
> now may be very hard or impossible to fix later; syscalls don't just change
> from release to release.

Which is why there was so much discussion and reworking of the
interface. It went through many iterations to end up here. What all
did we have in those discussions? 2 magic proc files, ioctls on a char
dev, netlink, a socket with get/setsockopt and eventually we landed on 2
syscalls that noone found fault with. For this particular feature
request the syscall in question looks like so:

SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)

Your changes could be as simple as defining a new flag and then add an
if (flag && S_IFDIR(i_mode)) to the fanotify_should_send_event handler.
I'd love to hear objections, failings, short comings, or suggestions if
you think the interface is inadequate to address these or other needs.

> This also applies to the error reporting mess I have commented on. At least
> the interface should be changed so that it can report a valid file descriptor
> and an error condition at the same time; otherwise, we will end up with a
> weakness in the interface that we won't be able to fix.

Can you describe your problem for me again. I'm not sure I understand
your request. I don't understand how we have an error and a valid fd at
the same time. Are you really intending to restating the complaint that
when the system is unable to deliver notification there is not a lot of
information about what notification it failed to deliver? You suggested
possibly adding st_ino and st_dev to the notification in that situation.
The notification messages are easy extensible assuming userspace
properly used the provided macros FAN_EVENT_NEXT and FAN_EVENT_OK to do
its processing. Nothing prevents us from expanding what information is
provided in the message. (idea stolen from netlink for just that
purpose)

I understand you want new features but I'm not seeing failures of the
interface to be forward looking or failures in the feature set that is
provided.

-Eric

2010-08-19 20:24:36

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Tuesday 17 August 2010 17:08:26 Eric Paris wrote:
> On Tue, 2010-08-17 at 10:09 +0200, Andreas Gruenbacher wrote:
> > On Tuesday 17 August 2010 05:39:47 Eric Paris wrote:
> > > On Mon, 2010-08-16 at 22:32 +0200, Andreas Gruenbacher wrote:
> > > > Q: What happens when a process watching for FAN_OPEN_PERM or
> > > > FAN_ACCESS_PERM events exits or dies while events are in flight? I
> > > > can't see anything in the code that would wake sleeping processes up
> > > > when the fsnotify_group of the listener is torn down.
> > >
> > > We can get stuck. There was code which cleaned that up, but it got
> > > accidentally removed long ago when, upon review on list, I was told to
> > > remove all timeout code. It's easy enough to fix up. I'll post a
> > > patch this week.
> >
> > This needs to be fixed then. Not such a big deal, but it shows that the
> > tree wasn't ready for being merged yet and needs further review.
>
> Code with bugs, shocking! Two other bugs have been found and patches
> for those will be coming shortly. I've begged for review how many
> times? I don't care when review it comes, I'll address any issues as
> they come up.

Here is one more bug: when watching a directory with inotify, doing an ls
gives me:

Watching d
d was opened
d not opened for writing was closed

Watching the same directory with fanotify results in:

.../d: pid=... open_perm
.../d: pid=... open
.../d: pid=... access_perm
.../d: pid=... access_perm
.../d: pid=... close

Five events seem a bit excessive; I can't explain why so many are generated.
The real issue is when watching the same directory both with inotify and
fanotify, though: the fanotify result stays the same, but

Watching d
d has not changed
d was opened
d has not changed
d has not changed
d not opened for writing was closed

In other words, watching a directory with fanotify causes extra inotify events
with mask == 0.

Andreas

2010-08-19 20:32:56

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Thursday 19 August 2010 22:24:11 Andreas Gruenbacher wrote:
> Watching the same directory with fanotify results in:
>
> .../d: pid=... open_perm
> .../d: pid=... open
> .../d: pid=... access_perm
> .../d: pid=... access_perm
> .../d: pid=... close

One more thing: the fanotify example code automatically sets ignore marks by
default so the second access_perm check above will normally be suppressed.
Use the -n option of the copy here to turn ignore marks off:

http://git.kernel.org/?p=linux/kernel/git/agruen/fanotify-example.git

2010-08-19 20:43:32

by Eric Paris

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Thu, 2010-08-19 at 22:24 +0200, Andreas Gruenbacher wrote:
> On Tuesday 17 August 2010 17:08:26 Eric Paris wrote:
> > On Tue, 2010-08-17 at 10:09 +0200, Andreas Gruenbacher wrote:
> > > On Tuesday 17 August 2010 05:39:47 Eric Paris wrote:
> > > > On Mon, 2010-08-16 at 22:32 +0200, Andreas Gruenbacher wrote:
> > > > > Q: What happens when a process watching for FAN_OPEN_PERM or
> > > > > FAN_ACCESS_PERM events exits or dies while events are in flight? I
> > > > > can't see anything in the code that would wake sleeping processes up
> > > > > when the fsnotify_group of the listener is torn down.
> > > >
> > > > We can get stuck. There was code which cleaned that up, but it got
> > > > accidentally removed long ago when, upon review on list, I was told to
> > > > remove all timeout code. It's easy enough to fix up. I'll post a
> > > > patch this week.
> > >
> > > This needs to be fixed then. Not such a big deal, but it shows that the
> > > tree wasn't ready for being merged yet and needs further review.
> >
> > Code with bugs, shocking! Two other bugs have been found and patches
> > for those will be coming shortly. I've begged for review how many
> > times? I don't care when review it comes, I'll address any issues as
> > they come up.
>
> Here is one more bug: when watching a directory with inotify, doing an ls
> gives me:
>
> Watching d
> d was opened
> d not opened for writing was closed
>
> Watching the same directory with fanotify results in:
>
> .../d: pid=... open_perm
> .../d: pid=... open
> .../d: pid=... access_perm
> .../d: pid=... access_perm
> .../d: pid=... close
>
> Five events seem a bit excessive; I can't explain why so many are generated.
> The real issue is when watching the same directory both with inotify and
> fanotify, though: the fanotify result stays the same, but

The extra events are plainly the new events that inotify doesn't
support: namely permissions events. You ask for and received extra
events....

> Watching d
> d has not changed
> d was opened
> d has not changed
> d has not changed
> d not opened for writing was closed
>
> In other words, watching a directory with fanotify causes extra inotify events
> with mask == 0.

I can't reproduce it.

inotifywait -m /mnt/tmp
fanotify -p /mnt/tmp
ls /mnt/tmp

All I see is:

# /tmp/inotifywait.strace -- inotifywait -m /mnt/tmp/
Setting up watches.
Watches established.
/mnt/tmp/ OPEN,ISDIR
/mnt/tmp/ CLOSE_NOWRITE,CLOSE,ISDIR

# /storage/tmp/fanotify/fanotify -p /mnt/tmp
/mnt/tmp: pid=508 open_perm
/mnt/tmp: pid=508 open
/mnt/tmp: pid=508 access_perm
/mnt/tmp: pid=508 close

# ls
file lost+found

You must have some other testing methodology.....

2010-08-19 21:07:54

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Thursday 19 August 2010 22:42:45 Eric Paris wrote:
> The extra events are plainly the new events that inotify doesn't
> support: namely permissions events. You ask for and received extra
> events....

How can it make sense that inotify suddenly starts seeing events it does not
support?

> I can't reproduce it. You must have some other testing methodology.....

Apparently. Here is a trace if what I'm getting through inotify (the inotify
events were dumped with gdb, the rest is from strace):

inotify_init() = 3
inotify_add_watch(3, "d", IN_ACCESS|IN_MODIFY|IN_ATTRIB|IN_CLOSE_WRITE|IN_CLOSE_NOWRITE|IN_OPEN|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|
IN_DELETE|IN_DELETE_SELF|IN_MOVE_SELF) = 1
read(3, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0 \0\0@\0\0\0\0\0\0\0\0", 4210688) = 32

=> {wd = 1, mask = 0, cookie = 0, len = 0, name = ...}
=> {wd = 1, mask = 1073741856, cookie = 0, len = 0, name = ...}

read(3, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\20\0\0@\0\0\0\0\0\0\0\0", 4210688) = 32

=> {wd = 1, mask = 0, cookie = 0, len = 0, name = ...}
=> {wd = 1, mask = 1073741840, cookie = 0, len = 0, name = ...}

read(3, 0x7fff56db15e0, 4210688) = -1 EINTR (Interrupted system call)


Andreas

2010-08-19 21:22:37

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Thursday 19 August 2010 23:07:31 Andreas Gruenbacher wrote:
> inotify_init() = 3
> inotify_add_watch(3, "d",
> IN_ACCESS|IN_MODIFY|IN_ATTRIB|IN_CLOSE_WRITE|IN_CLOSE_NOWRITE|IN_OPEN|IN_M
> OVED_FROM|IN_MOVED_TO|IN_CREATE| IN_DELETE|IN_DELETE_SELF|IN_MOVE_SELF) = 1
> read(3, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0 \0\0@\0\0\0\0\0\0\0\0",
> 4210688) = 32
>
> => {wd = 1, mask = 0, cookie = 0, len = 0, name = ...}
> => {wd = 1, mask = 1073741856, cookie = 0, len = 0, name = ...}
>
> read(3, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\20\0\0@\0\0\0\0\0\0\0\0",
> 4210688) = 32
>
> => {wd = 1, mask = 0, cookie = 0, len = 0, name = ...}

When disabling ignore masks (-n), I get another of these here:

=> {wd = 1, mask = 0, cookie = 0, len = 0, name = ...}

> => {wd = 1, mask = 1073741840, cookie = 0, len = 0, name = ...}
>
> read(3, 0x7fff56db15e0, 4210688) = -1 EINTR (Interrupted system
> call)

Andreas

2010-08-19 23:41:31

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree: directory events

[Adding linux-fsdevel to the list as this really is a filesystem discussion.]

On Thursday 19 August 2010 17:00:12 Eric Paris wrote:
> On Thu, 2010-08-19 at 14:44 +0200, Andreas Gruenbacher wrote:
> > On Wednesday 18 August 2010 17:59:06 Eric Paris wrote:
> >
> > The half-thought-out directory events are not part of this subset though.
> > They are not useful in their own right and only generate overheads, and
> > much worse, they could even get in the way when proper directory event
> > support is eventually added. So that part should really be removed
> > ASAP.
>
> They aren't something I specifically added so you can call them
> zero-thought-out. fanotify is just a userspace interface on top of the
> notification infrastructure in the kernel. The events the
> infrastructure delivers are the events it sends.

Apparently the infrastructure delivers a number of directory events like file
creation, deletion, and renames through inotify that are not delivered through
fanotify, and fanotify only sees a subset of all directory events. The result
is that directory events for inotify are useful because they allow to watch a
directory for changes, and the directory events in fanotify are not useful for
that right now.

>[...]
>> Your changes could be as simple as defining a new flag and then add an
> if (flag && S_IFDIR(i_mode)) to the fanotify_should_send_event handler.
> I'd love to hear objections, failings, short comings, or suggestions if
> you think the interface is inadequate to address these or other needs.

I don't think the events that fanotify delivers for directories (open_perm,
open, access_perm, close) are useful at all, or that we should allow perm
events for directories.

So let's please get rid of those directory events unconditionally now, and add
them back in their proper, final form later, after 2.6.36.

> > We are not talking about Eric's own private syscalls here. Things we
> > screw up now may be very hard or impossible to fix later; syscalls don't
> > just change from release to release.
>
> Which is why there was so much discussion and reworking of the
> interface. It went through many iterations to end up here. What all
> did we have in those discussions? 2 magic proc files, ioctls on a char
> dev, netlink, a socket with get/setsockopt and eventually we landed on 2
> syscalls that noone found fault with.

Unfortunately there wasn't a lot of discussion about which events should be
generated when. Fanotify was merged before turning into a superset of
inotify, and there was even less discussion about how fanotify should look if
it isn't a superset of inotify.

> > This also applies to the error reporting mess I have commented on. At
> > least the interface should be changed so that it can report a valid file
> > descriptor and an error condition at the same time; otherwise, we will
> > end up with a weakness in the interface that we won't be able to fix.
>
> Can you describe your problem for me again. I'm not sure I understand
> your request. I don't understand how we have an error and a valid fd at
> the same time.

Yes. Right now, struct fanotify_event_metadata contains a file descriptor
which is the only information we have about the object the event refers to, or
a negative error code if a file descriptor could not be opened with
dentry_open() or some other error occurred.

Being able to identify the object that an event refers to is important. There
are two ways to do that:

(1) Include fields like st_dev and st_ino in struct
fanotify_event_metadata. This is not ideal because the listener
won't be able to find out any additional information about the file
(like an idea about its name, for example).

For example, if inode->i_generation is ever exposed to user-space
through stat(), that information would still be missing in struct
fanotify_event_metadata.

(2) Construct a file descriptor that refers to the file that could not be
dentry_open()ed, but which cannot be used for any I/O, and pass the
error condition in a separate field. The kernel has all the
information needed for doing that, and it shouldn't be hard to
implement.

That way, the listener always has a file descriptor to poke around
with.

Failing to do (2) right now, I think it still makes sense to separate the file
descriptor from the error code in struct fanotify_event_metadata; this would
enable us to do (2) later if we decide to.

> I understand you want new features but I'm not seeing failures of the
> interface to be forward looking or failures in the feature set that is
> provided.

I do see failures of being forward looking, inconsistencies in the feature set
which might lead to trouble as we extend fanotify in the future, and bugs that
would have been shaken out in a proper review (OOMing the system when a
listener stops listening or blocking access to files when a listener dies; I
have reported that).

Sorry to say, but this code really should not have been merged yet. (And mind
I'm not saying this because I want to block fanotify from making progress --
quite the contrary.)

Andreas

2010-08-20 00:00:53

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

[Adding linux-fsdevel here as well.]

On Tuesday 17 August 2010 10:09:50 Andreas Gruenbacher wrote:
> > > Q: What prevents the system from going out of memory when a listener
> > > decides to stop reading events or simply can't keep up? There doesn't
> > > seem to be a limit on the queue depth. Listeners currently need
> > > CAP_SYS_ADMIN, but somehow limiting the queue depth and throttling when
> > > things start to go bad still sounds like a reasonable thing to do,
> > > right?
> >
> > It's an interesting question and obviously one that I've thought about.
> > You remember when we talked previously I said the hardest part left was
> > allowing non-root users to use the interface. It gets especially
> > difficult when thinking about perm-events. I was specifically told not
> > to timeout or drop those. But when dealing with non-root users using
> > perm events? As for pure notification we can do something like inotify
> > does quite easily.
> >
> > I'm not certain exactly what the best semantics are for non trusted
> > users, so I didn't push any patches that way. Suggestions welcome :)
>
> The system will happily go OOM for trusted users and non-perm events if the
> listener doesn't keep up, so some throttling, dropping, or both needs to
> happen for non-perm events. This is the critical case. Doing what inotify
> does (queue an overflow event and drop further events) seems to make sense
> here.
>
> The situation with perm-events is less severe because the number of
> outstanding perm events is bounded by the number of running processes.
> This may be enough of a limit.
>
> I don't think we need to worry about perm-events for untrusted users. We
> can start supporting some kinds of non-perm-events for untrusted users
> later; this won't change the existing interface.

Another case where fanotify fails to generate useful events is when a listener
runs out of file descriptors; events will simply end up with fd == -EMFILE in
that case. I don't think this behavior is useful; instead, reading from the
fanotify file descriptor (he one returned by fanotify_init()) should fail to
give the listener a chance to react.

Andreas

2010-08-20 03:38:39

by Eric Paris

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree: directory events

On Fri, 2010-08-20 at 01:41 +0200, Andreas Gruenbacher wrote:
> [Adding linux-fsdevel to the list as this really is a filesystem discussion.]

> Yes. Right now, struct fanotify_event_metadata contains a file descriptor
> which is the only information we have about the object the event refers to, or
> a negative error code if a file descriptor could not be opened with
> dentry_open() or some other error occurred.
>
> Being able to identify the object that an event refers to is important. There
> are two ways to do that:
>
> (1) Include fields like st_dev and st_ino in struct
> fanotify_event_metadata. This is not ideal because the listener
> won't be able to find out any additional information about the file
> (like an idea about its name, for example).
>
> For example, if inode->i_generation is ever exposed to user-space
> through stat(), that information would still be missing in struct
> fanotify_event_metadata.
>
> (2) Construct a file descriptor that refers to the file that could not be
> dentry_open()ed, but which cannot be used for any I/O, and pass the
> error condition in a separate field. The kernel has all the
> information needed for doing that, and it shouldn't be hard to
> implement.
>
> That way, the listener always has a file descriptor to poke around
> with.
>
> Failing to do (2) right now, I think it still makes sense to separate the file
> descriptor from the error code in struct fanotify_event_metadata; this would
> enable us to do (2) later if we decide to.

In reference to (2), I don't even understand what an fd is that can't be
used for anything. I'll let Al or Christoph respond if they feel like
it, but it sounds crazy to me. You want to just magic up some struct
file and attach a struct path to it even though dentry_open() failed?
So you can do what with it?

On a more realistic note, I'm not opposed to (1), however, your
arguments would lead one to reject inotify as the IN_OVERFLOW or oom
conditions will result in silently lost events or events which provide
no useful information since the notification system has broken down.
When the appropriate use of notification is impossible I'm certainly not
opposed to patches which add best effort information, but you are
already outside the bounds of a reasonably functional system and there
is no good solution.

> bugs that
> would have been shaken out in a proper review (OOMing the system when a
> listener stops listening or blocking access to files when a listener dies; I
> have reported that).

So the actual bugs you have reported, I see two.

The (nearly) unbounded number of potential outstanding notifications
events is a known situation, pointed out in previous discussions well
before this commit and is one of the (numerous) reasons why fanotify is
at this time CAP_SYS_ADMIN only. It is something that is difficult to
address while still making fanotify useful for permissions gating. But
the issue is clearly noted.

As to when a listener dies: You have to define 'dies'. If the process
just stops respond it is supposed to lock forever. I was explicitly ask
to remove timeouts (even though I've already been ask off list to put
them back) In Linus' tree there is a race in which both processes (the
listener and the process doing an action waiting on the listener) can
deadlock and hang forever. A (much less racy but not right) patch to
fix this has already been posted. Do you have comments on that patch?
I have a better version which has worked well for me today which I will
likely post tomorrow morning after I look over it again. I'd love your
feedback.

-Eric

2010-08-20 03:51:00

by Eric Paris

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Thu, 2010-08-19 at 23:07 +0200, Andreas Gruenbacher wrote:
> On Thursday 19 August 2010 22:42:45 Eric Paris wrote:

[necessary context to the conversation reinserted]

>>> Watching the same directory with fanotify results in:
>>>
>>> .../d: pid=... open_perm
>>> .../d: pid=... open
>>> .../d: pid=... access_perm
>>> .../d: pid=... access_perm
>>> .../d: pid=... close
>>>
>>> Five events seem a bit excessive; I can't explain why so many are generated.
>>> The real issue is when watching the same directory both with inotify and
>>> fanotify, though: the fanotify result stays the same, but

> > The extra events are plainly the new events that inotify doesn't
> > support: namely permissions events. You ask for and received extra
> > events....
>
> How can it make sense that inotify suddenly starts seeing events it does not
> support?

I inline commented in the wrong place to make it clear what I was
responding to. The question I was was answering was: "Five events seem
a bit excessive; I can't explain why so many are generated." I answered
it.

Now onto the the question of extra inotify events:

> > I can't reproduce it. You must have some other testing methodology.....
>
> Apparently. Here is a trace if what I'm getting through inotify (the inotify
> events were dumped with gdb, the rest is from strace):
>
> inotify_init() = 3
> inotify_add_watch(3, "d", IN_ACCESS|IN_MODIFY|IN_ATTRIB|IN_CLOSE_WRITE|IN_CLOSE_NOWRITE|IN_OPEN|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|
> IN_DELETE|IN_DELETE_SELF|IN_MOVE_SELF) = 1
> read(3, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0 \0\0@\0\0\0\0\0\0\0\0", 4210688) = 32
>
> => {wd = 1, mask = 0, cookie = 0, len = 0, name = ...}
> => {wd = 1, mask = 1073741856, cookie = 0, len = 0, name = ...}
>
> read(3, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\20\0\0@\0\0\0\0\0\0\0\0", 4210688) = 32
>
> => {wd = 1, mask = 0, cookie = 0, len = 0, name = ...}
> => {wd = 1, mask = 1073741840, cookie = 0, len = 0, name = ...}
>
> read(3, 0x7fff56db15e0, 4210688) = -1 EINTR (Interrupted system call)

inotify_add_watch(3, "/mnt/tmp/", IN_ACCESS|IN_MODIFY|IN_ATTRIB|IN_CLOSE_WRITE|IN_CLOSE_NOWRITE|IN_OPEN|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE|IN_DELETE_SELF|IN_MOVE_SELF) = 1
lstat("/mnt/tmp/", {st_mode=S_IFDIR|0755, st_size=1024, ...}) = 0
write(2, "Watches established.\n", 21) = 21
select(4, [3], NULL, NULL, NULL) = 1 (in [3])
ioctl(3, FIONREAD, [16]) = 0
read(3, "\1\0\0\0 \0\0@\0\0\0\0\0\0\0\0", 65536) = 16
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8fef0d3000
write(1, "/mnt/tmp/ OPEN,ISDIR \n", 22) = 22
select(4, [3], NULL, NULL, NULL) = 1 (in [3])
ioctl(3, FIONREAD, [16]) = 0
read(3, "\1\0\0\0\20\0\0@\0\0\0\0\0\0\0\0", 65536) = 16
write(1, "/mnt/tmp/ CLOSE_NOWRITE,CLOSE,IS"..., 37) = 37
select(4, [3], NULL, NULL, NULL) = ? ERESTARTNOHAND (To be restarted)
--- SIGINT (Interrupt) @ 0 (0) ---
+++ killed by SIGINT +++

We must be doing something different... What kernel? what kconfig?
What exact FS setup? What exact steps are you taking? What programs
are you using to test east side? If you want to spam me with something
I added a whole lot of pr_debug statements throughout notification code
just in case it wasn't perfect (haha) so if you built with dynamic debug
you could run my debug script:

#!/bin/bash

echo "file fs/notify/inode_mark.c +p" > /sys/kernel/debug/dynamic_debug/control
echo "file fs/notify/mark.c +p" > /sys/kernel/debug/dynamic_debug/control
echo "file fs/notify/notification.c +p" > /sys/kernel/debug/dynamic_debug/control
echo "file fs/notify/fanotify/fanotify_user.c +p" > /sys/kernel/debug/dynamic_debug/control
echo "file fs/notify/fanotify/fanotify.c +p" > /sys/kernel/debug/dynamic_debug/control
echo "file fs/notify/vfsmount_mark.c +p" > /sys/kernel/debug/dynamic_debug/control
echo "file fs/notify/group.c +p" > /sys/kernel/debug/dynamic_debug/control
echo "file fs/notify/inotify/inotify_fsnotify.c +p" > /sys/kernel/debug/dynamic_debug/control
echo "file fs/notify/inotify/inotify_user.c +p" > /sys/kernel/debug/dynamic_debug/control
echo "file fs/notify/dnotify/dnotify.c +p" > /sys/kernel/debug/dynamic_debug/control
echo "file fs/notify/fsnotify.c +p" > /sys/kernel/debug/dynamic_debug/control

reproduce and send me the dmesg results. Maybe I can glean something
out of it....

-Eric

2010-08-20 05:20:11

by Andreas Dilger

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree: directory events

On 2010-08-19, at 21:38, Eric Paris wrote:
> On Fri, 2010-08-20 at 01:41 +0200, Andreas Gruenbacher wrote:
>> Being able to identify the object that an event refers to is important. There
>> are two ways to do that:
>>
>> (1) Include fields like st_dev and st_ino in struct
>> fanotify_event_metadata.
>
> On a more realistic note, I'm not opposed to (1), however, your
> arguments would lead one to reject inotify as the IN_OVERFLOW or oom
> conditions will result in silently lost events or events which provide
> no useful information since the notification system has broken down.
> When the appropriate use of notification is impossible I'm certainly not
> opposed to patches which add best effort information, but you are
> already outside the bounds of a reasonably functional system and there
> is no good solution.

What about unifying the file identification here with the file handle used for open_by_handle()? That keeps a consistent identifier for the inode between multiple system calls (always a good thing), and allows the inode to be opened again via open_by_handle() if needed.

Cheers, Andreas




2010-08-20 09:09:51

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree: directory events

On Friday 20 Aug 2010 04:38:17 Eric Paris wrote:
> On Fri, 2010-08-20 at 01:41 +0200, Andreas Gruenbacher wrote:
> > (2) Construct a file descriptor that refers to the file that could not
> > be
> >
> > dentry_open()ed, but which cannot be used for any I/O, and pass
the
> > error condition in a separate field. The kernel has all the
> > information needed for doing that, and it shouldn't be hard to
> > implement.
> >
> > That way, the listener always has a file descriptor to poke around
> > with.
> >
> > Failing to do (2) right now, I think it still makes sense to separate the
> > file descriptor from the error code in struct fanotify_event_metadata;
> > this would enable us to do (2) later if we decide to.
>
> In reference to (2), I don't even understand what an fd is that can't be
> used for anything. I'll let Al or Christoph respond if they feel like
> it, but it sounds crazy to me. You want to just magic up some struct
> file and attach a struct path to it even though dentry_open() failed?
> So you can do what with it?

I think Andreas would like to get a path even if open failed so it could be
for that use? If creating such file object will be impossible or too hacky,
but the general idea accepted, it may be better just to attach the path to the
event.

Tvrtko

Sophos Plc, The Pentagon, Abingdon Science Park, Abingdon, OX14 3YP, United Kingdom.
Company Reg No 2096520. VAT Reg No GB 348 3873 20.

2010-08-20 09:21:42

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree: directory events

On Thu, Aug 19, 2010 at 11:19:07PM -0600, Andreas Dilger wrote:
> What about unifying the file identification here with the file handle used for open_by_handle()? That keeps a consistent identifier for the inode between multiple system calls (always a good thing), and allows the inode to be opened again via open_by_handle() if needed.

That's what the dmapi callouts that are intendeded to do just about the
same as fanotify always did. I'm pretty sure I brought this up earlier
in the discussion.

2010-08-20 11:08:00

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree: directory events

On Friday 20 August 2010 05:38:17 Eric Paris wrote:
> On Fri, 2010-08-20 at 01:41 +0200, Andreas Gruenbacher wrote:
> > (2) Construct a file descriptor that refers to the file that could not be
> > dentry_open()ed, but which cannot be used for any I/O, and pass the
> > error condition in a separate field. The kernel has all the
> > information needed for doing that, and it shouldn't be hard to
> > implement.
> >
> > That way, the listener always has a file descriptor to poke around
> > with.
> >
> > Failing to do (2) right now, I think it still makes sense to separate the
> > file descriptor from the error code in struct fanotify_event_metadata;
> > this would enable us to do (2) later if we decide to.
>
> In reference to (2), I don't even understand what an fd is that can't be
> used for anything. I'll let Al or Christoph respond if they feel like
> it, but it sounds crazy to me. You want to just magic up some struct
> file and attach a struct path to it even though dentry_open() failed?
> So you can do what with it?

I've never said the fd can't be used for anything. The idea would be to
allow the file to be stat()ed, to poke around in /proc/fd/ in the hope to
get a somewhat useful pathname out, or to use something like xstat() on the
fd one day (David Howells is going to be looking into that).

It would be a good thing to be able to use the same mechanisms for
identifying the file that are available for a "normal" file descriptor.

> On a more realistic note, I'm not opposed to (1), however, your
> arguments would lead one to reject inotify as the IN_OVERFLOW or oom
> conditions will result in silently lost events or events which provide
> no useful information since the notification system has broken down.

Nonsense. An overflow event obviously is not associated with any one file,
so that would be the (only) case where you should get an event that doesn't
refer to a file.

Andreas

2010-08-20 11:25:24

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree: directory events

On Friday 20 August 2010 05:38:17 Eric Paris wrote:
> So the actual bugs you have reported, I see two.
>
> The (nearly) unbounded number of potential outstanding notifications
> events is a known situation, pointed out in previous discussions well
> before this commit and is one of the (numerous) reasons why fanotify is
> at this time CAP_SYS_ADMIN only. It is something that is difficult to
> address while still making fanotify useful for permissions gating. But
> the issue is clearly noted.

Clearly noting and blissfully ignoring the problem is not enough
unfortunately; this needs to be addressed now. I have already pointed out
(quoted below) that permission gating is not the worst problem here; the worst
problem are listeners whose fanotify event queue just grows and grows. There
is no throttling, and no guarantee that even a listener which simply reads and
completely ignores all events will manage to keep up. The system will run out
of memory eventually.

Here is a quote from a previous message about this problem that only went to
linux-kernel:

On Tuesday 17 August 2010 05:39:47 Eric Paris wrote:
> On Mon, 2010-08-16 at 22:32 +0200, Andreas Gruenbacher wrote:
> > Q: What prevents the system from going out of memory when a listener
> > decides to stop reading events or simply can't keep up? There doesn't
> > seem to be a limit on the queue depth. Listeners currently need
> > CAP_SYS_ADMIN, but somehow limiting the queue depth and throttling when
> > things start to go bad still sounds like a reasonable thing to do,
> > right?
>
> It's an interesting question and obviously one that I've thought about.
> You remember when we talked previously I said the hardest part left was
> allowing non-root users to use the interface. It gets especially
> difficult when thinking about perm-events. I was specifically told not
> to timeout or drop those. But when dealing with non-root users using
> perm events? As for pure notification we can do something like inotify
> does quite easily.
>
> I'm not certain exactly what the best semantics are for non trusted
> users, so I didn't push any patches that way. Suggestions welcome :)

The system will happily go OOM for trusted users and non-perm events if the
listener doesn't keep up, so some throttling, dropping, or both needs to
happen for non-perm events. This is the critical case. Doing what inotify
does (queue an overflow event and drop further events) seems to make sense
here.

The situation with perm-events is less severe because the number of
outstanding perm events is bounded by the number of running processes. This
may be enough of a limit.

I don't think we need to worry about perm-events for untrusted users. We can
start supporting some kinds of non-perm-events for untrusted users later; this
won't change the existing interface.

Andreas

2010-08-20 12:16:34

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree: directory events

On Friday 20 August 2010 05:38:17 Eric Paris wrote:
> As to when a listener dies: You have to define 'dies'.

I meant a process that gets killed or simply exits while there are outstanding
perm events. Nothing in the code would wake up the processes blocked on the
perm event; they will simply be stuck forever. (They cannot even be killed
with SIGKILL.)

This is easy to reproduce with a listener that is killed while processing a
perm event: just run the fanotify example [1] with the new -s option and hit
^C while it is sleeping.

Bonus points would be awarded to make a process blocked on a listener killable
with SIGKILL or other lethal signals.

[1] http://git.kernel.org/?p=linux/kernel/git/agruen/fanotify-example.git

The listener will also hang forever in that case; not sure why that is the
case.

I already told you about this before (http://lkml.org/lkml/2010/8/16/349); not
sure why everything needs to repeated multiple times until it sinks in.

> If the process just stops respond it is supposed to lock forever.

I agree.

> I was explicitly ask to remove timeouts (even though I've already been ask
> off list to put them back)

I disagree with putting timeouts back in.

> In Linus' tree there is a race in which both processes (the
> listener and the process doing an action waiting on the listener) can
> deadlock and hang forever. A (much less racy but not right) patch to
> fix this has already been posted.
>
> I have a better version which has worked well for me today which I will
> likely post tomorrow morning after I look over it again. I'd love your
> feedback.

Do you have a pointer to that?

Thanks,
Andreas

2010-08-20 12:38:39

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Friday 20 August 2010 05:50:36 Eric Paris wrote:
> We must be doing something different... What kernel? what kconfig?
> What exact FS setup? What exact steps are you taking? What programs
> are you using to test east side?

I'm runnning 2.6.36-rc1, with CONFIG_FANOTIFY and
CONFIG_FANOTIFY_ACCESS_PERMISSIONS on apparently. I am watching the same
directory with inotify and fanotify at the same time, that is, with both an
inotify and an fanotify listener running in two separate processes. The
inotify listener is code I cannot send so easily, but I've shown the resulting
strace. The fanotify listener is the one from [1].

[1] http://git.kernel.org/?p=linux/kernel/git/agruen/fanotify-example.git

Together with the traces I've provided this should give you way enough clues
to be able to look up in the code why listening for fanotify events apparently
causes a concurrent inotify listener to return an inotify event with struct
inotify_event->mask == 0 for each fanotify perm event.

Andreas

2010-08-20 15:29:32

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree: directory events

On Friday 20 August 2010 11:21:27 Christoph Hellwig wrote:
> On Thu, Aug 19, 2010 at 11:19:07PM -0600, Andreas Dilger wrote:
> > What about unifying the file identification here with the file handle
> > used for open_by_handle()? That keeps a consistent identifier for the
> > inode between multiple system calls (always a good thing), and allows
> > the inode to be opened again via open_by_handle() if needed.
>
> That's what the dmapi callouts that are intendeded to do just about the
> same as fanotify always did. I'm pretty sure I brought this up earlier
> in the discussion.

I remember you bringing this up.

The persistent handles require CAP_DAC_READ_SEARCH for using open_by_handle()
to prevent an unprivileged process from forging handles and bypassing
directory permission checks. This would be okay for users of fanotify which
can listen to all events in their namespace (and which requires CAP_SYS_ADMIN
anyway).

I don't see how open_by_handle() could be made safe to use by arbitrary
processes; I don't think we can make the handles cryptographically strong, for
example. But I may be overlooking something here.

[Side note: checking for CAP_DAC_READ_SEARCH in fanotify would probably be
enough when no perm events are involved because dentry_open() performs tail
permission checks anyway.]

In the future it will make sense to extend fanotify to allow unprivileged
processes to listen to "their own" events, for example, like inotify does
today. (You know that providing a better inotify replacement was one of the
goals of fanotify before it got merged anyway.) Unprivileged processes
wouldn't be allowed to use open_by_handle() anymore though, and so file
handles still look like a better choice for fanotify to me.

Thanks,
Andreas

2010-08-20 20:39:44

by Andreas Dilger

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree: directory events

On 2010-08-20, at 09:29, Andreas Gruenbacher wrote:
> On Friday 20 August 2010 11:21:27 Christoph Hellwig wrote:
>> On Thu, Aug 19, 2010 at 11:19:07PM -0600, Andreas Dilger wrote:
>>> What about unifying the file identification here with the file handle
>>> used for open_by_handle()? That keeps a consistent identifier for the
>>> inode between multiple system calls (always a good thing), and allows
>>> the inode to be opened again via open_by_handle() if needed.
>>
>> That's what the dmapi callouts that are intended to do just about the
>> same as fanotify always did. I'm pretty sure I brought this up earlier
>> in the discussion.
>
> I remember you bringing this up.
>
> The persistent handles require CAP_DAC_READ_SEARCH for using open_by_handle()
> to prevent an unprivileged process from forging handles and bypassing
> directory permission checks. This would be okay for users of fanotify which
> can listen to all events in their namespace (and which requires CAP_SYS_ADMIN
> anyway).
>
> I don't see how open_by_handle() could be made safe to use by arbitrary
> processes; I don't think we can make the handles cryptographically strong, for
> example. But I may be overlooking something here.

If the file handles only need to have a limited lifetime for unprivileged processes, then they can contain a random salt that is kept on the in-core inode. For me and my intended HPC use case this would be a useful addition for open_by_handle() to allow unprivileged process access. At worst, if the inode is evicted from memory the process would redo the name_to_handle(), or do the slower open-by-path().

I suspect it would also be possible to have an array of per-superblock (or global) crypto keys that are hashed with the file handle data. That avoids the per-inode memory, and allows a well-defined lifetime for the handle (minutes, hours, days) only as a function of how quickly the crypto key needs to rotate (based on key length and attack speed) and the size of the array that is kept.

> In the future it will make sense to extend fanotify to allow unprivileged
> processes to listen to "their own" events, for example, like inotify does
> today. (You know that providing a better inotify replacement was one of the
> goals of fanotify before it got merged anyway.) Unprivileged processes
> wouldn't be allowed to use open_by_handle() anymore though, and so file
> handles still look like a better choice for fanotify to me.


Cheers, Andreas




2010-08-23 16:46:34

by Eric Paris

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Fri, 2010-08-20 at 14:38 +0200, Andreas Gruenbacher wrote:
> On Friday 20 August 2010 05:50:36 Eric Paris wrote:
> > We must be doing something different... What kernel? what kconfig?
> > What exact FS setup? What exact steps are you taking? What programs
> > are you using to test east side?
>
> I'm runnning 2.6.36-rc1, with CONFIG_FANOTIFY and
> CONFIG_FANOTIFY_ACCESS_PERMISSIONS on apparently. I am watching the same
> directory with inotify and fanotify at the same time, that is, with both an
> inotify and an fanotify listener running in two separate processes. The
> inotify listener is code I cannot send so easily, but I've shown the resulting
> strace. The fanotify listener is the one from [1].
>
> [1] http://git.kernel.org/?p=linux/kernel/git/agruen/fanotify-example.git
>
> Together with the traces I've provided this should give you way enough clues
> to be able to look up in the code why listening for fanotify events apparently
> causes a concurrent inotify listener to return an inotify event with struct
> inotify_event->mask == 0 for each fanotify perm event.

Spent a bit of the weekend trying to figure out what you were doing and
couldn't reproduce it or find it in the code because I had already fixed
it (albeit for slightly different reasons). The patch in question was:

http://marc.info/?l=linux-kernel&m=128214903125780&w=2

the vfsmount_test_mask was always initialized but since no vfsmount
marks were found it was never cleared. This left the code thinking that
the given (inode) mark was interested in the event. I think you could
reproduce it differently

inotifywait -m -e open /mnt/tmp
inotifywait -m -e close /mnt/tmp

and you would get both event types for both watches.

-Eric

2010-08-23 22:38:43

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [GIT PULL] notification tree - try 37!

On Monday 23 August 2010 18:46:13 Eric Paris wrote:
> Spent a bit of the weekend trying to figure out what you were doing and
> couldn't reproduce it or find it in the code because I had already fixed
> it (albeit for slightly different reasons). The patch in question was:
>
> http://marc.info/?l=linux-kernel&m=128214903125780&w=2
>
> the vfsmount_test_mask was always initialized but since no vfsmount
> marks were found it was never cleared. This left the code thinking that
> the given (inode) mark was interested in the event.

That seems to explain it. I can no longer reproduce the bug with your current
for-linus tree (3ba67c8a) using my previous method.

Sorry this bug was difficult for you to track down; I was not even aware of
this fix until today.

> I think you could reproduce it differently
>
> inotifywait -m -e open /mnt/tmp
> inotifywait -m -e close /mnt/tmp
>
> and you would get both event types for both watches.

No, the open listener gets:
d/ OPEN,ISDIR

and the close listener gets:
d/ CLOSE_NOWRITE,CLOSE,ISDIR

I tried to reproduce the previously failing case with inotifywait. I can see
that inotifywait receives five inotify events at the syscall level, but it
will not report events with mask == 0.

Thanks,
Andreas