2009-10-31 18:47:28

by Eric Paris

[permalink] [raw]
Subject: [PATCH 01/10] vfs: introduce FMODE_NONOTIFY

This is a new f_mode which can only be set by the kernel. It indicates
that the fd was opened by fanotify and should not cause future fanotify
events. This is needed to prevent fanotify livelock. An example of
obvious livelock is from fanotify close events.

Process A closes file1
This creates a close event for file1.
fanotify opens file1 for Listener X
Listener X deals with the event and closes its fd for file1.
This creates a close event for file1.
fanotify opens file1 for Listener X
Listener X deals with the event and closes its fd for file1.
This creates a close event for file1.
fanotify opens file1 for Listener X
Listener X deals with the event and closes its fd for file1.
notice a pattern?

The fix is to add the FMODE_NONOTIFY bit to the open filp done by the kernel
for fanotify. Thus when that file is used it will not generate future
events.

This patch simply defines the bit.

Signed-off-by: Eric Paris <[email protected]>
---

fs/open.c | 7 ++++---
include/asm-generic/fcntl.h | 8 ++++++++
include/linux/fs.h | 3 +++
include/linux/fsnotify.h | 24 ++++++++++++++++--------
4 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index ce737b3..7347eef 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -831,9 +831,10 @@ static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
struct inode *inode;
int error;

- f->f_flags = flags;
- f->f_mode = (__force fmode_t)((flags+1) & O_ACCMODE) | FMODE_LSEEK |
- FMODE_PREAD | FMODE_PWRITE;
+ f->f_flags = (flags & ~(FMODE_EXEC | FMODE_NONOTIFY));
+ f->f_mode = (__force fmode_t)((flags+1) & O_ACCMODE) | (flags & FMODE_NONOTIFY) |
+ FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
+
inode = dentry->d_inode;
if (f->f_mode & FMODE_WRITE) {
error = __get_file_write_access(inode, mnt);
diff --git a/include/asm-generic/fcntl.h b/include/asm-generic/fcntl.h
index 104fce8..30bece2 100644
--- a/include/asm-generic/fcntl.h
+++ b/include/asm-generic/fcntl.h
@@ -3,6 +3,14 @@

#include <linux/types.h>

+/*
+ * FMODE_EXEC is 0x20
+ * FMODE_NONOTIFY is 0x800000
+ * These cannot be used by userspace O_* until internal and external open
+ * flags are split.
+ * -Eric Paris
+ */
+
#define O_ACCMODE 00000003
#define O_RDONLY 00000000
#define O_WRONLY 00000001
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ab92f13..752056f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -87,6 +87,9 @@ struct inodes_stat_t {
*/
#define FMODE_NOCMTIME ((__force fmode_t)2048)

+/* File was opened by fanotify and shouldn't generate fanotify events */
+#define FMODE_NONOTIFY ((__force fmode_t)8388608)
+
/*
* The below are the various read and write types that we support. Some of
* them include behavioral modifiers that send information down to the
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 294f488..cc4dead 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -197,8 +197,10 @@ static inline void fsnotify_access(struct file *file)

inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

- fsnotify_parent(path, NULL, mask);
- fsnotify(inode, mask, path, FSNOTIFY_EVENT_PATH, NULL, 0);
+ if (!(file->f_mode & FMODE_NONOTIFY)) {
+ fsnotify_parent(path, NULL, mask);
+ fsnotify(inode, mask, path, FSNOTIFY_EVENT_PATH, NULL, 0);
+ }
}

/*
@@ -215,8 +217,10 @@ static inline void fsnotify_modify(struct file *file)

inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

- fsnotify_parent(path, NULL, mask);
- fsnotify(inode, mask, path, FSNOTIFY_EVENT_PATH, NULL, 0);
+ if (!(file->f_mode & FMODE_NONOTIFY)) {
+ fsnotify_parent(path, NULL, mask);
+ fsnotify(inode, mask, path, FSNOTIFY_EVENT_PATH, NULL, 0);
+ }
}

/*
@@ -233,8 +237,10 @@ static inline void fsnotify_open(struct file *file)

inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

- fsnotify_parent(path, NULL, mask);
- fsnotify(inode, mask, path, FSNOTIFY_EVENT_PATH, NULL, 0);
+ if (!(file->f_mode & FMODE_NONOTIFY)) {
+ fsnotify_parent(path, NULL, mask);
+ fsnotify(inode, mask, path, FSNOTIFY_EVENT_PATH, NULL, 0);
+ }
}

/*
@@ -252,8 +258,10 @@ static inline void fsnotify_close(struct file *file)

inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

- fsnotify_parent(path, NULL, mask);
- fsnotify(inode, mask, path, FSNOTIFY_EVENT_PATH, NULL, 0);
+ if (!(file->f_mode & FMODE_NONOTIFY)) {
+ fsnotify_parent(path, NULL, mask);
+ fsnotify(inode, mask, path, FSNOTIFY_EVENT_PATH, NULL, 0);
+ }
}

/*


2009-10-31 18:47:36

by Eric Paris

[permalink] [raw]
Subject: [PATCH 02/10] fanotify: fscking all notification system

fanotify is a novel file notification system which bases notification on
giving userspace both an event type (open, close, read, write) and an open
file descriptor to the object in question. This should address a number of
races and problems with other notification systems like inotify and dnotify
and should allow the future implementation of blocking or access controlled
notification. These are useful for on access scanners or hierachical storage
management schemes.

This patch just implements the basics of the fsnotify functions.

Signed-off-by: Eric Paris <[email protected]>
---

fs/notify/Kconfig | 1 +
fs/notify/Makefile | 1 +
fs/notify/fanotify/Kconfig | 11 ++++++
fs/notify/fanotify/Makefile | 1 +
fs/notify/fanotify/fanotify.c | 78 +++++++++++++++++++++++++++++++++++++++++
fs/notify/fanotify/fanotify.h | 12 ++++++
include/linux/Kbuild | 1 +
include/linux/fanotify.h | 40 +++++++++++++++++++++
8 files changed, 145 insertions(+), 0 deletions(-)
create mode 100644 fs/notify/fanotify/Kconfig
create mode 100644 fs/notify/fanotify/Makefile
create mode 100644 fs/notify/fanotify/fanotify.c
create mode 100644 fs/notify/fanotify/fanotify.h
create mode 100644 include/linux/fanotify.h

diff --git a/fs/notify/Kconfig b/fs/notify/Kconfig
index dffbb09..22c629e 100644
--- a/fs/notify/Kconfig
+++ b/fs/notify/Kconfig
@@ -3,3 +3,4 @@ config FSNOTIFY

source "fs/notify/dnotify/Kconfig"
source "fs/notify/inotify/Kconfig"
+source "fs/notify/fanotify/Kconfig"
diff --git a/fs/notify/Makefile b/fs/notify/Makefile
index 0922cc8..396a387 100644
--- a/fs/notify/Makefile
+++ b/fs/notify/Makefile
@@ -2,3 +2,4 @@ obj-$(CONFIG_FSNOTIFY) += fsnotify.o notification.o group.o inode_mark.o

obj-y += dnotify/
obj-y += inotify/
+obj-y += fanotify/
diff --git a/fs/notify/fanotify/Kconfig b/fs/notify/fanotify/Kconfig
new file mode 100644
index 0000000..70631ed
--- /dev/null
+++ b/fs/notify/fanotify/Kconfig
@@ -0,0 +1,11 @@
+config FANOTIFY
+ bool "Filesystem wide access notification"
+ select FSNOTIFY
+ default y
+ ---help---
+ Say Y here to enable fanotify suport. fanotify is a system wide
+ file access notification interface. Events are read from from a
+ socket and in doing so an fd is created in the reading process
+ which points to the same data as the one on which the event occured.
+
+ If unsure, say Y.
diff --git a/fs/notify/fanotify/Makefile b/fs/notify/fanotify/Makefile
new file mode 100644
index 0000000..e7d39c0
--- /dev/null
+++ b/fs/notify/fanotify/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_FANOTIFY) += fanotify.o
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
new file mode 100644
index 0000000..3ffb9db
--- /dev/null
+++ b/fs/notify/fanotify/fanotify.c
@@ -0,0 +1,78 @@
+#include <linux/fdtable.h>
+#include <linux/fsnotify_backend.h>
+#include <linux/init.h>
+#include <linux/kernel.h> /* UINT_MAX */
+#include <linux/types.h>
+
+#include "fanotify.h"
+
+static int fanotify_handle_event(struct fsnotify_group *group, struct fsnotify_event *event)
+{
+ int ret;
+
+
+ BUILD_BUG_ON(FAN_ACCESS != FS_ACCESS);
+ BUILD_BUG_ON(FAN_MODIFY != FS_MODIFY);
+ BUILD_BUG_ON(FAN_CLOSE_NOWRITE != FS_CLOSE_NOWRITE);
+ BUILD_BUG_ON(FAN_CLOSE_WRITE != FS_CLOSE_WRITE);
+ BUILD_BUG_ON(FAN_OPEN != FS_OPEN);
+ BUILD_BUG_ON(FAN_EVENT_ON_CHILD != FS_EVENT_ON_CHILD);
+ BUILD_BUG_ON(FAN_Q_OVERFLOW != FS_Q_OVERFLOW);
+
+ pr_debug("%s: group=%p event=%p\n", __func__, group, event);
+
+ ret = fsnotify_add_notify_event(group, event, NULL, NULL);
+
+ return ret;
+}
+
+static bool fanotify_should_send_event(struct fsnotify_group *group, struct inode *inode,
+ struct vfsmount *mnt, __u32 mask, void *data,
+ int data_type)
+{
+ struct fsnotify_mark *fsn_mark;
+ bool send;
+
+ pr_debug("%s: group=%p inode=%p mask=%x data=%p data_type=%d\n",
+ __func__, group, inode, mask, data, data_type);
+
+ /* sorry, fanotify only gives a damn about files and dirs */
+ if (!S_ISREG(inode->i_mode) &&
+ !S_ISDIR(inode->i_mode))
+ return false;
+
+ /* if we don't have enough info to send an event to userspace say no */
+ if (data_type != FSNOTIFY_EVENT_PATH)
+ return false;
+
+ fsn_mark = fsnotify_find_mark(group, inode);
+ if (!fsn_mark)
+ return false;
+
+ /* if the event is for a child and this inode doesn't care about
+ * events on the child, don't send it! */
+ if ((mask & FS_EVENT_ON_CHILD) &&
+ !(fsn_mark->mask & FS_EVENT_ON_CHILD)) {
+ send = false;
+ } else {
+ /*
+ * We care about children, but do we care about this particular
+ * type of event?
+ */
+ mask = (mask & ~FS_EVENT_ON_CHILD);
+ send = (fsn_mark->mask & mask);
+ }
+
+ /* find took a reference */
+ fsnotify_put_mark(fsn_mark);
+
+ return send;
+}
+
+const struct fsnotify_ops fanotify_fsnotify_ops = {
+ .handle_event = fanotify_handle_event,
+ .should_send_event = fanotify_should_send_event,
+ .free_group_priv = NULL,
+ .free_event_priv = NULL,
+ .freeing_mark = NULL,
+};
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
new file mode 100644
index 0000000..50765eb
--- /dev/null
+++ b/fs/notify/fanotify/fanotify.h
@@ -0,0 +1,12 @@
+#include <linux/fanotify.h>
+#include <linux/fsnotify_backend.h>
+#include <linux/net.h>
+#include <linux/kernel.h>
+#include <linux/types.h>
+
+static inline bool fanotify_mask_valid(__u32 mask)
+{
+ if (mask & ~((__u32)FAN_ALL_INCOMING_EVENTS))
+ return false;
+ return true;
+}
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index 5d9957f..83381d4 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -206,6 +206,7 @@ unifdef-y += ethtool.h
unifdef-y += eventpoll.h
unifdef-y += signalfd.h
unifdef-y += ext2_fs.h
+unifdef-y += fanotify.h
unifdef-y += fb.h
unifdef-y += fcntl.h
unifdef-y += filter.h
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
new file mode 100644
index 0000000..b560f86
--- /dev/null
+++ b/include/linux/fanotify.h
@@ -0,0 +1,40 @@
+#ifndef _LINUX_FANOTIFY_H
+#define _LINUX_FANOTIFY_H
+
+#include <linux/types.h>
+
+/* the following events that user-space can register for */
+#define FAN_ACCESS 0x00000001 /* File was accessed */
+#define FAN_MODIFY 0x00000002 /* File was modified */
+#define FAN_CLOSE_WRITE 0x00000008 /* Unwrittable file closed */
+#define FAN_CLOSE_NOWRITE 0x00000010 /* Writtable file closed */
+#define FAN_OPEN 0x00000020 /* File was opened */
+
+#define FAN_EVENT_ON_CHILD 0x08000000 /* interested in child events */
+
+/* FIXME currently Q's have no limit.... */
+#define FAN_Q_OVERFLOW 0x00004000 /* Event queued overflowed */
+
+/* helper events */
+#define FAN_CLOSE (FAN_CLOSE_WRITE | FAN_CLOSE_NOWRITE) /* close */
+
+/*
+ * All of the events - we build the list by hand so that we can add flags in
+ * the future and not break backward compatibility. Apps will get only the
+ * events that they originally wanted. Be sure to add new events here!
+ */
+#define FAN_ALL_EVENTS (FAN_ACCESS |\
+ FAN_MODIFY |\
+ FAN_CLOSE |\
+ FAN_OPEN)
+
+/*
+ * All legal FAN bits userspace can request (although possibly not all
+ * at the same time.
+ */
+#define FAN_ALL_INCOMING_EVENTS (FAN_ALL_EVENTS |\
+ FAN_EVENT_ON_CHILD)
+#ifdef __KERNEL__
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_FANOTIFY_H */

2009-10-31 18:47:40

by Eric Paris

[permalink] [raw]
Subject: [PATCH 03/10] fanotify:drop notification if they exist in the outgoing queue

fanotify listeners get an open file descriptor to the object in question so
the ordering of operations is not as important as in other notification
systems. inotify will drop events if the last event in the event FIFO is
the same as the current event. This patch will drop fanotify events if
they are the same as another event anywhere in the event FIFO.

Signed-off-by: Eric Paris <[email protected]>
---

fs/notify/fanotify/fanotify.c | 45 +++++++++++++++++++++++++++++++++++++++--
1 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 3ffb9db..c35c117 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -6,6 +6,45 @@

#include "fanotify.h"

+static bool should_merge(struct fsnotify_event *old, struct fsnotify_event *new)
+{
+ pr_debug("%s: old=%p new=%p\n", __func__, old, new);
+
+ if ((old->mask == new->mask) &&
+ (old->to_tell == new->to_tell) &&
+ (old->data_type == new->data_type)) {
+ switch (old->data_type) {
+ case (FSNOTIFY_EVENT_PATH):
+ if ((old->path.mnt == new->path.mnt) &&
+ (old->path.dentry == new->path.dentry))
+ return true;
+ case (FSNOTIFY_EVENT_NONE):
+ return true;
+ default:
+ BUG();
+ };
+ }
+ return false;
+}
+
+static int fanotify_merge(struct list_head *list, struct fsnotify_event *event)
+{
+ struct fsnotify_event_holder *holder;
+ struct fsnotify_event *test_event;
+
+ pr_debug("%s: list=%p event=%p\n", __func__, list, event);
+
+ /* and the list better be locked by something too! */
+
+ list_for_each_entry_reverse(holder, list, event_list) {
+ test_event = holder->event;
+ if (should_merge(test_event, event))
+ return -EEXIST;
+ }
+
+ return 0;
+}
+
static int fanotify_handle_event(struct fsnotify_group *group, struct fsnotify_event *event)
{
int ret;
@@ -21,8 +60,10 @@ static int fanotify_handle_event(struct fsnotify_group *group, struct fsnotify_e

pr_debug("%s: group=%p event=%p\n", __func__, group, event);

- ret = fsnotify_add_notify_event(group, event, NULL, NULL);
-
+ ret = fsnotify_add_notify_event(group, event, NULL, fanotify_merge);
+ /* -EEXIST means this event was merged with another, not that it was an error */
+ if (ret == -EEXIST)
+ ret = 0;
return ret;
}

2009-10-31 18:47:49

by Eric Paris

[permalink] [raw]
Subject: [PATCH 04/10] fanotify: merge notification events with different masks

Instead of just merging fanotify events if they are exactly the same, merge
notification events with different masks. To do this we have to clone the
old event, update the mask in the new event with the new merged mask, and
put the new event in place of the old event.

Signed-off-by: Eric Paris <[email protected]>
---

fs/notify/fanotify/fanotify.c | 39 ++++++++++++++++++++++++++++++---------
1 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index c35c117..8e574d6 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -10,8 +10,7 @@ static bool should_merge(struct fsnotify_event *old, struct fsnotify_event *new)
{
pr_debug("%s: old=%p new=%p\n", __func__, old, new);

- if ((old->mask == new->mask) &&
- (old->to_tell == new->to_tell) &&
+ if ((old->to_tell == new->to_tell) &&
(old->data_type == new->data_type)) {
switch (old->data_type) {
case (FSNOTIFY_EVENT_PATH):
@@ -29,20 +28,42 @@ static bool should_merge(struct fsnotify_event *old, struct fsnotify_event *new)

static int fanotify_merge(struct list_head *list, struct fsnotify_event *event)
{
- struct fsnotify_event_holder *holder;
+ struct fsnotify_event_holder *test_holder;
struct fsnotify_event *test_event;
+ struct fsnotify_event *new_event;
+ int ret = 0;

pr_debug("%s: list=%p event=%p\n", __func__, list, event);

/* and the list better be locked by something too! */

- list_for_each_entry_reverse(holder, list, event_list) {
- test_event = holder->event;
- if (should_merge(test_event, event))
- return -EEXIST;
+ list_for_each_entry_reverse(test_holder, list, event_list) {
+ test_event = test_holder->event;
+ if (should_merge(test_event, event)) {
+ ret = -EEXIST;
+
+ /* if they are exactly the same we are done */
+ if (test_event->mask == event->mask)
+ goto out;
+
+ /* can't allocate memory, merge was no possible */
+ new_event = fsnotify_clone_event(test_event);
+ if (unlikely(!new_event)) {
+ ret = 0;
+ goto out;
+ }
+
+ /* build new event and replace it on the list */
+ new_event->mask = (test_event->mask | event->mask);
+ fsnotify_replace_event(test_holder, new_event);
+ /* match ref from fsnotify_clone_event() */
+ fsnotify_put_event(new_event);
+
+ break;
+ }
}
-
- return 0;
+out:
+ return ret;
}

static int fanotify_handle_event(struct fsnotify_group *group, struct fsnotify_event *event)

2009-10-31 18:47:55

by Eric Paris

[permalink] [raw]
Subject: [PATCH 05/10] fanotify: do not clone on merge unless needed

Currently if 2 events are going to be merged on the notication queue with
different masks the second event will be cloned and will replace the first
event. However if this notification queue is the only place referencing
the event in question there is no reason not to just update the event in
place. We can tell this if the event->refcnt == 1. Since we hold a
reference for each queue this event is on we know that when refcnt == 1
this is the only queue. The other concern is that it might be about to be
added to a new queue, but this can't be the case since fsnotify holds a
reference on the event until it is finished adding it to queues.

Signed-off-by: Eric Paris <[email protected]>
---

fs/notify/fanotify/fanotify.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 8e574d6..5b0b6b4 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -46,6 +46,16 @@ static int fanotify_merge(struct list_head *list, struct fsnotify_event *event)
if (test_event->mask == event->mask)
goto out;

+ /*
+ * if the refcnt == 1 this is the only queue
+ * for this event and so we can update the mask
+ * in place.
+ */
+ if (atomic_read(&test_event->refcnt) == 1) {
+ test_event->mask |= event->mask;
+ goto out;
+ }
+
/* can't allocate memory, merge was no possible */
new_event = fsnotify_clone_event(test_event);
if (unlikely(!new_event)) {

2009-10-31 18:48:00

by Eric Paris

[permalink] [raw]
Subject: [PATCH 06/10] fanotify: fanotify_init syscall declaration

This patch defines a new syscall fanotify_init() of the form:

int sys_fanotify_init(unsigned int flags, unsigned int event_f_flags, int priority)

This syscall is used to create and fanotify group. This is very similar to
the inotify_init() syscall.

Signed-off-by: Eric Paris <[email protected]>
---

arch/x86/ia32/ia32entry.S | 1 +
arch/x86/include/asm/unistd_32.h | 3 ++-
arch/x86/include/asm/unistd_64.h | 2 ++
arch/x86/kernel/syscall_table_32.S | 1 +
fs/notify/fanotify/Makefile | 2 +-
fs/notify/fanotify/fanotify_user.c | 13 +++++++++++++
include/linux/syscalls.h | 1 +
7 files changed, 21 insertions(+), 2 deletions(-)
create mode 100644 fs/notify/fanotify/fanotify_user.c

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index d23b987..df13544 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -842,4 +842,5 @@ ia32_sys_call_table:
.quad compat_sys_rt_tgsigqueueinfo /* 335 */
.quad sys_perf_event_open
.quad compat_sys_recvmmsg
+ .quad sys_fanotify_init
ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 3baf379..1dc812d 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -343,10 +343,11 @@
#define __NR_rt_tgsigqueueinfo 335
#define __NR_perf_event_open 336
#define __NR_recvmmsg 337
+#define __NR_fanotify_init 338

#ifdef __KERNEL__

-#define NR_syscalls 338
+#define NR_syscalls 339

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 4843f7b..83f0f69 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -663,6 +663,8 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
__SYSCALL(__NR_perf_event_open, sys_perf_event_open)
#define __NR_recvmmsg 299
__SYSCALL(__NR_recvmmsg, sys_recvmmsg)
+#define __NR_fanotify_init 300
+__SYSCALL(__NR_fanotify_init, sys_fanotify_init)

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 70c2125..34bf346 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -337,3 +337,4 @@ ENTRY(sys_call_table)
.long sys_rt_tgsigqueueinfo /* 335 */
.long sys_perf_event_open
.long sys_recvmmsg
+ .long sys_fanotify_init
diff --git a/fs/notify/fanotify/Makefile b/fs/notify/fanotify/Makefile
index e7d39c0..0999213 100644
--- a/fs/notify/fanotify/Makefile
+++ b/fs/notify/fanotify/Makefile
@@ -1 +1 @@
-obj-$(CONFIG_FANOTIFY) += fanotify.o
+obj-$(CONFIG_FANOTIFY) += fanotify.o fanotify_user.o
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
new file mode 100644
index 0000000..ca35276
--- /dev/null
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -0,0 +1,13 @@
+#include <linux/fcntl.h>
+#include <linux/fs.h>
+#include <linux/fsnotify_backend.h>
+#include <linux/security.h>
+#include <linux/syscalls.h>
+
+#include "fanotify.h"
+
+SYSCALL_DEFINE3(fanotify_init, unsigned int, flags, unsigned int, event_f_flags,
+ int, priority)
+{
+ return -ENOSYS;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f03a2d9..534f9ff 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -876,6 +876,7 @@ asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
asmlinkage long sys_ppoll(struct pollfd __user *, unsigned int,
struct timespec __user *, const sigset_t __user *,
size_t);
+asmlinkage long sys_fanotify_init(unsigned int flags, unsigned int event_f_flags, int priority);

int kernel_execve(const char *filename, char *const argv[], char *const envp[]);

2009-10-31 18:48:12

by Eric Paris

[permalink] [raw]
Subject: [PATCH 07/10] fanotify: fanotify_init syscall implementation

NAME
fanotify_init - initialize an fanotify group

SYNOPSIS
int fanotify_init(unsigned int flags, unsigned int event_f_flags, int priority);

DESCRIPTION
fanotify_init() initializes a new fanotify instance and returns a file
descriptor associated with the new fanotify event queue.

The following values can be OR'd into the flags field:

FAN_NONBLOCK Set the O_NONBLOCK file status flag on the new open file description.
Using this flag saves extra calls to fcntl(2) to achieve the same
result.

FAN_CLOEXEC Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor.
See the description of the O_CLOEXEC flag in open(2) for reasons why
this may be useful.

The event_f_flags argument is unused and must be set to 0

The priority argument is unused and must be set to 0

RETURN VALUE
On success, this system call return a new file descriptor. On error, -1 is
returned, and errno is set to indicate the error.

ERRORS
EINVAL An invalid value was specified in flags.

EINVAL A non-zero valid was passed in event_f_flags or in priority

ENFILE The system limit on the total number of file descriptors has been reached.

ENOMEM Insufficient kernel memory is available.

CONFORMING TO
These system calls are Linux-specific.

Signed-off-by: Eric Paris <[email protected]>
---

fs/notify/fanotify/fanotify.h | 2 +
fs/notify/fanotify/fanotify_user.c | 61 +++++++++++++++++++++++++++++++++++-
include/linux/fanotify.h | 4 ++
3 files changed, 66 insertions(+), 1 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 50765eb..dd656cf 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -4,6 +4,8 @@
#include <linux/kernel.h>
#include <linux/types.h>

+extern const struct fsnotify_ops fanotify_fsnotify_ops;
+
static inline bool fanotify_mask_valid(__u32 mask)
{
if (mask & ~((__u32)FAN_ALL_INCOMING_EVENTS))
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index ca35276..9ca7413 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1,13 +1,72 @@
#include <linux/fcntl.h>
#include <linux/fs.h>
+#include <linux/anon_inodes.h>
#include <linux/fsnotify_backend.h>
#include <linux/security.h>
#include <linux/syscalls.h>

#include "fanotify.h"

+static int fanotify_release(struct inode *ignored, struct file *file)
+{
+ struct fsnotify_group *group = file->private_data;
+
+ pr_debug("%s: file=%p group=%p\n", __func__, file, group);
+
+ /* matches the fanotify_init->fsnotify_alloc_group */
+ fsnotify_put_group(group);
+
+ return 0;
+}
+
+static const struct file_operations fanotify_fops = {
+ .poll = NULL,
+ .read = NULL,
+ .fasync = NULL,
+ .release = fanotify_release,
+ .unlocked_ioctl = NULL,
+ .compat_ioctl = NULL,
+};
+
+/* fanotify syscalls */
SYSCALL_DEFINE3(fanotify_init, unsigned int, flags, unsigned int, event_f_flags,
int, priority)
{
- return -ENOSYS;
+ struct fsnotify_group *group;
+ int f_flags, fd;
+
+ pr_debug("%s: flags=%d event_f_flags=%d priority=%d\n",
+ __func__, flags, event_f_flags, priority);
+
+ if (event_f_flags)
+ return -EINVAL;
+ if (priority)
+ return -EINVAL;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EACCES;
+
+ if (flags & ~FAN_ALL_INIT_FLAGS)
+ return -EINVAL;
+
+ f_flags = (O_RDONLY | FMODE_NONOTIFY);
+ if (flags & FAN_CLOEXEC)
+ f_flags |= O_CLOEXEC;
+ if (flags & FAN_NONBLOCK)
+ f_flags |= O_NONBLOCK;
+
+ /* fsnotify_alloc_group takes a ref. Dropped in fanotify_release */
+ group = fsnotify_alloc_group(&fanotify_fsnotify_ops);
+ if (IS_ERR(group))
+ return PTR_ERR(group);
+
+ fd = anon_inode_getfd("[fanotify]", &fanotify_fops, group, f_flags);
+ if (fd < 0)
+ goto out_put_group;
+
+ return fd;
+
+out_put_group:
+ fsnotify_put_group(group);
+ return fd;
}
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index b560f86..00bc6d4 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -18,6 +18,10 @@
/* helper events */
#define FAN_CLOSE (FAN_CLOSE_WRITE | FAN_CLOSE_NOWRITE) /* close */

+#define FAN_CLOEXEC 0x00000001
+#define FAN_NONBLOCK 0x00000002
+
+#define FAN_ALL_INIT_FLAGS (FAN_CLOEXEC | FAN_NONBLOCK)
/*
* All of the events - we build the list by hand so that we can add flags in
* the future and not break backward compatibility. Apps will get only the

2009-10-31 18:48:20

by Eric Paris

[permalink] [raw]
Subject: [PATCH 08/10] fanotify: sys_fanotify_mark declartion

This patch simply declares the new sys_fanotify_mark syscall

int fanotify_mark(int fanotify_fd, unsigned int flags, int dfd
const char *pathname, u64 mask, u64 ignored_mask)

Signed-off-by: Eric Paris <[email protected]>
---

arch/x86/ia32/ia32entry.S | 1 +
arch/x86/include/asm/unistd_32.h | 3 ++-
arch/x86/include/asm/unistd_64.h | 2 ++
arch/x86/kernel/syscall_table_32.S | 1 +
fs/notify/fanotify/fanotify_user.c | 6 ++++++
include/linux/syscalls.h | 3 +++
6 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index df13544..48960ac 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -843,4 +843,5 @@ ia32_sys_call_table:
.quad sys_perf_event_open
.quad compat_sys_recvmmsg
.quad sys_fanotify_init
+ .quad sys_fanotify_mark
ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 1dc812d..cf70e9e 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -344,10 +344,11 @@
#define __NR_perf_event_open 336
#define __NR_recvmmsg 337
#define __NR_fanotify_init 338
+#define __NR_fanotify_mark 339

#ifdef __KERNEL__

-#define NR_syscalls 339
+#define NR_syscalls 340

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 83f0f69..7fbd20f 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -665,6 +665,8 @@ __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
__SYSCALL(__NR_recvmmsg, sys_recvmmsg)
#define __NR_fanotify_init 300
__SYSCALL(__NR_fanotify_init, sys_fanotify_init)
+#define __NR_fanotify_mark 301
+__SYSCALL(__NR_fanotify_mark, sys_fanotify_mark)

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 34bf346..ca486c7 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -338,3 +338,4 @@ ENTRY(sys_call_table)
.long sys_perf_event_open
.long sys_recvmmsg
.long sys_fanotify_init
+ .long sys_fanotify_mark
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 9ca7413..c93f56b 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -70,3 +70,9 @@ out_put_group:
fsnotify_put_group(group);
return fd;
}
+
+SYSCALL_DEFINE5(fanotify_mark, int, fanotify_fd, unsigned int, flags,
+ int, dfd, const char __user *, pathname, __u64, mask)
+{
+ return -ENOSYS;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 534f9ff..f3492ac 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -877,6 +877,9 @@ asmlinkage long sys_ppoll(struct pollfd __user *, unsigned int,
struct timespec __user *, const sigset_t __user *,
size_t);
asmlinkage long sys_fanotify_init(unsigned int flags, unsigned int event_f_flags, int priority);
+asmlinkage long sys_fanotify_mark(int fanotify_fd, unsigned int flags,
+ int fd, const char __user *pathname,
+ u64 mask);

int kernel_execve(const char *filename, char *const argv[], char *const envp[]);

2009-10-31 18:48:26

by Eric Paris

[permalink] [raw]
Subject: [PATCH 09/10] fanotify: fanotify_mark syscall implementation

NAME
fanotify_mark - add, remove, or modify an fanotify mark on a
filesystem object

SYNOPSIS
int fanotify_mark(int fanotify_fd, unsigned int flags,
int dfd, const char *pathname, u64 mask,
u64 ignored_mask)

DESCRIPTION
fanotify_mark() is used to add remove or modify a mark on a filesystem
object. Marks are used to indicate that the fanotify group is
interested in events which occur on that object. At this point in
time marks may only be added to files and directories.

fanotify_fd must be a file descriptor returned by fanotify_init()

The flags field must contain exactly one of the following:

FAN_MARK_ADD - or the bits in mask and ignored mask into the mark
FAN_MARK_REMOVE - bitwise remove the bits in mask and ignored mark
from the mark

The following values can be OR'd into the flags field:

FAN_MARK_DONT_FOLLOW - same meaning as O_NOFOLLOW as described in open(2)
FAN_MARK_ONLYDIR - same meaning as O_DIRECTORY as described in open(2)

dfd may be any other the following:
AT_FDCWD: the object will be lookup up based on pathname similar
to open(2)

file descriptor of a directory: if pathname is not NULL the
object to modify will be lookup up similar to openat(2)

file descriptor of the final object: if pathname is NULL the
object to modify will be the object referenced by dfd

The mask is the bitwise OR of the set of events of interest such as:
FAN_ACCESS - object was accessed (read)
FAN_MODIFY - object was modified (write)
FAN_CLOSE_WRITE - object was writable and was closed
FAN_CLOSE_NOWRITE - object was read only and was closed
FAN_OPEN - object was opened
FAN_EVENT_ON_CHILD - interested in objected that happen to
children. Only relavent when the object
is a directory
FAN_Q_OVERFLOW - event queue overflowed (not implemented)

The ignored mask is the opposite of the mask as if applied after the
mask. If FAN_OPEN is specified in both the mask and the ignored_mask
no event will be sent to userspace. This is not persently used but
will be used when more objects may be marked. Assume you marked a
mount point as something of interest. You could then add an
ignored_mask entry on individual inodes to get notification on
everything in the mount point except for a select few inodes.


RETURN VALUE
On success, this system call returns 0. On error, -1 is
returned, and errno is set to indicate the error.

ERRORS
EINVAL An invalid value was specified in flags.

EINVAL An invalid value was specified in mask.

EINVAL An invalid value was specified in ignored_mask.

EINVAL fanotify_fd is not a file descriptor as returned by
fanotify_init()

EBADF fanotify_fd is not a valid file descriptor

EBADF dfd is not a vlid file descriptor and path is NULL.

ENOTDIR dfd is not a directory and path is not NULL

EACCESS you do not have search permission on dfd

EACCESS no search permissions on some part of the path

ENENT file not found

ENOMEM Insufficient kernel memory is available.

CONFORMING TO
These system calls are Linux-specific.

Signed-off-by: Eric Paris <[email protected]>
---

fs/notify/fanotify/fanotify.h | 18 +++
fs/notify/fanotify/fanotify_user.c | 239 ++++++++++++++++++++++++++++++++++++
include/linux/fanotify.h | 13 ++
3 files changed, 269 insertions(+), 1 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index dd656cf..59c3331 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -6,6 +6,24 @@

extern const struct fsnotify_ops fanotify_fsnotify_ops;

+static inline bool fanotify_mark_flags_valid(unsigned int flags)
+{
+ /* must be either and add or a remove */
+ if (!(flags & (FAN_MARK_ADD | FAN_MARK_REMOVE)))
+ return false;
+
+ /* cannot be both add and remove */
+ if ((flags & FAN_MARK_ADD) &&
+ (flags & FAN_MARK_REMOVE))
+ return false;
+
+ /* cannot have more flags than we know about */
+ if (flags & ~FAN_ALL_MARK_FLAGS)
+ return false;
+
+ return true;
+}
+
static inline bool fanotify_mask_valid(__u32 mask)
{
if (mask & ~((__u32)FAN_ALL_INCOMING_EVENTS))
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index c93f56b..d415174 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1,12 +1,18 @@
#include <linux/fcntl.h>
+#include <linux/file.h>
#include <linux/fs.h>
#include <linux/anon_inodes.h>
#include <linux/fsnotify_backend.h>
+#include <linux/init.h>
+#include <linux/namei.h>
#include <linux/security.h>
#include <linux/syscalls.h>
+#include <linux/types.h>

#include "fanotify.h"

+static struct kmem_cache *fanotify_mark_cache __read_mostly;
+
static int fanotify_release(struct inode *ignored, struct file *file)
{
struct fsnotify_group *group = file->private_data;
@@ -28,6 +34,185 @@ static const struct file_operations fanotify_fops = {
.compat_ioctl = NULL,
};

+static void fanotify_free_mark(struct fsnotify_mark *fsn_mark)
+{
+ kmem_cache_free(fanotify_mark_cache, fsn_mark);
+}
+
+static int fanotify_find_path(int dfd, const char __user *filename,
+ struct path *path, unsigned int flags)
+{
+ int ret;
+
+ pr_debug("%s: dfd=%d filename=%p flags=%x\n", __func__,
+ dfd, filename, flags);
+
+ if (filename == NULL) {
+ struct file *file;
+ int fput_needed;
+
+ ret = -EBADF;
+ file = fget_light(dfd, &fput_needed);
+ if (!file)
+ goto out;
+
+ ret = -ENOTDIR;
+ if ((flags & FAN_MARK_ONLYDIR) &&
+ !(S_ISDIR(file->f_path.dentry->d_inode->i_mode))) {
+ fput_light(file, fput_needed);
+ goto out;
+ }
+
+ *path = file->f_path;
+ path_get(path);
+ fput_light(file, fput_needed);
+ } else {
+ unsigned int lookup_flags = 0;
+
+ if (!(flags & FAN_MARK_DONT_FOLLOW))
+ lookup_flags |= LOOKUP_FOLLOW;
+ if (flags & FAN_MARK_ONLYDIR)
+ lookup_flags |= LOOKUP_DIRECTORY;
+
+ ret = user_path_at(dfd, filename, lookup_flags, path);
+ if (ret)
+ goto out;
+ }
+
+ /* you can only watch an inode if you have read permissions on it */
+ ret = inode_permission(path->dentry->d_inode, MAY_READ);
+ if (ret)
+ path_put(path);
+out:
+ return ret;
+}
+
+static int fanotify_remove_mark(struct fsnotify_group *group,
+ struct inode *inode,
+ __u32 mask)
+{
+ struct fsnotify_mark *fsn_mark;
+ __u32 new_mask;
+
+ pr_debug("%s: group=%p inode=%p mask=%x\n", __func__,
+ group, inode, mask);
+
+ fsn_mark = fsnotify_find_mark(group, inode);
+ if (!fsn_mark)
+ return -ENOENT;
+
+ spin_lock(&fsn_mark->lock);
+ fsn_mark->mask &= ~mask;
+ new_mask = fsn_mark->mask;
+ spin_unlock(&fsn_mark->lock);
+
+ if (!new_mask)
+ fsnotify_destroy_mark(fsn_mark);
+ else
+ fsnotify_recalc_inode_mask(inode);
+
+ fsnotify_recalc_group_mask(group);
+
+ /* matches the fsnotify_find_mark() */
+ fsnotify_put_mark(fsn_mark);
+
+ return 0;
+}
+
+static int fanotify_add_mark(struct fsnotify_group *group,
+ struct inode *inode,
+ __u32 mask)
+{
+ struct fsnotify_mark *fsn_mark;
+ __u32 old_mask, new_mask;
+ int ret;
+
+ pr_debug("%s: group=%p inode=%p mask=%x\n", __func__,
+ group, inode, mask);
+
+ fsn_mark = fsnotify_find_mark(group, inode);
+ if (!fsn_mark) {
+ struct fsnotify_mark *new_fsn_mark;
+
+ ret = -ENOMEM;
+ new_fsn_mark = kmem_cache_alloc(fanotify_mark_cache, GFP_KERNEL);
+ if (!new_fsn_mark)
+ goto out;
+
+ fsnotify_init_mark(new_fsn_mark, fanotify_free_mark);
+ ret = fsnotify_add_mark(new_fsn_mark, group, inode, 0);
+ if (ret) {
+ fanotify_free_mark(new_fsn_mark);
+ goto out;
+ }
+
+ fsn_mark = new_fsn_mark;
+ }
+
+ ret = 0;
+
+ spin_lock(&fsn_mark->lock);
+ old_mask = fsn_mark->mask;
+ fsn_mark->mask |= mask;
+ new_mask = fsn_mark->mask;
+ spin_unlock(&fsn_mark->lock);
+
+ /* we made changes to a mask, update the group mask and the inode mask
+ * so things happen quickly. */
+ if (old_mask != new_mask) {
+ /* more bits in old than in new? */
+ int dropped = (old_mask & ~new_mask);
+ /* more bits in this mark than the inode's mask? */
+ int do_inode = (new_mask & ~inode->i_fsnotify_mask);
+ /* more bits in this mark than the group? */
+ int do_group = (new_mask & ~group->mask);
+
+ /* update the inode with this new mark */
+ if (dropped || do_inode)
+ fsnotify_recalc_inode_mask(inode);
+
+ /* update the group mask with the new mask */
+ if (dropped || do_group)
+ fsnotify_recalc_group_mask(group);
+ }
+
+ /* match the init or the find.... */
+ fsnotify_put_mark(fsn_mark);
+out:
+ return ret;
+}
+
+static int fanotify_update_mark(struct fsnotify_group *group,
+ struct inode *inode, int flags,
+ __u32 mask)
+{
+ pr_debug("%s: group=%p inode=%p flags=%x mask=%x\n", __func__,
+ group, inode, flags, mask);
+
+ if (flags & FAN_MARK_ADD)
+ fanotify_add_mark(group, inode, mask);
+ else if (flags & FAN_MARK_REMOVE)
+ fanotify_remove_mark(group, inode, mask);
+ else
+ BUG();
+
+ return 0;
+}
+
+static bool fanotify_mark_validate_input(int flags,
+ __u32 mask)
+{
+ pr_debug("%s: flags=%x mask=%x\n", __func__, flags, mask);
+
+ /* are flags valid of this operation? */
+ if (!fanotify_mark_flags_valid(flags))
+ return false;
+ /* is the mask valid? */
+ if (!fanotify_mask_valid(mask))
+ return false;
+ return true;
+}
+
/* fanotify syscalls */
SYSCALL_DEFINE3(fanotify_init, unsigned int, flags, unsigned int, event_f_flags,
int, priority)
@@ -74,5 +259,57 @@ out_put_group:
SYSCALL_DEFINE5(fanotify_mark, int, fanotify_fd, unsigned int, flags,
int, dfd, const char __user *, pathname, __u64, mask)
{
- return -ENOSYS;
+ struct inode *inode;
+ struct fsnotify_group *group;
+ struct file *filp;
+ struct path path;
+ int ret, fput_needed;
+
+ pr_debug("%s: fanotify_fd=%d flags=%x dfd=%d pathname=%p mask=%llx\n",
+ __func__, fanotify_fd, flags, dfd, pathname, mask);
+
+ /* we only use the lower 32 bits as of right now. */
+ if (mask & ((__u64)0xffffffff << 32))
+ return -EINVAL;
+
+ if (!fanotify_mark_validate_input(flags, mask))
+ return -EINVAL;
+
+ filp = fget_light(fanotify_fd, &fput_needed);
+ if (unlikely(!filp))
+ return -EBADF;
+
+ /* verify that this is indeed an fanotify instance */
+ ret = -EINVAL;
+ if (unlikely(filp->f_op != &fanotify_fops))
+ goto fput_and_out;
+
+ ret = fanotify_find_path(dfd, pathname, &path, flags);
+ if (ret)
+ goto fput_and_out;
+
+ /* inode held in place by reference to path; group by fget on fd */
+ inode = path.dentry->d_inode;
+ group = filp->private_data;
+
+ /* create/update an inode mark */
+ ret = fanotify_update_mark(group, inode, flags, mask);
+
+ path_put(&path);
+fput_and_out:
+ fput_light(filp, fput_needed);
+ return ret;
+}
+
+/*
+ * fanotify_user_setup - Our initialization function. Note that we cannnot return
+ * error because we have compiled-in VFS hooks. So an (unlikely) failure here
+ * must result in panic().
+ */
+static int __init fanotify_user_setup(void)
+{
+ fanotify_mark_cache = KMEM_CACHE(fsnotify_mark, SLAB_PANIC);
+
+ return 0;
}
+device_initcall(fanotify_user_setup);
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index 00bc6d4..95aeea2 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -18,10 +18,23 @@
/* helper events */
#define FAN_CLOSE (FAN_CLOSE_WRITE | FAN_CLOSE_NOWRITE) /* close */

+/* flags used for fanotify_init() */
#define FAN_CLOEXEC 0x00000001
#define FAN_NONBLOCK 0x00000002

#define FAN_ALL_INIT_FLAGS (FAN_CLOEXEC | FAN_NONBLOCK)
+
+/* flags used for fanotify_modify_mark() */
+#define FAN_MARK_ADD 0x00000001
+#define FAN_MARK_REMOVE 0x00000002
+#define FAN_MARK_DONT_FOLLOW 0x00000004
+#define FAN_MARK_ONLYDIR 0x00000008
+
+#define FAN_ALL_MARK_FLAGS (FAN_MARK_ADD |\
+ FAN_MARK_REMOVE |\
+ FAN_MARK_DONT_FOLLOW |\
+ FAN_MARK_ONLYDIR)
+
/*
* All of the events - we build the list by hand so that we can add flags in
* the future and not break backward compatibility. Apps will get only the

2009-10-31 18:48:39

by Eric Paris

[permalink] [raw]
Subject: [PATCH 10/10] send events using read


---

fs/notify/fanotify/fanotify.h | 5 +
fs/notify/fanotify/fanotify_user.c | 225 +++++++++++++++++++++++++++++++++++-
include/linux/fanotify.h | 24 ++++
3 files changed, 250 insertions(+), 4 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 59c3331..5608783 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -30,3 +30,8 @@ static inline bool fanotify_mask_valid(__u32 mask)
return false;
return true;
}
+
+static inline __u32 fanotify_outgoing_mask(__u32 mask)
+{
+ return mask & FAN_ALL_OUTGOING_EVENTS;
+}
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index d415174..1fdd720 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -3,16 +3,208 @@
#include <linux/fs.h>
#include <linux/anon_inodes.h>
#include <linux/fsnotify_backend.h>
+#include <linux/ima.h>
#include <linux/init.h>
+#include <linux/mount.h>
#include <linux/namei.h>
+#include <linux/poll.h>
#include <linux/security.h>
#include <linux/syscalls.h>
#include <linux/types.h>
+#include <linux/uaccess.h>
+
+#include <asm/ioctls.h>

#include "fanotify.h"

static struct kmem_cache *fanotify_mark_cache __read_mostly;

+/*
+ * Get an fsnotify notification event if one exists and is small
+ * enough to fit in "count". Return an error pointer if the count
+ * is not large enough.
+ *
+ * Called with the group->notification_mutex held.
+ */
+static struct fsnotify_event *get_one_event(struct fsnotify_group *group,
+ size_t count)
+{
+ BUG_ON(!mutex_is_locked(&group->notification_mutex));
+
+ pr_debug("%s: group=%p count=%zd\n", __func__, group, count);
+
+ if (fsnotify_notify_queue_is_empty(group))
+ return NULL;
+
+ if (FAN_EVENT_METADATA_LEN > count)
+ return ERR_PTR(-EINVAL);
+
+ /* held the notification_mutex the whole time, so this is the
+ * same event we peeked above */
+ return fsnotify_remove_notify_event(group);
+}
+
+static int create_and_fill_fd(struct fsnotify_group *group,
+ struct fanotify_event_metadata *metadata,
+ struct fsnotify_event *event)
+{
+ int client_fd, err;
+ struct dentry *dentry;
+ struct vfsmount *mnt;
+ struct file *new_file;
+
+ pr_debug("%s: group=%p metadata=%p event=%p\n", __func__, group,
+ metadata, event);
+
+ client_fd = get_unused_fd();
+ if (client_fd < 0)
+ return client_fd;
+
+ if (event->data_type != FSNOTIFY_EVENT_PATH) {
+ WARN_ON(1);
+ put_unused_fd(client_fd);
+ return -EINVAL;
+ }
+
+ /*
+ * we need a new file handle for the userspace program so it can read even if it was
+ * originally opened O_WRONLY.
+ */
+ dentry = dget(event->path.dentry);
+ mnt = mntget(event->path.mnt);
+ /* it's possible this event was an overflow event. in that case dentry and mnt
+ * are NULL; That's fine, just don't call dentry open */
+ if (dentry && mnt) {
+ err = ima_path_check(&event->path, MAY_READ, IMA_COUNT_UPDATE);
+ if (err)
+ new_file = ERR_PTR(err);
+ else
+ new_file = dentry_open(dentry, mnt,
+ O_RDONLY | O_LARGEFILE | FMODE_NONOTIFY,
+ current_cred());
+ } else
+ new_file = ERR_PTR(-EOVERFLOW);
+ if (IS_ERR(new_file)) {
+ /*
+ * we still send an event even if we can't open the file. this
+ * can happen when say tasks are gone and we try to open their
+ * /proc files or we try to open a WRONLY file like in sysfs
+ * we just send the errno to userspace since there isn't much
+ * else we can do.
+ */
+ put_unused_fd(client_fd);
+ client_fd = PTR_ERR(new_file);
+ } else {
+ fd_install(client_fd, new_file);
+ }
+
+ metadata->fd = client_fd;
+
+ return 0;
+}
+
+static ssize_t fill_event_metadata(struct fsnotify_group *group,
+ struct fanotify_event_metadata *metadata,
+ struct fsnotify_event *event)
+{
+ pr_debug("%s: group=%p metadata=%p event=%p\n", __func__,
+ group, metadata, event);
+
+ metadata->event_len = FAN_EVENT_METADATA_LEN;
+ metadata->vers = FANOTIFY_METADATA_VERSION;
+ metadata->mask = fanotify_outgoing_mask(event->mask);
+
+ return create_and_fill_fd(group, metadata, event);
+
+}
+
+static ssize_t copy_event_to_user(struct fsnotify_group *group,
+ struct fsnotify_event *event,
+ char __user *buf)
+{
+ struct fanotify_event_metadata fanotify_event_metadata;
+ int ret;
+
+ pr_debug("%s: group=%p event=%p\n", __func__, group, event);
+
+ ret = fill_event_metadata(group, &fanotify_event_metadata, event);
+ if (ret)
+ return ret;
+
+ if (copy_to_user(buf, &fanotify_event_metadata, FAN_EVENT_METADATA_LEN))
+ return -EFAULT;
+
+ return FAN_EVENT_METADATA_LEN;
+}
+
+/* intofiy userspace file descriptor functions */
+static unsigned int fanotify_poll(struct file *file, poll_table *wait)
+{
+ struct fsnotify_group *group = file->private_data;
+ int ret = 0;
+
+ poll_wait(file, &group->notification_waitq, wait);
+ mutex_lock(&group->notification_mutex);
+ if (!fsnotify_notify_queue_is_empty(group))
+ ret = POLLIN | POLLRDNORM;
+ mutex_unlock(&group->notification_mutex);
+
+ return ret;
+}
+
+static ssize_t fanotify_read(struct file *file, char __user *buf,
+ size_t count, loff_t *pos)
+{
+ struct fsnotify_group *group;
+ struct fsnotify_event *kevent;
+ char __user *start;
+ int ret;
+ DEFINE_WAIT(wait);
+
+ start = buf;
+ group = file->private_data;
+
+ pr_debug("%s: group=%p\n", __func__, group);
+
+ while (1) {
+ prepare_to_wait(&group->notification_waitq, &wait, TASK_INTERRUPTIBLE);
+
+ mutex_lock(&group->notification_mutex);
+ kevent = get_one_event(group, count);
+ mutex_unlock(&group->notification_mutex);
+
+ if (kevent) {
+ ret = PTR_ERR(kevent);
+ if (IS_ERR(kevent))
+ break;
+ ret = copy_event_to_user(group, kevent, buf);
+ fsnotify_put_event(kevent);
+ if (ret < 0)
+ break;
+ buf += ret;
+ count -= ret;
+ continue;
+ }
+
+ ret = -EAGAIN;
+ if (file->f_flags & O_NONBLOCK)
+ break;
+ ret = -EINTR;
+ if (signal_pending(current))
+ break;
+
+ if (start != buf)
+ break;
+
+ schedule();
+ }
+
+ finish_wait(&group->notification_waitq, &wait);
+ if (start != buf && ret != -EFAULT)
+ ret = buf - start;
+ return ret;
+}
+
static int fanotify_release(struct inode *ignored, struct file *file)
{
struct fsnotify_group *group = file->private_data;
@@ -25,13 +217,38 @@ static int fanotify_release(struct inode *ignored, struct file *file)
return 0;
}

+static long fanotify_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+ struct fsnotify_group *group;
+ struct fsnotify_event_holder *holder;
+ void __user *p;
+ int ret = -ENOTTY;
+ size_t send_len = 0;
+
+ group = file->private_data;
+
+ p = (void __user *) arg;
+
+ switch (cmd) {
+ case FIONREAD:
+ mutex_lock(&group->notification_mutex);
+ list_for_each_entry(holder, &group->notification_list, event_list)
+ send_len += FAN_EVENT_METADATA_LEN;
+ mutex_unlock(&group->notification_mutex);
+ ret = put_user(send_len, (int __user *) p);
+ break;
+ }
+
+ return ret;
+}
+
static const struct file_operations fanotify_fops = {
- .poll = NULL,
- .read = NULL,
+ .poll = fanotify_poll,
+ .read = fanotify_read,
.fasync = NULL,
.release = fanotify_release,
- .unlocked_ioctl = NULL,
- .compat_ioctl = NULL,
+ .unlocked_ioctl = fanotify_ioctl,
+ .compat_ioctl = fanotify_ioctl,
};

static void fanotify_free_mark(struct fsnotify_mark *fsn_mark)
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index 95aeea2..c1c6616 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -51,6 +51,30 @@
*/
#define FAN_ALL_INCOMING_EVENTS (FAN_ALL_EVENTS |\
FAN_EVENT_ON_CHILD)
+
+#define FAN_ALL_OUTGOING_EVENTS (FAN_ALL_EVENTS |\
+ FAN_Q_OVERFLOW)
+
+#define FANOTIFY_METADATA_VERSION 1
+
+struct fanotify_event_metadata {
+ __u32 event_len;
+ __u32 vers;
+ __s32 fd;
+ __u64 mask;
+} __attribute__ ((packed));
+
+/* Helper functions to deal with fanotify_event_metadata buffers */
+#define FAN_EVENT_METADATA_LEN (sizeof(struct fanotify_event_metadata))
+
+#define FAN_EVENT_NEXT(meta, len) ((len) -= (meta)->event_len, \
+ (struct fanotify_event_metadata*)(((char *)(meta)) + \
+ (meta)->event_len))
+
+#define FAN_EVENT_OK(meta, len) ((long)(len) >= (long)FAN_EVENT_METADATA_LEN && \
+ (long)(meta)->event_len >= (long)FAN_EVENT_METADATA_LEN && \
+ (long)(meta)->event_len <= (long)(len))
+
#ifdef __KERNEL__

#endif /* __KERNEL__ */

2009-11-03 23:59:45

by Jonathan Corbet

[permalink] [raw]
Subject: Re: [PATCH 10/10] send events using read

Hi, Eric,

This is not a full review, but I did notice a problem as I was trying to
figure out the new API...

> +static ssize_t fanotify_read(struct file *file, char __user *buf,
> + size_t count, loff_t *pos)
> +{
> + struct fsnotify_group *group;
> + struct fsnotify_event *kevent;
> + char __user *start;
> + int ret;
> + DEFINE_WAIT(wait);
> +
> + start = buf;
> + group = file->private_data;
> +
> + pr_debug("%s: group=%p\n", __func__, group);
> +
> + while (1) {
> + prepare_to_wait(&group->notification_waitq, &wait, TASK_INTERRUPTIBLE);
> +
> + mutex_lock(&group->notification_mutex);
> + kevent = get_one_event(group, count);
> + mutex_unlock(&group->notification_mutex);

prepare_to_wait(), among other things, sets the task state. But then you
go into various sleeping calls (mutex_lock(), for starters); that will undo
what prepare_to_wait has done. You'll be back in TASK_RUNNING at this
point, so you'll never sleep.

I've not looked at the code well enough to know how to fix it. My guess is
that the sleeping on event availability should be done in get_one_event(),
or in a small wrapper which goes immediately around it.

> +
> + if (kevent) {
> + ret = PTR_ERR(kevent);
> + if (IS_ERR(kevent))
> + break;
> + ret = copy_event_to_user(group, kevent, buf);
> + fsnotify_put_event(kevent);
> + if (ret < 0)
> + break;
> + buf += ret;
> + count -= ret;
> + continue;
> + }
> +
> + ret = -EAGAIN;
> + if (file->f_flags & O_NONBLOCK)
> + break;
> + ret = -EINTR;
> + if (signal_pending(current))
> + break;
> +
> + if (start != buf)
> + break;

Alternatively, maybe you could do this here?

prepare_to_wait(...);
if (fsnotify_notify_queue_is_empty(group))
schedule()
> +
> + schedule();

You also need finish_wait() here, not outside the loop.

> + }
> +
> + finish_wait(&group->notification_waitq, &wait);
> + if (start != buf && ret != -EFAULT)
> + ret = buf - start;
> + return ret;
> +}

jon

2009-11-04 00:55:56

by Eric Paris

[permalink] [raw]
Subject: Re: [PATCH 10/10] send events using read

On Tue, 2009-11-03 at 16:59 -0700, Jonathan Corbet wrote:
> Hi, Eric,
>
> This is not a full review, but I did notice a problem as I was trying to
> figure out the new API...

This is a rip off of inotify which was introduced in commit 3632dee2f8b
from Vegard Nossum back in January. I can't seem to find any discussion
of this before it went into Linus' tree, so if someone knows how this
patch got to Linus and what was said about it, I'd like to know. Thanks
for unwittingly finding and inotify bug!

I looked back over it and it looks to me like it will work although
there may be a race like situation if there are multiple things trying
to read events. I can see how with perfect timing and precision Task A
might try to get the mutex and it is being held by task B. Task B drops
the mutex and Task A gets it, this causes Task A to be TASK_RUNNABLE.
Lets assume Task B runs back around the loop and tries to get it while
Task A still holds it. Task A will drop the mutex and Task B gets it,
now B is runnable. Repeat until infinity with perfect timing! It's not
really a live-lock, if either of them ever gets the mutex uncontested or
if an event ever arrives they are going to sleep and/or break the cycle.

Certainly I'll take a look at it.

> > +static ssize_t fanotify_read(struct file *file, char __user *buf,
> > + size_t count, loff_t *pos)
> > +{
> > + struct fsnotify_group *group;
> > + struct fsnotify_event *kevent;
> > + char __user *start;
> > + int ret;
> > + DEFINE_WAIT(wait);
> > +
> > + start = buf;
> > + group = file->private_data;
> > +
> > + pr_debug("%s: group=%p\n", __func__, group);
> > +
> > + while (1) {
> > + prepare_to_wait(&group->notification_waitq, &wait, TASK_INTERRUPTIBLE);
> > +
> > + mutex_lock(&group->notification_mutex);
> > + kevent = get_one_event(group, count);
> > + mutex_unlock(&group->notification_mutex);
>
> prepare_to_wait(), among other things, sets the task state. But then you
> go into various sleeping calls (mutex_lock(), for starters); that will undo
> what prepare_to_wait has done. You'll be back in TASK_RUNNING at this
> point, so you'll never sleep.

2009-11-04 08:07:40

by Vegard Nossum

[permalink] [raw]
Subject: Re: [PATCH 10/10] send events using read

2009/11/4 Eric Paris <[email protected]>:
> On Tue, 2009-11-03 at 16:59 -0700, Jonathan Corbet wrote:
>> Hi, Eric,
>>
>> This is not a full review, but I did notice a problem as I was trying to
>> figure out the new API...
>
> This is a rip off of inotify which was introduced in commit 3632dee2f8b
> from Vegard Nossum back in January.  I can't seem to find any discussion
> of this before it went into Linus' tree, so if someone knows how this
> patch got to Linus and what was said about it, I'd like to know.  Thanks
> for unwittingly finding and inotify bug!

I don't have a git tree handy, but I think this is the locking
imbalance fix. It went to the security list, so that's probably why
you didn't find it. As far as I can tell, the code in question was
wrong even before that patch, though.


Vegard

2009-11-06 15:31:53

by Eric Paris

[permalink] [raw]
Subject: Re: [PATCH 02/10] fanotify: fscking all notification system

On Sat, 2009-10-31 at 14:47 -0400, Eric Paris wrote:
> fanotify is a novel file notification system which bases notification on
> giving userspace both an event type (open, close, read, write) and an open
> file descriptor to the object in question. This should address a number of
> races and problems with other notification systems like inotify and dnotify
> and should allow the future implementation of blocking or access controlled
> notification. These are useful for on access scanners or hierachical storage
> management schemes.

I'm not hearing the negative feedback of the past and most everyone's
questions and concerns have been addressed. Everyone ok if I drop these
into -next?

-Eric

2009-11-06 15:56:20

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [PATCH 02/10] fanotify: fscking all notification system

On Friday 06 November 2009 04:31:45 pm Eric Paris wrote:
> I'm not hearing the negative feedback of the past and most everyone's
> questions and concerns have been addressed. Everyone ok if I drop these
> into -next?

Some of the comments are outdated and don't match the code anymore.

Could the man pages please be removed from the comments, updated, and
submitted separately (i.e., not as part of the kernel patches)?

Thanks,
Andreas