2008-12-12 21:51:45

by Eric Paris

[permalink] [raw]
Subject: [RFC PATCH -v4 00/14] fsnotify, dnotify, and inotify

The following series implements a new generic in kernel filesystem
notification system, fsnotify. On top of fsnotify I reimplement dnotify and
inotify. I have not finished with the change from inotify although I think
inotify_user should be completed. In kernel inotify users (aka audit) still
(until I get positive feedback) relay on the old inotify backend. This can be
'easily' fixed.

All of this is in preperation for fanotify and using fanotify as an on access
file scanner. So you better know it's coming.

Why is this useful? Because I actually shrank the struct inode. That's
right, my code is smaller and faster. Eat that.

struct inode went from:
unsigned long i_dnotify_mask; /* Directory notify events */
struct dnotify_struct *i_dnotify; /* for directory notifications */
struct list_head inotify_watches; /* watches on this inode */
struct mutex inotify_mutex; /* protects the watches list */

to:
__u64 i_fsnotify_mask; /* all events this inode cares about */
struct list_head i_fsnotify_mark_entries; /* fsnotify mark entries */

This is just an RFC as I haven't tested it heavily I might still be able to
split patches. I think there are some problem at the moment trying to build
fsnotify out. I need to send patch 3 and 4 seperately. Actually I think it's
perfectly reasonable to take patches 1-4 as is right now.

if you ever want to see more up2date patches check out:
http://people.redhat.com/~eparis/fsnotify

---

Eric Paris (14):
shit on top for debugging
inotify: reimplement inotify using fsnotify
fsnotify: add correlations between events
fsnotify: include pathnames with entries when possible
fsnotify: generic notification queue and waitq
dnotify: reimplement dnotify using fsnotify
fsnotify: parent event notification
fsnotify: add in inode fsnotify markings
fsnotify: add group priorities
fsnotify: unified filesystem notification backend
fsnotify: use the new open-exec hook for inotify and dnotify
fsnotify: sys_execve and sys_uselib do not call into fsnotify
fsnotify: pass a file instead of an inode to open, read, and write
filesystem notification: create fs/notify to contain all fs notification


fs/Kconfig | 39 -
fs/Makefile | 5
fs/compat.c | 5
fs/dnotify.c | 194 -------
fs/exec.c | 5
fs/fcntl.c | 4
fs/inode.c | 7
fs/inotify.c | 913 ----------------------------------
fs/inotify_user.c | 778 -----------------------------
fs/nfsd/vfs.c | 4
fs/notify/Kconfig | 14 +
fs/notify/Makefile | 4
fs/notify/dnotify/Kconfig | 11
fs/notify/dnotify/Makefile | 1
fs/notify/dnotify/dnotify.c | 350 +++++++++++++
fs/notify/fsnotify.c | 92 +++
fs/notify/fsnotify.h | 115 ++++
fs/notify/group.c | 222 ++++++++
fs/notify/inode_mark.c | 279 ++++++++++
fs/notify/inotify/Kconfig | 31 +
fs/notify/inotify/Makefile | 2
fs/notify/inotify/inotify.c | 913 ++++++++++++++++++++++++++++++++++
fs/notify/inotify/inotify.h | 117 ++++
fs/notify/inotify/inotify_fsnotify.c | 190 +++++++
fs/notify/inotify/inotify_kernel.c | 299 +++++++++++
fs/notify/inotify/inotify_user.c | 502 +++++++++++++++++++
fs/notify/notification.c | 279 ++++++++++
fs/open.c | 2
fs/read_write.c | 8
include/linux/dcache.h | 3
include/linux/dnotify.h | 29 -
include/linux/fs.h | 6
include/linux/fsnotify.h | 271 ++++++++--
include/linux/fsnotify_backend.h | 152 ++++++
include/linux/inotify.h | 1
35 files changed, 3818 insertions(+), 2029 deletions(-)
delete mode 100644 fs/dnotify.c
delete mode 100644 fs/inotify.c
delete mode 100644 fs/inotify_user.c
create mode 100644 fs/notify/Kconfig
create mode 100644 fs/notify/Makefile
create mode 100644 fs/notify/dnotify/Kconfig
create mode 100644 fs/notify/dnotify/Makefile
create mode 100644 fs/notify/dnotify/dnotify.c
create mode 100644 fs/notify/fsnotify.c
create mode 100644 fs/notify/fsnotify.h
create mode 100644 fs/notify/group.c
create mode 100644 fs/notify/inode_mark.c
create mode 100644 fs/notify/inotify/Kconfig
create mode 100644 fs/notify/inotify/Makefile
create mode 100644 fs/notify/inotify/inotify.c
create mode 100644 fs/notify/inotify/inotify.h
create mode 100644 fs/notify/inotify/inotify_fsnotify.c
create mode 100644 fs/notify/inotify/inotify_kernel.c
create mode 100644 fs/notify/inotify/inotify_user.c
create mode 100644 fs/notify/notification.c
create mode 100644 include/linux/fsnotify_backend.h


2008-12-12 21:52:04

by Eric Paris

[permalink] [raw]
Subject: [RFC PATCH -v4 02/14] fsnotify: pass a file instead of an inode to open, read, and write

fanotify, the upcoming notification system actually needs a f_path so it can
do opens in the context of listeners, and it needs a file so it can get f_flags
from the original process. Close already was passing a file.

Signed-off-by: Eric Paris <[email protected]>
---

fs/compat.c | 5 ++---
fs/nfsd/vfs.c | 4 ++--
fs/open.c | 2 +-
fs/read_write.c | 8 ++++----
include/linux/fsnotify.h | 9 ++++++---
5 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index e5f49f5..4a7788f 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1158,11 +1158,10 @@ out:
if (iov != iovstack)
kfree(iov);
if ((ret + (type == READ)) > 0) {
- struct dentry *dentry = file->f_path.dentry;
if (type == READ)
- fsnotify_access(dentry);
+ fsnotify_access(file);
else
- fsnotify_modify(dentry);
+ fsnotify_modify(file);
}
return ret;
}
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 4433c8f..f4bc1e6 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -940,7 +940,7 @@ nfsd_vfs_read(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
nfsdstats.io_read += host_err;
*count = host_err;
err = 0;
- fsnotify_access(file->f_path.dentry);
+ fsnotify_access(file);
} else
err = nfserrno(host_err);
out:
@@ -1007,7 +1007,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
set_fs(oldfs);
if (host_err >= 0) {
nfsdstats.io_write += cnt;
- fsnotify_modify(file->f_path.dentry);
+ fsnotify_modify(file);
}

/* clear setuid/setgid flag after write */
diff --git a/fs/open.c b/fs/open.c
index 83cdb9d..9d69dd9 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1020,7 +1020,7 @@ long do_sys_open(int dfd, const char __user *filename, int flags, int mode)
put_unused_fd(fd);
fd = PTR_ERR(f);
} else {
- fsnotify_open(f->f_path.dentry);
+ fsnotify_open(f);
fd_install(fd, f);
}
}
diff --git a/fs/read_write.c b/fs/read_write.c
index 969a6d9..7eb2949 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -280,7 +280,7 @@ ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
else
ret = do_sync_read(file, buf, count, pos);
if (ret > 0) {
- fsnotify_access(file->f_path.dentry);
+ fsnotify_access(file);
add_rchar(current, ret);
}
inc_syscr(current);
@@ -335,7 +335,7 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
else
ret = do_sync_write(file, buf, count, pos);
if (ret > 0) {
- fsnotify_modify(file->f_path.dentry);
+ fsnotify_modify(file);
add_wchar(current, ret);
}
inc_syscw(current);
@@ -626,9 +626,9 @@ out:
kfree(iov);
if ((ret + (type == READ)) > 0) {
if (type == READ)
- fsnotify_access(file->f_path.dentry);
+ fsnotify_access(file);
else
- fsnotify_modify(file->f_path.dentry);
+ fsnotify_modify(file);
}
return ret;
}
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 00fbd5b..dec1afb 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -136,8 +136,9 @@ static inline void fsnotify_mkdir(struct inode *inode, struct dentry *dentry)
/*
* fsnotify_access - file was read
*/
-static inline void fsnotify_access(struct dentry *dentry)
+static inline void fsnotify_access(struct file *file)
{
+ struct dentry *dentry = file->f_path.dentry;
struct inode *inode = dentry->d_inode;
u32 mask = IN_ACCESS;

@@ -152,8 +153,9 @@ static inline void fsnotify_access(struct dentry *dentry)
/*
* fsnotify_modify - file was modified
*/
-static inline void fsnotify_modify(struct dentry *dentry)
+static inline void fsnotify_modify(struct file *file)
{
+ struct dentry *dentry = file->f_path.dentry;
struct inode *inode = dentry->d_inode;
u32 mask = IN_MODIFY;

@@ -168,8 +170,9 @@ static inline void fsnotify_modify(struct dentry *dentry)
/*
* fsnotify_open - file was opened
*/
-static inline void fsnotify_open(struct dentry *dentry)
+static inline void fsnotify_open(struct file *file)
{
+ struct dentry *dentry = file->f_path.dentry;
struct inode *inode = dentry->d_inode;
u32 mask = IN_OPEN;

2008-12-12 21:52:27

by Eric Paris

[permalink] [raw]
Subject: [RFC PATCH -v4 01/14] filesystem notification: create fs/notify to contain all fs notification

Adding yet another filesystem notification system it seemed like a good
idea to clean up fs/ by creating an fs/notify and putting everything
there.

Signed-off-by: Eric Paris <[email protected]>
---

fs/Kconfig | 39 --
fs/Makefile | 5
fs/dnotify.c | 194 --------
fs/inotify.c | 913 --------------------------------------
fs/inotify_user.c | 778 --------------------------------
fs/notify/Kconfig | 2
fs/notify/Makefile | 2
fs/notify/dnotify/Kconfig | 10
fs/notify/dnotify/Makefile | 1
fs/notify/dnotify/dnotify.c | 194 ++++++++
fs/notify/inotify/Kconfig | 27 +
fs/notify/inotify/Makefile | 2
fs/notify/inotify/inotify.c | 913 ++++++++++++++++++++++++++++++++++++++
fs/notify/inotify/inotify_user.c | 778 ++++++++++++++++++++++++++++++++
14 files changed, 1931 insertions(+), 1927 deletions(-)
delete mode 100644 fs/dnotify.c
delete mode 100644 fs/inotify.c
delete mode 100644 fs/inotify_user.c
create mode 100644 fs/notify/Kconfig
create mode 100644 fs/notify/Makefile
create mode 100644 fs/notify/dnotify/Kconfig
create mode 100644 fs/notify/dnotify/Makefile
create mode 100644 fs/notify/dnotify/dnotify.c
create mode 100644 fs/notify/inotify/Kconfig
create mode 100644 fs/notify/inotify/Makefile
create mode 100644 fs/notify/inotify/inotify.c
create mode 100644 fs/notify/inotify/inotify_user.c

diff --git a/fs/Kconfig b/fs/Kconfig
index 522469a..ff0e819 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -270,44 +270,7 @@ config OCFS2_COMPAT_JBD

endif # BLOCK

-config DNOTIFY
- bool "Dnotify support"
- default y
- help
- Dnotify is a directory-based per-fd file change notification system
- that uses signals to communicate events to user-space. There exist
- superior alternatives, but some applications may still rely on
- dnotify.
-
- If unsure, say Y.
-
-config INOTIFY
- bool "Inotify file change notification support"
- default y
- ---help---
- Say Y here to enable inotify support. Inotify is a file change
- notification system and a replacement for dnotify. Inotify fixes
- numerous shortcomings in dnotify and introduces several new features
- including multiple file events, one-shot support, and unmount
- notification.
-
- For more information, see <file:Documentation/filesystems/inotify.txt>
-
- If unsure, say Y.
-
-config INOTIFY_USER
- bool "Inotify support for userspace"
- depends on INOTIFY
- default y
- ---help---
- Say Y here to enable inotify support for userspace, including the
- associated system calls. Inotify allows monitoring of both files and
- directories via a single open fd. Events are read from the file
- descriptor, which is also select()- and poll()-able.
-
- For more information, see <file:Documentation/filesystems/inotify.txt>
-
- If unsure, say Y.
+source "fs/notify/Kconfig"

config QUOTA
bool "Quota support"
diff --git a/fs/Makefile b/fs/Makefile
index d9f8afe..e6f423d 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -20,8 +20,7 @@ obj-y += no-block.o
endif

obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o
-obj-$(CONFIG_INOTIFY) += inotify.o
-obj-$(CONFIG_INOTIFY_USER) += inotify_user.o
+obj-y += notify/
obj-$(CONFIG_EPOLL) += eventpoll.o
obj-$(CONFIG_ANON_INODES) += anon_inodes.o
obj-$(CONFIG_SIGNALFD) += signalfd.o
@@ -57,8 +56,6 @@ obj-$(CONFIG_QFMT_V1) += quota_v1.o
obj-$(CONFIG_QFMT_V2) += quota_v2.o
obj-$(CONFIG_QUOTACTL) += quota.o

-obj-$(CONFIG_DNOTIFY) += dnotify.o
-
obj-$(CONFIG_PROC_FS) += proc/
obj-y += partitions/
obj-$(CONFIG_SYSFS) += sysfs/
diff --git a/fs/notify/Kconfig b/fs/notify/Kconfig
new file mode 100644
index 0000000..50914d7
--- /dev/null
+++ b/fs/notify/Kconfig
@@ -0,0 +1,2 @@
+source "fs/notify/dnotify/Kconfig"
+source "fs/notify/inotify/Kconfig"
diff --git a/fs/notify/Makefile b/fs/notify/Makefile
new file mode 100644
index 0000000..5a95b60
--- /dev/null
+++ b/fs/notify/Makefile
@@ -0,0 +1,2 @@
+obj-y += dnotify/
+obj-y += inotify/
diff --git a/fs/notify/dnotify/Kconfig b/fs/notify/dnotify/Kconfig
new file mode 100644
index 0000000..26adf5d
--- /dev/null
+++ b/fs/notify/dnotify/Kconfig
@@ -0,0 +1,10 @@
+config DNOTIFY
+ bool "Dnotify support"
+ default y
+ help
+ Dnotify is a directory-based per-fd file change notification system
+ that uses signals to communicate events to user-space. There exist
+ superior alternatives, but some applications may still rely on
+ dnotify.
+
+ If unsure, say Y.
diff --git a/fs/notify/dnotify/Makefile b/fs/notify/dnotify/Makefile
new file mode 100644
index 0000000..f145251
--- /dev/null
+++ b/fs/notify/dnotify/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_DNOTIFY) += dnotify.o
diff --git a/fs/dnotify.c b/fs/notify/dnotify/dnotify.c
similarity index 100%
rename from fs/dnotify.c
rename to fs/notify/dnotify/dnotify.c
diff --git a/fs/notify/inotify/Kconfig b/fs/notify/inotify/Kconfig
new file mode 100644
index 0000000..4467928
--- /dev/null
+++ b/fs/notify/inotify/Kconfig
@@ -0,0 +1,27 @@
+config INOTIFY
+ bool "Inotify file change notification support"
+ default y
+ ---help---
+ Say Y here to enable inotify support. Inotify is a file change
+ notification system and a replacement for dnotify. Inotify fixes
+ numerous shortcomings in dnotify and introduces several new features
+ including multiple file events, one-shot support, and unmount
+ notification.
+
+ For more information, see <file:Documentation/filesystems/inotify.txt>
+
+ If unsure, say Y.
+
+config INOTIFY_USER
+ bool "Inotify support for userspace"
+ depends on INOTIFY
+ default y
+ ---help---
+ Say Y here to enable inotify support for userspace, including the
+ associated system calls. Inotify allows monitoring of both files and
+ directories via a single open fd. Events are read from the file
+ descriptor, which is also select()- and poll()-able.
+
+ For more information, see <file:Documentation/filesystems/inotify.txt>
+
+ If unsure, say Y.
diff --git a/fs/notify/inotify/Makefile b/fs/notify/inotify/Makefile
new file mode 100644
index 0000000..e290f3b
--- /dev/null
+++ b/fs/notify/inotify/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_INOTIFY) += inotify.o
+obj-$(CONFIG_INOTIFY_USER) += inotify_user.o
diff --git a/fs/inotify.c b/fs/notify/inotify/inotify.c
similarity index 100%
rename from fs/inotify.c
rename to fs/notify/inotify/inotify.c
diff --git a/fs/inotify_user.c b/fs/notify/inotify/inotify_user.c
similarity index 100%
rename from fs/inotify_user.c
rename to fs/notify/inotify/inotify_user.c

2008-12-12 21:52:45

by Eric Paris

[permalink] [raw]
Subject: [RFC PATCH -v4 03/14] fsnotify: sys_execve and sys_uselib do not call into fsnotify

sys_execve and sys_uselib do not call into fsnotify so inotify, dnotify,
and importantly to me fanotify do not see opens on things which are going
to be exectued. Create a generic fsnotify hook for these paths.

Signed-off-by: Eric Paris <[email protected]>
---

fs/exec.c | 5 +++++
include/linux/fsnotify.h | 7 +++++++
2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index ec5df9a..8a659a8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -51,6 +51,7 @@
#include <linux/audit.h>
#include <linux/tracehook.h>
#include <linux/kmod.h>
+#include <linux/fsnotify.h>

#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -135,6 +136,8 @@ asmlinkage long sys_uselib(const char __user * library)
if (IS_ERR(file))
goto out;

+ fsnotify_open_exec(file);
+
error = -ENOEXEC;
if(file->f_op) {
struct linux_binfmt * fmt;
@@ -687,6 +690,8 @@ struct file *open_exec(const char *name)
if (IS_ERR(file))
return file;

+ fsnotify_open_exec(file);
+
err = deny_write_access(file);
if (err) {
fput(file);
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index dec1afb..ffe787f 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -168,6 +168,13 @@ static inline void fsnotify_modify(struct file *file)
}

/*
+ * fsnotify_open_exec - file was opened by execve or uselib
+ */
+static inline void fsnotify_open_exec(struct file *file)
+{
+}
+
+/*
* fsnotify_open - file was opened
*/
static inline void fsnotify_open(struct file *file)

2008-12-12 21:53:03

by Eric Paris

[permalink] [raw]
Subject: [RFC PATCH -v4 05/14] fsnotify: unified filesystem notification backend

fsnotify is a backend for filesystem notification. fsnotify does
not provide any userspace interface but does provide the basis
needed for other notification schemes such as dnotify. fsnotify
can be extended to be the backend for inotify or the upcoming
fsnotify.

Signed-off-by: Eric Paris <[email protected]>
---

fs/notify/Kconfig | 12 +++
fs/notify/Makefile | 2
fs/notify/fsnotify.c | 78 +++++++++++++++++++
fs/notify/fsnotify.h | 51 +++++++++++++
fs/notify/group.c | 153 ++++++++++++++++++++++++++++++++++++++
fs/notify/notification.c | 123 +++++++++++++++++++++++++++++++
include/linux/fsnotify.h | 57 ++++++++++++--
include/linux/fsnotify_backend.h | 110 +++++++++++++++++++++++++++
8 files changed, 578 insertions(+), 8 deletions(-)
create mode 100644 fs/notify/fsnotify.c
create mode 100644 fs/notify/fsnotify.h
create mode 100644 fs/notify/group.c
create mode 100644 fs/notify/notification.c
create mode 100644 include/linux/fsnotify_backend.h

diff --git a/fs/notify/Kconfig b/fs/notify/Kconfig
index 50914d7..269b59a 100644
--- a/fs/notify/Kconfig
+++ b/fs/notify/Kconfig
@@ -1,2 +1,14 @@
+config FSNOTIFY
+ bool "Filesystem notification backend"
+ default y
+ ---help---
+ fsnotify is a backend for filesystem notification. fsnotify does
+ not provide any userspace interface but does provide the basis
+ needed for other notification schemes such as dnotify and fsnotify.
+
+ Say Y here to enable fsnotify suport.
+
+ If unsure, say Y.
+
source "fs/notify/dnotify/Kconfig"
source "fs/notify/inotify/Kconfig"
diff --git a/fs/notify/Makefile b/fs/notify/Makefile
index 5a95b60..7cb285a 100644
--- a/fs/notify/Makefile
+++ b/fs/notify/Makefile
@@ -1,2 +1,4 @@
obj-y += dnotify/
obj-y += inotify/
+
+obj-$(CONFIG_FSNOTIFY) += fsnotify.o notification.o group.o
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
new file mode 100644
index 0000000..93a0e8f
--- /dev/null
+++ b/fs/notify/fsnotify.c
@@ -0,0 +1,78 @@
+/*
+ * Copyright (C) 2008 Red Hat, Inc., Eric Paris <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; see the file COPYING. If not, write to
+ * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#include <linux/dcache.h>
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/srcu.h>
+
+#include <linux/fsnotify_backend.h>
+#include "fsnotify.h"
+
+void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is)
+{
+ struct fsnotify_group *group;
+ struct fsnotify_event *event = NULL;
+ int idx;
+
+ if (list_empty(&fsnotify_groups))
+ return;
+
+ if (!(mask & fsnotify_mask))
+ return;
+
+ /*
+ * SRCU!! the groups list is very very much read only and the path is
+ * very hot (assuming something is using fsnotify) Not blocking while
+ * walking this list is ugly. We could preallocate an event and an
+ * event holder for every group that event might need to be put on, but
+ * all that possibly wasted allocation is nuts. For all we know there
+ * are already mark entries, groups don't need this event, or all
+ * sorts of reasons to believe not every kernel action is going to get
+ * sent to userspace. Hopefully this won't get shit on too much,
+ * because going to a mutex here is really going to needlessly serialize
+ * read/write/open/close across the whole system....
+ */
+ idx = srcu_read_lock(&fsnotify_grp_srcu_struct);
+ list_for_each_entry_rcu(group, &fsnotify_groups, group_list) {
+ if (mask & group->mask) {
+ if (!event) {
+ event = fsnotify_create_event(to_tell, mask, data, data_is);
+ /* shit, we OOM'd and now we can't tell, lets hope something else blows up */
+ if (!event)
+ break;
+ }
+ group->ops->event_to_notif(group, event);
+ }
+ }
+ srcu_read_unlock(&fsnotify_grp_srcu_struct, idx);
+ /*
+ * fsnotify_create_event() took a reference so the event can't be cleaned
+ * up while we are still trying to add it to lists, drop that one.
+ */
+ if (event)
+ fsnotify_put_event(event);
+}
+EXPORT_SYMBOL_GPL(fsnotify);
+
+static __init int fsnotify_init(void)
+{
+ return init_srcu_struct(&fsnotify_grp_srcu_struct);
+}
+subsys_initcall(fsnotify_init);
diff --git a/fs/notify/fsnotify.h b/fs/notify/fsnotify.h
new file mode 100644
index 0000000..15bc151
--- /dev/null
+++ b/fs/notify/fsnotify.h
@@ -0,0 +1,51 @@
+#ifndef _LINUX_FSNOTIFY_PRIVATE_H
+#define _LINUX_FSNOTIFY_PRIVATE_H
+
+#include <linux/dcache.h>
+#include <linux/list.h>
+#include <linux/fs.h>
+#include <linux/path.h>
+#include <linux/spinlock.h>
+
+#include <linux/fsnotify.h>
+
+#include <asm/atomic.h>
+
+struct fsnotify_event_private_data {
+ struct fsnotify_group *group;
+ struct list_head event_list;
+ char data[0];
+};
+
+/*
+ * all of the information about the original object we want to now send to
+ * a scanner. If you want to carry more info from the accessing task to the
+ * listener this structure is where you need to be adding fields.
+ */
+struct fsnotify_event {
+ spinlock_t lock; /* protection for the associated event_holder and private_list */
+ struct inode *to_tell;
+ /*
+ * depending on the event type we should have either a path, dentry, or inode
+ * we should never have more than one....
+ */
+ union {
+ struct path path;
+ struct inode *inode;
+ };
+ int flag; /* which of the above we have */
+ __u64 mask; /* the type of access */
+ atomic_t refcnt; /* how many groups still are using/need to send this event */
+
+ struct list_head private_data_list;
+};
+
+extern struct srcu_struct fsnotify_grp_srcu_struct;
+extern struct list_head fsnotify_groups;
+extern __u64 fsnotify_mask;
+
+extern void fsnotify_get_event(struct fsnotify_event *event);
+extern void fsnotify_put_event(struct fsnotify_event *event);
+extern struct fsnotify_event_private_data *fsnotify_get_priv_from_event(struct fsnotify_group *group, struct fsnotify_event *event);
+extern struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is);
+#endif /* _LINUX_FSNOTIFY_PRIVATE_H */
diff --git a/fs/notify/group.c b/fs/notify/group.c
new file mode 100644
index 0000000..40935c3
--- /dev/null
+++ b/fs/notify/group.c
@@ -0,0 +1,153 @@
+/*
+ * Copyright (C) 2008 Red Hat, Inc., Eric Paris <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; see the file COPYING. If not, write to
+ * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/srcu.h>
+#include <linux/rculist.h>
+#include <linux/wait.h>
+
+#include <linux/fsnotify_backend.h>
+#include "fsnotify.h"
+
+#include <asm/atomic.h>
+
+DEFINE_MUTEX(fsnotify_grp_mutex);
+struct srcu_struct fsnotify_grp_srcu_struct;
+LIST_HEAD(fsnotify_groups);
+__u64 fsnotify_mask;
+
+void fsnotify_recalc_global_mask(void)
+{
+ struct fsnotify_group *group;
+ __u64 mask = 0;
+ int idx;
+
+ idx = srcu_read_lock(&fsnotify_grp_srcu_struct);
+ list_for_each_entry_rcu(group, &fsnotify_groups, group_list) {
+ mask |= group->mask;
+ }
+ srcu_read_unlock(&fsnotify_grp_srcu_struct, idx);
+ fsnotify_mask = mask;
+}
+
+static void fsnotify_add_group(struct fsnotify_group *group)
+{
+ list_add_rcu(&group->group_list, &fsnotify_groups);
+}
+
+void fsnotify_get_group(struct fsnotify_group *group)
+{
+ atomic_inc(&group->refcnt);
+}
+
+static void fsnotify_destroy_group(struct fsnotify_group *group)
+{
+ if (group->ops->free_group_priv)
+ group->ops->free_group_priv(group);
+
+ kfree(group);
+}
+
+void fsnotify_put_group(struct fsnotify_group *group)
+{
+ if (atomic_dec_and_test(&group->refcnt)) {
+ mutex_lock(&fsnotify_grp_mutex);
+ list_del_rcu(&group->group_list);
+ mutex_unlock(&fsnotify_grp_mutex);
+
+ synchronize_srcu(&fsnotify_grp_srcu_struct);
+
+ /*
+ * shit. something found us before we got off the list.
+ * so lets put ourselves back...
+ */
+ if (atomic_read(&group->refcnt)) {
+ mutex_lock(&fsnotify_grp_mutex);
+ if (atomic_read(&group->refcnt))
+ fsnotify_add_group(group);
+ mutex_unlock(&fsnotify_grp_mutex);
+ return;
+ }
+
+ fsnotify_recalc_global_mask();
+ fsnotify_destroy_group(group);
+ }
+}
+
+static struct fsnotify_group *fsnotify_find_group(unsigned int group_num, __u64 mask, const struct fsnotify_ops *ops)
+{
+ struct fsnotify_group *group_iter;
+ struct fsnotify_group *group = NULL;
+
+ list_for_each_entry_rcu(group_iter, &fsnotify_groups, group_list) {
+ if (group_iter->group_num == group_num) {
+ if ((group_iter->mask == mask) &&
+ (group_iter->ops == ops)) {
+ fsnotify_get_group(group_iter);
+ group = group_iter;
+ } else
+ group = ERR_PTR(-EEXIST);
+ }
+ }
+ return group;
+}
+
+struct fsnotify_group *fsnotify_obtain_group(unsigned int group_num, __u64 mask, const struct fsnotify_ops *ops)
+{
+ struct fsnotify_group *group, *tgroup;
+ int idx;
+
+ idx = srcu_read_lock(&fsnotify_grp_srcu_struct);
+ group = fsnotify_find_group(group_num, mask, ops);
+ srcu_read_unlock(&fsnotify_grp_srcu_struct, idx);
+ if (group)
+ return group;
+
+ group = kmalloc(sizeof(struct fsnotify_group), GFP_KERNEL);
+ if (!group)
+ return ERR_PTR(-ENOMEM);
+
+ atomic_set(&group->refcnt, 1);
+
+ group->group_num = group_num;
+ group->mask = mask;
+
+ group->ops = ops;
+ group->private = NULL;
+
+ mutex_lock(&fsnotify_grp_mutex);
+ tgroup = fsnotify_find_group(group_num, mask, ops);
+ /* we raced and something else inserted the same group */
+ if (tgroup) {
+ mutex_unlock(&fsnotify_grp_mutex);
+ /* destroy the new one we made */
+ fsnotify_put_group(group);
+ return tgroup;
+ }
+
+ /* ok, no races here, add it */
+ fsnotify_add_group(group);
+ mutex_unlock(&fsnotify_grp_mutex);
+
+ if (mask)
+ fsnotify_recalc_global_mask();
+
+ return group;
+}
diff --git a/fs/notify/notification.c b/fs/notify/notification.c
new file mode 100644
index 0000000..f008a15
--- /dev/null
+++ b/fs/notify/notification.c
@@ -0,0 +1,123 @@
+/*
+ * Copyright (C) 2008 Red Hat, Inc., Eric Paris <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; see the file COPYING. If not, write to
+ * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/mount.h>
+#include <linux/mutex.h>
+#include <linux/namei.h>
+#include <linux/path.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+
+#include <asm/atomic.h>
+
+#include <linux/fsnotify_backend.h>
+#include "fsnotify.h"
+
+static struct kmem_cache *event_kmem_cache;
+
+void fsnotify_get_event(struct fsnotify_event *event)
+{
+ atomic_inc(&event->refcnt);
+}
+
+void fsnotify_put_event(struct fsnotify_event *event)
+{
+ if (!event)
+ return;
+
+ if (atomic_dec_and_test(&event->refcnt)) {
+ if (event->flag == FSNOTIFY_EVENT_FILE) {
+ path_put(&event->path);
+ event->path.dentry = NULL;
+ event->path.mnt = NULL;
+ }
+
+ event->mask = 0;
+
+ BUG_ON(!list_empty(&event->private_data_list));
+ kmem_cache_free(event_kmem_cache, event);
+ }
+}
+
+struct fsnotify_event_private_data *fsnotify_get_priv_from_event(struct fsnotify_group *group, struct fsnotify_event *event)
+{
+ struct fsnotify_event_private_data *lpriv;
+ struct fsnotify_event_private_data *priv = NULL;
+
+ list_for_each_entry(lpriv, &event->private_data_list, event_list) {
+ if (lpriv->group == group) {
+ priv = lpriv;
+ break;
+ }
+ }
+ return priv;
+}
+
+struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is)
+{
+ struct fsnotify_event *event;
+
+ event = kmem_cache_alloc(event_kmem_cache, GFP_KERNEL);
+ if (!event)
+ return NULL;
+
+ atomic_set(&event->refcnt, 1);
+
+ spin_lock_init(&event->lock);
+
+ event->path.dentry = NULL;
+ event->path.mnt = NULL;
+ event->inode = NULL;
+
+ INIT_LIST_HEAD(&event->private_data_list);
+
+ event->to_tell = to_tell;
+ event->flag = data_is;
+
+ switch (data_is) {
+ case FSNOTIFY_EVENT_FILE: {
+ struct file *file = data;
+ event->path.dentry = file->f_path.dentry;
+ event->path.mnt = file->f_path.mnt;
+ path_get(&event->path);
+ break;
+ }
+ case FSNOTIFY_EVENT_INODE:
+ event->inode = data;
+ break;
+ default:
+ BUG();
+ };
+
+ event->mask = mask;
+
+ return event;
+}
+
+__init int fsnotify_notification_init(void)
+{
+ event_kmem_cache = kmem_cache_create("fsnotify_event", sizeof(struct fsnotify_event), 0, SLAB_PANIC, NULL);
+
+ return 0;
+}
+subsys_initcall(fsnotify_notification_init);
+
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 6fbf455..b084b98 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -13,6 +13,7 @@

#include <linux/dnotify.h>
#include <linux/inotify.h>
+#include <linux/fsnotify_backend.h>
#include <linux/audit.h>

/*
@@ -43,28 +44,45 @@ static inline void fsnotify_move(struct inode *old_dir, struct inode *new_dir,
{
struct inode *source = moved->d_inode;
u32 cookie = inotify_get_cookie();
+ __u64 old_dir_mask = 0;
+ __u64 new_dir_mask = 0;

- if (old_dir == new_dir)
+ if (old_dir == new_dir) {
inode_dir_notify(old_dir, DN_RENAME);
- else {
+ old_dir_mask = FS_DN_RENAME;
+ } else {
inode_dir_notify(old_dir, DN_DELETE);
+ old_dir_mask = FS_DELETE;
inode_dir_notify(new_dir, DN_CREATE);
+ new_dir_mask = FS_CREATE;
}

- if (isdir)
+ if (isdir) {
isdir = IN_ISDIR;
+ old_dir_mask |= FS_IN_ISDIR;
+ new_dir_mask |= FS_IN_ISDIR;
+ }
+
+ old_dir_mask |= FS_MOVED_FROM;
+ new_dir_mask |= FS_MOVED_TO;
+
inotify_inode_queue_event(old_dir, IN_MOVED_FROM|isdir,cookie,old_name,
source);
inotify_inode_queue_event(new_dir, IN_MOVED_TO|isdir, cookie, new_name,
source);

+ fsnotify(old_dir, old_dir_mask, old_dir, FSNOTIFY_EVENT_INODE);
+ fsnotify(new_dir, new_dir_mask, new_dir, FSNOTIFY_EVENT_INODE);
+
if (target) {
inotify_inode_queue_event(target, IN_DELETE_SELF, 0, NULL, NULL);
inotify_inode_is_dead(target);
+ fsnotify(target, FS_DELETE, target, FSNOTIFY_EVENT_INODE);
}

if (source) {
inotify_inode_queue_event(source, IN_MOVE_SELF, 0, NULL, NULL);
+ fsnotify(source, FS_MOVE_SELF, moved->d_inode, FSNOTIFY_EVENT_INODE);
}
audit_inode_child(new_name, moved, new_dir);
}
@@ -87,6 +105,8 @@ static inline void fsnotify_inoderemove(struct inode *inode)
{
inotify_inode_queue_event(inode, IN_DELETE_SELF, 0, NULL, NULL);
inotify_inode_is_dead(inode);
+
+ fsnotify(inode, FS_DELETE_SELF, inode, FSNOTIFY_EVENT_INODE);
}

/*
@@ -95,6 +115,8 @@ static inline void fsnotify_inoderemove(struct inode *inode)
static inline void fsnotify_link_count(struct inode *inode)
{
inotify_inode_queue_event(inode, IN_ATTRIB, 0, NULL, NULL);
+
+ fsnotify(inode, FS_ATTRIB, inode, FSNOTIFY_EVENT_INODE);
}

/*
@@ -106,6 +128,8 @@ static inline void fsnotify_create(struct inode *inode, struct dentry *dentry)
inotify_inode_queue_event(inode, IN_CREATE, 0, dentry->d_name.name,
dentry->d_inode);
audit_inode_child(dentry->d_name.name, dentry, inode);
+
+ fsnotify(inode, FS_CREATE, dentry->d_inode, FSNOTIFY_EVENT_INODE);
}

/*
@@ -120,6 +144,8 @@ static inline void fsnotify_link(struct inode *dir, struct inode *inode, struct
inode);
fsnotify_link_count(inode);
audit_inode_child(new_dentry->d_name.name, new_dentry, dir);
+
+ fsnotify(dir, FS_CREATE, inode, FSNOTIFY_EVENT_INODE);
}

/*
@@ -131,6 +157,8 @@ static inline void fsnotify_mkdir(struct inode *inode, struct dentry *dentry)
inotify_inode_queue_event(inode, IN_CREATE | IN_ISDIR, 0,
dentry->d_name.name, dentry->d_inode);
audit_inode_child(dentry->d_name.name, dentry, inode);
+
+ fsnotify(inode, FS_CREATE | FS_IN_ISDIR, dentry->d_inode, FSNOTIFY_EVENT_INODE);
}

/*
@@ -140,7 +168,7 @@ static inline void fsnotify_access(struct file *file)
{
struct dentry *dentry = file->f_path.dentry;
struct inode *inode = dentry->d_inode;
- u32 mask = IN_ACCESS;
+ __u64 mask = IN_ACCESS;

if (S_ISDIR(inode->i_mode))
mask |= IN_ISDIR;
@@ -148,6 +176,8 @@ static inline void fsnotify_access(struct file *file)
dnotify_parent(dentry, DN_ACCESS);
inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
+
+ fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE);
}

/*
@@ -157,7 +187,7 @@ static inline void fsnotify_modify(struct file *file)
{
struct dentry *dentry = file->f_path.dentry;
struct inode *inode = dentry->d_inode;
- u32 mask = IN_MODIFY;
+ __u64 mask = IN_MODIFY;

if (S_ISDIR(inode->i_mode))
mask |= IN_ISDIR;
@@ -165,6 +195,8 @@ static inline void fsnotify_modify(struct file *file)
dnotify_parent(dentry, DN_MODIFY);
inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
+
+ fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE);
}

/*
@@ -178,6 +210,8 @@ static inline void fsnotify_open_exec(struct file *file)
dnotify_parent(dentry, DN_ACCESS);
inotify_dentry_parent_queue_event(dentry, IN_ACCESS, 0, dentry->d_name.name);
inotify_inode_queue_event(inode, IN_ACCESS, 0, NULL, NULL);
+
+ fsnotify(inode, FS_ACCESS, file, FSNOTIFY_EVENT_FILE);
}

/*
@@ -187,13 +221,15 @@ static inline void fsnotify_open(struct file *file)
{
struct dentry *dentry = file->f_path.dentry;
struct inode *inode = dentry->d_inode;
- u32 mask = IN_OPEN;
+ __u64 mask = IN_OPEN;

if (S_ISDIR(inode->i_mode))
mask |= IN_ISDIR;

inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
+
+ fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE);
}

/*
@@ -205,13 +241,15 @@ static inline void fsnotify_close(struct file *file)
struct inode *inode = dentry->d_inode;
const char *name = dentry->d_name.name;
fmode_t mode = file->f_mode;
- u32 mask = (mode & FMODE_WRITE) ? IN_CLOSE_WRITE : IN_CLOSE_NOWRITE;
+ __u64 mask = (mode & FMODE_WRITE) ? IN_CLOSE_WRITE : IN_CLOSE_NOWRITE;

if (S_ISDIR(inode->i_mode))
mask |= IN_ISDIR;

inotify_dentry_parent_queue_event(dentry, mask, 0, name);
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
+
+ fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE);
}

/*
@@ -220,13 +258,15 @@ static inline void fsnotify_close(struct file *file)
static inline void fsnotify_xattr(struct dentry *dentry)
{
struct inode *inode = dentry->d_inode;
- u32 mask = IN_ATTRIB;
+ __u64 mask = IN_ATTRIB;

if (S_ISDIR(inode->i_mode))
mask |= IN_ISDIR;

inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
+
+ fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE);
}

/*
@@ -276,6 +316,7 @@ static inline void fsnotify_change(struct dentry *dentry, unsigned int ia_valid)
inotify_inode_queue_event(inode, in_mask, 0, NULL, NULL);
inotify_dentry_parent_queue_event(dentry, in_mask, 0,
dentry->d_name.name);
+ fsnotify(inode, in_mask, inode, FSNOTIFY_EVENT_INODE);
}
}

diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
new file mode 100644
index 0000000..5264db1
--- /dev/null
+++ b/include/linux/fsnotify_backend.h
@@ -0,0 +1,110 @@
+/*
+ * Filesystem access notification for Linux
+ *
+ * Copyright (C) 2008 Red Hat, Inc., Eric Paris <[email protected]>
+ */
+
+#ifndef _LINUX_FSNOTIFY_BACKEND_H
+#define _LINUX_FSNOTIFY_BACKEND_H
+
+#ifdef __KERNEL__
+
+#include <linux/dcache.h>
+#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+
+#include <asm/atomic.h>
+
+/*
+ * IN_* from inotfy.h lines up EXACTLY with FS_*, this is so we can easily
+ * convert between them. dnotify only needs conversion at watch creation
+ * so no perf loss there. fanotify isn't defined yet, so it can use the
+ * wholes if it needs more events.
+ */
+#define FS_ACCESS 0x0000000000000001ull /* File was accessed */
+#define FS_MODIFY 0x0000000000000002ull /* File was modified */
+#define FS_ATTRIB 0x0000000000000004ull /* Metadata changed */
+#define FS_CLOSE_WRITE 0x0000000000000008ull /* Writtable file was closed */
+#define FS_CLOSE_NOWRITE 0x0000000000000010ull /* Unwrittable file closed */
+#define FS_OPEN 0x0000000000000020ull /* File was opened */
+#define FS_MOVED_FROM 0x0000000000000040ull /* File was moved from X */
+#define FS_MOVED_TO 0x0000000000000080ull /* File was moved to Y */
+#define FS_CREATE 0x0000000000000100ull /* Subfile was created */
+#define FS_DELETE 0x0000000000000200ull /* Subfile was deleted */
+#define FS_DELETE_SELF 0x0000000000000400ull /* Self was deleted */
+#define FS_MOVE_SELF 0x0000000000000800ull /* Self was moved */
+
+#define FS_IN_UNMOUNT 0x0000000000002000ull /* inode on umount fs */
+#define FS_Q_OVERFLOW 0x0000000000004000ull /* Event queued overflowed */
+#define FS_IN_IGNORED 0x0000000000008000ull /* last inotify event here */
+
+#define FS_IN_ISDIR 0x0000000040000000ull /* event occurred against dir */
+#define FS_IN_ONESHOT 0x0000000080000000ull /* only send event once */
+
+/*
+ * FSNOTIFY has decided to seperate out events for self vs events delivered to
+ * a parent baed on the actions of the child. dnotify does this for 4 events
+ * ACCESS, MODIFY, ATTRIB, and DELETE. Inotify adds to that list CLOSE_WRITE,
+ * CLOSE_NOWRITE, CREATE, MOVE_FROM, MOVE_TO, OPEN. So all of these _CHILD
+ * events are defined the same as the regular only << 32 for easy conversion.
+ */
+#define FS_ACCESS_CHILD 0x0000000100000000ull /* child was accessed */
+#define FS_MODIFY_CHILD 0x0000000200000000ull /* child was modified */
+#define FS_ATTRIB_CHILD 0x0000000400000000ull /* child attributed changed */
+#define FS_CLOSE_WRITE_CHILD 0x0000000800000000ull /* Writtable file was closed */
+#define FS_CLOSE_NOWRITE_CHILD 0x0000001000000000ull /* Unwrittable file closed */
+#define FS_OPEN_CHILD 0x0000002000000000ull /* File was opened */
+#define FS_MOVED_FROM_CHILD 0x0000004000000000ull /* File was moved from X */
+#define FS_MOVED_TO_CHILD 0x0000008000000000ull /* File was moved to Y */
+#define FS_CREATE_CHILD 0x0000010000000000ull /* Subfile was created */
+#define FS_DELETE_CHILD 0x0000020000000000ull /* child was deleted */
+
+#define FS_DN_RENAME 0x1000000000000000ull /* file renamed */
+#define FS_DN_MULTISHOT 0x2000000000000000ull /* dnotify multishot */
+
+/* when calling fsnotify tell it if the data is a file, dentry, or inode */
+#define FSNOTIFY_EVENT_FILE 1
+#define FSNOTIFY_EVENT_INODE 2
+
+struct fsnotify_group;
+struct fsnotify_event;
+
+struct fsnotify_ops {
+ int (*event_to_notif)(struct fsnotify_group *group, struct fsnotify_event *event);
+ void (*free_group_priv)(struct fsnotify_group *group);
+ void (*free_event_priv)(struct fsnotify_group *group, struct fsnotify_event *event);
+};
+
+struct fsnotify_group {
+ struct list_head group_list; /* list of all groups on the system */
+ unsigned int group_num; /* the 'name' of the event */
+ __u64 mask; /* mask of events this group cares about */
+ atomic_t refcnt; /* num of processes with a special file open */
+
+ const struct fsnotify_ops *ops; /* how this group handles things */
+
+ void *private; /* private data for implementers (dnotify, inotify, fanotify) */
+};
+
+#ifdef CONFIG_FSNOTIFY
+
+/* called from the vfs to signal fs events */
+extern void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is);
+
+/* called from fsnotify interfaces, such as fanotify or dnotify */
+extern void fsnotify_recalc_global_mask(void);
+extern struct fsnotify_group *fsnotify_obtain_group(unsigned int group_num, __u64 mask, const struct fsnotify_ops *ops);
+extern void fsnotify_put_group(struct fsnotify_group *group);
+extern void fsnotify_get_group(struct fsnotify_group *group);
+
+#else
+
+static inline void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is);
+{}
+#endif /* CONFIG_FSNOTIFY */
+
+#endif /* __KERNEL __ */
+
+#endif /* _LINUX_FSNOTIFY_BACKEND_H */

2008-12-12 21:53:35

by Eric Paris

[permalink] [raw]
Subject: [RFC PATCH -v4 06/14] fsnotify: add group priorities

In preperation for blocking fsnotify calls group priorities must be added.
When multiple groups request the same event type the lowest priority group
will receive the notification first.

Signed-off-by: Eric Paris <[email protected]>
---

fs/notify/group.c | 30 ++++++++++++++++++++++++------
include/linux/fsnotify_backend.h | 4 +++-
2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/fs/notify/group.c b/fs/notify/group.c
index 40935c3..0dd6e82 100644
--- a/fs/notify/group.c
+++ b/fs/notify/group.c
@@ -49,7 +49,21 @@ void fsnotify_recalc_global_mask(void)

static void fsnotify_add_group(struct fsnotify_group *group)
{
- list_add_rcu(&group->group_list, &fsnotify_groups);
+ int priority = group->priority;
+ struct fsnotify_group *group_iter;
+
+ list_for_each_entry(group_iter, &fsnotify_groups, group_list) {
+ BUG_ON(list_empty(&fsnotify_groups));
+ /* insert in front of this one? */
+ if (priority < group_iter->priority) {
+ /* I used list_add_tail() to insert in front of group_iter... */
+ list_add_tail_rcu(&group->group_list, &group_iter->group_list);
+ return;
+ }
+ }
+
+ /* apparently we need to be the last entry */
+ list_add_tail_rcu(&group->group_list, &fsnotify_groups);
}

void fsnotify_get_group(struct fsnotify_group *group)
@@ -91,14 +105,16 @@ void fsnotify_put_group(struct fsnotify_group *group)
}
}

-static struct fsnotify_group *fsnotify_find_group(unsigned int group_num, __u64 mask, const struct fsnotify_ops *ops)
+static struct fsnotify_group *fsnotify_find_group(unsigned int priority, unsigned int group_num,
+ __u64 mask, const struct fsnotify_ops *ops)
{
struct fsnotify_group *group_iter;
struct fsnotify_group *group = NULL;

list_for_each_entry_rcu(group_iter, &fsnotify_groups, group_list) {
- if (group_iter->group_num == group_num) {
+ if (group_iter->priority == priority) {
if ((group_iter->mask == mask) &&
+ (group_iter->group_num == group_num) &&
(group_iter->ops == ops)) {
fsnotify_get_group(group_iter);
group = group_iter;
@@ -109,13 +125,14 @@ static struct fsnotify_group *fsnotify_find_group(unsigned int group_num, __u64
return group;
}

-struct fsnotify_group *fsnotify_obtain_group(unsigned int group_num, __u64 mask, const struct fsnotify_ops *ops)
+struct fsnotify_group *fsnotify_obtain_group(unsigned int priority, unsigned int group_num,
+ __u64 mask, const struct fsnotify_ops *ops)
{
struct fsnotify_group *group, *tgroup;
int idx;

idx = srcu_read_lock(&fsnotify_grp_srcu_struct);
- group = fsnotify_find_group(group_num, mask, ops);
+ group = fsnotify_find_group(priority, group_num, mask, ops);
srcu_read_unlock(&fsnotify_grp_srcu_struct, idx);
if (group)
return group;
@@ -126,6 +143,7 @@ struct fsnotify_group *fsnotify_obtain_group(unsigned int group_num, __u64 mask,

atomic_set(&group->refcnt, 1);

+ group->priority = priority;
group->group_num = group_num;
group->mask = mask;

@@ -133,7 +151,7 @@ struct fsnotify_group *fsnotify_obtain_group(unsigned int group_num, __u64 mask,
group->private = NULL;

mutex_lock(&fsnotify_grp_mutex);
- tgroup = fsnotify_find_group(group_num, mask, ops);
+ tgroup = fsnotify_find_group(priority, group_num, mask, ops);
/* we raced and something else inserted the same group */
if (tgroup) {
mutex_unlock(&fsnotify_grp_mutex);
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 5264db1..924902e 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -85,6 +85,8 @@ struct fsnotify_group {

const struct fsnotify_ops *ops; /* how this group handles things */

+ unsigned int priority; /* order this group should receive msgs. low first */
+
void *private; /* private data for implementers (dnotify, inotify, fanotify) */
};

@@ -95,7 +97,7 @@ extern void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is)

/* called from fsnotify interfaces, such as fanotify or dnotify */
extern void fsnotify_recalc_global_mask(void);
-extern struct fsnotify_group *fsnotify_obtain_group(unsigned int group_num, __u64 mask, const struct fsnotify_ops *ops);
+extern struct fsnotify_group *fsnotify_obtain_group(unsigned int priority, unsigned int group_num, __u64 mask, const struct fsnotify_ops *ops);
extern void fsnotify_put_group(struct fsnotify_group *group);
extern void fsnotify_get_group(struct fsnotify_group *group);

2008-12-12 21:53:51

by Eric Paris

[permalink] [raw]
Subject: [RFC PATCH -v4 04/14] fsnotify: use the new open-exec hook for inotify and dnotify

inotify and dnotify did not get access events when their children were
accessed for shlib or exec purposes. Trigger on those events as well.

Signed-off-by: Eric Paris <[email protected]>
---

include/linux/fsnotify.h | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index ffe787f..6fbf455 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -172,6 +172,12 @@ static inline void fsnotify_modify(struct file *file)
*/
static inline void fsnotify_open_exec(struct file *file)
{
+ struct dentry *dentry = file->f_path.dentry;
+ struct inode *inode = dentry->d_inode;
+
+ dnotify_parent(dentry, DN_ACCESS);
+ inotify_dentry_parent_queue_event(dentry, IN_ACCESS, 0, dentry->d_name.name);
+ inotify_inode_queue_event(inode, IN_ACCESS, 0, NULL, NULL);
}

/*

2008-12-12 21:54:15

by Eric Paris

[permalink] [raw]
Subject: [RFC PATCH -v4 08/14] fsnotify: parent event notification

inotify and dnotify both use a similar parent notification mechanism. We
add a generic parent notification mechanism to fsnotify for both of these
to use. This new machanism also adds the dentry flag optimization which
exists for inotify to dnotify.

Signed-off-by: Eric Paris <[email protected]>
---

fs/notify/inode_mark.c | 2 +
include/linux/dcache.h | 3 +
include/linux/fsnotify.h | 103 +++++++++++++++++++++++++++++++++++++-
include/linux/fsnotify_backend.h | 5 ++
4 files changed, 109 insertions(+), 4 deletions(-)

diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index c68adff..339a368 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -255,6 +255,8 @@ void fsnotify_recalc_inode_mask(struct inode *inode)
}
inode->i_fsnotify_mask = new_mask;
spin_unlock(&inode->i_lock);
+
+ fsnotify_update_dentry_child_flags(inode);
}


diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index a37359d..2f7f646 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -175,7 +175,8 @@ d_iput: no no no yes
#define DCACHE_REFERENCED 0x0008 /* Recently used, don't discard. */
#define DCACHE_UNHASHED 0x0010

-#define DCACHE_INOTIFY_PARENT_WATCHED 0x0020 /* Parent inode is watched */
+#define DCACHE_INOTIFY_PARENT_WATCHED 0x0020 /* Parent inode is watched by inotify */
+#define DCACHE_FSNOTIFY_PARENT_WATCHED 0x0040 /* Parent inode is watched by some fsnotify listener */

extern spinlock_t dcache_lock;
extern seqlock_t rename_lock;
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index c2ed916..971ada9 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -16,14 +16,93 @@
#include <linux/fsnotify_backend.h>
#include <linux/audit.h>

+static inline int fsnotify_inode_watches_children(struct inode *inode)
+{
+ if (inode->i_fsnotify_mask & (FS_EVENTS_WITH_CHILD << 32))
+ return 1;
+ return 0;
+}
+
+/*
+ * Get child dentry flag into synch with parent inode.
+ * Flag should always be clear for negative dentrys.
+ */
+static inline void fsnotify_update_dentry_child_flags(struct inode *inode)
+{
+ struct dentry *alias;
+ int watched = fsnotify_inode_watches_children(inode);
+
+ spin_lock(&dcache_lock);
+ list_for_each_entry(alias, &inode->i_dentry, d_alias) {
+ struct dentry *child;
+
+ list_for_each_entry(child, &alias->d_subdirs, d_u.d_child) {
+ if (!child->d_inode)
+ continue;
+
+ spin_lock(&child->d_lock);
+ if (watched)
+ child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
+ else
+ child->d_flags &=~DCACHE_FSNOTIFY_PARENT_WATCHED;
+ spin_unlock(&child->d_lock);
+ }
+ }
+ spin_unlock(&dcache_lock);
+}
+
/*
* fsnotify_d_instantiate - instantiate a dentry for inode
* Called with dcache_lock held.
*/
-static inline void fsnotify_d_instantiate(struct dentry *entry,
- struct inode *inode)
+static inline void fsnotify_d_instantiate(struct dentry *dentry, struct inode *inode)
+{
+ struct dentry *parent;
+ struct inode *p_inode;
+
+ if (!inode)
+ return;
+
+ spin_lock(&dentry->d_lock);
+ parent = dentry->d_parent;
+ p_inode = parent->d_inode;
+
+ if (p_inode && fsnotify_inode_watches_children(p_inode))
+ dentry->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
+ spin_unlock(&dentry->d_lock);
+
+ /* call the legacy inotify shit */
+ inotify_d_instantiate(dentry, inode);
+}
+
+/* Notify this dentry's parent about a child's events. */
+static inline void fsnotify_parent(struct dentry *dentry, __u64 orig_mask)
{
- inotify_d_instantiate(entry, inode);
+ struct dentry *parent;
+ struct inode *p_inode;
+ __u64 mask;
+
+ if (!(dentry->d_flags | DCACHE_FSNOTIFY_PARENT_WATCHED))
+ return;
+
+ /* we are notifying a parent so come up with the new mask which
+ * specifies these are events which came from a child. */
+ mask = (orig_mask & FS_EVENTS_WITH_CHILD) << 32;
+ /* need to remember if the child was a dir */
+ mask |= (orig_mask & FS_IN_ISDIR);
+
+ spin_lock(&dentry->d_lock);
+ parent = dentry->d_parent;
+ p_inode = parent->d_inode;
+
+ if (p_inode && (p_inode->i_fsnotify_mask & mask)) {
+ dget(parent);
+ spin_unlock(&dentry->d_lock);
+ fsnotify(p_inode, mask, dentry->d_inode, FSNOTIFY_EVENT_INODE);
+ dput(parent);
+ } else {
+ spin_unlock(&dentry->d_lock);
+ }
}

/*
@@ -32,6 +111,14 @@ static inline void fsnotify_d_instantiate(struct dentry *entry,
*/
static inline void fsnotify_d_move(struct dentry *entry)
{
+ struct dentry *parent;
+
+ parent = entry->d_parent;
+ if (fsnotify_inode_watches_children(parent->d_inode))
+ entry->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
+ else
+ entry->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
+
inotify_d_move(entry);
}

@@ -96,6 +183,8 @@ static inline void fsnotify_nameremove(struct dentry *dentry, int isdir)
isdir = IN_ISDIR;
dnotify_parent(dentry, DN_DELETE);
inotify_dentry_parent_queue_event(dentry, IN_DELETE|isdir, 0, dentry->d_name.name);
+
+ fsnotify_parent(dentry, FS_DELETE|isdir);
}

/*
@@ -186,6 +275,7 @@ static inline void fsnotify_access(struct file *file)
inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

+ fsnotify_parent(dentry, mask);
fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE);
}

@@ -205,6 +295,7 @@ static inline void fsnotify_modify(struct file *file)
inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

+ fsnotify_parent(dentry, mask);
fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE);
}

@@ -220,6 +311,7 @@ static inline void fsnotify_open_exec(struct file *file)
inotify_dentry_parent_queue_event(dentry, IN_ACCESS, 0, dentry->d_name.name);
inotify_inode_queue_event(inode, IN_ACCESS, 0, NULL, NULL);

+ fsnotify_parent(dentry, FS_ACCESS);
fsnotify(inode, FS_ACCESS, file, FSNOTIFY_EVENT_FILE);
}

@@ -238,6 +330,7 @@ static inline void fsnotify_open(struct file *file)
inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

+ fsnotify_parent(dentry, mask);
fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE);
}

@@ -258,6 +351,7 @@ static inline void fsnotify_close(struct file *file)
inotify_dentry_parent_queue_event(dentry, mask, 0, name);
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

+ fsnotify_parent(dentry, mask);
fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE);
}

@@ -275,6 +369,7 @@ static inline void fsnotify_xattr(struct dentry *dentry)
inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

+ fsnotify_parent(dentry, mask);
fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE);
}

@@ -325,6 +420,8 @@ static inline void fsnotify_change(struct dentry *dentry, unsigned int ia_valid)
inotify_inode_queue_event(inode, in_mask, 0, NULL, NULL);
inotify_dentry_parent_queue_event(dentry, in_mask, 0,
dentry->d_name.name);
+
+ fsnotify_parent(dentry, in_mask);
fsnotify(inode, in_mask, inode, FSNOTIFY_EVENT_INODE);
}
}
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 0482a14..3e4ac24 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -65,6 +65,11 @@
#define FS_DN_RENAME 0x1000000000000000ull /* file renamed */
#define FS_DN_MULTISHOT 0x2000000000000000ull /* dnotify multishot */

+#define FS_EVENTS_WITH_CHILD (FS_ACCESS | FS_MODIFY | FS_ATTRIB |\
+ FS_CLOSE_WRITE | FS_CLOSE_NOWRITE | FS_OPEN |\
+ FS_MOVED_FROM | FS_MOVED_TO | FS_CREATE |\
+ FS_DELETE)
+
/* when calling fsnotify tell it if the data is a file, dentry, or inode */
#define FSNOTIFY_EVENT_FILE 1
#define FSNOTIFY_EVENT_INODE 2

2008-12-12 21:54:37

by Eric Paris

[permalink] [raw]
Subject: [RFC PATCH -v4 07/14] fsnotify: add in inode fsnotify markings

This patch creates in inode fsnotify markings. dnotify will make use of in
inode markings to mark which inodes it wishes to send events for. fanotify
will use this to mark which inodes it does not wish to send events for.

Signed-off-by: Eric Paris <[email protected]>
---

fs/inode.c | 6 +
fs/notify/Makefile | 2
fs/notify/fsnotify.c | 13 ++
fs/notify/fsnotify.h | 31 ++++
fs/notify/group.c | 24 +++
fs/notify/inode_mark.c | 267 ++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 5 +
include/linux/fsnotify.h | 9 +
include/linux/fsnotify_backend.h | 22 +++
9 files changed, 378 insertions(+), 1 deletions(-)
create mode 100644 fs/notify/inode_mark.c

diff --git a/fs/inode.c b/fs/inode.c
index 0487ddb..a7f6397 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@
#include <linux/cdev.h>
#include <linux/bootmem.h>
#include <linux/inotify.h>
+#include <linux/fsnotify.h>
#include <linux/mount.h>

/*
@@ -183,6 +184,10 @@ static struct inode *alloc_inode(struct super_block *sb)
}
inode->i_private = NULL;
inode->i_mapping = mapping;
+#ifdef CONFIG_FSNOTIFY
+ inode->i_fsnotify_mask = 0;
+ INIT_LIST_HEAD(&inode->i_fsnotify_mark_entries);
+#endif
}
return inode;
}
@@ -191,6 +196,7 @@ void destroy_inode(struct inode *inode)
{
BUG_ON(inode_has_buffers(inode));
security_inode_free(inode);
+ fsnotify_inode_delete(inode);
if (inode->i_sb->s_op->destroy_inode)
inode->i_sb->s_op->destroy_inode(inode);
else
diff --git a/fs/notify/Makefile b/fs/notify/Makefile
index 7cb285a..47b60f3 100644
--- a/fs/notify/Makefile
+++ b/fs/notify/Makefile
@@ -1,4 +1,4 @@
obj-y += dnotify/
obj-y += inotify/

-obj-$(CONFIG_FSNOTIFY) += fsnotify.o notification.o group.o
+obj-$(CONFIG_FSNOTIFY) += fsnotify.o notification.o group.o inode_mark.o
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 93a0e8f..61157f2 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -25,6 +25,15 @@
#include <linux/fsnotify_backend.h>
#include "fsnotify.h"

+void __fsnotify_inode_delete(struct inode *inode, int flag)
+{
+ if (likely(list_empty(&fsnotify_groups)))
+ return;
+
+ fsnotify_clear_marks_by_inode(inode, flag);
+}
+EXPORT_SYMBOL_GPL(__fsnotify_inode_delete);
+
void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is)
{
struct fsnotify_group *group;
@@ -37,6 +46,8 @@ void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is)
if (!(mask & fsnotify_mask))
return;

+ if (!(mask & to_tell->i_fsnotify_mask))
+ return;
/*
* SRCU!! the groups list is very very much read only and the path is
* very hot (assuming something is using fsnotify) Not blocking while
@@ -52,6 +63,8 @@ void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is)
idx = srcu_read_lock(&fsnotify_grp_srcu_struct);
list_for_each_entry_rcu(group, &fsnotify_groups, group_list) {
if (mask & group->mask) {
+ if (!group->ops->should_send_event(group, to_tell, mask))
+ continue;
if (!event) {
event = fsnotify_create_event(to_tell, mask, data, data_is);
/* shit, we OOM'd and now we can't tell, lets hope something else blows up */
diff --git a/fs/notify/fsnotify.h b/fs/notify/fsnotify.h
index 15bc151..e6f4f0d 100644
--- a/fs/notify/fsnotify.h
+++ b/fs/notify/fsnotify.h
@@ -40,6 +40,31 @@ struct fsnotify_event {
struct list_head private_data_list;
};

+/*
+ * a mark is simply an entry attached to an in core inode which allows an
+ * fsnotify listener to indicate they are either no longer interested in events
+ * of a type matching mask or only interested in those events.
+ *
+ * these are flushed when an inode is evicted from core and may be flushed
+ * when the inode is modified (as seen by fsnotify_access). Some fsnotify users
+ * (such as dnotify) will flush these when the open fd is closed and not at
+ * inode eviction or modification.
+ */
+struct fsnotify_mark_entry {
+ struct fsnotify_group *group; /* group this mark entry is for */
+ __u64 mask; /* mask this mark entry is for */
+ struct inode *inode; /* inode this entry is associated with */
+ void *private; /* private data for the listener */
+ spinlock_t lock; /* protect group, inode, and killme */
+ atomic_t refcnt; /* active things looking at this mark */
+ int freeme; /* free when this is set and refcnt hits 0 */
+ struct list_head i_list; /* list of mark_entries by inode->i_fsnotify_mark_entries */
+ struct list_head g_list; /* list of mark_entries by group->i_fsnotify_mark_entries */
+ struct list_head free_i_list; /* tmp list used when freeing this mark */
+ struct list_head free_g_list; /* tmp list used when freeing this mark */
+ void (*free_private)(struct fsnotify_mark_entry *entry); /* called on final put+free */
+};
+
extern struct srcu_struct fsnotify_grp_srcu_struct;
extern struct list_head fsnotify_groups;
extern __u64 fsnotify_mask;
@@ -48,4 +73,10 @@ extern void fsnotify_get_event(struct fsnotify_event *event);
extern void fsnotify_put_event(struct fsnotify_event *event);
extern struct fsnotify_event_private_data *fsnotify_get_priv_from_event(struct fsnotify_group *group, struct fsnotify_event *event);
extern struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is);
+
+extern void fsnotify_clear_marks_by_group(struct fsnotify_group *group);
+extern void fsnotify_clear_marks_by_inode(struct inode *inode, unsigned int flags);
+extern void fsnotify_destroy_mark_by_entry(struct fsnotify_mark_entry *entry);
+extern void fsnotify_get_mark(struct fsnotify_mark_entry *entry);
+extern void fsnotify_put_mark(struct fsnotify_mark_entry *entry);
#endif /* _LINUX_FSNOTIFY_PRIVATE_H */
diff --git a/fs/notify/group.c b/fs/notify/group.c
index 0dd6e82..1ed97fe 100644
--- a/fs/notify/group.c
+++ b/fs/notify/group.c
@@ -47,6 +47,24 @@ void fsnotify_recalc_global_mask(void)
fsnotify_mask = mask;
}

+void fsnotify_recalc_group_mask(struct fsnotify_group *group)
+{
+ __u64 mask = 0;
+ unsigned long old_mask = group->mask;
+ struct fsnotify_mark_entry *entry;
+
+ spin_lock(&group->mark_lock);
+ list_for_each_entry(entry, &group->mark_entries, g_list) {
+ mask |= entry->mask;
+ }
+ spin_unlock(&group->mark_lock);
+
+ group->mask = mask;
+
+ if (old_mask != mask)
+ fsnotify_recalc_global_mask();
+}
+
static void fsnotify_add_group(struct fsnotify_group *group)
{
int priority = group->priority;
@@ -73,6 +91,9 @@ void fsnotify_get_group(struct fsnotify_group *group)

static void fsnotify_destroy_group(struct fsnotify_group *group)
{
+ /* clear all inode mark entries for this group */
+ fsnotify_clear_marks_by_group(group);
+
if (group->ops->free_group_priv)
group->ops->free_group_priv(group);

@@ -147,6 +168,9 @@ struct fsnotify_group *fsnotify_obtain_group(unsigned int priority, unsigned int
group->group_num = group_num;
group->mask = mask;

+ spin_lock_init(&group->mark_lock);
+ INIT_LIST_HEAD(&group->mark_entries);
+
group->ops = ops;
group->private = NULL;

diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
new file mode 100644
index 0000000..c68adff
--- /dev/null
+++ b/fs/notify/inode_mark.c
@@ -0,0 +1,267 @@
+/*
+ * Copyright (C) 2008 Red Hat, Inc., Eric Paris <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; see the file COPYING. If not, write to
+ * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+
+#include <asm/atomic.h>
+
+#include <linux/fsnotify_backend.h>
+#include "fsnotify.h"
+
+static struct kmem_cache *fsnotify_mark_kmem_cache;
+
+static void fsnotify_destroy_mark(struct fsnotify_mark_entry *entry)
+{
+ entry->group = NULL;
+ entry->inode = NULL;
+ entry->mask = 0;
+ if (entry->free_private) {
+ entry->free_private(entry);
+ }
+ entry->private = NULL;
+ INIT_LIST_HEAD(&entry->i_list);
+ INIT_LIST_HEAD(&entry->g_list);
+ INIT_LIST_HEAD(&entry->free_i_list);
+ INIT_LIST_HEAD(&entry->free_g_list);
+ kmem_cache_free(fsnotify_mark_kmem_cache, entry);
+}
+
+static struct fsnotify_mark_entry *fsnotify_alloc_mark(void)
+{
+ struct fsnotify_mark_entry *entry;
+
+ entry = kmem_cache_alloc(fsnotify_mark_kmem_cache, GFP_KERNEL);
+
+ return entry;
+}
+
+void fsnotify_get_mark(struct fsnotify_mark_entry *entry)
+{
+ atomic_inc(&entry->refcnt);
+}
+
+void fsnotify_put_mark(struct fsnotify_mark_entry *entry)
+{
+ if (atomic_dec_and_test(&entry->refcnt)) {
+ spin_lock(&entry->lock);
+ /* entries can only be found by the kernel by searching the
+ * inode->i_fsnotify_entries or the group->mark_entries lists.
+ * if freeme is set that means this entry is off both lists.
+ * if refcnt is 0 that means we are the last thing still
+ * looking at this entry, so its time to free.
+ */
+ if (!atomic_read(&entry->refcnt) && entry->freeme) {
+ spin_unlock(&entry->lock);
+ fsnotify_destroy_mark(entry);
+ return;
+ }
+ spin_unlock(&entry->lock);
+ }
+}
+
+void fsnotify_clear_marks_by_group(struct fsnotify_group *group)
+{
+ struct fsnotify_mark_entry *lentry, *entry;
+ struct inode *inode;
+ LIST_HEAD(free_list);
+
+ spin_lock(&group->mark_lock);
+ list_for_each_entry_safe(entry, lentry, &group->mark_entries, g_list) {
+ list_del_init(&entry->g_list);
+ list_add(&entry->free_g_list, &free_list);
+ }
+ spin_unlock(&group->mark_lock);
+
+ list_for_each_entry_safe(entry, lentry, &free_list, free_g_list) {
+ fsnotify_get_mark(entry);
+ spin_lock(&entry->lock);
+ inode = entry->inode;
+ if (!inode) {
+ entry->group = NULL;
+ spin_unlock(&entry->lock);
+ fsnotify_put_mark(entry);
+ continue;
+ }
+ spin_lock(&inode->i_lock);
+
+ list_del_init(&entry->i_list);
+ entry->inode = NULL;
+ list_del_init(&entry->g_list);
+ entry->group = NULL;
+ entry->freeme = 1;
+
+ spin_unlock(&inode->i_lock);
+ spin_unlock(&entry->lock);
+
+ fsnotify_put_mark(entry);
+ }
+}
+
+void fsnotify_destroy_mark_by_entry(struct fsnotify_mark_entry *entry)
+{
+ struct fsnotify_group *group;
+ struct inode *inode;
+
+ fsnotify_get_mark(entry);
+
+ spin_lock(&entry->lock);
+
+ group = entry->group;
+ if (group)
+ spin_lock(&group->mark_lock);
+
+ inode = entry->inode;
+ if (inode)
+ spin_lock(&inode->i_lock);
+
+ list_del_init(&entry->i_list);
+ entry->inode = NULL;
+ list_del_init(&entry->g_list);
+ entry->group = NULL;
+ entry->freeme = 1;
+
+ if (inode)
+ spin_unlock(&inode->i_lock);
+ if (group)
+ spin_unlock(&group->mark_lock);
+
+ spin_unlock(&entry->lock);
+
+ fsnotify_put_mark(entry);
+}
+
+void fsnotify_clear_marks_by_inode(struct inode *inode, unsigned int flags)
+{
+ struct fsnotify_mark_entry *lentry, *entry;
+ LIST_HEAD(free_list);
+
+ spin_lock(&inode->i_lock);
+ list_for_each_entry_safe(entry, lentry, &inode->i_fsnotify_mark_entries, i_list) {
+ list_del_init(&entry->i_list);
+ list_add(&entry->free_i_list, &free_list);
+ }
+ spin_unlock(&inode->i_lock);
+
+ /*
+ * at this point destroy_by_* might race.
+ *
+ * we used list_del_init() so it can be list_del_init'd again, no harm.
+ * we were called from an inode function so we know that other user can
+ * try to grab entry->inode->i_lock without a problem.
+ */
+ list_for_each_entry_safe(entry, lentry, &free_list, free_i_list) {
+ fsnotify_get_mark(entry);
+ entry->group->ops->mark_clear_inode(entry, inode, flags);
+ fsnotify_put_mark(entry);
+ }
+}
+
+/* caller must hold inode->i_lock */
+struct fsnotify_mark_entry *fsnotify_find_mark_entry(struct fsnotify_group *group, struct inode *inode)
+{
+ struct fsnotify_mark_entry *entry;
+
+ list_for_each_entry(entry, &inode->i_fsnotify_mark_entries, i_list) {
+ if (entry->group == group) {
+ fsnotify_get_mark(entry);
+ return entry;
+ }
+ }
+ return NULL;
+}
+/*
+ * This is a low use function called when userspace is changing what is being
+ * watched. I don't mind doing the allocation since I'm assuming we will have
+ * more new events than we have adding to old events...
+ *
+ * add (we use |=) the mark to the in core inode mark, if you need to change
+ * rather than | some new bits you needs to fsnotify_destroy_mark_by_inode()
+ * then call this with all the right bits in the mask.
+ */
+struct fsnotify_mark_entry *fsnotify_mark_add(struct fsnotify_group *group, struct inode *inode, __u64 mask)
+{
+ /* we initialize entry to shut up the compiler in case we just to out... */
+ struct fsnotify_mark_entry *entry = NULL, *lentry;
+
+ /* pre allocate an entry so we can hold the lock */
+ entry = fsnotify_alloc_mark();
+ if (!entry)
+ return NULL;
+
+ /*
+ * LOCKING ORDER!!!!
+ * entry->lock
+ * group->mark_lock
+ * inode->i_lock
+ */
+ spin_lock(&group->mark_lock);
+ spin_lock(&inode->i_lock);
+ lentry = fsnotify_find_mark_entry(group, inode);
+ if (lentry) {
+ /* we didn't use the new entry, kill it */
+ fsnotify_destroy_mark(entry);
+ entry = lentry;
+ entry->mask |= mask;
+ goto out_unlock;
+ }
+
+ spin_lock_init(&entry->lock);
+ atomic_set(&entry->refcnt, 1);
+ entry->group = group;
+ entry->mask = mask;
+ entry->inode = inode;
+ entry->freeme = 0;
+ entry->private = NULL;
+ entry->free_private = group->ops->free_mark_priv;
+
+ list_add(&entry->i_list, &inode->i_fsnotify_mark_entries);
+ list_add(&entry->g_list, &group->mark_entries);
+
+out_unlock:
+ spin_unlock(&inode->i_lock);
+ spin_unlock(&group->mark_lock);
+ return entry;
+}
+
+void fsnotify_recalc_inode_mask(struct inode *inode)
+{
+ unsigned long new_mask = 0;
+ struct fsnotify_mark_entry *entry;
+
+ spin_lock(&inode->i_lock);
+ list_for_each_entry(entry, &inode->i_fsnotify_mark_entries, i_list) {
+ new_mask |= entry->mask;
+ }
+ inode->i_fsnotify_mask = new_mask;
+ spin_unlock(&inode->i_lock);
+}
+
+
+__init int fsnotify_mark_init(void)
+{
+ fsnotify_mark_kmem_cache = kmem_cache_create("fsnotify_mark_entry", sizeof(struct fsnotify_mark_entry), 0, SLAB_PANIC, NULL);
+
+ return 0;
+}
+subsys_initcall(fsnotify_mark_init);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4a853ef..b5a7bce 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -665,6 +665,11 @@ struct inode {

__u32 i_generation;

+#ifdef CONFIG_FSNOTIFY
+ __u64 i_fsnotify_mask; /* all events this inode cares about */
+ struct list_head i_fsnotify_mark_entries; /* fsnotify mark entries */
+#endif
+
#ifdef CONFIG_DNOTIFY
unsigned long i_dnotify_mask; /* Directory notify events */
struct dnotify_struct *i_dnotify; /* for directory notifications */
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index b084b98..c2ed916 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -99,6 +99,14 @@ static inline void fsnotify_nameremove(struct dentry *dentry, int isdir)
}

/*
+ * fsnotify_inode_delete - and inode is being evicted from cache, clean up is needed
+ */
+static inline void fsnotify_inode_delete(struct inode *inode)
+{
+ __fsnotify_inode_delete(inode, FSNOTIFY_INODE_DESTROY);
+}
+
+/*
* fsnotify_inoderemove - an inode is going away
*/
static inline void fsnotify_inoderemove(struct inode *inode)
@@ -107,6 +115,7 @@ static inline void fsnotify_inoderemove(struct inode *inode)
inotify_inode_is_dead(inode);

fsnotify(inode, FS_DELETE_SELF, inode, FSNOTIFY_EVENT_INODE);
+ __fsnotify_inode_delete(inode, FSNOTIFY_LAST_DENTRY);
}

/*
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 924902e..0482a14 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -13,6 +13,7 @@
#include <linux/fs.h>
#include <linux/list.h>
#include <linux/mutex.h>
+#include <linux/spinlock.h>
#include <linux/wait.h>

#include <asm/atomic.h>
@@ -68,13 +69,21 @@
#define FSNOTIFY_EVENT_FILE 1
#define FSNOTIFY_EVENT_INODE 2

+/* these tell __fsnotify_inode_delete what kind of event this is */
+#define FSNOTIFY_LAST_DENTRY 1
+#define FSNOTIFY_INODE_DESTROY 2
+
struct fsnotify_group;
struct fsnotify_event;
+struct fsnotify_mark_entry;

struct fsnotify_ops {
int (*event_to_notif)(struct fsnotify_group *group, struct fsnotify_event *event);
+ void (*mark_clear_inode)(struct fsnotify_mark_entry *entry, struct inode *inode, unsigned int flags);
+ int (*should_send_event)(struct fsnotify_group *group, struct inode *inode, __u64 mask);
void (*free_group_priv)(struct fsnotify_group *group);
void (*free_event_priv)(struct fsnotify_group *group, struct fsnotify_event *event);
+ void (*free_mark_priv)(struct fsnotify_mark_entry *entry);
};

struct fsnotify_group {
@@ -85,6 +94,10 @@ struct fsnotify_group {

const struct fsnotify_ops *ops; /* how this group handles things */

+ /* stores all fastapth entries assoc with this group so they can be cleaned on unregister */
+ spinlock_t mark_lock; /* protect mark_entries list */
+ struct list_head mark_entries; /* all inode mark entries for this group */
+
unsigned int priority; /* order this group should receive msgs. low first */

void *private; /* private data for implementers (dnotify, inotify, fanotify) */
@@ -94,17 +107,26 @@ struct fsnotify_group {

/* called from the vfs to signal fs events */
extern void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is);
+extern void __fsnotify_inode_delete(struct inode *inode, int flag);

/* called from fsnotify interfaces, such as fanotify or dnotify */
extern void fsnotify_recalc_global_mask(void);
+extern void fsnotify_recalc_group_mask(struct fsnotify_group *group);
extern struct fsnotify_group *fsnotify_obtain_group(unsigned int priority, unsigned int group_num, __u64 mask, const struct fsnotify_ops *ops);
extern void fsnotify_put_group(struct fsnotify_group *group);
extern void fsnotify_get_group(struct fsnotify_group *group);

+extern void fsnotify_recalc_inode_mask(struct inode *inode);
+extern struct fsnotify_mark_entry *fsnotify_find_mark_entry(struct fsnotify_group *group, struct inode *inode);
+extern struct fsnotify_mark_entry *fsnotify_mark_add(struct fsnotify_group *group, struct inode *inode, __u64 mask);
#else

static inline void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is);
{}
+
+static inline void __fsnotify_inode_delete(struct inode *inode, int flag)
+{}
+
#endif /* CONFIG_FSNOTIFY */

#endif /* __KERNEL __ */

2008-12-12 21:54:55

by Eric Paris

[permalink] [raw]
Subject: [RFC PATCH -v4 09/14] dnotify: reimplement dnotify using fsnotify

Reimplement dnotify using fsnotify.

Signed-off-by: Eric Paris <[email protected]>
---

fs/fcntl.c | 4 +
fs/notify/dnotify/Kconfig | 1
fs/notify/dnotify/dnotify.c | 322 ++++++++++++++++++++++++++++++++-----------
include/linux/dnotify.h | 29 +---
include/linux/fs.h | 5 -
include/linux/fsnotify.h | 71 +++------
6 files changed, 271 insertions(+), 161 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 549daf8..dcc733e 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -327,6 +327,10 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
break;
case F_NOTIFY:
err = fcntl_dirnotify(fd, filp, arg);
+ if (err)
+ break;
+ if (filp->f_op && filp->f_op->dir_notify)
+ err = filp->f_op->dir_notify(filp, arg);
break;
default:
break;
diff --git a/fs/notify/dnotify/Kconfig b/fs/notify/dnotify/Kconfig
index 26adf5d..904ff8d 100644
--- a/fs/notify/dnotify/Kconfig
+++ b/fs/notify/dnotify/Kconfig
@@ -1,5 +1,6 @@
config DNOTIFY
bool "Dnotify support"
+ depends on FSNOTIFY
default y
help
Dnotify is a directory-based per-fd file change notification system
diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index 676073b..dae1bd6 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -21,24 +21,147 @@
#include <linux/spinlock.h>
#include <linux/slab.h>
#include <linux/fdtable.h>
+#include <linux/fsnotify_backend.h>
+
+#include "../fsnotify.h"

int dir_notify_enable __read_mostly = 1;

static struct kmem_cache *dn_cache __read_mostly;

-static void redo_inode_mask(struct inode *inode)
+static int inode_dir_notify(struct fsnotify_group *group, struct fsnotify_event *event);
+static void clear_mark_dir_notify(struct fsnotify_mark_entry *entry, struct inode *inode, unsigned int flags);
+static int should_send_event_dir_notify(struct fsnotify_group *group, struct inode *inode, __u64 mask);
+
+static struct fsnotify_ops dnotify_fsnotify_ops = {
+ .event_to_notif = inode_dir_notify,
+ .mark_clear_inode = clear_mark_dir_notify,
+ .should_send_event = should_send_event_dir_notify,
+ .free_group_priv = NULL,
+ .free_event_priv = NULL,
+ .free_mark_priv = NULL,
+};
+
+/* this horribleness only works because dnotify only ever has 1 group */
+static inline struct fsnotify_mark_entry *dnotify_get_mark(struct inode *inode)
+{
+ struct fsnotify_mark_entry *lentry;
+ struct fsnotify_mark_entry *entry = NULL;
+
+ spin_lock(&inode->i_lock);
+ list_for_each_entry(lentry, &inode->i_fsnotify_mark_entries, i_list) {
+ if (lentry->group->ops == &dnotify_fsnotify_ops) {
+ fsnotify_get_mark(lentry);
+ entry = lentry;
+ break;
+ }
+ }
+ spin_unlock(&inode->i_lock);
+ return entry;
+}
+
+/* holding the entry->lock to protect private data. */
+static void dnotify_recalc_inode_mask(struct fsnotify_mark_entry *entry)
{
unsigned long new_mask;
struct dnotify_struct *dn;

new_mask = 0;
- for (dn = inode->i_dnotify; dn != NULL; dn = dn->dn_next)
- new_mask |= dn->dn_mask & ~DN_MULTISHOT;
- inode->i_dnotify_mask = new_mask;
+ dn = (struct dnotify_struct *)entry->private;
+ for (; dn != NULL; dn = dn->dn_next)
+ new_mask |= (dn->dn_mask & ~FS_DN_MULTISHOT);
+ entry->mask = new_mask;
+
+ if (entry->inode)
+ fsnotify_recalc_inode_mask(entry->inode);
+}
+
+static int inode_dir_notify(struct fsnotify_group *group, struct fsnotify_event *event)
+{
+ struct fsnotify_mark_entry *entry = NULL;
+ struct inode *to_tell;
+ struct dnotify_struct *dn;
+ struct dnotify_struct **prev;
+ struct fown_struct *fown;
+ int changed = 0;
+
+ to_tell = event->to_tell;
+
+ spin_lock(&to_tell->i_lock);
+ entry = fsnotify_find_mark_entry(group, to_tell);
+ spin_unlock(&to_tell->i_lock);
+
+ /* unlikely since we alreay passed should_send_event_dir_notify() */
+ if (unlikely(!entry))
+ return 0;
+
+ spin_lock(&entry->lock);
+ prev = (struct dnotify_struct **)&entry->private;
+ while ((dn = *prev) != NULL) {
+ if ((dn->dn_mask & event->mask) == 0) {
+ prev = &dn->dn_next;
+ continue;
+ }
+ fown = &dn->dn_filp->f_owner;
+ send_sigio(fown, dn->dn_fd, POLL_MSG);
+ if (dn->dn_mask & FS_DN_MULTISHOT)
+ prev = &dn->dn_next;
+ else {
+ *prev = dn->dn_next;
+ changed = 1;
+ kmem_cache_free(dn_cache, dn);
+ }
+ }
+ if (changed)
+ dnotify_recalc_inode_mask(entry);
+
+ spin_unlock(&entry->lock);
+ fsnotify_put_mark(entry);
+
+ return 0;
+}
+
+static void clear_mark_dir_notify(struct fsnotify_mark_entry *entry, struct inode *inode, unsigned int flags)
+{
+ /* if we got here when this inode just closed it's last dentry or when
+ * the inode is being kicked out of core we screwed up since it should
+ * have already been flushed in dnotify_flush() */
+ BUG();
+}
+
+static int should_send_event_dir_notify(struct fsnotify_group *group, struct inode *inode, __u64 mask)
+{
+ struct fsnotify_mark_entry *entry;
+ int send;
+
+ /* !dir_notify_enable should never get here, don't waste time checking
+ if (!dir_notify_enable)
+ return 0; */
+
+ /* not a dir, dnotify doesn't care */
+ if (!S_ISDIR(inode->i_mode))
+ return 0;
+
+ spin_lock(&inode->i_lock);
+ entry = fsnotify_find_mark_entry(group, inode);
+ spin_unlock(&inode->i_lock);
+
+ /* no mark means no dnotify watch */
+ if (!entry)
+ return 0;
+
+ spin_lock(&entry->lock);
+ send = !!(mask & entry->mask);
+ spin_unlock(&entry->lock);
+ fsnotify_put_mark(entry);
+
+ return send;
}

void dnotify_flush(struct file *filp, fl_owner_t id)
{
+ struct fsnotify_group *dnotify_group = NULL;
+ struct fsnotify_mark_entry *entry;
struct dnotify_struct *dn;
struct dnotify_struct **prev;
struct inode *inode;
@@ -46,22 +169,66 @@ void dnotify_flush(struct file *filp, fl_owner_t id)
inode = filp->f_path.dentry->d_inode;
if (!S_ISDIR(inode->i_mode))
return;
- spin_lock(&inode->i_lock);
- prev = &inode->i_dnotify;
+
+ entry = dnotify_get_mark(inode);
+ if (!entry)
+ return;
+
+ spin_lock(&entry->lock);
+ prev = (struct dnotify_struct **)&entry->private;
while ((dn = *prev) != NULL) {
if ((dn->dn_owner == id) && (dn->dn_filp == filp)) {
*prev = dn->dn_next;
- redo_inode_mask(inode);
+ dnotify_recalc_inode_mask(entry);
kmem_cache_free(dn_cache, dn);
break;
}
prev = &dn->dn_next;
}
- spin_unlock(&inode->i_lock);
+
+ /* last dnotify watch on this inode is gone */
+ if (entry->private == NULL)
+ dnotify_group = entry->group;
+
+ spin_unlock(&entry->lock);
+
+ if (dnotify_group) {
+ fsnotify_destroy_mark_by_entry(entry);
+ fsnotify_put_group(dnotify_group);
+ }
+
+ fsnotify_put_mark(entry);
+}
+
+/* this conversion is done only at watch creation */
+static inline unsigned long convert_arg(unsigned long arg)
+{
+ unsigned long new_mask = 0;
+
+ if (arg & DN_MULTISHOT)
+ new_mask |= FS_DN_MULTISHOT;
+ if (arg & DN_DELETE)
+ new_mask |= (FS_DELETE | FS_MOVED_FROM);
+ if (arg & DN_MODIFY)
+ new_mask |= FS_MODIFY;
+ if (arg & DN_ACCESS)
+ new_mask |= FS_ACCESS;
+ if (arg & DN_ATTRIB)
+ new_mask |= FS_ATTRIB;
+ if (arg & DN_RENAME)
+ new_mask |= FS_DN_RENAME;
+ if (arg & DN_CREATE)
+ new_mask |= (FS_CREATE | FS_MOVED_TO);
+
+ new_mask |= (new_mask & FS_EVENTS_WITH_CHILD) << 32;
+
+ return new_mask;
}

int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
{
+ struct fsnotify_group *dnotify_group;
+ struct fsnotify_mark_entry *entry;
struct dnotify_struct *dn;
struct dnotify_struct *odn;
struct dnotify_struct **prev;
@@ -69,27 +236,64 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
fl_owner_t id = current->files;
struct file *f;
int error = 0;
+ __u64 mask;
+
+ if (!dir_notify_enable)
+ return -EINVAL;

if ((arg & ~DN_MULTISHOT) == 0) {
dnotify_flush(filp, id);
return 0;
}
- if (!dir_notify_enable)
- return -EINVAL;
inode = filp->f_path.dentry->d_inode;
if (!S_ISDIR(inode->i_mode))
return -ENOTDIR;
+
+ /* expect most fcntl to add new rather than augment old */
dn = kmem_cache_alloc(dn_cache, GFP_KERNEL);
if (dn == NULL)
return -ENOMEM;
- spin_lock(&inode->i_lock);
- prev = &inode->i_dnotify;
+
+ /* convert the userspace DN_* "arg" to the internal FS_* defines in fsnotify */
+ mask = convert_arg(arg);
+
+ /*
+ * I really don't like using ALL_DNOTIFY_EVENTS. We could probably do
+ * better setting the group->mask equal to only those dnotify watches care
+ * about, but removing events means running the entire group->mark_entries
+ * list to recalculate the mask. Also makes it harder to find the right
+ * group, but this is not a fast path, so harder doesn't mean bad.
+ * Maybe a future performance win since it could result in faster fsnotify()
+ * processing.
+ */
+ dnotify_group = fsnotify_obtain_group(UINT_MAX, UINT_MAX, ALL_DNOTIFY_EVENTS, &dnotify_fsnotify_ops);
+ /* screw it, i don't care */
+ if (IS_ERR(dnotify_group)) {
+ error = PTR_ERR(dnotify_group);
+ goto out_free;
+ }
+
+ entry = fsnotify_mark_add(dnotify_group, inode, mask);
+ if (!entry) {
+ error = -ENOMEM;
+ goto out_put_group;
+ }
+
+ spin_lock(&entry->lock);
+ /* entry->private == NULL means that this is a new inode mark.
+ * take a group reference for it. */
+ if (entry->private == NULL)
+ fsnotify_get_group(dnotify_group);
+
+ prev = (struct dnotify_struct **)&entry->private;
while ((odn = *prev) != NULL) {
+ /* do we already have a dnotify struct and we are just adding more events? */
if ((odn->dn_owner == id) && (odn->dn_filp == filp)) {
odn->dn_fd = fd;
- odn->dn_mask |= arg;
- inode->i_dnotify_mask |= arg & ~DN_MULTISHOT;
- goto out_free;
+ odn->dn_mask |= mask;
+ /* recalculate the entry->mask and entry->inode->i_fsnotify_mask */
+ dnotify_recalc_inode_mask(entry);
+ goto out_unlock;
}
prev = &odn->dn_next;
}
@@ -98,92 +302,38 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
f = fcheck(fd);
rcu_read_unlock();
/* we'd lost the race with close(), sod off silently */
- /* note that inode->i_lock prevents reordering problems
- * between accesses to descriptor table and ->i_dnotify */
+ /* note that entry->lock prevents reordering problems
+ * between accesses to descriptor table and the private data in the
+ * inode mark, aka the dnotify_flush when the fd was closed */
if (f != filp)
- goto out_free;
+ goto out_unlock;

error = __f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
if (error)
- goto out_free;
+ goto out_unlock;

- dn->dn_mask = arg;
+ dn->dn_mask = mask;
dn->dn_fd = fd;
dn->dn_filp = filp;
dn->dn_owner = id;
- inode->i_dnotify_mask |= arg & ~DN_MULTISHOT;
- dn->dn_next = inode->i_dnotify;
- inode->i_dnotify = dn;
- spin_unlock(&inode->i_lock);
-
- if (filp->f_op && filp->f_op->dir_notify)
- return filp->f_op->dir_notify(filp, arg);
+ dn->dn_next = entry->private;
+ entry->private = dn;
+ dnotify_recalc_inode_mask(entry);
+ spin_unlock(&entry->lock);
+ fsnotify_put_mark(entry);
+ fsnotify_put_group(dnotify_group);
return 0;

+out_unlock:
+ spin_unlock(&entry->lock);
+ fsnotify_put_mark(entry);
+out_put_group:
+ fsnotify_put_group(dnotify_group);
out_free:
- spin_unlock(&inode->i_lock);
kmem_cache_free(dn_cache, dn);
return error;
}

-void __inode_dir_notify(struct inode *inode, unsigned long event)
-{
- struct dnotify_struct * dn;
- struct dnotify_struct **prev;
- struct fown_struct * fown;
- int changed = 0;
-
- spin_lock(&inode->i_lock);
- prev = &inode->i_dnotify;
- while ((dn = *prev) != NULL) {
- if ((dn->dn_mask & event) == 0) {
- prev = &dn->dn_next;
- continue;
- }
- fown = &dn->dn_filp->f_owner;
- send_sigio(fown, dn->dn_fd, POLL_MSG);
- if (dn->dn_mask & DN_MULTISHOT)
- prev = &dn->dn_next;
- else {
- *prev = dn->dn_next;
- changed = 1;
- kmem_cache_free(dn_cache, dn);
- }
- }
- if (changed)
- redo_inode_mask(inode);
- spin_unlock(&inode->i_lock);
-}
-
-EXPORT_SYMBOL(__inode_dir_notify);
-
-/*
- * This is hopelessly wrong, but unfixable without API changes. At
- * least it doesn't oops the kernel...
- *
- * To safely access ->d_parent we need to keep d_move away from it. Use the
- * dentry's d_lock for this.
- */
-void dnotify_parent(struct dentry *dentry, unsigned long event)
-{
- struct dentry *parent;
-
- if (!dir_notify_enable)
- return;
-
- spin_lock(&dentry->d_lock);
- parent = dentry->d_parent;
- if (parent->d_inode->i_dnotify_mask & event) {
- dget(parent);
- spin_unlock(&dentry->d_lock);
- __inode_dir_notify(parent->d_inode, event);
- dput(parent);
- } else {
- spin_unlock(&dentry->d_lock);
- }
-}
-EXPORT_SYMBOL_GPL(dnotify_parent);
-
static int __init dnotify_init(void)
{
dn_cache = kmem_cache_create("dnotify_cache",
diff --git a/include/linux/dnotify.h b/include/linux/dnotify.h
index 102a902..e8c4256 100644
--- a/include/linux/dnotify.h
+++ b/include/linux/dnotify.h
@@ -10,7 +10,7 @@

struct dnotify_struct {
struct dnotify_struct * dn_next;
- unsigned long dn_mask;
+ __u64 dn_mask;
int dn_fd;
struct file * dn_filp;
fl_owner_t dn_owner;
@@ -21,23 +21,18 @@ struct dnotify_struct {

#ifdef CONFIG_DNOTIFY

-extern void __inode_dir_notify(struct inode *, unsigned long);
+#define ALL_DNOTIFY_EVENTS (FS_DELETE | FS_DELETE_CHILD |\
+ FS_MODIFY | FS_MODIFY_CHILD |\
+ FS_ACCESS | FS_ACCESS_CHILD |\
+ FS_ATTRIB | FS_ATTRIB_CHILD |\
+ FS_CREATE | FS_DN_RENAME |\
+ FS_MOVED_FROM | FS_MOVED_TO)
+
extern void dnotify_flush(struct file *, fl_owner_t);
extern int fcntl_dirnotify(int, struct file *, unsigned long);
-extern void dnotify_parent(struct dentry *, unsigned long);
-
-static inline void inode_dir_notify(struct inode *inode, unsigned long event)
-{
- if (inode->i_dnotify_mask & (event))
- __inode_dir_notify(inode, event);
-}

#else

-static inline void __inode_dir_notify(struct inode *inode, unsigned long event)
-{
-}
-
static inline void dnotify_flush(struct file *filp, fl_owner_t id)
{
}
@@ -47,14 +42,6 @@ static inline int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
return -EINVAL;
}

-static inline void dnotify_parent(struct dentry *dentry, unsigned long event)
-{
-}
-
-static inline void inode_dir_notify(struct inode *inode, unsigned long event)
-{
-}
-
#endif /* CONFIG_DNOTIFY */

#endif /* __KERNEL __ */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b5a7bce..f945763 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -670,11 +670,6 @@ struct inode {
struct list_head i_fsnotify_mark_entries; /* fsnotify mark entries */
#endif

-#ifdef CONFIG_DNOTIFY
- unsigned long i_dnotify_mask; /* Directory notify events */
- struct dnotify_struct *i_dnotify; /* for directory notifications */
-#endif
-
#ifdef CONFIG_INOTIFY
struct list_head inotify_watches; /* watches on this inode */
struct mutex inotify_mutex; /* protects the watches list */
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 971ada9..791d288 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -135,13 +135,7 @@ static inline void fsnotify_move(struct inode *old_dir, struct inode *new_dir,
__u64 new_dir_mask = 0;

if (old_dir == new_dir) {
- inode_dir_notify(old_dir, DN_RENAME);
old_dir_mask = FS_DN_RENAME;
- } else {
- inode_dir_notify(old_dir, DN_DELETE);
- old_dir_mask = FS_DELETE;
- inode_dir_notify(new_dir, DN_CREATE);
- new_dir_mask = FS_CREATE;
}

if (isdir) {
@@ -181,7 +175,6 @@ static inline void fsnotify_nameremove(struct dentry *dentry, int isdir)
{
if (isdir)
isdir = IN_ISDIR;
- dnotify_parent(dentry, DN_DELETE);
inotify_dentry_parent_queue_event(dentry, IN_DELETE|isdir, 0, dentry->d_name.name);

fsnotify_parent(dentry, FS_DELETE|isdir);
@@ -222,7 +215,6 @@ static inline void fsnotify_link_count(struct inode *inode)
*/
static inline void fsnotify_create(struct inode *inode, struct dentry *dentry)
{
- inode_dir_notify(inode, DN_CREATE);
inotify_inode_queue_event(inode, IN_CREATE, 0, dentry->d_name.name,
dentry->d_inode);
audit_inode_child(dentry->d_name.name, dentry, inode);
@@ -237,7 +229,6 @@ static inline void fsnotify_create(struct inode *inode, struct dentry *dentry)
*/
static inline void fsnotify_link(struct inode *dir, struct inode *inode, struct dentry *new_dentry)
{
- inode_dir_notify(dir, DN_CREATE);
inotify_inode_queue_event(dir, IN_CREATE, 0, new_dentry->d_name.name,
inode);
fsnotify_link_count(inode);
@@ -251,7 +242,6 @@ static inline void fsnotify_link(struct inode *dir, struct inode *inode, struct
*/
static inline void fsnotify_mkdir(struct inode *inode, struct dentry *dentry)
{
- inode_dir_notify(inode, DN_CREATE);
inotify_inode_queue_event(inode, IN_CREATE | IN_ISDIR, 0,
dentry->d_name.name, dentry->d_inode);
audit_inode_child(dentry->d_name.name, dentry, inode);
@@ -271,7 +261,6 @@ static inline void fsnotify_access(struct file *file)
if (S_ISDIR(inode->i_mode))
mask |= IN_ISDIR;

- dnotify_parent(dentry, DN_ACCESS);
inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

@@ -291,7 +280,6 @@ static inline void fsnotify_modify(struct file *file)
if (S_ISDIR(inode->i_mode))
mask |= IN_ISDIR;

- dnotify_parent(dentry, DN_MODIFY);
inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

@@ -307,7 +295,6 @@ static inline void fsnotify_open_exec(struct file *file)
struct dentry *dentry = file->f_path.dentry;
struct inode *inode = dentry->d_inode;

- dnotify_parent(dentry, DN_ACCESS);
inotify_dentry_parent_queue_event(dentry, IN_ACCESS, 0, dentry->d_name.name);
inotify_inode_queue_event(inode, IN_ACCESS, 0, NULL, NULL);

@@ -380,49 +367,35 @@ static inline void fsnotify_xattr(struct dentry *dentry)
static inline void fsnotify_change(struct dentry *dentry, unsigned int ia_valid)
{
struct inode *inode = dentry->d_inode;
- int dn_mask = 0;
- u32 in_mask = 0;
+ __u64 mask = 0;
+
+ if (ia_valid & ATTR_UID)
+ mask |= IN_ATTRIB;
+ if (ia_valid & ATTR_GID)
+ mask |= IN_ATTRIB;
+ if (ia_valid & ATTR_SIZE)
+ mask |= IN_MODIFY;

- if (ia_valid & ATTR_UID) {
- in_mask |= IN_ATTRIB;
- dn_mask |= DN_ATTRIB;
- }
- if (ia_valid & ATTR_GID) {
- in_mask |= IN_ATTRIB;
- dn_mask |= DN_ATTRIB;
- }
- if (ia_valid & ATTR_SIZE) {
- in_mask |= IN_MODIFY;
- dn_mask |= DN_MODIFY;
- }
/* both times implies a utime(s) call */
if ((ia_valid & (ATTR_ATIME | ATTR_MTIME)) == (ATTR_ATIME | ATTR_MTIME))
- {
- in_mask |= IN_ATTRIB;
- dn_mask |= DN_ATTRIB;
- } else if (ia_valid & ATTR_ATIME) {
- in_mask |= IN_ACCESS;
- dn_mask |= DN_ACCESS;
- } else if (ia_valid & ATTR_MTIME) {
- in_mask |= IN_MODIFY;
- dn_mask |= DN_MODIFY;
- }
- if (ia_valid & ATTR_MODE) {
- in_mask |= IN_ATTRIB;
- dn_mask |= DN_ATTRIB;
- }
+ mask |= IN_ATTRIB;
+ else if (ia_valid & ATTR_ATIME)
+ mask |= IN_ACCESS;
+ else if (ia_valid & ATTR_MTIME)
+ mask |= IN_MODIFY;
+
+ if (ia_valid & ATTR_MODE)
+ mask |= IN_ATTRIB;

- if (dn_mask)
- dnotify_parent(dentry, dn_mask);
- if (in_mask) {
+ if (mask) {
if (S_ISDIR(inode->i_mode))
- in_mask |= IN_ISDIR;
- inotify_inode_queue_event(inode, in_mask, 0, NULL, NULL);
- inotify_dentry_parent_queue_event(dentry, in_mask, 0,
+ mask |= IN_ISDIR;
+ inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
+ inotify_dentry_parent_queue_event(dentry, mask, 0,
dentry->d_name.name);

- fsnotify_parent(dentry, in_mask);
- fsnotify(inode, in_mask, inode, FSNOTIFY_EVENT_INODE);
+ fsnotify_parent(dentry, mask);
+ fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE);
}
}

2008-12-12 21:55:24

by Eric Paris

[permalink] [raw]
Subject: [RFC PATCH -v4 10/14] fsnotify: generic notification queue and waitq

inotify needs to do asyc notification in which event information is stored
on a queue until the listener is ready to receive it. This patch
implements a generic notification queue for inotify (and later fanotify) to
store events to be sent at a later time.

Signed-off-by: Eric Paris <[email protected]>
---

fs/notify/fsnotify.h | 29 +++++++++
fs/notify/group.c | 9 +++
fs/notify/notification.c | 127 ++++++++++++++++++++++++++++++++++++++
include/linux/fsnotify_backend.h | 7 ++
4 files changed, 172 insertions(+), 0 deletions(-)

diff --git a/fs/notify/fsnotify.h b/fs/notify/fsnotify.h
index e6f4f0d..6b2dc15 100644
--- a/fs/notify/fsnotify.h
+++ b/fs/notify/fsnotify.h
@@ -10,6 +10,20 @@
#include <linux/fsnotify.h>

#include <asm/atomic.h>
+/*
+ * A single event can be queued in multiple group->notification_lists.
+ *
+ * each group->notification_list will point to an event_holder which in turns points
+ * to the actual event that needs to be sent to userspace.
+ *
+ * Seemed cheaper to create a refcnt'd event and a small holder for every group
+ * than create a different event for every group
+ *
+ */
+struct fsnotify_event_holder {
+ struct fsnotify_event *event;
+ struct list_head event_list;
+};

struct fsnotify_event_private_data {
struct fsnotify_group *group;
@@ -23,6 +37,12 @@ struct fsnotify_event_private_data {
* listener this structure is where you need to be adding fields.
*/
struct fsnotify_event {
+ /*
+ * If we create an event we are also going to need to create a holder
+ * to link to a group. So embed one holder in the event. Means only
+ * one allocation for the common case where we only have one group
+ */
+ struct fsnotify_event_holder holder;
spinlock_t lock; /* protection for the associated event_holder and private_list */
struct inode *to_tell;
/*
@@ -72,6 +92,13 @@ extern __u64 fsnotify_mask;
extern void fsnotify_get_event(struct fsnotify_event *event);
extern void fsnotify_put_event(struct fsnotify_event *event);
extern struct fsnotify_event_private_data *fsnotify_get_priv_from_event(struct fsnotify_group *group, struct fsnotify_event *event);
+
+extern void fsnotify_flush_notif(struct fsnotify_group *group);
+extern int fsnotify_add_notif_event(struct fsnotify_group *group, struct fsnotify_event *event, struct fsnotify_event_private_data *priv);
+extern int fsnotify_check_notif_queue(struct fsnotify_group *group);
+extern struct fsnotify_event *fsnotify_peek_notif_event(struct fsnotify_group *group);
+extern struct fsnotify_event *fsnotify_remove_notif_event(struct fsnotify_group *group);
+
extern struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is);

extern void fsnotify_clear_marks_by_group(struct fsnotify_group *group);
@@ -79,4 +106,6 @@ extern void fsnotify_clear_marks_by_inode(struct inode *inode, unsigned int flag
extern void fsnotify_destroy_mark_by_entry(struct fsnotify_mark_entry *entry);
extern void fsnotify_get_mark(struct fsnotify_mark_entry *entry);
extern void fsnotify_put_mark(struct fsnotify_mark_entry *entry);
+ extern struct fsnotify_event_holder *fsnotify_alloc_event_holder(void);
+ extern void fsnotify_destroy_event_holder(struct fsnotify_event_holder *holder);
#endif /* _LINUX_FSNOTIFY_PRIVATE_H */
diff --git a/fs/notify/group.c b/fs/notify/group.c
index 1ed97fe..b041e6d 100644
--- a/fs/notify/group.c
+++ b/fs/notify/group.c
@@ -91,6 +91,9 @@ void fsnotify_get_group(struct fsnotify_group *group)

static void fsnotify_destroy_group(struct fsnotify_group *group)
{
+ /* clear the notification queue of all events */
+ fsnotify_flush_notif(group);
+
/* clear all inode mark entries for this group */
fsnotify_clear_marks_by_group(group);

@@ -168,6 +171,12 @@ struct fsnotify_group *fsnotify_obtain_group(unsigned int priority, unsigned int
group->group_num = group_num;
group->mask = mask;

+ mutex_init(&group->notification_mutex);
+ INIT_LIST_HEAD(&group->notification_list);
+ init_waitqueue_head(&group->notification_waitq);
+ group->q_len = 0;
+ group->max_events = UINT_MAX;
+
spin_lock_init(&group->mark_lock);
INIT_LIST_HEAD(&group->mark_entries);

diff --git a/fs/notify/notification.c b/fs/notify/notification.c
index f008a15..e039a47 100644
--- a/fs/notify/notification.c
+++ b/fs/notify/notification.c
@@ -33,6 +33,14 @@
#include "fsnotify.h"

static struct kmem_cache *event_kmem_cache;
+static struct kmem_cache *event_holder_kmem_cache;
+
+int fsnotify_check_notif_queue(struct fsnotify_group *group)
+{
+ if (!list_empty(&group->notification_list))
+ return 1;
+ return 0;
+}

void fsnotify_get_event(struct fsnotify_event *event)
{
@@ -58,6 +66,16 @@ void fsnotify_put_event(struct fsnotify_event *event)
}
}

+struct fsnotify_event_holder *alloc_event_holder(void)
+{
+ return kmem_cache_alloc(event_holder_kmem_cache, GFP_KERNEL);
+}
+
+void fsnotify_destroy_event_holder(struct fsnotify_event_holder *holder)
+{
+ kmem_cache_free(event_holder_kmem_cache, holder);
+}
+
struct fsnotify_event_private_data *fsnotify_get_priv_from_event(struct fsnotify_group *group, struct fsnotify_event *event)
{
struct fsnotify_event_private_data *lpriv;
@@ -72,6 +90,112 @@ struct fsnotify_event_private_data *fsnotify_get_priv_from_event(struct fsnotify
return priv;
}

+int fsnotify_add_notif_event(struct fsnotify_group *group, struct fsnotify_event *event, struct fsnotify_event_private_data *priv)
+{
+ struct fsnotify_event_holder *holder;
+
+ /*
+ * holder locking
+ *
+ * only this task is going to be adding this event to lists, thus only
+ * this task can add the in event holder to a list.
+ *
+ * other tasks may be removing this event from some other group's
+ * notification_list.
+ *
+ * those other tasks will blank the in event holder list under
+ * the holder spinlock. If we see it blank we know that once we
+ * get that lock the in event holder will be ok for us to (re)use.
+ */
+ if (list_empty(&event->holder.event_list))
+ holder = (struct fsnotify_event_holder *)event;
+ else
+ holder = alloc_event_holder();
+ if (!holder)
+ return -ENOMEM;
+
+ fsnotify_get_event(event);
+
+ mutex_lock(&group->notification_mutex);
+
+ if (group->q_len + 1 >= group->max_events) {
+ mutex_unlock(&group->notification_mutex);
+ return -ENOSPC;
+ } else {
+ group->q_len++;
+ }
+ spin_lock(&event->lock);
+ holder->event = event;
+ list_add_tail(&holder->event_list, &group->notification_list);
+ if (priv)
+ list_add_tail(&priv->event_list, &event->private_data_list);
+ spin_unlock(&event->lock);
+ mutex_unlock(&group->notification_mutex);
+
+ wake_up(&group->notification_waitq);
+
+ return 0;
+}
+
+/*
+ * must be called with group->notification_mutex held and must know event is present.
+ * it is the responsibility of the caller to call put_event() on the returned
+ * structure
+ */
+struct fsnotify_event *fsnotify_remove_notif_event(struct fsnotify_group *group)
+{
+ struct fsnotify_event *event;
+ struct fsnotify_event_holder *holder;
+
+ holder = list_first_entry(&group->notification_list, struct fsnotify_event_holder, event_list);
+
+ event = holder->event;
+
+ spin_lock(&event->lock);
+ holder->event = NULL;
+ list_del_init(&holder->event_list);
+ spin_unlock(&event->lock);
+
+ /* event == holder means we are referenced through the in event holder */
+ if (event != (struct fsnotify_event *)holder)
+ fsnotify_destroy_event_holder(holder);
+
+ group->q_len--;
+
+ return event;
+}
+
+/*
+ * caller must hold group->notification_mutex and must know event is present.
+ * this will not remove the event, that must be done with fsnotify_remove_notif_event()
+ */
+struct fsnotify_event *fsnotify_peek_notif_event(struct fsnotify_group *group)
+{
+ struct fsnotify_event *event;
+ struct fsnotify_event_holder *holder;
+
+ holder = list_first_entry(&group->notification_list, struct fsnotify_event_holder, event_list);
+ event = holder->event;
+
+ return event;
+}
+
+void fsnotify_flush_notif(struct fsnotify_group *group)
+{
+ struct fsnotify_event *event;
+
+ /* do I really need the mutex here? I think the group is now safe to
+ * play with lockless... */
+ mutex_lock(&group->notification_mutex);
+ while (fsnotify_check_notif_queue(group)) {
+ event = fsnotify_remove_notif_event(group);
+ if (group->ops->free_event_priv)
+ group->ops->free_event_priv(group, event);
+ fsnotify_put_event(event);
+ }
+ mutex_unlock(&group->notification_mutex);
+}
+
struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is)
{
struct fsnotify_event *event;
@@ -80,6 +204,8 @@ struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask,
if (!event)
return NULL;

+ event->holder.event = NULL;
+ INIT_LIST_HEAD(&event->holder.event_list);
atomic_set(&event->refcnt, 1);

spin_lock_init(&event->lock);
@@ -116,6 +242,7 @@ struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask,
__init int fsnotify_notification_init(void)
{
event_kmem_cache = kmem_cache_create("fsnotify_event", sizeof(struct fsnotify_event), 0, SLAB_PANIC, NULL);
+ event_holder_kmem_cache = kmem_cache_create("fsnotify_event_holder", sizeof(struct fsnotify_event_holder), 0, SLAB_PANIC, NULL);

return 0;
}
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 3e4ac24..b6d1895 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -99,6 +99,13 @@ struct fsnotify_group {

const struct fsnotify_ops *ops; /* how this group handles things */

+ /* needed to send notification to userspace */
+ struct mutex notification_mutex;/* protect the notification_list */
+ struct list_head notification_list; /* list of event_holder this group needs to send to userspace */
+ wait_queue_head_t notification_waitq; /* read() on the notification file blocks on this waitq */
+ unsigned int q_len; /* events on the queue */
+ unsigned int max_events; /* maximum events allowed on the list */
+
/* stores all fastapth entries assoc with this group so they can be cleaned on unregister */
spinlock_t mark_lock; /* protect mark_entries list */
struct list_head mark_entries; /* all inode mark entries for this group */

2008-12-12 21:55:50

by Eric Paris

[permalink] [raw]
Subject: [RFC PATCH -v4 11/14] fsnotify: include pathnames with entries when possible

When inotify wants to send events to a directory about a child it includes
the name of the original file. This patch collects that filename and makes
it available for notification.

Signed-off-by: Eric Paris <[email protected]>
---

fs/notify/fsnotify.c | 4 ++--
fs/notify/fsnotify.h | 9 ++++++---
fs/notify/notification.c | 19 ++++++++++++++++++-
include/linux/fsnotify.h | 34 +++++++++++++++++-----------------
include/linux/fsnotify_backend.h | 4 ++--
5 files changed, 45 insertions(+), 25 deletions(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 61157f2..c48dcf6 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -34,7 +34,7 @@ void __fsnotify_inode_delete(struct inode *inode, int flag)
}
EXPORT_SYMBOL_GPL(__fsnotify_inode_delete);

-void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is)
+void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *file_name)
{
struct fsnotify_group *group;
struct fsnotify_event *event = NULL;
@@ -66,7 +66,7 @@ void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is)
if (!group->ops->should_send_event(group, to_tell, mask))
continue;
if (!event) {
- event = fsnotify_create_event(to_tell, mask, data, data_is);
+ event = fsnotify_create_event(to_tell, mask, data, data_is, file_name);
/* shit, we OOM'd and now we can't tell, lets hope something else blows up */
if (!event)
break;
diff --git a/fs/notify/fsnotify.h b/fs/notify/fsnotify.h
index 6b2dc15..89677ee 100644
--- a/fs/notify/fsnotify.h
+++ b/fs/notify/fsnotify.h
@@ -57,6 +57,9 @@ struct fsnotify_event {
__u64 mask; /* the type of access */
atomic_t refcnt; /* how many groups still are using/need to send this event */

+ char *file_name;
+ size_t name_len;
+
struct list_head private_data_list;
};

@@ -99,13 +102,13 @@ extern int fsnotify_check_notif_queue(struct fsnotify_group *group);
extern struct fsnotify_event *fsnotify_peek_notif_event(struct fsnotify_group *group);
extern struct fsnotify_event *fsnotify_remove_notif_event(struct fsnotify_group *group);

-extern struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is);
+extern struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name);

extern void fsnotify_clear_marks_by_group(struct fsnotify_group *group);
extern void fsnotify_clear_marks_by_inode(struct inode *inode, unsigned int flags);
extern void fsnotify_destroy_mark_by_entry(struct fsnotify_mark_entry *entry);
extern void fsnotify_get_mark(struct fsnotify_mark_entry *entry);
extern void fsnotify_put_mark(struct fsnotify_mark_entry *entry);
- extern struct fsnotify_event_holder *fsnotify_alloc_event_holder(void);
- extern void fsnotify_destroy_event_holder(struct fsnotify_event_holder *holder);
+extern struct fsnotify_event_holder *fsnotify_alloc_event_holder(void);
+extern void fsnotify_destroy_event_holder(struct fsnotify_event_holder *holder);
#endif /* _LINUX_FSNOTIFY_PRIVATE_H */
diff --git a/fs/notify/notification.c b/fs/notify/notification.c
index e039a47..8ed9d32 100644
--- a/fs/notify/notification.c
+++ b/fs/notify/notification.c
@@ -62,6 +62,7 @@ void fsnotify_put_event(struct fsnotify_event *event)
event->mask = 0;

BUG_ON(!list_empty(&event->private_data_list));
+ kfree(event->file_name);
kmem_cache_free(event_kmem_cache, event);
}
}
@@ -196,7 +197,7 @@ void fsnotify_flush_notif(struct fsnotify_group *group)
mutex_unlock(&group->notification_mutex);
}

-struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is)
+struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name)
{
struct fsnotify_event *event;

@@ -204,6 +205,18 @@ struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask,
if (!event)
return NULL;

+ if (name) {
+ event->file_name = kstrdup(name, GFP_KERNEL);
+ if (!event->file_name) {
+ kmem_cache_free(event_kmem_cache, event);
+ return NULL;
+ }
+ event->name_len = strlen(event->file_name);
+ } else {
+ event->file_name = NULL;
+ event->name_len = 0;
+ }
+
event->holder.event = NULL;
INIT_LIST_HEAD(&event->holder.event_list);
atomic_set(&event->refcnt, 1);
@@ -222,6 +235,7 @@ struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask,
switch (data_is) {
case FSNOTIFY_EVENT_FILE: {
struct file *file = data;
+
event->path.dentry = file->f_path.dentry;
event->path.mnt = file->f_path.mnt;
path_get(&event->path);
@@ -231,6 +245,9 @@ struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask,
event->inode = data;
break;
default:
+ event->path.dentry = NULL;
+ event->path.mnt = NULL;
+ event->inode = NULL;
BUG();
};

diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 791d288..8f4caf8 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -98,7 +98,7 @@ static inline void fsnotify_parent(struct dentry *dentry, __u64 orig_mask)
if (p_inode && (p_inode->i_fsnotify_mask & mask)) {
dget(parent);
spin_unlock(&dentry->d_lock);
- fsnotify(p_inode, mask, dentry->d_inode, FSNOTIFY_EVENT_INODE);
+ fsnotify(p_inode, mask, dentry->d_inode, FSNOTIFY_EVENT_INODE, dentry->d_name.name);
dput(parent);
} else {
spin_unlock(&dentry->d_lock);
@@ -152,18 +152,18 @@ static inline void fsnotify_move(struct inode *old_dir, struct inode *new_dir,
inotify_inode_queue_event(new_dir, IN_MOVED_TO|isdir, cookie, new_name,
source);

- fsnotify(old_dir, old_dir_mask, old_dir, FSNOTIFY_EVENT_INODE);
- fsnotify(new_dir, new_dir_mask, new_dir, FSNOTIFY_EVENT_INODE);
+ fsnotify(old_dir, old_dir_mask, old_dir, FSNOTIFY_EVENT_INODE, old_name);
+ fsnotify(new_dir, new_dir_mask, new_dir, FSNOTIFY_EVENT_INODE, new_name);

if (target) {
inotify_inode_queue_event(target, IN_DELETE_SELF, 0, NULL, NULL);
inotify_inode_is_dead(target);
- fsnotify(target, FS_DELETE, target, FSNOTIFY_EVENT_INODE);
+ fsnotify(target, FS_DELETE, target, FSNOTIFY_EVENT_INODE, NULL);
}

if (source) {
inotify_inode_queue_event(source, IN_MOVE_SELF, 0, NULL, NULL);
- fsnotify(source, FS_MOVE_SELF, moved->d_inode, FSNOTIFY_EVENT_INODE);
+ fsnotify(source, FS_MOVE_SELF, moved->d_inode, FSNOTIFY_EVENT_INODE, NULL);
}
audit_inode_child(new_name, moved, new_dir);
}
@@ -196,7 +196,7 @@ static inline void fsnotify_inoderemove(struct inode *inode)
inotify_inode_queue_event(inode, IN_DELETE_SELF, 0, NULL, NULL);
inotify_inode_is_dead(inode);

- fsnotify(inode, FS_DELETE_SELF, inode, FSNOTIFY_EVENT_INODE);
+ fsnotify(inode, FS_DELETE_SELF, inode, FSNOTIFY_EVENT_INODE, NULL);
__fsnotify_inode_delete(inode, FSNOTIFY_LAST_DENTRY);
}

@@ -207,7 +207,7 @@ static inline void fsnotify_link_count(struct inode *inode)
{
inotify_inode_queue_event(inode, IN_ATTRIB, 0, NULL, NULL);

- fsnotify(inode, FS_ATTRIB, inode, FSNOTIFY_EVENT_INODE);
+ fsnotify(inode, FS_ATTRIB, inode, FSNOTIFY_EVENT_INODE, NULL);
}

/*
@@ -219,7 +219,7 @@ static inline void fsnotify_create(struct inode *inode, struct dentry *dentry)
dentry->d_inode);
audit_inode_child(dentry->d_name.name, dentry, inode);

- fsnotify(inode, FS_CREATE, dentry->d_inode, FSNOTIFY_EVENT_INODE);
+ fsnotify(inode, FS_CREATE, dentry->d_inode, FSNOTIFY_EVENT_INODE, dentry->d_name.name);
}

/*
@@ -234,7 +234,7 @@ static inline void fsnotify_link(struct inode *dir, struct inode *inode, struct
fsnotify_link_count(inode);
audit_inode_child(new_dentry->d_name.name, new_dentry, dir);

- fsnotify(dir, FS_CREATE, inode, FSNOTIFY_EVENT_INODE);
+ fsnotify(dir, FS_CREATE, inode, FSNOTIFY_EVENT_INODE, new_dentry->d_name.name);
}

/*
@@ -246,7 +246,7 @@ static inline void fsnotify_mkdir(struct inode *inode, struct dentry *dentry)
dentry->d_name.name, dentry->d_inode);
audit_inode_child(dentry->d_name.name, dentry, inode);

- fsnotify(inode, FS_CREATE | FS_IN_ISDIR, dentry->d_inode, FSNOTIFY_EVENT_INODE);
+ fsnotify(inode, FS_CREATE | FS_IN_ISDIR, dentry->d_inode, FSNOTIFY_EVENT_INODE, dentry->d_name.name);
}

/*
@@ -265,7 +265,7 @@ static inline void fsnotify_access(struct file *file)
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

fsnotify_parent(dentry, mask);
- fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE);
+ fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE, NULL);
}

/*
@@ -284,7 +284,7 @@ static inline void fsnotify_modify(struct file *file)
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

fsnotify_parent(dentry, mask);
- fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE);
+ fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE, NULL);
}

/*
@@ -299,7 +299,7 @@ static inline void fsnotify_open_exec(struct file *file)
inotify_inode_queue_event(inode, IN_ACCESS, 0, NULL, NULL);

fsnotify_parent(dentry, FS_ACCESS);
- fsnotify(inode, FS_ACCESS, file, FSNOTIFY_EVENT_FILE);
+ fsnotify(inode, FS_ACCESS, file, FSNOTIFY_EVENT_FILE, NULL);
}

/*
@@ -318,7 +318,7 @@ static inline void fsnotify_open(struct file *file)
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

fsnotify_parent(dentry, mask);
- fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE);
+ fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE, NULL);
}

/*
@@ -339,7 +339,7 @@ static inline void fsnotify_close(struct file *file)
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

fsnotify_parent(dentry, mask);
- fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE);
+ fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE, NULL);
}

/*
@@ -357,7 +357,7 @@ static inline void fsnotify_xattr(struct dentry *dentry)
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

fsnotify_parent(dentry, mask);
- fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE);
+ fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE, NULL);
}

/*
@@ -395,7 +395,7 @@ static inline void fsnotify_change(struct dentry *dentry, unsigned int ia_valid)
dentry->d_name.name);

fsnotify_parent(dentry, mask);
- fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE);
+ fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE, NULL);
}
}

diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index b6d1895..cf24ff1 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -118,7 +118,7 @@ struct fsnotify_group {
#ifdef CONFIG_FSNOTIFY

/* called from the vfs to signal fs events */
-extern void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is);
+extern void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name);
extern void __fsnotify_inode_delete(struct inode *inode, int flag);

/* called from fsnotify interfaces, such as fanotify or dnotify */
@@ -133,7 +133,7 @@ extern struct fsnotify_mark_entry *fsnotify_find_mark_entry(struct fsnotify_grou
extern struct fsnotify_mark_entry *fsnotify_mark_add(struct fsnotify_group *group, struct inode *inode, __u64 mask);
#else

-static inline void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is);
+static inline void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name);
{}

static inline void __fsnotify_inode_delete(struct inode *inode, int flag)

2008-12-12 21:56:14

by Eric Paris

[permalink] [raw]
Subject: [RFC PATCH -v4 12/14] fsnotify: add correlations between events

inotify sends userspace a correlation between events when they are related
(aka when dentries are moved). This adds that same support for all
fsnotify events.

Signed-off-by: Eric Paris <[email protected]>
---

fs/notify/fsnotify.c | 4 ++--
fs/notify/fsnotify.h | 3 ++-
fs/notify/notification.c | 12 ++++++++++-
include/linux/fsnotify.h | 41 +++++++++++++++++++-------------------
include/linux/fsnotify_backend.h | 10 +++++++--
5 files changed, 44 insertions(+), 26 deletions(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index c48dcf6..f6973a8 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -34,7 +34,7 @@ void __fsnotify_inode_delete(struct inode *inode, int flag)
}
EXPORT_SYMBOL_GPL(__fsnotify_inode_delete);

-void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *file_name)
+void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *file_name, u32 cookie)
{
struct fsnotify_group *group;
struct fsnotify_event *event = NULL;
@@ -66,7 +66,7 @@ void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is, const
if (!group->ops->should_send_event(group, to_tell, mask))
continue;
if (!event) {
- event = fsnotify_create_event(to_tell, mask, data, data_is, file_name);
+ event = fsnotify_create_event(to_tell, mask, data, data_is, file_name, cookie);
/* shit, we OOM'd and now we can't tell, lets hope something else blows up */
if (!event)
break;
diff --git a/fs/notify/fsnotify.h b/fs/notify/fsnotify.h
index 89677ee..cb143bf 100644
--- a/fs/notify/fsnotify.h
+++ b/fs/notify/fsnotify.h
@@ -57,6 +57,7 @@ struct fsnotify_event {
__u64 mask; /* the type of access */
atomic_t refcnt; /* how many groups still are using/need to send this event */

+ u32 sync_cookie;
char *file_name;
size_t name_len;

@@ -102,7 +103,7 @@ extern int fsnotify_check_notif_queue(struct fsnotify_group *group);
extern struct fsnotify_event *fsnotify_peek_notif_event(struct fsnotify_group *group);
extern struct fsnotify_event *fsnotify_remove_notif_event(struct fsnotify_group *group);

-extern struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name);
+extern struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name, u32 cookie);

extern void fsnotify_clear_marks_by_group(struct fsnotify_group *group);
extern void fsnotify_clear_marks_by_inode(struct inode *inode, unsigned int flags);
diff --git a/fs/notify/notification.c b/fs/notify/notification.c
index 8ed9d32..7243b20 100644
--- a/fs/notify/notification.c
+++ b/fs/notify/notification.c
@@ -20,6 +20,7 @@
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/list.h>
+#include <linux/module.h>
#include <linux/mount.h>
#include <linux/mutex.h>
#include <linux/namei.h>
@@ -34,6 +35,13 @@

static struct kmem_cache *event_kmem_cache;
static struct kmem_cache *event_holder_kmem_cache;
+static atomic_t fsnotify_sync_cookie = ATOMIC_INIT(0);
+
+u32 fsnotify_get_cookie(void)
+{
+ return atomic_inc_return(&fsnotify_sync_cookie);
+}
+EXPORT_SYMBOL_GPL(fsnotify_get_cookie);

int fsnotify_check_notif_queue(struct fsnotify_group *group)
{
@@ -197,7 +205,7 @@ void fsnotify_flush_notif(struct fsnotify_group *group)
mutex_unlock(&group->notification_mutex);
}

-struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name)
+struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name, u32 cookie)
{
struct fsnotify_event *event;

@@ -217,6 +225,8 @@ struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask,
event->name_len = 0;
}

+ event->sync_cookie = cookie;
+
event->holder.event = NULL;
INIT_LIST_HEAD(&event->holder.event_list);
atomic_set(&event->refcnt, 1);
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 8f4caf8..c1a7b61 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -98,7 +98,7 @@ static inline void fsnotify_parent(struct dentry *dentry, __u64 orig_mask)
if (p_inode && (p_inode->i_fsnotify_mask & mask)) {
dget(parent);
spin_unlock(&dentry->d_lock);
- fsnotify(p_inode, mask, dentry->d_inode, FSNOTIFY_EVENT_INODE, dentry->d_name.name);
+ fsnotify(p_inode, mask, dentry->d_inode, FSNOTIFY_EVENT_INODE, dentry->d_name.name, 0);
dput(parent);
} else {
spin_unlock(&dentry->d_lock);
@@ -130,7 +130,8 @@ static inline void fsnotify_move(struct inode *old_dir, struct inode *new_dir,
int isdir, struct inode *target, struct dentry *moved)
{
struct inode *source = moved->d_inode;
- u32 cookie = inotify_get_cookie();
+ u32 in_cookie = inotify_get_cookie();
+ u32 fs_cookie = fsnotify_get_cookie();
__u64 old_dir_mask = 0;
__u64 new_dir_mask = 0;

@@ -147,23 +148,23 @@ static inline void fsnotify_move(struct inode *old_dir, struct inode *new_dir,
old_dir_mask |= FS_MOVED_FROM;
new_dir_mask |= FS_MOVED_TO;

- inotify_inode_queue_event(old_dir, IN_MOVED_FROM|isdir,cookie,old_name,
+ inotify_inode_queue_event(old_dir, IN_MOVED_FROM|isdir,in_cookie,old_name,
source);
- inotify_inode_queue_event(new_dir, IN_MOVED_TO|isdir, cookie, new_name,
+ inotify_inode_queue_event(new_dir, IN_MOVED_TO|isdir, in_cookie, new_name,
source);

- fsnotify(old_dir, old_dir_mask, old_dir, FSNOTIFY_EVENT_INODE, old_name);
- fsnotify(new_dir, new_dir_mask, new_dir, FSNOTIFY_EVENT_INODE, new_name);
+ fsnotify(old_dir, old_dir_mask, old_dir, FSNOTIFY_EVENT_INODE, old_name, fs_cookie);
+ fsnotify(new_dir, new_dir_mask, new_dir, FSNOTIFY_EVENT_INODE, new_name, fs_cookie);

if (target) {
inotify_inode_queue_event(target, IN_DELETE_SELF, 0, NULL, NULL);
inotify_inode_is_dead(target);
- fsnotify(target, FS_DELETE, target, FSNOTIFY_EVENT_INODE, NULL);
+ fsnotify(target, FS_DELETE, target, FSNOTIFY_EVENT_INODE, NULL, 0);
}

if (source) {
inotify_inode_queue_event(source, IN_MOVE_SELF, 0, NULL, NULL);
- fsnotify(source, FS_MOVE_SELF, moved->d_inode, FSNOTIFY_EVENT_INODE, NULL);
+ fsnotify(source, FS_MOVE_SELF, moved->d_inode, FSNOTIFY_EVENT_INODE, NULL, 0);
}
audit_inode_child(new_name, moved, new_dir);
}
@@ -196,7 +197,7 @@ static inline void fsnotify_inoderemove(struct inode *inode)
inotify_inode_queue_event(inode, IN_DELETE_SELF, 0, NULL, NULL);
inotify_inode_is_dead(inode);

- fsnotify(inode, FS_DELETE_SELF, inode, FSNOTIFY_EVENT_INODE, NULL);
+ fsnotify(inode, FS_DELETE_SELF, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
__fsnotify_inode_delete(inode, FSNOTIFY_LAST_DENTRY);
}

@@ -207,7 +208,7 @@ static inline void fsnotify_link_count(struct inode *inode)
{
inotify_inode_queue_event(inode, IN_ATTRIB, 0, NULL, NULL);

- fsnotify(inode, FS_ATTRIB, inode, FSNOTIFY_EVENT_INODE, NULL);
+ fsnotify(inode, FS_ATTRIB, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
}

/*
@@ -219,7 +220,7 @@ static inline void fsnotify_create(struct inode *inode, struct dentry *dentry)
dentry->d_inode);
audit_inode_child(dentry->d_name.name, dentry, inode);

- fsnotify(inode, FS_CREATE, dentry->d_inode, FSNOTIFY_EVENT_INODE, dentry->d_name.name);
+ fsnotify(inode, FS_CREATE, dentry->d_inode, FSNOTIFY_EVENT_INODE, dentry->d_name.name, 0);
}

/*
@@ -234,7 +235,7 @@ static inline void fsnotify_link(struct inode *dir, struct inode *inode, struct
fsnotify_link_count(inode);
audit_inode_child(new_dentry->d_name.name, new_dentry, dir);

- fsnotify(dir, FS_CREATE, inode, FSNOTIFY_EVENT_INODE, new_dentry->d_name.name);
+ fsnotify(dir, FS_CREATE, inode, FSNOTIFY_EVENT_INODE, new_dentry->d_name.name, 0);
}

/*
@@ -246,7 +247,7 @@ static inline void fsnotify_mkdir(struct inode *inode, struct dentry *dentry)
dentry->d_name.name, dentry->d_inode);
audit_inode_child(dentry->d_name.name, dentry, inode);

- fsnotify(inode, FS_CREATE | FS_IN_ISDIR, dentry->d_inode, FSNOTIFY_EVENT_INODE, dentry->d_name.name);
+ fsnotify(inode, FS_CREATE | FS_IN_ISDIR, dentry->d_inode, FSNOTIFY_EVENT_INODE, dentry->d_name.name, 0);
}

/*
@@ -265,7 +266,7 @@ static inline void fsnotify_access(struct file *file)
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

fsnotify_parent(dentry, mask);
- fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE, NULL);
+ fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE, NULL, 0);
}

/*
@@ -284,7 +285,7 @@ static inline void fsnotify_modify(struct file *file)
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

fsnotify_parent(dentry, mask);
- fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE, NULL);
+ fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE, NULL, 0);
}

/*
@@ -299,7 +300,7 @@ static inline void fsnotify_open_exec(struct file *file)
inotify_inode_queue_event(inode, IN_ACCESS, 0, NULL, NULL);

fsnotify_parent(dentry, FS_ACCESS);
- fsnotify(inode, FS_ACCESS, file, FSNOTIFY_EVENT_FILE, NULL);
+ fsnotify(inode, FS_ACCESS, file, FSNOTIFY_EVENT_FILE, NULL, 0);
}

/*
@@ -318,7 +319,7 @@ static inline void fsnotify_open(struct file *file)
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

fsnotify_parent(dentry, mask);
- fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE, NULL);
+ fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE, NULL, 0);
}

/*
@@ -339,7 +340,7 @@ static inline void fsnotify_close(struct file *file)
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

fsnotify_parent(dentry, mask);
- fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE, NULL);
+ fsnotify(inode, mask, file, FSNOTIFY_EVENT_FILE, NULL, 0);
}

/*
@@ -357,7 +358,7 @@ static inline void fsnotify_xattr(struct dentry *dentry)
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

fsnotify_parent(dentry, mask);
- fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE, NULL);
+ fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
}

/*
@@ -395,7 +396,7 @@ static inline void fsnotify_change(struct dentry *dentry, unsigned int ia_valid)
dentry->d_name.name);

fsnotify_parent(dentry, mask);
- fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE, NULL);
+ fsnotify(inode, mask, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
}
}

diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index cf24ff1..a068b80 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -118,8 +118,9 @@ struct fsnotify_group {
#ifdef CONFIG_FSNOTIFY

/* called from the vfs to signal fs events */
-extern void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name);
+extern void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name, u32 cookie);
extern void __fsnotify_inode_delete(struct inode *inode, int flag);
+extern u32 fsnotify_get_cookie(void);

/* called from fsnotify interfaces, such as fanotify or dnotify */
extern void fsnotify_recalc_global_mask(void);
@@ -133,12 +134,17 @@ extern struct fsnotify_mark_entry *fsnotify_find_mark_entry(struct fsnotify_grou
extern struct fsnotify_mark_entry *fsnotify_mark_add(struct fsnotify_group *group, struct inode *inode, __u64 mask);
#else

-static inline void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name);
+static inline void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is, const char *name, u32 cookie);
{}

static inline void __fsnotify_inode_delete(struct inode *inode, int flag)
{}

+static inline u32 fsnotify_get_cookie(void)
+{
+ return 0;
+}
+
#endif /* CONFIG_FSNOTIFY */

#endif /* __KERNEL __ */

2008-12-12 21:56:33

by Eric Paris

[permalink] [raw]
Subject: [RFC PATCH -v4 14/14] shit on top for debugging


---

fs/notify/dnotify/dnotify.c | 6 ++++++
fs/notify/fsnotify.c | 1 +
fs/notify/group.c | 18 ++++++++++++++++++
fs/notify/inode_mark.c | 10 ++++++++++
fs/notify/inotify/inotify_fsnotify.c | 7 +++++++
fs/notify/inotify/inotify_kernel.c | 6 ++++++
fs/notify/inotify/inotify_user.c | 3 +++
fs/notify/notification.c | 2 ++
include/linux/fsnotify.h | 1 +
9 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index dae1bd6..d7f201e 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -66,6 +66,8 @@ static void dnotify_recalc_inode_mask(struct fsnotify_mark_entry *entry)
unsigned long new_mask;
struct dnotify_struct *dn;

+ printk(KERN_CRIT "%s: inode=%p\n", __func__, entry->inode);
+
new_mask = 0;
dn = (struct dnotify_struct *)entry->private;
for (; dn != NULL; dn = dn->dn_next)
@@ -174,6 +176,8 @@ void dnotify_flush(struct file *filp, fl_owner_t id)
if (!entry)
return;

+ printk(KERN_CRIT "%s: found entry=%p entry->group=%p\n", __func__, entry, entry->group);
+
spin_lock(&entry->lock);
prev = (struct dnotify_struct **)&entry->private;
while ((dn = *prev) != NULL) {
@@ -241,6 +245,8 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
if (!dir_notify_enable)
return -EINVAL;

+ printk(KERN_CRIT "%s: fd=%d filp=%p arg=%lx\n", __func__, fd, filp, arg);
+
if ((arg & ~DN_MULTISHOT) == 0) {
dnotify_flush(filp, id);
return 0;
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index f6973a8..5b0d632 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -65,6 +65,7 @@ void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is, const
if (mask & group->mask) {
if (!group->ops->should_send_event(group, to_tell, mask))
continue;
+ printk(KERN_CRIT "%s: to_tell=%p data=%p data_is=%d mask=%llx fsnotify_mask=%llx\n", __func__, to_tell, data, data_is, mask, fsnotify_mask);
if (!event) {
event = fsnotify_create_event(to_tell, mask, data, data_is, file_name, cookie);
/* shit, we OOM'd and now we can't tell, lets hope something else blows up */
diff --git a/fs/notify/group.c b/fs/notify/group.c
index b041e6d..0509443 100644
--- a/fs/notify/group.c
+++ b/fs/notify/group.c
@@ -39,12 +39,17 @@ void fsnotify_recalc_global_mask(void)
__u64 mask = 0;
int idx;

+ printk(KERN_CRIT "%s: starting fsnotify_mask=%llx\n", __func__, fsnotify_mask);
+
idx = srcu_read_lock(&fsnotify_grp_srcu_struct);
list_for_each_entry_rcu(group, &fsnotify_groups, group_list) {
+ printk(KERN_CRIT "%s: group=%p group->mask=%llx\n", __func__, group, group->mask);
mask |= group->mask;
}
srcu_read_unlock(&fsnotify_grp_srcu_struct, idx);
fsnotify_mask = mask;
+
+ printk(KERN_CRIT "%s: ending fsnotify_mask=%llx\n", __func__, fsnotify_mask);
}

void fsnotify_recalc_group_mask(struct fsnotify_group *group)
@@ -54,13 +59,19 @@ void fsnotify_recalc_group_mask(struct fsnotify_group *group)
struct fsnotify_mark_entry *entry;

spin_lock(&group->mark_lock);
+
+ printk(KERN_CRIT "%s: group=%p starting group->mask=%llx\n", __func__, group, group->mask);
+
list_for_each_entry(entry, &group->mark_entries, g_list) {
+ printk(KERN_CRIT "%s: entry=%p entry->mask=%llx\n", __func__, entry, entry->mask);
mask |= entry->mask;
}
spin_unlock(&group->mark_lock);

group->mask = mask;

+ printk(KERN_CRIT "%s: group=%p finishing group->mask=%llx\n", __func__, group, group->mask);
+
if (old_mask != mask)
fsnotify_recalc_global_mask();
}
@@ -87,10 +98,14 @@ static void fsnotify_add_group(struct fsnotify_group *group)
void fsnotify_get_group(struct fsnotify_group *group)
{
atomic_inc(&group->refcnt);
+
+ printk(KERN_CRIT "%s: group=%p refcnt=%d\n", __func__, group, atomic_read(&group->refcnt));
}

static void fsnotify_destroy_group(struct fsnotify_group *group)
{
+ printk(KERN_CRIT "%s: group=%p refcnt=%d\n", __func__, group, atomic_read(&group->refcnt));
+
/* clear the notification queue of all events */
fsnotify_flush_notif(group);

@@ -105,6 +120,7 @@ static void fsnotify_destroy_group(struct fsnotify_group *group)

void fsnotify_put_group(struct fsnotify_group *group)
{
+ printk(KERN_CRIT "%s: group=%p refcnt=%d\n", __func__, group, atomic_read(&group->refcnt));
if (atomic_dec_and_test(&group->refcnt)) {
mutex_lock(&fsnotify_grp_mutex);
list_del_rcu(&group->group_list);
@@ -200,5 +216,7 @@ struct fsnotify_group *fsnotify_obtain_group(unsigned int priority, unsigned int
if (mask)
fsnotify_recalc_global_mask();

+ printk(KERN_CRIT "%s: group=%p refcnt=%d\n", __func__, group, atomic_read(&group->refcnt));
+
return group;
}
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 339a368..7c1edf2 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -58,11 +58,13 @@ static struct fsnotify_mark_entry *fsnotify_alloc_mark(void)

void fsnotify_get_mark(struct fsnotify_mark_entry *entry)
{
+ printk(KERN_CRIT "%s: entry=%p refcnt BEFORE the get=%x\n", __func__, entry, atomic_read(&entry->refcnt));
atomic_inc(&entry->refcnt);
}

void fsnotify_put_mark(struct fsnotify_mark_entry *entry)
{
+ printk(KERN_CRIT "%s: entry=%p refcnt BEFORE the put=%x\n", __func__, entry, atomic_read(&entry->refcnt));
if (atomic_dec_and_test(&entry->refcnt)) {
spin_lock(&entry->lock);
/* entries can only be found by the kernel by searching the
@@ -95,6 +97,7 @@ void fsnotify_clear_marks_by_group(struct fsnotify_group *group)

list_for_each_entry_safe(entry, lentry, &free_list, free_g_list) {
fsnotify_get_mark(entry);
+ printk(KERN_CRIT "%s: entry=%p entry->inode=%p entry->group=%p\n", __func__, entry, entry->inode, entry->group);
spin_lock(&entry->lock);
inode = entry->inode;
if (!inode) {
@@ -127,6 +130,8 @@ void fsnotify_destroy_mark_by_entry(struct fsnotify_mark_entry *entry)

spin_lock(&entry->lock);

+ printk(KERN_CRIT "%s: entry=%p entry->inode=%p entry->group=%p\n", __func__, entry, entry->inode, entry->group);
+
group = entry->group;
if (group)
spin_lock(&group->mark_lock);
@@ -156,6 +161,9 @@ void fsnotify_clear_marks_by_inode(struct inode *inode, unsigned int flags)
struct fsnotify_mark_entry *lentry, *entry;
LIST_HEAD(free_list);

+ if (!list_empty(&inode->i_fsnotify_mark_entries))
+ printk(KERN_CRIT "%s: inode=%p flags=%d\n", __func__, inode, flags);
+
spin_lock(&inode->i_lock);
list_for_each_entry_safe(entry, lentry, &inode->i_fsnotify_mark_entries, i_list) {
list_del_init(&entry->i_list);
@@ -241,6 +249,8 @@ struct fsnotify_mark_entry *fsnotify_mark_add(struct fsnotify_group *group, stru
out_unlock:
spin_unlock(&inode->i_lock);
spin_unlock(&group->mark_lock);
+
+ printk(KERN_CRIT "%s: group=%p inode=%p entry=%p mask=%llx\n", __func__, group, inode, entry, mask);
return entry;
}

diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c
index 30c0a91..a078472 100644
--- a/fs/notify/inotify/inotify_fsnotify.c
+++ b/fs/notify/inotify/inotify_fsnotify.c
@@ -51,6 +51,8 @@ static int inotify_event_to_notif(struct fsnotify_group *group, struct fsnotify_
struct inotify_mark_private_data *mark_priv;
int wd, ret = 0;

+ printk(KERN_CRIT "%s: group=%p event=%p\n", __func__, group, event);
+
to_tell = event->to_tell;

spin_lock(&to_tell->i_lock);
@@ -100,7 +102,10 @@ static void inotify_mark_clear_inode(struct fsnotify_mark_entry *entry, struct i
list_add(&entry->i_list, &inode->i_fsnotify_mark_entries);
spin_unlock(&inode->i_lock);

+ printk(KERN_CRIT "%s: entry=%p inode=%p flags=%d\n", __func__, entry, inode, flags);
+
fsnotify(inode, FS_IN_IGNORED, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
+
inotify_destroy_mark_entry(entry);
}

@@ -162,6 +167,8 @@ static void inotify_free_mark_priv(struct fsnotify_mark_entry *entry)
struct inotify_mark_private_data *mark_priv = entry->private;
struct inode *inode = mark_priv->inode;

+ printk(KERN_CRIT "%s: entry=%p\n", __func__, entry);
+
BUG_ON(!entry->private);

mark_priv = entry->private;
diff --git a/fs/notify/inotify/inotify_kernel.c b/fs/notify/inotify/inotify_kernel.c
index 269fd87..ace0b51 100644
--- a/fs/notify/inotify/inotify_kernel.c
+++ b/fs/notify/inotify/inotify_kernel.c
@@ -56,6 +56,10 @@ int find_inode(const char __user *dirname, struct path *path, unsigned flags)
{
int error;

+ char *tmp = getname(dirname);
+ printk(KERN_CRIT "%s: pathname=%s\n", __func__, tmp);
+ putname(tmp);
+
error = user_path_at(AT_FDCWD, dirname, flags, path);
if (error)
return error;
@@ -74,6 +78,8 @@ void inotify_destroy_mark_entry(struct fsnotify_mark_entry *entry)
struct idr *idr;
int wd;

+ printk(KERN_CRIT "%s: entry=%p refct=%d\n", __func__, entry, atomic_read(&entry->refcnt));
+
spin_lock(&entry->lock);

mark_priv = entry->private;
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index 2df65ff..37cf619 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -202,6 +202,9 @@ static ssize_t inotify_read(struct file *file, char __user *buf,
__inotify_free_event_priv(priv);
spin_unlock(&event->lock);

+ printk(KERN_CRIT "%s: event->wd=%d event->mask=%x event->cookie=%d event->len=%d event->file_name=%s\n",
+ __func__, inotify_event.wd, inotify_event.mask, inotify_event.cookie, inotify_event.len, event->file_name);
+
if (copy_to_user(buf, &inotify_event, event_size)) {
ret = -EFAULT;
break;
diff --git a/fs/notify/notification.c b/fs/notify/notification.c
index 7243b20..631b82f 100644
--- a/fs/notify/notification.c
+++ b/fs/notify/notification.c
@@ -263,6 +263,8 @@ struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask,

event->mask = mask;

+ printk(KERN_CRIT "%s: event=%p event->to_tell=%p event->data=%p event->flag=%d event->file=%s\n", __func__, event, event->to_tell, event->inode, event->flag, event->file_name);
+
return event;
}

diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 3d10004..763a2b1 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -96,6 +96,7 @@ static inline void fsnotify_parent(struct dentry *dentry, __u64 orig_mask)
p_inode = parent->d_inode;

if (p_inode && (p_inode->i_fsnotify_mask & mask)) {
+ printk(KERN_CRIT "%s: dentry=%p orig_mask=%llx mask=%llx parent=%p parent->inode=%p p_inode->i_fsnotify_mask=%llx\n", __func__, dentry, orig_mask, mask, parent, p_inode, p_inode->i_fsnotify_mask);
dget(parent);
spin_unlock(&dentry->d_lock);
fsnotify(p_inode, mask, dentry->d_inode, FSNOTIFY_EVENT_INODE, dentry->d_name.name, 0);

2008-12-12 21:56:51

by Eric Paris

[permalink] [raw]
Subject: [RFC PATCH -v4 13/14] inotify: reimplement inotify using fsnotify

Yes, holy shit, I'm trying to reimplement inotify as fsnotify...

Signed-off-by: Eric Paris <[email protected]>
---

fs/inode.c | 1
fs/notify/inotify/Kconfig | 20 +
fs/notify/inotify/Makefile | 2
fs/notify/inotify/inotify.h | 117 +++++++
fs/notify/inotify/inotify_fsnotify.c | 183 +++++++++++
fs/notify/inotify/inotify_kernel.c | 293 +++++++++++++++++
fs/notify/inotify/inotify_user.c | 591 +++++++++-------------------------
include/linux/fsnotify.h | 39 +-
include/linux/inotify.h | 1
9 files changed, 783 insertions(+), 464 deletions(-)
create mode 100644 fs/notify/inotify/inotify.h
create mode 100644 fs/notify/inotify/inotify_fsnotify.c
create mode 100644 fs/notify/inotify/inotify_kernel.c

diff --git a/fs/inode.c b/fs/inode.c
index a7f6397..05a12b5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -372,6 +372,7 @@ int invalidate_inodes(struct super_block * sb)
mutex_lock(&iprune_mutex);
spin_lock(&inode_lock);
inotify_unmount_inodes(&sb->s_inodes);
+ fsn_inotify_unmount_inodes(&sb->s_inodes);
busy = invalidate_list(&sb->s_inodes, &throw_away);
spin_unlock(&inode_lock);

diff --git a/fs/notify/inotify/Kconfig b/fs/notify/inotify/Kconfig
index 4467928..b89bfab 100644
--- a/fs/notify/inotify/Kconfig
+++ b/fs/notify/inotify/Kconfig
@@ -1,26 +1,30 @@
config INOTIFY
bool "Inotify file change notification support"
- default y
+ default n
---help---
- Say Y here to enable inotify support. Inotify is a file change
- notification system and a replacement for dnotify. Inotify fixes
- numerous shortcomings in dnotify and introduces several new features
- including multiple file events, one-shot support, and unmount
- notification.
+ Say Y here to enable legacy in kernel inotify support. Inotify is a
+ file change notification system. It is a replacement for dnotify.
+ This option only provides the legacy inotify in kernel API. There
+ are no in tree kernel users of this interface since it is deprecated.
+ You only need this if you are loading an out of tree kernel module
+ that uses inotify.

For more information, see <file:Documentation/filesystems/inotify.txt>

- If unsure, say Y.
+ If unsure, say N.

config INOTIFY_USER
bool "Inotify support for userspace"
- depends on INOTIFY
+ depends on FSNOTIFY
default y
---help---
Say Y here to enable inotify support for userspace, including the
associated system calls. Inotify allows monitoring of both files and
directories via a single open fd. Events are read from the file
descriptor, which is also select()- and poll()-able.
+ Inotify fixes numerous shortcomings in dnotify and introduces several
+ new features including multiple file events, one-shot support, and
+ unmount notification.

For more information, see <file:Documentation/filesystems/inotify.txt>

diff --git a/fs/notify/inotify/Makefile b/fs/notify/inotify/Makefile
index e290f3b..aff7f68 100644
--- a/fs/notify/inotify/Makefile
+++ b/fs/notify/inotify/Makefile
@@ -1,2 +1,2 @@
obj-$(CONFIG_INOTIFY) += inotify.o
-obj-$(CONFIG_INOTIFY_USER) += inotify_user.o
+obj-$(CONFIG_INOTIFY_USER) += inotify_fsnotify.o inotify_kernel.o inotify_user.o
diff --git a/fs/notify/inotify/inotify.h b/fs/notify/inotify/inotify.h
new file mode 100644
index 0000000..37a437c
--- /dev/null
+++ b/fs/notify/inotify/inotify.h
@@ -0,0 +1,117 @@
+/*
+ * fs/inotify_user.c - inotify support for userspace
+ *
+ * Authors:
+ * John McCutchan <[email protected]>
+ * Robert Love <[email protected]>
+ *
+ * Copyright (C) 2005 John McCutchan
+ * Copyright 2006 Hewlett-Packard Development Company, L.P.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2, or (at your option) any
+ * later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/limits.h>
+#include <linux/module.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/poll.h>
+#include <linux/idr.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/inotify.h>
+#include <linux/syscalls.h>
+#include <linux/string.h>
+#include <linux/magic.h>
+#include <linux/writeback.h>
+#include <linux/fsnotify.h>
+
+#include "../fsnotify.h"
+
+#include <asm/ioctls.h>
+
+extern struct kmem_cache *grp_priv_cachep;
+extern struct kmem_cache *mark_priv_cachep;
+extern struct kmem_cache *event_priv_cachep;
+
+struct inotify_group_private_data {
+ struct idr idr;
+ u32 last_wd;
+ struct fasync_struct *fa; /* async notification */
+ struct user_struct *user;
+};
+
+struct inotify_mark_private_data {
+ int wd;
+ struct inode *inode;
+};
+
+struct inotify_event_private_data {
+ struct fsnotify_event_private_data fsnotify_event_priv_data;
+ int wd;
+};
+
+static inline __u64 inotify_arg_to_mask(u32 arg)
+{
+ /* everything should accept their own ignored */
+ __u64 mask = FS_IN_IGNORED;
+
+ BUILD_BUG_ON(IN_ACCESS != FS_ACCESS);
+ BUILD_BUG_ON(IN_MODIFY != FS_MODIFY);
+ BUILD_BUG_ON(IN_ATTRIB != FS_ATTRIB);
+ BUILD_BUG_ON(IN_CLOSE_WRITE != FS_CLOSE_WRITE);
+ BUILD_BUG_ON(IN_CLOSE_NOWRITE != FS_CLOSE_NOWRITE);
+ BUILD_BUG_ON(IN_OPEN != FS_OPEN);
+ BUILD_BUG_ON(IN_MOVED_FROM != FS_MOVED_FROM);
+ BUILD_BUG_ON(IN_MOVED_TO != FS_MOVED_TO);
+ BUILD_BUG_ON(IN_CREATE != FS_CREATE);
+ BUILD_BUG_ON(IN_DELETE != FS_DELETE);
+ BUILD_BUG_ON(IN_DELETE_SELF != FS_DELETE_SELF);
+ BUILD_BUG_ON(IN_MOVE_SELF != FS_MOVE_SELF);
+ BUILD_BUG_ON(IN_Q_OVERFLOW != FS_Q_OVERFLOW);
+
+ BUILD_BUG_ON(IN_UNMOUNT != FS_IN_UNMOUNT);
+ BUILD_BUG_ON(IN_ISDIR != FS_IN_ISDIR);
+ BUILD_BUG_ON(IN_IGNORED != FS_IN_IGNORED);
+ BUILD_BUG_ON(IN_ONESHOT != FS_IN_ONESHOT);
+
+ mask |= (arg & (IN_ALL_EVENTS | IN_ONESHOT));
+
+ mask |= ((mask & FS_EVENTS_WITH_CHILD) << 32);
+
+ return mask;
+}
+
+static inline u32 inotify_mask_to_arg(__u64 mask)
+{
+ u32 arg;
+
+ arg = (mask & (IN_ALL_EVENTS | IN_ISDIR | IN_UNMOUNT | IN_IGNORED));
+
+ arg |= ((mask >> 32) & FS_EVENTS_WITH_CHILD);
+
+ return arg;
+}
+
+
+int find_inode(const char __user *dirname, struct path *path, unsigned flags);
+void inotify_destroy_mark_entry(struct fsnotify_mark_entry *entry);
+void fsn_inotify_unmount_inodes(struct list_head *list);
+int inotify_update_watch(struct fsnotify_group *group, struct inode *inode, u32 arg);
+struct fsnotify_group *inotify_new_group(struct user_struct *user, unsigned int max_events);
+void __inotify_free_event_priv(struct inotify_event_private_data *event_priv);
+
+extern const struct fsnotify_ops inotify_fsnotify_ops;
diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c
new file mode 100644
index 0000000..30c0a91
--- /dev/null
+++ b/fs/notify/inotify/inotify_fsnotify.c
@@ -0,0 +1,183 @@
+/*
+ * fs/inotify_user.c - inotify support for userspace
+ *
+ * Authors:
+ * John McCutchan <[email protected]>
+ * Robert Love <[email protected]>
+ *
+ * Copyright (C) 2005 John McCutchan
+ * Copyright 2006 Hewlett-Packard Development Company, L.P.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2, or (at your option) any
+ * later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/limits.h>
+#include <linux/module.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/poll.h>
+#include <linux/idr.h>
+#include <linux/init.h>
+#include <linux/inotify.h>
+#include <linux/list.h>
+#include <linux/syscalls.h>
+#include <linux/string.h>
+#include <linux/magic.h>
+#include <linux/writeback.h>
+
+#include "inotify.h"
+#include "../fsnotify.h"
+
+#include <asm/ioctls.h>
+
+static int inotify_event_to_notif(struct fsnotify_group *group, struct fsnotify_event *event)
+{
+ struct fsnotify_mark_entry *entry;
+ struct inode *to_tell;
+ struct inotify_event_private_data *event_priv;
+ struct inotify_mark_private_data *mark_priv;
+ int wd, ret = 0;
+
+ to_tell = event->to_tell;
+
+ spin_lock(&to_tell->i_lock);
+ entry = fsnotify_find_mark_entry(group, to_tell);
+ spin_unlock(&to_tell->i_lock);
+
+ /* race with watch removal? */
+ if (!entry)
+ return ret;
+
+ mark_priv = entry->private;
+ wd = mark_priv->wd;
+
+ fsnotify_put_mark(entry);
+
+ event_priv = kmem_cache_alloc(event_priv_cachep, GFP_KERNEL);
+ if (unlikely(!event_priv))
+ return -ENOMEM;
+
+ event_priv->fsnotify_event_priv_data.group = group;
+ event_priv->wd = wd;
+
+ ret = fsnotify_add_notif_event(group, event, (struct fsnotify_event_private_data *)event_priv);
+
+ return ret;
+}
+
+static void inotify_mark_clear_inode(struct fsnotify_mark_entry *entry, struct inode *inode, unsigned int flags)
+{
+ if (unlikely((flags != FSNOTIFY_LAST_DENTRY) && (flags != FSNOTIFY_INODE_DESTROY))) {
+ BUG();
+ return;
+ }
+
+ /*
+ * so no matter what we need to put this entry back on the inode's list.
+ * we need it there so fsnotify can find it to send the ignore message.
+ *
+ * I didn't realize how brilliant this was until I did it. Our caller
+ * blanked the inode->i_fsnotify_mark_entries list so we will be the
+ * only mark on the list when fsnotify runs so only our group will get
+ * this FS_IN_IGNORED.
+ *
+ * Bloody brilliant.
+ */
+ spin_lock(&inode->i_lock);
+ list_add(&entry->i_list, &inode->i_fsnotify_mark_entries);
+ spin_unlock(&inode->i_lock);
+
+ fsnotify(inode, FS_IN_IGNORED, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
+ inotify_destroy_mark_entry(entry);
+}
+
+static int inotify_should_send_event(struct fsnotify_group *group, struct inode *inode, __u64 mask)
+{
+ struct fsnotify_mark_entry *entry;
+ int send;
+
+ entry = fsnotify_find_mark_entry(group, inode);
+ if (!entry)
+ return 0;
+
+ spin_lock(&entry->lock);
+ send = !!(entry->mask & mask);
+ spin_unlock(&entry->lock);
+
+ /* find took a reference */
+ fsnotify_put_mark(entry);
+
+ return send;
+}
+
+static void inotify_free_group_priv(struct fsnotify_group *group)
+{
+ struct inotify_group_private_data *grp_priv;
+
+ BUG_ON(!group->private);
+
+ grp_priv = group->private;
+ idr_destroy(&grp_priv->idr);
+
+ kmem_cache_free(grp_priv_cachep, group->private);
+ group->private = NULL;
+}
+
+void __inotify_free_event_priv(struct inotify_event_private_data *event_priv)
+{
+ list_del_init(&event_priv->fsnotify_event_priv_data.event_list);
+ kmem_cache_free(event_priv_cachep, event_priv);
+}
+
+static void inotify_free_event_priv(struct fsnotify_group *group, struct fsnotify_event *event)
+{
+ struct inotify_event_private_data *event_priv;
+
+ spin_lock(&event->lock);
+
+ event_priv = (struct inotify_event_private_data *)fsnotify_get_priv_from_event(group, event);
+ BUG_ON(!event_priv);
+
+ __inotify_free_event_priv(event_priv);
+
+ spin_unlock(&event->lock);
+}
+
+/* ding dong the mark is dead */
+static void inotify_free_mark_priv(struct fsnotify_mark_entry *entry)
+{
+ struct inotify_mark_private_data *mark_priv = entry->private;
+ struct inode *inode = mark_priv->inode;
+
+ BUG_ON(!entry->private);
+
+ mark_priv = entry->private;
+ inode = mark_priv->inode;
+
+ iput(inode);
+
+ kmem_cache_free(mark_priv_cachep, entry->private);
+ entry->private = NULL;
+}
+
+const struct fsnotify_ops inotify_fsnotify_ops = {
+ .event_to_notif = inotify_event_to_notif,
+ .mark_clear_inode = inotify_mark_clear_inode,
+ .should_send_event = inotify_should_send_event,
+ .free_group_priv = inotify_free_group_priv,
+ .free_event_priv = inotify_free_event_priv,
+ .free_mark_priv = inotify_free_mark_priv,
+};
diff --git a/fs/notify/inotify/inotify_kernel.c b/fs/notify/inotify/inotify_kernel.c
new file mode 100644
index 0000000..269fd87
--- /dev/null
+++ b/fs/notify/inotify/inotify_kernel.c
@@ -0,0 +1,293 @@
+/*
+ * fs/inotify_user.c - inotify support for userspace
+ *
+ * Authors:
+ * John McCutchan <[email protected]>
+ * Robert Love <[email protected]>
+ *
+ * Copyright (C) 2005 John McCutchan
+ * Copyright 2006 Hewlett-Packard Development Company, L.P.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2, or (at your option) any
+ * later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/limits.h>
+#include <linux/module.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/poll.h>
+#include <linux/idr.h>
+#include <linux/init.h>
+#include <linux/inotify.h>
+#include <linux/list.h>
+#include <linux/syscalls.h>
+#include <linux/string.h>
+#include <linux/magic.h>
+#include <linux/writeback.h>
+
+#include "inotify.h"
+#include "../fsnotify.h"
+
+#include <asm/ioctls.h>
+
+struct kmem_cache *grp_priv_cachep __read_mostly;
+struct kmem_cache *mark_priv_cachep __read_mostly;
+struct kmem_cache *event_priv_cachep __read_mostly;
+
+atomic_t inotify_grp_num;
+
+/*
+ * find_inode - resolve a user-given path to a specific inode
+ */
+int find_inode(const char __user *dirname, struct path *path, unsigned flags)
+{
+ int error;
+
+ error = user_path_at(AT_FDCWD, dirname, flags, path);
+ if (error)
+ return error;
+ /* you can only watch an inode if you have read permissions on it */
+ error = inode_permission(path->dentry->d_inode, MAY_READ);
+ if (error)
+ path_put(path);
+ return error;
+}
+
+void inotify_destroy_mark_entry(struct fsnotify_mark_entry *entry)
+{
+ struct fsnotify_group *group;
+ struct inotify_group_private_data *grp_priv;
+ struct inotify_mark_private_data *mark_priv;
+ struct idr *idr;
+ int wd;
+
+ spin_lock(&entry->lock);
+
+ mark_priv = entry->private;
+ wd = mark_priv->wd;
+
+ group = entry->group;
+ if (!group) {
+ /* racing with group tear down, let it do it */
+ spin_unlock(&entry->lock);
+ return;
+ }
+ grp_priv = group->private;
+ idr = &grp_priv->idr;
+ spin_lock(&group->mark_lock);
+ idr_remove(idr, wd);
+ spin_unlock(&group->mark_lock);
+
+ spin_unlock(&entry->lock);
+
+ /* mark the entry to die */
+ fsnotify_destroy_mark_by_entry(entry);
+
+ /* removed from idr, need to shoot it */
+ fsnotify_put_mark(entry);
+}
+
+/**
+ * inotify_unmount_inodes - an sb is unmounting. handle any watched inodes.
+ * @list: list of inodes being unmounted (sb->s_inodes)
+ *
+ * Called with inode_lock held, protecting the unmounting super block's list
+ * of inodes, and with iprune_mutex held, keeping shrink_icache_memory() at bay.
+ * We temporarily drop inode_lock, however, and CAN block.
+ */
+void fsn_inotify_unmount_inodes(struct list_head *list)
+{
+ struct inode *inode, *next_i, *need_iput = NULL;
+
+ list_for_each_entry_safe(inode, next_i, list, i_sb_list) {
+ struct inode *need_iput_tmp;
+
+ /*
+ * If i_count is zero, the inode cannot have any watches and
+ * doing an __iget/iput with MS_ACTIVE clear would actually
+ * evict all inodes with zero i_count from icache which is
+ * unnecessarily violent and may in fact be illegal to do.
+ */
+ if (!atomic_read(&inode->i_count))
+ continue;
+
+ /*
+ * We cannot __iget() an inode in state I_CLEAR, I_FREEING, or
+ * I_WILL_FREE which is fine because by that point the inode
+ * cannot have any associated watches.
+ */
+ if (inode->i_state & (I_CLEAR | I_FREEING | I_WILL_FREE))
+ continue;
+
+ need_iput_tmp = need_iput;
+ need_iput = NULL;
+ /* In case inotify_remove_watch_locked() drops a reference. */
+ if (inode != need_iput_tmp)
+ __iget(inode);
+ else
+ need_iput_tmp = NULL;
+ /* In case the dropping of a reference would nuke next_i. */
+ if ((&next_i->i_sb_list != list) &&
+ atomic_read(&next_i->i_count) &&
+ !(next_i->i_state & (I_CLEAR | I_FREEING |
+ I_WILL_FREE))) {
+ __iget(next_i);
+ need_iput = next_i;
+ }
+
+ /*
+ * We can safely drop inode_lock here because we hold
+ * references on both inode and next_i. Also no new inodes
+ * will be added since the umount has begun. Finally,
+ * iprune_mutex keeps shrink_icache_memory() away.
+ */
+ spin_unlock(&inode_lock);
+
+ if (need_iput_tmp)
+ iput(need_iput_tmp);
+
+ /* for each watch, send IN_UNMOUNT and then remove it */
+ fsnotify(inode, FS_IN_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
+
+ fsnotify_inode_delete(inode);
+
+ iput(inode);
+
+ spin_lock(&inode_lock);
+ }
+}
+EXPORT_SYMBOL_GPL(fsn_inotify_unmount_inodes);
+
+int inotify_update_watch(struct fsnotify_group *group, struct inode *inode, u32 arg)
+{
+ struct fsnotify_mark_entry *entry;
+ struct inotify_group_private_data *grp_priv = group->private;
+ struct inotify_mark_private_data *mark_priv;
+ int ret = 0;
+ int add = (arg & IN_MASK_ADD);
+ __u64 mask;
+
+ /* don't allow invalid bits: we don't want flags set */
+ mask = inotify_arg_to_mask(arg);
+ if (unlikely(!mask))
+ return -EINVAL;
+
+ mark_priv = kmem_cache_alloc(mark_priv_cachep, GFP_KERNEL);
+ if (unlikely(!mark_priv))
+ return -ENOMEM;
+
+ /* this is slick, using 0 for mask gives me the entry */
+ entry = fsnotify_mark_add(group, inode, 0);
+ if (unlikely(!entry)) {
+ kmem_cache_free(mark_priv_cachep, mark_priv);
+ return -ENOMEM;
+ }
+
+retry:
+ if (entry->mask == 0) {
+ if (unlikely(!idr_pre_get(&grp_priv->idr, GFP_KERNEL)))
+ goto out_and_shoot;
+ }
+
+ spin_lock(&entry->lock);
+ if (entry->mask == 0) {
+ spin_lock(&group->mark_lock);
+ /* if entry is added to the idr we keep the reference obtained
+ * through fsnotify_mark_add. remember to drop this reference
+ * when entry is removed from idr */
+ ret = idr_get_new_above(&grp_priv->idr, entry, grp_priv->last_wd+1, &mark_priv->wd);
+ if (ret) {
+ spin_unlock(&group->mark_lock);
+ spin_unlock(&entry->lock);
+ if (ret == -EAGAIN)
+ goto retry;
+ goto out_and_shoot;
+ }
+ spin_unlock(&group->mark_lock);
+ /* this is a new entry, pin the inode */
+ __iget(inode);
+ mark_priv->inode = inode;
+ entry->private = mark_priv;
+ } else {
+ kmem_cache_free(mark_priv_cachep, mark_priv);
+ }
+
+ if (add)
+ entry->mask |= mask;
+ else
+ entry->mask = mask;
+
+ spin_unlock(&entry->lock);
+
+ /* update the inode with this new entry */
+ fsnotify_recalc_inode_mask(inode);
+
+ /* update the group mask with the new mask */
+ fsnotify_recalc_group_mask(group);
+
+ return mark_priv->wd;
+
+out_and_shoot:
+ /* see this isn't supposed to happen, just kill the watch */
+ fsnotify_destroy_mark_by_entry(entry);
+ kmem_cache_free(mark_priv_cachep, mark_priv);
+ fsnotify_put_mark(entry);
+ return ret;
+}
+
+struct fsnotify_group *inotify_new_group(struct user_struct *user, unsigned int max_events)
+{
+ struct fsnotify_group *group;
+ struct inotify_group_private_data *grp_priv;
+ unsigned int grp_num;
+
+ /* fsnotify_obtain_group took a reference to group, we put this when we kill the file in the end */
+ grp_num = (UINT_MAX - atomic_inc_return(&inotify_grp_num));
+ group = fsnotify_obtain_group(grp_num, grp_num, 0, &inotify_fsnotify_ops);
+ if (IS_ERR(group))
+ return group;
+
+ group->max_events = max_events;
+
+ grp_priv = kmem_cache_alloc(grp_priv_cachep, GFP_KERNEL);
+ if (unlikely(!grp_priv)) {
+ fsnotify_put_group(group);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ idr_init(&grp_priv->idr);
+ grp_priv->last_wd = 0;
+ grp_priv->user = user;
+ grp_priv->fa = NULL;
+ group->private = (void *)grp_priv;
+
+ return group;
+}
+
+static int __init inotify_kernel_setup(void)
+{
+ grp_priv_cachep = kmem_cache_create("inotify_group_priv_cache",
+ sizeof(struct inotify_group_private_data),
+ 0, SLAB_PANIC, NULL);
+ mark_priv_cachep = kmem_cache_create("inotify_mark_priv_cache",
+ sizeof(struct inotify_mark_private_data),
+ 0, SLAB_PANIC, NULL);
+ event_priv_cachep = kmem_cache_create("inotify_event_priv_cache",
+ sizeof(struct inotify_event_private_data),
+ 0, SLAB_PANIC, NULL);
+ return 0;
+}
+subsys_initcall(inotify_kernel_setup);
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index d367e9b..2df65ff 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -24,90 +24,35 @@
#include <linux/slab.h>
#include <linux/fs.h>
#include <linux/file.h>
+#include <linux/limits.h>
+#include <linux/module.h>
#include <linux/mount.h>
#include <linux/namei.h>
#include <linux/poll.h>
+#include <linux/idr.h>
#include <linux/init.h>
-#include <linux/list.h>
#include <linux/inotify.h>
+#include <linux/list.h>
#include <linux/syscalls.h>
+#include <linux/string.h>
#include <linux/magic.h>
+#include <linux/writeback.h>

-#include <asm/ioctls.h>
+#include "inotify.h"
+#include "../fsnotify.h"

-static struct kmem_cache *watch_cachep __read_mostly;
-static struct kmem_cache *event_cachep __read_mostly;
+#include <asm/ioctls.h>

static struct vfsmount *inotify_mnt __read_mostly;

+/* this just sits here and wastes global memory. used to just pad userspace messages with zeros */
+static struct inotify_event nul_inotify_event;
+
/* these are configurable via /proc/sys/fs/inotify/ */
static int inotify_max_user_instances __read_mostly;
static int inotify_max_user_watches __read_mostly;
static int inotify_max_queued_events __read_mostly;

-/*
- * Lock ordering:
- *
- * inotify_dev->up_mutex (ensures we don't re-add the same watch)
- * inode->inotify_mutex (protects inode's watch list)
- * inotify_handle->mutex (protects inotify_handle's watch list)
- * inotify_dev->ev_mutex (protects device's event queue)
- */
-
-/*
- * Lifetimes of the main data structures:
- *
- * inotify_device: Lifetime is managed by reference count, from
- * sys_inotify_init() until release. Additional references can bump the count
- * via get_inotify_dev() and drop the count via put_inotify_dev().
- *
- * inotify_user_watch: Lifetime is from create_watch() to the receipt of an
- * IN_IGNORED event from inotify, or when using IN_ONESHOT, to receipt of the
- * first event, or to inotify_destroy().
- */
-
-/*
- * struct inotify_device - represents an inotify instance
- *
- * This structure is protected by the mutex 'mutex'.
- */
-struct inotify_device {
- wait_queue_head_t wq; /* wait queue for i/o */
- struct mutex ev_mutex; /* protects event queue */
- struct mutex up_mutex; /* synchronizes watch updates */
- struct list_head events; /* list of queued events */
- atomic_t count; /* reference count */
- struct user_struct *user; /* user who opened this dev */
- struct inotify_handle *ih; /* inotify handle */
- struct fasync_struct *fa; /* async notification */
- unsigned int queue_size; /* size of the queue (bytes) */
- unsigned int event_count; /* number of pending events */
- unsigned int max_events; /* maximum number of events */
-};
-
-/*
- * struct inotify_kernel_event - An inotify event, originating from a watch and
- * queued for user-space. A list of these is attached to each instance of the
- * device. In read(), this list is walked and all events that can fit in the
- * buffer are returned.
- *
- * Protected by dev->ev_mutex of the device in which we are queued.
- */
-struct inotify_kernel_event {
- struct inotify_event event; /* the user-space event */
- struct list_head list; /* entry in inotify_device's list */
- char *name; /* filename, if any */
-};
-
-/*
- * struct inotify_user_watch - our version of an inotify_watch, we add
- * a reference to the associated inotify_device.
- */
-struct inotify_user_watch {
- struct inotify_device *dev; /* associated device */
- struct inotify_watch wdata; /* inotify watch data */
-};
-
#ifdef CONFIG_SYSCTL

#include <linux/sysctl.h>
@@ -149,280 +94,17 @@ ctl_table inotify_table[] = {
};
#endif /* CONFIG_SYSCTL */

-static inline void get_inotify_dev(struct inotify_device *dev)
-{
- atomic_inc(&dev->count);
-}
-
-static inline void put_inotify_dev(struct inotify_device *dev)
-{
- if (atomic_dec_and_test(&dev->count)) {
- atomic_dec(&dev->user->inotify_devs);
- free_uid(dev->user);
- kfree(dev);
- }
-}
-
-/*
- * free_inotify_user_watch - cleans up the watch and its references
- */
-static void free_inotify_user_watch(struct inotify_watch *w)
-{
- struct inotify_user_watch *watch;
- struct inotify_device *dev;
-
- watch = container_of(w, struct inotify_user_watch, wdata);
- dev = watch->dev;
-
- atomic_dec(&dev->user->inotify_watches);
- put_inotify_dev(dev);
- kmem_cache_free(watch_cachep, watch);
-}
-
-/*
- * kernel_event - create a new kernel event with the given parameters
- *
- * This function can sleep.
- */
-static struct inotify_kernel_event * kernel_event(s32 wd, u32 mask, u32 cookie,
- const char *name)
-{
- struct inotify_kernel_event *kevent;
-
- kevent = kmem_cache_alloc(event_cachep, GFP_NOFS);
- if (unlikely(!kevent))
- return NULL;
-
- /* we hand this out to user-space, so zero it just in case */
- memset(&kevent->event, 0, sizeof(struct inotify_event));
-
- kevent->event.wd = wd;
- kevent->event.mask = mask;
- kevent->event.cookie = cookie;
-
- INIT_LIST_HEAD(&kevent->list);
-
- if (name) {
- size_t len, rem, event_size = sizeof(struct inotify_event);
-
- /*
- * We need to pad the filename so as to properly align an
- * array of inotify_event structures. Because the structure is
- * small and the common case is a small filename, we just round
- * up to the next multiple of the structure's sizeof. This is
- * simple and safe for all architectures.
- */
- len = strlen(name) + 1;
- rem = event_size - len;
- if (len > event_size) {
- rem = event_size - (len % event_size);
- if (len % event_size == 0)
- rem = 0;
- }
-
- kevent->name = kmalloc(len + rem, GFP_KERNEL);
- if (unlikely(!kevent->name)) {
- kmem_cache_free(event_cachep, kevent);
- return NULL;
- }
- memcpy(kevent->name, name, len);
- if (rem)
- memset(kevent->name + len, 0, rem);
- kevent->event.len = len + rem;
- } else {
- kevent->event.len = 0;
- kevent->name = NULL;
- }
-
- return kevent;
-}
-
-/*
- * inotify_dev_get_event - return the next event in the given dev's queue
- *
- * Caller must hold dev->ev_mutex.
- */
-static inline struct inotify_kernel_event *
-inotify_dev_get_event(struct inotify_device *dev)
-{
- return list_entry(dev->events.next, struct inotify_kernel_event, list);
-}
-
-/*
- * inotify_dev_get_last_event - return the last event in the given dev's queue
- *
- * Caller must hold dev->ev_mutex.
- */
-static inline struct inotify_kernel_event *
-inotify_dev_get_last_event(struct inotify_device *dev)
-{
- if (list_empty(&dev->events))
- return NULL;
- return list_entry(dev->events.prev, struct inotify_kernel_event, list);
-}
-
-/*
- * inotify_dev_queue_event - event handler registered with core inotify, adds
- * a new event to the given device
- *
- * Can sleep (calls kernel_event()).
- */
-static void inotify_dev_queue_event(struct inotify_watch *w, u32 wd, u32 mask,
- u32 cookie, const char *name,
- struct inode *ignored)
-{
- struct inotify_user_watch *watch;
- struct inotify_device *dev;
- struct inotify_kernel_event *kevent, *last;
-
- watch = container_of(w, struct inotify_user_watch, wdata);
- dev = watch->dev;
-
- mutex_lock(&dev->ev_mutex);
-
- /* we can safely put the watch as we don't reference it while
- * generating the event
- */
- if (mask & IN_IGNORED || w->mask & IN_ONESHOT)
- put_inotify_watch(w); /* final put */
-
- /* coalescing: drop this event if it is a dupe of the previous */
- last = inotify_dev_get_last_event(dev);
- if (last && last->event.mask == mask && last->event.wd == wd &&
- last->event.cookie == cookie) {
- const char *lastname = last->name;
-
- if (!name && !lastname)
- goto out;
- if (name && lastname && !strcmp(lastname, name))
- goto out;
- }
-
- /* the queue overflowed and we already sent the Q_OVERFLOW event */
- if (unlikely(dev->event_count > dev->max_events))
- goto out;
-
- /* if the queue overflows, we need to notify user space */
- if (unlikely(dev->event_count == dev->max_events))
- kevent = kernel_event(-1, IN_Q_OVERFLOW, cookie, NULL);
- else
- kevent = kernel_event(wd, mask, cookie, name);
-
- if (unlikely(!kevent))
- goto out;
-
- /* queue the event and wake up anyone waiting */
- dev->event_count++;
- dev->queue_size += sizeof(struct inotify_event) + kevent->event.len;
- list_add_tail(&kevent->list, &dev->events);
- wake_up_interruptible(&dev->wq);
- kill_fasync(&dev->fa, SIGIO, POLL_IN);
-
-out:
- mutex_unlock(&dev->ev_mutex);
-}
-
-/*
- * remove_kevent - cleans up the given kevent
- *
- * Caller must hold dev->ev_mutex.
- */
-static void remove_kevent(struct inotify_device *dev,
- struct inotify_kernel_event *kevent)
-{
- list_del(&kevent->list);
-
- dev->event_count--;
- dev->queue_size -= sizeof(struct inotify_event) + kevent->event.len;
-}
-
-/*
- * free_kevent - frees the given kevent.
- */
-static void free_kevent(struct inotify_kernel_event *kevent)
-{
- kfree(kevent->name);
- kmem_cache_free(event_cachep, kevent);
-}
-
-/*
- * inotify_dev_event_dequeue - destroy an event on the given device
- *
- * Caller must hold dev->ev_mutex.
- */
-static void inotify_dev_event_dequeue(struct inotify_device *dev)
-{
- if (!list_empty(&dev->events)) {
- struct inotify_kernel_event *kevent;
- kevent = inotify_dev_get_event(dev);
- remove_kevent(dev, kevent);
- free_kevent(kevent);
- }
-}
-
-/*
- * find_inode - resolve a user-given path to a specific inode
- */
-static int find_inode(const char __user *dirname, struct path *path,
- unsigned flags)
-{
- int error;
-
- error = user_path_at(AT_FDCWD, dirname, flags, path);
- if (error)
- return error;
- /* you can only watch an inode if you have read permissions on it */
- error = inode_permission(path->dentry->d_inode, MAY_READ);
- if (error)
- path_put(path);
- return error;
-}
-
-/*
- * create_watch - creates a watch on the given device.
- *
- * Callers must hold dev->up_mutex.
- */
-static int create_watch(struct inotify_device *dev, struct inode *inode,
- u32 mask)
-{
- struct inotify_user_watch *watch;
- int ret;
-
- if (atomic_read(&dev->user->inotify_watches) >=
- inotify_max_user_watches)
- return -ENOSPC;
-
- watch = kmem_cache_alloc(watch_cachep, GFP_KERNEL);
- if (unlikely(!watch))
- return -ENOMEM;
-
- /* save a reference to device and bump the count to make it official */
- get_inotify_dev(dev);
- watch->dev = dev;
-
- atomic_inc(&dev->user->inotify_watches);
-
- inotify_init_watch(&watch->wdata);
- ret = inotify_add_watch(dev->ih, &watch->wdata, inode, mask);
- if (ret < 0)
- free_inotify_user_watch(&watch->wdata);
-
- return ret;
-}
-
-/* Device Interface */
-
+/* intofiy userspace file descriptor functions */
static unsigned int inotify_poll(struct file *file, poll_table *wait)
{
- struct inotify_device *dev = file->private_data;
+ struct fsnotify_group *group = file->private_data;
int ret = 0;

- poll_wait(file, &dev->wq, wait);
- mutex_lock(&dev->ev_mutex);
- if (!list_empty(&dev->events))
+ poll_wait(file, &group->notification_waitq, wait);
+ mutex_lock(&group->notification_mutex);
+ if (fsnotify_check_notif_queue(group))
ret = POLLIN | POLLRDNORM;
- mutex_unlock(&dev->ev_mutex);
+ mutex_unlock(&group->notification_mutex);

return ret;
}
@@ -430,25 +112,26 @@ static unsigned int inotify_poll(struct file *file, poll_table *wait)
static ssize_t inotify_read(struct file *file, char __user *buf,
size_t count, loff_t *pos)
{
- size_t event_size = sizeof (struct inotify_event);
- struct inotify_device *dev;
+ struct fsnotify_group *group;
+ struct inotify_event inotify_event;
+ const size_t event_size = sizeof (struct inotify_event);
char __user *start;
int ret;
DEFINE_WAIT(wait);

start = buf;
- dev = file->private_data;
+ group = file->private_data;

while (1) {

- prepare_to_wait(&dev->wq, &wait, TASK_INTERRUPTIBLE);
+ prepare_to_wait(&group->notification_waitq, &wait, TASK_INTERRUPTIBLE);

- mutex_lock(&dev->ev_mutex);
- if (!list_empty(&dev->events)) {
+ mutex_lock(&group->notification_mutex);
+ if (fsnotify_check_notif_queue(group)) {
ret = 0;
break;
}
- mutex_unlock(&dev->ev_mutex);
+ mutex_unlock(&group->notification_mutex);

if (file->f_flags & O_NONBLOCK) {
ret = -EAGAIN;
@@ -456,26 +139,38 @@ static ssize_t inotify_read(struct file *file, char __user *buf,
}

if (signal_pending(current)) {
- ret = -EINTR;
+ ret = -ERESTARTSYS;
break;
}

schedule();
}

- finish_wait(&dev->wq, &wait);
+ finish_wait(&group->notification_waitq, &wait);
if (ret)
return ret;

while (1) {
- struct inotify_kernel_event *kevent;
+ struct fsnotify_event *event;
+ struct inotify_event_private_data *priv;
+ size_t name_to_send_len;

ret = buf - start;
- if (list_empty(&dev->events))
+
+ if (!fsnotify_check_notif_queue(group))
break;

- kevent = inotify_dev_get_event(dev);
- if (event_size + kevent->event.len > count) {
+ event = fsnotify_peek_notif_event(group);
+
+ spin_lock(&event->lock);
+ priv = (struct inotify_event_private_data *)fsnotify_get_priv_from_event(group, event);
+ spin_unlock(&event->lock);
+ BUG_ON(!priv);
+
+ name_to_send_len = roundup(event->name_len, event_size);
+
+ /* the above is closer, since it sends filenames */
+ if (event_size + name_to_send_len > count) {
if (ret == 0 && count > 0) {
/*
* could not get a single event because we
@@ -485,60 +180,94 @@ static ssize_t inotify_read(struct file *file, char __user *buf,
}
break;
}
- remove_kevent(dev, kevent);
+
+ /* held the notification_mutex the whole time, so this is the
+ * same event we peeked above */
+ fsnotify_remove_notif_event(group);

/*
* Must perform the copy_to_user outside the mutex in order
* to avoid a lock order reversal with mmap_sem.
*/
- mutex_unlock(&dev->ev_mutex);
+ mutex_unlock(&group->notification_mutex);
+
+ memset(&inotify_event, 0, sizeof(struct inotify_event));
+
+ inotify_event.wd = priv->wd;
+ inotify_event.mask = inotify_mask_to_arg(event->mask);
+ inotify_event.cookie = event->sync_cookie;
+ inotify_event.len = name_to_send_len;
+
+ spin_lock(&event->lock);
+ __inotify_free_event_priv(priv);
+ spin_unlock(&event->lock);

- if (copy_to_user(buf, &kevent->event, event_size)) {
+ if (copy_to_user(buf, &inotify_event, event_size)) {
ret = -EFAULT;
break;
}
buf += event_size;
count -= event_size;

- if (kevent->name) {
- if (copy_to_user(buf, kevent->name, kevent->event.len)){
+ if (name_to_send_len) {
+ unsigned int len_to_zero = name_to_send_len - event->name_len;
+ /* copy the path name */
+ if (copy_to_user(buf, event->file_name, event->name_len)) {
ret = -EFAULT;
break;
}
- buf += kevent->event.len;
- count -= kevent->event.len;
+ buf += event->name_len;
+ count -= event->name_len;
+ /* fill userspace with 0's from nul_inotify_event */
+ if (copy_to_user(buf, &nul_inotify_event, len_to_zero)) {
+ ret = -EFAULT;
+ break;
+ }
+ buf += len_to_zero;
+ count -= len_to_zero;
}

- free_kevent(kevent);
+ fsnotify_put_event(event);

- mutex_lock(&dev->ev_mutex);
+ mutex_lock(&group->notification_mutex);
}
- mutex_unlock(&dev->ev_mutex);
+ mutex_unlock(&group->notification_mutex);

return ret;
}

static int inotify_fasync(int fd, struct file *file, int on)
{
- struct inotify_device *dev = file->private_data;
+ struct fsnotify_group *group = file->private_data;
+ struct inotify_group_private_data *priv = group->private;

- return fasync_helper(fd, file, on, &dev->fa) >= 0 ? 0 : -EIO;
+ return fasync_helper(fd, file, on, &priv->fa) >= 0 ? 0 : -EIO;
}

static int inotify_release(struct inode *ignored, struct file *file)
{
- struct inotify_device *dev = file->private_data;
+ struct fsnotify_group *group = file->private_data;
+ struct fsnotify_mark_entry *entry;

- inotify_destroy(dev->ih);
+ /* run all the entries remove them from the idr and drop that ref */
+ spin_lock(&group->mark_lock);
+ while(!list_empty(&group->mark_entries)) {
+ entry = list_first_entry(&group->mark_entries, struct fsnotify_mark_entry, g_list);

- /* destroy all of the events on this device */
- mutex_lock(&dev->ev_mutex);
- while (!list_empty(&dev->events))
- inotify_dev_event_dequeue(dev);
- mutex_unlock(&dev->ev_mutex);
+ /* make sure entry can't get freed */
+ fsnotify_get_mark(entry);
+ spin_unlock(&group->mark_lock);

- /* free this device: the put matching the get in inotify_init() */
- put_inotify_dev(dev);
+ inotify_destroy_mark_entry(entry);
+
+ /* ok, free it */
+ fsnotify_put_mark(entry);
+ spin_lock(&group->mark_lock);
+ }
+ spin_unlock(&group->mark_lock);
+
+ /* free this group, matching get was inotify_init->fsnotify_obtain_group */
+ fsnotify_put_group(group);

return 0;
}
@@ -546,16 +275,25 @@ static int inotify_release(struct inode *ignored, struct file *file)
static long inotify_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
{
- struct inotify_device *dev;
+ struct fsnotify_group *group;
+ struct fsnotify_event_holder *holder;
+ struct fsnotify_event *event;
void __user *p;
int ret = -ENOTTY;
+ size_t send_len = 0;

- dev = file->private_data;
+ group = file->private_data;
p = (void __user *) arg;

switch (cmd) {
case FIONREAD:
- ret = put_user(dev->queue_size, (int __user *) p);
+ mutex_lock(&group->notification_mutex);
+ list_for_each_entry(holder, &group->notification_list, event_list) {
+ event = holder->event;
+ send_len += sizeof(struct inotify_event) + event->name_len;
+ }
+ mutex_unlock(&group->notification_mutex);
+ ret = put_user(send_len, (int __user *) p);
break;
}

@@ -563,23 +301,18 @@ static long inotify_ioctl(struct file *file, unsigned int cmd,
}

static const struct file_operations inotify_fops = {
- .poll = inotify_poll,
- .read = inotify_read,
- .fasync = inotify_fasync,
- .release = inotify_release,
- .unlocked_ioctl = inotify_ioctl,
+ .poll = inotify_poll,
+ .read = inotify_read,
+ .fasync = inotify_fasync,
+ .release = inotify_release,
+ .unlocked_ioctl = inotify_ioctl,
.compat_ioctl = inotify_ioctl,
};

-static const struct inotify_operations inotify_user_ops = {
- .handle_event = inotify_dev_queue_event,
- .destroy_watch = free_inotify_user_watch,
-};
-
+/* inotify syscalls */
asmlinkage long sys_inotify_init1(int flags)
{
- struct inotify_device *dev;
- struct inotify_handle *ih;
+ struct fsnotify_group *group;
struct user_struct *user;
struct file *filp;
int fd, ret;
@@ -608,45 +341,27 @@ asmlinkage long sys_inotify_init1(int flags)
goto out_free_uid;
}

- dev = kmalloc(sizeof(struct inotify_device), GFP_KERNEL);
- if (unlikely(!dev)) {
- ret = -ENOMEM;
+ /* fsnotify_obtain_group took a reference to group, we put this when we kill the file in the end */
+ group = inotify_new_group(user, inotify_max_queued_events);
+ if (IS_ERR(group)) {
+ ret = PTR_ERR(group);
goto out_free_uid;
}

- ih = inotify_init(&inotify_user_ops);
- if (IS_ERR(ih)) {
- ret = PTR_ERR(ih);
- goto out_free_dev;
- }
- dev->ih = ih;
- dev->fa = NULL;
-
filp->f_op = &inotify_fops;
filp->f_path.mnt = mntget(inotify_mnt);
filp->f_path.dentry = dget(inotify_mnt->mnt_root);
filp->f_mapping = filp->f_path.dentry->d_inode->i_mapping;
filp->f_mode = FMODE_READ;
filp->f_flags = O_RDONLY | (flags & O_NONBLOCK);
- filp->private_data = dev;
-
- INIT_LIST_HEAD(&dev->events);
- init_waitqueue_head(&dev->wq);
- mutex_init(&dev->ev_mutex);
- mutex_init(&dev->up_mutex);
- dev->event_count = 0;
- dev->queue_size = 0;
- dev->max_events = inotify_max_queued_events;
- dev->user = user;
- atomic_set(&dev->count, 0);
-
- get_inotify_dev(dev);
+ filp->private_data = group;
+
atomic_inc(&user->inotify_devs);
+
fd_install(fd, filp);

return fd;
-out_free_dev:
- kfree(dev);
+
out_free_uid:
free_uid(user);
put_filp(filp);
@@ -662,8 +377,8 @@ asmlinkage long sys_inotify_init(void)

asmlinkage long sys_inotify_add_watch(int fd, const char __user *pathname, u32 mask)
{
+ struct fsnotify_group *group;
struct inode *inode;
- struct inotify_device *dev;
struct path path;
struct file *filp;
int ret, fput_needed;
@@ -685,19 +400,19 @@ asmlinkage long sys_inotify_add_watch(int fd, const char __user *pathname, u32 m
flags |= LOOKUP_DIRECTORY;

ret = find_inode(pathname, &path, flags);
- if (unlikely(ret))
+ if (ret)
goto fput_and_out;

- /* inode held in place by reference to path; dev by fget on fd */
+ /* inode held in place by reference to path; group by fget on fd */
inode = path.dentry->d_inode;
- dev = filp->private_data;
+ group = filp->private_data;

- mutex_lock(&dev->up_mutex);
- ret = inotify_find_update_watch(dev->ih, inode, mask);
- if (ret == -ENOENT)
- ret = create_watch(dev, inode, mask);
- mutex_unlock(&dev->up_mutex);
+ /* create/update an inode mark */
+ ret = inotify_update_watch(group, inode, mask);
+ if (unlikely(ret))
+ goto path_put_and_out;

+path_put_and_out:
path_put(&path);
fput_and_out:
fput_light(filp, fput_needed);
@@ -706,9 +421,11 @@ fput_and_out:

asmlinkage long sys_inotify_rm_watch(int fd, u32 wd)
{
+ struct fsnotify_group *group;
+ struct fsnotify_mark_entry *entry;
+ struct inotify_group_private_data *priv;
struct file *filp;
- struct inotify_device *dev;
- int ret, fput_needed;
+ int ret = 0, fput_needed;

filp = fget_light(fd, &fput_needed);
if (unlikely(!filp))
@@ -720,10 +437,22 @@ asmlinkage long sys_inotify_rm_watch(int fd, u32 wd)
goto out;
}

- dev = filp->private_data;
+ group = filp->private_data;
+ priv = group->private;

+ spin_lock(&group->mark_lock);
/* we free our watch data when we get IN_IGNORED */
- ret = inotify_rm_wd(dev->ih, wd);
+ entry = idr_find(&priv->idr, wd);
+ if (unlikely(!entry)) {
+ spin_unlock(&group->mark_lock);
+ ret = -EINVAL;
+ goto out;
+ }
+ fsnotify_get_mark(entry);
+ spin_unlock(&group->mark_lock);
+
+ inotify_destroy_mark_entry(entry);
+ fsnotify_put_mark(entry);

out:
fput_light(filp, fput_needed);
@@ -739,9 +468,9 @@ inotify_get_sb(struct file_system_type *fs_type, int flags,
}

static struct file_system_type inotify_fs_type = {
- .name = "inotifyfs",
- .get_sb = inotify_get_sb,
- .kill_sb = kill_anon_super,
+ .name = "inotifyfs",
+ .get_sb = inotify_get_sb,
+ .kill_sb = kill_anon_super,
};

/*
@@ -765,14 +494,6 @@ static int __init inotify_user_setup(void)
inotify_max_user_instances = 128;
inotify_max_user_watches = 8192;

- watch_cachep = kmem_cache_create("inotify_watch_cache",
- sizeof(struct inotify_user_watch),
- 0, SLAB_PANIC, NULL);
- event_cachep = kmem_cache_create("inotify_event_cache",
- sizeof(struct inotify_kernel_event),
- 0, SLAB_PANIC, NULL);
-
return 0;
}
-
module_init(inotify_user_setup);
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index c1a7b61..3d10004 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -257,10 +257,10 @@ static inline void fsnotify_access(struct file *file)
{
struct dentry *dentry = file->f_path.dentry;
struct inode *inode = dentry->d_inode;
- __u64 mask = IN_ACCESS;
+ __u64 mask = FS_ACCESS;

if (S_ISDIR(inode->i_mode))
- mask |= IN_ISDIR;
+ mask |= FS_IN_ISDIR;

inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
@@ -276,10 +276,10 @@ static inline void fsnotify_modify(struct file *file)
{
struct dentry *dentry = file->f_path.dentry;
struct inode *inode = dentry->d_inode;
- __u64 mask = IN_MODIFY;
+ __u64 mask = FS_MODIFY;

if (S_ISDIR(inode->i_mode))
- mask |= IN_ISDIR;
+ mask |= FS_IN_ISDIR;

inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
@@ -310,10 +310,10 @@ static inline void fsnotify_open(struct file *file)
{
struct dentry *dentry = file->f_path.dentry;
struct inode *inode = dentry->d_inode;
- __u64 mask = IN_OPEN;
+ __u64 mask = FS_OPEN;

if (S_ISDIR(inode->i_mode))
- mask |= IN_ISDIR;
+ mask |= FS_IN_ISDIR;

inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
@@ -329,14 +329,13 @@ static inline void fsnotify_close(struct file *file)
{
struct dentry *dentry = file->f_path.dentry;
struct inode *inode = dentry->d_inode;
- const char *name = dentry->d_name.name;
fmode_t mode = file->f_mode;
- __u64 mask = (mode & FMODE_WRITE) ? IN_CLOSE_WRITE : IN_CLOSE_NOWRITE;
+ __u64 mask = (mode & FMODE_WRITE) ? FS_CLOSE_WRITE : FS_CLOSE_NOWRITE;

if (S_ISDIR(inode->i_mode))
- mask |= IN_ISDIR;
+ mask |= FS_IN_ISDIR;

- inotify_dentry_parent_queue_event(dentry, mask, 0, name);
+ inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);

fsnotify_parent(dentry, mask);
@@ -349,10 +348,10 @@ static inline void fsnotify_close(struct file *file)
static inline void fsnotify_xattr(struct dentry *dentry)
{
struct inode *inode = dentry->d_inode;
- __u64 mask = IN_ATTRIB;
+ __u64 mask = FS_ATTRIB;

if (S_ISDIR(inode->i_mode))
- mask |= IN_ISDIR;
+ mask |= FS_IN_ISDIR;

inotify_dentry_parent_queue_event(dentry, mask, 0, dentry->d_name.name);
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
@@ -371,26 +370,26 @@ static inline void fsnotify_change(struct dentry *dentry, unsigned int ia_valid)
__u64 mask = 0;

if (ia_valid & ATTR_UID)
- mask |= IN_ATTRIB;
+ mask |= FS_ATTRIB;
if (ia_valid & ATTR_GID)
- mask |= IN_ATTRIB;
+ mask |= FS_ATTRIB;
if (ia_valid & ATTR_SIZE)
- mask |= IN_MODIFY;
+ mask |= FS_MODIFY;

/* both times implies a utime(s) call */
if ((ia_valid & (ATTR_ATIME | ATTR_MTIME)) == (ATTR_ATIME | ATTR_MTIME))
- mask |= IN_ATTRIB;
+ mask |= FS_ATTRIB;
else if (ia_valid & ATTR_ATIME)
- mask |= IN_ACCESS;
+ mask |= FS_ACCESS;
else if (ia_valid & ATTR_MTIME)
- mask |= IN_MODIFY;
+ mask |= FS_MODIFY;

if (ia_valid & ATTR_MODE)
- mask |= IN_ATTRIB;
+ mask |= FS_ATTRIB;

if (mask) {
if (S_ISDIR(inode->i_mode))
- mask |= IN_ISDIR;
+ mask |= FS_IN_ISDIR;
inotify_inode_queue_event(inode, mask, 0, NULL, NULL);
inotify_dentry_parent_queue_event(dentry, mask, 0,
dentry->d_name.name);
diff --git a/include/linux/inotify.h b/include/linux/inotify.h
index 37ea289..084d1c1 100644
--- a/include/linux/inotify.h
+++ b/include/linux/inotify.h
@@ -112,6 +112,7 @@ extern void inotify_inode_queue_event(struct inode *, __u32, __u32,
extern void inotify_dentry_parent_queue_event(struct dentry *, __u32, __u32,
const char *);
extern void inotify_unmount_inodes(struct list_head *);
+extern void fsn_inotify_unmount_inodes(struct list_head *);
extern void inotify_inode_is_dead(struct inode *);
extern u32 inotify_get_cookie(void);

2008-12-13 02:54:28

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 05/14] fsnotify: unified filesystem notification backend

Hi.

On Fri, Dec 12, 2008 at 04:51:40PM -0500, Eric Paris ([email protected]) wrote:
> +DEFINE_MUTEX(fsnotify_grp_mutex);
> +struct srcu_struct fsnotify_grp_srcu_struct;
> +LIST_HEAD(fsnotify_groups);
> +__u64 fsnotify_mask;

Those can be static.

> +struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is)
> +{
> + struct fsnotify_event *event;
> +
> + event = kmem_cache_alloc(event_kmem_cache, GFP_KERNEL);
> + if (!event)
> + return NULL;
> +
> + atomic_set(&event->refcnt, 1);
> +
> + spin_lock_init(&event->lock);
> +
> + event->path.dentry = NULL;
> + event->path.mnt = NULL;
> + event->inode = NULL;
> +
> + INIT_LIST_HEAD(&event->private_data_list);
> +
> + event->to_tell = to_tell;

What prevents this inode to be released?

--
Evgeniy Polyakov

2008-12-13 03:07:20

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 07/14] fsnotify: add in inode fsnotify markings

On Fri, Dec 12, 2008 at 04:51:51PM -0500, Eric Paris ([email protected]) wrote:
> +void fsnotify_put_mark(struct fsnotify_mark_entry *entry)
> +{
> + if (atomic_dec_and_test(&entry->refcnt)) {
> + spin_lock(&entry->lock);
> + /* entries can only be found by the kernel by searching the
> + * inode->i_fsnotify_entries or the group->mark_entries lists.
> + * if freeme is set that means this entry is off both lists.
> + * if refcnt is 0 that means we are the last thing still
> + * looking at this entry, so its time to free.
> + */
> + if (!atomic_read(&entry->refcnt) && entry->freeme) {

Why do you check refcnt again? Does it mean its life cycle does not end
when it hits zero and should be freed only under lock, which in turn
means it may be accessed under that lock and zero refcnt? Please
describe this locking in more details.

> + spin_unlock(&entry->lock);
> + fsnotify_destroy_mark(entry);
> + return;
> + }
> + spin_unlock(&entry->lock);
> + }
> +}
> +
> +void fsnotify_clear_marks_by_group(struct fsnotify_group *group)
> +{
> + struct fsnotify_mark_entry *lentry, *entry;
> + struct inode *inode;
> + LIST_HEAD(free_list);
> +
> + spin_lock(&group->mark_lock);
> + list_for_each_entry_safe(entry, lentry, &group->mark_entries, g_list) {
> + list_del_init(&entry->g_list);
> + list_add(&entry->free_g_list, &free_list);
> + }
> + spin_unlock(&group->mark_lock);
> +
> + list_for_each_entry_safe(entry, lentry, &free_list, free_g_list) {
> + fsnotify_get_mark(entry);
> + spin_lock(&entry->lock);
> + inode = entry->inode;
> + if (!inode) {

This inode does not seem to be grabbed previously, or I missed that
part? What prevents it from being freed?

> + entry->group = NULL;
> + spin_unlock(&entry->lock);
> + fsnotify_put_mark(entry);
> + continue;
> + }
> + spin_lock(&inode->i_lock);
> +
> + list_del_init(&entry->i_list);
> + entry->inode = NULL;
> + list_del_init(&entry->g_list);
> + entry->group = NULL;
> + entry->freeme = 1;
> +
> + spin_unlock(&inode->i_lock);
> + spin_unlock(&entry->lock);
> +
> + fsnotify_put_mark(entry);
> + }
> +}

> +void fsnotify_clear_marks_by_inode(struct inode *inode, unsigned int flags)
> +{
> + struct fsnotify_mark_entry *lentry, *entry;
> + LIST_HEAD(free_list);
> +
> + spin_lock(&inode->i_lock);
> + list_for_each_entry_safe(entry, lentry, &inode->i_fsnotify_mark_entries, i_list) {
> + list_del_init(&entry->i_list);
> + list_add(&entry->free_i_list, &free_list);
> + }
> + spin_unlock(&inode->i_lock);
> +
> + /*
> + * at this point destroy_by_* might race.
> + *
> + * we used list_del_init() so it can be list_del_init'd again, no harm.
> + * we were called from an inode function so we know that other user can
> + * try to grab entry->inode->i_lock without a problem.
> + */
> + list_for_each_entry_safe(entry, lentry, &free_list, free_i_list) {
> + fsnotify_get_mark(entry);
> + entry->group->ops->mark_clear_inode(entry, inode, flags);
> + fsnotify_put_mark(entry);
> + }

Since entry was not grabbed in the above locked list, what prevents it
to be freed before fsnotify_get_mark() executed? Could that get moved
under lock?

> +}
> +
> +/* caller must hold inode->i_lock */
> +struct fsnotify_mark_entry *fsnotify_find_mark_entry(struct fsnotify_group *group, struct inode *inode)
> +{
> + struct fsnotify_mark_entry *entry;
> +
> + list_for_each_entry(entry, &inode->i_fsnotify_mark_entries, i_list) {
> + if (entry->group == group) {
> + fsnotify_get_mark(entry);
> + return entry;
> + }
> + }
> + return NULL;
> +}
> +/*

Missed newline before comment's start :)

> + * This is a low use function called when userspace is changing what is being
> + * watched. I don't mind doing the allocation since I'm assuming we will have
> + * more new events than we have adding to old events...
> + *
> + * add (we use |=) the mark to the in core inode mark, if you need to change
> + * rather than | some new bits you needs to fsnotify_destroy_mark_by_inode()
> + * then call this with all the right bits in the mask.
> + */
> +struct fsnotify_mark_entry *fsnotify_mark_add(struct fsnotify_group *group, struct inode *inode, __u64 mask)
> +{
> + /* we initialize entry to shut up the compiler in case we just to out... */
> + struct fsnotify_mark_entry *entry = NULL, *lentry;
> +
> + /* pre allocate an entry so we can hold the lock */
> + entry = fsnotify_alloc_mark();
> + if (!entry)
> + return NULL;
> +
> + /*
> + * LOCKING ORDER!!!!
> + * entry->lock
> + * group->mark_lock
> + * inode->i_lock
> + */
> + spin_lock(&group->mark_lock);
> + spin_lock(&inode->i_lock);
> + lentry = fsnotify_find_mark_entry(group, inode);
> + if (lentry) {
> + /* we didn't use the new entry, kill it */
> + fsnotify_destroy_mark(entry);
> + entry = lentry;
> + entry->mask |= mask;
> + goto out_unlock;
> + }
> +
> + spin_lock_init(&entry->lock);
> + atomic_set(&entry->refcnt, 1);
> + entry->group = group;
> + entry->mask = mask;
> + entry->inode = inode;

That's what I talked about previously, what if this inode will be
released after inode unlocked?

> + entry->freeme = 0;
> + entry->private = NULL;
> + entry->free_private = group->ops->free_mark_priv;
> +
> + list_add(&entry->i_list, &inode->i_fsnotify_mark_entries);
> + list_add(&entry->g_list, &group->mark_entries);
> +
> +out_unlock:
> + spin_unlock(&inode->i_lock);
> + spin_unlock(&group->mark_lock);
> + return entry;
> +}
> +
> +void fsnotify_recalc_inode_mask(struct inode *inode)
> +{
> + unsigned long new_mask = 0;
> + struct fsnotify_mark_entry *entry;
> +
> + spin_lock(&inode->i_lock);
> + list_for_each_entry(entry, &inode->i_fsnotify_mark_entries, i_list) {
> + new_mask |= entry->mask;
> + }
> + inode->i_fsnotify_mask = new_mask;
> + spin_unlock(&inode->i_lock);
> +}
> +
> +
> +__init int fsnotify_mark_init(void)
> +{
> + fsnotify_mark_kmem_cache = kmem_cache_create("fsnotify_mark_entry", sizeof(struct fsnotify_mark_entry), 0, SLAB_PANIC, NULL);
> +
> + return 0;
> +}
> +subsys_initcall(fsnotify_mark_init);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 4a853ef..b5a7bce 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -665,6 +665,11 @@ struct inode {
>
> __u32 i_generation;
>
> +#ifdef CONFIG_FSNOTIFY
> + __u64 i_fsnotify_mask; /* all events this inode cares about */
> + struct list_head i_fsnotify_mark_entries; /* fsnotify mark entries */
> +#endif
> +
> #ifdef CONFIG_DNOTIFY
> unsigned long i_dnotify_mask; /* Directory notify events */
> struct dnotify_struct *i_dnotify; /* for directory notifications */
> diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
> index b084b98..c2ed916 100644
> --- a/include/linux/fsnotify.h
> +++ b/include/linux/fsnotify.h
> @@ -99,6 +99,14 @@ static inline void fsnotify_nameremove(struct dentry *dentry, int isdir)
> }
>
> /*
> + * fsnotify_inode_delete - and inode is being evicted from cache, clean up is needed
> + */
> +static inline void fsnotify_inode_delete(struct inode *inode)
> +{
> + __fsnotify_inode_delete(inode, FSNOTIFY_INODE_DESTROY);
> +}
> +
> +/*
> * fsnotify_inoderemove - an inode is going away
> */
> static inline void fsnotify_inoderemove(struct inode *inode)
> @@ -107,6 +115,7 @@ static inline void fsnotify_inoderemove(struct inode *inode)
> inotify_inode_is_dead(inode);
>
> fsnotify(inode, FS_DELETE_SELF, inode, FSNOTIFY_EVENT_INODE);
> + __fsnotify_inode_delete(inode, FSNOTIFY_LAST_DENTRY);
> }
>
> /*
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index 924902e..0482a14 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -13,6 +13,7 @@
> #include <linux/fs.h>
> #include <linux/list.h>
> #include <linux/mutex.h>
> +#include <linux/spinlock.h>
> #include <linux/wait.h>
>
> #include <asm/atomic.h>
> @@ -68,13 +69,21 @@
> #define FSNOTIFY_EVENT_FILE 1
> #define FSNOTIFY_EVENT_INODE 2
>
> +/* these tell __fsnotify_inode_delete what kind of event this is */
> +#define FSNOTIFY_LAST_DENTRY 1
> +#define FSNOTIFY_INODE_DESTROY 2
> +
> struct fsnotify_group;
> struct fsnotify_event;
> +struct fsnotify_mark_entry;
>
> struct fsnotify_ops {
> int (*event_to_notif)(struct fsnotify_group *group, struct fsnotify_event *event);
> + void (*mark_clear_inode)(struct fsnotify_mark_entry *entry, struct inode *inode, unsigned int flags);
> + int (*should_send_event)(struct fsnotify_group *group, struct inode *inode, __u64 mask);
> void (*free_group_priv)(struct fsnotify_group *group);
> void (*free_event_priv)(struct fsnotify_group *group, struct fsnotify_event *event);
> + void (*free_mark_priv)(struct fsnotify_mark_entry *entry);
> };
>
> struct fsnotify_group {
> @@ -85,6 +94,10 @@ struct fsnotify_group {
>
> const struct fsnotify_ops *ops; /* how this group handles things */
>
> + /* stores all fastapth entries assoc with this group so they can be cleaned on unregister */
> + spinlock_t mark_lock; /* protect mark_entries list */
> + struct list_head mark_entries; /* all inode mark entries for this group */
> +
> unsigned int priority; /* order this group should receive msgs. low first */
>
> void *private; /* private data for implementers (dnotify, inotify, fanotify) */
> @@ -94,17 +107,26 @@ struct fsnotify_group {
>
> /* called from the vfs to signal fs events */
> extern void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is);
> +extern void __fsnotify_inode_delete(struct inode *inode, int flag);
>
> /* called from fsnotify interfaces, such as fanotify or dnotify */
> extern void fsnotify_recalc_global_mask(void);
> +extern void fsnotify_recalc_group_mask(struct fsnotify_group *group);
> extern struct fsnotify_group *fsnotify_obtain_group(unsigned int priority, unsigned int group_num, __u64 mask, const struct fsnotify_ops *ops);
> extern void fsnotify_put_group(struct fsnotify_group *group);
> extern void fsnotify_get_group(struct fsnotify_group *group);
>
> +extern void fsnotify_recalc_inode_mask(struct inode *inode);
> +extern struct fsnotify_mark_entry *fsnotify_find_mark_entry(struct fsnotify_group *group, struct inode *inode);
> +extern struct fsnotify_mark_entry *fsnotify_mark_add(struct fsnotify_group *group, struct inode *inode, __u64 mask);
> #else
>
> static inline void fsnotify(struct inode *to_tell, __u64 mask, void *data, int data_is);
> {}
> +
> +static inline void __fsnotify_inode_delete(struct inode *inode, int flag)
> +{}
> +
> #endif /* CONFIG_FSNOTIFY */
>
> #endif /* __KERNEL __ */

--
Evgeniy Polyakov

2008-12-13 03:19:33

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 11/14] fsnotify: include pathnames with entries when possible

On Fri, Dec 12, 2008 at 04:52:12PM -0500, Eric Paris ([email protected]) wrote:
> When inotify wants to send events to a directory about a child it includes
> the name of the original file. This patch collects that filename and makes
> it available for notification.

What about extending fsnotify with attributes so that there would be no
problems extending it with new events and data sent to userspace?


--
Evgeniy Polyakov

2008-12-13 03:23:11

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 13/14] inotify: reimplement inotify using fsnotify

On Fri, Dec 12, 2008 at 04:52:23PM -0500, Eric Paris ([email protected]) wrote:
> Yes, holy shit, I'm trying to reimplement inotify as fsnotify...

While you are at it, please update inode_setattr() so that it dropped
inotify watches if new permissions do not allow to read data.

--
Evgeniy Polyakov

2008-12-13 15:01:48

by Eric Paris

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 05/14] fsnotify: unified filesystem notification backend

On Sat, 2008-12-13 at 05:54 +0300, Evgeniy Polyakov wrote:
> > +struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u64 mask, void *data, int data_is)
> > +{
> > + struct fsnotify_event *event;
> > +
> > + event = kmem_cache_alloc(event_kmem_cache, GFP_KERNEL);
> > + if (!event)
> > + return NULL;
> > +
> > + atomic_set(&event->refcnt, 1);
> > +
> > + spin_lock_init(&event->lock);
> > +
> > + event->path.dentry = NULL;
> > + event->path.mnt = NULL;
> > + event->inode = NULL;
> > +
> > + INIT_LIST_HEAD(&event->private_data_list);
> > +
> > + event->to_tell = to_tell;
>
> What prevents this inode to be released?

Absolutely nothing and I need to document that. Two things,
event->to_tell and event->data if event->data_is == INODE are ONLY valid
during the call to group-?event_to_notif(). As soon as all groups
return from that call those are not valid fields. I could set them NULL
when they aren't allowed to be used any more, but it's a wasted
operation on a VERY hot path (fsnotify())

As a side note event->data if event->data_is == FILE is perfectly
allowed to be used until the event is freed. That's totally pointless
for this patch set and I might drop it on the next submission, but it is
needed for my fanotify notify patches that I keep in mind as I'm doing
this.

I'll remember to better document this 'quirk' in case anyone tries to
write a new notifier.

-Eric

2008-12-13 15:30:06

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 04/14] fsnotify: use the new open-exec hook for inotify and dnotify

On Fri, Dec 12, 2008 at 04:51:35PM -0500, Eric Paris wrote:
> inotify and dnotify did not get access events when their children were
> accessed for shlib or exec purposes. Trigger on those events as well.

And requesting it about the third time: Can you please submit this (and
the required previous patch) as a bug fix? We'd like to have this in
ASAP even if the rest will need some more work.

2008-12-13 16:35:44

by Eric Paris

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 07/14] fsnotify: add in inode fsnotify markings

On Sat, 2008-12-13 at 06:07 +0300, Evgeniy Polyakov wrote:
> On Fri, Dec 12, 2008 at 04:51:51PM -0500, Eric Paris ([email protected]) wrote:
> > +void fsnotify_put_mark(struct fsnotify_mark_entry *entry)
> > +{
> > + if (atomic_dec_and_test(&entry->refcnt)) {
> > + spin_lock(&entry->lock);
> > + /* entries can only be found by the kernel by searching the
> > + * inode->i_fsnotify_entries or the group->mark_entries lists.
> > + * if freeme is set that means this entry is off both lists.
> > + * if refcnt is 0 that means we are the last thing still
> > + * looking at this entry, so its time to free.
> > + */
> > + if (!atomic_read(&entry->refcnt) && entry->freeme) {
>
> Why do you check refcnt again? Does it mean its life cycle does not end
> when it hits zero and should be freed only under lock, which in turn
> means it may be accessed under that lock and zero refcnt? Please
> describe this locking in more details.

Al Viro, please pay attention to this patch. This is the one that
provides lifetimes which I hope solves all of inotify's "fun"

marks have very interesting semantics which took me a long time to
figure out. Marks can only be found in kernel by walking one of two
lists. group->mark_entries or inode->i_fsnotify_mark_entries. These
lists are protected by group->mark_lock or inode->i_lost respectively.
Everything else inside of a mark is protected by entry->lock. The
locking order is

entry->lock
group->mark_lock
inode->i_lock

Entries must not be freed until they are gone from both the group->list
and the inode->list and no other process is referencing this entry. The
refcnt keeps track of how many processes are looking at this entry. The
freeme is set when the entry is removed from both lists. Thus if the
refcnt==0 and freeme=1 that means nothing is looking at this entry and
it is off both lists so we free it.

The difficulty lies in the fact that we might need to free marks for 3
reasons from 3 different directions.

1) inode disappears (inode deleted or evicted form core, backing FS
unmounted)
2) group disappears (close the inotify fd)
3) entry needs to go (inotify unregisters it)

Number 3 is by far the easiest. We grab entry->lock. If entry->group
is set grab it's lock. If entry->inode is set we grab that lock. We
can then drop this entry from both lists set group and inode to NULL and
set freeme. Once everything is finished looking at the entry it will
get freed.

Numbers 2 and 3 are much harder lets look at #1 since #2 is similar.
There are 3 reasons I can think of that we would need to kill all
entries given an inode. the inode is evicted from core (this isn't
actually an issue with dnotify or inotify since dnotify holds a
reference by the sheer fact an associated file for the inode must be
held open, and the existence of a "watch" or "entry" for inotify keep it
from being evicted due to memory pressure. It could be evicted because
the last dentry pointing to the inode was deleted. The fs backing that
inode is being unmounted. In any case, we are given an inode and we are
told "all entries associated with this inode need to be freed."

So to find an entry we need to first grab the inode->i_lock and start to
walk the inode->i_fsnotify_mark_entries list. Since we hold the i_lock
we are not allowed to grab any other locks nor are we allowed to change
anything other than entry->i_list. The secret sauce is that we actually
move the entry from the inode list to a private list which we can walk
and modify lockless. Inside the event we actually have to use a
different list, free_i_list, for this operation so nothing else that
races with us can mess stuff up. We run the entire inode we are trying
to free all entries for an put the entries on the private list. We do
NOT modify event->inode.

After this point we can drop inode->i_lock since we are finished with
inode->i_fsnotify_mark_entries and that list is empty. Now we can walk
the private list we just created lockless. We can thus try to grab in
order, all three locks (we don't really need i_lock since we are already
off the inode's list) and can safely remove this entry from both lists,
set both entry->inode and entry->group to NULL (we hold the event->lock
so this is safe) and since we are off both lists set freeme.

As soon as we dropped i_lock remember we could have raced. The cool
thing here is that this is fine. The "thing" we are racing with is
going to be holding entry->lock. So we are going to block until that
gets released. That other task is going to grab entry->inode->i_lock
which is also fine. As long as we exist the inode still exists and so
the lock is ok. Since we always use list_del_init(entry->i_list) when
that other task tries to remove this entry from the inode list it won't
really be doing anything, nor hurting anything. This is the reason that
we needed the free_i_list. So a racing task can't mess with the lists
that we need. Only a process shooting an entry by inode will use this
list. That other task is also going to take care of setting
entry->inode == NULL.

Eventually we are going to get the entry->lock, will do our best to lock
what else we can (we actually shouldn't find entry->inode or
entry->group if we lost the race), and we will remove the entry from
what lists we can (again it's already off both lists). At the end we
know that this entry is off both lists and can be marked to be freed.

It's actually quite slick, we can try to free an entry from all 3
directions at once but the entry won't actually be freed until it is off
both lists and nothing is left actively referencing it.


>
> > + spin_unlock(&entry->lock);
> > + fsnotify_destroy_mark(entry);
> > + return;
> > + }
> > + spin_unlock(&entry->lock);
> > + }
> > +}
> > +
> > +void fsnotify_clear_marks_by_group(struct fsnotify_group *group)
> > +{
> > + struct fsnotify_mark_entry *lentry, *entry;
> > + struct inode *inode;
> > + LIST_HEAD(free_list);
> > +
> > + spin_lock(&group->mark_lock);
> > + list_for_each_entry_safe(entry, lentry, &group->mark_entries, g_list) {
> > + list_del_init(&entry->g_list);
> > + list_add(&entry->free_g_list, &free_list);
> > + }
> > + spin_unlock(&group->mark_lock);
> > +
> > + list_for_each_entry_safe(entry, lentry, &free_list, free_g_list) {
> > + fsnotify_get_mark(entry);
> > + spin_lock(&entry->lock);
> > + inode = entry->inode;
> > + if (!inode) {
>
> This inode does not seem to be grabbed previously, or I missed that
> part? What prevents it from being freed?

Here's the cool part. If the inode was free, we would have removed this
entry from the inode->i_fsnotify_mark_entries list and set entry->inode
to NULL back when that happened.

>
> > + entry->group = NULL;
> > + spin_unlock(&entry->lock);
> > + fsnotify_put_mark(entry);
> > + continue;
> > + }
> > + spin_lock(&inode->i_lock);
> > +
> > + list_del_init(&entry->i_list);
> > + entry->inode = NULL;
> > + list_del_init(&entry->g_list);
> > + entry->group = NULL;
> > + entry->freeme = 1;
> > +
> > + spin_unlock(&inode->i_lock);
> > + spin_unlock(&entry->lock);
> > +
> > + fsnotify_put_mark(entry);
> > + }
> > +}
>
> > +void fsnotify_clear_marks_by_inode(struct inode *inode, unsigned int flags)
> > +{
> > + struct fsnotify_mark_entry *lentry, *entry;
> > + LIST_HEAD(free_list);
> > +
> > + spin_lock(&inode->i_lock);
> > + list_for_each_entry_safe(entry, lentry, &inode->i_fsnotify_mark_entries, i_list) {
> > + list_del_init(&entry->i_list);
> > + list_add(&entry->free_i_list, &free_list);
> > + }
> > + spin_unlock(&inode->i_lock);
> > +
> > + /*
> > + * at this point destroy_by_* might race.
> > + *
> > + * we used list_del_init() so it can be list_del_init'd again, no harm.
> > + * we were called from an inode function so we know that other user can
> > + * try to grab entry->inode->i_lock without a problem.
> > + */
> > + list_for_each_entry_safe(entry, lentry, &free_list, free_i_list) {
> > + fsnotify_get_mark(entry);
> > + entry->group->ops->mark_clear_inode(entry, inode, flags);
> > + fsnotify_put_mark(entry);
> > + }
>
> Since entry was not grabbed in the above locked list, what prevents it
> to be freed before fsnotify_get_mark() executed? Could that get moved
> under lock?

absolutely nothing. This is a bug. thanks.

> > +}
> > +
> > +/* caller must hold inode->i_lock */
> > +struct fsnotify_mark_entry *fsnotify_find_mark_entry(struct fsnotify_group *group, struct inode *inode)
> > +{
> > + struct fsnotify_mark_entry *entry;
> > +
> > + list_for_each_entry(entry, &inode->i_fsnotify_mark_entries, i_list) {
> > + if (entry->group == group) {
> > + fsnotify_get_mark(entry);
> > + return entry;
> > + }
> > + }
> > + return NULL;
> > +}
> > +/*
>
> Missed newline before comment's start :)

don't I wish this was my only problem :)

>
> > + * This is a low use function called when userspace is changing what is being
> > + * watched. I don't mind doing the allocation since I'm assuming we will have
> > + * more new events than we have adding to old events...
> > + *
> > + * add (we use |=) the mark to the in core inode mark, if you need to change
> > + * rather than | some new bits you needs to fsnotify_destroy_mark_by_inode()
> > + * then call this with all the right bits in the mask.
> > + */
> > +struct fsnotify_mark_entry *fsnotify_mark_add(struct fsnotify_group *group, struct inode *inode, __u64 mask)
> > +{
> > + /* we initialize entry to shut up the compiler in case we just to out... */
> > + struct fsnotify_mark_entry *entry = NULL, *lentry;
> > +
> > + /* pre allocate an entry so we can hold the lock */
> > + entry = fsnotify_alloc_mark();
> > + if (!entry)
> > + return NULL;
> > +
> > + /*
> > + * LOCKING ORDER!!!!
> > + * entry->lock
> > + * group->mark_lock
> > + * inode->i_lock
> > + */
> > + spin_lock(&group->mark_lock);
> > + spin_lock(&inode->i_lock);
> > + lentry = fsnotify_find_mark_entry(group, inode);
> > + if (lentry) {
> > + /* we didn't use the new entry, kill it */
> > + fsnotify_destroy_mark(entry);
> > + entry = lentry;
> > + entry->mask |= mask;
> > + goto out_unlock;
> > + }
> > +
> > + spin_lock_init(&entry->lock);
> > + atomic_set(&entry->refcnt, 1);
> > + entry->group = group;
> > + entry->mask = mask;
> > + entry->inode = inode;
>
> That's what I talked about previously, what if this inode will be
> released after inode unlocked?

__fsnotify_inode_delete() will clean this up when the inode is being
booted. since this entry is on the inode->i_fsnotify_mark_entries list.


2008-12-13 16:43:15

by Eric Paris

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 11/14] fsnotify: include pathnames with entries when possible

On Sat, 2008-12-13 at 06:19 +0300, Evgeniy Polyakov wrote:
> On Fri, Dec 12, 2008 at 04:52:12PM -0500, Eric Paris ([email protected]) wrote:
> > When inotify wants to send events to a directory about a child it includes
> > the name of the original file. This patch collects that filename and makes
> > it available for notification.
>
> What about extending fsnotify with attributes so that there would be no
> problems extending it with new events and data sent to userspace?

Since struct fsnotify_event is not exported to userspace information may
be collected in there at event creation time. I actually have a number
of later patches that do this for information that I want to send for
fanotify.

Information can also be stored in the event->private area if only one
group really cares.... (kinda how I store wd for inotify)

Information which is useful to more than 1 group should be added to
struct fsnotify_event and its up to the individual listeners to export
that to userspace anyway they want to.

So absolutely this is completely extensible.

2008-12-13 16:44:35

by Eric Paris

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 13/14] inotify: reimplement inotify using fsnotify

On Sat, 2008-12-13 at 06:22 +0300, Evgeniy Polyakov wrote:
> On Fri, Dec 12, 2008 at 04:52:23PM -0500, Eric Paris ([email protected]) wrote:
> > Yes, holy shit, I'm trying to reimplement inotify as fsnotify...
>
> While you are at it, please update inode_setattr() so that it dropped
> inotify watches if new permissions do not allow to read data.

Now this isn't so easy.... Do you have suggestions how to do this with
inotify as it stands today? the vfs needs to know what's hanging on the
open fd in userspace... Not a bad idea. I'll look, but clearly not a
core patch....

-Eric

2008-12-14 22:42:22

by James Morris

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 14/14] shit on top for debugging

On Fri, 12 Dec 2008, Eric Paris wrote:

>

Given that this code will probably need some general debugging, perhaps
use pr_debug.

--
James Morris
<[email protected]>

2008-12-14 22:47:51

by Eric Paris

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 14/14] shit on top for debugging

On Mon, 2008-12-15 at 09:40 +1100, James Morris wrote:
> On Fri, 12 Dec 2008, Eric Paris wrote:
>
> >
>
> Given that this code will probably need some general debugging, perhaps
> use pr_debug.

an accidental patch left on top of my stack which I didn't mean to send.
But sure, that's a great idea. I'll fix the comments and do it that
way....

-Eroc

2008-12-15 15:48:18

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 13/14] inotify: reimplement inotify using fsnotify

On Sat, Dec 13, 2008 at 11:44:01AM -0500, Eric Paris ([email protected]) wrote:
> > While you are at it, please update inode_setattr() so that it dropped
> > inotify watches if new permissions do not allow to read data.
>
> Now this isn't so easy.... Do you have suggestions how to do this with
> inotify as it stands today? the vfs needs to know what's hanging on the
> open fd in userspace... Not a bad idea. I'll look, but clearly not a
> core patch....

I could put inotify check into notify_change(), which will check if new
permissions do not allow reading for the users stored in
inode->inotify_watches, each one would be dereferenced to inotify
device, which holds a user context, which added watch object, so it can
be checked the same way generic_permission() works.

--
Evgeniy Polyakov

2008-12-18 22:29:13

by C. Scott Ananian

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 12/14] fsnotify: add correlations between events

On Fri, Dec 12, 2008 at 4:52 PM, Eric Paris <[email protected]> wrote:
> inotify sends userspace a correlation between events when they are related
> (aka when dentries are moved). This adds that same support for all
> fsnotify events.
> diff --git a/fs/notify/notification.c b/fs/notify/notification.c
> index 8ed9d32..7243b20 100644
> --- a/fs/notify/notification.c
> +++ b/fs/notify/notification.c
> @@ -34,6 +35,13 @@
>
> static struct kmem_cache *event_kmem_cache;
> static struct kmem_cache *event_holder_kmem_cache;
> +static atomic_t fsnotify_sync_cookie = ATOMIC_INIT(0);
> +
> +u32 fsnotify_get_cookie(void)
> +{
> + return atomic_inc_return(&fsnotify_sync_cookie);
> +}
> +EXPORT_SYMBOL_GPL(fsnotify_get_cookie);
>
> int fsnotify_check_notif_queue(struct fsnotify_group *group)
> {

atomic_inc_return seems rather expensive to put on a hot path in
almost every fs operation. On a multiprocessor system, the cache line
for fsnotify_sync_cookie would be ping-ponging constantly between
processors. The canonical solution is to form the cookie by
concatenating the processor number with a per-processor cookie, so
that generating a new cookie would not require synchronization between
processors. Surely this code already exists to be used somewhere in
Linux.
--scott

--
( http://cscott.net/ )

2008-12-18 23:36:47

by C. Scott Ananian

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 00/14] fsnotify, dnotify, and inotify

On Fri, Dec 12, 2008 at 4:51 PM, Eric Paris <[email protected]> wrote:
> The following series implements a new generic in kernel filesystem
> notification system, fsnotify. On top of fsnotify I reimplement dnotify and
> inotify. I have not finished with the change from inotify although I think
> inotify_user should be completed. In kernel inotify users (aka audit) still
> (until I get positive feedback) relay on the old inotify backend. This can be
> 'easily' fixed.
> All of this is in preperation for fanotify and using fanotify as an on access
> file scanner. So you better know it's coming.
> Why is this useful? Because I actually shrank the struct inode. That's
> right, my code is smaller and faster. Eat that.

As a desktop-search-and-indexing developer, it doesn't seem like
fanotify is going to give me anything I want. The inotify/dnotify
restructuring to fsnotify seems reasonable, but not exciting (to me).

>From a desktop search perspective, my wishlist reads like this:
1) An 'autoadd' option added to inotify directory watches, so that
newly-created subdirectories get atomically added to the watch. That
would prevent missed IN_MOVED_FROM and IN_MOVED_TO in a newly created
directory.
2) A reasonable interface to map inode #s to "the current path to
this inode" -- or failing that, an iopen or ilink syscall. This would
allow the search index to maintain inode #s instead of path names for
files, which saves a lot of IN_MOVE processing and races (especially
for directory moves, which require recursive path renaming).
3) Dream case: in-kernel dirty bits. I don't *really* want to know
all the details; I just want to know which inotify watches are dirty,
so I can rescan them. To avoid races, the query should return a dirty
watch and atomically clear its dirty flag, so that if it is updated
again during my indexing, I'd be told to scan it again.

>From the indexing perspective, dealing with a sequence of operations like:
1) mkdir -p abc/def
2) echo "foo" > abc/def/ghi
3) mv abc xyz
4) mv xyz/def/ghi jkl
is entirely too much "fun". Depending on the races between kernel and
user space, I might get only:
CREATE abc
IN_MOVED_TO jkl
and if I do get my watches put in place after step (1) I've still got
to deal with a recursive rename in step (3), the possibility of
getting notification of (2) after the rename in (3) (so abc/def/ghi no
longer exists), and other delights.

As far as I can tell, fanotify only helps with the 'recursive watch'
problem (which could be solved with 'autoadd' or just using the
algorithm in http://mail.gnome.org/archives/dashboard-hackers/2004-October/msg00022.html
), and doesn't give me any tools to deal with the actual hard races or
path-maintenance problems.
--scott

--
( http://cscott.net/ )

2008-12-22 02:40:35

by Eric Paris

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 12/14] fsnotify: add correlations between events

On Thu, 2008-12-18 at 17:28 -0500, C. Scott Ananian wrote:
> On Fri, Dec 12, 2008 at 4:52 PM, Eric Paris <[email protected]> wrote:
> > inotify sends userspace a correlation between events when they are related
> > (aka when dentries are moved). This adds that same support for all
> > fsnotify events.
> > diff --git a/fs/notify/notification.c b/fs/notify/notification.c
> > index 8ed9d32..7243b20 100644
> > --- a/fs/notify/notification.c
> > +++ b/fs/notify/notification.c
> > @@ -34,6 +35,13 @@
> >
> > static struct kmem_cache *event_kmem_cache;
> > static struct kmem_cache *event_holder_kmem_cache;
> > +static atomic_t fsnotify_sync_cookie = ATOMIC_INIT(0);
> > +
> > +u32 fsnotify_get_cookie(void)
> > +{
> > + return atomic_inc_return(&fsnotify_sync_cookie);
> > +}
> > +EXPORT_SYMBOL_GPL(fsnotify_get_cookie);
> >
> > int fsnotify_check_notif_queue(struct fsnotify_group *group)
> > {
>
> atomic_inc_return seems rather expensive to put on a hot path in
> almost every fs operation. On a multiprocessor system, the cache line
> for fsnotify_sync_cookie would be ping-ponging constantly between
> processors. The canonical solution is to form the cookie by
> concatenating the processor number with a per-processor cookie, so
> that generating a new cookie would not require synchronization between
> processors. Surely this code already exists to be used somewhere in
> Linux.

A) this isn't a hot path, it's only when a file is renamed.

B) It's still a great idea (even if not here at least where I plan to
use an atomic_inc_return in my later fanotify patches). Does anyone
know of an example of something like this in kernel?

Here I'm slightly concerned wasting enough bits of the 32 (we can't grow
it, because inotify has a fixed abi) for processors, but I'd be
surprised to find we'd ever have enough renames on a single processor
(my 2 second math would be we'd need about 13-14 bits for processors
which still leaves 262k renames per processor before anything wrapped)
to be a problem.

anyone who uses inotify think there would be a problem with cookie reuse
coming this soon?

Since I'm using a 64bit cookie and the lifetime of use in my later code
is measured in seconds I think this would be a great improvement.
Thanks

-Eric

2008-12-22 03:22:22

by Eric Paris

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 00/14] fsnotify, dnotify, and inotify

On Thu, 2008-12-18 at 18:36 -0500, C. Scott Ananian wrote:
> On Fri, Dec 12, 2008 at 4:51 PM, Eric Paris <[email protected]> wrote:
> > The following series implements a new generic in kernel filesystem
> > notification system, fsnotify. On top of fsnotify I reimplement dnotify and
> > inotify. I have not finished with the change from inotify although I think
> > inotify_user should be completed. In kernel inotify users (aka audit) still
> > (until I get positive feedback) relay on the old inotify backend. This can be
> > 'easily' fixed.
> > All of this is in preperation for fanotify and using fanotify as an on access
> > file scanner. So you better know it's coming.
> > Why is this useful? Because I actually shrank the struct inode. That's
> > right, my code is smaller and faster. Eat that.
>
> As a desktop-search-and-indexing developer, it doesn't seem like
> fanotify is going to give me anything I want. The inotify/dnotify
> restructuring to fsnotify seems reasonable, but not exciting (to me).
>
> From a desktop search perspective, my wishlist reads like this:
> 1) An 'autoadd' option added to inotify directory watches, so that
> newly-created subdirectories get atomically added to the watch. That
> would prevent missed IN_MOVED_FROM and IN_MOVED_TO in a newly created
> directory.
> 2) A reasonable interface to map inode #s to "the current path to
> this inode" -- or failing that, an iopen or ilink syscall. This would
> allow the search index to maintain inode #s instead of path names for
> files, which saves a lot of IN_MOVE processing and races (especially
> for directory moves, which require recursive path renaming).
> 3) Dream case: in-kernel dirty bits. I don't *really* want to know
> all the details; I just want to know which inotify watches are dirty,
> so I can rescan them. To avoid races, the query should return a dirty
> watch and atomically clear its dirty flag, so that if it is updated
> again during my indexing, I'd be told to scan it again.
>
> From the indexing perspective, dealing with a sequence of operations like:
> 1) mkdir -p abc/def
> 2) echo "foo" > abc/def/ghi
> 3) mv abc xyz
> 4) mv xyz/def/ghi jkl
> is entirely too much "fun". Depending on the races between kernel and
> user space, I might get only:
> CREATE abc
> IN_MOVED_TO jkl
> and if I do get my watches put in place after step (1) I've still got
> to deal with a recursive rename in step (3), the possibility of
> getting notification of (2) after the rename in (3) (so abc/def/ghi no
> longer exists), and other delights.
>
> As far as I can tell, fanotify only helps with the 'recursive watch'
> problem (which could be solved with 'autoadd' or just using the
> algorithm in http://mail.gnome.org/archives/dashboard-hackers/2004-October/msg00022.html
> ), and doesn't give me any tools to deal with the actual hard races or
> path-maintenance problems.

You are absolutely correct that fanotify doesn't help with object
movement or path maintenance. Neither had been requested, but
notification (that an inode moved) shouldn't be impossible (although the
hooks are going to be a lot more involved and will probably take some
fighting with the VFS people, my current fanotify hooks use what is
already being handed to fsnotify_* today) To directly answer you
requests

1) autoadd isn't really what I'm looking at, but maybe someday I could
take a peek, at first glance it doesn't seem unreasonable an idea, but I
don't see how the userspace interface could work. Without the call the
inotify_init to get the watch descriptor how can userspace know what
these new events are? Only possibility I see for this is if inotify got
an extensible userspace interface. In any case I'd be hard pressed to
call it a high priority since it's already possible to get this and the
intention of the addition is to make userspace code easier.

2) major vfs and every FS redesign me thinks.

3) What you want is IN_MODIFY from every inode but you want them to all
coallese until you grab one instead of only merging the same event type
if they are the last 2 in the notification queue. Not sure of a
particularly clean/fast way to implement that right offhand, we'd have
to run the entire notification queue every time before we add an event
to the end, but at least this is doable with the constraints of the
inotify user interface.

Can't this already be done in userspace just by draining the queue,
matching events, throwing duplicates away, and then processing whatever
is left? You know there is atomicity since removal of an event and the
addition of an event both hold the inotify_dev->ev_mutex.

In any case, I'm going to let your thoughts rattle around in my brain
while I'm still trying to rewrite inotify and dnotify to a better base.
My first inclination is to stop using inotify and start using fanotify.
Abandon filenames and start using device/inode pairs and you get
everything you need. But I'm certain that isn't that case :)

-Eric

2008-12-22 09:02:05

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 12/14] fsnotify: add correlations between events

On Sun, Dec 21, 2008 at 09:40:01PM -0500, Eric Paris ([email protected]) wrote:
> anyone who uses inotify think there would be a problem with cookie reuse
> coming this soon?

In my experience we only care about cookies to be the same, and not
monotinically grow up. So this may just jump up and down, and since
events are potentially distributed in time (although follow one after
another), wrapping should roughly cover number of processes running (and
potentially renaming objects), so having 20 bits for the counter and 12
for the CPU id should be enough for now.

--
Evgeniy Polyakov

2008-12-22 10:58:57

by Niraj kumar

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 00/14] fsnotify, dnotify, and inotify

O Sun, Dec 21, 2008 at 10:22:06PM -0500, Eric Paris wrote:
> On Thu, 2008-12-18 at 18:36 -0500, C. Scott Ananian wrote:
> > On Fri, Dec 12, 2008 at 4:51 PM, Eric Paris <[email protected]> wrote:
> > > The following series implements a new generic in kernel filesystem
> > > notification system, fsnotify. On top of fsnotify I reimplement dnotify and
> > > inotify. I have not finished with the change from inotify although I think
> > > inotify_user should be completed. In kernel inotify users (aka audit) still
> > > (until I get positive feedback) relay on the old inotify backend. This can be
> > > 'easily' fixed.
> > > All of this is in preperation for fanotify and using fanotify as an on access
> > > file scanner. So you better know it's coming.

Eric,

I was also getting my hands dirty with inotify (and friends)
but I see that you are already ahead. So, I thought I would
rather list my immediate requirements and see how far we can go with that.

My application is interested in only filesystem events (read/write/delete) but:

1) wants notification of only those events which are generated by the
calling process and it's children.
2) Subject to the constraint as mentioned above (1), it would be nice if I
can just say that I need to be notified of events on all filesystem
objects (perhaps within a namespace) rather than specifying each and
every file/dir that I am interested in. This is somewhat similar to the
"recursive watch" problem that others have mentioned.

The first part is fairly easy to achieve and I had already created a
patch for that on top of inotify. (See patch below). I was still
thinking about second part when I saw your patches. Can you tell me
whether your work is going to achive this? Maybe I can modify my
patch to work on top of what you are proposing ...

-Niraj

--------------------------------

Inotify: filter events only by the current process and it's children.

Adds a new flag (to be used with sys_inotify_init1) which
restricts notifications of only those events which has been
generated by the calling process and it's children.

This patch is on top of 2.6.28-rc5.

Signed-off-by: Niraj Kumar <[email protected]>


diff --git a/fs/inotify.c b/fs/inotify.c
index 7bbed1b..f7ed34e 100644
--- a/fs/inotify.c
+++ b/fs/inotify.c
@@ -80,6 +80,8 @@ struct inotify_handle {
struct list_head watches; /* list of watches */
atomic_t count; /* reference count */
u32 last_wd; /* the last wd allocated */
+ struct task_struct *owner_task; /* owner task_struct */
+ u32 flags; /* operations flags */
const struct inotify_operations *in_ops; /* inotify caller operations */
};

@@ -310,6 +312,18 @@ void inotify_inode_queue_event(struct inode *inode, u32 mask, u32 cookie,
u32 watch_mask = watch->mask;
if (watch_mask & mask) {
struct inotify_handle *ih= watch->ih;
+ struct task_struct *curr = current;
+ int found = 0;
+ if (ih->flags & IN_ONLY_CHILD) {
+ for (curr = current; curr != &init_task;
+ curr = curr->parent)
+ if (ih->owner_task == curr) {
+ found = 1;
+ break;
+ }
+ if (!found)
+ continue;
+ }
mutex_lock(&ih->mutex);
if (watch_mask & IN_ONESHOT)
remove_watch_no_event(watch, ih);
@@ -467,7 +481,8 @@ EXPORT_SYMBOL_GPL(inotify_inode_is_dead);
* inotify_init - allocate and initialize an inotify instance
* @ops: caller's inotify operations
*/
-struct inotify_handle *inotify_init(const struct inotify_operations *ops)
+struct inotify_handle *inotify_init(const struct inotify_operations *ops,
+ int flags)
{
struct inotify_handle *ih;

@@ -479,6 +494,8 @@ struct inotify_handle *inotify_init(const struct inotify_operations *ops)
INIT_LIST_HEAD(&ih->watches);
mutex_init(&ih->mutex);
ih->last_wd = 0;
+ ih->owner_task = current;
+ ih->flags = flags;
ih->in_ops = ops;
atomic_set(&ih->count, 0);
get_inotify_handle(ih);
diff --git a/fs/inotify_user.c b/fs/inotify_user.c
index d367e9b..15359e3 100644
--- a/fs/inotify_user.c
+++ b/fs/inotify_user.c
@@ -588,7 +588,7 @@ asmlinkage long sys_inotify_init1(int flags)
BUILD_BUG_ON(IN_CLOEXEC != O_CLOEXEC);
BUILD_BUG_ON(IN_NONBLOCK != O_NONBLOCK);

- if (flags & ~(IN_CLOEXEC | IN_NONBLOCK))
+ if (flags & ~(IN_CLOEXEC | IN_NONBLOCK | IN_ONLY_CHILD))
return -EINVAL;

fd = get_unused_fd_flags(flags & O_CLOEXEC);
@@ -614,7 +614,7 @@ asmlinkage long sys_inotify_init1(int flags)
goto out_free_uid;
}

- ih = inotify_init(&inotify_user_ops);
+ ih = inotify_init(&inotify_user_ops, flags);
if (IS_ERR(ih)) {
ret = PTR_ERR(ih);
goto out_free_dev;
diff --git a/include/linux/inotify.h b/include/linux/inotify.h
index 37ea289..24adb0e 100644
--- a/include/linux/inotify.h
+++ b/include/linux/inotify.h
@@ -66,8 +66,10 @@ struct inotify_event {
IN_MOVE_SELF)

/* Flags for sys_inotify_init1. */
-#define IN_CLOEXEC O_CLOEXEC
-#define IN_NONBLOCK O_NONBLOCK
+#define IN_CLOEXEC O_CLOEXEC /* 02000000 */
+#define IN_NONBLOCK O_NONBLOCK /* 00004000 */
+#define IN_ONLY_CHILD 0x00000002 /* only notify event generated by this
+ process and children. */

#ifdef __KERNEL__

@@ -117,7 +119,8 @@ extern u32 inotify_get_cookie(void);

/* Kernel Consumer API */

-extern struct inotify_handle *inotify_init(const struct inotify_operations *);
+extern struct inotify_handle *inotify_init(const struct inotify_operations *,
+ int);
extern void inotify_init_watch(struct inotify_watch *);
extern void inotify_destroy(struct inotify_handle *);
extern __s32 inotify_find_watch(struct inotify_handle *, struct inode *,
diff --git a/kernel/audit.c b/kernel/audit.c
index 4414e93..c3c9e7d 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -983,7 +983,7 @@ static int __init audit_init(void)
audit_log(NULL, GFP_KERNEL, AUDIT_KERNEL, "initialized");

#ifdef CONFIG_AUDITSYSCALL
- audit_ih = inotify_init(&audit_inotify_ops);
+ audit_ih = inotify_init(&audit_inotify_ops, 0);
if (IS_ERR(audit_ih))
audit_panic("cannot initialize inotify handle");
#endif
diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c
index 8b50944..e631b60 100644
--- a/kernel/audit_tree.c
+++ b/kernel/audit_tree.c
@@ -907,7 +907,7 @@ static int __init audit_tree_init(void)
{
int i;

- rtree_ih = inotify_init(&rtree_inotify_ops);
+ rtree_ih = inotify_init(&rtree_inotify_ops, 0);
if (IS_ERR(rtree_ih))
audit_panic("cannot initialize inotify handle for rectree watches");

2008-12-22 13:43:19

by Al Viro

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 07/14] fsnotify: add in inode fsnotify markings

On Sat, Dec 13, 2008 at 11:35:09AM -0500, Eric Paris wrote:

> So to find an entry we need to first grab the inode->i_lock and start to
> walk the inode->i_fsnotify_mark_entries list. Since we hold the i_lock
> we are not allowed to grab any other locks nor are we allowed to change
> anything other than entry->i_list. The secret sauce is that we actually
> move the entry from the inode list to a private list which we can walk
> and modify lockless. Inside the event we actually have to use a
> different list, free_i_list, for this operation so nothing else that
> races with us can mess stuff up. We run the entire inode we are trying
> to free all entries for an put the entries on the private list. We do
> NOT modify event->inode.

And just what happens if #3 ("remove entry") hits us in the meanwhile? Freed
object sitting on free_i_list?

2008-12-22 14:45:58

by Eric Paris

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 07/14] fsnotify: add in inode fsnotify markings

On Mon, 2008-12-22 at 13:43 +0000, Al Viro wrote:
> On Sat, Dec 13, 2008 at 11:35:09AM -0500, Eric Paris wrote:
>
> > So to find an entry we need to first grab the inode->i_lock and start to
> > walk the inode->i_fsnotify_mark_entries list. Since we hold the i_lock
> > we are not allowed to grab any other locks nor are we allowed to change
> > anything other than entry->i_list. The secret sauce is that we actually
> > move the entry from the inode list to a private list which we can walk
> > and modify lockless. Inside the event we actually have to use a
> > different list, free_i_list, for this operation so nothing else that
> > races with us can mess stuff up. We run the entire inode we are trying
> > to free all entries for an put the entries on the private list. We do
> > NOT modify event->inode.
>
> And just what happens if #3 ("remove entry") hits us in the meanwhile? Freed
> object sitting on free_i_list?

In the code on list, nothing. Evgeniy Polyakov pointed out that I
grabbed my marks at the wrong time.

http://lkml.org/lkml/2008/12/13/94

The corrected idea is that while under the i_lock I call
fsnotify_get_mark() on every inode mark that I am moving to the
free_i_list. When #3 races it actually does all of the cleanup,
including setting the "freeme" flag. But it won't be able to get the
refcnt down to 0. Once we finish 'cleaning up,' we call put on the
object and the refcnt will actually go to 0 and be freed.

The memory can't be freed as long as it is on a free_{i,g}_list.

-Eric

2008-12-22 19:59:51

by C. Scott Ananian

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 00/14] fsnotify, dnotify, and inotify

On Sun, Dec 21, 2008 at 10:22 PM, Eric Paris <[email protected]> wrote:
> On Thu, 2008-12-18 at 18:36 -0500, C. Scott Ananian wrote:
>> As a desktop-search-and-indexing developer, it doesn't seem like
>> fanotify is going to give me anything I want. [...]
> You are absolutely correct that fanotify doesn't help with object
> movement or path maintenance. Neither had been requested, but
> notification (that an inode moved) shouldn't be impossible (although the
> hooks are going to be a lot more involved and will probably take some
> fighting with the VFS people, my current fanotify hooks use what is
> already being handed to fsnotify_* today) To directly answer you
> requests
>
>> 1) An 'autoadd' option added to inotify directory watches, so that
>> newly-created subdirectories get atomically added to the watch. That
>> would prevent missed IN_MOVED_FROM and IN_MOVED_TO in a newly created
>> directory.
> 1) autoadd isn't really what I'm looking at, but maybe someday I could
> take a peek, at first glance it doesn't seem unreasonable an idea, but I
> don't see how the userspace interface could work. Without the call the
> inotify_init to get the watch descriptor how can userspace know what
> these new events are? Only possibility I see for this is if inotify got
> an extensible userspace interface. In any case I'd be hard pressed to
> call it a high priority since it's already possible to get this and the
> intention of the addition is to make userspace code easier.

Multiple calls to inotify_add_watch are allowed to return the same
watch descriptor, since the descriptor is unique to a pathname. I
think you would pass IN_DIR_AUTO_ADD as part of the 'mask' when you
set up the watch, and when a subdirectory is added you generate the
IN_CREATE as before but also atomically create a new watch on that
directory, adding it to the same inotify instance. Since inotify
maintains an ordered queue, the userland will eventually get the
IN_CREATE, call inotify_add_watch on the subdirectory as before, and
get the automatically created watch descriptor. Later events in this
directory which are already on this queue use this same descriptor, so
it "just works".

If you don't properly process the IN_CREATE or don't process queue
events in order, you could get events referring to a watch descriptor
you don't know about yet, but you can just defer them until you
process the IN_CREATE and discover the descriptor id. These problems
would be of your own making, of course, but are solvable. If you do
the obvious thing, you don't have to worry.

This is a narrow fix which removes a race condition and enables
straightforward code to "just work".

>> 2) A reasonable interface to map inode #s to "the current path to
>> this inode" -- or failing that, an iopen or ilink syscall. This would
>> allow the search index to maintain inode #s instead of path names for
>> files, which saves a lot of IN_MOVE processing and races (especially
>> for directory moves, which require recursive path renaming).
> 2) major vfs and every FS redesign me thinks.

I'm not convinced of that. I'm pretty certain one could export
symlinks /proc/<pid>/mountinfo/<dev>/<inode> -> <absolute path in
processes' namespace> with very little trouble, and no violence done
to the VFS, which already has an iget() function which does the heavy
lifting.

Would it be efficient? Well, more efficient (and reliable!) than try
to maintain this same information in userland. We only need to use
this when some action is actually taken by the desktop search (like
launching an application with some document resulting from a search),
which is less frequent than indexing operations.

>> 3) Dream case: in-kernel dirty bits. I don't *really* want to know
>> all the details; I just want to know which inotify watches are dirty,
>> so I can rescan them. To avoid races, the query should return a dirty
>> watch and atomically clear its dirty flag, so that if it is updated
>> again during my indexing, I'd be told to scan it again.
> 3) What you want is IN_MODIFY from every inode but you want them to all
> coallese until you grab one instead of only merging the same event type
> if they are the last 2 in the notification queue. Not sure of a
> particularly clean/fast way to implement that right offhand, we'd have
> to run the entire notification queue every time before we add an event
> to the end, but at least this is doable with the constraints of the
> inotify user interface.

Yes, this sounds about right. There are details to be hashed out: If
a/foo is modified and then moved to a/bar, do I get a combined event,
reordered events (IN_MOVE a/foo -> a/bar, then IN_MODIFY a/bar), or
dirty bits (IN_MOVE a/foo -> a/bar with an IN_DIRTY flag set on the
event). But there are still atomicity concerns (next paragraph):

> Can't this already be done in userspace just by draining the queue,
> matching events, throwing duplicates away, and then processing whatever
> is left? You know there is atomicity since removal of an event and the
> addition of an event both hold the inotify_dev->ev_mutex.

No, you break atomicity between draining the event and doing the
processing. I can drain the queue but then have a/foo moved to a/bar
before I try to index a/foo. Then indexing fails, and I have to
maintain some complicated user-space data structure to say, "oh, when
you find out where a/bar went to, you should index it".

Proper dirty bits would have an atomic "fetch and clear" operation.
So a/foo would be dirty, it would be returned on the queue and the
dirty bit would be atomically cleared. If it was then moved to a/bar
before I got around to indexing, the index operation would fail, but
I'd know that a/bar would have its dirty bit set implicitly by the
move, and so would show up on the queue again.

More tricky details: The dirty bit would probably actually be set on
the directory 'a', and then when I scanned it I'd discover the
'foo->bar' rename. But I'd still have to "remember" (in userspace)
that I didn't successfully scan a/foo and so scan a/bar. This dirty
list could be a short list of "still dirty" inode numbers, and it
would only be used in this particular
move-after-dirty-read-and-before-index race. If a/foo was modified
and then moved to a/bar, I'd simply see that both directory 'a' was
dirty and file 'bar' was dirty, and wouldn't need to use the "failed
index" list.

> In any case, I'm going to let your thoughts rattle around in my brain
> while I'm still trying to rewrite inotify and dnotify to a better base.
> My first inclination is to stop using inotify and start using fanotify.
> Abandon filenames and start using device/inode pairs and you get
> everything you need. But I'm certain that isn't that case :)

Well, except for being able to recreate the path from the inode,
without which ability inode numbers without directory notifications
are pretty useless.

BTW, I had some difficulty discovering the exact userland API you were
proposing for fanotify. I eventually found in it the 'v1' and 'v2'
set of fanotify patches, before the split to fsnotify, but it would be
nice to see it restated in an easier-to-find place. 'google fanotify'
turns up:
http://lwn.net/Articles/303277/
as the second hit, which is reasonable, but
http://people.redhat.com/~eparis/fanotify/21-fanotify-documentation.patch
seems better? I note that fanotify doesn't actually seem to return
the relevant inode number from (say) a CLOSE_WAS_WRITABLE event; I've
got to stat /proc/self/fd/<fd> to get that?
--scott

--
( http://cscott.net/ )

2008-12-22 20:06:19

by C. Scott Ananian

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 12/14] fsnotify: add correlations between events

On Sun, Dec 21, 2008 at 9:40 PM, Eric Paris <[email protected]> wrote:
> A) this isn't a hot path, it's only when a file is renamed.

Yes, sorry, I misread the description of that patch to indicate that
you planned to include a newly-generated cookie in all fsevents.
Re-reading the patch I see that you are still only generating cookies
for renames. But yeah, better cookie generation will help scalability
if you plan to use them more broadly in fanotify.
--scott

--
( http://cscott.net/ )

2008-12-22 20:53:44

by Eric Paris

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 00/14] fsnotify, dnotify, and inotify

On Mon, 2008-12-22 at 14:59 -0500, C. Scott Ananian wrote:
> On Sun, Dec 21, 2008 at 10:22 PM, Eric Paris <[email protected]> wrote:

> > In any case, I'm going to let your thoughts rattle around in my brain
> > while I'm still trying to rewrite inotify and dnotify to a better base.
> > My first inclination is to stop using inotify and start using fanotify.
> > Abandon filenames and start using device/inode pairs and you get
> > everything you need. But I'm certain that isn't that case :)
>
> Well, except for being able to recreate the path from the inode,
> without which ability inode numbers without directory notifications
> are pretty useless.

fanotify doesn't have directory notifications. You get notifications
about the individual inodes. I don't have mv tracking, and I'm not sure
how much trouble it'll be. Maybe not that bad since a mv doesn't have
to deliver the fd to userspace. I promise to think about how to make
this better for you. In any case, you can get pathnames, dev, and inode
very quickly.

> BTW, I had some difficulty discovering the exact userland API you were
> proposing for fanotify. I eventually found in it the 'v1' and 'v2'
> set of fanotify patches, before the split to fsnotify, but it would be
> nice to see it restated in an easier-to-find place. 'google fanotify'
> turns up:
> http://lwn.net/Articles/303277/
> as the second hit, which is reasonable, but
> http://people.redhat.com/~eparis/fanotify/21-fanotify-documentation.patch

This is probably the best there is ATM. It includes a program which
uses all of the fanotify functionality I wrote.

> seems better? I note that fanotify doesn't actually seem to return
> the relevant inode number from (say) a CLOSE_WAS_WRITABLE event; I've
> got to stat /proc/self/fd/<fd> to get that?

For pathname you readlink on /proc/self/fd/[event->fd]

For dev and ino you stat [event->fd]

where event was filled from your getsockopt call.

In any case, event coallessing seems like it needs to work by walking
the notification queue starting at the back and working forwards. If
you find a duplicate just drop. If you find a mv, place this one at the
end. Kinda sucks that we are taking and O(1) operation and making it
O(n).

At least with fanotify you don't have mv races, since you have an open
fd which still gives you the access you need even if it mv'd.

don't worry I won't forget your thoughts.

-Eric

2008-12-22 21:04:21

by Al Viro

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 00/14] fsnotify, dnotify, and inotify

On Mon, Dec 22, 2008 at 02:59:37PM -0500, C. Scott Ananian wrote:
> > 2) major vfs and every FS redesign me thinks.
>
> I'm not convinced of that. I'm pretty certain one could export
> symlinks /proc/<pid>/mountinfo/<dev>/<inode> -> <absolute path in
> processes' namespace> with very little trouble, and no violence done
> to the VFS, which already has an iget() function which does the heavy
> lifting.

There is no such thing as absolute path of inode *anywhere*, process
namespace or not. Not on any UNIX. Period. End of story.

Al, once again astonished by the ability of desktop developers to
post from the alternative realities they apparently inhabit...

2008-12-22 23:14:14

by C. Scott Ananian

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 00/14] fsnotify, dnotify, and inotify

On Mon, Dec 22, 2008 at 4:04 PM, Al Viro <[email protected]> wrote:
> On Mon, Dec 22, 2008 at 02:59:37PM -0500, C. Scott Ananian wrote:
>> > 2) major vfs and every FS redesign me thinks.
>>
>> I'm not convinced of that. I'm pretty certain one could export
>> symlinks /proc/<pid>/mountinfo/<dev>/<inode> -> <absolute path in
>> processes' namespace> with very little trouble, and no violence done
>> to the VFS, which already has an iget() function which does the heavy
>> lifting.
>
> There is no such thing as absolute path of inode *anywhere*, process
> namespace or not. Not on any UNIX. Period. End of story.

That's not correct, as /proc/self/fd/<num> and the getcwd syscall make
clear. struct inode has a i_dentry member, and via its d_parent links
you can reconstruct the path, as __d_path in fs/dcache.c does.

Of course, this only works as long as the inode is in the kernel's
cache (which implies all the appropriate dentry items are there, too).
For an open file descriptor, this works well. If you index the open
file now and then return sometime later to try to open it, there's no
guarantee.

If the inode represents a directory, we *could* use the magic '..' and
'.' entries to reconstruct its path, even without its being in the
cache. I haven't looked hard, but that might involve VFS violence.
But having to manually track what directory inode a given file was in
is still an improvement over having to duplicate the entire directory
tree.

We can also debate whether finding *a* name for the inode if
sufficient, or if you really need to know *all* names for the inode.
One is plenty if you're just looking for a path to give on the command
line to (say) emacs to open the file after a successful search. But
then emacs may not be happy when it tries to write a backup file to an
arbitrarily chosen directory.

> Al, once again astonished by the ability of desktop developers to post from the alternative realities they apparently inhabit...

I think you're prejudging. I'm certainly posting from a weird twisted
alternate reality, but by and large it's not the "desktop developer"
one. ;-)

There are a paucity of good solutions. We can follow the example of
old-school UNIX dump and open the filesystem's block device directly
to map inodes to paths. Or we can use 'find' over the entire
filesystem. Or we can duplicate the filesystem's directory structure
in userspace so that we can correctly associate index information with
the current path to the given file?

inotify *almost* lets us do that last thing (though the code
duplication pains me) but is too racy for reliable use. Give me a
kernel interface without races and I'll call it a good start. If you
can save me the trouble of duplicating all of the filesystem's
directory information in my userspace database in order to handle
directory moves, I'll actually grin a little.
--scott

--
( http://cscott.net/ )

2008-12-22 23:20:52

by Al Viro

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 00/14] fsnotify, dnotify, and inotify

On Mon, Dec 22, 2008 at 06:08:08PM -0500, C. Scott Ananian wrote:

> That's not correct, as /proc/self/fd/<num> and the getcwd syscall make
> clear. struct inode has a i_dentry member, and via its d_parent links
> you can reconstruct the path, as __d_path in fs/dcache.c does.

Think for a minute

a) you might have any number of links to given inode, *including* *zero*
b) any subset of these links may be in dcache (including empty)
c) any number of bindings might exist to file in question *or* to its
ancestors (again, including zero).
d) none of the relative paths from fs root to file has to be visible
as a path leading to file in question from anywhere (again, think of
bindings)
e) all of that can change before information reaches userland.

> inotify *almost* lets us do that last thing (though the code
> duplication pains me) but is too racy for reliable use. Give me a
> kernel interface without races and I'll call it a good start. If you
> can save me the trouble of duplicating all of the filesystem's
> directory information in my userspace database in order to handle
> directory moves, I'll actually grin a little.

_WHAT_ interface without races? Anything along the lines of "somebody had
done something to <pathname>" is racy by definition. Simply because the
next operation might have changed the mapping from pathnames to files.

What are you actually trying to do?

2008-12-22 23:21:37

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 00/14] fsnotify, dnotify, and inotify

On Mon, Dec 22, 2008 at 06:08:08PM -0500, C. Scott Ananian wrote:
> That's not correct, as /proc/self/fd/<num> and the getcwd syscall make
> clear.

No. These take a file descriptor into account, which _does_ have a
unique path.

> struct inode has a i_dentry member, and via its d_parent links
> you can reconstruct the path, as __d_path in fs/dcache.c does.

reconstruct _a_ path inside the same filesystem, ignoring which link is
wanted, and inside which mount.

2008-12-25 18:18:39

by C. Scott Ananian

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 00/14] fsnotify, dnotify, and inotify

On Mon, Dec 22, 2008 at 6:21 PM, Christoph Hellwig <[email protected]> wrote:
> On Mon, Dec 22, 2008 at 06:08:08PM -0500, C. Scott Ananian wrote:
>> That's not correct, as /proc/self/fd/<num> and the getcwd syscall make
>> clear.
>
> No. These take a file descriptor into account, which _does_ have a
> unique path.

getcwd doesn't actually hold a file descriptor to the working
directory. If you reread my message, you'll find that I was explicit
about where the information was stored.

>> struct inode has a i_dentry member, and via its d_parent links
>> you can reconstruct the path, as __d_path in fs/dcache.c does.
>
> reconstruct _a_ path inside the same filesystem, ignoring which link is
> wanted, and inside which mount.

Right. If you go back and reread my message, I made clear that that
was sufficient.

But Al's question is more relevant, and I'll try to restate the
problem for him, since it's clear that none of the existing *notify
interfaces was written with an eye towards making possible races
easily manageable.
--scott

--
( http://cscott.net/ )

2008-12-25 20:33:19

by Al Viro

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 00/14] fsnotify, dnotify, and inotify

On Thu, Dec 25, 2008 at 01:17:28PM -0500, C. Scott Ananian wrote:
> getcwd doesn't actually hold a file descriptor to the working
> directory. If you reread my message, you'll find that I was explicit
> about where the information was stored.

Indeed - explicit, persistent and wrong. For current directory of a process
we store vfsmount and dentry. And use those in getcwd() rather than playing
hopeless games with inodes.

2008-12-26 00:58:35

by C. Scott Ananian

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 00/14] fsnotify, dnotify, and inotify

On Thu, Dec 25, 2008 at 3:33 PM, Al Viro <[email protected]> wrote:
> On Thu, Dec 25, 2008 at 01:17:28PM -0500, C. Scott Ananian wrote:
>> getcwd doesn't actually hold a file descriptor to the working
>> directory. If you reread my message, you'll find that I was explicit
>> about where the information was stored.
>
> Indeed - explicit, persistent and wrong. For current directory of a process
> we store vfsmount and dentry. And use those in getcwd() rather than playing
> hopeless games with inodes.

Geez. Please don't treat me as if I can't read source code.

I suggested a Mach-like iopen mechanism to address some inotify races.
In order to show that extreme VFS violence might not be necessary, I
pointed out that *in some cases* you can derive *some paths* to the
file from the inode number, using the iget()->i_dentry list. But
you've driven me far off-topic.

Let's get back to the problem at hand. The most obvious problem with
inotify is the race between mkdir/IN_CREATE and the userspace process
adding the watch on the new directory. I proposed an 'autoadd'
mechanism earlier in this thread to address this (stolen from the racy
userspace version of this in python-inotify); the "Love-Trowbridge
algorithm" from:
http://mail.gnome.org/archives/dashboard-hackers/2004-October/msg00022.html
is also targetted at this race.

But this isn't the only problem. The inotify interface on directories
returns (in effect) a <directory inode>,<filename> pair (the directory
watch is per inode; the event includes a filename). This means that:
echo foo >a/b; echo bar >a/c; mv a/c a/b
has an inherent race. Our index service drains the inotify queue and
attempts to open and index a/b. After the indexing, we check the
queue and discover IN_MOVED_FROM c and IN_MOVED_TO b. There is no way
for the userspace process to know whether it managed to index the file
before or after the move. (We're forced to track renames to detect
this situation and then attempt to reindex a/b, and of course we can
have another race; we must repeat until we finally succeed.) If
inotify provided an inode number or file descriptor instead of a path
name, we'd be able to tell if we were indexing the thing we expected.

But this isn't the end. How about:
mkdir -p a/b a/c ; touch a/b/foo a/c/foo
<read inotify queue here>
mv a/b a/bb ; mv a/c a/b
When we index a/b/foo, we won't know whether this is the original
a/b/foo or the original a/c/foo. In this case we can open 'a/b' and
check that the inode number is what we expect before using openat to
open 'foo' (but remember that the previous race means that we're still
not sure 'foo' is what we expect it to be, so we still need to use
that detection algorithm as well).

And remember that we're still expected to keep and update a map in
userspace mapping from directory watch ids to path names, and
presumably keep path name information updated in our search index as
well. When a directory is moved, we need to recursively update path
information for all files in the index -- unless we keep path
information as <directory inode>;<filename> pairs, which avoids the
recursive update at the expense of having to maintain a redundant copy
of the filesystem's directory structure in userspace.

(These are the races I've found; it's possible there are others.)

As far as I can tell, none of the existing Linux desktop search tools
attempt to deal with these races. (Beagle handles the 'mkdir' race,
but not the other rename races.) This is acceptable only if an
unreliable file index is acceptable.

Some possible improvements to the situtation (all bad, in various ways
-- better suggestions wanted!):

a) do nothing. Most developers will ignore the races in inotify out
of ignorance or complexity, and most applications which use inotify
will be unreliable as a result.

b) use inode numbers rather than path names uniformly, in both
inotify and the userland search index, along with an iopen() syscall,
as in Mach. This decouples path maintenance from indexing. This was
discussed in (for example)
http://www.coda.cs.cmu.edu/maillists/codalist/codalist-1998/0217.html
by Peter J. Braam and Ted Ts'o, but Al Viro has been objecting to the
idea here. (If all you need to do is open found files after a search,
you can skip path maintenance entirely.)

c) Pass file descriptors in the notification API from the kernel.
This solves the races associated with renames before indexing.
Userland still has to maintain its own copy of all the direntries for
indexed content, but at least this task is decoupled. (The proposed
fanotify API passes file descriptors, but provides no mechanism (yet)
for path maintenance.)

d) Do all indexing in the filesystem. BeOS used this option; in
Linux-land, this would probably be a thin FUSE shim which layered over
an existing filesystem. The shim could grab the appropriate locks to
manage the races and ensure that the index's path information was
consistent with the filesystem.

Returning to fanotify, I'll recant some of my earlier judgement:
fanotify already solves the 'mkdir' race in inotify (by virtue of not
requiring separate watches on each directory) and the 'mv before
index' race (by passing an open file descriptor to userland). If it
provided some basic directory-change support so that path information
can be maintained, it would be a clear win for desktop search, since
by simply processing events in order we can produce a coherent index
state. The only remaining races would be during the initial scan.
If one wanted the simplest possible correct userspace, perhaps move
and create can be deferred by userland using the fanotify 'approval'
mechanism until the scan is complete.
--scott

--
( http://cscott.net/ )

2008-12-26 01:44:18

by Al Viro

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 00/14] fsnotify, dnotify, and inotify

On Thu, Dec 25, 2008 at 07:58:23PM -0500, C. Scott Ananian wrote:
> > Indeed - explicit, persistent and wrong. For current directory of a process
> > we store vfsmount and dentry. And use those in getcwd() rather than playing
> > hopeless games with inodes.
>
> Geez. Please don't treat me as if I can't read source code.
>
> I suggested a Mach-like iopen mechanism to address some inotify races.
> In order to show that extreme VFS violence might not be necessary, I
> pointed out that *in some cases* you can derive *some paths* to the
> file from the inode number, using the iget()->i_dentry list. But
> you've driven me far off-topic.

BTW, XFS open-by-inode implementation is severely broken - try using
it for directories and you'll get all kinds of hell breaking loose.
And that's the only attempt at iopen in Linux.

While we are at it, iget() is not a generic interface - it's strictly
up to specific filesystem whether that sucker will work at all. Or
what kind of data will have to be required for it to work on given fs,
assuming it will work to start with.

> Let's get back to the problem at hand. The most obvious problem with
> inotify is the race between mkdir/IN_CREATE and the userspace process

[snip]

> As far as I can tell, none of the existing Linux desktop search tools
> attempt to deal with these races. (Beagle handles the 'mkdir' race,
> but not the other rename races.) This is acceptable only if an
> unreliable file index is acceptable.

... and inotify is unreliable by design, or what passes for it anyway.

> Some possible improvements to the situtation (all bad, in various ways
> -- better suggestions wanted!):
>
> a) do nothing. Most developers will ignore the races in inotify out
> of ignorance or complexity, and most applications which use inotify
> will be unreliable as a result.
>
> b) use inode numbers rather than path names uniformly, in both
> inotify and the userland search index, along with an iopen() syscall,
> as in Mach. This decouples path maintenance from indexing. This was
> discussed in (for example)
> http://www.coda.cs.cmu.edu/maillists/codalist/codalist-1998/0217.html
> by Peter J. Braam and Ted Ts'o, but Al Viro has been objecting to the
> idea here. (If all you need to do is open found files after a search,
> you can skip path maintenance entirely.)

For one thing, opened directory can be fchdir'ed to. Or passed to
openat() et.al. For another, go ahead, show me how to implement that
sucker for something like NFS. Or CIFS. Or FAT, while we are at it.
Or anything that does have stable numbers identifying fs objects, but where
those numbers are huge.

> c) Pass file descriptors in the notification API from the kernel.
> This solves the races associated with renames before indexing.
> Userland still has to maintain its own copy of all the direntries for
> indexed content, but at least this task is decoupled. (The proposed
> fanotify API passes file descriptors, but provides no mechanism (yet)
> for path maintenance.)
>
> d) Do all indexing in the filesystem. BeOS used this option; in
> Linux-land, this would probably be a thin FUSE shim which layered over
> an existing filesystem. The shim could grab the appropriate locks to
> manage the races and ensure that the index's path information was
> consistent with the filesystem.

e) start with providing a higher-level description of what you want to
achieve. While we are at it, is it supposed to be fs-agnostic? What
kinds of filesystems could in principle be used with that?

So far you've mentioned use of blatantly inadequate tool and far too
low-level description of changes that might possibly make it more
tolerable for unspecified use you have in mind. I'm sorry, but that's
exactly how a bunch of bad APIs (including inotify) got pushed into the
tree and I would rather not repeat the experience.

Interfaces must make sense. And "we need <list of things> for our project,
nevermind what are they for, here's what we would like to see" is a recipe
for kludges. Inotify, dnotify, F_SETLEASE, etc. Sigh...

2008-12-27 21:23:46

by C. Scott Ananian

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 00/14] fsnotify, dnotify, and inotify

On Thu, Dec 25, 2008 at 8:44 PM, Al Viro <[email protected]> wrote:
>> As far as I can tell, none of the existing Linux desktop search tools
>> attempt to deal with these races. (Beagle handles the 'mkdir' race,
>> but not the other rename races.) This is acceptable only if an
>> unreliable file index is acceptable.
>
> ... and inotify is unreliable by design, or what passes for it anyway.

Really? I don't see that in the documentation anywhere (except as an
aside in a comment you added to inotify's pin-to-kill method). Nor do
I find it supported by code review. Sure, the queue is limited size,
but you are guaranteed an overflow event when messages are dropped;
that's all that's needed for reliability. Can you provide some
specific example where events are lost "by design"? There are
userland races in inotify, but they are all solvable, and I've
described above how to do so. The problem is that the workarounds
make practical programming with inotify very cumbersome, and the
result is that most implementors haven't bothered to make things
correct (ie, moving a directory breaks all search tools I've looked
at). I'm interested in determining whether there are minor tweaks
which can make correct userlands easier to write. That's why I've
been beating the "autoadd" drum in this thread: it's a very small
patch would would remove one category of race entirely (leaving only
the rename races for userland to deal with).

>> b) use inode numbers rather than path names uniformly, in both
>> inotify and the userland search index, along with an iopen() syscall,
>> as in Mach. This decouples path maintenance from indexing. This was
>> discussed in (for example)
>> http://www.coda.cs.cmu.edu/maillists/codalist/codalist-1998/0217.html
>> by Peter J. Braam and Ted Ts'o, but Al Viro has been objecting to the
>> idea here. (If all you need to do is open found files after a search,
>> you can skip path maintenance entirely.)
>
> For one thing, opened directory can be fchdir'ed to. Or passed to
> openat() et.al.

You seem to be confusing inotify watch descriptors with file
descriptors. Directories are not held open in inotify, as you should
know.

Adding an inotify_open_wd() syscall would indeed eliminate the last of
the 3 races I described above (a/b/c rename to a/d/c). It wouldn't do
anything about a/b -> a/c rename races.

> For another, go ahead, show me how to implement that
> sucker for something like NFS. Or CIFS. Or FAT, while we are at it.
> Or anything that does have stable numbers identifying fs objects, but where those numbers are huge.

Sure. Because not all filesystems support xattr (say) we should
refuse to add it to any? As you know well, The standard solution to
the inode problem (in NFS or FUSE, say) is to synthesize stable inode
numbers in the kernel (or in metadata in something like vfat), with
the understanding that these numbers will be different on every mount.
For the purposes of race-avoidance in inotify, transient inode
numbers are acceptable. And we can always fall back to "old racy
inotify" or manual filesystem scans for minority filesystems.

>> c) Pass file descriptors in the notification API from the kernel.
>> This solves the races associated with renames before indexing.
>> Userland still has to maintain its own copy of all the direntries for
>> indexed content, but at least this task is decoupled. (The proposed
>> fanotify API passes file descriptors, but provides no mechanism (yet)
>> for path maintenance.)

You didn't comment on this alternative, but as I've continued thinking
about the problem, this alternative has seemed best.

> e) start with providing a higher-level description of what you want to
> achieve. While we are at it, is it supposed to be fs-agnostic? What
> kinds of filesystems could in principle be used with that?

The high level problem should be obvious, but I'll restate it for you anyway:

* Maintain a auxiliary set of content-derived metadata for each file
in some subset (under a directory, mount-point, matching a wildcard,
etc). This metadata must be updated whenever the content changes, and
must include/maintain some stable reference to the file so that
searches on the metadata can be mapped back to their corresponding
files later.

Implementors vary on the definition of 'content' (do file permissions
count? xattrs? etc), the best "update" API (is notify-on-close and
reindexing entire files sufficient, or do we want to do incremental
updates on each dirty block/byte), and whether the notification is
synchronous or asynchronous (eg: Eric Paris wants to act on 'bad'
content, so he needs a synchronous interface to prevent races from
acting on bad content). In Linux search tools I've looked at, the
'stable reference' notion has usually been kept as a file:// URL (ie,
absolute path), but Mac OS X has file handles closer to inode numbers
(although I'm not positive that's what they are using).

Filesystem-agnostic tools are preferable, but "works best on
filesystem X" search tools are already common. The BeFS was an
example of non-agnostic search, and Beagle "works best" on filesystems
that support xattrs. The i_version field will also be useful for
search tools, for filesystems which support it (only ext4 so far?).
So give me filesystem-specific search if it is compelling, or agnostic
search where possible.

> So far you've mentioned use of blatantly inadequate tool and far too
> low-level description of changes that might possibly make it more
> tolerable for unspecified use you have in mind. I'm sorry, but that's
> exactly how a bunch of bad APIs (including inotify) got pushed into the
> tree and I would rather not repeat the experience.

So, you don't understand the problem, and you'd rather not have
inotify in tree because you think it is "blatantly inadequate" -- yet
you are using inotify in your audit subsystem? I think I am beginning
to get the impression that you're just not going to be much help.

> Interfaces must make sense. And "we need <list of things> for our project,
> nevermind what are they for, here's what we would like to see" is a recipe
> for kludges. Inotify, dnotify, F_SETLEASE, etc. Sigh...

I don't think this is a sound technical argument. We've got at least
two different APIs on the table (fanotify and inotify), a
well-understood problem description, *and* I've made at least 4
different proposals for how the current situation might be improved.
Please feel free to suggest concrete alternatives, but I don't plan to
continue this thread while you are just taking bogus potshots.
--scott

--
( http://cscott.net/ )

2008-12-29 18:19:59

by C. Scott Ananian

[permalink] [raw]
Subject: Re: [RFC PATCH -v4 00/14] fsnotify, dnotify, and inotify

On Mon, Dec 22, 2008 at 3:53 PM, Eric Paris <[email protected]> wrote:
> fanotify doesn't have directory notifications. You get notifications
> about the individual inodes. I don't have mv tracking, and I'm not sure
> how much trouble it'll be. Maybe not that bad since a mv doesn't have
> to deliver the fd to userspace. I promise to think about how to make
> this better for you. In any case, you can get pathnames, dev, and inode
> very quickly.

For consistency, directory events could appear as mutations
(write/close_was_writable) of the directory inodes, with an open file
descriptor to the directory.

It would be a bit annoying, but you wouldn't even have to pass the
'from' and 'to' names. You can then readdir the open file descriptor
to the directory find out what's changed for yourself. (If you do
pass 'from' and 'to' path information, then the open file descriptor
to the directory is unnecessary.)

Capturing a coherent directory listing on startup (or while processing
a directory mutation) is the other tricky bit; being able to
block-for-ack on a rename would make that much easier, returning
SEND_DELAY for example. The basic ideas seem coherent with what
you've got in fanotify. A write_need_access_decision on directories
would allow userspace locking during the scan, and you'd set a
fastpath (with SURVIVE_WRITE cleared) after the scan was complete.

The write_need_access_decision method would, as testiment to the
generality of the mechanism, allow userspace implementations of
copy-on-write and versioning file systems. A versioning filesystem
could check in the 'new version' of a file on close_was_writable;
write_need_access would just need to block until userland had
completed the checkin to prevent races between kernel and userland.
For a copy-on-write system based on hardlinks,
write_need_access_decision would break a hardlink if necessary, and
regardless install a fast path skipping future checks.

> In any case, event coallessing seems like it needs to work by walking
> the notification queue starting at the back and working forwards. If
> you find a duplicate just drop. If you find a mv, place this one at the
> end. Kinda sucks that we are taking and O(1) operation and making it
> O(n).
>
> At least with fanotify you don't have mv races, since you have an open
> fd which still gives you the access you need even if it mv'd.

On reflection event coalescing is not necessary so long as you're
passing open file descriptors around. The kernel mechanisms are
sufficient to ensure that the file descriptors are still valid even if
unlink/move operations occur after the mutation.

To briefly summarize my fanotify wishlist:
1) write/close_was_writable events on directory inodes triggered by
create/rename/unlink.
2) write_need_access_decision event applicable to both regular files
and directories.
--scott

--
( http://cscott.net/ )