2017-03-13 23:09:14

by Filip Štědronský

[permalink] [raw]
Subject: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

Fanotify currently does not report directory modification events
(rename, unlink, etc.). But these events are essential for about half of
concievable fanotify use cases, especially:

- file system indexers / desktop search tools
- file synchronization tools (like Dropbox, Nextcloud, etc.),
online backup tools

and pretty much any app that needs to maintain and up-to-date internal
representation of current contents of the file system.

All applications of the above classes that I'm aware of currently use
recursive inotify watches, which do not scale (my home dir has ~200k
directories, creating all the watches takes ~2min and eats several tens
of MB of kernel memory).

There have been many calls for such a feature, pretty much since the
creation of fanotify in 2009:
* By GNOME developers:
https://wiki.gnome.org/BastienNocera/KernelWishlist#VFS.2C_filesystems
* By KDE developers:
http://lkml.kernel.org/r/201211011352.42476.Martin%40lichtvoll.de
'Better support for (desktop) file search / indexing applications'
* And others:
http://lkml.kernel.org/r/AANLkTi=owK=WZW4oNtpm5WpAZhqCQUdTR2K5gzJ_MqZ+%40mail.gmail.com
'Fanotify mv/rename?'
http://lkml.kernel.org/r/1238158043.23703.20.camel%40fatty
'Issues with using fanotify for a filesystem indexer'

Add a new fanotify event, FAN_MODIFY_DIR, that is emitted whenever the
contents of a directory change (a directory entry is added, removed or
renamed). This covers all the currently missing events: rename, unlink,
mknod, mkdir, rmdir, etc.

Included with the event is a file descriptor to the modified directory
but no further info. The userspace application is expected to rescan the
whole directory and update its model accordingly. This needs to be done
carefully to prevent race conditions. A cross-directory rename generates
two FAN_MODIFY_DIR events.

This minimalistic approach has several advantages:
* Does not require changes to the fanotify userspace API or the
fsnotify in-kernel framework, apart from adding a new event.
Especially doesn't complicate it by adding string fields.
* Has simple and clear semantics, even with multiple renames occurring
in parallel etc. In case of any inconsistencies, one can simply wait
for a while and rescan again.
* FAN_MODIFY_DIR events are easily merged in case of multiple
operations on a directory (e.g. deleting all files).

Signed-off-by: Filip Štědronský <[email protected]>

---

An alternative might be to create wrapper functions like
vfs_path_(rename|unlink|...). They could also take care of calling
security_path_(rename|unlink|...), which is currently also up to
the indvidual callers (possibly with a flag because it might not
be always desired).

An alternative was proposed by Amir Goldstein in several long series of
patches that add superblock-scoped (as opposed to vfsmount-scoped)
fanotify watches and specific dentry manipulation events with filenames:

http://lkml.kernel.org/r/1481984434-13293-1-git-send-email-amir73il%40gmail.com
http://lkml.kernel.org/r/1482247207-4424-1-git-send-email-amir73il%40gmail.com
http://lkml.kernel.org/r/1476126784-12520-1-git-send-email-amir73il%40gmail.com
http://lkml.kernel.org/r/1489411223-12081-1-git-send-email-amir73il%40gmail.com

There is large but not complete overlap between that proposal and
mine (which is was originally created over a year ago, before Amir's
work, but never posted).

I think the superblock watch idea is very interesting because it might
in the future allow reporing fsnotify events from remote and virtual
filesystems. So I'm posting this more as a potential source of more
ideas for discussion, or a fallback proposal in case Amir's patches
don't make it.
---
fs/notify/fanotify/fanotify.c | 1 +
include/linux/fsnotify.h | 17 +++++++++++++++++
include/linux/fsnotify_backend.h | 1 +
include/uapi/linux/fanotify.h | 5 ++++-
4 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index bbc175d4213d..5178b06c338c 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -186,6 +186,7 @@ static int fanotify_handle_event(struct fsnotify_group *group,

BUILD_BUG_ON(FAN_ACCESS != FS_ACCESS);
BUILD_BUG_ON(FAN_MODIFY != FS_MODIFY);
+ BUILD_BUG_ON(FAN_MODIFY_DIR != FS_MODIFY_DIR);
BUILD_BUG_ON(FAN_CLOSE_NOWRITE != FS_CLOSE_NOWRITE);
BUILD_BUG_ON(FAN_CLOSE_WRITE != FS_CLOSE_WRITE);
BUILD_BUG_ON(FAN_OPEN != FS_OPEN);
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index b43d3f5bd9ea..00fb87c975d6 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -208,6 +208,23 @@ static inline void fsnotify_modify(struct file *file)
}

/*
+ * fsnotify_modifydir - directory contents were changed
+ * (as a result of rename, creat, unlink, etc.)
+ */
+static inline void fsnotify_modify_dir(struct path *path)
+{
+ struct inode *inode = path->dentry->d_inode;
+ __u32 mask = FS_MODIFY_DIR;
+
+ if (S_ISDIR(inode->i_mode))
+ mask |= FS_ISDIR;
+ else
+ return;
+
+ fsnotify(inode, mask, path, FSNOTIFY_EVENT_PATH, NULL, 0);
+}
+
+/*
* fsnotify_open - file was opened
*/
static inline void fsnotify_open(struct file *file)
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 487246546ebe..7751b337ec31 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -42,6 +42,7 @@

#define FS_OPEN_PERM 0x00010000 /* open event in an permission hook */
#define FS_ACCESS_PERM 0x00020000 /* access event in a permissions hook */
+#define FS_MODIFY_DIR 0x00040000 /* directory changed (rename/unlink/...) */

#define FS_EXCL_UNLINK 0x04000000 /* do not send events if object is unlinked */
#define FS_ISDIR 0x40000000 /* event occurred against dir */
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index 030508d195d3..f14e048d492a 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -15,6 +15,8 @@
#define FAN_OPEN_PERM 0x00010000 /* File open in perm check */
#define FAN_ACCESS_PERM 0x00020000 /* File accessed in perm check */

+#define FAN_MODIFY_DIR 0x00040000 /* directory changed (rename/unlink/...) */
+
#define FAN_ONDIR 0x40000000 /* event occurred against dir */

#define FAN_EVENT_ON_CHILD 0x08000000 /* interested in child events */
@@ -67,7 +69,8 @@
#define FAN_ALL_EVENTS (FAN_ACCESS |\
FAN_MODIFY |\
FAN_CLOSE |\
- FAN_OPEN)
+ FAN_OPEN |\
+ FAN_MODIFY_DIR)

/*
* All events which require a permission response from userspace
--
2.11.1


2017-03-13 23:14:12

by Filip Štědronský

[permalink] [raw]
Subject: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

Besause fanotify requires `struct path`, the event cannot be generated
directly in `fsnotify_move` and friends because they only get the inode
(and their callers, `vfs_rename`&co. cannot supply any better info).
So instead it needs to be generated higher in the call chain, i.e. in
the callers of functions like `vfs_rename`.

This leads to some code duplication. Currently, there are several places
whence functions like `vfs_rename` or `vfs_unlink` are called:

* syscall handlers (done)
* NFS server (done)
* stacked filesystems
- ecryptfs (done)
- overlayfs
(Currently doesn't report even ordinary fanotify events, because
it internally clones the upper mount; not sure about the
rationale. One can always watch the overlay mount instead.)
* few rather minor things
- devtmpfs
(its internal changes are not tied to any vfsmount so it cannot
emit mount-scoped events)
- cachefiles (done)
- ipc/mqueue.c (done)
- fs/nfsd/nfs4recover.c (done)
- kernel/bpf/inode.c (done)
net/unix/af_unix.c (done)

(grep -rE '\bvfs_(rename|unlink|mknod|whiteout|create|mkdir|rmdir|symlink|link)\(')

Signed-off-by: Filip Štědronský <[email protected]>

---

An alternative might be to create wrapper functions like
vfs_path_(rename|unlink|...). They could also take care of calling
security_path_(rename|unlink|...), which is currently also up to
the indvidual callers (possibly with a flag because it might not
be always desired).
---
fs/cachefiles/namei.c | 9 +++++++
fs/ecryptfs/inode.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++++
fs/namei.c | 23 +++++++++++++++++-
fs/nfsd/nfs4recover.c | 7 ++++++
fs/nfsd/vfs.c | 24 ++++++++++++++++--
ipc/mqueue.c | 9 +++++++
kernel/bpf/inode.c | 3 +++
net/unix/af_unix.c | 2 ++
8 files changed, 141 insertions(+), 3 deletions(-)

diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index 41df8a27d7eb..8c86699424d1 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -313,6 +313,8 @@ static int cachefiles_bury_object(struct cachefiles_cache *cache,
cachefiles_io_error(cache, "Unlink security error");
} else {
ret = vfs_unlink(d_inode(dir), rep, NULL);
+ if (ret == 0)
+ fsnotify_modify_dir(&path);

if (preemptive)
cachefiles_mark_object_buried(cache, rep, why);
@@ -418,6 +420,10 @@ static int cachefiles_bury_object(struct cachefiles_cache *cache,
if (ret != 0 && ret != -ENOMEM)
cachefiles_io_error(cache,
"Rename failed with error %d", ret);
+ if (ret == 0) {
+ fsnotify_modify_dir(&path);
+ fsnotify_modify_dir(&path_to_graveyard);
+ }

if (preemptive)
cachefiles_mark_object_buried(cache, rep, why);
@@ -560,6 +566,7 @@ int cachefiles_walk_to_object(struct cachefiles_object *parent,
cachefiles_hist(cachefiles_mkdir_histogram, start);
if (ret < 0)
goto create_error;
+ fsnotify_modify_dir(&path);

ASSERT(d_backing_inode(next));

@@ -589,6 +596,7 @@ int cachefiles_walk_to_object(struct cachefiles_object *parent,
cachefiles_hist(cachefiles_create_histogram, start);
if (ret < 0)
goto create_error;
+ fsnotify_modify_dir(&path);

ASSERT(d_backing_inode(next));

@@ -779,6 +787,7 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
ret = vfs_mkdir(d_inode(dir), subdir, 0700);
if (ret < 0)
goto mkdir_error;
+ fsnotify_modify_dir(&path);

ASSERT(d_backing_inode(subdir));

diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index e7413f82d27b..88a41b270bcc 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -29,6 +29,8 @@
#include <linux/dcache.h>
#include <linux/namei.h>
#include <linux/mount.h>
+#include <linux/path.h>
+#include <linux/fsnotify.h>
#include <linux/fs_stack.h>
#include <linux/slab.h>
#include <linux/xattr.h>
@@ -144,16 +146,22 @@ static int ecryptfs_do_unlink(struct inode *dir, struct dentry *dentry,
{
struct dentry *lower_dentry = ecryptfs_dentry_to_lower(dentry);
struct inode *lower_dir_inode = ecryptfs_inode_to_lower(dir);
+ struct vfsmount *lower_mnt = ecryptfs_dentry_to_lower_mnt(dentry);
+ struct path lower_dir_path = {lower_mnt, NULL};
struct dentry *lower_dir_dentry;
int rc;

dget(lower_dentry);
lower_dir_dentry = lock_parent(lower_dentry);
+ lower_dir_path.dentry = lower_dir_dentry;
rc = vfs_unlink(lower_dir_inode, lower_dentry, NULL);
if (rc) {
printk(KERN_ERR "Error in vfs_unlink; rc = [%d]\n", rc);
goto out_unlock;
}
+
+ fsnotify_modify_dir(&lower_dir_path);
+
fsstack_copy_attr_times(dir, lower_dir_inode);
set_nlink(inode, ecryptfs_inode_to_lower(inode)->i_nlink);
inode->i_ctime = dir->i_ctime;
@@ -184,9 +192,13 @@ ecryptfs_do_create(struct inode *directory_inode,
struct dentry *lower_dentry;
struct dentry *lower_dir_dentry;
struct inode *inode;
+ struct path lower_dir_path;

lower_dentry = ecryptfs_dentry_to_lower(ecryptfs_dentry);
lower_dir_dentry = lock_parent(lower_dentry);
+ lower_dir_path.dentry = lower_dir_dentry;
+ lower_dir_path.mnt = ecryptfs_dentry_to_lower_mnt(ecryptfs_dentry);
+
rc = vfs_create(d_inode(lower_dir_dentry), lower_dentry, mode, true);
if (rc) {
printk(KERN_ERR "%s: Failure to create dentry in lower fs; "
@@ -194,10 +206,14 @@ ecryptfs_do_create(struct inode *directory_inode,
inode = ERR_PTR(rc);
goto out_lock;
}
+
+ fsnotify_modify_dir(&lower_dir_path);
+
inode = __ecryptfs_get_inode(d_inode(lower_dentry),
directory_inode->i_sb);
if (IS_ERR(inode)) {
vfs_unlink(d_inode(lower_dir_dentry), lower_dentry, NULL);
+ fsnotify_modify_dir(&lower_dir_path);
goto out_lock;
}
fsstack_copy_attr_times(directory_inode, d_inode(lower_dir_dentry));
@@ -432,6 +448,7 @@ static int ecryptfs_link(struct dentry *old_dentry, struct inode *dir,
struct dentry *lower_old_dentry;
struct dentry *lower_new_dentry;
struct dentry *lower_dir_dentry;
+ struct path lower_dir_path;
u64 file_size_save;
int rc;

@@ -441,10 +458,16 @@ static int ecryptfs_link(struct dentry *old_dentry, struct inode *dir,
dget(lower_old_dentry);
dget(lower_new_dentry);
lower_dir_dentry = lock_parent(lower_new_dentry);
+ lower_dir_path.dentry = lower_dir_dentry;
+ lower_dir_path.mnt = ecryptfs_dentry_to_lower_mnt(new_dentry);
+
rc = vfs_link(lower_old_dentry, d_inode(lower_dir_dentry),
lower_new_dentry, NULL);
if (rc || d_really_is_negative(lower_new_dentry))
goto out_lock;
+
+ fsnotify_modify_dir(&lower_dir_path);
+
rc = ecryptfs_interpose(lower_new_dentry, new_dentry, dir->i_sb);
if (rc)
goto out_lock;
@@ -471,6 +494,7 @@ static int ecryptfs_symlink(struct inode *dir, struct dentry *dentry,
int rc;
struct dentry *lower_dentry;
struct dentry *lower_dir_dentry;
+ struct path lower_dir_path;
char *encoded_symname;
size_t encoded_symlen;
struct ecryptfs_mount_crypt_stat *mount_crypt_stat = NULL;
@@ -478,6 +502,9 @@ static int ecryptfs_symlink(struct inode *dir, struct dentry *dentry,
lower_dentry = ecryptfs_dentry_to_lower(dentry);
dget(lower_dentry);
lower_dir_dentry = lock_parent(lower_dentry);
+ lower_dir_path.dentry = lower_dir_dentry;
+ lower_dir_path.mnt = ecryptfs_dentry_to_lower_mnt(dentry);
+
mount_crypt_stat = &ecryptfs_superblock_to_private(
dir->i_sb)->mount_crypt_stat;
rc = ecryptfs_encrypt_and_encode_filename(&encoded_symname,
@@ -491,6 +518,9 @@ static int ecryptfs_symlink(struct inode *dir, struct dentry *dentry,
kfree(encoded_symname);
if (rc || d_really_is_negative(lower_dentry))
goto out_lock;
+
+ fsnotify_modify_dir(&lower_dir_path);
+
rc = ecryptfs_interpose(lower_dentry, dentry, dir->i_sb);
if (rc)
goto out_lock;
@@ -509,12 +539,18 @@ static int ecryptfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode
int rc;
struct dentry *lower_dentry;
struct dentry *lower_dir_dentry;
+ struct path lower_dir_path;

lower_dentry = ecryptfs_dentry_to_lower(dentry);
lower_dir_dentry = lock_parent(lower_dentry);
+ lower_dir_path.dentry = lower_dir_dentry;
+ lower_dir_path.mnt = ecryptfs_dentry_to_lower_mnt(dentry);
rc = vfs_mkdir(d_inode(lower_dir_dentry), lower_dentry, mode);
if (rc || d_really_is_negative(lower_dentry))
goto out;
+
+ fsnotify_modify_dir(&lower_dir_path);
+
rc = ecryptfs_interpose(lower_dentry, dentry, dir->i_sb);
if (rc)
goto out;
@@ -532,16 +568,24 @@ static int ecryptfs_rmdir(struct inode *dir, struct dentry *dentry)
{
struct dentry *lower_dentry;
struct dentry *lower_dir_dentry;
+ struct path lower_dir_path;
int rc;

lower_dentry = ecryptfs_dentry_to_lower(dentry);
dget(dentry);
lower_dir_dentry = lock_parent(lower_dentry);
+ lower_dir_path.dentry = lower_dir_dentry;
+ lower_dir_path.mnt = ecryptfs_dentry_to_lower_mnt(dentry);
dget(lower_dentry);
+
rc = vfs_rmdir(d_inode(lower_dir_dentry), lower_dentry);
dput(lower_dentry);
if (!rc && d_really_is_positive(dentry))
clear_nlink(d_inode(dentry));
+
+ if (rc)
+ fsnotify_modify_dir(&lower_dir_path);
+
fsstack_copy_attr_times(dir, d_inode(lower_dir_dentry));
set_nlink(dir, d_inode(lower_dir_dentry)->i_nlink);
unlock_dir(lower_dir_dentry);
@@ -557,12 +601,19 @@ ecryptfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev
int rc;
struct dentry *lower_dentry;
struct dentry *lower_dir_dentry;
+ struct path lower_dir_path;

lower_dentry = ecryptfs_dentry_to_lower(dentry);
lower_dir_dentry = lock_parent(lower_dentry);
+ lower_dir_path.dentry = lower_dir_dentry;
+ lower_dir_path.mnt = ecryptfs_dentry_to_lower_mnt(dentry);
+
rc = vfs_mknod(d_inode(lower_dir_dentry), lower_dentry, mode, dev);
if (rc || d_really_is_negative(lower_dentry))
goto out;
+
+ fsnotify_modify_dir(&lower_dir_path);
+
rc = ecryptfs_interpose(lower_dentry, dentry, dir->i_sb);
if (rc)
goto out;
@@ -585,6 +636,9 @@ ecryptfs_rename(struct inode *old_dir, struct dentry *old_dentry,
struct dentry *lower_new_dentry;
struct dentry *lower_old_dir_dentry;
struct dentry *lower_new_dir_dentry;
+ struct vfsmount *lower_mnt;
+ struct path lower_old_dir_path;
+ struct path lower_new_dir_path;
struct dentry *trap = NULL;
struct inode *target_inode;

@@ -593,10 +647,15 @@ ecryptfs_rename(struct inode *old_dir, struct dentry *old_dentry,

lower_old_dentry = ecryptfs_dentry_to_lower(old_dentry);
lower_new_dentry = ecryptfs_dentry_to_lower(new_dentry);
+ lower_mnt = ecryptfs_dentry_to_lower_mnt(old_dentry);
dget(lower_old_dentry);
dget(lower_new_dentry);
lower_old_dir_dentry = dget_parent(lower_old_dentry);
lower_new_dir_dentry = dget_parent(lower_new_dentry);
+ lower_old_dir_path.dentry = lower_old_dir_dentry;
+ lower_old_dir_path.mnt = lower_mnt;
+ lower_new_dir_path.dentry = lower_new_dir_dentry;
+ lower_new_dir_path.mnt = lower_mnt;
target_inode = d_inode(new_dentry);
trap = lock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
/* source should not be ancestor of target */
@@ -614,6 +673,14 @@ ecryptfs_rename(struct inode *old_dir, struct dentry *old_dentry,
NULL, 0);
if (rc)
goto out_lock;
+
+ /* ecryptfs does not support crossing mount boundaries, we can take
+ * vfsmount from an arbitrary dentry.
+ */
+ fsnotify_modify_dir(&lower_old_dir_path);
+ if (!path_equal(&lower_old_dir_path, &lower_new_dir_path))
+ fsnotify_modify_dir(&lower_new_dir_path);
+
if (target_inode)
fsstack_copy_attr_all(target_inode,
ecryptfs_inode_to_lower(target_inode));
diff --git a/fs/namei.c b/fs/namei.c
index ad74877e1442..17667f0c89e5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3009,8 +3009,12 @@ static int atomic_open(struct nameidata *nd, struct dentry *dentry,
dput(dentry);
dentry = file->f_path.dentry;
}
- if (*opened & FILE_CREATED)
+ if (*opened & FILE_CREATED) {
+ struct path parent_path = {file->f_path.mnt,
+ dentry->d_parent};
fsnotify_create(dir, dentry);
+ fsnotify_modify_dir(&parent_path);
+ }
if (unlikely(d_is_negative(dentry))) {
error = -ENOENT;
} else {
@@ -3157,6 +3161,7 @@ static int lookup_open(struct nameidata *nd, struct path *path,
if (error)
goto out_dput;
fsnotify_create(dir_inode, dentry);
+ fsnotify_modify_dir(&nd->path);
}
if (unlikely(create_error) && !dentry->d_inode) {
error = create_error;
@@ -3702,6 +3707,7 @@ SYSCALL_DEFINE4(mknodat, int, dfd, const char __user *, filename, umode_t, mode,
error = vfs_mknod(path.dentry->d_inode,dentry,mode,0);
break;
}
+ fsnotify_modify_dir(&path);
out:
done_path_create(&path, dentry);
if (retry_estale(error, lookup_flags)) {
@@ -3759,6 +3765,8 @@ SYSCALL_DEFINE3(mkdirat, int, dfd, const char __user *, pathname, umode_t, mode)
error = security_path_mkdir(&path, dentry, mode);
if (!error)
error = vfs_mkdir(path.dentry->d_inode, dentry, mode);
+ if (!error)
+ fsnotify_modify_dir(&path);
done_path_create(&path, dentry);
if (retry_estale(error, lookup_flags)) {
lookup_flags |= LOOKUP_REVAL;
@@ -3855,6 +3863,8 @@ static long do_rmdir(int dfd, const char __user *pathname)
if (error)
goto exit3;
error = vfs_rmdir(path.dentry->d_inode, dentry);
+ if (!error)
+ fsnotify_modify_dir(&path);
exit3:
dput(dentry);
exit2:
@@ -3979,6 +3989,8 @@ static long do_unlinkat(int dfd, const char __user *pathname)
if (error)
goto exit2;
error = vfs_unlink(path.dentry->d_inode, dentry, &delegated_inode);
+ if (!error)
+ fsnotify_modify_dir(&path);
exit2:
dput(dentry);
}
@@ -4070,6 +4082,8 @@ SYSCALL_DEFINE3(symlinkat, const char __user *, oldname,
error = security_path_symlink(&path, dentry, from->name);
if (!error)
error = vfs_symlink(path.dentry->d_inode, dentry, from->name);
+ if (!error)
+ fsnotify_modify_dir(&path);
done_path_create(&path, dentry);
if (retry_estale(error, lookup_flags)) {
lookup_flags |= LOOKUP_REVAL;
@@ -4219,6 +4233,8 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
if (error)
goto out_dput;
error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry, &delegated_inode);
+ if (!error)
+ fsnotify_modify_dir(&new_path);
out_dput:
done_path_create(&new_path, new_dentry);
if (delegated_inode) {
@@ -4532,6 +4548,11 @@ SYSCALL_DEFINE5(renameat2, int, olddfd, const char __user *, oldname,
error = vfs_rename(old_path.dentry->d_inode, old_dentry,
new_path.dentry->d_inode, new_dentry,
&delegated_inode, flags);
+ if (error == 0) {
+ fsnotify_modify_dir(&old_path);
+ if (!path_equal(&old_path, &new_path))
+ fsnotify_modify_dir(&new_path);
+ }
exit5:
dput(new_dentry);
exit4:
diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
index 66eaeb1e8c2c..58f70bbaac38 100644
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -36,6 +36,7 @@
#include <linux/file.h>
#include <linux/slab.h>
#include <linux/namei.h>
+#include <linux/fsnotify.h>
#include <linux/sched.h>
#include <linux/fs.h>
#include <linux/module.h>
@@ -216,6 +217,8 @@ nfsd4_create_clid_dir(struct nfs4_client *clp)
*/
goto out_put;
status = vfs_mkdir(d_inode(dir), dentry, S_IRWXU);
+ if (status == 0)
+ fsnotify_modify_dir(&nn->rec_file->f_path);
out_put:
dput(dentry);
out_unlock:
@@ -338,6 +341,8 @@ nfsd4_unlink_clid_dir(char *name, int namlen, struct nfsd_net *nn)
if (d_really_is_negative(dentry))
goto out;
status = vfs_rmdir(d_inode(dir), dentry);
+ if (status == 0)
+ fsnotify_modify_dir(&nn->rec_file->f_path);
out:
dput(dentry);
out_unlock:
@@ -401,6 +406,8 @@ purge_old(struct dentry *parent, struct dentry *child, struct nfsd_net *nn)
if (status)
printk("failed to remove client recovery directory %pd\n",
child);
+ else
+ fsnotify_modify_dir(&nn->rec_file->f_path);
/* Keep trying, success or failure: */
return 0;
}
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 26c6fdb4bf67..7632ab3fd99e 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -364,6 +364,18 @@ nfsd_get_write_access(struct svc_rqst *rqstp, struct svc_fh *fhp,
}

/*
+ * Helper to emit fsnotify modify_dir event. Call with fph locked.
+ */
+static void nfsd_fsnotify_modify_dir(struct svc_fh *fhp)
+{
+ struct path path;
+
+ path.mnt = fhp->fh_export->ex_path.mnt;
+ path.dentry = fhp->fh_dentry;
+ fsnotify_modify_dir(&path);
+}
+
+/*
* Set various file attributes. After this call fhp needs an fh_put.
*/
__be32
@@ -1207,6 +1219,7 @@ nfsd_create_locked(struct svc_rqst *rqstp, struct svc_fh *fhp,
goto out_nfserr;

err = nfsd_create_setattr(rqstp, resfhp, iap);
+ nfsd_fsnotify_modify_dir(fhp);

/*
* nfsd_create_setattr already committed the child. Transactional
@@ -1525,8 +1538,10 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,

host_err = vfs_symlink(d_inode(dentry), dnew, path);
err = nfserrno(host_err);
- if (!err)
+ if (!err) {
+ nfsd_fsnotify_modify_dir(fhp);
err = nfserrno(commit_metadata(fhp));
+ }
fh_unlock(fhp);

fh_drop_write(fhp);
@@ -1593,6 +1608,7 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
goto out_dput;
host_err = vfs_link(dold, dirp, dnew, NULL);
if (!host_err) {
+ nfsd_fsnotify_modify_dir(tfhp);
err = nfserrno(commit_metadata(ffhp));
if (!err)
err = nfserrno(commit_metadata(tfhp));
@@ -1686,6 +1702,8 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,

host_err = vfs_rename(fdir, odentry, tdir, ndentry, NULL, 0);
if (!host_err) {
+ nfsd_fsnotify_modify_dir(tfhp);
+ nfsd_fsnotify_modify_dir(ffhp);
host_err = commit_metadata(tfhp);
if (!host_err)
host_err = commit_metadata(ffhp);
@@ -1757,8 +1775,10 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
host_err = vfs_unlink(dirp, rdentry, NULL);
else
host_err = vfs_rmdir(dirp, rdentry);
- if (!host_err)
+ if (!host_err) {
+ nfsd_fsnotify_modify_dir(fhp);
host_err = commit_metadata(fhp);
+ }
dput(rdentry);

out_nfserr:
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 7a2d8f0c8ae5..10e413c2216f 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -19,6 +19,7 @@
#include <linux/file.h>
#include <linux/mount.h>
#include <linux/namei.h>
+#include <linux/fsnotify.h>
#include <linux/sysctl.h>
#include <linux/poll.h>
#include <linux/mqueue.h>
@@ -818,6 +819,10 @@ SYSCALL_DEFINE4(mq_open, const char __user *, u_name, int, oflag, umode_t, mode,
filp = do_create(ipc_ns, d_inode(root),
&path, oflag, mode,
u_attr ? &attr : NULL);
+ if (!IS_ERR(filp)) {
+ struct path root_path = {mnt, mnt->mnt_root};
+ fsnotify_modify_dir(&root_path);
+ }
}
} else {
if (d_really_is_negative(path.dentry)) {
@@ -878,6 +883,10 @@ SYSCALL_DEFINE1(mq_unlink, const char __user *, u_name)
} else {
ihold(inode);
err = vfs_unlink(d_inode(dentry->d_parent), dentry, NULL);
+ if (!err) {
+ struct path path = {mnt, dentry->d_parent};
+ fsnotify_modify_dir(&path);
+ }
}
dput(dentry);

diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 0b030c9126d3..93137292b051 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -16,6 +16,7 @@
#include <linux/major.h>
#include <linux/mount.h>
#include <linux/namei.h>
+#include <linux/fsnotify.h>
#include <linux/fs.h>
#include <linux/kdev_t.h>
#include <linux/parser.h>
@@ -255,6 +256,8 @@ static int bpf_obj_do_pin(const struct filename *pathname, void *raw,

dentry->d_fsdata = raw;
ret = vfs_mknod(dir, dentry, mode, devt);
+ if (ret == 0)
+ fsnotify_modify_dir(&path);
dentry->d_fsdata = NULL;
out:
done_path_create(&path, dentry);
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index cef79873b09d..5049bd4bd1d8 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -91,6 +91,7 @@
#include <linux/stat.h>
#include <linux/dcache.h>
#include <linux/namei.h>
+#include <linux/fsnotify.h>
#include <linux/socket.h>
#include <linux/un.h>
#include <linux/fcntl.h>
@@ -976,6 +977,7 @@ static int unix_mknod(const char *sun_path, umode_t mode, struct path *res)
if (!err) {
err = vfs_mknod(d_inode(path.dentry), dentry, mode, 0);
if (!err) {
+ fsnotify_modify_dir(&path);
res->mnt = mntget(path.mnt);
res->dentry = dget(dentry);
}
--
2.11.1

2017-03-13 23:16:40

by Filip Štědronský

[permalink] [raw]
Subject: Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

An example userspace program that uses FAN_MODIFY_DIR to reliably keep
an up-to-date internal representation of the file system. It uses some
filehandle trickery to identify inodes, other heuristics could be also
used.

---

//#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/fanotify.h>
#include <stdint.h>
#include <dirent.h>
#include <assert.h>
#include <string.h>

#include <map>
#include <set>
#include <list>
using namespace std;

#ifndef FAN_MODIFY_DIR
#define FAN_MODIFY_DIR 0x00040000
#endif

// die-on-error helpers
#define CHK(x) ({ __typeof__(x) r = x; if (r == -1) { perror(#x); abort(); } r; })
#define CHKN(x) ({ __typeof__(x) r = x; if (r == NULL) { perror(#x); abort(); } r; })
struct inode_info;
struct dentry_info;

struct inode_info {
ino_t ino;
mode_t mode;
char handle[MAX_HANDLE_SZ];
set<struct dentry_info *> links;
map<string, struct dentry_info *> children; // for directory inodes
};

struct dentry_info {
struct inode_info *parent, *inode;
string name;
};


map<ino_t, inode_info*> inodes;

int root_fd;
int fan_fd;

bool compare_handles(const void *h1, const void *h2) {
const struct file_handle *fh1 = (const struct file_handle*) h1;
const struct file_handle *fh2 = (const struct file_handle*) h2;
return (fh1->handle_bytes == fh2->handle_bytes
&& memcmp(h1, h2, fh1->handle_bytes) == 0);
}

bool handle_valid(void *handle) {
int check_fd = open_by_handle_at(root_fd, (struct file_handle*)handle, O_PATH);
if (check_fd >= 0) {
CHK(close(check_fd));
return true;
} else if (errno == ESTALE) {
return false;
} else {
perror("open_by_handle_at");
exit(1);
}
}

// Get the path corresponding to an inode (one of its paths, in the presence of
// hardlinks).
void inode_path(const struct inode_info *inode, char *buf, size_t bufsiz) {
list<string> components;
while (true) {
if (inode->links.empty()) break;
struct dentry_info *dentry = *inode->links.begin();
components.push_front(dentry->name);
inode = dentry->parent;
}
buf[0] = '\0';
for (auto name: components) {
int len = snprintf(buf, bufsiz, "/%s", name.c_str());
buf += len;
bufsiz -= len;
}
}


void delete_dentry(struct dentry_info *dentry) {
assert(dentry->parent->children[dentry->name] == dentry);

char path_buf[4096];
inode_path(dentry->parent, path_buf, sizeof(path_buf));
printf("unlinked %s/%s (ino %lu, parent %lu)\n", path_buf, dentry->name.c_str(),
dentry->inode->ino, dentry->parent->ino);

dentry->parent->children.erase(dentry->name.c_str());
dentry->inode->links.erase(dentry);
// TODO: If this was the last dentry pointing to an inode, schedule removing
// the inode after a timeout (we cannot remove it immediately because
// the zero-link situation might occur during a rename when the source
// directory has been processed but the target directory hasn't).
delete dentry;
}

struct dentry_info *add_dentry(struct inode_info *parent, const char *name,
struct inode_info *child) {
struct dentry_info *dentry = new dentry_info();
dentry->parent = parent;
dentry->name = name;
dentry->inode = child;
parent->children[name] = dentry;
child->links.insert(dentry);

char path_buf[4096] = "\0";
inode_path(parent, path_buf, sizeof(path_buf));
printf("linked %s/%s (ino %lu, parent %lu)\n", path_buf, name, child->ino, parent->ino);

return dentry;
}

void delete_inode(struct inode_info *inode) {
for (auto dentry: inode->links) {
delete_dentry(dentry);
}
delete inode;
}

// Given a file descriptor, find the corresponding inode object in our database,
// or create a new one if it does not exist. An O_PATH fd suffices.
struct inode_info *find_inode(int fd) {
struct stat st;
CHK(fstat(fd, &st));
char handle[sizeof(struct file_handle) + MAX_HANDLE_SZ];
struct file_handle *fh = (struct file_handle*)handle;
fh->handle_bytes = sizeof(handle);
int mntid;
CHK(name_to_handle_at(fd, "", (struct file_handle*)handle, &mntid,
AT_EMPTY_PATH));

struct inode_info *info = inodes[st.st_ino];
if (info) {
// Handles can refer to the same file despite not being equal.
// If the old handle can still be opened, we can be assured
// that the inode number has not been recycled.
if (compare_handles(handle, info->handle) || handle_valid(info->handle)) {
return info;
} else {
delete_inode(info);
info = NULL;
}
}

inodes[st.st_ino] = info = new inode_info();
info->ino = st.st_ino;
info->mode = st.st_mode;
memcpy(info->handle, handle, fh->handle_bytes);
return info;
}

// Scan directory and update internal filesystem representation accordingly.
// Closes `dirfd`.
void scan(int dirfd, bool recursive) {
struct inode_info *dir = find_inode(dirfd);

char path_buf[4096] = "\0";
inode_path(dir, path_buf, sizeof(path_buf));
printf("scan %s (%lu)\n", path_buf, dir->ino);

DIR *dp = CHKN(fdopendir(dirfd));
set<string> seen;
while (struct dirent *ent = readdir(dp)) {
if (strcmp(ent->d_name, ".") == 0 || strcmp(ent->d_name, "..") == 0) continue;
seen.insert(ent->d_name);
if (dir->children.find(ent->d_name) != dir->children.end()
&& dir->children[ent->d_name]->inode->ino == ent->d_ino) {
// Heuristic: It is massively unlikely that an inode number
// would be recylced at the same path as before. So if we
// see the same inode for the same child, we skip the more
// expensive checks altogether. This saves us a buttload of
// syscalls, especially given that most directory entries
// will be unchanged after a FAN_MODIFY_DIR.
//
// This can be skipped if strict correctness is preferred
// over speed.
continue;
}
int fd = openat(dirfd, ent->d_name, O_PATH|O_NOFOLLOW);
if (fd < 0) continue;
struct inode_info *child = find_inode(fd);
if (dir->children.find(ent->d_name) != dir->children.end()) {
struct dentry_info *old_dentry = dir->children[ent->d_name];
if (child != old_dentry->inode) {
delete_dentry(old_dentry);
add_dentry(dir, ent->d_name, child);
}
} else {
add_dentry(dir, ent->d_name, child);
}
if (recursive && S_ISDIR(child->mode)) {
// `fd' is just an O_PATH fd. For scanning we need O_RDONLY.
int scan_fd = CHK(openat(fd, ".", O_RDONLY|O_DIRECTORY));
scan(scan_fd, true); // closes scan_fd
}
close(fd);
}
for (auto it: dir->children) {
if (seen.find(it.second->name) == seen.end()) delete_dentry(it.second);
}
closedir(dp);
}

void event_loop() {
while (true) {
char buf[4096];
ssize_t len = CHK(read(fan_fd, buf, sizeof(buf)));
const struct fanotify_event_metadata *event;
event = (const struct fanotify_event_metadata*) buf;
while (FAN_EVENT_OK(event, len)) {
if (event->vers != FANOTIFY_METADATA_VERSION) abort();
if (event->mask & FAN_MODIFY_DIR) {
scan(event->fd, false);
} else if (event->mask & FAN_Q_OVERFLOW) {
abort(); // TODO: full rescan needed
} else {
close(event->fd);
}
event = FAN_EVENT_NEXT(event, len);
}
}
}

int main(int argc, char **argv) {
if (argc != 2) { fprintf(stderr, "Usage: %s MOUNTPOINT\n", argv[0]); return 1; }

root_fd = CHK(open(argv[1], O_RDONLY|O_DIRECTORY));
// In a real application, FAN_UNLIMITED_QUEUE would be replaced with a secondary
// userspace queue filled during scanning.
fan_fd = CHK(fanotify_init(FAN_UNLIMITED_QUEUE, O_RDONLY));
CHK(fanotify_mark(fan_fd, FAN_MARK_ADD|FAN_MARK_MOUNT, FAN_MODIFY_DIR|FAN_ONDIR,
root_fd, NULL));

scan(dup(root_fd), true);

event_loop();

return 0;
}

2017-03-14 10:11:45

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

On Tue, Mar 14, 2017 at 1:02 AM, Filip Štědronský <[email protected]> wrote:
> Fanotify currently does not report directory modification events
> (rename, unlink, etc.). But these events are essential for about half of
> concievable fanotify use cases, especially:
>
> - file system indexers / desktop search tools
> - file synchronization tools (like Dropbox, Nextcloud, etc.),
> online backup tools

This last one is the use case of my employer, Ctera Networks.
Out of curiosity, what is the use case that you are focusing on?

>
> and pretty much any app that needs to maintain and up-to-date internal
> representation of current contents of the file system.
>
> All applications of the above classes that I'm aware of currently use
> recursive inotify watches, which do not scale (my home dir has ~200k
> directories, creating all the watches takes ~2min and eats several tens
> of MB of kernel memory).
>
> There have been many calls for such a feature, pretty much since the
> creation of fanotify in 2009:
> * By GNOME developers:
> https://wiki.gnome.org/BastienNocera/KernelWishlist#VFS.2C_filesystems
> * By KDE developers:
> http://lkml.kernel.org/r/201211011352.42476.Martin%40lichtvoll.de
> 'Better support for (desktop) file search / indexing applications'
> * And others:
> http://lkml.kernel.org/r/AANLkTi=owK=WZW4oNtpm5WpAZhqCQUdTR2K5gzJ_MqZ+%40mail.gmail.com
> 'Fanotify mv/rename?'
> http://lkml.kernel.org/r/1238158043.23703.20.camel%40fatty
> 'Issues with using fanotify for a filesystem indexer'
>

Thanks for sharing this summary!
I had the feeling that all recursive inotify users are hiding in the shadows,
but was missing more concrete evidence.

> Add a new fanotify event, FAN_MODIFY_DIR, that is emitted whenever the
> contents of a directory change (a directory entry is added, removed or
> renamed). This covers all the currently missing events: rename, unlink,
> mknod, mkdir, rmdir, etc.
>
> Included with the event is a file descriptor to the modified directory
> but no further info. The userspace application is expected to rescan the
> whole directory and update its model accordingly. This needs to be done
> carefully to prevent race conditions. A cross-directory rename generates
> two FAN_MODIFY_DIR events.
>

Your approach is interesting and I am glad you shared it with us.
I do like it and it gives me an idea, I am going to prepare a branch
with a subset
of my patches, so you can try them with your userspace sample program.

In comments to your patches I am going to argue that, as a matter of fact,
you can take a small sub set of my patches and get the same functionality
of FAN_MODIFY_DIR, with code that is *less* complex then what you propose.
So please treat my comments as technical comments how FAN_MODIFY_DIR
should be implemented IMO and not as an attempt to reject an opposing proposal.

> This minimalistic approach has several advantages:
> * Does not require changes to the fanotify userspace API or the
> fsnotify in-kernel framework, apart from adding a new event.

About the argument of not having to change in-kernel framework,
I don't think it should be a consideration at all.
I claim that my 1st fsnotify cleanup series is an improvement of the framework
(https://github.com/amir73il/linux/commits/fsnotify_dentry)
regardless of additional functionality.
So changing the in-kernel framework by adding complexity may be bad,
but changing it by simplifying and making code more readable should not
be an argument against the "non-minimalistic" approach.

About the change to usespace API, if you strip off the *added* functionality
of super block watch from my patches, your proposal for user API looks
like this:

fanotify_mark(fan_fd, FAN_MARK_ADD|FAN_MARK_MOUNT, \
FAN_MODIFY_DIR|FAN_ONDIR,

And my proposal for user API looks like this:

fanotify_mark(fan_fd, FAN_MARK_ADD|FAN_MARK_MOUNT, \
FAN_MOVE|FAN_CREATE|FAN_DELETE,

You can see why my proposal for user API is not any less minimalistic
and it is fully compatible with existing intotify API for the same events.

I understand why it is hard to see this behind my added functionality,
so I will post a minimalistic branch and test program as an example.

> Especially doesn't complicate it by adding string fields.

So the string fields in my proposal are optional.
userspace opts-in to get them by specifying the FAN_EVENT_INFO_NAME flag:

fanotify_init(FAN_CLOEXEC | FAN_CLASS_NOTIF |
FAN_EVENT_INFO_PARENT | FAN_EVENT_INFO_NAME,

If you don't specify FAN_EVENT_INFO_NAME, you can get filename events
FAN_MOVE|FAN_CREATE|FAN_DELETE without the name.

What you do get is the file descriptor of the parent. sounds familiar? ;-)

The flag FAN_EVENT_INFO_PARENT in an explicit opt-in to get parent fd
instead of victim id on specific events.
If FAN_EVENT_INFO_PARENT is not specified in fanotify_init()
the filename events FAN_MOVE|FAN_CREATE|FAN_DELETE are masked
out of fanotify_mark().

This is important because your patchs adds FAN_MODIFY_DIR to
FAN_ALL_EVENTS (as does mine).
So old fanotify userspace code, compiled with new kernel headers, with:

fanotify_mark(fan_fd, FAN_MARK_ADD|FAN_MARK_MOUNT,
FAN_ALL_EVENTS,

Will start getting your FAN_MODIFY_DIR events and those programs
might have wrong assumptions about the provided fd.

So I chose to have stronger backward compatibility, and added the opt-in
FAN_EVENT_INFO_PARENT flag.

> * Has simple and clear semantics, even with multiple renames occurring
> in parallel etc. In case of any inconsistencies, one can simply wait
> for a while and rescan again.
> * FAN_MODIFY_DIR events are easily merged in case of multiple
> operations on a directory (e.g. deleting all files).
>

I agree those 2 are important advantages.
You could get them with my proposed API when dropping the
FAN_EVENT_INFO_NAME flag.

Thanks for pointing that out.

> Signed-off-by: Filip Štědronský <[email protected]>
>
> ---
>
> An alternative might be to create wrapper functions like
> vfs_path_(rename|unlink|...). They could also take care of calling
> security_path_(rename|unlink|...), which is currently also up to
> the indvidual callers (possibly with a flag because it might not
> be always desired).
>
> An alternative was proposed by Amir Goldstein in several long series of
> patches that add superblock-scoped (as opposed to vfsmount-scoped)
> fanotify watches and specific dentry manipulation events with filenames:
>
> http://lkml.kernel.org/r/1481984434-13293-1-git-send-email-amir73il%40gmail.com
> http://lkml.kernel.org/r/1482247207-4424-1-git-send-email-amir73il%40gmail.com
> http://lkml.kernel.org/r/1476126784-12520-1-git-send-email-amir73il%40gmail.com
> http://lkml.kernel.org/r/1489411223-12081-1-git-send-email-amir73il%40gmail.com
>
> There is large but not complete overlap between that proposal and
> mine (which is was originally created over a year ago, before Amir's
> work, but never posted).
>

I am proud of the role I played to help your proposal see the day of light :-)

> I think the superblock watch idea is very interesting because it might
> in the future allow reporing fsnotify events from remote and virtual
> filesystems. So I'm posting this more as a potential source of more
> ideas for discussion, or a fallback proposal in case Amir's patches
> don't make it.
> ---
> fs/notify/fanotify/fanotify.c | 1 +
> include/linux/fsnotify.h | 17 +++++++++++++++++
> include/linux/fsnotify_backend.h | 1 +
> include/uapi/linux/fanotify.h | 5 ++++-
> 4 files changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index bbc175d4213d..5178b06c338c 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -186,6 +186,7 @@ static int fanotify_handle_event(struct fsnotify_group *group,
>
> BUILD_BUG_ON(FAN_ACCESS != FS_ACCESS);
> BUILD_BUG_ON(FAN_MODIFY != FS_MODIFY);
> + BUILD_BUG_ON(FAN_MODIFY_DIR != FS_MODIFY_DIR);
> BUILD_BUG_ON(FAN_CLOSE_NOWRITE != FS_CLOSE_NOWRITE);
> BUILD_BUG_ON(FAN_CLOSE_WRITE != FS_CLOSE_WRITE);
> BUILD_BUG_ON(FAN_OPEN != FS_OPEN);
> diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
> index b43d3f5bd9ea..00fb87c975d6 100644
> --- a/include/linux/fsnotify.h
> +++ b/include/linux/fsnotify.h
> @@ -208,6 +208,23 @@ static inline void fsnotify_modify(struct file *file)
> }
>
> /*
> + * fsnotify_modifydir - directory contents were changed
> + * (as a result of rename, creat, unlink, etc.)
> + */
> +static inline void fsnotify_modify_dir(struct path *path)
> +{
> + struct inode *inode = path->dentry->d_inode;
> + __u32 mask = FS_MODIFY_DIR;
> +
> + if (S_ISDIR(inode->i_mode))
> + mask |= FS_ISDIR;

It is going to be somewhat confusing for users to understand
the difference between FS_MODIFY_DIR and FS_MODIFY_DIR|FS_ISDIR.
IMO, it is much more intuitive (and compatible with intoify semantics)
when users
get specific event information, e.g. FS_DELETE and FS_DELETE|FS_ISDIR.


> + else
> + return;
> +
> + fsnotify(inode, mask, path, FSNOTIFY_EVENT_PATH, NULL, 0);
> +}
> +
> +/*
> * fsnotify_open - file was opened
> */
> static inline void fsnotify_open(struct file *file)
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index 487246546ebe..7751b337ec31 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -42,6 +42,7 @@
>
> #define FS_OPEN_PERM 0x00010000 /* open event in an permission hook */
> #define FS_ACCESS_PERM 0x00020000 /* access event in a permissions hook */
> +#define FS_MODIFY_DIR 0x00040000 /* directory changed (rename/unlink/...) */
>
> #define FS_EXCL_UNLINK 0x04000000 /* do not send events if object is unlinked */
> #define FS_ISDIR 0x40000000 /* event occurred against dir */
> diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
> index 030508d195d3..f14e048d492a 100644
> --- a/include/uapi/linux/fanotify.h
> +++ b/include/uapi/linux/fanotify.h
> @@ -15,6 +15,8 @@
> #define FAN_OPEN_PERM 0x00010000 /* File open in perm check */
> #define FAN_ACCESS_PERM 0x00020000 /* File accessed in perm check */
>
> +#define FAN_MODIFY_DIR 0x00040000 /* directory changed (rename/unlink/...) */
> +
> #define FAN_ONDIR 0x40000000 /* event occurred against dir */
>
> #define FAN_EVENT_ON_CHILD 0x08000000 /* interested in child events */
> @@ -67,7 +69,8 @@
> #define FAN_ALL_EVENTS (FAN_ACCESS |\
> FAN_MODIFY |\
> FAN_CLOSE |\
> - FAN_OPEN)
> + FAN_OPEN |\
> + FAN_MODIFY_DIR)
>
> /*
> * All events which require a permission response from userspace
> --
> 2.11.1
>

2017-03-14 10:41:00

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

On Tue, Mar 14, 2017 at 1:16 AM, Filip Štědronský <[email protected]> wrote:
> An example userspace program that uses FAN_MODIFY_DIR to reliably keep
> an up-to-date internal representation of the file system. It uses some
> filehandle trickery to identify inodes, other heuristics could be also
> used.
>


An I am very happy that you used filehandles to keep track of objects
in your example, because it fits my proposal really well.

See, if you used my proposed API, you would have had

fan_fd = CHK(fanotify_init(FAN_UNLIMITED_QUEUE| \
FAN_EVENT_INFO_PARENT | FAN_EVENT_INFO_FH,
O_RDONLY));

And then you would have gotten the filhandle of parent with the event
and you would need find_inode().

Furthermore, my proposal records the filehandle at the time of the event
and therefore, does not have to keep an elevated refcount of the object
in the events queue.

Again, this needs some work, but I will try to post a testing branch
which demonstrates how my patches should be used with this test
program.

Thanks again for sharing it.

Amir.

> ---
>
> //#define _GNU_SOURCE
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <sys/fanotify.h>
> #include <stdint.h>
> #include <dirent.h>
> #include <assert.h>
> #include <string.h>
>
> #include <map>
> #include <set>
> #include <list>
> using namespace std;
>
> #ifndef FAN_MODIFY_DIR
> #define FAN_MODIFY_DIR 0x00040000
> #endif
>
> // die-on-error helpers
> #define CHK(x) ({ __typeof__(x) r = x; if (r == -1) { perror(#x); abort(); } r; })
> #define CHKN(x) ({ __typeof__(x) r = x; if (r == NULL) { perror(#x); abort(); } r; })
> struct inode_info;
> struct dentry_info;
>
> struct inode_info {
> ino_t ino;
> mode_t mode;
> char handle[MAX_HANDLE_SZ];
> set<struct dentry_info *> links;
> map<string, struct dentry_info *> children; // for directory inodes
> };
>
> struct dentry_info {
> struct inode_info *parent, *inode;
> string name;
> };
>
>
> map<ino_t, inode_info*> inodes;
>
> int root_fd;
> int fan_fd;
>
> bool compare_handles(const void *h1, const void *h2) {
> const struct file_handle *fh1 = (const struct file_handle*) h1;
> const struct file_handle *fh2 = (const struct file_handle*) h2;
> return (fh1->handle_bytes == fh2->handle_bytes
> && memcmp(h1, h2, fh1->handle_bytes) == 0);
> }
>
> bool handle_valid(void *handle) {
> int check_fd = open_by_handle_at(root_fd, (struct file_handle*)handle, O_PATH);
> if (check_fd >= 0) {
> CHK(close(check_fd));
> return true;
> } else if (errno == ESTALE) {
> return false;
> } else {
> perror("open_by_handle_at");
> exit(1);
> }
> }
>
> // Get the path corresponding to an inode (one of its paths, in the presence of
> // hardlinks).
> void inode_path(const struct inode_info *inode, char *buf, size_t bufsiz) {
> list<string> components;
> while (true) {
> if (inode->links.empty()) break;
> struct dentry_info *dentry = *inode->links.begin();
> components.push_front(dentry->name);
> inode = dentry->parent;
> }
> buf[0] = '\0';
> for (auto name: components) {
> int len = snprintf(buf, bufsiz, "/%s", name.c_str());
> buf += len;
> bufsiz -= len;
> }
> }
>
>
> void delete_dentry(struct dentry_info *dentry) {
> assert(dentry->parent->children[dentry->name] == dentry);
>
> char path_buf[4096];
> inode_path(dentry->parent, path_buf, sizeof(path_buf));
> printf("unlinked %s/%s (ino %lu, parent %lu)\n", path_buf, dentry->name.c_str(),
> dentry->inode->ino, dentry->parent->ino);
>
> dentry->parent->children.erase(dentry->name.c_str());
> dentry->inode->links.erase(dentry);
> // TODO: If this was the last dentry pointing to an inode, schedule removing
> // the inode after a timeout (we cannot remove it immediately because
> // the zero-link situation might occur during a rename when the source
> // directory has been processed but the target directory hasn't).
> delete dentry;
> }
>
> struct dentry_info *add_dentry(struct inode_info *parent, const char *name,
> struct inode_info *child) {
> struct dentry_info *dentry = new dentry_info();
> dentry->parent = parent;
> dentry->name = name;
> dentry->inode = child;
> parent->children[name] = dentry;
> child->links.insert(dentry);
>
> char path_buf[4096] = "\0";
> inode_path(parent, path_buf, sizeof(path_buf));
> printf("linked %s/%s (ino %lu, parent %lu)\n", path_buf, name, child->ino, parent->ino);
>
> return dentry;
> }
>
> void delete_inode(struct inode_info *inode) {
> for (auto dentry: inode->links) {
> delete_dentry(dentry);
> }
> delete inode;
> }
>
> // Given a file descriptor, find the corresponding inode object in our database,
> // or create a new one if it does not exist. An O_PATH fd suffices.
> struct inode_info *find_inode(int fd) {
> struct stat st;
> CHK(fstat(fd, &st));
> char handle[sizeof(struct file_handle) + MAX_HANDLE_SZ];
> struct file_handle *fh = (struct file_handle*)handle;
> fh->handle_bytes = sizeof(handle);
> int mntid;
> CHK(name_to_handle_at(fd, "", (struct file_handle*)handle, &mntid,
> AT_EMPTY_PATH));
>
> struct inode_info *info = inodes[st.st_ino];
> if (info) {
> // Handles can refer to the same file despite not being equal.
> // If the old handle can still be opened, we can be assured
> // that the inode number has not been recycled.
> if (compare_handles(handle, info->handle) || handle_valid(info->handle)) {
> return info;
> } else {
> delete_inode(info);
> info = NULL;
> }
> }
>
> inodes[st.st_ino] = info = new inode_info();
> info->ino = st.st_ino;
> info->mode = st.st_mode;
> memcpy(info->handle, handle, fh->handle_bytes);
> return info;
> }
>
> // Scan directory and update internal filesystem representation accordingly.
> // Closes `dirfd`.
> void scan(int dirfd, bool recursive) {
> struct inode_info *dir = find_inode(dirfd);
>
> char path_buf[4096] = "\0";
> inode_path(dir, path_buf, sizeof(path_buf));
> printf("scan %s (%lu)\n", path_buf, dir->ino);
>
> DIR *dp = CHKN(fdopendir(dirfd));
> set<string> seen;
> while (struct dirent *ent = readdir(dp)) {
> if (strcmp(ent->d_name, ".") == 0 || strcmp(ent->d_name, "..") == 0) continue;
> seen.insert(ent->d_name);
> if (dir->children.find(ent->d_name) != dir->children.end()
> && dir->children[ent->d_name]->inode->ino == ent->d_ino) {
> // Heuristic: It is massively unlikely that an inode number
> // would be recylced at the same path as before. So if we
> // see the same inode for the same child, we skip the more
> // expensive checks altogether. This saves us a buttload of
> // syscalls, especially given that most directory entries
> // will be unchanged after a FAN_MODIFY_DIR.
> //
> // This can be skipped if strict correctness is preferred
> // over speed.
> continue;
> }
> int fd = openat(dirfd, ent->d_name, O_PATH|O_NOFOLLOW);
> if (fd < 0) continue;
> struct inode_info *child = find_inode(fd);
> if (dir->children.find(ent->d_name) != dir->children.end()) {
> struct dentry_info *old_dentry = dir->children[ent->d_name];
> if (child != old_dentry->inode) {
> delete_dentry(old_dentry);
> add_dentry(dir, ent->d_name, child);
> }
> } else {
> add_dentry(dir, ent->d_name, child);
> }
> if (recursive && S_ISDIR(child->mode)) {
> // `fd' is just an O_PATH fd. For scanning we need O_RDONLY.
> int scan_fd = CHK(openat(fd, ".", O_RDONLY|O_DIRECTORY));
> scan(scan_fd, true); // closes scan_fd
> }
> close(fd);
> }
> for (auto it: dir->children) {
> if (seen.find(it.second->name) == seen.end()) delete_dentry(it.second);
> }
> closedir(dp);
> }
>
> void event_loop() {
> while (true) {
> char buf[4096];
> ssize_t len = CHK(read(fan_fd, buf, sizeof(buf)));
> const struct fanotify_event_metadata *event;
> event = (const struct fanotify_event_metadata*) buf;
> while (FAN_EVENT_OK(event, len)) {
> if (event->vers != FANOTIFY_METADATA_VERSION) abort();
> if (event->mask & FAN_MODIFY_DIR) {
> scan(event->fd, false);
> } else if (event->mask & FAN_Q_OVERFLOW) {
> abort(); // TODO: full rescan needed
> } else {
> close(event->fd);
> }
> event = FAN_EVENT_NEXT(event, len);
> }
> }
> }
>
> int main(int argc, char **argv) {
> if (argc != 2) { fprintf(stderr, "Usage: %s MOUNTPOINT\n", argv[0]); return 1; }
>
> root_fd = CHK(open(argv[1], O_RDONLY|O_DIRECTORY));
> // In a real application, FAN_UNLIMITED_QUEUE would be replaced with a secondary
> // userspace queue filled during scanning.
> fan_fd = CHK(fanotify_init(FAN_UNLIMITED_QUEUE, O_RDONLY));
> CHK(fanotify_mark(fan_fd, FAN_MARK_ADD|FAN_MARK_MOUNT, FAN_MODIFY_DIR|FAN_ONDIR,
> root_fd, NULL));
>
> scan(dup(root_fd), true);
>
> event_loop();
>
> return 0;
> }

2017-03-14 11:18:06

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

On Tue, Mar 14, 2017 at 1:03 AM, Filip Štědronský <[email protected]> wrote:
> Besause fanotify requires `struct path`, the event cannot be generated
> directly in `fsnotify_move` and friends because they only get the inode
> (and their callers, `vfs_rename`&co. cannot supply any better info).
> So instead it needs to be generated higher in the call chain, i.e. in
> the callers of functions like `vfs_rename`.
>
> This leads to some code duplication. Currently, there are several places
> whence functions like `vfs_rename` or `vfs_unlink` are called:
>
> * syscall handlers (done)
> * NFS server (done)
> * stacked filesystems
> - ecryptfs (done)
> - overlayfs
> (Currently doesn't report even ordinary fanotify events, because
> it internally clones the upper mount; not sure about the
> rationale. One can always watch the overlay mount instead.)
> * few rather minor things
> - devtmpfs
> (its internal changes are not tied to any vfsmount so it cannot
> emit mount-scoped events)
> - cachefiles (done)
> - ipc/mqueue.c (done)
> - fs/nfsd/nfs4recover.c (done)
> - kernel/bpf/inode.c (done)
> net/unix/af_unix.c (done)
>
> (grep -rE '\bvfs_(rename|unlink|mknod|whiteout|create|mkdir|rmdir|symlink|link)\(')
>
> Signed-off-by: Filip Štědronský <[email protected]>
>
> ---
>
> An alternative might be to create wrapper functions like
> vfs_path_(rename|unlink|...). They could also take care of calling
> security_path_(rename|unlink|...), which is currently also up to
> the indvidual callers (possibly with a flag because it might not
> be always desired).

That's an interesting idea. There is some duplicity between security/audit
hook and fsnotify hooks. It should be interesting to try and deduplicate
some of this code.

> ---
> fs/cachefiles/namei.c | 9 +++++++
> fs/ecryptfs/inode.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++++
> fs/namei.c | 23 +++++++++++++++++-
> fs/nfsd/nfs4recover.c | 7 ++++++
> fs/nfsd/vfs.c | 24 ++++++++++++++++--
> ipc/mqueue.c | 9 +++++++
> kernel/bpf/inode.c | 3 +++
> net/unix/af_unix.c | 2 ++
> 8 files changed, 141 insertions(+), 3 deletions(-)
>

OK, just for comparison, I am going to put here the diff of the sub set of
my patches that are needed to support fanotify filename events.


$ git diff --stat fsnotify_sb..fanotify_dentry
fs/notify/fanotify/fanotify.c | 94
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------
fs/notify/fanotify/fanotify.h | 25 ++++++++++++++++++++++++-
fs/notify/fanotify/fanotify_user.c | 92
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------
fs/notify/fdinfo.c | 25 +++----------------------
fs/notify/inode_mark.c | 1 +
fs/notify/mark.c | 15 ++++++++++++---
include/linux/fsnotify_backend.h | 21 ++++++++++++++++-----
include/uapi/linux/fanotify.h | 41
+++++++++++++++++++++++++++++++++++------
8 files changed, 255 insertions(+), 59 deletions(-)

Yes, it is a bit more code, much mostly because it adds more functionality
(optionally reporting the filename).
But most of the code is contained within the fsnotify/fanotify subsystem.

The altenative to sprinkle fsnotify_modify_dir() hooks is much less
maintainable IMO.

Of course I am presenting biased information :-)
so for full disclosure, these patches also depend on a previous
cleanup series, with the following diffstat.
But as I claimed before and am going to claim again,
the cleanup series improves the code IMO regardless of
the additional functionality that it enables:

$ git diff --stat base..fsnotify_dentry
arch/powerpc/platforms/cell/spufs/inode.c | 2 +-
fs/btrfs/ioctl.c | 2 +-
fs/debugfs/inode.c | 8 ++++----
fs/devpts/inode.c | 2 +-
fs/namei.c | 23 +++++++++++++----------
fs/notify/fsnotify.c | 2 +-
fs/ocfs2/refcounttree.c | 2 +-
fs/overlayfs/inode.c | 15 ++++++++-------
fs/tracefs/inode.c | 4 ++--
include/linux/fsnotify.h | 78
+++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------------
include/linux/fsnotify_backend.h | 3 ++-
net/sunrpc/rpc_pipe.c | 6 +++---
12 files changed, 90 insertions(+), 57 deletions(-)

But even when taking the cleanup series into account,
the changes outside of the fsnotify subsystem and include files
are still a lot smaller then in your counter proposal.

This does not come without a price though.
I managed to stay a way from cross subsystems changes,
by allowing to loose some information about the event.

When a filename event is generated (rename|delete|create)
the path of the parent fd that will be reported to user is NOT
the actual path that the process executing the operation used,
but the path from which the watching process has added the mark.

So for example, if you have a bind mount:
mount -o bind /a/b/c/d/e/f/g /tmp/g

And you add a watch on mount /a
you *will* get events for create,delete,rename in directory /tmp/g
but that path in the associated fd with point to /a/b/c/d/e/f/g

Some would claim that it is wrong to report the events
if they were not originated from the mount were the
watch was added.

I claim that fanotify filters event by mount not because it
was a requirement, but because it was an implementation challenge
to do otherwise.
And I claim that what mount watchers are really interested in is
"all the changes that happen in the file system in the area
that is visible to me through this mount point".

In other words, an indexer needs to know if files were modified\
create/deleted if that indexer sits in container host namespace
regardless if those files were modified from within a container
namespace.

It's not a matter of security/isolation. It's a matter of functionality.
I agree that for some event (e.g. permission events) it is possible
to argue both ways (i.e. that the namespace context should be used
as a filter for events).
But for the new proposed events (FS_MODIFY_DIR), I really don't
see the point in isolation by mount/namespace.

> diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
> index 41df8a27d7eb..8c86699424d1 100644
> --- a/fs/cachefiles/namei.c
> +++ b/fs/cachefiles/namei.c
> @@ -313,6 +313,8 @@ static int cachefiles_bury_object(struct cachefiles_cache *cache,
> cachefiles_io_error(cache, "Unlink security error");
> } else {
> ret = vfs_unlink(d_inode(dir), rep, NULL);
> + if (ret == 0)
> + fsnotify_modify_dir(&path);
>
> if (preemptive)
> cachefiles_mark_object_buried(cache, rep, why);
> @@ -418,6 +420,10 @@ static int cachefiles_bury_object(struct cachefiles_cache *cache,
> if (ret != 0 && ret != -ENOMEM)
> cachefiles_io_error(cache,
> "Rename failed with error %d", ret);
> + if (ret == 0) {
> + fsnotify_modify_dir(&path);
> + fsnotify_modify_dir(&path_to_graveyard);
> + }
>
> if (preemptive)
> cachefiles_mark_object_buried(cache, rep, why);
> @@ -560,6 +566,7 @@ int cachefiles_walk_to_object(struct cachefiles_object *parent,
> cachefiles_hist(cachefiles_mkdir_histogram, start);
> if (ret < 0)
> goto create_error;
> + fsnotify_modify_dir(&path);
>
> ASSERT(d_backing_inode(next));
>
> @@ -589,6 +596,7 @@ int cachefiles_walk_to_object(struct cachefiles_object *parent,
> cachefiles_hist(cachefiles_create_histogram, start);
> if (ret < 0)
> goto create_error;
> + fsnotify_modify_dir(&path);
>
> ASSERT(d_backing_inode(next));
>
> @@ -779,6 +787,7 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
> ret = vfs_mkdir(d_inode(dir), subdir, 0700);
> if (ret < 0)
> goto mkdir_error;
> + fsnotify_modify_dir(&path);
>
> ASSERT(d_backing_inode(subdir));
>
> diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
> index e7413f82d27b..88a41b270bcc 100644
> --- a/fs/ecryptfs/inode.c
> +++ b/fs/ecryptfs/inode.c
> @@ -29,6 +29,8 @@
> #include <linux/dcache.h>
> #include <linux/namei.h>
> #include <linux/mount.h>
> +#include <linux/path.h>
> +#include <linux/fsnotify.h>
> #include <linux/fs_stack.h>
> #include <linux/slab.h>
> #include <linux/xattr.h>
> @@ -144,16 +146,22 @@ static int ecryptfs_do_unlink(struct inode *dir, struct dentry *dentry,
> {
> struct dentry *lower_dentry = ecryptfs_dentry_to_lower(dentry);
> struct inode *lower_dir_inode = ecryptfs_inode_to_lower(dir);
> + struct vfsmount *lower_mnt = ecryptfs_dentry_to_lower_mnt(dentry);
> + struct path lower_dir_path = {lower_mnt, NULL};
> struct dentry *lower_dir_dentry;
> int rc;
>
> dget(lower_dentry);
> lower_dir_dentry = lock_parent(lower_dentry);
> + lower_dir_path.dentry = lower_dir_dentry;
> rc = vfs_unlink(lower_dir_inode, lower_dentry, NULL);
> if (rc) {
> printk(KERN_ERR "Error in vfs_unlink; rc = [%d]\n", rc);
> goto out_unlock;
> }
> +
> + fsnotify_modify_dir(&lower_dir_path);
> +
> fsstack_copy_attr_times(dir, lower_dir_inode);
> set_nlink(inode, ecryptfs_inode_to_lower(inode)->i_nlink);
> inode->i_ctime = dir->i_ctime;
> @@ -184,9 +192,13 @@ ecryptfs_do_create(struct inode *directory_inode,
> struct dentry *lower_dentry;
> struct dentry *lower_dir_dentry;
> struct inode *inode;
> + struct path lower_dir_path;
>
> lower_dentry = ecryptfs_dentry_to_lower(ecryptfs_dentry);
> lower_dir_dentry = lock_parent(lower_dentry);
> + lower_dir_path.dentry = lower_dir_dentry;
> + lower_dir_path.mnt = ecryptfs_dentry_to_lower_mnt(ecryptfs_dentry);
> +
> rc = vfs_create(d_inode(lower_dir_dentry), lower_dentry, mode, true);
> if (rc) {
> printk(KERN_ERR "%s: Failure to create dentry in lower fs; "
> @@ -194,10 +206,14 @@ ecryptfs_do_create(struct inode *directory_inode,
> inode = ERR_PTR(rc);
> goto out_lock;
> }
> +
> + fsnotify_modify_dir(&lower_dir_path);
> +
> inode = __ecryptfs_get_inode(d_inode(lower_dentry),
> directory_inode->i_sb);
> if (IS_ERR(inode)) {
> vfs_unlink(d_inode(lower_dir_dentry), lower_dentry, NULL);
> + fsnotify_modify_dir(&lower_dir_path);
> goto out_lock;
> }
> fsstack_copy_attr_times(directory_inode, d_inode(lower_dir_dentry));
> @@ -432,6 +448,7 @@ static int ecryptfs_link(struct dentry *old_dentry, struct inode *dir,
> struct dentry *lower_old_dentry;
> struct dentry *lower_new_dentry;
> struct dentry *lower_dir_dentry;
> + struct path lower_dir_path;
> u64 file_size_save;
> int rc;
>
> @@ -441,10 +458,16 @@ static int ecryptfs_link(struct dentry *old_dentry, struct inode *dir,
> dget(lower_old_dentry);
> dget(lower_new_dentry);
> lower_dir_dentry = lock_parent(lower_new_dentry);
> + lower_dir_path.dentry = lower_dir_dentry;
> + lower_dir_path.mnt = ecryptfs_dentry_to_lower_mnt(new_dentry);
> +
> rc = vfs_link(lower_old_dentry, d_inode(lower_dir_dentry),
> lower_new_dentry, NULL);
> if (rc || d_really_is_negative(lower_new_dentry))
> goto out_lock;
> +
> + fsnotify_modify_dir(&lower_dir_path);
> +
> rc = ecryptfs_interpose(lower_new_dentry, new_dentry, dir->i_sb);
> if (rc)
> goto out_lock;
> @@ -471,6 +494,7 @@ static int ecryptfs_symlink(struct inode *dir, struct dentry *dentry,
> int rc;
> struct dentry *lower_dentry;
> struct dentry *lower_dir_dentry;
> + struct path lower_dir_path;
> char *encoded_symname;
> size_t encoded_symlen;
> struct ecryptfs_mount_crypt_stat *mount_crypt_stat = NULL;
> @@ -478,6 +502,9 @@ static int ecryptfs_symlink(struct inode *dir, struct dentry *dentry,
> lower_dentry = ecryptfs_dentry_to_lower(dentry);
> dget(lower_dentry);
> lower_dir_dentry = lock_parent(lower_dentry);
> + lower_dir_path.dentry = lower_dir_dentry;
> + lower_dir_path.mnt = ecryptfs_dentry_to_lower_mnt(dentry);
> +
> mount_crypt_stat = &ecryptfs_superblock_to_private(
> dir->i_sb)->mount_crypt_stat;
> rc = ecryptfs_encrypt_and_encode_filename(&encoded_symname,
> @@ -491,6 +518,9 @@ static int ecryptfs_symlink(struct inode *dir, struct dentry *dentry,
> kfree(encoded_symname);
> if (rc || d_really_is_negative(lower_dentry))
> goto out_lock;
> +
> + fsnotify_modify_dir(&lower_dir_path);
> +
> rc = ecryptfs_interpose(lower_dentry, dentry, dir->i_sb);
> if (rc)
> goto out_lock;
> @@ -509,12 +539,18 @@ static int ecryptfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode
> int rc;
> struct dentry *lower_dentry;
> struct dentry *lower_dir_dentry;
> + struct path lower_dir_path;
>
> lower_dentry = ecryptfs_dentry_to_lower(dentry);
> lower_dir_dentry = lock_parent(lower_dentry);
> + lower_dir_path.dentry = lower_dir_dentry;
> + lower_dir_path.mnt = ecryptfs_dentry_to_lower_mnt(dentry);
> rc = vfs_mkdir(d_inode(lower_dir_dentry), lower_dentry, mode);
> if (rc || d_really_is_negative(lower_dentry))
> goto out;
> +
> + fsnotify_modify_dir(&lower_dir_path);
> +
> rc = ecryptfs_interpose(lower_dentry, dentry, dir->i_sb);
> if (rc)
> goto out;
> @@ -532,16 +568,24 @@ static int ecryptfs_rmdir(struct inode *dir, struct dentry *dentry)
> {
> struct dentry *lower_dentry;
> struct dentry *lower_dir_dentry;
> + struct path lower_dir_path;
> int rc;
>
> lower_dentry = ecryptfs_dentry_to_lower(dentry);
> dget(dentry);
> lower_dir_dentry = lock_parent(lower_dentry);
> + lower_dir_path.dentry = lower_dir_dentry;
> + lower_dir_path.mnt = ecryptfs_dentry_to_lower_mnt(dentry);
> dget(lower_dentry);
> +
> rc = vfs_rmdir(d_inode(lower_dir_dentry), lower_dentry);
> dput(lower_dentry);
> if (!rc && d_really_is_positive(dentry))
> clear_nlink(d_inode(dentry));
> +
> + if (rc)
> + fsnotify_modify_dir(&lower_dir_path);
> +
> fsstack_copy_attr_times(dir, d_inode(lower_dir_dentry));
> set_nlink(dir, d_inode(lower_dir_dentry)->i_nlink);
> unlock_dir(lower_dir_dentry);
> @@ -557,12 +601,19 @@ ecryptfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev
> int rc;
> struct dentry *lower_dentry;
> struct dentry *lower_dir_dentry;
> + struct path lower_dir_path;
>
> lower_dentry = ecryptfs_dentry_to_lower(dentry);
> lower_dir_dentry = lock_parent(lower_dentry);
> + lower_dir_path.dentry = lower_dir_dentry;
> + lower_dir_path.mnt = ecryptfs_dentry_to_lower_mnt(dentry);
> +
> rc = vfs_mknod(d_inode(lower_dir_dentry), lower_dentry, mode, dev);
> if (rc || d_really_is_negative(lower_dentry))
> goto out;
> +
> + fsnotify_modify_dir(&lower_dir_path);
> +
> rc = ecryptfs_interpose(lower_dentry, dentry, dir->i_sb);
> if (rc)
> goto out;
> @@ -585,6 +636,9 @@ ecryptfs_rename(struct inode *old_dir, struct dentry *old_dentry,
> struct dentry *lower_new_dentry;
> struct dentry *lower_old_dir_dentry;
> struct dentry *lower_new_dir_dentry;
> + struct vfsmount *lower_mnt;
> + struct path lower_old_dir_path;
> + struct path lower_new_dir_path;
> struct dentry *trap = NULL;
> struct inode *target_inode;
>
> @@ -593,10 +647,15 @@ ecryptfs_rename(struct inode *old_dir, struct dentry *old_dentry,
>
> lower_old_dentry = ecryptfs_dentry_to_lower(old_dentry);
> lower_new_dentry = ecryptfs_dentry_to_lower(new_dentry);
> + lower_mnt = ecryptfs_dentry_to_lower_mnt(old_dentry);
> dget(lower_old_dentry);
> dget(lower_new_dentry);
> lower_old_dir_dentry = dget_parent(lower_old_dentry);
> lower_new_dir_dentry = dget_parent(lower_new_dentry);
> + lower_old_dir_path.dentry = lower_old_dir_dentry;
> + lower_old_dir_path.mnt = lower_mnt;
> + lower_new_dir_path.dentry = lower_new_dir_dentry;
> + lower_new_dir_path.mnt = lower_mnt;
> target_inode = d_inode(new_dentry);
> trap = lock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
> /* source should not be ancestor of target */
> @@ -614,6 +673,14 @@ ecryptfs_rename(struct inode *old_dir, struct dentry *old_dentry,
> NULL, 0);
> if (rc)
> goto out_lock;
> +
> + /* ecryptfs does not support crossing mount boundaries, we can take
> + * vfsmount from an arbitrary dentry.
> + */
> + fsnotify_modify_dir(&lower_old_dir_path);
> + if (!path_equal(&lower_old_dir_path, &lower_new_dir_path))
> + fsnotify_modify_dir(&lower_new_dir_path);
> +
> if (target_inode)
> fsstack_copy_attr_all(target_inode,
> ecryptfs_inode_to_lower(target_inode));
> diff --git a/fs/namei.c b/fs/namei.c
> index ad74877e1442..17667f0c89e5 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3009,8 +3009,12 @@ static int atomic_open(struct nameidata *nd, struct dentry *dentry,
> dput(dentry);
> dentry = file->f_path.dentry;
> }
> - if (*opened & FILE_CREATED)
> + if (*opened & FILE_CREATED) {
> + struct path parent_path = {file->f_path.mnt,
> + dentry->d_parent};
> fsnotify_create(dir, dentry);
> + fsnotify_modify_dir(&parent_path);
> + }
> if (unlikely(d_is_negative(dentry))) {
> error = -ENOENT;
> } else {
> @@ -3157,6 +3161,7 @@ static int lookup_open(struct nameidata *nd, struct path *path,
> if (error)
> goto out_dput;
> fsnotify_create(dir_inode, dentry);
> + fsnotify_modify_dir(&nd->path);
> }
> if (unlikely(create_error) && !dentry->d_inode) {
> error = create_error;
> @@ -3702,6 +3707,7 @@ SYSCALL_DEFINE4(mknodat, int, dfd, const char __user *, filename, umode_t, mode,
> error = vfs_mknod(path.dentry->d_inode,dentry,mode,0);
> break;
> }
> + fsnotify_modify_dir(&path);
> out:
> done_path_create(&path, dentry);
> if (retry_estale(error, lookup_flags)) {
> @@ -3759,6 +3765,8 @@ SYSCALL_DEFINE3(mkdirat, int, dfd, const char __user *, pathname, umode_t, mode)
> error = security_path_mkdir(&path, dentry, mode);
> if (!error)
> error = vfs_mkdir(path.dentry->d_inode, dentry, mode);
> + if (!error)
> + fsnotify_modify_dir(&path);
> done_path_create(&path, dentry);
> if (retry_estale(error, lookup_flags)) {
> lookup_flags |= LOOKUP_REVAL;
> @@ -3855,6 +3863,8 @@ static long do_rmdir(int dfd, const char __user *pathname)
> if (error)
> goto exit3;
> error = vfs_rmdir(path.dentry->d_inode, dentry);
> + if (!error)
> + fsnotify_modify_dir(&path);
> exit3:
> dput(dentry);
> exit2:
> @@ -3979,6 +3989,8 @@ static long do_unlinkat(int dfd, const char __user *pathname)
> if (error)
> goto exit2;
> error = vfs_unlink(path.dentry->d_inode, dentry, &delegated_inode);
> + if (!error)
> + fsnotify_modify_dir(&path);
> exit2:
> dput(dentry);
> }
> @@ -4070,6 +4082,8 @@ SYSCALL_DEFINE3(symlinkat, const char __user *, oldname,
> error = security_path_symlink(&path, dentry, from->name);
> if (!error)
> error = vfs_symlink(path.dentry->d_inode, dentry, from->name);
> + if (!error)
> + fsnotify_modify_dir(&path);
> done_path_create(&path, dentry);
> if (retry_estale(error, lookup_flags)) {
> lookup_flags |= LOOKUP_REVAL;
> @@ -4219,6 +4233,8 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
> if (error)
> goto out_dput;
> error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry, &delegated_inode);
> + if (!error)
> + fsnotify_modify_dir(&new_path);
> out_dput:
> done_path_create(&new_path, new_dentry);
> if (delegated_inode) {
> @@ -4532,6 +4548,11 @@ SYSCALL_DEFINE5(renameat2, int, olddfd, const char __user *, oldname,
> error = vfs_rename(old_path.dentry->d_inode, old_dentry,
> new_path.dentry->d_inode, new_dentry,
> &delegated_inode, flags);
> + if (error == 0) {
> + fsnotify_modify_dir(&old_path);
> + if (!path_equal(&old_path, &new_path))
> + fsnotify_modify_dir(&new_path);
> + }
> exit5:
> dput(new_dentry);
> exit4:
> diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
> index 66eaeb1e8c2c..58f70bbaac38 100644
> --- a/fs/nfsd/nfs4recover.c
> +++ b/fs/nfsd/nfs4recover.c
> @@ -36,6 +36,7 @@
> #include <linux/file.h>
> #include <linux/slab.h>
> #include <linux/namei.h>
> +#include <linux/fsnotify.h>
> #include <linux/sched.h>
> #include <linux/fs.h>
> #include <linux/module.h>
> @@ -216,6 +217,8 @@ nfsd4_create_clid_dir(struct nfs4_client *clp)
> */
> goto out_put;
> status = vfs_mkdir(d_inode(dir), dentry, S_IRWXU);
> + if (status == 0)
> + fsnotify_modify_dir(&nn->rec_file->f_path);
> out_put:
> dput(dentry);
> out_unlock:
> @@ -338,6 +341,8 @@ nfsd4_unlink_clid_dir(char *name, int namlen, struct nfsd_net *nn)
> if (d_really_is_negative(dentry))
> goto out;
> status = vfs_rmdir(d_inode(dir), dentry);
> + if (status == 0)
> + fsnotify_modify_dir(&nn->rec_file->f_path);
> out:
> dput(dentry);
> out_unlock:
> @@ -401,6 +406,8 @@ purge_old(struct dentry *parent, struct dentry *child, struct nfsd_net *nn)
> if (status)
> printk("failed to remove client recovery directory %pd\n",
> child);
> + else
> + fsnotify_modify_dir(&nn->rec_file->f_path);
> /* Keep trying, success or failure: */
> return 0;
> }
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 26c6fdb4bf67..7632ab3fd99e 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -364,6 +364,18 @@ nfsd_get_write_access(struct svc_rqst *rqstp, struct svc_fh *fhp,
> }
>
> /*
> + * Helper to emit fsnotify modify_dir event. Call with fph locked.
> + */
> +static void nfsd_fsnotify_modify_dir(struct svc_fh *fhp)
> +{
> + struct path path;
> +
> + path.mnt = fhp->fh_export->ex_path.mnt;
> + path.dentry = fhp->fh_dentry;
> + fsnotify_modify_dir(&path);
> +}
> +
> +/*
> * Set various file attributes. After this call fhp needs an fh_put.
> */
> __be32
> @@ -1207,6 +1219,7 @@ nfsd_create_locked(struct svc_rqst *rqstp, struct svc_fh *fhp,
> goto out_nfserr;
>
> err = nfsd_create_setattr(rqstp, resfhp, iap);
> + nfsd_fsnotify_modify_dir(fhp);
>
> /*
> * nfsd_create_setattr already committed the child. Transactional
> @@ -1525,8 +1538,10 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
>
> host_err = vfs_symlink(d_inode(dentry), dnew, path);
> err = nfserrno(host_err);
> - if (!err)
> + if (!err) {
> + nfsd_fsnotify_modify_dir(fhp);
> err = nfserrno(commit_metadata(fhp));
> + }
> fh_unlock(fhp);
>
> fh_drop_write(fhp);
> @@ -1593,6 +1608,7 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
> goto out_dput;
> host_err = vfs_link(dold, dirp, dnew, NULL);
> if (!host_err) {
> + nfsd_fsnotify_modify_dir(tfhp);
> err = nfserrno(commit_metadata(ffhp));
> if (!err)
> err = nfserrno(commit_metadata(tfhp));
> @@ -1686,6 +1702,8 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
>
> host_err = vfs_rename(fdir, odentry, tdir, ndentry, NULL, 0);
> if (!host_err) {
> + nfsd_fsnotify_modify_dir(tfhp);
> + nfsd_fsnotify_modify_dir(ffhp);
> host_err = commit_metadata(tfhp);
> if (!host_err)
> host_err = commit_metadata(ffhp);
> @@ -1757,8 +1775,10 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
> host_err = vfs_unlink(dirp, rdentry, NULL);
> else
> host_err = vfs_rmdir(dirp, rdentry);
> - if (!host_err)
> + if (!host_err) {
> + nfsd_fsnotify_modify_dir(fhp);
> host_err = commit_metadata(fhp);
> + }
> dput(rdentry);
>
> out_nfserr:
> diff --git a/ipc/mqueue.c b/ipc/mqueue.c
> index 7a2d8f0c8ae5..10e413c2216f 100644
> --- a/ipc/mqueue.c
> +++ b/ipc/mqueue.c
> @@ -19,6 +19,7 @@
> #include <linux/file.h>
> #include <linux/mount.h>
> #include <linux/namei.h>
> +#include <linux/fsnotify.h>
> #include <linux/sysctl.h>
> #include <linux/poll.h>
> #include <linux/mqueue.h>
> @@ -818,6 +819,10 @@ SYSCALL_DEFINE4(mq_open, const char __user *, u_name, int, oflag, umode_t, mode,
> filp = do_create(ipc_ns, d_inode(root),
> &path, oflag, mode,
> u_attr ? &attr : NULL);
> + if (!IS_ERR(filp)) {
> + struct path root_path = {mnt, mnt->mnt_root};
> + fsnotify_modify_dir(&root_path);
> + }
> }
> } else {
> if (d_really_is_negative(path.dentry)) {
> @@ -878,6 +883,10 @@ SYSCALL_DEFINE1(mq_unlink, const char __user *, u_name)
> } else {
> ihold(inode);
> err = vfs_unlink(d_inode(dentry->d_parent), dentry, NULL);
> + if (!err) {
> + struct path path = {mnt, dentry->d_parent};
> + fsnotify_modify_dir(&path);
> + }
> }
> dput(dentry);
>
> diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> index 0b030c9126d3..93137292b051 100644
> --- a/kernel/bpf/inode.c
> +++ b/kernel/bpf/inode.c
> @@ -16,6 +16,7 @@
> #include <linux/major.h>
> #include <linux/mount.h>
> #include <linux/namei.h>
> +#include <linux/fsnotify.h>
> #include <linux/fs.h>
> #include <linux/kdev_t.h>
> #include <linux/parser.h>
> @@ -255,6 +256,8 @@ static int bpf_obj_do_pin(const struct filename *pathname, void *raw,
>
> dentry->d_fsdata = raw;
> ret = vfs_mknod(dir, dentry, mode, devt);
> + if (ret == 0)
> + fsnotify_modify_dir(&path);
> dentry->d_fsdata = NULL;
> out:
> done_path_create(&path, dentry);
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index cef79873b09d..5049bd4bd1d8 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -91,6 +91,7 @@
> #include <linux/stat.h>
> #include <linux/dcache.h>
> #include <linux/namei.h>
> +#include <linux/fsnotify.h>
> #include <linux/socket.h>
> #include <linux/un.h>
> #include <linux/fcntl.h>
> @@ -976,6 +977,7 @@ static int unix_mknod(const char *sun_path, umode_t mode, struct path *res)
> if (!err) {
> err = vfs_mknod(d_inode(path.dentry), dentry, mode, 0);
> if (!err) {
> + fsnotify_modify_dir(&path);
> res->mnt = mntget(path.mnt);
> res->dentry = dget(dentry);
> }
> --
> 2.11.1
>

2017-03-14 12:41:59

by Filip Štědronský

[permalink] [raw]
Subject: Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

Hi,

On Tue, Mar 14, 2017 at 12:11:40PM +0200, Amir Goldstein wrote:
> > - file system indexers / desktop search tools
> > - file synchronization tools (like Dropbox, Nextcloud, etc.),
> > online backup tools
>
> This last one is the use case of my employer, Ctera Networks.
> Out of curiosity, what is the use case that you are focusing on?

I'm working on a file synchronization tool as part of my bachelor
thesis at Charles University.

When I started (now over a year ago ... long story), there were AFIK no
attempts at solving the recursive inotify issue, only a lot of
complaints, so I cobbled up together something simple (I'm not a kernel
developer by trade, this was my first patch) that would allow me to
work on the userspace parts, which are the main bulk.

I try to focus on algorithmic and implementation efficiency as opposed
to fancy GUIs and similar. I want it to be fast with 100k directories
and 10M files in home dir. But it's very WIP so we'll see how that turns
out.

> I had the feeling that all recursive inotify users are hiding in the shadows,
> but was missing more concrete evidence.

Yes, even Dropbox uses inotify. They have articles in their help on how
to increase inotify.max_user_watches: https://www.dropbox.com/help/145.
That's not good PR. ;-) (I'm not affiliated with DB, just pointing that
out.)

> About the argument of not having to change in-kernel framework,
> I don't think it should be a consideration at all.

Understood. I tried to stay conservative and non-controversial because I
imagined that radical framework changes would take months of discussions
(look at the history of statx) and this issue seems to be pressing for
quite a lot of people. But rushing is of course not the best strategy
either, there are always compromises.

> If you don't specify FAN_EVENT_INFO_NAME, you can get filename events
> FAN_MOVE|FAN_CREATE|FAN_DELETE without the name.
>
> What you do get is the file descriptor of the parent. sounds familiar? ;-)

I didn't notice this bit. That sounds like a win-win.

Filip

2017-03-14 13:46:53

by Filip Štědronský

[permalink] [raw]
Subject: Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

Hi,

On Tue, Mar 14, 2017 at 12:40:56PM +0200, Amir Goldstein wrote:
> An I am very happy that you used filehandles to keep track of objects
> in your example, because it fits my proposal really well.

you can read more about the rationales in the WIP draft of my thesis
(especially the chapter Identifying Inodes and its surroundings):

http://regnarg.cz/tmp/thesis.pdf
https://github.com/regnarg/bc

Either way, one can never use pathnames as identifiers because in
a world with parallel renames on multiple levels of the hierarchy
(between which no ordering is guaranteed unless they are cross-dir),
pathnames are not even a well-defined concept.

NB: I used a simple script for testing that does precisely this
(a lot of parallel within-directory renames at many levels):
https://github.com/regnarg/bc/blob/master/experiments/fanotify/parallel_renamist.sh
It seems like a good stress test for any application that tries to
do reliable filesystem monitoring.

> See, if you used my proposed API, you would have had
>
> fan_fd = CHK(fanotify_init(FAN_UNLIMITED_QUEUE| \
> FAN_EVENT_INFO_PARENT | FAN_EVENT_INFO_FH,
> O_RDONLY));

Sure, I'll definitely try to implement your interface as an
optional backend and mention it in my thesis and definitely
give it some heavy testing.

> Furthermore, my proposal records the filehandle at the time of the event
> and therefore, does not have to keep an elevated refcount of the object
> in the events queue.

That sounds good but from what I've understood, not all
filesystems support handles. Is this a concern? (Maybe the
right answer is to extend handle support...)

Filip

2017-03-14 13:55:30

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

On Tue, Mar 14, 2017 at 2:41 PM, Filip Štědronský <[email protected]> wrote:
> Hi,
>
> On Tue, Mar 14, 2017 at 12:11:40PM +0200, Amir Goldstein wrote:
>> > - file system indexers / desktop search tools
>> > - file synchronization tools (like Dropbox, Nextcloud, etc.),
>> > online backup tools
>>
>> This last one is the use case of my employer, Ctera Networks.
>> Out of curiosity, what is the use case that you are focusing on?
>
> I'm working on a file synchronization tool as part of my bachelor
> thesis at Charles University.
>
> When I started (now over a year ago ... long story), there were AFIK no
> attempts at solving the recursive inotify issue, only a lot of
> complaints, so I cobbled up together something simple (I'm not a kernel
> developer by trade, this was my first patch) that would allow me to
> work on the userspace parts, which are the main bulk.
>

That sounds very useful. I was actually looking for an open userspace
implementation were I could demonstrate my patches, but all the projects
I looked at (beagle etc..) seems to have died from desperation waiting
for intotify scalabiltiy to be fixed...

> I try to focus on algorithmic and implementation efficiency as opposed
> to fancy GUIs and similar. I want it to be fast with 100k directories
> and 10M files in home dir. But it's very WIP so we'll see how that turns
> out.
>

I did some basic tests of my super block watch with 1M directories,
and currently the code is under integration testing in our company.

I prepared a drop-in replacement for our fs change listener class
that was previously implemented with inotify.
Currently, the userspace code still "thinks" it is adding watches on directories
recursively, but what it really does is just gets all their file handles.
Then, the code that maps an event to "watch descriptor" actually matches
the fhandle in the event to the directory fhandle (much like in your example).

If you are not married to using your kernel patch, you can just go a head and
try my patch set.
You can fetch it directly from:
https://github.com/amir73il/linux/tree/fanotify_sb

It has been recently rebased over v4.11-rc1.

There are also ready ports for stable kernels v4.1 v4.4 and v4.9
in my github if that serves you better.

Marko Rauhamaa [CCed] has also tested my patches to his satisfaction.


>> I had the feeling that all recursive inotify users are hiding in the shadows,
>> but was missing more concrete evidence.
>
> Yes, even Dropbox uses inotify. They have articles in their help on how
> to increase inotify.max_user_watches: https://www.dropbox.com/help/145.
> That's not good PR. ;-) (I'm not affiliated with DB, just pointing that
> out.)
>
>> About the argument of not having to change in-kernel framework,
>> I don't think it should be a consideration at all.
>
> Understood. I tried to stay conservative and non-controversial because I
> imagined that radical framework changes would take months of discussions
> (look at the history of statx) and this issue seems to be pressing for
> quite a lot of people. But rushing is of course not the best strategy
> either, there are always compromises.
>

Don't get me wrong, I certainly do *understand* why you would want to stay
away from making radical framework changes.
But I believe it is the best interest of the community to choose the
best solution
going forward while eliminating the fear-of-change factor.
This is why I say that fear of change SHOULD not be a consideration.

>> If you don't specify FAN_EVENT_INFO_NAME, you can get filename events
>> FAN_MOVE|FAN_CREATE|FAN_DELETE without the name.
>>
>> What you do get is the file descriptor of the parent. sounds familiar? ;-)
>
> I didn't notice this bit. That sounds like a win-win.
>

Well, I may me giving you half the truth here.
With current patches not specifying FAN_EVENT_INFO_NAME
will mask out the filename events from the mark, but
1. The implementation could be easily changed
2. You can set FAN_EVENT_INFO_NAME and ignore the
filename info in userspace. You still get the parent fd
as your program expects and you get a bonus fhandle
of the parent for free with the event, see my test program:

github.com/amir73il/fsnotify-utils/blob/master/src/test/fanotify_demo.c#L101

Please let me know if that is sufficient for your needs
or if you need me to prepare a version that delivers filename events
without filename info, therefore allowing to merge filename events.

Sounds to me like if you get an event FAN_DELETE, "aaa",
your implementation can update the db directly without having
to scan the directory, so it should be useful.
For your consideration.

Amir.

2017-03-14 14:49:28

by Filip Štědronský

[permalink] [raw]
Subject: Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

Hi,

On Tue, Mar 14, 2017 at 03:55:19PM +0200, Amir Goldstein wrote:
> Please let me know if that is sufficient for your needs
> or if you need me to prepare a version that delivers filename events
> without filename info, therefore allowing to merge filename events.
>
> Sounds to me like if you get an event FAN_DELETE, "aaa",
> your implementation can update the db directly without having
> to scan the directory, so it should be useful.
> For your consideration.

both approaches might be useful and there are tradeoffs. Direct updates
save us from rescanning but give more events and more context switches.

On the other hand, with an unlimited queue and well-mergable events I
could e.g. sleep for five seconds each time after emptying the queue,
thus giving the events more potential to merge and reducing context
switches.

But one risks blowing up the queue. However, I have some ideas.
Basically we could implement some kind of "bulk read" mode for fanotify
that would pass events to userspace only when
(a) a given event count thresold is hit (e.g. half the queue limit
if queue is limited), or
(b) the oldest event is older than a specified maximum latency. That
might vary from 1 second to 5 minutes depending on specific
application needs (e.g. background backup does not need to be
low-latency).

This would greatly reduce extra context switches when watching
a large portion (or whole) of the file system. All of this presumably
could be implemented on top of your suggested patches.

Either way, I suggest that implementing the nameless filename events
is a good idea. I do not know whether they will be the best choice
for my application but they probably will be useful for some
applications.

Filip

2017-03-14 14:58:08

by Filip Štědronský

[permalink] [raw]
Subject: Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

Hi,

On Tue, Mar 14, 2017 at 01:18:01PM +0200, Amir Goldstein wrote:
> I claim that fanotify filters event by mount not because it
> was a requirement, but because it was an implementation challenge
> to do otherwise.
>
> And I claim that what mount watchers are really interested in is
> "all the changes that happen in the file system in the area
> that is visible to me through this mount point".
>
> In other words, an indexer needs to know if files were modified\
> create/deleted if that indexer sits in container host namespace
> regardless if those files were modified from within a container
> namespace.
>
> It's not a matter of security/isolation. It's a matter of functionality.
> I agree that for some event (e.g. permission events) it is possible
> to argue both ways (i.e. that the namespace context should be used
> as a filter for events).
> But for the new proposed events (FS_MODIFY_DIR), I really don't
> see the point in isolation by mount/namespace.

there are basically two classes of uses for a fantotify-like
interface:

(1) Keeping an up-to-date representation of the file system.
For this, superblock watches are clearly what you want.

* You are interested to know the current state of the
filesystem so you need to know about every change,
regardless of where it came from.
* As I mentioned earlier, in case of remote, ditributed
and virtual filesystems, the change might come from
within the filesystem itself (if the protocol supports
reporting such changes). This can probably be
implemented only with superblock-scoped watches because
the change is fundamentally not related to any mount.
* Some filesystems might also support change journalling
and it might be concievable to extend the API in the
future to report "past" events (for example by passing
sequence number of last seen event or similar).
* The argument about containers escaping change notification
you mentioned earlier.

All those factors speak greatly in favour of superblock
watches.

(2) Tracking filesystem *activity*. Now you are not building
an image of current filesystem state but rather a log of
what happened. Perhaps you are also interested in who
(user/process/...) did what. Permission events also fit
mostly in this category.

For those it *might* make sense to have mount-scoped
watches, for example if you want to monitor only one
container or a subset of processes.

We both concentrate on the first but we shouldn't forget about
the second, which was one of the original motivations for
fanotify.

Thus I conclude that it might be desirable to implement
mount-scoped filename events in the long run. Even though
I agree that the sb-scoped events are more important because
they cover more use cases and you can do additional filtering
(e.g. by pid) if deemed necessary.

This would require:

(a) Sprinkling the callers of vfs_* with fanotify calls
as I did, or
(b) Creating wrapper functions like vfs_path_unlink & co.
that would make the necessary fanotify call (and probably
tell the lower function not to generate another
notification), as I suggested earlier.
(c) Give the vfs_* functions an *optional* vfsmount argument.

In the end I probably find (c) the most elegant but this
can be discussed later, even after your changes are merged.

Filip

2017-03-14 15:07:21

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

On Tue, Mar 14, 2017 at 3:46 PM, Filip Štědronský <[email protected]> wrote:
> Hi,
>
> On Tue, Mar 14, 2017 at 12:40:56PM +0200, Amir Goldstein wrote:
>> An I am very happy that you used filehandles to keep track of objects
>> in your example, because it fits my proposal really well.
>
> you can read more about the rationales in the WIP draft of my thesis
> (especially the chapter Identifying Inodes and its surroundings):
>
> http://regnarg.cz/tmp/thesis.pdf
> https://github.com/regnarg/bc
>
> Either way, one can never use pathnames as identifiers because in
> a world with parallel renames on multiple levels of the hierarchy
> (between which no ordering is guaranteed unless they are cross-dir),
> pathnames are not even a well-defined concept.
>
> NB: I used a simple script for testing that does precisely this
> (a lot of parallel within-directory renames at many levels):
> https://github.com/regnarg/bc/blob/master/experiments/fanotify/parallel_renamist.sh
> It seems like a good stress test for any application that tries to
> do reliable filesystem monitoring.

I agree. not only to filesystem monitoring.
filesystems could use a test like this in general.
It would be good to stress test overlayfs dir rename like this.

>
>> See, if you used my proposed API, you would have had
>>
>> fan_fd = CHK(fanotify_init(FAN_UNLIMITED_QUEUE| \
>> FAN_EVENT_INFO_PARENT | FAN_EVENT_INFO_FH,
>> O_RDONLY));
>
> Sure, I'll definitely try to implement your interface as an
> optional backend and mention it in my thesis and definitely
> give it some heavy testing.
>
>> Furthermore, my proposal records the filehandle at the time of the event
>> and therefore, does not have to keep an elevated refcount of the object
>> in the events queue.
>
> That sounds good but from what I've understood, not all
> filesystems support handles. Is this a concern? (Maybe the
> right answer is to extend handle support...)
>

Very interesting question.
Extending fhandle support is not always easy.

The majority of local file system do support fhandles,
because they are required for exporting a file system over NFS.
see: git grep -lw s_export_op fs/

The reason that some file systems "don't support" fhandles
is usually because they either do not have an efficient way to implement
open_by_handle() or because they do not have a way to encode a unique
and persistent (over reboot) identifer for name_to_handle().

An example of a commonly used file system that currently does not support
file handles is overlayfs.

The interesting thing to note wrt fanotify and overlayfs is that while it is
technically hard to implement open_by_handle() for overlayfs, it is quite
trivial to implement name_to_handle() for the special common case of all
layers on the same underlying fs (e.g. docker).
The trivial implementation (at least for directories) is to return fhandle
of lower most dir in case of lower or merge dir and fhandle of upper dir
otherwise.

I don't think there is any filesystem that implements name_to_handle()
without implementing open_by_handle(), but technically it is possible.
And for fanotify FAN_EVENT_INFO_FH only the former is required.

But in any case, FAN_EVENT_INFO_FH is optional.
If your userspace implementation can settle with parent fd information
then filesystems that do not support fhandles can use this mode.

Amir.

2017-03-14 15:35:55

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

On Tue, Mar 14, 2017 at 4:58 PM, Filip Štědronský <[email protected]> wrote:
> Hi,
>
> On Tue, Mar 14, 2017 at 01:18:01PM +0200, Amir Goldstein wrote:
>> I claim that fanotify filters event by mount not because it
>> was a requirement, but because it was an implementation challenge
>> to do otherwise.
>>
>> And I claim that what mount watchers are really interested in is
>> "all the changes that happen in the file system in the area
>> that is visible to me through this mount point".
>>
>> In other words, an indexer needs to know if files were modified\
>> create/deleted if that indexer sits in container host namespace
>> regardless if those files were modified from within a container
>> namespace.
>>
>> It's not a matter of security/isolation. It's a matter of functionality.
>> I agree that for some event (e.g. permission events) it is possible
>> to argue both ways (i.e. that the namespace context should be used
>> as a filter for events).
>> But for the new proposed events (FS_MODIFY_DIR), I really don't
>> see the point in isolation by mount/namespace.
>
> there are basically two classes of uses for a fantotify-like
> interface:
>
> (1) Keeping an up-to-date representation of the file system.
> For this, superblock watches are clearly what you want.
>
> * You are interested to know the current state of the
> filesystem so you need to know about every change,
> regardless of where it came from.
> * As I mentioned earlier, in case of remote, ditributed
> and virtual filesystems, the change might come from
> within the filesystem itself (if the protocol supports
> reporting such changes). This can probably be
> implemented only with superblock-scoped watches because
> the change is fundamentally not related to any mount.
> * Some filesystems might also support change journalling
> and it might be concievable to extend the API in the
> future to report "past" events (for example by passing
> sequence number of last seen event or similar).
> * The argument about containers escaping change notification
> you mentioned earlier.
>
> All those factors speak greatly in favour of superblock
> watches.
>
> (2) Tracking filesystem *activity*. Now you are not building
> an image of current filesystem state but rather a log of
> what happened. Perhaps you are also interested in who
> (user/process/...) did what. Permission events also fit
> mostly in this category.
>
> For those it *might* make sense to have mount-scoped
> watches, for example if you want to monitor only one
> container or a subset of processes.
>
> We both concentrate on the first but we shouldn't forget about
> the second, which was one of the original motivations for
> fanotify.
>
> Thus I conclude that it might be desirable to implement
> mount-scoped filename events in the long run. Even though
> I agree that the sb-scoped events are more important because
> they cover more use cases and you can do additional filtering
> (e.g. by pid) if deemed necessary.
>
> This would require:
>
> (a) Sprinkling the callers of vfs_* with fanotify calls
> as I did, or
> (b) Creating wrapper functions like vfs_path_unlink & co.
> that would make the necessary fanotify call (and probably
> tell the lower function not to generate another
> notification), as I suggested earlier.
> (c) Give the vfs_* functions an *optional* vfsmount argument.
>
> In the end I probably find (c) the most elegant but this
> can be discussed later, even after your changes are merged.
>

Agreed. That is an independent question.
Thanks for the thorough summary.

Amir.

2017-03-14 22:30:04

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

On Tue, Mar 14, 2017 at 4:48 PM, Filip Štědronský <[email protected]> wrote:
> Hi,
>
> On Tue, Mar 14, 2017 at 03:55:19PM +0200, Amir Goldstein wrote:
>> Please let me know if that is sufficient for your needs
>> or if you need me to prepare a version that delivers filename events
>> without filename info, therefore allowing to merge filename events.
>>
>> Sounds to me like if you get an event FAN_DELETE, "aaa",
>> your implementation can update the db directly without having
>> to scan the directory, so it should be useful.
>> For your consideration.
>
> both approaches might be useful and there are tradeoffs. Direct updates
> save us from rescanning but give more events and more context switches.
>
> On the other hand, with an unlimited queue and well-mergable events I
> could e.g. sleep for five seconds each time after emptying the queue,
> thus giving the events more potential to merge and reducing context
> switches.
>
> But one risks blowing up the queue. However, I have some ideas.
> Basically we could implement some kind of "bulk read" mode for fanotify
> that would pass events to userspace only when
> (a) a given event count thresold is hit (e.g. half the queue limit
> if queue is limited), or
> (b) the oldest event is older than a specified maximum latency. That
> might vary from 1 second to 5 minutes depending on specific
> application needs (e.g. background backup does not need to be
> low-latency).
>
> This would greatly reduce extra context switches when watching
> a large portion (or whole) of the file system. All of this presumably
> could be implemented on top of your suggested patches.
>
> Either way, I suggest that implementing the nameless filename events
> is a good idea. I do not know whether they will be the best choice
> for my application but they probably will be useful for some
> applications.
>

Done and pushed all branches.
Added 2 new patches to the fanotify_dentry set:
ce0c223 fanotify: allow merge of filename events
cd872d1 fanotify: make filename info optional for filename events

You should test with branch fanotify_sb, using super block watch,
because filename events are not supported in mount scope.
You can follow user API used by my fanotify_demo.c program.

Tested with my fanotify_demo test
both without FAN_EVENT_INFO_NAME and without FAN_EVENT_INFO_FH.

Without FAN_EVENT_INFO_NAME more filename events are merged,
but in my test that runs fs ops by a script, most events do not get merged
because of different pid value.

I guess that for the filesync/indexer use case, we would need to add a new
FAN_EVENT_INFO_NOPID flag to loose the irrelevant pid information
in order to be able to merge a lot more filename event on a directory.

I also plan to add FAN_EVENT_INFO_NOFD, to get rid of redundant
fd (when INFO_FH is available) and then be able to drop the refcount
on the victim object.

Cheers,
Amir.

Subject: Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

[CC += [email protected]]

Filip,

Since this is a kernel-user-space API change, please CC linux-api@
(and on future iterations of this patch). The kernel source file
Documentation/SubmitChecklist notes that all Linux kernel patches that
change userspace interfaces should be CCed to
[email protected], so that the various parties who are
interested in API changes are informed. For further information, see
https://www.kernel.org/doc/man-pages/linux-api-ml.html

Thanks,

Michael



On Tue, Mar 14, 2017 at 12:02 AM, Filip Štědronský <[email protected]> wrote:
> Fanotify currently does not report directory modification events
> (rename, unlink, etc.). But these events are essential for about half of
> concievable fanotify use cases, especially:
>
> - file system indexers / desktop search tools
> - file synchronization tools (like Dropbox, Nextcloud, etc.),
> online backup tools
>
> and pretty much any app that needs to maintain and up-to-date internal
> representation of current contents of the file system.
>
> All applications of the above classes that I'm aware of currently use
> recursive inotify watches, which do not scale (my home dir has ~200k
> directories, creating all the watches takes ~2min and eats several tens
> of MB of kernel memory).
>
> There have been many calls for such a feature, pretty much since the
> creation of fanotify in 2009:
> * By GNOME developers:
> https://wiki.gnome.org/BastienNocera/KernelWishlist#VFS.2C_filesystems
> * By KDE developers:
> http://lkml.kernel.org/r/201211011352.42476.Martin%40lichtvoll.de
> 'Better support for (desktop) file search / indexing applications'
> * And others:
> http://lkml.kernel.org/r/AANLkTi=owK=WZW4oNtpm5WpAZhqCQUdTR2K5gzJ_MqZ+%40mail.gmail.com
> 'Fanotify mv/rename?'
> http://lkml.kernel.org/r/1238158043.23703.20.camel%40fatty
> 'Issues with using fanotify for a filesystem indexer'
>
> Add a new fanotify event, FAN_MODIFY_DIR, that is emitted whenever the
> contents of a directory change (a directory entry is added, removed or
> renamed). This covers all the currently missing events: rename, unlink,
> mknod, mkdir, rmdir, etc.
>
> Included with the event is a file descriptor to the modified directory
> but no further info. The userspace application is expected to rescan the
> whole directory and update its model accordingly. This needs to be done
> carefully to prevent race conditions. A cross-directory rename generates
> two FAN_MODIFY_DIR events.
>
> This minimalistic approach has several advantages:
> * Does not require changes to the fanotify userspace API or the
> fsnotify in-kernel framework, apart from adding a new event.
> Especially doesn't complicate it by adding string fields.
> * Has simple and clear semantics, even with multiple renames occurring
> in parallel etc. In case of any inconsistencies, one can simply wait
> for a while and rescan again.
> * FAN_MODIFY_DIR events are easily merged in case of multiple
> operations on a directory (e.g. deleting all files).
>
> Signed-off-by: Filip Štědronský <[email protected]>
>
> ---
>
> An alternative might be to create wrapper functions like
> vfs_path_(rename|unlink|...). They could also take care of calling
> security_path_(rename|unlink|...), which is currently also up to
> the indvidual callers (possibly with a flag because it might not
> be always desired).
>
> An alternative was proposed by Amir Goldstein in several long series of
> patches that add superblock-scoped (as opposed to vfsmount-scoped)
> fanotify watches and specific dentry manipulation events with filenames:
>
> http://lkml.kernel.org/r/1481984434-13293-1-git-send-email-amir73il%40gmail.com
> http://lkml.kernel.org/r/1482247207-4424-1-git-send-email-amir73il%40gmail.com
> http://lkml.kernel.org/r/1476126784-12520-1-git-send-email-amir73il%40gmail.com
> http://lkml.kernel.org/r/1489411223-12081-1-git-send-email-amir73il%40gmail.com
>
> There is large but not complete overlap between that proposal and
> mine (which is was originally created over a year ago, before Amir's
> work, but never posted).
>
> I think the superblock watch idea is very interesting because it might
> in the future allow reporing fsnotify events from remote and virtual
> filesystems. So I'm posting this more as a potential source of more
> ideas for discussion, or a fallback proposal in case Amir's patches
> don't make it.
> ---
> fs/notify/fanotify/fanotify.c | 1 +
> include/linux/fsnotify.h | 17 +++++++++++++++++
> include/linux/fsnotify_backend.h | 1 +
> include/uapi/linux/fanotify.h | 5 ++++-
> 4 files changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index bbc175d4213d..5178b06c338c 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -186,6 +186,7 @@ static int fanotify_handle_event(struct fsnotify_group *group,
>
> BUILD_BUG_ON(FAN_ACCESS != FS_ACCESS);
> BUILD_BUG_ON(FAN_MODIFY != FS_MODIFY);
> + BUILD_BUG_ON(FAN_MODIFY_DIR != FS_MODIFY_DIR);
> BUILD_BUG_ON(FAN_CLOSE_NOWRITE != FS_CLOSE_NOWRITE);
> BUILD_BUG_ON(FAN_CLOSE_WRITE != FS_CLOSE_WRITE);
> BUILD_BUG_ON(FAN_OPEN != FS_OPEN);
> diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
> index b43d3f5bd9ea..00fb87c975d6 100644
> --- a/include/linux/fsnotify.h
> +++ b/include/linux/fsnotify.h
> @@ -208,6 +208,23 @@ static inline void fsnotify_modify(struct file *file)
> }
>
> /*
> + * fsnotify_modifydir - directory contents were changed
> + * (as a result of rename, creat, unlink, etc.)
> + */
> +static inline void fsnotify_modify_dir(struct path *path)
> +{
> + struct inode *inode = path->dentry->d_inode;
> + __u32 mask = FS_MODIFY_DIR;
> +
> + if (S_ISDIR(inode->i_mode))
> + mask |= FS_ISDIR;
> + else
> + return;
> +
> + fsnotify(inode, mask, path, FSNOTIFY_EVENT_PATH, NULL, 0);
> +}
> +
> +/*
> * fsnotify_open - file was opened
> */
> static inline void fsnotify_open(struct file *file)
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index 487246546ebe..7751b337ec31 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -42,6 +42,7 @@
>
> #define FS_OPEN_PERM 0x00010000 /* open event in an permission hook */
> #define FS_ACCESS_PERM 0x00020000 /* access event in a permissions hook */
> +#define FS_MODIFY_DIR 0x00040000 /* directory changed (rename/unlink/...) */
>
> #define FS_EXCL_UNLINK 0x04000000 /* do not send events if object is unlinked */
> #define FS_ISDIR 0x40000000 /* event occurred against dir */
> diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
> index 030508d195d3..f14e048d492a 100644
> --- a/include/uapi/linux/fanotify.h
> +++ b/include/uapi/linux/fanotify.h
> @@ -15,6 +15,8 @@
> #define FAN_OPEN_PERM 0x00010000 /* File open in perm check */
> #define FAN_ACCESS_PERM 0x00020000 /* File accessed in perm check */
>
> +#define FAN_MODIFY_DIR 0x00040000 /* directory changed (rename/unlink/...) */
> +
> #define FAN_ONDIR 0x40000000 /* event occurred against dir */
>
> #define FAN_EVENT_ON_CHILD 0x08000000 /* interested in child events */
> @@ -67,7 +69,8 @@
> #define FAN_ALL_EVENTS (FAN_ACCESS |\
> FAN_MODIFY |\
> FAN_CLOSE |\
> - FAN_OPEN)
> + FAN_OPEN |\
> + FAN_MODIFY_DIR)
>
> /*
> * All events which require a permission response from userspace
> --
> 2.11.1
>



--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

2017-03-15 08:46:38

by Marko Rauhamaa

[permalink] [raw]
Subject: Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

Filip Štědronský <[email protected]>:

> there are basically two classes of uses for a fantotify-like
> interface:
>
> (1) Keeping an up-to-date representation of the file system. For this,
> superblock watches are clearly what you want.
>
> [...]
>
> All those factors speak greatly in favour of superblock
> watches.
>
> (2) Tracking filesystem *activity*. Now you are not building
> an image of current filesystem state but rather a log of what
> happened. Perhaps you are also interested in who
> (user/process/...) did what. Permission events also fit mostly in
> this category.
>
> For those it *might* make sense to have mount-scoped watches, for
> example if you want to monitor only one container or a subset of
> processes.
>
> We both concentrate on the first but we shouldn't forget about the
> second, which was one of the original motivations for fanotify.

My (employer's) needs are centered around (2). We definitely crave
permission events with a filesystem scope. At the moment, you can avoid
permission checks with a simple unshare command (<URL:
https://lkml.org/lkml/2016/12/21/144>).

So I must be able to see everything that is happening in my universe. It
might also be useful to monitor a subuniverse of mine, but the former
need is critical at the moment.

As for "who (user/process/...) did what", the fanotify API is flawed in
that we don't have a CLOSE_WRITE_PERM event. The hit-and-run process is
long gone by the time we receive the event. That's more of a rule than
an exception.


Marko

2017-03-15 13:40:09

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

On Wed 15-03-17 10:19:52, Marko Rauhamaa wrote:
> Filip Štědronský <[email protected]>:
>
> > there are basically two classes of uses for a fantotify-like
> > interface:
> >
> > (1) Keeping an up-to-date representation of the file system. For this,
> > superblock watches are clearly what you want.
> >
> > [...]
> >
> > All those factors speak greatly in favour of superblock
> > watches.
> >
> > (2) Tracking filesystem *activity*. Now you are not building
> > an image of current filesystem state but rather a log of what
> > happened. Perhaps you are also interested in who
> > (user/process/...) did what. Permission events also fit mostly in
> > this category.
> >
> > For those it *might* make sense to have mount-scoped watches, for
> > example if you want to monitor only one container or a subset of
> > processes.
> >
> > We both concentrate on the first but we shouldn't forget about the
> > second, which was one of the original motivations for fanotify.
>
> My (employer's) needs are centered around (2). We definitely crave
> permission events with a filesystem scope. At the moment, you can avoid
> permission checks with a simple unshare command (<URL:
> https://lkml.org/lkml/2016/12/21/144>).

Yes, that is bad.

> So I must be able to see everything that is happening in my universe. It
> might also be useful to monitor a subuniverse of mine, but the former
> need is critical at the moment.

So I understand your need. However with superblock watches I'm still
concerned that the process would be able to see too much. E.g. if it is
restricted to see only some subtree of a filesystem (by bind mounts &
namespaces), it should not be able to see events on the same filesystem
outside of that subtree. I have not found a good solution for that yet.

> As for "who (user/process/...) did what", the fanotify API is flawed in
> that we don't have a CLOSE_WRITE_PERM event. The hit-and-run process is
> long gone by the time we receive the event. That's more of a rule than
> an exception.

Adding CLOSE_WRITE_PERM would not be that difficult I assume. What do you
need it for?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-03-15 14:06:05

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

Hello,

On Tue 14-03-17 12:11:40, Amir Goldstein wrote:
> > Add a new fanotify event, FAN_MODIFY_DIR, that is emitted whenever the
> > contents of a directory change (a directory entry is added, removed or
> > renamed). This covers all the currently missing events: rename, unlink,
> > mknod, mkdir, rmdir, etc.
> >
> > Included with the event is a file descriptor to the modified directory
> > but no further info. The userspace application is expected to rescan the
> > whole directory and update its model accordingly. This needs to be done
> > carefully to prevent race conditions. A cross-directory rename generates
> > two FAN_MODIFY_DIR events.
> >
>
> Your approach is interesting and I am glad you shared it with us.
> I do like it and it gives me an idea, I am going to prepare a branch
> with a subset
> of my patches, so you can try them with your userspace sample program.
>
> In comments to your patches I am going to argue that, as a matter of fact,
> you can take a small sub set of my patches and get the same functionality
> of FAN_MODIFY_DIR, with code that is *less* complex then what you propose.
> So please treat my comments as technical comments how FAN_MODIFY_DIR
> should be implemented IMO and not as an attempt to reject an opposing proposal.

Yeah, so I have yet to read both your patch sets in detail (good news is
that this is getting towards top of my TODO stack and I think I'll do this
when flying to LSF/MM this weekend). With Filip's design I like the
minimalistic approach of adding just DIR_CHANGED event with (albeit
optional) messing with names. That just seems to fit well with the fanotify
design.

> > This minimalistic approach has several advantages:
> > * Does not require changes to the fanotify userspace API or the
> > fsnotify in-kernel framework, apart from adding a new event.
>
> About the argument of not having to change in-kernel framework,
> I don't think it should be a consideration at all.
> I claim that my 1st fsnotify cleanup series is an improvement of the framework
> (https://github.com/amir73il/linux/commits/fsnotify_dentry)
> regardless of additional functionality.
> So changing the in-kernel framework by adding complexity may be bad,
> but changing it by simplifying and making code more readable should not
> be an argument against the "non-minimalistic" approach.
>
> About the change to usespace API, if you strip off the *added* functionality
> of super block watch from my patches, your proposal for user API looks
> like this:
>
> fanotify_mark(fan_fd, FAN_MARK_ADD|FAN_MARK_MOUNT, \
> FAN_MODIFY_DIR|FAN_ONDIR,
>
> And my proposal for user API looks like this:
>
> fanotify_mark(fan_fd, FAN_MARK_ADD|FAN_MARK_MOUNT, \
> FAN_MOVE|FAN_CREATE|FAN_DELETE,
>
> You can see why my proposal for user API is not any less minimalistic
> and it is fully compatible with existing intotify API for the same events.
>
> I understand why it is hard to see this behind my added functionality,
> so I will post a minimalistic branch and test program as an example.
>
> > Especially doesn't complicate it by adding string fields.
>
> So the string fields in my proposal are optional.
> userspace opts-in to get them by specifying the FAN_EVENT_INFO_NAME flag:
>
> fanotify_init(FAN_CLOEXEC | FAN_CLASS_NOTIF |
> FAN_EVENT_INFO_PARENT | FAN_EVENT_INFO_NAME,
>
> If you don't specify FAN_EVENT_INFO_NAME, you can get filename events
> FAN_MOVE|FAN_CREATE|FAN_DELETE without the name.

So when thinking about this again, I'm again concerned that the names need
not tell userspace what it thinks. I know we already discussed this but
in our last discussion I think I forgot to point out that inotify directory
events (and fanotify would be the same AFAICT) may come out of order compared
to real filesystem changes. E.g. sequence:

Task 1 Task 2

mv a b
mv b c

may come out of inotify as:

IN_MOVED_FROM "b" COOKIE 1
IN_MOVED_TO "c" COOKIE 1
IN_MOVED_FROM "a" COOKIE 2
IN_MOVED_TO "b" COOKIE 2

and if user program just blindly belives this sequence its internal
representation of the filesystem will be different from the real state of
the filesystem.

So for now I'm more inclined towards the trivial approach (possibly using
your patches and stripping additional functionality from them). But I'll
leave final decision to when I'll be able to read everything in detail.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-03-15 14:19:16

by Marko Rauhamaa

[permalink] [raw]
Subject: Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

Jan Kara <[email protected]>:

> On Wed 15-03-17 10:19:52, Marko Rauhamaa wrote:
>> As for "who (user/process/...) did what", the fanotify API is flawed
>> in that we don't have a CLOSE_WRITE_PERM event. The hit-and-run
>> process is long gone by the time we receive the event. That's more of
>> a rule than an exception.
>
> Adding CLOSE_WRITE_PERM would not be that difficult I assume. What do you
> need it for?

Mainly to hold the process hostage until I have verified the content
change. If I disqualify the content change, I will need to report on the
process. CLOSE_WRITE only gives me a pid that is often stale as it
doesn't block the process.

(Another possibility would be to keep the process around as a zombie as
long as the CLOSE_WRITE event's file descriptor is open. That sounds
more complicated and questionable, though.)


Marko

--
+358 44 990 4795
Skype: marko.rauhamaa_f-secure

2017-03-15 14:34:15

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

On Wed, Mar 15, 2017 at 4:05 PM, Jan Kara <[email protected]> wrote:
> Hello,
>
> On Tue 14-03-17 12:11:40, Amir Goldstein wrote:
>> > Add a new fanotify event, FAN_MODIFY_DIR, that is emitted whenever the
>> > contents of a directory change (a directory entry is added, removed or
>> > renamed). This covers all the currently missing events: rename, unlink,
>> > mknod, mkdir, rmdir, etc.
>> >
>> > Included with the event is a file descriptor to the modified directory
>> > but no further info. The userspace application is expected to rescan the
>> > whole directory and update its model accordingly. This needs to be done
>> > carefully to prevent race conditions. A cross-directory rename generates
>> > two FAN_MODIFY_DIR events.
>> >
>>
>> Your approach is interesting and I am glad you shared it with us.
>> I do like it and it gives me an idea, I am going to prepare a branch
>> with a subset
>> of my patches, so you can try them with your userspace sample program.
>>
>> In comments to your patches I am going to argue that, as a matter of fact,
>> you can take a small sub set of my patches and get the same functionality
>> of FAN_MODIFY_DIR, with code that is *less* complex then what you propose.
>> So please treat my comments as technical comments how FAN_MODIFY_DIR
>> should be implemented IMO and not as an attempt to reject an opposing proposal.
>
> Yeah, so I have yet to read both your patch sets in detail (good news is
> that this is getting towards top of my TODO stack and I think I'll do this
> when flying to LSF/MM this weekend). With Filip's design I like the
> minimalistic approach of adding just DIR_CHANGED event with (albeit
> optional) messing with names. That just seems to fit well with the fanotify
> design.
>
>> > This minimalistic approach has several advantages:
>> > * Does not require changes to the fanotify userspace API or the
>> > fsnotify in-kernel framework, apart from adding a new event.
>>
>> About the argument of not having to change in-kernel framework,
>> I don't think it should be a consideration at all.
>> I claim that my 1st fsnotify cleanup series is an improvement of the framework
>> (https://github.com/amir73il/linux/commits/fsnotify_dentry)
>> regardless of additional functionality.
>> So changing the in-kernel framework by adding complexity may be bad,
>> but changing it by simplifying and making code more readable should not
>> be an argument against the "non-minimalistic" approach.
>>
>> About the change to usespace API, if you strip off the *added* functionality
>> of super block watch from my patches, your proposal for user API looks
>> like this:
>>
>> fanotify_mark(fan_fd, FAN_MARK_ADD|FAN_MARK_MOUNT, \
>> FAN_MODIFY_DIR|FAN_ONDIR,
>>
>> And my proposal for user API looks like this:
>>
>> fanotify_mark(fan_fd, FAN_MARK_ADD|FAN_MARK_MOUNT, \
>> FAN_MOVE|FAN_CREATE|FAN_DELETE,
>>
>> You can see why my proposal for user API is not any less minimalistic
>> and it is fully compatible with existing intotify API for the same events.
>>
>> I understand why it is hard to see this behind my added functionality,
>> so I will post a minimalistic branch and test program as an example.
>>
>> > Especially doesn't complicate it by adding string fields.
>>
>> So the string fields in my proposal are optional.
>> userspace opts-in to get them by specifying the FAN_EVENT_INFO_NAME flag:
>>
>> fanotify_init(FAN_CLOEXEC | FAN_CLASS_NOTIF |
>> FAN_EVENT_INFO_PARENT | FAN_EVENT_INFO_NAME,
>>
>> If you don't specify FAN_EVENT_INFO_NAME, you can get filename events
>> FAN_MOVE|FAN_CREATE|FAN_DELETE without the name.
>
> So when thinking about this again, I'm again concerned that the names need
> not tell userspace what it thinks. I know we already discussed this but
> in our last discussion I think I forgot to point out that inotify directory
> events (and fanotify would be the same AFAICT) may come out of order compared
> to real filesystem changes. E.g. sequence:
>
> Task 1 Task 2
>
> mv a b
> mv b c
>
> may come out of inotify as:
>
> IN_MOVED_FROM "b" COOKIE 1
> IN_MOVED_TO "c" COOKIE 1
> IN_MOVED_FROM "a" COOKIE 2
> IN_MOVED_TO "b" COOKIE 2
>

Really? How come?

fsnotify_move() inside vfs_rename() is serialized with lock_rename()

I should use this opportunity to note that fanotify event does not have
a cookie field and that I did not add one to my implementation, so that
is a problem.

> and if user program just blindly belives this sequence its internal
> representation of the filesystem will be different from the real state of
> the filesystem.
>
> So for now I'm more inclined towards the trivial approach (possibly using
> your patches and stripping additional functionality from them). But I'll
> leave final decision to when I'll be able to read everything in detail.
>

The addition of filename info is completely optional and the relevant patch
("fanotify: pass filename info for filename events")
Can be yanked out of the series and pushed back to the 'maybe' pile.

Amir.

2017-03-15 14:45:03

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

On Wed, Mar 15, 2017 at 3:39 PM, Jan Kara <[email protected]> wrote:
> On Wed 15-03-17 10:19:52, Marko Rauhamaa wrote:
>> Filip Štědronský <[email protected]>:
>>
>> > there are basically two classes of uses for a fantotify-like
>> > interface:
>> >
>> > (1) Keeping an up-to-date representation of the file system. For this,
>> > superblock watches are clearly what you want.
>> >
>> > [...]
>> >
>> > All those factors speak greatly in favour of superblock
>> > watches.
>> >
>> > (2) Tracking filesystem *activity*. Now you are not building
>> > an image of current filesystem state but rather a log of what
>> > happened. Perhaps you are also interested in who
>> > (user/process/...) did what. Permission events also fit mostly in
>> > this category.
>> >
>> > For those it *might* make sense to have mount-scoped watches, for
>> > example if you want to monitor only one container or a subset of
>> > processes.
>> >
>> > We both concentrate on the first but we shouldn't forget about the
>> > second, which was one of the original motivations for fanotify.
>>
>> My (employer's) needs are centered around (2). We definitely crave
>> permission events with a filesystem scope. At the moment, you can avoid
>> permission checks with a simple unshare command (<URL:
>> https://lkml.org/lkml/2016/12/21/144>).
>
> Yes, that is bad.
>
>> So I must be able to see everything that is happening in my universe. It
>> might also be useful to monitor a subuniverse of mine, but the former
>> need is critical at the moment.
>
> So I understand your need. However with superblock watches I'm still
> concerned that the process would be able to see too much. E.g. if it is
> restricted to see only some subtree of a filesystem (by bind mounts &
> namespaces), it should not be able to see events on the same filesystem
> outside of that subtree. I have not found a good solution for that yet.
>

See the last patch in my series. The cherry on the top ;-)

commit 5e3b5bd943991cdf5b72745c1e24833bc998b7ed
Author: Amir Goldstein <[email protected]>
Date: Sun Dec 18 11:25:55 2016 +0200

fanotify: filter events by root mark mount point

When adding a super block root watch from a mount point that is not mounted
on the root of the file system, filter out events on file system objects
that happen outside this mount point directory (on non decendant objects).

This is not like FAN_MARK_MOUNT which filters only events that happened
on the mount of the mark. All events on file system objects are reported
as long as these objects are accessible from the mark mount point.

Signed-off-by: Amir Goldstein <[email protected]>

fs/notify/fanotify/fanotify.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

Our use case is monitoring a large directory tree, not the entire file system.

I used a simple check in should_send_event()

+ /*
+ * Only interesetd in dentry events visible from the mount
+ * from which the root watch was added
+ */
+ if (mark_mnt && mark_mnt->mnt_root != dentry &&
+ d_ancestor(mark_mnt->mnt_root, dentry) == NULL)
+ return false;
+

This does not cover the case of events on objects that are hidden
under another mount in my mount namespace, but it covers the
simple case of bind mount.

Note that 'mark_mnt' does NOT stand for the vfs_mark mount,
because root watch is an inode_mark (on the root inode).
It stands for the mount from which the root watch was added.
Same mount that is used to construct the event->fd for all
the dentry events.

This scheme does NOT allow multiple root watches with
different mount point filter on the same group.
Every group can have just one root watch per sb with a single
mount filter.

Amir.

2017-03-16 10:39:05

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

On Wed 15-03-17 16:34:11, Amir Goldstein wrote:
> On Wed, Mar 15, 2017 at 4:05 PM, Jan Kara <[email protected]> wrote:
> >> So the string fields in my proposal are optional.
> >> userspace opts-in to get them by specifying the FAN_EVENT_INFO_NAME flag:
> >>
> >> fanotify_init(FAN_CLOEXEC | FAN_CLASS_NOTIF |
> >> FAN_EVENT_INFO_PARENT | FAN_EVENT_INFO_NAME,
> >>
> >> If you don't specify FAN_EVENT_INFO_NAME, you can get filename events
> >> FAN_MOVE|FAN_CREATE|FAN_DELETE without the name.
> >
> > So when thinking about this again, I'm again concerned that the names need
> > not tell userspace what it thinks. I know we already discussed this but
> > in our last discussion I think I forgot to point out that inotify directory
> > events (and fanotify would be the same AFAICT) may come out of order compared
> > to real filesystem changes. E.g. sequence:
> >
> > Task 1 Task 2
> >
> > mv a b
> > mv b c
> >
> > may come out of inotify as:
> >
> > IN_MOVED_FROM "b" COOKIE 1
> > IN_MOVED_TO "c" COOKIE 1
> > IN_MOVED_FROM "a" COOKIE 2
> > IN_MOVED_TO "b" COOKIE 2
> >
>
> Really? How come?
>
> fsnotify_move() inside vfs_rename() is serialized with lock_rename()

Sorry, I got confused by the unlocking in vfs_rename() however that unlocks
only the moved file / directory but not the directory in which the rename
is happening. Taking my objection back.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-03-19 10:19:51

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

On Tue 14-03-17 13:18:01, Amir Goldstein wrote:
> On Tue, Mar 14, 2017 at 1:03 AM, Filip Štědronský <[email protected]> wrote:
> > Besause fanotify requires `struct path`, the event cannot be generated
> > directly in `fsnotify_move` and friends because they only get the inode
> > (and their callers, `vfs_rename`&co. cannot supply any better info).
> > So instead it needs to be generated higher in the call chain, i.e. in
> > the callers of functions like `vfs_rename`.
> >
> > This leads to some code duplication. Currently, there are several places
> > whence functions like `vfs_rename` or `vfs_unlink` are called:
> >
> > * syscall handlers (done)
> > * NFS server (done)
> > * stacked filesystems
> > - ecryptfs (done)
> > - overlayfs
> > (Currently doesn't report even ordinary fanotify events, because
> > it internally clones the upper mount; not sure about the
> > rationale. One can always watch the overlay mount instead.)
> > * few rather minor things
> > - devtmpfs
> > (its internal changes are not tied to any vfsmount so it cannot
> > emit mount-scoped events)
> > - cachefiles (done)
> > - ipc/mqueue.c (done)
> > - fs/nfsd/nfs4recover.c (done)
> > - kernel/bpf/inode.c (done)
> > net/unix/af_unix.c (done)
> >
> > (grep -rE '\bvfs_(rename|unlink|mknod|whiteout|create|mkdir|rmdir|symlink|link)\(')
> >
> > Signed-off-by: Filip Štědronský <[email protected]>
> >
> > ---
> >
> > An alternative might be to create wrapper functions like
> > vfs_path_(rename|unlink|...). They could also take care of calling
> > security_path_(rename|unlink|...), which is currently also up to
> > the indvidual callers (possibly with a flag because it might not
> > be always desired).
>
> That's an interesting idea. There is some duplicity between security/audit
> hook and fsnotify hooks. It should be interesting to try and deduplicate
> some of this code.

Yeah, but ecryptfs or nfsd don't actually call these security hooks AFAICT.
And at least for NFSD it seems correct they don't call them since you are
running in a context of an NFSD server process and don't have security
context of the process actually issuing the IO. So I'm not sure
"deduplication" is really possible.

However if you can really call fsnotify hooks with 'path' available in all
the places, it should be equally hard to just pass 'path' to
vfs_(create|mkdir|...) and that way we don't have to sprinkle fsnotify
calls into several call sites but keep them local to vfs_(create|mkdir|...)
helpers. Hmm?

>
> > ---
> > fs/cachefiles/namei.c | 9 +++++++
> > fs/ecryptfs/inode.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > fs/namei.c | 23 +++++++++++++++++-
> > fs/nfsd/nfs4recover.c | 7 ++++++
> > fs/nfsd/vfs.c | 24 ++++++++++++++++--
> > ipc/mqueue.c | 9 +++++++
> > kernel/bpf/inode.c | 3 +++
> > net/unix/af_unix.c | 2 ++
> > 8 files changed, 141 insertions(+), 3 deletions(-)
> >
>
> OK, just for comparison, I am going to put here the diff of the sub set of
> my patches that are needed to support fanotify filename events.
>
>
> $ git diff --stat fsnotify_sb..fanotify_dentry
> fs/notify/fanotify/fanotify.c | 94
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------
> fs/notify/fanotify/fanotify.h | 25 ++++++++++++++++++++++++-
> fs/notify/fanotify/fanotify_user.c | 92
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------
> fs/notify/fdinfo.c | 25 +++----------------------
> fs/notify/inode_mark.c | 1 +
> fs/notify/mark.c | 15 ++++++++++++---
> include/linux/fsnotify_backend.h | 21 ++++++++++++++++-----
> include/uapi/linux/fanotify.h | 41
> +++++++++++++++++++++++++++++++++++------
> 8 files changed, 255 insertions(+), 59 deletions(-)
>
> Yes, it is a bit more code, much mostly because it adds more functionality
> (optionally reporting the filename).
> But most of the code is contained within the fsnotify/fanotify subsystem.

So I'm not that concerned about actual line numbers. They play some role
but they don't tell much about how maintainable the result is in the end.

> The altenative to sprinkle fsnotify_modify_dir() hooks is much less
> maintainable IMO.

Agreed on this.

> I managed to stay a way from cross subsystems changes,
> by allowing to loose some information about the event.
>
> When a filename event is generated (rename|delete|create)
> the path of the parent fd that will be reported to user is NOT
> the actual path that the process executing the operation used,
> but the path from which the watching process has added the mark.

Yeah, and frankly I'm not yet convinced this is a sane thing to do. I'd
much rather propagate path to vfs_(create|mkdir|...) helpers.

> So for example, if you have a bind mount:
> mount -o bind /a/b/c/d/e/f/g /tmp/g
>
> And you add a watch on mount /a
> you *will* get events for create,delete,rename in directory /tmp/g
> but that path in the associated fd with point to /a/b/c/d/e/f/g
>
> Some would claim that it is wrong to report the events
> if they were not originated from the mount were the
> watch was added.
>
> I claim that fanotify filters event by mount not because it
> was a requirement, but because it was an implementation challenge
> to do otherwise.
> And I claim that what mount watchers are really interested in is
> "all the changes that happen in the file system in the area
> that is visible to me through this mount point".
>
> In other words, an indexer needs to know if files were modified\
> create/deleted if that indexer sits in container host namespace
> regardless if those files were modified from within a container
> namespace.
>
> It's not a matter of security/isolation. It's a matter of functionality.
> I agree that for some event (e.g. permission events) it is possible
> to argue both ways (i.e. that the namespace context should be used
> as a filter for events).
> But for the new proposed events (FS_MODIFY_DIR), I really don't
> see the point in isolation by mount/namespace.

I would be *very* careful with such assumptions. People are very inventive
in ways how they can abuse stuff. What I'm most nervous about is that the
(mnt, dentry) pair you create may not be reachable path. When you combine
that with xyz_at(2) syscalls allowing you to traverse directory hierarchy
from fd which you got from fanotify event, things can get really
interesting. So unless there are very clear rules proving this cannot be
misused, I'm against hand-crafting of path structures inside fsnotify.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-03-19 10:46:26

by Filip Štědronský

[permalink] [raw]
Subject: Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

On Sun, Mar 19, 2017 at 11:19:43AM +0100, Jan Kara wrote:
> However if you can really call fsnotify hooks with 'path' available in all
> the places, it should be equally hard to just pass 'path' to
> vfs_(create|mkdir|...) and that way we don't have to sprinkle fsnotify
> calls into several call sites but keep them local to vfs_(create|mkdir|...)
> helpers. Hmm?

the problem is: not absolutely all. One illuminating example is the use
of vfs_mknod in devtmpfs. There a struct path is not only unavailable
but makes not semantic sense: the changes do not go thru any mountpoint.
And in general I think there will be situations where you would need
to call VFS functions without paths.

Thus I suggested either
(a) wrapping the VFS functions with path variants, or
(b) giving them an optional vfsmount argument that can be set to NULL
when it does not make sense

Filip

2017-03-20 11:14:02

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

On Sun 19-03-17 11:37:39, Filip Štědronský wrote:
> On Sun, Mar 19, 2017 at 11:19:43AM +0100, Jan Kara wrote:
> > However if you can really call fsnotify hooks with 'path' available in all
> > the places, it should be equally hard to just pass 'path' to
> > vfs_(create|mkdir|...) and that way we don't have to sprinkle fsnotify
> > calls into several call sites but keep them local to vfs_(create|mkdir|...)
> > helpers. Hmm?
>
> the problem is: not absolutely all. One illuminating example is the use
> of vfs_mknod in devtmpfs. There a struct path is not only unavailable
> but makes not semantic sense: the changes do not go thru any mountpoint.

How come? In current kernel the call looks like:

vfs_mknod(d_inode(path.dentry), dentry, mode, dev->devt);

so the path is available there... I've actually quickly checked all
vfs_mknod() callers and they all seem to have path available.

> And in general I think there will be situations where you would need
> to call VFS functions without paths.
>
> Thus I suggested either
> (a) wrapping the VFS functions with path variants, or
> (b) giving them an optional vfsmount argument that can be set to NULL
> when it does not make sense

So my first take is that fsnotify calls happen still relatively high in the
call stack where we should mostly have mount point available from the path
lookup. That being said there may be places where we've lost that
information and it will not be easy to propagate it there and that would
have to be dealt with on case-by-case basis. But mountpoint is needed for
other stuff like security checks these days as well so we should have it
available in principle.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-03-20 11:40:55

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

On Sun, Mar 19, 2017 at 2:04 PM, Jan Kara <[email protected]> wrote:
> On Sun 19-03-17 11:37:39, Filip Štědronský wrote:
>> On Sun, Mar 19, 2017 at 11:19:43AM +0100, Jan Kara wrote:
>> > However if you can really call fsnotify hooks with 'path' available in all
>> > the places, it should be equally hard to just pass 'path' to
>> > vfs_(create|mkdir|...) and that way we don't have to sprinkle fsnotify
>> > calls into several call sites but keep them local to vfs_(create|mkdir|...)
>> > helpers. Hmm?
>>
>> the problem is: not absolutely all. One illuminating example is the use
>> of vfs_mknod in devtmpfs. There a struct path is not only unavailable
>> but makes not semantic sense: the changes do not go thru any mountpoint.
>
> How come? In current kernel the call looks like:
>
> vfs_mknod(d_inode(path.dentry), dentry, mode, dev->devt);
>
> so the path is available there... I've actually quickly checked all
> vfs_mknod() callers and they all seem to have path available.
>
>> And in general I think there will be situations where you would need
>> to call VFS functions without paths.
>>
>> Thus I suggested either
>> (a) wrapping the VFS functions with path variants, or
>> (b) giving them an optional vfsmount argument that can be set to NULL
>> when it does not make sense
>
> So my first take is that fsnotify calls happen still relatively high in the
> call stack where we should mostly have mount point available from the path
> lookup. That being said there may be places where we've lost that
> information and it will not be easy to propagate it there and that would
> have to be dealt with on case-by-case basis. But mountpoint is needed for
> other stuff like security checks these days as well so we should have it
> available in principle.
>

I agree that propagating the path to fsnotify seem like the right thing to do.
fsnotify_inoderemove() is an example (the only one I know of) where path
is not available (when called down from from dput()), but frankly, I can't
think of any use cases that really needs to make use of the
FS_DELETE_SELF event in that case.

d_delete() which also calls fsnotify_inoderemove() already calls
fsnotify_nameremove() hook with the exact same dentry, so
the FS_DELETE_SELF event can be generated by that hook as well
as the FS_DELETE event.

2017-03-20 11:53:06

by Filip Štědronský

[permalink] [raw]
Subject: Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

Hi,

On Sun, Mar 19, 2017 at 07:04:13PM +0100, Jan Kara wrote:
> How come? In current kernel the call looks like:
>
> vfs_mknod(d_inode(path.dentry), dentry, mode, dev->devt);
>
> so the path is available there... I've actually quickly checked all
> vfs_mknod() callers and they all seem to have path available.

terribly sorry, must have misremembered something. Been staring at the
code long into the night. You are quite right.

But it is an internal mount so userspace never gets the notifications.
The same goes for the cloned upper mount in overlayfs. This might be
considered ok, because the change is semantically "internal" and does
not originate through any userspace-visible mountpoint. Superblock
watches would solve this case.

Otherwise it seems feasible to pass a path to all VFS functions.

Filip

2017-03-20 12:13:00

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

On Tue, Mar 14, 2017 at 9:46 AM, Filip Štědronský <[email protected]> wrote:
> Hi,
>
> On Tue, Mar 14, 2017 at 12:40:56PM +0200, Amir Goldstein wrote:
>> An I am very happy that you used filehandles to keep track of objects
>> in your example, because it fits my proposal really well.
>
> you can read more about the rationales in the WIP draft of my thesis
> (especially the chapter Identifying Inodes and its surroundings):
>
> http://regnarg.cz/tmp/thesis.pdf
> https://github.com/regnarg/bc
>

Filip,

I read your draft and it's very interesting.
Looking forward for the unfinished chapters ;-)

open_by_handle() is important for your implementation in order
to be able to get inode number (unique object id) from the file handle.
In practice, although file handles are opaque descriptors, they usually
contain the inode number information is a well known format.

If it will be possible to take those formats and make them standard ABI,
or alternatively, to call a new get_inum_by_handle() syscall, then
the indexer/file sync won't need to have the extra privileges needed
for open_by_handle() and file systems that can provide "view only"
file handles (e.g. overlayfs) could be used with the indexer.

Apropos chapter 1.3 "Filesystem based change detection"
It's on my todo list to try and implement something with XFS
using the LSN information embedded in every inode.
It would be great if we can collaborate on this in the future.

Amir.

2017-03-21 15:40:01

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

On Sun, Mar 19, 2017 at 11:19:43AM +0100, Jan Kara wrote:
> On Tue 14-03-17 13:18:01, Amir Goldstein wrote:
> > On Tue, Mar 14, 2017 at 1:03 AM, Filip Štědronský <[email protected]> wrote:
> > > Besause fanotify requires `struct path`, the event cannot be generated
> > > directly in `fsnotify_move` and friends because they only get the inode
> > > (and their callers, `vfs_rename`&co. cannot supply any better info).
> > > So instead it needs to be generated higher in the call chain, i.e. in
> > > the callers of functions like `vfs_rename`.
> > >
> > > This leads to some code duplication. Currently, there are several places
> > > whence functions like `vfs_rename` or `vfs_unlink` are called:
> > >
> > > * syscall handlers (done)
> > > * NFS server (done)
> > > * stacked filesystems
> > > - ecryptfs (done)
> > > - overlayfs
> > > (Currently doesn't report even ordinary fanotify events, because
> > > it internally clones the upper mount; not sure about the
> > > rationale. One can always watch the overlay mount instead.)
> > > * few rather minor things
> > > - devtmpfs
> > > (its internal changes are not tied to any vfsmount so it cannot
> > > emit mount-scoped events)
> > > - cachefiles (done)
> > > - ipc/mqueue.c (done)
> > > - fs/nfsd/nfs4recover.c (done)
> > > - kernel/bpf/inode.c (done)
> > > net/unix/af_unix.c (done)
> > >
> > > (grep -rE '\bvfs_(rename|unlink|mknod|whiteout|create|mkdir|rmdir|symlink|link)\(')
> > >
> > > Signed-off-by: Filip Štědronský <[email protected]>
> > >
> > > ---
> > >
> > > An alternative might be to create wrapper functions like
> > > vfs_path_(rename|unlink|...). They could also take care of calling
> > > security_path_(rename|unlink|...), which is currently also up to
> > > the indvidual callers (possibly with a flag because it might not
> > > be always desired).
> >
> > That's an interesting idea. There is some duplicity between security/audit
> > hook and fsnotify hooks. It should be interesting to try and deduplicate
> > some of this code.
>
> Yeah, but ecryptfs or nfsd don't actually call these security hooks AFAICT.

We don't? E.g. nfsd_unlink calls vfs_unlink which calls
security_inode_unlink().

> And at least for NFSD it seems correct they don't call them since you are
> running in a context of an NFSD server process and don't have security
> context of the process actually issuing the IO.

# ps -eo comm,label|grep nfsd
nfsd system_u:system_r:kernel_t:s0
...

So I guess that's the process context they see.

There is actually a way for the protocol to pass the security context,
in the case it's running over kerberos. So some day this might
optionally start using the context of the client process, for what it's
worth.

--b.

> So I'm not sure
> "deduplication" is really possible.
>
> However if you can really call fsnotify hooks with 'path' available in all
> the places, it should be equally hard to just pass 'path' to
> vfs_(create|mkdir|...) and that way we don't have to sprinkle fsnotify
> calls into several call sites but keep them local to vfs_(create|mkdir|...)
> helpers. Hmm?
>
> >
> > > ---
> > > fs/cachefiles/namei.c | 9 +++++++
> > > fs/ecryptfs/inode.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > > fs/namei.c | 23 +++++++++++++++++-
> > > fs/nfsd/nfs4recover.c | 7 ++++++
> > > fs/nfsd/vfs.c | 24 ++++++++++++++++--
> > > ipc/mqueue.c | 9 +++++++
> > > kernel/bpf/inode.c | 3 +++
> > > net/unix/af_unix.c | 2 ++
> > > 8 files changed, 141 insertions(+), 3 deletions(-)
> > >
> >
> > OK, just for comparison, I am going to put here the diff of the sub set of
> > my patches that are needed to support fanotify filename events.
> >
> >
> > $ git diff --stat fsnotify_sb..fanotify_dentry
> > fs/notify/fanotify/fanotify.c | 94
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------
> > fs/notify/fanotify/fanotify.h | 25 ++++++++++++++++++++++++-
> > fs/notify/fanotify/fanotify_user.c | 92
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------
> > fs/notify/fdinfo.c | 25 +++----------------------
> > fs/notify/inode_mark.c | 1 +
> > fs/notify/mark.c | 15 ++++++++++++---
> > include/linux/fsnotify_backend.h | 21 ++++++++++++++++-----
> > include/uapi/linux/fanotify.h | 41
> > +++++++++++++++++++++++++++++++++++------
> > 8 files changed, 255 insertions(+), 59 deletions(-)
> >
> > Yes, it is a bit more code, much mostly because it adds more functionality
> > (optionally reporting the filename).
> > But most of the code is contained within the fsnotify/fanotify subsystem.
>
> So I'm not that concerned about actual line numbers. They play some role
> but they don't tell much about how maintainable the result is in the end.
>
> > The altenative to sprinkle fsnotify_modify_dir() hooks is much less
> > maintainable IMO.
>
> Agreed on this.
>
> > I managed to stay a way from cross subsystems changes,
> > by allowing to loose some information about the event.
> >
> > When a filename event is generated (rename|delete|create)
> > the path of the parent fd that will be reported to user is NOT
> > the actual path that the process executing the operation used,
> > but the path from which the watching process has added the mark.
>
> Yeah, and frankly I'm not yet convinced this is a sane thing to do. I'd
> much rather propagate path to vfs_(create|mkdir|...) helpers.
>
> > So for example, if you have a bind mount:
> > mount -o bind /a/b/c/d/e/f/g /tmp/g
> >
> > And you add a watch on mount /a
> > you *will* get events for create,delete,rename in directory /tmp/g
> > but that path in the associated fd with point to /a/b/c/d/e/f/g
> >
> > Some would claim that it is wrong to report the events
> > if they were not originated from the mount were the
> > watch was added.
> >
> > I claim that fanotify filters event by mount not because it
> > was a requirement, but because it was an implementation challenge
> > to do otherwise.
> > And I claim that what mount watchers are really interested in is
> > "all the changes that happen in the file system in the area
> > that is visible to me through this mount point".
> >
> > In other words, an indexer needs to know if files were modified\
> > create/deleted if that indexer sits in container host namespace
> > regardless if those files were modified from within a container
> > namespace.
> >
> > It's not a matter of security/isolation. It's a matter of functionality.
> > I agree that for some event (e.g. permission events) it is possible
> > to argue both ways (i.e. that the namespace context should be used
> > as a filter for events).
> > But for the new proposed events (FS_MODIFY_DIR), I really don't
> > see the point in isolation by mount/namespace.
>
> I would be *very* careful with such assumptions. People are very inventive
> in ways how they can abuse stuff. What I'm most nervous about is that the
> (mnt, dentry) pair you create may not be reachable path. When you combine
> that with xyz_at(2) syscalls allowing you to traverse directory hierarchy
> from fd which you got from fanotify event, things can get really
> interesting. So unless there are very clear rules proving this cannot be
> misused, I'm against hand-crafting of path structures inside fsnotify.
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2017-03-21 16:41:31

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

On Tue 21-03-17 11:38:49, J. Bruce Fields wrote:
> On Sun, Mar 19, 2017 at 11:19:43AM +0100, Jan Kara wrote:
> > On Tue 14-03-17 13:18:01, Amir Goldstein wrote:
> > > On Tue, Mar 14, 2017 at 1:03 AM, Filip Štědronský <[email protected]> wrote:
> > > > Besause fanotify requires `struct path`, the event cannot be generated
> > > > directly in `fsnotify_move` and friends because they only get the inode
> > > > (and their callers, `vfs_rename`&co. cannot supply any better info).
> > > > So instead it needs to be generated higher in the call chain, i.e. in
> > > > the callers of functions like `vfs_rename`.
> > > >
> > > > This leads to some code duplication. Currently, there are several places
> > > > whence functions like `vfs_rename` or `vfs_unlink` are called:
> > > >
> > > > * syscall handlers (done)
> > > > * NFS server (done)
> > > > * stacked filesystems
> > > > - ecryptfs (done)
> > > > - overlayfs
> > > > (Currently doesn't report even ordinary fanotify events, because
> > > > it internally clones the upper mount; not sure about the
> > > > rationale. One can always watch the overlay mount instead.)
> > > > * few rather minor things
> > > > - devtmpfs
> > > > (its internal changes are not tied to any vfsmount so it cannot
> > > > emit mount-scoped events)
> > > > - cachefiles (done)
> > > > - ipc/mqueue.c (done)
> > > > - fs/nfsd/nfs4recover.c (done)
> > > > - kernel/bpf/inode.c (done)
> > > > net/unix/af_unix.c (done)
> > > >
> > > > (grep -rE '\bvfs_(rename|unlink|mknod|whiteout|create|mkdir|rmdir|symlink|link)\(')
> > > >
> > > > Signed-off-by: Filip Štědronský <[email protected]>
> > > >
> > > > ---
> > > >
> > > > An alternative might be to create wrapper functions like
> > > > vfs_path_(rename|unlink|...). They could also take care of calling
> > > > security_path_(rename|unlink|...), which is currently also up to
> > > > the indvidual callers (possibly with a flag because it might not
> > > > be always desired).
> > >
> > > That's an interesting idea. There is some duplicity between security/audit
> > > hook and fsnotify hooks. It should be interesting to try and deduplicate
> > > some of this code.
> >
> > Yeah, but ecryptfs or nfsd don't actually call these security hooks AFAICT.
>
> We don't? E.g. nfsd_unlink calls vfs_unlink which calls
> security_inode_unlink().

OK, I have not been specific enough :). ecryptfs or nfsd don't call *path*
security hooks AFAICT - e.g. security_path_unlink() from nfsd_unlink().

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-03-21 17:45:15

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

On Tue, Mar 21, 2017 at 05:41:22PM +0100, Jan Kara wrote:
> On Tue 21-03-17 11:38:49, J. Bruce Fields wrote:
> > On Sun, Mar 19, 2017 at 11:19:43AM +0100, Jan Kara wrote:
> > > On Tue 14-03-17 13:18:01, Amir Goldstein wrote:
> > > > On Tue, Mar 14, 2017 at 1:03 AM, Filip Štědronský <[email protected]> wrote:
> > > > > An alternative might be to create wrapper functions like
> > > > > vfs_path_(rename|unlink|...). They could also take care of calling
> > > > > security_path_(rename|unlink|...), which is currently also up to
> > > > > the indvidual callers (possibly with a flag because it might not
> > > > > be always desired).
> > > >
> > > > That's an interesting idea. There is some duplicity between security/audit
> > > > hook and fsnotify hooks. It should be interesting to try and deduplicate
> > > > some of this code.
> > >
> > > Yeah, but ecryptfs or nfsd don't actually call these security hooks AFAICT.
> >
> > We don't? E.g. nfsd_unlink calls vfs_unlink which calls
> > security_inode_unlink().
>
> OK, I have not been specific enough :). ecryptfs or nfsd don't call *path*
> security hooks AFAICT - e.g. security_path_unlink() from nfsd_unlink().

Oh, got it, thanks.

But, no, nfsd is definitely is not meant to be invisible to security
modules, so that's just a bug.

--b.