2016-12-20 15:20:31

by Amir Goldstein

[permalink] [raw]
Subject: [RFC][PATCH 0/2] fsnotify: super block watch

Jan,

This 2 patch series is the 2nd part of fanotify super block watch work.

The full work is available on my github:
1. fsnotify: misc cleanups - 4 patches posted on Dec 17
https://github.com/amir73il/linux/commits/fsnotify_dentry
2. fsnotify: super block watch - 2 patches in this posting
https://github.com/amir73il/linux/commits/fsnotify_sb
3. fanotify: add support for more events - 6 patches posted on Oct 10
https://github.com/amir73il/linux/commits/fanotify_dentry
4. fanotify: add super block watch - 6 patches not posted yet
https://github.com/amir73il/linux/commits/fanotify_sb

On the Oct 10 posting you asked me about the use case and it was hard
to explain the use case with only part of the work done.

The issue, which this work sets to solve, is the poor scalability of
recursive inotify watches.

Other operating systems have a scalable way of watching changes on
a large file system. Windows has USN Journal, macOS has FSEvents
and BSD has kevents.

The only way in Linux to monitor file system namei events
(e.g. create/delete/move) is the recursive inotify watch way and
this method scales very poorly for large enough directory trees.
Beyond the exploding probability of a need for full scan, pinning
all directory inodes wastes a lot of memory.

The efforts to merge fanotify took several steps in the direction
of solving the scalability issue, but they did not go all the way
to provide the functionality required to replace inotify.

Unfortunately, the fsnotify subsystem has been suffering from
neglection for quite some time, so it hard for me to get proper
feedback on this work.

The fanotify super block watch feature is currently being tested
by my employer and by other interested parties as well.

I am planning to write some more testing tools and gather more
benchmark results before posting the patches for the last part
of this work (fanotify_sb).

Any assistance you can provide with review and with helping
to push this work forward would be very much appreciated.

Thanks!
Amir.

Amir Goldstein (2):
fsnotify: add event mask FS_EVENT_ON_SB
fsnotify: implement event reporting to super block's root inode

fs/notify/fsnotify.c | 60 ++++++++++++++++++++++++++++++++++++----
include/linux/fsnotify_backend.h | 29 +++++++++++++++++--
2 files changed, 81 insertions(+), 8 deletions(-)

--
2.7.4


2016-12-20 15:20:35

by Amir Goldstein

[permalink] [raw]
Subject: [RFC][PATCH 2/2] fsnotify: implement event reporting to super block's root inode

When the flag FS_EVENT_ON_SB is set on a super block's root inode
fsnotify mask, all events on inodes that are on the same super
block are reported to the root inode.

Events reported to root inode carry the flag FS_EVENT_ON_SB,
to distinguish from events that happen to the root inode itself
or to its direct children (FS_EVENT_ON_CHILD).

The extra cost for vfs operations without any root inode watch
is the test for fsnotify flag on root inode, i.e.:
(dentry->d_sb->s_root->d_inode->i_fsnotify_mask & FS_EVENT_ON_SB)
This could be further optimized by setting a flag on the super_block
struct.

Signed-off-by: Amir Goldstein <[email protected]>
---
fs/notify/fsnotify.c | 58 +++++++++++++++++++++++++++++++++++++---
include/linux/fsnotify_backend.h | 10 +++++++
2 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 12d4479..c0a596c 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -183,12 +183,14 @@ static int send_to_group(struct inode *to_tell,

/*
* This is the main call to fsnotify. The VFS calls into hook specific functions
- * in linux/fsnotify.h. Those functions then in turn call here. Here will call
+ * in linux/fsnotify.h. Those functions call the helpers fsnotify(),
+ * fsnotify_root(), fsnotify_parent() and they in turn call here. Here will call
* out to all of the registered fsnotify_group. Those groups can then use the
* notification event in whatever means they feel necessary.
*/
-int fsnotify(struct inode *to_tell, __u32 mask, void *data, int data_is,
- const unsigned char *file_name, u32 cookie)
+static int __fsnotify(struct inode *to_tell, __u32 mask,
+ void *data, int data_is,
+ const unsigned char *file_name, u32 cookie)
{
struct hlist_node *inode_node = NULL, *vfsmount_node = NULL;
struct fsnotify_mark *inode_mark = NULL, *vfsmount_mark = NULL;
@@ -196,7 +198,7 @@ int fsnotify(struct inode *to_tell, __u32 mask, void *data, int data_is,
struct mount *mnt;
int idx, ret = 0;
/* global tests shouldn't care about events on child only the specific event */
- __u32 test_mask = (mask & ~FS_EVENT_ON_CHILD);
+ __u32 test_mask = (mask & ~FS_EVENT_ON_DESCENDANT);

if (data_is == FSNOTIFY_EVENT_PATH)
mnt = real_mount(((struct path *)data)->mnt);
@@ -291,6 +293,54 @@ int fsnotify(struct inode *to_tell, __u32 mask, void *data, int data_is,

return ret;
}
+
+/* Notify this dentry's sb root about descendant inode events. */
+static int fsnotify_root(struct dentry *dentry, __u32 mask,
+ void *data, int data_is,
+ const unsigned char *file_name, u32 cookie)
+{
+ struct inode *r_inode = d_inode(dentry->d_sb->s_root);
+
+ if (likely(!fsnotify_inode_watches_sb(r_inode)) ||
+ !(r_inode->i_fsnotify_mask & mask))
+ return 0;
+
+ /* we are notifying root so come up with the new mask which
+ * specifies these are events which came from sb. */
+ mask |= FS_EVENT_ON_SB;
+
+ return __fsnotify(r_inode, mask, data, data_is, file_name, cookie);
+}
+
+/* Notify this inode and maybe the sb root inode. */
+int fsnotify(struct inode *to_tell, __u32 mask, void *data, int data_is,
+ const unsigned char *file_name, u32 cookie)
+{
+ struct dentry *dentry = NULL;
+ int ret = 0;
+
+ BUG_ON(mask & FS_EVENT_ON_SB);
+
+ if (data_is == FSNOTIFY_EVENT_PATH)
+ dentry = ((struct path *)data)->dentry;
+ else if (data_is == FSNOTIFY_EVENT_DENTRY)
+ dentry = data;
+
+ if (dentry) {
+ /* First, notify root inode if it cares */
+ ret = fsnotify_root(dentry, mask, data, data_is,
+ file_name, cookie);
+ if (ret)
+ return ret;
+
+ /* Do not report to root sb watch an event twice */
+ if (unlikely(fsnotify_inode_watches_sb(to_tell)))
+ return 0;
+ }
+
+ /* Then, notify this inode */
+ return __fsnotify(to_tell, mask, data, data_is, file_name, cookie);
+}
EXPORT_SYMBOL_GPL(fsnotify);

static __init int fsnotify_init(void)
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index b7992da..224c4aa 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -266,6 +266,16 @@ extern void __fsnotify_inode_delete(struct inode *inode);
extern void __fsnotify_vfsmount_delete(struct vfsmount *mnt);
extern u32 fsnotify_get_cookie(void);

+static inline int fsnotify_inode_watches_sb(struct inode *inode)
+{
+ /* FS_EVENT_ON_SB is set if the sb root inode may care */
+ if (!(inode->i_fsnotify_mask & FS_EVENT_ON_SB))
+ return 0;
+ /* this root inode might care about sb events, does it care about the
+ * specific set of events that can happen on a distant child? */
+ return inode->i_fsnotify_mask & FS_EVENTS_POSS_ON_SB;
+}
+
static inline int fsnotify_inode_watches_children(struct inode *inode)
{
/* FS_EVENT_ON_CHILD is set if the inode may care */
--
2.7.4

2016-12-20 15:20:52

by Amir Goldstein

[permalink] [raw]
Subject: [RFC][PATCH 1/2] fsnotify: add event mask FS_EVENT_ON_SB

When a watch is added on a super block's root inode with
the FS_EVENT_ON_SB flag, the watched inode is intended
to report events on all inodes on the same super block.

Signed-off-by: Amir Goldstein <[email protected]>
---
fs/notify/fsnotify.c | 2 +-
include/linux/fsnotify_backend.h | 19 ++++++++++++++++---
2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 3ba0e4a..12d4479 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -297,7 +297,7 @@ static __init int fsnotify_init(void)
{
int ret;

- BUG_ON(hweight32(ALL_FSNOTIFY_EVENTS) != 23);
+ BUG_ON(hweight32(ALL_FSNOTIFY_EVENTS) != 24);

ret = init_srcu_struct(&fsnotify_mark_srcu);
if (ret)
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index b55c64d..b7992da 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -54,6 +54,13 @@
* dnotify and inotify. */
#define FS_EVENT_ON_CHILD 0x08000000

+/* This root inode cares about things that happen to inodes on same super block.
+ * Can only be set for fanotify.
+ * Overloads IN_ONLYDIR inotify open only flag */
+#define FS_EVENT_ON_SB 0x01000000
+
+#define FS_EVENT_ON_DESCENDANT (FS_EVENT_ON_CHILD | FS_EVENT_ON_SB)
+
/* This is a list of all events that may get sent to a parernt based on fs event
* happening to inodes inside that directory */
#define FS_EVENTS_POSS_ON_CHILD (FS_ACCESS | FS_MODIFY | FS_ATTRIB |\
@@ -61,6 +68,11 @@
FS_MOVED_FROM | FS_MOVED_TO | FS_CREATE |\
FS_DELETE | FS_OPEN_PERM | FS_ACCESS_PERM)

+/* This is a list of all events that may get sent to the root inode based on fs
+ * event happening to inodes on the same super block */
+#define FS_EVENTS_POSS_ON_SB (FS_EVENTS_POSS_ON_CHILD |\
+ FS_DELETE_SELF | FS_MOVE_SELF)
+
#define FS_MOVE (FS_MOVED_FROM | FS_MOVED_TO)

#define ALL_FSNOTIFY_PERM_EVENTS (FS_OPEN_PERM | FS_ACCESS_PERM)
@@ -70,9 +82,10 @@
FS_MOVED_FROM | FS_MOVED_TO | FS_CREATE | \
FS_DELETE | FS_DELETE_SELF | FS_MOVE_SELF | \
FS_UNMOUNT | FS_Q_OVERFLOW | FS_IN_IGNORED | \
- FS_OPEN_PERM | FS_ACCESS_PERM | FS_EXCL_UNLINK | \
- FS_ISDIR | FS_IN_ONESHOT | FS_DN_RENAME | \
- FS_DN_MULTISHOT | FS_EVENT_ON_CHILD)
+ FS_OPEN_PERM | FS_ACCESS_PERM | \
+ FS_EXCL_UNLINK | FS_ISDIR | FS_IN_ONESHOT | \
+ FS_DN_RENAME | FS_DN_MULTISHOT | \
+ FS_EVENT_ON_CHILD | FS_EVENT_ON_SB)

struct fsnotify_group;
struct fsnotify_event;
--
2.7.4

2016-12-21 10:30:49

by Marko Rauhamaa

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] fsnotify: super block watch

Amir Goldstein <[email protected]>:

> On the Oct 10 posting you asked me about the use case and it was hard
> to explain the use case with only part of the work done.
>
> The issue, which this work sets to solve, is the poor scalability of
> recursive inotify watches.

On my [employer's] part, the fanotify API suffers from leakage through
namespaces.

If you need to monitor files in a filesystem, the current fanotify
FAN_MARK_MOUNT flag does not give you a notification if you execute a
command like

unshare -m touch xyz

inside the file system.

Amir's patch addresses this increasingly critical issue.


Marko