2021-10-25 20:49:57

by Ioannis Angelakopoulos

[permalink] [raw]
Subject: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

Hello,

I am a PhD student currently interning at Red Hat and working on the
virtiofs file system. I am trying to add support for the Inotify
notification subsystem in virtiofs. I seek your feedback and
suggestions on what is the right direction to take.

Currently, virtiofs does not support the Inotify API and there are
applications which look for the Inotify support in virtiofs (e.g., Kata
containers).

However, all the event notification subsystems (Dnotify/Inotify/Fanotify)
are supported by only by local kernel file systems.

--Proposed solution

With this RFC patch we add the inotify support to the FUSE kernel module
so that the remote virtiofs file system (based on FUSE) used by a QEMU
guest VM can make use of this feature.

Specifically, we enhance FUSE to add/modify/delete watches on the FUSE
server and also receive remote inotify events. To achieve this we modify
the fsnotify subsystem so that it calls specific hooks in FUSE when a
remote watch is added/modified/deleted and FUSE calls into fsnotify when
a remote event is received to send the event to user space.

In our case the FUSE server is virtiofsd.

We also considered an out of band approach for implementing the remote
notifications (e.g., FAM, Gamin), however this approach would break
applications that are already compatible with inotify, and thus would
require an update.

These kernel patches depend on the patch series posted by Vivek Goyal:
https://lore.kernel.org/linux-fsdevel/[email protected]/

My PoC Linux kernel patches are here:
https://github.com/iangelak/linux/commits/inotify_v1

My PoC virtiofsd corresponding patches are here:
https://github.com/iangelak/qemu/commits/inotify_v1

--Advantages

1) Our approach is compatible with existing applications that rely on
Inotify, thus improves portability.

2) Everything is implemented in one place (virtiofs and virtiofsd) and
there is no need to run additional processes (daemons) specifically to
handle the remote notifications.

--Weaknesses

1) Both a local (QEMU guest) and a remote (Host/Virtiofsd) watch on the
target inode have to be active at the same time. The local watch
guarantees that events are going to be sent to the guest user space while
the remote watch captures events occurring on the host (and will be sent
to the guest).

As a result, when an event occures on a inode within the exported
directory by virtiofs, two events will be generated at the same time; a
local event (generated by the guest kernel) and a remote event (generated
by the host), thus the guest will receive duplicate events.

To account for this issue we implemented two modes; one where local events
function as expected (when virtiofsd does not support the remote
inotify) and one where the local events are suppressed and only the
remote events originating from the host side are let through (when
virtiofsd supports the remote inotify).

3) The lifetime of the local watch in the guest kernel is very
important. Specifically, there is a possibility that the guest does not
receive remote events on time, if it removes its local watch on the
target or deletes the inode (and thus the guest kernel removes the watch).
In these cases the guest kernel removes the local watch before the
remote events arrive from the host (virtiofsd) and as such the guest
kernel drops all the remote events for the target inode (since the
corresponding local watch does not exist anymore). On top of that,
virtiofsd keeps an open proc file descriptor for each inode that is not
immediately closed on a inode deletion request by the guest. As a result
no IN_DELETE_SELF is generated by virtiofsd and sent to the guest kernel
in this case.

4) Because virtiofsd implements additional operations during the
servicing of a request from the guest, additional inotify events might
be generated and sent to the guest other than the ones the guest
expects. However, this is not technically a limitation and it is dependent
on the implementation of the remote file system server (in this case
virtiofsd).

5) The current implementation only supports Inotify, due to its
simplicity and not Fanotify. Fanotify's complexity requires support from
virtiofsd that is not currently available. One such example is
Fsnotify's access permission decision capabilities, which could
conflict with virtiofsd's current access permission implementation.

Ioannis Angelakopoulos (7):
FUSE: Add the fsnotify opcode and in/out structs to FUSE
FUSE: Add the remote inotify support capability to FUSE
FUSE,Inotify,Fsnotify,VFS: Add the fuse_fsnotify_update_mark inode
operation
FUSE: Add the fuse_fsnotify_send_request to FUSE
Fsnotify: Add a wrapper around the fsnotify function
FUSE,Fsnotify: Add the fuse_fsnotify_event inode operation
virtiofs: Add support for handling the remote fsnotify notifications

fs/fuse/dev.c | 37 ++++++++++++++++++
fs/fuse/dir.c | 56 ++++++++++++++++++++++++++++
fs/fuse/fuse_i.h | 10 +++++
fs/fuse/inode.c | 6 +++
fs/fuse/virtio_fs.c | 64 +++++++++++++++++++++++++++++++-
fs/notify/fsnotify.c | 35 ++++++++++++++++-
fs/notify/inotify/inotify_user.c | 11 ++++++
fs/notify/mark.c | 10 +++++
include/linux/fs.h | 5 +++
include/linux/fsnotify_backend.h | 14 ++++++-
include/uapi/linux/fuse.h | 23 +++++++++++-
11 files changed, 265 insertions(+), 6 deletions(-)

--
2.33.0


2021-10-25 20:51:38

by Ioannis Angelakopoulos

[permalink] [raw]
Subject: [RFC PATCH 5/7] Fsnotify: Add a wrapper around the fsnotify function

Generally, inotify events are generated locally by calling the "fsnotify"
function in fs/notify/fsnotify.c and various helper functions. However, now
we expect events to arrive from the FUSE server. Thus, without any
intervention a user space application will receive two events. One event is
generated locally and one arrives from the server.

Hence, to avoid duplicate events we need to "suppress" the local events
generated by the guest kernel for FUSE inodes. To achieve this we add a
wrapper around the "fsnotify" function in fs/notify/fsnotify.c that
checks if the remote inotify is enabled and based on the check either it
"suppresses" or lets through a local event.

The wrapper will be called in the place of the original "fsnotify" call
that is responsible for the event notification (now renamed as
"__fsnotify").

When the remote inotify is not enabled, all local events will be let
through as expected. This process is completely transparent to user space.

Signed-off-by: Ioannis Angelakopoulos <[email protected]>
---
fs/notify/fsnotify.c | 35 ++++++++++++++++++++++++++++++--
include/linux/fsnotify_backend.h | 14 ++++++++++++-
2 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 963e6ce75b96..848a824c29c4 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -440,7 +440,7 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
}

/*
- * fsnotify - This is the main call to fsnotify.
+ * __fsnotify - This is the main call to fsnotify.
*
* The VFS calls into hook specific functions in linux/fsnotify.h.
* Those functions then in turn call here. Here will call out to all of the
@@ -459,7 +459,7 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
* if both are non-NULL event may be reported to both.
* @cookie: inotify rename cookie
*/
-int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
+int __fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
const struct qstr *file_name, struct inode *inode, u32 cookie)
{
const struct path *path = fsnotify_data_path(data, data_type);
@@ -552,6 +552,37 @@ int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,

return ret;
}
+
+/*
+ * Wrapper around fsnotify. The main functionality is to filter local events in
+ * case the inode belongs to a filesystem that supports remote events
+ */
+int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
+ const struct qstr *file_name, struct inode *inode, u32 cookie)
+{
+
+ if (inode != NULL || dir != NULL) {
+ /*
+ * Check if the fsnotify_event operation is available which
+ * will let the remote inotify events go through and suppress
+ * the local events
+ */
+ if (inode && inode->i_op->fsnotify_event) {
+ return inode->i_op->fsnotify_event(mask, data,
+ data_type, dir,
+ file_name, inode,
+ cookie);
+ }
+ if (dir && dir->i_op->fsnotify_event) {
+ return dir->i_op->fsnotify_event(mask, data,
+ data_type, dir,
+ file_name, inode,
+ cookie);
+ }
+ }
+
+ return __fsnotify(mask, data, data_type, dir, file_name, inode, cookie);
+}
EXPORT_SYMBOL_GPL(fsnotify);

static __init int fsnotify_init(void)
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 1ce66748a2d2..8cfbd685f1c0 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -408,10 +408,15 @@ struct fsnotify_mark {

/* called from the vfs helpers */

-/* main fsnotify call to send events */
+/* fsnotify call wrapper */
extern int fsnotify(__u32 mask, const void *data, int data_type,
struct inode *dir, const struct qstr *name,
struct inode *inode, u32 cookie);
+
+/* main fsnotify call to send events */
+extern int __fsnotify(__u32 mask, const void *data, int data_type,
+ struct inode *dir, const struct qstr *name,
+ struct inode *inode, u32 cookie);
extern int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
int data_type);
extern void __fsnotify_inode_delete(struct inode *inode);
@@ -595,6 +600,13 @@ static inline int fsnotify(__u32 mask, const void *data, int data_type,
return 0;
}

+static inline int __fsnotify(__u32 mask, const void *data, int data_type,
+ struct inode *dir, const struct qstr *name,
+ struct inode *inode, u32 cookie)
+{
+ return 0;
+}
+
static inline int __fsnotify_parent(struct dentry *dentry, __u32 mask,
const void *data, int data_type)
{
--
2.33.0

2021-10-25 20:51:46

by Ioannis Angelakopoulos

[permalink] [raw]
Subject: [RFC PATCH 6/7] FUSE,Fsnotify: Add the fuse_fsnotify_event inode operation

To avoid duplicate events we need to "suppress" the local events generated
by the guest kernel for FUSE inodes. To achieve this we introduce a new
inode operation "fuse_fsnotify_event". Any VFS operation on a FUSE inode
that calls the fsnotify subsystem will call this function.

Specifically, the new inode operation "fuse_fsnotify_event" is called by
the "fsnotify" wrapper function in fsnotify.c, if the inode is a FUSE
inode. In turn "fuse_fsnotify_event" will check if the remote inotify is
enabled and if yes the event will be dropped, since the local inotify
events should be "suppressed". If the remote inotify is not enabled the
event will go through as expected.

In the case where the remote inotify is enabled, FUSE will directly call
"__fsnotify" to send the remote events to user space and not the "fsnotify"
wrapper.

Signed-off-by: Ioannis Angelakopoulos <[email protected]>
---
fs/fuse/dir.c | 27 +++++++++++++++++++++++++++
include/linux/fs.h | 3 +++
2 files changed, 30 insertions(+)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index f666aafc8d3f..d36f85bd4dda 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1831,6 +1831,30 @@ static int fuse_fsnotify_update_mark(struct inode *inode, uint32_t action,
return fuse_fsnotify_send_request(inode, mask, action, group);
}

+static int fuse_fsnotify_event(__u32 mask, const void *data, int data_type,
+ struct inode *dir, const struct qstr *file_name,
+ struct inode *inode, u32 cookie)
+{
+ struct fuse_mount *fm = NULL;
+
+ if (inode != NULL)
+ fm = get_fuse_mount(inode);
+ else
+ fm = get_fuse_mount(dir);
+
+ /* Remote inotify supported. Do nothing */
+ if (!(fm->fc->no_fsnotify)) {
+ return 0;
+ /*
+ * Remote inotify not supported. Call the __fsnotify function
+ * directly
+ */
+ } else {
+ return __fsnotify(mask, data, data_type, dir, file_name,
+ inode, cookie);
+ }
+}
+
static const struct inode_operations fuse_dir_inode_operations = {
.lookup = fuse_lookup,
.mkdir = fuse_mkdir,
@@ -1851,6 +1875,7 @@ static const struct inode_operations fuse_dir_inode_operations = {
.fileattr_get = fuse_fileattr_get,
.fileattr_set = fuse_fileattr_set,
.fsnotify_update = fuse_fsnotify_update_mark,
+ .fsnotify_event = fuse_fsnotify_event,
};

static const struct file_operations fuse_dir_operations = {
@@ -1874,6 +1899,7 @@ static const struct inode_operations fuse_common_inode_operations = {
.fileattr_get = fuse_fileattr_get,
.fileattr_set = fuse_fileattr_set,
.fsnotify_update = fuse_fsnotify_update_mark,
+ .fsnotify_event = fuse_fsnotify_event,
};

static const struct inode_operations fuse_symlink_inode_operations = {
@@ -1882,6 +1908,7 @@ static const struct inode_operations fuse_symlink_inode_operations = {
.getattr = fuse_getattr,
.listxattr = fuse_listxattr,
.fsnotify_update = fuse_fsnotify_update_mark,
+ .fsnotify_event = fuse_fsnotify_event,
};

void fuse_init_common(struct inode *inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 86bcc44e3ab8..ed6b62e2131a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2151,6 +2151,9 @@ struct inode_operations {
int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
int (*fsnotify_update)(struct inode *inode, uint32_t action,
uint64_t group, uint32_t mask);
+ int (*fsnotify_event)(__u32 mask, const void *data, int data_type,
+ struct inode *dir, const struct qstr *file_name,
+ struct inode *inode, u32 cookie);
} ____cacheline_aligned;

static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
--
2.33.0

2021-10-25 20:51:58

by Ioannis Angelakopoulos

[permalink] [raw]
Subject: [RFC PATCH 7/7] virtiofs: Add support for handling the remote fsnotify notifications

FUSE and specifically virtiofs should be able to handle the asynchronous
event notifications originating from the FUSE server. To this end we add
the FUSE_NOTIFY_FSNOTIFY switch case to the "virtio_fs_handle_notify" in
fs/fuse/virtio_fs.c to handle these specific notifications.

The event notification contains the information that a user space
application would receive when monitoring an inode for events. The
information is the mask of the inode watch, a file name corresponding to
the inode the remote event was generated for and finally, the inotify
cookie.

Then a new event should be generated corresponding to the event
notification received from the FUSE server. Specifically, FUSE in the guest
kernel will call the "__fsnotify" function in fs/notify/fsnotify.c to send
the event to user space.

Signed-off-by: Ioannis Angelakopoulos <[email protected]>
---
fs/fuse/virtio_fs.c | 64 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 62 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index d3dba9e3a07e..4c48c2812caa 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -16,6 +16,7 @@
#include <linux/fs_parser.h>
#include <linux/highmem.h>
#include <linux/uio.h>
+#include <linux/fsnotify_backend.h>
#include "fuse_i.h"

/* Used to help calculate the FUSE connection's max_pages limit for a request's
@@ -655,14 +656,69 @@ static void notify_node_reuse(struct virtio_fs_vq *notify_fsvq,
spin_unlock(&notify_fsvq->lock);
}

+static int fsnotify_remote_event(struct inode *inode, uint32_t mask,
+ struct qstr *filename, uint32_t cookie)
+{
+ return __fsnotify(mask, NULL, 0, NULL,
+ (const struct qstr *)filename, inode, cookie);
+}
+
+/*
+ * Function to generate a new event when a fsnotify notification comes from the
+ * fuse server
+ */
+static int generate_fsnotify_event(struct fuse_conn *fc,
+ struct fuse_notify_fsnotify_out *fsnotify_out)
+{
+ struct inode *inode;
+ uint32_t mask, cookie;
+ struct fuse_mount *fm;
+ int ret = -1;
+ struct qstr name;
+
+ down_read(&fc->killsb);
+ inode = fuse_ilookup(fc, fsnotify_out->inode, &fm);
+ /*
+ * The inode that corresponds to the event does not exist in this case
+ * so do not generate any new event and just return an error
+ */
+ if (!inode)
+ goto out;
+
+ mask = fsnotify_out->mask;
+ cookie = fsnotify_out->cookie;
+
+ /*
+ * If the notification contained the name of the file/dir the event
+ * occurred for, it will be placed after the fsnotify_out struct in the
+ * notification message
+ */
+ if (fsnotify_out->namelen > 0) {
+ name.len = fsnotify_out->namelen;
+ name.name = (char *)fsnotify_out + sizeof(struct fuse_notify_fsnotify_out);
+ ret = fsnotify_remote_event(inode, mask, &name, cookie);
+ } else {
+ ret = fsnotify_remote_event(inode, mask, NULL, cookie);
+ }
+
+ up_read(&fc->killsb);
+out:
+ if (ret < 0)
+ return -EINVAL;
+
+ return ret;
+}
+
static int virtio_fs_handle_notify(struct virtio_fs *vfs,
- struct virtio_fs_notify_node *notifyn)
+ struct virtio_fs_notify_node *notifyn,
+ struct fuse_conn *fc)
{
int ret = 0, no_reuse = 0;
struct virtio_fs_notify *notify = &notifyn->notify;
struct virtio_fs_vq *notify_fsvq = &vfs->vqs[VQ_NOTIFY_IDX];
struct fuse_out_header *oh = &notify->out_hdr;
struct fuse_notify_lock_out *lo;
+ struct fuse_notify_fsnotify_out *fsnotify_out;

/*
* For notifications, oh.unique is 0 and oh->error contains code
@@ -673,6 +729,10 @@ static int virtio_fs_handle_notify(struct virtio_fs *vfs,
lo = (struct fuse_notify_lock_out *) &notify->outarg;
no_reuse = notify_complete_waiting_req(vfs, lo);
break;
+ case FUSE_NOTIFY_FSNOTIFY:
+ fsnotify_out = (struct fuse_notify_fsnotify_out *) &notify->outarg;
+ generate_fsnotify_event(fc, fsnotify_out);
+ break;
default:
pr_err("virtio-fs: Unexpected notification %d\n", oh->error);
}
@@ -711,7 +771,7 @@ static void virtio_fs_notify_done_work(struct work_struct *work)
WARN_ON(oh->unique);
list_del_init(&notifyn->list);
/* Handle notification */
- virtio_fs_handle_notify(vfs, notifyn);
+ virtio_fs_handle_notify(vfs, notifyn, fsvq->fud->fc);
spin_lock(&fsvq->lock);
dec_in_flight_req(fsvq);
spin_unlock(&fsvq->lock);
--
2.33.0

2021-10-26 19:00:34

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] Fsnotify: Add a wrapper around the fsnotify function

On Mon, Oct 25, 2021 at 11:47 PM Ioannis Angelakopoulos
<[email protected]> wrote:
>
> Generally, inotify events are generated locally by calling the "fsnotify"
> function in fs/notify/fsnotify.c and various helper functions. However, now
> we expect events to arrive from the FUSE server. Thus, without any
> intervention a user space application will receive two events. One event is
> generated locally and one arrives from the server.
>
> Hence, to avoid duplicate events we need to "suppress" the local events
> generated by the guest kernel for FUSE inodes. To achieve this we add a
> wrapper around the "fsnotify" function in fs/notify/fsnotify.c that
> checks if the remote inotify is enabled and based on the check either it
> "suppresses" or lets through a local event.
>
> The wrapper will be called in the place of the original "fsnotify" call
> that is responsible for the event notification (now renamed as
> "__fsnotify").
>
> When the remote inotify is not enabled, all local events will be let
> through as expected. This process is completely transparent to user space.
>
> Signed-off-by: Ioannis Angelakopoulos <[email protected]>
> ---
> fs/notify/fsnotify.c | 35 ++++++++++++++++++++++++++++++--
> include/linux/fsnotify_backend.h | 14 ++++++++++++-
> 2 files changed, 46 insertions(+), 3 deletions(-)
>
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 963e6ce75b96..848a824c29c4 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -440,7 +440,7 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
> }
>
> /*
> - * fsnotify - This is the main call to fsnotify.
> + * __fsnotify - This is the main call to fsnotify.
> *
> * The VFS calls into hook specific functions in linux/fsnotify.h.
> * Those functions then in turn call here. Here will call out to all of the
> @@ -459,7 +459,7 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
> * if both are non-NULL event may be reported to both.
> * @cookie: inotify rename cookie
> */
> -int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> +int __fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> const struct qstr *file_name, struct inode *inode, u32 cookie)
> {
> const struct path *path = fsnotify_data_path(data, data_type);
> @@ -552,6 +552,37 @@ int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
>
> return ret;
> }
> +
> +/*
> + * Wrapper around fsnotify. The main functionality is to filter local events in
> + * case the inode belongs to a filesystem that supports remote events
> + */
> +int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> + const struct qstr *file_name, struct inode *inode, u32 cookie)
> +{
> +
> + if (inode != NULL || dir != NULL) {
> + /*
> + * Check if the fsnotify_event operation is available which
> + * will let the remote inotify events go through and suppress
> + * the local events
> + */
> + if (inode && inode->i_op->fsnotify_event) {
> + return inode->i_op->fsnotify_event(mask, data,
> + data_type, dir,
> + file_name, inode,
> + cookie);
> + }
> + if (dir && dir->i_op->fsnotify_event) {
> + return dir->i_op->fsnotify_event(mask, data,
> + data_type, dir,
> + file_name, inode,
> + cookie);
> + }
> + }
> +

That's not the way to accomplish what you want to do.

Assuming that we agree to let filesystem silence VFS fsnotify hooks
it should be done using an inode flag, similar to file flag FMODE_NONOTIFY.

But the reason you want to do that seems a bit odd.
Duplicate events are going to be merged with fanotify and I think that
you ruled out fanotify for the wrong reasons
(you should have only ruled out permission events)

Thanks,
Amir.

2021-10-26 20:44:01

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Mon, Oct 25, 2021 at 11:47 PM Ioannis Angelakopoulos
<[email protected]> wrote:
>
> Hello,
>
> I am a PhD student currently interning at Red Hat and working on the
> virtiofs file system. I am trying to add support for the Inotify
> notification subsystem in virtiofs. I seek your feedback and
> suggestions on what is the right direction to take.
>

Hi Ioannis!

I am very happy that you have taken on this task.
People have been requesting this functionality in the past [1]
Not specifically for FUSE, but FUSE is a very good place to start.

[1] https://lore.kernel.org/linux-fsdevel/CAH2r5mt1Fy6hR+Rdig0sHsOS8fVQDsKf9HqZjvjORS3R-7=RFw@mail.gmail.com/

> Currently, virtiofs does not support the Inotify API and there are
> applications which look for the Inotify support in virtiofs (e.g., Kata
> containers).
>
> However, all the event notification subsystems (Dnotify/Inotify/Fanotify)
> are supported by only by local kernel file systems.
>
> --Proposed solution
>
> With this RFC patch we add the inotify support to the FUSE kernel module
> so that the remote virtiofs file system (based on FUSE) used by a QEMU
> guest VM can make use of this feature.
>
> Specifically, we enhance FUSE to add/modify/delete watches on the FUSE
> server and also receive remote inotify events. To achieve this we modify
> the fsnotify subsystem so that it calls specific hooks in FUSE when a
> remote watch is added/modified/deleted and FUSE calls into fsnotify when
> a remote event is received to send the event to user space.
>
> In our case the FUSE server is virtiofsd.
>
> We also considered an out of band approach for implementing the remote
> notifications (e.g., FAM, Gamin), however this approach would break
> applications that are already compatible with inotify, and thus would
> require an update.
>
> These kernel patches depend on the patch series posted by Vivek Goyal:
> https://lore.kernel.org/linux-fsdevel/[email protected]/

It would be a shame if remote fsnotify was not added as a generic
capability to FUSE filesystems and not only virtiofs.
Is there a way to get rid of this dependency?

>
> My PoC Linux kernel patches are here:
> https://github.com/iangelak/linux/commits/inotify_v1
>
> My PoC virtiofsd corresponding patches are here:
> https://github.com/iangelak/qemu/commits/inotify_v1
>
> --Advantages
>
> 1) Our approach is compatible with existing applications that rely on
> Inotify, thus improves portability.
>
> 2) Everything is implemented in one place (virtiofs and virtiofsd) and
> there is no need to run additional processes (daemons) specifically to
> handle the remote notifications.
>
> --Weaknesses
>
> 1) Both a local (QEMU guest) and a remote (Host/Virtiofsd) watch on the
> target inode have to be active at the same time. The local watch
> guarantees that events are going to be sent to the guest user space while
> the remote watch captures events occurring on the host (and will be sent
> to the guest).
>
> As a result, when an event occures on a inode within the exported
> directory by virtiofs, two events will be generated at the same time; a
> local event (generated by the guest kernel) and a remote event (generated
> by the host), thus the guest will receive duplicate events.
>
> To account for this issue we implemented two modes; one where local events
> function as expected (when virtiofsd does not support the remote
> inotify) and one where the local events are suppressed and only the
> remote events originating from the host side are let through (when
> virtiofsd supports the remote inotify).

Dropping events from the local side would be weird.
Avoiding duplicate events is not a good enough reason IMO
compared to the problems this could cause.
I am not convinced this is worth it.

>
> 3) The lifetime of the local watch in the guest kernel is very
> important. Specifically, there is a possibility that the guest does not
> receive remote events on time, if it removes its local watch on the
> target or deletes the inode (and thus the guest kernel removes the watch).
> In these cases the guest kernel removes the local watch before the
> remote events arrive from the host (virtiofsd) and as such the guest
> kernel drops all the remote events for the target inode (since the
> corresponding local watch does not exist anymore). On top of that,
> virtiofsd keeps an open proc file descriptor for each inode that is not
> immediately closed on a inode deletion request by the guest. As a result
> no IN_DELETE_SELF is generated by virtiofsd and sent to the guest kernel
> in this case.
>
> 4) Because virtiofsd implements additional operations during the
> servicing of a request from the guest, additional inotify events might
> be generated and sent to the guest other than the ones the guest
> expects. However, this is not technically a limitation and it is dependent
> on the implementation of the remote file system server (in this case
> virtiofsd).
>
> 5) The current implementation only supports Inotify, due to its
> simplicity and not Fanotify. Fanotify's complexity requires support from
> virtiofsd that is not currently available. One such example is
> Fsnotify's access permission decision capabilities, which could
> conflict with virtiofsd's current access permission implementation.

Good example, bad decision.
It is perfectly fine for a remote server to provide a "supported event mask"
and leave permission events out of the game.

Imagine a remote SMB server, it also does not support all of the events
that the local application would like to set.

That should not be a reason to rule out fanotify, only specific
fanotify events.

Same goes to FAN_MARK_MOUNT and FAN_MARK_FILESYSTEM
remote server may or may not support anything other than watching
inode objects, but it should not be a limit of the "remote fsnotify" API.

Thanks,
Amir.

2021-10-26 21:04:27

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] Fsnotify: Add a wrapper around the fsnotify function

On Tue, Oct 26, 2021 at 05:37:55PM +0300, Amir Goldstein wrote:
> On Mon, Oct 25, 2021 at 11:47 PM Ioannis Angelakopoulos
> <[email protected]> wrote:
> >
> > Generally, inotify events are generated locally by calling the "fsnotify"
> > function in fs/notify/fsnotify.c and various helper functions. However, now
> > we expect events to arrive from the FUSE server. Thus, without any
> > intervention a user space application will receive two events. One event is
> > generated locally and one arrives from the server.
> >
> > Hence, to avoid duplicate events we need to "suppress" the local events
> > generated by the guest kernel for FUSE inodes. To achieve this we add a
> > wrapper around the "fsnotify" function in fs/notify/fsnotify.c that
> > checks if the remote inotify is enabled and based on the check either it
> > "suppresses" or lets through a local event.
> >
> > The wrapper will be called in the place of the original "fsnotify" call
> > that is responsible for the event notification (now renamed as
> > "__fsnotify").
> >
> > When the remote inotify is not enabled, all local events will be let
> > through as expected. This process is completely transparent to user space.
> >
> > Signed-off-by: Ioannis Angelakopoulos <[email protected]>
> > ---
> > fs/notify/fsnotify.c | 35 ++++++++++++++++++++++++++++++--
> > include/linux/fsnotify_backend.h | 14 ++++++++++++-
> > 2 files changed, 46 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> > index 963e6ce75b96..848a824c29c4 100644
> > --- a/fs/notify/fsnotify.c
> > +++ b/fs/notify/fsnotify.c
> > @@ -440,7 +440,7 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
> > }
> >
> > /*
> > - * fsnotify - This is the main call to fsnotify.
> > + * __fsnotify - This is the main call to fsnotify.
> > *
> > * The VFS calls into hook specific functions in linux/fsnotify.h.
> > * Those functions then in turn call here. Here will call out to all of the
> > @@ -459,7 +459,7 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info)
> > * if both are non-NULL event may be reported to both.
> > * @cookie: inotify rename cookie
> > */
> > -int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> > +int __fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> > const struct qstr *file_name, struct inode *inode, u32 cookie)
> > {
> > const struct path *path = fsnotify_data_path(data, data_type);
> > @@ -552,6 +552,37 @@ int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> >
> > return ret;
> > }
> > +
> > +/*
> > + * Wrapper around fsnotify. The main functionality is to filter local events in
> > + * case the inode belongs to a filesystem that supports remote events
> > + */
> > +int fsnotify(__u32 mask, const void *data, int data_type, struct inode *dir,
> > + const struct qstr *file_name, struct inode *inode, u32 cookie)
> > +{
> > +
> > + if (inode != NULL || dir != NULL) {
> > + /*
> > + * Check if the fsnotify_event operation is available which
> > + * will let the remote inotify events go through and suppress
> > + * the local events
> > + */
> > + if (inode && inode->i_op->fsnotify_event) {
> > + return inode->i_op->fsnotify_event(mask, data,
> > + data_type, dir,
> > + file_name, inode,
> > + cookie);
> > + }
> > + if (dir && dir->i_op->fsnotify_event) {
> > + return dir->i_op->fsnotify_event(mask, data,
> > + data_type, dir,
> > + file_name, inode,
> > + cookie);
> > + }
> > + }
> > +
>
> That's not the way to accomplish what you want to do.
>
> Assuming that we agree to let filesystem silence VFS fsnotify hooks
> it should be done using an inode flag, similar to file flag FMODE_NONOTIFY.

Ok, so basically define a new flag say FMODE_FOO and check that in
fsnotify() and ignore event if flag is set. And filesystem can set
that flag if remote events are enabled. And then vfs will ignore
local events. Sounds reasonable.

>
> But the reason you want to do that seems a bit odd.

I gave this idea to Ioannis. But defining a inode flag probably is
much cheaper as comapred to inode operation.

> Duplicate events are going to be merged with fanotify and I think that
> you ruled out fanotify for the wrong reasons
> (you should have only ruled out permission events)

Ioannis was looking at fanoity and wondering if fanotify can be supported
as well. fanotify seemed to be much more powerful as compared to inotify
and some of the things looked like not feasible to be supported for
remote filesystems.

But if it is acceptable to support only limited functionality/events, then
it probably is a good idea to keep fanotify in mind and somebody can extend
it for fanotify as well.

Ideally it will be nice to support fanoity as well (as much as possible).
Just that it seemed more complicated. So we thought that let us start
with a simpler API (inotify) and try to implement that first.

I don't understand "Duplicate events are going to be merged with
fanotify". So fanotiy has something to figure out there are duplicate
events and merge these and user space never sees duplicate events.

Vivek

2021-10-26 21:14:27

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Tue, Oct 26, 2021 at 06:23:50PM +0300, Amir Goldstein wrote:
> On Mon, Oct 25, 2021 at 11:47 PM Ioannis Angelakopoulos
> <[email protected]> wrote:
> >
> > Hello,
> >
> > I am a PhD student currently interning at Red Hat and working on the
> > virtiofs file system. I am trying to add support for the Inotify
> > notification subsystem in virtiofs. I seek your feedback and
> > suggestions on what is the right direction to take.
> >
>
> Hi Ioannis!
>
> I am very happy that you have taken on this task.
> People have been requesting this functionality in the past [1]
> Not specifically for FUSE, but FUSE is a very good place to start.
>
> [1] https://lore.kernel.org/linux-fsdevel/CAH2r5mt1Fy6hR+Rdig0sHsOS8fVQDsKf9HqZjvjORS3R-7=RFw@mail.gmail.com/

Hi Amir,

Aha, good to know that other remote filesystems are looking for similar
functionality. So there can be a common design so that other remote
filesystems can support remote inotify/fanotify events too.

>
> > Currently, virtiofs does not support the Inotify API and there are
> > applications which look for the Inotify support in virtiofs (e.g., Kata
> > containers).
> >
> > However, all the event notification subsystems (Dnotify/Inotify/Fanotify)
> > are supported by only by local kernel file systems.
> >
> > --Proposed solution
> >
> > With this RFC patch we add the inotify support to the FUSE kernel module
> > so that the remote virtiofs file system (based on FUSE) used by a QEMU
> > guest VM can make use of this feature.
> >
> > Specifically, we enhance FUSE to add/modify/delete watches on the FUSE
> > server and also receive remote inotify events. To achieve this we modify
> > the fsnotify subsystem so that it calls specific hooks in FUSE when a
> > remote watch is added/modified/deleted and FUSE calls into fsnotify when
> > a remote event is received to send the event to user space.
> >
> > In our case the FUSE server is virtiofsd.
> >
> > We also considered an out of band approach for implementing the remote
> > notifications (e.g., FAM, Gamin), however this approach would break
> > applications that are already compatible with inotify, and thus would
> > require an update.
> >
> > These kernel patches depend on the patch series posted by Vivek Goyal:
> > https://lore.kernel.org/linux-fsdevel/[email protected]/
>
> It would be a shame if remote fsnotify was not added as a generic
> capability to FUSE filesystems and not only virtiofs.
> Is there a way to get rid of this dependency?

Agreed. I think currently he has just added support for virtiofs. But
it should be possible to extend it to other fuse filesystems as well.
All they need to do is send notification and regular fuse already
has support for allowing server to send notifications to fuse client
kernel.

>
> >
> > My PoC Linux kernel patches are here:
> > https://github.com/iangelak/linux/commits/inotify_v1
> >
> > My PoC virtiofsd corresponding patches are here:
> > https://github.com/iangelak/qemu/commits/inotify_v1
> >
> > --Advantages
> >
> > 1) Our approach is compatible with existing applications that rely on
> > Inotify, thus improves portability.
> >
> > 2) Everything is implemented in one place (virtiofs and virtiofsd) and
> > there is no need to run additional processes (daemons) specifically to
> > handle the remote notifications.
> >
> > --Weaknesses
> >
> > 1) Both a local (QEMU guest) and a remote (Host/Virtiofsd) watch on the
> > target inode have to be active at the same time. The local watch
> > guarantees that events are going to be sent to the guest user space while
> > the remote watch captures events occurring on the host (and will be sent
> > to the guest).
> >
> > As a result, when an event occures on a inode within the exported
> > directory by virtiofs, two events will be generated at the same time; a
> > local event (generated by the guest kernel) and a remote event (generated
> > by the host), thus the guest will receive duplicate events.
> >
> > To account for this issue we implemented two modes; one where local events
> > function as expected (when virtiofsd does not support the remote
> > inotify) and one where the local events are suppressed and only the
> > remote events originating from the host side are let through (when
> > virtiofsd supports the remote inotify).
>
> Dropping events from the local side would be weird.
> Avoiding duplicate events is not a good enough reason IMO
> compared to the problems this could cause.
> I am not convinced this is worth it.

So what should be done? If we don't drop local events, then application
will see both local and remote events. And then we will have to build
this knowledge in API that an event can be either local or remote.
Application should be able to distinguish between these two and act
accordingly. That sounds like a lot. And I am not even sure if application
cares about local events in case of a remote shared filesystem.

I have no experience with inotify/fanotify/fsnotify and what people
expect from inotify/fanotify API. So we are open to all the ideas
w.r.t what will be a good design to support this thing on remote
filesystems. Currently whole infrastructure seems to be written with
local filesystems in mind.

>
> >
> > 3) The lifetime of the local watch in the guest kernel is very
> > important. Specifically, there is a possibility that the guest does not
> > receive remote events on time, if it removes its local watch on the
> > target or deletes the inode (and thus the guest kernel removes the watch).
> > In these cases the guest kernel removes the local watch before the
> > remote events arrive from the host (virtiofsd) and as such the guest
> > kernel drops all the remote events for the target inode (since the
> > corresponding local watch does not exist anymore). On top of that,
> > virtiofsd keeps an open proc file descriptor for each inode that is not
> > immediately closed on a inode deletion request by the guest. As a result
> > no IN_DELETE_SELF is generated by virtiofsd and sent to the guest kernel
> > in this case.
> >
> > 4) Because virtiofsd implements additional operations during the
> > servicing of a request from the guest, additional inotify events might
> > be generated and sent to the guest other than the ones the guest
> > expects. However, this is not technically a limitation and it is dependent
> > on the implementation of the remote file system server (in this case
> > virtiofsd).
> >
> > 5) The current implementation only supports Inotify, due to its
> > simplicity and not Fanotify. Fanotify's complexity requires support from
> > virtiofsd that is not currently available. One such example is
> > Fsnotify's access permission decision capabilities, which could
> > conflict with virtiofsd's current access permission implementation.
>
> Good example, bad decision.
> It is perfectly fine for a remote server to provide a "supported event mask"
> and leave permission events out of the game.
>
> Imagine a remote SMB server, it also does not support all of the events
> that the local application would like to set.
>
> That should not be a reason to rule out fanotify, only specific
> fanotify events.
>
> Same goes to FAN_MARK_MOUNT and FAN_MARK_FILESYSTEM
> remote server may or may not support anything other than watching
> inode objects, but it should not be a limit of the "remote fsnotify" API.

Agreed. If limited fanotify functionality is acceptable, then it should
be written in such a way so that one can easily extend it to support
limited fanotify support.

Thanks
Vivek

2021-10-26 23:04:24

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Tue, Oct 26, 2021 at 06:23:50PM +0300, Amir Goldstein wrote:

[..]
> > 3) The lifetime of the local watch in the guest kernel is very
> > important. Specifically, there is a possibility that the guest does not
> > receive remote events on time, if it removes its local watch on the
> > target or deletes the inode (and thus the guest kernel removes the watch).
> > In these cases the guest kernel removes the local watch before the
> > remote events arrive from the host (virtiofsd) and as such the guest
> > kernel drops all the remote events for the target inode (since the
> > corresponding local watch does not exist anymore).

So this is one of the issues which has been haunting us in virtiofs. If
a file is removed, for local events, event is generated first and
then watch is removed. But in case of remote filesystems, it is racy.
It is possible that by the time event arrives, watch is already gone
and application never sees the delete event.

Not sure how to address this issue.

Vivek

2021-10-27 04:38:38

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Tue, Oct 26, 2021 at 7:18 PM Vivek Goyal <[email protected]> wrote:
>
> On Tue, Oct 26, 2021 at 06:23:50PM +0300, Amir Goldstein wrote:
>
> [..]
> > > 3) The lifetime of the local watch in the guest kernel is very
> > > important. Specifically, there is a possibility that the guest does not
> > > receive remote events on time, if it removes its local watch on the
> > > target or deletes the inode (and thus the guest kernel removes the watch).
> > > In these cases the guest kernel removes the local watch before the
> > > remote events arrive from the host (virtiofsd) and as such the guest
> > > kernel drops all the remote events for the target inode (since the
> > > corresponding local watch does not exist anymore).
>
> So this is one of the issues which has been haunting us in virtiofs. If
> a file is removed, for local events, event is generated first and
> then watch is removed. But in case of remote filesystems, it is racy.
> It is possible that by the time event arrives, watch is already gone
> and application never sees the delete event.
>
> Not sure how to address this issue.

Can you take me through the scenario step by step.
I am not sure I understand the exact sequence of the race.
If it is local file removal that causes watch to be removed,
then don't drop local events and you are good to go.
Is it something else?

Thanks,
Amir.

2021-10-27 05:43:01

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

> > > As a result, when an event occures on a inode within the exported
> > > directory by virtiofs, two events will be generated at the same time; a
> > > local event (generated by the guest kernel) and a remote event (generated
> > > by the host), thus the guest will receive duplicate events.
> > >
> > > To account for this issue we implemented two modes; one where local events
> > > function as expected (when virtiofsd does not support the remote
> > > inotify) and one where the local events are suppressed and only the
> > > remote events originating from the host side are let through (when
> > > virtiofsd supports the remote inotify).
> >
> > Dropping events from the local side would be weird.
> > Avoiding duplicate events is not a good enough reason IMO
> > compared to the problems this could cause.
> > I am not convinced this is worth it.
>
> So what should be done? If we don't drop local events, then application
> will see both local and remote events. And then we will have to build
> this knowledge in API that an event can be either local or remote.
> Application should be able to distinguish between these two and act
> accordingly. That sounds like a lot. And I am not even sure if application
> cares about local events in case of a remote shared filesystem.
>

I am not completely ruling out the S_NONOTIFY inode flag, but I am not
yet convinced that it is needed.
So what if applications get duplicate events? What's the worst thing that
could happen?
And once again, merging most of the duplicate events is pretty easy
and fanotify does that already.

> I have no experience with inotify/fanotify/fsnotify and what people
> expect from inotify/fanotify API. So we are open to all the ideas
> w.r.t what will be a good design to support this thing on remote
> filesystems. Currently whole infrastructure seems to be written with
> local filesystems in mind.
>

I have no control of what users expect from inotify/fanotify
but I do know that some users' expectations are misaligned with
how inotify/fanotify actually works.

For example, some users may expect events not to be reordered,
but there is no such guarantee.

Some inotify users would expect to find a file with the filename
in the event without realizing that this is a snapshot of a past
filename, that may now not exist or be a completely different object.

Those are some misconceptions that we tried to address with
the fanotify FAN_REPORT_DFID_NAME APIs and hopefully we also
documented the expectations better in the man page.

The use of NFS file handles to identify objects in the FAN_REPORT_FID
modes enables getting events also for already dead objects
(without keeping the inode alive like inotify).

I didn't want to complicate Ioannis in this early stage, but I think
that FUSE fsnotify should be tied to LOOKUP_HANDLE.
All FUSE remote objects should be described by file handles
which is the same manner in which network protocols work
and file handles should be used to describe objects in
fsnotify FUSE events.

But I am getting ahead of myself with a lot of hand waving....

Thanks,
Amir.

2021-10-27 07:07:17

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Tue, Oct 26, 2021 at 08:59:44PM +0300, Amir Goldstein wrote:
> On Tue, Oct 26, 2021 at 7:18 PM Vivek Goyal <[email protected]> wrote:
> >
> > On Tue, Oct 26, 2021 at 06:23:50PM +0300, Amir Goldstein wrote:
> >
> > [..]
> > > > 3) The lifetime of the local watch in the guest kernel is very
> > > > important. Specifically, there is a possibility that the guest does not
> > > > receive remote events on time, if it removes its local watch on the
> > > > target or deletes the inode (and thus the guest kernel removes the watch).
> > > > In these cases the guest kernel removes the local watch before the
> > > > remote events arrive from the host (virtiofsd) and as such the guest
> > > > kernel drops all the remote events for the target inode (since the
> > > > corresponding local watch does not exist anymore).
> >
> > So this is one of the issues which has been haunting us in virtiofs. If
> > a file is removed, for local events, event is generated first and
> > then watch is removed. But in case of remote filesystems, it is racy.
> > It is possible that by the time event arrives, watch is already gone
> > and application never sees the delete event.
> >
> > Not sure how to address this issue.
>

> Can you take me through the scenario step by step.
> I am not sure I understand the exact sequence of the race.

Ioannis, please correct me If I get something wrong. You know exact
details much more than me.

A. Say a guest process unlinks a file.
B. Fuse sends an unlink request to server (virtiofsd)
C. File is unlinked on host. Assume there are no other users so inode
will be freed as well. And event will be generated on host and watch
removed.
D. Now Fuse server will send a unlink request reply. unlink notification
might still be in kernel buffers or still be in virtiofsd or could
be in virtiofs virtqueue.
E. Fuse client will receive unlink reply and remove local watch.

Fuse reply and notification event are now traveling in parallel on
different virtqueues and there is no connection between these two. And
it could very well happen that fuse reply comes first, gets processed
first and local watch is removed. And notification is processed right
after but by then local watch is gone and filesystem will be forced to
drop event.

As of now situation is more complicated in virtiofsd. We don't keep
file handle open for file and keep an O_PATH fd open for each file.
That means in step D above, inode on host is not freed yet and unlink
event is not generated yet. When unlink reply reaches fuse client,
it sends FORGET messages to server, and then server closes O_PATH fd
and then host generates unlink events. By that time its too late,
guest has already remove local watches (and triggered removal of
remote watches too).

This second problem probably can be solved by using file handles, but
basic race will still continue to be there.

> If it is local file removal that causes watch to be removed,
> then don't drop local events and you are good to go.
> Is it something else?

- If remote events are enabled, then idea will be that user space gets
and event when file is actually removed from server, right? Now it
is possible that another VM has this file open and file has not been
yet removed. So local event only tells you that file has been removed
in guest VM (or locally) but does not tell anything about the state
of file on server. (It has been unlinked on server but inode continues
to be alive internall).

- If user receives both local and remote delete event, it will be
confusing. I guess if we want to see both the events, then there
has to be some sort of info in event which classifies whether event
is local or remote. And let application act accordingly.

Thanks
Vivek

2021-10-27 07:20:46

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Tue, Oct 26, 2021 at 9:27 PM Vivek Goyal <[email protected]> wrote:
>
> On Tue, Oct 26, 2021 at 08:59:44PM +0300, Amir Goldstein wrote:
> > On Tue, Oct 26, 2021 at 7:18 PM Vivek Goyal <[email protected]> wrote:
> > >
> > > On Tue, Oct 26, 2021 at 06:23:50PM +0300, Amir Goldstein wrote:
> > >
> > > [..]
> > > > > 3) The lifetime of the local watch in the guest kernel is very
> > > > > important. Specifically, there is a possibility that the guest does not
> > > > > receive remote events on time, if it removes its local watch on the
> > > > > target or deletes the inode (and thus the guest kernel removes the watch).
> > > > > In these cases the guest kernel removes the local watch before the
> > > > > remote events arrive from the host (virtiofsd) and as such the guest
> > > > > kernel drops all the remote events for the target inode (since the
> > > > > corresponding local watch does not exist anymore).
> > >
> > > So this is one of the issues which has been haunting us in virtiofs. If
> > > a file is removed, for local events, event is generated first and
> > > then watch is removed. But in case of remote filesystems, it is racy.
> > > It is possible that by the time event arrives, watch is already gone
> > > and application never sees the delete event.
> > >
> > > Not sure how to address this issue.
> >
>
> > Can you take me through the scenario step by step.
> > I am not sure I understand the exact sequence of the race.
>
> Ioannis, please correct me If I get something wrong. You know exact
> details much more than me.
>
> A. Say a guest process unlinks a file.
> B. Fuse sends an unlink request to server (virtiofsd)
> C. File is unlinked on host. Assume there are no other users so inode
> will be freed as well. And event will be generated on host and watch
> removed.
> D. Now Fuse server will send a unlink request reply. unlink notification
> might still be in kernel buffers or still be in virtiofsd or could
> be in virtiofs virtqueue.
> E. Fuse client will receive unlink reply and remove local watch.
>
> Fuse reply and notification event are now traveling in parallel on
> different virtqueues and there is no connection between these two. And
> it could very well happen that fuse reply comes first, gets processed
> first and local watch is removed. And notification is processed right
> after but by then local watch is gone and filesystem will be forced to
> drop event.
>
> As of now situation is more complicated in virtiofsd. We don't keep
> file handle open for file and keep an O_PATH fd open for each file.
> That means in step D above, inode on host is not freed yet and unlink
> event is not generated yet. When unlink reply reaches fuse client,
> it sends FORGET messages to server, and then server closes O_PATH fd
> and then host generates unlink events. By that time its too late,
> guest has already remove local watches (and triggered removal of
> remote watches too).
>
> This second problem probably can be solved by using file handles, but
> basic race will still continue to be there.
>
> > If it is local file removal that causes watch to be removed,
> > then don't drop local events and you are good to go.
> > Is it something else?
>
> - If remote events are enabled, then idea will be that user space gets
> and event when file is actually removed from server, right? Now it
> is possible that another VM has this file open and file has not been
> yet removed. So local event only tells you that file has been removed
> in guest VM (or locally) but does not tell anything about the state
> of file on server. (It has been unlinked on server but inode continues
> to be alive internall).
>
> - If user receives both local and remote delete event, it will be
> confusing. I guess if we want to see both the events, then there
> has to be some sort of info in event which classifies whether event
> is local or remote. And let application act accordingly.
>

Maybe. Not sure this is the way to go.

There are several options to deal with this situation.
The thing is that applications cannot usually rely on getting
DELETE_SELF events for many different reasons that might
keep the inode reflink elevated also on local filesystems.

Perhaps the only way for FUSE (or any network) client to
know if object on server was really deleted is to issue a lookup
request to the file handle and when getting ESTALE.

It really sounds to me like DELETE_SELF should not be reported
at all for the first implementation.
It is very easy to deal with DELETE_SELF events scenario with
a filesystem watch, so bare that in mind as a possible application level
solution.

Thanks,
Amir.

2021-10-27 18:00:47

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Tue, Oct 26, 2021 at 10:14 PM Ioannis Angelakopoulos
<[email protected]> wrote:
>
>
>
> On Tue, Oct 26, 2021 at 2:27 PM Vivek Goyal <[email protected]> wrote:
>>
>> On Tue, Oct 26, 2021 at 08:59:44PM +0300, Amir Goldstein wrote:
>> > On Tue, Oct 26, 2021 at 7:18 PM Vivek Goyal <[email protected]> wrote:
>> > >
>> > > On Tue, Oct 26, 2021 at 06:23:50PM +0300, Amir Goldstein wrote:
>> > >
>> > > [..]
>> > > > > 3) The lifetime of the local watch in the guest kernel is very
>> > > > > important. Specifically, there is a possibility that the guest does not
>> > > > > receive remote events on time, if it removes its local watch on the
>> > > > > target or deletes the inode (and thus the guest kernel removes the watch).
>> > > > > In these cases the guest kernel removes the local watch before the
>> > > > > remote events arrive from the host (virtiofsd) and as such the guest
>> > > > > kernel drops all the remote events for the target inode (since the
>> > > > > corresponding local watch does not exist anymore).
>> > >
>> > > So this is one of the issues which has been haunting us in virtiofs. If
>> > > a file is removed, for local events, event is generated first and
>> > > then watch is removed. But in case of remote filesystems, it is racy.
>> > > It is possible that by the time event arrives, watch is already gone
>> > > and application never sees the delete event.
>> > >
>> > > Not sure how to address this issue.
>> >
>>
>> > Can you take me through the scenario step by step.
>> > I am not sure I understand the exact sequence of the race.
>>
>> Ioannis, please correct me If I get something wrong. You know exact
>> details much more than me.
>>
>> A. Say a guest process unlinks a file.
>> B. Fuse sends an unlink request to server (virtiofsd)
>> C. File is unlinked on host. Assume there are no other users so inode
>> will be freed as well. And event will be generated on host and watch
>> removed.
>> D. Now Fuse server will send a unlink request reply. unlink notification
>> might still be in kernel buffers or still be in virtiofsd or could
>> be in virtiofs virtqueue.
>> E. Fuse client will receive unlink reply and remove local watch.
>>
>> Fuse reply and notification event are now traveling in parallel on
>> different virtqueues and there is no connection between these two. And
>> it could very well happen that fuse reply comes first, gets processed
>> first and local watch is removed. And notification is processed right
>> after but by then local watch is gone and filesystem will be forced to
>> drop event.
>>
>> As of now situation is more complicated in virtiofsd. We don't keep
>> file handle open for file and keep an O_PATH fd open for each file.
>> That means in step D above, inode on host is not freed yet and unlink
>> event is not generated yet. When unlink reply reaches fuse client,
>> it sends FORGET messages to server, and then server closes O_PATH fd
>> and then host generates unlink events. By that time its too late,
>> guest has already remove local watches (and triggered removal of
>> remote watches too).
>>
>> This second problem probably can be solved by using file handles, but
>> basic race will still continue to be there.
>>
>> > If it is local file removal that causes watch to be removed,
>> > then don't drop local events and you are good to go.
>> > Is it something else?
>>
>> - If remote events are enabled, then idea will be that user space gets
>> and event when file is actually removed from server, right? Now it
>> is possible that another VM has this file open and file has not been
>> yet removed. So local event only tells you that file has been removed
>> in guest VM (or locally) but does not tell anything about the state
>> of file on server. (It has been unlinked on server but inode continues
>> to be alive internall).
>>
>> - If user receives both local and remote delete event, it will be
>> confusing. I guess if we want to see both the events, then there
>> has to be some sort of info in event which classifies whether event
>> is local or remote. And let application act accordingly.
>>
>> Thanks
>> Vivek
>>
>
> Hello Amir!
>
> Sorry for taking part in the conversation a bit late. Vivek was on point with the
> example he gave but the race is a bit more generic than only the DELETE event.
>
> Let's say that a guest process monitors an inode for OPEN events:
>
> 1) The same guest process or another guest process opens the file (related to the
> monitored inode), and then closes and immediately deletes the file/inode.
> 2) The FUSE server (virtiofsd) will mimic the operations of the guest process:
> a) Will open the file on the host side and thus a remote OPEN event is going to
> be generated on the host and sent to the guest.
> b) Will unlink the remote inode and if no other host process uses the inode then the
> inode will be freed and a DELETE event is going to be generated on the host and sent
> to the guest (However, due to how virtiofsd works and Vivek mentioned, this step won't
> happen immediately)
>

You are confusing DELETE with DELETE_SELF.
DELETE corresponds to unlink(), so you get a DELETE event even if
inode is a hardlink
with nlink > 0 after unlink().

The DELETE event is reported (along with filename) against the parent directory
inode, so the test case above won't drop the event.

> The problem here is that the OPEN event might still be travelling towards the guest in the
> virtqueues and arrives after the guest has already deleted its local inode.
> While the remote event (OPEN) received by the guest is valid, its fsnotify
> subsystem will drop it since the local inode is not there.
>

I have a feeling that we are mixing issues related to shared server
and remote fsnotify.
Does virtiofsd support multiple guests already? There are many other
issues related
to cache coherency that should be dealt with in this case, some of
them overlap the
problem that you describe, so solving the narrow problem of dropped
remote events
seems like the wrong way to approach the problem.

I think that in a shared server situation, the simple LOOKUP/FORGET protocol
will not suffice. I will not even try to solve the generic problem,
but will just
mentioned that SMB/NFS protocols use delegations/oplocks in the protocol
to advertise object usage by other clients to all clients.

I think this issue is far outside the scope of your project and you should
just leave the dropped events as a known limitation at this point.
inotify has the event IN_IGNORED that application can use as a hint that some
events could have been dropped.

Thanks,
Amir.

2021-10-27 21:30:08

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Wed 27-10-21 08:59:15, Amir Goldstein wrote:
> On Tue, Oct 26, 2021 at 10:14 PM Ioannis Angelakopoulos
> <[email protected]> wrote:
> > On Tue, Oct 26, 2021 at 2:27 PM Vivek Goyal <[email protected]> wrote:
> > The problem here is that the OPEN event might still be travelling towards the guest in the
> > virtqueues and arrives after the guest has already deleted its local inode.
> > While the remote event (OPEN) received by the guest is valid, its fsnotify
> > subsystem will drop it since the local inode is not there.
> >
>
> I have a feeling that we are mixing issues related to shared server
> and remote fsnotify.

I don't think Ioannis was speaking about shared server case here. I think
he says that in a simple FUSE remote notification setup we can loose OPEN
events (or basically any other) if the inode for which the event happens
gets deleted sufficiently early after the event being generated. That seems
indeed somewhat unexpected and could be confusing if it happens e.g. for
some directory operations.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-10-27 21:35:14

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Wed, Oct 27, 2021 at 04:29:41PM -0400, Vivek Goyal wrote:
> On Wed, Oct 27, 2021 at 03:23:19PM +0200, Jan Kara wrote:
> > On Wed 27-10-21 08:59:15, Amir Goldstein wrote:
> > > On Tue, Oct 26, 2021 at 10:14 PM Ioannis Angelakopoulos
> > > <[email protected]> wrote:
> > > > On Tue, Oct 26, 2021 at 2:27 PM Vivek Goyal <[email protected]> wrote:
> > > > The problem here is that the OPEN event might still be travelling towards the guest in the
> > > > virtqueues and arrives after the guest has already deleted its local inode.
> > > > While the remote event (OPEN) received by the guest is valid, its fsnotify
> > > > subsystem will drop it since the local inode is not there.
> > > >
> > >
> > > I have a feeling that we are mixing issues related to shared server
> > > and remote fsnotify.
> >
> > I don't think Ioannis was speaking about shared server case here. I think
> > he says that in a simple FUSE remote notification setup we can loose OPEN
> > events (or basically any other) if the inode for which the event happens
> > gets deleted sufficiently early after the event being generated. That seems
> > indeed somewhat unexpected and could be confusing if it happens e.g. for
> > some directory operations.
>
> Hi Jan,
>
> Agreed. That's what Ioannis is trying to say. That some of the remote events
> can be lost if fuse/guest local inode is unlinked. I think problem exists
> both for shared and non-shared directory case.

Thinking more about it, one will need remote fsnotify support only if
this is a case of shared direcotry. If there are no clients doing parallel
operations in shared dir (like dropping some new files), then local
inotify should be good enough. And I think Ioannis mentioned that local
inotify already work with fuse.

For example, kata container developers wanted to use inotify on virtiofs,
because a process on host (kubernetes?) will update some sort of config
file in shared directory. They wanted to monitor that config in guest,
get an inotify event and then act on updated config file. That's the
case of shared dir.

Given we do not support inotify yet, I think they went ahead with some
sort of polling mechanism to take care of there use case.

Thanks
Vivek

2021-10-27 21:36:09

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Wed, Oct 27, 2021 at 08:59:15AM +0300, Amir Goldstein wrote:
> On Tue, Oct 26, 2021 at 10:14 PM Ioannis Angelakopoulos
> <[email protected]> wrote:
> >
> >
> >
> > On Tue, Oct 26, 2021 at 2:27 PM Vivek Goyal <[email protected]> wrote:
> >>
> >> On Tue, Oct 26, 2021 at 08:59:44PM +0300, Amir Goldstein wrote:
> >> > On Tue, Oct 26, 2021 at 7:18 PM Vivek Goyal <[email protected]> wrote:
> >> > >
> >> > > On Tue, Oct 26, 2021 at 06:23:50PM +0300, Amir Goldstein wrote:
> >> > >
> >> > > [..]
> >> > > > > 3) The lifetime of the local watch in the guest kernel is very
> >> > > > > important. Specifically, there is a possibility that the guest does not
> >> > > > > receive remote events on time, if it removes its local watch on the
> >> > > > > target or deletes the inode (and thus the guest kernel removes the watch).
> >> > > > > In these cases the guest kernel removes the local watch before the
> >> > > > > remote events arrive from the host (virtiofsd) and as such the guest
> >> > > > > kernel drops all the remote events for the target inode (since the
> >> > > > > corresponding local watch does not exist anymore).
> >> > >
> >> > > So this is one of the issues which has been haunting us in virtiofs. If
> >> > > a file is removed, for local events, event is generated first and
> >> > > then watch is removed. But in case of remote filesystems, it is racy.
> >> > > It is possible that by the time event arrives, watch is already gone
> >> > > and application never sees the delete event.
> >> > >
> >> > > Not sure how to address this issue.
> >> >
> >>
> >> > Can you take me through the scenario step by step.
> >> > I am not sure I understand the exact sequence of the race.
> >>
> >> Ioannis, please correct me If I get something wrong. You know exact
> >> details much more than me.
> >>
> >> A. Say a guest process unlinks a file.
> >> B. Fuse sends an unlink request to server (virtiofsd)
> >> C. File is unlinked on host. Assume there are no other users so inode
> >> will be freed as well. And event will be generated on host and watch
> >> removed.
> >> D. Now Fuse server will send a unlink request reply. unlink notification
> >> might still be in kernel buffers or still be in virtiofsd or could
> >> be in virtiofs virtqueue.
> >> E. Fuse client will receive unlink reply and remove local watch.
> >>
> >> Fuse reply and notification event are now traveling in parallel on
> >> different virtqueues and there is no connection between these two. And
> >> it could very well happen that fuse reply comes first, gets processed
> >> first and local watch is removed. And notification is processed right
> >> after but by then local watch is gone and filesystem will be forced to
> >> drop event.
> >>
> >> As of now situation is more complicated in virtiofsd. We don't keep
> >> file handle open for file and keep an O_PATH fd open for each file.
> >> That means in step D above, inode on host is not freed yet and unlink
> >> event is not generated yet. When unlink reply reaches fuse client,
> >> it sends FORGET messages to server, and then server closes O_PATH fd
> >> and then host generates unlink events. By that time its too late,
> >> guest has already remove local watches (and triggered removal of
> >> remote watches too).
> >>
> >> This second problem probably can be solved by using file handles, but
> >> basic race will still continue to be there.
> >>
> >> > If it is local file removal that causes watch to be removed,
> >> > then don't drop local events and you are good to go.
> >> > Is it something else?
> >>
> >> - If remote events are enabled, then idea will be that user space gets
> >> and event when file is actually removed from server, right? Now it
> >> is possible that another VM has this file open and file has not been
> >> yet removed. So local event only tells you that file has been removed
> >> in guest VM (or locally) but does not tell anything about the state
> >> of file on server. (It has been unlinked on server but inode continues
> >> to be alive internall).
> >>
> >> - If user receives both local and remote delete event, it will be
> >> confusing. I guess if we want to see both the events, then there
> >> has to be some sort of info in event which classifies whether event
> >> is local or remote. And let application act accordingly.
> >>
> >> Thanks
> >> Vivek
> >>
> >
> > Hello Amir!
> >
> > Sorry for taking part in the conversation a bit late. Vivek was on point with the
> > example he gave but the race is a bit more generic than only the DELETE event.
> >
> > Let's say that a guest process monitors an inode for OPEN events:
> >
> > 1) The same guest process or another guest process opens the file (related to the
> > monitored inode), and then closes and immediately deletes the file/inode.
> > 2) The FUSE server (virtiofsd) will mimic the operations of the guest process:
> > a) Will open the file on the host side and thus a remote OPEN event is going to
> > be generated on the host and sent to the guest.
> > b) Will unlink the remote inode and if no other host process uses the inode then the
> > inode will be freed and a DELETE event is going to be generated on the host and sent
> > to the guest (However, due to how virtiofsd works and Vivek mentioned, this step won't
> > happen immediately)
> >
>
> You are confusing DELETE with DELETE_SELF.
> DELETE corresponds to unlink(), so you get a DELETE event even if
> inode is a hardlink
> with nlink > 0 after unlink().
>
> The DELETE event is reported (along with filename) against the parent directory
> inode, so the test case above won't drop the event.

Hi Amir,

Agreed that there is confusion between DELETE and DELETE_SELF events. I
think Ioannis is referring to DELETE_SELF event. With this example he
is trying to emphasize that due to races, problem is not limited to
DELETE_SELF events only and other events could arrive little later
after the local watch in guest has been removed and then all those
events will be dropped as well. So he gave example of OPEN event. And
I think remote IN_IGNORED might face the same fate.

In the case of IN_IGNORED, I am wondering is it ok to generate that
event locally instead.

>
> > The problem here is that the OPEN event might still be travelling towards the guest in the
> > virtqueues and arrives after the guest has already deleted its local inode.
> > While the remote event (OPEN) received by the guest is valid, its fsnotify
> > subsystem will drop it since the local inode is not there.
> >
>
> I have a feeling that we are mixing issues related to shared server
> and remote fsnotify.
> Does virtiofsd support multiple guests already?

I would like to think that there are not many basic issues with shared
directory configuration. So we don't stop users from using it.

> There are many other
> issues related
> to cache coherency that should be dealt with in this case, some of
> them overlap the
> problem that you describe, so solving the narrow problem of dropped
> remote events
> seems like the wrong way to approach the problem.

Dropped remote event problem/race will exist even if it was not shared
server. Remote events travel through different virtqueue as comapred
to fuse request reply. So there is no guarantee in what order events
or replies will processed.

>
> I think that in a shared server situation, the simple LOOKUP/FORGET protocol
> will not suffice.

Hmm.., So what's the problem with LOOKUP/FORGET in shared dir case?

> I will not even try to solve the generic problem,
> but will just
> mentioned that SMB/NFS protocols use delegations/oplocks in the protocol
> to advertise object usage by other clients to all clients.

I think Miklos was looking into the idea of some sort of file leases
on fuse for creating equivalent of delegations. Not sure if that work
made any progress.

>
> I think this issue is far outside the scope of your project and you should
> just leave the dropped events as a known limitation at this point.

May be that's what we should do to begin with and just say these events
can be lost or never arrive.

> inotify has the event IN_IGNORED that application can use as a hint that some
> events could have been dropped.

That probably will require generating IN_IGNORED locally when local watch
goes away (and not rely on remote IN_IGNORED), IIUC.

Thanks
Vivek

2021-10-27 21:37:17

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Wed, Oct 27, 2021 at 03:23:19PM +0200, Jan Kara wrote:
> On Wed 27-10-21 08:59:15, Amir Goldstein wrote:
> > On Tue, Oct 26, 2021 at 10:14 PM Ioannis Angelakopoulos
> > <[email protected]> wrote:
> > > On Tue, Oct 26, 2021 at 2:27 PM Vivek Goyal <[email protected]> wrote:
> > > The problem here is that the OPEN event might still be travelling towards the guest in the
> > > virtqueues and arrives after the guest has already deleted its local inode.
> > > While the remote event (OPEN) received by the guest is valid, its fsnotify
> > > subsystem will drop it since the local inode is not there.
> > >
> >
> > I have a feeling that we are mixing issues related to shared server
> > and remote fsnotify.
>
> I don't think Ioannis was speaking about shared server case here. I think
> he says that in a simple FUSE remote notification setup we can loose OPEN
> events (or basically any other) if the inode for which the event happens
> gets deleted sufficiently early after the event being generated. That seems
> indeed somewhat unexpected and could be confusing if it happens e.g. for
> some directory operations.

Hi Jan,

Agreed. That's what Ioannis is trying to say. That some of the remote events
can be lost if fuse/guest local inode is unlinked. I think problem exists
both for shared and non-shared directory case.

With local filesystems we have a control that we can first queue up
the event in buffer before we remove local watches. With events travelling
from a remote server, there is no such control/synchronization. It can
very well happen that events got delayed in the communication path
somewhere and local watches went away and now there is no way to
deliver those events to the application.

Thanks
Vivek

2021-10-28 05:16:56

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Wed, Oct 27, 2021 at 11:24 PM Vivek Goyal <[email protected]> wrote:
>
> On Wed, Oct 27, 2021 at 08:59:15AM +0300, Amir Goldstein wrote:
> > On Tue, Oct 26, 2021 at 10:14 PM Ioannis Angelakopoulos
> > <[email protected]> wrote:
> > >
> > >
> > >
> > > On Tue, Oct 26, 2021 at 2:27 PM Vivek Goyal <[email protected]> wrote:
> > >>
> > >> On Tue, Oct 26, 2021 at 08:59:44PM +0300, Amir Goldstein wrote:
> > >> > On Tue, Oct 26, 2021 at 7:18 PM Vivek Goyal <[email protected]> wrote:
> > >> > >
> > >> > > On Tue, Oct 26, 2021 at 06:23:50PM +0300, Amir Goldstein wrote:
> > >> > >
> > >> > > [..]
> > >> > > > > 3) The lifetime of the local watch in the guest kernel is very
> > >> > > > > important. Specifically, there is a possibility that the guest does not
> > >> > > > > receive remote events on time, if it removes its local watch on the
> > >> > > > > target or deletes the inode (and thus the guest kernel removes the watch).
> > >> > > > > In these cases the guest kernel removes the local watch before the
> > >> > > > > remote events arrive from the host (virtiofsd) and as such the guest
> > >> > > > > kernel drops all the remote events for the target inode (since the
> > >> > > > > corresponding local watch does not exist anymore).
> > >> > >
> > >> > > So this is one of the issues which has been haunting us in virtiofs. If
> > >> > > a file is removed, for local events, event is generated first and
> > >> > > then watch is removed. But in case of remote filesystems, it is racy.
> > >> > > It is possible that by the time event arrives, watch is already gone
> > >> > > and application never sees the delete event.
> > >> > >
> > >> > > Not sure how to address this issue.
> > >> >
> > >>
> > >> > Can you take me through the scenario step by step.
> > >> > I am not sure I understand the exact sequence of the race.
> > >>
> > >> Ioannis, please correct me If I get something wrong. You know exact
> > >> details much more than me.
> > >>
> > >> A. Say a guest process unlinks a file.
> > >> B. Fuse sends an unlink request to server (virtiofsd)
> > >> C. File is unlinked on host. Assume there are no other users so inode
> > >> will be freed as well. And event will be generated on host and watch
> > >> removed.
> > >> D. Now Fuse server will send a unlink request reply. unlink notification
> > >> might still be in kernel buffers or still be in virtiofsd or could
> > >> be in virtiofs virtqueue.
> > >> E. Fuse client will receive unlink reply and remove local watch.
> > >>
> > >> Fuse reply and notification event are now traveling in parallel on
> > >> different virtqueues and there is no connection between these two. And
> > >> it could very well happen that fuse reply comes first, gets processed
> > >> first and local watch is removed. And notification is processed right
> > >> after but by then local watch is gone and filesystem will be forced to
> > >> drop event.
> > >>
> > >> As of now situation is more complicated in virtiofsd. We don't keep
> > >> file handle open for file and keep an O_PATH fd open for each file.
> > >> That means in step D above, inode on host is not freed yet and unlink
> > >> event is not generated yet. When unlink reply reaches fuse client,
> > >> it sends FORGET messages to server, and then server closes O_PATH fd
> > >> and then host generates unlink events. By that time its too late,
> > >> guest has already remove local watches (and triggered removal of
> > >> remote watches too).
> > >>
> > >> This second problem probably can be solved by using file handles, but
> > >> basic race will still continue to be there.
> > >>
> > >> > If it is local file removal that causes watch to be removed,
> > >> > then don't drop local events and you are good to go.
> > >> > Is it something else?
> > >>
> > >> - If remote events are enabled, then idea will be that user space gets
> > >> and event when file is actually removed from server, right? Now it
> > >> is possible that another VM has this file open and file has not been
> > >> yet removed. So local event only tells you that file has been removed
> > >> in guest VM (or locally) but does not tell anything about the state
> > >> of file on server. (It has been unlinked on server but inode continues
> > >> to be alive internall).
> > >>
> > >> - If user receives both local and remote delete event, it will be
> > >> confusing. I guess if we want to see both the events, then there
> > >> has to be some sort of info in event which classifies whether event
> > >> is local or remote. And let application act accordingly.
> > >>
> > >> Thanks
> > >> Vivek
> > >>
> > >
> > > Hello Amir!
> > >
> > > Sorry for taking part in the conversation a bit late. Vivek was on point with the
> > > example he gave but the race is a bit more generic than only the DELETE event.
> > >
> > > Let's say that a guest process monitors an inode for OPEN events:
> > >
> > > 1) The same guest process or another guest process opens the file (related to the
> > > monitored inode), and then closes and immediately deletes the file/inode.
> > > 2) The FUSE server (virtiofsd) will mimic the operations of the guest process:
> > > a) Will open the file on the host side and thus a remote OPEN event is going to
> > > be generated on the host and sent to the guest.
> > > b) Will unlink the remote inode and if no other host process uses the inode then the
> > > inode will be freed and a DELETE event is going to be generated on the host and sent
> > > to the guest (However, due to how virtiofsd works and Vivek mentioned, this step won't
> > > happen immediately)
> > >
> >
> > You are confusing DELETE with DELETE_SELF.
> > DELETE corresponds to unlink(), so you get a DELETE event even if
> > inode is a hardlink
> > with nlink > 0 after unlink().
> >
> > The DELETE event is reported (along with filename) against the parent directory
> > inode, so the test case above won't drop the event.
>
> Hi Amir,
>
> Agreed that there is confusion between DELETE and DELETE_SELF events. I
> think Ioannis is referring to DELETE_SELF event. With this example he
> is trying to emphasize that due to races, problem is not limited to
> DELETE_SELF events only and other events could arrive little later
> after the local watch in guest has been removed and then all those
> events will be dropped as well. So he gave example of OPEN event. And
> I think remote IN_IGNORED might face the same fate.
>
> In the case of IN_IGNORED, I am wondering is it ok to generate that
> event locally instead.
>

Yes, that is what I meant.
Local watch removed can be used as a hint to applications that
events might have been lost.

> >
> > > The problem here is that the OPEN event might still be travelling towards the guest in the
> > > virtqueues and arrives after the guest has already deleted its local inode.
> > > While the remote event (OPEN) received by the guest is valid, its fsnotify
> > > subsystem will drop it since the local inode is not there.
> > >
> >
> > I have a feeling that we are mixing issues related to shared server
> > and remote fsnotify.
> > Does virtiofsd support multiple guests already?
>
> I would like to think that there are not many basic issues with shared
> directory configuration. So we don't stop users from using it.
>
> > There are many other
> > issues related
> > to cache coherency that should be dealt with in this case, some of
> > them overlap the
> > problem that you describe, so solving the narrow problem of dropped
> > remote events
> > seems like the wrong way to approach the problem.
>
> Dropped remote event problem/race will exist even if it was not shared
> server. Remote events travel through different virtqueue as comapred
> to fuse request reply. So there is no guarantee in what order events
> or replies will processed.
>
> >
> > I think that in a shared server situation, the simple LOOKUP/FORGET protocol
> > will not suffice.
>
> Hmm.., So what's the problem with LOOKUP/FORGET in shared dir case?
>

I guess when we say "shared dir" case, we mean different things.
I was thinking about "shared between different client" and about the
fact the information about the server refcount of objects is obscured from
the client.

For example, LOOKUP_INC/DEC requests that report back the server's
shared object refcount could prevent a guest inode from being evicted
in case inode has a remote watch.

But you were mentioning "shared dir with guest and host", so LOOKUP
refcount is not relevant in that case.

> > I will not even try to solve the generic problem,
> > but will just
> > mentioned that SMB/NFS protocols use delegations/oplocks in the protocol
> > to advertise object usage by other clients to all clients.
>
> I think Miklos was looking into the idea of some sort of file leases
> on fuse for creating equivalent of delegations. Not sure if that work
> made any progress.
>
> >
> > I think this issue is far outside the scope of your project and you should
> > just leave the dropped events as a known limitation at this point.
>
> May be that's what we should do to begin with and just say these events
> can be lost or never arrive.
>
> > inotify has the event IN_IGNORED that application can use as a hint that some
> > events could have been dropped.
>
> That probably will require generating IN_IGNORED locally when local watch
> goes away (and not rely on remote IN_IGNORED), IIUC.
>

Right.

Thanks,
Amir.

2021-11-02 11:10:42

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Wed 27-10-21 16:29:40, Vivek Goyal wrote:
> On Wed, Oct 27, 2021 at 03:23:19PM +0200, Jan Kara wrote:
> > On Wed 27-10-21 08:59:15, Amir Goldstein wrote:
> > > On Tue, Oct 26, 2021 at 10:14 PM Ioannis Angelakopoulos
> > > <[email protected]> wrote:
> > > > On Tue, Oct 26, 2021 at 2:27 PM Vivek Goyal <[email protected]> wrote:
> > > > The problem here is that the OPEN event might still be travelling towards the guest in the
> > > > virtqueues and arrives after the guest has already deleted its local inode.
> > > > While the remote event (OPEN) received by the guest is valid, its fsnotify
> > > > subsystem will drop it since the local inode is not there.
> > > >
> > >
> > > I have a feeling that we are mixing issues related to shared server
> > > and remote fsnotify.
> >
> > I don't think Ioannis was speaking about shared server case here. I think
> > he says that in a simple FUSE remote notification setup we can loose OPEN
> > events (or basically any other) if the inode for which the event happens
> > gets deleted sufficiently early after the event being generated. That seems
> > indeed somewhat unexpected and could be confusing if it happens e.g. for
> > some directory operations.
>
> Hi Jan,
>
> Agreed. That's what Ioannis is trying to say. That some of the remote events
> can be lost if fuse/guest local inode is unlinked. I think problem exists
> both for shared and non-shared directory case.
>
> With local filesystems we have a control that we can first queue up
> the event in buffer before we remove local watches. With events travelling
> from a remote server, there is no such control/synchronization. It can
> very well happen that events got delayed in the communication path
> somewhere and local watches went away and now there is no way to
> deliver those events to the application.

So after thinking for some time about this I have the following question
about the architecture of this solution: Why do you actually have local
fsnotify watches at all? They seem to cause quite some trouble... I mean
cannot we have fsnotify marks only on FUSE server and generate all events
there? When e.g. file is created from the client, client tells the server
about creation, the server performs the creation which generates the
fsnotify event, that is received by the server and forwared back to the
client which just queues it into notification group's queue for userspace
to read it.

Now with this architecture there's no problem with duplicate events for
local & server notification marks, similarly there's no problem with lost
events after inode deletion because events received by the client are
directly queued into notification queue without any checking whether inode
is still alive etc. Would this work or am I missing something?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-11-02 12:58:34

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Tue, Nov 2, 2021 at 1:09 PM Jan Kara <[email protected]> wrote:
>
> On Wed 27-10-21 16:29:40, Vivek Goyal wrote:
> > On Wed, Oct 27, 2021 at 03:23:19PM +0200, Jan Kara wrote:
> > > On Wed 27-10-21 08:59:15, Amir Goldstein wrote:
> > > > On Tue, Oct 26, 2021 at 10:14 PM Ioannis Angelakopoulos
> > > > <[email protected]> wrote:
> > > > > On Tue, Oct 26, 2021 at 2:27 PM Vivek Goyal <[email protected]> wrote:
> > > > > The problem here is that the OPEN event might still be travelling towards the guest in the
> > > > > virtqueues and arrives after the guest has already deleted its local inode.
> > > > > While the remote event (OPEN) received by the guest is valid, its fsnotify
> > > > > subsystem will drop it since the local inode is not there.
> > > > >
> > > >
> > > > I have a feeling that we are mixing issues related to shared server
> > > > and remote fsnotify.
> > >
> > > I don't think Ioannis was speaking about shared server case here. I think
> > > he says that in a simple FUSE remote notification setup we can loose OPEN
> > > events (or basically any other) if the inode for which the event happens
> > > gets deleted sufficiently early after the event being generated. That seems
> > > indeed somewhat unexpected and could be confusing if it happens e.g. for
> > > some directory operations.
> >
> > Hi Jan,
> >
> > Agreed. That's what Ioannis is trying to say. That some of the remote events
> > can be lost if fuse/guest local inode is unlinked. I think problem exists
> > both for shared and non-shared directory case.
> >
> > With local filesystems we have a control that we can first queue up
> > the event in buffer before we remove local watches. With events travelling
> > from a remote server, there is no such control/synchronization. It can
> > very well happen that events got delayed in the communication path
> > somewhere and local watches went away and now there is no way to
> > deliver those events to the application.
>
> So after thinking for some time about this I have the following question
> about the architecture of this solution: Why do you actually have local
> fsnotify watches at all? They seem to cause quite some trouble... I mean
> cannot we have fsnotify marks only on FUSE server and generate all events
> there? When e.g. file is created from the client, client tells the server
> about creation, the server performs the creation which generates the
> fsnotify event, that is received by the server and forwared back to the
> client which just queues it into notification group's queue for userspace
> to read it.
>
> Now with this architecture there's no problem with duplicate events for
> local & server notification marks, similarly there's no problem with lost
> events after inode deletion because events received by the client are
> directly queued into notification queue without any checking whether inode
> is still alive etc. Would this work or am I missing something?
>

What about group #1 that wants mask A and group #2 that wants mask B
events?

Do you propose to maintain separate event queues over the protocol?
Attach a "recipient list" to each event?

I just don't see how this can scale other than:
- Local marks and connectors manage the subscriptions on local machine
- Protocol updates the server with the combined masks for watched objects

I think that the "post-mortem events" issue could be solved by keeping an
S_DEAD fuse inode object in limbo just for the mark.
When a remote server sends FS_IN_IGNORED or FS_DELETE_SELF for
an inode, the fuse client inode can be finally evicted.
I haven't tried to see how hard that would be to implement.

Thanks,
Amir.

2021-11-02 20:37:10

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Tue, Nov 02, 2021 at 02:54:01PM +0200, Amir Goldstein wrote:
> On Tue, Nov 2, 2021 at 1:09 PM Jan Kara <[email protected]> wrote:
> >
> > On Wed 27-10-21 16:29:40, Vivek Goyal wrote:
> > > On Wed, Oct 27, 2021 at 03:23:19PM +0200, Jan Kara wrote:
> > > > On Wed 27-10-21 08:59:15, Amir Goldstein wrote:
> > > > > On Tue, Oct 26, 2021 at 10:14 PM Ioannis Angelakopoulos
> > > > > <[email protected]> wrote:
> > > > > > On Tue, Oct 26, 2021 at 2:27 PM Vivek Goyal <[email protected]> wrote:
> > > > > > The problem here is that the OPEN event might still be travelling towards the guest in the
> > > > > > virtqueues and arrives after the guest has already deleted its local inode.
> > > > > > While the remote event (OPEN) received by the guest is valid, its fsnotify
> > > > > > subsystem will drop it since the local inode is not there.
> > > > > >
> > > > >
> > > > > I have a feeling that we are mixing issues related to shared server
> > > > > and remote fsnotify.
> > > >
> > > > I don't think Ioannis was speaking about shared server case here. I think
> > > > he says that in a simple FUSE remote notification setup we can loose OPEN
> > > > events (or basically any other) if the inode for which the event happens
> > > > gets deleted sufficiently early after the event being generated. That seems
> > > > indeed somewhat unexpected and could be confusing if it happens e.g. for
> > > > some directory operations.
> > >
> > > Hi Jan,
> > >
> > > Agreed. That's what Ioannis is trying to say. That some of the remote events
> > > can be lost if fuse/guest local inode is unlinked. I think problem exists
> > > both for shared and non-shared directory case.
> > >
> > > With local filesystems we have a control that we can first queue up
> > > the event in buffer before we remove local watches. With events travelling
> > > from a remote server, there is no such control/synchronization. It can
> > > very well happen that events got delayed in the communication path
> > > somewhere and local watches went away and now there is no way to
> > > deliver those events to the application.
> >
> > So after thinking for some time about this I have the following question
> > about the architecture of this solution: Why do you actually have local
> > fsnotify watches at all? They seem to cause quite some trouble... I mean
> > cannot we have fsnotify marks only on FUSE server and generate all events
> > there?

I think currently we are already implementing this part of the proposal.
We are sending "group" pointer to server while updating a watch. And server
is managing watches per inode per group. IOW, if client has group1 wanting
mask A and group2 wanting mask B, then server is going to add two watches
with two masks on same inotify fd instance.

Admittedly we did this because we did not know that an aggregate mask
exists. And using an aggregate mask in guest kernel and then server
putting a single watch for that inode based on aggregate mask simplifies
server implementation.

One downside of this approach is more complexity on server. Also more
number of events will be travelling from host to guest. So if two groups
are watching same events on same inode, then I think two copies of
events will travel to guest. One for the group1 and one for group2 (as
we are having separate watches on host). If we use aggregate watch on
host, then only one event can travel between host and guest and I am
assuming same event can be replicated among multiple groups, depending
on their interest.

> When e.g. file is created from the client, client tells the server
> > about creation, the server performs the creation which generates the
> > fsnotify event, that is received by the server and forwared back to the
> > client which just queues it into notification group's queue for userspace
> > to read it.

This is the part we have not implemented. When we generate the event,
we just generate the event for the inode. There is no notion
of that this event has been generated for a specific group with-in
this inode. As of now that's left to the local fsnotify code to manage
and figure out.

So the idea is, that when event arrives from remote, queue it up directly
into the group (without having to worry about inode). Hmm.., how do we do
that. That means we need to return that group identifier in notification
event atleast so that client can find out the group (without having to
worry about inode?).

So group will basically become part of the remote protocol if we were
to go this way. And implementation becomes more complicated.

> >
> > Now with this architecture there's no problem with duplicate events for
> > local & server notification marks,

I guess supressing local events is really trivial. Either we have that
inode flag Amir suggested or have an inode operation to let file system
decide.

> similarly there's no problem with lost
> > events after inode deletion because events received by the client are
> > directly queued into notification queue without any checking whether inode
> > is still alive etc. Would this work or am I missing something?


So when will the watches on remote go away. When a file is unlinked and
inode is going away we call fsnotify_inoderemove(). This generates
FS_DELETE_SELF and then seems to remove all local marks on the inode.

Now if we don't have local marks and guest inode is going away, and client
sends FUSE_FORGET message, I am assuming that will be the time to cleanup
all the remote watches and groups etc. And if file is open by some other
guest, then DELETE_SELF event will not have been generated by then and
we will clean remote watches.

Even if other guest did not have file open, cleanup of remote watches
and DELETE_SELF will be parallel operation and can be racy. So if
thread reading inotify descriptor gets little late in reading DELETE_SELF,
it is possible another thread in virtiofsd cleaned up all remote
watches and associated groups. And now there is no way to send event
back to guest and we lost event?

My understanding of this notification magic is very primitive. So it
is very much possible I am misunderstanding how remote watches and
groups will be managed and reported back. But my current assumption
is that their life time will have to be tied to remote inode we
are managing. Otherwise when will remote server clean its own
internal state (watch descriptors), when inode goes away.

> >
>
> What about group #1 that wants mask A and group #2 that wants mask B
> events?
>
> Do you propose to maintain separate event queues over the protocol?
> Attach a "recipient list" to each event?
>
> I just don't see how this can scale other than:
> - Local marks and connectors manage the subscriptions on local machine
> - Protocol updates the server with the combined masks for watched objects
>
> I think that the "post-mortem events" issue could be solved by keeping an
> S_DEAD fuse inode object in limbo just for the mark.
> When a remote server sends FS_IN_IGNORED or FS_DELETE_SELF for
> an inode, the fuse client inode can be finally evicted.

There is no guarantee that FS_IN_IGNORED or FS_DELETE_SELF will come
or when will it come. If another guest has reference on inode it might
not come for a long time. And this will kind of become a mechanism
for one guest to keep other's inode cache full of such objects.

If event queue becomes too full, we might drop these events. But I guess
in that case we will have to generate IN_Q_OVERFLOW and that can somehow
be used to cleanup such S_DEAD inodes?

nodeid is managed by server. So I am assuming that FORGET messages will
not be sent to server for this inode till we have seen FS_IN_IGNORED
and FS_DELETE_SELF events?

Thanks
Vivek

> I haven't tried to see how hard that would be to implement.
>
> Thanks,
> Amir.
>

2021-11-03 07:33:30

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

> > >
> >
> > What about group #1 that wants mask A and group #2 that wants mask B
> > events?
> >
> > Do you propose to maintain separate event queues over the protocol?
> > Attach a "recipient list" to each event?
> >
> > I just don't see how this can scale other than:
> > - Local marks and connectors manage the subscriptions on local machine
> > - Protocol updates the server with the combined masks for watched objects
> >
> > I think that the "post-mortem events" issue could be solved by keeping an
> > S_DEAD fuse inode object in limbo just for the mark.
> > When a remote server sends FS_IN_IGNORED or FS_DELETE_SELF for
> > an inode, the fuse client inode can be finally evicted.
>
> There is no guarantee that FS_IN_IGNORED or FS_DELETE_SELF will come
> or when will it come. If another guest has reference on inode it might
> not come for a long time. And this will kind of become a mechanism
> for one guest to keep other's inode cache full of such objects.
>
> If event queue becomes too full, we might drop these events. But I guess
> in that case we will have to generate IN_Q_OVERFLOW and that can somehow
> be used to cleanup such S_DEAD inodes?

That depends on the server implementation.
If the server is watching host fs using fanotify filesystem mark, then
an overflow
event does NOT mean that other new events on inode may be missed only
that old events could have been missed.
Server should know about all the watched inodes, so it can check on overflow
if any of the watched inodes were deleted and notify the client using a reliable
channel.

Given the current server implementation with inotify, IN_Q_OVERFLOW
means server may have lost an IN_IGNORED event and may not get any
more events on inode, so server should check all the watched inodes after
overflow, notify the client of all deleted inodes and try to re-create
the watches
for all inodes with known path or use magic /prod/pid/fd path if that
works (??).

>
> nodeid is managed by server. So I am assuming that FORGET messages will
> not be sent to server for this inode till we have seen FS_IN_IGNORED
> and FS_DELETE_SELF events?
>

Or until the application that requested the watch calls
inotify_rm_watch() or closes
the inotify fd.

IOW, when fs implements remote fsnotify, the local watch keeps the local deleted
inode object in limbo until the local watch is removed.
When the remote fsnotify server informs that the remote watch (or remote inode)
is gone, the local watch is removed as well and then the inotify
application also gets
an FS_IN_IGNORED event.

Lifetime of local inode is complicated and lifetime of this "shared inode"
is much more complicated, so I am not pretending to claim that I have this all
figured out or that it could be reliably done at all.

Thanks,
Amir.

2021-11-03 10:11:17

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Tue 02-11-21 14:54:01, Amir Goldstein wrote:
> On Tue, Nov 2, 2021 at 1:09 PM Jan Kara <[email protected]> wrote:
> >
> > On Wed 27-10-21 16:29:40, Vivek Goyal wrote:
> > > On Wed, Oct 27, 2021 at 03:23:19PM +0200, Jan Kara wrote:
> > > > On Wed 27-10-21 08:59:15, Amir Goldstein wrote:
> > > > > On Tue, Oct 26, 2021 at 10:14 PM Ioannis Angelakopoulos
> > > > > <[email protected]> wrote:
> > > > > > On Tue, Oct 26, 2021 at 2:27 PM Vivek Goyal <[email protected]> wrote:
> > > > > > The problem here is that the OPEN event might still be travelling towards the guest in the
> > > > > > virtqueues and arrives after the guest has already deleted its local inode.
> > > > > > While the remote event (OPEN) received by the guest is valid, its fsnotify
> > > > > > subsystem will drop it since the local inode is not there.
> > > > > >
> > > > >
> > > > > I have a feeling that we are mixing issues related to shared server
> > > > > and remote fsnotify.
> > > >
> > > > I don't think Ioannis was speaking about shared server case here. I think
> > > > he says that in a simple FUSE remote notification setup we can loose OPEN
> > > > events (or basically any other) if the inode for which the event happens
> > > > gets deleted sufficiently early after the event being generated. That seems
> > > > indeed somewhat unexpected and could be confusing if it happens e.g. for
> > > > some directory operations.
> > >
> > > Hi Jan,
> > >
> > > Agreed. That's what Ioannis is trying to say. That some of the remote events
> > > can be lost if fuse/guest local inode is unlinked. I think problem exists
> > > both for shared and non-shared directory case.
> > >
> > > With local filesystems we have a control that we can first queue up
> > > the event in buffer before we remove local watches. With events travelling
> > > from a remote server, there is no such control/synchronization. It can
> > > very well happen that events got delayed in the communication path
> > > somewhere and local watches went away and now there is no way to
> > > deliver those events to the application.
> >
> > So after thinking for some time about this I have the following question
> > about the architecture of this solution: Why do you actually have local
> > fsnotify watches at all? They seem to cause quite some trouble... I mean
> > cannot we have fsnotify marks only on FUSE server and generate all events
> > there? When e.g. file is created from the client, client tells the server
> > about creation, the server performs the creation which generates the
> > fsnotify event, that is received by the server and forwared back to the
> > client which just queues it into notification group's queue for userspace
> > to read it.
> >
> > Now with this architecture there's no problem with duplicate events for
> > local & server notification marks, similarly there's no problem with lost
> > events after inode deletion because events received by the client are
> > directly queued into notification queue without any checking whether inode
> > is still alive etc. Would this work or am I missing something?
> >
>
> What about group #1 that wants mask A and group #2 that wants mask B
> events?
>
> Do you propose to maintain separate event queues over the protocol?
> Attach a "recipient list" to each event?

Yes, that was my idea. Essentially when we see group A creates mark on FUSE
for path P, we notify server, it will create notification group A on the
server (if not already existing - there we need some notification group
identifier unique among all clients), and place mark for it on path P. Then
the full stream of notification events generated for group A on the server
will just be forwarded to the client and inserted into the A's notification
queue. IMO this is very simple solution to implement - you just need to
forward mark addition / removal events from the client to the server and you
forward event stream from the server to the client. Everything else is
handled by the fsnotify infrastructure on the server.

> I just don't see how this can scale other than:
> - Local marks and connectors manage the subscriptions on local machine
> - Protocol updates the server with the combined masks for watched objects

I agree that depending on the usecase and particular FUSE filesystem
performance of this solution may be a concern. OTOH the only additional
cost of this solution I can see (compared to all those processes just
watching files locally) is the passing of the events from the server to the
client. For local FUSE filesystems such as virtiofs this should be rather
cheap since you have to do very little processing for each generated event.
For filesystems such as sshfs, I can imagine this would be a bigger deal.

Also one problem I can see with my proposal is that it will have problems
with stuff such as leases - i.e., if the client does not notify the server
of the changes quickly but rather batches local operations and tells the
server about them only on special occasions. I don't know enough about FUSE
filesystems to tell whether this is a frequent problem or not.

> I think that the "post-mortem events" issue could be solved by keeping an
> S_DEAD fuse inode object in limbo just for the mark.
> When a remote server sends FS_IN_IGNORED or FS_DELETE_SELF for
> an inode, the fuse client inode can be finally evicted.
> I haven't tried to see how hard that would be to implement.

Sure, there can be other solutions to this particular problem. I just
want to discuss the other architecture to see why we cannot to it in a
simple way :).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-11-03 11:23:33

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

> > > > Hi Jan,
> > > >
> > > > Agreed. That's what Ioannis is trying to say. That some of the remote events
> > > > can be lost if fuse/guest local inode is unlinked. I think problem exists
> > > > both for shared and non-shared directory case.
> > > >
> > > > With local filesystems we have a control that we can first queue up
> > > > the event in buffer before we remove local watches. With events travelling
> > > > from a remote server, there is no such control/synchronization. It can
> > > > very well happen that events got delayed in the communication path
> > > > somewhere and local watches went away and now there is no way to
> > > > deliver those events to the application.
> > >
> > > So after thinking for some time about this I have the following question
> > > about the architecture of this solution: Why do you actually have local
> > > fsnotify watches at all? They seem to cause quite some trouble... I mean
> > > cannot we have fsnotify marks only on FUSE server and generate all events
> > > there? When e.g. file is created from the client, client tells the server
> > > about creation, the server performs the creation which generates the
> > > fsnotify event, that is received by the server and forwared back to the
> > > client which just queues it into notification group's queue for userspace
> > > to read it.
> > >
> > > Now with this architecture there's no problem with duplicate events for
> > > local & server notification marks, similarly there's no problem with lost
> > > events after inode deletion because events received by the client are
> > > directly queued into notification queue without any checking whether inode
> > > is still alive etc. Would this work or am I missing something?
> > >
> >
> > What about group #1 that wants mask A and group #2 that wants mask B
> > events?
> >
> > Do you propose to maintain separate event queues over the protocol?
> > Attach a "recipient list" to each event?
>
> Yes, that was my idea. Essentially when we see group A creates mark on FUSE
> for path P, we notify server, it will create notification group A on the
> server (if not already existing - there we need some notification group
> identifier unique among all clients), and place mark for it on path P. Then
> the full stream of notification events generated for group A on the server
> will just be forwarded to the client and inserted into the A's notification
> queue. IMO this is very simple solution to implement - you just need to
> forward mark addition / removal events from the client to the server and you
> forward event stream from the server to the client. Everything else is
> handled by the fsnotify infrastructure on the server.
>
> > I just don't see how this can scale other than:
> > - Local marks and connectors manage the subscriptions on local machine
> > - Protocol updates the server with the combined masks for watched objects
>
> I agree that depending on the usecase and particular FUSE filesystem
> performance of this solution may be a concern. OTOH the only additional
> cost of this solution I can see (compared to all those processes just
> watching files locally) is the passing of the events from the server to the
> client. For local FUSE filesystems such as virtiofs this should be rather
> cheap since you have to do very little processing for each generated event.
> For filesystems such as sshfs, I can imagine this would be a bigger deal.
>
> Also one problem I can see with my proposal is that it will have problems
> with stuff such as leases - i.e., if the client does not notify the server
> of the changes quickly but rather batches local operations and tells the
> server about them only on special occasions. I don't know enough about FUSE
> filesystems to tell whether this is a frequent problem or not.
>
> > I think that the "post-mortem events" issue could be solved by keeping an
> > S_DEAD fuse inode object in limbo just for the mark.
> > When a remote server sends FS_IN_IGNORED or FS_DELETE_SELF for
> > an inode, the fuse client inode can be finally evicted.
> > I haven't tried to see how hard that would be to implement.
>
> Sure, there can be other solutions to this particular problem. I just
> want to discuss the other architecture to see why we cannot to it in a
> simple way :).
>

Fair enough.

Beyond the scalability aspects, I think that a design that exposes the group
to the remote server and allows to "inject" events to the group queue
will prevent
users from useful features going forward.

For example, fanotify ignored_mask could be added to a group, even on
a mount mark, even if the remote server only supports inode marks and it
would just work.

Another point of view for the post-mortem events:
As Miklos once noted and as you wrote above, for cache coherency and leases,
an async notification queue is not adequate and synchronous notifications are
too costly, so there needs to be some shared memory solution involving guest
cache invalidation by host.

Suppose said cache invalidation solution would be able to set a variety of
"dirty" flags, not just one type of dirty or to call in another way -
an "event mask".
If that is available, then when a fuse inode gets evicted, the events from the
"event mask" can be queued before destroying the inode and mark -
post mortem event issue averted...

Thanks,
Amir.

2021-11-03 22:31:45

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Wed, Nov 03, 2021 at 09:31:02AM +0200, Amir Goldstein wrote:
> > > >
> > >
> > > What about group #1 that wants mask A and group #2 that wants mask B
> > > events?
> > >
> > > Do you propose to maintain separate event queues over the protocol?
> > > Attach a "recipient list" to each event?
> > >
> > > I just don't see how this can scale other than:
> > > - Local marks and connectors manage the subscriptions on local machine
> > > - Protocol updates the server with the combined masks for watched objects
> > >
> > > I think that the "post-mortem events" issue could be solved by keeping an
> > > S_DEAD fuse inode object in limbo just for the mark.
> > > When a remote server sends FS_IN_IGNORED or FS_DELETE_SELF for
> > > an inode, the fuse client inode can be finally evicted.
> >
> > There is no guarantee that FS_IN_IGNORED or FS_DELETE_SELF will come
> > or when will it come. If another guest has reference on inode it might
> > not come for a long time. And this will kind of become a mechanism
> > for one guest to keep other's inode cache full of such objects.
> >
> > If event queue becomes too full, we might drop these events. But I guess
> > in that case we will have to generate IN_Q_OVERFLOW and that can somehow
> > be used to cleanup such S_DEAD inodes?
>
> That depends on the server implementation.
> If the server is watching host fs using fanotify filesystem mark, then
> an overflow
> event does NOT mean that other new events on inode may be missed only
> that old events could have been missed.
> Server should know about all the watched inodes, so it can check on overflow
> if any of the watched inodes were deleted and notify the client using a reliable
> channel.

Ok. We have only one channel for notifications. I guess we can program
the channel in such a way so that it does not drop overflow events but
can drop other kind of events if things get crazy. If too many overflow
events and we allocate too much of memory, I guess at some point of
time, oom killer will kick in a kill server.

>
> Given the current server implementation with inotify, IN_Q_OVERFLOW
> means server may have lost an IN_IGNORED event and may not get any
> more events on inode, so server should check all the watched inodes after
> overflow, notify the client of all deleted inodes and try to re-create
> the watches
> for all inodes with known path or use magic /prod/pid/fd path if that
> works (??).

Re-doing the watches sounds very painful. That means we will need to
keep track of aggregated mask in server side inode as well. As of
now we just pass mask to kernel using inotify_add_watch() and forget
about it.

/proc/pid/fd should work because I think that's how ioannis is putting
current watches on inodes. We don't send path info to server.

>
> >
> > nodeid is managed by server. So I am assuming that FORGET messages will
> > not be sent to server for this inode till we have seen FS_IN_IGNORED
> > and FS_DELETE_SELF events?
> >
>
> Or until the application that requested the watch calls
> inotify_rm_watch() or closes
> the inotify fd.
>
> IOW, when fs implements remote fsnotify, the local watch keeps the local deleted
> inode object in limbo until the local watch is removed.
> When the remote fsnotify server informs that the remote watch (or remote inode)
> is gone, the local watch is removed as well and then the inotify
> application also gets
> an FS_IN_IGNORED event.

Hmm.., I guess remote server will simply send IN_DELETE event when it
gets it and forward to client. And client will have to then cleanup
this S_DEAD inode which is in limbo waiting for IN_DELETE_SELF event.
And that should trigger cleanup of marks/local-watches on the inode, IIUC.

>
> Lifetime of local inode is complicated and lifetime of this "shared inode"
> is much more complicated, so I am not pretending to claim that I have this all
> figured out or that it could be reliably done at all.

Yes this handling of IN_DELETE_SELF is turning out to be the most
complicated piece of this proposal. I wish initial implementation
could just be designed that it does not send IN_DELETE_SELF and
IN_INGORED is generated locally. And later enhance it to support
reliable delivery of IN_DELETE_SELF.

Thanks
Vivek

2021-11-03 22:39:31

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Wed, Nov 03, 2021 at 01:17:36PM +0200, Amir Goldstein wrote:
> > > > > Hi Jan,
> > > > >
> > > > > Agreed. That's what Ioannis is trying to say. That some of the remote events
> > > > > can be lost if fuse/guest local inode is unlinked. I think problem exists
> > > > > both for shared and non-shared directory case.
> > > > >
> > > > > With local filesystems we have a control that we can first queue up
> > > > > the event in buffer before we remove local watches. With events travelling
> > > > > from a remote server, there is no such control/synchronization. It can
> > > > > very well happen that events got delayed in the communication path
> > > > > somewhere and local watches went away and now there is no way to
> > > > > deliver those events to the application.
> > > >
> > > > So after thinking for some time about this I have the following question
> > > > about the architecture of this solution: Why do you actually have local
> > > > fsnotify watches at all? They seem to cause quite some trouble... I mean
> > > > cannot we have fsnotify marks only on FUSE server and generate all events
> > > > there? When e.g. file is created from the client, client tells the server
> > > > about creation, the server performs the creation which generates the
> > > > fsnotify event, that is received by the server and forwared back to the
> > > > client which just queues it into notification group's queue for userspace
> > > > to read it.
> > > >
> > > > Now with this architecture there's no problem with duplicate events for
> > > > local & server notification marks, similarly there's no problem with lost
> > > > events after inode deletion because events received by the client are
> > > > directly queued into notification queue without any checking whether inode
> > > > is still alive etc. Would this work or am I missing something?
> > > >
> > >
> > > What about group #1 that wants mask A and group #2 that wants mask B
> > > events?
> > >
> > > Do you propose to maintain separate event queues over the protocol?
> > > Attach a "recipient list" to each event?
> >
> > Yes, that was my idea. Essentially when we see group A creates mark on FUSE
> > for path P, we notify server, it will create notification group A on the
> > server (if not already existing - there we need some notification group
> > identifier unique among all clients), and place mark for it on path P. Then
> > the full stream of notification events generated for group A on the server
> > will just be forwarded to the client and inserted into the A's notification
> > queue. IMO this is very simple solution to implement - you just need to
> > forward mark addition / removal events from the client to the server and you
> > forward event stream from the server to the client. Everything else is
> > handled by the fsnotify infrastructure on the server.
> >
> > > I just don't see how this can scale other than:
> > > - Local marks and connectors manage the subscriptions on local machine
> > > - Protocol updates the server with the combined masks for watched objects
> >
> > I agree that depending on the usecase and particular FUSE filesystem
> > performance of this solution may be a concern. OTOH the only additional
> > cost of this solution I can see (compared to all those processes just
> > watching files locally) is the passing of the events from the server to the
> > client. For local FUSE filesystems such as virtiofs this should be rather
> > cheap since you have to do very little processing for each generated event.
> > For filesystems such as sshfs, I can imagine this would be a bigger deal.
> >
> > Also one problem I can see with my proposal is that it will have problems
> > with stuff such as leases - i.e., if the client does not notify the server
> > of the changes quickly but rather batches local operations and tells the
> > server about them only on special occasions. I don't know enough about FUSE
> > filesystems to tell whether this is a frequent problem or not.
> >
> > > I think that the "post-mortem events" issue could be solved by keeping an
> > > S_DEAD fuse inode object in limbo just for the mark.
> > > When a remote server sends FS_IN_IGNORED or FS_DELETE_SELF for
> > > an inode, the fuse client inode can be finally evicted.
> > > I haven't tried to see how hard that would be to implement.
> >
> > Sure, there can be other solutions to this particular problem. I just
> > want to discuss the other architecture to see why we cannot to it in a
> > simple way :).
> >
>
> Fair enough.
>
> Beyond the scalability aspects, I think that a design that exposes the group
> to the remote server and allows to "inject" events to the group queue
> will prevent
> users from useful features going forward.
>
> For example, fanotify ignored_mask could be added to a group, even on
> a mount mark, even if the remote server only supports inode marks and it
> would just work.
>
> Another point of view for the post-mortem events:
> As Miklos once noted and as you wrote above, for cache coherency and leases,
> an async notification queue is not adequate and synchronous notifications are
> too costly, so there needs to be some shared memory solution involving guest
> cache invalidation by host.

Any shared memory solution works only limited setup. If server is remote
on other machine, there is no sharing. I am hoping that this can be
generic enough to support other remote filesystems down the line.

>
> Suppose said cache invalidation solution would be able to set a variety of
> "dirty" flags, not just one type of dirty or to call in another way -
> an "event mask".
> If that is available, then when a fuse inode gets evicted, the events from the
> "event mask" can be queued before destroying the inode and mark -
> post mortem event issue averted...

This is assuming that that server itself got the "IN_DELETE_SELF" event
when fuse is destroying its inode. But if inode might be alive due to
other client having fd open.

Even if other client does not have fd open, this still sounds racy. By
the time we set inode event_mask (using shared memory, instead of
sending an event notifiation), fuse might have cleaned up its inode.

There is a good chance I completely misunderstood what you are suggesting
here. :-)

Thanks
Vivek

2021-11-04 05:24:06

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

> > > If event queue becomes too full, we might drop these events. But I guess
> > > in that case we will have to generate IN_Q_OVERFLOW and that can somehow
> > > be used to cleanup such S_DEAD inodes?
> >
> > That depends on the server implementation.
> > If the server is watching host fs using fanotify filesystem mark, then
> > an overflow
> > event does NOT mean that other new events on inode may be missed only
> > that old events could have been missed.
> > Server should know about all the watched inodes, so it can check on overflow
> > if any of the watched inodes were deleted and notify the client using a reliable
> > channel.
>
> Ok. We have only one channel for notifications. I guess we can program
> the channel in such a way so that it does not drop overflow events but
> can drop other kind of events if things get crazy. If too many overflow
> events and we allocate too much of memory, I guess at some point of
> time, oom killer will kick in a kill server.
>

The kernel implementation of fsnotify events queue pre-allocates
a single overflow event and never queues more than a single overflow
event. IN_Q_OVERFLOW must be delivered reliably, but delivering one
overflow event is enough (until it is consumed).

> >
> > Given the current server implementation with inotify, IN_Q_OVERFLOW
> > means server may have lost an IN_IGNORED event and may not get any
> > more events on inode, so server should check all the watched inodes after
> > overflow, notify the client of all deleted inodes and try to re-create
> > the watches
> > for all inodes with known path or use magic /prod/pid/fd path if that
> > works (??).
>
> Re-doing the watches sounds very painful.

Event overflow is a painful incident and systems usually pay a large
penalty when it happens (e.g. full recrawl of watched tree).
If virtiofsd is going to use inotify, it is no different than any other inotify
application that needs to bear the consequence of event overflow.

> That means we will need to
> keep track of aggregated mask in server side inode as well. As of
> now we just pass mask to kernel using inotify_add_watch() and forget
> about it.
>

It costs nothing to keep the aggregated mask in server side inode
and it makes sense to do that anyway.
This allows an implementation to notify about changes that the server
itself handles even if there is no backing filesystem behind it or
host OS has no fs notification support.

> /proc/pid/fd should work because I think that's how ioannis is putting
> current watches on inodes. We don't send path info to server.
>
> >
> > >
> > > nodeid is managed by server. So I am assuming that FORGET messages will
> > > not be sent to server for this inode till we have seen FS_IN_IGNORED
> > > and FS_DELETE_SELF events?
> > >
> >
> > Or until the application that requested the watch calls
> > inotify_rm_watch() or closes
> > the inotify fd.
> >
> > IOW, when fs implements remote fsnotify, the local watch keeps the local deleted
> > inode object in limbo until the local watch is removed.
> > When the remote fsnotify server informs that the remote watch (or remote inode)
> > is gone, the local watch is removed as well and then the inotify
> > application also gets
> > an FS_IN_IGNORED event.
>
> Hmm.., I guess remote server will simply send IN_DELETE event when it
> gets it and forward to client. And client will have to then cleanup
> this S_DEAD inode which is in limbo waiting for IN_DELETE_SELF event.
> And that should trigger cleanup of marks/local-watches on the inode, IIUC.
>

In very broad lines, but the server notification must be delivered reliably.

> >
> > Lifetime of local inode is complicated and lifetime of this "shared inode"
> > is much more complicated, so I am not pretending to claim that I have this all
> > figured out or that it could be reliably done at all.
>
> Yes this handling of IN_DELETE_SELF is turning out to be the most
> complicated piece of this proposal. I wish initial implementation
> could just be designed that it does not send IN_DELETE_SELF and
> IN_INGORED is generated locally. And later enhance it to support
> reliable delivery of IN_DELETE_SELF.
>

Not allowing DELETE_SELF in the mask sounds reasonable, but
as Ioannis explained, other events can be missed on local file delete.
If you want to preserve inotify semantics, you could queue an overflow
event if a fuse inode that gets evicted still has inotify marks.
That's a bit harsh though.
Alternatively, you could document in inotify man page that IN_INGORED
could mean that some events were dropped and hope for the best...

Thanks,
Amir.

2021-11-04 05:32:32

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Thu, Nov 4, 2021 at 12:36 AM Vivek Goyal <[email protected]> wrote:
>
> On Wed, Nov 03, 2021 at 01:17:36PM +0200, Amir Goldstein wrote:
> > > > > > Hi Jan,
> > > > > >
> > > > > > Agreed. That's what Ioannis is trying to say. That some of the remote events
> > > > > > can be lost if fuse/guest local inode is unlinked. I think problem exists
> > > > > > both for shared and non-shared directory case.
> > > > > >
> > > > > > With local filesystems we have a control that we can first queue up
> > > > > > the event in buffer before we remove local watches. With events travelling
> > > > > > from a remote server, there is no such control/synchronization. It can
> > > > > > very well happen that events got delayed in the communication path
> > > > > > somewhere and local watches went away and now there is no way to
> > > > > > deliver those events to the application.
> > > > >
> > > > > So after thinking for some time about this I have the following question
> > > > > about the architecture of this solution: Why do you actually have local
> > > > > fsnotify watches at all? They seem to cause quite some trouble... I mean
> > > > > cannot we have fsnotify marks only on FUSE server and generate all events
> > > > > there? When e.g. file is created from the client, client tells the server
> > > > > about creation, the server performs the creation which generates the
> > > > > fsnotify event, that is received by the server and forwared back to the
> > > > > client which just queues it into notification group's queue for userspace
> > > > > to read it.
> > > > >
> > > > > Now with this architecture there's no problem with duplicate events for
> > > > > local & server notification marks, similarly there's no problem with lost
> > > > > events after inode deletion because events received by the client are
> > > > > directly queued into notification queue without any checking whether inode
> > > > > is still alive etc. Would this work or am I missing something?
> > > > >
> > > >
> > > > What about group #1 that wants mask A and group #2 that wants mask B
> > > > events?
> > > >
> > > > Do you propose to maintain separate event queues over the protocol?
> > > > Attach a "recipient list" to each event?
> > >
> > > Yes, that was my idea. Essentially when we see group A creates mark on FUSE
> > > for path P, we notify server, it will create notification group A on the
> > > server (if not already existing - there we need some notification group
> > > identifier unique among all clients), and place mark for it on path P. Then
> > > the full stream of notification events generated for group A on the server
> > > will just be forwarded to the client and inserted into the A's notification
> > > queue. IMO this is very simple solution to implement - you just need to
> > > forward mark addition / removal events from the client to the server and you
> > > forward event stream from the server to the client. Everything else is
> > > handled by the fsnotify infrastructure on the server.
> > >
> > > > I just don't see how this can scale other than:
> > > > - Local marks and connectors manage the subscriptions on local machine
> > > > - Protocol updates the server with the combined masks for watched objects
> > >
> > > I agree that depending on the usecase and particular FUSE filesystem
> > > performance of this solution may be a concern. OTOH the only additional
> > > cost of this solution I can see (compared to all those processes just
> > > watching files locally) is the passing of the events from the server to the
> > > client. For local FUSE filesystems such as virtiofs this should be rather
> > > cheap since you have to do very little processing for each generated event.
> > > For filesystems such as sshfs, I can imagine this would be a bigger deal.
> > >
> > > Also one problem I can see with my proposal is that it will have problems
> > > with stuff such as leases - i.e., if the client does not notify the server
> > > of the changes quickly but rather batches local operations and tells the
> > > server about them only on special occasions. I don't know enough about FUSE
> > > filesystems to tell whether this is a frequent problem or not.
> > >
> > > > I think that the "post-mortem events" issue could be solved by keeping an
> > > > S_DEAD fuse inode object in limbo just for the mark.
> > > > When a remote server sends FS_IN_IGNORED or FS_DELETE_SELF for
> > > > an inode, the fuse client inode can be finally evicted.
> > > > I haven't tried to see how hard that would be to implement.
> > >
> > > Sure, there can be other solutions to this particular problem. I just
> > > want to discuss the other architecture to see why we cannot to it in a
> > > simple way :).
> > >
> >
> > Fair enough.
> >
> > Beyond the scalability aspects, I think that a design that exposes the group
> > to the remote server and allows to "inject" events to the group queue
> > will prevent
> > users from useful features going forward.
> >
> > For example, fanotify ignored_mask could be added to a group, even on
> > a mount mark, even if the remote server only supports inode marks and it
> > would just work.
> >
> > Another point of view for the post-mortem events:
> > As Miklos once noted and as you wrote above, for cache coherency and leases,
> > an async notification queue is not adequate and synchronous notifications are
> > too costly, so there needs to be some shared memory solution involving guest
> > cache invalidation by host.
>
> Any shared memory solution works only limited setup. If server is remote
> on other machine, there is no sharing. I am hoping that this can be
> generic enough to support other remote filesystems down the line.
>

I do too :)

> >
> > Suppose said cache invalidation solution would be able to set a variety of
> > "dirty" flags, not just one type of dirty or to call in another way -
> > an "event mask".
> > If that is available, then when a fuse inode gets evicted, the events from the
> > "event mask" can be queued before destroying the inode and mark -
> > post mortem event issue averted...
>
> This is assuming that that server itself got the "IN_DELETE_SELF" event
> when fuse is destroying its inode. But if inode might be alive due to
> other client having fd open.
>
> Even if other client does not have fd open, this still sounds racy. By
> the time we set inode event_mask (using shared memory, instead of
> sending an event notifiation), fuse might have cleaned up its inode.
>

There is no escape from some sort of leases design for a reliable
and efficient shared remote fs.
Unless the client has an exclusive lease on the inode, it must provide
enough grace period before cleaning the inode to wait for an update
from the server if the client cares about getting all events on inode.

> There is a good chance I completely misunderstood what you are suggesting
> here. :-)
>

There is a good chance that I am talking nonsense :-)

Thanks,
Amir.

2021-11-04 10:04:34

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Wed 03-11-21 18:36:06, Vivek Goyal wrote:
> On Wed, Nov 03, 2021 at 01:17:36PM +0200, Amir Goldstein wrote:
> > > > > > Hi Jan,
> > > > > >
> > > > > > Agreed. That's what Ioannis is trying to say. That some of the remote events
> > > > > > can be lost if fuse/guest local inode is unlinked. I think problem exists
> > > > > > both for shared and non-shared directory case.
> > > > > >
> > > > > > With local filesystems we have a control that we can first queue up
> > > > > > the event in buffer before we remove local watches. With events travelling
> > > > > > from a remote server, there is no such control/synchronization. It can
> > > > > > very well happen that events got delayed in the communication path
> > > > > > somewhere and local watches went away and now there is no way to
> > > > > > deliver those events to the application.
> > > > >
> > > > > So after thinking for some time about this I have the following question
> > > > > about the architecture of this solution: Why do you actually have local
> > > > > fsnotify watches at all? They seem to cause quite some trouble... I mean
> > > > > cannot we have fsnotify marks only on FUSE server and generate all events
> > > > > there? When e.g. file is created from the client, client tells the server
> > > > > about creation, the server performs the creation which generates the
> > > > > fsnotify event, that is received by the server and forwared back to the
> > > > > client which just queues it into notification group's queue for userspace
> > > > > to read it.
> > > > >
> > > > > Now with this architecture there's no problem with duplicate events for
> > > > > local & server notification marks, similarly there's no problem with lost
> > > > > events after inode deletion because events received by the client are
> > > > > directly queued into notification queue without any checking whether inode
> > > > > is still alive etc. Would this work or am I missing something?
> > > > >
> > > >
> > > > What about group #1 that wants mask A and group #2 that wants mask B
> > > > events?
> > > >
> > > > Do you propose to maintain separate event queues over the protocol?
> > > > Attach a "recipient list" to each event?
> > >
> > > Yes, that was my idea. Essentially when we see group A creates mark on FUSE
> > > for path P, we notify server, it will create notification group A on the
> > > server (if not already existing - there we need some notification group
> > > identifier unique among all clients), and place mark for it on path P. Then
> > > the full stream of notification events generated for group A on the server
> > > will just be forwarded to the client and inserted into the A's notification
> > > queue. IMO this is very simple solution to implement - you just need to
> > > forward mark addition / removal events from the client to the server and you
> > > forward event stream from the server to the client. Everything else is
> > > handled by the fsnotify infrastructure on the server.
> > >
> > > > I just don't see how this can scale other than:
> > > > - Local marks and connectors manage the subscriptions on local machine
> > > > - Protocol updates the server with the combined masks for watched objects
> > >
> > > I agree that depending on the usecase and particular FUSE filesystem
> > > performance of this solution may be a concern. OTOH the only additional
> > > cost of this solution I can see (compared to all those processes just
> > > watching files locally) is the passing of the events from the server to the
> > > client. For local FUSE filesystems such as virtiofs this should be rather
> > > cheap since you have to do very little processing for each generated event.
> > > For filesystems such as sshfs, I can imagine this would be a bigger deal.
> > >
> > > Also one problem I can see with my proposal is that it will have problems
> > > with stuff such as leases - i.e., if the client does not notify the server
> > > of the changes quickly but rather batches local operations and tells the
> > > server about them only on special occasions. I don't know enough about FUSE
> > > filesystems to tell whether this is a frequent problem or not.
> > >
> > > > I think that the "post-mortem events" issue could be solved by keeping an
> > > > S_DEAD fuse inode object in limbo just for the mark.
> > > > When a remote server sends FS_IN_IGNORED or FS_DELETE_SELF for
> > > > an inode, the fuse client inode can be finally evicted.
> > > > I haven't tried to see how hard that would be to implement.
> > >
> > > Sure, there can be other solutions to this particular problem. I just
> > > want to discuss the other architecture to see why we cannot to it in a
> > > simple way :).
> > >
> >
> > Fair enough.
> >
> > Beyond the scalability aspects, I think that a design that exposes the group
> > to the remote server and allows to "inject" events to the group queue
> > will prevent
> > users from useful features going forward.
> >
> > For example, fanotify ignored_mask could be added to a group, even on
> > a mount mark, even if the remote server only supports inode marks and it
> > would just work.
> >
> > Another point of view for the post-mortem events:
> > As Miklos once noted and as you wrote above, for cache coherency and leases,
> > an async notification queue is not adequate and synchronous notifications are
> > too costly, so there needs to be some shared memory solution involving guest
> > cache invalidation by host.
>
> Any shared memory solution works only limited setup. If server is remote
> on other machine, there is no sharing. I am hoping that this can be
> generic enough to support other remote filesystems down the line.

OK, so do I understand both you and Amir correctly that you think that
always relying on the FUSE server for generating the events and just piping
them to the client is not long-term viable design for FUSE? Mostly because
caching of modifications on the client is essentially inevitable and hence
generating events from the server would be unreliable (delayed too much)?

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-11-05 16:33:02

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Thu, Nov 04, 2021 at 11:03:16AM +0100, Jan Kara wrote:
> On Wed 03-11-21 18:36:06, Vivek Goyal wrote:
> > On Wed, Nov 03, 2021 at 01:17:36PM +0200, Amir Goldstein wrote:
> > > > > > > Hi Jan,
> > > > > > >
> > > > > > > Agreed. That's what Ioannis is trying to say. That some of the remote events
> > > > > > > can be lost if fuse/guest local inode is unlinked. I think problem exists
> > > > > > > both for shared and non-shared directory case.
> > > > > > >
> > > > > > > With local filesystems we have a control that we can first queue up
> > > > > > > the event in buffer before we remove local watches. With events travelling
> > > > > > > from a remote server, there is no such control/synchronization. It can
> > > > > > > very well happen that events got delayed in the communication path
> > > > > > > somewhere and local watches went away and now there is no way to
> > > > > > > deliver those events to the application.
> > > > > >
> > > > > > So after thinking for some time about this I have the following question
> > > > > > about the architecture of this solution: Why do you actually have local
> > > > > > fsnotify watches at all? They seem to cause quite some trouble... I mean
> > > > > > cannot we have fsnotify marks only on FUSE server and generate all events
> > > > > > there? When e.g. file is created from the client, client tells the server
> > > > > > about creation, the server performs the creation which generates the
> > > > > > fsnotify event, that is received by the server and forwared back to the
> > > > > > client which just queues it into notification group's queue for userspace
> > > > > > to read it.
> > > > > >
> > > > > > Now with this architecture there's no problem with duplicate events for
> > > > > > local & server notification marks, similarly there's no problem with lost
> > > > > > events after inode deletion because events received by the client are
> > > > > > directly queued into notification queue without any checking whether inode
> > > > > > is still alive etc. Would this work or am I missing something?
> > > > > >
> > > > >
> > > > > What about group #1 that wants mask A and group #2 that wants mask B
> > > > > events?
> > > > >
> > > > > Do you propose to maintain separate event queues over the protocol?
> > > > > Attach a "recipient list" to each event?
> > > >
> > > > Yes, that was my idea. Essentially when we see group A creates mark on FUSE
> > > > for path P, we notify server, it will create notification group A on the
> > > > server (if not already existing - there we need some notification group
> > > > identifier unique among all clients), and place mark for it on path P. Then
> > > > the full stream of notification events generated for group A on the server
> > > > will just be forwarded to the client and inserted into the A's notification
> > > > queue. IMO this is very simple solution to implement - you just need to
> > > > forward mark addition / removal events from the client to the server and you
> > > > forward event stream from the server to the client. Everything else is
> > > > handled by the fsnotify infrastructure on the server.
> > > >
> > > > > I just don't see how this can scale other than:
> > > > > - Local marks and connectors manage the subscriptions on local machine
> > > > > - Protocol updates the server with the combined masks for watched objects
> > > >
> > > > I agree that depending on the usecase and particular FUSE filesystem
> > > > performance of this solution may be a concern. OTOH the only additional
> > > > cost of this solution I can see (compared to all those processes just
> > > > watching files locally) is the passing of the events from the server to the
> > > > client. For local FUSE filesystems such as virtiofs this should be rather
> > > > cheap since you have to do very little processing for each generated event.
> > > > For filesystems such as sshfs, I can imagine this would be a bigger deal.
> > > >
> > > > Also one problem I can see with my proposal is that it will have problems
> > > > with stuff such as leases - i.e., if the client does not notify the server
> > > > of the changes quickly but rather batches local operations and tells the
> > > > server about them only on special occasions. I don't know enough about FUSE
> > > > filesystems to tell whether this is a frequent problem or not.
> > > >
> > > > > I think that the "post-mortem events" issue could be solved by keeping an
> > > > > S_DEAD fuse inode object in limbo just for the mark.
> > > > > When a remote server sends FS_IN_IGNORED or FS_DELETE_SELF for
> > > > > an inode, the fuse client inode can be finally evicted.
> > > > > I haven't tried to see how hard that would be to implement.
> > > >
> > > > Sure, there can be other solutions to this particular problem. I just
> > > > want to discuss the other architecture to see why we cannot to it in a
> > > > simple way :).
> > > >
> > >
> > > Fair enough.
> > >
> > > Beyond the scalability aspects, I think that a design that exposes the group
> > > to the remote server and allows to "inject" events to the group queue
> > > will prevent
> > > users from useful features going forward.
> > >
> > > For example, fanotify ignored_mask could be added to a group, even on
> > > a mount mark, even if the remote server only supports inode marks and it
> > > would just work.
> > >
> > > Another point of view for the post-mortem events:
> > > As Miklos once noted and as you wrote above, for cache coherency and leases,
> > > an async notification queue is not adequate and synchronous notifications are
> > > too costly, so there needs to be some shared memory solution involving guest
> > > cache invalidation by host.
> >
> > Any shared memory solution works only limited setup. If server is remote
> > on other machine, there is no sharing. I am hoping that this can be
> > generic enough to support other remote filesystems down the line.
>
> OK, so do I understand both you and Amir correctly that you think that
> always relying on the FUSE server for generating the events and just piping
> them to the client is not long-term viable design for FUSE? Mostly because
> caching of modifications on the client is essentially inevitable and hence
> generating events from the server would be unreliable (delayed too much)?

Hi Jan,

Actually I had not even thought about operation caching in clients. IIUC,
as of now we only have modes to support caching of buffered writes in fuse
(which can be flushed later, -o writeback). Other file operations should go
to server.

To me, it sounds reasonable for FUSE server to generate events and that's
what we are doing in this RFC proposal. So idea is that an application
is effectively watching and receiving events for changes happening at
remote server.

As of now local events will be supressed so if some operations are local
to client only, then events will not be generated or will be generated
late when server sees those changes.

I am not sure if supressing all local events will serve all use cases
in long term though. For example, Amir was mentioning about fanotify,
events on mount objects and it might make sense to generate local
events there.

So initial implementation could be about, application either get local
events or remote events (based on filesystem). Down the line more
complicated modes can emerge where some combination of local and remote
events could be generated and applications could specify it. That
probably will be extension of fanotiy/inotify API.

Thanks
Vivek

2021-11-10 06:31:06

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

> > OK, so do I understand both you and Amir correctly that you think that
> > always relying on the FUSE server for generating the events and just piping
> > them to the client is not long-term viable design for FUSE? Mostly because
> > caching of modifications on the client is essentially inevitable and hence
> > generating events from the server would be unreliable (delayed too much)?

This is one aspect, but we do not have to tie remote notifications to
the cache coherency problem that is more complicated.

OTOH, if virtiofs take the route of "remote groups", why not take it one step
further and implement "remote event queue".
Then, it does not need to push event notifications from the server
into fsnotify at all.
Instead, FUSE client can communicate with the server using ioctl or a new
command to implement new_group() that returns a special FUSE file and
use FUSE POLL/READ to check/read the server's event queue.

There is already a precedent of this model with CIFS_IOC_NOTIFY
and SMB2_CHANGE_NOTIFY - SMB protocol, samba server and Windows server
support watching a directory or a subtree. I think it is a "oneshot" watch, so
there is no polling nor queue involved.

>
> So initial implementation could be about, application either get local
> events or remote events (based on filesystem). Down the line more
> complicated modes can emerge where some combination of local and remote
> events could be generated and applications could specify it. That
> probably will be extension of fanotiy/inotify API.
>

There is one more problem with this approach.
We cannot silently change the behavior of existing FUSE filesystems.
What if a system has antivirus configured to scan access to virtiofs mounts?
Do you consider it reasonable that on system upgrade, the capability of
adding local watches would go away?

I understand the desire to have existing inotify applications work out of
the box to get remote notifications, but I have doubts if this goal is even
worth pursuing. Considering that the existing known use case described in this
thread is already using polling to identify changes to config files on the host,
it could just as well be using a new API to get the job done.

If we had to plan an interface without considering existing applications,
I think it would look something like:

#define FAN_MARK_INODE 0x00000000
#define FAN_MARK_MOUNT 0x00000010
#define FAN_MARK_FILESYSTEM 0x00000100
#define FAN_MARK_REMOTE_INODE 0x00001000
#define FAN_MARK_REMOTE_FILESYSTEM 0x00001100

Then, the application can choose to add a remote mark with a certain
event mask and a local mark with a different event mask without any ambiguity.
The remote events could be tagged with a flag (turn reserved member of
fanotify_event_metadata into flags).

We have made a lot of effort to make fanotify a super set of inotify
functionality removing old UAPI mistakes and baggage along the way,
so I'd really hate to see us going down the path of ambiguous UAPI
again.

IMO, the work that Ioannis has already done to support remote
notifications with virtiofsd is perfectly valid as a private functionality
of virtiofs, just like cifs CIFS_IOC_NOTIFY.

If we want to go further and implement a "remote notification subsystem"
then I think we should take other fs (e.g.cifs) into consideration and think
about functionality beyond the plain remote watch and create an UAPI
that can be used to extend the functionality further.

Thanks,
Amir.

2021-11-11 17:30:51

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Wed 10-11-21 08:28:13, Amir Goldstein wrote:
> > > OK, so do I understand both you and Amir correctly that you think that
> > > always relying on the FUSE server for generating the events and just piping
> > > them to the client is not long-term viable design for FUSE? Mostly because
> > > caching of modifications on the client is essentially inevitable and hence
> > > generating events from the server would be unreliable (delayed too much)?
>
> This is one aspect, but we do not have to tie remote notifications to
> the cache coherency problem that is more complicated.
>
> OTOH, if virtiofs take the route of "remote groups", why not take it one step
> further and implement "remote event queue".
> Then, it does not need to push event notifications from the server
> into fsnotify at all.
> Instead, FUSE client can communicate with the server using ioctl or a new
> command to implement new_group() that returns a special FUSE file and
> use FUSE POLL/READ to check/read the server's event queue.

Yes, that could work. But I don't see how you want to avoid pushing events
to fsnotify on the client side. Suppose app creates fanotify group G, it
adds mark for some filesystem F1 to it, then in adds marks for another
filesystem F2 - this time it is FUSE based fs. App wants to receive events
for both F1 and F2 but events for F2 are actually coming from the server
and somehow they have to get into G's queue for an app to read them.

> There is already a precedent of this model with CIFS_IOC_NOTIFY
> and SMB2_CHANGE_NOTIFY - SMB protocol, samba server and Windows server
> support watching a directory or a subtree. I think it is a "oneshot" watch, so
> there is no polling nor queue involved.

OK, are you aiming at completely separate notification API for userspace
for FUSE-based filesystems? Then I see why you don't need to push events
to fsnotify but I don't quite like that for a few reasons:

1) Application has to be aware whether the filesystem it operates on is
FUSE based or not. That is IMO unnecessary burden for the applications.

2) If application wants to watch for several paths potentially on multiple
filesystems, it would now need to somehow merge the event streams from FUSE
API and {fa/i}notify API.

3) It would potentially duplicate quite some parts of fsnotify API
handling.

> > So initial implementation could be about, application either get local
> > events or remote events (based on filesystem). Down the line more
> > complicated modes can emerge where some combination of local and remote
> > events could be generated and applications could specify it. That
> > probably will be extension of fanotiy/inotify API.
>
> There is one more problem with this approach.
> We cannot silently change the behavior of existing FUSE filesystems.
> What if a system has antivirus configured to scan access to virtiofs mounts?
> Do you consider it reasonable that on system upgrade, the capability of
> adding local watches would go away?

I agree we have to be careful there. If fanotify / inotify is currently
usable with virtiofs, just missing events coming from host changes (which
are presumably rare for lots of setups), it is likely used by someone and
we should not regress it.

> I understand the desire to have existing inotify applications work out of
> the box to get remote notifications, but I have doubts if this goal is even
> worth pursuing. Considering that the existing known use case described in this
> thread is already using polling to identify changes to config files on the host,
> it could just as well be using a new API to get the job done.
>
> If we had to plan an interface without considering existing applications,
> I think it would look something like:
>
> #define FAN_MARK_INODE 0x00000000
> #define FAN_MARK_MOUNT 0x00000010
> #define FAN_MARK_FILESYSTEM 0x00000100
> #define FAN_MARK_REMOTE_INODE 0x00001000
> #define FAN_MARK_REMOTE_FILESYSTEM 0x00001100
>
> Then, the application can choose to add a remote mark with a certain
> event mask and a local mark with a different event mask without any ambiguity.
> The remote events could be tagged with a flag (turn reserved member of
> fanotify_event_metadata into flags).
>
> We have made a lot of effort to make fanotify a super set of inotify
> functionality removing old UAPI mistakes and baggage along the way,
> so I'd really hate to see us going down the path of ambiguous UAPI
> again.

So there's a question: Does application care whether the event comes from
local or remote side? I'd say generally no - a file change is simply a file
change and it does not matter who did it. Also again this API implies the
application has to be aware it runs on a filesystem that may generate
remote events to ask for them. But presumably knowledgeable app can always
ask for local & remote events and if that fails (fs does not support remote
events), ask only for local. That's not a big burden.

The question is why would we make remote events explicit (and thus
complicate the API and usage) when generally apps cannot be bothered who
did the change. I suppose it makes implementation and some of the
consequences on the stream of events more obvious. Also it allows
functionality such as permission or mount events which are only local when
the server does not support them which could be useful for "mostly local"
filesystems. Hum...

> IMO, the work that Ioannis has already done to support remote
> notifications with virtiofsd is perfectly valid as a private functionality
> of virtiofs, just like cifs CIFS_IOC_NOTIFY.
>
> If we want to go further and implement a "remote notification subsystem"
> then I think we should take other fs (e.g.cifs) into consideration and think
> about functionality beyond the plain remote watch and create an UAPI
> that can be used to extend the functionality further.

So I agree that if we go for "remote notification subsystem", then we
better consider cases like other FUSE filesystems, cifs, or NFS. I haven't
yet made up my mind whether we cannot somehow seamlessly incorporate remote
events into fsnotify because it seems to me they are identical to local
events from app point of view. But I agree that if we cannot find a way to
integrate remote events without many rough edges, then it is better to make
remote events explicit.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-11-11 20:52:40

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] Inotify support in FUSE and virtiofs

On Thu, Nov 11, 2021 at 7:30 PM Jan Kara <[email protected]> wrote:
>
> On Wed 10-11-21 08:28:13, Amir Goldstein wrote:
> > > > OK, so do I understand both you and Amir correctly that you think that
> > > > always relying on the FUSE server for generating the events and just piping
> > > > them to the client is not long-term viable design for FUSE? Mostly because
> > > > caching of modifications on the client is essentially inevitable and hence
> > > > generating events from the server would be unreliable (delayed too much)?
> >
> > This is one aspect, but we do not have to tie remote notifications to
> > the cache coherency problem that is more complicated.
> >
> > OTOH, if virtiofs take the route of "remote groups", why not take it one step
> > further and implement "remote event queue".
> > Then, it does not need to push event notifications from the server
> > into fsnotify at all.
> > Instead, FUSE client can communicate with the server using ioctl or a new
> > command to implement new_group() that returns a special FUSE file and
> > use FUSE POLL/READ to check/read the server's event queue.
>
> Yes, that could work. But I don't see how you want to avoid pushing events
> to fsnotify on the client side. Suppose app creates fanotify group G, it
> adds mark for some filesystem F1 to it, then in adds marks for another
> filesystem F2 - this time it is FUSE based fs. App wants to receive events
> for both F1 and F2 but events for F2 are actually coming from the server
> and somehow they have to get into G's queue for an app to read them.
>
> > There is already a precedent of this model with CIFS_IOC_NOTIFY
> > and SMB2_CHANGE_NOTIFY - SMB protocol, samba server and Windows server
> > support watching a directory or a subtree. I think it is a "oneshot" watch, so
> > there is no polling nor queue involved.
>
> OK, are you aiming at completely separate notification API for userspace
> for FUSE-based filesystems?

Yes, if the group is "remote" then all the marks are "remote" and event
queue is also remote.

> Then I see why you don't need to push events
> to fsnotify but I don't quite like that for a few reasons:
>
> 1) Application has to be aware whether the filesystem it operates on is
> FUSE based or not. That is IMO unnecessary burden for the applications.
>
> 2) If application wants to watch for several paths potentially on multiple
> filesystems, it would now need to somehow merge the event streams from FUSE
> API and {fa/i}notify API.
>
> 3) It would potentially duplicate quite some parts of fsnotify API
> handling.
>

I don't like so much either, but virtiofs developers are free to pursue this
direction without involving the fsnotify subsystem (like cifs).
One can even implement userspace wrappers for inotify library functions
that know how to utilize remote cifs/fuse watches...

> > > So initial implementation could be about, application either get local
> > > events or remote events (based on filesystem). Down the line more
> > > complicated modes can emerge where some combination of local and remote
> > > events could be generated and applications could specify it. That
> > > probably will be extension of fanotiy/inotify API.
> >
> > There is one more problem with this approach.
> > We cannot silently change the behavior of existing FUSE filesystems.
> > What if a system has antivirus configured to scan access to virtiofs mounts?
> > Do you consider it reasonable that on system upgrade, the capability of
> > adding local watches would go away?
>
> I agree we have to be careful there. If fanotify / inotify is currently
> usable with virtiofs, just missing events coming from host changes (which
> are presumably rare for lots of setups), it is likely used by someone and
> we should not regress it.
>

I don't see why it wouldn't be usable.
I think we have to assume that someone is using inotify/fanotify
on virtiofs.

> > I understand the desire to have existing inotify applications work out of
> > the box to get remote notifications, but I have doubts if this goal is even
> > worth pursuing. Considering that the existing known use case described in this
> > thread is already using polling to identify changes to config files on the host,
> > it could just as well be using a new API to get the job done.
> >
> > If we had to plan an interface without considering existing applications,
> > I think it would look something like:
> >
> > #define FAN_MARK_INODE 0x00000000
> > #define FAN_MARK_MOUNT 0x00000010
> > #define FAN_MARK_FILESYSTEM 0x00000100
> > #define FAN_MARK_REMOTE_INODE 0x00001000
> > #define FAN_MARK_REMOTE_FILESYSTEM 0x00001100
> >
> > Then, the application can choose to add a remote mark with a certain
> > event mask and a local mark with a different event mask without any ambiguity.
> > The remote events could be tagged with a flag (turn reserved member of
> > fanotify_event_metadata into flags).
> >
> > We have made a lot of effort to make fanotify a super set of inotify
> > functionality removing old UAPI mistakes and baggage along the way,
> > so I'd really hate to see us going down the path of ambiguous UAPI
> > again.
>
> So there's a question: Does application care whether the event comes from
> local or remote side? I'd say generally no - a file change is simply a file
> change and it does not matter who did it. Also again this API implies the
> application has to be aware it runs on a filesystem that may generate
> remote events to ask for them. But presumably knowledgeable app can always
> ask for local & remote events and if that fails (fs does not support remote
> events), ask only for local. That's not a big burden.
>
> The question is why would we make remote events explicit (and thus
> complicate the API and usage) when generally apps cannot be bothered who
> did the change. I suppose it makes implementation and some of the
> consequences on the stream of events more obvious. Also it allows
> functionality such as permission or mount events which are only local when
> the server does not support them which could be useful for "mostly local"
> filesystems. Hum...
>

Yeh, it is not clear at all what's the best approach.
Maybe the application would just need to specify FAN_REMOTE_EVENTS
in fanotify_init() to opt-in for any new behavior to avoid surprises and not
be any more explicit than that for the common use case.

Thanks,
Amir.