2023-02-20 19:38:23

by Aleksandr Mikhalitsyn

[permalink] [raw]
Subject: [RFC PATCH 0/9] fuse: API for Checkpoint/Restore

Hello everyone,

It would be great to hear your comments regarding this proof-of-concept Checkpoint/Restore API for FUSE.

Support of FUSE C/R is a challenging task for CRIU [1]. Last year I've given a brief talk on LPC 2022
about how we handle files C/R in CRIU and which blockers we have for FUSE filesystems. [2]

The main problem for CRIU is that we have to restore mount namespaces and memory mappings before the process tree.
It means that when CRIU is performing mount of fuse filesystem it can't use the original FUSE daemon from the
restorable process tree, but instead use a "fake daemon".

This leads to many other technical problems:
* "fake" daemon has to reply to FUSE_INIT request from the kernel and initialize fuse connection somehow.
This setup can be not consistent with the original daemon (protocol version, daemon capabilities/settings
like no_open, no_flush, readahead, and so on).
* each fuse request has a unique ID. It could confuse userspace if this unique ID sequence was reset.

We can workaround some issues and implement fragile and limited support of FUSE in CRIU but it doesn't make any sense, IMHO.
Btw, I've enumerated only CRIU restore-stage problems there. The dump stage is another story...

My proposal is not only about CRIU. The same interface can be useful for FUSE mounts recovery after daemon crashes.
LXC project uses LXCFS [3] as a procfs/cgroupfs/sysfs emulation layer for containers. We are using a scheme when
one LXCFS daemon handles all the work for all the containers and we use bindmounts to overmount particular
files/directories in procfs/cgroupfs/sysfs. If this single daemon crashes for some reason we are in trouble,
because we have to restart all the containers (fuse bindmounts become invalid after the crash).
The solution is fairly easy:
allow somehow to reinitialize the existing fuse connection and replace the daemon on the fly
This case is a little bit simpler than CRIU cause we don't need to care about the previously opened files
and other stuff, we are only interested in mounts.

Current PoC implementation was developed and tested with this "recovery case".
Right now I only have LXCFS patched and have nothing for CRIU. But I wanted to discuss this idea before going forward with CRIU.

Brief API/design description.

I've added two ioctl's:
* ioctl(FUSE_DEV_IOC_REINIT)
Performs fuse connection abort and then reinitializes all internal fuse structures as "brand new".
Then sends a FUSE_INIT request, so a new userspace daemon can reply to it and start processing fuse reqs.

* ioctl(FUSE_DEV_IOC_BM_REVAL)
A little bit hacky thing. Traverses all the bindmounts of existing fuse mount and performs relookup
of (struct vfsmount)->mnt_root dentries with the new daemon and reset them to new dentries.
Pretty useful for the recovery case (for instance, LXCFS).

Now about the dentry/inode invalidation mechanism.
* added the "fuse connection generation" concept.
When reinit is performed then connection generation is increased by 1.
Each fuse inode stores the generation of the connection it was allocated with.

* perform dentry revalidation if it has an old generation [fuse_dentry_revalidate]
The current implementation of fuse_dentry_revalidate follows a simple and elegant idea. When we
want to revalidate the dentry we just send a FUSE_LOOKUP request to the userspace
for the parent dentry with the name of the current dentry and check which attributes/inode id
it gets. If inode ids are the same and attributes (provided by the userspace) are valid
then we mark dentry valid and it continues to live (with inode).
I've only added a connection generation check to the condition when we have to perform revalidation
and added an inode connection generation reset (to actual connection gen) if the new userspace
daemon has looked up the same inode id (important for the CRIU case!).

Thank you for your attention and I'm waiting for your feedback :)

References:
[1] Support FUSE mountpoints https://github.com/checkpoint-restore/criu/issues/53
[2] Bringing up FUSE mounts C/R support https://lpc.events/event/16/contributions/1243/
[3] LXCFS https://github.com/lxc/lxcfs

Kind regards,
Alex

Cc: Miklos Szeredi <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

Alexander Mikhalitsyn (9):
fuse: move FUSE_DEFAULT_* defines to fuse common header
fuse: add const qualifiers to common fuse helpers
fuse: add fuse connection generation
fuse: handle stale inode connection in fuse_queue_forget
fuse: move fuse connection flags to the separate structure
fuse: take fuse connection generation into account
fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT)
namespace: add sb_revalidate_bindmounts helper
fuse: add fuse device ioctl(FUSE_DEV_IOC_BM_REVAL)

fs/fuse/acl.c | 6 +-
fs/fuse/dev.c | 167 +++++++++++++++++++-
fs/fuse/dir.c | 38 ++---
fs/fuse/file.c | 85 +++++-----
fs/fuse/fuse_i.h | 281 ++++++++++++++++++++--------------
fs/fuse/inode.c | 62 ++++----
fs/fuse/readdir.c | 8 +-
fs/fuse/xattr.c | 18 +--
fs/namespace.c | 90 +++++++++++
include/linux/mnt_namespace.h | 3 +
include/uapi/linux/fuse.h | 2 +
11 files changed, 531 insertions(+), 229 deletions(-)

--
2.34.1



2023-02-20 19:38:26

by Aleksandr Mikhalitsyn

[permalink] [raw]
Subject: [RFC PATCH 1/9] fuse: move FUSE_DEFAULT_* defines to fuse common header

Cc: Miklos Szeredi <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Alexander Mikhalitsyn <[email protected]>
---
fs/fuse/fuse_i.h | 6 ++++++
fs/fuse/inode.c | 6 ------
2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e13a8eff2e3d..4be7c404da4b 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -47,6 +47,12 @@
/** Number of dentries for each connection in the control filesystem */
#define FUSE_CTL_NUM_DENTRIES 5

+/** Maximum number of outstanding background requests */
+#define FUSE_DEFAULT_MAX_BACKGROUND 12
+
+/** Congestion starts at 75% of maximum */
+#define FUSE_DEFAULT_CONGESTION_THRESHOLD (FUSE_DEFAULT_MAX_BACKGROUND * 3 / 4)
+
/** List of active connections */
extern struct list_head fuse_conn_list;

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 114bdb3f7ccb..097b7049057e 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -53,12 +53,6 @@ MODULE_PARM_DESC(max_user_congthresh,

#define FUSE_DEFAULT_BLKSIZE 512

-/** Maximum number of outstanding background requests */
-#define FUSE_DEFAULT_MAX_BACKGROUND 12
-
-/** Congestion starts at 75% of maximum */
-#define FUSE_DEFAULT_CONGESTION_THRESHOLD (FUSE_DEFAULT_MAX_BACKGROUND * 3 / 4)
-
#ifdef CONFIG_BLOCK
static struct file_system_type fuseblk_fs_type;
#endif
--
2.34.1


2023-02-20 19:38:29

by Aleksandr Mikhalitsyn

[permalink] [raw]
Subject: [RFC PATCH 2/9] fuse: add const qualifiers to common fuse helpers

Cc: Miklos Szeredi <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Alexander Mikhalitsyn <[email protected]>
---
fs/fuse/fuse_i.h | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 4be7c404da4b..d5fc2d89ff1c 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -870,32 +870,32 @@ struct fuse_mount {
struct list_head fc_entry;
};

-static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
+static inline struct fuse_mount *get_fuse_mount_super(const struct super_block *sb)
{
return sb->s_fs_info;
}

-static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
+static inline struct fuse_conn *get_fuse_conn_super(const struct super_block *sb)
{
return get_fuse_mount_super(sb)->fc;
}

-static inline struct fuse_mount *get_fuse_mount(struct inode *inode)
+static inline struct fuse_mount *get_fuse_mount(const struct inode *inode)
{
return get_fuse_mount_super(inode->i_sb);
}

-static inline struct fuse_conn *get_fuse_conn(struct inode *inode)
+static inline struct fuse_conn *get_fuse_conn(const struct inode *inode)
{
return get_fuse_mount_super(inode->i_sb)->fc;
}

-static inline struct fuse_inode *get_fuse_inode(struct inode *inode)
+static inline struct fuse_inode *get_fuse_inode(const struct inode *inode)
{
return container_of(inode, struct fuse_inode, inode);
}

-static inline u64 get_node_id(struct inode *inode)
+static inline u64 get_node_id(const struct inode *inode)
{
return get_fuse_inode(inode)->nodeid;
}
@@ -905,7 +905,7 @@ static inline int invalid_nodeid(u64 nodeid)
return !nodeid || nodeid == FUSE_ROOT_ID;
}

-static inline u64 fuse_get_attr_version(struct fuse_conn *fc)
+static inline u64 fuse_get_attr_version(const struct fuse_conn *fc)
{
return atomic64_read(&fc->attr_version);
}
@@ -923,7 +923,7 @@ static inline void fuse_make_bad(struct inode *inode)
set_bit(FUSE_I_BAD, &get_fuse_inode(inode)->state);
}

-static inline bool fuse_is_bad(struct inode *inode)
+static inline bool fuse_is_bad(const struct inode *inode)
{
return unlikely(test_bit(FUSE_I_BAD, &get_fuse_inode(inode)->state));
}
--
2.34.1


2023-02-20 19:38:37

by Aleksandr Mikhalitsyn

[permalink] [raw]
Subject: [RFC PATCH 4/9] fuse: handle stale inode connection in fuse_queue_forget

We don't want to send FUSE_FORGET request to the new
fuse daemon if inode was lookuped by the old fuse daemon
because it can confuse and break userspace (libfuse).

For now, just add a new argument to fuse_queue_forget and
handle it. Adjust all callers to match the old behaviour.

Cc: Miklos Szeredi <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Alexander Mikhalitsyn <[email protected]>
---
fs/fuse/dev.c | 4 ++--
fs/fuse/dir.c | 8 ++++----
fs/fuse/fuse_i.h | 2 +-
fs/fuse/inode.c | 2 +-
4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index eb4f88e3dc97..85f69629f34d 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -234,7 +234,7 @@ __releases(fiq->lock)
}

void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget,
- u64 nodeid, u64 nlookup)
+ u64 nodeid, u64 nlookup, bool stale_inode_conn)
{
struct fuse_iqueue *fiq = &fc->iq;

@@ -242,7 +242,7 @@ void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget,
forget->forget_one.nlookup = nlookup;

spin_lock(&fiq->lock);
- if (fiq->connected) {
+ if (fiq->connected && !stale_inode_conn) {
fiq->forget_list_tail->next = forget;
fiq->forget_list_tail = forget;
fiq->ops->wake_forget_and_unlock(fiq);
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 7f4b29aa7c79..49d91add08bc 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -250,7 +250,7 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
if (outarg.nodeid != get_node_id(inode) ||
(bool) IS_AUTOMOUNT(inode) != (bool) (outarg.attr.flags & FUSE_ATTR_SUBMOUNT)) {
fuse_queue_forget(fm->fc, forget,
- outarg.nodeid, 1);
+ outarg.nodeid, 1, false);
goto invalid;
}
spin_lock(&fi->lock);
@@ -405,7 +405,7 @@ int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name
attr_version);
err = -ENOMEM;
if (!*inode) {
- fuse_queue_forget(fm->fc, forget, outarg->nodeid, 1);
+ fuse_queue_forget(fm->fc, forget, outarg->nodeid, 1, false);
goto out;
}
err = 0;
@@ -692,7 +692,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
if (!inode) {
flags &= ~(O_CREAT | O_EXCL | O_TRUNC);
fuse_sync_release(NULL, ff, flags);
- fuse_queue_forget(fm->fc, forget, outentry.nodeid, 1);
+ fuse_queue_forget(fm->fc, forget, outentry.nodeid, 1, false);
err = -ENOMEM;
goto out_err;
}
@@ -817,7 +817,7 @@ static int create_new_entry(struct fuse_mount *fm, struct fuse_args *args,
inode = fuse_iget(dir->i_sb, outarg.nodeid, outarg.generation,
&outarg.attr, entry_attr_timeout(&outarg), 0);
if (!inode) {
- fuse_queue_forget(fm->fc, forget, outarg.nodeid, 1);
+ fuse_queue_forget(fm->fc, forget, outarg.nodeid, 1, false);
return -ENOMEM;
}
kfree(forget);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index ccd7c145de94..ce154e7ab715 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1008,7 +1008,7 @@ int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name
* Send FORGET command
*/
void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget,
- u64 nodeid, u64 nlookup);
+ u64 nodeid, u64 nlookup, bool stale_inode_conn);

struct fuse_forget_link *fuse_alloc_forget(void);

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index c604434611d9..33a108cfcefe 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -124,7 +124,7 @@ static void fuse_evict_inode(struct inode *inode)
fuse_dax_inode_cleanup(inode);
if (fi->nlookup) {
fuse_queue_forget(fc, fi->forget, fi->nodeid,
- fi->nlookup);
+ fi->nlookup, false);
fi->forget = NULL;
}
}
--
2.34.1


2023-02-20 19:38:43

by Aleksandr Mikhalitsyn

[permalink] [raw]
Subject: [RFC PATCH 6/9] fuse: take fuse connection generation into account

- modify dentry revalidation algorithm to check inode/connection
generations. If them are not equal then perform revalidation.

remark: during forced dentry revalidation we are sending FUSE_LOOKUP
request to the userspace daemon and if it return the same inode after
lookup then we can "upgrade" inode connection generation without
invalidating it.

- don't send FUSE_FSYNC, FUSE_RELEASE, etc requests to the userspace
daemon about stale inodes (this can confuse libfuse)

Cc: Miklos Szeredi <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Alexander Mikhalitsyn <[email protected]>
---
fs/fuse/dir.c | 4 +++-
fs/fuse/file.c | 6 ++++--
fs/fuse/fuse_i.h | 3 ++-
fs/fuse/inode.c | 2 +-
4 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 63625c29d6ef..5ac7d88e5dfc 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -213,7 +213,8 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
inode = d_inode_rcu(entry);
if (inode && fuse_is_bad(inode))
goto invalid;
- else if (time_before64(fuse_dentry_time(entry), get_jiffies_64()) ||
+ else if ((inode && fuse_stale_inode_conn(inode)) ||
+ time_before64(fuse_dentry_time(entry), get_jiffies_64()) ||
(flags & (LOOKUP_EXCL | LOOKUP_REVAL | LOOKUP_RENAME_TARGET))) {
struct fuse_entry_out outarg;
FUSE_ARGS(args);
@@ -255,6 +256,7 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
}
spin_lock(&fi->lock);
fi->nlookup++;
+ fi->conn_gen = READ_ONCE(get_fuse_conn(inode)->conn_gen);
spin_unlock(&fi->lock);
}
kfree(forget);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index d5b30faff0b9..be9086a1868d 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -110,7 +110,8 @@ static void fuse_file_put(struct fuse_file *ff, bool sync, bool isdir)
if (refcount_dec_and_test(&ff->count)) {
struct fuse_args *args = &ff->release_args->args;

- if (isdir ? ff->fm->fc->flags.no_opendir : ff->fm->fc->flags.no_open) {
+ if (fuse_stale_ff(ff) ||
+ (isdir ? ff->fm->fc->flags.no_opendir : ff->fm->fc->flags.no_open)) {
/* Do nothing when client does not implement 'open' */
fuse_release_end(ff->fm, args, 0);
} else if (sync) {
@@ -597,9 +598,10 @@ static int fuse_fsync(struct file *file, loff_t start, loff_t end,
{
struct inode *inode = file->f_mapping->host;
struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_file *ff = file->private_data;
int err;

- if (fuse_is_bad(inode))
+ if (fuse_stale_ff(ff) || fuse_is_bad(inode))
return -EIO;

inode_lock(inode);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 4f4a6f912c7c..0643de31674d 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -954,7 +954,8 @@ static inline bool fuse_stale_inode(const struct inode *inode, int generation,
struct fuse_attr *attr)
{
return inode->i_generation != generation ||
- inode_wrong_type(inode, attr->mode);
+ inode_wrong_type(inode, attr->mode) ||
+ fuse_stale_inode_conn(inode);
}

static inline void fuse_make_bad(struct inode *inode)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index c3109e016494..f9dc8971274d 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -124,7 +124,7 @@ static void fuse_evict_inode(struct inode *inode)
fuse_dax_inode_cleanup(inode);
if (fi->nlookup) {
fuse_queue_forget(fc, fi->forget, fi->nodeid,
- fi->nlookup, false);
+ fi->nlookup, fuse_stale_inode_conn(inode));
fi->forget = NULL;
}
}
--
2.34.1


2023-02-20 19:38:48

by Aleksandr Mikhalitsyn

[permalink] [raw]
Subject: [RFC PATCH 3/9] fuse: add fuse connection generation

We will use connection generation to detect stale inodes
from the "old" fuse daemon and invalidate/revalidate them.

There is no functional changes.

Cc: Miklos Szeredi <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Alexander Mikhalitsyn <[email protected]>
---
fs/fuse/file.c | 1 +
fs/fuse/fuse_i.h | 29 +++++++++++++++++++++++++++++
fs/fuse/inode.c | 2 ++
3 files changed, 32 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3eb28aad5674..6d89d4dd4c55 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -78,6 +78,7 @@ struct fuse_file *fuse_file_alloc(struct fuse_mount *fm)
init_waitqueue_head(&ff->poll_wait);

ff->kh = atomic64_inc_return(&fm->fc->khctr);
+ ff->conn_gen = READ_ONCE(fm->fc->conn_gen);

return ff;
}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index d5fc2d89ff1c..ccd7c145de94 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -161,6 +161,9 @@ struct fuse_inode {
*/
struct fuse_inode_dax *dax;
#endif
+
+ /** Fuse connection (fuse_conn) generation when inode was allocated */
+ u32 conn_gen;
};

/** FUSE inode state bits */
@@ -232,6 +235,9 @@ struct fuse_file {

/** Has flock been performed on this file? */
bool flock:1;
+
+ /** Fuse connection (fuse_conn) generation when file was allocated */
+ u32 conn_gen;
};

/** One input argument of a request */
@@ -847,6 +853,18 @@ struct fuse_conn {

/* New writepages go into this bucket */
struct fuse_sync_bucket __rcu *curr_bucket;
+
+ /**
+ * Connection generation.
+ * Used to determine if inodes/files were created with an "old"
+ * fuse connection and have to be invalidated. So, all requests
+ * related to these inodes should fail with -EIO.
+ *
+ * CHECKME: do we really need conn_gen for struct fuse_file?
+ * Right now it's only needed for fuse_file_put(), where we have
+ * no access to the inode in some cases.
+ */
+ u32 conn_gen;
};

/*
@@ -910,6 +928,17 @@ static inline u64 fuse_get_attr_version(const struct fuse_conn *fc)
return atomic64_read(&fc->attr_version);
}

+static inline bool fuse_stale_ff(const struct fuse_file *ff)
+{
+ return unlikely(READ_ONCE(ff->fm->fc->conn_gen) != ff->conn_gen);
+}
+
+static inline bool fuse_stale_inode_conn(const struct inode *inode)
+{
+ return unlikely(READ_ONCE(get_fuse_conn(inode)->conn_gen) !=
+ get_fuse_inode(inode)->conn_gen);
+}
+
static inline bool fuse_stale_inode(const struct inode *inode, int generation,
struct fuse_attr *attr)
{
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 097b7049057e..c604434611d9 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -77,6 +77,7 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
fi->attr_version = 0;
fi->orig_ino = 0;
fi->state = 0;
+ fi->conn_gen = READ_ONCE(get_fuse_conn_super(sb)->conn_gen);
mutex_init(&fi->mutex);
spin_lock_init(&fi->lock);
fi->forget = fuse_alloc_forget();
@@ -841,6 +842,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
fc->user_ns = get_user_ns(user_ns);
fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
fc->max_pages_limit = FUSE_MAX_MAX_PAGES;
+ fc->conn_gen = 1;

INIT_LIST_HEAD(&fc->mounts);
list_add(&fm->fc_entry, &fc->mounts);
--
2.34.1


2023-02-20 19:38:51

by Aleksandr Mikhalitsyn

[permalink] [raw]
Subject: [RFC PATCH 5/9] fuse: move fuse connection flags to the separate structure

Let's move all the fuse connection flags that can be safely zeroed
after connection reinitialization to the separate structure fuse_conn_flags.

All of these flags values are calculated dynamically basing on
the userspace daemon capabilities (like no_open, no_flush) or on the
response for FUSE_INIT request.

Cc: Miklos Szeredi <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Alexander Mikhalitsyn <[email protected]>
---
fs/fuse/acl.c | 6 +-
fs/fuse/dev.c | 4 +-
fs/fuse/dir.c | 26 +++---
fs/fuse/file.c | 80 ++++++++---------
fs/fuse/fuse_i.h | 225 ++++++++++++++++++++++++----------------------
fs/fuse/inode.c | 52 +++++------
fs/fuse/readdir.c | 8 +-
fs/fuse/xattr.c | 18 ++--
8 files changed, 215 insertions(+), 204 deletions(-)

diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index a4850aee2639..b27953739de4 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -25,7 +25,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type, bool rcu)
if (fuse_is_bad(inode))
return ERR_PTR(-EIO);

- if (!fc->posix_acl || fc->no_getxattr)
+ if (!fc->posix_acl || fc->flags.no_getxattr)
return NULL;

if (type == ACL_TYPE_ACCESS)
@@ -42,7 +42,7 @@ struct posix_acl *fuse_get_acl(struct inode *inode, int type, bool rcu)
if (size > 0)
acl = posix_acl_from_xattr(fc->user_ns, value, size);
else if ((size == 0) || (size == -ENODATA) ||
- (size == -EOPNOTSUPP && fc->no_getxattr))
+ (size == -EOPNOTSUPP && fc->flags.no_getxattr))
acl = NULL;
else if (size == -ERANGE)
acl = ERR_PTR(-E2BIG);
@@ -64,7 +64,7 @@ int fuse_set_acl(struct user_namespace *mnt_userns, struct dentry *dentry,
if (fuse_is_bad(inode))
return -EIO;

- if (!fc->posix_acl || fc->no_setxattr)
+ if (!fc->posix_acl || fc->flags.no_setxattr)
return -EOPNOTSUPP;

if (type == ACL_TYPE_ACCESS)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 85f69629f34d..737764c2295e 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -367,7 +367,7 @@ static void request_wait_answer(struct fuse_req *req)
struct fuse_iqueue *fiq = &fc->iq;
int err;

- if (!fc->no_interrupt) {
+ if (!fc->flags.no_interrupt) {
/* Any signal may interrupt this */
err = wait_event_interruptible(req->waitq,
test_bit(FR_FINISHED, &req->flags));
@@ -1901,7 +1901,7 @@ static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
if (nbytes != sizeof(struct fuse_out_header))
err = -EINVAL;
else if (oh.error == -ENOSYS)
- fc->no_interrupt = 1;
+ fc->flags.no_interrupt = 1;
else if (oh.error == -EAGAIN)
err = queue_interrupt(req);

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 49d91add08bc..63625c29d6ef 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -707,7 +707,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
} else {
file->private_data = ff;
fuse_finish_open(inode, file);
- if (fm->fc->atomic_o_trunc && trunc)
+ if (fm->fc->flags.atomic_o_trunc && trunc)
truncate_pagecache(inode, 0);
else if (!(ff->open_flags & FOPEN_KEEP_CACHE))
invalidate_inode_pages2(inode->i_mapping);
@@ -750,12 +750,12 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
/* Only creates */
file->f_mode |= FMODE_CREATED;

- if (fc->no_create)
+ if (fc->flags.no_create)
goto mknod;

err = fuse_create_open(dir, entry, file, flags, mode, FUSE_CREATE);
if (err == -ENOSYS) {
- fc->no_create = 1;
+ fc->flags.no_create = 1;
goto mknod;
}
out_dput:
@@ -1080,14 +1080,14 @@ static int fuse_rename2(struct user_namespace *mnt_userns, struct inode *olddir,
return -EINVAL;

if (flags) {
- if (fc->no_rename2 || fc->minor < 23)
+ if (fc->flags.no_rename2 || fc->minor < 23)
return -EINVAL;

err = fuse_rename_common(olddir, oldent, newdir, newent, flags,
FUSE_RENAME2,
sizeof(struct fuse_rename2_in));
if (err == -ENOSYS) {
- fc->no_rename2 = 1;
+ fc->flags.no_rename2 = 1;
err = -EINVAL;
}
} else {
@@ -1354,7 +1354,7 @@ static int fuse_access(struct inode *inode, int mask)

BUG_ON(mask & MAY_NOT_BLOCK);

- if (fm->fc->no_access)
+ if (fm->fc->flags.no_access)
return 0;

memset(&inarg, 0, sizeof(inarg));
@@ -1366,7 +1366,7 @@ static int fuse_access(struct inode *inode, int mask)
args.in_args[0].value = &inarg;
err = fuse_simple_request(fm, &args);
if (err == -ENOSYS) {
- fm->fc->no_access = 1;
+ fm->fc->flags.no_access = 1;
err = 0;
}
return err;
@@ -1503,7 +1503,7 @@ static const char *fuse_get_link(struct dentry *dentry, struct inode *inode,
if (fuse_is_bad(inode))
goto out_err;

- if (fc->cache_symlinks)
+ if (fc->flags.cache_symlinks)
return page_get_link(dentry, inode, callback);

err = -ECHILD;
@@ -1551,13 +1551,13 @@ static int fuse_dir_fsync(struct file *file, loff_t start, loff_t end,
if (fuse_is_bad(inode))
return -EIO;

- if (fc->no_fsyncdir)
+ if (fc->flags.no_fsyncdir)
return 0;

inode_lock(inode);
err = fuse_fsync_common(file, start, end, datasync, FUSE_FSYNCDIR);
if (err == -ENOSYS) {
- fc->no_fsyncdir = 1;
+ fc->flags.no_fsyncdir = 1;
err = 0;
}
inode_unlock(inode);
@@ -1749,7 +1749,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
struct fuse_setattr_in inarg;
struct fuse_attr_out outarg;
bool is_truncate = false;
- bool is_wb = fc->writeback_cache && S_ISREG(inode->i_mode);
+ bool is_wb = fc->flags.writeback_cache && S_ISREG(inode->i_mode);
loff_t oldsize;
int err;
bool trust_local_cmtime = is_wb;
@@ -1782,7 +1782,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
/* This is coming from open(..., ... | O_TRUNC); */
WARN_ON(!(attr->ia_valid & ATTR_SIZE));
WARN_ON(attr->ia_size != 0);
- if (fc->atomic_o_trunc) {
+ if (fc->flags.atomic_o_trunc) {
/*
* No need to send request to userspace, since actual
* truncation has already been done by OPEN. But still
@@ -1929,7 +1929,7 @@ static int fuse_setattr(struct user_namespace *mnt_userns, struct dentry *entry,
*
* This should be done on write(), truncate() and chown().
*/
- if (!fc->handle_killpriv && !fc->handle_killpriv_v2) {
+ if (!fc->flags.handle_killpriv && !fc->handle_killpriv_v2) {
/*
* ia_mode calculation may have used stale i_mode.
* Refresh and recalculate.
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6d89d4dd4c55..d5b30faff0b9 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -29,7 +29,7 @@ static int fuse_send_open(struct fuse_mount *fm, u64 nodeid,

memset(&inarg, 0, sizeof(inarg));
inarg.flags = open_flags & ~(O_CREAT | O_EXCL | O_NOCTTY);
- if (!fm->fc->atomic_o_trunc)
+ if (!fm->fc->flags.atomic_o_trunc)
inarg.flags &= ~O_TRUNC;

if (fm->fc->handle_killpriv_v2 &&
@@ -110,7 +110,7 @@ static void fuse_file_put(struct fuse_file *ff, bool sync, bool isdir)
if (refcount_dec_and_test(&ff->count)) {
struct fuse_args *args = &ff->release_args->args;

- if (isdir ? ff->fm->fc->no_opendir : ff->fm->fc->no_open) {
+ if (isdir ? ff->fm->fc->flags.no_opendir : ff->fm->fc->flags.no_open) {
/* Do nothing when client does not implement 'open' */
fuse_release_end(ff->fm, args, 0);
} else if (sync) {
@@ -140,7 +140,7 @@ struct fuse_file *fuse_file_open(struct fuse_mount *fm, u64 nodeid,
ff->fh = 0;
/* Default for no-open */
ff->open_flags = FOPEN_KEEP_CACHE | (isdir ? FOPEN_CACHE_DIR : 0);
- if (isdir ? !fc->no_opendir : !fc->no_open) {
+ if (isdir ? !fc->flags.no_opendir : !fc->flags.no_open) {
struct fuse_open_out outarg;
int err;

@@ -154,9 +154,9 @@ struct fuse_file *fuse_file_open(struct fuse_mount *fm, u64 nodeid,
return ERR_PTR(err);
} else {
if (isdir)
- fc->no_opendir = 1;
+ fc->flags.no_opendir = 1;
else
- fc->no_open = 1;
+ fc->flags.no_open = 1;
}
}

@@ -205,7 +205,7 @@ void fuse_finish_open(struct inode *inode, struct file *file)
else if (ff->open_flags & FOPEN_NONSEEKABLE)
nonseekable_open(inode, file);

- if (fc->atomic_o_trunc && (file->f_flags & O_TRUNC)) {
+ if (fc->flags.atomic_o_trunc && (file->f_flags & O_TRUNC)) {
struct fuse_inode *fi = get_fuse_inode(inode);

spin_lock(&fi->lock);
@@ -215,7 +215,7 @@ void fuse_finish_open(struct inode *inode, struct file *file)
file_update_time(file);
fuse_invalidate_attr_mask(inode, FUSE_STATX_MODSIZE);
}
- if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache)
+ if ((file->f_mode & FMODE_WRITE) && fc->flags.writeback_cache)
fuse_link_write_file(file);
}

@@ -225,10 +225,10 @@ int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
struct fuse_conn *fc = fm->fc;
int err;
bool is_wb_truncate = (file->f_flags & O_TRUNC) &&
- fc->atomic_o_trunc &&
- fc->writeback_cache;
+ fc->flags.atomic_o_trunc &&
+ fc->flags.writeback_cache;
bool dax_truncate = (file->f_flags & O_TRUNC) &&
- fc->atomic_o_trunc && FUSE_IS_DAX(inode);
+ fc->flags.atomic_o_trunc && FUSE_IS_DAX(inode);

if (fuse_is_bad(inode))
return -EIO;
@@ -259,7 +259,7 @@ int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
if (!err) {
struct fuse_file *ff = file->private_data;

- if (fc->atomic_o_trunc && (file->f_flags & O_TRUNC))
+ if (fc->flags.atomic_o_trunc && (file->f_flags & O_TRUNC))
truncate_pagecache(inode, 0);
else if (!(ff->open_flags & FOPEN_KEEP_CACHE))
invalidate_inode_pages2(inode->i_mapping);
@@ -350,7 +350,7 @@ static int fuse_release(struct inode *inode, struct file *file)
* Dirty pages might remain despite write_inode_now() call from
* fuse_flush() due to writes racing with the close.
*/
- if (fc->writeback_cache)
+ if (fc->flags.writeback_cache)
write_inode_now(inode, 1);

fuse_release_common(file, false);
@@ -505,12 +505,12 @@ static int fuse_do_flush(struct fuse_flush_args *fa)
goto out;

err = 0;
- if (fm->fc->no_flush)
+ if (fm->fc->flags.no_flush)
goto inval_attr_out;

err = fuse_simple_request(fm, &fa->args);
if (err == -ENOSYS) {
- fm->fc->no_flush = 1;
+ fm->fc->flags.no_flush = 1;
err = 0;
}

@@ -519,7 +519,7 @@ static int fuse_do_flush(struct fuse_flush_args *fa)
* In memory i_blocks is not maintained by fuse, if writeback cache is
* enabled, i_blocks from cached attr may not be accurate.
*/
- if (!err && fm->fc->writeback_cache)
+ if (!err && fm->fc->flags.writeback_cache)
fuse_invalidate_attr_mask(inode, STATX_BLOCKS);

out:
@@ -545,7 +545,7 @@ static int fuse_flush(struct file *file, fl_owner_t id)
if (fuse_is_bad(inode))
return -EIO;

- if (ff->open_flags & FOPEN_NOFLUSH && !fm->fc->writeback_cache)
+ if (ff->open_flags & FOPEN_NOFLUSH && !fm->fc->flags.writeback_cache)
return 0;

fa = kzalloc(sizeof(*fa), GFP_KERNEL);
@@ -628,12 +628,12 @@ static int fuse_fsync(struct file *file, loff_t start, loff_t end,
if (err)
goto out;

- if (fc->no_fsync)
+ if (fc->flags.no_fsync)
goto out;

err = fuse_fsync_common(file, start, end, datasync, FUSE_FSYNC);
if (err == -ENOSYS) {
- fc->no_fsync = 1;
+ fc->flags.no_fsync = 1;
err = 0;
}
out:
@@ -858,7 +858,7 @@ static void fuse_short_read(struct inode *inode, u64 attr_ver, size_t num_read,
* the file. Some data after the hole is in page cache, but has not
* reached the client fs yet. So the hole is not present there.
*/
- if (!fc->writeback_cache) {
+ if (!fc->flags.writeback_cache) {
loff_t pos = page_offset(ap->pages[0]) + num_read;
fuse_read_update_size(inode, pos, attr_ver);
}
@@ -989,7 +989,7 @@ static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file)

fuse_read_args_fill(ia, file, pos, count, FUSE_READ);
ia->read.attr_ver = fuse_get_attr_version(fm->fc);
- if (fm->fc->async_read) {
+ if (fm->fc->flags.async_read) {
ia->ff = fuse_file_get(ff);
ap->args.end = fuse_readpages_end;
err = fuse_simple_background(fm, &ap->args, GFP_KERNEL);
@@ -1056,7 +1056,7 @@ static ssize_t fuse_cache_read_iter(struct kiocb *iocb, struct iov_iter *to)
* Otherwise, only update if we attempt to read past EOF (to ensure
* i_size is up to date).
*/
- if (fc->auto_inval_data ||
+ if (fc->flags.auto_inval_data ||
(iocb->ki_pos + iov_iter_count(to) > i_size_read(inode))) {
int err;
err = fuse_update_attributes(inode, iocb->ki_filp, STATX_SIZE);
@@ -1263,7 +1263,7 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
ia->write.page_locked = true;
break;
}
- if (!fc->big_writes)
+ if (!fc->flags.big_writes)
break;
} while (iov_iter_count(ii) && count < fc->max_write &&
ap->num_pages < max_pages && offset == 0);
@@ -1343,7 +1343,7 @@ static ssize_t fuse_cache_write_iter(struct kiocb *iocb, struct iov_iter *from)
struct fuse_conn *fc = get_fuse_conn(inode);
loff_t endbyte = 0;

- if (fc->writeback_cache) {
+ if (fc->flags.writeback_cache) {
/* Update size (EOF optimization) and mode (SUID clearing) */
err = fuse_update_attributes(mapping->host, file,
STATX_SIZE | STATX_MODE);
@@ -1864,7 +1864,7 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
* Do this only if writeback_cache is not enabled. If writeback_cache
* is enabled, we trust local ctime/mtime.
*/
- if (!fc->writeback_cache)
+ if (!fc->flags.writeback_cache)
fuse_invalidate_attr_mask(inode, FUSE_STATX_MODIFY);
spin_lock(&fi->lock);
rb_erase(&wpa->writepages_entry, &fi->writepages);
@@ -2368,7 +2368,7 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping,
loff_t fsize;
int err = -ENOMEM;

- WARN_ON(!fc->writeback_cache);
+ WARN_ON(!fc->flags.writeback_cache);

page = grab_cache_page_write_begin(mapping, index);
if (!page)
@@ -2641,13 +2641,13 @@ static int fuse_file_lock(struct file *file, int cmd, struct file_lock *fl)
if (cmd == F_CANCELLK) {
err = 0;
} else if (cmd == F_GETLK) {
- if (fc->no_lock) {
+ if (fc->flags.no_lock) {
posix_test_lock(file, fl);
err = 0;
} else
err = fuse_getlk(file, fl);
} else {
- if (fc->no_lock)
+ if (fc->flags.no_lock)
err = posix_lock_file(file, fl, NULL);
else
err = fuse_setlk(file, fl, 0);
@@ -2661,7 +2661,7 @@ static int fuse_file_flock(struct file *file, int cmd, struct file_lock *fl)
struct fuse_conn *fc = get_fuse_conn(inode);
int err;

- if (fc->no_flock) {
+ if (fc->flags.no_flock) {
err = locks_lock_file_wait(file, fl);
} else {
struct fuse_file *ff = file->private_data;
@@ -2683,7 +2683,7 @@ static sector_t fuse_bmap(struct address_space *mapping, sector_t block)
struct fuse_bmap_out outarg;
int err;

- if (!inode->i_sb->s_bdev || fm->fc->no_bmap)
+ if (!inode->i_sb->s_bdev || fm->fc->flags.no_bmap)
return 0;

memset(&inarg, 0, sizeof(inarg));
@@ -2699,7 +2699,7 @@ static sector_t fuse_bmap(struct address_space *mapping, sector_t block)
args.out_args[0].value = &outarg;
err = fuse_simple_request(fm, &args);
if (err == -ENOSYS)
- fm->fc->no_bmap = 1;
+ fm->fc->flags.no_bmap = 1;

return err ? 0 : outarg.block;
}
@@ -2718,7 +2718,7 @@ static loff_t fuse_lseek(struct file *file, loff_t offset, int whence)
struct fuse_lseek_out outarg;
int err;

- if (fm->fc->no_lseek)
+ if (fm->fc->flags.no_lseek)
goto fallback;

args.opcode = FUSE_LSEEK;
@@ -2732,7 +2732,7 @@ static loff_t fuse_lseek(struct file *file, loff_t offset, int whence)
err = fuse_simple_request(fm, &args);
if (err) {
if (err == -ENOSYS) {
- fm->fc->no_lseek = 1;
+ fm->fc->flags.no_lseek = 1;
goto fallback;
}
return err;
@@ -2839,7 +2839,7 @@ __poll_t fuse_file_poll(struct file *file, poll_table *wait)
FUSE_ARGS(args);
int err;

- if (fm->fc->no_poll)
+ if (fm->fc->flags.no_poll)
return DEFAULT_POLLMASK;

poll_wait(file, &ff->poll_wait, wait);
@@ -2867,7 +2867,7 @@ __poll_t fuse_file_poll(struct file *file, poll_table *wait)
if (!err)
return demangle_poll(outarg.revents);
if (err == -ENOSYS) {
- fm->fc->no_poll = 1;
+ fm->fc->flags.no_poll = 1;
return DEFAULT_POLLMASK;
}
return EPOLLERR;
@@ -2953,7 +2953,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
* By default, we want to optimize all I/Os with async request
* submission to the client filesystem if supported.
*/
- io->async = ff->fm->fc->async_dio;
+ io->async = ff->fm->fc->flags.async_dio;
io->iocb = iocb;
io->blocking = is_sync_kiocb(iocb);

@@ -3046,7 +3046,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
FALLOC_FL_ZERO_RANGE))
return -EOPNOTSUPP;

- if (fm->fc->no_fallocate)
+ if (fm->fc->flags.no_fallocate)
return -EOPNOTSUPP;

inode_lock(inode);
@@ -3086,7 +3086,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
args.in_args[0].value = &inarg;
err = fuse_simple_request(fm, &args);
if (err == -ENOSYS) {
- fm->fc->no_fallocate = 1;
+ fm->fc->flags.no_fallocate = 1;
err = -EOPNOTSUPP;
}
if (err)
@@ -3142,10 +3142,10 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
ssize_t err;
/* mark unstable when write-back is not used, and file_out gets
* extended */
- bool is_unstable = (!fc->writeback_cache) &&
+ bool is_unstable = (!fc->flags.writeback_cache) &&
((pos_out + len) > inode_out->i_size);

- if (fc->no_copy_file_range)
+ if (fc->flags.no_copy_file_range)
return -EOPNOTSUPP;

if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
@@ -3198,7 +3198,7 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
args.out_args[0].value = &outarg;
err = fuse_simple_request(fm, &args);
if (err == -ENOSYS) {
- fc->no_copy_file_range = 1;
+ fc->flags.no_copy_file_range = 1;
err = -EOPNOTSUPP;
}
if (err)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index ce154e7ab715..4f4a6f912c7c 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -543,6 +543,122 @@ struct fuse_sync_bucket {
struct rcu_head rcu;
};

+/**
+ * A Fuse connection.
+ *
+ * This structure is created, when the root filesystem is mounted, and
+ * is destroyed, when the client device is closed and the last
+ * fuse_mount is destroyed.
+ */
+struct fuse_conn_flags {
+ /** Do readahead asynchronously? Only set in INIT */
+ unsigned async_read:1;
+
+ /** Do not send separate SETATTR request before open(O_TRUNC) */
+ unsigned atomic_o_trunc:1;
+
+ /** Filesystem supports NFS exporting. Only set in INIT */
+ unsigned export_support:1;
+
+ /** write-back cache policy (default is write-through) */
+ unsigned writeback_cache:1;
+
+ /** allow parallel lookups and readdir (default is serialized) */
+ unsigned parallel_dirops:1;
+
+ /** handle fs handles killing suid/sgid/cap on write/chown/trunc */
+ unsigned handle_killpriv:1;
+
+ /** cache READLINK responses in page cache */
+ unsigned cache_symlinks:1;
+
+ /*
+ * The following bitfields are only for optimization purposes
+ * and hence races in setting them will not cause malfunction
+ */
+
+ /** Is open/release not implemented by fs? */
+ unsigned no_open:1;
+
+ /** Is opendir/releasedir not implemented by fs? */
+ unsigned no_opendir:1;
+
+ /** Is fsync not implemented by fs? */
+ unsigned no_fsync:1;
+
+ /** Is fsyncdir not implemented by fs? */
+ unsigned no_fsyncdir:1;
+
+ /** Is flush not implemented by fs? */
+ unsigned no_flush:1;
+
+ /** Is setxattr not implemented by fs? */
+ unsigned no_setxattr:1;
+
+ /** Does file server support extended setxattr */
+ unsigned setxattr_ext:1;
+
+ /** Is getxattr not implemented by fs? */
+ unsigned no_getxattr:1;
+
+ /** Is listxattr not implemented by fs? */
+ unsigned no_listxattr:1;
+
+ /** Is removexattr not implemented by fs? */
+ unsigned no_removexattr:1;
+
+ /** Are posix file locking primitives not implemented by fs? */
+ unsigned no_lock:1;
+
+ /** Is access not implemented by fs? */
+ unsigned no_access:1;
+
+ /** Is create not implemented by fs? */
+ unsigned no_create:1;
+
+ /** Is interrupt not implemented by fs? */
+ unsigned no_interrupt:1;
+
+ /** Is bmap not implemented by fs? */
+ unsigned no_bmap:1;
+
+ /** Is poll not implemented by fs? */
+ unsigned no_poll:1;
+
+ /** Do multi-page cached writes */
+ unsigned big_writes:1;
+
+ /** Are BSD file locking primitives not implemented by fs? */
+ unsigned no_flock:1;
+
+ /** Is fallocate not implemented by fs? */
+ unsigned no_fallocate:1;
+
+ /** Is rename with flags implemented by fs? */
+ unsigned no_rename2:1;
+
+ /** Use enhanced/automatic page cache invalidation. */
+ unsigned auto_inval_data:1;
+
+ /** Filesystem is fully responsible for page cache invalidation. */
+ unsigned explicit_inval_data:1;
+
+ /** Does the filesystem support readdirplus? */
+ unsigned do_readdirplus:1;
+
+ /** Does the filesystem want adaptive readdirplus? */
+ unsigned readdirplus_auto:1;
+
+ /** Does the filesystem support asynchronous direct-IO submission? */
+ unsigned async_dio:1;
+
+ /** Is lseek not implemented by fs? */
+ unsigned no_lseek:1;
+
+ /** Does the filesystem support copy_file_range? */
+ unsigned no_copy_file_range:1;
+};
+
/**
* A Fuse connection.
*
@@ -641,30 +757,9 @@ struct fuse_conn {
/** Connection successful. Only set in INIT */
unsigned conn_init:1;

- /** Do readahead asynchronously? Only set in INIT */
- unsigned async_read:1;
-
/** Return an unique read error after abort. Only set in INIT */
unsigned abort_err:1;

- /** Do not send separate SETATTR request before open(O_TRUNC) */
- unsigned atomic_o_trunc:1;
-
- /** Filesystem supports NFS exporting. Only set in INIT */
- unsigned export_support:1;
-
- /** write-back cache policy (default is write-through) */
- unsigned writeback_cache:1;
-
- /** allow parallel lookups and readdir (default is serialized) */
- unsigned parallel_dirops:1;
-
- /** handle fs handles killing suid/sgid/cap on write/chown/trunc */
- unsigned handle_killpriv:1;
-
- /** cache READLINK responses in page cache */
- unsigned cache_symlinks:1;
-
/* show legacy mount options */
unsigned int legacy_opts_show:1;

@@ -676,92 +771,9 @@ struct fuse_conn {
*/
unsigned handle_killpriv_v2:1;

- /*
- * The following bitfields are only for optimization purposes
- * and hence races in setting them will not cause malfunction
- */
-
- /** Is open/release not implemented by fs? */
- unsigned no_open:1;
-
- /** Is opendir/releasedir not implemented by fs? */
- unsigned no_opendir:1;
-
- /** Is fsync not implemented by fs? */
- unsigned no_fsync:1;
-
- /** Is fsyncdir not implemented by fs? */
- unsigned no_fsyncdir:1;
-
- /** Is flush not implemented by fs? */
- unsigned no_flush:1;
-
- /** Is setxattr not implemented by fs? */
- unsigned no_setxattr:1;
-
- /** Does file server support extended setxattr */
- unsigned setxattr_ext:1;
-
- /** Is getxattr not implemented by fs? */
- unsigned no_getxattr:1;
-
- /** Is listxattr not implemented by fs? */
- unsigned no_listxattr:1;
-
- /** Is removexattr not implemented by fs? */
- unsigned no_removexattr:1;
-
- /** Are posix file locking primitives not implemented by fs? */
- unsigned no_lock:1;
-
- /** Is access not implemented by fs? */
- unsigned no_access:1;
-
- /** Is create not implemented by fs? */
- unsigned no_create:1;
-
- /** Is interrupt not implemented by fs? */
- unsigned no_interrupt:1;
-
- /** Is bmap not implemented by fs? */
- unsigned no_bmap:1;
-
- /** Is poll not implemented by fs? */
- unsigned no_poll:1;
-
- /** Do multi-page cached writes */
- unsigned big_writes:1;
-
/** Don't apply umask to creation modes */
unsigned dont_mask:1;

- /** Are BSD file locking primitives not implemented by fs? */
- unsigned no_flock:1;
-
- /** Is fallocate not implemented by fs? */
- unsigned no_fallocate:1;
-
- /** Is rename with flags implemented by fs? */
- unsigned no_rename2:1;
-
- /** Use enhanced/automatic page cache invalidation. */
- unsigned auto_inval_data:1;
-
- /** Filesystem is fully responsible for page cache invalidation. */
- unsigned explicit_inval_data:1;
-
- /** Does the filesystem support readdirplus? */
- unsigned do_readdirplus:1;
-
- /** Does the filesystem want adaptive readdirplus? */
- unsigned readdirplus_auto:1;
-
- /** Does the filesystem support asynchronous direct-IO submission? */
- unsigned async_dio:1;
-
- /** Is lseek not implemented by fs? */
- unsigned no_lseek:1;
-
/** Does the filesystem support posix acls? */
unsigned posix_acl:1;

@@ -771,9 +783,6 @@ struct fuse_conn {
/** Allow other than the mounter user to access the filesystem ? */
unsigned allow_other:1;

- /** Does the filesystem support copy_file_range? */
- unsigned no_copy_file_range:1;
-
/* Send DESTROY request */
unsigned int destroy:1;

@@ -804,6 +813,8 @@ struct fuse_conn {
/* Is tmpfile not implemented by fs? */
unsigned int no_tmpfile:1;

+ struct fuse_conn_flags flags;
+
/** The number of requests waiting for completion */
atomic_t num_waiting;

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 33a108cfcefe..c3109e016494 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -224,7 +224,7 @@ u32 fuse_get_cache_mask(struct inode *inode)
{
struct fuse_conn *fc = get_fuse_conn(inode);

- if (!fc->writeback_cache || !S_ISREG(inode->i_mode))
+ if (!fc->flags.writeback_cache || !S_ISREG(inode->i_mode))
return 0;

return STATX_MTIME | STATX_CTIME | STATX_SIZE;
@@ -282,9 +282,9 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,

if (oldsize != attr->size) {
truncate_pagecache(inode, attr->size);
- if (!fc->explicit_inval_data)
+ if (!fc->flags.explicit_inval_data)
inval = true;
- } else if (fc->auto_inval_data) {
+ } else if (fc->flags.auto_inval_data) {
struct timespec64 new_mtime = {
.tv_sec = attr->mtime,
.tv_nsec = attr->mtimensec,
@@ -380,7 +380,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,

if ((inode->i_state & I_NEW)) {
inode->i_flags |= S_NOATIME;
- if (!fc->writeback_cache || !S_ISREG(attr->mode))
+ if (!fc->flags.writeback_cache || !S_ISREG(attr->mode))
inode->i_flags |= S_NOCMTIME;
inode->i_generation = generation;
fuse_init_inode(inode, attr);
@@ -459,7 +459,7 @@ bool fuse_lock_inode(struct inode *inode)
{
bool locked = false;

- if (!get_fuse_conn(inode)->parallel_dirops) {
+ if (!get_fuse_conn(inode)->flags.parallel_dirops) {
mutex_lock(&get_fuse_inode(inode)->mutex);
locked = true;
}
@@ -911,7 +911,7 @@ static struct dentry *fuse_get_dentry(struct super_block *sb,
struct fuse_entry_out outarg;
const struct qstr name = QSTR_INIT(".", 1);

- if (!fc->export_support)
+ if (!fc->flags.export_support)
goto out_err;

err = fuse_lookup_name(sb, handle->nodeid, &name, &outarg,
@@ -1011,7 +1011,7 @@ static struct dentry *fuse_get_parent(struct dentry *child)
struct fuse_entry_out outarg;
int err;

- if (!fc->export_support)
+ if (!fc->flags.export_support)
return ERR_PTR(-ESTALE);

err = fuse_lookup_name(child_inode->i_sb, get_node_id(child_inode),
@@ -1127,44 +1127,44 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,

ra_pages = arg->max_readahead / PAGE_SIZE;
if (flags & FUSE_ASYNC_READ)
- fc->async_read = 1;
+ fc->flags.async_read = 1;
if (!(flags & FUSE_POSIX_LOCKS))
- fc->no_lock = 1;
+ fc->flags.no_lock = 1;
if (arg->minor >= 17) {
if (!(flags & FUSE_FLOCK_LOCKS))
- fc->no_flock = 1;
+ fc->flags.no_flock = 1;
} else {
if (!(flags & FUSE_POSIX_LOCKS))
- fc->no_flock = 1;
+ fc->flags.no_flock = 1;
}
if (flags & FUSE_ATOMIC_O_TRUNC)
- fc->atomic_o_trunc = 1;
+ fc->flags.atomic_o_trunc = 1;
if (arg->minor >= 9) {
/* LOOKUP has dependency on proto version */
if (flags & FUSE_EXPORT_SUPPORT)
- fc->export_support = 1;
+ fc->flags.export_support = 1;
}
if (flags & FUSE_BIG_WRITES)
- fc->big_writes = 1;
+ fc->flags.big_writes = 1;
if (flags & FUSE_DONT_MASK)
fc->dont_mask = 1;
if (flags & FUSE_AUTO_INVAL_DATA)
- fc->auto_inval_data = 1;
+ fc->flags.auto_inval_data = 1;
else if (flags & FUSE_EXPLICIT_INVAL_DATA)
- fc->explicit_inval_data = 1;
+ fc->flags.explicit_inval_data = 1;
if (flags & FUSE_DO_READDIRPLUS) {
- fc->do_readdirplus = 1;
+ fc->flags.do_readdirplus = 1;
if (flags & FUSE_READDIRPLUS_AUTO)
- fc->readdirplus_auto = 1;
+ fc->flags.readdirplus_auto = 1;
}
if (flags & FUSE_ASYNC_DIO)
- fc->async_dio = 1;
+ fc->flags.async_dio = 1;
if (flags & FUSE_WRITEBACK_CACHE)
- fc->writeback_cache = 1;
+ fc->flags.writeback_cache = 1;
if (flags & FUSE_PARALLEL_DIROPS)
- fc->parallel_dirops = 1;
+ fc->flags.parallel_dirops = 1;
if (flags & FUSE_HANDLE_KILLPRIV)
- fc->handle_killpriv = 1;
+ fc->flags.handle_killpriv = 1;
if (arg->time_gran && arg->time_gran <= 1000000000)
fm->sb->s_time_gran = arg->time_gran;
if ((flags & FUSE_POSIX_ACL)) {
@@ -1173,7 +1173,7 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
fm->sb->s_xattr = fuse_acl_xattr_handlers;
}
if (flags & FUSE_CACHE_SYMLINKS)
- fc->cache_symlinks = 1;
+ fc->flags.cache_symlinks = 1;
if (flags & FUSE_ABORT_ERROR)
fc->abort_err = 1;
if (flags & FUSE_MAX_PAGES) {
@@ -1194,15 +1194,15 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
fm->sb->s_flags |= SB_NOSEC;
}
if (flags & FUSE_SETXATTR_EXT)
- fc->setxattr_ext = 1;
+ fc->flags.setxattr_ext = 1;
if (flags & FUSE_SECURITY_CTX)
fc->init_security = 1;
if (flags & FUSE_CREATE_SUPP_GROUP)
fc->create_supp_group = 1;
} else {
ra_pages = fc->max_read / PAGE_SIZE;
- fc->no_lock = 1;
- fc->no_flock = 1;
+ fc->flags.no_lock = 1;
+ fc->flags.no_flock = 1;
}

fm->sb->s_bdi->ra_pages =
diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
index dc603479b30e..2a5bfb52ebf3 100644
--- a/fs/fuse/readdir.c
+++ b/fs/fuse/readdir.c
@@ -18,9 +18,9 @@ static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx)
struct fuse_conn *fc = get_fuse_conn(dir);
struct fuse_inode *fi = get_fuse_inode(dir);

- if (!fc->do_readdirplus)
+ if (!fc->flags.do_readdirplus)
return false;
- if (!fc->readdirplus_auto)
+ if (!fc->flags.readdirplus_auto)
return true;
if (test_and_clear_bit(FUSE_I_ADVISE_RDPLUS, &fi->state))
return true;
@@ -246,7 +246,7 @@ static int fuse_direntplus_link(struct file *file,
if (IS_ERR(dentry))
return PTR_ERR(dentry);
}
- if (fc->readdirplus_auto)
+ if (fc->flags.readdirplus_auto)
set_bit(FUSE_I_INIT_RDPLUS, &get_fuse_inode(inode)->state);
fuse_change_entry_timeout(dentry, o);

@@ -455,7 +455,7 @@ static int fuse_readdir_cached(struct file *file, struct dir_context *ctx)
* We're just about to start reading into the cache or reading the
* cache; both cases require an up-to-date mtime value.
*/
- if (!ctx->pos && fc->auto_inval_data) {
+ if (!ctx->pos && fc->flags.auto_inval_data) {
int err = fuse_update_attributes(inode, file, STATX_MTIME);

if (err)
diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
index 0d3e7177fce0..13245c11ce25 100644
--- a/fs/fuse/xattr.c
+++ b/fs/fuse/xattr.c
@@ -19,7 +19,7 @@ int fuse_setxattr(struct inode *inode, const char *name, const void *value,
struct fuse_setxattr_in inarg;
int err;

- if (fm->fc->no_setxattr)
+ if (fm->fc->flags.no_setxattr)
return -EOPNOTSUPP;

memset(&inarg, 0, sizeof(inarg));
@@ -30,7 +30,7 @@ int fuse_setxattr(struct inode *inode, const char *name, const void *value,
args.opcode = FUSE_SETXATTR;
args.nodeid = get_node_id(inode);
args.in_numargs = 3;
- args.in_args[0].size = fm->fc->setxattr_ext ?
+ args.in_args[0].size = fm->fc->flags.setxattr_ext ?
sizeof(inarg) : FUSE_COMPAT_SETXATTR_IN_SIZE;
args.in_args[0].value = &inarg;
args.in_args[1].size = strlen(name) + 1;
@@ -39,7 +39,7 @@ int fuse_setxattr(struct inode *inode, const char *name, const void *value,
args.in_args[2].value = value;
err = fuse_simple_request(fm, &args);
if (err == -ENOSYS) {
- fm->fc->no_setxattr = 1;
+ fm->fc->flags.no_setxattr = 1;
err = -EOPNOTSUPP;
}
if (!err)
@@ -57,7 +57,7 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
struct fuse_getxattr_out outarg;
ssize_t ret;

- if (fm->fc->no_getxattr)
+ if (fm->fc->flags.no_getxattr)
return -EOPNOTSUPP;

memset(&inarg, 0, sizeof(inarg));
@@ -83,7 +83,7 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
if (!ret && !size)
ret = min_t(ssize_t, outarg.size, XATTR_SIZE_MAX);
if (ret == -ENOSYS) {
- fm->fc->no_getxattr = 1;
+ fm->fc->flags.no_getxattr = 1;
ret = -EOPNOTSUPP;
}
return ret;
@@ -121,7 +121,7 @@ ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size)
if (!fuse_allow_current_process(fm->fc))
return -EACCES;

- if (fm->fc->no_listxattr)
+ if (fm->fc->flags.no_listxattr)
return -EOPNOTSUPP;

memset(&inarg, 0, sizeof(inarg));
@@ -147,7 +147,7 @@ ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size)
if (ret > 0 && size)
ret = fuse_verify_xattr_list(list, ret);
if (ret == -ENOSYS) {
- fm->fc->no_listxattr = 1;
+ fm->fc->flags.no_listxattr = 1;
ret = -EOPNOTSUPP;
}
return ret;
@@ -159,7 +159,7 @@ int fuse_removexattr(struct inode *inode, const char *name)
FUSE_ARGS(args);
int err;

- if (fm->fc->no_removexattr)
+ if (fm->fc->flags.no_removexattr)
return -EOPNOTSUPP;

args.opcode = FUSE_REMOVEXATTR;
@@ -169,7 +169,7 @@ int fuse_removexattr(struct inode *inode, const char *name)
args.in_args[0].value = name;
err = fuse_simple_request(fm, &args);
if (err == -ENOSYS) {
- fm->fc->no_removexattr = 1;
+ fm->fc->flags.no_removexattr = 1;
err = -EOPNOTSUPP;
}
if (!err)
--
2.34.1


2023-02-20 19:39:18

by Aleksandr Mikhalitsyn

[permalink] [raw]
Subject: [RFC PATCH 8/9] namespace: add sb_revalidate_bindmounts helper

Useful if for some reason bindmounts root dentries get invalidated
but it's needed to revalidate existing bindmounts without remounting.

Cc: Miklos Szeredi <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Alexander Mikhalitsyn <[email protected]>
---
fs/namespace.c | 90 +++++++++++++++++++++++++++++++++++
include/linux/mnt_namespace.h | 3 ++
2 files changed, 93 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index ab467ee58341..88491f9c8bbd 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -682,6 +682,96 @@ static int mnt_make_readonly(struct mount *mnt)
return ret;
}

+struct bind_mount_list_item {
+ struct list_head list;
+ struct vfsmount *mnt;
+};
+
+/*
+ * sb_revalidate_bindmounts - Relookup/reset bindmounts root dentries
+ *
+ * Useful if for some reason bindmount root dentries get invalidated
+ * but it's needed to revalidate existing bindmounts without remounting.
+ */
+int sb_revalidate_bindmounts(struct super_block *sb)
+{
+ struct mount *mnt;
+ struct bind_mount_list_item *bmnt, *next;
+ int err = 0;
+ struct vfsmount *root_mnt = NULL;
+ LIST_HEAD(mnt_to_update);
+ char *buf;
+
+ buf = (char *) __get_free_page(GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ lock_mount_hash();
+ list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) {
+ /* we only want to touch bindmounts */
+ if (mnt->mnt.mnt_root == sb->s_root) {
+ if (!root_mnt)
+ root_mnt = mntget(&mnt->mnt);
+
+ continue;
+ }
+
+ bmnt = kzalloc(sizeof(struct bind_mount_list_item), GFP_NOWAIT | __GFP_NOWARN);
+ if (!bmnt) {
+ err = -ENOMEM;
+ goto exit;
+ }
+
+ bmnt->mnt = mntget(&mnt->mnt);
+ list_add_tail(&bmnt->list, &mnt_to_update);
+ }
+ unlock_mount_hash();
+
+ /* TODO: get rid of this limitation */
+ if (!root_mnt) {
+ err = -ENOENT;
+ goto exit;
+ }
+
+ list_for_each_entry_safe(bmnt, next, &mnt_to_update, list) {
+ struct vfsmount *cur_mnt = bmnt->mnt;
+ struct path path;
+ struct dentry *old_root;
+ char *p;
+
+ p = dentry_path(cur_mnt->mnt_root, buf, PAGE_SIZE);
+ if (IS_ERR(p))
+ goto exit;
+
+ /* TODO: are these lookup flags fully safe and correct? */
+ err = vfs_path_lookup(root_mnt->mnt_root, root_mnt,
+ p, LOOKUP_FOLLOW|LOOKUP_AUTOMOUNT|LOOKUP_REVAL, &path);
+ if (err)
+ goto exit;
+
+ /* replace bindmount root dentry */
+ lock_mount_hash();
+ old_root = cur_mnt->mnt_root;
+ cur_mnt->mnt_root = dget(path.dentry);
+ dput(old_root);
+ unlock_mount_hash();
+
+ path_put(&path);
+ }
+
+exit:
+ free_page((unsigned long) buf);
+ mntput(root_mnt);
+ list_for_each_entry_safe(bmnt, next, &mnt_to_update, list) {
+ list_del(&bmnt->list);
+ mntput(bmnt->mnt);
+ kfree(bmnt);
+ }
+
+ return err;
+}
+EXPORT_SYMBOL_GPL(sb_revalidate_bindmounts);
+
int sb_prepare_remount_readonly(struct super_block *sb)
{
struct mount *mnt;
diff --git a/include/linux/mnt_namespace.h b/include/linux/mnt_namespace.h
index 8f882f5881e8..20ac29e702f5 100644
--- a/include/linux/mnt_namespace.h
+++ b/include/linux/mnt_namespace.h
@@ -3,6 +3,7 @@
#define _NAMESPACE_H_
#ifdef __KERNEL__

+struct super_block;
struct mnt_namespace;
struct fs_struct;
struct user_namespace;
@@ -13,6 +14,8 @@ extern struct mnt_namespace *copy_mnt_ns(unsigned long, struct mnt_namespace *,
extern void put_mnt_ns(struct mnt_namespace *ns);
extern struct ns_common *from_mnt_ns(struct mnt_namespace *);

+extern int sb_revalidate_bindmounts(struct super_block *sb);
+
extern const struct file_operations proc_mounts_operations;
extern const struct file_operations proc_mountinfo_operations;
extern const struct file_operations proc_mountstats_operations;
--
2.34.1


2023-02-20 19:39:21

by Aleksandr Mikhalitsyn

[permalink] [raw]
Subject: [RFC PATCH 7/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT)

This ioctl aborts fuse connection and then reinitializes it,
sends FUSE_INIT request to allow a new userspace daemon
to pick up the fuse connection.

Cc: Miklos Szeredi <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Alexander Mikhalitsyn <[email protected]>
---
fs/fuse/dev.c | 132 ++++++++++++++++++++++++++++++++++++++
include/uapi/linux/fuse.h | 1 +
2 files changed, 133 insertions(+)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 737764c2295e..0f53ffd63957 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2187,6 +2187,112 @@ void fuse_abort_conn(struct fuse_conn *fc)
}
EXPORT_SYMBOL_GPL(fuse_abort_conn);

+static int fuse_reinit_conn(struct fuse_conn *fc)
+{
+ struct fuse_iqueue *fiq = &fc->iq;
+ struct fuse_dev *fud;
+ unsigned int i;
+
+ if (fc->conn_gen + 1 < fc->conn_gen)
+ return -EOVERFLOW;
+
+ fuse_abort_conn(fc);
+ fuse_wait_aborted(fc);
+
+ spin_lock(&fc->lock);
+ if (fc->connected) {
+ spin_unlock(&fc->lock);
+ return -EINVAL;
+ }
+
+ if (fc->conn_gen + 1 < fc->conn_gen) {
+ spin_unlock(&fc->lock);
+ return -EOVERFLOW;
+ }
+
+ fc->conn_gen++;
+
+ spin_lock(&fiq->lock);
+ if (request_pending(fiq) || fiq->forget_list_tail != &fiq->forget_list_head) {
+ spin_unlock(&fiq->lock);
+ spin_unlock(&fc->lock);
+ return -EINVAL;
+ }
+
+ if (&fuse_dev_fiq_ops != fiq->ops) {
+ spin_unlock(&fiq->lock);
+ spin_unlock(&fc->lock);
+ return -EOPNOTSUPP;
+ }
+
+ fiq->connected = 1;
+ spin_unlock(&fiq->lock);
+
+ spin_lock(&fc->bg_lock);
+ if (!list_empty(&fc->bg_queue)) {
+ spin_unlock(&fc->bg_lock);
+ spin_unlock(&fc->lock);
+ return -EINVAL;
+ }
+
+ fc->blocked = 0;
+ fc->max_background = FUSE_DEFAULT_MAX_BACKGROUND;
+ spin_unlock(&fc->bg_lock);
+
+ list_for_each_entry(fud, &fc->devices, entry) {
+ struct fuse_pqueue *fpq = &fud->pq;
+
+ spin_lock(&fpq->lock);
+ if (!list_empty(&fpq->io)) {
+ spin_unlock(&fpq->lock);
+ spin_unlock(&fc->lock);
+ return -EINVAL;
+ }
+
+ for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) {
+ if (!list_empty(&fpq->processing[i])) {
+ spin_unlock(&fpq->lock);
+ spin_unlock(&fc->lock);
+ return -EINVAL;
+ }
+ }
+
+ fpq->connected = 1;
+ spin_unlock(&fpq->lock);
+ }
+
+ fuse_set_initialized(fc);
+
+ /* Background queuing checks fc->connected under bg_lock */
+ spin_lock(&fc->bg_lock);
+ fc->connected = 1;
+ spin_unlock(&fc->bg_lock);
+
+ fc->aborted = false;
+ fc->abort_err = 0;
+
+ /* nullify all the flags */
+ memset(&fc->flags, 0, sizeof(struct fuse_conn_flags));
+
+ spin_unlock(&fc->lock);
+
+ down_read(&fc->killsb);
+ if (!list_empty(&fc->mounts)) {
+ struct fuse_mount *fm;
+
+ fm = list_first_entry(&fc->mounts, struct fuse_mount, fc_entry);
+ if (!fm->sb) {
+ up_read(&fc->killsb);
+ return -EINVAL;
+ }
+
+ fuse_send_init(fm);
+ }
+ up_read(&fc->killsb);
+
+ return 0;
+}
+
void fuse_wait_aborted(struct fuse_conn *fc)
{
/* matches implicit memory barrier in fuse_drop_waiting() */
@@ -2282,6 +2388,32 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
}
}
break;
+ case FUSE_DEV_IOC_REINIT:
+ struct fuse_conn *fc;
+
+ if (!checkpoint_restore_ns_capable(file->f_cred->user_ns))
+ return -EPERM;
+
+ res = -EINVAL;
+ fud = fuse_get_dev(file);
+
+ /*
+ * Only fuse mounts with an already initialized fuse
+ * connection are supported
+ */
+ if (file->f_op == &fuse_dev_operations && fud) {
+ mutex_lock(&fuse_mutex);
+ fc = fud->fc;
+ if (fc)
+ fc = fuse_conn_get(fc);
+ mutex_unlock(&fuse_mutex);
+
+ if (fc) {
+ res = fuse_reinit_conn(fc);
+ fuse_conn_put(fc);
+ }
+ }
+ break;
default:
res = -ENOTTY;
break;
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 1b9d0dfae72d..3dac67b25eae 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -989,6 +989,7 @@ struct fuse_notify_retrieve_in {
/* Device ioctls: */
#define FUSE_DEV_IOC_MAGIC 229
#define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
+#define FUSE_DEV_IOC_REINIT _IO(FUSE_DEV_IOC_MAGIC, 0)

struct fuse_lseek_in {
uint64_t fh;
--
2.34.1


2023-02-20 19:39:24

by Aleksandr Mikhalitsyn

[permalink] [raw]
Subject: [RFC PATCH 9/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_BM_REVAL)

This ioctl allows to revalidate all the existing fuse bindmounts
by performing relookup of all root dentries and resetting them.

Useful if it's needed to make fuse bindmounts work without
remounting them after fuse connection reinitialization.

Cc: Miklos Szeredi <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Alexander Mikhalitsyn <[email protected]>
---
fs/fuse/dev.c | 29 ++++++++++++++++++++++++++++-
include/uapi/linux/fuse.h | 1 +
2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 0f53ffd63957..ae301cb8486b 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -13,6 +13,7 @@
#include <linux/poll.h>
#include <linux/sched/signal.h>
#include <linux/uio.h>
+#include <linux/mnt_namespace.h>
#include <linux/miscdevice.h>
#include <linux/pagemap.h>
#include <linux/file.h>
@@ -2293,6 +2294,27 @@ static int fuse_reinit_conn(struct fuse_conn *fc)
return 0;
}

+static ssize_t fuse_revalidate_bindmounts(struct fuse_conn *fc)
+{
+ int ret = 0;
+
+ down_read(&fc->killsb);
+ if (!list_empty(&fc->mounts)) {
+ struct fuse_mount *fm;
+
+ fm = list_first_entry(&fc->mounts, struct fuse_mount, fc_entry);
+ if (!fm->sb) {
+ up_read(&fc->killsb);
+ return -EINVAL;
+ }
+
+ ret = sb_revalidate_bindmounts(fm->sb);
+ }
+ up_read(&fc->killsb);
+
+ return ret;
+}
+
void fuse_wait_aborted(struct fuse_conn *fc)
{
/* matches implicit memory barrier in fuse_drop_waiting() */
@@ -2389,6 +2411,7 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
}
break;
case FUSE_DEV_IOC_REINIT:
+ case FUSE_DEV_IOC_BM_REVAL:
struct fuse_conn *fc;

if (!checkpoint_restore_ns_capable(file->f_cred->user_ns))
@@ -2409,7 +2432,11 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
mutex_unlock(&fuse_mutex);

if (fc) {
- res = fuse_reinit_conn(fc);
+ if (cmd == FUSE_DEV_IOC_REINIT)
+ res = fuse_reinit_conn(fc);
+ else if (cmd == FUSE_DEV_IOC_BM_REVAL)
+ res = fuse_revalidate_bindmounts(fc);
+
fuse_conn_put(fc);
}
}
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 3dac67b25eae..14f7fe4a99cf 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -990,6 +990,7 @@ struct fuse_notify_retrieve_in {
#define FUSE_DEV_IOC_MAGIC 229
#define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
#define FUSE_DEV_IOC_REINIT _IO(FUSE_DEV_IOC_MAGIC, 0)
+#define FUSE_DEV_IOC_BM_REVAL _IO(FUSE_DEV_IOC_MAGIC, 1)

struct fuse_lseek_in {
uint64_t fh;
--
2.34.1


2023-03-03 18:27:12

by Bernd Schubert

[permalink] [raw]
Subject: Re: [RFC PATCH 5/9] fuse: move fuse connection flags to the separate structure



On 2/20/23 20:37, Alexander Mikhalitsyn wrote:
> Let's move all the fuse connection flags that can be safely zeroed
> after connection reinitialization to the separate structure fuse_conn_flags.
>
> All of these flags values are calculated dynamically basing on
> the userspace daemon capabilities (like no_open, no_flush) or on the
> response for FUSE_INIT request.
>

From my point of view this makes the code a bit better readable, in
general.

[...]

> };
>
> +/**
> + * A Fuse connection.
> + *
> + * This structure is created, when the root filesystem is mounted, and
> + * is destroyed, when the client device is closed and the last
> + * fuse_mount is destroyed.
> + */
> +struct fuse_conn_flags {
> + /** Do readahead asynchronously? Only set in INIT */
> + unsigned async_read:1;


The comment does not match the struct?


2023-03-03 18:45:37

by Bernd Schubert

[permalink] [raw]
Subject: Re: [RFC PATCH 6/9] fuse: take fuse connection generation into account


[...]
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index d5b30faff0b9..be9086a1868d 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -110,7 +110,8 @@ static void fuse_file_put(struct fuse_file *ff, bool sync, bool isdir)
> if (refcount_dec_and_test(&ff->count)) {
> struct fuse_args *args = &ff->release_args->args;
>
> - if (isdir ? ff->fm->fc->flags.no_opendir : ff->fm->fc->flags.no_open) {
> + if (fuse_stale_ff(ff) ||
> + (isdir ? ff->fm->fc->flags.no_opendir : ff->fm->fc->flags.no_open)) {
> /* Do nothing when client does not implement 'open' */

The comment does not match anymore.

> fuse_release_end(ff->fm, args, 0);
> } else if (sync) {
> @@ -597,9 +598,10 @@ static int fuse_fsync(struct file *file, loff_t start, loff_t end,
> {
> struct inode *inode = file->f_mapping->host;
> struct fuse_conn *fc = get_fuse_conn(inode);
> + struct fuse_file *ff = file->private_data;
> int err;
>
> - if (fuse_is_bad(inode))
> + if (fuse_stale_ff(ff) || fuse_is_bad(inode))
> return -EIO;
>
> inode_lock(inode);
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 4f4a6f912c7c..0643de31674d 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -954,7 +954,8 @@ static inline bool fuse_stale_inode(const struct inode *inode, int generation,
> struct fuse_attr *attr)
> {
> return inode->i_generation != generation ||
> - inode_wrong_type(inode, attr->mode);
> + inode_wrong_type(inode, attr->mode) ||
> + fuse_stale_inode_conn(inode);
> }
>
> static inline void fuse_make_bad(struct inode *inode)
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index c3109e016494..f9dc8971274d 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -124,7 +124,7 @@ static void fuse_evict_inode(struct inode *inode)
> fuse_dax_inode_cleanup(inode);
> if (fi->nlookup) {
> fuse_queue_forget(fc, fi->forget, fi->nodeid,
> - fi->nlookup, false);
> + fi->nlookup, fuse_stale_inode_conn(inode));
> fi->forget = NULL;
> }
> }

2023-03-03 18:48:10

by Bernd Schubert

[permalink] [raw]
Subject: Re: [RFC PATCH 4/9] fuse: handle stale inode connection in fuse_queue_forget



On 2/20/23 20:37, Alexander Mikhalitsyn wrote:
> We don't want to send FUSE_FORGET request to the new
> fuse daemon if inode was lookuped by the old fuse daemon
> because it can confuse and break userspace (libfuse).
>
> For now, just add a new argument to fuse_queue_forget and
> handle it. Adjust all callers to match the old behaviour.
>
> Cc: Miklos Szeredi <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Amir Goldstein <[email protected]>
> Cc: Stéphane Graber <[email protected]>
> Cc: Seth Forshee <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Cc: Andrei Vagin <[email protected]>
> Cc: Pavel Tikhomirov <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Alexander Mikhalitsyn <[email protected]>
> ---
> fs/fuse/dev.c | 4 ++--
> fs/fuse/dir.c | 8 ++++----
> fs/fuse/fuse_i.h | 2 +-
> fs/fuse/inode.c | 2 +-
> 4 files changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index eb4f88e3dc97..85f69629f34d 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -234,7 +234,7 @@ __releases(fiq->lock)
> }
>
> void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget,
> - u64 nodeid, u64 nlookup)
> + u64 nodeid, u64 nlookup, bool stale_inode_conn)
> {
> struct fuse_iqueue *fiq = &fc->iq;
>
> @@ -242,7 +242,7 @@ void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget,
> forget->forget_one.nlookup = nlookup;
>
> spin_lock(&fiq->lock);
> - if (fiq->connected) {
> + if (fiq->connected && !stale_inode_conn) {
> fiq->forget_list_tail->next = forget;
> fiq->forget_list_tail = forget;
> fiq->ops->wake_forget_and_unlock(fiq);

I'm not sure about kernel coding rules here - for me that is unlikely
rare event - I would have added unlikely() here.

2023-03-03 19:19:32

by Bernd Schubert

[permalink] [raw]
Subject: Re: [RFC PATCH 7/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT)



On 2/20/23 20:37, Alexander Mikhalitsyn wrote:
> This ioctl aborts fuse connection and then reinitializes it,
> sends FUSE_INIT request to allow a new userspace daemon
> to pick up the fuse connection.
>
> Cc: Miklos Szeredi <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Amir Goldstein <[email protected]>
> Cc: Stéphane Graber <[email protected]>
> Cc: Seth Forshee <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Cc: Andrei Vagin <[email protected]>
> Cc: Pavel Tikhomirov <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Alexander Mikhalitsyn <[email protected]>
> ---
> fs/fuse/dev.c | 132 ++++++++++++++++++++++++++++++++++++++
> include/uapi/linux/fuse.h | 1 +
> 2 files changed, 133 insertions(+)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 737764c2295e..0f53ffd63957 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -2187,6 +2187,112 @@ void fuse_abort_conn(struct fuse_conn *fc)
> }
> EXPORT_SYMBOL_GPL(fuse_abort_conn);
>
> +static int fuse_reinit_conn(struct fuse_conn *fc)
> +{
> + struct fuse_iqueue *fiq = &fc->iq;
> + struct fuse_dev *fud;
> + unsigned int i;

Assuming you have a malicious daemon that tries to cause bad behavior,
only allow one ioctl at at time? I.e. add a value that reinit is in
progress? And unset at the end of the function?

> +
> + if (fc->conn_gen + 1 < fc->conn_gen)
> + return -EOVERFLOW;
> +

Add a comment, like

/* Unsets fc->connected and fiq->connected and ensures that no new
requests can be queued */

?

> + fuse_abort_conn(fc);
> + fuse_wait_aborted(fc);
> +
> + spin_lock(&fc->lock);
> + if (fc->connected) {
> + spin_unlock(&fc->lock);
> + return -EINVAL;
> + }
> +
> + if (fc->conn_gen + 1 < fc->conn_gen) {
> + spin_unlock(&fc->lock);
> + return -EOVERFLOW;
> + }
> +
> + fc->conn_gen++;
> +
> + spin_lock(&fiq->lock);
> + if (request_pending(fiq) || fiq->forget_list_tail != &fiq->forget_list_head) {
> + spin_unlock(&fiq->lock);
> + spin_unlock(&fc->lock);
> + return -EINVAL;
> + }
> +
> + if (&fuse_dev_fiq_ops != fiq->ops) {
> + spin_unlock(&fiq->lock);
> + spin_unlock(&fc->lock);
> + return -EOPNOTSUPP;
> + }
> +
> + fiq->connected = 1;
> + spin_unlock(&fiq->lock);
> +
> + spin_lock(&fc->bg_lock);
> + if (!list_empty(&fc->bg_queue)) {
> + spin_unlock(&fc->bg_lock);
> + spin_unlock(&fc->lock);
> + return -EINVAL;
> + }
> +
> + fc->blocked = 0;
> + fc->max_background = FUSE_DEFAULT_MAX_BACKGROUND;
> + spin_unlock(&fc->bg_lock);
> +
> + list_for_each_entry(fud, &fc->devices, entry) {
> + struct fuse_pqueue *fpq = &fud->pq;
> +
> + spin_lock(&fpq->lock);
> + if (!list_empty(&fpq->io)) {
> + spin_unlock(&fpq->lock);
> + spin_unlock(&fc->lock);
> + return -EINVAL;
> + }
> +
> + for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) {
> + if (!list_empty(&fpq->processing[i])) {
> + spin_unlock(&fpq->lock);
> + spin_unlock(&fc->lock);
> + return -EINVAL;
> + }
> + }
> +
> + fpq->connected = 1;
> + spin_unlock(&fpq->lock);
> + }
> +
> + fuse_set_initialized(fc);

I'm not sure about this, why not the common way via FUSE_INIT reply?

> +
> + /* Background queuing checks fc->connected under bg_lock */
> + spin_lock(&fc->bg_lock);
> + fc->connected = 1;
> + spin_unlock(&fc->bg_lock);
> +
> + fc->aborted = false;
> + fc->abort_err = 0;
> +
> + /* nullify all the flags */
> + memset(&fc->flags, 0, sizeof(struct fuse_conn_flags));
> +
> + spin_unlock(&fc->lock);
> +
> + down_read(&fc->killsb);
> + if (!list_empty(&fc->mounts)) {
> + struct fuse_mount *fm;
> +
> + fm = list_first_entry(&fc->mounts, struct fuse_mount, fc_entry);
> + if (!fm->sb) {
> + up_read(&fc->killsb);
> + return -EINVAL;
> + }
> +
> + fuse_send_init(fm);
> + }
> + up_read(&fc->killsb);
> +
> + return 0;
> +}
> +
> void fuse_wait_aborted(struct fuse_conn *fc)
> {
> /* matches implicit memory barrier in fuse_drop_waiting() */
> @@ -2282,6 +2388,32 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
> }
> }
> break;
> + case FUSE_DEV_IOC_REINIT:
> + struct fuse_conn *fc;
> +
> + if (!checkpoint_restore_ns_capable(file->f_cred->user_ns))
> + return -EPERM;
> +
> + res = -EINVAL;
> + fud = fuse_get_dev(file);
> +
> + /*
> + * Only fuse mounts with an already initialized fuse
> + * connection are supported
> + */
> + if (file->f_op == &fuse_dev_operations && fud) {
> + mutex_lock(&fuse_mutex);
> + fc = fud->fc;
> + if (fc)
> + fc = fuse_conn_get(fc);
> + mutex_unlock(&fuse_mutex);
> +
> + if (fc) {
> + res = fuse_reinit_conn(fc);
> + fuse_conn_put(fc);
> + }
> + }
> + break;
> default:
> res = -ENOTTY;
> break;
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 1b9d0dfae72d..3dac67b25eae 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -989,6 +989,7 @@ struct fuse_notify_retrieve_in {
> /* Device ioctls: */
> #define FUSE_DEV_IOC_MAGIC 229
> #define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
> +#define FUSE_DEV_IOC_REINIT _IO(FUSE_DEV_IOC_MAGIC, 0)
>
> struct fuse_lseek_in {
> uint64_t fh;

2023-03-03 19:27:18

by Bernd Schubert

[permalink] [raw]
Subject: Re: [RFC PATCH 7/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT)



On 2/20/23 20:37, Alexander Mikhalitsyn wrote:
> This ioctl aborts fuse connection and then reinitializes it,
> sends FUSE_INIT request to allow a new userspace daemon
> to pick up the fuse connection.
>
> Cc: Miklos Szeredi <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Amir Goldstein <[email protected]>
> Cc: Stéphane Graber <[email protected]>
> Cc: Seth Forshee <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Cc: Andrei Vagin <[email protected]>
> Cc: Pavel Tikhomirov <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Alexander Mikhalitsyn <[email protected]>
> ---
> fs/fuse/dev.c | 132 ++++++++++++++++++++++++++++++++++++++
> include/uapi/linux/fuse.h | 1 +
> 2 files changed, 133 insertions(+)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 737764c2295e..0f53ffd63957 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -2187,6 +2187,112 @@ void fuse_abort_conn(struct fuse_conn *fc)
> }
> EXPORT_SYMBOL_GPL(fuse_abort_conn);
>
> +static int fuse_reinit_conn(struct fuse_conn *fc)
> +{
> + struct fuse_iqueue *fiq = &fc->iq;
> + struct fuse_dev *fud;
> + unsigned int i;
> +
> + if (fc->conn_gen + 1 < fc->conn_gen)
> + return -EOVERFLOW;
> +
> + fuse_abort_conn(fc);
> + fuse_wait_aborted(fc);

Shouldn't this also try to flush all data first?


2023-03-03 19:43:10

by Bernd Schubert

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] fuse: API for Checkpoint/Restore



On 2/20/23 20:37, Alexander Mikhalitsyn wrote:
> Hello everyone,
>
> It would be great to hear your comments regarding this proof-of-concept Checkpoint/Restore API for FUSE.
>
> Support of FUSE C/R is a challenging task for CRIU [1]. Last year I've given a brief talk on LPC 2022
> about how we handle files C/R in CRIU and which blockers we have for FUSE filesystems. [2]
>
> The main problem for CRIU is that we have to restore mount namespaces and memory mappings before the process tree.
> It means that when CRIU is performing mount of fuse filesystem it can't use the original FUSE daemon from the
> restorable process tree, but instead use a "fake daemon".
>
> This leads to many other technical problems:
> * "fake" daemon has to reply to FUSE_INIT request from the kernel and initialize fuse connection somehow.
> This setup can be not consistent with the original daemon (protocol version, daemon capabilities/settings
> like no_open, no_flush, readahead, and so on).
> * each fuse request has a unique ID. It could confuse userspace if this unique ID sequence was reset.
>
> We can workaround some issues and implement fragile and limited support of FUSE in CRIU but it doesn't make any sense, IMHO.
> Btw, I've enumerated only CRIU restore-stage problems there. The dump stage is another story...
>
> My proposal is not only about CRIU. The same interface can be useful for FUSE mounts recovery after daemon crashes.
> LXC project uses LXCFS [3] as a procfs/cgroupfs/sysfs emulation layer for containers. We are using a scheme when
> one LXCFS daemon handles all the work for all the containers and we use bindmounts to overmount particular
> files/directories in procfs/cgroupfs/sysfs. If this single daemon crashes for some reason we are in trouble,
> because we have to restart all the containers (fuse bindmounts become invalid after the crash).
> The solution is fairly easy:
> allow somehow to reinitialize the existing fuse connection and replace the daemon on the fly
> This case is a little bit simpler than CRIU cause we don't need to care about the previously opened files
> and other stuff, we are only interested in mounts.


I like your patches, small and easy to read :)
So this basically fails all existing open files - our (future) needs go
beyond that. I wonder if we can extend it later and re-init the new
daemon with something like "fuse_queue_recall" - basically the opposite
of fuse_queue_forget. Not sure if fuse can access the vfs dentry cache
to know for which files that would need to be done - if not, it would
need to do its own book-keeping.


Thanks,
Bernd

2023-03-06 15:09:54

by Aleksandr Mikhalitsyn

[permalink] [raw]
Subject: Re: [RFC PATCH 7/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT)

On Fri, Mar 3, 2023 at 8:26 PM Bernd Schubert
<[email protected]> wrote:
>
>
>
> On 2/20/23 20:37, Alexander Mikhalitsyn wrote:
> > This ioctl aborts fuse connection and then reinitializes it,
> > sends FUSE_INIT request to allow a new userspace daemon
> > to pick up the fuse connection.
> >
> > Cc: Miklos Szeredi <[email protected]>
> > Cc: Al Viro <[email protected]>
> > Cc: Amir Goldstein <[email protected]>
> > Cc: Stéphane Graber <[email protected]>
> > Cc: Seth Forshee <[email protected]>
> > Cc: Christian Brauner <[email protected]>
> > Cc: Andrei Vagin <[email protected]>
> > Cc: Pavel Tikhomirov <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: Alexander Mikhalitsyn <[email protected]>
> > ---
> > fs/fuse/dev.c | 132 ++++++++++++++++++++++++++++++++++++++
> > include/uapi/linux/fuse.h | 1 +
> > 2 files changed, 133 insertions(+)
> >
> > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > index 737764c2295e..0f53ffd63957 100644
> > --- a/fs/fuse/dev.c
> > +++ b/fs/fuse/dev.c
> > @@ -2187,6 +2187,112 @@ void fuse_abort_conn(struct fuse_conn *fc)
> > }
> > EXPORT_SYMBOL_GPL(fuse_abort_conn);
> >
> > +static int fuse_reinit_conn(struct fuse_conn *fc)
> > +{
> > + struct fuse_iqueue *fiq = &fc->iq;
> > + struct fuse_dev *fud;
> > + unsigned int i;
> > +
> > + if (fc->conn_gen + 1 < fc->conn_gen)
> > + return -EOVERFLOW;
> > +
> > + fuse_abort_conn(fc);
> > + fuse_wait_aborted(fc);
>
> Shouldn't this also try to flush all data first?

I think we should. Thanks for pointing to that!

I've read all your comments and I'll prepare -v2 series soon.

Thanks a lot, Bernd!

>

2023-03-06 15:23:34

by Aleksandr Mikhalitsyn

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] fuse: API for Checkpoint/Restore

On Fri, Mar 3, 2023 at 8:43 PM Bernd Schubert <[email protected]> wrote:
>
>
>
> On 2/20/23 20:37, Alexander Mikhalitsyn wrote:
> > Hello everyone,
> >
> > It would be great to hear your comments regarding this proof-of-concept Checkpoint/Restore API for FUSE.
> >
> > Support of FUSE C/R is a challenging task for CRIU [1]. Last year I've given a brief talk on LPC 2022
> > about how we handle files C/R in CRIU and which blockers we have for FUSE filesystems. [2]
> >
> > The main problem for CRIU is that we have to restore mount namespaces and memory mappings before the process tree.
> > It means that when CRIU is performing mount of fuse filesystem it can't use the original FUSE daemon from the
> > restorable process tree, but instead use a "fake daemon".
> >
> > This leads to many other technical problems:
> > * "fake" daemon has to reply to FUSE_INIT request from the kernel and initialize fuse connection somehow.
> > This setup can be not consistent with the original daemon (protocol version, daemon capabilities/settings
> > like no_open, no_flush, readahead, and so on).
> > * each fuse request has a unique ID. It could confuse userspace if this unique ID sequence was reset.
> >
> > We can workaround some issues and implement fragile and limited support of FUSE in CRIU but it doesn't make any sense, IMHO.
> > Btw, I've enumerated only CRIU restore-stage problems there. The dump stage is another story...
> >
> > My proposal is not only about CRIU. The same interface can be useful for FUSE mounts recovery after daemon crashes.
> > LXC project uses LXCFS [3] as a procfs/cgroupfs/sysfs emulation layer for containers. We are using a scheme when
> > one LXCFS daemon handles all the work for all the containers and we use bindmounts to overmount particular
> > files/directories in procfs/cgroupfs/sysfs. If this single daemon crashes for some reason we are in trouble,
> > because we have to restart all the containers (fuse bindmounts become invalid after the crash).
> > The solution is fairly easy:
> > allow somehow to reinitialize the existing fuse connection and replace the daemon on the fly
> > This case is a little bit simpler than CRIU cause we don't need to care about the previously opened files
> > and other stuff, we are only interested in mounts.
>

Hello, Bernd!

Thanks a lot for your attention/review to this patch series!

>
> I like your patches, small and easy to read :)

Glad to hear, thanks! ;-)

> So this basically fails all existing open files - our (future) needs go
> beyond that. I wonder if we can extend it later and re-init the new
> daemon with something like "fuse_queue_recall" - basically the opposite
> of fuse_queue_forget. Not sure if fuse can access the vfs dentry cache
> to know for which files that would need to be done - if not, it would
> need to do its own book-keeping.
>

I thought about this (problem with existing opened FDs) too, it's just
a first approach to the problem and as far as I mentioned I have no
CRIU-side implementation (so it's not tested with full
Checkpoint/Restore),
but it works well for "fuse reinitialization" (which is needed for
LXCFS and can be useful for other fuse filesystems).

I think we can easily extend this later if we come up with some
agreement about generic UAPI.

I hope that more people will react to this and post their opinions.

Kind regards,
Alex

>
> Thanks,
> Bernd

2023-03-06 16:32:47

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] fuse: API for Checkpoint/Restore

On Mon, 20 Feb 2023 at 20:38, Alexander Mikhalitsyn
<[email protected]> wrote:
>
> Hello everyone,
>
> It would be great to hear your comments regarding this proof-of-concept Checkpoint/Restore API for FUSE.
>
> Support of FUSE C/R is a challenging task for CRIU [1]. Last year I've given a brief talk on LPC 2022
> about how we handle files C/R in CRIU and which blockers we have for FUSE filesystems. [2]
>
> The main problem for CRIU is that we have to restore mount namespaces and memory mappings before the process tree.
> It means that when CRIU is performing mount of fuse filesystem it can't use the original FUSE daemon from the
> restorable process tree, but instead use a "fake daemon".
>
> This leads to many other technical problems:
> * "fake" daemon has to reply to FUSE_INIT request from the kernel and initialize fuse connection somehow.
> This setup can be not consistent with the original daemon (protocol version, daemon capabilities/settings
> like no_open, no_flush, readahead, and so on).
> * each fuse request has a unique ID. It could confuse userspace if this unique ID sequence was reset.
>
> We can workaround some issues and implement fragile and limited support of FUSE in CRIU but it doesn't make any sense, IMHO.
> Btw, I've enumerated only CRIU restore-stage problems there. The dump stage is another story...
>
> My proposal is not only about CRIU. The same interface can be useful for FUSE mounts recovery after daemon crashes.
> LXC project uses LXCFS [3] as a procfs/cgroupfs/sysfs emulation layer for containers. We are using a scheme when
> one LXCFS daemon handles all the work for all the containers and we use bindmounts to overmount particular
> files/directories in procfs/cgroupfs/sysfs. If this single daemon crashes for some reason we are in trouble,
> because we have to restart all the containers (fuse bindmounts become invalid after the crash).
> The solution is fairly easy:
> allow somehow to reinitialize the existing fuse connection and replace the daemon on the fly
> This case is a little bit simpler than CRIU cause we don't need to care about the previously opened files
> and other stuff, we are only interested in mounts.
>
> Current PoC implementation was developed and tested with this "recovery case".
> Right now I only have LXCFS patched and have nothing for CRIU. But I wanted to discuss this idea before going forward with CRIU.

Apparently all of the added mechanisms (REINIT, BM_REVAL, conn_gen)
are crash recovery related, and not useful for C/R. Why is this being
advertised as a precursor for CRIU support?

BTW here's some earlier attempt at partial recovery, which might be interesting:

https://lore.kernel.org/all/CAPm50a+j8UL9g3UwpRsye5e+a=M0Hy7Tf1FdfwOrUUBWMyosNg@mail.gmail.com/

Thanks,
Miklos

2023-03-06 17:24:38

by Aleksandr Mikhalitsyn

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] fuse: API for Checkpoint/Restore

On Mon, Mar 6, 2023 at 5:15 PM Miklos Szeredi <[email protected]> wrote:
>
> On Mon, 20 Feb 2023 at 20:38, Alexander Mikhalitsyn
> <[email protected]> wrote:
> >
> > Hello everyone,
> >
> > It would be great to hear your comments regarding this proof-of-concept Checkpoint/Restore API for FUSE.
> >
> > Support of FUSE C/R is a challenging task for CRIU [1]. Last year I've given a brief talk on LPC 2022
> > about how we handle files C/R in CRIU and which blockers we have for FUSE filesystems. [2]
> >
> > The main problem for CRIU is that we have to restore mount namespaces and memory mappings before the process tree.
> > It means that when CRIU is performing mount of fuse filesystem it can't use the original FUSE daemon from the
> > restorable process tree, but instead use a "fake daemon".
> >
> > This leads to many other technical problems:
> > * "fake" daemon has to reply to FUSE_INIT request from the kernel and initialize fuse connection somehow.
> > This setup can be not consistent with the original daemon (protocol version, daemon capabilities/settings
> > like no_open, no_flush, readahead, and so on).
> > * each fuse request has a unique ID. It could confuse userspace if this unique ID sequence was reset.
> >
> > We can workaround some issues and implement fragile and limited support of FUSE in CRIU but it doesn't make any sense, IMHO.
> > Btw, I've enumerated only CRIU restore-stage problems there. The dump stage is another story...
> >
> > My proposal is not only about CRIU. The same interface can be useful for FUSE mounts recovery after daemon crashes.
> > LXC project uses LXCFS [3] as a procfs/cgroupfs/sysfs emulation layer for containers. We are using a scheme when
> > one LXCFS daemon handles all the work for all the containers and we use bindmounts to overmount particular
> > files/directories in procfs/cgroupfs/sysfs. If this single daemon crashes for some reason we are in trouble,
> > because we have to restart all the containers (fuse bindmounts become invalid after the crash).
> > The solution is fairly easy:
> > allow somehow to reinitialize the existing fuse connection and replace the daemon on the fly
> > This case is a little bit simpler than CRIU cause we don't need to care about the previously opened files
> > and other stuff, we are only interested in mounts.
> >
> > Current PoC implementation was developed and tested with this "recovery case".
> > Right now I only have LXCFS patched and have nothing for CRIU. But I wanted to discuss this idea before going forward with CRIU.
>

Hi Miklos,

> Apparently all of the added mechanisms (REINIT, BM_REVAL, conn_gen)
> are crash recovery related, and not useful for C/R. Why is this being
> advertised as a precursor for CRIU support?

It's because I'm doing this with CRIU in mind too, I think it's a good
way to make a universal interface
which can address not only the recovery case but also the C/R, cause
in some sense it's a close problem.
But of course, Checkpoint/Restore is a way more trickier. But before
doing all the work with CRIU PoC,
I wanted to consult with you and folks if there are any serious
objections to this interface/feature or, conversely,
if there is someone else who is interested in it.

Now about interfaces REINIT, BM_REVAL.

I think it will be useful for CRIU case, but probably I need to extend
it a little bit, as I mentioned earlier in the cover letter:
> >* "fake" daemon has to reply to FUSE_INIT request from the kernel and initialize fuse connection somehow.
> > This setup can be not consistent with the original daemon (protocol version, daemon capabilities/settings
> > like no_open, no_flush, readahead, and so on).

So, after the "fake" demon has done its job during CRIU restore, we
need to replace it with the actual demon from
the dumpee tree and performing REINIT looks like a sanner way.

The next point is that if we use REINIT during CRIU restore, then we
automatically need to have BM_REINIT too,
otherwise all restored bind mounts become invalid.

Conn generation is not a problem for CRIU if we are not exposing it to
the userspace. It's just a technical thing to distinguish
old and new inodes/struct file's.

>
> BTW here's some earlier attempt at partial recovery, which might be interesting:
>
> https://lore.kernel.org/all/CAPm50a+j8UL9g3UwpRsye5e+a=M0Hy7Tf1FdfwOrUUBWMyosNg@mail.gmail.com/

Oh, that's interesting. Thanks for mentioning this! And the most
interesting thing is that Peng Hao mentioned LXCFS as a use case :-)

Added Peng Hao to CC

Kind regards,
Alex

>
> Thanks,
> Miklos

2023-03-06 19:18:48

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] fuse: API for Checkpoint/Restore

On Mon, 6 Mar 2023 at 17:44, Aleksandr Mikhalitsyn
<[email protected]> wrote:
>
> On Mon, Mar 6, 2023 at 5:15 PM Miklos Szeredi <[email protected]> wrote:

> > Apparently all of the added mechanisms (REINIT, BM_REVAL, conn_gen)
> > are crash recovery related, and not useful for C/R. Why is this being
> > advertised as a precursor for CRIU support?
>
> It's because I'm doing this with CRIU in mind too, I think it's a good
> way to make a universal interface
> which can address not only the recovery case but also the C/R, cause
> in some sense it's a close problem.

That's what I'm wondering about...

Crash recovery is about restoring (or at least regenerating) state in
the userspace server.

In CRIU restoring the state of the userspace server is a solved
problem, the issue is restoring state in the kernel part of fuse. In
a sense it's the exact opposite problem that crash recovery is doing.

> But of course, Checkpoint/Restore is a way more trickier. But before
> doing all the work with CRIU PoC,
> I wanted to consult with you and folks if there are any serious
> objections to this interface/feature or, conversely,
> if there is someone else who is interested in it.
>
> Now about interfaces REINIT, BM_REVAL.
>
> I think it will be useful for CRIU case, but probably I need to extend
> it a little bit, as I mentioned earlier in the cover letter:
> > >* "fake" daemon has to reply to FUSE_INIT request from the kernel and initialize fuse connection somehow.
> > > This setup can be not consistent with the original daemon (protocol version, daemon capabilities/settings
> > > like no_open, no_flush, readahead, and so on).
>
> So, after the "fake" demon has done its job during CRIU restore, we
> need to replace it with the actual demon from
> the dumpee tree and performing REINIT looks like a sanner way.

I don't get it. How does REINIT help with switching to the real daemon?

Thanks,
Miklos

2023-03-06 21:05:57

by Bernd Schubert

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] fuse: API for Checkpoint/Restore



On 3/6/23 20:18, Miklos Szeredi wrote:
> On Mon, 6 Mar 2023 at 17:44, Aleksandr Mikhalitsyn
> <[email protected]> wrote:
>>
>> On Mon, Mar 6, 2023 at 5:15 PM Miklos Szeredi <[email protected]> wrote:
>
>>> Apparently all of the added mechanisms (REINIT, BM_REVAL, conn_gen)
>>> are crash recovery related, and not useful for C/R. Why is this being
>>> advertised as a precursor for CRIU support?
>>
>> It's because I'm doing this with CRIU in mind too, I think it's a good
>> way to make a universal interface
>> which can address not only the recovery case but also the C/R, cause
>> in some sense it's a close problem.
>
> That's what I'm wondering about...
>
> Crash recovery is about restoring (or at least regenerating) state in
> the userspace server.
>
> In CRIU restoring the state of the userspace server is a solved
> problem, the issue is restoring state in the kernel part of fuse. In
> a sense it's the exact opposite problem that crash recovery is doing.
>
>> But of course, Checkpoint/Restore is a way more trickier. But before
>> doing all the work with CRIU PoC,
>> I wanted to consult with you and folks if there are any serious
>> objections to this interface/feature or, conversely,
>> if there is someone else who is interested in it.
>>
>> Now about interfaces REINIT, BM_REVAL.
>>
>> I think it will be useful for CRIU case, but probably I need to extend
>> it a little bit, as I mentioned earlier in the cover letter:
>>>> * "fake" daemon has to reply to FUSE_INIT request from the kernel and initialize fuse connection somehow.
>>>> This setup can be not consistent with the original daemon (protocol version, daemon capabilities/settings
>>>> like no_open, no_flush, readahead, and so on).
>>
>> So, after the "fake" demon has done its job during CRIU restore, we
>> need to replace it with the actual demon from
>> the dumpee tree and performing REINIT looks like a sanner way.
>
> I don't get it. How does REINIT help with switching to the real daemon?

The way I read the patches, the new daemon sends FUSE_INIT to advertise
all of its features.

2023-03-06 22:16:51

by Aleksandr Mikhalitsyn

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] fuse: API for Checkpoint/Restore

On Mon, Mar 6, 2023 at 10:05 PM Bernd Schubert <[email protected]> wrote:
>
>
>
> On 3/6/23 20:18, Miklos Szeredi wrote:
> > On Mon, 6 Mar 2023 at 17:44, Aleksandr Mikhalitsyn
> > <[email protected]> wrote:
> >>
> >> On Mon, Mar 6, 2023 at 5:15 PM Miklos Szeredi <[email protected]> wrote:
> >
> >>> Apparently all of the added mechanisms (REINIT, BM_REVAL, conn_gen)
> >>> are crash recovery related, and not useful for C/R. Why is this being
> >>> advertised as a precursor for CRIU support?
> >>
> >> It's because I'm doing this with CRIU in mind too, I think it's a good
> >> way to make a universal interface
> >> which can address not only the recovery case but also the C/R, cause
> >> in some sense it's a close problem.
> >
> > That's what I'm wondering about...
> >
> > Crash recovery is about restoring (or at least regenerating) state in
> > the userspace server.
> >
> > In CRIU restoring the state of the userspace server is a solved
> > problem, the issue is restoring state in the kernel part of fuse. In
> > a sense it's the exact opposite problem that crash recovery is doing.

I can't argue, you're right. In the "recover" case we don't care about userspace
state, we just want to forget everything in the kernel but only keep
mounts (someone may want to keep opened FDs too).
In the C/R case we want to recreate full userspace and kernel states.

These are different problems, but in some parts they require the same UAPIs.
I think I need to write a detailed motivation for the CRIU part in the
-v2 cover letter, so we can discuss it. What do you think?

> >
> >> But of course, Checkpoint/Restore is a way more trickier. But before
> >> doing all the work with CRIU PoC,
> >> I wanted to consult with you and folks if there are any serious
> >> objections to this interface/feature or, conversely,
> >> if there is someone else who is interested in it.
> >>
> >> Now about interfaces REINIT, BM_REVAL.
> >>
> >> I think it will be useful for CRIU case, but probably I need to extend
> >> it a little bit, as I mentioned earlier in the cover letter:
> >>>> * "fake" daemon has to reply to FUSE_INIT request from the kernel and initialize fuse connection somehow.
> >>>> This setup can be not consistent with the original daemon (protocol version, daemon capabilities/settings
> >>>> like no_open, no_flush, readahead, and so on).
> >>
> >> So, after the "fake" demon has done its job during CRIU restore, we
> >> need to replace it with the actual demon from
> >> the dumpee tree and performing REINIT looks like a sanner way.
> >
> > I don't get it. How does REINIT help with switching to the real daemon?
>
> The way I read the patches, the new daemon sends FUSE_INIT to advertise
> all of its features.

Yes, thanks, Bernd!

Theoretically, we can implement some basic C/R without using reinit.
It was my first idea and I've described it in my LPC 2022 talk,
but this approach is not fully safe and universal because CRIU fake
daemon will implement a particular fuse protocol version (and define a
particular set on fuse ops/features),
but the dumpee fuse daemon can use a different set of fuse ops and
fuse protocol version. So, changing fuse daemon fully transparently to
the kernel is not fully safe.

Thank you guys for your attention to this!

Kind regards,
Alex

2023-04-03 15:12:29

by Aleksandr Mikhalitsyn

[permalink] [raw]
Subject: Re: [RFC PATCH 7/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT)

On Fri, Mar 3, 2023 at 8:19 PM Bernd Schubert
<[email protected]> wrote:
>
>
>
> On 2/20/23 20:37, Alexander Mikhalitsyn wrote:
> > This ioctl aborts fuse connection and then reinitializes it,
> > sends FUSE_INIT request to allow a new userspace daemon
> > to pick up the fuse connection.
> >
> > Cc: Miklos Szeredi <[email protected]>
> > Cc: Al Viro <[email protected]>
> > Cc: Amir Goldstein <[email protected]>
> > Cc: Stéphane Graber <[email protected]>
> > Cc: Seth Forshee <[email protected]>
> > Cc: Christian Brauner <[email protected]>
> > Cc: Andrei Vagin <[email protected]>
> > Cc: Pavel Tikhomirov <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: Alexander Mikhalitsyn <[email protected]>
> > ---
> > fs/fuse/dev.c | 132 ++++++++++++++++++++++++++++++++++++++
> > include/uapi/linux/fuse.h | 1 +
> > 2 files changed, 133 insertions(+)
> >
> > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > index 737764c2295e..0f53ffd63957 100644
> > --- a/fs/fuse/dev.c
> > +++ b/fs/fuse/dev.c
> > @@ -2187,6 +2187,112 @@ void fuse_abort_conn(struct fuse_conn *fc)
> > }
> > EXPORT_SYMBOL_GPL(fuse_abort_conn);
> >
> > +static int fuse_reinit_conn(struct fuse_conn *fc)
> > +{
> > + struct fuse_iqueue *fiq = &fc->iq;
> > + struct fuse_dev *fud;
> > + unsigned int i;
>
> Assuming you have a malicious daemon that tries to cause bad behavior,
> only allow one ioctl at at time? I.e. add a value that reinit is in
> progress? And unset at the end of the function?

Have done. Thanks!
>
> > +
> > + if (fc->conn_gen + 1 < fc->conn_gen)
> > + return -EOVERFLOW;
> > +
>
> Add a comment, like
>
> /* Unsets fc->connected and fiq->connected and ensures that no new
> requests can be queued */
>
> ?

Have done.

>
> > + fuse_abort_conn(fc);
> > + fuse_wait_aborted(fc);
> > +
> > + spin_lock(&fc->lock);
> > + if (fc->connected) {
> > + spin_unlock(&fc->lock);
> > + return -EINVAL;
> > + }
> > +
> > + if (fc->conn_gen + 1 < fc->conn_gen) {
> > + spin_unlock(&fc->lock);
> > + return -EOVERFLOW;
> > + }
> > +
> > + fc->conn_gen++;
> > +
> > + spin_lock(&fiq->lock);
> > + if (request_pending(fiq) || fiq->forget_list_tail != &fiq->forget_list_head) {
> > + spin_unlock(&fiq->lock);
> > + spin_unlock(&fc->lock);
> > + return -EINVAL;
> > + }
> > +
> > + if (&fuse_dev_fiq_ops != fiq->ops) {
> > + spin_unlock(&fiq->lock);
> > + spin_unlock(&fc->lock);
> > + return -EOPNOTSUPP;
> > + }
> > +
> > + fiq->connected = 1;
> > + spin_unlock(&fiq->lock);
> > +
> > + spin_lock(&fc->bg_lock);
> > + if (!list_empty(&fc->bg_queue)) {
> > + spin_unlock(&fc->bg_lock);
> > + spin_unlock(&fc->lock);
> > + return -EINVAL;
> > + }
> > +
> > + fc->blocked = 0;
> > + fc->max_background = FUSE_DEFAULT_MAX_BACKGROUND;
> > + spin_unlock(&fc->bg_lock);
> > +
> > + list_for_each_entry(fud, &fc->devices, entry) {
> > + struct fuse_pqueue *fpq = &fud->pq;
> > +
> > + spin_lock(&fpq->lock);
> > + if (!list_empty(&fpq->io)) {
> > + spin_unlock(&fpq->lock);
> > + spin_unlock(&fc->lock);
> > + return -EINVAL;
> > + }
> > +
> > + for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) {
> > + if (!list_empty(&fpq->processing[i])) {
> > + spin_unlock(&fpq->lock);
> > + spin_unlock(&fc->lock);
> > + return -EINVAL;
> > + }
> > + }
> > +
> > + fpq->connected = 1;
> > + spin_unlock(&fpq->lock);
> > + }
> > +
> > + fuse_set_initialized(fc);
>
> I'm not sure about this, why not the common way via FUSE_INIT reply?

fuse_send_init will fail, if !fc->initialized (see fuse_block_alloc <-
fuse_get_req <- fuse_simple_background).

>
> > +
> > + /* Background queuing checks fc->connected under bg_lock */
> > + spin_lock(&fc->bg_lock);
> > + fc->connected = 1;
> > + spin_unlock(&fc->bg_lock);
> > +
> > + fc->aborted = false;
> > + fc->abort_err = 0;
> > +
> > + /* nullify all the flags */
> > + memset(&fc->flags, 0, sizeof(struct fuse_conn_flags));
> > +
> > + spin_unlock(&fc->lock);
> > +
> > + down_read(&fc->killsb);
> > + if (!list_empty(&fc->mounts)) {
> > + struct fuse_mount *fm;
> > +
> > + fm = list_first_entry(&fc->mounts, struct fuse_mount, fc_entry);
> > + if (!fm->sb) {
> > + up_read(&fc->killsb);
> > + return -EINVAL;
> > + }
> > +
> > + fuse_send_init(fm);
> > + }
> > + up_read(&fc->killsb);
> > +
> > + return 0;
> > +}
> > +
> > void fuse_wait_aborted(struct fuse_conn *fc)
> > {
> > /* matches implicit memory barrier in fuse_drop_waiting() */
> > @@ -2282,6 +2388,32 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
> > }
> > }
> > break;
> > + case FUSE_DEV_IOC_REINIT:
> > + struct fuse_conn *fc;
> > +
> > + if (!checkpoint_restore_ns_capable(file->f_cred->user_ns))
> > + return -EPERM;
> > +
> > + res = -EINVAL;
> > + fud = fuse_get_dev(file);
> > +
> > + /*
> > + * Only fuse mounts with an already initialized fuse
> > + * connection are supported
> > + */
> > + if (file->f_op == &fuse_dev_operations && fud) {
> > + mutex_lock(&fuse_mutex);
> > + fc = fud->fc;
> > + if (fc)
> > + fc = fuse_conn_get(fc);
> > + mutex_unlock(&fuse_mutex);
> > +
> > + if (fc) {
> > + res = fuse_reinit_conn(fc);
> > + fuse_conn_put(fc);
> > + }
> > + }
> > + break;
> > default:
> > res = -ENOTTY;
> > break;
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index 1b9d0dfae72d..3dac67b25eae 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -989,6 +989,7 @@ struct fuse_notify_retrieve_in {
> > /* Device ioctls: */
> > #define FUSE_DEV_IOC_MAGIC 229
> > #define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
> > +#define FUSE_DEV_IOC_REINIT _IO(FUSE_DEV_IOC_MAGIC, 0)
> >
> > struct fuse_lseek_in {
> > uint64_t fh;

2023-04-03 15:15:32

by Aleksandr Mikhalitsyn

[permalink] [raw]
Subject: Re: [RFC PATCH 7/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT)

On Mon, Mar 6, 2023 at 3:09 PM Aleksandr Mikhalitsyn
<[email protected]> wrote:
>
> On Fri, Mar 3, 2023 at 8:26 PM Bernd Schubert
> <[email protected]> wrote:
> >
> >
> >
> > On 2/20/23 20:37, Alexander Mikhalitsyn wrote:
> > > This ioctl aborts fuse connection and then reinitializes it,
> > > sends FUSE_INIT request to allow a new userspace daemon
> > > to pick up the fuse connection.
> > >
> > > Cc: Miklos Szeredi <[email protected]>
> > > Cc: Al Viro <[email protected]>
> > > Cc: Amir Goldstein <[email protected]>
> > > Cc: Stéphane Graber <[email protected]>
> > > Cc: Seth Forshee <[email protected]>
> > > Cc: Christian Brauner <[email protected]>
> > > Cc: Andrei Vagin <[email protected]>
> > > Cc: Pavel Tikhomirov <[email protected]>
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Cc: [email protected]
> > > Signed-off-by: Alexander Mikhalitsyn <[email protected]>
> > > ---
> > > fs/fuse/dev.c | 132 ++++++++++++++++++++++++++++++++++++++
> > > include/uapi/linux/fuse.h | 1 +
> > > 2 files changed, 133 insertions(+)
> > >
> > > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > > index 737764c2295e..0f53ffd63957 100644
> > > --- a/fs/fuse/dev.c
> > > +++ b/fs/fuse/dev.c
> > > @@ -2187,6 +2187,112 @@ void fuse_abort_conn(struct fuse_conn *fc)
> > > }
> > > EXPORT_SYMBOL_GPL(fuse_abort_conn);
> > >
> > > +static int fuse_reinit_conn(struct fuse_conn *fc)
> > > +{
> > > + struct fuse_iqueue *fiq = &fc->iq;
> > > + struct fuse_dev *fud;
> > > + unsigned int i;
> > > +
> > > + if (fc->conn_gen + 1 < fc->conn_gen)
> > > + return -EOVERFLOW;
> > > +
> > > + fuse_abort_conn(fc);
> > > + fuse_wait_aborted(fc);
> >
> > Shouldn't this also try to flush all data first?

Dear Bernd,

I've reviewed this place 2nd time and I'm not sure that we have to
perform any flushing there, because userspace daemon can be dead or
stuck.
Technically, if userspace knows that daemon is alive then it can call
fsync/sync before doing reinit.

What do you think about it?

Kind regards,
Alex

>
> I think we should. Thanks for pointing to that!
>
> I've read all your comments and I'll prepare -v2 series soon.
>
> Thanks a lot, Bernd!
>
> >

2023-04-03 17:09:40

by Askar Safin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] fuse: API for Checkpoint/Restore

These patches seem to be related to the well-known suspend+fuse problem
(see my comment here:
https://bugzilla.kernel.org/show_bug.cgi?id=34932#c12 ). Still
reproducible on modern kernels. Please, somehow address this
long-standing problem.

CC me when answering

--
Askar Safin

2023-04-11 22:20:45

by Bernd Schubert

[permalink] [raw]
Subject: Re: [RFC PATCH 7/9] fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT)



On 4/3/23 16:51, Aleksandr Mikhalitsyn wrote:
> On Mon, Mar 6, 2023 at 3:09 PM Aleksandr Mikhalitsyn
> <[email protected]> wrote:
>>
>> On Fri, Mar 3, 2023 at 8:26 PM Bernd Schubert
>> <[email protected]> wrote:
>>>
>>>
>>>
>>> On 2/20/23 20:37, Alexander Mikhalitsyn wrote:
>>>> This ioctl aborts fuse connection and then reinitializes it,
>>>> sends FUSE_INIT request to allow a new userspace daemon
>>>> to pick up the fuse connection.
>>>>
>>>> Cc: Miklos Szeredi <[email protected]>
>>>> Cc: Al Viro <[email protected]>
>>>> Cc: Amir Goldstein <[email protected]>
>>>> Cc: Stéphane Graber <[email protected]>
>>>> Cc: Seth Forshee <[email protected]>
>>>> Cc: Christian Brauner <[email protected]>
>>>> Cc: Andrei Vagin <[email protected]>
>>>> Cc: Pavel Tikhomirov <[email protected]>
>>>> Cc: [email protected]
>>>> Cc: [email protected]
>>>> Cc: [email protected]
>>>> Signed-off-by: Alexander Mikhalitsyn <[email protected]>
>>>> ---
>>>> fs/fuse/dev.c | 132 ++++++++++++++++++++++++++++++++++++++
>>>> include/uapi/linux/fuse.h | 1 +
>>>> 2 files changed, 133 insertions(+)
>>>>
>>>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>>>> index 737764c2295e..0f53ffd63957 100644
>>>> --- a/fs/fuse/dev.c
>>>> +++ b/fs/fuse/dev.c
>>>> @@ -2187,6 +2187,112 @@ void fuse_abort_conn(struct fuse_conn *fc)
>>>> }
>>>> EXPORT_SYMBOL_GPL(fuse_abort_conn);
>>>>
>>>> +static int fuse_reinit_conn(struct fuse_conn *fc)
>>>> +{
>>>> + struct fuse_iqueue *fiq = &fc->iq;
>>>> + struct fuse_dev *fud;
>>>> + unsigned int i;
>>>> +
>>>> + if (fc->conn_gen + 1 < fc->conn_gen)
>>>> + return -EOVERFLOW;
>>>> +
>>>> + fuse_abort_conn(fc);
>>>> + fuse_wait_aborted(fc);
>>>
>>> Shouldn't this also try to flush all data first?
>
> Dear Bernd,
>
> I've reviewed this place 2nd time and I'm not sure that we have to
> perform any flushing there, because userspace daemon can be dead or
> stuck.
> Technically, if userspace knows that daemon is alive then it can call
> fsync/sync before doing reinit.
>
> What do you think about it?

Hello Alex,

sorry for my late reply.

Hmm, I just fear that fsync/sync is a bit racy, what is if a user would
write data after the sync and that would get silently removed by
fuse_abort_conn()? Isn't what we want:

ioctl
refuse new requests -> unset fc->initialized
flush all fc queues (fc->iq.pending, fc->bg_queue, I guess with your
current patches we do not need to handle forget)
fuse_abort_conn


So what is missing is the information if the daemon is still running -
take a daemon reference and then check for PF_EXITING, as in my uring
patches? Miklos has some objections for that, though.


The alternative would be to mount read-only, then sync, then do the
ioctl and remount back. I don't know what needs to be done to get
remount working, though. Just handle it in libfuse mount.fuse and send
the mount syscall?


As I wrote before, at DDN we want to have run time daemon restart - I'm
also not opposed to entirely give up on the flush and to just work on a
restart protocol to make the new daemon to the old state (opened files
and lookup/forget count). In principle we could even transfer that in
userspace from one daemon to the other?


Thanks,
Bernd

PS: Will look at the new patches later this week.


Thanks,
Bernd