Hello everyone,
It would be great to hear your comments regarding this proof-of-concept Checkpoint/Restore API for FUSE.
Support of FUSE C/R is a challenging task for CRIU [1]. Last year I've given a brief talk on LPC 2022
about how we handle files C/R in CRIU and which blockers we have for FUSE filesystems. [2]
The main problem for CRIU is that we have to restore mount namespaces and memory mappings before the process tree.
It means that when CRIU is performing mount of fuse filesystem it can't use the original FUSE daemon from the
restorable process tree, but instead use a "fake daemon".
This leads to many other technical problems:
* "fake" daemon has to reply to FUSE_INIT request from the kernel and initialize fuse connection somehow.
This setup can be not consistent with the original daemon (protocol version, daemon capabilities/settings
like no_open, no_flush, readahead, and so on).
* each fuse request has a unique ID. It could confuse userspace if this unique ID sequence was reset.
We can workaround some issues and implement fragile and limited support of FUSE in CRIU but it doesn't make any sense, IMHO.
Btw, I've enumerated only CRIU restore-stage problems there. The dump stage is another story...
My proposal is not only about CRIU. The same interface can be useful for FUSE mounts recovery after daemon crashes.
LXC project uses LXCFS [3] as a procfs/cgroupfs/sysfs emulation layer for containers. We are using a scheme when
one LXCFS daemon handles all the work for all the containers and we use bindmounts to overmount particular
files/directories in procfs/cgroupfs/sysfs. If this single daemon crashes for some reason we are in trouble,
because we have to restart all the containers (fuse bindmounts become invalid after the crash).
The solution is fairly easy:
allow somehow to reinitialize the existing fuse connection and replace the daemon on the fly
This case is a little bit simpler than CRIU cause we don't need to care about the previously opened files
and other stuff, we are only interested in mounts.
Current PoC implementation was developed and tested with this "recovery case".
Right now I only have LXCFS patched and have nothing for CRIU. But I wanted to discuss this idea before going forward with CRIU.
Brief API/design description.
I've added two ioctl's:
* ioctl(FUSE_DEV_IOC_REINIT)
Performs fuse connection abort and then reinitializes all internal fuse structures as "brand new".
Then sends a FUSE_INIT request, so a new userspace daemon can reply to it and start processing fuse reqs.
* ioctl(FUSE_DEV_IOC_BM_REVAL)
A little bit hacky thing. Traverses all the bindmounts of existing fuse mount and performs relookup
of (struct vfsmount)->mnt_root dentries with the new daemon and reset them to new dentries.
Pretty useful for the recovery case (for instance, LXCFS).
Now about the dentry/inode invalidation mechanism.
* added the "fuse connection generation" concept.
When reinit is performed then connection generation is increased by 1.
Each fuse inode stores the generation of the connection it was allocated with.
* perform dentry revalidation if it has an old generation [fuse_dentry_revalidate]
The current implementation of fuse_dentry_revalidate follows a simple and elegant idea. When we
want to revalidate the dentry we just send a FUSE_LOOKUP request to the userspace
for the parent dentry with the name of the current dentry and check which attributes/inode id
it gets. If inode ids are the same and attributes (provided by the userspace) are valid
then we mark dentry valid and it continues to live (with inode).
I've only added a connection generation check to the condition when we have to perform revalidation
and added an inode connection generation reset (to actual connection gen) if the new userspace
daemon has looked up the same inode id (important for the CRIU case!).
Thank you for your attention and I'm waiting for your feedback :)
Branch link: https://github.com/mihalicyn/linux/commits/fuse_cr_rev0
References:
[1] Support FUSE mountpoints https://github.com/checkpoint-restore/criu/issues/53
[2] Bringing up FUSE mounts C/R support https://lpc.events/event/16/contributions/1243/
[3] LXCFS https://github.com/lxc/lxcfs
Kind regards,
Alex
--
v2:
- rebased
- fixes according to review comments from Bernd Schubert
--
Cc: Miklos Szeredi <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: Bernd Schubert <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Alexander Mikhalitsyn (9):
fuse: move FUSE_DEFAULT_* defines to fuse common header
fuse: add const qualifiers to common fuse helpers
fuse: add fuse connection generation
fuse: handle stale inode connection in fuse_queue_forget
fuse: move fuse connection flags to the separate structure
fuse: take fuse connection generation into account
fuse: add fuse device ioctl(FUSE_DEV_IOC_REINIT)
namespace: add sb_revalidate_bindmounts helper
fuse: add fuse device ioctl(FUSE_DEV_IOC_BM_REVAL)
fs/fuse/acl.c | 6 +-
fs/fuse/dev.c | 187 +++++++++++++++++++++-
fs/fuse/dir.c | 38 ++---
fs/fuse/file.c | 92 ++++++-----
fs/fuse/fuse_i.h | 287 ++++++++++++++++++++--------------
fs/fuse/inode.c | 62 ++++----
fs/fuse/readdir.c | 8 +-
fs/fuse/xattr.c | 18 +--
fs/namespace.c | 90 +++++++++++
include/linux/mnt_namespace.h | 3 +
include/uapi/linux/fuse.h | 2 +
11 files changed, 563 insertions(+), 230 deletions(-)
base-commit: c68ea140050e631d24d61b6e399ca8224a4a2219
--
2.34.1
Cc: Miklos Szeredi <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: Bernd Schubert <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Alexander Mikhalitsyn <[email protected]>
---
fs/fuse/fuse_i.h | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 69af0acecb69..6d3d3ca4f136 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -870,32 +870,32 @@ struct fuse_mount {
struct list_head fc_entry;
};
-static inline struct fuse_mount *get_fuse_mount_super(struct super_block *sb)
+static inline struct fuse_mount *get_fuse_mount_super(const struct super_block *sb)
{
return sb->s_fs_info;
}
-static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
+static inline struct fuse_conn *get_fuse_conn_super(const struct super_block *sb)
{
return get_fuse_mount_super(sb)->fc;
}
-static inline struct fuse_mount *get_fuse_mount(struct inode *inode)
+static inline struct fuse_mount *get_fuse_mount(const struct inode *inode)
{
return get_fuse_mount_super(inode->i_sb);
}
-static inline struct fuse_conn *get_fuse_conn(struct inode *inode)
+static inline struct fuse_conn *get_fuse_conn(const struct inode *inode)
{
return get_fuse_mount_super(inode->i_sb)->fc;
}
-static inline struct fuse_inode *get_fuse_inode(struct inode *inode)
+static inline struct fuse_inode *get_fuse_inode(const struct inode *inode)
{
return container_of(inode, struct fuse_inode, inode);
}
-static inline u64 get_node_id(struct inode *inode)
+static inline u64 get_node_id(const struct inode *inode)
{
return get_fuse_inode(inode)->nodeid;
}
@@ -905,7 +905,7 @@ static inline int invalid_nodeid(u64 nodeid)
return !nodeid || nodeid == FUSE_ROOT_ID;
}
-static inline u64 fuse_get_attr_version(struct fuse_conn *fc)
+static inline u64 fuse_get_attr_version(const struct fuse_conn *fc)
{
return atomic64_read(&fc->attr_version);
}
@@ -923,7 +923,7 @@ static inline void fuse_make_bad(struct inode *inode)
set_bit(FUSE_I_BAD, &get_fuse_inode(inode)->state);
}
-static inline bool fuse_is_bad(struct inode *inode)
+static inline bool fuse_is_bad(const struct inode *inode)
{
return unlikely(test_bit(FUSE_I_BAD, &get_fuse_inode(inode)->state));
}
--
2.34.1
Let's move all the fuse connection flags that can be safely zeroed
after connection reinitialization to the separate structure fuse_conn_flags.
All of these flags values are calculated dynamically basing on
the userspace daemon capabilities (like no_open, no_flush) or on the
response for FUSE_INIT request.
Cc: Miklos Szeredi <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: Bernd Schubert <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Alexander Mikhalitsyn <[email protected]>
---
fs/fuse/acl.c | 6 +-
fs/fuse/dev.c | 4 +-
fs/fuse/dir.c | 26 +++---
fs/fuse/file.c | 80 ++++++++--------
fs/fuse/fuse_i.h | 228 ++++++++++++++++++++++++----------------------
fs/fuse/inode.c | 52 +++++------
fs/fuse/readdir.c | 8 +-
fs/fuse/xattr.c | 18 ++--
8 files changed, 218 insertions(+), 204 deletions(-)
diff --git a/fs/fuse/acl.c b/fs/fuse/acl.c
index 3d192b80a561..549b5a1da7ff 100644
--- a/fs/fuse/acl.c
+++ b/fs/fuse/acl.c
@@ -26,7 +26,7 @@ static struct posix_acl *__fuse_get_acl(struct fuse_conn *fc,
if (fuse_is_bad(inode))
return ERR_PTR(-EIO);
- if (fc->no_getxattr)
+ if (fc->flags.no_getxattr)
return NULL;
if (type == ACL_TYPE_ACCESS)
@@ -43,7 +43,7 @@ static struct posix_acl *__fuse_get_acl(struct fuse_conn *fc,
if (size > 0)
acl = posix_acl_from_xattr(fc->user_ns, value, size);
else if ((size == 0) || (size == -ENODATA) ||
- (size == -EOPNOTSUPP && fc->no_getxattr))
+ (size == -EOPNOTSUPP && fc->flags.no_getxattr))
acl = NULL;
else if (size == -ERANGE)
acl = ERR_PTR(-E2BIG);
@@ -105,7 +105,7 @@ int fuse_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
if (fuse_is_bad(inode))
return -EIO;
- if (fc->no_setxattr || fuse_no_acl(fc, inode))
+ if (fc->flags.no_setxattr || fuse_no_acl(fc, inode))
return -EOPNOTSUPP;
if (type == ACL_TYPE_ACCESS)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 2e7cd60b685e..b4501a10c379 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -367,7 +367,7 @@ static void request_wait_answer(struct fuse_req *req)
struct fuse_iqueue *fiq = &fc->iq;
int err;
- if (!fc->no_interrupt) {
+ if (!fc->flags.no_interrupt) {
/* Any signal may interrupt this */
err = wait_event_interruptible(req->waitq,
test_bit(FR_FINISHED, &req->flags));
@@ -1901,7 +1901,7 @@ static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
if (nbytes != sizeof(struct fuse_out_header))
err = -EINVAL;
else if (oh.error == -ENOSYS)
- fc->no_interrupt = 1;
+ fc->flags.no_interrupt = 1;
else if (oh.error == -EAGAIN)
err = queue_interrupt(req);
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 7e308a655191..bfbe59e8fce2 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -705,7 +705,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
} else {
file->private_data = ff;
fuse_finish_open(inode, file);
- if (fm->fc->atomic_o_trunc && trunc)
+ if (fm->fc->flags.atomic_o_trunc && trunc)
truncate_pagecache(inode, 0);
else if (!(ff->open_flags & FOPEN_KEEP_CACHE))
invalidate_inode_pages2(inode->i_mapping);
@@ -748,12 +748,12 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
/* Only creates */
file->f_mode |= FMODE_CREATED;
- if (fc->no_create)
+ if (fc->flags.no_create)
goto mknod;
err = fuse_create_open(dir, entry, file, flags, mode, FUSE_CREATE);
if (err == -ENOSYS) {
- fc->no_create = 1;
+ fc->flags.no_create = 1;
goto mknod;
}
out_dput:
@@ -1078,14 +1078,14 @@ static int fuse_rename2(struct mnt_idmap *idmap, struct inode *olddir,
return -EINVAL;
if (flags) {
- if (fc->no_rename2 || fc->minor < 23)
+ if (fc->flags.no_rename2 || fc->minor < 23)
return -EINVAL;
err = fuse_rename_common(olddir, oldent, newdir, newent, flags,
FUSE_RENAME2,
sizeof(struct fuse_rename2_in));
if (err == -ENOSYS) {
- fc->no_rename2 = 1;
+ fc->flags.no_rename2 = 1;
err = -EINVAL;
}
} else {
@@ -1352,7 +1352,7 @@ static int fuse_access(struct inode *inode, int mask)
BUG_ON(mask & MAY_NOT_BLOCK);
- if (fm->fc->no_access)
+ if (fm->fc->flags.no_access)
return 0;
memset(&inarg, 0, sizeof(inarg));
@@ -1364,7 +1364,7 @@ static int fuse_access(struct inode *inode, int mask)
args.in_args[0].value = &inarg;
err = fuse_simple_request(fm, &args);
if (err == -ENOSYS) {
- fm->fc->no_access = 1;
+ fm->fc->flags.no_access = 1;
err = 0;
}
return err;
@@ -1501,7 +1501,7 @@ static const char *fuse_get_link(struct dentry *dentry, struct inode *inode,
if (fuse_is_bad(inode))
goto out_err;
- if (fc->cache_symlinks)
+ if (fc->flags.cache_symlinks)
return page_get_link(dentry, inode, callback);
err = -ECHILD;
@@ -1549,13 +1549,13 @@ static int fuse_dir_fsync(struct file *file, loff_t start, loff_t end,
if (fuse_is_bad(inode))
return -EIO;
- if (fc->no_fsyncdir)
+ if (fc->flags.no_fsyncdir)
return 0;
inode_lock(inode);
err = fuse_fsync_common(file, start, end, datasync, FUSE_FSYNCDIR);
if (err == -ENOSYS) {
- fc->no_fsyncdir = 1;
+ fc->flags.no_fsyncdir = 1;
err = 0;
}
inode_unlock(inode);
@@ -1747,7 +1747,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
struct fuse_setattr_in inarg;
struct fuse_attr_out outarg;
bool is_truncate = false;
- bool is_wb = fc->writeback_cache && S_ISREG(inode->i_mode);
+ bool is_wb = fc->flags.writeback_cache && S_ISREG(inode->i_mode);
loff_t oldsize;
int err;
bool trust_local_cmtime = is_wb;
@@ -1780,7 +1780,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
/* This is coming from open(..., ... | O_TRUNC); */
WARN_ON(!(attr->ia_valid & ATTR_SIZE));
WARN_ON(attr->ia_size != 0);
- if (fc->atomic_o_trunc) {
+ if (fc->flags.atomic_o_trunc) {
/*
* No need to send request to userspace, since actual
* truncation has already been done by OPEN. But still
@@ -1927,7 +1927,7 @@ static int fuse_setattr(struct mnt_idmap *idmap, struct dentry *entry,
*
* This should be done on write(), truncate() and chown().
*/
- if (!fc->handle_killpriv && !fc->handle_killpriv_v2) {
+ if (!fc->flags.handle_killpriv && !fc->handle_killpriv_v2) {
/*
* ia_mode calculation may have used stale i_mode.
* Refresh and recalculate.
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1e36cd9490c6..742f90b4e638 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -30,7 +30,7 @@ static int fuse_send_open(struct fuse_mount *fm, u64 nodeid,
memset(&inarg, 0, sizeof(inarg));
inarg.flags = open_flags & ~(O_CREAT | O_EXCL | O_NOCTTY);
- if (!fm->fc->atomic_o_trunc)
+ if (!fm->fc->flags.atomic_o_trunc)
inarg.flags &= ~O_TRUNC;
if (fm->fc->handle_killpriv_v2 &&
@@ -111,7 +111,7 @@ static void fuse_file_put(struct fuse_file *ff, bool sync, bool isdir)
if (refcount_dec_and_test(&ff->count)) {
struct fuse_args *args = &ff->release_args->args;
- if (isdir ? ff->fm->fc->no_opendir : ff->fm->fc->no_open) {
+ if (isdir ? ff->fm->fc->flags.no_opendir : ff->fm->fc->flags.no_open) {
/* Do nothing when client does not implement 'open' */
fuse_release_end(ff->fm, args, 0);
} else if (sync) {
@@ -141,7 +141,7 @@ struct fuse_file *fuse_file_open(struct fuse_mount *fm, u64 nodeid,
ff->fh = 0;
/* Default for no-open */
ff->open_flags = FOPEN_KEEP_CACHE | (isdir ? FOPEN_CACHE_DIR : 0);
- if (isdir ? !fc->no_opendir : !fc->no_open) {
+ if (isdir ? !fc->flags.no_opendir : !fc->flags.no_open) {
struct fuse_open_out outarg;
int err;
@@ -155,9 +155,9 @@ struct fuse_file *fuse_file_open(struct fuse_mount *fm, u64 nodeid,
return ERR_PTR(err);
} else {
if (isdir)
- fc->no_opendir = 1;
+ fc->flags.no_opendir = 1;
else
- fc->no_open = 1;
+ fc->flags.no_open = 1;
}
}
@@ -206,7 +206,7 @@ void fuse_finish_open(struct inode *inode, struct file *file)
else if (ff->open_flags & FOPEN_NONSEEKABLE)
nonseekable_open(inode, file);
- if (fc->atomic_o_trunc && (file->f_flags & O_TRUNC)) {
+ if (fc->flags.atomic_o_trunc && (file->f_flags & O_TRUNC)) {
struct fuse_inode *fi = get_fuse_inode(inode);
spin_lock(&fi->lock);
@@ -216,7 +216,7 @@ void fuse_finish_open(struct inode *inode, struct file *file)
file_update_time(file);
fuse_invalidate_attr_mask(inode, FUSE_STATX_MODSIZE);
}
- if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache)
+ if ((file->f_mode & FMODE_WRITE) && fc->flags.writeback_cache)
fuse_link_write_file(file);
}
@@ -226,10 +226,10 @@ int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
struct fuse_conn *fc = fm->fc;
int err;
bool is_wb_truncate = (file->f_flags & O_TRUNC) &&
- fc->atomic_o_trunc &&
- fc->writeback_cache;
+ fc->flags.atomic_o_trunc &&
+ fc->flags.writeback_cache;
bool dax_truncate = (file->f_flags & O_TRUNC) &&
- fc->atomic_o_trunc && FUSE_IS_DAX(inode);
+ fc->flags.atomic_o_trunc && FUSE_IS_DAX(inode);
if (fuse_is_bad(inode))
return -EIO;
@@ -260,7 +260,7 @@ int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
if (!err) {
struct fuse_file *ff = file->private_data;
- if (fc->atomic_o_trunc && (file->f_flags & O_TRUNC))
+ if (fc->flags.atomic_o_trunc && (file->f_flags & O_TRUNC))
truncate_pagecache(inode, 0);
else if (!(ff->open_flags & FOPEN_KEEP_CACHE))
invalidate_inode_pages2(inode->i_mapping);
@@ -351,7 +351,7 @@ static int fuse_release(struct inode *inode, struct file *file)
* Dirty pages might remain despite write_inode_now() call from
* fuse_flush() due to writes racing with the close.
*/
- if (fc->writeback_cache)
+ if (fc->flags.writeback_cache)
write_inode_now(inode, 1);
fuse_release_common(file, false);
@@ -506,12 +506,12 @@ static int fuse_do_flush(struct fuse_flush_args *fa)
goto out;
err = 0;
- if (fm->fc->no_flush)
+ if (fm->fc->flags.no_flush)
goto inval_attr_out;
err = fuse_simple_request(fm, &fa->args);
if (err == -ENOSYS) {
- fm->fc->no_flush = 1;
+ fm->fc->flags.no_flush = 1;
err = 0;
}
@@ -520,7 +520,7 @@ static int fuse_do_flush(struct fuse_flush_args *fa)
* In memory i_blocks is not maintained by fuse, if writeback cache is
* enabled, i_blocks from cached attr may not be accurate.
*/
- if (!err && fm->fc->writeback_cache)
+ if (!err && fm->fc->flags.writeback_cache)
fuse_invalidate_attr_mask(inode, STATX_BLOCKS);
out:
@@ -546,7 +546,7 @@ static int fuse_flush(struct file *file, fl_owner_t id)
if (fuse_is_bad(inode))
return -EIO;
- if (ff->open_flags & FOPEN_NOFLUSH && !fm->fc->writeback_cache)
+ if (ff->open_flags & FOPEN_NOFLUSH && !fm->fc->flags.writeback_cache)
return 0;
fa = kzalloc(sizeof(*fa), GFP_KERNEL);
@@ -629,12 +629,12 @@ static int fuse_fsync(struct file *file, loff_t start, loff_t end,
if (err)
goto out;
- if (fc->no_fsync)
+ if (fc->flags.no_fsync)
goto out;
err = fuse_fsync_common(file, start, end, datasync, FUSE_FSYNC);
if (err == -ENOSYS) {
- fc->no_fsync = 1;
+ fc->flags.no_fsync = 1;
err = 0;
}
out:
@@ -859,7 +859,7 @@ static void fuse_short_read(struct inode *inode, u64 attr_ver, size_t num_read,
* the file. Some data after the hole is in page cache, but has not
* reached the client fs yet. So the hole is not present there.
*/
- if (!fc->writeback_cache) {
+ if (!fc->flags.writeback_cache) {
loff_t pos = page_offset(ap->pages[0]) + num_read;
fuse_read_update_size(inode, pos, attr_ver);
}
@@ -990,7 +990,7 @@ static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file)
fuse_read_args_fill(ia, file, pos, count, FUSE_READ);
ia->read.attr_ver = fuse_get_attr_version(fm->fc);
- if (fm->fc->async_read) {
+ if (fm->fc->flags.async_read) {
ia->ff = fuse_file_get(ff);
ap->args.end = fuse_readpages_end;
err = fuse_simple_background(fm, &ap->args, GFP_KERNEL);
@@ -1057,7 +1057,7 @@ static ssize_t fuse_cache_read_iter(struct kiocb *iocb, struct iov_iter *to)
* Otherwise, only update if we attempt to read past EOF (to ensure
* i_size is up to date).
*/
- if (fc->auto_inval_data ||
+ if (fc->flags.auto_inval_data ||
(iocb->ki_pos + iov_iter_count(to) > i_size_read(inode))) {
int err;
err = fuse_update_attributes(inode, iocb->ki_filp, STATX_SIZE);
@@ -1264,7 +1264,7 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
ia->write.page_locked = true;
break;
}
- if (!fc->big_writes)
+ if (!fc->flags.big_writes)
break;
} while (iov_iter_count(ii) && count < fc->max_write &&
ap->num_pages < max_pages && offset == 0);
@@ -1344,7 +1344,7 @@ static ssize_t fuse_cache_write_iter(struct kiocb *iocb, struct iov_iter *from)
struct fuse_conn *fc = get_fuse_conn(inode);
loff_t endbyte = 0;
- if (fc->writeback_cache) {
+ if (fc->flags.writeback_cache) {
/* Update size (EOF optimization) and mode (SUID clearing) */
err = fuse_update_attributes(mapping->host, file,
STATX_SIZE | STATX_MODE);
@@ -1866,7 +1866,7 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
* Do this only if writeback_cache is not enabled. If writeback_cache
* is enabled, we trust local ctime/mtime.
*/
- if (!fc->writeback_cache)
+ if (!fc->flags.writeback_cache)
fuse_invalidate_attr_mask(inode, FUSE_STATX_MODIFY);
spin_lock(&fi->lock);
rb_erase(&wpa->writepages_entry, &fi->writepages);
@@ -2370,7 +2370,7 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping,
loff_t fsize;
int err = -ENOMEM;
- WARN_ON(!fc->writeback_cache);
+ WARN_ON(!fc->flags.writeback_cache);
page = grab_cache_page_write_begin(mapping, index);
if (!page)
@@ -2643,13 +2643,13 @@ static int fuse_file_lock(struct file *file, int cmd, struct file_lock *fl)
if (cmd == F_CANCELLK) {
err = 0;
} else if (cmd == F_GETLK) {
- if (fc->no_lock) {
+ if (fc->flags.no_lock) {
posix_test_lock(file, fl);
err = 0;
} else
err = fuse_getlk(file, fl);
} else {
- if (fc->no_lock)
+ if (fc->flags.no_lock)
err = posix_lock_file(file, fl, NULL);
else
err = fuse_setlk(file, fl, 0);
@@ -2663,7 +2663,7 @@ static int fuse_file_flock(struct file *file, int cmd, struct file_lock *fl)
struct fuse_conn *fc = get_fuse_conn(inode);
int err;
- if (fc->no_flock) {
+ if (fc->flags.no_flock) {
err = locks_lock_file_wait(file, fl);
} else {
struct fuse_file *ff = file->private_data;
@@ -2685,7 +2685,7 @@ static sector_t fuse_bmap(struct address_space *mapping, sector_t block)
struct fuse_bmap_out outarg;
int err;
- if (!inode->i_sb->s_bdev || fm->fc->no_bmap)
+ if (!inode->i_sb->s_bdev || fm->fc->flags.no_bmap)
return 0;
memset(&inarg, 0, sizeof(inarg));
@@ -2701,7 +2701,7 @@ static sector_t fuse_bmap(struct address_space *mapping, sector_t block)
args.out_args[0].value = &outarg;
err = fuse_simple_request(fm, &args);
if (err == -ENOSYS)
- fm->fc->no_bmap = 1;
+ fm->fc->flags.no_bmap = 1;
return err ? 0 : outarg.block;
}
@@ -2720,7 +2720,7 @@ static loff_t fuse_lseek(struct file *file, loff_t offset, int whence)
struct fuse_lseek_out outarg;
int err;
- if (fm->fc->no_lseek)
+ if (fm->fc->flags.no_lseek)
goto fallback;
args.opcode = FUSE_LSEEK;
@@ -2734,7 +2734,7 @@ static loff_t fuse_lseek(struct file *file, loff_t offset, int whence)
err = fuse_simple_request(fm, &args);
if (err) {
if (err == -ENOSYS) {
- fm->fc->no_lseek = 1;
+ fm->fc->flags.no_lseek = 1;
goto fallback;
}
return err;
@@ -2841,7 +2841,7 @@ __poll_t fuse_file_poll(struct file *file, poll_table *wait)
FUSE_ARGS(args);
int err;
- if (fm->fc->no_poll)
+ if (fm->fc->flags.no_poll)
return DEFAULT_POLLMASK;
poll_wait(file, &ff->poll_wait, wait);
@@ -2869,7 +2869,7 @@ __poll_t fuse_file_poll(struct file *file, poll_table *wait)
if (!err)
return demangle_poll(outarg.revents);
if (err == -ENOSYS) {
- fm->fc->no_poll = 1;
+ fm->fc->flags.no_poll = 1;
return DEFAULT_POLLMASK;
}
return EPOLLERR;
@@ -2955,7 +2955,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
* By default, we want to optimize all I/Os with async request
* submission to the client filesystem if supported.
*/
- io->async = ff->fm->fc->async_dio;
+ io->async = ff->fm->fc->flags.async_dio;
io->iocb = iocb;
io->blocking = is_sync_kiocb(iocb);
@@ -3048,7 +3048,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
FALLOC_FL_ZERO_RANGE))
return -EOPNOTSUPP;
- if (fm->fc->no_fallocate)
+ if (fm->fc->flags.no_fallocate)
return -EOPNOTSUPP;
inode_lock(inode);
@@ -3088,7 +3088,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
args.in_args[0].value = &inarg;
err = fuse_simple_request(fm, &args);
if (err == -ENOSYS) {
- fm->fc->no_fallocate = 1;
+ fm->fc->flags.no_fallocate = 1;
err = -EOPNOTSUPP;
}
if (err)
@@ -3144,10 +3144,10 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
ssize_t err;
/* mark unstable when write-back is not used, and file_out gets
* extended */
- bool is_unstable = (!fc->writeback_cache) &&
+ bool is_unstable = (!fc->flags.writeback_cache) &&
((pos_out + len) > inode_out->i_size);
- if (fc->no_copy_file_range)
+ if (fc->flags.no_copy_file_range)
return -EOPNOTSUPP;
if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
@@ -3200,7 +3200,7 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
args.out_args[0].value = &outarg;
err = fuse_simple_request(fm, &args);
if (err == -ENOSYS) {
- fc->no_copy_file_range = 1;
+ fc->flags.no_copy_file_range = 1;
err = -EOPNOTSUPP;
}
if (err)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index be5d5d3fe6f5..943d5011dfa0 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -543,6 +543,125 @@ struct fuse_sync_bucket {
struct rcu_head rcu;
};
+/**
+ * A Fuse connection flags.
+ *
+ * This structure describes fuse connection capabilities, depending on the
+ * userspace daemon implementation.
+ * Most of this flags are calculated during the processing of reply to FUSE_INIT request,
+ * but some flags values are determined after connection initialization, for example,
+ * no_flush, no_setxattr, etc. These flags are safe to clear, because they always can be
+ * restored with a proper values in runtime.
+ */
+struct fuse_conn_flags {
+ /** Do readahead asynchronously? Only set in INIT */
+ unsigned async_read:1;
+
+ /** Do not send separate SETATTR request before open(O_TRUNC) */
+ unsigned atomic_o_trunc:1;
+
+ /** Filesystem supports NFS exporting. Only set in INIT */
+ unsigned export_support:1;
+
+ /** write-back cache policy (default is write-through) */
+ unsigned writeback_cache:1;
+
+ /** allow parallel lookups and readdir (default is serialized) */
+ unsigned parallel_dirops:1;
+
+ /** handle fs handles killing suid/sgid/cap on write/chown/trunc */
+ unsigned handle_killpriv:1;
+
+ /** cache READLINK responses in page cache */
+ unsigned cache_symlinks:1;
+
+ /*
+ * The following bitfields are only for optimization purposes
+ * and hence races in setting them will not cause malfunction
+ */
+
+ /** Is open/release not implemented by fs? */
+ unsigned no_open:1;
+
+ /** Is opendir/releasedir not implemented by fs? */
+ unsigned no_opendir:1;
+
+ /** Is fsync not implemented by fs? */
+ unsigned no_fsync:1;
+
+ /** Is fsyncdir not implemented by fs? */
+ unsigned no_fsyncdir:1;
+
+ /** Is flush not implemented by fs? */
+ unsigned no_flush:1;
+
+ /** Is setxattr not implemented by fs? */
+ unsigned no_setxattr:1;
+
+ /** Does file server support extended setxattr */
+ unsigned setxattr_ext:1;
+
+ /** Is getxattr not implemented by fs? */
+ unsigned no_getxattr:1;
+
+ /** Is listxattr not implemented by fs? */
+ unsigned no_listxattr:1;
+
+ /** Is removexattr not implemented by fs? */
+ unsigned no_removexattr:1;
+
+ /** Are posix file locking primitives not implemented by fs? */
+ unsigned no_lock:1;
+
+ /** Is access not implemented by fs? */
+ unsigned no_access:1;
+
+ /** Is create not implemented by fs? */
+ unsigned no_create:1;
+
+ /** Is interrupt not implemented by fs? */
+ unsigned no_interrupt:1;
+
+ /** Is bmap not implemented by fs? */
+ unsigned no_bmap:1;
+
+ /** Is poll not implemented by fs? */
+ unsigned no_poll:1;
+
+ /** Do multi-page cached writes */
+ unsigned big_writes:1;
+
+ /** Are BSD file locking primitives not implemented by fs? */
+ unsigned no_flock:1;
+
+ /** Is fallocate not implemented by fs? */
+ unsigned no_fallocate:1;
+
+ /** Is rename with flags implemented by fs? */
+ unsigned no_rename2:1;
+
+ /** Use enhanced/automatic page cache invalidation. */
+ unsigned auto_inval_data:1;
+
+ /** Filesystem is fully responsible for page cache invalidation. */
+ unsigned explicit_inval_data:1;
+
+ /** Does the filesystem support readdirplus? */
+ unsigned do_readdirplus:1;
+
+ /** Does the filesystem want adaptive readdirplus? */
+ unsigned readdirplus_auto:1;
+
+ /** Does the filesystem support asynchronous direct-IO submission? */
+ unsigned async_dio:1;
+
+ /** Is lseek not implemented by fs? */
+ unsigned no_lseek:1;
+
+ /** Does the filesystem support copy_file_range? */
+ unsigned no_copy_file_range:1;
+};
+
/**
* A Fuse connection.
*
@@ -641,30 +760,9 @@ struct fuse_conn {
/** Connection successful. Only set in INIT */
unsigned conn_init:1;
- /** Do readahead asynchronously? Only set in INIT */
- unsigned async_read:1;
-
/** Return an unique read error after abort. Only set in INIT */
unsigned abort_err:1;
- /** Do not send separate SETATTR request before open(O_TRUNC) */
- unsigned atomic_o_trunc:1;
-
- /** Filesystem supports NFS exporting. Only set in INIT */
- unsigned export_support:1;
-
- /** write-back cache policy (default is write-through) */
- unsigned writeback_cache:1;
-
- /** allow parallel lookups and readdir (default is serialized) */
- unsigned parallel_dirops:1;
-
- /** handle fs handles killing suid/sgid/cap on write/chown/trunc */
- unsigned handle_killpriv:1;
-
- /** cache READLINK responses in page cache */
- unsigned cache_symlinks:1;
-
/* show legacy mount options */
unsigned int legacy_opts_show:1;
@@ -676,92 +774,9 @@ struct fuse_conn {
*/
unsigned handle_killpriv_v2:1;
- /*
- * The following bitfields are only for optimization purposes
- * and hence races in setting them will not cause malfunction
- */
-
- /** Is open/release not implemented by fs? */
- unsigned no_open:1;
-
- /** Is opendir/releasedir not implemented by fs? */
- unsigned no_opendir:1;
-
- /** Is fsync not implemented by fs? */
- unsigned no_fsync:1;
-
- /** Is fsyncdir not implemented by fs? */
- unsigned no_fsyncdir:1;
-
- /** Is flush not implemented by fs? */
- unsigned no_flush:1;
-
- /** Is setxattr not implemented by fs? */
- unsigned no_setxattr:1;
-
- /** Does file server support extended setxattr */
- unsigned setxattr_ext:1;
-
- /** Is getxattr not implemented by fs? */
- unsigned no_getxattr:1;
-
- /** Is listxattr not implemented by fs? */
- unsigned no_listxattr:1;
-
- /** Is removexattr not implemented by fs? */
- unsigned no_removexattr:1;
-
- /** Are posix file locking primitives not implemented by fs? */
- unsigned no_lock:1;
-
- /** Is access not implemented by fs? */
- unsigned no_access:1;
-
- /** Is create not implemented by fs? */
- unsigned no_create:1;
-
- /** Is interrupt not implemented by fs? */
- unsigned no_interrupt:1;
-
- /** Is bmap not implemented by fs? */
- unsigned no_bmap:1;
-
- /** Is poll not implemented by fs? */
- unsigned no_poll:1;
-
- /** Do multi-page cached writes */
- unsigned big_writes:1;
-
/** Don't apply umask to creation modes */
unsigned dont_mask:1;
- /** Are BSD file locking primitives not implemented by fs? */
- unsigned no_flock:1;
-
- /** Is fallocate not implemented by fs? */
- unsigned no_fallocate:1;
-
- /** Is rename with flags implemented by fs? */
- unsigned no_rename2:1;
-
- /** Use enhanced/automatic page cache invalidation. */
- unsigned auto_inval_data:1;
-
- /** Filesystem is fully responsible for page cache invalidation. */
- unsigned explicit_inval_data:1;
-
- /** Does the filesystem support readdirplus? */
- unsigned do_readdirplus:1;
-
- /** Does the filesystem want adaptive readdirplus? */
- unsigned readdirplus_auto:1;
-
- /** Does the filesystem support asynchronous direct-IO submission? */
- unsigned async_dio:1;
-
- /** Is lseek not implemented by fs? */
- unsigned no_lseek:1;
-
/** Does the filesystem support posix acls? */
unsigned posix_acl:1;
@@ -771,9 +786,6 @@ struct fuse_conn {
/** Allow other than the mounter user to access the filesystem ? */
unsigned allow_other:1;
- /** Does the filesystem support copy_file_range? */
- unsigned no_copy_file_range:1;
-
/* Send DESTROY request */
unsigned int destroy:1;
@@ -804,6 +816,8 @@ struct fuse_conn {
/* Is tmpfile not implemented by fs? */
unsigned int no_tmpfile:1;
+ struct fuse_conn_flags flags;
+
/** The number of requests waiting for completion */
atomic_t num_waiting;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index e5ad5d4c215a..389bea6e4a69 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -224,7 +224,7 @@ u32 fuse_get_cache_mask(struct inode *inode)
{
struct fuse_conn *fc = get_fuse_conn(inode);
- if (!fc->writeback_cache || !S_ISREG(inode->i_mode))
+ if (!fc->flags.writeback_cache || !S_ISREG(inode->i_mode))
return 0;
return STATX_MTIME | STATX_CTIME | STATX_SIZE;
@@ -282,9 +282,9 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
if (oldsize != attr->size) {
truncate_pagecache(inode, attr->size);
- if (!fc->explicit_inval_data)
+ if (!fc->flags.explicit_inval_data)
inval = true;
- } else if (fc->auto_inval_data) {
+ } else if (fc->flags.auto_inval_data) {
struct timespec64 new_mtime = {
.tv_sec = attr->mtime,
.tv_nsec = attr->mtimensec,
@@ -387,7 +387,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
if ((inode->i_state & I_NEW)) {
inode->i_flags |= S_NOATIME;
- if (!fc->writeback_cache || !S_ISREG(attr->mode))
+ if (!fc->flags.writeback_cache || !S_ISREG(attr->mode))
inode->i_flags |= S_NOCMTIME;
inode->i_generation = generation;
fuse_init_inode(inode, attr, fc);
@@ -466,7 +466,7 @@ bool fuse_lock_inode(struct inode *inode)
{
bool locked = false;
- if (!get_fuse_conn(inode)->parallel_dirops) {
+ if (!get_fuse_conn(inode)->flags.parallel_dirops) {
mutex_lock(&get_fuse_inode(inode)->mutex);
locked = true;
}
@@ -918,7 +918,7 @@ static struct dentry *fuse_get_dentry(struct super_block *sb,
struct fuse_entry_out outarg;
const struct qstr name = QSTR_INIT(".", 1);
- if (!fc->export_support)
+ if (!fc->flags.export_support)
goto out_err;
err = fuse_lookup_name(sb, handle->nodeid, &name, &outarg,
@@ -1018,7 +1018,7 @@ static struct dentry *fuse_get_parent(struct dentry *child)
struct fuse_entry_out outarg;
int err;
- if (!fc->export_support)
+ if (!fc->flags.export_support)
return ERR_PTR(-ESTALE);
err = fuse_lookup_name(child_inode->i_sb, get_node_id(child_inode),
@@ -1134,44 +1134,44 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
ra_pages = arg->max_readahead / PAGE_SIZE;
if (flags & FUSE_ASYNC_READ)
- fc->async_read = 1;
+ fc->flags.async_read = 1;
if (!(flags & FUSE_POSIX_LOCKS))
- fc->no_lock = 1;
+ fc->flags.no_lock = 1;
if (arg->minor >= 17) {
if (!(flags & FUSE_FLOCK_LOCKS))
- fc->no_flock = 1;
+ fc->flags.no_flock = 1;
} else {
if (!(flags & FUSE_POSIX_LOCKS))
- fc->no_flock = 1;
+ fc->flags.no_flock = 1;
}
if (flags & FUSE_ATOMIC_O_TRUNC)
- fc->atomic_o_trunc = 1;
+ fc->flags.atomic_o_trunc = 1;
if (arg->minor >= 9) {
/* LOOKUP has dependency on proto version */
if (flags & FUSE_EXPORT_SUPPORT)
- fc->export_support = 1;
+ fc->flags.export_support = 1;
}
if (flags & FUSE_BIG_WRITES)
- fc->big_writes = 1;
+ fc->flags.big_writes = 1;
if (flags & FUSE_DONT_MASK)
fc->dont_mask = 1;
if (flags & FUSE_AUTO_INVAL_DATA)
- fc->auto_inval_data = 1;
+ fc->flags.auto_inval_data = 1;
else if (flags & FUSE_EXPLICIT_INVAL_DATA)
- fc->explicit_inval_data = 1;
+ fc->flags.explicit_inval_data = 1;
if (flags & FUSE_DO_READDIRPLUS) {
- fc->do_readdirplus = 1;
+ fc->flags.do_readdirplus = 1;
if (flags & FUSE_READDIRPLUS_AUTO)
- fc->readdirplus_auto = 1;
+ fc->flags.readdirplus_auto = 1;
}
if (flags & FUSE_ASYNC_DIO)
- fc->async_dio = 1;
+ fc->flags.async_dio = 1;
if (flags & FUSE_WRITEBACK_CACHE)
- fc->writeback_cache = 1;
+ fc->flags.writeback_cache = 1;
if (flags & FUSE_PARALLEL_DIROPS)
- fc->parallel_dirops = 1;
+ fc->flags.parallel_dirops = 1;
if (flags & FUSE_HANDLE_KILLPRIV)
- fc->handle_killpriv = 1;
+ fc->flags.handle_killpriv = 1;
if (arg->time_gran && arg->time_gran <= 1000000000)
fm->sb->s_time_gran = arg->time_gran;
if ((flags & FUSE_POSIX_ACL)) {
@@ -1179,7 +1179,7 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
fc->posix_acl = 1;
}
if (flags & FUSE_CACHE_SYMLINKS)
- fc->cache_symlinks = 1;
+ fc->flags.cache_symlinks = 1;
if (flags & FUSE_ABORT_ERROR)
fc->abort_err = 1;
if (flags & FUSE_MAX_PAGES) {
@@ -1200,15 +1200,15 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
fm->sb->s_flags |= SB_NOSEC;
}
if (flags & FUSE_SETXATTR_EXT)
- fc->setxattr_ext = 1;
+ fc->flags.setxattr_ext = 1;
if (flags & FUSE_SECURITY_CTX)
fc->init_security = 1;
if (flags & FUSE_CREATE_SUPP_GROUP)
fc->create_supp_group = 1;
} else {
ra_pages = fc->max_read / PAGE_SIZE;
- fc->no_lock = 1;
- fc->no_flock = 1;
+ fc->flags.no_lock = 1;
+ fc->flags.no_flock = 1;
}
fm->sb->s_bdi->ra_pages =
diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
index dc603479b30e..2a5bfb52ebf3 100644
--- a/fs/fuse/readdir.c
+++ b/fs/fuse/readdir.c
@@ -18,9 +18,9 @@ static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx)
struct fuse_conn *fc = get_fuse_conn(dir);
struct fuse_inode *fi = get_fuse_inode(dir);
- if (!fc->do_readdirplus)
+ if (!fc->flags.do_readdirplus)
return false;
- if (!fc->readdirplus_auto)
+ if (!fc->flags.readdirplus_auto)
return true;
if (test_and_clear_bit(FUSE_I_ADVISE_RDPLUS, &fi->state))
return true;
@@ -246,7 +246,7 @@ static int fuse_direntplus_link(struct file *file,
if (IS_ERR(dentry))
return PTR_ERR(dentry);
}
- if (fc->readdirplus_auto)
+ if (fc->flags.readdirplus_auto)
set_bit(FUSE_I_INIT_RDPLUS, &get_fuse_inode(inode)->state);
fuse_change_entry_timeout(dentry, o);
@@ -455,7 +455,7 @@ static int fuse_readdir_cached(struct file *file, struct dir_context *ctx)
* We're just about to start reading into the cache or reading the
* cache; both cases require an up-to-date mtime value.
*/
- if (!ctx->pos && fc->auto_inval_data) {
+ if (!ctx->pos && fc->flags.auto_inval_data) {
int err = fuse_update_attributes(inode, file, STATX_MTIME);
if (err)
diff --git a/fs/fuse/xattr.c b/fs/fuse/xattr.c
index 49c01559580f..5d8603f9c355 100644
--- a/fs/fuse/xattr.c
+++ b/fs/fuse/xattr.c
@@ -19,7 +19,7 @@ int fuse_setxattr(struct inode *inode, const char *name, const void *value,
struct fuse_setxattr_in inarg;
int err;
- if (fm->fc->no_setxattr)
+ if (fm->fc->flags.no_setxattr)
return -EOPNOTSUPP;
memset(&inarg, 0, sizeof(inarg));
@@ -30,7 +30,7 @@ int fuse_setxattr(struct inode *inode, const char *name, const void *value,
args.opcode = FUSE_SETXATTR;
args.nodeid = get_node_id(inode);
args.in_numargs = 3;
- args.in_args[0].size = fm->fc->setxattr_ext ?
+ args.in_args[0].size = fm->fc->flags.setxattr_ext ?
sizeof(inarg) : FUSE_COMPAT_SETXATTR_IN_SIZE;
args.in_args[0].value = &inarg;
args.in_args[1].size = strlen(name) + 1;
@@ -39,7 +39,7 @@ int fuse_setxattr(struct inode *inode, const char *name, const void *value,
args.in_args[2].value = value;
err = fuse_simple_request(fm, &args);
if (err == -ENOSYS) {
- fm->fc->no_setxattr = 1;
+ fm->fc->flags.no_setxattr = 1;
err = -EOPNOTSUPP;
}
if (!err)
@@ -57,7 +57,7 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
struct fuse_getxattr_out outarg;
ssize_t ret;
- if (fm->fc->no_getxattr)
+ if (fm->fc->flags.no_getxattr)
return -EOPNOTSUPP;
memset(&inarg, 0, sizeof(inarg));
@@ -83,7 +83,7 @@ ssize_t fuse_getxattr(struct inode *inode, const char *name, void *value,
if (!ret && !size)
ret = min_t(ssize_t, outarg.size, XATTR_SIZE_MAX);
if (ret == -ENOSYS) {
- fm->fc->no_getxattr = 1;
+ fm->fc->flags.no_getxattr = 1;
ret = -EOPNOTSUPP;
}
return ret;
@@ -121,7 +121,7 @@ ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size)
if (!fuse_allow_current_process(fm->fc))
return -EACCES;
- if (fm->fc->no_listxattr)
+ if (fm->fc->flags.no_listxattr)
return -EOPNOTSUPP;
memset(&inarg, 0, sizeof(inarg));
@@ -147,7 +147,7 @@ ssize_t fuse_listxattr(struct dentry *entry, char *list, size_t size)
if (ret > 0 && size)
ret = fuse_verify_xattr_list(list, ret);
if (ret == -ENOSYS) {
- fm->fc->no_listxattr = 1;
+ fm->fc->flags.no_listxattr = 1;
ret = -EOPNOTSUPP;
}
return ret;
@@ -159,7 +159,7 @@ int fuse_removexattr(struct inode *inode, const char *name)
FUSE_ARGS(args);
int err;
- if (fm->fc->no_removexattr)
+ if (fm->fc->flags.no_removexattr)
return -EOPNOTSUPP;
args.opcode = FUSE_REMOVEXATTR;
@@ -169,7 +169,7 @@ int fuse_removexattr(struct inode *inode, const char *name)
args.in_args[0].value = name;
err = fuse_simple_request(fm, &args);
if (err == -ENOSYS) {
- fm->fc->no_removexattr = 1;
+ fm->fc->flags.no_removexattr = 1;
err = -EOPNOTSUPP;
}
if (!err)
--
2.34.1
- modify dentry revalidation algorithm to check inode/connection
generations. If them are not equal then perform revalidation.
remark: during forced dentry revalidation we are sending FUSE_LOOKUP
request to the userspace daemon and if it return the same inode after
lookup then we can "upgrade" inode connection generation without
invalidating it.
- don't send FUSE_FSYNC, FUSE_RELEASE, etc requests to the userspace
daemon about stale inodes (this can confuse libfuse)
Cc: Miklos Szeredi <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: Bernd Schubert <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Alexander Mikhalitsyn <[email protected]>
---
fs/fuse/dir.c | 4 +++-
fs/fuse/file.c | 13 ++++++++++---
fs/fuse/fuse_i.h | 3 ++-
fs/fuse/inode.c | 2 +-
4 files changed, 16 insertions(+), 6 deletions(-)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index bfbe59e8fce2..d0fdf7289d56 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -213,7 +213,8 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
inode = d_inode_rcu(entry);
if (inode && fuse_is_bad(inode))
goto invalid;
- else if (time_before64(fuse_dentry_time(entry), get_jiffies_64()) ||
+ else if ((inode && fuse_stale_inode_conn(inode)) ||
+ time_before64(fuse_dentry_time(entry), get_jiffies_64()) ||
(flags & (LOOKUP_EXCL | LOOKUP_REVAL | LOOKUP_RENAME_TARGET))) {
struct fuse_entry_out outarg;
FUSE_ARGS(args);
@@ -255,6 +256,7 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
}
spin_lock(&fi->lock);
fi->nlookup++;
+ fi->conn_gen = READ_ONCE(get_fuse_conn(inode)->conn_gen);
spin_unlock(&fi->lock);
}
kfree(forget);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 742f90b4e638..b977d087b925 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -111,8 +111,14 @@ static void fuse_file_put(struct fuse_file *ff, bool sync, bool isdir)
if (refcount_dec_and_test(&ff->count)) {
struct fuse_args *args = &ff->release_args->args;
- if (isdir ? ff->fm->fc->flags.no_opendir : ff->fm->fc->flags.no_open) {
- /* Do nothing when client does not implement 'open' */
+ if (fuse_stale_ff(ff) ||
+ (isdir ? ff->fm->fc->flags.no_opendir : ff->fm->fc->flags.no_open)) {
+ /*
+ * Do nothing when client does not implement 'open' OR
+ * file descriptor was opened in the previous connection generation,
+ * so, current daemon likely not aware of this FD, let's just skip
+ * FUSE_RELEASE(DIR) request.
+ */
fuse_release_end(ff->fm, args, 0);
} else if (sync) {
fuse_simple_request(ff->fm, args);
@@ -598,9 +604,10 @@ static int fuse_fsync(struct file *file, loff_t start, loff_t end,
{
struct inode *inode = file->f_mapping->host;
struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_file *ff = file->private_data;
int err;
- if (fuse_is_bad(inode))
+ if (fuse_stale_ff(ff) || fuse_is_bad(inode))
return -EIO;
inode_lock(inode);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 943d5011dfa0..90c5b3459864 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -957,7 +957,8 @@ static inline bool fuse_stale_inode(const struct inode *inode, int generation,
struct fuse_attr *attr)
{
return inode->i_generation != generation ||
- inode_wrong_type(inode, attr->mode);
+ inode_wrong_type(inode, attr->mode) ||
+ fuse_stale_inode_conn(inode);
}
static inline void fuse_make_bad(struct inode *inode)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 389bea6e4a69..26a4149f6db7 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -124,7 +124,7 @@ static void fuse_evict_inode(struct inode *inode)
fuse_dax_inode_cleanup(inode);
if (fi->nlookup) {
fuse_queue_forget(fc, fi->forget, fi->nodeid,
- fi->nlookup, false);
+ fi->nlookup, fuse_stale_inode_conn(inode));
fi->forget = NULL;
}
}
--
2.34.1
This ioctl allows to revalidate all the existing fuse bindmounts
by performing relookup of all root dentries and resetting them.
Useful if it's needed to make fuse bindmounts work without
remounting them after fuse connection reinitialization.
Cc: Miklos Szeredi <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: Bernd Schubert <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Alexander Mikhalitsyn <[email protected]>
---
fs/fuse/dev.c | 29 ++++++++++++++++++++++++++++-
include/uapi/linux/fuse.h | 1 +
2 files changed, 29 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 93a457c90b49..dae4d718b03a 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -13,6 +13,7 @@
#include <linux/poll.h>
#include <linux/sched/signal.h>
#include <linux/uio.h>
+#include <linux/mnt_namespace.h>
#include <linux/miscdevice.h>
#include <linux/pagemap.h>
#include <linux/file.h>
@@ -2313,6 +2314,27 @@ static int fuse_reinit_conn(struct fuse_conn *fc)
return 0;
}
+static ssize_t fuse_revalidate_bindmounts(struct fuse_conn *fc)
+{
+ int ret = 0;
+
+ down_read(&fc->killsb);
+ if (!list_empty(&fc->mounts)) {
+ struct fuse_mount *fm;
+
+ fm = list_first_entry(&fc->mounts, struct fuse_mount, fc_entry);
+ if (!fm->sb) {
+ up_read(&fc->killsb);
+ return -EINVAL;
+ }
+
+ ret = sb_revalidate_bindmounts(fm->sb);
+ }
+ up_read(&fc->killsb);
+
+ return ret;
+}
+
void fuse_wait_aborted(struct fuse_conn *fc)
{
/* matches implicit memory barrier in fuse_drop_waiting() */
@@ -2409,6 +2431,7 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
}
break;
case FUSE_DEV_IOC_REINIT:
+ case FUSE_DEV_IOC_BM_REVAL:
struct fuse_conn *fc;
if (!checkpoint_restore_ns_capable(file->f_cred->user_ns))
@@ -2429,7 +2452,11 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
mutex_unlock(&fuse_mutex);
if (fc) {
- res = fuse_reinit_conn(fc);
+ if (cmd == FUSE_DEV_IOC_REINIT)
+ res = fuse_reinit_conn(fc);
+ else if (cmd == FUSE_DEV_IOC_BM_REVAL)
+ res = fuse_revalidate_bindmounts(fc);
+
fuse_conn_put(fc);
}
}
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 325da23431ef..d8d1997876d8 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -993,6 +993,7 @@ struct fuse_notify_retrieve_in {
#define FUSE_DEV_IOC_MAGIC 229
#define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
#define FUSE_DEV_IOC_REINIT _IO(FUSE_DEV_IOC_MAGIC, 0)
+#define FUSE_DEV_IOC_BM_REVAL _IO(FUSE_DEV_IOC_MAGIC, 1)
struct fuse_lseek_in {
uint64_t fh;
--
2.34.1
This ioctl aborts fuse connection and then reinitializes it,
sends FUSE_INIT request to allow a new userspace daemon
to pick up the fuse connection.
Cc: Miklos Szeredi <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: Bernd Schubert <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Alexander Mikhalitsyn <[email protected]>
---
fs/fuse/dev.c | 152 ++++++++++++++++++++++++++++++++++++++
fs/fuse/fuse_i.h | 3 +
include/uapi/linux/fuse.h | 1 +
3 files changed, 156 insertions(+)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index b4501a10c379..93a457c90b49 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2187,6 +2187,132 @@ void fuse_abort_conn(struct fuse_conn *fc)
}
EXPORT_SYMBOL_GPL(fuse_abort_conn);
+static int fuse_reinit_conn(struct fuse_conn *fc)
+{
+ struct fuse_iqueue *fiq = &fc->iq;
+ struct fuse_dev *fud;
+ unsigned int i;
+
+ spin_lock(&fc->lock);
+ if (fc->reinit_in_progress) {
+ spin_unlock(&fc->lock);
+ return -EBUSY;
+ }
+
+ if (fc->conn_gen + 1 < fc->conn_gen) {
+ spin_unlock(&fc->lock);
+ return -EOVERFLOW;
+ }
+
+ fc->reinit_in_progress = true;
+ spin_unlock(&fc->lock);
+
+ /*
+ * Unsets fc->connected and fiq->connected and
+ * ensures that no new requests can be queued
+ */
+ fuse_abort_conn(fc);
+ fuse_wait_aborted(fc);
+
+ spin_lock(&fc->lock);
+ if (fc->connected) {
+ fc->reinit_in_progress = false;
+ spin_unlock(&fc->lock);
+ return -EINVAL;
+ }
+
+ fc->conn_gen++;
+
+ spin_lock(&fiq->lock);
+ if (request_pending(fiq) || fiq->forget_list_tail != &fiq->forget_list_head) {
+ spin_unlock(&fiq->lock);
+ fc->reinit_in_progress = false;
+ spin_unlock(&fc->lock);
+ return -EINVAL;
+ }
+
+ if (&fuse_dev_fiq_ops != fiq->ops) {
+ spin_unlock(&fiq->lock);
+ fc->reinit_in_progress = false;
+ spin_unlock(&fc->lock);
+ return -EOPNOTSUPP;
+ }
+
+ fiq->connected = 1;
+ spin_unlock(&fiq->lock);
+
+ spin_lock(&fc->bg_lock);
+ if (!list_empty(&fc->bg_queue)) {
+ spin_unlock(&fc->bg_lock);
+ fc->reinit_in_progress = false;
+ spin_unlock(&fc->lock);
+ return -EINVAL;
+ }
+
+ fc->blocked = 0;
+ fc->max_background = FUSE_DEFAULT_MAX_BACKGROUND;
+ spin_unlock(&fc->bg_lock);
+
+ list_for_each_entry(fud, &fc->devices, entry) {
+ struct fuse_pqueue *fpq = &fud->pq;
+
+ spin_lock(&fpq->lock);
+ if (!list_empty(&fpq->io)) {
+ spin_unlock(&fpq->lock);
+ fc->reinit_in_progress = false;
+ spin_unlock(&fc->lock);
+ return -EINVAL;
+ }
+
+ for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) {
+ if (!list_empty(&fpq->processing[i])) {
+ spin_unlock(&fpq->lock);
+ fc->reinit_in_progress = false;
+ spin_unlock(&fc->lock);
+ return -EINVAL;
+ }
+ }
+
+ fpq->connected = 1;
+ spin_unlock(&fpq->lock);
+ }
+
+ fuse_set_initialized(fc);
+
+ /* Background queuing checks fc->connected under bg_lock */
+ spin_lock(&fc->bg_lock);
+ fc->connected = 1;
+ spin_unlock(&fc->bg_lock);
+
+ fc->aborted = false;
+ fc->abort_err = 0;
+
+ /* nullify all the flags */
+ memset(&fc->flags, 0, sizeof(struct fuse_conn_flags));
+
+ spin_unlock(&fc->lock);
+
+ down_read(&fc->killsb);
+ if (!list_empty(&fc->mounts)) {
+ struct fuse_mount *fm;
+
+ fm = list_first_entry(&fc->mounts, struct fuse_mount, fc_entry);
+ if (!fm->sb) {
+ up_read(&fc->killsb);
+ return -EINVAL;
+ }
+
+ fuse_send_init(fm);
+ }
+ up_read(&fc->killsb);
+
+ spin_lock(&fc->lock);
+ fc->reinit_in_progress = false;
+ spin_unlock(&fc->lock);
+
+ return 0;
+}
+
void fuse_wait_aborted(struct fuse_conn *fc)
{
/* matches implicit memory barrier in fuse_drop_waiting() */
@@ -2282,6 +2408,32 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
}
}
break;
+ case FUSE_DEV_IOC_REINIT:
+ struct fuse_conn *fc;
+
+ if (!checkpoint_restore_ns_capable(file->f_cred->user_ns))
+ return -EPERM;
+
+ res = -EINVAL;
+ fud = fuse_get_dev(file);
+
+ /*
+ * Only fuse mounts with an already initialized fuse
+ * connection are supported
+ */
+ if (file->f_op == &fuse_dev_operations && fud) {
+ mutex_lock(&fuse_mutex);
+ fc = fud->fc;
+ if (fc)
+ fc = fuse_conn_get(fc);
+ mutex_unlock(&fuse_mutex);
+
+ if (fc) {
+ res = fuse_reinit_conn(fc);
+ fuse_conn_put(fc);
+ }
+ }
+ break;
default:
res = -ENOTTY;
break;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 90c5b3459864..8f2c0f969f6f 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -752,6 +752,9 @@ struct fuse_conn {
/** Connection aborted via sysfs */
bool aborted;
+ /** Connection reinit in progress */
+ bool reinit_in_progress;
+
/** Connection failed (version mismatch). Cannot race with
setting other bitfields since it is only set once in INIT
reply, before any other request, and never cleared */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index b3fcab13fcd3..325da23431ef 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -992,6 +992,7 @@ struct fuse_notify_retrieve_in {
/* Device ioctls: */
#define FUSE_DEV_IOC_MAGIC 229
#define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
+#define FUSE_DEV_IOC_REINIT _IO(FUSE_DEV_IOC_MAGIC, 0)
struct fuse_lseek_in {
uint64_t fh;
--
2.34.1
We don't want to send FUSE_FORGET request to the new
fuse daemon if inode was lookuped by the old fuse daemon
because it can confuse and break userspace (libfuse).
For now, just add a new argument to fuse_queue_forget and
handle it. Adjust all callers to match the old behaviour.
Cc: Miklos Szeredi <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: Bernd Schubert <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Alexander Mikhalitsyn <[email protected]>
---
fs/fuse/dev.c | 4 ++--
fs/fuse/dir.c | 8 ++++----
fs/fuse/fuse_i.h | 2 +-
fs/fuse/inode.c | 2 +-
4 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index eb4f88e3dc97..2e7cd60b685e 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -234,7 +234,7 @@ __releases(fiq->lock)
}
void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget,
- u64 nodeid, u64 nlookup)
+ u64 nodeid, u64 nlookup, bool stale_inode_conn)
{
struct fuse_iqueue *fiq = &fc->iq;
@@ -242,7 +242,7 @@ void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget,
forget->forget_one.nlookup = nlookup;
spin_lock(&fiq->lock);
- if (fiq->connected) {
+ if (fiq->connected && likely(!stale_inode_conn)) {
fiq->forget_list_tail->next = forget;
fiq->forget_list_tail = forget;
fiq->ops->wake_forget_and_unlock(fiq);
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 5a4a7155cf1c..7e308a655191 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -250,7 +250,7 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
if (outarg.nodeid != get_node_id(inode) ||
(bool) IS_AUTOMOUNT(inode) != (bool) (outarg.attr.flags & FUSE_ATTR_SUBMOUNT)) {
fuse_queue_forget(fm->fc, forget,
- outarg.nodeid, 1);
+ outarg.nodeid, 1, false);
goto invalid;
}
spin_lock(&fi->lock);
@@ -403,7 +403,7 @@ int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name
attr_version);
err = -ENOMEM;
if (!*inode) {
- fuse_queue_forget(fm->fc, forget, outarg->nodeid, 1);
+ fuse_queue_forget(fm->fc, forget, outarg->nodeid, 1, false);
goto out;
}
err = 0;
@@ -690,7 +690,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
if (!inode) {
flags &= ~(O_CREAT | O_EXCL | O_TRUNC);
fuse_sync_release(NULL, ff, flags);
- fuse_queue_forget(fm->fc, forget, outentry.nodeid, 1);
+ fuse_queue_forget(fm->fc, forget, outentry.nodeid, 1, false);
err = -ENOMEM;
goto out_err;
}
@@ -815,7 +815,7 @@ static int create_new_entry(struct fuse_mount *fm, struct fuse_args *args,
inode = fuse_iget(dir->i_sb, outarg.nodeid, outarg.generation,
&outarg.attr, entry_attr_timeout(&outarg), 0);
if (!inode) {
- fuse_queue_forget(fm->fc, forget, outarg.nodeid, 1);
+ fuse_queue_forget(fm->fc, forget, outarg.nodeid, 1, false);
return -ENOMEM;
}
kfree(forget);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 8d4276d7ab1e..be5d5d3fe6f5 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1008,7 +1008,7 @@ int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name
* Send FORGET command
*/
void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget,
- u64 nodeid, u64 nlookup);
+ u64 nodeid, u64 nlookup, bool stale_inode_conn);
struct fuse_forget_link *fuse_alloc_forget(void);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 009fe5bbb855..e5ad5d4c215a 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -124,7 +124,7 @@ static void fuse_evict_inode(struct inode *inode)
fuse_dax_inode_cleanup(inode);
if (fi->nlookup) {
fuse_queue_forget(fc, fi->forget, fi->nodeid,
- fi->nlookup);
+ fi->nlookup, false);
fi->forget = NULL;
}
}
--
2.34.1
We will use connection generation to detect stale inodes
from the "old" fuse daemon and invalidate/revalidate them.
There is no functional changes.
Cc: Miklos Szeredi <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: Bernd Schubert <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Alexander Mikhalitsyn <[email protected]>
---
fs/fuse/file.c | 1 +
fs/fuse/fuse_i.h | 29 +++++++++++++++++++++++++++++
fs/fuse/inode.c | 2 ++
3 files changed, 32 insertions(+)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index de37a3a06a71..1e36cd9490c6 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -79,6 +79,7 @@ struct fuse_file *fuse_file_alloc(struct fuse_mount *fm)
init_waitqueue_head(&ff->poll_wait);
ff->kh = atomic64_inc_return(&fm->fc->khctr);
+ ff->conn_gen = READ_ONCE(fm->fc->conn_gen);
return ff;
}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 6d3d3ca4f136..8d4276d7ab1e 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -161,6 +161,9 @@ struct fuse_inode {
*/
struct fuse_inode_dax *dax;
#endif
+
+ /** Fuse connection (fuse_conn) generation when inode was allocated */
+ u32 conn_gen;
};
/** FUSE inode state bits */
@@ -232,6 +235,9 @@ struct fuse_file {
/** Has flock been performed on this file? */
bool flock:1;
+
+ /** Fuse connection (fuse_conn) generation when file was allocated */
+ u32 conn_gen;
};
/** One input argument of a request */
@@ -847,6 +853,18 @@ struct fuse_conn {
/* New writepages go into this bucket */
struct fuse_sync_bucket __rcu *curr_bucket;
+
+ /**
+ * Connection generation.
+ * Used to determine if inodes/files were created with an "old"
+ * fuse connection and have to be invalidated. So, all requests
+ * related to these inodes should fail with -EIO.
+ *
+ * CHECKME: do we really need conn_gen for struct fuse_file?
+ * Right now it's only needed for fuse_file_put(), where we have
+ * no access to the inode in some cases.
+ */
+ u32 conn_gen;
};
/*
@@ -910,6 +928,17 @@ static inline u64 fuse_get_attr_version(const struct fuse_conn *fc)
return atomic64_read(&fc->attr_version);
}
+static inline bool fuse_stale_ff(const struct fuse_file *ff)
+{
+ return unlikely(READ_ONCE(ff->fm->fc->conn_gen) != ff->conn_gen);
+}
+
+static inline bool fuse_stale_inode_conn(const struct inode *inode)
+{
+ return unlikely(READ_ONCE(get_fuse_conn(inode)->conn_gen) !=
+ get_fuse_inode(inode)->conn_gen);
+}
+
static inline bool fuse_stale_inode(const struct inode *inode, int generation,
struct fuse_attr *attr)
{
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 3de950104f15..009fe5bbb855 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -77,6 +77,7 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
fi->attr_version = 0;
fi->orig_ino = 0;
fi->state = 0;
+ fi->conn_gen = READ_ONCE(get_fuse_conn_super(sb)->conn_gen);
mutex_init(&fi->mutex);
spin_lock_init(&fi->lock);
fi->forget = fuse_alloc_forget();
@@ -848,6 +849,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
fc->user_ns = get_user_ns(user_ns);
fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
fc->max_pages_limit = FUSE_MAX_MAX_PAGES;
+ fc->conn_gen = 1;
INIT_LIST_HEAD(&fc->mounts);
list_add(&fm->fc_entry, &fc->mounts);
--
2.34.1
Useful if for some reason bindmounts root dentries get invalidated
but it's needed to revalidate existing bindmounts without remounting.
Cc: Miklos Szeredi <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Stéphane Graber <[email protected]>
Cc: Seth Forshee <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: Bernd Schubert <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Alexander Mikhalitsyn <[email protected]>
---
fs/namespace.c | 90 +++++++++++++++++++++++++++++++++++
include/linux/mnt_namespace.h | 3 ++
2 files changed, 93 insertions(+)
diff --git a/fs/namespace.c b/fs/namespace.c
index bc0f15257b49..b74d00d6abb0 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -568,6 +568,96 @@ static int mnt_make_readonly(struct mount *mnt)
return ret;
}
+struct bind_mount_list_item {
+ struct list_head list;
+ struct vfsmount *mnt;
+};
+
+/*
+ * sb_revalidate_bindmounts - Relookup/reset bindmounts root dentries
+ *
+ * Useful if for some reason bindmount root dentries get invalidated
+ * but it's needed to revalidate existing bindmounts without remounting.
+ */
+int sb_revalidate_bindmounts(struct super_block *sb)
+{
+ struct mount *mnt;
+ struct bind_mount_list_item *bmnt, *next;
+ int err = 0;
+ struct vfsmount *root_mnt = NULL;
+ LIST_HEAD(mnt_to_update);
+ char *buf;
+
+ buf = (char *) __get_free_page(GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ lock_mount_hash();
+ list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) {
+ /* we only want to touch bindmounts */
+ if (mnt->mnt.mnt_root == sb->s_root) {
+ if (!root_mnt)
+ root_mnt = mntget(&mnt->mnt);
+
+ continue;
+ }
+
+ bmnt = kzalloc(sizeof(struct bind_mount_list_item), GFP_NOWAIT | __GFP_NOWARN);
+ if (!bmnt) {
+ err = -ENOMEM;
+ goto exit;
+ }
+
+ bmnt->mnt = mntget(&mnt->mnt);
+ list_add_tail(&bmnt->list, &mnt_to_update);
+ }
+ unlock_mount_hash();
+
+ /* TODO: get rid of this limitation */
+ if (!root_mnt) {
+ err = -ENOENT;
+ goto exit;
+ }
+
+ list_for_each_entry_safe(bmnt, next, &mnt_to_update, list) {
+ struct vfsmount *cur_mnt = bmnt->mnt;
+ struct path path;
+ struct dentry *old_root;
+ char *p;
+
+ p = dentry_path(cur_mnt->mnt_root, buf, PAGE_SIZE);
+ if (IS_ERR(p))
+ goto exit;
+
+ /* TODO: are these lookup flags fully safe and correct? */
+ err = vfs_path_lookup(root_mnt->mnt_root, root_mnt,
+ p, LOOKUP_FOLLOW|LOOKUP_AUTOMOUNT|LOOKUP_REVAL, &path);
+ if (err)
+ goto exit;
+
+ /* replace bindmount root dentry */
+ lock_mount_hash();
+ old_root = cur_mnt->mnt_root;
+ cur_mnt->mnt_root = dget(path.dentry);
+ dput(old_root);
+ unlock_mount_hash();
+
+ path_put(&path);
+ }
+
+exit:
+ free_page((unsigned long) buf);
+ mntput(root_mnt);
+ list_for_each_entry_safe(bmnt, next, &mnt_to_update, list) {
+ list_del(&bmnt->list);
+ mntput(bmnt->mnt);
+ kfree(bmnt);
+ }
+
+ return err;
+}
+EXPORT_SYMBOL_GPL(sb_revalidate_bindmounts);
+
int sb_prepare_remount_readonly(struct super_block *sb)
{
struct mount *mnt;
diff --git a/include/linux/mnt_namespace.h b/include/linux/mnt_namespace.h
index 8f882f5881e8..20ac29e702f5 100644
--- a/include/linux/mnt_namespace.h
+++ b/include/linux/mnt_namespace.h
@@ -3,6 +3,7 @@
#define _NAMESPACE_H_
#ifdef __KERNEL__
+struct super_block;
struct mnt_namespace;
struct fs_struct;
struct user_namespace;
@@ -13,6 +14,8 @@ extern struct mnt_namespace *copy_mnt_ns(unsigned long, struct mnt_namespace *,
extern void put_mnt_ns(struct mnt_namespace *ns);
extern struct ns_common *from_mnt_ns(struct mnt_namespace *);
+extern int sb_revalidate_bindmounts(struct super_block *sb);
+
extern const struct file_operations proc_mounts_operations;
extern const struct file_operations proc_mountinfo_operations;
extern const struct file_operations proc_mountstats_operations;
--
2.34.1
On Mon, Apr 03, 2023 at 04:45:16PM +0200, Alexander Mikhalitsyn wrote:
> Useful if for some reason bindmounts root dentries get invalidated
> but it's needed to revalidate existing bindmounts without remounting.
>
> Cc: Miklos Szeredi <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Amir Goldstein <[email protected]>
> Cc: Stéphane Graber <[email protected]>
> Cc: Seth Forshee <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Cc: Andrei Vagin <[email protected]>
> Cc: Pavel Tikhomirov <[email protected]>
> Cc: Bernd Schubert <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Alexander Mikhalitsyn <[email protected]>
> ---
> fs/namespace.c | 90 +++++++++++++++++++++++++++++++++++
> include/linux/mnt_namespace.h | 3 ++
> 2 files changed, 93 insertions(+)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index bc0f15257b49..b74d00d6abb0 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -568,6 +568,96 @@ static int mnt_make_readonly(struct mount *mnt)
> return ret;
> }
>
> +struct bind_mount_list_item {
> + struct list_head list;
> + struct vfsmount *mnt;
> +};
> +
> +/*
> + * sb_revalidate_bindmounts - Relookup/reset bindmounts root dentries
> + *
> + * Useful if for some reason bindmount root dentries get invalidated
> + * but it's needed to revalidate existing bindmounts without remounting.
> + */
> +int sb_revalidate_bindmounts(struct super_block *sb)
It's difficult to not strongly dislike this patchset on the basis of the
need for a function like this alone. And I just have to say it even if I
normally try not to do this: This whole function is bonkers in my opinion.
But leaving that aside for a second. This really needs detailed
explanations on locking, assumptions, and an explanation what you're
doing here. This looks crazy to me and definitely isn't fit to be a
generic helper in this form.
> +{
> + struct mount *mnt;
> + struct bind_mount_list_item *bmnt, *next;
> + int err = 0;
> + struct vfsmount *root_mnt = NULL;
> + LIST_HEAD(mnt_to_update);
> + char *buf;
> +
> + buf = (char *) __get_free_page(GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> +
> + lock_mount_hash();
> + list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) {
> + /* we only want to touch bindmounts */
> + if (mnt->mnt.mnt_root == sb->s_root) {
> + if (!root_mnt)
> + root_mnt = mntget(&mnt->mnt);
> +
> + continue;
> + }
> +
> + bmnt = kzalloc(sizeof(struct bind_mount_list_item), GFP_NOWAIT | __GFP_NOWARN);
Allocating under lock_mount_hash() even if doable with this flag
combination should be avoided at all costs imho. That just seems hacky.
> + if (!bmnt) {
> + err = -ENOMEM;
> + goto exit;
You're exiting the function with lock_mount_hash() held...
> + }
> +
> + bmnt->mnt = mntget(&mnt->mnt);
> + list_add_tail(&bmnt->list, &mnt_to_update);
> + }
> + unlock_mount_hash();
You've dropped unlock_mount_hash() and the function doesn't hold
namespace_lock() and isn't documented as requiring the caller to hold
it. And the later patch that uses this doesn't afaict (although I
haven't looked at any of the fuse specific stuff). So this is open to
all sorts of races with mount and unmount now. This needs an explanation
why that doesn't matter.
> +
> + /* TODO: get rid of this limitation */
Confused about this comment.
> + if (!root_mnt) {
> + err = -ENOENT;
> + goto exit;
> + }
> +
> + list_for_each_entry_safe(bmnt, next, &mnt_to_update, list) {
No one else can access that list so list_for_each_entry_safe() seems
pointless.
> + struct vfsmount *cur_mnt = bmnt->mnt;
> + struct path path;
> + struct dentry *old_root;
> + char *p;
> +
> + p = dentry_path(cur_mnt->mnt_root, buf, PAGE_SIZE);
> + if (IS_ERR(p))
> + goto exit;
> +
> + /* TODO: are these lookup flags fully safe and correct? */
> + err = vfs_path_lookup(root_mnt->mnt_root, root_mnt,
> + p, LOOKUP_FOLLOW|LOOKUP_AUTOMOUNT|LOOKUP_REVAL, &path);
> + if (err)
> + goto exit;
> +
> + /* replace bindmount root dentry */
> + lock_mount_hash();
> + old_root = cur_mnt->mnt_root;
> + cur_mnt->mnt_root = dget(path.dentry);
mnt->mnt_root isn't protected by lock_mount_hash(). It's invariant after
it has been set and a lot of code just assumes that it's stable.
Top of my hat, since you're not holding namespace_lock() mount
propagation can be going on concurrently so propagate_one() can check if
(!is_subdir(mp->m_dentry, m->mnt.mnt_root)) while you're happily
updating it. A lot of code could actually be touching mnt->mnt_root
while you're updating it.
There's probably a lot more issues with this I'm just not seeing without
spending more time on this. But NAK on this as it stands. Sorry.
> + dput(old_root);
> + unlock_mount_hash();
> +
> + path_put(&path);
> + }
> +
> +exit:
> + free_page((unsigned long) buf);
> + mntput(root_mnt);
> + list_for_each_entry_safe(bmnt, next, &mnt_to_update, list) {
> + list_del(&bmnt->list);
> + mntput(bmnt->mnt);
> + kfree(bmnt);
> + }
> +
> + return err;
> +}
> +EXPORT_SYMBOL_GPL(sb_revalidate_bindmounts);
On Mon, Apr 3, 2023 at 11:14 PM Christian Brauner <[email protected]> wrote:
>
> On Mon, Apr 03, 2023 at 04:45:16PM +0200, Alexander Mikhalitsyn wrote:
> > Useful if for some reason bindmounts root dentries get invalidated
> > but it's needed to revalidate existing bindmounts without remounting.
> >
> > Cc: Miklos Szeredi <[email protected]>
> > Cc: Al Viro <[email protected]>
> > Cc: Amir Goldstein <[email protected]>
> > Cc: Stéphane Graber <[email protected]>
> > Cc: Seth Forshee <[email protected]>
> > Cc: Christian Brauner <[email protected]>
> > Cc: Andrei Vagin <[email protected]>
> > Cc: Pavel Tikhomirov <[email protected]>
> > Cc: Bernd Schubert <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: Alexander Mikhalitsyn <[email protected]>
> > ---
> > fs/namespace.c | 90 +++++++++++++++++++++++++++++++++++
> > include/linux/mnt_namespace.h | 3 ++
> > 2 files changed, 93 insertions(+)
> >
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index bc0f15257b49..b74d00d6abb0 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -568,6 +568,96 @@ static int mnt_make_readonly(struct mount *mnt)
> > return ret;
> > }
> >
> > +struct bind_mount_list_item {
> > + struct list_head list;
> > + struct vfsmount *mnt;
> > +};
> > +
> > +/*
> > + * sb_revalidate_bindmounts - Relookup/reset bindmounts root dentries
> > + *
> > + * Useful if for some reason bindmount root dentries get invalidated
> > + * but it's needed to revalidate existing bindmounts without remounting.
> > + */
> > +int sb_revalidate_bindmounts(struct super_block *sb)
Hi Christian!
>
> It's difficult to not strongly dislike this patchset on the basis of the
> need for a function like this alone. And I just have to say it even if I
> normally try not to do this: This whole function is bonkers in my opinion.
This function is the main pain point in this patchset, I believe :)
I thought a lot about other ways (without this helper) and all of them
have their pros and cons.
And that's the reason why this patchset is RFC. I wanted to hear from
you and other folks
about your point of view on the problem. As far as I can see there is
some interest to that problem,
and implementation question is important, but if there is no
objections against the concept of
FUSE mounts healing (and Checkpoint/Restore too as a more complex
case) then I think we will
be able to agree on some approach to the problem. I'm ready to rework
that in the way it is supposed to be done.
>
> But leaving that aside for a second. This really needs detailed
> explanations on locking, assumptions, and an explanation what you're
> doing here. This looks crazy to me and definitely isn't fit to be a
> generic helper in this form.
Of course.
>
> > +{
> > + struct mount *mnt;
> > + struct bind_mount_list_item *bmnt, *next;
> > + int err = 0;
> > + struct vfsmount *root_mnt = NULL;
> > + LIST_HEAD(mnt_to_update);
> > + char *buf;
> > +
> > + buf = (char *) __get_free_page(GFP_KERNEL);
> > + if (!buf)
> > + return -ENOMEM;
> > +
> > + lock_mount_hash();
> > + list_for_each_entry(mnt, &sb->s_mounts, mnt_instance) {
> > + /* we only want to touch bindmounts */
> > + if (mnt->mnt.mnt_root == sb->s_root) {
> > + if (!root_mnt)
> > + root_mnt = mntget(&mnt->mnt);
> > +
> > + continue;
> > + }
> > +
> > + bmnt = kzalloc(sizeof(struct bind_mount_list_item), GFP_NOWAIT | __GFP_NOWARN);
>
> Allocating under lock_mount_hash() even if doable with this flag
> combination should be avoided at all costs imho. That just seems hacky.
got it
>
> > + if (!bmnt) {
> > + err = -ENOMEM;
> > + goto exit;
>
> You're exiting the function with lock_mount_hash() held...
+
>
> > + }
> > +
> > + bmnt->mnt = mntget(&mnt->mnt);
> > + list_add_tail(&bmnt->list, &mnt_to_update);
> > + }
> > + unlock_mount_hash();
>
> You've dropped unlock_mount_hash() and the function doesn't hold
> namespace_lock() and isn't documented as requiring the caller to hold
> it. And the later patch that uses this doesn't afaict (although I
> haven't looked at any of the fuse specific stuff). So this is open to
> all sorts of races with mount and unmount now. This needs an explanation
> why that doesn't matter.
yes, it doesn't matter, my idea was that this helper should
"actualize" roots for the bind mounts
at the point of call, but it seems no problem if we skip some new
mounts, because
This helper can be called again. I didn't want to take a namespace_lock() there
because we don't want to have such guarantees for this interface.
Of course, this has to be documented. Thanks!
>
> > +
> > + /* TODO: get rid of this limitation */
>
> Confused about this comment.
Sorry.
This comment means that current implementation relies on the fact that
each fuse superblock has some
root mount in the list (mnt->mnt.mnt_root == sb->s_root). And that's
not always the case, because user
can do something like this:
fusermount ... /mnt/my_fuse_mount
mount --bind /mnt/my_fuse_mount/subtree /mnt/my_fuse_mount_subtree
umount /mnt/my_fuse_mount
In this case, root_mnt == NULL. But we need to have root_mnt, to pass
it in vfs_path_lookup(..) later.
>
> > + if (!root_mnt) {
> > + err = -ENOENT;
> > + goto exit;
> > + }
> > +
> > + list_for_each_entry_safe(bmnt, next, &mnt_to_update, list) {
>
> No one else can access that list so list_for_each_entry_safe() seems
> pointless.
+
>
> > + struct vfsmount *cur_mnt = bmnt->mnt;
> > + struct path path;
> > + struct dentry *old_root;
> > + char *p;
> > +
> > + p = dentry_path(cur_mnt->mnt_root, buf, PAGE_SIZE);
> > + if (IS_ERR(p))
> > + goto exit;
> > +
> > + /* TODO: are these lookup flags fully safe and correct? */
> > + err = vfs_path_lookup(root_mnt->mnt_root, root_mnt,
> > + p, LOOKUP_FOLLOW|LOOKUP_AUTOMOUNT|LOOKUP_REVAL, &path);
> > + if (err)
> > + goto exit;
> > +
> > + /* replace bindmount root dentry */
> > + lock_mount_hash();
> > + old_root = cur_mnt->mnt_root;
> > + cur_mnt->mnt_root = dget(path.dentry);
>
> mnt->mnt_root isn't protected by lock_mount_hash(). It's invariant after
> it has been set and a lot of code just assumes that it's stable.
>
> Top of my hat, since you're not holding namespace_lock() mount
> propagation can be going on concurrently so propagate_one() can check if
> (!is_subdir(mp->m_dentry, m->mnt.mnt_root)) while you're happily
> updating it. A lot of code could actually be touching mnt->mnt_root
> while you're updating it.
>
> There's probably a lot more issues with this I'm just not seeing without
> spending more time on this. But NAK on this as it stands. Sorry.
Thanks for looking into it, Christian!
Kind regards,
Alex
>
> > + dput(old_root);
> > + unlock_mount_hash();
> > +
> > + path_put(&path);
> > + }
> > +
> > +exit:
> > + free_page((unsigned long) buf);
> > + mntput(root_mnt);
> > + list_for_each_entry_safe(bmnt, next, &mnt_to_update, list) {
> > + list_del(&bmnt->list);
> > + mntput(bmnt->mnt);
> > + kfree(bmnt);
> > + }
> > +
> > + return err;
> > +}
> > +EXPORT_SYMBOL_GPL(sb_revalidate_bindmounts);