2024-05-24 06:40:50

by Jingbo Xu

[permalink] [raw]
Subject: [RFC 0/2] fuse: introduce fuse server recovery mechanism

Background
==========
The fd of '/dev/fuse' serves as a message transmission channel between
FUSE filesystem (kernel space) and fuse server (user space). Once the
fd gets closed (intentionally or unintentionally), the FUSE filesystem
gets aborted, and any attempt of filesystem access gets -ECONNABORTED
error until the FUSE filesystem finally umounted.

It is one of the requisites in production environment to provide
uninterruptible filesystem service. The most straightforward way, and
maybe the most widely used way, is that make another dedicated user
daemon (similar to systemd fdstore) keep the device fd open. When the
fuse daemon recovers from a crash, it can retrieve the device fd from the
fdstore daemon through socket takeover (Unix domain socket) method [1]
or pidfd_getfd() syscall [2]. In this way, as long as the fdstore
daemon doesn't exit, the FUSE filesystem won't get aborted once the fuse
daemon crashes, though the filesystem service may hang there for a while
when the fuse daemon gets restarted and has not been completely
recovered yet.

This picture indeed works and has been deployed in our internal
production environment until the following issues are encountered:

1. The fdstore daemon may be killed by mistake, in which case the FUSE
filesystem gets aborted and irrecoverable.

2. In scenarios of containerized deployment, the fuse daemon is deployed
in a container POD, and a dedicated fdstore daemon needs to be deployed
for each fuse daemon. The fdstore daemon could consume a amount of
resources (e.g. memory footprint), which is not conducive to the dense
container deployment.

3. Each fuse daemon implementation needs to implement its own fdstore
daemon. If we implement the fuse recovery mechanism on the kernel side,
all fuse daemon implementations could reuse this mechanism.


What we do
==========

Basic Recovery Mechanism
------------------------
We introduce a recovery mechanism for fuse server on the kernel side.

To do this:
1. Introduce a new "tag=" mount option, with which users could identify
a fuse connection with a unique name.
2. Introduce a new FUSE_DEV_IOC_ATTACH ioctl, with which the fuse server
could reconnect to the fuse connection corresponding to the given tag.
3. Introduce a new FUSE_HAS_RECOVERY init flag. The fuse server should
advertise this feature if it supports server recovery.


With the above recovery mechanism, the whole time sequence is like:
- At the initial mount, the fuse filesystem is mounted with "tag="
option
- The fuse server advertises FUSE_HAS_RECOVERY flag when replying
FUSE_INIT
- When the fuse server crashes and the (/dev/fuse) device fd is closed,
the fuse connection won't be aborted.
- The requests submitted after the server crash will keep staying in
the iqueue; the processes submitting the requests will hang there
- The fuse server gets restarted and recovers the previous state before
crash (including the negotiation results of the last FUSE_INIT)
- The fuse server opens /dev/fuse and gets a new device fd, and then
runs FUSE_DEV_IOC_ATTACH ioctl on the new device fd to retrieve the
fuse connection with the tag previously used to mount the fuse
filesystem
- The fuse server issues a FUSE_NOTIFY_RESEND notification to request
the kernel to resend those inflight requests that have been sent to
the fuse server before the server crash but not been replied yet
- The fuse server starts to process requests normally (those queued in
iqueue and those resent by FUSE_NOTIFY_RESEND)

In summary, the requests submitted after the server crash will stay in
the iqueue and get serviced once the fuse server recovers from the crash
and retrieve the previous fuse connection. As for the inflight requests
that have been sent to the fuse server before the server crash but not
been replied yet, the fuse server could request the kernel to resend
those inflight requests through FUSE_NOTIFY_RESEND notification type.


Security Enhancement
---------------------
Besides, we offer a uid-based security enhancement for the fuse server
recovery mechanism. Otherwise any malicious attacker could kill the
fuse server and take the filesystem service over with the recovery
mechanism.

To implement this, we introduce a new "rescue_uid=" mount option
specifying the expected uid of the legal process running the fuse
server. Then only the process with the matching uid is permissible to
retrieve the fuse connection with the server recovery mechanism.


Limitation
==========
1. The current mechanism won't resend a new FUSE_INIT request to fuse
server and start a new negotiation when the fuse server attempts to
re-attach to the fuse connection through FUSE_DEV_IOC_ATTACH ioctl.
Thus the fuse server needs to recover the previous state before crash
(including the negotiation results of the last FUSE_INIT) by itself.

PS. Thus I had to do hacking tricks on libfuse passthrough_ll daemon
when testing the recovery feature.

2. With the current recovery mechanism, the fuse filesystem won't get
aborted when the fuse server crashes. A following umount will get hung
there. The call stack shows the hang task is waiting for FUSE_GETATTR
on the mntpoint:

[<0>] request_wait_answer+0xe1/0x200
[<0>] fuse_simple_request+0x18e/0x2a0
[<0>] fuse_do_getattr+0xc9/0x180
[<0>] vfs_statx+0x92/0x170
[<0>] vfs_fstatat+0x7c/0xb0
[<0>] __do_sys_newstat+0x1d/0x40
[<0>] do_syscall_64+0x60/0x170
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

It's not fixed yet in this RFC version.

3. I don't know if a kernel based recovery mechanism is welcome on the
community side. Any comment is welcome. Thanks!


[1] https://copyconstruct.medium.com/file-descriptor-transfer-over-unix-domain-sockets-dcbbf5b3b6ec
[2] https://copyconstruct.medium.com/seamless-file-descriptor-transfer-between-processes-with-pidfd-and-pidfd-getfd-816afcd19ed4


Jingbo Xu (2):
fuse: introduce recovery mechanism for fuse server
fuse: uid-based security enhancement for the recovery mechanism

fs/fuse/dev.c | 55 ++++++++++++++++++++++++++++++++++++++-
fs/fuse/fuse_i.h | 15 +++++++++++
fs/fuse/inode.c | 46 +++++++++++++++++++++++++++++++-
include/uapi/linux/fuse.h | 7 +++++
4 files changed, 121 insertions(+), 2 deletions(-)

--
2.19.1.6.gb485710b



2024-05-24 06:41:05

by Jingbo Xu

[permalink] [raw]
Subject: [RFC 1/2] fuse: introduce recovery mechanism for fuse server

Introduce failover mechanism for fuse server, with which the fuse
connection could keep alive while the fuse server crashes. The fuse
server could re-attach to the fuse connection after crash and recover
the filesystem service.

The requests submitted after the server crash will stay in the iqueue
and get serviced once the fuse server recovers from the crash and
retrieve the previous fuse connection. As for the inflight requests
that have been sent to the fuse server before the server crash and not
been replied yet, the fuse server could request the kernel to resend
those inflight requests through FUSE_NOTIFY_RESEND notification type.

To implement the above mechanism:

1. Introduce a new "tag=" mount option, with which useres could identify
a fuse connection with a unique name.
2. Introduce a new FUSE_DEV_IOC_ATTACH ioctl, with which the fuse server
could reconnect to the fuse connection corresponding to the given tag.
3. Introduce a new FUSE_HAS_RECOVERY init flag. The fuse server should
advertise this feature if it supports server recovery.

Signed-off-by: Jingbo Xu <[email protected]>
---
fs/fuse/dev.c | 43 ++++++++++++++++++++++++++++++++++++++-
fs/fuse/fuse_i.h | 7 +++++++
fs/fuse/inode.c | 35 ++++++++++++++++++++++++++++++-
include/uapi/linux/fuse.h | 7 +++++++
4 files changed, 90 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 3ec8bb5e68ff..7599138baac0 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2271,7 +2271,7 @@ int fuse_dev_release(struct inode *inode, struct file *file)
end_requests(&to_end);

/* Are we the last open device? */
- if (atomic_dec_and_test(&fc->dev_count)) {
+ if (atomic_dec_and_test(&fc->dev_count) && !fc->recovery) {
WARN_ON(fc->iq.fasync != NULL);
fuse_abort_conn(fc);
}
@@ -2376,6 +2376,44 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
return fuse_backing_close(fud->fc, backing_id);
}

+static inline bool fuse_device_attach_match(struct fuse_conn *fc,
+ const char *tag)
+{
+ if (!fc->recovery)
+ return false;
+
+ return !strncmp(fc->tag, tag, FUSE_TAG_NAME_MAX);
+}
+
+static int fuse_device_attach(struct file *file, const char *tag)
+{
+ struct fuse_conn *fc;
+
+ list_for_each_entry(fc, &fuse_conn_list, entry) {
+ if (!fuse_device_attach_match(fc, tag))
+ continue;
+ return fuse_device_clone(fc, file);
+ }
+ return -ENOTTY;
+}
+
+static long fuse_dev_ioctl_attach(struct file *file, __u32 __user *argp)
+{
+ struct fuse_ioctl_attach attach;
+ int res;
+
+ if (copy_from_user(&attach, argp, sizeof(attach)))
+ return -EFAULT;
+
+ if (attach.tag[0] == '\0')
+ return -EINVAL;
+
+ mutex_lock(&fuse_mutex);
+ res = fuse_device_attach(file, attach.tag);
+ mutex_unlock(&fuse_mutex);
+ return res;
+}
+
static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
{
@@ -2391,6 +2429,9 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
case FUSE_DEV_IOC_BACKING_CLOSE:
return fuse_dev_ioctl_backing_close(file, argp);

+ case FUSE_DEV_IOC_ATTACH:
+ return fuse_dev_ioctl_attach(file, argp);
+
default:
return -ENOTTY;
}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f23919610313..e9832186f84f 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -575,6 +575,7 @@ struct fuse_fs_context {
unsigned int max_read;
unsigned int blksize;
const char *subtype;
+ const char *tag;

/* DAX device, may be NULL */
struct dax_device *dax_dev;
@@ -860,6 +861,9 @@ struct fuse_conn {
/** Passthrough support for read/write IO */
unsigned int passthrough:1;

+ /** Support for fuse server recovery */
+ unsigned int recovery:1;
+
/** Maximum stack depth for passthrough backing files */
int max_stack_depth;

@@ -917,6 +921,9 @@ struct fuse_conn {
/** IDR for backing files ids */
struct idr backing_files_map;
#endif
+
+ /* Tag of the connection used by fuse server recovery */
+ const char *tag;
};

/*
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 99e44ea7d875..1ab245d6ade3 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -733,6 +733,7 @@ enum {
OPT_ALLOW_OTHER,
OPT_MAX_READ,
OPT_BLKSIZE,
+ OPT_TAG,
OPT_ERR
};

@@ -747,6 +748,7 @@ static const struct fs_parameter_spec fuse_fs_parameters[] = {
fsparam_u32 ("max_read", OPT_MAX_READ),
fsparam_u32 ("blksize", OPT_BLKSIZE),
fsparam_string ("subtype", OPT_SUBTYPE),
+ fsparam_string ("tag", OPT_TAG),
{}
};

@@ -830,6 +832,15 @@ static int fuse_parse_param(struct fs_context *fsc, struct fs_parameter *param)
ctx->blksize = result.uint_32;
break;

+ case OPT_TAG:
+ if (ctx->tag)
+ return invalfc(fsc, "Multiple tags specified");
+ if (strlen(param->string) > FUSE_TAG_NAME_MAX)
+ return invalfc(fsc, "Tag name too long");
+ ctx->tag = param->string;
+ param->string = NULL;
+ return 0;
+
default:
return -EINVAL;
}
@@ -843,6 +854,7 @@ static void fuse_free_fsc(struct fs_context *fsc)

if (ctx) {
kfree(ctx->subtype);
+ kfree(ctx->tag);
kfree(ctx);
}
}
@@ -969,6 +981,7 @@ void fuse_conn_put(struct fuse_conn *fc)
}
if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
fuse_backing_files_free(fc);
+ kfree(fc->tag);
call_rcu(&fc->rcu, delayed_release);
}
}
@@ -1331,6 +1344,8 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
}
if (flags & FUSE_NO_EXPORT_SUPPORT)
fm->sb->s_export_op = &fuse_export_fid_operations;
+ if (flags & FUSE_HAS_RECOVERY)
+ fc->recovery = 1;
} else {
ra_pages = fc->max_read / PAGE_SIZE;
fc->no_lock = 1;
@@ -1378,7 +1393,7 @@ void fuse_send_init(struct fuse_mount *fm)
FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_EXT | FUSE_INIT_EXT |
FUSE_SECURITY_CTX | FUSE_CREATE_SUPP_GROUP |
FUSE_HAS_EXPIRE_ONLY | FUSE_DIRECT_IO_ALLOW_MMAP |
- FUSE_NO_EXPORT_SUPPORT | FUSE_HAS_RESEND;
+ FUSE_NO_EXPORT_SUPPORT | FUSE_HAS_RESEND | FUSE_HAS_RECOVERY;
#ifdef CONFIG_FUSE_DAX
if (fm->fc->dax)
flags |= FUSE_MAP_ALIGNMENT;
@@ -1520,6 +1535,17 @@ void fuse_dev_free(struct fuse_dev *fud)
}
EXPORT_SYMBOL_GPL(fuse_dev_free);

+static bool fuse_find_conn_tag(const char *tag)
+{
+ struct fuse_conn *fc;
+
+ list_for_each_entry(fc, &fuse_conn_list, entry) {
+ if (!strcmp(fc->tag, tag))
+ return true;
+ }
+ return false;
+}
+
static void fuse_fill_attr_from_inode(struct fuse_attr *attr,
const struct fuse_inode *fi)
{
@@ -1727,6 +1753,8 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
fc->destroy = ctx->destroy;
fc->no_control = ctx->no_control;
fc->no_force_umount = ctx->no_force_umount;
+ fc->tag = ctx->tag;
+ ctx->tag = NULL;

err = -ENOMEM;
root = fuse_get_root_inode(sb, ctx->rootmode);
@@ -1742,6 +1770,11 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
if (ctx->fudptr && *ctx->fudptr)
goto err_unlock;

+ if (fc->tag && fuse_find_conn_tag(fc->tag)) {
+ pr_err("tag %s already exist\n", fc->tag);
+ goto err_unlock;
+ }
+
err = fuse_ctl_add_conn(fc);
if (err)
goto err_unlock;
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index d08b99d60f6f..054d6789b2fc 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -463,6 +463,7 @@ struct fuse_file_lock {
#define FUSE_PASSTHROUGH (1ULL << 37)
#define FUSE_NO_EXPORT_SUPPORT (1ULL << 38)
#define FUSE_HAS_RESEND (1ULL << 39)
+#define FUSE_HAS_RECOVERY (1ULL << 40)

/* Obsolete alias for FUSE_DIRECT_IO_ALLOW_MMAP */
#define FUSE_DIRECT_IO_RELAX FUSE_DIRECT_IO_ALLOW_MMAP
@@ -1079,12 +1080,18 @@ struct fuse_backing_map {
uint64_t padding;
};

+struct fuse_ioctl_attach {
+#define FUSE_TAG_NAME_MAX 128
+ char tag[FUSE_TAG_NAME_MAX];
+};
+
/* Device ioctls: */
#define FUSE_DEV_IOC_MAGIC 229
#define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
#define FUSE_DEV_IOC_BACKING_OPEN _IOW(FUSE_DEV_IOC_MAGIC, 1, \
struct fuse_backing_map)
#define FUSE_DEV_IOC_BACKING_CLOSE _IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t)
+#define FUSE_DEV_IOC_ATTACH _IOW(FUSE_DEV_IOC_MAGIC, 3, struct fuse_ioctl_attach)

struct fuse_lseek_in {
uint64_t fh;
--
2.19.1.6.gb485710b


2024-05-24 06:41:51

by Jingbo Xu

[permalink] [raw]
Subject: [RFC 2/2] fuse: uid-based security enhancement for the recovery mechanism

Offer a uid-based security enhancement for the fuse server recovery
mechanism. Otherwise any malicious attacker could kill the fuse server
and take the filesystem service over with the recovery mechanism.

Introduce a new "rescue_uid=" mount option specifying the expected uid
of the legal process running the fuse server. Then only the process
with the matching uid is permissible to retrieve the fuse connection
with the server recovery mechanism.

Signed-off-by: Jingbo Xu <[email protected]>
---
fs/fuse/dev.c | 12 ++++++++++++
fs/fuse/fuse_i.h | 8 ++++++++
fs/fuse/inode.c | 13 ++++++++++++-
3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 7599138baac0..9db35a2bbd85 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2376,12 +2376,24 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
return fuse_backing_close(fud->fc, backing_id);
}

+static inline bool fuse_device_attach_permissible(struct fuse_conn *fc)
+{
+ const struct cred *cred = current_cred();
+
+ return (uid_eq(cred->euid, fc->rescue_uid) &&
+ uid_eq(cred->suid, fc->rescue_uid) &&
+ uid_eq(cred->uid, fc->rescue_uid));
+}
+
static inline bool fuse_device_attach_match(struct fuse_conn *fc,
const char *tag)
{
if (!fc->recovery)
return false;

+ if (!fuse_device_attach_permissible(fc))
+ return false;
+
return !strncmp(fc->tag, tag, FUSE_TAG_NAME_MAX);
}

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e9832186f84f..c43026d7229c 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -560,6 +560,7 @@ struct fuse_fs_context {
unsigned int rootmode;
kuid_t user_id;
kgid_t group_id;
+ kuid_t rescue_uid;
bool is_bdev:1;
bool fd_present:1;
bool rootmode_present:1;
@@ -571,6 +572,7 @@ struct fuse_fs_context {
bool no_control:1;
bool no_force_umount:1;
bool legacy_opts_show:1;
+ bool rescue_uid_present:1;
enum fuse_dax_mode dax_mode;
unsigned int max_read;
unsigned int blksize;
@@ -616,6 +618,9 @@ struct fuse_conn {
/** The group id for this mount */
kgid_t group_id;

+ /* The expected user id of the fuse server */
+ kuid_t rescue_uid;
+
/** The pid namespace for this mount */
struct pid_namespace *pid_ns;

@@ -864,6 +869,9 @@ struct fuse_conn {
/** Support for fuse server recovery */
unsigned int recovery:1;

+ /** Is rescue_uid specified? */
+ unsigned int rescue_uid_present:1;
+
/** Maximum stack depth for passthrough backing files */
int max_stack_depth;

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 1ab245d6ade3..3b00482293b6 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -734,6 +734,7 @@ enum {
OPT_MAX_READ,
OPT_BLKSIZE,
OPT_TAG,
+ OPT_RESCUE_UID,
OPT_ERR
};

@@ -749,6 +750,7 @@ static const struct fs_parameter_spec fuse_fs_parameters[] = {
fsparam_u32 ("blksize", OPT_BLKSIZE),
fsparam_string ("subtype", OPT_SUBTYPE),
fsparam_string ("tag", OPT_TAG),
+ fsparam_u32 ("rescue_uid", OPT_RESCUE_UID),
{}
};

@@ -841,6 +843,13 @@ static int fuse_parse_param(struct fs_context *fsc, struct fs_parameter *param)
param->string = NULL;
return 0;

+ case OPT_RESCUE_UID:
+ ctx->rescue_uid = make_kuid(fsc->user_ns, result.uint_32);
+ if (!uid_valid(ctx->rescue_uid))
+ return invalfc(fsc, "Invalid rescue_uid");
+ ctx->rescue_uid_present = true;
+ break;
+
default:
return -EINVAL;
}
@@ -1344,7 +1353,7 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
}
if (flags & FUSE_NO_EXPORT_SUPPORT)
fm->sb->s_export_op = &fuse_export_fid_operations;
- if (flags & FUSE_HAS_RECOVERY)
+ if (flags & FUSE_HAS_RECOVERY && fc->rescue_uid_present)
fc->recovery = 1;
} else {
ra_pages = fc->max_read / PAGE_SIZE;
@@ -1753,6 +1762,8 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
fc->destroy = ctx->destroy;
fc->no_control = ctx->no_control;
fc->no_force_umount = ctx->no_force_umount;
+ fc->rescue_uid = ctx->rescue_uid;
+ fc->rescue_uid_present = ctx->rescue_uid_present;
fc->tag = ctx->tag;
ctx->tag = NULL;

--
2.19.1.6.gb485710b


2024-05-27 15:23:26

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism

On Fri, 24 May 2024 at 08:40, Jingbo Xu <[email protected]> wrote:

> 3. I don't know if a kernel based recovery mechanism is welcome on the
> community side. Any comment is welcome. Thanks!

I'd prefer something external to fuse.

Maybe a kernel based fdstore (lifetime connected to that of the
container) would a useful service more generally?

Thanks,
Miklos

2024-05-28 02:45:41

by Jingbo Xu

[permalink] [raw]
Subject: Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism



On 5/27/24 11:16 PM, Miklos Szeredi wrote:
> On Fri, 24 May 2024 at 08:40, Jingbo Xu <[email protected]> wrote:
>
>> 3. I don't know if a kernel based recovery mechanism is welcome on the
>> community side. Any comment is welcome. Thanks!
>
> I'd prefer something external to fuse.

Okay, understood.

>
> Maybe a kernel based fdstore (lifetime connected to that of the
> container) would a useful service more generally?

Yeah I indeed had considered this, but I'm afraid VFS guys would be
concerned about why we do this on kernel side rather than in user space.

I'm not sure what the VFS guys think about this and if the kernel side
shall care about this.

Many thanks!


--
Thanks,
Jingbo

2024-05-28 03:08:39

by Jingbo Xu

[permalink] [raw]
Subject: Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism



On 5/28/24 10:45 AM, Jingbo Xu wrote:
>
>
> On 5/27/24 11:16 PM, Miklos Szeredi wrote:
>> On Fri, 24 May 2024 at 08:40, Jingbo Xu <[email protected]> wrote:
>>
>>> 3. I don't know if a kernel based recovery mechanism is welcome on the
>>> community side. Any comment is welcome. Thanks!
>>
>> I'd prefer something external to fuse.
>
> Okay, understood.
>
>>
>> Maybe a kernel based fdstore (lifetime connected to that of the
>> container) would a useful service more generally?
>
> Yeah I indeed had considered this, but I'm afraid VFS guys would be
> concerned about why we do this on kernel side rather than in user space.
>
> I'm not sure what the VFS guys think about this and if the kernel side
> shall care about this.
>

There was an RFC for kernel-side fdstore [1], though it's also
implemented upon FUSE.

[1]
https://lore.kernel.org/all/CA+a=Yy5rnqLqH2iR-ZY6AUkNJy48mroVV3Exmhmt-pfTi82kXA@mail.gmail.com/T/



--
Thanks,
Jingbo

2024-05-28 04:03:33

by Gao Xiang

[permalink] [raw]
Subject: Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism



On 2024/5/28 11:08, Jingbo Xu wrote:
>
>
> On 5/28/24 10:45 AM, Jingbo Xu wrote:
>>
>>
>> On 5/27/24 11:16 PM, Miklos Szeredi wrote:
>>> On Fri, 24 May 2024 at 08:40, Jingbo Xu <[email protected]> wrote:
>>>
>>>> 3. I don't know if a kernel based recovery mechanism is welcome on the
>>>> community side. Any comment is welcome. Thanks!
>>>
>>> I'd prefer something external to fuse.
>>
>> Okay, understood.
>>
>>>
>>> Maybe a kernel based fdstore (lifetime connected to that of the
>>> container) would a useful service more generally?
>>
>> Yeah I indeed had considered this, but I'm afraid VFS guys would be
>> concerned about why we do this on kernel side rather than in user space.

Just from my own perspective, even if it's in FUSE, the concern is
almost the same.

I wonder if on-demand cachefiles can keep fds too in the future
(thus e.g. daemonless feature could even be implemented entirely
with kernel fdstore) but it still has the same concern or it's
a source of duplication.

Thanks,
Gao Xiang

>>
>> I'm not sure what the VFS guys think about this and if the kernel side
>> shall care about this.
>>
>
> There was an RFC for kernel-side fdstore [1], though it's also
> implemented upon FUSE.
>
> [1]
> https://lore.kernel.org/all/CA+a=Yy5rnqLqH2iR-ZY6AUkNJy48mroVV3Exmhmt-pfTi82kXA@mail.gmail.com/T/
>
>
>

2024-05-28 07:45:36

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism

On Tue, 28 May 2024 at 04:45, Jingbo Xu <[email protected]> wrote:

> Yeah I indeed had considered this, but I'm afraid VFS guys would be
> concerned about why we do this on kernel side rather than in user space.
>
> I'm not sure what the VFS guys think about this and if the kernel side
> shall care about this.

Yes, that is indeed something that needs to be discussed.

I often find, that when discussing something like this a lot of good
ideas can come from different directions, so it can help move things
forward.

Try something really simple first, and post a patch. Don't overthink
the first version.

Thanks,
Miklos

2024-05-28 07:47:29

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism

On Tue, 28 May 2024 at 05:08, Jingbo Xu <[email protected]> wrote:
> There was an RFC for kernel-side fdstore [1], though it's also
> implemented upon FUSE.

I strongly believe that this needs to be disassociated from fuse.

It could be a pseudo filesystem, though.

Thanks,
Miklos

2024-05-28 08:38:45

by Christian Brauner

[permalink] [raw]
Subject: Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism

On Fri, May 24, 2024 at 02:40:28PM +0800, Jingbo Xu wrote:
> Background
> ==========
> The fd of '/dev/fuse' serves as a message transmission channel between
> FUSE filesystem (kernel space) and fuse server (user space). Once the
> fd gets closed (intentionally or unintentionally), the FUSE filesystem
> gets aborted, and any attempt of filesystem access gets -ECONNABORTED
> error until the FUSE filesystem finally umounted.
>
> It is one of the requisites in production environment to provide
> uninterruptible filesystem service. The most straightforward way, and
> maybe the most widely used way, is that make another dedicated user
> daemon (similar to systemd fdstore) keep the device fd open. When the
> fuse daemon recovers from a crash, it can retrieve the device fd from the
> fdstore daemon through socket takeover (Unix domain socket) method [1]
> or pidfd_getfd() syscall [2]. In this way, as long as the fdstore
> daemon doesn't exit, the FUSE filesystem won't get aborted once the fuse
> daemon crashes, though the filesystem service may hang there for a while
> when the fuse daemon gets restarted and has not been completely
> recovered yet.
>
> This picture indeed works and has been deployed in our internal
> production environment until the following issues are encountered:
>
> 1. The fdstore daemon may be killed by mistake, in which case the FUSE
> filesystem gets aborted and irrecoverable.

That's only a problem if you use the fdstore of the per-user instance.
The main fdstore is part of PID 1 and you can't kill that. So really,
systemd needs to hand the fds from the per-user instance to the main
fdstore.

> 2. In scenarios of containerized deployment, the fuse daemon is deployed
> in a container POD, and a dedicated fdstore daemon needs to be deployed
> for each fuse daemon. The fdstore daemon could consume a amount of
> resources (e.g. memory footprint), which is not conducive to the dense
> container deployment.
>
> 3. Each fuse daemon implementation needs to implement its own fdstore
> daemon. If we implement the fuse recovery mechanism on the kernel side,
> all fuse daemon implementations could reuse this mechanism.

You can just the global fdstore. That is a design limitation not an
inherent limitation.

2024-05-28 08:43:47

by Christian Brauner

[permalink] [raw]
Subject: Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism

On Tue, May 28, 2024 at 12:02:46PM +0800, Gao Xiang wrote:
>
>
> On 2024/5/28 11:08, Jingbo Xu wrote:
> >
> >
> > On 5/28/24 10:45 AM, Jingbo Xu wrote:
> > >
> > >
> > > On 5/27/24 11:16 PM, Miklos Szeredi wrote:
> > > > On Fri, 24 May 2024 at 08:40, Jingbo Xu <[email protected]> wrote:
> > > >
> > > > > 3. I don't know if a kernel based recovery mechanism is welcome on the
> > > > > community side. Any comment is welcome. Thanks!
> > > >
> > > > I'd prefer something external to fuse.
> > >
> > > Okay, understood.
> > >
> > > >
> > > > Maybe a kernel based fdstore (lifetime connected to that of the
> > > > container) would a useful service more generally?
> > >
> > > Yeah I indeed had considered this, but I'm afraid VFS guys would be
> > > concerned about why we do this on kernel side rather than in user space.
>
> Just from my own perspective, even if it's in FUSE, the concern is
> almost the same.
>
> I wonder if on-demand cachefiles can keep fds too in the future
> (thus e.g. daemonless feature could even be implemented entirely
> with kernel fdstore) but it still has the same concern or it's
> a source of duplication.
>
> Thanks,
> Gao Xiang
>
> > >
> > > I'm not sure what the VFS guys think about this and if the kernel side
> > > shall care about this.

Fwiw, I'm not convinced and I think that's a big can of worms security
wise and semantics wise. I have discussed whether a kernel-side fdstore
would be something that systemd would use if available multiple times
and they wouldn't use it because it provides them with no benefits over
having it in userspace.

Especially since it implements a lot of special semantics and policy
that we really don't want in the kernel. I think that's just not
something we should do. We should give userspace all the means to
implement fdstores in userspace but not hold fds ourselves.

2024-05-28 09:13:23

by Gao Xiang

[permalink] [raw]
Subject: Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism

Hi Christian,

On 2024/5/28 16:43, Christian Brauner wrote:
> On Tue, May 28, 2024 at 12:02:46PM +0800, Gao Xiang wrote:
>>
>>
>> On 2024/5/28 11:08, Jingbo Xu wrote:
>>>
>>>
>>> On 5/28/24 10:45 AM, Jingbo Xu wrote:
>>>>
>>>>
>>>> On 5/27/24 11:16 PM, Miklos Szeredi wrote:
>>>>> On Fri, 24 May 2024 at 08:40, Jingbo Xu <[email protected]> wrote:
>>>>>
>>>>>> 3. I don't know if a kernel based recovery mechanism is welcome on the
>>>>>> community side. Any comment is welcome. Thanks!
>>>>>
>>>>> I'd prefer something external to fuse.
>>>>
>>>> Okay, understood.
>>>>
>>>>>
>>>>> Maybe a kernel based fdstore (lifetime connected to that of the
>>>>> container) would a useful service more generally?
>>>>
>>>> Yeah I indeed had considered this, but I'm afraid VFS guys would be
>>>> concerned about why we do this on kernel side rather than in user space.
>>
>> Just from my own perspective, even if it's in FUSE, the concern is
>> almost the same.
>>
>> I wonder if on-demand cachefiles can keep fds too in the future
>> (thus e.g. daemonless feature could even be implemented entirely
>> with kernel fdstore) but it still has the same concern or it's
>> a source of duplication.
>>
>> Thanks,
>> Gao Xiang
>>
>>>>
>>>> I'm not sure what the VFS guys think about this and if the kernel side
>>>> shall care about this.
>
> Fwiw, I'm not convinced and I think that's a big can of worms security
> wise and semantics wise. I have discussed whether a kernel-side fdstore
> would be something that systemd would use if available multiple times
> and they wouldn't use it because it provides them with no benefits over
> having it in userspace.

As far as I know, currently there are approximately two ways to do
failover mechanisms in kernel.

The first model much like a fuse-like model: in this mode, we should
keep and pass fd to maintain the active state. And currently,
userspace should be responsible for the permission/security issues
when doing something like passing fds.

The second model is like one device-one instance model, for example
ublk (If I understand correctly): each active instance (/dev/ublkbX)
has their own unique control device (/dev/ublkcX). Users could
assign/change DAC/MAC for each control device. And failover
recovery just needs to reopen the control device with proper
permission and do recovery.

So just my own thought, kernel-side fdstore pseudo filesystem may
provide a DAC/MAC mechanism for the first model. That is a much
cleaner way than doing some similar thing independently in each
subsystem which may need DAC/MAC-like mechanism. But that is
just my own thought.

Thanks,
Gao Xiang

>
> Especially since it implements a lot of special semantics and policy
> that we really don't want in the kernel. I think that's just not
> something we should do. We should give userspace all the means to
> implement fdstores in userspace but not hold fds ourselves.

2024-05-28 09:32:51

by Christian Brauner

[permalink] [raw]
Subject: Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism

On Tue, May 28, 2024 at 05:13:04PM +0800, Gao Xiang wrote:
> Hi Christian,
>
> On 2024/5/28 16:43, Christian Brauner wrote:
> > On Tue, May 28, 2024 at 12:02:46PM +0800, Gao Xiang wrote:
> > >
> > >
> > > On 2024/5/28 11:08, Jingbo Xu wrote:
> > > >
> > > >
> > > > On 5/28/24 10:45 AM, Jingbo Xu wrote:
> > > > >
> > > > >
> > > > > On 5/27/24 11:16 PM, Miklos Szeredi wrote:
> > > > > > On Fri, 24 May 2024 at 08:40, Jingbo Xu <[email protected]> wrote:
> > > > > >
> > > > > > > 3. I don't know if a kernel based recovery mechanism is welcome on the
> > > > > > > community side. Any comment is welcome. Thanks!
> > > > > >
> > > > > > I'd prefer something external to fuse.
> > > > >
> > > > > Okay, understood.
> > > > >
> > > > > >
> > > > > > Maybe a kernel based fdstore (lifetime connected to that of the
> > > > > > container) would a useful service more generally?
> > > > >
> > > > > Yeah I indeed had considered this, but I'm afraid VFS guys would be
> > > > > concerned about why we do this on kernel side rather than in user space.
> > >
> > > Just from my own perspective, even if it's in FUSE, the concern is
> > > almost the same.
> > >
> > > I wonder if on-demand cachefiles can keep fds too in the future
> > > (thus e.g. daemonless feature could even be implemented entirely
> > > with kernel fdstore) but it still has the same concern or it's
> > > a source of duplication.
> > >
> > > Thanks,
> > > Gao Xiang
> > >
> > > > >
> > > > > I'm not sure what the VFS guys think about this and if the kernel side
> > > > > shall care about this.
> >
> > Fwiw, I'm not convinced and I think that's a big can of worms security
> > wise and semantics wise. I have discussed whether a kernel-side fdstore
> > would be something that systemd would use if available multiple times
> > and they wouldn't use it because it provides them with no benefits over
> > having it in userspace.
>
> As far as I know, currently there are approximately two ways to do
> failover mechanisms in kernel.
>
> The first model much like a fuse-like model: in this mode, we should
> keep and pass fd to maintain the active state. And currently,
> userspace should be responsible for the permission/security issues
> when doing something like passing fds.
>
> The second model is like one device-one instance model, for example
> ublk (If I understand correctly): each active instance (/dev/ublkbX)
> has their own unique control device (/dev/ublkcX). Users could
> assign/change DAC/MAC for each control device. And failover
> recovery just needs to reopen the control device with proper
> permission and do recovery.
>
> So just my own thought, kernel-side fdstore pseudo filesystem may
> provide a DAC/MAC mechanism for the first model. That is a much
> cleaner way than doing some similar thing independently in each
> subsystem which may need DAC/MAC-like mechanism. But that is
> just my own thought.

The failover mechanism for /dev/ublkcX could easily be implemented using
the fdstore. The fact that they rolled their own thing is orthogonal to
this imho. Implementing retrieval policies like this in the kernel is
slowly advancing into /proc/$pid/fd/ levels of complexity. That's all
better handled with appropriate policies in userspace. And cachefilesd
can similarly just stash their fds in the fdstore.

2024-05-28 09:45:49

by Jingbo Xu

[permalink] [raw]
Subject: Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism

Hi, Christian,

Thanks for the review.


On 5/28/24 4:38 PM, Christian Brauner wrote:
> On Fri, May 24, 2024 at 02:40:28PM +0800, Jingbo Xu wrote:
>> Background
>> ==========
>> The fd of '/dev/fuse' serves as a message transmission channel between
>> FUSE filesystem (kernel space) and fuse server (user space). Once the
>> fd gets closed (intentionally or unintentionally), the FUSE filesystem
>> gets aborted, and any attempt of filesystem access gets -ECONNABORTED
>> error until the FUSE filesystem finally umounted.
>>
>> It is one of the requisites in production environment to provide
>> uninterruptible filesystem service. The most straightforward way, and
>> maybe the most widely used way, is that make another dedicated user
>> daemon (similar to systemd fdstore) keep the device fd open. When the
>> fuse daemon recovers from a crash, it can retrieve the device fd from the
>> fdstore daemon through socket takeover (Unix domain socket) method [1]
>> or pidfd_getfd() syscall [2]. In this way, as long as the fdstore
>> daemon doesn't exit, the FUSE filesystem won't get aborted once the fuse
>> daemon crashes, though the filesystem service may hang there for a while
>> when the fuse daemon gets restarted and has not been completely
>> recovered yet.
>>
>> This picture indeed works and has been deployed in our internal
>> production environment until the following issues are encountered:
>>
>> 1. The fdstore daemon may be killed by mistake, in which case the FUSE
>> filesystem gets aborted and irrecoverable.
>
> That's only a problem if you use the fdstore of the per-user instance.
> The main fdstore is part of PID 1 and you can't kill that. So really,
> systemd needs to hand the fds from the per-user instance to the main
> fdstore.

Systemd indeed has implemented its own fdstore mechanism in the user space.

Nowadays more and more fuse daemons are running inside containers, but a
container generally has no systemd inside it.
>
>> 2. In scenarios of containerized deployment, the fuse daemon is deployed
>> in a container POD, and a dedicated fdstore daemon needs to be deployed
>> for each fuse daemon. The fdstore daemon could consume a amount of
>> resources (e.g. memory footprint), which is not conducive to the dense
>> container deployment.
>>
>> 3. Each fuse daemon implementation needs to implement its own fdstore
>> daemon. If we implement the fuse recovery mechanism on the kernel side,
>> all fuse daemon implementations could reuse this mechanism.
>
> You can just the global fdstore. That is a design limitation not an
> inherent limitation.

What I initially mean is that each fuse daemon implementation (e.g.
s3fs, ossfs, and other vendors) needs to make its own but similar
mechanism for daemon failover. There has not been a common component
for fdstore in container scenarios just like systemd fdstore.


I'd admit that it's controversial to implement a kernel-side fdstore.
Thus I only implement a failover mechanism for fuse server in this RFC
patch. But I also understand Miklos's concern as what we really need to
support daemon failover is just something like fdstore to keep the
device fd alive.


--
Thanks,
Jingbo

2024-05-28 10:01:41

by Gao Xiang

[permalink] [raw]
Subject: Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism



On 2024/5/28 17:32, Christian Brauner wrote:
> On Tue, May 28, 2024 at 05:13:04PM +0800, Gao Xiang wrote:
>> Hi Christian,
>>
>> On 2024/5/28 16:43, Christian Brauner wrote:
>>> On Tue, May 28, 2024 at 12:02:46PM +0800, Gao Xiang wrote:
>>>>
>>>>
>>>> On 2024/5/28 11:08, Jingbo Xu wrote:
>>>>>
>>>>>
>>>>> On 5/28/24 10:45 AM, Jingbo Xu wrote:
>>>>>>
>>>>>>
>>>>>> On 5/27/24 11:16 PM, Miklos Szeredi wrote:
>>>>>>> On Fri, 24 May 2024 at 08:40, Jingbo Xu <[email protected]> wrote:
>>>>>>>
>>>>>>>> 3. I don't know if a kernel based recovery mechanism is welcome on the
>>>>>>>> community side. Any comment is welcome. Thanks!
>>>>>>>
>>>>>>> I'd prefer something external to fuse.
>>>>>>
>>>>>> Okay, understood.
>>>>>>
>>>>>>>
>>>>>>> Maybe a kernel based fdstore (lifetime connected to that of the
>>>>>>> container) would a useful service more generally?
>>>>>>
>>>>>> Yeah I indeed had considered this, but I'm afraid VFS guys would be
>>>>>> concerned about why we do this on kernel side rather than in user space.
>>>>
>>>> Just from my own perspective, even if it's in FUSE, the concern is
>>>> almost the same.
>>>>
>>>> I wonder if on-demand cachefiles can keep fds too in the future
>>>> (thus e.g. daemonless feature could even be implemented entirely
>>>> with kernel fdstore) but it still has the same concern or it's
>>>> a source of duplication.
>>>>
>>>> Thanks,
>>>> Gao Xiang
>>>>
>>>>>>
>>>>>> I'm not sure what the VFS guys think about this and if the kernel side
>>>>>> shall care about this.
>>>
>>> Fwiw, I'm not convinced and I think that's a big can of worms security
>>> wise and semantics wise. I have discussed whether a kernel-side fdstore
>>> would be something that systemd would use if available multiple times
>>> and they wouldn't use it because it provides them with no benefits over
>>> having it in userspace.
>>
>> As far as I know, currently there are approximately two ways to do
>> failover mechanisms in kernel.
>>
>> The first model much like a fuse-like model: in this mode, we should
>> keep and pass fd to maintain the active state. And currently,
>> userspace should be responsible for the permission/security issues
>> when doing something like passing fds.
>>
>> The second model is like one device-one instance model, for example
>> ublk (If I understand correctly): each active instance (/dev/ublkbX)
>> has their own unique control device (/dev/ublkcX). Users could
>> assign/change DAC/MAC for each control device. And failover
>> recovery just needs to reopen the control device with proper
>> permission and do recovery.
>>
>> So just my own thought, kernel-side fdstore pseudo filesystem may
>> provide a DAC/MAC mechanism for the first model. That is a much
>> cleaner way than doing some similar thing independently in each
>> subsystem which may need DAC/MAC-like mechanism. But that is
>> just my own thought.
>
> The failover mechanism for /dev/ublkcX could easily be implemented using
> the fdstore. The fact that they rolled their own thing is orthogonal to
> this imho. Implementing retrieval policies like this in the kernel is
> slowly advancing into /proc/$pid/fd/ levels of complexity. That's all
> better handled with appropriate policies in userspace. And cachefilesd
> can similarly just stash their fds in the fdstore.

Ok, got it. I just would like to know what kernel fdstore
currently sounds like (since Miklos mentioned it so I wonder
if it's feasible since it can benefit to non-fuse cases).
I think userspace fdstore works for me (unless some other
interesting use cases for evaluation later).

Jingbo has an internal requirement for fuse, that is a pure
fuse stuff, and that is out of my scope though.

Thanks,
Gao Xiang