Add support for file system passthrough read/write of files when enabled in
userspace through the option FUSE_PASSTHROUGH.
There are file systems based on FUSE that are intended to enforce special
policies or trigger complicated decision makings at the file operations
level. Android, for example, uses FUSE to enforce fine-grained access
policies that also depend on the file contents.
Sometimes it happens that at open or create time a file is identified as
not requiring additional checks for consequent reads/writes, thus FUSE
would simply act as a passive bridge between the process accessing the FUSE
file system and the lower file system. Splicing and caching help reduce the
FUSE overhead, but there are still read/write operations forwarded to the
userspace FUSE daemon that could be avoided.
This series has been inspired by the original patches from Nikhilesh Reddy,
the idea and code of which has been elaborated and improved thanks to the
community support.
When the FUSE_PASSTHROUGH capability is enabled, the FUSE daemon may decide
while handling the open/create operations, if the given file can be
accessed in passthrough mode. This means that all the further read and
write operations would be forwarded by the kernel directly to the lower
file system rather than to the FUSE daemon. All requests that are not reads
or writes are still handled by the userspace code.
This allows for improved performance on reads and writes, especially in the
case of reads at random offsets, for which no (readahead) caching mechanism
would help.
Benchmarks show improved performance that is close to native file system
access when doing massive manipulations on a single opened file, especially
in the case of random reads, for which the bandwidth increased by almost 2X
or sequential writes for which the improvement is close to 3X.
The creation of this direct connection (passthrough) between FUSE file
objects and file objects in the lower file system happens in a way that
reminds of passing file descriptors via sockets:
- a process requests the opening of a file handled by FUSE, so the kernel
forwards the request to the FUSE daemon;
- the FUSE daemon opens the target file in the lower file system, getting
its file descriptor;
- the FUSE daemon also decides according to its internal policies if
passthrough can be enabled for that file, and, if so, can perform a
FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl() on /dev/fuse, passing the file
descriptor obtained at the previous step and the fuse_req unique
identifier;
- the kernel translates the file descriptor to the file pointer navigating
through the opened files of the "current" process and temporarily stores
it in the associated open/create fuse_req's passthrough_filp;
- when the FUSE daemon has done with the request and it's time for the
kernel to close it, it checks if the passthrough_filp is available and in
case updates the additional field in the fuse_file owned by the process
accessing the FUSE file system.
From now on, all the read/write operations performed by that process will
be redirected to the file pointer pointing at the lower file system's file.
Asynchronous IO is supported as well, handled by creating separate AIO
requests for the lower file system that will be internally tracked by FUSE,
that intercepts and propagates their completion through an internal
ki_completed callback similar to the current implementation of overlayfs.
Being the userspace FUSE daemon unaware of the read/write operations
happening in passthrough mode, the stats of the FUSE file must be
continuously updated with the lower file system's file to avoid
inconsistencies when stats caching is enabled.
The ioctl() has been designed taking as a reference and trying to converge
to the fuse2 implementation. For example, the fuse_passthrough_out data
structure has extra fields that will allow for further extensions of the
feature.
Performance
What follows has been performed with this change [V6] rebased on top of
vanilla v5.8 Linux kernel, using a custom passthrough_hp FUSE daemon that
enables pass-through for each file that is opened during both "open" and
"create". Tests were run on an Intel Xeon E5-2678V3, 32GiB of RAM, with an
ext4-formatted SSD as the lower file system, with no special tuning, e.g.,
all the involved processes are SCHED_OTHER, ondemand is the frequency
governor with no frequency restrictions, and turbo-boost, as well as
p-state, are active. This is because I noticed that, for such high-level
benchmarks, results consistency was minimally affected by these features.
The source code of the updated libfuse library and passthrough_hp is shared
at the following repository:
https://github.com/balsini/libfuse/tree/fuse-passthrough-stable-v.3.9.4
Two different kinds of benchmarks were done for this change, the first set
of tests evaluates the bandwidth improvements when manipulating a huge
single file, the second set of tests verify that no performance regressions
were introduced when handling many small files.
The first benchmarks were done by running FIO (fio-3.21) with:
- bs=4Ki;
- file size: 50Gi;
- ioengine: sync;
- fsync_on_close: true.
The target file has been chosen large enough to avoid it to be entirely
loaded into the page cache.
Results are presented in the following table:
+-----------+--------+-------------+--------+
| Bandwidth | FUSE | FUSE | Bind |
| (KiB/s) | | passthrough | mount |
+-----------+--------+-------------+--------+
| read | 468897 | 502085 | 516830 |
+-----------+--------+-------------+--------+
| randread | 15773 | 26632 | 21386 |
+-----------+--------+-------------+--------+
| write | 58185 | 141272 | 141671 |
+-----------+--------+-------------+--------+
| randwrite | 59892 | 75236 | 76486 |
+-----------+--------+-------------+--------+
As long as this patch has the primary objective of improving bandwidth,
another set of tests has been performed to see how this behaves on a
totally different scenario that involves accessing many small files. For
this purpose, measuring the build time of the Linux kernel has been chosen
as a well-known workload. The kernel has been built with as many processes
as the number of logical CPUs (-j $(nproc)), that besides being a
reasonable number, is also enough to saturate the processor’s utilization
thanks to the additional FUSE daemon’s threads, making it even harder to
get closer to the native file system performance.
The following table shows the total build times in the different
configurations:
+------------------+--------------+-----------+
| | AVG duration | Standard |
| | (sec) | deviation |
+------------------+--------------+-----------+
| FUSE | 144.566 | 0.697 |
+------------------+--------------+-----------+
| FUSE passthrough | 133.820 | 0.341 |
+------------------+--------------+-----------+
| Raw | 109.423 | 0.724 |
+------------------+--------------+-----------+
Similar performance measurements were performed in the current version of
the patch, and results compatible with what is shown above.
Further testing and performance evaluations are welcome.
Description of the series
The first patch introduces the data structures and definitions required
both for the communication with userspace and for the internal kernel use.
It also adds the basic functionalities to establish the bridge between the
FUSE file and the lower file system file through a ioctl().
The second patch enables the synchronous read and write operations for
those FUSE files for which the passthrough functionality is enabled. Those
operations are directly sent to the lower file system.
The third and last patch extends the read and write operations to also
support asynchronous IO.
Changes in v8:
* aio requests now use kmalloc/kfree, instead of kmem_cache
* Switched to call_{read,write}_iter in AIO
* Revisited attributes copy
* Passthrough can only be enabled via ioctl(), fixing the security issue
spotted by Jann
[Proposed by Jann Horn]
* Use an extensible fuse_passthrough_out data structure
Changes in v7:
* Full handling of aio requests as done in overlayfs (update commit
message).
* s/fget_raw/fget.
* Open fails in case of passthrough errors, emitting warning messages.
[Proposed by Jann Horn]
* Create new local kiocb, getting rid of the previously proposed ki_filp
swapping.
[Proposed by Jann Horn and Jens Axboe]
* Code polishing.
Changes in v6:
* Port to kernel v5.8:
* fuse_file_{read,write}_iter() changed since the v5 of this patch was
proposed.
* Simplify fuse_simple_request().
* Merge fuse_passthrough.h into fuse_i.h
* Refactor of passthrough.c:
* Remove BUG_ON()s.
* Simplified error checking and request arguments indexing.
* Use call_{read,write}_iter() utility functions.
* Remove get_file() and fputs() during read/write: handle the extra FUSE
references to the lower file object when the fuse_file is
created/deleted.
[Proposed by Jann Horn]
Changes in v5:
* Fix the check when setting the passthrough file.
[Found when testing by Mike Shal]
Changes in v3 and v4:
* Use the fs_stack_depth to prevent further stacking and a minor fix.
[Proposed by Jann Horn]
Changes in v2:
* Changed the feature name to passthrough from stacked_io.
[Proposed by Linus Torvalds]
Alessio Balsini (3):
fuse: Definitions and ioctl() for passthrough
fuse: Introduce synchronous read and write for passthrough
fuse: Handle AIO read and write in passthrough
fs/fuse/Makefile | 1 +
fs/fuse/dev.c | 57 ++++++++++-
fs/fuse/dir.c | 2 +
fs/fuse/file.c | 25 +++--
fs/fuse/fuse_i.h | 16 ++++
fs/fuse/inode.c | 9 +-
fs/fuse/passthrough.c | 196 ++++++++++++++++++++++++++++++++++++++
include/uapi/linux/fuse.h | 12 ++-
8 files changed, 305 insertions(+), 13 deletions(-)
create mode 100644 fs/fuse/passthrough.c
--
2.28.0.618.gf4bc123cb7-goog
Extend the passthrough feature by handling asynchronous IO both for read
and write operations.
When an AIO request is received, targeting a FUSE file with passthrough
functionality enabled, a new identical AIO request is created, the file
pointer of which is updated with the file pointer of the lower file system,
and the completion handler is set with a special AIO passthrough handler.
The lower file system AIO request is allocated in dynamic kernel memory
and, when it completes, the allocated memory is freed and the completion
signal is propagated to the FUSE AIO request by triggering its completion
callback as well.
Signed-off-by: Alessio Balsini <[email protected]>
---
fs/fuse/passthrough.c | 66 +++++++++++++++++++++++++++++++++++++++++--
1 file changed, 63 insertions(+), 3 deletions(-)
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
index 44a78e02f45d..87b57b26fd8a 100644
--- a/fs/fuse/passthrough.c
+++ b/fs/fuse/passthrough.c
@@ -2,10 +2,16 @@
#include "fuse_i.h"
+#include <linux/aio.h>
#include <linux/fs_stack.h>
#include <linux/fsnotify.h>
#include <linux/uio.h>
+struct fuse_aio_req {
+ struct kiocb iocb;
+ struct kiocb *iocb_fuse;
+};
+
static void fuse_copyattr(struct file *dst_file, struct file *src_file,
bool write)
{
@@ -20,6 +26,32 @@ static void fuse_copyattr(struct file *dst_file, struct file *src_file,
}
}
+static void fuse_aio_cleanup_handler(struct fuse_aio_req *aio_req)
+{
+ struct kiocb *iocb = &aio_req->iocb;
+ struct kiocb *iocb_fuse = aio_req->iocb_fuse;
+ bool write = !!(iocb->ki_flags & IOCB_WRITE);
+
+ if (write) {
+ __sb_writers_acquired(file_inode(iocb->ki_filp)->i_sb,
+ SB_FREEZE_WRITE);
+ file_end_write(iocb->ki_filp);
+ }
+
+ fuse_copyattr(iocb_fuse->ki_filp, iocb->ki_filp, write);
+ iocb_fuse->ki_pos = iocb->ki_pos;
+ kfree(aio_req);
+}
+
+static void fuse_aio_rw_complete(struct kiocb *iocb, long res, long res2)
+{
+ struct fuse_aio_req *aio_req =
+ container_of(iocb, struct fuse_aio_req, iocb);
+ struct kiocb *iocb_fuse = aio_req->iocb_fuse;
+
+ fuse_aio_cleanup_handler(aio_req);
+ iocb_fuse->ki_complete(iocb_fuse, res, res2);
+}
ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
struct iov_iter *iter)
@@ -42,7 +74,18 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
fuse_copyattr(fuse_filp, passthrough_filp, false);
} else {
- ret = -EIO;
+ struct fuse_aio_req *aio_req;
+
+ aio_req = kmalloc(sizeof(struct fuse_aio_req), GFP_KERNEL);
+ if (!aio_req)
+ return -ENOMEM;
+
+ aio_req->iocb_fuse = iocb_fuse;
+ kiocb_clone(&aio_req->iocb, iocb_fuse, passthrough_filp);
+ aio_req->iocb.ki_complete = fuse_aio_rw_complete;
+ ret = call_read_iter(passthrough_filp, &aio_req->iocb, iter);
+ if (ret != -EIOCBQUEUED)
+ fuse_aio_cleanup_handler(aio_req);
}
return ret;
@@ -56,6 +99,7 @@ ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
struct fuse_file *ff = fuse_filp->private_data;
struct inode *fuse_inode = file_inode(fuse_filp);
struct file *passthrough_filp = ff->passthrough_filp;
+ struct inode *passthrough_inode = file_inode(passthrough_filp);
if (!iov_iter_count(iter))
return 0;
@@ -75,9 +119,25 @@ ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
if (ret > 0)
fuse_copyattr(fuse_filp, passthrough_filp, true);
} else {
- ret = -EIO;
- }
+ struct fuse_aio_req *aio_req;
+ aio_req = kmalloc(sizeof(struct fuse_aio_req), GFP_KERNEL);
+ if (!aio_req) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ file_start_write(passthrough_filp);
+ __sb_writers_release(passthrough_inode->i_sb, SB_FREEZE_WRITE);
+
+ aio_req->iocb_fuse = iocb_fuse;
+ kiocb_clone(&aio_req->iocb, iocb_fuse, passthrough_filp);
+ aio_req->iocb.ki_complete = fuse_aio_rw_complete;
+ ret = call_write_iter(passthrough_filp, &aio_req->iocb, iter);
+ if (ret != -EIOCBQUEUED)
+ fuse_aio_cleanup_handler(aio_req);
+ }
+out:
inode_unlock(fuse_inode);
return ret;
--
2.28.0.618.gf4bc123cb7-goog
Introduce the new FUSE passthrough ioctl(), which allows userspace to
specify a direct connection between a FUSE file and a lower file system
file.
Such ioctl() requires userspace to specify:
- the file descriptor of one of its opened files,
- the unique identifier of the FUSE request associated with a pending
open/create operation,
both encapsulated into a fuse_passthrough_out data structure.
The ioctl() will search for the pending FUSE request matching the unique
identifier, and update the passthrough file pointer of the request with the
file pointer referenced by the passed file descriptor.
When that pending FUSE request is handled, the passthrough file pointer
is copied to the fuse_file data structure, so that the link between FUSE
and lower file system is consolidated.
In order for the passthrough mode to be successfully activated, the lower
file system file must implement both read_ and write_iter file operations.
This extra check avoids special pseudofiles to be targets for this feature.
An additional enforced limitation is that when FUSE passthrough is enabled,
no further file system stacking is allowed.
Signed-off-by: Alessio Balsini <[email protected]>
---
fs/fuse/Makefile | 1 +
fs/fuse/dev.c | 57 +++++++++++++++++++++++++++++++++++----
fs/fuse/dir.c | 2 ++
fs/fuse/file.c | 17 +++++++++---
fs/fuse/fuse_i.h | 14 ++++++++++
fs/fuse/inode.c | 9 ++++++-
fs/fuse/passthrough.c | 55 +++++++++++++++++++++++++++++++++++++
include/uapi/linux/fuse.h | 12 ++++++++-
8 files changed, 156 insertions(+), 11 deletions(-)
create mode 100644 fs/fuse/passthrough.c
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 3e8cebfb59b7..6971454a2bdf 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -8,4 +8,5 @@ obj-$(CONFIG_CUSE) += cuse.o
obj-$(CONFIG_VIRTIO_FS) += virtiofs.o
fuse-objs := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o
+fuse-objs += passthrough.o
virtiofs-y += virtio_fs.o
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 02b3c36b3676..b0fbdbfd4fbd 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2219,21 +2219,53 @@ static int fuse_device_clone(struct fuse_conn *fc, struct file *new)
return 0;
}
+int fuse_passthrough_open(struct fuse_dev *fud,
+ struct fuse_passthrough_out *pto)
+{
+ int ret;
+ struct fuse_req *req;
+ struct fuse_pqueue *fpq = &fud->pq;
+ struct fuse_conn *fc = fud->fc;
+
+ if (!fc->passthrough)
+ return -EPERM;
+
+ /* This field is reserved for future use */
+ if (pto->len != 0)
+ return -EINVAL;
+
+ spin_lock(&fpq->lock);
+ req = request_find(fpq, pto->unique & ~FUSE_INT_REQ_BIT);
+ if (!req) {
+ spin_unlock(&fpq->lock);
+ return -ENOENT;
+ }
+ __fuse_get_request(req);
+ spin_unlock(&fpq->lock);
+
+ ret = fuse_passthrough_setup(req, pto->fd);
+
+ __fuse_put_request(req);
+ return ret;
+}
+
static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
{
- int err = -ENOTTY;
-
- if (cmd == FUSE_DEV_IOC_CLONE) {
- int oldfd;
+ int err;
+ int oldfd;
+ struct fuse_dev *fud;
+ struct fuse_passthrough_out pto;
+ switch (cmd) {
+ case FUSE_DEV_IOC_CLONE:
err = -EFAULT;
if (!get_user(oldfd, (__u32 __user *) arg)) {
struct file *old = fget(oldfd);
err = -EINVAL;
if (old) {
- struct fuse_dev *fud = NULL;
+ fud = NULL;
/*
* Check against file->f_op because CUSE
@@ -2251,6 +2283,21 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
fput(old);
}
}
+ break;
+ case FUSE_DEV_IOC_PASSTHROUGH_OPEN:
+ err = -EFAULT;
+ if (!copy_from_user(&pto,
+ (struct fuse_passthrough_out __user *)arg,
+ sizeof(pto))) {
+ err = -EINVAL;
+ fud = fuse_get_dev(file);
+ if (fud)
+ err = fuse_passthrough_open(fud, &pto);
+ }
+ break;
+ default:
+ err = -ENOTTY;
+ break;
}
return err;
}
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 26f028bc760b..531de0c5c9e8 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -477,6 +477,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
args.out_args[0].value = &outentry;
args.out_args[1].size = sizeof(outopen);
args.out_args[1].value = &outopen;
+ args.passthrough_filp = NULL;
err = fuse_simple_request(fc, &args);
if (err)
goto out_free_ff;
@@ -489,6 +490,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
ff->fh = outopen.fh;
ff->nodeid = outentry.nodeid;
ff->open_flags = outopen.open_flags;
+ ff->passthrough_filp = args.passthrough_filp;
inode = fuse_iget(dir->i_sb, outentry.nodeid, outentry.generation,
&outentry.attr, entry_attr_timeout(&outentry), 0);
if (!inode) {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 83d917f7e542..6c0ec742ce74 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -33,10 +33,12 @@ static struct page **fuse_pages_alloc(unsigned int npages, gfp_t flags,
}
static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
- int opcode, struct fuse_open_out *outargp)
+ int opcode, struct fuse_open_out *outargp,
+ struct file **passthrough_filp)
{
struct fuse_open_in inarg;
FUSE_ARGS(args);
+ int ret;
memset(&inarg, 0, sizeof(inarg));
inarg.flags = file->f_flags & ~(O_CREAT | O_EXCL | O_NOCTTY);
@@ -51,7 +53,10 @@ static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
args.out_args[0].size = sizeof(*outargp);
args.out_args[0].value = outargp;
- return fuse_simple_request(fc, &args);
+ ret = fuse_simple_request(fc, &args);
+ *passthrough_filp = args.passthrough_filp;
+
+ return ret;
}
struct fuse_release_args {
@@ -144,14 +149,16 @@ int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
/* Default for no-open */
ff->open_flags = FOPEN_KEEP_CACHE | (isdir ? FOPEN_CACHE_DIR : 0);
if (isdir ? !fc->no_opendir : !fc->no_open) {
+ struct file *passthrough_filp;
struct fuse_open_out outarg;
int err;
- err = fuse_send_open(fc, nodeid, file, opcode, &outarg);
+ err = fuse_send_open(fc, nodeid, file, opcode, &outarg,
+ &passthrough_filp);
if (!err) {
ff->fh = outarg.fh;
ff->open_flags = outarg.open_flags;
-
+ ff->passthrough_filp = passthrough_filp;
} else if (err != -ENOSYS) {
fuse_file_free(ff);
return err;
@@ -281,6 +288,8 @@ void fuse_release_common(struct file *file, bool isdir)
struct fuse_release_args *ra = ff->release_args;
int opcode = isdir ? FUSE_RELEASEDIR : FUSE_RELEASE;
+ fuse_passthrough_release(ff);
+
fuse_prepare_release(fi, ff, file->f_flags, opcode);
if (ff->flock) {
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 740a8a7d7ae6..6c5166447905 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -208,6 +208,12 @@ struct fuse_file {
} readdir;
+ /**
+ * Reference to lower filesystem file for read/write operations
+ * handled in pass-through mode
+ */
+ struct file *passthrough_filp;
+
/** RB node to be linked on fuse_conn->polled_files */
struct rb_node polled_node;
@@ -250,6 +256,8 @@ struct fuse_args {
bool page_zeroing:1;
bool page_replace:1;
bool may_block:1;
+ /** Lower filesystem file pointer used in pass-through mode */
+ struct file *passthrough_filp;
struct fuse_in_arg in_args[3];
struct fuse_arg out_args[2];
void (*end)(struct fuse_conn *fc, struct fuse_args *args, int error);
@@ -720,6 +728,9 @@ struct fuse_conn {
/* Do not show mount options */
unsigned int no_mount_options:1;
+ /** Pass-through mode for read/write IO */
+ unsigned int passthrough:1;
+
/** The number of requests waiting for completion */
atomic_t num_waiting;
@@ -1093,4 +1104,7 @@ unsigned int fuse_len_args(unsigned int numargs, struct fuse_arg *args);
u64 fuse_get_unique(struct fuse_iqueue *fiq);
void fuse_free_conn(struct fuse_conn *fc);
+int fuse_passthrough_setup(struct fuse_req *req, unsigned int fd);
+void fuse_passthrough_release(struct fuse_file *ff);
+
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index bba747520e9b..eb223130a917 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -965,6 +965,12 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
min_t(unsigned int, FUSE_MAX_MAX_PAGES,
max_t(unsigned int, arg->max_pages, 1));
}
+ if (arg->flags & FUSE_PASSTHROUGH) {
+ fc->passthrough = 1;
+ /* Prevent further stacking */
+ fc->sb->s_stack_depth =
+ FILESYSTEM_MAX_STACK_DEPTH;
+ }
} else {
ra_pages = fc->max_read / PAGE_SIZE;
fc->no_lock = 1;
@@ -1002,7 +1008,8 @@ void fuse_send_init(struct fuse_conn *fc)
FUSE_WRITEBACK_CACHE | FUSE_NO_OPEN_SUPPORT |
FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
- FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA;
+ FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
+ FUSE_PASSTHROUGH;
ia->args.opcode = FUSE_INIT;
ia->args.in_numargs = 1;
ia->args.in_args[0].size = sizeof(ia->in);
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
new file mode 100644
index 000000000000..86ab4eafa7bf
--- /dev/null
+++ b/fs/fuse/passthrough.c
@@ -0,0 +1,55 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "fuse_i.h"
+
+int fuse_passthrough_setup(struct fuse_req *req, unsigned int fd)
+{
+ int ret;
+ int fs_stack_depth;
+ struct file *passthrough_filp;
+ struct inode *passthrough_inode;
+ struct super_block *passthrough_sb;
+
+ /* Passthrough mode can only be enabled at file open/create time */
+ if (req->in.h.opcode != FUSE_OPEN && req->in.h.opcode != FUSE_CREATE) {
+ pr_err("FUSE: invalid OPCODE for request.\n");
+ return -EINVAL;
+ }
+
+ passthrough_filp = fget(fd);
+ if (!passthrough_filp) {
+ pr_err("FUSE: invalid file descriptor for passthrough.\n");
+ return -EINVAL;
+ }
+
+ ret = -EINVAL;
+ if (!passthrough_filp->f_op->read_iter ||
+ !passthrough_filp->f_op->write_iter) {
+ pr_err("FUSE: passthrough file misses file operations.\n");
+ goto out;
+ }
+
+ passthrough_inode = file_inode(passthrough_filp);
+ passthrough_sb = passthrough_inode->i_sb;
+ fs_stack_depth = passthrough_sb->s_stack_depth + 1;
+ ret = -EEXIST;
+ if (fs_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
+ pr_err("FUSE: maximum fs stacking depth exceeded for passthrough\n");
+ goto out;
+ }
+
+ req->args->passthrough_filp = passthrough_filp;
+ return 0;
+out:
+ fput(passthrough_filp);
+ return ret;
+}
+
+void fuse_passthrough_release(struct fuse_file *ff)
+{
+ if (!ff->passthrough_filp)
+ return;
+
+ fput(ff->passthrough_filp);
+ ff->passthrough_filp = NULL;
+}
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 373cada89815..0cd9fd83374a 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -342,6 +342,7 @@ struct fuse_file_lock {
#define FUSE_NO_OPENDIR_SUPPORT (1 << 24)
#define FUSE_EXPLICIT_INVAL_DATA (1 << 25)
#define FUSE_MAP_ALIGNMENT (1 << 26)
+#define FUSE_PASSTHROUGH (1 << 27)
/**
* CUSE INIT request/reply flags
@@ -794,6 +795,14 @@ struct fuse_in_header {
uint32_t padding;
};
+struct fuse_passthrough_out {
+ uint64_t unique;
+ uint32_t fd;
+ /* For future implementation */
+ uint32_t len;
+ void *vec;
+};
+
struct fuse_out_header {
uint32_t len;
int32_t error;
@@ -869,7 +878,8 @@ struct fuse_notify_retrieve_in {
};
/* Device ioctls: */
-#define FUSE_DEV_IOC_CLONE _IOR(229, 0, uint32_t)
+#define FUSE_DEV_IOC_CLONE _IOR(229, 0, uint32_t)
+#define FUSE_DEV_IOC_PASSTHROUGH_OPEN _IOW(229, 1, struct fuse_passthrough_out)
struct fuse_lseek_in {
uint64_t fh;
--
2.28.0.618.gf4bc123cb7-goog
On 9/11/20 10:34 AM, Alessio Balsini wrote:
> Extend the passthrough feature by handling asynchronous IO both for read
> and write operations.
> When an AIO request is received, targeting a FUSE file with passthrough
> functionality enabled, a new identical AIO request is created, the file
> pointer of which is updated with the file pointer of the lower file system,
> and the completion handler is set with a special AIO passthrough handler.
> The lower file system AIO request is allocated in dynamic kernel memory
> and, when it completes, the allocated memory is freed and the completion
> signal is propagated to the FUSE AIO request by triggering its completion
> callback as well.
>
> Signed-off-by: Alessio Balsini <[email protected]>
> ---
> fs/fuse/passthrough.c | 66 +++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 63 insertions(+), 3 deletions(-)
>
> diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> index 44a78e02f45d..87b57b26fd8a 100644
> --- a/fs/fuse/passthrough.c
> +++ b/fs/fuse/passthrough.c
> @@ -2,10 +2,16 @@
>
> #include "fuse_i.h"
>
> +#include <linux/aio.h>
What is this include for? It's not using any aio parts at all.
--
Jens Axboe
On 9/11/2020 12:34 PM, Alessio Balsini via fuse-devel wrote:
> Add support for file system passthrough read/write of files when enabled in
> userspace through the option FUSE_PASSTHROUGH.
Might be more effort than it is worth but any thoughts on userland error
handling for passthrough? My use case, optionally, responds to read or
write errors in particular ways. It's not an unreasonable tradeoff to
disable passthrough if the user wants those features but was wondering
if there was any consideration of extending the protocol to pass
read/write errors back to the fuse server.
On Fri, Sep 11, 2020 at 7:34 PM Alessio Balsini <[email protected]> wrote:
>
> Introduce the new FUSE passthrough ioctl(), which allows userspace to
> specify a direct connection between a FUSE file and a lower file system
> file.
> Such ioctl() requires userspace to specify:
> - the file descriptor of one of its opened files,
> - the unique identifier of the FUSE request associated with a pending
> open/create operation,
> both encapsulated into a fuse_passthrough_out data structure.
> The ioctl() will search for the pending FUSE request matching the unique
> identifier, and update the passthrough file pointer of the request with the
> file pointer referenced by the passed file descriptor.
> When that pending FUSE request is handled, the passthrough file pointer
> is copied to the fuse_file data structure, so that the link between FUSE
> and lower file system is consolidated.
>
> In order for the passthrough mode to be successfully activated, the lower
> file system file must implement both read_ and write_iter file operations.
> This extra check avoids special pseudofiles to be targets for this feature.
> An additional enforced limitation is that when FUSE passthrough is enabled,
> no further file system stacking is allowed.
>
> Signed-off-by: Alessio Balsini <[email protected]>
> ---
[...]
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index bba747520e9b..eb223130a917 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -965,6 +965,12 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
> min_t(unsigned int, FUSE_MAX_MAX_PAGES,
> max_t(unsigned int, arg->max_pages, 1));
> }
> + if (arg->flags & FUSE_PASSTHROUGH) {
> + fc->passthrough = 1;
> + /* Prevent further stacking */
> + fc->sb->s_stack_depth =
> + FILESYSTEM_MAX_STACK_DEPTH;
> + }
That seems a bit limiting.
I suppose what you really want to avoid is loops into FUSE fd.
There may be a way to do this with forbidding overlay over FUSE passthrough
or the other way around.
You can set fc->sb->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH - 1
here and in passthrough ioctl you can check for looping into a fuse fs with
passthrough enabled on the passed fd (see below) ...
> } else {
> ra_pages = fc->max_read / PAGE_SIZE;
> fc->no_lock = 1;
> @@ -1002,7 +1008,8 @@ void fuse_send_init(struct fuse_conn *fc)
> FUSE_WRITEBACK_CACHE | FUSE_NO_OPEN_SUPPORT |
> FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
> FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
> - FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA;
> + FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
> + FUSE_PASSTHROUGH;
> ia->args.opcode = FUSE_INIT;
> ia->args.in_numargs = 1;
> ia->args.in_args[0].size = sizeof(ia->in);
> diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> new file mode 100644
> index 000000000000..86ab4eafa7bf
> --- /dev/null
> +++ b/fs/fuse/passthrough.c
> @@ -0,0 +1,55 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include "fuse_i.h"
> +
> +int fuse_passthrough_setup(struct fuse_req *req, unsigned int fd)
> +{
> + int ret;
> + int fs_stack_depth;
> + struct file *passthrough_filp;
> + struct inode *passthrough_inode;
> + struct super_block *passthrough_sb;
> +
> + /* Passthrough mode can only be enabled at file open/create time */
> + if (req->in.h.opcode != FUSE_OPEN && req->in.h.opcode != FUSE_CREATE) {
> + pr_err("FUSE: invalid OPCODE for request.\n");
> + return -EINVAL;
> + }
> +
> + passthrough_filp = fget(fd);
> + if (!passthrough_filp) {
> + pr_err("FUSE: invalid file descriptor for passthrough.\n");
> + return -EINVAL;
> + }
> +
> + ret = -EINVAL;
> + if (!passthrough_filp->f_op->read_iter ||
> + !passthrough_filp->f_op->write_iter) {
> + pr_err("FUSE: passthrough file misses file operations.\n");
> + goto out;
> + }
> +
> + passthrough_inode = file_inode(passthrough_filp);
> + passthrough_sb = passthrough_inode->i_sb;
> + fs_stack_depth = passthrough_sb->s_stack_depth + 1;
... for example:
if (fs_stack_depth && passthrough_sb->s_type == fuse_fs_type) {
pr_err("FUSE: stacked passthrough file\n");
goto out;
}
But maybe we want to ban passthrough to any lower FUSE at least for start.
> + ret = -EEXIST;
Why EEXIST? Why not EINVAL?
> + if (fs_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> + pr_err("FUSE: maximum fs stacking depth exceeded for passthrough\n");
> + goto out;
> + }
> +
> + req->args->passthrough_filp = passthrough_filp;
> + return 0;
> +out:
> + fput(passthrough_filp);
> + return ret;
> +}
> +
And speaking of overlayfs, I believe you may be able to test your code with
fuse-overlayfs (passthrough to upper files).
This is a project with real users running real workloads who may be
able to provide you with valuable feedback from testing.
Thanks,
Amir.
[1] https://github.com/containers/fuse-overlayfs
Hi Antonio,
It's indeed a great idea to notify the FUSE daemon in case of lower file
system exceptions, otherwise transparent when passthrough is enabled.
I was already planning to work on this feature as a future extension, glad
to see that this is of interest to the community.
Thanks for your feedback,
Alessio
On Fri, Sep 11, 2020 at 02:46:04PM -0400, Antonio SJ Musumeci wrote:
> On 9/11/2020 12:34 PM, Alessio Balsini via fuse-devel wrote:
> > Add support for file system passthrough read/write of files when enabled in
> > userspace through the option FUSE_PASSTHROUGH.
> Might be more effort than it is worth but any thoughts on userland error
> handling for passthrough? My use case, optionally, responds to read or write
> errors in particular ways. It's not an unreasonable tradeoff to disable
> passthrough if the user wants those features but was wondering if there was
> any consideration of extending the protocol to pass read/write errors back
> to the fuse server.
Hi Amir,
Thanks again for your feedback.
On Sat, Sep 12, 2020 at 02:06:02PM +0300, Amir Goldstein wrote:
> On Fri, Sep 11, 2020 at 7:34 PM Alessio Balsini <[email protected]> wrote:
> [...]
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index bba747520e9b..eb223130a917 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -965,6 +965,12 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
> > min_t(unsigned int, FUSE_MAX_MAX_PAGES,
> > max_t(unsigned int, arg->max_pages, 1));
> > }
> > + if (arg->flags & FUSE_PASSTHROUGH) {
> > + fc->passthrough = 1;
> > + /* Prevent further stacking */
> > + fc->sb->s_stack_depth =
> > + FILESYSTEM_MAX_STACK_DEPTH;
> > + }
>
> That seems a bit limiting.
> I suppose what you really want to avoid is loops into FUSE fd.
> There may be a way to do this with forbidding overlay over FUSE passthrough
> or the other way around.
>
> You can set fc->sb->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH - 1
> here and in passthrough ioctl you can check for looping into a fuse fs with
> passthrough enabled on the passed fd (see below) ...
>
> [...]
> > diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> > new file mode 100644
> > index 000000000000..86ab4eafa7bf
> > --- /dev/null
> > +++ b/fs/fuse/passthrough.c
> > @@ -0,0 +1,55 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#include "fuse_i.h"
> > +
> > +int fuse_passthrough_setup(struct fuse_req *req, unsigned int fd)
> > +{
> > + int ret;
> > + int fs_stack_depth;
> > + struct file *passthrough_filp;
> > + struct inode *passthrough_inode;
> > + struct super_block *passthrough_sb;
> > +
> > + /* Passthrough mode can only be enabled at file open/create time */
> > + if (req->in.h.opcode != FUSE_OPEN && req->in.h.opcode != FUSE_CREATE) {
> > + pr_err("FUSE: invalid OPCODE for request.\n");
> > + return -EINVAL;
> > + }
> > +
> > + passthrough_filp = fget(fd);
> > + if (!passthrough_filp) {
> > + pr_err("FUSE: invalid file descriptor for passthrough.\n");
> > + return -EINVAL;
> > + }
> > +
> > + ret = -EINVAL;
> > + if (!passthrough_filp->f_op->read_iter ||
> > + !passthrough_filp->f_op->write_iter) {
> > + pr_err("FUSE: passthrough file misses file operations.\n");
> > + goto out;
> > + }
> > +
> > + passthrough_inode = file_inode(passthrough_filp);
> > + passthrough_sb = passthrough_inode->i_sb;
> > + fs_stack_depth = passthrough_sb->s_stack_depth + 1;
>
> ... for example:
>
> if (fs_stack_depth && passthrough_sb->s_type == fuse_fs_type) {
> pr_err("FUSE: stacked passthrough file\n");
> goto out;
> }
>
> But maybe we want to ban passthrough to any lower FUSE at least for start.
Yes, what I proposed here is very conservative, and your solution sounds
good to me. Unfortunately I don't have a clear idea of what could go wrong
if we relax this constraint. I need some guidance from you experts here.
What do you think if we keep this overly strict rule for now to avoid
unintended behaviors and come back as we find affected use case?
>
> > + ret = -EEXIST;
>
> Why EEXIST? Why not EINVAL?
>
Reaching the stacking limit sounded like an error caused by the undesired
existence of something, thus EEXIST sounded like a good fit.
No problem in changing that to EINVAL.
> > + if (fs_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> > + pr_err("FUSE: maximum fs stacking depth exceeded for passthrough\n");
> > + goto out;
> > + }
> > +
> > + req->args->passthrough_filp = passthrough_filp;
> > + return 0;
> > +out:
> > + fput(passthrough_filp);
> > + return ret;
> > +}
> > +
>
> And speaking of overlayfs, I believe you may be able to test your code with
> fuse-overlayfs (passthrough to upper files).
>
> This is a project with real users running real workloads who may be
> able to provide you with valuable feedback from testing.
>
> Thanks,
> Amir.
>
> [1] https://github.com/containers/fuse-overlayfs
This is indeed a project with several common elements to what we are doing
in Android, and that would probably benefit from this change. Will surely
involve them in the discussion.
Thanks for sharing!
Alessio
On Fri, Sep 18, 2020 at 7:33 PM Alessio Balsini <[email protected]> wrote:
>
> Hi Amir,
>
> Thanks again for your feedback.
>
> On Sat, Sep 12, 2020 at 02:06:02PM +0300, Amir Goldstein wrote:
> > On Fri, Sep 11, 2020 at 7:34 PM Alessio Balsini <[email protected]> wrote:
> > [...]
> > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > index bba747520e9b..eb223130a917 100644
> > > --- a/fs/fuse/inode.c
> > > +++ b/fs/fuse/inode.c
> > > @@ -965,6 +965,12 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
> > > min_t(unsigned int, FUSE_MAX_MAX_PAGES,
> > > max_t(unsigned int, arg->max_pages, 1));
> > > }
> > > + if (arg->flags & FUSE_PASSTHROUGH) {
> > > + fc->passthrough = 1;
> > > + /* Prevent further stacking */
> > > + fc->sb->s_stack_depth =
> > > + FILESYSTEM_MAX_STACK_DEPTH;
> > > + }
> >
> > That seems a bit limiting.
> > I suppose what you really want to avoid is loops into FUSE fd.
> > There may be a way to do this with forbidding overlay over FUSE passthrough
> > or the other way around.
> >
> > You can set fc->sb->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH - 1
> > here and in passthrough ioctl you can check for looping into a fuse fs with
> > passthrough enabled on the passed fd (see below) ...
> >
> > [...]
> > > diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> > > new file mode 100644
> > > index 000000000000..86ab4eafa7bf
> > > --- /dev/null
> > > +++ b/fs/fuse/passthrough.c
> > > @@ -0,0 +1,55 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +
> > > +#include "fuse_i.h"
> > > +
> > > +int fuse_passthrough_setup(struct fuse_req *req, unsigned int fd)
> > > +{
> > > + int ret;
> > > + int fs_stack_depth;
> > > + struct file *passthrough_filp;
> > > + struct inode *passthrough_inode;
> > > + struct super_block *passthrough_sb;
> > > +
> > > + /* Passthrough mode can only be enabled at file open/create time */
> > > + if (req->in.h.opcode != FUSE_OPEN && req->in.h.opcode != FUSE_CREATE) {
> > > + pr_err("FUSE: invalid OPCODE for request.\n");
> > > + return -EINVAL;
> > > + }
> > > +
> > > + passthrough_filp = fget(fd);
> > > + if (!passthrough_filp) {
> > > + pr_err("FUSE: invalid file descriptor for passthrough.\n");
> > > + return -EINVAL;
> > > + }
> > > +
> > > + ret = -EINVAL;
> > > + if (!passthrough_filp->f_op->read_iter ||
> > > + !passthrough_filp->f_op->write_iter) {
> > > + pr_err("FUSE: passthrough file misses file operations.\n");
> > > + goto out;
> > > + }
> > > +
> > > + passthrough_inode = file_inode(passthrough_filp);
> > > + passthrough_sb = passthrough_inode->i_sb;
> > > + fs_stack_depth = passthrough_sb->s_stack_depth + 1;
> >
> > ... for example:
> >
> > if (fs_stack_depth && passthrough_sb->s_type == fuse_fs_type) {
> > pr_err("FUSE: stacked passthrough file\n");
> > goto out;
> > }
> >
> > But maybe we want to ban passthrough to any lower FUSE at least for start.
>
>
> Yes, what I proposed here is very conservative, and your solution sounds
> good to me. Unfortunately I don't have a clear idea of what could go wrong
> if we relax this constraint. I need some guidance from you experts here.
>
I guess the main concern would be locking order and deadlocks.
With my suggestion I think deadlocks are avoided and I am less sure
but think that lockdep should not have false positives either.
If people do need the 1-level stacking, I can try to think harder
if it is safe and maybe add some more compromise limitations.
> What do you think if we keep this overly strict rule for now to avoid
> unintended behaviors and come back as we find affected use case?
>
I can live with that if other designated users don't mind the limitation.
I happen to be developing a passthrough FUSE fs [1] myself and
I also happen to be using it to pass through to overlayfs.
OTOH, the workloads for my use case are mostly large sequential IO,
and the hardware can handle the few extra syscalls, so the passthrough
fd feature is not urgent for my use case at this point in time.
>
> >
> > > + ret = -EEXIST;
> >
> > Why EEXIST? Why not EINVAL?
> >
>
>
> Reaching the stacking limit sounded like an error caused by the undesired
> existence of something, thus EEXIST sounded like a good fit.
> No problem in changing that to EINVAL.
>
>
>
> > > + if (fs_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> > > + pr_err("FUSE: maximum fs stacking depth exceeded for passthrough\n");
> > > + goto out;
> > > + }
> > > +
> > > + req->args->passthrough_filp = passthrough_filp;
> > > + return 0;
> > > +out:
> > > + fput(passthrough_filp);
> > > + return ret;
> > > +}
> > > +
> >
> > And speaking of overlayfs, I believe you may be able to test your code with
> > fuse-overlayfs (passthrough to upper files).
> >
...
>
> This is indeed a project with several common elements to what we are doing
> in Android,
Are you in liberty to share more information about the Android project?
Is it related to Incremental FS [2]?
Thanks,
Amir.
[1] https://github.com/amir73il/libfuse/commits/cachegwfs
[2] https://lore.kernel.org/linux-fsdevel/[email protected]/
On Fri, Sep 11, 2020 at 11:23:22AM -0600, Jens Axboe wrote:
> On 9/11/20 10:34 AM, Alessio Balsini wrote:
> > Extend the passthrough feature by handling asynchronous IO both for read
> > and write operations.
> > When an AIO request is received, targeting a FUSE file with passthrough
> > functionality enabled, a new identical AIO request is created, the file
> > pointer of which is updated with the file pointer of the lower file system,
> > and the completion handler is set with a special AIO passthrough handler.
> > The lower file system AIO request is allocated in dynamic kernel memory
> > and, when it completes, the allocated memory is freed and the completion
> > signal is propagated to the FUSE AIO request by triggering its completion
> > callback as well.
> >
> > Signed-off-by: Alessio Balsini <[email protected]>
> > ---
> > fs/fuse/passthrough.c | 66 +++++++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 63 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> > index 44a78e02f45d..87b57b26fd8a 100644
> > --- a/fs/fuse/passthrough.c
> > +++ b/fs/fuse/passthrough.c
> > @@ -2,10 +2,16 @@
> >
> > #include "fuse_i.h"
> >
> > +#include <linux/aio.h>
>
> What is this include for? It's not using any aio parts at all.
>
> --
> Jens Axboe
>
Slipped from a cleanup. Fixed.
Thanks!
Alessio
On Fri, Sep 18, 2020 at 10:59:16PM +0300, Amir Goldstein wrote:
> On Fri, Sep 18, 2020 at 7:33 PM Alessio Balsini <[email protected]> wrote:
> [...]
> > > ... for example:
> > >
> > > if (fs_stack_depth && passthrough_sb->s_type == fuse_fs_type) {
> > > pr_err("FUSE: stacked passthrough file\n");
> > > goto out;
> > > }
> > >
> > > But maybe we want to ban passthrough to any lower FUSE at least for start.
> >
> >
> > Yes, what I proposed here is very conservative, and your solution sounds
> > good to me. Unfortunately I don't have a clear idea of what could go wrong
> > if we relax this constraint. I need some guidance from you experts here.
> >
>
> I guess the main concern would be locking order and deadlocks.
> With my suggestion I think deadlocks are avoided and I am less sure
> but think that lockdep should not have false positives either.
>
> If people do need the 1-level stacking, I can try to think harder
> if it is safe and maybe add some more compromise limitations.
>
> > What do you think if we keep this overly strict rule for now to avoid
> > unintended behaviors and come back as we find affected use case?
> >
>
> I can live with that if other designated users don't mind the limitation.
>
> I happen to be developing a passthrough FUSE fs [1] myself and
> I also happen to be using it to pass through to overlayfs.
> OTOH, the workloads for my use case are mostly large sequential IO,
> and the hardware can handle the few extra syscalls, so the passthrough
> fd feature is not urgent for my use case at this point in time.
This is something that only happens if the FUSE daemon opens a connection
wanting FUSE_PASSTHROUGH, so shouldn't affect existing use cases. Or am I
wrong?
If some users find this limitation to be an issue, we can rethink/relax
this policy in the future... Switching to something like the solution you
proposed does not break the current behavior, so we would be able to change
this with minimal effort.
>
>
> >
> > >
> > > > + ret = -EEXIST;
> > >
> > > Why EEXIST? Why not EINVAL?
> > >
> >
> >
> > Reaching the stacking limit sounded like an error caused by the undesired
> > existence of something, thus EEXIST sounded like a good fit.
> > No problem in changing that to EINVAL.
> >
> >
> >
> > > > + if (fs_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> > > > + pr_err("FUSE: maximum fs stacking depth exceeded for passthrough\n");
> > > > + goto out;
> > > > + }
> > > > +
> > > > + req->args->passthrough_filp = passthrough_filp;
> > > > + return 0;
> > > > +out:
> > > > + fput(passthrough_filp);
> > > > + return ret;
> > > > +}
> > > > +
> > >
> > > And speaking of overlayfs, I believe you may be able to test your code with
> > > fuse-overlayfs (passthrough to upper files).
> > >
> ...
> >
> > This is indeed a project with several common elements to what we are doing
> > in Android,
>
> Are you in liberty to share more information about the Android project?
> Is it related to Incremental FS [2]?
>
> Thanks,
> Amir.
>
> [1] https://github.com/amir73il/libfuse/commits/cachegwfs
> [2] https://lore.kernel.org/linux-fsdevel/[email protected]/
Thanks for the pointer to cachegwfs, I'm glad I'm seeing more and more
passthrough file systems in addition to our use case in Android.
I'm not directly involved in the Incremental FS project, but, as far as I
remember, only for the first PoC was actually developed as a FUSE file
system. Because of the overhead introduced by the user space round-trips,
that design was left behind and the whole Incremental FS infrastructure
switched to becoming a kernel module.
In general, the FUSE passthrough patch set proposed in this series wouldn't
be helpful for that use case because, for example, Incremental FS requires
live (de)compression of data, that can only be performed by the FUSE
daemon.
The specific use case I'm trying to improve with this FUSE passthrough
series is instead related to the scoped storage feature that we introduced
in Android 11, that is based on FUSE, and affects those folders that are
shared among all the Apps (e.g., DCIM, Downloads, etc). More details here:
https://developer.android.com/about/versions/11/privacy/storage
With FUSE we now have a flexible way to specify more fine-grained
permissions (e.g., specify if an App is allowed to access files depenind on
their type), create private App folders, maintain legacy paths for old
Apps, manipulate pieces of files at run-time, etc. Forgive me if I forgot
something. :)
These extra operations may slower the file system access comprared to a
native, direct access, but if:
- the file being accessed requires special treatment before being passed to
the requesting App, then further tests will be performed at every
read/write operation (with some optimizations). This overhead is of
course annoying, but is something we are happy to pay because is
beneficial to the user (i.e., improves privacy and security).
- Instead, if at open time a file is recognized as safe to access and does
not require any further enforcement from the FUSE daemon, there's no need
to pay for future read/write operations overheads, that wouldn't do
anything more than just copying data (possibly with the help of
splicing). In this case the FUSE passthrough feature proposed in this
series can be enabled to reduce this overhead.
Moreover, some Apps use big files that contain all their resources, then
access these files at random offsets, not taking advantage of read-ahead
cache. The same happens for files containing databases.
In addition, our specific use case involves a FUSE daemon is probably
heavier than the average passthrough file system (for example those that
are in libfuse/examples), so reducing user space round trips thanks to the
patchset proposed here gives a strong improvement.
Anyway, only showing the improvements in our extreme use case would have
brought a limited case for upstreaming, and that is why the benchmarks I
ran (presented in the cover letter), are based on the vanilla
passthrough_hp daemon managing kind of standard storage workloads, and
still show evident performance improvements.
Running and sharing benchmarks on Android is also tricky because at the
time of first submission the most recent public real device was running a
4.14 kernel, so that would have been maybe nice to see, but not that
interesting... :)
I hope I answered all your questions, please let me know if I missed
something and feel free to ask for more details!
Thanks again,
Alessio
On Tue, Sep 22, 2020 at 3:15 PM Alessio Balsini <[email protected]> wrote:
>
> On Fri, Sep 18, 2020 at 10:59:16PM +0300, Amir Goldstein wrote:
> > On Fri, Sep 18, 2020 at 7:33 PM Alessio Balsini <[email protected]> wrote:
> > [...]
> > > > ... for example:
> > > >
> > > > if (fs_stack_depth && passthrough_sb->s_type == fuse_fs_type) {
> > > > pr_err("FUSE: stacked passthrough file\n");
> > > > goto out;
> > > > }
> > > >
> > > > But maybe we want to ban passthrough to any lower FUSE at least for start.
> > >
> > >
> > > Yes, what I proposed here is very conservative, and your solution sounds
> > > good to me. Unfortunately I don't have a clear idea of what could go wrong
> > > if we relax this constraint. I need some guidance from you experts here.
> > >
> >
> > I guess the main concern would be locking order and deadlocks.
> > With my suggestion I think deadlocks are avoided and I am less sure
> > but think that lockdep should not have false positives either.
> >
> > If people do need the 1-level stacking, I can try to think harder
> > if it is safe and maybe add some more compromise limitations.
> >
> > > What do you think if we keep this overly strict rule for now to avoid
> > > unintended behaviors and come back as we find affected use case?
> > >
> >
> > I can live with that if other designated users don't mind the limitation.
> >
> > I happen to be developing a passthrough FUSE fs [1] myself and
> > I also happen to be using it to pass through to overlayfs.
> > OTOH, the workloads for my use case are mostly large sequential IO,
> > and the hardware can handle the few extra syscalls, so the passthrough
> > fd feature is not urgent for my use case at this point in time.
>
>
> This is something that only happens if the FUSE daemon opens a connection
> wanting FUSE_PASSTHROUGH, so shouldn't affect existing use cases. Or am I
> wrong?
I meant, if I would have expected a significant performance improvement
from FUSE_PASSTHROUGH in my use case, I would have wanted to use it
and then passthrough to overlayfs would have mattered to me more.
> If some users find this limitation to be an issue, we can rethink/relax
> this policy in the future... Switching to something like the solution you
> proposed does not break the current behavior, so we would be able to change
> this with minimal effort.
>
True.
>
> >
> >
> > >
> > > >
> > > > > + ret = -EEXIST;
> > > >
> > > > Why EEXIST? Why not EINVAL?
> > > >
> > >
> > >
> > > Reaching the stacking limit sounded like an error caused by the undesired
> > > existence of something, thus EEXIST sounded like a good fit.
> > > No problem in changing that to EINVAL.
> > >
> > >
> > >
> > > > > + if (fs_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
> > > > > + pr_err("FUSE: maximum fs stacking depth exceeded for passthrough\n");
> > > > > + goto out;
> > > > > + }
> > > > > +
> > > > > + req->args->passthrough_filp = passthrough_filp;
> > > > > + return 0;
> > > > > +out:
> > > > > + fput(passthrough_filp);
> > > > > + return ret;
> > > > > +}
> > > > > +
> > > >
> > > > And speaking of overlayfs, I believe you may be able to test your code with
> > > > fuse-overlayfs (passthrough to upper files).
> > > >
> > ...
> > >
> > > This is indeed a project with several common elements to what we are doing
> > > in Android,
> >
> > Are you in liberty to share more information about the Android project?
> > Is it related to Incremental FS [2]?
> >
> > Thanks,
> > Amir.
> >
> > [1] https://github.com/amir73il/libfuse/commits/cachegwfs
> > [2] https://lore.kernel.org/linux-fsdevel/[email protected]/
>
> Thanks for the pointer to cachegwfs, I'm glad I'm seeing more and more
> passthrough file systems in addition to our use case in Android.
>
I am hearing about a lot of these projects.
I think that FUSE_PASSTHROUGH is a very useful feature.
I have an intention to explore passthrough to directory fd for
directory modifications. I sure hope you will beat me to it ;-)
> I'm not directly involved in the Incremental FS project, but, as far as I
> remember, only for the first PoC was actually developed as a FUSE file
> system. Because of the overhead introduced by the user space round-trips,
> that design was left behind and the whole Incremental FS infrastructure
> switched to becoming a kernel module.
> In general, the FUSE passthrough patch set proposed in this series wouldn't
> be helpful for that use case because, for example, Incremental FS requires
> live (de)compression of data, that can only be performed by the FUSE
> daemon.
>
Ext4 supports inline encryption. Btrfs supports encrypted/compressed extents.
No reason for FUSE not to support the same.
Is it trivial? No.
Is it an excuse for not using FUSE and writing a new userspace fs. Not
in my option.
> The specific use case I'm trying to improve with this FUSE passthrough
> series is instead related to the scoped storage feature that we introduced
> in Android 11, that is based on FUSE, and affects those folders that are
> shared among all the Apps (e.g., DCIM, Downloads, etc). More details here:
>
sdcard fs has had a lot of reincarnations :-)
I for one am happy with the return to FUSE.
Instead of saying "FUSE is too slow" and implementing a kernel sdcardfs,
improve FUSE to be faster for everyone - that's the way to go ;-)
> https://developer.android.com/about/versions/11/privacy/storage
>
> With FUSE we now have a flexible way to specify more fine-grained
> permissions (e.g., specify if an App is allowed to access files depenind on
> their type), create private App folders, maintain legacy paths for old
> Apps, manipulate pieces of files at run-time, etc. Forgive me if I forgot
> something. :)
> These extra operations may slower the file system access comprared to a
> native, direct access, but if:
> - the file being accessed requires special treatment before being passed to
> the requesting App, then further tests will be performed at every
> read/write operation (with some optimizations). This overhead is of
> course annoying, but is something we are happy to pay because is
> beneficial to the user (i.e., improves privacy and security).
> - Instead, if at open time a file is recognized as safe to access and does
> not require any further enforcement from the FUSE daemon, there's no need
> to pay for future read/write operations overheads, that wouldn't do
> anything more than just copying data (possibly with the help of
> splicing). In this case the FUSE passthrough feature proposed in this
> series can be enabled to reduce this overhead.
>
> Moreover, some Apps use big files that contain all their resources, then
> access these files at random offsets, not taking advantage of read-ahead
> cache. The same happens for files containing databases.
> In addition, our specific use case involves a FUSE daemon is probably
> heavier than the average passthrough file system (for example those that
> are in libfuse/examples), so reducing user space round trips thanks to the
> patchset proposed here gives a strong improvement.
>
> Anyway, only showing the improvements in our extreme use case would have
> brought a limited case for upstreaming, and that is why the benchmarks I
> ran (presented in the cover letter), are based on the vanilla
> passthrough_hp daemon managing kind of standard storage workloads, and
> still show evident performance improvements.
> Running and sharing benchmarks on Android is also tricky because at the
> time of first submission the most recent public real device was running a
> 4.14 kernel, so that would have been maybe nice to see, but not that
> interesting... :)
>
> I hope I answered all your questions, please let me know if I missed
> something and feel free to ask for more details!
>
Thanks for sharing the details,
Amir.
On Tue, Sep 22, 2020 at 07:08:48PM +0300, Amir Goldstein wrote:
>
> I am hearing about a lot of these projects.
> I think that FUSE_PASSTHROUGH is a very useful feature.
> I have an intention to explore passthrough to directory fd for
> directory modifications. I sure hope you will beat me to it ;-)
It's awesome that you mentioned this, something similar is already in my
TODO list, suggested by Paul (in CC). I'll start working on this and will
be glad to discuss as soon as this FUSE_PASSTHROUGH extension will
hopefully get accepted.
>
> > I'm not directly involved in the Incremental FS project, but, as far as I
> > remember, only for the first PoC was actually developed as a FUSE file
> > system. Because of the overhead introduced by the user space round-trips,
> > that design was left behind and the whole Incremental FS infrastructure
> > switched to becoming a kernel module.
> > In general, the FUSE passthrough patch set proposed in this series wouldn't
> > be helpful for that use case because, for example, Incremental FS requires
> > live (de)compression of data, that can only be performed by the FUSE
> > daemon.
> >
>
> Ext4 supports inline encryption. Btrfs supports encrypted/compressed extents.
> No reason for FUSE not to support the same.
> Is it trivial? No.
> Is it an excuse for not using FUSE and writing a new userspace fs. Not
> in my option.
I started exploring the FUSE internals only in the last months and had the
feeling this compression feature was a bit out of context or at least very
use-case specific. But I'll start thinking about this.
>
> > The specific use case I'm trying to improve with this FUSE passthrough
> > series is instead related to the scoped storage feature that we introduced
> > in Android 11, that is based on FUSE, and affects those folders that are
> > shared among all the Apps (e.g., DCIM, Downloads, etc). More details here:
> >
>
> sdcard fs has had a lot of reincarnations :-)
>
> I for one am happy with the return to FUSE.
> Instead of saying "FUSE is too slow" and implementing a kernel sdcardfs,
> improve FUSE to be faster for everyone - that's the way to go ;-)
Yes! This is exactly what we are trying to achieve and this patch-set is
the first piece of a bigger picture which, among others, includes the
direct directory access that you mentioned before.
I hope the community appreciates these improvement attempts :)
As you may have noticed, I recently shared the v9 of the patch-set.
Thanks to the feedback I received, what we have now has a completely
different than the initial proposal.
Thanks again everyone for the suggestions!
Alessio