This is the 9th version of the series. Please find the changelog at the
bottom of this cover letter.
Add support for file system passthrough read/write of files when enabled in
userspace through the option FUSE_PASSTHROUGH.
There are file systems based on FUSE that are intended to enforce special
policies or trigger complicated decision makings at the file operations
level. Android, for example, uses FUSE to enforce fine-grained access
policies that also depend on the file contents.
Sometimes it happens that at open or create time a file is identified as
not requiring additional checks for consequent reads/writes, thus FUSE
would simply act as a passive bridge between the process accessing the FUSE
file system and the lower file system. Splicing and caching help reduce the
FUSE overhead, but there are still read/write operations forwarded to the
userspace FUSE daemon that could be avoided.
This series has been inspired by the original patches from Nikhilesh Reddy,
the idea and code of which has been elaborated and improved thanks to the
community support.
When the FUSE_PASSTHROUGH capability is enabled, the FUSE daemon may decide
while handling the open/create operations, if the given file can be
accessed in passthrough mode. This means that all the further read and
write operations would be forwarded by the kernel directly to the lower
file system using the VFS layer rather than to the FUSE daemon. All the
requests other than reads or writes are still handled by the userspace FUSE
daemon.
This allows for improved performance on reads and writes, especially in the
case of reads at random offsets, for which no (readahead) caching mechanism
would help.
Benchmarks show improved performance that is close to native file system
access when doing massive manipulations on a single opened file, especially
in the case of random reads, for which the bandwidth increased by almost 2X
or sequential writes for which the improvement is close to 3X.
The creation of this direct connection (passthrough) between FUSE file
objects and file objects in the lower file system happens in a way that
reminds of passing file descriptors via sockets:
- a process requests the opening of a file handled by FUSE, so the kernel
forwards the request to the FUSE daemon;
- the FUSE daemon opens the target file in the lower file system, getting
its file descriptor;
- the FUSE daemon also decides according to its internal policies if
passthrough can be enabled for that file, and, if so, can perform a
FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl() on /dev/fuse, passing the file
descriptor obtained at the previous step and the fuse_req unique
identifier;
- the kernel translates the file descriptor to the file pointer navigating
through the opened files of the "current" process and temporarily stores
it in the associated open/create fuse_req's passthrough_filp;
- when the FUSE daemon has done with the request and it's time for the
kernel to close it, it checks if the passthrough_filp is available and in
case updates the additional field in the fuse_file owned by the process
accessing the FUSE file system.
From now on, all the read/write operations performed by that process will
be redirected to the corresponding lower file system file by creating new
VFS requests.
Since the read/write operation to the lower file system is executed with
the current process's credentials, it might happen that it does not have
enough privileges to succeed. For this reason, the process temporarily
receives the same credentials as the FUSE daemon, that are reverted as soon
as the read/write operation completes, emulating the behavior of the
request to be performed by the FUSE daemon itself. This solution has been
inspired by the way overlayfs handles read/write operations.
Asynchronous IO is supported as well, handled by creating separate AIO
requests for the lower file system that will be internally tracked by FUSE,
that intercepts and propagates their completion through an internal
ki_completed callback similar to the current implementation of overlayfs.
The ioctl() has been designed taking as a reference and trying to converge
to the fuse2 implementation. For example, the fuse_passthrough_out data
structure has extra fields that will allow for further extensions of the
feature.
Performance
What follows has been performed with this change [V6] rebased on top of
vanilla v5.8 Linux kernel, using a custom passthrough_hp FUSE daemon that
enables pass-through for each file that is opened during both “open” and
“create”. Tests were run on an Intel Xeon E5-2678V3, 32GiB of RAM, with an
ext4-formatted SSD as the lower file system, with no special tuning, e.g.,
all the involved processes are SCHED_OTHER, ondemand is the frequency
governor with no frequency restrictions, and turbo-boost, as well as
p-state, are active. This is because I noticed that, for such high-level
benchmarks, results consistency was minimally affected by these features.
The source code of the updated libfuse library and passthrough_hp is shared
at the following repository:
https://github.com/balsini/libfuse/tree/fuse-passthrough-stable-v.3.9.4
Two different kinds of benchmarks were done for this change, the first set
of tests evaluates the bandwidth improvements when manipulating a huge
single file, the second set of tests verify that no performance regressions
were introduced when handling many small files.
The first benchmarks were done by running FIO (fio-3.21) with:
- bs=4Ki;
- file size: 50Gi;
- ioengine: sync;
- fsync_on_close: true.
The target file has been chosen large enough to avoid it to be entirely
loaded into the page cache.
Results are presented in the following table:
+-----------+--------+-------------+--------+
| Bandwidth | FUSE | FUSE | Bind |
| (KiB/s) | | passthrough | mount |
+-----------+--------+-------------+--------+
| read | 468897 | 502085 | 516830 |
+-----------+--------+-------------+--------+
| randread | 15773 | 26632 | 21386 |
+-----------+--------+-------------+--------+
| write | 58185 | 141272 | 141671 |
+-----------+--------+-------------+--------+
| randwrite | 59892 | 75236 | 76486 |
+-----------+--------+-------------+--------+
As long as this patch has the primary objective of improving bandwidth,
another set of tests has been performed to see how this behaves on a
totally different scenario that involves accessing many small files. For
this purpose, measuring the build time of the Linux kernel has been chosen
as a well-known workload. The kernel has been built with as many processes
as the number of logical CPUs (-j $(nproc)), that besides being a
reasonable number, is also enough to saturate the processor’s utilization
thanks to the additional FUSE daemon’s threads, making it even harder to
get closer to the native file system performance.
The following table shows the total build times in the different
configurations:
+------------------+--------------+-----------+
| | AVG duration | Standard |
| | (sec) | deviation |
+------------------+--------------+-----------+
| FUSE | 144.566 | 0.697 |
+------------------+--------------+-----------+
| FUSE passthrough | 133.820 | 0.341 |
+------------------+--------------+-----------+
| Raw | 109.423 | 0.724 |
+------------------+--------------+-----------+
Similar performance measurements were performed with the current version of
the patch, the results of which are comparable with what is shown above.
Further testing and performance evaluations are welcome.
Description of the series
Patch 1 introduces the data structures and definitions required both for
the communication with userspace and for the internal kernel use.
It also adds the basic functionalities to establish the bridge between the
FUSE file and the lower file system file through an ioctl().
Patch 2 creates a reference to the FUSE daemon credentials in the FUSE
connection.
Patch 3 enables the synchronous read and write operations for those FUSE
files for which the passthrough functionality is enabled.
Patch 4 extends the read and write operations to also support asynchronous
IO.
Changes in v9:
* Switched to using VFS instead of direct lower FS file ops
[Attempt to address a request from Jens Axboe, Jann Horn, Amir Goldstein]
* Removal of useless included aio.h header
[Proposed by Jens Axboe]
Changes in v8:
* aio requests now use kmalloc/kfree, instead of kmem_cache
* Switched to call_{read,write}_iter in AIO
* Revisited attributes copy
* Passthrough can only be enabled via ioctl(), fixing the security issue
* spotted by Jann
* Use an extensible fuse_passthrough_out data structure
[Attempt to address a request from Nikolaus Rath, Amir Goldstein and
Miklos Szeredi]
Changes in v7:
* Full handling of aio requests as done in overlayfs (update commit
* message).
* s/fget_raw/fget.
* Open fails in case of passthrough errors, emitting warning messages.
[Proposed by Jann Horn]
* Create new local kiocb, getting rid of the previously proposed ki_filp
* swapping.
[Proposed by Jann Horn and Jens Axboe]
* Code polishing.
Changes in v6:
* Port to kernel v5.8:
* fuse_file_{read,write}_iter() changed since the v5 of this patch was
* proposed.
* Simplify fuse_simple_request().
* Merge fuse_passthrough.h into fuse_i.h
* Refactor of passthrough.c:
* Remove BUG_ON()s.
* Simplified error checking and request arguments indexing.
* Use call_{read,write}_iter() utility functions.
* Remove get_file() and fputs() during read/write: handle the extra FUSE
* references to the lower file object when the fuse_file is
* created/deleted.
[Proposed by Jann Horn]
Changes in v5:
* Fix the check when setting the passthrough file.
[Found when testing by Mike Shal]
Changes in v3 and v4:
* Use the fs_stack_depth to prevent further stacking and a minor fix.
[Proposed by Jann Horn]
Changes in v2:
* Changed the feature name to passthrough from stacked_io.
[Proposed by Linus Torvalds]
Alessio Balsini (4):
fuse: Definitions and ioctl() for passthrough
fuse: Trace daemon creds
fuse: Introduce synchronous read and write for passthrough
fuse: Handle asynchronous read and write in passthrough
fs/fuse/Makefile | 1 +
fs/fuse/dev.c | 57 ++++++++++-
fs/fuse/dir.c | 2 +
fs/fuse/file.c | 25 +++--
fs/fuse/fuse_i.h | 19 ++++
fs/fuse/inode.c | 17 +++-
fs/fuse/passthrough.c | 208 ++++++++++++++++++++++++++++++++++++++
include/uapi/linux/fuse.h | 12 ++-
8 files changed, 328 insertions(+), 13 deletions(-)
create mode 100644 fs/fuse/passthrough.c
--
2.28.0.681.g6f77f65b4e-goog
Add a reference to the FUSE daemon credentials, so that they can be used to
temporarily raise the user credentials when accessing lower file system
files in passthrough.
When using FUSE passthrough, read/write operations are directly forwarded
to the lower file system file, but there is no guarantee that the process
that is triggering the request has the right permissions to access the
lower file system.
By default, in the non-passthrough use case, it is the daemon that handles
the read/write operations, that can be performed to the lower file system
with the daemon privileges.
When passthrough is active, instead, the read/write operation is directly
applied to the lower file system, so to keep the same behavior as before,
the calling process temporarily receives the same credentials as the
daemon, that should be removed as soon as the operation completes.
Signed-off-by: Alessio Balsini <[email protected]>
---
fs/fuse/fuse_i.h | 3 +++
fs/fuse/inode.c | 8 ++++++++
2 files changed, 11 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 6c5166447905..67bf5919f8d6 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -524,6 +524,9 @@ struct fuse_conn {
/** The group id for this mount */
kgid_t group_id;
+ /** Creds of process which created this mount point */
+ const struct cred *creator_cred;
+
/** The pid namespace for this mount */
struct pid_namespace *pid_ns;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index eb223130a917..d22407bfa959 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -654,6 +654,8 @@ void fuse_conn_put(struct fuse_conn *fc)
fiq->ops->release(fiq);
put_pid_ns(fc->pid_ns);
put_user_ns(fc->user_ns);
+ if (fc->creator_cred)
+ put_cred(fc->creator_cred);
fc->release(fc);
}
}
@@ -1203,6 +1205,12 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
fc->allow_other = ctx->allow_other;
fc->user_id = ctx->user_id;
fc->group_id = ctx->group_id;
+ fc->creator_cred = prepare_creds();
+ if (!fc->creator_cred) {
+ err = -ENOMEM;
+ goto err_dev_free;
+ }
+
fc->max_read = max_t(unsigned, 4096, ctx->max_read);
fc->destroy = ctx->destroy;
fc->no_control = ctx->no_control;
--
2.28.0.681.g6f77f65b4e-goog
All the read and write operations performed on fuse_files which have the
passthrough feature enabled are forwarded to the associated lower file
system file via VFS.
Sending the request directly to the lower file system avoids the userspace
round-trip that, because of possible context switches and additional
operations might reduce the overall performance, especially in those cases
where caching doesn't help, for example in reads at random offsets.
Verifying if a fuse_file has a lower file system file associated for
passthrough can be done by checking the validity of its passthrough_filp
pointer. This pointer is not NULL only if passthrough has been successfully
enabled via the appropriate ioctl().
When a read/write operation is requested for a FUSE file with passthrough
enabled, a new equivalent VFS request is generated, which instead targets
the lower file system file.
The VFS layer performs additional checks that allows for safer operations,
but may cause the operation to fail if the process accessing the FUSE file
system does not have access to the lower file system. This often happens in
passthrough file systems, where the FUSE daemon is responsible for the
enforcement of the lower file system access policies. In order to preserve
this behavior, the current process accessing the FUSE file with passthrough
enabled receives the privileges of the FUSE daemon while performing the
read/write operation, emulating a behavior used in overlayfs. These
privileges will be reverted as soon as the IO operation completes. This
feature does not provide any higher security privileges to those processes
accessing the FUSE file system with passthrough enabled. This because it is
still the FUSE daemon responsible for enabling or not the passthrough
feature at file open time, and should enable the feature only after
appropriate access policy checks.
This change only implements synchronous requests in passthrough, returning
an error in the case of ansynchronous operations, yet covering the majority
of the use cases.
Signed-off-by: Alessio Balsini <[email protected]>
---
fs/fuse/file.c | 8 +++-
fs/fuse/fuse_i.h | 2 +
fs/fuse/passthrough.c | 93 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 101 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6c0ec742ce74..c3289ff0cd33 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1552,7 +1552,9 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
if (is_bad_inode(file_inode(file)))
return -EIO;
- if (!(ff->open_flags & FOPEN_DIRECT_IO))
+ if (ff->passthrough_filp)
+ return fuse_passthrough_read_iter(iocb, to);
+ else if (!(ff->open_flags & FOPEN_DIRECT_IO))
return fuse_cache_read_iter(iocb, to);
else
return fuse_direct_read_iter(iocb, to);
@@ -1566,7 +1568,9 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (is_bad_inode(file_inode(file)))
return -EIO;
- if (!(ff->open_flags & FOPEN_DIRECT_IO))
+ if (ff->passthrough_filp)
+ return fuse_passthrough_write_iter(iocb, from);
+ else if (!(ff->open_flags & FOPEN_DIRECT_IO))
return fuse_cache_write_iter(iocb, from);
else
return fuse_direct_write_iter(iocb, from);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 67bf5919f8d6..b0764ca4c4fd 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1109,5 +1109,7 @@ void fuse_free_conn(struct fuse_conn *fc);
int fuse_passthrough_setup(struct fuse_req *req, unsigned int fd);
void fuse_passthrough_release(struct fuse_file *ff);
+ssize_t fuse_passthrough_read_iter(struct kiocb *iocb, struct iov_iter *to);
+ssize_t fuse_passthrough_write_iter(struct kiocb *iocb, struct iov_iter *from);
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
index 86ab4eafa7bf..f70c0ef6945b 100644
--- a/fs/fuse/passthrough.c
+++ b/fs/fuse/passthrough.c
@@ -2,6 +2,99 @@
#include "fuse_i.h"
+#include <linux/uio.h>
+
+static void fuse_copyattr(struct file *dst_file, struct file *src_file)
+{
+ struct inode *dst = file_inode(dst_file);
+ struct inode *src = file_inode(src_file);
+
+ i_size_write(dst, i_size_read(src));
+}
+
+static rwf_t iocbflags_to_rwf(int ifl)
+{
+ rwf_t flags = 0;
+
+ if (ifl & IOCB_APPEND)
+ flags |= RWF_APPEND;
+ if (ifl & IOCB_DSYNC)
+ flags |= RWF_DSYNC;
+ if (ifl & IOCB_HIPRI)
+ flags |= RWF_HIPRI;
+ if (ifl & IOCB_NOWAIT)
+ flags |= RWF_NOWAIT;
+ if (ifl & IOCB_SYNC)
+ flags |= RWF_SYNC;
+
+ return flags;
+}
+
+static const struct cred *
+fuse_passthrough_override_creds(const struct file *fuse_filp)
+{
+ struct inode *fuse_inode = file_inode(fuse_filp);
+ struct fuse_conn *fc = fuse_inode->i_sb->s_fs_info;
+
+ return override_creds(fc->creator_cred);
+}
+
+ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
+ struct iov_iter *iter)
+{
+ ssize_t ret;
+ const struct cred *old_cred;
+ struct file *fuse_filp = iocb_fuse->ki_filp;
+ struct fuse_file *ff = fuse_filp->private_data;
+ struct file *passthrough_filp = ff->passthrough_filp;
+
+ if (!iov_iter_count(iter))
+ return 0;
+
+ old_cred = fuse_passthrough_override_creds(fuse_filp);
+ if (is_sync_kiocb(iocb_fuse)) {
+ ret = vfs_iter_read(passthrough_filp, iter, &iocb_fuse->ki_pos,
+ iocbflags_to_rwf(iocb_fuse->ki_flags));
+ } else {
+ ret = -EIO;
+ }
+ revert_creds(old_cred);
+
+ return ret;
+}
+
+ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
+ struct iov_iter *iter)
+{
+ ssize_t ret;
+ const struct cred *old_cred;
+ struct file *fuse_filp = iocb_fuse->ki_filp;
+ struct fuse_file *ff = fuse_filp->private_data;
+ struct inode *fuse_inode = file_inode(fuse_filp);
+ struct file *passthrough_filp = ff->passthrough_filp;
+
+ if (!iov_iter_count(iter))
+ return 0;
+
+ inode_lock(fuse_inode);
+
+ old_cred = fuse_passthrough_override_creds(fuse_filp);
+ if (is_sync_kiocb(iocb_fuse)) {
+ file_start_write(passthrough_filp);
+ ret = vfs_iter_write(passthrough_filp, iter, &iocb_fuse->ki_pos,
+ iocbflags_to_rwf(iocb_fuse->ki_flags));
+ file_end_write(passthrough_filp);
+ if (ret > 0)
+ fuse_copyattr(fuse_filp, passthrough_filp);
+ } else {
+ ret = -EIO;
+ }
+ revert_creds(old_cred);
+ inode_unlock(fuse_inode);
+
+ return ret;
+}
+
int fuse_passthrough_setup(struct fuse_req *req, unsigned int fd)
{
int ret;
--
2.28.0.681.g6f77f65b4e-goog
Extend the passthrough feature by handling asynchronous IO both for read
and write operations.
When an AIO request is received, if the request targets a FUSE file with
the passthrough functionality enabled, a new identical AIO request is
created. The new request targets the lower file system file, and gets
assigned a special FUSE passthrough AIO completion callback.
When the lower file system AIO request is completed, the FUSE passthrough
AIO completion callback is executed and propagates the completion signal to
the FUSE AIO request by triggering its completion callback as well.
Signed-off-by: Alessio Balsini <[email protected]>
---
fs/fuse/passthrough.c | 64 +++++++++++++++++++++++++++++++++++++++++--
1 file changed, 62 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
index f70c0ef6945b..b7d1a5517ffd 100644
--- a/fs/fuse/passthrough.c
+++ b/fs/fuse/passthrough.c
@@ -4,6 +4,11 @@
#include <linux/uio.h>
+struct fuse_aio_req {
+ struct kiocb iocb;
+ struct kiocb *iocb_fuse;
+};
+
static void fuse_copyattr(struct file *dst_file, struct file *src_file)
{
struct inode *dst = file_inode(dst_file);
@@ -39,6 +44,32 @@ fuse_passthrough_override_creds(const struct file *fuse_filp)
return override_creds(fc->creator_cred);
}
+static void fuse_aio_cleanup_handler(struct fuse_aio_req *aio_req)
+{
+ struct kiocb *iocb = &aio_req->iocb;
+ struct kiocb *iocb_fuse = aio_req->iocb_fuse;
+
+ if (iocb->ki_flags & IOCB_WRITE) {
+ __sb_writers_acquired(file_inode(iocb->ki_filp)->i_sb,
+ SB_FREEZE_WRITE);
+ file_end_write(iocb->ki_filp);
+ fuse_copyattr(iocb_fuse->ki_filp, iocb->ki_filp);
+ }
+
+ iocb_fuse->ki_pos = iocb->ki_pos;
+ kfree(aio_req);
+}
+
+static void fuse_aio_rw_complete(struct kiocb *iocb, long res, long res2)
+{
+ struct fuse_aio_req *aio_req =
+ container_of(iocb, struct fuse_aio_req, iocb);
+ struct kiocb *iocb_fuse = aio_req->iocb_fuse;
+
+ fuse_aio_cleanup_handler(aio_req);
+ iocb_fuse->ki_complete(iocb_fuse, res, res2);
+}
+
ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
struct iov_iter *iter)
{
@@ -56,7 +87,18 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
ret = vfs_iter_read(passthrough_filp, iter, &iocb_fuse->ki_pos,
iocbflags_to_rwf(iocb_fuse->ki_flags));
} else {
- ret = -EIO;
+ struct fuse_aio_req *aio_req;
+
+ aio_req = kmalloc(sizeof(struct fuse_aio_req), GFP_KERNEL);
+ if (!aio_req)
+ return -ENOMEM;
+
+ aio_req->iocb_fuse = iocb_fuse;
+ kiocb_clone(&aio_req->iocb, iocb_fuse, passthrough_filp);
+ aio_req->iocb.ki_complete = fuse_aio_rw_complete;
+ ret = call_read_iter(passthrough_filp, &aio_req->iocb, iter);
+ if (ret != -EIOCBQUEUED)
+ fuse_aio_cleanup_handler(aio_req);
}
revert_creds(old_cred);
@@ -72,6 +114,7 @@ ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
struct fuse_file *ff = fuse_filp->private_data;
struct inode *fuse_inode = file_inode(fuse_filp);
struct file *passthrough_filp = ff->passthrough_filp;
+ struct inode *passthrough_inode = file_inode(passthrough_filp);
if (!iov_iter_count(iter))
return 0;
@@ -87,8 +130,25 @@ ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
if (ret > 0)
fuse_copyattr(fuse_filp, passthrough_filp);
} else {
- ret = -EIO;
+ struct fuse_aio_req *aio_req;
+
+ aio_req = kmalloc(sizeof(struct fuse_aio_req), GFP_KERNEL);
+ if (!aio_req) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ file_start_write(passthrough_filp);
+ __sb_writers_release(passthrough_inode->i_sb, SB_FREEZE_WRITE);
+
+ aio_req->iocb_fuse = iocb_fuse;
+ kiocb_clone(&aio_req->iocb, iocb_fuse, passthrough_filp);
+ aio_req->iocb.ki_complete = fuse_aio_rw_complete;
+ ret = call_write_iter(passthrough_filp, &aio_req->iocb, iter);
+ if (ret != -EIOCBQUEUED)
+ fuse_aio_cleanup_handler(aio_req);
}
+out:
revert_creds(old_cred);
inode_unlock(fuse_inode);
--
2.28.0.681.g6f77f65b4e-goog
Introduce the new FUSE passthrough ioctl(), which allows userspace to
specify a direct connection between a FUSE file and a lower file system
file.
Such ioctl() requires userspace to specify:
- the file descriptor of one of its opened files,
- the unique identifier of the FUSE request associated with a pending
open/create operation,
both encapsulated into a fuse_passthrough_out data structure.
The ioctl() will search for the pending FUSE request matching the unique
identifier, and update the passthrough file pointer of the request with the
file pointer referenced by the passed file descriptor.
When that pending FUSE request is handled, the passthrough file pointer
is copied to the fuse_file data structure, so that the link between FUSE
and lower file system is consolidated.
In order for the passthrough mode to be successfully activated, the lower
file system file must implement both read_ and write_iter file operations.
This extra check avoids special pseudofiles to be targets for this feature.
An additional enforced limitation is that when FUSE passthrough is enabled,
no further file system stacking is allowed.
Signed-off-by: Alessio Balsini <[email protected]>
---
fs/fuse/Makefile | 1 +
fs/fuse/dev.c | 57 +++++++++++++++++++++++++++++++++++----
fs/fuse/dir.c | 2 ++
fs/fuse/file.c | 17 +++++++++---
fs/fuse/fuse_i.h | 14 ++++++++++
fs/fuse/inode.c | 9 ++++++-
fs/fuse/passthrough.c | 55 +++++++++++++++++++++++++++++++++++++
include/uapi/linux/fuse.h | 12 ++++++++-
8 files changed, 156 insertions(+), 11 deletions(-)
create mode 100644 fs/fuse/passthrough.c
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 3e8cebfb59b7..6971454a2bdf 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -8,4 +8,5 @@ obj-$(CONFIG_CUSE) += cuse.o
obj-$(CONFIG_VIRTIO_FS) += virtiofs.o
fuse-objs := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o
+fuse-objs += passthrough.o
virtiofs-y += virtio_fs.o
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 02b3c36b3676..c31e6c30fabf 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2219,21 +2219,53 @@ static int fuse_device_clone(struct fuse_conn *fc, struct file *new)
return 0;
}
+static int fuse_passthrough_open(struct fuse_dev *fud,
+ struct fuse_passthrough_out *pto)
+{
+ int ret;
+ struct fuse_req *req;
+ struct fuse_pqueue *fpq = &fud->pq;
+ struct fuse_conn *fc = fud->fc;
+
+ if (!fc->passthrough)
+ return -EPERM;
+
+ /* This field is reserved for future use */
+ if (pto->len != 0)
+ return -EINVAL;
+
+ spin_lock(&fpq->lock);
+ req = request_find(fpq, pto->unique & ~FUSE_INT_REQ_BIT);
+ if (!req) {
+ spin_unlock(&fpq->lock);
+ return -ENOENT;
+ }
+ __fuse_get_request(req);
+ spin_unlock(&fpq->lock);
+
+ ret = fuse_passthrough_setup(req, pto->fd);
+
+ fuse_put_request(fc, req);
+ return ret;
+}
+
static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
{
- int err = -ENOTTY;
-
- if (cmd == FUSE_DEV_IOC_CLONE) {
- int oldfd;
+ int err;
+ int oldfd;
+ struct fuse_dev *fud;
+ struct fuse_passthrough_out pto;
+ switch (cmd) {
+ case FUSE_DEV_IOC_CLONE:
err = -EFAULT;
if (!get_user(oldfd, (__u32 __user *) arg)) {
struct file *old = fget(oldfd);
err = -EINVAL;
if (old) {
- struct fuse_dev *fud = NULL;
+ fud = NULL;
/*
* Check against file->f_op because CUSE
@@ -2251,6 +2283,21 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
fput(old);
}
}
+ break;
+ case FUSE_DEV_IOC_PASSTHROUGH_OPEN:
+ err = -EFAULT;
+ if (!copy_from_user(&pto,
+ (struct fuse_passthrough_out __user *)arg,
+ sizeof(pto))) {
+ err = -EINVAL;
+ fud = fuse_get_dev(file);
+ if (fud)
+ err = fuse_passthrough_open(fud, &pto);
+ }
+ break;
+ default:
+ err = -ENOTTY;
+ break;
}
return err;
}
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 26f028bc760b..531de0c5c9e8 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -477,6 +477,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
args.out_args[0].value = &outentry;
args.out_args[1].size = sizeof(outopen);
args.out_args[1].value = &outopen;
+ args.passthrough_filp = NULL;
err = fuse_simple_request(fc, &args);
if (err)
goto out_free_ff;
@@ -489,6 +490,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
ff->fh = outopen.fh;
ff->nodeid = outentry.nodeid;
ff->open_flags = outopen.open_flags;
+ ff->passthrough_filp = args.passthrough_filp;
inode = fuse_iget(dir->i_sb, outentry.nodeid, outentry.generation,
&outentry.attr, entry_attr_timeout(&outentry), 0);
if (!inode) {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 83d917f7e542..6c0ec742ce74 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -33,10 +33,12 @@ static struct page **fuse_pages_alloc(unsigned int npages, gfp_t flags,
}
static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
- int opcode, struct fuse_open_out *outargp)
+ int opcode, struct fuse_open_out *outargp,
+ struct file **passthrough_filp)
{
struct fuse_open_in inarg;
FUSE_ARGS(args);
+ int ret;
memset(&inarg, 0, sizeof(inarg));
inarg.flags = file->f_flags & ~(O_CREAT | O_EXCL | O_NOCTTY);
@@ -51,7 +53,10 @@ static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
args.out_args[0].size = sizeof(*outargp);
args.out_args[0].value = outargp;
- return fuse_simple_request(fc, &args);
+ ret = fuse_simple_request(fc, &args);
+ *passthrough_filp = args.passthrough_filp;
+
+ return ret;
}
struct fuse_release_args {
@@ -144,14 +149,16 @@ int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
/* Default for no-open */
ff->open_flags = FOPEN_KEEP_CACHE | (isdir ? FOPEN_CACHE_DIR : 0);
if (isdir ? !fc->no_opendir : !fc->no_open) {
+ struct file *passthrough_filp;
struct fuse_open_out outarg;
int err;
- err = fuse_send_open(fc, nodeid, file, opcode, &outarg);
+ err = fuse_send_open(fc, nodeid, file, opcode, &outarg,
+ &passthrough_filp);
if (!err) {
ff->fh = outarg.fh;
ff->open_flags = outarg.open_flags;
-
+ ff->passthrough_filp = passthrough_filp;
} else if (err != -ENOSYS) {
fuse_file_free(ff);
return err;
@@ -281,6 +288,8 @@ void fuse_release_common(struct file *file, bool isdir)
struct fuse_release_args *ra = ff->release_args;
int opcode = isdir ? FUSE_RELEASEDIR : FUSE_RELEASE;
+ fuse_passthrough_release(ff);
+
fuse_prepare_release(fi, ff, file->f_flags, opcode);
if (ff->flock) {
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 740a8a7d7ae6..6c5166447905 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -208,6 +208,12 @@ struct fuse_file {
} readdir;
+ /**
+ * Reference to lower filesystem file for read/write operations
+ * handled in pass-through mode
+ */
+ struct file *passthrough_filp;
+
/** RB node to be linked on fuse_conn->polled_files */
struct rb_node polled_node;
@@ -250,6 +256,8 @@ struct fuse_args {
bool page_zeroing:1;
bool page_replace:1;
bool may_block:1;
+ /** Lower filesystem file pointer used in pass-through mode */
+ struct file *passthrough_filp;
struct fuse_in_arg in_args[3];
struct fuse_arg out_args[2];
void (*end)(struct fuse_conn *fc, struct fuse_args *args, int error);
@@ -720,6 +728,9 @@ struct fuse_conn {
/* Do not show mount options */
unsigned int no_mount_options:1;
+ /** Pass-through mode for read/write IO */
+ unsigned int passthrough:1;
+
/** The number of requests waiting for completion */
atomic_t num_waiting;
@@ -1093,4 +1104,7 @@ unsigned int fuse_len_args(unsigned int numargs, struct fuse_arg *args);
u64 fuse_get_unique(struct fuse_iqueue *fiq);
void fuse_free_conn(struct fuse_conn *fc);
+int fuse_passthrough_setup(struct fuse_req *req, unsigned int fd);
+void fuse_passthrough_release(struct fuse_file *ff);
+
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index bba747520e9b..eb223130a917 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -965,6 +965,12 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
min_t(unsigned int, FUSE_MAX_MAX_PAGES,
max_t(unsigned int, arg->max_pages, 1));
}
+ if (arg->flags & FUSE_PASSTHROUGH) {
+ fc->passthrough = 1;
+ /* Prevent further stacking */
+ fc->sb->s_stack_depth =
+ FILESYSTEM_MAX_STACK_DEPTH;
+ }
} else {
ra_pages = fc->max_read / PAGE_SIZE;
fc->no_lock = 1;
@@ -1002,7 +1008,8 @@ void fuse_send_init(struct fuse_conn *fc)
FUSE_WRITEBACK_CACHE | FUSE_NO_OPEN_SUPPORT |
FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
- FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA;
+ FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
+ FUSE_PASSTHROUGH;
ia->args.opcode = FUSE_INIT;
ia->args.in_numargs = 1;
ia->args.in_args[0].size = sizeof(ia->in);
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
new file mode 100644
index 000000000000..86ab4eafa7bf
--- /dev/null
+++ b/fs/fuse/passthrough.c
@@ -0,0 +1,55 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "fuse_i.h"
+
+int fuse_passthrough_setup(struct fuse_req *req, unsigned int fd)
+{
+ int ret;
+ int fs_stack_depth;
+ struct file *passthrough_filp;
+ struct inode *passthrough_inode;
+ struct super_block *passthrough_sb;
+
+ /* Passthrough mode can only be enabled at file open/create time */
+ if (req->in.h.opcode != FUSE_OPEN && req->in.h.opcode != FUSE_CREATE) {
+ pr_err("FUSE: invalid OPCODE for request.\n");
+ return -EINVAL;
+ }
+
+ passthrough_filp = fget(fd);
+ if (!passthrough_filp) {
+ pr_err("FUSE: invalid file descriptor for passthrough.\n");
+ return -EINVAL;
+ }
+
+ ret = -EINVAL;
+ if (!passthrough_filp->f_op->read_iter ||
+ !passthrough_filp->f_op->write_iter) {
+ pr_err("FUSE: passthrough file misses file operations.\n");
+ goto out;
+ }
+
+ passthrough_inode = file_inode(passthrough_filp);
+ passthrough_sb = passthrough_inode->i_sb;
+ fs_stack_depth = passthrough_sb->s_stack_depth + 1;
+ ret = -EEXIST;
+ if (fs_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
+ pr_err("FUSE: maximum fs stacking depth exceeded for passthrough\n");
+ goto out;
+ }
+
+ req->args->passthrough_filp = passthrough_filp;
+ return 0;
+out:
+ fput(passthrough_filp);
+ return ret;
+}
+
+void fuse_passthrough_release(struct fuse_file *ff)
+{
+ if (!ff->passthrough_filp)
+ return;
+
+ fput(ff->passthrough_filp);
+ ff->passthrough_filp = NULL;
+}
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 373cada89815..0cd9fd83374a 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -342,6 +342,7 @@ struct fuse_file_lock {
#define FUSE_NO_OPENDIR_SUPPORT (1 << 24)
#define FUSE_EXPLICIT_INVAL_DATA (1 << 25)
#define FUSE_MAP_ALIGNMENT (1 << 26)
+#define FUSE_PASSTHROUGH (1 << 27)
/**
* CUSE INIT request/reply flags
@@ -794,6 +795,14 @@ struct fuse_in_header {
uint32_t padding;
};
+struct fuse_passthrough_out {
+ uint64_t unique;
+ uint32_t fd;
+ /* For future implementation */
+ uint32_t len;
+ void *vec;
+};
+
struct fuse_out_header {
uint32_t len;
int32_t error;
@@ -869,7 +878,8 @@ struct fuse_notify_retrieve_in {
};
/* Device ioctls: */
-#define FUSE_DEV_IOC_CLONE _IOR(229, 0, uint32_t)
+#define FUSE_DEV_IOC_CLONE _IOR(229, 0, uint32_t)
+#define FUSE_DEV_IOC_PASSTHROUGH_OPEN _IOW(229, 1, struct fuse_passthrough_out)
struct fuse_lseek_in {
uint64_t fh;
--
2.28.0.681.g6f77f65b4e-goog
Hi,
I noticed the following fixup suggested by Amir slipped from this
submission.
Thanks,
Alessio
---8<---
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
index b7d1a5517ffd..eba26196be92 100644
--- a/fs/fuse/passthrough.c
+++ b/fs/fuse/passthrough.c
@@ -185,7 +185,7 @@ int fuse_passthrough_setup(struct fuse_req *req, unsigned int fd)
passthrough_inode = file_inode(passthrough_filp);
passthrough_sb = passthrough_inode->i_sb;
fs_stack_depth = passthrough_sb->s_stack_depth + 1;
- ret = -EEXIST;
+ ret = -EINVAL;
if (fs_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
pr_err("FUSE: maximum fs stacking depth exceeded for passthrough\n");
goto out;
On Thu, Sep 24, 2020 at 3:13 PM Alessio Balsini <[email protected]> wrote:
> The first benchmarks were done by running FIO (fio-3.21) with:
> - bs=4Ki;
> - file size: 50Gi;
> - ioengine: sync;
> - fsync_on_close: true.
> The target file has been chosen large enough to avoid it to be entirely
> loaded into the page cache.
> Results are presented in the following table:
>
> +-----------+--------+-------------+--------+
> | Bandwidth | FUSE | FUSE | Bind |
> | (KiB/s) | | passthrough | mount |
> +-----------+--------+-------------+--------+
> | read | 468897 | 502085 | 516830 |
> +-----------+--------+-------------+--------+
> | randread | 15773 | 26632 | 21386 |
Have you looked into why passthrough is faster than native?
Thanks,
Miklos
On Thu, Sep 24, 2020 at 3:13 PM Alessio Balsini <[email protected]> wrote:
>
> Introduce the new FUSE passthrough ioctl(), which allows userspace to
> specify a direct connection between a FUSE file and a lower file system
> file.
> Such ioctl() requires userspace to specify:
> - the file descriptor of one of its opened files,
> - the unique identifier of the FUSE request associated with a pending
> open/create operation,
> both encapsulated into a fuse_passthrough_out data structure.
> The ioctl() will search for the pending FUSE request matching the unique
> identifier, and update the passthrough file pointer of the request with the
> file pointer referenced by the passed file descriptor.
> When that pending FUSE request is handled, the passthrough file pointer
> is copied to the fuse_file data structure, so that the link between FUSE
> and lower file system is consolidated.
How about returning an ID from the ioctl (like the fuse2 porototype)
and returning that in fuse_open_out.passthrough_fh?
Seems a more straightforward interface to me.
Thanks,
Miklos
On Thu, Sep 24, 2020 at 3:13 PM Alessio Balsini <[email protected]> wrote:
>
> Add a reference to the FUSE daemon credentials, so that they can be used to
> temporarily raise the user credentials when accessing lower file system
> files in passthrough.
Hmm, I think it would be better to store the creds of the ioctl()
caller together with the open file. The mounter may deliberately
have different privileges from the process doing the actual I/O.
Thanks,
Miklos
On Thu, Sep 24, 2020 at 3:13 PM Alessio Balsini <[email protected]> wrote:
>
> All the read and write operations performed on fuse_files which have the
> passthrough feature enabled are forwarded to the associated lower file
> system file via VFS.
>
> Sending the request directly to the lower file system avoids the userspace
> round-trip that, because of possible context switches and additional
> operations might reduce the overall performance, especially in those cases
> where caching doesn't help, for example in reads at random offsets.
>
> Verifying if a fuse_file has a lower file system file associated for
> passthrough can be done by checking the validity of its passthrough_filp
> pointer. This pointer is not NULL only if passthrough has been successfully
> enabled via the appropriate ioctl().
> When a read/write operation is requested for a FUSE file with passthrough
> enabled, a new equivalent VFS request is generated, which instead targets
> the lower file system file.
> The VFS layer performs additional checks that allows for safer operations,
> but may cause the operation to fail if the process accessing the FUSE file
> system does not have access to the lower file system. This often happens in
> passthrough file systems, where the FUSE daemon is responsible for the
> enforcement of the lower file system access policies. In order to preserve
> this behavior, the current process accessing the FUSE file with passthrough
> enabled receives the privileges of the FUSE daemon while performing the
> read/write operation, emulating a behavior used in overlayfs. These
> privileges will be reverted as soon as the IO operation completes. This
> feature does not provide any higher security privileges to those processes
> accessing the FUSE file system with passthrough enabled. This because it is
> still the FUSE daemon responsible for enabling or not the passthrough
> feature at file open time, and should enable the feature only after
> appropriate access policy checks.
>
> This change only implements synchronous requests in passthrough, returning
> an error in the case of ansynchronous operations, yet covering the majority
> of the use cases.
>
> Signed-off-by: Alessio Balsini <[email protected]>
> ---
> fs/fuse/file.c | 8 +++-
> fs/fuse/fuse_i.h | 2 +
> fs/fuse/passthrough.c | 93 +++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 101 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 6c0ec742ce74..c3289ff0cd33 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1552,7 +1552,9 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> if (is_bad_inode(file_inode(file)))
> return -EIO;
>
> - if (!(ff->open_flags & FOPEN_DIRECT_IO))
> + if (ff->passthrough_filp)
> + return fuse_passthrough_read_iter(iocb, to);
> + else if (!(ff->open_flags & FOPEN_DIRECT_IO))
> return fuse_cache_read_iter(iocb, to);
> else
> return fuse_direct_read_iter(iocb, to);
> @@ -1566,7 +1568,9 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> if (is_bad_inode(file_inode(file)))
> return -EIO;
>
> - if (!(ff->open_flags & FOPEN_DIRECT_IO))
> + if (ff->passthrough_filp)
> + return fuse_passthrough_write_iter(iocb, from);
> + else if (!(ff->open_flags & FOPEN_DIRECT_IO))
> return fuse_cache_write_iter(iocb, from);
> else
> return fuse_direct_write_iter(iocb, from);
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 67bf5919f8d6..b0764ca4c4fd 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -1109,5 +1109,7 @@ void fuse_free_conn(struct fuse_conn *fc);
>
> int fuse_passthrough_setup(struct fuse_req *req, unsigned int fd);
> void fuse_passthrough_release(struct fuse_file *ff);
> +ssize_t fuse_passthrough_read_iter(struct kiocb *iocb, struct iov_iter *to);
> +ssize_t fuse_passthrough_write_iter(struct kiocb *iocb, struct iov_iter *from);
>
> #endif /* _FS_FUSE_I_H */
> diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> index 86ab4eafa7bf..f70c0ef6945b 100644
> --- a/fs/fuse/passthrough.c
> +++ b/fs/fuse/passthrough.c
> @@ -2,6 +2,99 @@
>
> #include "fuse_i.h"
>
> +#include <linux/uio.h>
> +
> +static void fuse_copyattr(struct file *dst_file, struct file *src_file)
> +{
> + struct inode *dst = file_inode(dst_file);
> + struct inode *src = file_inode(src_file);
> +
> + i_size_write(dst, i_size_read(src));
> +}
> +
> +static rwf_t iocbflags_to_rwf(int ifl)
> +{
> + rwf_t flags = 0;
> +
> + if (ifl & IOCB_APPEND)
> + flags |= RWF_APPEND;
> + if (ifl & IOCB_DSYNC)
> + flags |= RWF_DSYNC;
> + if (ifl & IOCB_HIPRI)
> + flags |= RWF_HIPRI;
> + if (ifl & IOCB_NOWAIT)
> + flags |= RWF_NOWAIT;
> + if (ifl & IOCB_SYNC)
> + flags |= RWF_SYNC;
> +
> + return flags;
> +}
> +
> +static const struct cred *
> +fuse_passthrough_override_creds(const struct file *fuse_filp)
> +{
> + struct inode *fuse_inode = file_inode(fuse_filp);
> + struct fuse_conn *fc = fuse_inode->i_sb->s_fs_info;
> +
> + return override_creds(fc->creator_cred);
> +}
> +
> +ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
> + struct iov_iter *iter)
> +{
> + ssize_t ret;
> + const struct cred *old_cred;
> + struct file *fuse_filp = iocb_fuse->ki_filp;
> + struct fuse_file *ff = fuse_filp->private_data;
> + struct file *passthrough_filp = ff->passthrough_filp;
> +
> + if (!iov_iter_count(iter))
> + return 0;
> +
> + old_cred = fuse_passthrough_override_creds(fuse_filp);
> + if (is_sync_kiocb(iocb_fuse)) {
> + ret = vfs_iter_read(passthrough_filp, iter, &iocb_fuse->ki_pos,
> + iocbflags_to_rwf(iocb_fuse->ki_flags));
> + } else {
> + ret = -EIO;
> + }
Just do vfs_iter_read() unconditionally, instead of returning EIO.
It will work fine, except it won't be async.
Yeah, I know next patch is going to fix this, but still, lets not make
this patch return silly errors.
> + revert_creds(old_cred);
> +
> + return ret;
> +}
> +
> +ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
> + struct iov_iter *iter)
> +{
> + ssize_t ret;
> + const struct cred *old_cred;
> + struct file *fuse_filp = iocb_fuse->ki_filp;
> + struct fuse_file *ff = fuse_filp->private_data;
> + struct inode *fuse_inode = file_inode(fuse_filp);
> + struct file *passthrough_filp = ff->passthrough_filp;
> +
> + if (!iov_iter_count(iter))
> + return 0;
> +
> + inode_lock(fuse_inode);
> +
> + old_cred = fuse_passthrough_override_creds(fuse_filp);
> + if (is_sync_kiocb(iocb_fuse)) {
> + file_start_write(passthrough_filp);
> + ret = vfs_iter_write(passthrough_filp, iter, &iocb_fuse->ki_pos,
> + iocbflags_to_rwf(iocb_fuse->ki_flags));
> + file_end_write(passthrough_filp);
> + if (ret > 0)
> + fuse_copyattr(fuse_filp, passthrough_filp);
> + } else {
> + ret = -EIO;
> + }
And the same here.
> + revert_creds(old_cred);
> + inode_unlock(fuse_inode);
> +
> + return ret;
> +}
> +
> int fuse_passthrough_setup(struct fuse_req *req, unsigned int fd)
> {
> int ret;
> --
> 2.28.0.681.g6f77f65b4e-goog
>
On Thu, Sep 24, 2020 at 3:13 PM Alessio Balsini <[email protected]> wrote:
>
> Extend the passthrough feature by handling asynchronous IO both for read
> and write operations.
>
> When an AIO request is received, if the request targets a FUSE file with
> the passthrough functionality enabled, a new identical AIO request is
> created. The new request targets the lower file system file, and gets
> assigned a special FUSE passthrough AIO completion callback.
> When the lower file system AIO request is completed, the FUSE passthrough
> AIO completion callback is executed and propagates the completion signal to
> the FUSE AIO request by triggering its completion callback as well.
This ends up with almost identical code in fuse and overlayfs, right?
Maybe it's worth looking into moving these into common helpers.
Thanks,
Miklos
>
> Signed-off-by: Alessio Balsini <[email protected]>
> ---
> fs/fuse/passthrough.c | 64 +++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 62 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> index f70c0ef6945b..b7d1a5517ffd 100644
> --- a/fs/fuse/passthrough.c
> +++ b/fs/fuse/passthrough.c
> @@ -4,6 +4,11 @@
>
> #include <linux/uio.h>
>
> +struct fuse_aio_req {
> + struct kiocb iocb;
> + struct kiocb *iocb_fuse;
> +};
> +
> static void fuse_copyattr(struct file *dst_file, struct file *src_file)
> {
> struct inode *dst = file_inode(dst_file);
> @@ -39,6 +44,32 @@ fuse_passthrough_override_creds(const struct file *fuse_filp)
> return override_creds(fc->creator_cred);
> }
>
> +static void fuse_aio_cleanup_handler(struct fuse_aio_req *aio_req)
> +{
> + struct kiocb *iocb = &aio_req->iocb;
> + struct kiocb *iocb_fuse = aio_req->iocb_fuse;
> +
> + if (iocb->ki_flags & IOCB_WRITE) {
> + __sb_writers_acquired(file_inode(iocb->ki_filp)->i_sb,
> + SB_FREEZE_WRITE);
> + file_end_write(iocb->ki_filp);
> + fuse_copyattr(iocb_fuse->ki_filp, iocb->ki_filp);
> + }
> +
> + iocb_fuse->ki_pos = iocb->ki_pos;
> + kfree(aio_req);
> +}
> +
> +static void fuse_aio_rw_complete(struct kiocb *iocb, long res, long res2)
> +{
> + struct fuse_aio_req *aio_req =
> + container_of(iocb, struct fuse_aio_req, iocb);
> + struct kiocb *iocb_fuse = aio_req->iocb_fuse;
> +
> + fuse_aio_cleanup_handler(aio_req);
> + iocb_fuse->ki_complete(iocb_fuse, res, res2);
> +}
> +
> ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
> struct iov_iter *iter)
> {
> @@ -56,7 +87,18 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
> ret = vfs_iter_read(passthrough_filp, iter, &iocb_fuse->ki_pos,
> iocbflags_to_rwf(iocb_fuse->ki_flags));
> } else {
> - ret = -EIO;
> + struct fuse_aio_req *aio_req;
> +
> + aio_req = kmalloc(sizeof(struct fuse_aio_req), GFP_KERNEL);
> + if (!aio_req)
> + return -ENOMEM;
> +
> + aio_req->iocb_fuse = iocb_fuse;
> + kiocb_clone(&aio_req->iocb, iocb_fuse, passthrough_filp);
> + aio_req->iocb.ki_complete = fuse_aio_rw_complete;
> + ret = call_read_iter(passthrough_filp, &aio_req->iocb, iter);
> + if (ret != -EIOCBQUEUED)
> + fuse_aio_cleanup_handler(aio_req);
> }
> revert_creds(old_cred);
>
> @@ -72,6 +114,7 @@ ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
> struct fuse_file *ff = fuse_filp->private_data;
> struct inode *fuse_inode = file_inode(fuse_filp);
> struct file *passthrough_filp = ff->passthrough_filp;
> + struct inode *passthrough_inode = file_inode(passthrough_filp);
>
> if (!iov_iter_count(iter))
> return 0;
> @@ -87,8 +130,25 @@ ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
> if (ret > 0)
> fuse_copyattr(fuse_filp, passthrough_filp);
> } else {
> - ret = -EIO;
> + struct fuse_aio_req *aio_req;
> +
> + aio_req = kmalloc(sizeof(struct fuse_aio_req), GFP_KERNEL);
> + if (!aio_req) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + file_start_write(passthrough_filp);
> + __sb_writers_release(passthrough_inode->i_sb, SB_FREEZE_WRITE);
> +
> + aio_req->iocb_fuse = iocb_fuse;
> + kiocb_clone(&aio_req->iocb, iocb_fuse, passthrough_filp);
> + aio_req->iocb.ki_complete = fuse_aio_rw_complete;
> + ret = call_write_iter(passthrough_filp, &aio_req->iocb, iter);
> + if (ret != -EIOCBQUEUED)
> + fuse_aio_cleanup_handler(aio_req);
> }
> +out:
> revert_creds(old_cred);
> inode_unlock(fuse_inode);
>
> --
> 2.28.0.681.g6f77f65b4e-goog
>
On 9/30/2020 2:45 PM, Miklos Szeredi wrote:
> On Thu, Sep 24, 2020 at 3:13 PM Alessio Balsini <[email protected]> wrote:
>> Add a reference to the FUSE daemon credentials, so that they can be used to
>> temporarily raise the user credentials when accessing lower file system
>> files in passthrough.
> Hmm, I think it would be better to store the creds of the ioctl()
> caller together with the open file. The mounter may deliberately
> have different privileges from the process doing the actual I/O.
>
> Thanks,
> Miklos
In my usecase I'm changing euid/egid of the thread to whichever the
uid/gid was passed to the server which is otherwise running as root.
On Wed, Sep 30, 2020 at 05:33:30PM +0200, Miklos Szeredi wrote:
> On Thu, Sep 24, 2020 at 3:13 PM Alessio Balsini <[email protected]> wrote:
>
> > The first benchmarks were done by running FIO (fio-3.21) with:
> > - bs=4Ki;
> > - file size: 50Gi;
> > - ioengine: sync;
> > - fsync_on_close: true.
> > The target file has been chosen large enough to avoid it to be entirely
> > loaded into the page cache.
> > Results are presented in the following table:
> >
> > +-----------+--------+-------------+--------+
> > | Bandwidth | FUSE | FUSE | Bind |
> > | (KiB/s) | | passthrough | mount |
> > +-----------+--------+-------------+--------+
> > | read | 468897 | 502085 | 516830 |
> > +-----------+--------+-------------+--------+
> > | randread | 15773 | 26632 | 21386 |
>
>
> Have you looked into why passthrough is faster than native?
>
> Thanks,
> Miklos
Hi Miklos,
Thank you for bringing this to my attention, I probably missed it because
focusing on the comparison between FUSE and FUSE passthrough.
I jumped back to benchmarkings right after you sent this email.
At a first glance I though I made a stupid copy-paste mistake, but looking
at a bunch of partial results I'm collecting, I realized that the Vi550 S3
SSD I'm using has sometimes unstable performance, especially when dealing
with random offsets. I also realized that SSD performance might change
depending on previous operations.
To solve these issues, each test is now being run 10 times, and at
post-processing time I'm thinking of getting the median to remove possible
outliers.
I also noticed that the performance noise increases after a few minutes the
SSD is busy. This made me think of some kind of SSD thermal throttling I
totally overlooked.
This might be reason why passthrough is performing better than native in
the numbers you highlighted.
Unfortunately the SMART registers of my SSD always reports 33 Celsius
degrees regardless the workload, so to solve this I'm now applying a 5
minutes cooldown between each run.
This time I'm also removing fsync_on_close and reducing the file size to 25
GiB to improve caching and limit the interaction with the SSD during
writes. Still for caching reasons I am also separating the creation of the
fio target file from the actual execution of the benchmark by first running
fio with create_only=1. Before triggering fio, in the above benchmark I was
just sync-ing and dropping the pagecache, I now also drop slab objects,
including inodes and dentries:
echo 3 > /proc/sys/vm/drop_caches
that I suspect wouldn't make any difference, but wouldn't harm as well.
Please let me know if you have any suggestion on how to improve my
benchmarks, or if you recommend tools other than fio (that I actually
really like) to make comparisons.
Thanks,
Alessio
Hi Miklos, all,
After being stuck with some strange and hard to reproduce results from my SSD,
I finally decided to overcome the biggest chunk of inconsistencies by
forgetting about the SSD and switching to a RAM block device to host my lower
file system.
Getting rid of the discrete storage device removes a huge component of
slowness, highlighting the performance difference of the software parts (and
probably goodness of CPU cache and its coherence/invalidation mechanisms).
More specifically, out of my system's 32 GiB of RAM, I reserved 24 for
/dev/ram0, which has been formatted as ext4.
That file system has been completely filled and then cleaned up before running
the benchmarks to make sure all the memory addresses were marked as used and
removed from the page cache.
As for the last time, I've been using a slightly modified libfuse
passthrough_hp.cc example, that simply enables the passthrough mode at every
open/create operation:
[email protected]:balsini/libfuse fuse-passthrough-stable-v.3.9.4
The following tests were ran using fio-3.23 with the following configuration:
- bs=4Ki
- size=20Gi
- ioengine=sync
- fsync_on_close=1
- randseed=0
- create_only=0 (set to 1 during a first dry run to create the test file)
As for the tool configuration, the following benchmarks would perform a single
open operation each, focusing on just the read/write perfromance.
The file size of 20 GiB has been chosen to not completely fit the page cache.
As mentioned in my previous email, all the caches were dropped before running
every benchmark with
echo 3 > /proc/sys/vm/drop_caches
All the benchmarks were run 10 times, with 1 minute cool down between each run.
Here the updated results for this patch set:
+-----------+-------------+-------------+-------------+
| | | FUSE | |
| MiB/s | FUSE | passthrough | native |
+-----------+-------------+-------------+-------------+
| read | 1341(?4.2%) | 1485(?1.1%) | 1634(?.5%) |
+-----------+-------------+-------------+-------------+
| write | 49(?2.1%) | 1304(?2.6%) | 1363(?3.0%) |
+-----------+-------------+-------------+-------------+
| randread | 43(?1.3%) | 643(?11.1%) | 715(?1.1%) |
+-----------+-------------+-------------+-------------+
| randwrite | 27(?39.9%) | 763(?1.1%) | 790(?1.0%) |
+-----------+-------------+-------------+-------------+
This table shows that FUSE, except for the sequential reads, is left behind
FUSE passthrough and native performance. The extremely good FUSE performance
for sequential reads is the result of a great read-ahead mechanism, that has
been easy to prove by showing that performance dropped after setting
read_ahead_kb to 0.
Except for FUSE randwrite and passthrough randread with respectively ~40% and
~11% standard deviations, all the other results are relatively stable.
Nevertheless, these two standard deviation exceptions are not sufficient to
invalidate the results, that are still showing clear performance benefits.
I'm also kind of happy to see that passthrough, that for each read/write
operation traverses the VFS layer twice, now maintains consistent slightly
lower performance than native.
I wanted to make sure the results were consistent before jumping back to your
feedback on the series.
Thanks,
Alessio
On Wed, Sep 30, 2020 at 05:44:54PM +0200, Miklos Szeredi wrote:
> On Thu, Sep 24, 2020 at 3:13 PM Alessio Balsini <[email protected]> wrote:
> >
> > Introduce the new FUSE passthrough ioctl(), which allows userspace to
> > specify a direct connection between a FUSE file and a lower file system
> > file.
> > Such ioctl() requires userspace to specify:
> > - the file descriptor of one of its opened files,
> > - the unique identifier of the FUSE request associated with a pending
> > open/create operation,
> > both encapsulated into a fuse_passthrough_out data structure.
> > The ioctl() will search for the pending FUSE request matching the unique
> > identifier, and update the passthrough file pointer of the request with the
> > file pointer referenced by the passed file descriptor.
> > When that pending FUSE request is handled, the passthrough file pointer
> > is copied to the fuse_file data structure, so that the link between FUSE
> > and lower file system is consolidated.
>
> How about returning an ID from the ioctl (like the fuse2 porototype)
> and returning that in fuse_open_out.passthrough_fh?
>
> Seems a more straightforward interface to me.
>
> Thanks,
> Miklos
With this patch I tried to avoid any API modifications, for example,
changing fuse_open_out, doing everything at ioctl() time, and limiting the
allocation of dynamic memory.
In my next patch set (that I'll share as soon as I have enough hours of
testing with syzkaller) I implemented something similar to fuse2 in terms
of communication between kernel and userspace, with an id passing as you
recommended. So, userspace gets an id from the ioctl() and sends it back to
the kernel through the open/create reply, setting the passthrough_fh field
in fuse_open_out, that originally was the uint32_t padding. This wouldn't
change the fuse_open_out struct size, but is kind of an API breakage.
As in fuse2 I'm using IDR to generate the id and track the passthrough
entry information. On the other hand, compared to fuse2, I have one dynamic
IDR for each fuse_conn to prevent the owner of a connection from messing
with the ids of others.
In the upcoming patch set, the elements of each IDR serve the only purpose
of keeping track of the id in the timespan between the ioctl() and the
open/create reply. That IDR entry is removed right after the open/create
request is completed.
If in fuse2 the global IDR is queried for every read/write operation, my
new patch set would still create a "passthrough entry" for each fuse_file
to simplify the access to the lower file system at every read/write,
getting rid of IDR searches for each read/write operation.
Does this solution make sense?
I'm not sure that the upcoming solution is simpler than this one, but for
sure it keeps the interface more flexible and I like that it is moving in
the direction of fuse2.
Thanks,
Alessio
On Wed, Sep 30, 2020 at 03:16:20PM -0400, Antonio SJ Musumeci wrote:
> On 9/30/2020 2:45 PM, Miklos Szeredi wrote:
> > On Thu, Sep 24, 2020 at 3:13 PM Alessio Balsini <[email protected]> wrote:
> > > Add a reference to the FUSE daemon credentials, so that they can be used to
> > > temporarily raise the user credentials when accessing lower file system
> > > files in passthrough.
> > Hmm, I think it would be better to store the creds of the ioctl()
> > caller together with the open file. The mounter may deliberately
> > have different privileges from the process doing the actual I/O.
> >
> > Thanks,
> > Miklos
>
>
> In my usecase I'm changing euid/egid of the thread to whichever the uid/gid
> was passed to the server which is otherwise running as root.
>
Ack, in the next patch set I will store the creds of the ioctl() caller.
Thanks,
Alessio
On Wed, Sep 30, 2020 at 08:50:46PM +0200, Miklos Szeredi wrote:
> > [...]
> > +ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
> > + struct iov_iter *iter)
> > +{
> > + ssize_t ret;
> > + const struct cred *old_cred;
> > + struct file *fuse_filp = iocb_fuse->ki_filp;
> > + struct fuse_file *ff = fuse_filp->private_data;
> > + struct file *passthrough_filp = ff->passthrough_filp;
> > +
> > + if (!iov_iter_count(iter))
> > + return 0;
> > +
> > + old_cred = fuse_passthrough_override_creds(fuse_filp);
> > + if (is_sync_kiocb(iocb_fuse)) {
> > + ret = vfs_iter_read(passthrough_filp, iter, &iocb_fuse->ki_pos,
> > + iocbflags_to_rwf(iocb_fuse->ki_flags));
> > + } else {
> > + ret = -EIO;
> > + }
>
> Just do vfs_iter_read() unconditionally, instead of returning EIO.
> It will work fine, except it won't be async.
>
> Yeah, I know next patch is going to fix this, but still, lets not make
> this patch return silly errors.
>
> > [...]
> > +ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
> > + struct iov_iter *iter)
> > +{
> > + ssize_t ret;
> > + const struct cred *old_cred;
> > + struct file *fuse_filp = iocb_fuse->ki_filp;
> > + struct fuse_file *ff = fuse_filp->private_data;
> > + struct inode *fuse_inode = file_inode(fuse_filp);
> > + struct file *passthrough_filp = ff->passthrough_filp;
> > +
> > + if (!iov_iter_count(iter))
> > + return 0;
> > +
> > + inode_lock(fuse_inode);
> > +
> > + old_cred = fuse_passthrough_override_creds(fuse_filp);
> > + if (is_sync_kiocb(iocb_fuse)) {
> > + file_start_write(passthrough_filp);
> > + ret = vfs_iter_write(passthrough_filp, iter, &iocb_fuse->ki_pos,
> > + iocbflags_to_rwf(iocb_fuse->ki_flags));
> > + file_end_write(passthrough_filp);
> > + if (ret > 0)
> > + fuse_copyattr(fuse_filp, passthrough_filp);
> > + } else {
> > + ret = -EIO;
> > + }
>
> And the same here.
>
Ack, adding both to the upcoming patch set.
Thanks,
Alessio
On Wed, Sep 30, 2020 at 08:54:03PM +0200, Miklos Szeredi wrote:
> On Thu, Sep 24, 2020 at 3:13 PM Alessio Balsini <[email protected]> wrote:
> >
> > Extend the passthrough feature by handling asynchronous IO both for read
> > and write operations.
> >
> > When an AIO request is received, if the request targets a FUSE file with
> > the passthrough functionality enabled, a new identical AIO request is
> > created. The new request targets the lower file system file, and gets
> > assigned a special FUSE passthrough AIO completion callback.
> > When the lower file system AIO request is completed, the FUSE passthrough
> > AIO completion callback is executed and propagates the completion signal to
> > the FUSE AIO request by triggering its completion callback as well.
>
> This ends up with almost identical code in fuse and overlayfs, right?
> Maybe it's worth looking into moving these into common helpers.
>
> Thanks,
> Miklos
>
There are still a few differences between overlayfs and passthrough
read/write_iter(), so that merge wouldn't be straightforward. And I
would love to see this series merged before increasing its complexity,
hopefully in the next version if everything is all right.
I will anyway work on the cleanup patch you suggested right after FUSE
passthrough gets in, so that I have a solid code base to work on.
Would this work for you?
That cleanup patch would be handy also for the upcoming plan of having
something similar to FUSE passthrough extended to directories, as it was
mentioned in a previous discussion on this series with Amir.
Thanks!
Alessio