This is the 12th version of the series, rebased on top of v5.11-rc5.
Please find the changelog at the bottom of this cover letter.
Add support for file system passthrough read/write of files when enabled
in userspace through the option FUSE_PASSTHROUGH.
There are file systems based on FUSE that are intended to enforce
special policies or trigger complicated decision makings at the file
operations level. Android, for example, uses FUSE to enforce
fine-grained access policies that also depend on the file contents.
Sometimes it happens that at open or create time a file is identified as
not requiring additional checks for consequent reads/writes, thus FUSE
would simply act as a passive bridge between the process accessing the
FUSE file system and the lower file system. Splicing and caching help
reduce the FUSE overhead, but there are still read/write operations
forwarded to the userspace FUSE daemon that could be avoided.
This series has been inspired by the original patches from Nikhilesh
Reddy, the idea and code of which has been elaborated and improved
thanks to the community support.
When the FUSE_PASSTHROUGH capability is enabled, the FUSE daemon may
decide while handling the open/create operations, if the given file can
be accessed in passthrough mode. This means that all the further read
and write operations would be forwarded by the kernel directly to the
lower file system using the VFS layer rather than to the FUSE daemon.
All the requests other than reads or writes are still handled by the
userspace FUSE daemon.
This allows for improved performance on reads and writes, especially in
the case of reads at random offsets, for which no (readahead) caching
mechanism would help.
Benchmarks show improved performance that is close to native file system
access when doing massive manipulations on a single opened file,
especially in the case of random reads, random writes and sequential
writes. Detailed benchmarking results are presented below.
The creation of this direct connection (passthrough) between FUSE file
objects and file objects in the lower file system happens in a way that
reminds of passing file descriptors via sockets:
- a process requests the opening of a file handled by FUSE, so the
kernel forwards the request to the FUSE daemon;
- the FUSE daemon opens the target file in the lower file system,
getting its file descriptor;
- the FUSE daemon also decides according to its internal policies if
passthrough can be enabled for that file, and, if so, can perform a
FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl on /dev/fuse, passing the file
descriptor obtained at the previous step and the fuse_req unique
identifier;
- the kernel translates the file descriptor to the file pointer
navigating through the opened files of the "current" process and
temporarily stores it in the associated open/create fuse_req's
passthrough_filp;
- when the FUSE daemon has done with the request and it's time for the
kernel to close it, it checks if the passthrough_filp is available and
in case updates the additional field in the fuse_file owned by the
process accessing the FUSE file system.
From now on, all the read/write operations performed by that process
will be redirected to the corresponding lower file system file by
creating new VFS requests.
Since the read/write operation to the lower file system is executed with
the current process's credentials, it might happen that it does not have
enough privileges to succeed. For this reason, the process temporarily
receives the same credentials as the FUSE daemon, that are reverted as
soon as the read/write operation completes, emulating the behavior of
the request to be performed by the FUSE daemon itself. This solution has
been inspired by the way overlayfs handles read/write operations.
Asynchronous IO is supported as well, handled by creating separate AIO
requests for the lower file system that will be internally tracked by
FUSE, that intercepts and propagates their completion through an
internal ki_completed callback similar to the current implementation of
overlayfs.
Finally, also memory-mapped FUSE files are supported in this FUSE
passthrough series as it has been noticed that when a same file with
FUSE passthrough enabled is accessed both with standard
read/write(-iter) operations and memory-mapped read/write operations,
the file content might result corrupted due to an inconsistency between
the FUSE and lower file system caches.
The ioctl has been designed taking as a reference and trying to converge
to the fuse2 implementation. For example, the fuse_passthrough_out data
structure has extra fields that will allow for further extensions of the
feature.
Performance on RAM block device
What follows has been performed using a custom passthrough_hp FUSE
daemon that enables pass-through for each file that is opened during
both "open" and "create". Benchmarks were run on an Intel Xeon W-2135,
64 GiB of RAM workstation, with a RAM block device used as storage
target. More specifically, out of the system's 64 GiB of RAM, 40 GiB
were reserved for /dev/ram0, formatted as ext4. For the FUSE and FUSE
passthrough benchmarks, the FUSE file system was mounted on top of the
mounted /dev/ram0 device.
That file system has been completely filled and then cleaned up before
running the benchmarks: this to ensure that all the /dev/ram0 space was
reserved and not usable as page cache.
The rationale for using a RAM block device is that SSDs may experience
performance fluctuations, especially when dealing with accessing data
random offsets.
Getting rid of the discrete storage device also removes a huge component
of slowness, highlighting the performance difference of the software
parts (and probably the goodness of CPU caching and its coherence
mechanisms).
No special tuning has been performed, e.g., all the involved processes
are SCHED_OTHER, ondemand is the frequency governor with no frequency
restrictions, and turbo-boost, as well as p-state, are active. This is
because I noticed that, for such high-level benchmarks, results
consistency was minimally affected by these features.
The source code of the updated libfuse library and passthrough_hp is
shared at the following repository:
https://github.com/balsini/libfuse/tree/fuse-passthrough-v12-v5.11-rc5
Two different kinds of benchmarks were done for this change, the first
set of tests evaluates the bandwidth improvements when manipulating huge
single files, the second set of tests verify that no performance
regressions were introduced when handling many small files.
All the caches were dropped before running every benchmark with:
echo 3 > /proc/sys/vm/drop_caches
All the benchmarks were run 10 times, with 1 minute cool down between
each run.
The first benchmarks were done by running FIO (fio-3.24) with:
- bs=4Ki;
- file size: 35Gi;
- ioengine: sync;
- fsync_on_close=1;
- randseed=0.
The target file has been chosen large enough to avoid it to be entirely
loaded into the page cache.
Results are presented in the following table:
+-----------+------------+-------------+-------------+
| MiB/s | fuse | passthrough | native |
+-----------+------------+-------------+-------------+
| read | 471(±1.3%) | 1791(±1.0%) | 1839(±1.8%) |
| write | 95(±.6%) | 1068(±.9%) | 1322(±.8%) |
| randread | 25(±1.7%) | 860(±.8%) | 1135(±.5%) |
| randwrite | 76(±3.0%) | 813(±1.0%) | 1005(±.7%) |
+-----------+------------+-------------+-------------+
This table shows that FUSE, except for the sequential reads, is far
behind FUSE passthrough and native in terms of performance. The
extremely good FUSE performance for sequential reads is the result of a
great read-ahead mechanism. I was able to verify that setting
read_ahead_kb to 0 causes a terrible performance drop.
All the results are stable, as shown by the standard deviations.
Moreover, these numbers show the reasonable gap between passthrough and
native, introduced by the extra traversal through the VFS layer.
As long as this patch has the primary objective of improving bandwidth,
another set of tests has been performed to see how this behaves on a
totally different scenario that involves accessing many small files. For
this purpose, measuring the build time of the Linux kernel has been
chosen as an appropriate, well-known, workload. The kernel has been
built with as many processes as the number of logical CPUs (-j
$(nproc)), that besides being a reasonable parallelization value, is
also enough to saturate the processor's utilization thanks to the
additional FUSE daemon's threads, making it even harder to get closer to
the native file system performance.
The following table shows the total build times in the different
configurations:
+------------------+--------------+-----------+
| | AVG duration | Standard |
| | (sec) | deviation |
+------------------+--------------+-----------+
| FUSE | 144.566 | 0.697 |
+------------------+--------------+-----------+
| FUSE passthrough | 133.820 | 0.341 |
+------------------+--------------+-----------+
| Native | 109.423 | 0.724 |
+------------------+--------------+-----------+
Further testing and performance evaluations are welcome.
Description of the series
Patch 1 generalizes the function which converts iocb flags to rw flags
from overlayfs, so that can be used in this patch set.
Patch 2 enables the 32-bit compatibility for the /dev/fuse ioctl.
Patch 3 introduces the data structures, function signatures and ioctl
required both for the communication with userspace and for the internal
kernel use.
Patch 4 introduces initialization and release functions for FUSE
passthrough.
Patch 5 enables the synchronous read and write operations for those FUSE
files for which the passthrough functionality is enabled.
Patch 6 extends the read and write operations to also support
asynchronous IO.
Patch 7 allows FUSE passthrough to target files for which the requesting
process would not have direct access to, by temporarily performing a
credentials switch to the credentials of the FUSE daemon that issued the
FUSE passthrough ioctl.
Patch 8 extends FUSE passthrough operations to memory-mapped FUSE files.
Changelog
Changes in v12:
* Revert FILESYSTEM_MAX_STACK_DEPTH checks as they were in v10
[Requested by Amir Goldstein]
* Introduce passthrough support for memory-mapped FUSE files
[Requested by yanwu]
Changes in v11:
* Fix the FILESYSTEM_MAX_STACK_DEPTH check to allow other file systems
to be stacked
* Moved file system stacking depth check at ioctl time
* Update cover letter with correct libfuse repository to test the change
[Requested by Peng Tao]
* Fix the file reference counter leak introduced in v10
[Requested by yanwu]
Changes in v10:
* UAPI updated: ioctl now returns an ID that will be used at open/create
response time to reference the passthrough file
* Synchronous read/write_iter functions does not return silly errors
(fixed in aio patch)
* FUSE daemon credentials updated at ioctl time instead of mount time
* Updated benchmark results
[Requested by Miklos Szeredi]
Changes in v9:
* Switched to using VFS instead of direct lower FS file ops
[Attempt to address a request from Jens Axboe, Jann Horn,
Amir Goldstein]
* Removal of useless included aio.h header
[Proposed by Jens Axboe]
Changes in v8:
* aio requests now use kmalloc/kfree, instead of kmem_cache
* Switched to call_{read,write}_iter in AIO
* Revisited attributes copy
* Passthrough can only be enabled via ioctl, fixing the security issue
spotted by Jann
* Use an extensible fuse_passthrough_out data structure
[Attempt to address a request from Nikolaus Rath, Amir Goldstein and
Miklos Szeredi]
Changes in v7:
* Full handling of aio requests as done in overlayfs (update commit
* message).
* s/fget_raw/fget.
* Open fails in case of passthrough errors, emitting warning messages.
[Proposed by Jann Horn]
* Create new local kiocb, getting rid of the previously proposed ki_filp
swapping.
[Proposed by Jann Horn and Jens Axboe]
* Code polishing.
Changes in v6:
* Port to kernel v5.8:
* fuse_file_{read,write}_iter changed since the v5 of this patch was
proposed.
* Simplify fuse_simple_request.
* Merge fuse_passthrough.h into fuse_i.h
* Refactor of passthrough.c:
* Remove BUG_ONs.
* Simplified error checking and request arguments indexing.
* Use call_{read,write}_iter utility functions.
* Remove get_file and fputs during read/write: handle the extra FUSE
references to the lower file object when the fuse_file is
created/deleted.
[Proposed by Jann Horn]
Changes in v5:
* Fix the check when setting the passthrough file.
[Found when testing by Mike Shal]
Changes in v3 and v4:
* Use the fs_stack_depth to prevent further stacking and a minor fix.
[Proposed by Jann Horn]
Changes in v2:
* Changed the feature name to passthrough from stacked_io.
[Proposed by Linus Torvalds]
Alessio Balsini (8):
fs: Generic function to convert iocb to rw flags
fuse: 32-bit user space ioctl compat for fuse device
fuse: Definitions and ioctl for passthrough
fuse: Passthrough initialization and release
fuse: Introduce synchronous read and write for passthrough
fuse: Handle asynchronous read and write in passthrough
fuse: Use daemon creds in passthrough mode
fuse: Introduce passthrough for mmap
fs/fuse/Makefile | 1 +
fs/fuse/dev.c | 41 ++++--
fs/fuse/dir.c | 2 +
fs/fuse/file.c | 15 +-
fs/fuse/fuse_i.h | 33 +++++
fs/fuse/inode.c | 22 ++-
fs/fuse/passthrough.c | 280 ++++++++++++++++++++++++++++++++++++++
fs/overlayfs/file.c | 23 +---
include/linux/fs.h | 5 +
include/uapi/linux/fuse.h | 14 +-
10 files changed, 401 insertions(+), 35 deletions(-)
create mode 100644 fs/fuse/passthrough.c
--
2.30.0.280.ga3ce27912f-goog
OverlayFS implements its own function to translate iocb flags into rw
flags, so that they can be passed into another vfs call.
With commit ce71bfea207b4 ("fs: align IOCB_* flags with RWF_* flags")
Jens created a 1:1 matching between the iocb flags and rw flags,
simplifying the conversion.
Reduce the OverlayFS code by making the flag conversion function generic
and reusable.
Signed-off-by: Alessio Balsini <[email protected]>
---
fs/overlayfs/file.c | 23 +++++------------------
include/linux/fs.h | 5 +++++
2 files changed, 10 insertions(+), 18 deletions(-)
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index bd9dd38347ae..56be2ffc5a14 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -15,6 +15,8 @@
#include <linux/fs.h>
#include "overlayfs.h"
+#define OVL_IOCB_MASK (IOCB_DSYNC | IOCB_HIPRI | IOCB_NOWAIT | IOCB_SYNC)
+
struct ovl_aio_req {
struct kiocb iocb;
struct kiocb *orig_iocb;
@@ -236,22 +238,6 @@ static void ovl_file_accessed(struct file *file)
touch_atime(&file->f_path);
}
-static rwf_t ovl_iocb_to_rwf(int ifl)
-{
- rwf_t flags = 0;
-
- if (ifl & IOCB_NOWAIT)
- flags |= RWF_NOWAIT;
- if (ifl & IOCB_HIPRI)
- flags |= RWF_HIPRI;
- if (ifl & IOCB_DSYNC)
- flags |= RWF_DSYNC;
- if (ifl & IOCB_SYNC)
- flags |= RWF_SYNC;
-
- return flags;
-}
-
static void ovl_aio_cleanup_handler(struct ovl_aio_req *aio_req)
{
struct kiocb *iocb = &aio_req->iocb;
@@ -299,7 +285,8 @@ static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
old_cred = ovl_override_creds(file_inode(file)->i_sb);
if (is_sync_kiocb(iocb)) {
ret = vfs_iter_read(real.file, iter, &iocb->ki_pos,
- ovl_iocb_to_rwf(iocb->ki_flags));
+ iocb_to_rw_flags(iocb->ki_flags,
+ OVL_IOCB_MASK));
} else {
struct ovl_aio_req *aio_req;
@@ -356,7 +343,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
if (is_sync_kiocb(iocb)) {
file_start_write(real.file);
ret = vfs_iter_write(real.file, iter, &iocb->ki_pos,
- ovl_iocb_to_rwf(ifl));
+ iocb_to_rw_flags(ifl, OVL_IOCB_MASK));
file_end_write(real.file);
/* Update size */
ovl_copyattr(ovl_inode_real(inode), inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd47deea7c17..647c35423545 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3275,6 +3275,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
return 0;
}
+static inline rwf_t iocb_to_rw_flags(int ifl, int iocb_mask)
+{
+ return ifl & iocb_mask;
+}
+
static inline ino_t parent_ino(struct dentry *dentry)
{
ino_t res;
--
2.30.0.280.ga3ce27912f-goog
With a 64-bit kernel build the FUSE device cannot handle ioctl requests
coming from 32-bit user space.
This is due to the ioctl command translation that generates different
command identifiers that thus cannot be used for direct comparisons
without proper manipulation.
Explicitly extract type and number from the ioctl command to enable
32-bit user space compatibility on 64-bit kernel builds.
Signed-off-by: Alessio Balsini <[email protected]>
---
fs/fuse/dev.c | 29 ++++++++++++++++++-----------
include/uapi/linux/fuse.h | 3 ++-
2 files changed, 20 insertions(+), 12 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 588f8d1240aa..ff9f3b83f879 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2233,37 +2233,44 @@ static int fuse_device_clone(struct fuse_conn *fc, struct file *new)
static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
{
- int err = -ENOTTY;
+ int res;
+ int oldfd;
+ struct fuse_dev *fud = NULL;
- if (cmd == FUSE_DEV_IOC_CLONE) {
- int oldfd;
+ if (_IOC_TYPE(cmd) != FUSE_DEV_IOC_MAGIC)
+ return -EINVAL;
- err = -EFAULT;
- if (!get_user(oldfd, (__u32 __user *) arg)) {
+ switch (_IOC_NR(cmd)) {
+ case _IOC_NR(FUSE_DEV_IOC_CLONE):
+ res = -EFAULT;
+ if (!get_user(oldfd, (__u32 __user *)arg)) {
struct file *old = fget(oldfd);
- err = -EINVAL;
+ res = -EINVAL;
if (old) {
- struct fuse_dev *fud = NULL;
-
/*
* Check against file->f_op because CUSE
* uses the same ioctl handler.
*/
if (old->f_op == file->f_op &&
- old->f_cred->user_ns == file->f_cred->user_ns)
+ old->f_cred->user_ns ==
+ file->f_cred->user_ns)
fud = fuse_get_dev(old);
if (fud) {
mutex_lock(&fuse_mutex);
- err = fuse_device_clone(fud->fc, file);
+ res = fuse_device_clone(fud->fc, file);
mutex_unlock(&fuse_mutex);
}
fput(old);
}
}
+ break;
+ default:
+ res = -ENOTTY;
+ break;
}
- return err;
+ return res;
}
const struct file_operations fuse_dev_operations = {
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 98ca64d1beb6..54442612c48b 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -903,7 +903,8 @@ struct fuse_notify_retrieve_in {
};
/* Device ioctls: */
-#define FUSE_DEV_IOC_CLONE _IOR(229, 0, uint32_t)
+#define FUSE_DEV_IOC_MAGIC 229
+#define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
struct fuse_lseek_in {
uint64_t fh;
--
2.30.0.280.ga3ce27912f-goog
On Mon, Jan 25, 2021 at 03:30:50PM +0000, Alessio Balsini wrote:
> OverlayFS implements its own function to translate iocb flags into rw
> flags, so that they can be passed into another vfs call.
> With commit ce71bfea207b4 ("fs: align IOCB_* flags with RWF_* flags")
> Jens created a 1:1 matching between the iocb flags and rw flags,
> simplifying the conversion.
>
> Reduce the OverlayFS code by making the flag conversion function generic
> and reusable.
>
> Signed-off-by: Alessio Balsini <[email protected]>
> ---
> fs/overlayfs/file.c | 23 +++++------------------
> include/linux/fs.h | 5 +++++
> 2 files changed, 10 insertions(+), 18 deletions(-)
>
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index bd9dd38347ae..56be2ffc5a14 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -15,6 +15,8 @@
> #include <linux/fs.h>
> #include "overlayfs.h"
>
> +#define OVL_IOCB_MASK (IOCB_DSYNC | IOCB_HIPRI | IOCB_NOWAIT | IOCB_SYNC)
> +
> struct ovl_aio_req {
> struct kiocb iocb;
> struct kiocb *orig_iocb;
> @@ -236,22 +238,6 @@ static void ovl_file_accessed(struct file *file)
> touch_atime(&file->f_path);
> }
>
> -static rwf_t ovl_iocb_to_rwf(int ifl)
> -{
> - rwf_t flags = 0;
> -
> - if (ifl & IOCB_NOWAIT)
> - flags |= RWF_NOWAIT;
> - if (ifl & IOCB_HIPRI)
> - flags |= RWF_HIPRI;
> - if (ifl & IOCB_DSYNC)
> - flags |= RWF_DSYNC;
> - if (ifl & IOCB_SYNC)
> - flags |= RWF_SYNC;
> -
> - return flags;
> -}
> -
> static void ovl_aio_cleanup_handler(struct ovl_aio_req *aio_req)
> {
> struct kiocb *iocb = &aio_req->iocb;
> @@ -299,7 +285,8 @@ static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> old_cred = ovl_override_creds(file_inode(file)->i_sb);
> if (is_sync_kiocb(iocb)) {
> ret = vfs_iter_read(real.file, iter, &iocb->ki_pos,
> - ovl_iocb_to_rwf(iocb->ki_flags));
> + iocb_to_rw_flags(iocb->ki_flags,
> + OVL_IOCB_MASK));
> } else {
> struct ovl_aio_req *aio_req;
>
> @@ -356,7 +343,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
> if (is_sync_kiocb(iocb)) {
> file_start_write(real.file);
> ret = vfs_iter_write(real.file, iter, &iocb->ki_pos,
> - ovl_iocb_to_rwf(ifl));
> + iocb_to_rw_flags(ifl, OVL_IOCB_MASK));
> file_end_write(real.file);
> /* Update size */
> ovl_copyattr(ovl_inode_real(inode), inode);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index fd47deea7c17..647c35423545 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3275,6 +3275,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
> return 0;
> }
>
> +static inline rwf_t iocb_to_rw_flags(int ifl, int iocb_mask)
> +{
> + return ifl & iocb_mask;
> +}
> +
> static inline ino_t parent_ino(struct dentry *dentry)
> {
> ino_t res;
> --
> 2.30.0.280.ga3ce27912f-goog
>
For some reason lkml.org and lore.kernel.org are not showing this change
as part of the thread.
Let's see if replying to the email fixes the indexing.
Regards,
Alessio
Implement the FUSE passthrough ioctl that associates the lower
(passthrough) file system file with the fuse_file.
The file descriptor passed to the ioctl by the FUSE daemon is used to
access the relative file pointer, that will be copied to the fuse_file
data structure to consolidate the link between the FUSE and lower file
system.
To enable the passthrough mode, user space triggers the
FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl and, if the call succeeds, receives
back an identifier that will be used at open/create response time in the
fuse_open_out field to associate the FUSE file to the lower file system
file.
The value returned by the ioctl to user space can be:
- > 0: success, the identifier can be used as part of an open/create
reply.
- <= 0: an error occurred.
The value 0 represents an error to preserve backward compatibility: the
fuse_open_out field that is used to pass the passthrough_fh back to the
kernel uses the same bits that were previously as struct padding, and is
commonly zero-initialized (e.g., in the libfuse implementation).
Removing 0 from the correct values fixes the ambiguity between the case
in which 0 corresponds to a real passthrough_fh, a missing
implementation of FUSE passthrough or a request for a normal FUSE file,
simplifying the user space implementation.
For the passthrough mode to be successfully activated, the lower file
system file must implement both read_iter and write_iter file
operations. This extra check avoids special pseudo files to be targeted
for this feature.
Passthrough comes with another limitation: no further file system
stacking is allowed for those FUSE file systems using passthrough.
Signed-off-by: Alessio Balsini <[email protected]>
---
fs/fuse/inode.c | 5 +++
fs/fuse/passthrough.c | 87 ++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 90 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index a1104d5abb70..7ebc398fbacb 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1133,6 +1133,11 @@ EXPORT_SYMBOL_GPL(fuse_send_init);
static int free_fuse_passthrough(int id, void *p, void *data)
{
+ struct fuse_passthrough *passthrough = (struct fuse_passthrough *)p;
+
+ fuse_passthrough_release(passthrough);
+ kfree(p);
+
return 0;
}
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
index 594060c654f8..cf993e83803e 100644
--- a/fs/fuse/passthrough.c
+++ b/fs/fuse/passthrough.c
@@ -3,19 +3,102 @@
#include "fuse_i.h"
#include <linux/fuse.h>
+#include <linux/idr.h>
int fuse_passthrough_open(struct fuse_dev *fud,
struct fuse_passthrough_out *pto)
{
- return -EINVAL;
+ int res;
+ struct file *passthrough_filp;
+ struct fuse_conn *fc = fud->fc;
+ struct inode *passthrough_inode;
+ struct super_block *passthrough_sb;
+ struct fuse_passthrough *passthrough;
+
+ if (!fc->passthrough)
+ return -EPERM;
+
+ /* This field is reserved for future implementation */
+ if (pto->len != 0)
+ return -EINVAL;
+
+ passthrough_filp = fget(pto->fd);
+ if (!passthrough_filp) {
+ pr_err("FUSE: invalid file descriptor for passthrough.\n");
+ return -EBADF;
+ }
+
+ if (!passthrough_filp->f_op->read_iter ||
+ !passthrough_filp->f_op->write_iter) {
+ pr_err("FUSE: passthrough file misses file operations.\n");
+ res = -EBADF;
+ goto err_free_file;
+ }
+
+ passthrough_inode = file_inode(passthrough_filp);
+ passthrough_sb = passthrough_inode->i_sb;
+ if (passthrough_sb->s_stack_depth >= FILESYSTEM_MAX_STACK_DEPTH) {
+ pr_err("FUSE: fs stacking depth exceeded for passthrough\n");
+ res = -EINVAL;
+ goto err_free_file;
+ }
+
+ passthrough = kmalloc(sizeof(struct fuse_passthrough), GFP_KERNEL);
+ if (!passthrough) {
+ res = -ENOMEM;
+ goto err_free_file;
+ }
+
+ passthrough->filp = passthrough_filp;
+
+ idr_preload(GFP_KERNEL);
+ spin_lock(&fc->passthrough_req_lock);
+ res = idr_alloc(&fc->passthrough_req, passthrough, 1, 0, GFP_ATOMIC);
+ spin_unlock(&fc->passthrough_req_lock);
+ idr_preload_end();
+
+ if (res > 0)
+ return res;
+
+ fuse_passthrough_release(passthrough);
+ kfree(passthrough);
+
+err_free_file:
+ fput(passthrough_filp);
+
+ return res;
}
int fuse_passthrough_setup(struct fuse_conn *fc, struct fuse_file *ff,
struct fuse_open_out *openarg)
{
- return -EINVAL;
+ struct fuse_passthrough *passthrough;
+ int passthrough_fh = openarg->passthrough_fh;
+
+ if (!fc->passthrough)
+ return -EPERM;
+
+ /* Default case, passthrough is not requested */
+ if (passthrough_fh <= 0)
+ return -EINVAL;
+
+ spin_lock(&fc->passthrough_req_lock);
+ passthrough = idr_remove(&fc->passthrough_req, passthrough_fh);
+ spin_unlock(&fc->passthrough_req_lock);
+
+ if (!passthrough)
+ return -EINVAL;
+
+ ff->passthrough = *passthrough;
+ kfree(passthrough);
+
+ return 0;
}
void fuse_passthrough_release(struct fuse_passthrough *passthrough)
{
+ if (passthrough->filp) {
+ fput(passthrough->filp);
+ passthrough->filp = NULL;
+ }
}
--
2.30.0.280.ga3ce27912f-goog
Enabling FUSE passthrough for mmap-ed operations not only affects
performance, but has also been shown as mandatory for the correct
functioning of FUSE passthrough.
yanwu noticed [1] that a FUSE file with passthrough enabled may suffer
data inconsistencies if the same file is also accessed with mmap. What
happens is that read/write operations are directly applied to the lower
file system (and its cache), while mmap-ed operations are affecting the
FUSE cache.
Extend the FUSE passthrough implementation to also handle memory-mapped
FUSE file, to both fix the cache inconsistencies and extend the
passthrough performance benefits to mmap-ed operations.
[1] https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Alessio Balsini <[email protected]>
---
fs/fuse/file.c | 3 +++
fs/fuse/fuse_i.h | 1 +
fs/fuse/passthrough.c | 41 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 45 insertions(+)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index cddada1e8bd9..e3741a94c1f9 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2370,6 +2370,9 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
if (FUSE_IS_DAX(file_inode(file)))
return fuse_dax_mmap(file, vma);
+ if (ff->passthrough.filp)
+ return fuse_passthrough_mmap(file, vma);
+
if (ff->open_flags & FOPEN_DIRECT_IO) {
/* Can't provide the coherency needed for MAP_SHARED */
if (vma->vm_flags & VM_MAYSHARE)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 815af1845b16..7b0d65984608 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1244,5 +1244,6 @@ int fuse_passthrough_setup(struct fuse_conn *fc, struct fuse_file *ff,
void fuse_passthrough_release(struct fuse_passthrough *passthrough);
ssize_t fuse_passthrough_read_iter(struct kiocb *iocb, struct iov_iter *to);
ssize_t fuse_passthrough_write_iter(struct kiocb *iocb, struct iov_iter *from);
+ssize_t fuse_passthrough_mmap(struct file *file, struct vm_area_struct *vma);
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
index 24866c5fe7e2..284979f87747 100644
--- a/fs/fuse/passthrough.c
+++ b/fs/fuse/passthrough.c
@@ -135,6 +135,47 @@ ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
return ret;
}
+ssize_t fuse_passthrough_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ int ret;
+ const struct cred *old_cred;
+ struct fuse_file *ff = file->private_data;
+ struct inode *fuse_inode = file_inode(file);
+ struct file *passthrough_filp = ff->passthrough.filp;
+ struct inode *passthrough_inode = file_inode(passthrough_filp);
+
+ if (!passthrough_filp->f_op->mmap)
+ return -ENODEV;
+
+ if (WARN_ON(file != vma->vm_file))
+ return -EIO;
+
+ vma->vm_file = get_file(passthrough_filp);
+
+ old_cred = override_creds(ff->passthrough.cred);
+ ret = call_mmap(vma->vm_file, vma);
+ revert_creds(old_cred);
+
+ if (ret)
+ fput(passthrough_filp);
+ else
+ fput(file);
+
+ if (file->f_flags & O_NOATIME)
+ return ret;
+
+ if ((!timespec64_equal(&fuse_inode->i_mtime,
+ &passthrough_inode->i_mtime) ||
+ !timespec64_equal(&fuse_inode->i_ctime,
+ &passthrough_inode->i_ctime))) {
+ fuse_inode->i_mtime = passthrough_inode->i_mtime;
+ fuse_inode->i_ctime = passthrough_inode->i_ctime;
+ }
+ touch_atime(&file->f_path);
+
+ return ret;
+}
+
int fuse_passthrough_open(struct fuse_dev *fud,
struct fuse_passthrough_out *pto)
{
--
2.30.0.280.ga3ce27912f-goog
Expose the FUSE_PASSTHROUGH interface to user space and declare all the
basic data structures and functions as the skeleton on top of which the
FUSE passthrough functionality will be built.
As part of this, introduce the new FUSE passthrough ioctl, which allows
the FUSE daemon to specify a direct connection between a FUSE file and a
lower file system file. Such ioctl requires user space to pass the file
descriptor of one of its opened files through the fuse_passthrough_out
data structure introduced in this patch. This structure includes extra
fields for possible future extensions.
Also, add the passthrough functions for the set-up and tear-down of the
data structures and locks that will be used both when fuse_conns and
fuse_files are created/deleted.
Signed-off-by: Alessio Balsini <[email protected]>
---
fs/fuse/Makefile | 1 +
fs/fuse/dev.c | 12 ++++++++++++
fs/fuse/dir.c | 2 ++
fs/fuse/file.c | 4 +++-
fs/fuse/fuse_i.h | 27 +++++++++++++++++++++++++++
fs/fuse/inode.c | 17 ++++++++++++++++-
fs/fuse/passthrough.c | 21 +++++++++++++++++++++
include/uapi/linux/fuse.h | 11 ++++++++++-
8 files changed, 92 insertions(+), 3 deletions(-)
create mode 100644 fs/fuse/passthrough.c
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 8c7021fb2cd4..20ed23aa16fa 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_CUSE) += cuse.o
obj-$(CONFIG_VIRTIO_FS) += virtiofs.o
fuse-y := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o
+fuse-y += passthrough.o
fuse-$(CONFIG_FUSE_DAX) += dax.o
virtiofs-y := virtio_fs.o
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index ff9f3b83f879..5446f13db5a0 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2236,6 +2236,7 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
int res;
int oldfd;
struct fuse_dev *fud = NULL;
+ struct fuse_passthrough_out pto;
if (_IOC_TYPE(cmd) != FUSE_DEV_IOC_MAGIC)
return -EINVAL;
@@ -2266,6 +2267,17 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
}
}
break;
+ case _IOC_NR(FUSE_DEV_IOC_PASSTHROUGH_OPEN):
+ res = -EFAULT;
+ if (!copy_from_user(&pto,
+ (struct fuse_passthrough_out __user *)arg,
+ sizeof(pto))) {
+ res = -EINVAL;
+ fud = fuse_get_dev(file);
+ if (fud)
+ res = fuse_passthrough_open(fud, &pto);
+ }
+ break;
default:
res = -ENOTTY;
break;
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 78f9f209078c..c9a1b33c5481 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -513,6 +513,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
{
int err;
struct inode *inode;
+ struct fuse_conn *fc = get_fuse_conn(dir);
struct fuse_mount *fm = get_fuse_mount(dir);
FUSE_ARGS(args);
struct fuse_forget_link *forget;
@@ -574,6 +575,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
ff->fh = outopen.fh;
ff->nodeid = outentry.nodeid;
ff->open_flags = outopen.open_flags;
+ fuse_passthrough_setup(fc, ff, &outopen);
inode = fuse_iget(dir->i_sb, outentry.nodeid, outentry.generation,
&outentry.attr, entry_attr_timeout(&outentry), 0);
if (!inode) {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 8cccecb55fb8..953f3034c375 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -158,7 +158,7 @@ int fuse_do_open(struct fuse_mount *fm, u64 nodeid, struct file *file,
if (!err) {
ff->fh = outarg.fh;
ff->open_flags = outarg.open_flags;
-
+ fuse_passthrough_setup(fc, ff, &outarg);
} else if (err != -ENOSYS) {
fuse_file_free(ff);
return err;
@@ -304,6 +304,8 @@ void fuse_release_common(struct file *file, bool isdir)
struct fuse_release_args *ra = ff->release_args;
int opcode = isdir ? FUSE_RELEASEDIR : FUSE_RELEASE;
+ fuse_passthrough_release(&ff->passthrough);
+
fuse_prepare_release(fi, ff, file->f_flags, opcode);
if (ff->flock) {
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 7c4b8cb93f9f..8d39f5304a11 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -180,6 +180,14 @@ struct fuse_conn;
struct fuse_mount;
struct fuse_release_args;
+/**
+ * Reference to lower filesystem file for read/write operations handled in
+ * passthrough mode
+ */
+struct fuse_passthrough {
+ struct file *filp;
+};
+
/** FUSE specific file data */
struct fuse_file {
/** Fuse connection for this file */
@@ -225,6 +233,9 @@ struct fuse_file {
} readdir;
+ /** Container for data related to the passthrough functionality */
+ struct fuse_passthrough passthrough;
+
/** RB node to be linked on fuse_conn->polled_files */
struct rb_node polled_node;
@@ -755,6 +766,9 @@ struct fuse_conn {
/* Auto-mount submounts announced by the server */
unsigned int auto_submounts:1;
+ /** Passthrough mode for read/write IO */
+ unsigned int passthrough:1;
+
/** The number of requests waiting for completion */
atomic_t num_waiting;
@@ -798,6 +812,12 @@ struct fuse_conn {
/** List of filesystems using this connection */
struct list_head mounts;
+
+ /** IDR for passthrough requests */
+ struct idr passthrough_req;
+
+ /** Protects passthrough_req */
+ spinlock_t passthrough_req_lock;
};
/*
@@ -1213,4 +1233,11 @@ void fuse_dax_inode_cleanup(struct inode *inode);
bool fuse_dax_check_alignment(struct fuse_conn *fc, unsigned int map_alignment);
void fuse_dax_cancel_work(struct fuse_conn *fc);
+/* passthrough.c */
+int fuse_passthrough_open(struct fuse_dev *fud,
+ struct fuse_passthrough_out *pto);
+int fuse_passthrough_setup(struct fuse_conn *fc, struct fuse_file *ff,
+ struct fuse_open_out *openarg);
+void fuse_passthrough_release(struct fuse_passthrough *passthrough);
+
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index b0e18b470e91..a1104d5abb70 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -691,6 +691,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
memset(fc, 0, sizeof(*fc));
spin_lock_init(&fc->lock);
spin_lock_init(&fc->bg_lock);
+ spin_lock_init(&fc->passthrough_req_lock);
init_rwsem(&fc->killsb);
refcount_set(&fc->count, 1);
atomic_set(&fc->dev_count, 1);
@@ -699,6 +700,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
INIT_LIST_HEAD(&fc->bg_queue);
INIT_LIST_HEAD(&fc->entry);
INIT_LIST_HEAD(&fc->devices);
+ idr_init(&fc->passthrough_req);
atomic_set(&fc->num_waiting, 0);
fc->max_background = FUSE_DEFAULT_MAX_BACKGROUND;
fc->congestion_threshold = FUSE_DEFAULT_CONGESTION_THRESHOLD;
@@ -1052,6 +1054,12 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
fc->handle_killpriv_v2 = 1;
fm->sb->s_flags |= SB_NOSEC;
}
+ if (arg->flags & FUSE_PASSTHROUGH) {
+ fc->passthrough = 1;
+ /* Prevent further stacking */
+ fm->sb->s_stack_depth =
+ FILESYSTEM_MAX_STACK_DEPTH;
+ }
} else {
ra_pages = fc->max_read / PAGE_SIZE;
fc->no_lock = 1;
@@ -1095,7 +1103,7 @@ void fuse_send_init(struct fuse_mount *fm)
FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
- FUSE_HANDLE_KILLPRIV_V2;
+ FUSE_HANDLE_KILLPRIV_V2 | FUSE_PASSTHROUGH;
#ifdef CONFIG_FUSE_DAX
if (fm->fc->dax)
ia->in.flags |= FUSE_MAP_ALIGNMENT;
@@ -1123,9 +1131,16 @@ void fuse_send_init(struct fuse_mount *fm)
}
EXPORT_SYMBOL_GPL(fuse_send_init);
+static int free_fuse_passthrough(int id, void *p, void *data)
+{
+ return 0;
+}
+
void fuse_free_conn(struct fuse_conn *fc)
{
WARN_ON(!list_empty(&fc->devices));
+ idr_for_each(&fc->passthrough_req, free_fuse_passthrough, NULL);
+ idr_destroy(&fc->passthrough_req);
kfree_rcu(fc, rcu);
}
EXPORT_SYMBOL_GPL(fuse_free_conn);
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
new file mode 100644
index 000000000000..594060c654f8
--- /dev/null
+++ b/fs/fuse/passthrough.c
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "fuse_i.h"
+
+#include <linux/fuse.h>
+
+int fuse_passthrough_open(struct fuse_dev *fud,
+ struct fuse_passthrough_out *pto)
+{
+ return -EINVAL;
+}
+
+int fuse_passthrough_setup(struct fuse_conn *fc, struct fuse_file *ff,
+ struct fuse_open_out *openarg)
+{
+ return -EINVAL;
+}
+
+void fuse_passthrough_release(struct fuse_passthrough *passthrough)
+{
+}
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 54442612c48b..9d7685ce0acd 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -360,6 +360,7 @@ struct fuse_file_lock {
#define FUSE_MAP_ALIGNMENT (1 << 26)
#define FUSE_SUBMOUNTS (1 << 27)
#define FUSE_HANDLE_KILLPRIV_V2 (1 << 28)
+#define FUSE_PASSTHROUGH (1 << 29)
/**
* CUSE INIT request/reply flags
@@ -625,7 +626,7 @@ struct fuse_create_in {
struct fuse_open_out {
uint64_t fh;
uint32_t open_flags;
- uint32_t padding;
+ uint32_t passthrough_fh;
};
struct fuse_release_in {
@@ -828,6 +829,13 @@ struct fuse_in_header {
uint32_t padding;
};
+struct fuse_passthrough_out {
+ uint32_t fd;
+ /* For future implementation */
+ uint32_t len;
+ void *vec;
+};
+
struct fuse_out_header {
uint32_t len;
int32_t error;
@@ -905,6 +913,7 @@ struct fuse_notify_retrieve_in {
/* Device ioctls: */
#define FUSE_DEV_IOC_MAGIC 229
#define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
+#define FUSE_DEV_IOC_PASSTHROUGH_OPEN _IOW(FUSE_DEV_IOC_MAGIC, 1, struct fuse_passthrough_out)
struct fuse_lseek_in {
uint64_t fh;
--
2.30.0.280.ga3ce27912f-goog
When using FUSE passthrough, read/write operations are directly
forwarded to the lower file system file through VFS, but there is no
guarantee that the process that is triggering the request has the right
permissions to access the lower file system. This would cause the
read/write access to fail.
In passthrough file systems, where the FUSE daemon is responsible for
the enforcement of the lower file system access policies, often happens
that the process dealing with the FUSE file system doesn't have access
to the lower file system.
Being the FUSE daemon in charge of implementing the FUSE file
operations, that in the case of read/write operations usually simply
results in the copy of memory buffers from/to the lower file system
respectively, these operations are executed with the FUSE daemon
privileges.
This patch adds a reference to the FUSE daemon credentials, referenced
at FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl() time so that they can be used
to temporarily raise the user credentials when accessing lower file
system files in passthrough.
The process accessing the FUSE file with passthrough enabled temporarily
receives the privileges of the FUSE daemon while performing read/write
operations. Similar behavior is implemented in overlayfs.
These privileges will be reverted as soon as the IO operation completes.
This feature does not provide any higher security privileges to those
processes accessing the FUSE file system with passthrough enabled. This
is because it is still the FUSE daemon responsible for enabling or not
the passthrough feature at file open time, and should enable the feature
only after appropriate access policy checks.
Signed-off-by: Alessio Balsini <[email protected]>
---
fs/fuse/fuse_i.h | 5 ++++-
fs/fuse/passthrough.c | 11 +++++++++++
2 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index c4730d893324..815af1845b16 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -182,10 +182,13 @@ struct fuse_release_args;
/**
* Reference to lower filesystem file for read/write operations handled in
- * passthrough mode
+ * passthrough mode.
+ * This struct also tracks the credentials to be used for handling read/write
+ * operations.
*/
struct fuse_passthrough {
struct file *filp;
+ struct cred *cred;
};
/** FUSE specific file data */
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
index c7fa1eeb7639..24866c5fe7e2 100644
--- a/fs/fuse/passthrough.c
+++ b/fs/fuse/passthrough.c
@@ -52,6 +52,7 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
struct iov_iter *iter)
{
ssize_t ret;
+ const struct cred *old_cred;
struct file *fuse_filp = iocb_fuse->ki_filp;
struct fuse_file *ff = fuse_filp->private_data;
struct file *passthrough_filp = ff->passthrough.filp;
@@ -59,6 +60,7 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
if (!iov_iter_count(iter))
return 0;
+ old_cred = override_creds(ff->passthrough.cred);
if (is_sync_kiocb(iocb_fuse)) {
ret = vfs_iter_read(passthrough_filp, iter, &iocb_fuse->ki_pos,
iocb_to_rw_flags(iocb_fuse->ki_flags,
@@ -77,6 +79,7 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
if (ret != -EIOCBQUEUED)
fuse_aio_cleanup_handler(aio_req);
}
+ revert_creds(old_cred);
return ret;
}
@@ -85,6 +88,7 @@ ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
struct iov_iter *iter)
{
ssize_t ret;
+ const struct cred *old_cred;
struct file *fuse_filp = iocb_fuse->ki_filp;
struct fuse_file *ff = fuse_filp->private_data;
struct inode *fuse_inode = file_inode(fuse_filp);
@@ -96,6 +100,7 @@ ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
inode_lock(fuse_inode);
+ old_cred = override_creds(ff->passthrough.cred);
if (is_sync_kiocb(iocb_fuse)) {
file_start_write(passthrough_filp);
ret = vfs_iter_write(passthrough_filp, iter, &iocb_fuse->ki_pos,
@@ -124,6 +129,7 @@ ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
fuse_aio_cleanup_handler(aio_req);
}
out:
+ revert_creds(old_cred);
inode_unlock(fuse_inode);
return ret;
@@ -174,6 +180,7 @@ int fuse_passthrough_open(struct fuse_dev *fud,
}
passthrough->filp = passthrough_filp;
+ passthrough->cred = prepare_creds();
idr_preload(GFP_KERNEL);
spin_lock(&fc->passthrough_req_lock);
@@ -225,4 +232,8 @@ void fuse_passthrough_release(struct fuse_passthrough *passthrough)
fput(passthrough->filp);
passthrough->filp = NULL;
}
+ if (passthrough->cred) {
+ put_cred(passthrough->cred);
+ passthrough->cred = NULL;
+ }
}
--
2.30.0.280.ga3ce27912f-goog
On Wed, Jan 27, 2021 at 08:15:19PM +0800, qxy wrote:
> Hi Allesio,
>
> Thank you for your supply for 32-bit user space.
> Actually we have already met this issue on our product and we resolved it
> like this:
>
> Project *platform/external/libfuse*
> diff --git a/include/fuse_kernel.h b/include/fuse_kernel.h
> index e9d4f1a..fe0cb6d 100644
> --- a/include/fuse_kernel.h
> +++ b/include/fuse_kernel.h
> @@ -562,7 +562,11 @@
> uint32_t fd;
> /* For future implementation */
> uint32_t len;
> - void * vec;
> + union {
> + void * vec;
> + /* compatible for 32-bit libfuse and 64-bit kernel */
> + uint64_t padding;
> + };
> };
>
> However, if we need to use *vec in the future, we still need to do more in
> 32-bit libfuse to work with 64-bit kernel :(
>
Good point.
What I had in mind as a possible use was the possibility of enabling
passthrough for only a few regions of the file, similar to fuse2.
But it doesn't really make sense to define the data structure fields for
uncertain future extensions, so what we could do is:
struct fuse_passthrough_out {
+ uint32_t size; // Size of this data structure
uint32_t fd;
- /* For future implementation */
- uint32_t len;
- void *vec;
};
Similar to what "struct sched_attr" does.
This is probably the most flexible solution, that would allow to extend
this data structure in the future with no headaches both in kernel and
user space.
What do you think?
Thanks!
Alessio
Hi all,
I'm more than happy to change the interface into something that is
objectively better and accepted by everyone.
I would really love to reach the point at which we have a "stable-ish"
UAPI as soon as possible.
I've been thinking about a few possible approaches to fix the issue, yet
to preserve its flexibility. These are mentioned below.
Solution 1: Size
As mentioned in my previous email, one solution could be to introduce
the "size" field to allow the structure to grow in the future.
struct fuse_passthrough_out {
uint32_t size; // Size of this data structure
uint32_t fd;
};
The problem here is that we are making the promise that all the upcoming
fields are going to be maintained forever and at the offsets they were
originally defined.
Solution 2: Version
Another solution could be to s/size/version, where for every version of
FUSE passthrough we reserve the right to modifying the fields over time,
casting them to the right data structure according to the version.
Solution 3: Type
Using an enumerator to define the data structure content and purpose is
the most flexible solution I can think of. This would for example allow
us to substitute FUSE_DEV_IOC_PASSTHROUGH_OPEN with the generic
FUSE_DEV_IOC_PASSTHROUGH and having a single ioctl for any eventually
upcoming passthrough requests.
enum fuse_passthrough_type {
FUSE_PASSTHROUGH_OPEN
};
struct fuse_passthrough_out {
uint32_t type; /* as defined by enum fuse_passthrough_type */
union {
uint32_t fd;
};
};
This last is my favorite, as regardless the minimal logic required to
detect the size and content of the struct (not required now as we only
have a single option), it would also allow to do some kind of interface
versioning (e.g., in case we want to implement
FUSE_PASSTHROUGH_OPEN_V2).
What do you think?
Thanks,
Alessio
P.S.
Sorry if you received a duplicate email. I first sent this in reply to an email
without realizing it was a private message.
On Thu, Jan 28, 2021 at 11:01:59AM +0800, qxy wrote:
> Hi Alessio,
>
> I have received a failure from the Mail Delivery System for times and feel
> really sorry if you have already received the duplicate message...
>
> Thank you for your reply.
> I think it's wonderful to remove *vec from the data structure fields since
> we consider that it is not a good idea to use pointer when there is a need
> for cross-platform.
> Do you have a plan to modify the kernel fuse_passthrough_out data structure
> the same way as you mentioned?
>
> Thanks!
> qixiaoyu
On Mon, Jan 25, 2021 at 11:31 PM Alessio Balsini <[email protected]> wrote:
>
> When using FUSE passthrough, read/write operations are directly
> forwarded to the lower file system file through VFS, but there is no
> guarantee that the process that is triggering the request has the right
> permissions to access the lower file system. This would cause the
> read/write access to fail.
>
> In passthrough file systems, where the FUSE daemon is responsible for
> the enforcement of the lower file system access policies, often happens
> that the process dealing with the FUSE file system doesn't have access
> to the lower file system.
> Being the FUSE daemon in charge of implementing the FUSE file
> operations, that in the case of read/write operations usually simply
> results in the copy of memory buffers from/to the lower file system
> respectively, these operations are executed with the FUSE daemon
> privileges.
>
> This patch adds a reference to the FUSE daemon credentials, referenced
> at FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl() time so that they can be used
> to temporarily raise the user credentials when accessing lower file
> system files in passthrough.
> The process accessing the FUSE file with passthrough enabled temporarily
> receives the privileges of the FUSE daemon while performing read/write
> operations. Similar behavior is implemented in overlayfs.
> These privileges will be reverted as soon as the IO operation completes.
> This feature does not provide any higher security privileges to those
> processes accessing the FUSE file system with passthrough enabled. This
> is because it is still the FUSE daemon responsible for enabling or not
> the passthrough feature at file open time, and should enable the feature
> only after appropriate access policy checks.
>
> Signed-off-by: Alessio Balsini <[email protected]>
> ---
> fs/fuse/fuse_i.h | 5 ++++-
> fs/fuse/passthrough.c | 11 +++++++++++
> 2 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index c4730d893324..815af1845b16 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -182,10 +182,13 @@ struct fuse_release_args;
>
> /**
> * Reference to lower filesystem file for read/write operations handled in
> - * passthrough mode
> + * passthrough mode.
> + * This struct also tracks the credentials to be used for handling read/write
> + * operations.
> */
> struct fuse_passthrough {
> struct file *filp;
> + struct cred *cred;
> };
>
> /** FUSE specific file data */
> diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> index c7fa1eeb7639..24866c5fe7e2 100644
> --- a/fs/fuse/passthrough.c
> +++ b/fs/fuse/passthrough.c
> @@ -52,6 +52,7 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
> struct iov_iter *iter)
> {
> ssize_t ret;
> + const struct cred *old_cred;
> struct file *fuse_filp = iocb_fuse->ki_filp;
> struct fuse_file *ff = fuse_filp->private_data;
> struct file *passthrough_filp = ff->passthrough.filp;
> @@ -59,6 +60,7 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
> if (!iov_iter_count(iter))
> return 0;
>
> + old_cred = override_creds(ff->passthrough.cred);
> if (is_sync_kiocb(iocb_fuse)) {
> ret = vfs_iter_read(passthrough_filp, iter, &iocb_fuse->ki_pos,
> iocb_to_rw_flags(iocb_fuse->ki_flags,
> @@ -77,6 +79,7 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
> if (ret != -EIOCBQUEUED)
> fuse_aio_cleanup_handler(aio_req);
> }
> + revert_creds(old_cred);
cred should be reverted when kmalloc() fails above.
Cheers,
Tao
--
Into Sth. Rich & Strange
On Fri, Feb 05, 2021 at 05:23:56PM +0800, Peng Tao wrote:
> On Mon, Jan 25, 2021 at 11:31 PM Alessio Balsini <[email protected]> wrote:
> >
> > When using FUSE passthrough, read/write operations are directly
> > forwarded to the lower file system file through VFS, but there is no
> > guarantee that the process that is triggering the request has the right
> > permissions to access the lower file system. This would cause the
> > read/write access to fail.
> >
> > In passthrough file systems, where the FUSE daemon is responsible for
> > the enforcement of the lower file system access policies, often happens
> > that the process dealing with the FUSE file system doesn't have access
> > to the lower file system.
> > Being the FUSE daemon in charge of implementing the FUSE file
> > operations, that in the case of read/write operations usually simply
> > results in the copy of memory buffers from/to the lower file system
> > respectively, these operations are executed with the FUSE daemon
> > privileges.
> >
> > This patch adds a reference to the FUSE daemon credentials, referenced
> > at FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl() time so that they can be used
> > to temporarily raise the user credentials when accessing lower file
> > system files in passthrough.
> > The process accessing the FUSE file with passthrough enabled temporarily
> > receives the privileges of the FUSE daemon while performing read/write
> > operations. Similar behavior is implemented in overlayfs.
> > These privileges will be reverted as soon as the IO operation completes.
> > This feature does not provide any higher security privileges to those
> > processes accessing the FUSE file system with passthrough enabled. This
> > is because it is still the FUSE daemon responsible for enabling or not
> > the passthrough feature at file open time, and should enable the feature
> > only after appropriate access policy checks.
> >
> > Signed-off-by: Alessio Balsini <[email protected]>
> > ---
> > fs/fuse/fuse_i.h | 5 ++++-
> > fs/fuse/passthrough.c | 11 +++++++++++
> > 2 files changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index c4730d893324..815af1845b16 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -182,10 +182,13 @@ struct fuse_release_args;
> >
> > /**
> > * Reference to lower filesystem file for read/write operations handled in
> > - * passthrough mode
> > + * passthrough mode.
> > + * This struct also tracks the credentials to be used for handling read/write
> > + * operations.
> > */
> > struct fuse_passthrough {
> > struct file *filp;
> > + struct cred *cred;
> > };
> >
> > /** FUSE specific file data */
> > diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> > index c7fa1eeb7639..24866c5fe7e2 100644
> > --- a/fs/fuse/passthrough.c
> > +++ b/fs/fuse/passthrough.c
> > @@ -52,6 +52,7 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
> > struct iov_iter *iter)
> > {
> > ssize_t ret;
> > + const struct cred *old_cred;
> > struct file *fuse_filp = iocb_fuse->ki_filp;
> > struct fuse_file *ff = fuse_filp->private_data;
> > struct file *passthrough_filp = ff->passthrough.filp;
> > @@ -59,6 +60,7 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
> > if (!iov_iter_count(iter))
> > return 0;
> >
> > + old_cred = override_creds(ff->passthrough.cred);
> > if (is_sync_kiocb(iocb_fuse)) {
> > ret = vfs_iter_read(passthrough_filp, iter, &iocb_fuse->ki_pos,
> > iocb_to_rw_flags(iocb_fuse->ki_flags,
> > @@ -77,6 +79,7 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
> > if (ret != -EIOCBQUEUED)
> > fuse_aio_cleanup_handler(aio_req);
> > }
> > + revert_creds(old_cred);
> cred should be reverted when kmalloc() fails above.
>
> Cheers,
> Tao
> --
> Into Sth. Rich & Strange
Thanks Tao, definitely!
Please find the fixup at the bottom of this email.
I keep the WIP V13 here:
https://github.com/balsini/linux/tree/fuse-passthrough-v13-v5.11-rc5
Thanks,
Alessio
---8<---
From 63797a2cc6b3946bce59989adcb8f39f70f27643 Mon Sep 17 00:00:00 2001
From: Alessio Balsini <[email protected]>
Date: Fri, 5 Feb 2021 10:58:49 +0000
Subject: [PATCH] fuse: Fix crediantials leak in passthrough read_iter
If the system doesn't have enough memory when fuse_passthrough_read_iter
is requested in asynchronous IO, an error is directly returned without
restoring the caller's credentials.
Fix by always ensuring credentials are restored.
Fixes: [email protected] ("fuse: Use daemon creds in passthrough mode")
Signed-off-by: Alessio Balsini <[email protected]>
---
fs/fuse/passthrough.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
index 284979f87747..1df94c1d8a00 100644
--- a/fs/fuse/passthrough.c
+++ b/fs/fuse/passthrough.c
@@ -69,8 +69,10 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
struct fuse_aio_req *aio_req;
aio_req = kmalloc(sizeof(struct fuse_aio_req), GFP_KERNEL);
- if (!aio_req)
- return -ENOMEM;
+ if (!aio_req) {
+ ret = -ENOMEM;
+ goto out;
+ }
aio_req->iocb_fuse = iocb_fuse;
kiocb_clone(&aio_req->iocb, iocb_fuse, passthrough_filp);
@@ -79,6 +81,7 @@ ssize_t fuse_passthrough_read_iter(struct kiocb *iocb_fuse,
if (ret != -EIOCBQUEUED)
fuse_aio_cleanup_handler(aio_req);
}
+out:
revert_creds(old_cred);
return ret;
--
2.30.0.365.g02bc693789-goog
On Thu, Jan 28, 2021 at 10:15 PM Alessio Balsini <[email protected]> wrote:
>
> Hi all,
>
> I'm more than happy to change the interface into something that is
> objectively better and accepted by everyone.
> I would really love to reach the point at which we have a "stable-ish"
> UAPI as soon as possible.
>
> I've been thinking about a few possible approaches to fix the issue, yet
> to preserve its flexibility. These are mentioned below.
>
>
> Solution 1: Size
>
> As mentioned in my previous email, one solution could be to introduce
> the "size" field to allow the structure to grow in the future.
>
> struct fuse_passthrough_out {
> uint32_t size; // Size of this data structure
> uint32_t fd;
> };
>
> The problem here is that we are making the promise that all the upcoming
> fields are going to be maintained forever and at the offsets they were
> originally defined.
>
>
> Solution 2: Version
>
> Another solution could be to s/size/version, where for every version of
> FUSE passthrough we reserve the right to modifying the fields over time,
> casting them to the right data structure according to the version.
>
>
> Solution 3: Type
>
> Using an enumerator to define the data structure content and purpose is
> the most flexible solution I can think of. This would for example allow
> us to substitute FUSE_DEV_IOC_PASSTHROUGH_OPEN with the generic
> FUSE_DEV_IOC_PASSTHROUGH and having a single ioctl for any eventually
> upcoming passthrough requests.
>
> enum fuse_passthrough_type {
> FUSE_PASSTHROUGH_OPEN
> };
>
> struct fuse_passthrough_out {
> uint32_t type; /* as defined by enum fuse_passthrough_type */
> union {
> uint32_t fd;
> };
> };
>
> This last is my favorite, as regardless the minimal logic required to
> detect the size and content of the struct (not required now as we only
> have a single option), it would also allow to do some kind of interface
> versioning (e.g., in case we want to implement
> FUSE_PASSTHROUGH_OPEN_V2).
>
Usually a new type of ioctl will be added in such cases. If we want to
stick to the same ioctl number, it might be easier to add a `flags`
field to differentiate compatible passthrough ioctls. So in future, if
a new interface is compatible with the existing one, we can use flags
to tell it. If it is not compatible, we still need to add a new ioctl.
wdyt?
struct fuse_passthrough_out {
uint32_t flags;
union {
uint32_t fd;
};
};
This somehow follows the "Flags as a system call API design pattern"
(https://lwn.net/Articles/585415/).
Just my two cents.
Cheers,
Tao
--
Into Sth. Rich & Strange
On Mon, Jan 25, 2021 at 4:31 PM Alessio Balsini <[email protected]> wrote:
>
> With a 64-bit kernel build the FUSE device cannot handle ioctl requests
> coming from 32-bit user space.
> This is due to the ioctl command translation that generates different
> command identifiers that thus cannot be used for direct comparisons
> without proper manipulation.
>
> Explicitly extract type and number from the ioctl command to enable
> 32-bit user space compatibility on 64-bit kernel builds.
>
> Signed-off-by: Alessio Balsini <[email protected]>
> ---
> fs/fuse/dev.c | 29 ++++++++++++++++++-----------
> include/uapi/linux/fuse.h | 3 ++-
> 2 files changed, 20 insertions(+), 12 deletions(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 588f8d1240aa..ff9f3b83f879 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -2233,37 +2233,44 @@ static int fuse_device_clone(struct fuse_conn *fc, struct file *new)
> static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
> unsigned long arg)
> {
> - int err = -ENOTTY;
> + int res;
> + int oldfd;
> + struct fuse_dev *fud = NULL;
>
> - if (cmd == FUSE_DEV_IOC_CLONE) {
> - int oldfd;
> + if (_IOC_TYPE(cmd) != FUSE_DEV_IOC_MAGIC)
> + return -EINVAL;
Why change ENOTTY to EINVAL?
Thanks,
MIklos
On Mon, Jan 25, 2021 at 4:31 PM Alessio Balsini <[email protected]> wrote:
>
> Implement the FUSE passthrough ioctl that associates the lower
> (passthrough) file system file with the fuse_file.
>
> The file descriptor passed to the ioctl by the FUSE daemon is used to
> access the relative file pointer, that will be copied to the fuse_file
> data structure to consolidate the link between the FUSE and lower file
> system.
>
> To enable the passthrough mode, user space triggers the
> FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl and, if the call succeeds, receives
> back an identifier that will be used at open/create response time in the
> fuse_open_out field to associate the FUSE file to the lower file system
> file.
> The value returned by the ioctl to user space can be:
> - > 0: success, the identifier can be used as part of an open/create
> reply.
> - <= 0: an error occurred.
> The value 0 represents an error to preserve backward compatibility: the
> fuse_open_out field that is used to pass the passthrough_fh back to the
> kernel uses the same bits that were previously as struct padding, and is
> commonly zero-initialized (e.g., in the libfuse implementation).
> Removing 0 from the correct values fixes the ambiguity between the case
> in which 0 corresponds to a real passthrough_fh, a missing
> implementation of FUSE passthrough or a request for a normal FUSE file,
> simplifying the user space implementation.
>
> For the passthrough mode to be successfully activated, the lower file
> system file must implement both read_iter and write_iter file
> operations. This extra check avoids special pseudo files to be targeted
> for this feature.
> Passthrough comes with another limitation: no further file system
> stacking is allowed for those FUSE file systems using passthrough.
>
> Signed-off-by: Alessio Balsini <[email protected]>
> ---
> fs/fuse/inode.c | 5 +++
> fs/fuse/passthrough.c | 87 ++++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 90 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index a1104d5abb70..7ebc398fbacb 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1133,6 +1133,11 @@ EXPORT_SYMBOL_GPL(fuse_send_init);
>
> static int free_fuse_passthrough(int id, void *p, void *data)
> {
> + struct fuse_passthrough *passthrough = (struct fuse_passthrough *)p;
> +
> + fuse_passthrough_release(passthrough);
> + kfree(p);
> +
> return 0;
> }
>
> diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> index 594060c654f8..cf993e83803e 100644
> --- a/fs/fuse/passthrough.c
> +++ b/fs/fuse/passthrough.c
> @@ -3,19 +3,102 @@
> #include "fuse_i.h"
>
> #include <linux/fuse.h>
> +#include <linux/idr.h>
>
> int fuse_passthrough_open(struct fuse_dev *fud,
> struct fuse_passthrough_out *pto)
> {
> - return -EINVAL;
> + int res;
> + struct file *passthrough_filp;
> + struct fuse_conn *fc = fud->fc;
> + struct inode *passthrough_inode;
> + struct super_block *passthrough_sb;
> + struct fuse_passthrough *passthrough;
> +
> + if (!fc->passthrough)
> + return -EPERM;
> +
> + /* This field is reserved for future implementation */
> + if (pto->len != 0)
> + return -EINVAL;
> +
> + passthrough_filp = fget(pto->fd);
> + if (!passthrough_filp) {
> + pr_err("FUSE: invalid file descriptor for passthrough.\n");
> + return -EBADF;
> + }
> +
> + if (!passthrough_filp->f_op->read_iter ||
> + !passthrough_filp->f_op->write_iter) {
> + pr_err("FUSE: passthrough file misses file operations.\n");
> + res = -EBADF;
> + goto err_free_file;
> + }
> +
> + passthrough_inode = file_inode(passthrough_filp);
> + passthrough_sb = passthrough_inode->i_sb;
> + if (passthrough_sb->s_stack_depth >= FILESYSTEM_MAX_STACK_DEPTH) {
> + pr_err("FUSE: fs stacking depth exceeded for passthrough\n");
No need to print an error to the logs, this can be a perfectly normal
occurrence. However I'd try to find a more unique error value than
EINVAL so that the fuse server can interpret this as "not your fault,
but can't support passthrough on this file". E.g. EOPNOTSUPP.
> + res = -EINVAL;
> + goto err_free_file;
> + }
> +
> + passthrough = kmalloc(sizeof(struct fuse_passthrough), GFP_KERNEL);
> + if (!passthrough) {
> + res = -ENOMEM;
> + goto err_free_file;
> + }
> +
> + passthrough->filp = passthrough_filp;
> +
> + idr_preload(GFP_KERNEL);
> + spin_lock(&fc->passthrough_req_lock);
Should be okay to use fc->lock, since neither adding nor removing the
passthrough ID should be a heavily used operation, and querying the
mapping is lockless.
Thanks,
Miklos
On Mon, Jan 25, 2021 at 4:31 PM Alessio Balsini <[email protected]> wrote:
>
> Expose the FUSE_PASSTHROUGH interface to user space and declare all the
> basic data structures and functions as the skeleton on top of which the
> FUSE passthrough functionality will be built.
>
> As part of this, introduce the new FUSE passthrough ioctl, which allows
> the FUSE daemon to specify a direct connection between a FUSE file and a
> lower file system file. Such ioctl requires user space to pass the file
> descriptor of one of its opened files through the fuse_passthrough_out
> data structure introduced in this patch. This structure includes extra
> fields for possible future extensions.
> Also, add the passthrough functions for the set-up and tear-down of the
> data structures and locks that will be used both when fuse_conns and
> fuse_files are created/deleted.
>
> Signed-off-by: Alessio Balsini <[email protected]>
[...]
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 54442612c48b..9d7685ce0acd 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -360,6 +360,7 @@ struct fuse_file_lock {
> #define FUSE_MAP_ALIGNMENT (1 << 26)
> #define FUSE_SUBMOUNTS (1 << 27)
> #define FUSE_HANDLE_KILLPRIV_V2 (1 << 28)
> +#define FUSE_PASSTHROUGH (1 << 29)
This header has a version and a changelog. Please update those as well.
>
> /**
> * CUSE INIT request/reply flags
> @@ -625,7 +626,7 @@ struct fuse_create_in {
> struct fuse_open_out {
> uint64_t fh;
> uint32_t open_flags;
> - uint32_t padding;
> + uint32_t passthrough_fh;
I think it would be cleaner to add a FOPEN_PASSTHROUGH flag to
explicitly request passthrough instead of just passing a non-null
value to passthrough_fh.
> };
>
> struct fuse_release_in {
> @@ -828,6 +829,13 @@ struct fuse_in_header {
> uint32_t padding;
> };
>
> +struct fuse_passthrough_out {
> + uint32_t fd;
> + /* For future implementation */
> + uint32_t len;
> + void *vec;
> +};
I don't see why we'd need these extensions. The ioctl just needs to
establish an ID to open file mapping that can be referenced on the
regular protocol, i.e. it just needs to be passed an open file
descriptor and return an unique ID.
Mapping the fuse file's data to the underlying file's data is a
different matter. That can be an identity mapping established at open
time (this is what this series does) or it can be an arbitrary extent
mapping to one or more underlying open files, established at open time
or on demand. All of these can be done in band using the fuse
protocol, no need to involve the ioctl mechanism.
So I think we can just get rid of "struct fuse_passthrough_out"
completely and use "uint32_t *" as the ioctl argument.
What I think would be useful is to have an explicit
FUSE_DEV_IOC_PASSTHROUGH_CLOSE ioctl, that would need to be called
once the fuse server no longer needs this ID. If this turns out to
be a performance problem, we could still add the auto-close behavior
with an explicit FOPEN_PASSTHROUGH_AUTOCLOSE flag later.
Thanks,
Miklos
On Mon, Jan 25, 2021 at 4:31 PM Alessio Balsini <[email protected]> wrote:
>
> Enabling FUSE passthrough for mmap-ed operations not only affects
> performance, but has also been shown as mandatory for the correct
> functioning of FUSE passthrough.
> yanwu noticed [1] that a FUSE file with passthrough enabled may suffer
> data inconsistencies if the same file is also accessed with mmap. What
> happens is that read/write operations are directly applied to the lower
> file system (and its cache), while mmap-ed operations are affecting the
> FUSE cache.
>
> Extend the FUSE passthrough implementation to also handle memory-mapped
> FUSE file, to both fix the cache inconsistencies and extend the
> passthrough performance benefits to mmap-ed operations.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
>
> Signed-off-by: Alessio Balsini <[email protected]>
> ---
> fs/fuse/file.c | 3 +++
> fs/fuse/fuse_i.h | 1 +
> fs/fuse/passthrough.c | 41 +++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 45 insertions(+)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index cddada1e8bd9..e3741a94c1f9 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -2370,6 +2370,9 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
> if (FUSE_IS_DAX(file_inode(file)))
> return fuse_dax_mmap(file, vma);
>
> + if (ff->passthrough.filp)
> + return fuse_passthrough_mmap(file, vma);
> +
> if (ff->open_flags & FOPEN_DIRECT_IO) {
> /* Can't provide the coherency needed for MAP_SHARED */
> if (vma->vm_flags & VM_MAYSHARE)
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 815af1845b16..7b0d65984608 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -1244,5 +1244,6 @@ int fuse_passthrough_setup(struct fuse_conn *fc, struct fuse_file *ff,
> void fuse_passthrough_release(struct fuse_passthrough *passthrough);
> ssize_t fuse_passthrough_read_iter(struct kiocb *iocb, struct iov_iter *to);
> ssize_t fuse_passthrough_write_iter(struct kiocb *iocb, struct iov_iter *from);
> +ssize_t fuse_passthrough_mmap(struct file *file, struct vm_area_struct *vma);
>
> #endif /* _FS_FUSE_I_H */
> diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> index 24866c5fe7e2..284979f87747 100644
> --- a/fs/fuse/passthrough.c
> +++ b/fs/fuse/passthrough.c
> @@ -135,6 +135,47 @@ ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
> return ret;
> }
>
> +ssize_t fuse_passthrough_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + int ret;
> + const struct cred *old_cred;
> + struct fuse_file *ff = file->private_data;
> + struct inode *fuse_inode = file_inode(file);
> + struct file *passthrough_filp = ff->passthrough.filp;
> + struct inode *passthrough_inode = file_inode(passthrough_filp);
> +
> + if (!passthrough_filp->f_op->mmap)
> + return -ENODEV;
> +
> + if (WARN_ON(file != vma->vm_file))
> + return -EIO;
> +
> + vma->vm_file = get_file(passthrough_filp);
> +
> + old_cred = override_creds(ff->passthrough.cred);
> + ret = call_mmap(vma->vm_file, vma);
> + revert_creds(old_cred);
> +
> + if (ret)
> + fput(passthrough_filp);
> + else
> + fput(file);
> +
> + if (file->f_flags & O_NOATIME)
> + return ret;
> +
> + if ((!timespec64_equal(&fuse_inode->i_mtime,
> + &passthrough_inode->i_mtime) ||
> + !timespec64_equal(&fuse_inode->i_ctime,
> + &passthrough_inode->i_ctime))) {
> + fuse_inode->i_mtime = passthrough_inode->i_mtime;
> + fuse_inode->i_ctime = passthrough_inode->i_ctime;
Again, violation of rules. Not sure why this is needed, mmap(2)
isn't supposed to change mtime or ctime, AFAIK.
Thanks,
Miklos
On Wed, Feb 17, 2021 at 9:41 PM Miklos Szeredi <[email protected]> wrote:
>
> On Mon, Jan 25, 2021 at 4:31 PM Alessio Balsini <[email protected]> wrote:
> >
> > Expose the FUSE_PASSTHROUGH interface to user space and declare all the
> > basic data structures and functions as the skeleton on top of which the
> > FUSE passthrough functionality will be built.
> >
> > As part of this, introduce the new FUSE passthrough ioctl, which allows
> > the FUSE daemon to specify a direct connection between a FUSE file and a
> > lower file system file. Such ioctl requires user space to pass the file
> > descriptor of one of its opened files through the fuse_passthrough_out
> > data structure introduced in this patch. This structure includes extra
> > fields for possible future extensions.
> > Also, add the passthrough functions for the set-up and tear-down of the
> > data structures and locks that will be used both when fuse_conns and
> > fuse_files are created/deleted.
> >
> > Signed-off-by: Alessio Balsini <[email protected]>
>
> [...]
>
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index 54442612c48b..9d7685ce0acd 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -360,6 +360,7 @@ struct fuse_file_lock {
> > #define FUSE_MAP_ALIGNMENT (1 << 26)
> > #define FUSE_SUBMOUNTS (1 << 27)
> > #define FUSE_HANDLE_KILLPRIV_V2 (1 << 28)
> > +#define FUSE_PASSTHROUGH (1 << 29)
>
> This header has a version and a changelog. Please update those as well.
>
> >
> > /**
> > * CUSE INIT request/reply flags
> > @@ -625,7 +626,7 @@ struct fuse_create_in {
> > struct fuse_open_out {
> > uint64_t fh;
> > uint32_t open_flags;
> > - uint32_t padding;
> > + uint32_t passthrough_fh;
>
> I think it would be cleaner to add a FOPEN_PASSTHROUGH flag to
> explicitly request passthrough instead of just passing a non-null
> value to passthrough_fh.
>
> > };
> >
> > struct fuse_release_in {
> > @@ -828,6 +829,13 @@ struct fuse_in_header {
> > uint32_t padding;
> > };
> >
> > +struct fuse_passthrough_out {
> > + uint32_t fd;
> > + /* For future implementation */
> > + uint32_t len;
> > + void *vec;
> > +};
>
> I don't see why we'd need these extensions. The ioctl just needs to
> establish an ID to open file mapping that can be referenced on the
> regular protocol, i.e. it just needs to be passed an open file
> descriptor and return an unique ID.
>
> Mapping the fuse file's data to the underlying file's data is a
> different matter. That can be an identity mapping established at open
> time (this is what this series does) or it can be an arbitrary extent
> mapping to one or more underlying open files, established at open time
> or on demand. All of these can be done in band using the fuse
> protocol, no need to involve the ioctl mechanism.
>
> So I think we can just get rid of "struct fuse_passthrough_out"
> completely and use "uint32_t *" as the ioctl argument.
>
> What I think would be useful is to have an explicit
> FUSE_DEV_IOC_PASSTHROUGH_CLOSE ioctl, that would need to be called
> once the fuse server no longer needs this ID. If this turns out to
> be a performance problem, we could still add the auto-close behavior
> with an explicit FOPEN_PASSTHROUGH_AUTOCLOSE flag later.
Hi Miklos,
W/o auto closing, what happens if user space daemon forgets to call
FUSE_DEV_IOC_PASSTHROUGH_CLOSE? Do we keep the ID alive somewhere?
Thanks,
Tao
--
Into Sth. Rich & Strange
On Fri, Feb 19, 2021 at 8:05 AM Peng Tao <[email protected]> wrote:
>
> On Wed, Feb 17, 2021 at 9:41 PM Miklos Szeredi <[email protected]> wrote:
> > What I think would be useful is to have an explicit
> > FUSE_DEV_IOC_PASSTHROUGH_CLOSE ioctl, that would need to be called
> > once the fuse server no longer needs this ID. If this turns out to
> > be a performance problem, we could still add the auto-close behavior
> > with an explicit FOPEN_PASSTHROUGH_AUTOCLOSE flag later.
> Hi Miklos,
>
> W/o auto closing, what happens if user space daemon forgets to call
> FUSE_DEV_IOC_PASSTHROUGH_CLOSE? Do we keep the ID alive somewhere?
Kernel would keep the ID open until explicit close or fuse connection
is released.
There should be some limit on the max open files referenced through
ID's, though. E.g. inherit RLIMIT_NOFILE from mounting task.
Thanks,
Miklos
On Wed, Feb 17, 2021 at 11:21:04AM +0100, Miklos Szeredi wrote:
> On Mon, Jan 25, 2021 at 4:31 PM Alessio Balsini <[email protected]> wrote:
> >
> > With a 64-bit kernel build the FUSE device cannot handle ioctl requests
> > coming from 32-bit user space.
> > This is due to the ioctl command translation that generates different
> > command identifiers that thus cannot be used for direct comparisons
> > without proper manipulation.
> >
> > Explicitly extract type and number from the ioctl command to enable
> > 32-bit user space compatibility on 64-bit kernel builds.
> >
> > Signed-off-by: Alessio Balsini <[email protected]>
> > ---
> > fs/fuse/dev.c | 29 ++++++++++++++++++-----------
> > include/uapi/linux/fuse.h | 3 ++-
> > 2 files changed, 20 insertions(+), 12 deletions(-)
> >
> > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > index 588f8d1240aa..ff9f3b83f879 100644
> > --- a/fs/fuse/dev.c
> > +++ b/fs/fuse/dev.c
> > @@ -2233,37 +2233,44 @@ static int fuse_device_clone(struct fuse_conn *fc, struct file *new)
> > static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
> > unsigned long arg)
> > {
> > - int err = -ENOTTY;
> > + int res;
> > + int oldfd;
> > + struct fuse_dev *fud = NULL;
> >
> > - if (cmd == FUSE_DEV_IOC_CLONE) {
> > - int oldfd;
> > + if (_IOC_TYPE(cmd) != FUSE_DEV_IOC_MAGIC)
> > + return -EINVAL;
>
> Why change ENOTTY to EINVAL?
>
> Thanks,
> MIklos
For the magic number mismatch I was thinking that EINVAL would have been
more appropriate, meaning: "this ioctl is definitely something this file
doesn't support". ENOTTY, could be more specific to the subset of
ioctls supported by the file, so I kept this in the default case of the
switch.
After counting the use of ENOTTY vs EINVAL for these _IOC_TYPE checks,
EINVALs were less than half ENOTTYs, so I'm totally fine with switching
back to ENOTTY in V13.
Thanks!
Alessio
On Fri, Feb 19, 2021 at 09:40:21AM +0100, Miklos Szeredi wrote:
> On Fri, Feb 19, 2021 at 8:05 AM Peng Tao <[email protected]> wrote:
> >
> > On Wed, Feb 17, 2021 at 9:41 PM Miklos Szeredi <[email protected]> wrote:
>
> > > What I think would be useful is to have an explicit
> > > FUSE_DEV_IOC_PASSTHROUGH_CLOSE ioctl, that would need to be called
> > > once the fuse server no longer needs this ID. If this turns out to
> > > be a performance problem, we could still add the auto-close behavior
> > > with an explicit FOPEN_PASSTHROUGH_AUTOCLOSE flag later.
> > Hi Miklos,
> >
> > W/o auto closing, what happens if user space daemon forgets to call
> > FUSE_DEV_IOC_PASSTHROUGH_CLOSE? Do we keep the ID alive somewhere?
>
> Kernel would keep the ID open until explicit close or fuse connection
> is released.
>
> There should be some limit on the max open files referenced through
> ID's, though. E.g. inherit RLIMIT_NOFILE from mounting task.
>
> Thanks,
> Miklos
I like the idea of FUSE_DEV_IOC_PASSTHROUGH_CLOSE to revoke the
passthrough access, that is something I was already working on. What I
had in mind was simply to break that 1:1 connection between fuse_file
and lower filp setting a specific fuse_file::passthrough::filp to NULL,
but this is slightly different from what you mentioned.
AFAIU you are suggesting to allocate one ID for each lower fs file
opened with passthrough within a connection, and maybe using idr_find at
every read/write/mmap operation to check if passthrough is enabled on
that file. Something similar to fuse2_map_get().
This way the fuse server can pass the same ID to one or more
fuse_file(s).
FUSE_DEV_IOC_PASSTHROUGH_CLOSE would idr_remove the ID, so idr_find
would fail, preventing the use of passthrough on that ID. CMIIW.
After FUSE_DEV_IOC_PASSTHROUGH_CLOSE(ID) it may happen that if some
fuse_file(s) storing that ID are still open and the same ID is reclaimed
in a new idr_alloc, this would lead to mismatching lower fs filp being
used by our fuse_file(s). So also the ID stored in the fuse_file(s)
must be invalidated to prevent future uses of deallocated IDs.
Would it make sense to have a list of fuse_files using the same ID, that
must be traversed at FUSE_DEV_IOC_PASSTHROUGH_CLOSE time?
Negative values (maybe -ENOENT) might be used to mark IDs as invalid,
and tested before idr_find at read/write/mmap to avoid the idr_find
complexity in case passthrough is disabled for that file.
What do you think?
I agree with all the above comments to this patch, i.e., add
FOPEN_PASSTHROUGH, drop fuse_passthrough_out, header version+changelog,
that will be fixed in V13.
Thanks,
Alessio
On Mon, Jan 25, 2021 at 4:48 PM Alessio Balsini <[email protected]> wrote:
>
> With a 64-bit kernel build the FUSE device cannot handle ioctl requests
> coming from 32-bit user space.
> This is due to the ioctl command translation that generates different
> command identifiers that thus cannot be used for direct comparisons
> without proper manipulation.
>
> Explicitly extract type and number from the ioctl command to enable
> 32-bit user space compatibility on 64-bit kernel builds.
>
> Signed-off-by: Alessio Balsini <[email protected]>
I saw this commit go into the mainline kernel, and I'm worried that this
doesn't do what the description says. Since the argument is a 'uint32_t',
it is the same on both 32-bit and 64-bit user space, and the patch won't
make any difference for compat mode, as long as that is using the normal
uapi headers.
If there is any user space that has a different definition of
FUSE_DEV_IOC_CLONE, that may now successfully call
this ioctl command, but the kernel will now also accept any other
command code that has the same type and number, but an
arbitrary direction or size argument.
I think this should be changed back to specifically allow the
command code(s) that are actually used and nothing else.
Arnd
On Thu, Jan 28, 2021 at 3:17 PM Alessio Balsini <[email protected]> wrote:
>
> Hi all,
>
> I'm more than happy to change the interface into something that is
> objectively better and accepted by everyone.
> I would really love to reach the point at which we have a "stable-ish"
> UAPI as soon as possible.
It's in the mainline kernel, so you already have a stable uapi and
cannot change that in any incompatible way!
> I've been thinking about a few possible approaches to fix the issue, yet
> to preserve its flexibility. These are mentioned below.
>
>
> Solution 1: Size
>
> As mentioned in my previous email, one solution could be to introduce
> the "size" field to allow the structure to grow in the future.
>
> struct fuse_passthrough_out {
> uint32_t size; // Size of this data structure
> uint32_t fd;
> };
>
> The problem here is that we are making the promise that all the upcoming
> fields are going to be maintained forever and at the offsets they were
> originally defined.
>
>
> Solution 2: Version
>
> Another solution could be to s/size/version, where for every version of
> FUSE passthrough we reserve the right to modifying the fields over time,
> casting them to the right data structure according to the version.
Please read Documentation/driver-api/ioctl.rst for how to design
ioctls. Neither 'size' nor 'version' fields are appropriate here. If you
have a new behavior, you need a new command code.
Arnd
On Tue, Mar 16, 2021 at 07:53:06PM +0100, Arnd Bergmann wrote:
> On Mon, Jan 25, 2021 at 4:48 PM Alessio Balsini <[email protected]> wrote:
> >
> > With a 64-bit kernel build the FUSE device cannot handle ioctl requests
> > coming from 32-bit user space.
> > This is due to the ioctl command translation that generates different
> > command identifiers that thus cannot be used for direct comparisons
> > without proper manipulation.
> >
> > Explicitly extract type and number from the ioctl command to enable
> > 32-bit user space compatibility on 64-bit kernel builds.
> >
> > Signed-off-by: Alessio Balsini <[email protected]>
>
> I saw this commit go into the mainline kernel, and I'm worried that this
> doesn't do what the description says. Since the argument is a 'uint32_t',
> it is the same on both 32-bit and 64-bit user space, and the patch won't
> make any difference for compat mode, as long as that is using the normal
> uapi headers.
>
> If there is any user space that has a different definition of
> FUSE_DEV_IOC_CLONE, that may now successfully call
> this ioctl command, but the kernel will now also accept any other
> command code that has the same type and number, but an
> arbitrary direction or size argument.
>
> I think this should be changed back to specifically allow the
> command code(s) that are actually used and nothing else.
>
> Arnd
Hi Arnd,
Thanks for spotting this possible criticality.
I noticed that 32-bit users pace was unable to use the
FUSE_DEV_IOC_CLONE ioctl on 64-bit kernels, so this change avoid this
issue by forcing the kernel to interpret 32 and 64 bit
FUSE_DEV_IOC_CLONE command as if they were the same.
This is the simplest solution I could find as the UAPI is not changed
as, as you mentioned, the argument doesn't require any conversion.
I understand that this might limit possible future extensions of the
FUSE_DEV_IOC_XXX ioctls if their in/out argument changed depending on
the architecture, but only at that point we can switch to using the
compat layer, right?
What I'm worried about is the direction, do you think this would be an
issue?
I can start working on a compat layer fix meanwhile.
Thanks,
Alessio
On Thu, Mar 18, 2021 at 5:13 PM Alessio Balsini <[email protected]> wrote:
> On Tue, Mar 16, 2021 at 07:53:06PM +0100, Arnd Bergmann wrote:
> > On Mon, Jan 25, 2021 at 4:48 PM Alessio Balsini <[email protected]> wrote:
> > >
>
> Thanks for spotting this possible criticality.
>
> I noticed that 32-bit users pace was unable to use the
> FUSE_DEV_IOC_CLONE ioctl on 64-bit kernels, so this change avoid this
> issue by forcing the kernel to interpret 32 and 64 bit
> FUSE_DEV_IOC_CLONE command as if they were the same.
As far as I can tell from the kernel headers, the command code should
be the same for both 32-bit and 64-bit tasks: 0x8004e500.
Can you find out what exact value you see in the user space that was
causing problems, and how it ended up with a different value than
the 64-bit version?
If there are two possible command codes, I'd suggest you just change
the driver to handle both variants explicitly, but not any other one.
> This is the simplest solution I could find as the UAPI is not changed
> as, as you mentioned, the argument doesn't require any conversion.
>
> I understand that this might limit possible future extensions of the
> FUSE_DEV_IOC_XXX ioctls if their in/out argument changed depending on
> the architecture, but only at that point we can switch to using the
> compat layer, right?
>
> What I'm worried about is the direction, do you think this would be an
> issue?
>
> I can start working on a compat layer fix meanwhile.
For a proper well-designed ioctl interface, compat support should not
need anything beyond the '.compat_ioctl = compat_ptr_ioctl'
assignment.
Arnd
On Thu, Mar 18, 2021 at 10:15:43PM +0100, Arnd Bergmann wrote:
> On Thu, Mar 18, 2021 at 5:13 PM Alessio Balsini <[email protected]> wrote:
> > On Tue, Mar 16, 2021 at 07:53:06PM +0100, Arnd Bergmann wrote:
> > > On Mon, Jan 25, 2021 at 4:48 PM Alessio Balsini <[email protected]> wrote:
> > > >
> >
> > Thanks for spotting this possible criticality.
> >
> > I noticed that 32-bit users pace was unable to use the
> > FUSE_DEV_IOC_CLONE ioctl on 64-bit kernels, so this change avoid this
> > issue by forcing the kernel to interpret 32 and 64 bit
> > FUSE_DEV_IOC_CLONE command as if they were the same.
>
> As far as I can tell from the kernel headers, the command code should
> be the same for both 32-bit and 64-bit tasks: 0x8004e500.
> Can you find out what exact value you see in the user space that was
> causing problems, and how it ended up with a different value than
> the 64-bit version?
>
> If there are two possible command codes, I'd suggest you just change
> the driver to handle both variants explicitly, but not any other one.
>
> > This is the simplest solution I could find as the UAPI is not changed
> > as, as you mentioned, the argument doesn't require any conversion.
> >
> > I understand that this might limit possible future extensions of the
> > FUSE_DEV_IOC_XXX ioctls if their in/out argument changed depending on
> > the architecture, but only at that point we can switch to using the
> > compat layer, right?
> >
> > What I'm worried about is the direction, do you think this would be an
> > issue?
> >
> > I can start working on a compat layer fix meanwhile.
>
> For a proper well-designed ioctl interface, compat support should not
> need anything beyond the '.compat_ioctl = compat_ptr_ioctl'
> assignment.
>
> Arnd
You are right and now I can see what happened here.
When I introduce the PASSTHROUGH ioctl, because of the 'void *', the
command mismatches on _IOC_SIZE(nr). I solved this by only testing
_IOC_NUMBER and _IOC_TYPE, implicitely (mistakenly) removing the
_IOC_SIZE check. So, the fuse_dev_ioctl was correctly rejecting the
ioctl request from 32-bit userspace because of the wrong size and I was
just forcing it to digest the wrong data regardless.
Ouch!
The commit message for this patch would still be incorrect, but I posted
a fix here to bring the ioctl checking back to the original behavior:
https://lore.kernel.org/lkml/[email protected]/
Thanks for spotting this!
Alessio
On 1/26/21 12:46 AM, Alessio Balsini wrote:
> On Mon, Jan 25, 2021 at 03:30:50PM +0000, Alessio Balsini wrote:
>> OverlayFS implements its own function to translate iocb flags into rw
>> flags, so that they can be passed into another vfs call.
>> With commit ce71bfea207b4 ("fs: align IOCB_* flags with RWF_* flags")
>> Jens created a 1:1 matching between the iocb flags and rw flags,
>> simplifying the conversion.
>>
>> Reduce the OverlayFS code by making the flag conversion function generic
>> and reusable.
>>
>> Signed-off-by: Alessio Balsini <[email protected]>
>> ---
>> fs/overlayfs/file.c | 23 +++++------------------
>> include/linux/fs.h | 5 +++++
>> 2 files changed, 10 insertions(+), 18 deletions(-)
>>
>> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
>> index bd9dd38347ae..56be2ffc5a14 100644
>> --- a/fs/overlayfs/file.c
>> +++ b/fs/overlayfs/file.c
>> @@ -15,6 +15,8 @@
>> #include <linux/fs.h>
>> #include "overlayfs.h"
>>
>> +#define OVL_IOCB_MASK (IOCB_DSYNC | IOCB_HIPRI | IOCB_NOWAIT | IOCB_SYNC)
>> +
>> struct ovl_aio_req {
>> struct kiocb iocb;
>> struct kiocb *orig_iocb;
>> @@ -236,22 +238,6 @@ static void ovl_file_accessed(struct file *file)
>> touch_atime(&file->f_path);
>> }
>>
>> -static rwf_t ovl_iocb_to_rwf(int ifl)
>> -{
>> - rwf_t flags = 0;
>> -
>> - if (ifl & IOCB_NOWAIT)
>> - flags |= RWF_NOWAIT;
>> - if (ifl & IOCB_HIPRI)
>> - flags |= RWF_HIPRI;
>> - if (ifl & IOCB_DSYNC)
>> - flags |= RWF_DSYNC;
>> - if (ifl & IOCB_SYNC)
>> - flags |= RWF_SYNC;
>> -
>> - return flags;
>> -}
>> -
>> static void ovl_aio_cleanup_handler(struct ovl_aio_req *aio_req)
>> {
>> struct kiocb *iocb = &aio_req->iocb;
>> @@ -299,7 +285,8 @@ static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>> old_cred = ovl_override_creds(file_inode(file)->i_sb);
>> if (is_sync_kiocb(iocb)) {
>> ret = vfs_iter_read(real.file, iter, &iocb->ki_pos,
>> - ovl_iocb_to_rwf(iocb->ki_flags));
>> + iocb_to_rw_flags(iocb->ki_flags,
>> + OVL_IOCB_MASK));
>> } else {
>> struct ovl_aio_req *aio_req;
>>
>> @@ -356,7 +343,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
>> if (is_sync_kiocb(iocb)) {
>> file_start_write(real.file);
>> ret = vfs_iter_write(real.file, iter, &iocb->ki_pos,
>> - ovl_iocb_to_rwf(ifl));
>> + iocb_to_rw_flags(ifl, OVL_IOCB_MASK));
>> file_end_write(real.file);
>> /* Update size */
>> ovl_copyattr(ovl_inode_real(inode), inode);
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index fd47deea7c17..647c35423545 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -3275,6 +3275,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
>> return 0;
>> }
>>
>> +static inline rwf_t iocb_to_rw_flags(int ifl, int iocb_mask)
>> +{
>> + return ifl & iocb_mask;
>> +}
>> +
>> static inline ino_t parent_ino(struct dentry *dentry)
>> {
>> ino_t res;
>> --
>> 2.30.0.280.ga3ce27912f-goog
>>
>
> For some reason lkml.org and lore.kernel.org are not showing this change
> as part of the thread.
> Let's see if replying to the email fixes the indexing.
>
> Regards,
> Alessio
>
Hi, Alessio
This change imply IOCB_* and RWF_* flags are properly aligned, which is
not true for kernel version 5.4/4.19/4.14. As the patch ("fs: align
IOCB_* flags with RWF_* flags") is not back-ported to these stable
kernel branches. The issue was found when applying these patches
to kernel-5.4(files open with passthrough enabled can't do append
write). I think the issue exists in AOSP common kernel too.
Could you please fix this?
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce71bfea207b4d7c21d36f24ec37618ffcea1da8
https://android-review.googlesource.com/c/kernel/common/+/1556243
Thanks
yanwu
On Wed, Mar 24, 2021 at 03:43:12PM +0800, Rokudo Yan wrote:
> On 1/26/21 12:46 AM, Alessio Balsini wrote:
> > On Mon, Jan 25, 2021 at 03:30:50PM +0000, Alessio Balsini wrote:
> > > OverlayFS implements its own function to translate iocb flags into rw
> > > flags, so that they can be passed into another vfs call.
> > > With commit ce71bfea207b4 ("fs: align IOCB_* flags with RWF_* flags")
> > > Jens created a 1:1 matching between the iocb flags and rw flags,
> > > simplifying the conversion.
> > >
> > > Reduce the OverlayFS code by making the flag conversion function generic
> > > and reusable.
> > >
> > > Signed-off-by: Alessio Balsini <[email protected]>
> > > ---
> > > fs/overlayfs/file.c | 23 +++++------------------
> > > include/linux/fs.h | 5 +++++
> > > 2 files changed, 10 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > > index bd9dd38347ae..56be2ffc5a14 100644
> > > --- a/fs/overlayfs/file.c
> > > +++ b/fs/overlayfs/file.c
> > > @@ -15,6 +15,8 @@
> > > #include <linux/fs.h>
> > > #include "overlayfs.h"
> > > +#define OVL_IOCB_MASK (IOCB_DSYNC | IOCB_HIPRI | IOCB_NOWAIT | IOCB_SYNC)
> > > +
> > > struct ovl_aio_req {
> > > struct kiocb iocb;
> > > struct kiocb *orig_iocb;
> > > @@ -236,22 +238,6 @@ static void ovl_file_accessed(struct file *file)
> > > touch_atime(&file->f_path);
> > > }
> > > -static rwf_t ovl_iocb_to_rwf(int ifl)
> > > -{
> > > - rwf_t flags = 0;
> > > -
> > > - if (ifl & IOCB_NOWAIT)
> > > - flags |= RWF_NOWAIT;
> > > - if (ifl & IOCB_HIPRI)
> > > - flags |= RWF_HIPRI;
> > > - if (ifl & IOCB_DSYNC)
> > > - flags |= RWF_DSYNC;
> > > - if (ifl & IOCB_SYNC)
> > > - flags |= RWF_SYNC;
> > > -
> > > - return flags;
> > > -}
> > > -
> > > static void ovl_aio_cleanup_handler(struct ovl_aio_req *aio_req)
> > > {
> > > struct kiocb *iocb = &aio_req->iocb;
> > > @@ -299,7 +285,8 @@ static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> > > old_cred = ovl_override_creds(file_inode(file)->i_sb);
> > > if (is_sync_kiocb(iocb)) {
> > > ret = vfs_iter_read(real.file, iter, &iocb->ki_pos,
> > > - ovl_iocb_to_rwf(iocb->ki_flags));
> > > + iocb_to_rw_flags(iocb->ki_flags,
> > > + OVL_IOCB_MASK));
> > > } else {
> > > struct ovl_aio_req *aio_req;
> > > @@ -356,7 +343,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
> > > if (is_sync_kiocb(iocb)) {
> > > file_start_write(real.file);
> > > ret = vfs_iter_write(real.file, iter, &iocb->ki_pos,
> > > - ovl_iocb_to_rwf(ifl));
> > > + iocb_to_rw_flags(ifl, OVL_IOCB_MASK));
> > > file_end_write(real.file);
> > > /* Update size */
> > > ovl_copyattr(ovl_inode_real(inode), inode);
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index fd47deea7c17..647c35423545 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -3275,6 +3275,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
> > > return 0;
> > > }
> > > +static inline rwf_t iocb_to_rw_flags(int ifl, int iocb_mask)
> > > +{
> > > + return ifl & iocb_mask;
> > > +}
> > > +
> > > static inline ino_t parent_ino(struct dentry *dentry)
> > > {
> > > ino_t res;
> > > --
> > > 2.30.0.280.ga3ce27912f-goog
> > >
> >
> > For some reason lkml.org and lore.kernel.org are not showing this change
> > as part of the thread.
> > Let's see if replying to the email fixes the indexing.
> >
> > Regards,
> > Alessio
> >
>
> Hi, Alessio
>
> This change imply IOCB_* and RWF_* flags are properly aligned, which is not
> true for kernel version 5.4/4.19/4.14. As the patch ("fs: align IOCB_* flags
> with RWF_* flags") is not back-ported to these stable kernel branches. The
> issue was found when applying these patches
> to kernel-5.4(files open with passthrough enabled can't do append write). I
> think the issue exists in AOSP common kernel too.
> Could you please fix this?
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce71bfea207b4d7c21d36f24ec37618ffcea1da8
>
> https://android-review.googlesource.com/c/kernel/common/+/1556243
>
> Thanks
> yanwu
Hi yanwu,
Correct, this change depends on commit ce71bfea207b ("fs: align IOCB_*
flags with RWF_* flags"), and this dependency is satisfied upstream.
Being FUSE passthrough a new feature and not a bugfix, I'm not planning
to do any backporting to LTS kernels (and GregKH won't probably accept
it).
Android is a different story (and slightly out of topic here).
We are looking forward to have FUSE passthrough enabled on Android as
most of the user data is handled by FUSE. We liked the performance
improvements and non-intrusiveness of the change both for the kernel and
for userspace, so we started supporting this in android12-5.4+ kernel
branches. We are not planning to maintain the feature to older kernels
though (we can't add features to already released), and this is why FUSE
passthrough is not merged there.
To answer your question, in AOSP the officially supported kernels
already have the flags alignment change merged, and a not supported
backporting to older kernels (i.e., 4.14 and 4.19) is already available:
https://android-review.googlesource.com/q/%2522BACKPORT:+fs:+align+IOCB_*+flags+with+RWF_*+flags%2522+-status:abandoned
Thanks,
Alessio
On Wed, Feb 17, 2021 at 03:05:07PM +0100, Miklos Szeredi wrote:
> On Mon, Jan 25, 2021 at 4:31 PM Alessio Balsini <[email protected]> wrote:
> >
> > Enabling FUSE passthrough for mmap-ed operations not only affects
> > performance, but has also been shown as mandatory for the correct
> > functioning of FUSE passthrough.
> > yanwu noticed [1] that a FUSE file with passthrough enabled may suffer
> > data inconsistencies if the same file is also accessed with mmap. What
> > happens is that read/write operations are directly applied to the lower
> > file system (and its cache), while mmap-ed operations are affecting the
> > FUSE cache.
> >
> > Extend the FUSE passthrough implementation to also handle memory-mapped
> > FUSE file, to both fix the cache inconsistencies and extend the
> > passthrough performance benefits to mmap-ed operations.
> >
> > [1] https://lore.kernel.org/lkml/[email protected]/
> >
> > Signed-off-by: Alessio Balsini <[email protected]>
> > ---
> > fs/fuse/file.c | 3 +++
> > fs/fuse/fuse_i.h | 1 +
> > fs/fuse/passthrough.c | 41 +++++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 45 insertions(+)
> >
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index cddada1e8bd9..e3741a94c1f9 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -2370,6 +2370,9 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
> > if (FUSE_IS_DAX(file_inode(file)))
> > return fuse_dax_mmap(file, vma);
> >
> > + if (ff->passthrough.filp)
> > + return fuse_passthrough_mmap(file, vma);
> > +
> > if (ff->open_flags & FOPEN_DIRECT_IO) {
> > /* Can't provide the coherency needed for MAP_SHARED */
> > if (vma->vm_flags & VM_MAYSHARE)
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 815af1845b16..7b0d65984608 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -1244,5 +1244,6 @@ int fuse_passthrough_setup(struct fuse_conn *fc, struct fuse_file *ff,
> > void fuse_passthrough_release(struct fuse_passthrough *passthrough);
> > ssize_t fuse_passthrough_read_iter(struct kiocb *iocb, struct iov_iter *to);
> > ssize_t fuse_passthrough_write_iter(struct kiocb *iocb, struct iov_iter *from);
> > +ssize_t fuse_passthrough_mmap(struct file *file, struct vm_area_struct *vma);
> >
> > #endif /* _FS_FUSE_I_H */
> > diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> > index 24866c5fe7e2..284979f87747 100644
> > --- a/fs/fuse/passthrough.c
> > +++ b/fs/fuse/passthrough.c
> > @@ -135,6 +135,47 @@ ssize_t fuse_passthrough_write_iter(struct kiocb *iocb_fuse,
> > return ret;
> > }
> >
> > +ssize_t fuse_passthrough_mmap(struct file *file, struct vm_area_struct *vma)
> > +{
> > + int ret;
> > + const struct cred *old_cred;
> > + struct fuse_file *ff = file->private_data;
> > + struct inode *fuse_inode = file_inode(file);
> > + struct file *passthrough_filp = ff->passthrough.filp;
> > + struct inode *passthrough_inode = file_inode(passthrough_filp);
> > +
> > + if (!passthrough_filp->f_op->mmap)
> > + return -ENODEV;
> > +
> > + if (WARN_ON(file != vma->vm_file))
> > + return -EIO;
> > +
> > + vma->vm_file = get_file(passthrough_filp);
> > +
> > + old_cred = override_creds(ff->passthrough.cred);
> > + ret = call_mmap(vma->vm_file, vma);
> > + revert_creds(old_cred);
> > +
> > + if (ret)
> > + fput(passthrough_filp);
> > + else
> > + fput(file);
> > +
> > + if (file->f_flags & O_NOATIME)
> > + return ret;
> > +
> > + if ((!timespec64_equal(&fuse_inode->i_mtime,
> > + &passthrough_inode->i_mtime) ||
> > + !timespec64_equal(&fuse_inode->i_ctime,
> > + &passthrough_inode->i_ctime))) {
> > + fuse_inode->i_mtime = passthrough_inode->i_mtime;
> > + fuse_inode->i_ctime = passthrough_inode->i_ctime;
>
> Again, violation of rules. Not sure why this is needed, mmap(2)
> isn't supposed to change mtime or ctime, AFAIK.
>
> Thanks,
> Miklos
Hi Miklos,
I don't have a strong preference for this and will drop the ctime/atime
updates in v13.
For the records, here follows my reasoning for which I decided to update
atime/ctime here.
From the stats(2) man it just says that it's not guaranteed that atime
would be updated, as `Other routines, like mmap(2), may or may not
update st_atime.`
Something similar according to the inotify(7) man that warns not to trigger events
after mmap(2), msync(2), and munmap(2) operations.
The mmap(2) man mentions that st_ctime and st_mtime would be updated for
file mappings with PROT_WRITE and MAP_SHARED, before a msync(2) with
MS_SYNC or MS_ASYNC.
This passthrough scenario is slightly different from the standard mmap,
but it seems to me that we are kind of falling into a similar use case
for the atime/ctime update.
I would imagine this is why OverlayFS updates atime/ctime too in
ovl_mmap(), through ovl_copyattr().
Thanks,
Alessio
On Wed, Feb 17, 2021 at 3:52 PM Miklos Szeredi <[email protected]> wrote:
>
> On Mon, Jan 25, 2021 at 4:31 PM Alessio Balsini <[email protected]> wrote:
> >
> > Implement the FUSE passthrough ioctl that associates the lower
> > (passthrough) file system file with the fuse_file.
> >
> > The file descriptor passed to the ioctl by the FUSE daemon is used to
> > access the relative file pointer, that will be copied to the fuse_file
> > data structure to consolidate the link between the FUSE and lower file
> > system.
> >
> > To enable the passthrough mode, user space triggers the
> > FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl and, if the call succeeds, receives
> > back an identifier that will be used at open/create response time in the
> > fuse_open_out field to associate the FUSE file to the lower file system
> > file.
> > The value returned by the ioctl to user space can be:
> > - > 0: success, the identifier can be used as part of an open/create
> > reply.
> > - <= 0: an error occurred.
> > The value 0 represents an error to preserve backward compatibility: the
> > fuse_open_out field that is used to pass the passthrough_fh back to the
> > kernel uses the same bits that were previously as struct padding, and is
> > commonly zero-initialized (e.g., in the libfuse implementation).
> > Removing 0 from the correct values fixes the ambiguity between the case
> > in which 0 corresponds to a real passthrough_fh, a missing
> > implementation of FUSE passthrough or a request for a normal FUSE file,
> > simplifying the user space implementation.
> >
> > For the passthrough mode to be successfully activated, the lower file
> > system file must implement both read_iter and write_iter file
> > operations. This extra check avoids special pseudo files to be targeted
> > for this feature.
> > Passthrough comes with another limitation: no further file system
> > stacking is allowed for those FUSE file systems using passthrough.
> >
> > Signed-off-by: Alessio Balsini <[email protected]>
> > ---
> > fs/fuse/inode.c | 5 +++
> > fs/fuse/passthrough.c | 87 ++++++++++++++++++++++++++++++++++++++++++-
> > 2 files changed, 90 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index a1104d5abb70..7ebc398fbacb 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -1133,6 +1133,11 @@ EXPORT_SYMBOL_GPL(fuse_send_init);
> >
> > static int free_fuse_passthrough(int id, void *p, void *data)
> > {
> > + struct fuse_passthrough *passthrough = (struct fuse_passthrough *)p;
> > +
> > + fuse_passthrough_release(passthrough);
> > + kfree(p);
> > +
> > return 0;
> > }
> >
> > diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> > index 594060c654f8..cf993e83803e 100644
> > --- a/fs/fuse/passthrough.c
> > +++ b/fs/fuse/passthrough.c
> > @@ -3,19 +3,102 @@
> > #include "fuse_i.h"
> >
> > #include <linux/fuse.h>
> > +#include <linux/idr.h>
> >
> > int fuse_passthrough_open(struct fuse_dev *fud,
> > struct fuse_passthrough_out *pto)
> > {
> > - return -EINVAL;
> > + int res;
> > + struct file *passthrough_filp;
> > + struct fuse_conn *fc = fud->fc;
> > + struct inode *passthrough_inode;
> > + struct super_block *passthrough_sb;
> > + struct fuse_passthrough *passthrough;
> > +
> > + if (!fc->passthrough)
> > + return -EPERM;
> > +
> > + /* This field is reserved for future implementation */
> > + if (pto->len != 0)
> > + return -EINVAL;
> > +
> > + passthrough_filp = fget(pto->fd);
> > + if (!passthrough_filp) {
> > + pr_err("FUSE: invalid file descriptor for passthrough.\n");
> > + return -EBADF;
> > + }
> > +
> > + if (!passthrough_filp->f_op->read_iter ||
> > + !passthrough_filp->f_op->write_iter) {
> > + pr_err("FUSE: passthrough file misses file operations.\n");
> > + res = -EBADF;
> > + goto err_free_file;
> > + }
> > +
> > + passthrough_inode = file_inode(passthrough_filp);
> > + passthrough_sb = passthrough_inode->i_sb;
> > + if (passthrough_sb->s_stack_depth >= FILESYSTEM_MAX_STACK_DEPTH) {
> > + pr_err("FUSE: fs stacking depth exceeded for passthrough\n");
>
> No need to print an error to the logs, this can be a perfectly normal
> occurrence. However I'd try to find a more unique error value than
> EINVAL so that the fuse server can interpret this as "not your fault,
> but can't support passthrough on this file". E.g. EOPNOTSUPP.
>
>
Sorry for the fashionably late response...
Same comment for !{read,write}_iter case above.
EBAFD is really not appropriate there.
May I suggest ELOOP for s_stack_depth and EOPNOTSUPP
for no rw iter ops.
Are you planning to post another version of the patches soon?
Thanks,
Amir.
On Wed, May 05, 2021 at 03:21:26PM +0300, Amir Goldstein wrote:
> On Wed, Feb 17, 2021 at 3:52 PM Miklos Szeredi <[email protected]> wrote:
> >
> > On Mon, Jan 25, 2021 at 4:31 PM Alessio Balsini <[email protected]> wrote:
> > >
> > > Implement the FUSE passthrough ioctl that associates the lower
> > > (passthrough) file system file with the fuse_file.
> > >
> > > The file descriptor passed to the ioctl by the FUSE daemon is used to
> > > access the relative file pointer, that will be copied to the fuse_file
> > > data structure to consolidate the link between the FUSE and lower file
> > > system.
> > >
> > > To enable the passthrough mode, user space triggers the
> > > FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl and, if the call succeeds, receives
> > > back an identifier that will be used at open/create response time in the
> > > fuse_open_out field to associate the FUSE file to the lower file system
> > > file.
> > > The value returned by the ioctl to user space can be:
> > > - > 0: success, the identifier can be used as part of an open/create
> > > reply.
> > > - <= 0: an error occurred.
> > > The value 0 represents an error to preserve backward compatibility: the
> > > fuse_open_out field that is used to pass the passthrough_fh back to the
> > > kernel uses the same bits that were previously as struct padding, and is
> > > commonly zero-initialized (e.g., in the libfuse implementation).
> > > Removing 0 from the correct values fixes the ambiguity between the case
> > > in which 0 corresponds to a real passthrough_fh, a missing
> > > implementation of FUSE passthrough or a request for a normal FUSE file,
> > > simplifying the user space implementation.
> > >
> > > For the passthrough mode to be successfully activated, the lower file
> > > system file must implement both read_iter and write_iter file
> > > operations. This extra check avoids special pseudo files to be targeted
> > > for this feature.
> > > Passthrough comes with another limitation: no further file system
> > > stacking is allowed for those FUSE file systems using passthrough.
> > >
> > > Signed-off-by: Alessio Balsini <[email protected]>
> > > ---
> > > fs/fuse/inode.c | 5 +++
> > > fs/fuse/passthrough.c | 87 ++++++++++++++++++++++++++++++++++++++++++-
> > > 2 files changed, 90 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > index a1104d5abb70..7ebc398fbacb 100644
> > > --- a/fs/fuse/inode.c
> > > +++ b/fs/fuse/inode.c
> > > @@ -1133,6 +1133,11 @@ EXPORT_SYMBOL_GPL(fuse_send_init);
> > >
> > > static int free_fuse_passthrough(int id, void *p, void *data)
> > > {
> > > + struct fuse_passthrough *passthrough = (struct fuse_passthrough *)p;
> > > +
> > > + fuse_passthrough_release(passthrough);
> > > + kfree(p);
> > > +
> > > return 0;
> > > }
> > >
> > > diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
> > > index 594060c654f8..cf993e83803e 100644
> > > --- a/fs/fuse/passthrough.c
> > > +++ b/fs/fuse/passthrough.c
> > > @@ -3,19 +3,102 @@
> > > #include "fuse_i.h"
> > >
> > > #include <linux/fuse.h>
> > > +#include <linux/idr.h>
> > >
> > > int fuse_passthrough_open(struct fuse_dev *fud,
> > > struct fuse_passthrough_out *pto)
> > > {
> > > - return -EINVAL;
> > > + int res;
> > > + struct file *passthrough_filp;
> > > + struct fuse_conn *fc = fud->fc;
> > > + struct inode *passthrough_inode;
> > > + struct super_block *passthrough_sb;
> > > + struct fuse_passthrough *passthrough;
> > > +
> > > + if (!fc->passthrough)
> > > + return -EPERM;
> > > +
> > > + /* This field is reserved for future implementation */
> > > + if (pto->len != 0)
> > > + return -EINVAL;
> > > +
> > > + passthrough_filp = fget(pto->fd);
> > > + if (!passthrough_filp) {
> > > + pr_err("FUSE: invalid file descriptor for passthrough.\n");
> > > + return -EBADF;
> > > + }
> > > +
> > > + if (!passthrough_filp->f_op->read_iter ||
> > > + !passthrough_filp->f_op->write_iter) {
> > > + pr_err("FUSE: passthrough file misses file operations.\n");
> > > + res = -EBADF;
> > > + goto err_free_file;
> > > + }
> > > +
> > > + passthrough_inode = file_inode(passthrough_filp);
> > > + passthrough_sb = passthrough_inode->i_sb;
> > > + if (passthrough_sb->s_stack_depth >= FILESYSTEM_MAX_STACK_DEPTH) {
> > > + pr_err("FUSE: fs stacking depth exceeded for passthrough\n");
> >
> > No need to print an error to the logs, this can be a perfectly normal
> > occurrence. However I'd try to find a more unique error value than
> > EINVAL so that the fuse server can interpret this as "not your fault,
> > but can't support passthrough on this file". E.g. EOPNOTSUPP.
> >
> >
>
> Sorry for the fashionably late response...
> Same comment for !{read,write}_iter case above.
> EBAFD is really not appropriate there.
> May I suggest ELOOP for s_stack_depth and EOPNOTSUPP
> for no rw iter ops.
>
> Are you planning to post another version of the patches soon?
>
> Thanks,
> Amir.
Hi Amir,
Thanks for your feedback. I like the ELOOP for the stack_depth, and the
EOPNOTSUPP is already in my local branch.
I've been busy with the integration of this last patchset in Android,
that unfortunately will be shipped as out-of-tree. I'm keeping all my
fingers crossed for my workarounds to prevent future passthrough UAPI
breakages to work. :)
I have an ugly patch which uses IDR as Miklos asked, but unfortunately
I'm facing some performance issues due to the locking mechanisms to keep
guarantee the RCU consistency. I can post the new patch set as an RFC
soon for the community to take a look.
At a glance what happens is:
- the IDR, one for each fuse_conn, contains pointers to "struct
fuse_passthrough" containing:
- fuse_file *: which is using passthrough,
- file *: native file system file target,
- cred of the FUSE server,
- ioctl(PASSTHROUGH_OPEN): updates IDR, requires spinlock:
- kmalloc(fuse_passthrough), update file and cred,
- ID = idr_alloc(),
- return ID,
- fuse_open reply from FUSE server with passthrough ID: updates IDR,
requires spinlock:
- pt = idr_find(ID),
- update fuse_file with the current fuse_file,
- update fuse_file->passthrough_id = ID,
- idr_replace(),
- read/write/mmap: lock-free IDR read:
- idr_find(fuse_file::passthrough_id),
- forward request to lower file system as for the current FUSE
passthrough patches.
- ioctl(PASSTHROUGH_CLOSE): updates IDR, requires spinlock:
- idr_remove();
- call_rcu(): tell possible open fuse_file user that the ID is no more
valid and kfree() the allocated struct;
- close(fuse_file): updates IDR, requires spinlock:
- ID = fuse_file::passthrough_id
- idr_find(ID),
- fuse_passthrough::fuse_file = NULL,
- idr_replace().
This would be fine if passthrough is activated for a few, big files,
where the passthrough overhead is dominated by the direct access
benefits, but if passthrough is enabled for many small files which just
do one or two read/writes (as what I did for my benchmark of total build
time for the kernel, where I was passing-through every opened file), the
overhead becomes a real issue.
If you have any thoughts on how to make this simpler, I'm absolutely
open to fix this.
We are also in the preliminary design step of having a hierarchical
passthrough feature, for which all the dir/file operations performed in
any of the contents of the passthrough folder, are automatically
forwarded to the lower file system.
That may fix this overhead issue, but it's still early to forecast what
is going to happen with this idea. Right now I'm not even sure it's
feasible.
Looking forward to receiving some feedback, thanks in advance!
Alessio
> I have an ugly patch which uses IDR as Miklos asked, but unfortunately
> I'm facing some performance issues due to the locking mechanisms to keep
> guarantee the RCU consistency. I can post the new patch set as an RFC
> soon for the community to take a look.
> At a glance what happens is:
> - the IDR, one for each fuse_conn, contains pointers to "struct
> fuse_passthrough" containing:
> - fuse_file *: which is using passthrough,
> - file *: native file system file target,
> - cred of the FUSE server,
> - ioctl(PASSTHROUGH_OPEN): updates IDR, requires spinlock:
> - kmalloc(fuse_passthrough), update file and cred,
> - ID = idr_alloc(),
> - return ID,
> - fuse_open reply from FUSE server with passthrough ID: updates IDR,
> requires spinlock:
> - pt = idr_find(ID),
> - update fuse_file with the current fuse_file,
> - update fuse_file->passthrough_id = ID,
> - idr_replace(),
> - read/write/mmap: lock-free IDR read:
> - idr_find(fuse_file::passthrough_id),
> - forward request to lower file system as for the current FUSE
> passthrough patches.
> - ioctl(PASSTHROUGH_CLOSE): updates IDR, requires spinlock:
> - idr_remove();
> - call_rcu(): tell possible open fuse_file user that the ID is no more
> valid and kfree() the allocated struct;
> - close(fuse_file): updates IDR, requires spinlock:
> - ID = fuse_file::passthrough_id
> - idr_find(ID),
> - fuse_passthrough::fuse_file = NULL,
> - idr_replace().
>
> This would be fine if passthrough is activated for a few, big files,
> where the passthrough overhead is dominated by the direct access
> benefits, but if passthrough is enabled for many small files which just
> do one or two read/writes (as what I did for my benchmark of total build
> time for the kernel, where I was passing-through every opened file), the
> overhead becomes a real issue.
>
> If you have any thoughts on how to make this simpler, I'm absolutely
> open to fix this.
>
This IDR namespace usually serves a single process. Right?
It sounds a bit more like a file table, more specifically, like io_file_table.
I may be way off, but this sounds a bit like IORING_REGISTER_FILES.
Is there anything that can be learned from this UAPI?
Maybe even reuse of some of the io_uring file register code?
Thanks,
Amir.
On Mon, Jan 25, 2021 at 5:31 PM Alessio Balsini <[email protected]> wrote:
>
> This is the 12th version of the series, rebased on top of v5.11-rc5.
> Please find the changelog at the bottom of this cover letter.
>
> Add support for file system passthrough read/write of files when enabled
> in userspace through the option FUSE_PASSTHROUGH.
>
> There are file systems based on FUSE that are intended to enforce
> special policies or trigger complicated decision makings at the file
> operations level. Android, for example, uses FUSE to enforce
> fine-grained access policies that also depend on the file contents.
> Sometimes it happens that at open or create time a file is identified as
> not requiring additional checks for consequent reads/writes, thus FUSE
> would simply act as a passive bridge between the process accessing the
> FUSE file system and the lower file system. Splicing and caching help
> reduce the FUSE overhead, but there are still read/write operations
> forwarded to the userspace FUSE daemon that could be avoided.
>
> This series has been inspired by the original patches from Nikhilesh
> Reddy, the idea and code of which has been elaborated and improved
> thanks to the community support.
>
> When the FUSE_PASSTHROUGH capability is enabled, the FUSE daemon may
> decide while handling the open/create operations, if the given file can
> be accessed in passthrough mode. This means that all the further read
> and write operations would be forwarded by the kernel directly to the
> lower file system using the VFS layer rather than to the FUSE daemon.
> All the requests other than reads or writes are still handled by the
> userspace FUSE daemon.
> This allows for improved performance on reads and writes, especially in
> the case of reads at random offsets, for which no (readahead) caching
> mechanism would help.
> Benchmarks show improved performance that is close to native file system
> access when doing massive manipulations on a single opened file,
> especially in the case of random reads, random writes and sequential
> writes. Detailed benchmarking results are presented below.
>
> The creation of this direct connection (passthrough) between FUSE file
> objects and file objects in the lower file system happens in a way that
> reminds of passing file descriptors via sockets:
> - a process requests the opening of a file handled by FUSE, so the
> kernel forwards the request to the FUSE daemon;
> - the FUSE daemon opens the target file in the lower file system,
> getting its file descriptor;
> - the FUSE daemon also decides according to its internal policies if
> passthrough can be enabled for that file, and, if so, can perform a
> FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl on /dev/fuse, passing the file
> descriptor obtained at the previous step and the fuse_req unique
> identifier;
> - the kernel translates the file descriptor to the file pointer
> navigating through the opened files of the "current" process and
> temporarily stores it in the associated open/create fuse_req's
> passthrough_filp;
> - when the FUSE daemon has done with the request and it's time for the
> kernel to close it, it checks if the passthrough_filp is available and
> in case updates the additional field in the fuse_file owned by the
> process accessing the FUSE file system.
> From now on, all the read/write operations performed by that process
> will be redirected to the corresponding lower file system file by
> creating new VFS requests.
> Since the read/write operation to the lower file system is executed with
> the current process's credentials, it might happen that it does not have
> enough privileges to succeed. For this reason, the process temporarily
> receives the same credentials as the FUSE daemon, that are reverted as
> soon as the read/write operation completes, emulating the behavior of
> the request to be performed by the FUSE daemon itself. This solution has
> been inspired by the way overlayfs handles read/write operations.
> Asynchronous IO is supported as well, handled by creating separate AIO
> requests for the lower file system that will be internally tracked by
> FUSE, that intercepts and propagates their completion through an
> internal ki_completed callback similar to the current implementation of
> overlayfs.
> Finally, also memory-mapped FUSE files are supported in this FUSE
> passthrough series as it has been noticed that when a same file with
> FUSE passthrough enabled is accessed both with standard
> read/write(-iter) operations and memory-mapped read/write operations,
> the file content might result corrupted due to an inconsistency between
> the FUSE and lower file system caches.
>
> The ioctl has been designed taking as a reference and trying to converge
> to the fuse2 implementation. For example, the fuse_passthrough_out data
> structure has extra fields that will allow for further extensions of the
> feature.
>
>
> Performance on RAM block device
>
> What follows has been performed using a custom passthrough_hp FUSE
> daemon that enables pass-through for each file that is opened during
> both "open" and "create". Benchmarks were run on an Intel Xeon W-2135,
> 64 GiB of RAM workstation, with a RAM block device used as storage
> target. More specifically, out of the system's 64 GiB of RAM, 40 GiB
> were reserved for /dev/ram0, formatted as ext4. For the FUSE and FUSE
> passthrough benchmarks, the FUSE file system was mounted on top of the
> mounted /dev/ram0 device.
> That file system has been completely filled and then cleaned up before
> running the benchmarks: this to ensure that all the /dev/ram0 space was
> reserved and not usable as page cache.
>
> The rationale for using a RAM block device is that SSDs may experience
> performance fluctuations, especially when dealing with accessing data
> random offsets.
> Getting rid of the discrete storage device also removes a huge component
> of slowness, highlighting the performance difference of the software
> parts (and probably the goodness of CPU caching and its coherence
> mechanisms).
>
> No special tuning has been performed, e.g., all the involved processes
> are SCHED_OTHER, ondemand is the frequency governor with no frequency
> restrictions, and turbo-boost, as well as p-state, are active. This is
> because I noticed that, for such high-level benchmarks, results
> consistency was minimally affected by these features.
>
> The source code of the updated libfuse library and passthrough_hp is
> shared at the following repository:
>
> https://github.com/balsini/libfuse/tree/fuse-passthrough-v12-v5.11-rc5
>
> Two different kinds of benchmarks were done for this change, the first
> set of tests evaluates the bandwidth improvements when manipulating huge
> single files, the second set of tests verify that no performance
> regressions were introduced when handling many small files.
>
> All the caches were dropped before running every benchmark with:
>
> echo 3 > /proc/sys/vm/drop_caches
>
> All the benchmarks were run 10 times, with 1 minute cool down between
> each run.
>
> The first benchmarks were done by running FIO (fio-3.24) with:
> - bs=4Ki;
> - file size: 35Gi;
> - ioengine: sync;
> - fsync_on_close=1;
> - randseed=0.
> The target file has been chosen large enough to avoid it to be entirely
> loaded into the page cache.
>
> Results are presented in the following table:
>
> +-----------+------------+-------------+-------------+
> | MiB/s | fuse | passthrough | native |
> +-----------+------------+-------------+-------------+
> | read | 471(±1.3%) | 1791(±1.0%) | 1839(±1.8%) |
> | write | 95(±.6%) | 1068(±.9%) | 1322(±.8%) |
> | randread | 25(±1.7%) | 860(±.8%) | 1135(±.5%) |
> | randwrite | 76(±3.0%) | 813(±1.0%) | 1005(±.7%) |
> +-----------+------------+-------------+-------------+
>
> This table shows that FUSE, except for the sequential reads, is far
> behind FUSE passthrough and native in terms of performance. The
> extremely good FUSE performance for sequential reads is the result of a
> great read-ahead mechanism. I was able to verify that setting
> read_ahead_kb to 0 causes a terrible performance drop.
> All the results are stable, as shown by the standard deviations.
> Moreover, these numbers show the reasonable gap between passthrough and
> native, introduced by the extra traversal through the VFS layer.
>
> As long as this patch has the primary objective of improving bandwidth,
> another set of tests has been performed to see how this behaves on a
> totally different scenario that involves accessing many small files. For
> this purpose, measuring the build time of the Linux kernel has been
> chosen as an appropriate, well-known, workload. The kernel has been
> built with as many processes as the number of logical CPUs (-j
> $(nproc)), that besides being a reasonable parallelization value, is
> also enough to saturate the processor's utilization thanks to the
> additional FUSE daemon's threads, making it even harder to get closer to
> the native file system performance.
> The following table shows the total build times in the different
> configurations:
>
> +------------------+--------------+-----------+
> | | AVG duration | Standard |
> | | (sec) | deviation |
> +------------------+--------------+-----------+
> | FUSE | 144.566 | 0.697 |
> +------------------+--------------+-----------+
> | FUSE passthrough | 133.820 | 0.341 |
> +------------------+--------------+-----------+
> | Native | 109.423 | 0.724 |
> +------------------+--------------+-----------+
>
> Further testing and performance evaluations are welcome.
>
>
> Description of the series
>
> Patch 1 generalizes the function which converts iocb flags to rw flags
> from overlayfs, so that can be used in this patch set.
>
> Patch 2 enables the 32-bit compatibility for the /dev/fuse ioctl.
>
> Patch 3 introduces the data structures, function signatures and ioctl
> required both for the communication with userspace and for the internal
> kernel use.
>
> Patch 4 introduces initialization and release functions for FUSE
> passthrough.
>
> Patch 5 enables the synchronous read and write operations for those FUSE
> files for which the passthrough functionality is enabled.
>
> Patch 6 extends the read and write operations to also support
> asynchronous IO.
>
> Patch 7 allows FUSE passthrough to target files for which the requesting
> process would not have direct access to, by temporarily performing a
> credentials switch to the credentials of the FUSE daemon that issued the
> FUSE passthrough ioctl.
>
> Patch 8 extends FUSE passthrough operations to memory-mapped FUSE files.
>
>
> Changelog
>
> Changes in v12:
> * Revert FILESYSTEM_MAX_STACK_DEPTH checks as they were in v10
> [Requested by Amir Goldstein]
> * Introduce passthrough support for memory-mapped FUSE files
> [Requested by yanwu]
>
> Changes in v11:
> * Fix the FILESYSTEM_MAX_STACK_DEPTH check to allow other file systems
> to be stacked
> * Moved file system stacking depth check at ioctl time
> * Update cover letter with correct libfuse repository to test the change
> [Requested by Peng Tao]
> * Fix the file reference counter leak introduced in v10
> [Requested by yanwu]
>
> Changes in v10:
> * UAPI updated: ioctl now returns an ID that will be used at open/create
> response time to reference the passthrough file
> * Synchronous read/write_iter functions does not return silly errors
> (fixed in aio patch)
> * FUSE daemon credentials updated at ioctl time instead of mount time
> * Updated benchmark results
> [Requested by Miklos Szeredi]
>
> Changes in v9:
> * Switched to using VFS instead of direct lower FS file ops
> [Attempt to address a request from Jens Axboe, Jann Horn,
> Amir Goldstein]
> * Removal of useless included aio.h header
> [Proposed by Jens Axboe]
>
> Changes in v8:
> * aio requests now use kmalloc/kfree, instead of kmem_cache
> * Switched to call_{read,write}_iter in AIO
> * Revisited attributes copy
> * Passthrough can only be enabled via ioctl, fixing the security issue
> spotted by Jann
> * Use an extensible fuse_passthrough_out data structure
> [Attempt to address a request from Nikolaus Rath, Amir Goldstein and
> Miklos Szeredi]
>
> Changes in v7:
> * Full handling of aio requests as done in overlayfs (update commit
> * message).
> * s/fget_raw/fget.
> * Open fails in case of passthrough errors, emitting warning messages.
> [Proposed by Jann Horn]
> * Create new local kiocb, getting rid of the previously proposed ki_filp
> swapping.
> [Proposed by Jann Horn and Jens Axboe]
> * Code polishing.
>
> Changes in v6:
> * Port to kernel v5.8:
> * fuse_file_{read,write}_iter changed since the v5 of this patch was
> proposed.
> * Simplify fuse_simple_request.
> * Merge fuse_passthrough.h into fuse_i.h
> * Refactor of passthrough.c:
> * Remove BUG_ONs.
> * Simplified error checking and request arguments indexing.
> * Use call_{read,write}_iter utility functions.
> * Remove get_file and fputs during read/write: handle the extra FUSE
> references to the lower file object when the fuse_file is
> created/deleted.
> [Proposed by Jann Horn]
>
> Changes in v5:
> * Fix the check when setting the passthrough file.
> [Found when testing by Mike Shal]
>
> Changes in v3 and v4:
> * Use the fs_stack_depth to prevent further stacking and a minor fix.
> [Proposed by Jann Horn]
>
> Changes in v2:
> * Changed the feature name to passthrough from stacked_io.
> [Proposed by Linus Torvalds]
>
>
> Alessio Balsini (8):
> fs: Generic function to convert iocb to rw flags
> fuse: 32-bit user space ioctl compat for fuse device
> fuse: Definitions and ioctl for passthrough
> fuse: Passthrough initialization and release
> fuse: Introduce synchronous read and write for passthrough
> fuse: Handle asynchronous read and write in passthrough
> fuse: Use daemon creds in passthrough mode
> fuse: Introduce passthrough for mmap
>
> fs/fuse/Makefile | 1 +
> fs/fuse/dev.c | 41 ++++--
> fs/fuse/dir.c | 2 +
> fs/fuse/file.c | 15 +-
> fs/fuse/fuse_i.h | 33 +++++
> fs/fuse/inode.c | 22 ++-
> fs/fuse/passthrough.c | 280 ++++++++++++++++++++++++++++++++++++++
> fs/overlayfs/file.c | 23 +---
> include/linux/fs.h | 5 +
> include/uapi/linux/fuse.h | 14 +-
> 10 files changed, 401 insertions(+), 35 deletions(-)
> create mode 100644 fs/fuse/passthrough.c
>
> --
> 2.30.0.280.ga3ce27912f-goog
>
Hi Alessio,
I have been testing this patch set for a while and recently
nfstest_posix found one issue:
mtime/ctime are not invalidated on passthrough write.
I have tested a fix on this 5.10.y backport branch:
https://github.com/amir73il/linux/commits/linux-5.10.y-fuse-passthrough
Please feel free to review and/or take the fix for your next posting
if you have plans of posting another version...
Thanks,
Amir.