2022-06-24 06:23:57

by Jiachen Zhang

[permalink] [raw]
Subject: [PATCH] fuse: writeback_cache consistency enhancement (writeback_cache_v2)

Some users may want both the high performance of the writeback_cahe mode and
a little bit more consistency among FUSE mounts. In the current writeback
mode implementation, users of one FUSE mount can never see the file
expansion done by other FUSE mounts.

Based on the suggested writeback V2 patch in the upstream mailing-list [1],
this commit allows the cmtime and size to be updated from server in
writeback mode. Compared with the writeback V2 patch in [1], this commit has
several differences:

1. Ensure c/mtime are not updated from kernel to server. IOW, the cmtime
generated by kernel are just temporary values that are never flushed to
server, and they can also be updated by the official server cmtime when
the writeback cache is clean.

2. Skip mtime-based revalidation when fc->auto_inval_data is set with
fc->writeback_cache_v2. Because the kernel-generated temporary cmtime
are likely not equal to the offical server cmtime.

3. If any page is ever flushed to the server during FUSE_GETATTR
handling on fuse server, even if the cache is clean when
fuse_change_attributes() checks, we should not update the i_size. This
is because the FUSE_GETATTR may get a staled size before the FUSE_WRITE
request changes server inode size. This commit ensures this by
increasing attr_version after writeback for writeback_cache_v2. In that
case, we should also ensure the ordering of the attr_version updating
and the fi->writepages RB-tree updating. So that if a fuse page
writeback ever happens during fuse_change_attributes(), either the
fi->writepages is not empty, or the attr_version is increased. So we
never mistakenly update a stale file size from server to kernel.

With this patch, writeback mode can consider the server c/mtime as the
official one. When inode attr is timeout or invalidated, kernel has chance
to see size and c/mtime modified by others.

Together with another patch [2], a FUSE daemon is able to implement
close-to-open (CTO) consistency like what is done in NFS clients.

[1] https://lore.kernel.org/linux-fsdevel/[email protected]
[2] https://lore.kernel.org/linux-fsdevel/[email protected]/

Suggested-by: Miklos Szeredi <[email protected]>
Signed-off-by: Jiachen Zhang <[email protected]>
---
fs/fuse/file.c | 17 +++++++++++++++
fs/fuse/fuse_i.h | 3 +++
fs/fuse/inode.c | 44 +++++++++++++++++++++++++++++++++++++--
include/uapi/linux/fuse.h | 5 +++++
4 files changed, 67 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9b64e2ff1c96..35bdc7af8468 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1829,6 +1829,15 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
*/
fuse_send_writepage(fm, next, inarg->offset + inarg->size);
}
+
+ if (fc->writeback_cache_v2)
+ fi->attr_version = atomic64_inc_return(&fc->attr_version);
+ /*
+ * Ensure attr_version increases before the page is move out of the
+ * writepages rb-tree.
+ */
+ smp_mb();
+
fi->writectr--;
fuse_writepage_finish(fm, wpa);
spin_unlock(&fi->lock);
@@ -1858,10 +1867,18 @@ static struct fuse_file *fuse_write_file_get(struct fuse_inode *fi)

int fuse_write_inode(struct inode *inode, struct writeback_control *wbc)
{
+ struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
struct fuse_file *ff;
int err;

+ /*
+ * Kernel c/mtime should not be updated to the server in the
+ * writeback_cache_v2 mode as server c/mtime are official.
+ */
+ if (fc->writeback_cache_v2)
+ return 0;
+
/*
* Inode is always written before the last reference is dropped and
* hence this should not be reached from reclaim.
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 488b460e046f..47de36146fb8 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -654,6 +654,9 @@ struct fuse_conn {
/* show legacy mount options */
unsigned int legacy_opts_show:1;

+ /* Improved writeback cache policy */
+ unsigned writeback_cache_v2:1;
+
/*
* fs kills suid/sgid/cap on write/chown/trunc. suid is killed on
* write/trunc only if caller did not have CAP_FSETID. sgid is killed
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 8c0665c5dff8..2d5fa82b08b6 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -237,14 +237,41 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
u32 cache_mask;
loff_t oldsize;
struct timespec64 old_mtime;
+ bool try_wb_update = false;
+
+ if (fc->writeback_cache_v2 && S_ISREG(inode->i_mode)) {
+ inode_lock(inode);
+ try_wb_update = true;
+ }

spin_lock(&fi->lock);
/*
* In case of writeback_cache enabled, writes update mtime, ctime and
* may update i_size. In these cases trust the cached value in the
* inode.
+ *
+ * In writeback_cache_v2 mode, if all the following conditions are met,
+ * then we allow the attributes to be refreshed:
+ *
+ * - inode is not in the process of being written (I_SYNC)
+ * - inode has no dirty pages (I_DIRTY_PAGES)
+ * - inode data-related attributes are clean (I_DIRTY_DATASYNC)
+ * - inode does not have any page writeback in progress
+ *
+ * Note: checking PAGECACHE_TAG_WRITEBACK is not sufficient in fuse,
+ * since inode can appear to have no PageWriteback pages, yet still have
+ * outstanding write request.
*/
cache_mask = fuse_get_cache_mask(inode);
+ if (try_wb_update && !(inode->i_state & (I_DIRTY_PAGES | I_SYNC |
+ I_DIRTY_DATASYNC)) && RB_EMPTY_ROOT(&fi->writepages))
+ cache_mask &= ~(STATX_MTIME | STATX_CTIME | STATX_SIZE);
+ /*
+ * Ensure the ordering of cleanness checking and following attr_version
+ * comparison.
+ */
+ smp_mb();
+
if (cache_mask & STATX_SIZE)
attr->size = i_size_read(inode);

@@ -283,7 +310,13 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
truncate_pagecache(inode, attr->size);
if (!fc->explicit_inval_data)
inval = true;
- } else if (fc->auto_inval_data) {
+ } else if (!fc->writeback_cache_v2 && fc->auto_inval_data) {
+ /*
+ * When fc->writeback_cache_v2 is set, the old_mtime
+ * can be generated by kernel and must not equal to
+ * new_mtime generated by server. So skip in such
+ * case.
+ */
struct timespec64 new_mtime = {
.tv_sec = attr->mtime,
.tv_nsec = attr->mtimensec,
@@ -303,6 +336,9 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,

if (IS_ENABLED(CONFIG_FUSE_DAX))
fuse_dax_dontcache(inode, attr->flags);
+
+ if (try_wb_update)
+ inode_unlock(inode);
}

static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
@@ -1153,6 +1189,10 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
fc->async_dio = 1;
if (flags & FUSE_WRITEBACK_CACHE)
fc->writeback_cache = 1;
+ if (flags & FUSE_WRITEBACK_CACHE_V2) {
+ fc->writeback_cache = 1;
+ fc->writeback_cache_v2 = 1;
+ }
if (flags & FUSE_PARALLEL_DIROPS)
fc->parallel_dirops = 1;
if (flags & FUSE_HANDLE_KILLPRIV)
@@ -1234,7 +1274,7 @@ void fuse_send_init(struct fuse_mount *fm)
FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_EXT | FUSE_INIT_EXT |
- FUSE_SECURITY_CTX;
+ FUSE_SECURITY_CTX | FUSE_WRITEBACK_CACHE_V2;
#ifdef CONFIG_FUSE_DAX
if (fm->fc->dax)
flags |= FUSE_MAP_ALIGNMENT;
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index d6ccee961891..b474763bcf59 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -194,6 +194,7 @@
* - add FUSE_SECURITY_CTX init flag
* - add security context to create, mkdir, symlink, and mknod requests
* - add FUSE_HAS_INODE_DAX, FUSE_ATTR_DAX
+ * - add FUSE_WRITEBACK_CACHE_V2 init flag
*/

#ifndef _LINUX_FUSE_H
@@ -353,6 +354,9 @@ struct fuse_file_lock {
* FUSE_SECURITY_CTX: add security context to create, mkdir, symlink, and
* mknod
* FUSE_HAS_INODE_DAX: use per inode DAX
+ * FUSE_WRITEBACK_CACHE_V2:
+ * allow time/size to be refreshed if no pending write
+ * c/mtime not updated from kernel to server
*/
#define FUSE_ASYNC_READ (1 << 0)
#define FUSE_POSIX_LOCKS (1 << 1)
@@ -389,6 +393,7 @@ struct fuse_file_lock {
/* bits 32..63 get shifted down 32 bits into the flags2 field */
#define FUSE_SECURITY_CTX (1ULL << 32)
#define FUSE_HAS_INODE_DAX (1ULL << 33)
+#define FUSE_WRITEBACK_CACHE_V2 (1ULL << 34)

/**
* CUSE INIT request/reply flags
--
2.20.1


2022-06-24 18:50:37

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH] fuse: writeback_cache consistency enhancement (writeback_cache_v2)

On Fri, Jun 24, 2022 at 01:58:25PM +0800, Jiachen Zhang wrote:
> Some users may want both the high performance of the writeback_cahe mode and
> a little bit more consistency among FUSE mounts. In the current writeback
> mode implementation, users of one FUSE mount can never see the file
> expansion done by other FUSE mounts.
>
> Based on the suggested writeback V2 patch in the upstream mailing-list [1],
> this commit allows the cmtime and size to be updated from server in
> writeback mode. Compared with the writeback V2 patch in [1], this commit has
> several differences:
>
> 1. Ensure c/mtime are not updated from kernel to server. IOW, the cmtime
> generated by kernel are just temporary values that are never flushed to
> server, and they can also be updated by the official server cmtime when
> the writeback cache is clean.
>
> 2. Skip mtime-based revalidation when fc->auto_inval_data is set with
> fc->writeback_cache_v2. Because the kernel-generated temporary cmtime
> are likely not equal to the offical server cmtime.
>
> 3. If any page is ever flushed to the server during FUSE_GETATTR
> handling on fuse server, even if the cache is clean when
> fuse_change_attributes() checks, we should not update the i_size. This
> is because the FUSE_GETATTR may get a staled size before the FUSE_WRITE
> request changes server inode size. This commit ensures this by
> increasing attr_version after writeback for writeback_cache_v2. In that
> case, we should also ensure the ordering of the attr_version updating
> and the fi->writepages RB-tree updating. So that if a fuse page
> writeback ever happens during fuse_change_attributes(), either the
> fi->writepages is not empty, or the attr_version is increased. So we
> never mistakenly update a stale file size from server to kernel.
>
> With this patch, writeback mode can consider the server c/mtime as the
> official one. When inode attr is timeout or invalidated, kernel has chance
> to see size and c/mtime modified by others.
>
> Together with another patch [2], a FUSE daemon is able to implement
> close-to-open (CTO) consistency like what is done in NFS clients.
>
> [1] https://lore.kernel.org/linux-fsdevel/[email protected]
> [2] https://lore.kernel.org/linux-fsdevel/[email protected]/
>
> Suggested-by: Miklos Szeredi <[email protected]>
> Signed-off-by: Jiachen Zhang <[email protected]>
> ---
> fs/fuse/file.c | 17 +++++++++++++++
> fs/fuse/fuse_i.h | 3 +++
> fs/fuse/inode.c | 44 +++++++++++++++++++++++++++++++++++++--
> include/uapi/linux/fuse.h | 5 +++++
> 4 files changed, 67 insertions(+), 2 deletions(-)

A quick comment without reading this patch, please do add some
documentation which explains what's the existing behavior and what
the new behavior.

Probably Documentation/filesystem/fuse.rst is right place. We have
so many fine knobs w.r.t attr caching and data caching and what kind
of cache consistency to expect but there is no documentation. So
very hard to figure out what to expect in different scenarios for
a user.

I think this patch series probably a good time to start some cache
coherency related documentation in fuse and that will help users
as well as will provide context for future changes.

Thanks
Vivek

>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 9b64e2ff1c96..35bdc7af8468 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1829,6 +1829,15 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
> */
> fuse_send_writepage(fm, next, inarg->offset + inarg->size);
> }
> +
> + if (fc->writeback_cache_v2)
> + fi->attr_version = atomic64_inc_return(&fc->attr_version);
> + /*
> + * Ensure attr_version increases before the page is move out of the
> + * writepages rb-tree.
> + */
> + smp_mb();
> +
> fi->writectr--;
> fuse_writepage_finish(fm, wpa);
> spin_unlock(&fi->lock);
> @@ -1858,10 +1867,18 @@ static struct fuse_file *fuse_write_file_get(struct fuse_inode *fi)
>
> int fuse_write_inode(struct inode *inode, struct writeback_control *wbc)
> {
> + struct fuse_conn *fc = get_fuse_conn(inode);
> struct fuse_inode *fi = get_fuse_inode(inode);
> struct fuse_file *ff;
> int err;
>
> + /*
> + * Kernel c/mtime should not be updated to the server in the
> + * writeback_cache_v2 mode as server c/mtime are official.
> + */
> + if (fc->writeback_cache_v2)
> + return 0;
> +
> /*
> * Inode is always written before the last reference is dropped and
> * hence this should not be reached from reclaim.
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 488b460e046f..47de36146fb8 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -654,6 +654,9 @@ struct fuse_conn {
> /* show legacy mount options */
> unsigned int legacy_opts_show:1;
>
> + /* Improved writeback cache policy */
> + unsigned writeback_cache_v2:1;
> +
> /*
> * fs kills suid/sgid/cap on write/chown/trunc. suid is killed on
> * write/trunc only if caller did not have CAP_FSETID. sgid is killed
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 8c0665c5dff8..2d5fa82b08b6 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -237,14 +237,41 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> u32 cache_mask;
> loff_t oldsize;
> struct timespec64 old_mtime;
> + bool try_wb_update = false;
> +
> + if (fc->writeback_cache_v2 && S_ISREG(inode->i_mode)) {
> + inode_lock(inode);
> + try_wb_update = true;
> + }
>
> spin_lock(&fi->lock);
> /*
> * In case of writeback_cache enabled, writes update mtime, ctime and
> * may update i_size. In these cases trust the cached value in the
> * inode.
> + *
> + * In writeback_cache_v2 mode, if all the following conditions are met,
> + * then we allow the attributes to be refreshed:
> + *
> + * - inode is not in the process of being written (I_SYNC)
> + * - inode has no dirty pages (I_DIRTY_PAGES)
> + * - inode data-related attributes are clean (I_DIRTY_DATASYNC)
> + * - inode does not have any page writeback in progress
> + *
> + * Note: checking PAGECACHE_TAG_WRITEBACK is not sufficient in fuse,
> + * since inode can appear to have no PageWriteback pages, yet still have
> + * outstanding write request.
> */
> cache_mask = fuse_get_cache_mask(inode);
> + if (try_wb_update && !(inode->i_state & (I_DIRTY_PAGES | I_SYNC |
> + I_DIRTY_DATASYNC)) && RB_EMPTY_ROOT(&fi->writepages))
> + cache_mask &= ~(STATX_MTIME | STATX_CTIME | STATX_SIZE);
> + /*
> + * Ensure the ordering of cleanness checking and following attr_version
> + * comparison.
> + */
> + smp_mb();
> +
> if (cache_mask & STATX_SIZE)
> attr->size = i_size_read(inode);
>
> @@ -283,7 +310,13 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> truncate_pagecache(inode, attr->size);
> if (!fc->explicit_inval_data)
> inval = true;
> - } else if (fc->auto_inval_data) {
> + } else if (!fc->writeback_cache_v2 && fc->auto_inval_data) {
> + /*
> + * When fc->writeback_cache_v2 is set, the old_mtime
> + * can be generated by kernel and must not equal to
> + * new_mtime generated by server. So skip in such
> + * case.
> + */
> struct timespec64 new_mtime = {
> .tv_sec = attr->mtime,
> .tv_nsec = attr->mtimensec,
> @@ -303,6 +336,9 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
>
> if (IS_ENABLED(CONFIG_FUSE_DAX))
> fuse_dax_dontcache(inode, attr->flags);
> +
> + if (try_wb_update)
> + inode_unlock(inode);
> }
>
> static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
> @@ -1153,6 +1189,10 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> fc->async_dio = 1;
> if (flags & FUSE_WRITEBACK_CACHE)
> fc->writeback_cache = 1;
> + if (flags & FUSE_WRITEBACK_CACHE_V2) {
> + fc->writeback_cache = 1;
> + fc->writeback_cache_v2 = 1;
> + }
> if (flags & FUSE_PARALLEL_DIROPS)
> fc->parallel_dirops = 1;
> if (flags & FUSE_HANDLE_KILLPRIV)
> @@ -1234,7 +1274,7 @@ void fuse_send_init(struct fuse_mount *fm)
> FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
> FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
> FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_EXT | FUSE_INIT_EXT |
> - FUSE_SECURITY_CTX;
> + FUSE_SECURITY_CTX | FUSE_WRITEBACK_CACHE_V2;
> #ifdef CONFIG_FUSE_DAX
> if (fm->fc->dax)
> flags |= FUSE_MAP_ALIGNMENT;
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index d6ccee961891..b474763bcf59 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -194,6 +194,7 @@
> * - add FUSE_SECURITY_CTX init flag
> * - add security context to create, mkdir, symlink, and mknod requests
> * - add FUSE_HAS_INODE_DAX, FUSE_ATTR_DAX
> + * - add FUSE_WRITEBACK_CACHE_V2 init flag
> */
>
> #ifndef _LINUX_FUSE_H
> @@ -353,6 +354,9 @@ struct fuse_file_lock {
> * FUSE_SECURITY_CTX: add security context to create, mkdir, symlink, and
> * mknod
> * FUSE_HAS_INODE_DAX: use per inode DAX
> + * FUSE_WRITEBACK_CACHE_V2:
> + * allow time/size to be refreshed if no pending write
> + * c/mtime not updated from kernel to server
> */
> #define FUSE_ASYNC_READ (1 << 0)
> #define FUSE_POSIX_LOCKS (1 << 1)
> @@ -389,6 +393,7 @@ struct fuse_file_lock {
> /* bits 32..63 get shifted down 32 bits into the flags2 field */
> #define FUSE_SECURITY_CTX (1ULL << 32)
> #define FUSE_HAS_INODE_DAX (1ULL << 33)
> +#define FUSE_WRITEBACK_CACHE_V2 (1ULL << 34)
>
> /**
> * CUSE INIT request/reply flags
> --
> 2.20.1
>

2022-06-28 05:36:36

by Jiachen Zhang

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] fuse: writeback_cache consistency enhancement (writeback_cache_v2)

On Sat, Jun 25, 2022 at 2:29 AM Vivek Goyal <[email protected]> wrote:
>
> On Fri, Jun 24, 2022 at 01:58:25PM +0800, Jiachen Zhang wrote:
> > Some users may want both the high performance of the writeback_cahe mode and
> > a little bit more consistency among FUSE mounts. In the current writeback
> > mode implementation, users of one FUSE mount can never see the file
> > expansion done by other FUSE mounts.
> >
> > Based on the suggested writeback V2 patch in the upstream mailing-list [1],
> > this commit allows the cmtime and size to be updated from server in
> > writeback mode. Compared with the writeback V2 patch in [1], this commit has
> > several differences:
> >
> > 1. Ensure c/mtime are not updated from kernel to server. IOW, the cmtime
> > generated by kernel are just temporary values that are never flushed to
> > server, and they can also be updated by the official server cmtime when
> > the writeback cache is clean.
> >
> > 2. Skip mtime-based revalidation when fc->auto_inval_data is set with
> > fc->writeback_cache_v2. Because the kernel-generated temporary cmtime
> > are likely not equal to the offical server cmtime.
> >
> > 3. If any page is ever flushed to the server during FUSE_GETATTR
> > handling on fuse server, even if the cache is clean when
> > fuse_change_attributes() checks, we should not update the i_size. This
> > is because the FUSE_GETATTR may get a staled size before the FUSE_WRITE
> > request changes server inode size. This commit ensures this by
> > increasing attr_version after writeback for writeback_cache_v2. In that
> > case, we should also ensure the ordering of the attr_version updating
> > and the fi->writepages RB-tree updating. So that if a fuse page
> > writeback ever happens during fuse_change_attributes(), either the
> > fi->writepages is not empty, or the attr_version is increased. So we
> > never mistakenly update a stale file size from server to kernel.
> >
> > With this patch, writeback mode can consider the server c/mtime as the
> > official one. When inode attr is timeout or invalidated, kernel has chance
> > to see size and c/mtime modified by others.
> >
> > Together with another patch [2], a FUSE daemon is able to implement
> > close-to-open (CTO) consistency like what is done in NFS clients.
> >
> > [1] https://lore.kernel.org/linux-fsdevel/[email protected]
> > [2] https://lore.kernel.org/linux-fsdevel/[email protected]/
> >
> > Suggested-by: Miklos Szeredi <[email protected]>
> > Signed-off-by: Jiachen Zhang <[email protected]>
> > ---
> > fs/fuse/file.c | 17 +++++++++++++++
> > fs/fuse/fuse_i.h | 3 +++
> > fs/fuse/inode.c | 44 +++++++++++++++++++++++++++++++++++++--
> > include/uapi/linux/fuse.h | 5 +++++
> > 4 files changed, 67 insertions(+), 2 deletions(-)
>
> A quick comment without reading this patch, please do add some
> documentation which explains what's the existing behavior and what
> the new behavior.
>
> Probably Documentation/filesystem/fuse.rst is right place. We have
> so many fine knobs w.r.t attr caching and data caching and what kind
> of cache consistency to expect but there is no documentation. So
> very hard to figure out what to expect in different scenarios for
> a user.
>
> I think this patch series probably a good time to start some cache
> coherency related documentation in fuse and that will help users
> as well as will provide context for future changes.
>
> Thanks
> Vivek

Hi Vivek.

I found some good materials about fuse I/O and cache modes in
Documentation/filesystem/fuse-io.rst [1]. Maybe we could add more
information about cache coherence to that file?

[1] https://www.kernel.org/doc/html/v5.17/filesystems/fuse-io.html

Thanks,
Jiachen

>
> >
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 9b64e2ff1c96..35bdc7af8468 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -1829,6 +1829,15 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
> > */
> > fuse_send_writepage(fm, next, inarg->offset + inarg->size);
> > }
> > +
> > + if (fc->writeback_cache_v2)
> > + fi->attr_version = atomic64_inc_return(&fc->attr_version);
> > + /*
> > + * Ensure attr_version increases before the page is move out of the
> > + * writepages rb-tree.
> > + */
> > + smp_mb();
> > +
> > fi->writectr--;
> > fuse_writepage_finish(fm, wpa);
> > spin_unlock(&fi->lock);
> > @@ -1858,10 +1867,18 @@ static struct fuse_file *fuse_write_file_get(struct fuse_inode *fi)
> >
> > int fuse_write_inode(struct inode *inode, struct writeback_control *wbc)
> > {
> > + struct fuse_conn *fc = get_fuse_conn(inode);
> > struct fuse_inode *fi = get_fuse_inode(inode);
> > struct fuse_file *ff;
> > int err;
> >
> > + /*
> > + * Kernel c/mtime should not be updated to the server in the
> > + * writeback_cache_v2 mode as server c/mtime are official.
> > + */
> > + if (fc->writeback_cache_v2)
> > + return 0;
> > +
> > /*
> > * Inode is always written before the last reference is dropped and
> > * hence this should not be reached from reclaim.
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 488b460e046f..47de36146fb8 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -654,6 +654,9 @@ struct fuse_conn {
> > /* show legacy mount options */
> > unsigned int legacy_opts_show:1;
> >
> > + /* Improved writeback cache policy */
> > + unsigned writeback_cache_v2:1;
> > +
> > /*
> > * fs kills suid/sgid/cap on write/chown/trunc. suid is killed on
> > * write/trunc only if caller did not have CAP_FSETID. sgid is killed
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 8c0665c5dff8..2d5fa82b08b6 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -237,14 +237,41 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> > u32 cache_mask;
> > loff_t oldsize;
> > struct timespec64 old_mtime;
> > + bool try_wb_update = false;
> > +
> > + if (fc->writeback_cache_v2 && S_ISREG(inode->i_mode)) {
> > + inode_lock(inode);
> > + try_wb_update = true;
> > + }
> >
> > spin_lock(&fi->lock);
> > /*
> > * In case of writeback_cache enabled, writes update mtime, ctime and
> > * may update i_size. In these cases trust the cached value in the
> > * inode.
> > + *
> > + * In writeback_cache_v2 mode, if all the following conditions are met,
> > + * then we allow the attributes to be refreshed:
> > + *
> > + * - inode is not in the process of being written (I_SYNC)
> > + * - inode has no dirty pages (I_DIRTY_PAGES)
> > + * - inode data-related attributes are clean (I_DIRTY_DATASYNC)
> > + * - inode does not have any page writeback in progress
> > + *
> > + * Note: checking PAGECACHE_TAG_WRITEBACK is not sufficient in fuse,
> > + * since inode can appear to have no PageWriteback pages, yet still have
> > + * outstanding write request.
> > */
> > cache_mask = fuse_get_cache_mask(inode);
> > + if (try_wb_update && !(inode->i_state & (I_DIRTY_PAGES | I_SYNC |
> > + I_DIRTY_DATASYNC)) && RB_EMPTY_ROOT(&fi->writepages))
> > + cache_mask &= ~(STATX_MTIME | STATX_CTIME | STATX_SIZE);
> > + /*
> > + * Ensure the ordering of cleanness checking and following attr_version
> > + * comparison.
> > + */
> > + smp_mb();
> > +
> > if (cache_mask & STATX_SIZE)
> > attr->size = i_size_read(inode);
> >
> > @@ -283,7 +310,13 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> > truncate_pagecache(inode, attr->size);
> > if (!fc->explicit_inval_data)
> > inval = true;
> > - } else if (fc->auto_inval_data) {
> > + } else if (!fc->writeback_cache_v2 && fc->auto_inval_data) {
> > + /*
> > + * When fc->writeback_cache_v2 is set, the old_mtime
> > + * can be generated by kernel and must not equal to
> > + * new_mtime generated by server. So skip in such
> > + * case.
> > + */
> > struct timespec64 new_mtime = {
> > .tv_sec = attr->mtime,
> > .tv_nsec = attr->mtimensec,
> > @@ -303,6 +336,9 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> >
> > if (IS_ENABLED(CONFIG_FUSE_DAX))
> > fuse_dax_dontcache(inode, attr->flags);
> > +
> > + if (try_wb_update)
> > + inode_unlock(inode);
> > }
> >
> > static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
> > @@ -1153,6 +1189,10 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> > fc->async_dio = 1;
> > if (flags & FUSE_WRITEBACK_CACHE)
> > fc->writeback_cache = 1;
> > + if (flags & FUSE_WRITEBACK_CACHE_V2) {
> > + fc->writeback_cache = 1;
> > + fc->writeback_cache_v2 = 1;
> > + }
> > if (flags & FUSE_PARALLEL_DIROPS)
> > fc->parallel_dirops = 1;
> > if (flags & FUSE_HANDLE_KILLPRIV)
> > @@ -1234,7 +1274,7 @@ void fuse_send_init(struct fuse_mount *fm)
> > FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
> > FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
> > FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_EXT | FUSE_INIT_EXT |
> > - FUSE_SECURITY_CTX;
> > + FUSE_SECURITY_CTX | FUSE_WRITEBACK_CACHE_V2;
> > #ifdef CONFIG_FUSE_DAX
> > if (fm->fc->dax)
> > flags |= FUSE_MAP_ALIGNMENT;
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index d6ccee961891..b474763bcf59 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -194,6 +194,7 @@
> > * - add FUSE_SECURITY_CTX init flag
> > * - add security context to create, mkdir, symlink, and mknod requests
> > * - add FUSE_HAS_INODE_DAX, FUSE_ATTR_DAX
> > + * - add FUSE_WRITEBACK_CACHE_V2 init flag
> > */
> >
> > #ifndef _LINUX_FUSE_H
> > @@ -353,6 +354,9 @@ struct fuse_file_lock {
> > * FUSE_SECURITY_CTX: add security context to create, mkdir, symlink, and
> > * mknod
> > * FUSE_HAS_INODE_DAX: use per inode DAX
> > + * FUSE_WRITEBACK_CACHE_V2:
> > + * allow time/size to be refreshed if no pending write
> > + * c/mtime not updated from kernel to server
> > */
> > #define FUSE_ASYNC_READ (1 << 0)
> > #define FUSE_POSIX_LOCKS (1 << 1)
> > @@ -389,6 +393,7 @@ struct fuse_file_lock {
> > /* bits 32..63 get shifted down 32 bits into the flags2 field */
> > #define FUSE_SECURITY_CTX (1ULL << 32)
> > #define FUSE_HAS_INODE_DAX (1ULL << 33)
> > +#define FUSE_WRITEBACK_CACHE_V2 (1ULL << 34)
> >
> > /**
> > * CUSE INIT request/reply flags
> > --
> > 2.20.1
> >
>

2022-06-28 10:46:15

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH] fuse: writeback_cache consistency enhancement (writeback_cache_v2)

On Fri, Jun 24, 2022 at 9:03 AM Jiachen Zhang
<[email protected]> wrote:
>
> Some users may want both the high performance of the writeback_cahe mode and
> a little bit more consistency among FUSE mounts. In the current writeback
> mode implementation, users of one FUSE mount can never see the file
> expansion done by other FUSE mounts.
>
> Based on the suggested writeback V2 patch in the upstream mailing-list [1],
> this commit allows the cmtime and size to be updated from server in
> writeback mode. Compared with the writeback V2 patch in [1], this commit has
> several differences:
>
> 1. Ensure c/mtime are not updated from kernel to server. IOW, the cmtime
> generated by kernel are just temporary values that are never flushed to
> server, and they can also be updated by the official server cmtime when
> the writeback cache is clean.
>
> 2. Skip mtime-based revalidation when fc->auto_inval_data is set with
> fc->writeback_cache_v2. Because the kernel-generated temporary cmtime
> are likely not equal to the offical server cmtime.
>
> 3. If any page is ever flushed to the server during FUSE_GETATTR
> handling on fuse server, even if the cache is clean when
> fuse_change_attributes() checks, we should not update the i_size. This
> is because the FUSE_GETATTR may get a staled size before the FUSE_WRITE
> request changes server inode size. This commit ensures this by
> increasing attr_version after writeback for writeback_cache_v2. In that
> case, we should also ensure the ordering of the attr_version updating
> and the fi->writepages RB-tree updating. So that if a fuse page
> writeback ever happens during fuse_change_attributes(), either the
> fi->writepages is not empty, or the attr_version is increased. So we
> never mistakenly update a stale file size from server to kernel.
>
> With this patch, writeback mode can consider the server c/mtime as the
> official one. When inode attr is timeout or invalidated, kernel has chance
> to see size and c/mtime modified by others.
>
> Together with another patch [2], a FUSE daemon is able to implement
> close-to-open (CTO) consistency like what is done in NFS clients.
>
> [1] https://lore.kernel.org/linux-fsdevel/[email protected]
> [2] https://lore.kernel.org/linux-fsdevel/[email protected]/
>
> Suggested-by: Miklos Szeredi <[email protected]>
> Signed-off-by: Jiachen Zhang <[email protected]>
> ---
> fs/fuse/file.c | 17 +++++++++++++++
> fs/fuse/fuse_i.h | 3 +++
> fs/fuse/inode.c | 44 +++++++++++++++++++++++++++++++++++++--
> include/uapi/linux/fuse.h | 5 +++++
> 4 files changed, 67 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 9b64e2ff1c96..35bdc7af8468 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1829,6 +1829,15 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
> */
> fuse_send_writepage(fm, next, inarg->offset + inarg->size);
> }
> +
> + if (fc->writeback_cache_v2)
> + fi->attr_version = atomic64_inc_return(&fc->attr_version);
> + /*
> + * Ensure attr_version increases before the page is move out of the
> + * writepages rb-tree.
> + */
> + smp_mb();
> +
> fi->writectr--;
> fuse_writepage_finish(fm, wpa);
> spin_unlock(&fi->lock);
> @@ -1858,10 +1867,18 @@ static struct fuse_file *fuse_write_file_get(struct fuse_inode *fi)
>
> int fuse_write_inode(struct inode *inode, struct writeback_control *wbc)
> {
> + struct fuse_conn *fc = get_fuse_conn(inode);
> struct fuse_inode *fi = get_fuse_inode(inode);
> struct fuse_file *ff;
> int err;
>
> + /*
> + * Kernel c/mtime should not be updated to the server in the
> + * writeback_cache_v2 mode as server c/mtime are official.
> + */
> + if (fc->writeback_cache_v2)
> + return 0;
> +
> /*
> * Inode is always written before the last reference is dropped and
> * hence this should not be reached from reclaim.
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 488b460e046f..47de36146fb8 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -654,6 +654,9 @@ struct fuse_conn {
> /* show legacy mount options */
> unsigned int legacy_opts_show:1;
>
> + /* Improved writeback cache policy */
> + unsigned writeback_cache_v2:1;
> +

Seeing that writeback_cache_v2 depends on writeback_cache
I wonder whether that will not be better represented as:

/** write-back cache policy (default is write-through) */
- unsigned writeback_cache:1;
+ unsigned writeback_cache:2;


Looking at the recently added handle_killpriv_v2, I also wonder
if that would not have been better.

Vivek,
is handle_killpriv_v2 really independent of handle_killpriv?
Seeing test like these worry me as they are inviting bugs:

if (!fc->handle_killpriv && !fc->handle_killpriv_v2) {

Thanks,
Amir.

2022-06-29 18:02:17

by Vivek Goyal

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] fuse: writeback_cache consistency enhancement (writeback_cache_v2)

On Tue, Jun 28, 2022 at 01:18:35PM +0800, Jiachen Zhang wrote:
> On Sat, Jun 25, 2022 at 2:29 AM Vivek Goyal <[email protected]> wrote:
> >
> > On Fri, Jun 24, 2022 at 01:58:25PM +0800, Jiachen Zhang wrote:
> > > Some users may want both the high performance of the writeback_cahe mode and
> > > a little bit more consistency among FUSE mounts. In the current writeback
> > > mode implementation, users of one FUSE mount can never see the file
> > > expansion done by other FUSE mounts.
> > >
> > > Based on the suggested writeback V2 patch in the upstream mailing-list [1],
> > > this commit allows the cmtime and size to be updated from server in
> > > writeback mode. Compared with the writeback V2 patch in [1], this commit has
> > > several differences:
> > >
> > > 1. Ensure c/mtime are not updated from kernel to server. IOW, the cmtime
> > > generated by kernel are just temporary values that are never flushed to
> > > server, and they can also be updated by the official server cmtime when
> > > the writeback cache is clean.
> > >
> > > 2. Skip mtime-based revalidation when fc->auto_inval_data is set with
> > > fc->writeback_cache_v2. Because the kernel-generated temporary cmtime
> > > are likely not equal to the offical server cmtime.
> > >
> > > 3. If any page is ever flushed to the server during FUSE_GETATTR
> > > handling on fuse server, even if the cache is clean when
> > > fuse_change_attributes() checks, we should not update the i_size. This
> > > is because the FUSE_GETATTR may get a staled size before the FUSE_WRITE
> > > request changes server inode size. This commit ensures this by
> > > increasing attr_version after writeback for writeback_cache_v2. In that
> > > case, we should also ensure the ordering of the attr_version updating
> > > and the fi->writepages RB-tree updating. So that if a fuse page
> > > writeback ever happens during fuse_change_attributes(), either the
> > > fi->writepages is not empty, or the attr_version is increased. So we
> > > never mistakenly update a stale file size from server to kernel.
> > >
> > > With this patch, writeback mode can consider the server c/mtime as the
> > > official one. When inode attr is timeout or invalidated, kernel has chance
> > > to see size and c/mtime modified by others.
> > >
> > > Together with another patch [2], a FUSE daemon is able to implement
> > > close-to-open (CTO) consistency like what is done in NFS clients.
> > >
> > > [1] https://lore.kernel.org/linux-fsdevel/[email protected]
> > > [2] https://lore.kernel.org/linux-fsdevel/[email protected]/
> > >
> > > Suggested-by: Miklos Szeredi <[email protected]>
> > > Signed-off-by: Jiachen Zhang <[email protected]>
> > > ---
> > > fs/fuse/file.c | 17 +++++++++++++++
> > > fs/fuse/fuse_i.h | 3 +++
> > > fs/fuse/inode.c | 44 +++++++++++++++++++++++++++++++++++++--
> > > include/uapi/linux/fuse.h | 5 +++++
> > > 4 files changed, 67 insertions(+), 2 deletions(-)
> >
> > A quick comment without reading this patch, please do add some
> > documentation which explains what's the existing behavior and what
> > the new behavior.
> >
> > Probably Documentation/filesystem/fuse.rst is right place. We have
> > so many fine knobs w.r.t attr caching and data caching and what kind
> > of cache consistency to expect but there is no documentation. So
> > very hard to figure out what to expect in different scenarios for
> > a user.
> >
> > I think this patch series probably a good time to start some cache
> > coherency related documentation in fuse and that will help users
> > as well as will provide context for future changes.
> >
> > Thanks
> > Vivek
>
> Hi Vivek.
>
> I found some good materials about fuse I/O and cache modes in
> Documentation/filesystem/fuse-io.rst [1]. Maybe we could add more
> information about cache coherence to that file?

Adding cache coherency information in fuse-io.rst sounds reasonable.

Vivek

>
> [1] https://www.kernel.org/doc/html/v5.17/filesystems/fuse-io.html
>
> Thanks,
> Jiachen
>
> >
> > >
> > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > > index 9b64e2ff1c96..35bdc7af8468 100644
> > > --- a/fs/fuse/file.c
> > > +++ b/fs/fuse/file.c
> > > @@ -1829,6 +1829,15 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
> > > */
> > > fuse_send_writepage(fm, next, inarg->offset + inarg->size);
> > > }
> > > +
> > > + if (fc->writeback_cache_v2)
> > > + fi->attr_version = atomic64_inc_return(&fc->attr_version);
> > > + /*
> > > + * Ensure attr_version increases before the page is move out of the
> > > + * writepages rb-tree.
> > > + */
> > > + smp_mb();
> > > +
> > > fi->writectr--;
> > > fuse_writepage_finish(fm, wpa);
> > > spin_unlock(&fi->lock);
> > > @@ -1858,10 +1867,18 @@ static struct fuse_file *fuse_write_file_get(struct fuse_inode *fi)
> > >
> > > int fuse_write_inode(struct inode *inode, struct writeback_control *wbc)
> > > {
> > > + struct fuse_conn *fc = get_fuse_conn(inode);
> > > struct fuse_inode *fi = get_fuse_inode(inode);
> > > struct fuse_file *ff;
> > > int err;
> > >
> > > + /*
> > > + * Kernel c/mtime should not be updated to the server in the
> > > + * writeback_cache_v2 mode as server c/mtime are official.
> > > + */
> > > + if (fc->writeback_cache_v2)
> > > + return 0;
> > > +
> > > /*
> > > * Inode is always written before the last reference is dropped and
> > > * hence this should not be reached from reclaim.
> > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > index 488b460e046f..47de36146fb8 100644
> > > --- a/fs/fuse/fuse_i.h
> > > +++ b/fs/fuse/fuse_i.h
> > > @@ -654,6 +654,9 @@ struct fuse_conn {
> > > /* show legacy mount options */
> > > unsigned int legacy_opts_show:1;
> > >
> > > + /* Improved writeback cache policy */
> > > + unsigned writeback_cache_v2:1;
> > > +
> > > /*
> > > * fs kills suid/sgid/cap on write/chown/trunc. suid is killed on
> > > * write/trunc only if caller did not have CAP_FSETID. sgid is killed
> > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > index 8c0665c5dff8..2d5fa82b08b6 100644
> > > --- a/fs/fuse/inode.c
> > > +++ b/fs/fuse/inode.c
> > > @@ -237,14 +237,41 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> > > u32 cache_mask;
> > > loff_t oldsize;
> > > struct timespec64 old_mtime;
> > > + bool try_wb_update = false;
> > > +
> > > + if (fc->writeback_cache_v2 && S_ISREG(inode->i_mode)) {
> > > + inode_lock(inode);
> > > + try_wb_update = true;
> > > + }
> > >
> > > spin_lock(&fi->lock);
> > > /*
> > > * In case of writeback_cache enabled, writes update mtime, ctime and
> > > * may update i_size. In these cases trust the cached value in the
> > > * inode.
> > > + *
> > > + * In writeback_cache_v2 mode, if all the following conditions are met,
> > > + * then we allow the attributes to be refreshed:
> > > + *
> > > + * - inode is not in the process of being written (I_SYNC)
> > > + * - inode has no dirty pages (I_DIRTY_PAGES)
> > > + * - inode data-related attributes are clean (I_DIRTY_DATASYNC)
> > > + * - inode does not have any page writeback in progress
> > > + *
> > > + * Note: checking PAGECACHE_TAG_WRITEBACK is not sufficient in fuse,
> > > + * since inode can appear to have no PageWriteback pages, yet still have
> > > + * outstanding write request.
> > > */
> > > cache_mask = fuse_get_cache_mask(inode);
> > > + if (try_wb_update && !(inode->i_state & (I_DIRTY_PAGES | I_SYNC |
> > > + I_DIRTY_DATASYNC)) && RB_EMPTY_ROOT(&fi->writepages))
> > > + cache_mask &= ~(STATX_MTIME | STATX_CTIME | STATX_SIZE);
> > > + /*
> > > + * Ensure the ordering of cleanness checking and following attr_version
> > > + * comparison.
> > > + */
> > > + smp_mb();
> > > +
> > > if (cache_mask & STATX_SIZE)
> > > attr->size = i_size_read(inode);
> > >
> > > @@ -283,7 +310,13 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> > > truncate_pagecache(inode, attr->size);
> > > if (!fc->explicit_inval_data)
> > > inval = true;
> > > - } else if (fc->auto_inval_data) {
> > > + } else if (!fc->writeback_cache_v2 && fc->auto_inval_data) {
> > > + /*
> > > + * When fc->writeback_cache_v2 is set, the old_mtime
> > > + * can be generated by kernel and must not equal to
> > > + * new_mtime generated by server. So skip in such
> > > + * case.
> > > + */
> > > struct timespec64 new_mtime = {
> > > .tv_sec = attr->mtime,
> > > .tv_nsec = attr->mtimensec,
> > > @@ -303,6 +336,9 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> > >
> > > if (IS_ENABLED(CONFIG_FUSE_DAX))
> > > fuse_dax_dontcache(inode, attr->flags);
> > > +
> > > + if (try_wb_update)
> > > + inode_unlock(inode);
> > > }
> > >
> > > static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
> > > @@ -1153,6 +1189,10 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> > > fc->async_dio = 1;
> > > if (flags & FUSE_WRITEBACK_CACHE)
> > > fc->writeback_cache = 1;
> > > + if (flags & FUSE_WRITEBACK_CACHE_V2) {
> > > + fc->writeback_cache = 1;
> > > + fc->writeback_cache_v2 = 1;
> > > + }
> > > if (flags & FUSE_PARALLEL_DIROPS)
> > > fc->parallel_dirops = 1;
> > > if (flags & FUSE_HANDLE_KILLPRIV)
> > > @@ -1234,7 +1274,7 @@ void fuse_send_init(struct fuse_mount *fm)
> > > FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
> > > FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
> > > FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_EXT | FUSE_INIT_EXT |
> > > - FUSE_SECURITY_CTX;
> > > + FUSE_SECURITY_CTX | FUSE_WRITEBACK_CACHE_V2;
> > > #ifdef CONFIG_FUSE_DAX
> > > if (fm->fc->dax)
> > > flags |= FUSE_MAP_ALIGNMENT;
> > > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > > index d6ccee961891..b474763bcf59 100644
> > > --- a/include/uapi/linux/fuse.h
> > > +++ b/include/uapi/linux/fuse.h
> > > @@ -194,6 +194,7 @@
> > > * - add FUSE_SECURITY_CTX init flag
> > > * - add security context to create, mkdir, symlink, and mknod requests
> > > * - add FUSE_HAS_INODE_DAX, FUSE_ATTR_DAX
> > > + * - add FUSE_WRITEBACK_CACHE_V2 init flag
> > > */
> > >
> > > #ifndef _LINUX_FUSE_H
> > > @@ -353,6 +354,9 @@ struct fuse_file_lock {
> > > * FUSE_SECURITY_CTX: add security context to create, mkdir, symlink, and
> > > * mknod
> > > * FUSE_HAS_INODE_DAX: use per inode DAX
> > > + * FUSE_WRITEBACK_CACHE_V2:
> > > + * allow time/size to be refreshed if no pending write
> > > + * c/mtime not updated from kernel to server
> > > */
> > > #define FUSE_ASYNC_READ (1 << 0)
> > > #define FUSE_POSIX_LOCKS (1 << 1)
> > > @@ -389,6 +393,7 @@ struct fuse_file_lock {
> > > /* bits 32..63 get shifted down 32 bits into the flags2 field */
> > > #define FUSE_SECURITY_CTX (1ULL << 32)
> > > #define FUSE_HAS_INODE_DAX (1ULL << 33)
> > > +#define FUSE_WRITEBACK_CACHE_V2 (1ULL << 34)
> > >
> > > /**
> > > * CUSE INIT request/reply flags
> > > --
> > > 2.20.1
> > >
> >
>

2022-06-29 18:15:23

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH] fuse: writeback_cache consistency enhancement (writeback_cache_v2)

On Fri, Jun 24, 2022 at 01:58:25PM +0800, Jiachen Zhang wrote:
> Some users may want both the high performance of the writeback_cahe mode and
> a little bit more consistency among FUSE mounts. In the current writeback
> mode implementation, users of one FUSE mount can never see the file
> expansion done by other FUSE mounts.
>
> Based on the suggested writeback V2 patch in the upstream mailing-list [1],
> this commit allows the cmtime and size to be updated from server in
> writeback mode. Compared with the writeback V2 patch in [1], this commit has
> several differences:
>
> 1. Ensure c/mtime are not updated from kernel to server. IOW, the cmtime
> generated by kernel are just temporary values that are never flushed to
> server, and they can also be updated by the official server cmtime when
> the writeback cache is clean.
>
> 2. Skip mtime-based revalidation when fc->auto_inval_data is set with
> fc->writeback_cache_v2. Because the kernel-generated temporary cmtime
> are likely not equal to the offical server cmtime.

We have a init flag to negotiate FUSE_AUTO_INVAL_DATA to enable/disable
fc->auto_inval_data. So should it be filesystem's (user space) responsibility
to not use FUSE_AUTO_INVAL_DATA when using FUSE_WRITEBACK_CACHE. I mean,
otherwise it is confusing. On one hand filesystem is using FUSE_AUTO_INVAL_DATA
but it silently is ingored by kernel because FUSE_WRITEBACK_CACHE is
being used as well.

will it make more sense to just document it and let filesystem authors
enable options carefully.

BTW, this probably should be a separate patch in the patch series anyway
with proper explanation.

Thanks
Vivek

>
> 3. If any page is ever flushed to the server during FUSE_GETATTR
> handling on fuse server, even if the cache is clean when
> fuse_change_attributes() checks, we should not update the i_size. This
> is because the FUSE_GETATTR may get a staled size before the FUSE_WRITE
> request changes server inode size. This commit ensures this by
> increasing attr_version after writeback for writeback_cache_v2. In that
> case, we should also ensure the ordering of the attr_version updating
> and the fi->writepages RB-tree updating. So that if a fuse page
> writeback ever happens during fuse_change_attributes(), either the
> fi->writepages is not empty, or the attr_version is increased. So we
> never mistakenly update a stale file size from server to kernel.
>
> With this patch, writeback mode can consider the server c/mtime as the
> official one. When inode attr is timeout or invalidated, kernel has chance
> to see size and c/mtime modified by others.
>
> Together with another patch [2], a FUSE daemon is able to implement
> close-to-open (CTO) consistency like what is done in NFS clients.
>
> [1] https://lore.kernel.org/linux-fsdevel/[email protected]
> [2] https://lore.kernel.org/linux-fsdevel/[email protected]/
>
> Suggested-by: Miklos Szeredi <[email protected]>
> Signed-off-by: Jiachen Zhang <[email protected]>
> ---
> fs/fuse/file.c | 17 +++++++++++++++
> fs/fuse/fuse_i.h | 3 +++
> fs/fuse/inode.c | 44 +++++++++++++++++++++++++++++++++++++--
> include/uapi/linux/fuse.h | 5 +++++
> 4 files changed, 67 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 9b64e2ff1c96..35bdc7af8468 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1829,6 +1829,15 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
> */
> fuse_send_writepage(fm, next, inarg->offset + inarg->size);
> }
> +
> + if (fc->writeback_cache_v2)
> + fi->attr_version = atomic64_inc_return(&fc->attr_version);
> + /*
> + * Ensure attr_version increases before the page is move out of the
> + * writepages rb-tree.
> + */
> + smp_mb();
> +
> fi->writectr--;
> fuse_writepage_finish(fm, wpa);
> spin_unlock(&fi->lock);
> @@ -1858,10 +1867,18 @@ static struct fuse_file *fuse_write_file_get(struct fuse_inode *fi)
>
> int fuse_write_inode(struct inode *inode, struct writeback_control *wbc)
> {
> + struct fuse_conn *fc = get_fuse_conn(inode);
> struct fuse_inode *fi = get_fuse_inode(inode);
> struct fuse_file *ff;
> int err;
>
> + /*
> + * Kernel c/mtime should not be updated to the server in the
> + * writeback_cache_v2 mode as server c/mtime are official.
> + */
> + if (fc->writeback_cache_v2)
> + return 0;
> +
> /*
> * Inode is always written before the last reference is dropped and
> * hence this should not be reached from reclaim.
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 488b460e046f..47de36146fb8 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -654,6 +654,9 @@ struct fuse_conn {
> /* show legacy mount options */
> unsigned int legacy_opts_show:1;
>
> + /* Improved writeback cache policy */
> + unsigned writeback_cache_v2:1;
> +
> /*
> * fs kills suid/sgid/cap on write/chown/trunc. suid is killed on
> * write/trunc only if caller did not have CAP_FSETID. sgid is killed
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 8c0665c5dff8..2d5fa82b08b6 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -237,14 +237,41 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> u32 cache_mask;
> loff_t oldsize;
> struct timespec64 old_mtime;
> + bool try_wb_update = false;
> +
> + if (fc->writeback_cache_v2 && S_ISREG(inode->i_mode)) {
> + inode_lock(inode);
> + try_wb_update = true;
> + }
>
> spin_lock(&fi->lock);
> /*
> * In case of writeback_cache enabled, writes update mtime, ctime and
> * may update i_size. In these cases trust the cached value in the
> * inode.
> + *
> + * In writeback_cache_v2 mode, if all the following conditions are met,
> + * then we allow the attributes to be refreshed:
> + *
> + * - inode is not in the process of being written (I_SYNC)
> + * - inode has no dirty pages (I_DIRTY_PAGES)
> + * - inode data-related attributes are clean (I_DIRTY_DATASYNC)
> + * - inode does not have any page writeback in progress
> + *
> + * Note: checking PAGECACHE_TAG_WRITEBACK is not sufficient in fuse,
> + * since inode can appear to have no PageWriteback pages, yet still have
> + * outstanding write request.
> */
> cache_mask = fuse_get_cache_mask(inode);
> + if (try_wb_update && !(inode->i_state & (I_DIRTY_PAGES | I_SYNC |
> + I_DIRTY_DATASYNC)) && RB_EMPTY_ROOT(&fi->writepages))
> + cache_mask &= ~(STATX_MTIME | STATX_CTIME | STATX_SIZE);
> + /*
> + * Ensure the ordering of cleanness checking and following attr_version
> + * comparison.
> + */
> + smp_mb();
> +
> if (cache_mask & STATX_SIZE)
> attr->size = i_size_read(inode);
>
> @@ -283,7 +310,13 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> truncate_pagecache(inode, attr->size);
> if (!fc->explicit_inval_data)
> inval = true;
> - } else if (fc->auto_inval_data) {
> + } else if (!fc->writeback_cache_v2 && fc->auto_inval_data) {
> + /*
> + * When fc->writeback_cache_v2 is set, the old_mtime
> + * can be generated by kernel and must not equal to
> + * new_mtime generated by server. So skip in such
> + * case.
> + */
> struct timespec64 new_mtime = {
> .tv_sec = attr->mtime,
> .tv_nsec = attr->mtimensec,
> @@ -303,6 +336,9 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
>
> if (IS_ENABLED(CONFIG_FUSE_DAX))
> fuse_dax_dontcache(inode, attr->flags);
> +
> + if (try_wb_update)
> + inode_unlock(inode);
> }
>
> static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
> @@ -1153,6 +1189,10 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> fc->async_dio = 1;
> if (flags & FUSE_WRITEBACK_CACHE)
> fc->writeback_cache = 1;
> + if (flags & FUSE_WRITEBACK_CACHE_V2) {
> + fc->writeback_cache = 1;
> + fc->writeback_cache_v2 = 1;
> + }
> if (flags & FUSE_PARALLEL_DIROPS)
> fc->parallel_dirops = 1;
> if (flags & FUSE_HANDLE_KILLPRIV)
> @@ -1234,7 +1274,7 @@ void fuse_send_init(struct fuse_mount *fm)
> FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
> FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
> FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_EXT | FUSE_INIT_EXT |
> - FUSE_SECURITY_CTX;
> + FUSE_SECURITY_CTX | FUSE_WRITEBACK_CACHE_V2;
> #ifdef CONFIG_FUSE_DAX
> if (fm->fc->dax)
> flags |= FUSE_MAP_ALIGNMENT;
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index d6ccee961891..b474763bcf59 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -194,6 +194,7 @@
> * - add FUSE_SECURITY_CTX init flag
> * - add security context to create, mkdir, symlink, and mknod requests
> * - add FUSE_HAS_INODE_DAX, FUSE_ATTR_DAX
> + * - add FUSE_WRITEBACK_CACHE_V2 init flag
> */
>
> #ifndef _LINUX_FUSE_H
> @@ -353,6 +354,9 @@ struct fuse_file_lock {
> * FUSE_SECURITY_CTX: add security context to create, mkdir, symlink, and
> * mknod
> * FUSE_HAS_INODE_DAX: use per inode DAX
> + * FUSE_WRITEBACK_CACHE_V2:
> + * allow time/size to be refreshed if no pending write
> + * c/mtime not updated from kernel to server
> */
> #define FUSE_ASYNC_READ (1 << 0)
> #define FUSE_POSIX_LOCKS (1 << 1)
> @@ -389,6 +393,7 @@ struct fuse_file_lock {
> /* bits 32..63 get shifted down 32 bits into the flags2 field */
> #define FUSE_SECURITY_CTX (1ULL << 32)
> #define FUSE_HAS_INODE_DAX (1ULL << 33)
> +#define FUSE_WRITEBACK_CACHE_V2 (1ULL << 34)
>
> /**
> * CUSE INIT request/reply flags
> --
> 2.20.1
>

2022-06-29 18:20:50

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH] fuse: writeback_cache consistency enhancement (writeback_cache_v2)

On Fri, Jun 24, 2022 at 01:58:25PM +0800, Jiachen Zhang wrote:
> Some users may want both the high performance of the writeback_cahe mode and
> a little bit more consistency among FUSE mounts. In the current writeback
> mode implementation, users of one FUSE mount can never see the file
> expansion done by other FUSE mounts.
>
> Based on the suggested writeback V2 patch in the upstream mailing-list [1],
> this commit allows the cmtime and size to be updated from server in
> writeback mode. Compared with the writeback V2 patch in [1], this commit has
> several differences:
>
> 1. Ensure c/mtime are not updated from kernel to server. IOW, the cmtime
> generated by kernel are just temporary values that are never flushed to
> server, and they can also be updated by the official server cmtime when
> the writeback cache is clean.

Can this give the appearance of time going backwards. If file server
is getting time stamps from another remote filesystem (say NFS), it
is possible that client and server clocks are not in sync. So for
a stat() might return temporary value of c/mtime from local kernel and
once it gets udpated, it returns a time which might be behind previous
time. Sounds like trouble to me.

This can happen more often on virtiofs (even without a remote filesystem
being involved).

>
> 2. Skip mtime-based revalidation when fc->auto_inval_data is set with
> fc->writeback_cache_v2. Because the kernel-generated temporary cmtime
> are likely not equal to the offical server cmtime.
>
> 3. If any page is ever flushed to the server during FUSE_GETATTR
> handling on fuse server, even if the cache is clean when
> fuse_change_attributes() checks, we should not update the i_size. This
> is because the FUSE_GETATTR may get a staled size before the FUSE_WRITE
> request changes server inode size. This commit ensures this by
> increasing attr_version after writeback for writeback_cache_v2. In that
> case, we should also ensure the ordering of the attr_version updating
> and the fi->writepages RB-tree updating. So that if a fuse page
> writeback ever happens during fuse_change_attributes(), either the
> fi->writepages is not empty, or the attr_version is increased. So we
> never mistakenly update a stale file size from server to kernel.
>
> With this patch, writeback mode can consider the server c/mtime as the
> official one. When inode attr is timeout or invalidated, kernel has chance
> to see size and c/mtime modified by others.
>
> Together with another patch [2], a FUSE daemon is able to implement
> close-to-open (CTO) consistency like what is done in NFS clients.

If your goal is to implement NFS style CTO with writeback cache mode, then
is it not enough to query and udpate attributes at file open time. Instead
of worrying about also updating attrs while file is open (and cache is
clean). Updating attrs when cache is clean, is non-deterministic anyway,
so I don't see how it can be very useful for applications.

Thanks
Vivek


>
> [1] https://lore.kernel.org/linux-fsdevel/[email protected]
> [2] https://lore.kernel.org/linux-fsdevel/[email protected]/
>
> Suggested-by: Miklos Szeredi <[email protected]>
> Signed-off-by: Jiachen Zhang <[email protected]>
> ---
> fs/fuse/file.c | 17 +++++++++++++++
> fs/fuse/fuse_i.h | 3 +++
> fs/fuse/inode.c | 44 +++++++++++++++++++++++++++++++++++++--
> include/uapi/linux/fuse.h | 5 +++++
> 4 files changed, 67 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 9b64e2ff1c96..35bdc7af8468 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1829,6 +1829,15 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
> */
> fuse_send_writepage(fm, next, inarg->offset + inarg->size);
> }
> +
> + if (fc->writeback_cache_v2)
> + fi->attr_version = atomic64_inc_return(&fc->attr_version);
> + /*
> + * Ensure attr_version increases before the page is move out of the
> + * writepages rb-tree.
> + */
> + smp_mb();
> +
> fi->writectr--;
> fuse_writepage_finish(fm, wpa);
> spin_unlock(&fi->lock);
> @@ -1858,10 +1867,18 @@ static struct fuse_file *fuse_write_file_get(struct fuse_inode *fi)
>
> int fuse_write_inode(struct inode *inode, struct writeback_control *wbc)
> {
> + struct fuse_conn *fc = get_fuse_conn(inode);
> struct fuse_inode *fi = get_fuse_inode(inode);
> struct fuse_file *ff;
> int err;
>
> + /*
> + * Kernel c/mtime should not be updated to the server in the
> + * writeback_cache_v2 mode as server c/mtime are official.
> + */
> + if (fc->writeback_cache_v2)
> + return 0;
> +
> /*
> * Inode is always written before the last reference is dropped and
> * hence this should not be reached from reclaim.
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 488b460e046f..47de36146fb8 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -654,6 +654,9 @@ struct fuse_conn {
> /* show legacy mount options */
> unsigned int legacy_opts_show:1;
>
> + /* Improved writeback cache policy */
> + unsigned writeback_cache_v2:1;
> +
> /*
> * fs kills suid/sgid/cap on write/chown/trunc. suid is killed on
> * write/trunc only if caller did not have CAP_FSETID. sgid is killed
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 8c0665c5dff8..2d5fa82b08b6 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -237,14 +237,41 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> u32 cache_mask;
> loff_t oldsize;
> struct timespec64 old_mtime;
> + bool try_wb_update = false;
> +
> + if (fc->writeback_cache_v2 && S_ISREG(inode->i_mode)) {
> + inode_lock(inode);
> + try_wb_update = true;
> + }
>
> spin_lock(&fi->lock);
> /*
> * In case of writeback_cache enabled, writes update mtime, ctime and
> * may update i_size. In these cases trust the cached value in the
> * inode.
> + *
> + * In writeback_cache_v2 mode, if all the following conditions are met,
> + * then we allow the attributes to be refreshed:
> + *
> + * - inode is not in the process of being written (I_SYNC)
> + * - inode has no dirty pages (I_DIRTY_PAGES)
> + * - inode data-related attributes are clean (I_DIRTY_DATASYNC)
> + * - inode does not have any page writeback in progress
> + *
> + * Note: checking PAGECACHE_TAG_WRITEBACK is not sufficient in fuse,
> + * since inode can appear to have no PageWriteback pages, yet still have
> + * outstanding write request.
> */
> cache_mask = fuse_get_cache_mask(inode);
> + if (try_wb_update && !(inode->i_state & (I_DIRTY_PAGES | I_SYNC |
> + I_DIRTY_DATASYNC)) && RB_EMPTY_ROOT(&fi->writepages))
> + cache_mask &= ~(STATX_MTIME | STATX_CTIME | STATX_SIZE);
> + /*
> + * Ensure the ordering of cleanness checking and following attr_version
> + * comparison.
> + */
> + smp_mb();
> +
> if (cache_mask & STATX_SIZE)
> attr->size = i_size_read(inode);
>
> @@ -283,7 +310,13 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> truncate_pagecache(inode, attr->size);
> if (!fc->explicit_inval_data)
> inval = true;
> - } else if (fc->auto_inval_data) {
> + } else if (!fc->writeback_cache_v2 && fc->auto_inval_data) {
> + /*
> + * When fc->writeback_cache_v2 is set, the old_mtime
> + * can be generated by kernel and must not equal to
> + * new_mtime generated by server. So skip in such
> + * case.
> + */
> struct timespec64 new_mtime = {
> .tv_sec = attr->mtime,
> .tv_nsec = attr->mtimensec,
> @@ -303,6 +336,9 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
>
> if (IS_ENABLED(CONFIG_FUSE_DAX))
> fuse_dax_dontcache(inode, attr->flags);
> +
> + if (try_wb_update)
> + inode_unlock(inode);
> }
>
> static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
> @@ -1153,6 +1189,10 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> fc->async_dio = 1;
> if (flags & FUSE_WRITEBACK_CACHE)
> fc->writeback_cache = 1;
> + if (flags & FUSE_WRITEBACK_CACHE_V2) {
> + fc->writeback_cache = 1;
> + fc->writeback_cache_v2 = 1;
> + }
> if (flags & FUSE_PARALLEL_DIROPS)
> fc->parallel_dirops = 1;
> if (flags & FUSE_HANDLE_KILLPRIV)
> @@ -1234,7 +1274,7 @@ void fuse_send_init(struct fuse_mount *fm)
> FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
> FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
> FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_EXT | FUSE_INIT_EXT |
> - FUSE_SECURITY_CTX;
> + FUSE_SECURITY_CTX | FUSE_WRITEBACK_CACHE_V2;
> #ifdef CONFIG_FUSE_DAX
> if (fm->fc->dax)
> flags |= FUSE_MAP_ALIGNMENT;
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index d6ccee961891..b474763bcf59 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -194,6 +194,7 @@
> * - add FUSE_SECURITY_CTX init flag
> * - add security context to create, mkdir, symlink, and mknod requests
> * - add FUSE_HAS_INODE_DAX, FUSE_ATTR_DAX
> + * - add FUSE_WRITEBACK_CACHE_V2 init flag
> */
>
> #ifndef _LINUX_FUSE_H
> @@ -353,6 +354,9 @@ struct fuse_file_lock {
> * FUSE_SECURITY_CTX: add security context to create, mkdir, symlink, and
> * mknod
> * FUSE_HAS_INODE_DAX: use per inode DAX
> + * FUSE_WRITEBACK_CACHE_V2:
> + * allow time/size to be refreshed if no pending write
> + * c/mtime not updated from kernel to server
> */
> #define FUSE_ASYNC_READ (1 << 0)
> #define FUSE_POSIX_LOCKS (1 << 1)
> @@ -389,6 +393,7 @@ struct fuse_file_lock {
> /* bits 32..63 get shifted down 32 bits into the flags2 field */
> #define FUSE_SECURITY_CTX (1ULL << 32)
> #define FUSE_HAS_INODE_DAX (1ULL << 33)
> +#define FUSE_WRITEBACK_CACHE_V2 (1ULL << 34)
>
> /**
> * CUSE INIT request/reply flags
> --
> 2.20.1
>

2022-06-29 18:45:43

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH] fuse: writeback_cache consistency enhancement (writeback_cache_v2)

On Tue, Jun 28, 2022 at 01:09:56PM +0300, Amir Goldstein wrote:
> On Fri, Jun 24, 2022 at 9:03 AM Jiachen Zhang
> <[email protected]> wrote:
> >
> > Some users may want both the high performance of the writeback_cahe mode and
> > a little bit more consistency among FUSE mounts. In the current writeback
> > mode implementation, users of one FUSE mount can never see the file
> > expansion done by other FUSE mounts.
> >
> > Based on the suggested writeback V2 patch in the upstream mailing-list [1],
> > this commit allows the cmtime and size to be updated from server in
> > writeback mode. Compared with the writeback V2 patch in [1], this commit has
> > several differences:
> >
> > 1. Ensure c/mtime are not updated from kernel to server. IOW, the cmtime
> > generated by kernel are just temporary values that are never flushed to
> > server, and they can also be updated by the official server cmtime when
> > the writeback cache is clean.
> >
> > 2. Skip mtime-based revalidation when fc->auto_inval_data is set with
> > fc->writeback_cache_v2. Because the kernel-generated temporary cmtime
> > are likely not equal to the offical server cmtime.
> >
> > 3. If any page is ever flushed to the server during FUSE_GETATTR
> > handling on fuse server, even if the cache is clean when
> > fuse_change_attributes() checks, we should not update the i_size. This
> > is because the FUSE_GETATTR may get a staled size before the FUSE_WRITE
> > request changes server inode size. This commit ensures this by
> > increasing attr_version after writeback for writeback_cache_v2. In that
> > case, we should also ensure the ordering of the attr_version updating
> > and the fi->writepages RB-tree updating. So that if a fuse page
> > writeback ever happens during fuse_change_attributes(), either the
> > fi->writepages is not empty, or the attr_version is increased. So we
> > never mistakenly update a stale file size from server to kernel.
> >
> > With this patch, writeback mode can consider the server c/mtime as the
> > official one. When inode attr is timeout or invalidated, kernel has chance
> > to see size and c/mtime modified by others.
> >
> > Together with another patch [2], a FUSE daemon is able to implement
> > close-to-open (CTO) consistency like what is done in NFS clients.
> >
> > [1] https://lore.kernel.org/linux-fsdevel/[email protected]
> > [2] https://lore.kernel.org/linux-fsdevel/[email protected]/
> >
> > Suggested-by: Miklos Szeredi <[email protected]>
> > Signed-off-by: Jiachen Zhang <[email protected]>
> > ---
> > fs/fuse/file.c | 17 +++++++++++++++
> > fs/fuse/fuse_i.h | 3 +++
> > fs/fuse/inode.c | 44 +++++++++++++++++++++++++++++++++++++--
> > include/uapi/linux/fuse.h | 5 +++++
> > 4 files changed, 67 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 9b64e2ff1c96..35bdc7af8468 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -1829,6 +1829,15 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
> > */
> > fuse_send_writepage(fm, next, inarg->offset + inarg->size);
> > }
> > +
> > + if (fc->writeback_cache_v2)
> > + fi->attr_version = atomic64_inc_return(&fc->attr_version);
> > + /*
> > + * Ensure attr_version increases before the page is move out of the
> > + * writepages rb-tree.
> > + */
> > + smp_mb();
> > +
> > fi->writectr--;
> > fuse_writepage_finish(fm, wpa);
> > spin_unlock(&fi->lock);
> > @@ -1858,10 +1867,18 @@ static struct fuse_file *fuse_write_file_get(struct fuse_inode *fi)
> >
> > int fuse_write_inode(struct inode *inode, struct writeback_control *wbc)
> > {
> > + struct fuse_conn *fc = get_fuse_conn(inode);
> > struct fuse_inode *fi = get_fuse_inode(inode);
> > struct fuse_file *ff;
> > int err;
> >
> > + /*
> > + * Kernel c/mtime should not be updated to the server in the
> > + * writeback_cache_v2 mode as server c/mtime are official.
> > + */
> > + if (fc->writeback_cache_v2)
> > + return 0;
> > +
> > /*
> > * Inode is always written before the last reference is dropped and
> > * hence this should not be reached from reclaim.
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 488b460e046f..47de36146fb8 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -654,6 +654,9 @@ struct fuse_conn {
> > /* show legacy mount options */
> > unsigned int legacy_opts_show:1;
> >
> > + /* Improved writeback cache policy */
> > + unsigned writeback_cache_v2:1;
> > +
>
> Seeing that writeback_cache_v2 depends on writeback_cache
> I wonder whether that will not be better represented as:
>
> /** write-back cache policy (default is write-through) */
> - unsigned writeback_cache:1;
> + unsigned writeback_cache:2;
>
>
> Looking at the recently added handle_killpriv_v2, I also wonder
> if that would not have been better.
>
> Vivek,
> is handle_killpriv_v2 really independent of handle_killpriv?
> Seeing test like these worry me as they are inviting bugs:
>
> if (!fc->handle_killpriv && !fc->handle_killpriv_v2) {

Hi Amir,

IIRC, yes, handle_killpriv_v2 and handle_killpriv are independent.

Above check is doing a GETATTR request to server to get latest mode,
only if none of handle_killpriv and and handle_killpriv_v2 are enabled.
If any one of these is enabled, we skip fuse_do_getattr() and reset
ATTR_MODE and expect server to clear suid/sgid bits.

handle_killpriv was not enough, so I added handle_killpriv_v2 and
even enabled it by default in virtiofsd. Later there came reports that
some of the xfstests failed where suid/sgid was not cleared on server.
I had tried to debug it and it was some issue in underlying ext4/xfs
filesystem. I could not pursue it further at the time. I should probably
get back to it. My temporary solution was to disable killpriv_v2 in
virtiofsd.

Havind said that, I feel there are too many corner cases w.r.t suid/sgid
clearing. And it probably is a good idea that client sends a setattr
request with ATTR_MODE set. While it is racy but it can be more correct.
Otherwise it is easy to miss some code path somewhere where assumptions
are not correct and we did not clear suid/sgid as expected.

Thanks
Vivek

2022-07-15 10:59:02

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] fuse: writeback_cache consistency enhancement (writeback_cache_v2)

On Fri, 24 Jun 2022 at 07:58, Jiachen Zhang
<[email protected]> wrote:
>
> Some users may want both the high performance of the writeback_cahe mode and
> a little bit more consistency among FUSE mounts. In the current writeback
> mode implementation, users of one FUSE mount can never see the file
> expansion done by other FUSE mounts.
>
> Based on the suggested writeback V2 patch in the upstream mailing-list [1],
> this commit allows the cmtime and size to be updated from server in
> writeback mode. Compared with the writeback V2 patch in [1], this commit has
> several differences:
>
> 1. Ensure c/mtime are not updated from kernel to server. IOW, the cmtime
> generated by kernel are just temporary values that are never flushed to
> server, and they can also be updated by the official server cmtime when
> the writeback cache is clean.

Interesting. The issue there is that ctime/mtime would change
spontaneously when data is flushed to server. Maybe that's a lesser
of the evils, I don't know...

>
> 2. Skip mtime-based revalidation when fc->auto_inval_data is set with
> fc->writeback_cache_v2. Because the kernel-generated temporary cmtime
> are likely not equal to the offical server cmtime.

Right, I think auto_inval_data has always been broken with writeback_cache.
>
> 3. If any page is ever flushed to the server during FUSE_GETATTR
> handling on fuse server, even if the cache is clean when
> fuse_change_attributes() checks, we should not update the i_size. This
> is because the FUSE_GETATTR may get a staled size before the FUSE_WRITE
> request changes server inode size. This commit ensures this by
> increasing attr_version after writeback for writeback_cache_v2. In that
> case, we should also ensure the ordering of the attr_version updating
> and the fi->writepages RB-tree updating. So that if a fuse page
> writeback ever happens during fuse_change_attributes(), either the
> fi->writepages is not empty, or the attr_version is increased. So we
> never mistakenly update a stale file size from server to kernel.
>
> With this patch, writeback mode can consider the server c/mtime as the
> official one. When inode attr is timeout or invalidated, kernel has chance
> to see size and c/mtime modified by others.
>
> Together with another patch [2], a FUSE daemon is able to implement
> close-to-open (CTO) consistency like what is done in NFS clients.
>
> [1] https://lore.kernel.org/linux-fsdevel/[email protected]
> [2] https://lore.kernel.org/linux-fsdevel/[email protected]/
>
> Suggested-by: Miklos Szeredi <[email protected]>
> Signed-off-by: Jiachen Zhang <[email protected]>
> ---
> fs/fuse/file.c | 17 +++++++++++++++
> fs/fuse/fuse_i.h | 3 +++
> fs/fuse/inode.c | 44 +++++++++++++++++++++++++++++++++++++--
> include/uapi/linux/fuse.h | 5 +++++
> 4 files changed, 67 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 9b64e2ff1c96..35bdc7af8468 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1829,6 +1829,15 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
> */
> fuse_send_writepage(fm, next, inarg->offset + inarg->size);
> }
> +
> + if (fc->writeback_cache_v2)
> + fi->attr_version = atomic64_inc_return(&fc->attr_version);
> + /*
> + * Ensure attr_version increases before the page is move out of the
> + * writepages rb-tree.
> + */
> + smp_mb();
> +
> fi->writectr--;
> fuse_writepage_finish(fm, wpa);
> spin_unlock(&fi->lock);
> @@ -1858,10 +1867,18 @@ static struct fuse_file *fuse_write_file_get(struct fuse_inode *fi)
>
> int fuse_write_inode(struct inode *inode, struct writeback_control *wbc)
> {
> + struct fuse_conn *fc = get_fuse_conn(inode);
> struct fuse_inode *fi = get_fuse_inode(inode);
> struct fuse_file *ff;
> int err;
>
> + /*
> + * Kernel c/mtime should not be updated to the server in the
> + * writeback_cache_v2 mode as server c/mtime are official.
> + */
> + if (fc->writeback_cache_v2)
> + return 0;
> +
> /*
> * Inode is always written before the last reference is dropped and
> * hence this should not be reached from reclaim.
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 488b460e046f..47de36146fb8 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -654,6 +654,9 @@ struct fuse_conn {
> /* show legacy mount options */
> unsigned int legacy_opts_show:1;
>
> + /* Improved writeback cache policy */
> + unsigned writeback_cache_v2:1;
> +
> /*
> * fs kills suid/sgid/cap on write/chown/trunc. suid is killed on
> * write/trunc only if caller did not have CAP_FSETID. sgid is killed
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 8c0665c5dff8..2d5fa82b08b6 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -237,14 +237,41 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> u32 cache_mask;
> loff_t oldsize;
> struct timespec64 old_mtime;
> + bool try_wb_update = false;
> +
> + if (fc->writeback_cache_v2 && S_ISREG(inode->i_mode)) {
> + inode_lock(inode);

I don't think this can work. fuse_change_attributes() might be
called from within inlode locked context. E.g.

lookup_slow -> __lookup_slow -> d_revalidate -> fuse_dentry_revalidate
-> fuse_change_attributes

> + try_wb_update = true;
> + }
>
> spin_lock(&fi->lock);
> /*
> * In case of writeback_cache enabled, writes update mtime, ctime and
> * may update i_size. In these cases trust the cached value in the
> * inode.
> + *
> + * In writeback_cache_v2 mode, if all the following conditions are met,
> + * then we allow the attributes to be refreshed:
> + *
> + * - inode is not in the process of being written (I_SYNC)
> + * - inode has no dirty pages (I_DIRTY_PAGES)
> + * - inode data-related attributes are clean (I_DIRTY_DATASYNC)
> + * - inode does not have any page writeback in progress
> + *
> + * Note: checking PAGECACHE_TAG_WRITEBACK is not sufficient in fuse,
> + * since inode can appear to have no PageWriteback pages, yet still have
> + * outstanding write request.
> */
> cache_mask = fuse_get_cache_mask(inode);
> + if (try_wb_update && !(inode->i_state & (I_DIRTY_PAGES | I_SYNC |
> + I_DIRTY_DATASYNC)) && RB_EMPTY_ROOT(&fi->writepages))
> + cache_mask &= ~(STATX_MTIME | STATX_CTIME | STATX_SIZE);
> + /*
> + * Ensure the ordering of cleanness checking and following attr_version
> + * comparison.
> + */
> + smp_mb();
> +
> if (cache_mask & STATX_SIZE)
> attr->size = i_size_read(inode);
>
> @@ -283,7 +310,13 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> truncate_pagecache(inode, attr->size);
> if (!fc->explicit_inval_data)
> inval = true;
> - } else if (fc->auto_inval_data) {
> + } else if (!fc->writeback_cache_v2 && fc->auto_inval_data) {
> + /*
> + * When fc->writeback_cache_v2 is set, the old_mtime
> + * can be generated by kernel and must not equal to
> + * new_mtime generated by server. So skip in such
> + * case.
> + */
> struct timespec64 new_mtime = {
> .tv_sec = attr->mtime,
> .tv_nsec = attr->mtimensec,
> @@ -303,6 +336,9 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
>
> if (IS_ENABLED(CONFIG_FUSE_DAX))
> fuse_dax_dontcache(inode, attr->flags);
> +
> + if (try_wb_update)
> + inode_unlock(inode);
> }
>
> static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
> @@ -1153,6 +1189,10 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> fc->async_dio = 1;
> if (flags & FUSE_WRITEBACK_CACHE)
> fc->writeback_cache = 1;
> + if (flags & FUSE_WRITEBACK_CACHE_V2) {
> + fc->writeback_cache = 1;
> + fc->writeback_cache_v2 = 1;
> + }
> if (flags & FUSE_PARALLEL_DIROPS)
> fc->parallel_dirops = 1;
> if (flags & FUSE_HANDLE_KILLPRIV)
> @@ -1234,7 +1274,7 @@ void fuse_send_init(struct fuse_mount *fm)
> FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
> FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
> FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_EXT | FUSE_INIT_EXT |
> - FUSE_SECURITY_CTX;
> + FUSE_SECURITY_CTX | FUSE_WRITEBACK_CACHE_V2;
> #ifdef CONFIG_FUSE_DAX
> if (fm->fc->dax)
> flags |= FUSE_MAP_ALIGNMENT;
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index d6ccee961891..b474763bcf59 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -194,6 +194,7 @@
> * - add FUSE_SECURITY_CTX init flag
> * - add security context to create, mkdir, symlink, and mknod requests
> * - add FUSE_HAS_INODE_DAX, FUSE_ATTR_DAX
> + * - add FUSE_WRITEBACK_CACHE_V2 init flag
> */
>
> #ifndef _LINUX_FUSE_H
> @@ -353,6 +354,9 @@ struct fuse_file_lock {
> * FUSE_SECURITY_CTX: add security context to create, mkdir, symlink, and
> * mknod
> * FUSE_HAS_INODE_DAX: use per inode DAX
> + * FUSE_WRITEBACK_CACHE_V2:
> + * allow time/size to be refreshed if no pending write
> + * c/mtime not updated from kernel to server
> */
> #define FUSE_ASYNC_READ (1 << 0)
> #define FUSE_POSIX_LOCKS (1 << 1)
> @@ -389,6 +393,7 @@ struct fuse_file_lock {
> /* bits 32..63 get shifted down 32 bits into the flags2 field */
> #define FUSE_SECURITY_CTX (1ULL << 32)
> #define FUSE_HAS_INODE_DAX (1ULL << 33)
> +#define FUSE_WRITEBACK_CACHE_V2 (1ULL << 34)
>
> /**
> * CUSE INIT request/reply flags
> --
> 2.20.1
>

2022-07-18 05:30:29

by Jiachen Zhang

[permalink] [raw]
Subject: [PATCH] fuse: writeback_cache consistency enhancement (writeback_cache_v2)

On Thu, Jun 30, 2022 at 1:57 AM Vivek Goyal <[email protected]> wrote:
>
> On Fri, Jun 24, 2022 at 01:58:25PM +0800, Jiachen Zhang wrote:
> > Some users may want both the high performance of the writeback_cahe mode and
> > a little bit more consistency among FUSE mounts. In the current writeback
> > mode implementation, users of one FUSE mount can never see the file
> > expansion done by other FUSE mounts.
> >
> > Based on the suggested writeback V2 patch in the upstream mailing-list [1],
> > this commit allows the cmtime and size to be updated from server in
> > writeback mode. Compared with the writeback V2 patch in [1], this commit has
> > several differences:
> >
> > 1. Ensure c/mtime are not updated from kernel to server. IOW, the cmtime
> > generated by kernel are just temporary values that are never flushed to
> > server, and they can also be updated by the official server cmtime when
> > the writeback cache is clean.
> >
> > 2. Skip mtime-based revalidation when fc->auto_inval_data is set with
> > fc->writeback_cache_v2. Because the kernel-generated temporary cmtime
> > are likely not equal to the offical server cmtime.
>
> We have a init flag to negotiate FUSE_AUTO_INVAL_DATA to enable/disable
> fc->auto_inval_data. So should it be filesystem's (user space) responsibility
> to not use FUSE_AUTO_INVAL_DATA when using FUSE_WRITEBACK_CACHE. I mean,
> otherwise it is confusing. On one hand filesystem is using FUSE_AUTO_INVAL_DATA
> but it silently is ingored by kernel because FUSE_WRITEBACK_CACHE is
> being used as well.
>

Hi Vivek,

The fc->auto_inval_data branch code is also skipped in writeback_cache
v1 mode, so I think we can also skip it in writeback_cache_v2. I
listed this point here because it is not skipped in Miklos'
writeback_cache_v2 initial patch.

Thanks,
Jiachen

> will it make more sense to just document it and let filesystem authors
> enable options carefully.
>
> BTW, this probably should be a separate patch in the patch series anyway
> with proper explanation.
>
> Thanks
> Vivek
>
> >
> > 3. If any page is ever flushed to the server during FUSE_GETATTR
> > handling on fuse server, even if the cache is clean when
> > fuse_change_attributes() checks, we should not update the i_size. This
> > is because the FUSE_GETATTR may get a staled size before the FUSE_WRITE
> > request changes server inode size. This commit ensures this by
> > increasing attr_version after writeback for writeback_cache_v2. In that
> > case, we should also ensure the ordering of the attr_version updating
> > and the fi->writepages RB-tree updating. So that if a fuse page
> > writeback ever happens during fuse_change_attributes(), either the
> > fi->writepages is not empty, or the attr_version is increased. So we
> > never mistakenly update a stale file size from server to kernel.
> >
> > With this patch, writeback mode can consider the server c/mtime as the
> > official one. When inode attr is timeout or invalidated, kernel has chance
> > to see size and c/mtime modified by others.
> >
> > Together with another patch [2], a FUSE daemon is able to implement
> > close-to-open (CTO) consistency like what is done in NFS clients.
> >
> > [1] https://lore.kernel.org/linux-fsdevel/[email protected]
> > [2] https://lore.kernel.org/linux-fsdevel/[email protected]/
> >
> > Suggested-by: Miklos Szeredi <[email protected]>
> > Signed-off-by: Jiachen Zhang <[email protected]>
> > ---
> > fs/fuse/file.c | 17 +++++++++++++++
> > fs/fuse/fuse_i.h | 3 +++
> > fs/fuse/inode.c | 44 +++++++++++++++++++++++++++++++++++++--
> > include/uapi/linux/fuse.h | 5 +++++
> > 4 files changed, 67 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 9b64e2ff1c96..35bdc7af8468 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -1829,6 +1829,15 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
> > */
> > fuse_send_writepage(fm, next, inarg->offset + inarg->size);
> > }
> > +
> > + if (fc->writeback_cache_v2)
> > + fi->attr_version = atomic64_inc_return(&fc->attr_version);
> > + /*
> > + * Ensure attr_version increases before the page is move out of the
> > + * writepages rb-tree.
> > + */
> > + smp_mb();
> > +
> > fi->writectr--;
> > fuse_writepage_finish(fm, wpa);
> > spin_unlock(&fi->lock);
> > @@ -1858,10 +1867,18 @@ static struct fuse_file *fuse_write_file_get(struct fuse_inode *fi)
> >
> > int fuse_write_inode(struct inode *inode, struct writeback_control *wbc)
> > {
> > + struct fuse_conn *fc = get_fuse_conn(inode);
> > struct fuse_inode *fi = get_fuse_inode(inode);
> > struct fuse_file *ff;
> > int err;
> >
> > + /*
> > + * Kernel c/mtime should not be updated to the server in the
> > + * writeback_cache_v2 mode as server c/mtime are official.
> > + */
> > + if (fc->writeback_cache_v2)
> > + return 0;
> > +
> > /*
> > * Inode is always written before the last reference is dropped and
> > * hence this should not be reached from reclaim.
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 488b460e046f..47de36146fb8 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -654,6 +654,9 @@ struct fuse_conn {
> > /* show legacy mount options */
> > unsigned int legacy_opts_show:1;
> >
> > + /* Improved writeback cache policy */
> > + unsigned writeback_cache_v2:1;
> > +
> > /*
> > * fs kills suid/sgid/cap on write/chown/trunc. suid is killed on
> > * write/trunc only if caller did not have CAP_FSETID. sgid is killed
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 8c0665c5dff8..2d5fa82b08b6 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -237,14 +237,41 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> > u32 cache_mask;
> > loff_t oldsize;
> > struct timespec64 old_mtime;
> > + bool try_wb_update = false;
> > +
> > + if (fc->writeback_cache_v2 && S_ISREG(inode->i_mode)) {
> > + inode_lock(inode);
> > + try_wb_update = true;
> > + }
> >
> > spin_lock(&fi->lock);
> > /*
> > * In case of writeback_cache enabled, writes update mtime, ctime and
> > * may update i_size. In these cases trust the cached value in the
> > * inode.
> > + *
> > + * In writeback_cache_v2 mode, if all the following conditions are met,
> > + * then we allow the attributes to be refreshed:
> > + *
> > + * - inode is not in the process of being written (I_SYNC)
> > + * - inode has no dirty pages (I_DIRTY_PAGES)
> > + * - inode data-related attributes are clean (I_DIRTY_DATASYNC)
> > + * - inode does not have any page writeback in progress
> > + *
> > + * Note: checking PAGECACHE_TAG_WRITEBACK is not sufficient in fuse,
> > + * since inode can appear to have no PageWriteback pages, yet still have
> > + * outstanding write request.
> > */
> > cache_mask = fuse_get_cache_mask(inode);
> > + if (try_wb_update && !(inode->i_state & (I_DIRTY_PAGES | I_SYNC |
> > + I_DIRTY_DATASYNC)) && RB_EMPTY_ROOT(&fi->writepages))
> > + cache_mask &= ~(STATX_MTIME | STATX_CTIME | STATX_SIZE);
> > + /*
> > + * Ensure the ordering of cleanness checking and following attr_version
> > + * comparison.
> > + */
> > + smp_mb();
> > +
> > if (cache_mask & STATX_SIZE)
> > attr->size = i_size_read(inode);
> >
> > @@ -283,7 +310,13 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> > truncate_pagecache(inode, attr->size);
> > if (!fc->explicit_inval_data)
> > inval = true;
> > - } else if (fc->auto_inval_data) {
> > + } else if (!fc->writeback_cache_v2 && fc->auto_inval_data) {
> > + /*
> > + * When fc->writeback_cache_v2 is set, the old_mtime
> > + * can be generated by kernel and must not equal to
> > + * new_mtime generated by server. So skip in such
> > + * case.
> > + */
> > struct timespec64 new_mtime = {
> > .tv_sec = attr->mtime,
> > .tv_nsec = attr->mtimensec,
> > @@ -303,6 +336,9 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> >
> > if (IS_ENABLED(CONFIG_FUSE_DAX))
> > fuse_dax_dontcache(inode, attr->flags);
> > +
> > + if (try_wb_update)
> > + inode_unlock(inode);
> > }
> >
> > static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
> > @@ -1153,6 +1189,10 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> > fc->async_dio = 1;
> > if (flags & FUSE_WRITEBACK_CACHE)
> > fc->writeback_cache = 1;
> > + if (flags & FUSE_WRITEBACK_CACHE_V2) {
> > + fc->writeback_cache = 1;
> > + fc->writeback_cache_v2 = 1;
> > + }
> > if (flags & FUSE_PARALLEL_DIROPS)
> > fc->parallel_dirops = 1;
> > if (flags & FUSE_HANDLE_KILLPRIV)
> > @@ -1234,7 +1274,7 @@ void fuse_send_init(struct fuse_mount *fm)
> > FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
> > FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
> > FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_EXT | FUSE_INIT_EXT |
> > - FUSE_SECURITY_CTX;
> > + FUSE_SECURITY_CTX | FUSE_WRITEBACK_CACHE_V2;
> > #ifdef CONFIG_FUSE_DAX
> > if (fm->fc->dax)
> > flags |= FUSE_MAP_ALIGNMENT;
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index d6ccee961891..b474763bcf59 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -194,6 +194,7 @@
> > * - add FUSE_SECURITY_CTX init flag
> > * - add security context to create, mkdir, symlink, and mknod requests
> > * - add FUSE_HAS_INODE_DAX, FUSE_ATTR_DAX
> > + * - add FUSE_WRITEBACK_CACHE_V2 init flag
> > */
> >
> > #ifndef _LINUX_FUSE_H
> > @@ -353,6 +354,9 @@ struct fuse_file_lock {
> > * FUSE_SECURITY_CTX: add security context to create, mkdir, symlink, and
> > * mknod
> > * FUSE_HAS_INODE_DAX: use per inode DAX
> > + * FUSE_WRITEBACK_CACHE_V2:
> > + * allow time/size to be refreshed if no pending write
> > + * c/mtime not updated from kernel to server
> > */
> > #define FUSE_ASYNC_READ (1 << 0)
> > #define FUSE_POSIX_LOCKS (1 << 1)
> > @@ -389,6 +393,7 @@ struct fuse_file_lock {
> > /* bits 32..63 get shifted down 32 bits into the flags2 field */
> > #define FUSE_SECURITY_CTX (1ULL << 32)
> > #define FUSE_HAS_INODE_DAX (1ULL << 33)
> > +#define FUSE_WRITEBACK_CACHE_V2 (1ULL << 34)
> >
> > /**
> > * CUSE INIT request/reply flags
> > --
> > 2.20.1
> >
>

2022-07-18 06:07:31

by Jiachen Zhang

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] fuse: writeback_cache consistency enhancement (writeback_cache_v2)

On Thu, Jun 30, 2022 at 2:14 AM Vivek Goyal <[email protected]> wrote:
>
> On Fri, Jun 24, 2022 at 01:58:25PM +0800, Jiachen Zhang wrote:
> > Some users may want both the high performance of the writeback_cahe mode and
> > a little bit more consistency among FUSE mounts. In the current writeback
> > mode implementation, users of one FUSE mount can never see the file
> > expansion done by other FUSE mounts.
> >
> > Based on the suggested writeback V2 patch in the upstream mailing-list [1],
> > this commit allows the cmtime and size to be updated from server in
> > writeback mode. Compared with the writeback V2 patch in [1], this commit has
> > several differences:
> >
> > 1. Ensure c/mtime are not updated from kernel to server. IOW, the cmtime
> > generated by kernel are just temporary values that are never flushed to
> > server, and they can also be updated by the official server cmtime when
> > the writeback cache is clean.
>
> Can this give the appearance of time going backwards. If file server
> is getting time stamps from another remote filesystem (say NFS), it
> is possible that client and server clocks are not in sync. So for
> a stat() might return temporary value of c/mtime from local kernel and
> once it gets udpated, it returns a time which might be behind previous
> time. Sounds like trouble to me.
>

Hi Vivek,

Yes, the time is likely to be changed silently on a client after the
data is flush to the server. But in terms of consistency, some FS
implementations may rely on the server mtime to revalidate cache. In
such a case, if the server time is updated from multiple client time
sources, it may lead to false cache revalidation results.

> This can happen more often on virtiofs (even without a remote filesystem
> being involved).
>

Maybe if there is no need of multiple VM-guest consistency, we can
just disable the writeback_cache_v2 mode.


> >
> > 2. Skip mtime-based revalidation when fc->auto_inval_data is set with
> > fc->writeback_cache_v2. Because the kernel-generated temporary cmtime
> > are likely not equal to the offical server cmtime.
> >
> > 3. If any page is ever flushed to the server during FUSE_GETATTR
> > handling on fuse server, even if the cache is clean when
> > fuse_change_attributes() checks, we should not update the i_size. This
> > is because the FUSE_GETATTR may get a staled size before the FUSE_WRITE
> > request changes server inode size. This commit ensures this by
> > increasing attr_version after writeback for writeback_cache_v2. In that
> > case, we should also ensure the ordering of the attr_version updating
> > and the fi->writepages RB-tree updating. So that if a fuse page
> > writeback ever happens during fuse_change_attributes(), either the
> > fi->writepages is not empty, or the attr_version is increased. So we
> > never mistakenly update a stale file size from server to kernel.
> >
> > With this patch, writeback mode can consider the server c/mtime as the
> > official one. When inode attr is timeout or invalidated, kernel has chance
> > to see size and c/mtime modified by others.
> >
> > Together with another patch [2], a FUSE daemon is able to implement
> > close-to-open (CTO) consistency like what is done in NFS clients.
>
> If your goal is to implement NFS style CTO with writeback cache mode, then
> is it not enough to query and udpate attributes at file open time. Instead
> of worrying about also updating attrs while file is open (and cache is
> clean). Updating attrs when cache is clean, is non-deterministic anyway,
> so I don't see how it can be very useful for applications.
>

IIUC, the NFS-style CTO consistency only requires updating invalidated
data & attr caches on file open, and the cache updating will be
delayed if there is any writing opens, even if the cache is clean. So
writeback_cache_v2 updates even more radical than NFS-style CTO.

Thanks,
Jiachen

> Thanks
> Vivek
>
>
> >
> > [1] https://lore.kernel.org/linux-fsdevel/[email protected]
> > [2] https://lore.kernel.org/linux-fsdevel/[email protected]/
> >
> > Suggested-by: Miklos Szeredi <[email protected]>
> > Signed-off-by: Jiachen Zhang <[email protected]>
> > ---
> > fs/fuse/file.c | 17 +++++++++++++++
> > fs/fuse/fuse_i.h | 3 +++
> > fs/fuse/inode.c | 44 +++++++++++++++++++++++++++++++++++++--
> > include/uapi/linux/fuse.h | 5 +++++
> > 4 files changed, 67 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 9b64e2ff1c96..35bdc7af8468 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -1829,6 +1829,15 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
> > */
> > fuse_send_writepage(fm, next, inarg->offset + inarg->size);
> > }
> > +
> > + if (fc->writeback_cache_v2)
> > + fi->attr_version = atomic64_inc_return(&fc->attr_version);
> > + /*
> > + * Ensure attr_version increases before the page is move out of the
> > + * writepages rb-tree.
> > + */
> > + smp_mb();
> > +
> > fi->writectr--;
> > fuse_writepage_finish(fm, wpa);
> > spin_unlock(&fi->lock);
> > @@ -1858,10 +1867,18 @@ static struct fuse_file *fuse_write_file_get(struct fuse_inode *fi)
> >
> > int fuse_write_inode(struct inode *inode, struct writeback_control *wbc)
> > {
> > + struct fuse_conn *fc = get_fuse_conn(inode);
> > struct fuse_inode *fi = get_fuse_inode(inode);
> > struct fuse_file *ff;
> > int err;
> >
> > + /*
> > + * Kernel c/mtime should not be updated to the server in the
> > + * writeback_cache_v2 mode as server c/mtime are official.
> > + */
> > + if (fc->writeback_cache_v2)
> > + return 0;
> > +
> > /*
> > * Inode is always written before the last reference is dropped and
> > * hence this should not be reached from reclaim.
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 488b460e046f..47de36146fb8 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -654,6 +654,9 @@ struct fuse_conn {
> > /* show legacy mount options */
> > unsigned int legacy_opts_show:1;
> >
> > + /* Improved writeback cache policy */
> > + unsigned writeback_cache_v2:1;
> > +
> > /*
> > * fs kills suid/sgid/cap on write/chown/trunc. suid is killed on
> > * write/trunc only if caller did not have CAP_FSETID. sgid is killed
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 8c0665c5dff8..2d5fa82b08b6 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -237,14 +237,41 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> > u32 cache_mask;
> > loff_t oldsize;
> > struct timespec64 old_mtime;
> > + bool try_wb_update = false;
> > +
> > + if (fc->writeback_cache_v2 && S_ISREG(inode->i_mode)) {
> > + inode_lock(inode);
> > + try_wb_update = true;
> > + }
> >
> > spin_lock(&fi->lock);
> > /*
> > * In case of writeback_cache enabled, writes update mtime, ctime and
> > * may update i_size. In these cases trust the cached value in the
> > * inode.
> > + *
> > + * In writeback_cache_v2 mode, if all the following conditions are met,
> > + * then we allow the attributes to be refreshed:
> > + *
> > + * - inode is not in the process of being written (I_SYNC)
> > + * - inode has no dirty pages (I_DIRTY_PAGES)
> > + * - inode data-related attributes are clean (I_DIRTY_DATASYNC)
> > + * - inode does not have any page writeback in progress
> > + *
> > + * Note: checking PAGECACHE_TAG_WRITEBACK is not sufficient in fuse,
> > + * since inode can appear to have no PageWriteback pages, yet still have
> > + * outstanding write request.
> > */
> > cache_mask = fuse_get_cache_mask(inode);
> > + if (try_wb_update && !(inode->i_state & (I_DIRTY_PAGES | I_SYNC |
> > + I_DIRTY_DATASYNC)) && RB_EMPTY_ROOT(&fi->writepages))
> > + cache_mask &= ~(STATX_MTIME | STATX_CTIME | STATX_SIZE);
> > + /*
> > + * Ensure the ordering of cleanness checking and following attr_version
> > + * comparison.
> > + */
> > + smp_mb();
> > +
> > if (cache_mask & STATX_SIZE)
> > attr->size = i_size_read(inode);
> >
> > @@ -283,7 +310,13 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> > truncate_pagecache(inode, attr->size);
> > if (!fc->explicit_inval_data)
> > inval = true;
> > - } else if (fc->auto_inval_data) {
> > + } else if (!fc->writeback_cache_v2 && fc->auto_inval_data) {
> > + /*
> > + * When fc->writeback_cache_v2 is set, the old_mtime
> > + * can be generated by kernel and must not equal to
> > + * new_mtime generated by server. So skip in such
> > + * case.
> > + */
> > struct timespec64 new_mtime = {
> > .tv_sec = attr->mtime,
> > .tv_nsec = attr->mtimensec,
> > @@ -303,6 +336,9 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> >
> > if (IS_ENABLED(CONFIG_FUSE_DAX))
> > fuse_dax_dontcache(inode, attr->flags);
> > +
> > + if (try_wb_update)
> > + inode_unlock(inode);
> > }
> >
> > static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
> > @@ -1153,6 +1189,10 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> > fc->async_dio = 1;
> > if (flags & FUSE_WRITEBACK_CACHE)
> > fc->writeback_cache = 1;
> > + if (flags & FUSE_WRITEBACK_CACHE_V2) {
> > + fc->writeback_cache = 1;
> > + fc->writeback_cache_v2 = 1;
> > + }
> > if (flags & FUSE_PARALLEL_DIROPS)
> > fc->parallel_dirops = 1;
> > if (flags & FUSE_HANDLE_KILLPRIV)
> > @@ -1234,7 +1274,7 @@ void fuse_send_init(struct fuse_mount *fm)
> > FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
> > FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
> > FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_EXT | FUSE_INIT_EXT |
> > - FUSE_SECURITY_CTX;
> > + FUSE_SECURITY_CTX | FUSE_WRITEBACK_CACHE_V2;
> > #ifdef CONFIG_FUSE_DAX
> > if (fm->fc->dax)
> > flags |= FUSE_MAP_ALIGNMENT;
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index d6ccee961891..b474763bcf59 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -194,6 +194,7 @@
> > * - add FUSE_SECURITY_CTX init flag
> > * - add security context to create, mkdir, symlink, and mknod requests
> > * - add FUSE_HAS_INODE_DAX, FUSE_ATTR_DAX
> > + * - add FUSE_WRITEBACK_CACHE_V2 init flag
> > */
> >
> > #ifndef _LINUX_FUSE_H
> > @@ -353,6 +354,9 @@ struct fuse_file_lock {
> > * FUSE_SECURITY_CTX: add security context to create, mkdir, symlink, and
> > * mknod
> > * FUSE_HAS_INODE_DAX: use per inode DAX
> > + * FUSE_WRITEBACK_CACHE_V2:
> > + * allow time/size to be refreshed if no pending write
> > + * c/mtime not updated from kernel to server
> > */
> > #define FUSE_ASYNC_READ (1 << 0)
> > #define FUSE_POSIX_LOCKS (1 << 1)
> > @@ -389,6 +393,7 @@ struct fuse_file_lock {
> > /* bits 32..63 get shifted down 32 bits into the flags2 field */
> > #define FUSE_SECURITY_CTX (1ULL << 32)
> > #define FUSE_HAS_INODE_DAX (1ULL << 33)
> > +#define FUSE_WRITEBACK_CACHE_V2 (1ULL << 34)
> >
> > /**
> > * CUSE INIT request/reply flags
> > --
> > 2.20.1
> >
>

2022-07-18 06:09:02

by Jiachen Zhang

[permalink] [raw]
Subject: [PATCH] fuse: writeback_cache consistency enhancement (writeback_cache_v2)

On Fri, Jul 15, 2022 at 6:07 PM Miklos Szeredi <[email protected]> wrote:
>
> On Fri, 24 Jun 2022 at 07:58, Jiachen Zhang
> <[email protected]> wrote:
> >
> > Some users may want both the high performance of the writeback_cahe mode and
> > a little bit more consistency among FUSE mounts. In the current writeback
> > mode implementation, users of one FUSE mount can never see the file
> > expansion done by other FUSE mounts.
> >
> > Based on the suggested writeback V2 patch in the upstream mailing-list [1],
> > this commit allows the cmtime and size to be updated from server in
> > writeback mode. Compared with the writeback V2 patch in [1], this commit has
> > several differences:
> >
> > 1. Ensure c/mtime are not updated from kernel to server. IOW, the cmtime
> > generated by kernel are just temporary values that are never flushed to
> > server, and they can also be updated by the official server cmtime when
> > the writeback cache is clean.
>
> Interesting. The issue there is that ctime/mtime would change
> spontaneously when data is flushed to server. Maybe that's a lesser
> of the evils, I don't know...
>

Hi Miklos,

Yes, just like what Vivek said in this mail thread before. But the
server may reply on the cmtime to do cache revalidation, so I think
we'd better use a single time source.

> >
> > 2. Skip mtime-based revalidation when fc->auto_inval_data is set with
> > fc->writeback_cache_v2. Because the kernel-generated temporary cmtime
> > are likely not equal to the offical server cmtime.
>
> Right, I think auto_inval_data has always been broken with writeback_cache.
> >
> > 3. If any page is ever flushed to the server during FUSE_GETATTR
> > handling on fuse server, even if the cache is clean when
> > fuse_change_attributes() checks, we should not update the i_size. This
> > is because the FUSE_GETATTR may get a staled size before the FUSE_WRITE
> > request changes server inode size. This commit ensures this by
> > increasing attr_version after writeback for writeback_cache_v2. In that
> > case, we should also ensure the ordering of the attr_version updating
> > and the fi->writepages RB-tree updating. So that if a fuse page
> > writeback ever happens during fuse_change_attributes(), either the
> > fi->writepages is not empty, or the attr_version is increased. So we
> > never mistakenly update a stale file size from server to kernel.
> >
> > With this patch, writeback mode can consider the server c/mtime as the
> > official one. When inode attr is timeout or invalidated, kernel has chance
> > to see size and c/mtime modified by others.
> >
> > Together with another patch [2], a FUSE daemon is able to implement
> > close-to-open (CTO) consistency like what is done in NFS clients.
> >
> > [1] https://lore.kernel.org/linux-fsdevel/[email protected]
> > [2] https://lore.kernel.org/linux-fsdevel/[email protected]/
> >
> > Suggested-by: Miklos Szeredi <[email protected]>
> > Signed-off-by: Jiachen Zhang <[email protected]>
> > ---
> > fs/fuse/file.c | 17 +++++++++++++++
> > fs/fuse/fuse_i.h | 3 +++
> > fs/fuse/inode.c | 44 +++++++++++++++++++++++++++++++++++++--
> > include/uapi/linux/fuse.h | 5 +++++
> > 4 files changed, 67 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 9b64e2ff1c96..35bdc7af8468 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -1829,6 +1829,15 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
> > */
> > fuse_send_writepage(fm, next, inarg->offset + inarg->size);
> > }
> > +
> > + if (fc->writeback_cache_v2)
> > + fi->attr_version = atomic64_inc_return(&fc->attr_version);
> > + /*
> > + * Ensure attr_version increases before the page is move out of the
> > + * writepages rb-tree.
> > + */
> > + smp_mb();
> > +
> > fi->writectr--;
> > fuse_writepage_finish(fm, wpa);
> > spin_unlock(&fi->lock);
> > @@ -1858,10 +1867,18 @@ static struct fuse_file *fuse_write_file_get(struct fuse_inode *fi)
> >
> > int fuse_write_inode(struct inode *inode, struct writeback_control *wbc)
> > {
> > + struct fuse_conn *fc = get_fuse_conn(inode);
> > struct fuse_inode *fi = get_fuse_inode(inode);
> > struct fuse_file *ff;
> > int err;
> >
> > + /*
> > + * Kernel c/mtime should not be updated to the server in the
> > + * writeback_cache_v2 mode as server c/mtime are official.
> > + */
> > + if (fc->writeback_cache_v2)
> > + return 0;
> > +
> > /*
> > * Inode is always written before the last reference is dropped and
> > * hence this should not be reached from reclaim.
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 488b460e046f..47de36146fb8 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -654,6 +654,9 @@ struct fuse_conn {
> > /* show legacy mount options */
> > unsigned int legacy_opts_show:1;
> >
> > + /* Improved writeback cache policy */
> > + unsigned writeback_cache_v2:1;
> > +
> > /*
> > * fs kills suid/sgid/cap on write/chown/trunc. suid is killed on
> > * write/trunc only if caller did not have CAP_FSETID. sgid is killed
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 8c0665c5dff8..2d5fa82b08b6 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -237,14 +237,41 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> > u32 cache_mask;
> > loff_t oldsize;
> > struct timespec64 old_mtime;
> > + bool try_wb_update = false;
> > +
> > + if (fc->writeback_cache_v2 && S_ISREG(inode->i_mode)) {
> > + inode_lock(inode);
>
> I don't think this can work. fuse_change_attributes() might be
> called from within inlode locked context. E.g.
>
> lookup_slow -> __lookup_slow -> d_revalidate -> fuse_dentry_revalidate
> -> fuse_change_attributes
>

Yes, this is a problem that should be fixed. As we can not check the
inode lock state from the inode->i_rwsem structure, I think we can
pass the inode lock state along the FUSE function call-path to
fuse_change_attributes(), and only when we can certainly know whether
the inode is locked or unlocked then we continue the
writeback_cache_v2 logics. What do you think?

Thanks,
Jiachen

> > + try_wb_update = true;
> > + }
> >
> > spin_lock(&fi->lock);
> > /*
> > * In case of writeback_cache enabled, writes update mtime, ctime and
> > * may update i_size. In these cases trust the cached value in the
> > * inode.
> > + *
> > + * In writeback_cache_v2 mode, if all the following conditions are met,
> > + * then we allow the attributes to be refreshed:
> > + *
> > + * - inode is not in the process of being written (I_SYNC)
> > + * - inode has no dirty pages (I_DIRTY_PAGES)
> > + * - inode data-related attributes are clean (I_DIRTY_DATASYNC)
> > + * - inode does not have any page writeback in progress
> > + *
> > + * Note: checking PAGECACHE_TAG_WRITEBACK is not sufficient in fuse,
> > + * since inode can appear to have no PageWriteback pages, yet still have
> > + * outstanding write request.
> > */
> > cache_mask = fuse_get_cache_mask(inode);
> > + if (try_wb_update && !(inode->i_state & (I_DIRTY_PAGES | I_SYNC |
> > + I_DIRTY_DATASYNC)) && RB_EMPTY_ROOT(&fi->writepages))
> > + cache_mask &= ~(STATX_MTIME | STATX_CTIME | STATX_SIZE);
> > + /*
> > + * Ensure the ordering of cleanness checking and following attr_version
> > + * comparison.
> > + */
> > + smp_mb();
> > +
> > if (cache_mask & STATX_SIZE)
> > attr->size = i_size_read(inode);
> >
> > @@ -283,7 +310,13 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> > truncate_pagecache(inode, attr->size);
> > if (!fc->explicit_inval_data)
> > inval = true;
> > - } else if (fc->auto_inval_data) {
> > + } else if (!fc->writeback_cache_v2 && fc->auto_inval_data) {
> > + /*
> > + * When fc->writeback_cache_v2 is set, the old_mtime
> > + * can be generated by kernel and must not equal to
> > + * new_mtime generated by server. So skip in such
> > + * case.
> > + */
> > struct timespec64 new_mtime = {
> > .tv_sec = attr->mtime,
> > .tv_nsec = attr->mtimensec,
> > @@ -303,6 +336,9 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> >
> > if (IS_ENABLED(CONFIG_FUSE_DAX))
> > fuse_dax_dontcache(inode, attr->flags);
> > +
> > + if (try_wb_update)
> > + inode_unlock(inode);
> > }
> >
> > static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
> > @@ -1153,6 +1189,10 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> > fc->async_dio = 1;
> > if (flags & FUSE_WRITEBACK_CACHE)
> > fc->writeback_cache = 1;
> > + if (flags & FUSE_WRITEBACK_CACHE_V2) {
> > + fc->writeback_cache = 1;
> > + fc->writeback_cache_v2 = 1;
> > + }
> > if (flags & FUSE_PARALLEL_DIROPS)
> > fc->parallel_dirops = 1;
> > if (flags & FUSE_HANDLE_KILLPRIV)
> > @@ -1234,7 +1274,7 @@ void fuse_send_init(struct fuse_mount *fm)
> > FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
> > FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
> > FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_EXT | FUSE_INIT_EXT |
> > - FUSE_SECURITY_CTX;
> > + FUSE_SECURITY_CTX | FUSE_WRITEBACK_CACHE_V2;
> > #ifdef CONFIG_FUSE_DAX
> > if (fm->fc->dax)
> > flags |= FUSE_MAP_ALIGNMENT;
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index d6ccee961891..b474763bcf59 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -194,6 +194,7 @@
> > * - add FUSE_SECURITY_CTX init flag
> > * - add security context to create, mkdir, symlink, and mknod requests
> > * - add FUSE_HAS_INODE_DAX, FUSE_ATTR_DAX
> > + * - add FUSE_WRITEBACK_CACHE_V2 init flag
> > */
> >
> > #ifndef _LINUX_FUSE_H
> > @@ -353,6 +354,9 @@ struct fuse_file_lock {
> > * FUSE_SECURITY_CTX: add security context to create, mkdir, symlink, and
> > * mknod
> > * FUSE_HAS_INODE_DAX: use per inode DAX
> > + * FUSE_WRITEBACK_CACHE_V2:
> > + * allow time/size to be refreshed if no pending write
> > + * c/mtime not updated from kernel to server
> > */
> > #define FUSE_ASYNC_READ (1 << 0)
> > #define FUSE_POSIX_LOCKS (1 << 1)
> > @@ -389,6 +393,7 @@ struct fuse_file_lock {
> > /* bits 32..63 get shifted down 32 bits into the flags2 field */
> > #define FUSE_SECURITY_CTX (1ULL << 32)
> > #define FUSE_HAS_INODE_DAX (1ULL << 33)
> > +#define FUSE_WRITEBACK_CACHE_V2 (1ULL << 34)
> >
> > /**
> > * CUSE INIT request/reply flags
> > --
> > 2.20.1
> >

2022-07-18 08:49:58

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] fuse: writeback_cache consistency enhancement (writeback_cache_v2)

On Mon, 18 Jul 2022 at 08:01, Jiachen Zhang
<[email protected]> wrote:
>
> On Fri, Jul 15, 2022 at 6:07 PM Miklos Szeredi <[email protected]> wrote:
> >
> > On Fri, 24 Jun 2022 at 07:58, Jiachen Zhang
> > <[email protected]> wrote:

> > > + if (fc->writeback_cache_v2 && S_ISREG(inode->i_mode)) {
> > > + inode_lock(inode);
> >
> > I don't think this can work. fuse_change_attributes() might be
> > called from within inlode locked context. E.g.
> >
> > lookup_slow -> __lookup_slow -> d_revalidate -> fuse_dentry_revalidate
> > -> fuse_change_attributes
> >
>
> Yes, this is a problem that should be fixed. As we can not check the
> inode lock state from the inode->i_rwsem structure, I think we can
> pass the inode lock state along the FUSE function call-path to
> fuse_change_attributes(), and only when we can certainly know whether
> the inode is locked or unlocked then we continue the
> writeback_cache_v2 logics. What do you think?

Not liking it very much.

Better create a new lock for this purpose that we do always know the state of.

Thanks,
Miklos

2022-07-25 13:34:29

by Vivek Goyal

[permalink] [raw]
Subject: Re: [External] Re: [PATCH] fuse: writeback_cache consistency enhancement (writeback_cache_v2)

On Mon, Jul 18, 2022 at 01:48:30PM +0800, Jiachen Zhang wrote:
> On Thu, Jun 30, 2022 at 2:14 AM Vivek Goyal <[email protected]> wrote:
> >
> > On Fri, Jun 24, 2022 at 01:58:25PM +0800, Jiachen Zhang wrote:
> > > Some users may want both the high performance of the writeback_cahe mode and
> > > a little bit more consistency among FUSE mounts. In the current writeback
> > > mode implementation, users of one FUSE mount can never see the file
> > > expansion done by other FUSE mounts.
> > >
> > > Based on the suggested writeback V2 patch in the upstream mailing-list [1],
> > > this commit allows the cmtime and size to be updated from server in
> > > writeback mode. Compared with the writeback V2 patch in [1], this commit has
> > > several differences:
> > >
> > > 1. Ensure c/mtime are not updated from kernel to server. IOW, the cmtime
> > > generated by kernel are just temporary values that are never flushed to
> > > server, and they can also be updated by the official server cmtime when
> > > the writeback cache is clean.
> >
> > Can this give the appearance of time going backwards. If file server
> > is getting time stamps from another remote filesystem (say NFS), it
> > is possible that client and server clocks are not in sync. So for
> > a stat() might return temporary value of c/mtime from local kernel and
> > once it gets udpated, it returns a time which might be behind previous
> > time. Sounds like trouble to me.
> >
>
> Hi Vivek,
>
> Yes, the time is likely to be changed silently on a client after the
> data is flush to the server. But in terms of consistency, some FS
> implementations may rely on the server mtime to revalidate cache. In
> such a case, if the server time is updated from multiple client time
> sources, it may lead to false cache revalidation results.
>
> > This can happen more often on virtiofs (even without a remote filesystem
> > being involved).
> >
>
> Maybe if there is no need of multiple VM-guest consistency, we can
> just disable the writeback_cache_v2 mode.

To me writeback_cache_v2 mode in current form is not making much sense.
What are the semantics. If we are caching writes (and size/mtime/ctime)
locally, then either we should not be expecting sharing to happen
from other clients or we need to implement lease kind of mechanism
so that we flush out data before other clients do writes.

What is being proposed seems very odd. Local caching is happening at
the same time sharing among multiple clients is happening. If one
client writes and updates mtime, other client can sometime update
its mtime (if cache is clean) and ignore it other times. Even if
it updates mtime, it might not flush its local cache (auto_inval_data).
Time can go backward. If two clients are writing to same file and
talking to each other over network, then their writes can appear to
be completely out of order.

I feel that if we are using writeback_cache, we should not be using
sharing among multiple clients. If we do want to locally cache dirty
pages and at the same time do sharing, we need to implement notion of
leases as other remote filesystems have done.

Current proposed semantics of writeback_cache_v2 seem to be just
a band aid for a specific use case.

Thanks
Vivek




>
>
> > >
> > > 2. Skip mtime-based revalidation when fc->auto_inval_data is set with
> > > fc->writeback_cache_v2. Because the kernel-generated temporary cmtime
> > > are likely not equal to the offical server cmtime.
> > >
> > > 3. If any page is ever flushed to the server during FUSE_GETATTR
> > > handling on fuse server, even if the cache is clean when
> > > fuse_change_attributes() checks, we should not update the i_size. This
> > > is because the FUSE_GETATTR may get a staled size before the FUSE_WRITE
> > > request changes server inode size. This commit ensures this by
> > > increasing attr_version after writeback for writeback_cache_v2. In that
> > > case, we should also ensure the ordering of the attr_version updating
> > > and the fi->writepages RB-tree updating. So that if a fuse page
> > > writeback ever happens during fuse_change_attributes(), either the
> > > fi->writepages is not empty, or the attr_version is increased. So we
> > > never mistakenly update a stale file size from server to kernel.
> > >
> > > With this patch, writeback mode can consider the server c/mtime as the
> > > official one. When inode attr is timeout or invalidated, kernel has chance
> > > to see size and c/mtime modified by others.
> > >
> > > Together with another patch [2], a FUSE daemon is able to implement
> > > close-to-open (CTO) consistency like what is done in NFS clients.
> >
> > If your goal is to implement NFS style CTO with writeback cache mode, then
> > is it not enough to query and udpate attributes at file open time. Instead
> > of worrying about also updating attrs while file is open (and cache is
> > clean). Updating attrs when cache is clean, is non-deterministic anyway,
> > so I don't see how it can be very useful for applications.
> >
>
> IIUC, the NFS-style CTO consistency only requires updating invalidated
> data & attr caches on file open, and the cache updating will be
> delayed if there is any writing opens, even if the cache is clean. So
> writeback_cache_v2 updates even more radical than NFS-style CTO.
>
> Thanks,
> Jiachen
>
> > Thanks
> > Vivek
> >
> >
> > >
> > > [1] https://lore.kernel.org/linux-fsdevel/[email protected]
> > > [2] https://lore.kernel.org/linux-fsdevel/[email protected]/
> > >
> > > Suggested-by: Miklos Szeredi <[email protected]>
> > > Signed-off-by: Jiachen Zhang <[email protected]>
> > > ---
> > > fs/fuse/file.c | 17 +++++++++++++++
> > > fs/fuse/fuse_i.h | 3 +++
> > > fs/fuse/inode.c | 44 +++++++++++++++++++++++++++++++++++++--
> > > include/uapi/linux/fuse.h | 5 +++++
> > > 4 files changed, 67 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > > index 9b64e2ff1c96..35bdc7af8468 100644
> > > --- a/fs/fuse/file.c
> > > +++ b/fs/fuse/file.c
> > > @@ -1829,6 +1829,15 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
> > > */
> > > fuse_send_writepage(fm, next, inarg->offset + inarg->size);
> > > }
> > > +
> > > + if (fc->writeback_cache_v2)
> > > + fi->attr_version = atomic64_inc_return(&fc->attr_version);
> > > + /*
> > > + * Ensure attr_version increases before the page is move out of the
> > > + * writepages rb-tree.
> > > + */
> > > + smp_mb();
> > > +
> > > fi->writectr--;
> > > fuse_writepage_finish(fm, wpa);
> > > spin_unlock(&fi->lock);
> > > @@ -1858,10 +1867,18 @@ static struct fuse_file *fuse_write_file_get(struct fuse_inode *fi)
> > >
> > > int fuse_write_inode(struct inode *inode, struct writeback_control *wbc)
> > > {
> > > + struct fuse_conn *fc = get_fuse_conn(inode);
> > > struct fuse_inode *fi = get_fuse_inode(inode);
> > > struct fuse_file *ff;
> > > int err;
> > >
> > > + /*
> > > + * Kernel c/mtime should not be updated to the server in the
> > > + * writeback_cache_v2 mode as server c/mtime are official.
> > > + */
> > > + if (fc->writeback_cache_v2)
> > > + return 0;
> > > +
> > > /*
> > > * Inode is always written before the last reference is dropped and
> > > * hence this should not be reached from reclaim.
> > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > index 488b460e046f..47de36146fb8 100644
> > > --- a/fs/fuse/fuse_i.h
> > > +++ b/fs/fuse/fuse_i.h
> > > @@ -654,6 +654,9 @@ struct fuse_conn {
> > > /* show legacy mount options */
> > > unsigned int legacy_opts_show:1;
> > >
> > > + /* Improved writeback cache policy */
> > > + unsigned writeback_cache_v2:1;
> > > +
> > > /*
> > > * fs kills suid/sgid/cap on write/chown/trunc. suid is killed on
> > > * write/trunc only if caller did not have CAP_FSETID. sgid is killed
> > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > index 8c0665c5dff8..2d5fa82b08b6 100644
> > > --- a/fs/fuse/inode.c
> > > +++ b/fs/fuse/inode.c
> > > @@ -237,14 +237,41 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> > > u32 cache_mask;
> > > loff_t oldsize;
> > > struct timespec64 old_mtime;
> > > + bool try_wb_update = false;
> > > +
> > > + if (fc->writeback_cache_v2 && S_ISREG(inode->i_mode)) {
> > > + inode_lock(inode);
> > > + try_wb_update = true;
> > > + }
> > >
> > > spin_lock(&fi->lock);
> > > /*
> > > * In case of writeback_cache enabled, writes update mtime, ctime and
> > > * may update i_size. In these cases trust the cached value in the
> > > * inode.
> > > + *
> > > + * In writeback_cache_v2 mode, if all the following conditions are met,
> > > + * then we allow the attributes to be refreshed:
> > > + *
> > > + * - inode is not in the process of being written (I_SYNC)
> > > + * - inode has no dirty pages (I_DIRTY_PAGES)
> > > + * - inode data-related attributes are clean (I_DIRTY_DATASYNC)
> > > + * - inode does not have any page writeback in progress
> > > + *
> > > + * Note: checking PAGECACHE_TAG_WRITEBACK is not sufficient in fuse,
> > > + * since inode can appear to have no PageWriteback pages, yet still have
> > > + * outstanding write request.
> > > */
> > > cache_mask = fuse_get_cache_mask(inode);
> > > + if (try_wb_update && !(inode->i_state & (I_DIRTY_PAGES | I_SYNC |
> > > + I_DIRTY_DATASYNC)) && RB_EMPTY_ROOT(&fi->writepages))
> > > + cache_mask &= ~(STATX_MTIME | STATX_CTIME | STATX_SIZE);
> > > + /*
> > > + * Ensure the ordering of cleanness checking and following attr_version
> > > + * comparison.
> > > + */
> > > + smp_mb();
> > > +
> > > if (cache_mask & STATX_SIZE)
> > > attr->size = i_size_read(inode);
> > >
> > > @@ -283,7 +310,13 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> > > truncate_pagecache(inode, attr->size);
> > > if (!fc->explicit_inval_data)
> > > inval = true;
> > > - } else if (fc->auto_inval_data) {
> > > + } else if (!fc->writeback_cache_v2 && fc->auto_inval_data) {
> > > + /*
> > > + * When fc->writeback_cache_v2 is set, the old_mtime
> > > + * can be generated by kernel and must not equal to
> > > + * new_mtime generated by server. So skip in such
> > > + * case.
> > > + */
> > > struct timespec64 new_mtime = {
> > > .tv_sec = attr->mtime,
> > > .tv_nsec = attr->mtimensec,
> > > @@ -303,6 +336,9 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> > >
> > > if (IS_ENABLED(CONFIG_FUSE_DAX))
> > > fuse_dax_dontcache(inode, attr->flags);
> > > +
> > > + if (try_wb_update)
> > > + inode_unlock(inode);
> > > }
> > >
> > > static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
> > > @@ -1153,6 +1189,10 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> > > fc->async_dio = 1;
> > > if (flags & FUSE_WRITEBACK_CACHE)
> > > fc->writeback_cache = 1;
> > > + if (flags & FUSE_WRITEBACK_CACHE_V2) {
> > > + fc->writeback_cache = 1;
> > > + fc->writeback_cache_v2 = 1;
> > > + }
> > > if (flags & FUSE_PARALLEL_DIROPS)
> > > fc->parallel_dirops = 1;
> > > if (flags & FUSE_HANDLE_KILLPRIV)
> > > @@ -1234,7 +1274,7 @@ void fuse_send_init(struct fuse_mount *fm)
> > > FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
> > > FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
> > > FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_EXT | FUSE_INIT_EXT |
> > > - FUSE_SECURITY_CTX;
> > > + FUSE_SECURITY_CTX | FUSE_WRITEBACK_CACHE_V2;
> > > #ifdef CONFIG_FUSE_DAX
> > > if (fm->fc->dax)
> > > flags |= FUSE_MAP_ALIGNMENT;
> > > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > > index d6ccee961891..b474763bcf59 100644
> > > --- a/include/uapi/linux/fuse.h
> > > +++ b/include/uapi/linux/fuse.h
> > > @@ -194,6 +194,7 @@
> > > * - add FUSE_SECURITY_CTX init flag
> > > * - add security context to create, mkdir, symlink, and mknod requests
> > > * - add FUSE_HAS_INODE_DAX, FUSE_ATTR_DAX
> > > + * - add FUSE_WRITEBACK_CACHE_V2 init flag
> > > */
> > >
> > > #ifndef _LINUX_FUSE_H
> > > @@ -353,6 +354,9 @@ struct fuse_file_lock {
> > > * FUSE_SECURITY_CTX: add security context to create, mkdir, symlink, and
> > > * mknod
> > > * FUSE_HAS_INODE_DAX: use per inode DAX
> > > + * FUSE_WRITEBACK_CACHE_V2:
> > > + * allow time/size to be refreshed if no pending write
> > > + * c/mtime not updated from kernel to server
> > > */
> > > #define FUSE_ASYNC_READ (1 << 0)
> > > #define FUSE_POSIX_LOCKS (1 << 1)
> > > @@ -389,6 +393,7 @@ struct fuse_file_lock {
> > > /* bits 32..63 get shifted down 32 bits into the flags2 field */
> > > #define FUSE_SECURITY_CTX (1ULL << 32)
> > > #define FUSE_HAS_INODE_DAX (1ULL << 33)
> > > +#define FUSE_WRITEBACK_CACHE_V2 (1ULL << 34)
> > >
> > > /**
> > > * CUSE INIT request/reply flags
> > > --
> > > 2.20.1
> > >
> >
>

2022-07-25 13:47:08

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH] fuse: writeback_cache consistency enhancement (writeback_cache_v2)

On Mon, Jul 18, 2022 at 01:17:10PM +0800, Jiachen Zhang wrote:
> On Thu, Jun 30, 2022 at 1:57 AM Vivek Goyal <[email protected]> wrote:
> >
> > On Fri, Jun 24, 2022 at 01:58:25PM +0800, Jiachen Zhang wrote:
> > > Some users may want both the high performance of the writeback_cahe mode and
> > > a little bit more consistency among FUSE mounts. In the current writeback
> > > mode implementation, users of one FUSE mount can never see the file
> > > expansion done by other FUSE mounts.
> > >
> > > Based on the suggested writeback V2 patch in the upstream mailing-list [1],
> > > this commit allows the cmtime and size to be updated from server in
> > > writeback mode. Compared with the writeback V2 patch in [1], this commit has
> > > several differences:
> > >
> > > 1. Ensure c/mtime are not updated from kernel to server. IOW, the cmtime
> > > generated by kernel are just temporary values that are never flushed to
> > > server, and they can also be updated by the official server cmtime when
> > > the writeback cache is clean.
> > >
> > > 2. Skip mtime-based revalidation when fc->auto_inval_data is set with
> > > fc->writeback_cache_v2. Because the kernel-generated temporary cmtime
> > > are likely not equal to the offical server cmtime.
> >
> > We have a init flag to negotiate FUSE_AUTO_INVAL_DATA to enable/disable
> > fc->auto_inval_data. So should it be filesystem's (user space) responsibility
> > to not use FUSE_AUTO_INVAL_DATA when using FUSE_WRITEBACK_CACHE. I mean,
> > otherwise it is confusing. On one hand filesystem is using FUSE_AUTO_INVAL_DATA
> > but it silently is ingored by kernel because FUSE_WRITEBACK_CACHE is
> > being used as well.
> >
>
> Hi Vivek,
>
> The fc->auto_inval_data branch code is also skipped in writeback_cache
> v1 mode, so I think we can also skip it in writeback_cache_v2. I
> listed this point here because it is not skipped in Miklos'
> writeback_cache_v2 initial patch.

So, fc->auto_inval_data seems to invalidate inode pages if either file
size or mtime in user space (file server) has changed since last cached
value.

With writeback cache v1, we don't even trust/use size or mtime value
as retrieved from server. And if we are not using size/mtime value as
retrieved from server, there is no question of invalidating local cached
data. Idea seems to be that we are writing and caching the contents
locally (and not expecting the changes to take place on server from other
clients). So for v1 mode, it kind of makes sense to not pay attention
to auto_inval_data.

For write cache v2, on one hand you are updating the size as received
from the server. So you are epxecting another client to update the
file. But at the same time you are not willing to invalidate local
page cache. That is not making much sense to me.

For example, say with writeback cache v2, you have cached few pages (not
dirty) from file foo.txt in the kernel (client). Now another client
overwrites the file foo.txt and hence modified mtime. When you get new
mtime in first client (kenrel), you will simply ignore it and continue
to use old data?

I think you need to write a more coherent changelog/cover-letter. First
define what's the current behavior and then define what will be the
new behavior and how does it impact overall caching etc. We seem to be
talking about small individual pieces but I am not sure I am clear
on overall architecture of this new mode.

Thanks
Vivek


>
> Thanks,
> Jiachen
>
> > will it make more sense to just document it and let filesystem authors
> > enable options carefully.
> >
> > BTW, this probably should be a separate patch in the patch series anyway
> > with proper explanation.
> >
> > Thanks
> > Vivek
> >
> > >
> > > 3. If any page is ever flushed to the server during FUSE_GETATTR
> > > handling on fuse server, even if the cache is clean when
> > > fuse_change_attributes() checks, we should not update the i_size. This
> > > is because the FUSE_GETATTR may get a staled size before the FUSE_WRITE
> > > request changes server inode size. This commit ensures this by
> > > increasing attr_version after writeback for writeback_cache_v2. In that
> > > case, we should also ensure the ordering of the attr_version updating
> > > and the fi->writepages RB-tree updating. So that if a fuse page
> > > writeback ever happens during fuse_change_attributes(), either the
> > > fi->writepages is not empty, or the attr_version is increased. So we
> > > never mistakenly update a stale file size from server to kernel.
> > >
> > > With this patch, writeback mode can consider the server c/mtime as the
> > > official one. When inode attr is timeout or invalidated, kernel has chance
> > > to see size and c/mtime modified by others.
> > >
> > > Together with another patch [2], a FUSE daemon is able to implement
> > > close-to-open (CTO) consistency like what is done in NFS clients.
> > >
> > > [1] https://lore.kernel.org/linux-fsdevel/[email protected]
> > > [2] https://lore.kernel.org/linux-fsdevel/[email protected]/
> > >
> > > Suggested-by: Miklos Szeredi <[email protected]>
> > > Signed-off-by: Jiachen Zhang <[email protected]>
> > > ---
> > > fs/fuse/file.c | 17 +++++++++++++++
> > > fs/fuse/fuse_i.h | 3 +++
> > > fs/fuse/inode.c | 44 +++++++++++++++++++++++++++++++++++++--
> > > include/uapi/linux/fuse.h | 5 +++++
> > > 4 files changed, 67 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > > index 9b64e2ff1c96..35bdc7af8468 100644
> > > --- a/fs/fuse/file.c
> > > +++ b/fs/fuse/file.c
> > > @@ -1829,6 +1829,15 @@ static void fuse_writepage_end(struct fuse_mount *fm, struct fuse_args *args,
> > > */
> > > fuse_send_writepage(fm, next, inarg->offset + inarg->size);
> > > }
> > > +
> > > + if (fc->writeback_cache_v2)
> > > + fi->attr_version = atomic64_inc_return(&fc->attr_version);
> > > + /*
> > > + * Ensure attr_version increases before the page is move out of the
> > > + * writepages rb-tree.
> > > + */
> > > + smp_mb();
> > > +
> > > fi->writectr--;
> > > fuse_writepage_finish(fm, wpa);
> > > spin_unlock(&fi->lock);
> > > @@ -1858,10 +1867,18 @@ static struct fuse_file *fuse_write_file_get(struct fuse_inode *fi)
> > >
> > > int fuse_write_inode(struct inode *inode, struct writeback_control *wbc)
> > > {
> > > + struct fuse_conn *fc = get_fuse_conn(inode);
> > > struct fuse_inode *fi = get_fuse_inode(inode);
> > > struct fuse_file *ff;
> > > int err;
> > >
> > > + /*
> > > + * Kernel c/mtime should not be updated to the server in the
> > > + * writeback_cache_v2 mode as server c/mtime are official.
> > > + */
> > > + if (fc->writeback_cache_v2)
> > > + return 0;
> > > +
> > > /*
> > > * Inode is always written before the last reference is dropped and
> > > * hence this should not be reached from reclaim.
> > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > index 488b460e046f..47de36146fb8 100644
> > > --- a/fs/fuse/fuse_i.h
> > > +++ b/fs/fuse/fuse_i.h
> > > @@ -654,6 +654,9 @@ struct fuse_conn {
> > > /* show legacy mount options */
> > > unsigned int legacy_opts_show:1;
> > >
> > > + /* Improved writeback cache policy */
> > > + unsigned writeback_cache_v2:1;
> > > +
> > > /*
> > > * fs kills suid/sgid/cap on write/chown/trunc. suid is killed on
> > > * write/trunc only if caller did not have CAP_FSETID. sgid is killed
> > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > index 8c0665c5dff8..2d5fa82b08b6 100644
> > > --- a/fs/fuse/inode.c
> > > +++ b/fs/fuse/inode.c
> > > @@ -237,14 +237,41 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> > > u32 cache_mask;
> > > loff_t oldsize;
> > > struct timespec64 old_mtime;
> > > + bool try_wb_update = false;
> > > +
> > > + if (fc->writeback_cache_v2 && S_ISREG(inode->i_mode)) {
> > > + inode_lock(inode);
> > > + try_wb_update = true;
> > > + }
> > >
> > > spin_lock(&fi->lock);
> > > /*
> > > * In case of writeback_cache enabled, writes update mtime, ctime and
> > > * may update i_size. In these cases trust the cached value in the
> > > * inode.
> > > + *
> > > + * In writeback_cache_v2 mode, if all the following conditions are met,
> > > + * then we allow the attributes to be refreshed:
> > > + *
> > > + * - inode is not in the process of being written (I_SYNC)
> > > + * - inode has no dirty pages (I_DIRTY_PAGES)
> > > + * - inode data-related attributes are clean (I_DIRTY_DATASYNC)
> > > + * - inode does not have any page writeback in progress
> > > + *
> > > + * Note: checking PAGECACHE_TAG_WRITEBACK is not sufficient in fuse,
> > > + * since inode can appear to have no PageWriteback pages, yet still have
> > > + * outstanding write request.
> > > */
> > > cache_mask = fuse_get_cache_mask(inode);
> > > + if (try_wb_update && !(inode->i_state & (I_DIRTY_PAGES | I_SYNC |
> > > + I_DIRTY_DATASYNC)) && RB_EMPTY_ROOT(&fi->writepages))
> > > + cache_mask &= ~(STATX_MTIME | STATX_CTIME | STATX_SIZE);
> > > + /*
> > > + * Ensure the ordering of cleanness checking and following attr_version
> > > + * comparison.
> > > + */
> > > + smp_mb();
> > > +
> > > if (cache_mask & STATX_SIZE)
> > > attr->size = i_size_read(inode);
> > >
> > > @@ -283,7 +310,13 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> > > truncate_pagecache(inode, attr->size);
> > > if (!fc->explicit_inval_data)
> > > inval = true;
> > > - } else if (fc->auto_inval_data) {
> > > + } else if (!fc->writeback_cache_v2 && fc->auto_inval_data) {
> > > + /*
> > > + * When fc->writeback_cache_v2 is set, the old_mtime
> > > + * can be generated by kernel and must not equal to
> > > + * new_mtime generated by server. So skip in such
> > > + * case.
> > > + */
> > > struct timespec64 new_mtime = {
> > > .tv_sec = attr->mtime,
> > > .tv_nsec = attr->mtimensec,
> > > @@ -303,6 +336,9 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
> > >
> > > if (IS_ENABLED(CONFIG_FUSE_DAX))
> > > fuse_dax_dontcache(inode, attr->flags);
> > > +
> > > + if (try_wb_update)
> > > + inode_unlock(inode);
> > > }
> > >
> > > static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
> > > @@ -1153,6 +1189,10 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> > > fc->async_dio = 1;
> > > if (flags & FUSE_WRITEBACK_CACHE)
> > > fc->writeback_cache = 1;
> > > + if (flags & FUSE_WRITEBACK_CACHE_V2) {
> > > + fc->writeback_cache = 1;
> > > + fc->writeback_cache_v2 = 1;
> > > + }
> > > if (flags & FUSE_PARALLEL_DIROPS)
> > > fc->parallel_dirops = 1;
> > > if (flags & FUSE_HANDLE_KILLPRIV)
> > > @@ -1234,7 +1274,7 @@ void fuse_send_init(struct fuse_mount *fm)
> > > FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
> > > FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
> > > FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_EXT | FUSE_INIT_EXT |
> > > - FUSE_SECURITY_CTX;
> > > + FUSE_SECURITY_CTX | FUSE_WRITEBACK_CACHE_V2;
> > > #ifdef CONFIG_FUSE_DAX
> > > if (fm->fc->dax)
> > > flags |= FUSE_MAP_ALIGNMENT;
> > > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > > index d6ccee961891..b474763bcf59 100644
> > > --- a/include/uapi/linux/fuse.h
> > > +++ b/include/uapi/linux/fuse.h
> > > @@ -194,6 +194,7 @@
> > > * - add FUSE_SECURITY_CTX init flag
> > > * - add security context to create, mkdir, symlink, and mknod requests
> > > * - add FUSE_HAS_INODE_DAX, FUSE_ATTR_DAX
> > > + * - add FUSE_WRITEBACK_CACHE_V2 init flag
> > > */
> > >
> > > #ifndef _LINUX_FUSE_H
> > > @@ -353,6 +354,9 @@ struct fuse_file_lock {
> > > * FUSE_SECURITY_CTX: add security context to create, mkdir, symlink, and
> > > * mknod
> > > * FUSE_HAS_INODE_DAX: use per inode DAX
> > > + * FUSE_WRITEBACK_CACHE_V2:
> > > + * allow time/size to be refreshed if no pending write
> > > + * c/mtime not updated from kernel to server
> > > */
> > > #define FUSE_ASYNC_READ (1 << 0)
> > > #define FUSE_POSIX_LOCKS (1 << 1)
> > > @@ -389,6 +393,7 @@ struct fuse_file_lock {
> > > /* bits 32..63 get shifted down 32 bits into the flags2 field */
> > > #define FUSE_SECURITY_CTX (1ULL << 32)
> > > #define FUSE_HAS_INODE_DAX (1ULL << 33)
> > > +#define FUSE_WRITEBACK_CACHE_V2 (1ULL << 34)
> > >
> > > /**
> > > * CUSE INIT request/reply flags
> > > --
> > > 2.20.1
> > >
> >
>