2022-05-03 01:04:01

by Dharmendra Singh

[permalink] [raw]
Subject: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create

In FUSE, as of now, uncached lookups are expensive over the wire.
E.g additional latencies and stressing (meta data) servers from
thousands of clients. These lookup calls possibly can be avoided
in some cases. Incoming three patches address this issue.


Fist patch handles the case where we are creating a file with O_CREAT.
Before we go for file creation, we do a lookup on the file which is most
likely non-existent. After this lookup is done, we again go into libfuse
to create file. Such lookups where file is most likely non-existent, can
be avoided.

Second patch handles the case where we open first time a file/dir
but do a lookup first on it. After lookup is performed we make another
call into libfuse to open the file. Now these two separate calls into
libfuse can be combined and performed as a single call into libfuse.

Third patch handles the case when we are opening an already existing file
(positive dentry). Before this open call, we re-validate the inode and
this re-validation does a lookup on the file and verify the inode.
This separate lookup also can be avoided (for non-dir) and combined
with open call into libfuse. After open returns we can revalidate the inode.
This optimisation is performed only when we do not have default permissions
enabled.

Here is the link to performance numbers
https://lore.kernel.org/linux-fsdevel/[email protected]/


Dharmendra Singh (3):
FUSE: Implement atomic lookup + create
Implement atomic lookup + open
Avoid lookup in d_revalidate()

fs/fuse/dir.c | 211 +++++++++++++++++++++++++++++++++++---
fs/fuse/file.c | 30 +++++-
fs/fuse/fuse_i.h | 16 ++-
fs/fuse/inode.c | 4 +-
fs/fuse/ioctl.c | 2 +-
include/uapi/linux/fuse.h | 5 +
6 files changed, 246 insertions(+), 22 deletions(-)

---
v4: Addressed all comments and refactored the code into 3 separate patches
respectively for Atomic create, Atomic open, optimizing lookup in
d_revalidate().
---


2022-05-03 01:05:29

by Dharmendra Singh

[permalink] [raw]
Subject: [PATCH v4 2/3] FUSE: Implement atomic lookup + open

From: Dharmendra Singh <[email protected]>

We can optimize aggressive lookups which are triggered when
there is normal open for file/dir (dentry is new/negative).

Here in this case since we are anyway going to open the file/dir
with USER SPACE, avoid this separate lookup call into libfuse
and combine it with open call into libfuse.

This lookup + open in single call to libfuse has been named
as atomic open. It is expected that USER SPACE opens the file
and fills in the attributes, which are then used to make inode
stand/revalidate in the kernel cache.

Signed-off-by: Dharmendra Singh <[email protected]>
---
fs/fuse/dir.c | 78 ++++++++++++++++++++++++++++++---------
fs/fuse/fuse_i.h | 3 ++
fs/fuse/inode.c | 4 +-
include/uapi/linux/fuse.h | 2 +
4 files changed, 69 insertions(+), 18 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index cad3322a007f..6879d3a86796 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -516,18 +516,18 @@ static int get_security_context(struct dentry *entry, umode_t mode,
}

/*
- * Atomic create+open operation
- *
- * If the filesystem doesn't support this, then fall back to separate
- * 'mknod' + 'open' requests.
+ * This is common function for initiating atomic operations into libfuse.
+ * Currently being used by Atomic create(atomic lookup + create) and
+ * Atomic open(atomic lookup + open).
*/
-static int fuse_create_open(struct inode *dir, struct dentry *entry,
- struct file *file, unsigned int flags,
- umode_t mode, uint32_t opcode)
+static int fuse_atomic_do_common(struct inode *dir, struct dentry *entry,
+ struct dentry **alias, struct file *file,
+ unsigned int flags, umode_t mode, uint32_t opcode)
{
int err;
struct inode *inode;
struct fuse_mount *fm = get_fuse_mount(dir);
+ struct fuse_conn *fc = get_fuse_conn(dir);
FUSE_ARGS(args);
struct fuse_forget_link *forget;
struct fuse_create_in inarg;
@@ -539,9 +539,13 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
void *security_ctx = NULL;
u32 security_ctxlen;
bool atomic_create = (opcode == FUSE_ATOMIC_CREATE ? true : false);
+ bool create_op = (opcode == FUSE_CREATE ||
+ opcode == FUSE_ATOMIC_CREATE) ? true : false;
+ if (alias)
+ *alias = NULL;

/* Userspace expects S_IFREG in create mode */
- BUG_ON((mode & S_IFMT) != S_IFREG);
+ BUG_ON(create_op && (mode & S_IFMT) != S_IFREG);

forget = fuse_alloc_forget();
err = -ENOMEM;
@@ -553,7 +557,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
if (!ff)
goto out_put_forget_req;

- if (!fm->fc->dont_mask)
+ if (!fc->dont_mask)
mode &= ~current_umask();

flags &= ~O_NOCTTY;
@@ -642,8 +646,9 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
err = PTR_ERR(res);
goto out_err;
}
- /* res is expected to be NULL since its REG file */
- WARN_ON(res);
+ entry = res;
+ if (alias)
+ *alias = res;
}
}
fuse_change_entry_timeout(entry, &outentry);
@@ -681,8 +686,8 @@ static int fuse_atomic_create(struct inode *dir, struct dentry *entry,
if (fc->no_atomic_create)
return -ENOSYS;

- err = fuse_create_open(dir, entry, file, flags, mode,
- FUSE_ATOMIC_CREATE);
+ err = fuse_atomic_do_common(dir, entry, NULL, file, flags, mode,
+ FUSE_ATOMIC_CREATE);
/* If atomic create is not implemented then indicate in fc so that next
* request falls back to normal create instead of going into libufse and
* returning with -ENOSYS.
@@ -694,6 +699,29 @@ static int fuse_atomic_create(struct inode *dir, struct dentry *entry,
return err;
}

+static int fuse_do_atomic_open(struct inode *dir, struct dentry *entry,
+ struct dentry **alias, struct file *file,
+ unsigned int flags, umode_t mode)
+{
+ int err;
+ struct fuse_conn *fc = get_fuse_conn(dir);
+
+ if (!fc->do_atomic_open)
+ return -ENOSYS;
+
+ err = fuse_atomic_do_common(dir, entry, alias, file, flags, mode,
+ FUSE_ATOMIC_OPEN);
+ /* Atomic open imply atomic trunc as well i.e truncate should be performed
+ * as part of atomic open call itself.
+ */
+ if (!fc->atomic_o_trunc) {
+ if (fc->do_atomic_open)
+ fc->atomic_o_trunc = 1;
+ }
+
+ return err;
+}
+
static int fuse_mknod(struct user_namespace *, struct inode *, struct dentry *,
umode_t, dev_t);
static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
@@ -702,12 +730,23 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
{
int err;
struct fuse_conn *fc = get_fuse_conn(dir);
- struct dentry *res = NULL;
+ struct dentry *res = NULL, *alias = NULL;
bool create = flags & O_CREAT ? true : false;

if (fuse_is_bad(dir))
return -EIO;

+ if (!create) {
+ err = fuse_do_atomic_open(dir, entry, &alias,
+ file, flags, mode);
+ res = alias;
+ if (!err || err != -ENOSYS)
+ goto out_dput;
+ }
+ /*
+ * ENOSYS fall back for open- user space does not have full
+ * atomic open.
+ */
if ((!create || fc->no_atomic_create) && d_in_lookup(entry)) {
res = fuse_lookup(dir, entry, 0);
if (IS_ERR(res))
@@ -730,9 +769,14 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
/* Libfuse/user space has not implemented atomic create, therefore
* fall back to normal create.
*/
- if (err == -ENOSYS)
- err = fuse_create_open(dir, entry, file, flags, mode,
- FUSE_CREATE);
+ /* Atomic create+open operation
+ * If the filesystem doesn't support this, then fall back to separate
+ * 'mknod' + 'open' requests.
+ */
+ if (err == -ENOSYS) {
+ err = fuse_atomic_do_common(dir, entry, NULL, file, flags,
+ mode, FUSE_CREATE);
+ }
if (err == -ENOSYS) {
fc->no_create = 1;
goto mknod;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index d577a591ab16..24793b82303f 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -669,6 +669,9 @@ struct fuse_conn {
/** Is open/release not implemented by fs? */
unsigned no_open:1;

+ /** Is atomic open implemented by fs ? */
+ unsigned do_atomic_open : 1;
+
/** Is atomic create not implemented by fs? */
unsigned no_atomic_create:1;

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index ee846ce371d8..5f667de69115 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1190,6 +1190,8 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
fc->setxattr_ext = 1;
if (flags & FUSE_SECURITY_CTX)
fc->init_security = 1;
+ if (flags & FUSE_DO_ATOMIC_OPEN)
+ fc->do_atomic_open = 1;
} else {
ra_pages = fc->max_read / PAGE_SIZE;
fc->no_lock = 1;
@@ -1235,7 +1237,7 @@ void fuse_send_init(struct fuse_mount *fm)
FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_EXT | FUSE_INIT_EXT |
- FUSE_SECURITY_CTX;
+ FUSE_SECURITY_CTX | FUSE_DO_ATOMIC_OPEN;
#ifdef CONFIG_FUSE_DAX
if (fm->fc->dax)
flags |= FUSE_MAP_ALIGNMENT;
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index e4b56004b148..ab91e391241a 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -391,6 +391,7 @@ struct fuse_file_lock {
/* bits 32..63 get shifted down 32 bits into the flags2 field */
#define FUSE_SECURITY_CTX (1ULL << 32)
#define FUSE_HAS_INODE_DAX (1ULL << 33)
+#define FUSE_DO_ATOMIC_OPEN (1ULL << 34)

/**
* CUSE INIT request/reply flags
@@ -540,6 +541,7 @@ enum fuse_opcode {
FUSE_REMOVEMAPPING = 49,
FUSE_SYNCFS = 50,
FUSE_ATOMIC_CREATE = 51,
+ FUSE_ATOMIC_OPEN = 52,

/* CUSE specific operations */
CUSE_INIT = 4096,
--
2.17.1

2022-05-03 01:17:07

by Dharmendra Singh

[permalink] [raw]
Subject: [PATCH v4 1/3] FUSE: Implement atomic lookup + create

From: Dharmendra Singh <[email protected]>

When we go for creating a file (O_CREAT), we trigger
a lookup to FUSE USER SPACE. It is very much likely
that file does not exist yet as O_CREAT is passed to
open(). This lookup can be avoided and can be performed
as part of create call into libfuse.

This lookup + create in single call to libfuse and finally
to USER SPACE has been named as atomic create. It is expected
that USER SPACE create the file, open it and fills in the
attributes which are then used to make inode stand/revalidate
in the kernel cache. Also if file was newly created(does not
exist yet by this time) in USER SPACE then it should be indicated
in `struct fuse_file_info` by setting a bit which is again used by
libfuse to send some flags back to fuse kernel to indicate that
that file was newly created. These flags are used by kernel to
indicate changes in parent directory.

Fuse kernel automatically detects if atomic create is implemented
by libfuse/USER SPACE or not. And depending upon the outcome of
this check all further creates are decided to be atomic or non-atomic
creates.

If libfuse/USER SPACE has not implemented the atomic create operation
then by default behaviour remains same i.e we do not optimize lookup
calls which are triggered before create calls into libfuse.

Signed-off-by: Dharmendra Singh <[email protected]>
---
fs/fuse/dir.c | 82 +++++++++++++++++++++++++++++++++++----
fs/fuse/fuse_i.h | 3 ++
include/uapi/linux/fuse.h | 3 ++
3 files changed, 81 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 656e921f3506..cad3322a007f 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -523,7 +523,7 @@ static int get_security_context(struct dentry *entry, umode_t mode,
*/
static int fuse_create_open(struct inode *dir, struct dentry *entry,
struct file *file, unsigned int flags,
- umode_t mode)
+ umode_t mode, uint32_t opcode)
{
int err;
struct inode *inode;
@@ -535,8 +535,10 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
struct fuse_entry_out outentry;
struct fuse_inode *fi;
struct fuse_file *ff;
+ struct dentry *res = NULL;
void *security_ctx = NULL;
u32 security_ctxlen;
+ bool atomic_create = (opcode == FUSE_ATOMIC_CREATE ? true : false);

/* Userspace expects S_IFREG in create mode */
BUG_ON((mode & S_IFMT) != S_IFREG);
@@ -566,7 +568,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
inarg.open_flags |= FUSE_OPEN_KILL_SUIDGID;
}

- args.opcode = FUSE_CREATE;
+ args.opcode = opcode;
args.nodeid = get_node_id(dir);
args.in_numargs = 2;
args.in_args[0].size = sizeof(inarg);
@@ -613,9 +615,44 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
goto out_err;
}
kfree(forget);
- d_instantiate(entry, inode);
+ /*
+ * In atomic create, we skipped lookup and it is very much likely that
+ * dentry has DCACHE_PAR_LOOKUP flag set on it so call d_splice_alias().
+ * Note: Only REG file is allowed under create/atomic create.
+ */
+ /* There is special case when at very first call where we check if
+ * atomic create is implemented by USER SPACE/libfuse or not, we
+ * skipped lookup. Now, in case where atomic create is not implemented
+ * underlying, we fall back to FUSE_CREATE. here we are required to handle
+ * DCACHE_PAR_LOOKUP flag.
+ */
+ if (!atomic_create && !d_in_lookup(entry) && fm->fc->no_atomic_create)
+ d_instantiate(entry, inode);
+ else {
+ res = d_splice_alias(inode, entry);
+ if (res) {
+ /* Close the file in user space, but do not unlink it,
+ * if it was created - with network file systems other
+ * clients might have already accessed it.
+ */
+ if (IS_ERR(res)) {
+ fi = get_fuse_inode(inode);
+ fuse_sync_release(fi, ff, flags);
+ fuse_queue_forget(fm->fc, forget, outentry.nodeid, 1);
+ err = PTR_ERR(res);
+ goto out_err;
+ }
+ /* res is expected to be NULL since its REG file */
+ WARN_ON(res);
+ }
+ }
fuse_change_entry_timeout(entry, &outentry);
- fuse_dir_changed(dir);
+ /*
+ * In case of atomic create, we want to indicate directory change
+ * only if USER SPACE actually created the file.
+ */
+ if (!atomic_create || (outopen.open_flags & FOPEN_FILE_CREATED))
+ fuse_dir_changed(dir);
err = finish_open(file, entry, generic_file_open);
if (err) {
fi = get_fuse_inode(inode);
@@ -634,6 +671,29 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
return err;
}

+static int fuse_atomic_create(struct inode *dir, struct dentry *entry,
+ struct file *file, unsigned int flags,
+ umode_t mode)
+{
+ int err;
+ struct fuse_conn *fc = get_fuse_conn(dir);
+
+ if (fc->no_atomic_create)
+ return -ENOSYS;
+
+ err = fuse_create_open(dir, entry, file, flags, mode,
+ FUSE_ATOMIC_CREATE);
+ /* If atomic create is not implemented then indicate in fc so that next
+ * request falls back to normal create instead of going into libufse and
+ * returning with -ENOSYS.
+ */
+ if (err == -ENOSYS) {
+ if (!fc->no_atomic_create)
+ fc->no_atomic_create = 1;
+ }
+ return err;
+}
+
static int fuse_mknod(struct user_namespace *, struct inode *, struct dentry *,
umode_t, dev_t);
static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
@@ -643,11 +703,12 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
int err;
struct fuse_conn *fc = get_fuse_conn(dir);
struct dentry *res = NULL;
+ bool create = flags & O_CREAT ? true : false;

if (fuse_is_bad(dir))
return -EIO;

- if (d_in_lookup(entry)) {
+ if ((!create || fc->no_atomic_create) && d_in_lookup(entry)) {
res = fuse_lookup(dir, entry, 0);
if (IS_ERR(res))
return PTR_ERR(res);
@@ -656,7 +717,7 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
entry = res;
}

- if (!(flags & O_CREAT) || d_really_is_positive(entry))
+ if (!create || d_really_is_positive(entry))
goto no_open;

/* Only creates */
@@ -665,7 +726,13 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
if (fc->no_create)
goto mknod;

- err = fuse_create_open(dir, entry, file, flags, mode);
+ err = fuse_atomic_create(dir, entry, file, flags, mode);
+ /* Libfuse/user space has not implemented atomic create, therefore
+ * fall back to normal create.
+ */
+ if (err == -ENOSYS)
+ err = fuse_create_open(dir, entry, file, flags, mode,
+ FUSE_CREATE);
if (err == -ENOSYS) {
fc->no_create = 1;
goto mknod;
@@ -683,6 +750,7 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
}

/*
+
* Code shared between mknod, mkdir, symlink and link
*/
static int create_new_entry(struct fuse_mount *fm, struct fuse_args *args,
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e8e59fbdefeb..d577a591ab16 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -669,6 +669,9 @@ struct fuse_conn {
/** Is open/release not implemented by fs? */
unsigned no_open:1;

+ /** Is atomic create not implemented by fs? */
+ unsigned no_atomic_create:1;
+
/** Is opendir/releasedir not implemented by fs? */
unsigned no_opendir:1;

diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index d6ccee961891..e4b56004b148 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -301,6 +301,7 @@ struct fuse_file_lock {
* FOPEN_CACHE_DIR: allow caching this directory
* FOPEN_STREAM: the file is stream-like (no file position at all)
* FOPEN_NOFLUSH: don't flush data cache on close (unless FUSE_WRITEBACK_CACHE)
+ * FOPEN_FILE_CREATED: the file was actually created
*/
#define FOPEN_DIRECT_IO (1 << 0)
#define FOPEN_KEEP_CACHE (1 << 1)
@@ -308,6 +309,7 @@ struct fuse_file_lock {
#define FOPEN_CACHE_DIR (1 << 3)
#define FOPEN_STREAM (1 << 4)
#define FOPEN_NOFLUSH (1 << 5)
+#define FOPEN_FILE_CREATED (1 << 6)

/**
* INIT request/reply flags
@@ -537,6 +539,7 @@ enum fuse_opcode {
FUSE_SETUPMAPPING = 48,
FUSE_REMOVEMAPPING = 49,
FUSE_SYNCFS = 50,
+ FUSE_ATOMIC_CREATE = 51,

/* CUSE specific operations */
CUSE_INIT = 4096,
--
2.17.1

2022-05-03 22:03:56

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] FUSE: Implement atomic lookup + create

On Mon, May 02, 2022 at 03:55:19PM +0530, Dharmendra Singh wrote:
> From: Dharmendra Singh <[email protected]>
>
> When we go for creating a file (O_CREAT), we trigger
> a lookup to FUSE USER SPACE. It is very much likely
> that file does not exist yet as O_CREAT is passed to
> open(). This lookup can be avoided and can be performed
> as part of create call into libfuse.
>
> This lookup + create in single call to libfuse and finally
> to USER SPACE has been named as atomic create. It is expected
> that USER SPACE create the file, open it and fills in the
> attributes which are then used to make inode stand/revalidate
> in the kernel cache. Also if file was newly created(does not
> exist yet by this time) in USER SPACE then it should be indicated
> in `struct fuse_file_info` by setting a bit which is again used by
> libfuse to send some flags back to fuse kernel to indicate that
> that file was newly created. These flags are used by kernel to
> indicate changes in parent directory.
>
> Fuse kernel automatically detects if atomic create is implemented
> by libfuse/USER SPACE or not. And depending upon the outcome of
> this check all further creates are decided to be atomic or non-atomic
> creates.
>
> If libfuse/USER SPACE has not implemented the atomic create operation
> then by default behaviour remains same i.e we do not optimize lookup
> calls which are triggered before create calls into libfuse.
>
> Signed-off-by: Dharmendra Singh <[email protected]>
> ---
> fs/fuse/dir.c | 82 +++++++++++++++++++++++++++++++++++----
> fs/fuse/fuse_i.h | 3 ++
> include/uapi/linux/fuse.h | 3 ++
> 3 files changed, 81 insertions(+), 7 deletions(-)
>
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 656e921f3506..cad3322a007f 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -523,7 +523,7 @@ static int get_security_context(struct dentry *entry, umode_t mode,
> */
> static int fuse_create_open(struct inode *dir, struct dentry *entry,
> struct file *file, unsigned int flags,
> - umode_t mode)
> + umode_t mode, uint32_t opcode)
> {
> int err;
> struct inode *inode;
> @@ -535,8 +535,10 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
> struct fuse_entry_out outentry;
> struct fuse_inode *fi;
> struct fuse_file *ff;
> + struct dentry *res = NULL;
> void *security_ctx = NULL;
> u32 security_ctxlen;
> + bool atomic_create = (opcode == FUSE_ATOMIC_CREATE ? true : false);
>
> /* Userspace expects S_IFREG in create mode */
> BUG_ON((mode & S_IFMT) != S_IFREG);
> @@ -566,7 +568,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
> inarg.open_flags |= FUSE_OPEN_KILL_SUIDGID;
> }
>
> - args.opcode = FUSE_CREATE;
> + args.opcode = opcode;
> args.nodeid = get_node_id(dir);
> args.in_numargs = 2;
> args.in_args[0].size = sizeof(inarg);
> @@ -613,9 +615,44 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
> goto out_err;
> }
> kfree(forget);
> - d_instantiate(entry, inode);
> + /*
> + * In atomic create, we skipped lookup and it is very much likely that
> + * dentry has DCACHE_PAR_LOOKUP flag set on it so call d_splice_alias().

Can you please help me understand what does DCACHE_PAR_LOOKUP flag mean.
Also how d_splice_alias() helps in that case.


> + * Note: Only REG file is allowed under create/atomic create.
> + */
> + /* There is special case when at very first call where we check if
> + * atomic create is implemented by USER SPACE/libfuse or not, we
> + * skipped lookup. Now, in case where atomic create is not implemented
> + * underlying, we fall back to FUSE_CREATE. here we are required to handle
> + * DCACHE_PAR_LOOKUP flag.
> + */
> + if (!atomic_create && !d_in_lookup(entry) && fm->fc->no_atomic_create)
> + d_instantiate(entry, inode);

So if we are trying to do atomic_create first time, then we skipped lookup
in fuse_atomic_open() (assuming DCACHE_PAR_LOOKUP flag is set). If server
does not support, atomic_create, then we will set
fm->fc->no_atomic_create=1 and non-atomic create will be tried and we will
come here. And given we skipped lookup (even for non-atomic-create) case,
we are going to call d_splice_alias(). Right?

IOW, even without non-atomic-create, you are skipping lookup() and be
able to handle file creation and use d_splice_alias(). If this is
correct, I am wondering why do I need atomic create to begin with.

If this is not correct, then fuse_atomic_open() probably should handle
error from fuse_create_open(FUSE_ATOMIC_CREATE) first, retry lookup
and then retry fuse_create_open(FUSE_CREATE).

What am I missing?

Thanks
Vivek



> + else {
> + res = d_splice_alias(inode, entry);
> + if (res) {
> + /* Close the file in user space, but do not unlink it,
> + * if it was created - with network file systems other
> + * clients might have already accessed it.
> + */
> + if (IS_ERR(res)) {
> + fi = get_fuse_inode(inode);
> + fuse_sync_release(fi, ff, flags);
> + fuse_queue_forget(fm->fc, forget, outentry.nodeid, 1);
> + err = PTR_ERR(res);
> + goto out_err;
> + }
> + /* res is expected to be NULL since its REG file */
> + WARN_ON(res);
> + }
> + }


> fuse_change_entry_timeout(entry, &outentry);
> - fuse_dir_changed(dir);
> + /*
> + * In case of atomic create, we want to indicate directory change
> + * only if USER SPACE actually created the file.
> + */
> + if (!atomic_create || (outopen.open_flags & FOPEN_FILE_CREATED))
> + fuse_dir_changed(dir);
> err = finish_open(file, entry, generic_file_open);
> if (err) {
> fi = get_fuse_inode(inode);
> @@ -634,6 +671,29 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
> return err;
> }
>
> +static int fuse_atomic_create(struct inode *dir, struct dentry *entry,
> + struct file *file, unsigned int flags,
> + umode_t mode)
> +{
> + int err;
> + struct fuse_conn *fc = get_fuse_conn(dir);
> +
> + if (fc->no_atomic_create)
> + return -ENOSYS;
> +
> + err = fuse_create_open(dir, entry, file, flags, mode,
> + FUSE_ATOMIC_CREATE);
> + /* If atomic create is not implemented then indicate in fc so that next
> + * request falls back to normal create instead of going into libufse and
> + * returning with -ENOSYS.
> + */
> + if (err == -ENOSYS) {
> + if (!fc->no_atomic_create)
> + fc->no_atomic_create = 1;
> + }
> + return err;
> +}
> +
> static int fuse_mknod(struct user_namespace *, struct inode *, struct dentry *,
> umode_t, dev_t);
> static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
> @@ -643,11 +703,12 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
> int err;
> struct fuse_conn *fc = get_fuse_conn(dir);
> struct dentry *res = NULL;
> + bool create = flags & O_CREAT ? true : false;
>
> if (fuse_is_bad(dir))
> return -EIO;
>
> - if (d_in_lookup(entry)) {
> + if ((!create || fc->no_atomic_create) && d_in_lookup(entry)) {
> res = fuse_lookup(dir, entry, 0);
> if (IS_ERR(res))
> return PTR_ERR(res);
> @@ -656,7 +717,7 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
> entry = res;
> }
>
> - if (!(flags & O_CREAT) || d_really_is_positive(entry))
> + if (!create || d_really_is_positive(entry))
> goto no_open;
>
> /* Only creates */
> @@ -665,7 +726,13 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
> if (fc->no_create)
> goto mknod;
>
> - err = fuse_create_open(dir, entry, file, flags, mode);
> + err = fuse_atomic_create(dir, entry, file, flags, mode);
> + /* Libfuse/user space has not implemented atomic create, therefore
> + * fall back to normal create.
> + */
> + if (err == -ENOSYS)
> + err = fuse_create_open(dir, entry, file, flags, mode,
> + FUSE_CREATE);
> if (err == -ENOSYS) {
> fc->no_create = 1;
> goto mknod;
> @@ -683,6 +750,7 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
> }
>
> /*
> +
> * Code shared between mknod, mkdir, symlink and link
> */
> static int create_new_entry(struct fuse_mount *fm, struct fuse_args *args,
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index e8e59fbdefeb..d577a591ab16 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -669,6 +669,9 @@ struct fuse_conn {
> /** Is open/release not implemented by fs? */
> unsigned no_open:1;
>
> + /** Is atomic create not implemented by fs? */
> + unsigned no_atomic_create:1;
> +
> /** Is opendir/releasedir not implemented by fs? */
> unsigned no_opendir:1;
>
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index d6ccee961891..e4b56004b148 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -301,6 +301,7 @@ struct fuse_file_lock {
> * FOPEN_CACHE_DIR: allow caching this directory
> * FOPEN_STREAM: the file is stream-like (no file position at all)
> * FOPEN_NOFLUSH: don't flush data cache on close (unless FUSE_WRITEBACK_CACHE)
> + * FOPEN_FILE_CREATED: the file was actually created
> */
> #define FOPEN_DIRECT_IO (1 << 0)
> #define FOPEN_KEEP_CACHE (1 << 1)
> @@ -308,6 +309,7 @@ struct fuse_file_lock {
> #define FOPEN_CACHE_DIR (1 << 3)
> #define FOPEN_STREAM (1 << 4)
> #define FOPEN_NOFLUSH (1 << 5)
> +#define FOPEN_FILE_CREATED (1 << 6)
>
> /**
> * INIT request/reply flags
> @@ -537,6 +539,7 @@ enum fuse_opcode {
> FUSE_SETUPMAPPING = 48,
> FUSE_REMOVEMAPPING = 49,
> FUSE_SYNCFS = 50,
> + FUSE_ATOMIC_CREATE = 51,
>
> /* CUSE specific operations */
> CUSE_INIT = 4096,
> --
> 2.17.1
>

2022-05-04 01:49:38

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] FUSE: Implement atomic lookup + create

On Mon, May 02, 2022 at 03:55:19PM +0530, Dharmendra Singh wrote:
> From: Dharmendra Singh <[email protected]>
>
> When we go for creating a file (O_CREAT), we trigger
> a lookup to FUSE USER SPACE. It is very much likely
> that file does not exist yet as O_CREAT is passed to
> open(). This lookup can be avoided and can be performed
> as part of create call into libfuse.
>
> This lookup + create in single call to libfuse and finally
> to USER SPACE has been named as atomic create. It is expected
> that USER SPACE create the file, open it and fills in the
> attributes which are then used to make inode stand/revalidate
> in the kernel cache. Also if file was newly created(does not
> exist yet by this time) in USER SPACE then it should be indicated
> in `struct fuse_file_info` by setting a bit which is again used by
> libfuse to send some flags back to fuse kernel to indicate that
> that file was newly created. These flags are used by kernel to
> indicate changes in parent directory.

Reading the existing code a little bit more and trying to understand
existing semantics. And that will help me unerstand what new is being
done.

So current fuse_atomic_open() does following.

A. Looks up dentry (if d_in_lookup() is set).
B. If dentry is positive or O_CREAT is not set, return.
C. If server supports atomic create + open, use that to create file and
open it as well.
D. If server does not support atomic create + open, just create file
using "mknod" and return. VFS will take care of opening the file.

Now with this patch, new flow is.

A. Look up dentry if d_in_lookup() is set as well as either file is not
being created or fc->no_atomic_create is set. This basiclally means
skip lookup if atomic_create is supported and file is being created.

B. Remains same. if dentry is positive or O_CREATE is not set, return.

C. If server supports new atomic_create(), use that.

D. If not, if server supports atomic create + open, use that

E. If not, fall back to mknod and do not open file.

So to me this new functionality is basically atomic "lookup + create +
open"?

Or may be not. I see we check "fc->no_create" and fallback to mknod.

if (fc->no_create)
goto mknod;

So fc->no_create is representing both old atomic "create + open" as well
as new "lookup + create + open" ?

It might be obvious to you, but it is not to me. So will be great if
you shed some light on this.

Thanks
Vivek


>
> Fuse kernel automatically detects if atomic create is implemented
> by libfuse/USER SPACE or not. And depending upon the outcome of
> this check all further creates are decided to be atomic or non-atomic
> creates.
>
> If libfuse/USER SPACE has not implemented the atomic create operation
> then by default behaviour remains same i.e we do not optimize lookup
> calls which are triggered before create calls into libfuse.
>
> Signed-off-by: Dharmendra Singh <[email protected]>
> ---
> fs/fuse/dir.c | 82 +++++++++++++++++++++++++++++++++++----
> fs/fuse/fuse_i.h | 3 ++
> include/uapi/linux/fuse.h | 3 ++
> 3 files changed, 81 insertions(+), 7 deletions(-)
>
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 656e921f3506..cad3322a007f 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -523,7 +523,7 @@ static int get_security_context(struct dentry *entry, umode_t mode,
> */
> static int fuse_create_open(struct inode *dir, struct dentry *entry,
> struct file *file, unsigned int flags,
> - umode_t mode)
> + umode_t mode, uint32_t opcode)
> {
> int err;
> struct inode *inode;
> @@ -535,8 +535,10 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
> struct fuse_entry_out outentry;
> struct fuse_inode *fi;
> struct fuse_file *ff;
> + struct dentry *res = NULL;
> void *security_ctx = NULL;
> u32 security_ctxlen;
> + bool atomic_create = (opcode == FUSE_ATOMIC_CREATE ? true : false);
>
> /* Userspace expects S_IFREG in create mode */
> BUG_ON((mode & S_IFMT) != S_IFREG);
> @@ -566,7 +568,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
> inarg.open_flags |= FUSE_OPEN_KILL_SUIDGID;
> }
>
> - args.opcode = FUSE_CREATE;
> + args.opcode = opcode;
> args.nodeid = get_node_id(dir);
> args.in_numargs = 2;
> args.in_args[0].size = sizeof(inarg);
> @@ -613,9 +615,44 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
> goto out_err;
> }
> kfree(forget);
> - d_instantiate(entry, inode);
> + /*
> + * In atomic create, we skipped lookup and it is very much likely that
> + * dentry has DCACHE_PAR_LOOKUP flag set on it so call d_splice_alias().
> + * Note: Only REG file is allowed under create/atomic create.
> + */
> + /* There is special case when at very first call where we check if
> + * atomic create is implemented by USER SPACE/libfuse or not, we
> + * skipped lookup. Now, in case where atomic create is not implemented
> + * underlying, we fall back to FUSE_CREATE. here we are required to handle
> + * DCACHE_PAR_LOOKUP flag.
> + */
> + if (!atomic_create && !d_in_lookup(entry) && fm->fc->no_atomic_create)
> + d_instantiate(entry, inode);
> + else {
> + res = d_splice_alias(inode, entry);
> + if (res) {
> + /* Close the file in user space, but do not unlink it,
> + * if it was created - with network file systems other
> + * clients might have already accessed it.
> + */
> + if (IS_ERR(res)) {
> + fi = get_fuse_inode(inode);
> + fuse_sync_release(fi, ff, flags);
> + fuse_queue_forget(fm->fc, forget, outentry.nodeid, 1);
> + err = PTR_ERR(res);
> + goto out_err;
> + }
> + /* res is expected to be NULL since its REG file */
> + WARN_ON(res);
> + }
> + }
> fuse_change_entry_timeout(entry, &outentry);
> - fuse_dir_changed(dir);
> + /*
> + * In case of atomic create, we want to indicate directory change
> + * only if USER SPACE actually created the file.
> + */
> + if (!atomic_create || (outopen.open_flags & FOPEN_FILE_CREATED))
> + fuse_dir_changed(dir);
> err = finish_open(file, entry, generic_file_open);
> if (err) {
> fi = get_fuse_inode(inode);
> @@ -634,6 +671,29 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
> return err;
> }
>
> +static int fuse_atomic_create(struct inode *dir, struct dentry *entry,
> + struct file *file, unsigned int flags,
> + umode_t mode)
> +{
> + int err;
> + struct fuse_conn *fc = get_fuse_conn(dir);
> +
> + if (fc->no_atomic_create)
> + return -ENOSYS;
> +
> + err = fuse_create_open(dir, entry, file, flags, mode,
> + FUSE_ATOMIC_CREATE);
> + /* If atomic create is not implemented then indicate in fc so that next
> + * request falls back to normal create instead of going into libufse and
> + * returning with -ENOSYS.
> + */
> + if (err == -ENOSYS) {
> + if (!fc->no_atomic_create)
> + fc->no_atomic_create = 1;
> + }
> + return err;
> +}
> +
> static int fuse_mknod(struct user_namespace *, struct inode *, struct dentry *,
> umode_t, dev_t);
> static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
> @@ -643,11 +703,12 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
> int err;
> struct fuse_conn *fc = get_fuse_conn(dir);
> struct dentry *res = NULL;
> + bool create = flags & O_CREAT ? true : false;
>
> if (fuse_is_bad(dir))
> return -EIO;
>
> - if (d_in_lookup(entry)) {
> + if ((!create || fc->no_atomic_create) && d_in_lookup(entry)) {
> res = fuse_lookup(dir, entry, 0);
> if (IS_ERR(res))
> return PTR_ERR(res);
> @@ -656,7 +717,7 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
> entry = res;
> }
>
> - if (!(flags & O_CREAT) || d_really_is_positive(entry))
> + if (!create || d_really_is_positive(entry))
> goto no_open;
>
> /* Only creates */
> @@ -665,7 +726,13 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
> if (fc->no_create)
> goto mknod;
>
> - err = fuse_create_open(dir, entry, file, flags, mode);
> + err = fuse_atomic_create(dir, entry, file, flags, mode);
> + /* Libfuse/user space has not implemented atomic create, therefore
> + * fall back to normal create.
> + */
> + if (err == -ENOSYS)
> + err = fuse_create_open(dir, entry, file, flags, mode,
> + FUSE_CREATE);
> if (err == -ENOSYS) {
> fc->no_create = 1;
> goto mknod;
> @@ -683,6 +750,7 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
> }
>
> /*
> +
> * Code shared between mknod, mkdir, symlink and link
> */
> static int create_new_entry(struct fuse_mount *fm, struct fuse_args *args,
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index e8e59fbdefeb..d577a591ab16 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -669,6 +669,9 @@ struct fuse_conn {
> /** Is open/release not implemented by fs? */
> unsigned no_open:1;
>
> + /** Is atomic create not implemented by fs? */
> + unsigned no_atomic_create:1;
> +
> /** Is opendir/releasedir not implemented by fs? */
> unsigned no_opendir:1;
>
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index d6ccee961891..e4b56004b148 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -301,6 +301,7 @@ struct fuse_file_lock {
> * FOPEN_CACHE_DIR: allow caching this directory
> * FOPEN_STREAM: the file is stream-like (no file position at all)
> * FOPEN_NOFLUSH: don't flush data cache on close (unless FUSE_WRITEBACK_CACHE)
> + * FOPEN_FILE_CREATED: the file was actually created
> */
> #define FOPEN_DIRECT_IO (1 << 0)
> #define FOPEN_KEEP_CACHE (1 << 1)
> @@ -308,6 +309,7 @@ struct fuse_file_lock {
> #define FOPEN_CACHE_DIR (1 << 3)
> #define FOPEN_STREAM (1 << 4)
> #define FOPEN_NOFLUSH (1 << 5)
> +#define FOPEN_FILE_CREATED (1 << 6)
>
> /**
> * INIT request/reply flags
> @@ -537,6 +539,7 @@ enum fuse_opcode {
> FUSE_SETUPMAPPING = 48,
> FUSE_REMOVEMAPPING = 49,
> FUSE_SYNCFS = 50,
> + FUSE_ATOMIC_CREATE = 51,
>
> /* CUSE specific operations */
> CUSE_INIT = 4096,
> --
> 2.17.1
>

2022-05-04 07:33:20

by Dharmendra Singh

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] FUSE: Implement atomic lookup + create

On Wed, May 4, 2022 at 1:24 AM Vivek Goyal <[email protected]> wrote:
>
> On Mon, May 02, 2022 at 03:55:19PM +0530, Dharmendra Singh wrote:
> > From: Dharmendra Singh <[email protected]>
> >
> > When we go for creating a file (O_CREAT), we trigger
> > a lookup to FUSE USER SPACE. It is very much likely
> > that file does not exist yet as O_CREAT is passed to
> > open(). This lookup can be avoided and can be performed
> > as part of create call into libfuse.
> >
> > This lookup + create in single call to libfuse and finally
> > to USER SPACE has been named as atomic create. It is expected
> > that USER SPACE create the file, open it and fills in the
> > attributes which are then used to make inode stand/revalidate
> > in the kernel cache. Also if file was newly created(does not
> > exist yet by this time) in USER SPACE then it should be indicated
> > in `struct fuse_file_info` by setting a bit which is again used by
> > libfuse to send some flags back to fuse kernel to indicate that
> > that file was newly created. These flags are used by kernel to
> > indicate changes in parent directory.
>
> Reading the existing code a little bit more and trying to understand
> existing semantics. And that will help me unerstand what new is being
> done.
>
> So current fuse_atomic_open() does following.
>
> A. Looks up dentry (if d_in_lookup() is set).
> B. If dentry is positive or O_CREAT is not set, return.
> C. If server supports atomic create + open, use that to create file and
> open it as well.
> D. If server does not support atomic create + open, just create file
> using "mknod" and return. VFS will take care of opening the file.
>
> Now with this patch, new flow is.
>
> A. Look up dentry if d_in_lookup() is set as well as either file is not
> being created or fc->no_atomic_create is set. This basiclally means
> skip lookup if atomic_create is supported and file is being created.
>
> B. Remains same. if dentry is positive or O_CREATE is not set, return.
>
> C. If server supports new atomic_create(), use that.
>
> D. If not, if server supports atomic create + open, use that
>
> E. If not, fall back to mknod and do not open file.
>
> So to me this new functionality is basically atomic "lookup + create +
> open"?
>
> Or may be not. I see we check "fc->no_create" and fallback to mknod.
>
> if (fc->no_create)
> goto mknod;
>
> So fc->no_create is representing both old atomic "create + open" as well
> as new "lookup + create + open" ?
>
> It might be obvious to you, but it is not to me. So will be great if
> you shed some light on this.
>

I think you got it right now. New atomic create does what you
mentioned as new flow. It does lookup + create + open in single call
(being called as atomic create) to USER SPACE. mknod is a special case
where the file system does not have a create call implemented. I think
its legacy probably goes back to Linux 2.4 if I am not wrong. We did
not make any changes into that.

Second patch avoids lookup for open calls. 3rd patch avoids lookup in
de_revalidate() but for non-dir. And only in case when default
permissions are not enabled.

2022-05-04 08:39:23

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] FUSE: Implement atomic lookup + create

On Mon, May 02, 2022 at 03:55:19PM +0530, Dharmendra Singh wrote:
> From: Dharmendra Singh <[email protected]>
>
> When we go for creating a file (O_CREAT), we trigger
> a lookup to FUSE USER SPACE. It is very much likely
> that file does not exist yet as O_CREAT is passed to
> open(). This lookup can be avoided and can be performed
> as part of create call into libfuse.
>
> This lookup + create in single call to libfuse and finally
> to USER SPACE has been named as atomic create.

Looking at it and trying to understand if lookup can be avoided. But
before that this naming of calling it atomic create seems confusing.

We already have fuse_create_open() which we are calling it as
"Atomic create+open operation". Now looks like this is extension
to this operation where we are doing "Atomic lookup + create + open".
Do I understand it correctly?

Thanks
Vivek

> It is expected
> that USER SPACE create the file, open it and fills in the
> attributes which are then used to make inode stand/revalidate
> in the kernel cache. Also if file was newly created(does not
> exist yet by this time) in USER SPACE then it should be indicated
> in `struct fuse_file_info` by setting a bit which is again used by
> libfuse to send some flags back to fuse kernel to indicate that
> that file was newly created. These flags are used by kernel to
> indicate changes in parent directory.
>
> Fuse kernel automatically detects if atomic create is implemented
> by libfuse/USER SPACE or not. And depending upon the outcome of
> this check all further creates are decided to be atomic or non-atomic
> creates.
>
> If libfuse/USER SPACE has not implemented the atomic create operation
> then by default behaviour remains same i.e we do not optimize lookup
> calls which are triggered before create calls into libfuse.
>
> Signed-off-by: Dharmendra Singh <[email protected]>
> ---
> fs/fuse/dir.c | 82 +++++++++++++++++++++++++++++++++++----
> fs/fuse/fuse_i.h | 3 ++
> include/uapi/linux/fuse.h | 3 ++
> 3 files changed, 81 insertions(+), 7 deletions(-)
>
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 656e921f3506..cad3322a007f 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -523,7 +523,7 @@ static int get_security_context(struct dentry *entry, umode_t mode,
> */
> static int fuse_create_open(struct inode *dir, struct dentry *entry,
> struct file *file, unsigned int flags,
> - umode_t mode)
> + umode_t mode, uint32_t opcode)
> {
> int err;
> struct inode *inode;
> @@ -535,8 +535,10 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
> struct fuse_entry_out outentry;
> struct fuse_inode *fi;
> struct fuse_file *ff;
> + struct dentry *res = NULL;
> void *security_ctx = NULL;
> u32 security_ctxlen;
> + bool atomic_create = (opcode == FUSE_ATOMIC_CREATE ? true : false);
>
> /* Userspace expects S_IFREG in create mode */
> BUG_ON((mode & S_IFMT) != S_IFREG);
> @@ -566,7 +568,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
> inarg.open_flags |= FUSE_OPEN_KILL_SUIDGID;
> }
>
> - args.opcode = FUSE_CREATE;
> + args.opcode = opcode;
> args.nodeid = get_node_id(dir);
> args.in_numargs = 2;
> args.in_args[0].size = sizeof(inarg);
> @@ -613,9 +615,44 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
> goto out_err;
> }
> kfree(forget);
> - d_instantiate(entry, inode);
> + /*
> + * In atomic create, we skipped lookup and it is very much likely that
> + * dentry has DCACHE_PAR_LOOKUP flag set on it so call d_splice_alias().
> + * Note: Only REG file is allowed under create/atomic create.
> + */
> + /* There is special case when at very first call where we check if
> + * atomic create is implemented by USER SPACE/libfuse or not, we
> + * skipped lookup. Now, in case where atomic create is not implemented
> + * underlying, we fall back to FUSE_CREATE. here we are required to handle
> + * DCACHE_PAR_LOOKUP flag.
> + */
> + if (!atomic_create && !d_in_lookup(entry) && fm->fc->no_atomic_create)
> + d_instantiate(entry, inode);
> + else {
> + res = d_splice_alias(inode, entry);
> + if (res) {
> + /* Close the file in user space, but do not unlink it,
> + * if it was created - with network file systems other
> + * clients might have already accessed it.
> + */
> + if (IS_ERR(res)) {
> + fi = get_fuse_inode(inode);
> + fuse_sync_release(fi, ff, flags);
> + fuse_queue_forget(fm->fc, forget, outentry.nodeid, 1);
> + err = PTR_ERR(res);
> + goto out_err;
> + }
> + /* res is expected to be NULL since its REG file */
> + WARN_ON(res);
> + }
> + }
> fuse_change_entry_timeout(entry, &outentry);
> - fuse_dir_changed(dir);
> + /*
> + * In case of atomic create, we want to indicate directory change
> + * only if USER SPACE actually created the file.
> + */
> + if (!atomic_create || (outopen.open_flags & FOPEN_FILE_CREATED))
> + fuse_dir_changed(dir);
> err = finish_open(file, entry, generic_file_open);
> if (err) {
> fi = get_fuse_inode(inode);
> @@ -634,6 +671,29 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
> return err;
> }
>
> +static int fuse_atomic_create(struct inode *dir, struct dentry *entry,
> + struct file *file, unsigned int flags,
> + umode_t mode)
> +{
> + int err;
> + struct fuse_conn *fc = get_fuse_conn(dir);
> +
> + if (fc->no_atomic_create)
> + return -ENOSYS;
> +
> + err = fuse_create_open(dir, entry, file, flags, mode,
> + FUSE_ATOMIC_CREATE);
> + /* If atomic create is not implemented then indicate in fc so that next
> + * request falls back to normal create instead of going into libufse and
> + * returning with -ENOSYS.
> + */
> + if (err == -ENOSYS) {
> + if (!fc->no_atomic_create)
> + fc->no_atomic_create = 1;
> + }
> + return err;
> +}
> +
> static int fuse_mknod(struct user_namespace *, struct inode *, struct dentry *,
> umode_t, dev_t);
> static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
> @@ -643,11 +703,12 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
> int err;
> struct fuse_conn *fc = get_fuse_conn(dir);
> struct dentry *res = NULL;
> + bool create = flags & O_CREAT ? true : false;
>
> if (fuse_is_bad(dir))
> return -EIO;
>
> - if (d_in_lookup(entry)) {
> + if ((!create || fc->no_atomic_create) && d_in_lookup(entry)) {
> res = fuse_lookup(dir, entry, 0);
> if (IS_ERR(res))
> return PTR_ERR(res);
> @@ -656,7 +717,7 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
> entry = res;
> }
>
> - if (!(flags & O_CREAT) || d_really_is_positive(entry))
> + if (!create || d_really_is_positive(entry))
> goto no_open;
>
> /* Only creates */
> @@ -665,7 +726,13 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
> if (fc->no_create)
> goto mknod;
>
> - err = fuse_create_open(dir, entry, file, flags, mode);
> + err = fuse_atomic_create(dir, entry, file, flags, mode);
> + /* Libfuse/user space has not implemented atomic create, therefore
> + * fall back to normal create.
> + */
> + if (err == -ENOSYS)
> + err = fuse_create_open(dir, entry, file, flags, mode,
> + FUSE_CREATE);
> if (err == -ENOSYS) {
> fc->no_create = 1;
> goto mknod;
> @@ -683,6 +750,7 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
> }
>
> /*
> +
> * Code shared between mknod, mkdir, symlink and link
> */
> static int create_new_entry(struct fuse_mount *fm, struct fuse_args *args,
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index e8e59fbdefeb..d577a591ab16 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -669,6 +669,9 @@ struct fuse_conn {
> /** Is open/release not implemented by fs? */
> unsigned no_open:1;
>
> + /** Is atomic create not implemented by fs? */
> + unsigned no_atomic_create:1;
> +
> /** Is opendir/releasedir not implemented by fs? */
> unsigned no_opendir:1;
>
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index d6ccee961891..e4b56004b148 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -301,6 +301,7 @@ struct fuse_file_lock {
> * FOPEN_CACHE_DIR: allow caching this directory
> * FOPEN_STREAM: the file is stream-like (no file position at all)
> * FOPEN_NOFLUSH: don't flush data cache on close (unless FUSE_WRITEBACK_CACHE)
> + * FOPEN_FILE_CREATED: the file was actually created
> */
> #define FOPEN_DIRECT_IO (1 << 0)
> #define FOPEN_KEEP_CACHE (1 << 1)
> @@ -308,6 +309,7 @@ struct fuse_file_lock {
> #define FOPEN_CACHE_DIR (1 << 3)
> #define FOPEN_STREAM (1 << 4)
> #define FOPEN_NOFLUSH (1 << 5)
> +#define FOPEN_FILE_CREATED (1 << 6)
>
> /**
> * INIT request/reply flags
> @@ -537,6 +539,7 @@ enum fuse_opcode {
> FUSE_SETUPMAPPING = 48,
> FUSE_REMOVEMAPPING = 49,
> FUSE_SYNCFS = 50,
> + FUSE_ATOMIC_CREATE = 51,
>
> /* CUSE specific operations */
> CUSE_INIT = 4096,
> --
> 2.17.1
>


2022-05-04 23:08:04

by Bernd Schubert

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] FUSE: Implement atomic lookup + create

Hi Vivek,

On 5/3/22 21:53, Vivek Goyal wrote:
> Reading the existing code a little bit more and trying to understand
> existing semantics. And that will help me unerstand what new is being
> done.
>
> So current fuse_atomic_open() does following.
>
> A. Looks up dentry (if d_in_lookup() is set).
> B. If dentry is positive or O_CREAT is not set, return.
> C. If server supports atomic create + open, use that to create file and
> open it as well.
> D. If server does not support atomic create + open, just create file
> using "mknod" and return. VFS will take care of opening the file.
>
> Now with this patch, new flow is.
>
> A. Look up dentry if d_in_lookup() is set as well as either file is not
> being created or fc->no_atomic_create is set. This basiclally means
> skip lookup if atomic_create is supported and file is being created.
>
> B. Remains same. if dentry is positive or O_CREATE is not set, return.
>
> C. If server supports new atomic_create(), use that.
>
> D. If not, if server supports atomic create + open, use that
>
> E. If not, fall back to mknod and do not open file.
>
> So to me this new functionality is basically atomic "lookup + create +
> open"?
>
> Or may be not. I see we check "fc->no_create" and fallback to mknod.
>
> if (fc->no_create)
> goto mknod;
>
> So fc->no_create is representing both old atomic "create + open" as well
> as new "lookup + create + open" ?
>
> It might be obvious to you, but it is not to me. So will be great if
> you shed some light on this.

we are going to reply more in detail tomorrow, it gets rather late for
me as well. Anyway yes, goal is basically to avoid lookups as much
possible.
AFAIK, lookup-intents had been first introduced years ago by Lustre
developers - I guess to reduce network and server load - same reason for
us. Later Miklos had introduced atomic_open, which makes code
using/avoiding lookup much easier to read.
I guess unoticed that time, fuse was not fully using all possibilities
of atomic-open - we can see quite some lookup/revalidate traffic for our
file system.


I guess the commit message and introduction letter should be updated
with your A/B/C/D/E scheme. A) changes a bit in patch 2/3, which extents
it to normal file open, and patch 3/3 adds in revalidate.


Hope it helps,
Bernd

2022-05-04 23:47:54

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create

On Mon, May 02, 2022 at 03:55:18PM +0530, Dharmendra Singh wrote:
> In FUSE, as of now, uncached lookups are expensive over the wire.
> E.g additional latencies and stressing (meta data) servers from
> thousands of clients. These lookup calls possibly can be avoided
> in some cases. Incoming three patches address this issue.

BTW, these patches are designed to improve performance by cutting down
on number of fuse commands sent. Are there any performance numbers
which demonstrate what kind of improvement you are seeing.

Say, If I do kernel build, is the performance improvement observable?

Thanks
Vivek

>
>
> Fist patch handles the case where we are creating a file with O_CREAT.
> Before we go for file creation, we do a lookup on the file which is most
> likely non-existent. After this lookup is done, we again go into libfuse
> to create file. Such lookups where file is most likely non-existent, can
> be avoided.
>
> Second patch handles the case where we open first time a file/dir
> but do a lookup first on it. After lookup is performed we make another
> call into libfuse to open the file. Now these two separate calls into
> libfuse can be combined and performed as a single call into libfuse.
>
> Third patch handles the case when we are opening an already existing file
> (positive dentry). Before this open call, we re-validate the inode and
> this re-validation does a lookup on the file and verify the inode.
> This separate lookup also can be avoided (for non-dir) and combined
> with open call into libfuse. After open returns we can revalidate the inode.
> This optimisation is performed only when we do not have default permissions
> enabled.
>
> Here is the link to performance numbers
> https://lore.kernel.org/linux-fsdevel/[email protected]/
>
>
> Dharmendra Singh (3):
> FUSE: Implement atomic lookup + create
> Implement atomic lookup + open
> Avoid lookup in d_revalidate()
>
> fs/fuse/dir.c | 211 +++++++++++++++++++++++++++++++++++---
> fs/fuse/file.c | 30 +++++-
> fs/fuse/fuse_i.h | 16 ++-
> fs/fuse/inode.c | 4 +-
> fs/fuse/ioctl.c | 2 +-
> include/uapi/linux/fuse.h | 5 +
> 6 files changed, 246 insertions(+), 22 deletions(-)
>
> ---
> v4: Addressed all comments and refactored the code into 3 separate patches
> respectively for Atomic create, Atomic open, optimizing lookup in
> d_revalidate().
> ---
>


2022-05-04 23:50:31

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] FUSE: Implement atomic lookup + create

On Wed, May 04, 2022 at 05:46:27PM +0200, Bernd Schubert wrote:
>
>
> On 5/4/22 16:47, Vivek Goyal wrote:
>
> > Ok, naming is little confusing. I think we will have to put it in
> > commit message and where you define FUSE_ATOMIC_CREATE that what's
> > the difference between FUSE_CREATE and FUSE_ATOMIC_CREATE. This is
> > ATOMIC w.r.t what?
> >
> > May be atomic here means that "lookup + create + open" is a single operation.
> > But then even FUSE_CREATE is atomic because "creat + open" is a single
> > operation.
> >
> > In fact FUSE_CREATE does lookup anyway and returns all the information
> > in fuse_entry_out.
> >
> > IIUC, only difference between FUSE_CREATE and FUSE_ATOMIC_CREATE is that
> > later also carries information in reply whether file was actually created
> > or not (FOPEN_FILE_CREATED). This will be set if file did not exist
> > already and it was created indeed. Is that right?
> >
> > I see FOPEN_FILE_CREATED is being used to avoid calling
> > fuse_dir_changed(). That sounds like a separate optimization and probably
> > should be in a separate patch.
> >
> > IOW, I think this patch should be broken in to multiple pieces. First
> > piece seems to be avoiding lookup() and given the way it is implemented,
> > looks like we can avoid lookup() even by using existing FUSE_CREATE
> > command. We don't necessarily need FUSE_ATOMIC_CREATE. Is that right?
>
> The initial non-published patches had that, but I had actually asked not to
> go that route, because I'm scared that some user space file system
> implementations might get broken.

> Right now there is always a lookup before
> fuse_create_open() and when the resulting dentry is positive
> fuse_create_open/FUSE_CREATE is bypassed. I.e. user space implementations
> didn't need to handle existing files.

Hmm..., So if dentry is positive, we will call FUSE_OPEN instead in
current code.

Now with this change, we will call FUSE_CREATE and file could still
be present. If it is a shared filesystem, file could be created by
another client anyway after lookup() completed and returned a non-existent
file. So server can still get FUSE_CREATE and file could be there.

But I understand that risk of regression is not zero.

Given we are going to implement FUSE_CREATE_EXT in the same patch
series, I guess we could fix it easily by switching to FUSE_CREATE_EXT.

So that's my take. I will be willing to take this chance. Until and
unless ofcourse Miklos disagrees. :-)

Thanks
Vivek

> Out of the sudden user space
> implementations might need to handle it and some of them might get broken
> with that kernel update. I guess even a single broken user space
> implementation would count as regression.
> So I had asked to change the patch to require a user space flag.
>
> -- Bernd
>


2022-05-05 16:09:33

by Dharmendra Singh

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] FUSE: Implement atomic lookup + create

On Wed, May 4, 2022 at 8:17 PM Vivek Goyal <[email protected]> wrote:
>
> On Wed, May 04, 2022 at 09:56:49AM +0530, Dharmendra Hans wrote:
> > On Wed, May 4, 2022 at 1:24 AM Vivek Goyal <[email protected]> wrote:
> > >
> > > On Mon, May 02, 2022 at 03:55:19PM +0530, Dharmendra Singh wrote:
> > > > From: Dharmendra Singh <[email protected]>
> > > >
> > > > When we go for creating a file (O_CREAT), we trigger
> > > > a lookup to FUSE USER SPACE. It is very much likely
> > > > that file does not exist yet as O_CREAT is passed to
> > > > open(). This lookup can be avoided and can be performed
> > > > as part of create call into libfuse.
> > > >
> > > > This lookup + create in single call to libfuse and finally
> > > > to USER SPACE has been named as atomic create. It is expected
> > > > that USER SPACE create the file, open it and fills in the
> > > > attributes which are then used to make inode stand/revalidate
> > > > in the kernel cache. Also if file was newly created(does not
> > > > exist yet by this time) in USER SPACE then it should be indicated
> > > > in `struct fuse_file_info` by setting a bit which is again used by
> > > > libfuse to send some flags back to fuse kernel to indicate that
> > > > that file was newly created. These flags are used by kernel to
> > > > indicate changes in parent directory.
> > >
> > > Reading the existing code a little bit more and trying to understand
> > > existing semantics. And that will help me unerstand what new is being
> > > done.
> > >
> > > So current fuse_atomic_open() does following.
> > >
> > > A. Looks up dentry (if d_in_lookup() is set).
> > > B. If dentry is positive or O_CREAT is not set, return.
> > > C. If server supports atomic create + open, use that to create file and
> > > open it as well.
> > > D. If server does not support atomic create + open, just create file
> > > using "mknod" and return. VFS will take care of opening the file.
> > >
> > > Now with this patch, new flow is.
> > >
> > > A. Look up dentry if d_in_lookup() is set as well as either file is not
> > > being created or fc->no_atomic_create is set. This basiclally means
> > > skip lookup if atomic_create is supported and file is being created.
> > >
> > > B. Remains same. if dentry is positive or O_CREATE is not set, return.
> > >
> > > C. If server supports new atomic_create(), use that.
> > >
> > > D. If not, if server supports atomic create + open, use that
> > >
> > > E. If not, fall back to mknod and do not open file.
> > >
> > > So to me this new functionality is basically atomic "lookup + create +
> > > open"?
> > >
> > > Or may be not. I see we check "fc->no_create" and fallback to mknod.
> > >
> > > if (fc->no_create)
> > > goto mknod;
> > >
> > > So fc->no_create is representing both old atomic "create + open" as well
> > > as new "lookup + create + open" ?
> > >
> > > It might be obvious to you, but it is not to me. So will be great if
> > > you shed some light on this.
> > >
> >
> > I think you got it right now. New atomic create does what you
> > mentioned as new flow. It does lookup + create + open in single call
> > (being called as atomic create) to USER SPACE.mknod is a special case
>
> Ok, naming is little confusing. I think we will have to put it in
> commit message and where you define FUSE_ATOMIC_CREATE that what's
> the difference between FUSE_CREATE and FUSE_ATOMIC_CREATE. This is
> ATOMIC w.r.t what?

Sure, I would update the commit message to make the distinction clear
between the two. This operation is atomic w.r.t to USER SPACE FUSE
implementations. i.e USER SPACE would be performing all these
operations in a single call to it.


> May be atomic here means that "lookup + create + open" is a single operation.
> But then even FUSE_CREATE is atomic because "creat + open" is a single
> operation.
>
> In fact FUSE_CREATE does lookup anyway and returns all the information
> in fuse_entry_out.
>
> IIUC, only difference between FUSE_CREATE and FUSE_ATOMIC_CREATE is that
> later also carries information in reply whether file was actually created
> or not (FOPEN_FILE_CREATED). This will be set if file did not exist
> already and it was created indeed. Is that right?

FUSE_CREATE is atomic but upto level of libfuse. Libfuse separates it
into two calls, create and lookup separately into USER SPACE FUSE
implementation.
This FUSE_ATOMIC_CREATE does all these ops in a single call to FUSE
implementations. We do not want to break any existing FUSE
implementations, therefore it is being introduced as a new feature. I
forgot to include links to libfuse patches as well. That would have
made it much clearer. Here is the link to libfuse patch for this call
https://github.com/libfuse/libfuse/pull/673.

>
> I see FOPEN_FILE_CREATED is being used to avoid calling
> fuse_dir_changed(). That sounds like a separate optimization and probably
> should be in a separate patch.

FUSE_ATOMIC_CREATE needs to send back info about if file was actually
created or not (This is suggestion from Miklos) to correctly convey if
the parent dir is really changing or not. I included this as part of
this patch itself instead of having it as a separate patch.

> IOW, I think this patch should be broken in to multiple pieces. First
> piece seems to be avoiding lookup() and given the way it is implemented,
> looks like we can avoid lookup() even by using existing FUSE_CREATE
> command. We don't necessarily need FUSE_ATOMIC_CREATE. Is that right?

Its not only about changing fuse kernel code but USER SPACE
implementations also. If we change the you are suggesting we would be
required to twist many things at libfuse and FUSE low level API. So to
keep things simple and not to break any existing implementations we
have kept it as new call (we now pass `struct stat` to USER SPACE FUSE
to filled in).

> And once that is done, a separate patch should probably should explain
> the problem and say fuse_dir_changed() call can be avoided if we knew
> if file was actually created or it was already existing there. And that's
> when one need to introduce a new command. Given this is just an extension
> of existing FUSE_CREATE command and returns additiona info about
> FOPEN_FILE_CREATED, we probably should simply call it FUSE_CREATE_EXT
> and explain how this operation is different from FUSE_CREATE.

As explained above, we are not doing this way as we have kept in mind
all existing libfuse APIs as well.

2022-05-06 01:41:45

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] FUSE: Implement atomic lookup + create

On Thu, May 05, 2022 at 10:21:21AM +0530, Dharmendra Hans wrote:
> On Wed, May 4, 2022 at 8:17 PM Vivek Goyal <[email protected]> wrote:
> >
> > On Wed, May 04, 2022 at 09:56:49AM +0530, Dharmendra Hans wrote:
> > > On Wed, May 4, 2022 at 1:24 AM Vivek Goyal <[email protected]> wrote:
> > > >
> > > > On Mon, May 02, 2022 at 03:55:19PM +0530, Dharmendra Singh wrote:
> > > > > From: Dharmendra Singh <[email protected]>
> > > > >
> > > > > When we go for creating a file (O_CREAT), we trigger
> > > > > a lookup to FUSE USER SPACE. It is very much likely
> > > > > that file does not exist yet as O_CREAT is passed to
> > > > > open(). This lookup can be avoided and can be performed
> > > > > as part of create call into libfuse.
> > > > >
> > > > > This lookup + create in single call to libfuse and finally
> > > > > to USER SPACE has been named as atomic create. It is expected
> > > > > that USER SPACE create the file, open it and fills in the
> > > > > attributes which are then used to make inode stand/revalidate
> > > > > in the kernel cache. Also if file was newly created(does not
> > > > > exist yet by this time) in USER SPACE then it should be indicated
> > > > > in `struct fuse_file_info` by setting a bit which is again used by
> > > > > libfuse to send some flags back to fuse kernel to indicate that
> > > > > that file was newly created. These flags are used by kernel to
> > > > > indicate changes in parent directory.
> > > >
> > > > Reading the existing code a little bit more and trying to understand
> > > > existing semantics. And that will help me unerstand what new is being
> > > > done.
> > > >
> > > > So current fuse_atomic_open() does following.
> > > >
> > > > A. Looks up dentry (if d_in_lookup() is set).
> > > > B. If dentry is positive or O_CREAT is not set, return.
> > > > C. If server supports atomic create + open, use that to create file and
> > > > open it as well.
> > > > D. If server does not support atomic create + open, just create file
> > > > using "mknod" and return. VFS will take care of opening the file.
> > > >
> > > > Now with this patch, new flow is.
> > > >
> > > > A. Look up dentry if d_in_lookup() is set as well as either file is not
> > > > being created or fc->no_atomic_create is set. This basiclally means
> > > > skip lookup if atomic_create is supported and file is being created.
> > > >
> > > > B. Remains same. if dentry is positive or O_CREATE is not set, return.
> > > >
> > > > C. If server supports new atomic_create(), use that.
> > > >
> > > > D. If not, if server supports atomic create + open, use that
> > > >
> > > > E. If not, fall back to mknod and do not open file.
> > > >
> > > > So to me this new functionality is basically atomic "lookup + create +
> > > > open"?
> > > >
> > > > Or may be not. I see we check "fc->no_create" and fallback to mknod.
> > > >
> > > > if (fc->no_create)
> > > > goto mknod;
> > > >
> > > > So fc->no_create is representing both old atomic "create + open" as well
> > > > as new "lookup + create + open" ?
> > > >
> > > > It might be obvious to you, but it is not to me. So will be great if
> > > > you shed some light on this.
> > > >
> > >
> > > I think you got it right now. New atomic create does what you
> > > mentioned as new flow. It does lookup + create + open in single call
> > > (being called as atomic create) to USER SPACE.mknod is a special case
> >
> > Ok, naming is little confusing. I think we will have to put it in
> > commit message and where you define FUSE_ATOMIC_CREATE that what's
> > the difference between FUSE_CREATE and FUSE_ATOMIC_CREATE. This is
> > ATOMIC w.r.t what?
>
> Sure, I would update the commit message to make the distinction clear
> between the two. This operation is atomic w.r.t to USER SPACE FUSE
> implementations. i.e USER SPACE would be performing all these
> operations in a single call to it.

I think even FUSE_CREAT is doing same thing. Creating file, opening and
doing lookup and sending all the data. So that's not the difference
between the two, IMHO. And that's why I am getting confused with the
naming.

From user space file server perspective, only extra operation seems
to be that it sends a flag in response telling the client whether
file was actually created or it already existed. So to me it just
sounds little extension of existing FUSE_CREATE command and that's
why I thought calling it FUSE_CREATE_EXT is probably better naming.


>
>
> > May be atomic here means that "lookup + create + open" is a single operation.
> > But then even FUSE_CREATE is atomic because "creat + open" is a single
> > operation.
> >
> > In fact FUSE_CREATE does lookup anyway and returns all the information
> > in fuse_entry_out.
> >
> > IIUC, only difference between FUSE_CREATE and FUSE_ATOMIC_CREATE is that
> > later also carries information in reply whether file was actually created
> > or not (FOPEN_FILE_CREATED). This will be set if file did not exist
> > already and it was created indeed. Is that right?
>
> FUSE_CREATE is atomic but upto level of libfuse. Libfuse separates it
> into two calls, create and lookup separately into USER SPACE FUSE
> implementation.

I am not sure what do you mean by "libfuse separates it into two calls,
create and lookup separately". I guess you are referring to lo_create()
in example/passthrough_ll.c which first creates and opens file and
then looks it up and replies.

fd = openat(lo_fd(req, parent), name,
(fi->flags | O_CREAT) & ~O_NOFOLLOW, mode);
err = lo_do_lookup(req, parent, name, &e);
fuse_reply_create(req, &e, fi);

I am looking at your proposal for atomic_create implementation here.

https://github.com/libfuse/libfuse/pull/673/commits/88cd25b2857f2bb213d01afbcfd666787d1e6893#diff-a36385ec8fb753d6f4492a5f0d3c6a5750bd370b50df6ef0610efdcd3f8880ffR787

It is doing exactly same thing as lo_create(), except one difference that
it is checking first if file exists. It essentially is doing this.

A. newfd = openat(lo_fd(req, parent), name, O_PATH | O_NOFOLLOW);
B. fd = openat(lo_fd(req, parent), name,
(fi->flags | O_CREAT) & ~O_NOFOLLOW, mode);
C. err = lo_do_lookup(req, parent, name, &e);
D. fuse_reply_create(req, &e, fi);

So what do you mean by libfuse makes it two calls.

And I think above implementation is racy. What if filesystem is
shared and another client creates the file between calls to
A and B. You will think you created the file but some other
client created it.

So if intent is to know whether we created the file or not, then
you should probably do openat() with O_EXCL flag. If that succeeds
you know you created the file. If it fails with -EEXIST, then you
know file is already there. That's what virtiofs does.

Anyway, coming back to the point. IMHO, from server perspective,
there is no atomicity difference between FUSE_CREATE and
FUSE_ATOMIC_CREATE. Only difference seems to be to send addditional
information back to the client to tell it whether file was created
or not.

In fact for shared filesystem this is probably a problem. What if
guest's cache is stale and it does not know about the file. A client
B creates the file and we think we did not create the file. And we
will return with FOPEN_FILE_CREATED = 0. And in that case client
will not call fuse_dir_changed(). But that seems wrong in case
of shared filesystems. I am concerned about virtiofs which can
be shared between different guests.

Miklos, WDYT?

May be it is not a huge concern. If one guest drops a file, other guest
will not invalidate its dir attrs till timeout happens. Case of shared
filesystem is very tricky with fuse. And sometimes it is not clear
to me what kind of coherency matters.

So in this case say I am booted with cache=auto, if guest B drops a file
bar/foo.txt and guest A does open(bar/foo.txt, O_CREAT), then should
guest A invalidate the attrs of bar/ right away or it will be invalidated
anyway after a second.

Anyway.., my core point is that difference between FUSE_CREATE and
FUSE_ATOMIC_CREATE is just one flag FOPEN_FILE_CREATED which tells
client whether file was actually created or not. And that is used
to determine whether to invalidate parent dir attributes or not. It
does not have anything extra in terms of ATOMICITY as far as I can
see and that's what confuses me.



> This FUSE_ATOMIC_CREATE does all these ops in a single call to FUSE
> implementations. We do not want to break any existing FUSE
> implementations, therefore it is being introduced as a new feature. I
> forgot to include links to libfuse patches as well. That would have
> made it much clearer. Here is the link to libfuse patch for this call
> https://github.com/libfuse/libfuse/pull/673.
>
> >
> > I see FOPEN_FILE_CREATED is being used to avoid calling
> > fuse_dir_changed(). That sounds like a separate optimization and probably
> > should be in a separate patch.
>
> FUSE_ATOMIC_CREATE needs to send back info about if file was actually
> created or not (This is suggestion from Miklos) to correctly convey if
> the parent dir is really changing or not. I included this as part of
> this patch itself instead of having it as a separate patch.

This needs little more thought w.r.t shared filesystems.

>
> > IOW, I think this patch should be broken in to multiple pieces. First
> > piece seems to be avoiding lookup() and given the way it is implemented,
> > looks like we can avoid lookup() even by using existing FUSE_CREATE
> > command. We don't necessarily need FUSE_ATOMIC_CREATE. Is that right?
>
> Its not only about changing fuse kernel code but USER SPACE
> implementations also. If we change the you are suggesting we would be
> required to twist many things at libfuse and FUSE low level API. So to
> keep things simple and not to break any existing implementations we
> have kept it as new call (we now pass `struct stat` to USER SPACE FUSE
> to filled in).
>
> > And once that is done, a separate patch should probably should explain
> > the problem and say fuse_dir_changed() call can be avoided if we knew
> > if file was actually created or it was already existing there. And that's
> > when one need to introduce a new command. Given this is just an extension
> > of existing FUSE_CREATE command and returns additiona info about
> > FOPEN_FILE_CREATED, we probably should simply call it FUSE_CREATE_EXT
> > and explain how this operation is different from FUSE_CREATE.
>
> As explained above, we are not doing this way as we have kept in mind
> all existing libfuse APIs as well.
>

I am not seeing what will break in existing passthrough_ll.c if I
simply used FUSE_CREATE for a file which already exists.

fd = openat(lo_fd(req, parent), name,
(fi->flags | O_CREAT) & ~O_NOFOLLOW, mode);

This call should still succeed even if file already exists. (until and
unless called it with O_EXCL and in that case failing is the correct
behavior).

Thanks
Vivek


2022-05-06 02:13:44

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create

On Thu, May 05, 2022 at 11:42:51AM +0530, Dharmendra Hans wrote:
> On Thu, May 5, 2022 at 12:48 AM Vivek Goyal <[email protected]> wrote:
> >
> > On Mon, May 02, 2022 at 03:55:18PM +0530, Dharmendra Singh wrote:
> > > In FUSE, as of now, uncached lookups are expensive over the wire.
> > > E.g additional latencies and stressing (meta data) servers from
> > > thousands of clients. These lookup calls possibly can be avoided
> > > in some cases. Incoming three patches address this issue.
> >
> > BTW, these patches are designed to improve performance by cutting down
> > on number of fuse commands sent. Are there any performance numbers
> > which demonstrate what kind of improvement you are seeing.
> >
> > Say, If I do kernel build, is the performance improvement observable?
>
> Here are the numbers I took last time. These were taken on tmpfs to
> actually see the effect of reduced calls. On local file systems it
> might not be that much visible. But we have observed that on systems
> where we have thousands of clients hammering the metadata servers, it
> helps a lot (We did not take numbers yet as we are required to change
> a lot of our client code but would be doing it later on).
>
> Note that for a change in performance number due to the new version of
> these patches, we have just refactored the code and functionality has
> remained the same since then.
>
> here is the link to the performance numbers
> https://lore.kernel.org/linux-fsdevel/[email protected]/

There is a lot going in that table. Trying to understand it.

- Why care about No-Flush. I mean that's independent of these changes,
right? I am assuming this means that upon file close do not send
a flush to fuse server. Not sure how bringing No-Flush into the
mix is helpful here.

- What is "Patched Libfuse"? I am assuming that these are changes
needed in libfuse to support atomic create + atomic open. Similarly
assuming "Patched FuseK" means patched kernel with your changes.

If this is correct, I would probably only be interested in
looking at "Patched Libfuse + Patched FuseK" numbers to figure out
what's the effect of your changes w.r.t vanilla kernel + libfuse.
Am I understanding it right?

- I am wondering why do we measure "Sequential" and "Random" patterns.
These optimizations are primarily for file creation + file opening
and I/O pattern should not matter.

- Also wondering why performance of Read/s improves. Assuming once
file has been opened, I think your optimizations get out of the
way (no create, no open) and we are just going through data path of
reading file data and no lookups happening. If that's the case, why
do Read/s numbers show an improvement.

- Why do we measure "Patched Libfuse". It shows performance regression
of 4-5% in table 0B, Sequential workoad. That sounds bad. So without
any optimization kicking in, it has a performance cost.

Thanks
Vivek


2022-05-06 08:58:14

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create

On Thu, May 05, 2022 at 05:13:00PM +0200, Bernd Schubert wrote:
>
>
> On 5/5/22 14:54, Vivek Goyal wrote:
> > On Thu, May 05, 2022 at 11:42:51AM +0530, Dharmendra Hans wrote:
> > > Here are the numbers I took last time. These were taken on tmpfs to
> > > actually see the effect of reduced calls. On local file systems it
> > > might not be that much visible. But we have observed that on systems
> > > where we have thousands of clients hammering the metadata servers, it
> > > helps a lot (We did not take numbers yet as we are required to change
> > > a lot of our client code but would be doing it later on).
> > >
> > > Note that for a change in performance number due to the new version of
> > > these patches, we have just refactored the code and functionality has
> > > remained the same since then.
> > >
> > > here is the link to the performance numbers
> > > https://lore.kernel.org/linux-fsdevel/[email protected]/
> >
> > There is a lot going in that table. Trying to understand it.
> >
> > - Why care about No-Flush. I mean that's independent of these changes,
> > right? I am assuming this means that upon file close do not send
> > a flush to fuse server. Not sure how bringing No-Flush into the
> > mix is helpful here.
>
>
> It is a basically removing another call from kernel to user space. The calls
> there are, the lower is the resulting percentage for atomic-open.

Ok. You want to get rid of FUSE_FLUSH call so that % of FUSE_LOOKUP calls
go up and that will effectively show higher % of improvement due to
this.

>
>
> >
> > - What is "Patched Libfuse"? I am assuming that these are changes
> > needed in libfuse to support atomic create + atomic open. Similarly
> > assuming "Patched FuseK" means patched kernel with your changes.
>
> Yes, I did that to ensure there is no regression with the patches, when the
> other side is not patched.
>
> >
> > If this is correct, I would probably only be interested in
> > looking at "Patched Libfuse + Patched FuseK" numbers to figure out
> > what's the effect of your changes w.r.t vanilla kernel + libfuse.
> > Am I understanding it right?
>
> Yes.
>
> >
> > - I am wondering why do we measure "Sequential" and "Random" patterns.
> > These optimizations are primarily for file creation + file opening
> > and I/O pattern should not matter.
>
> bonnie++ does this automatically and it just convenient to take the bonnie++
> csv value and to paste them into a table.
>
> In our HPC world mdtest is more common, but it has MPI as requirement - make
> it harder to run. Reproducing the values with bonnie++ should be rather easy
> for you.
>
> Only issue with bonnie++ is that bonnie++ by default does not run
> multi-threaded and the old 3rd party perl scripts I had to let it run with
> multiple processes and to sum up the values don't work anymore with recent
> perl versions. I need to find some time to fix that.
>
>
> >
> > - Also wondering why performance of Read/s improves. Assuming once
> > file has been opened, I think your optimizations get out of the
> > way (no create, no open) and we are just going through data path of
> > reading file data and no lookups happening. If that's the case, why
> > do Read/s numbers show an improvement.
>
> That is now bonnie++ works. It creates the files, closes them (which causes
> the flush) and then re-opens for stat and read - atomic open comes into the
> picture here. Also read() is totally skipped when the files are empty -
> which is why one should use something like 1B files.
>
> If you have another metadata benchmark - please let us know.

Once I was pointed at smallfile.

https://github.com/distributed-system-analysis/smallfile

I ran this when I was posting the patches for virtiofs.

https://patchwork.kernel.org/project/linux-fsdevel/cover/[email protected]/

See if this is something interesting to you.


>
> >
> > - Why do we measure "Patched Libfuse". It shows performance regression
> > of 4-5% in table 0B, Sequential workoad. That sounds bad. So without
> > any optimization kicking in, it has a performance cost.
>
> Yes, I'm not sure yet. There is not so much code that has changed on the
> libfuse side.
> However the table needs to be redone with fixed libfuse - limiting the
> number of threads caused a permanent libfuse thread creation and destruction
>
> https://github.com/libfuse/libfuse/pull/652
>
> The numbers in table are also with paththrough_ll, which has its own issue
> due to linear inode search. paththrough_hp uses a C++ map and avoids that. I
> noticed too late when I started to investigate why there are regressions....
>
> Also the table made me to investigate/profile all the fuse operations, which
> resulted in my waitq question. Please see that thread for more details https://lore.kernel.org/lkml/[email protected]/T/
>
> Regarding atomic-open/create with avoiding lookup/revalidate, our primary
> goal is to reduce network calls. A file system that handles it locally only
> reduces the number of fuse kernel/user space crossing. A network file system
> that fully supports it needs to do the atomic open (or in old terms
> lookup-intent-open) on the server side of the network and needs to transfer
> attributes together with the open result.

Oh, I have no issues with the intent. I will like to see cut in network
traffic too (if we can do this without introducing problems). My primary
interest is that this kind of change should benefit virtiofs as well.

I am just trying to understand how much performance improvement is
actually there. And also trying to improve the quality of implementation.
Frankly speaking, it all seems very twisted and hard to read (and
hence maintain) code at this point fo time.

That's why I am going into the details to understand and suggest some
improvements.

Thanks
Vivek

>
> Lustre does this, although I cannot easily point you to the right code. It
> all started almost two decades ago:
> https://groups.google.com/g/lucky.linux.fsdevel/c/iYNFIIrkJ1s
>
>
> BeeGFS does this as well
> https://git.beegfs.io/pub/v7/-/blob/master/client_module/source/filesystem/FhgfsOpsInode.c
> See for examaple FhgfsOps_atomicOpen() and FhgfsOps_createIntent()
>
> (FhGFS is the old name when I was still involved in the project.)
>
> From my head I'm not sure if NFS does it over the wire, maybe v4.
>
>
> Thanks,
> Bernd
>
>
>
>
>
>


2022-05-06 11:13:00

by Bernd Schubert

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] FUSE: Implement atomic lookup + create



On 5/4/22 16:47, Vivek Goyal wrote:

> Ok, naming is little confusing. I think we will have to put it in
> commit message and where you define FUSE_ATOMIC_CREATE that what's
> the difference between FUSE_CREATE and FUSE_ATOMIC_CREATE. This is
> ATOMIC w.r.t what?
>
> May be atomic here means that "lookup + create + open" is a single operation.
> But then even FUSE_CREATE is atomic because "creat + open" is a single
> operation.
>
> In fact FUSE_CREATE does lookup anyway and returns all the information
> in fuse_entry_out.
>
> IIUC, only difference between FUSE_CREATE and FUSE_ATOMIC_CREATE is that
> later also carries information in reply whether file was actually created
> or not (FOPEN_FILE_CREATED). This will be set if file did not exist
> already and it was created indeed. Is that right?
>
> I see FOPEN_FILE_CREATED is being used to avoid calling
> fuse_dir_changed(). That sounds like a separate optimization and probably
> should be in a separate patch.
>
> IOW, I think this patch should be broken in to multiple pieces. First
> piece seems to be avoiding lookup() and given the way it is implemented,
> looks like we can avoid lookup() even by using existing FUSE_CREATE
> command. We don't necessarily need FUSE_ATOMIC_CREATE. Is that right?

The initial non-published patches had that, but I had actually asked not
to go that route, because I'm scared that some user space file system
implementations might get broken. Right now there is always a lookup
before fuse_create_open() and when the resulting dentry is positive
fuse_create_open/FUSE_CREATE is bypassed. I.e. user space
implementations didn't need to handle existing files. Out of the sudden
user space implementations might need to handle it and some of them
might get broken with that kernel update. I guess even a single broken
user space implementation would count as regression.
So I had asked to change the patch to require a user space flag.

-- Bernd

2022-05-07 05:45:13

by Dharmendra Singh

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] FUSE: Implement atomic lookup + create

On Thu, May 5, 2022 at 7:56 PM Vivek Goyal <[email protected]> wrote:
>
> On Thu, May 05, 2022 at 10:21:21AM +0530, Dharmendra Hans wrote:
> > On Wed, May 4, 2022 at 8:17 PM Vivek Goyal <[email protected]> wrote:
> > >
> > > On Wed, May 04, 2022 at 09:56:49AM +0530, Dharmendra Hans wrote:
> > > > On Wed, May 4, 2022 at 1:24 AM Vivek Goyal <[email protected]> wrote:
> > > > >
> > > > > On Mon, May 02, 2022 at 03:55:19PM +0530, Dharmendra Singh wrote:
> > > > > > From: Dharmendra Singh <[email protected]>
> > > > > >
> > > > > > When we go for creating a file (O_CREAT), we trigger
> > > > > > a lookup to FUSE USER SPACE. It is very much likely
> > > > > > that file does not exist yet as O_CREAT is passed to
> > > > > > open(). This lookup can be avoided and can be performed
> > > > > > as part of create call into libfuse.
> > > > > >
> > > > > > This lookup + create in single call to libfuse and finally
> > > > > > to USER SPACE has been named as atomic create. It is expected
> > > > > > that USER SPACE create the file, open it and fills in the
> > > > > > attributes which are then used to make inode stand/revalidate
> > > > > > in the kernel cache. Also if file was newly created(does not
> > > > > > exist yet by this time) in USER SPACE then it should be indicated
> > > > > > in `struct fuse_file_info` by setting a bit which is again used by
> > > > > > libfuse to send some flags back to fuse kernel to indicate that
> > > > > > that file was newly created. These flags are used by kernel to
> > > > > > indicate changes in parent directory.
> > > > >
> > > > > Reading the existing code a little bit more and trying to understand
> > > > > existing semantics. And that will help me unerstand what new is being
> > > > > done.
> > > > >
> > > > > So current fuse_atomic_open() does following.
> > > > >
> > > > > A. Looks up dentry (if d_in_lookup() is set).
> > > > > B. If dentry is positive or O_CREAT is not set, return.
> > > > > C. If server supports atomic create + open, use that to create file and
> > > > > open it as well.
> > > > > D. If server does not support atomic create + open, just create file
> > > > > using "mknod" and return. VFS will take care of opening the file.
> > > > >
> > > > > Now with this patch, new flow is.
> > > > >
> > > > > A. Look up dentry if d_in_lookup() is set as well as either file is not
> > > > > being created or fc->no_atomic_create is set. This basiclally means
> > > > > skip lookup if atomic_create is supported and file is being created.
> > > > >
> > > > > B. Remains same. if dentry is positive or O_CREATE is not set, return.
> > > > >
> > > > > C. If server supports new atomic_create(), use that.
> > > > >
> > > > > D. If not, if server supports atomic create + open, use that
> > > > >
> > > > > E. If not, fall back to mknod and do not open file.
> > > > >
> > > > > So to me this new functionality is basically atomic "lookup + create +
> > > > > open"?
> > > > >
> > > > > Or may be not. I see we check "fc->no_create" and fallback to mknod.
> > > > >
> > > > > if (fc->no_create)
> > > > > goto mknod;
> > > > >
> > > > > So fc->no_create is representing both old atomic "create + open" as well
> > > > > as new "lookup + create + open" ?
> > > > >
> > > > > It might be obvious to you, but it is not to me. So will be great if
> > > > > you shed some light on this.
> > > > >
> > > >
> > > > I think you got it right now. New atomic create does what you
> > > > mentioned as new flow. It does lookup + create + open in single call
> > > > (being called as atomic create) to USER SPACE.mknod is a special case
> > >
> > > Ok, naming is little confusing. I think we will have to put it in
> > > commit message and where you define FUSE_ATOMIC_CREATE that what's
> > > the difference between FUSE_CREATE and FUSE_ATOMIC_CREATE. This is
> > > ATOMIC w.r.t what?
> >
> > Sure, I would update the commit message to make the distinction clear
> > between the two. This operation is atomic w.r.t to USER SPACE FUSE
> > implementations. i.e USER SPACE would be performing all these
> > operations in a single call to it.
>
> I think even FUSE_CREAT is doing same thing. Creating file, opening and
> doing lookup and sending all the data. So that's not the difference
> between the two, IMHO. And that's why I am getting confused with the
> naming.
>
> From user space file server perspective, only extra operation seems
> to be that it sends a flag in response telling the client whether
> file was actually created or it already existed. So to me it just
> sounds little extension of existing FUSE_CREATE command and that's
> why I thought calling it FUSE_CREATE_EXT is probably better naming.

The difference between the two is of atomicity from top to bottom i.e
single call from fuse kernel to user space file server. Generally
FUSE_CREATE goes upto libfuse low level API as atomic (check
fuse_lib_create()) and it gets separated there into two calls (create
+ getattr). This is not what we want as it results in two network
trips each for create and lookup to the file server.

>
> >
> >
> > > May be atomic here means that "lookup + create + open" is a single operation.
> > > But then even FUSE_CREATE is atomic because "creat + open" is a single
> > > operation.
> > >
> > > In fact FUSE_CREATE does lookup anyway and returns all the information
> > > in fuse_entry_out.
> > >
> > > IIUC, only difference between FUSE_CREATE and FUSE_ATOMIC_CREATE is that
> > > later also carries information in reply whether file was actually created
> > > or not (FOPEN_FILE_CREATED). This will be set if file did not exist
> > > already and it was created indeed. Is that right?
> >
> > FUSE_CREATE is atomic but upto level of libfuse. Libfuse separates it
> > into two calls, create and lookup separately into USER SPACE FUSE
> > implementation.
>
> I am not sure what do you mean by "libfuse separates it into two calls,
> create and lookup separately". I guess you are referring to lo_create()
> in example/passthrough_ll.c which first creates and opens file and
> then looks it up and replies.
>
> fd = openat(lo_fd(req, parent), name,
> (fi->flags | O_CREAT) & ~O_NOFOLLOW, mode);
> err = lo_do_lookup(req, parent, name, &e);
> fuse_reply_create(req, &e, fi);
>
> I am looking at your proposal for atomic_create implementation here.
>
> https://github.com/libfuse/libfuse/pull/673/commits/88cd25b2857f2bb213d01afbcfd666787d1e6893#diff-a36385ec8fb753d6f4492a5f0d3c6a5750bd370b50df6ef0610efdcd3f8880ffR787
>
> It is doing exactly same thing as lo_create(), except one difference that
> it is checking first if file exists. It essentially is doing this.
>
> A. newfd = openat(lo_fd(req, parent), name, O_PATH | O_NOFOLLOW);
> B. fd = openat(lo_fd(req, parent), name,
> (fi->flags | O_CREAT) & ~O_NOFOLLOW, mode);
> C. err = lo_do_lookup(req, parent, name, &e);
> D. fuse_reply_create(req, &e, fi);
>
> So what do you mean by libfuse makes it two calls.
>
> And I think above implementation is racy. What if filesystem is
> shared and another client creates the file between calls to
> A and B. You will think you created the file but some other
> client created it.
>
> So if intent is to know whether we created the file or not, then
> you should probably do openat() with O_EXCL flag. If that succeeds
> you know you created the file. If it fails with -EEXIST, then you
> know file is already there. That's what virtiofs does.
>
> Anyway, coming back to the point. IMHO, from server perspective,
> there is no atomicity difference between FUSE_CREATE and
> FUSE_ATOMIC_CREATE. Only difference seems to be to send addditional
> information back to the client to tell it whether file was created
> or not.
>
> In fact for shared filesystem this is probably a problem. What if
> guest's cache is stale and it does not know about the file. A client
> B creates the file and we think we did not create the file. And we
> will return with FOPEN_FILE_CREATED = 0. And in that case client
> will not call fuse_dir_changed(). But that seems wrong in case
> of shared filesystems. I am concerned about virtiofs which can
> be shared between different guests.
>
> Miklos, WDYT?
>
> May be it is not a huge concern. If one guest drops a file, other guest
> will not invalidate its dir attrs till timeout happens. Case of shared
> filesystem is very tricky with fuse. And sometimes it is not clear
> to me what kind of coherency matters.
>
> So in this case say I am booted with cache=auto, if guest B drops a file
> bar/foo.txt and guest A does open(bar/foo.txt, O_CREAT), then should
> guest A invalidate the attrs of bar/ right away or it will be invalidated
> anyway after a second.
>
> Anyway.., my core point is that difference between FUSE_CREATE and
> FUSE_ATOMIC_CREATE is just one flag FOPEN_FILE_CREATED which tells
> client whether file was actually created or not. And that is used
> to determine whether to invalidate parent dir attributes or not. It
> does not have anything extra in terms of ATOMICITY as far as I can
> see and that's what confuses me.

You checked the wrong code. pasthrough_ll is not production code, we
do not use it at all, just using it to test these patches here. There
is a main patch which implements atomic create in libfuse for low
level api(). It is in same pull request, check it here
https://github.com/libfuse/libfuse/pull/673/commits/f86fe92bef7bb529ef1617e077d69a39eb26bc9f

What this FUSE_ATOMIC_CREATE(fuse_lib_atomic_create()) does in libfuse
is it make a single call into fuse_operations where file server would
be doing all 'lookup + create + open' in one call itself whereas
FUSE_CREATE gets separated into two calls 'create + lookup' which
eventually are two rpcs to the user space file server for a single
operation. Now you can see FUSE_CREATE is not atomic from file
systems point of view(i.e from user space file server perspective)
but from fuse kernel point of view only. We can make existing libfuse
code i.e fuse_lib_create() to call new atomic create but then the
interface becomes messy and it might start a trend to twist libfuse.
We want to keep libfuse APIs neat and clean. And do not break existing
systems. No flag from fuse kernel to libfuse to decide which code to
traverse. Therefore a new call for all this work.

>
>
>
> > This FUSE_ATOMIC_CREATE does all these ops in a single call to FUSE
> > implementations. We do not want to break any existing FUSE
> > implementations, therefore it is being introduced as a new feature. I
> > forgot to include links to libfuse patches as well. That would have
> > made it much clearer. Here is the link to libfuse patch for this call
> > https://github.com/libfuse/libfuse/pull/673.
> >
> > >
> > > I see FOPEN_FILE_CREATED is being used to avoid calling
> > > fuse_dir_changed(). That sounds like a separate optimization and probably
> > > should be in a separate patch.
> >
> > FUSE_ATOMIC_CREATE needs to send back info about if file was actually
> > created or not (This is suggestion from Miklos) to correctly convey if
> > the parent dir is really changing or not. I included this as part of
> > this patch itself instead of having it as a separate patch.
>
> This needs little more thought w.r.t shared filesystems.
>
> >
> > > IOW, I think this patch should be broken in to multiple pieces. First
> > > piece seems to be avoiding lookup() and given the way it is implemented,
> > > looks like we can avoid lookup() even by using existing FUSE_CREATE
> > > command. We don't necessarily need FUSE_ATOMIC_CREATE. Is that right?
> >
> > Its not only about changing fuse kernel code but USER SPACE
> > implementations also. If we change the you are suggesting we would be
> > required to twist many things at libfuse and FUSE low level API. So to
> > keep things simple and not to break any existing implementations we
> > have kept it as new call (we now pass `struct stat` to USER SPACE FUSE
> > to filled in).
> >
> > > And once that is done, a separate patch should probably should explain
> > > the problem and say fuse_dir_changed() call can be avoided if we knew
> > > if file was actually created or it was already existing there. And that's
> > > when one need to introduce a new command. Given this is just an extension
> > > of existing FUSE_CREATE command and returns additiona info about
> > > FOPEN_FILE_CREATED, we probably should simply call it FUSE_CREATE_EXT
> > > and explain how this operation is different from FUSE_CREATE.
> >
> > As explained above, we are not doing this way as we have kept in mind
> > all existing libfuse APIs as well.
> >
>
> I am not seeing what will break in existing passthrough_ll.c if I
> simply used FUSE_CREATE for a file which already exists.
>
> fd = openat(lo_fd(req, parent), name,
> (fi->flags | O_CREAT) & ~O_NOFOLLOW, mode);
>
> This call should still succeed even if file already exists. (until and
> unless called it with O_EXCL and in that case failing is the correct
> behavior).

It's not passthrogh_ll but low level libfuse API. Passthrough_ll is
just another user of low level API. It has been used just to test
these patches. We do not use passthrough_ll at all for any other
purpose.

2022-05-08 12:03:31

by Bernd Schubert

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] FUSE: Implement atomic lookup + create



On 5/6/22 19:07, Vivek Goyal wrote:
> On Fri, May 06, 2022 at 06:41:17PM +0200, Bernd Schubert wrote:
>>
>>
>> On 5/6/22 16:12, Vivek Goyal wrote:
>>
>> [...]
>>
>>> On Fri, May 06, 2022 at 11:04:05AM +0530, Dharmendra Hans wrote:
>>
>>>
>>> Ok, looks like your fuse file server is talking to a another file
>>> server on network and that's why you are mentioning two network trips.
>>>
>>> Let us differentiate between two things first.
>>>
>>> A. FUSE protocol semantics
>>> B. Implementation of FUSE protocl by libfuse.
>>>
>>> I think I am stressing on A and you are stressing on B. I just want
>>> to see what's the difference between FUSE_CREATE and FUSE_ATOMIC_CREATE
>>> from fuse protocol point of view. Again look at from kernel's point of
>>> view and don't worry about libfuse is going to implement it.
>>> Implementations can vary.
>>
>> Agreed, I don't think we need to bring in network for the kernel to libfuse
>> API.
>>
>>>
>>> From kernel's perspective FUSE_CREATE is supposed to create + open a
>>> file. It is possible file already exists. Look at include/fuse_lowlevel.h
>>> description for create().
>>>
>>> /**
>>> * Create and open a file
>>> *
>>> * If the file does not exist, first create it with the specified
>>> * mode, and then open it.
>>> */
>>>
>>> I notice that fuse is offering a high level API as well as low level
>>> API. I primarily know about low level API. To me these are just two
>>> different implementation but things don't change how kernel sends
>>> fuse messages and what it expects from server in return.
>>>
>>> Now with FUSE_ATOMIC_CREATE, from kernel's perspective, only difference
>>> is that in reply message file server will also indicate if file was
>>> actually created or not. Is that right?
>>>
>>> And I am focussing on this FUSE API apsect. I am least concerned at
>>> this point of time who libfuse decides to actually implement FUSE_CREATE
>>> or FUSE_ATOMIC_CREATE etc. You might make a single call in libfuse
>>> server (instead of two) and that's performance optimization in libfuse.
>>> Kernel does not care how many calls did you make in file server to
>>> implement FUSE_CREATE or FUSE_ATOMIC_CREATE. All it cares is that
>>> create and open the file.
>>>
>>> So while you might do things in more atomic manner in file server and
>>> cut down on network traffic, kernel fuse API does not care. All it cares
>>> about is create + open a file.
>>>
>>> Anyway, from kernel's perspective, I think you should be able to
>>> just use FUSE_CREATE and still be do "lookup + create + open".
>>> FUSE_ATOMIC_CREATE is just allows one additional optimization so
>>> that you know whether to invalidate parent dir's attrs or not.
>>>
>>> In fact kernel is not putting any atomicity requirements as well on
>>> file server. And that's why I think this new command should probably
>>> be called FUSE_CREATE_EXT because it just sends back additional
>>> info.
>>>
>>> All the atomicity stuff you have been describing is that you are
>>> trying to do some optimizations in libfuse implementation to implement
>>> FUSE_ATOMIC_CREATE so that you send less number of commands over
>>> network. That's a good idea but fuse kernel API does not require you
>>> do these atomically, AFAICS.
>>>
>>> Given I know little bit of fuse low level API, If I were to implement
>>> this in virtiofs/passthrough_ll.c, I probably will do following.
>>>
>>> A. Check if caller provided O_EXCL flag.
>>> B. openat(O_CREAT | O_EXCL)
>>> C. If success, we created the file. Set file_created = 1.
>>>
>>> D. If error and error != -EEXIST, send error back to client.
>>> E. If error and error == -EEXIST, if caller did provide O_EXCL flag,
>>> return error.
>>> F. openat() returned -EEXIST and caller did not provide O_EXCL flag,
>>> that means file already exists. Set file_created = 0.
>>> G. Do lookup() etc to create internal lo_inode and stat() of file.
>>> H. Send response back to client using fuse_reply_create().
>>> This is one sample implementation for fuse lowlevel API. There could
>>> be other ways to implement. But all that is libfuse + filesystem
>>> specific and kernel does not care how many operations you use to
>>> complete and what's the atomicity etc. Of course less number of
>>> operations you do better it is.
>>>
>>> Anyway, I think I have said enough on this topic. IMHO, FUSE_CREATE
>>> descritpion (fuse_lowlevel.h) already mentions that "If the file does not
>>> exist, first create it with the specified mode and then open it". That
>>> means intent of protocol is that file could already be there as well.
>>> So I think we probably should implement this optimization (in kernel)
>>> using FUSE_CREATE command and then add FUSE_CREATE_EXT to add optimization
>>> about knowing whether file was actually created or not.
>>>
>>> W.r.t libfuse optimizations, I am not sure why can't you do optimizations
>>> with FUSE_CREATE and why do you need FUSE_CREATE_EXT necessarily. If
>>> are you worried that some existing filesystems will break, I think
>>> you can create an internal helper say fuse_create_atomic() and then
>>> use that if filesystem offers it. IOW, libfuse will have two
>>> ways to implement FUSE_CREATE. And if filesystem offers a new way which
>>> cuts down on network traffic, libfuse uses more efficient method. We
>>> should not have to change kernel FUSE API just because libfuse can
>>> do create + open operation more efficiently.
>>
>> Ah right, I like this. As I had written before, the first patch version was
>> using FUSE_CREATE and I was worried to break something. Yes, it should be
>> possible split into lookup+create on the libfuse side. That being said,
>> libfuse will need to know which version it is - there might be an old kernel
>> sending the non-optimized version - libfuse should not do another lookup
>> then.
>
> I am confused about one thing. For FUSE_CREATE command, how does it
> matter whether kernel has done lookup() before sending FUSE_CREATE. All
> FUSE_CREATE seems to say that crate a file (if it does not exist already)
> and then open it and return file handle as well as inode attributes. It
> does not say anything about whether a LOOKUP has already been done
> by kernel or not.
>
> It looks like you are assuming that if FUSE_CREATE is coming, that
> means client has already done FUSE_LOOKUP. So there is something we
> are not on same page about.
>
> I looked at fuse_lowlevel API and passthrough_ll.c and there is no
> assumption whether FUSE_LOOKUP has already been called by client
> before calling FUSE_CREATE. Similarly, I looked at virtiofs code
> and I can't see any such assumption there as well.

The current linux kernel code does this right now, skipping the lookup
just changes behavior. Personally I would see it as bug if the
userspace implementation does not handle EEXIST for FUSE_CREATE.
Implementation developer and especially users might see it differently
if a kernel update breaks/changes things out of the sudden.
passthrough_ll.c is not the issue here, it handles it correctly, but
what about the XYZ other file systems out there - do you want to check
them one by one, including closed source ones? And wouldn't even a
single broken application count as regression?

>
> https://github.com/qemu/qemu/blob/master/tools/virtiofsd/passthrough_ll.c
>
> So I am sort of lost. May be you can help me understsand this.

I guess it would be more interesting to look at different file systems
that are not overlay based. Like ntfs-3g - I have not looked at the code
yet, but especially disk based file system didn't have a reason so far
to handle EEXIST. And

>
>> Now there is 'fi.flags = arg->flags', but these are already taken by
>> open/fcntl flags - I would not feel comfortable to overload these. At best,
>> struct fuse_create_in currently had a padding field, we could convert these
>> to something like 'ext_fuse_open_flags' and then use it for fuse internal
>> things. Difficulty here is that I don't know if all kernel implementations
>> zero the struct (BSD, MacOS), so I guess we would need to negotiate at
>> startup/init time and would need another main feature flag? And with that
>> I'm not be sure anymore if the result would be actually more simple than
>> what we have right now for the first patch.
>
> If FUSE_CREATE indeed has a dependency on FUSE_LOOKUP have been called
> before that, then I agree that we can't implement new semantics with
> FUSE_CREATE and we will have to introduce a new op say
> FUSE_ATOMIC_CREATE/FUSE_LOOKUP_CREATE/FUSE_CREATE_EXT.
>
> But looking at fuse API, I don't see FUSE_CREATE ever guaranteeing that
> a FUSE_LOOKUP has been done before this.

It is not in written document, but in the existing (linux) kernel behavior.


You can look up "regressions due to 64-bit ext4 directory cookies" -
patches from me had been once already the reason for accidental breakage
and back that time I did not even have the slightest chance to predict
it, as glusterfs was relying on max 32bit telldir results and using the
other 32bit for its own purposes. Kernel behavior changed and
application broke, even though the application was relying on non-posix
behavior. In case of fuse there is no document like posix, but just API
headers and current behavior...

Thanks,
Bernd





2022-05-09 01:49:49

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] FUSE: Implement atomic lookup + create

On Fri, May 06, 2022 at 11:04:05AM +0530, Dharmendra Hans wrote:
> On Thu, May 5, 2022 at 7:56 PM Vivek Goyal <[email protected]> wrote:
> >
> > On Thu, May 05, 2022 at 10:21:21AM +0530, Dharmendra Hans wrote:
> > > On Wed, May 4, 2022 at 8:17 PM Vivek Goyal <[email protected]> wrote:
> > > >
> > > > On Wed, May 04, 2022 at 09:56:49AM +0530, Dharmendra Hans wrote:
> > > > > On Wed, May 4, 2022 at 1:24 AM Vivek Goyal <[email protected]> wrote:
> > > > > >
> > > > > > On Mon, May 02, 2022 at 03:55:19PM +0530, Dharmendra Singh wrote:
> > > > > > > From: Dharmendra Singh <[email protected]>
> > > > > > >
> > > > > > > When we go for creating a file (O_CREAT), we trigger
> > > > > > > a lookup to FUSE USER SPACE. It is very much likely
> > > > > > > that file does not exist yet as O_CREAT is passed to
> > > > > > > open(). This lookup can be avoided and can be performed
> > > > > > > as part of create call into libfuse.
> > > > > > >
> > > > > > > This lookup + create in single call to libfuse and finally
> > > > > > > to USER SPACE has been named as atomic create. It is expected
> > > > > > > that USER SPACE create the file, open it and fills in the
> > > > > > > attributes which are then used to make inode stand/revalidate
> > > > > > > in the kernel cache. Also if file was newly created(does not
> > > > > > > exist yet by this time) in USER SPACE then it should be indicated
> > > > > > > in `struct fuse_file_info` by setting a bit which is again used by
> > > > > > > libfuse to send some flags back to fuse kernel to indicate that
> > > > > > > that file was newly created. These flags are used by kernel to
> > > > > > > indicate changes in parent directory.
> > > > > >
> > > > > > Reading the existing code a little bit more and trying to understand
> > > > > > existing semantics. And that will help me unerstand what new is being
> > > > > > done.
> > > > > >
> > > > > > So current fuse_atomic_open() does following.
> > > > > >
> > > > > > A. Looks up dentry (if d_in_lookup() is set).
> > > > > > B. If dentry is positive or O_CREAT is not set, return.
> > > > > > C. If server supports atomic create + open, use that to create file and
> > > > > > open it as well.
> > > > > > D. If server does not support atomic create + open, just create file
> > > > > > using "mknod" and return. VFS will take care of opening the file.
> > > > > >
> > > > > > Now with this patch, new flow is.
> > > > > >
> > > > > > A. Look up dentry if d_in_lookup() is set as well as either file is not
> > > > > > being created or fc->no_atomic_create is set. This basiclally means
> > > > > > skip lookup if atomic_create is supported and file is being created.
> > > > > >
> > > > > > B. Remains same. if dentry is positive or O_CREATE is not set, return.
> > > > > >
> > > > > > C. If server supports new atomic_create(), use that.
> > > > > >
> > > > > > D. If not, if server supports atomic create + open, use that
> > > > > >
> > > > > > E. If not, fall back to mknod and do not open file.
> > > > > >
> > > > > > So to me this new functionality is basically atomic "lookup + create +
> > > > > > open"?
> > > > > >
> > > > > > Or may be not. I see we check "fc->no_create" and fallback to mknod.
> > > > > >
> > > > > > if (fc->no_create)
> > > > > > goto mknod;
> > > > > >
> > > > > > So fc->no_create is representing both old atomic "create + open" as well
> > > > > > as new "lookup + create + open" ?
> > > > > >
> > > > > > It might be obvious to you, but it is not to me. So will be great if
> > > > > > you shed some light on this.
> > > > > >
> > > > >
> > > > > I think you got it right now. New atomic create does what you
> > > > > mentioned as new flow. It does lookup + create + open in single call
> > > > > (being called as atomic create) to USER SPACE.mknod is a special case
> > > >
> > > > Ok, naming is little confusing. I think we will have to put it in
> > > > commit message and where you define FUSE_ATOMIC_CREATE that what's
> > > > the difference between FUSE_CREATE and FUSE_ATOMIC_CREATE. This is
> > > > ATOMIC w.r.t what?
> > >
> > > Sure, I would update the commit message to make the distinction clear
> > > between the two. This operation is atomic w.r.t to USER SPACE FUSE
> > > implementations. i.e USER SPACE would be performing all these
> > > operations in a single call to it.
> >
> > I think even FUSE_CREAT is doing same thing. Creating file, opening and
> > doing lookup and sending all the data. So that's not the difference
> > between the two, IMHO. And that's why I am getting confused with the
> > naming.
> >
> > From user space file server perspective, only extra operation seems
> > to be that it sends a flag in response telling the client whether
> > file was actually created or it already existed. So to me it just
> > sounds little extension of existing FUSE_CREATE command and that's
> > why I thought calling it FUSE_CREATE_EXT is probably better naming.
>
> The difference between the two is of atomicity from top to bottom i.e
> single call from fuse kernel to user space file server. Generally
> FUSE_CREATE goes upto libfuse low level API as atomic (check
> fuse_lib_create()) and it gets separated there into two calls (create
> + getattr). This is not what we want as it results in two network
> trips each for create and lookup to the file server.

Ok, looks like your fuse file server is talking to a another file
server on network and that's why you are mentioning two network trips.

Let us differentiate between two things first.

A. FUSE protocol semantics
B. Implementation of FUSE protocl by libfuse.

I think I am stressing on A and you are stressing on B. I just want
to see what's the difference between FUSE_CREATE and FUSE_ATOMIC_CREATE
from fuse protocol point of view. Again look at from kernel's point of
view and don't worry about libfuse is going to implement it.
Implementations can vary.

From kernel's perspective FUSE_CREATE is supposed to create + open a
file. It is possible file already exists. Look at include/fuse_lowlevel.h
description for create().

/**
* Create and open a file
*
* If the file does not exist, first create it with the specified
* mode, and then open it.
*/

I notice that fuse is offering a high level API as well as low level
API. I primarily know about low level API. To me these are just two
different implementation but things don't change how kernel sends
fuse messages and what it expects from server in return.

Now with FUSE_ATOMIC_CREATE, from kernel's perspective, only difference
is that in reply message file server will also indicate if file was
actually created or not. Is that right?

And I am focussing on this FUSE API apsect. I am least concerned at
this point of time who libfuse decides to actually implement FUSE_CREATE
or FUSE_ATOMIC_CREATE etc. You might make a single call in libfuse
server (instead of two) and that's performance optimization in libfuse.
Kernel does not care how many calls did you make in file server to
implement FUSE_CREATE or FUSE_ATOMIC_CREATE. All it cares is that
create and open the file.

So while you might do things in more atomic manner in file server and
cut down on network traffic, kernel fuse API does not care. All it cares
about is create + open a file.

Anyway, from kernel's perspective, I think you should be able to
just use FUSE_CREATE and still be do "lookup + create + open".
FUSE_ATOMIC_CREATE is just allows one additional optimization so
that you know whether to invalidate parent dir's attrs or not.

In fact kernel is not putting any atomicity requirements as well on
file server. And that's why I think this new command should probably
be called FUSE_CREATE_EXT because it just sends back additional
info.

All the atomicity stuff you have been describing is that you are
trying to do some optimizations in libfuse implementation to implement
FUSE_ATOMIC_CREATE so that you send less number of commands over
network. That's a good idea but fuse kernel API does not require you
do these atomically, AFAICS.

Given I know little bit of fuse low level API, If I were to implement
this in virtiofs/passthrough_ll.c, I probably will do following.

A. Check if caller provided O_EXCL flag.
B. openat(O_CREAT | O_EXCL)
C. If success, we created the file. Set file_created = 1.

D. If error and error != -EEXIST, send error back to client.
E. If error and error == -EEXIST, if caller did provide O_EXCL flag,
return error.
F. openat() returned -EEXIST and caller did not provide O_EXCL flag,
that means file already exists. Set file_created = 0.
G. Do lookup() etc to create internal lo_inode and stat() of file.
H. Send response back to client using fuse_reply_create().

This is one sample implementation for fuse lowlevel API. There could
be other ways to implement. But all that is libfuse + filesystem
specific and kernel does not care how many operations you use to
complete and what's the atomicity etc. Of course less number of
operations you do better it is.

Anyway, I think I have said enough on this topic. IMHO, FUSE_CREATE
descritpion (fuse_lowlevel.h) already mentions that "If the file does not
exist, first create it with the specified mode and then open it". That
means intent of protocol is that file could already be there as well.
So I think we probably should implement this optimization (in kernel)
using FUSE_CREATE command and then add FUSE_CREATE_EXT to add optimization
about knowing whether file was actually created or not.

W.r.t libfuse optimizations, I am not sure why can't you do optimizations
with FUSE_CREATE and why do you need FUSE_CREATE_EXT necessarily. If
are you worried that some existing filesystems will break, I think
you can create an internal helper say fuse_create_atomic() and then
use that if filesystem offers it. IOW, libfuse will have two
ways to implement FUSE_CREATE. And if filesystem offers a new way which
cuts down on network traffic, libfuse uses more efficient method. We
should not have to change kernel FUSE API just because libfuse can
do create + open operation more efficiently.

By introducing FUSE_ATOMIC_CREATE and hiding new implementation behind
it, I think we are mixing things and that's why it is getting confusing.

Thanks
Vivek


>
> >
> > >
> > >
> > > > May be atomic here means that "lookup + create + open" is a single operation.
> > > > But then even FUSE_CREATE is atomic because "creat + open" is a single
> > > > operation.
> > > >
> > > > In fact FUSE_CREATE does lookup anyway and returns all the information
> > > > in fuse_entry_out.
> > > >
> > > > IIUC, only difference between FUSE_CREATE and FUSE_ATOMIC_CREATE is that
> > > > later also carries information in reply whether file was actually created
> > > > or not (FOPEN_FILE_CREATED). This will be set if file did not exist
> > > > already and it was created indeed. Is that right?
> > >
> > > FUSE_CREATE is atomic but upto level of libfuse. Libfuse separates it
> > > into two calls, create and lookup separately into USER SPACE FUSE
> > > implementation.
> >
> > I am not sure what do you mean by "libfuse separates it into two calls,
> > create and lookup separately". I guess you are referring to lo_create()
> > in example/passthrough_ll.c which first creates and opens file and
> > then looks it up and replies.
> >
> > fd = openat(lo_fd(req, parent), name,
> > (fi->flags | O_CREAT) & ~O_NOFOLLOW, mode);
> > err = lo_do_lookup(req, parent, name, &e);
> > fuse_reply_create(req, &e, fi);
> >
> > I am looking at your proposal for atomic_create implementation here.
> >
> > https://github.com/libfuse/libfuse/pull/673/commits/88cd25b2857f2bb213d01afbcfd666787d1e6893#diff-a36385ec8fb753d6f4492a5f0d3c6a5750bd370b50df6ef0610efdcd3f8880ffR787
> >
> > It is doing exactly same thing as lo_create(), except one difference that
> > it is checking first if file exists. It essentially is doing this.
> >
> > A. newfd = openat(lo_fd(req, parent), name, O_PATH | O_NOFOLLOW);
> > B. fd = openat(lo_fd(req, parent), name,
> > (fi->flags | O_CREAT) & ~O_NOFOLLOW, mode);
> > C. err = lo_do_lookup(req, parent, name, &e);
> > D. fuse_reply_create(req, &e, fi);
> >
> > So what do you mean by libfuse makes it two calls.
> >
> > And I think above implementation is racy. What if filesystem is
> > shared and another client creates the file between calls to
> > A and B. You will think you created the file but some other
> > client created it.
> >
> > So if intent is to know whether we created the file or not, then
> > you should probably do openat() with O_EXCL flag. If that succeeds
> > you know you created the file. If it fails with -EEXIST, then you
> > know file is already there. That's what virtiofs does.
> >
> > Anyway, coming back to the point. IMHO, from server perspective,
> > there is no atomicity difference between FUSE_CREATE and
> > FUSE_ATOMIC_CREATE. Only difference seems to be to send addditional
> > information back to the client to tell it whether file was created
> > or not.
> >
> > In fact for shared filesystem this is probably a problem. What if
> > guest's cache is stale and it does not know about the file. A client
> > B creates the file and we think we did not create the file. And we
> > will return with FOPEN_FILE_CREATED = 0. And in that case client
> > will not call fuse_dir_changed(). But that seems wrong in case
> > of shared filesystems. I am concerned about virtiofs which can
> > be shared between different guests.
> >
> > Miklos, WDYT?
> >
> > May be it is not a huge concern. If one guest drops a file, other guest
> > will not invalidate its dir attrs till timeout happens. Case of shared
> > filesystem is very tricky with fuse. And sometimes it is not clear
> > to me what kind of coherency matters.
> >
> > So in this case say I am booted with cache=auto, if guest B drops a file
> > bar/foo.txt and guest A does open(bar/foo.txt, O_CREAT), then should
> > guest A invalidate the attrs of bar/ right away or it will be invalidated
> > anyway after a second.
> >
> > Anyway.., my core point is that difference between FUSE_CREATE and
> > FUSE_ATOMIC_CREATE is just one flag FOPEN_FILE_CREATED which tells
> > client whether file was actually created or not. And that is used
> > to determine whether to invalidate parent dir attributes or not. It
> > does not have anything extra in terms of ATOMICITY as far as I can
> > see and that's what confuses me.
>
> You checked the wrong code. pasthrough_ll is not production code, we
> do not use it at all, just using it to test these patches here. There
> is a main patch which implements atomic create in libfuse for low
> level api(). It is in same pull request, check it here
> https://github.com/libfuse/libfuse/pull/673/commits/f86fe92bef7bb529ef1617e077d69a39eb26bc9f
>
> What this FUSE_ATOMIC_CREATE(fuse_lib_atomic_create()) does in libfuse
> is it make a single call into fuse_operations where file server would
> be doing all 'lookup + create + open' in one call itself whereas
> FUSE_CREATE gets separated into two calls 'create + lookup' which
> eventually are two rpcs to the user space file server for a single
> operation. Now you can see FUSE_CREATE is not atomic from file
> systems point of view(i.e from user space file server perspective)
> but from fuse kernel point of view only. We can make existing libfuse
> code i.e fuse_lib_create() to call new atomic create but then the
> interface becomes messy and it might start a trend to twist libfuse.
> We want to keep libfuse APIs neat and clean. And do not break existing
> systems. No flag from fuse kernel to libfuse to decide which code to
> traverse. Therefore a new call for all this work.
>
> >
> >
> >
> > > This FUSE_ATOMIC_CREATE does all these ops in a single call to FUSE
> > > implementations. We do not want to break any existing FUSE
> > > implementations, therefore it is being introduced as a new feature. I
> > > forgot to include links to libfuse patches as well. That would have
> > > made it much clearer. Here is the link to libfuse patch for this call
> > > https://github.com/libfuse/libfuse/pull/673.
> > >
> > > >
> > > > I see FOPEN_FILE_CREATED is being used to avoid calling
> > > > fuse_dir_changed(). That sounds like a separate optimization and probably
> > > > should be in a separate patch.
> > >
> > > FUSE_ATOMIC_CREATE needs to send back info about if file was actually
> > > created or not (This is suggestion from Miklos) to correctly convey if
> > > the parent dir is really changing or not. I included this as part of
> > > this patch itself instead of having it as a separate patch.
> >
> > This needs little more thought w.r.t shared filesystems.
> >
> > >
> > > > IOW, I think this patch should be broken in to multiple pieces. First
> > > > piece seems to be avoiding lookup() and given the way it is implemented,
> > > > looks like we can avoid lookup() even by using existing FUSE_CREATE
> > > > command. We don't necessarily need FUSE_ATOMIC_CREATE. Is that right?
> > >
> > > Its not only about changing fuse kernel code but USER SPACE
> > > implementations also. If we change the you are suggesting we would be
> > > required to twist many things at libfuse and FUSE low level API. So to
> > > keep things simple and not to break any existing implementations we
> > > have kept it as new call (we now pass `struct stat` to USER SPACE FUSE
> > > to filled in).
> > >
> > > > And once that is done, a separate patch should probably should explain
> > > > the problem and say fuse_dir_changed() call can be avoided if we knew
> > > > if file was actually created or it was already existing there. And that's
> > > > when one need to introduce a new command. Given this is just an extension
> > > > of existing FUSE_CREATE command and returns additiona info about
> > > > FOPEN_FILE_CREATED, we probably should simply call it FUSE_CREATE_EXT
> > > > and explain how this operation is different from FUSE_CREATE.
> > >
> > > As explained above, we are not doing this way as we have kept in mind
> > > all existing libfuse APIs as well.
> > >
> >
> > I am not seeing what will break in existing passthrough_ll.c if I
> > simply used FUSE_CREATE for a file which already exists.
> >
> > fd = openat(lo_fd(req, parent), name,
> > (fi->flags | O_CREAT) & ~O_NOFOLLOW, mode);
> >
> > This call should still succeed even if file already exists. (until and
> > unless called it with O_EXCL and in that case failing is the correct
> > behavior).
>
> It's not passthrogh_ll but low level libfuse API. Passthrough_ll is
> just another user of low level API. It has been used just to test
> these patches. We do not use passthrough_ll at all for any other
> purpose.
>


2022-05-09 04:08:29

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v4 2/3] FUSE: Implement atomic lookup + open

On Mon, May 02, 2022 at 03:55:20PM +0530, Dharmendra Singh wrote:
> From: Dharmendra Singh <[email protected]>
>
> We can optimize aggressive lookups which are triggered when
> there is normal open for file/dir (dentry is new/negative).
>
> Here in this case since we are anyway going to open the file/dir
> with USER SPACE, avoid this separate lookup call into libfuse
> and combine it with open call into libfuse.
>
> This lookup + open in single call to libfuse has been named
> as atomic open. It is expected that USER SPACE opens the file
> and fills in the attributes, which are then used to make inode
> stand/revalidate in the kernel cache.
>
> Signed-off-by: Dharmendra Singh <[email protected]>
> ---
> fs/fuse/dir.c | 78 ++++++++++++++++++++++++++++++---------
> fs/fuse/fuse_i.h | 3 ++
> fs/fuse/inode.c | 4 +-
> include/uapi/linux/fuse.h | 2 +
> 4 files changed, 69 insertions(+), 18 deletions(-)
>
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index cad3322a007f..6879d3a86796 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -516,18 +516,18 @@ static int get_security_context(struct dentry *entry, umode_t mode,
> }
>
> /*
> - * Atomic create+open operation
> - *
> - * If the filesystem doesn't support this, then fall back to separate
> - * 'mknod' + 'open' requests.
> + * This is common function for initiating atomic operations into libfuse.
> + * Currently being used by Atomic create(atomic lookup + create) and
> + * Atomic open(atomic lookup + open).
> */
> -static int fuse_create_open(struct inode *dir, struct dentry *entry,
> - struct file *file, unsigned int flags,
> - umode_t mode, uint32_t opcode)
> +static int fuse_atomic_do_common(struct inode *dir, struct dentry *entry,
> + struct dentry **alias, struct file *file,
> + unsigned int flags, umode_t mode, uint32_t opcode)

If this common function is really useful for atomic create + atomic open,
I will first add a patch which adds a common function and makes FUSE_CREATE_EXT use that function. And in later patch add functionality to do atomic open.
Just makes code more readable.

> {
> int err;
> struct inode *inode;
> struct fuse_mount *fm = get_fuse_mount(dir);
> + struct fuse_conn *fc = get_fuse_conn(dir);
> FUSE_ARGS(args);
> struct fuse_forget_link *forget;
> struct fuse_create_in inarg;
> @@ -539,9 +539,13 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
> void *security_ctx = NULL;
> u32 security_ctxlen;
> bool atomic_create = (opcode == FUSE_ATOMIC_CREATE ? true : false);
> + bool create_op = (opcode == FUSE_CREATE ||
> + opcode == FUSE_ATOMIC_CREATE) ? true : false;
> + if (alias)
> + *alias = NULL;
>
> /* Userspace expects S_IFREG in create mode */
> - BUG_ON((mode & S_IFMT) != S_IFREG);
> + BUG_ON(create_op && (mode & S_IFMT) != S_IFREG);
>
> forget = fuse_alloc_forget();
> err = -ENOMEM;
> @@ -553,7 +557,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
> if (!ff)
> goto out_put_forget_req;
>
> - if (!fm->fc->dont_mask)
> + if (!fc->dont_mask)
> mode &= ~current_umask();
>
> flags &= ~O_NOCTTY;
> @@ -642,8 +646,9 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
> err = PTR_ERR(res);
> goto out_err;
> }
> - /* res is expected to be NULL since its REG file */
> - WARN_ON(res);
> + entry = res;
> + if (alias)
> + *alias = res;
> }
> }
> fuse_change_entry_timeout(entry, &outentry);
> @@ -681,8 +686,8 @@ static int fuse_atomic_create(struct inode *dir, struct dentry *entry,
> if (fc->no_atomic_create)
> return -ENOSYS;
>
> - err = fuse_create_open(dir, entry, file, flags, mode,
> - FUSE_ATOMIC_CREATE);
> + err = fuse_atomic_do_common(dir, entry, NULL, file, flags, mode,
> + FUSE_ATOMIC_CREATE);
> /* If atomic create is not implemented then indicate in fc so that next
> * request falls back to normal create instead of going into libufse and
> * returning with -ENOSYS.
> @@ -694,6 +699,29 @@ static int fuse_atomic_create(struct inode *dir, struct dentry *entry,
> return err;
> }
>
> +static int fuse_do_atomic_open(struct inode *dir, struct dentry *entry,
> + struct dentry **alias, struct file *file,
> + unsigned int flags, umode_t mode)
> +{
> + int err;
> + struct fuse_conn *fc = get_fuse_conn(dir);
> +
> + if (!fc->do_atomic_open)
> + return -ENOSYS;
> +
> + err = fuse_atomic_do_common(dir, entry, alias, file, flags, mode,
> + FUSE_ATOMIC_OPEN);
> + /* Atomic open imply atomic trunc as well i.e truncate should be performed
> + * as part of atomic open call itself.
> + */
> + if (!fc->atomic_o_trunc) {
> + if (fc->do_atomic_open)
> + fc->atomic_o_trunc = 1;
> + }
> +
> + return err;
> +}
> +
> static int fuse_mknod(struct user_namespace *, struct inode *, struct dentry *,
> umode_t, dev_t);
> static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
> @@ -702,12 +730,23 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
> {
> int err;
> struct fuse_conn *fc = get_fuse_conn(dir);
> - struct dentry *res = NULL;
> + struct dentry *res = NULL, *alias = NULL;
> bool create = flags & O_CREAT ? true : false;
>
> if (fuse_is_bad(dir))
> return -EIO;
>
> + if (!create) {
> + err = fuse_do_atomic_open(dir, entry, &alias,
> + file, flags, mode);

So this is the core change of behavior. As of now, if we are not doing
create operation, we just lookup and return. But now you are doing
open + lookup in a single operation. Interesting. I am not sure if
it breaks anything in VFS.

> + res = alias;
> + if (!err || err != -ENOSYS)
> + goto out_dput;
> + }
> + /*
> + * ENOSYS fall back for open- user space does not have full
> + * atomic open.
> + */

This ENOSYS stuff all the place is making these patches look very unclean
to me. A part of me says that should we negotiate these as feature bits.
Having said that feature bits are precious and should not be used for
trivial purposes. Hmm..., may be we can make handle ENOSYS little better
in general.

> if ((!create || fc->no_atomic_create) && d_in_lookup(entry)) {
> res = fuse_lookup(dir, entry, 0);
> if (IS_ERR(res))
> @@ -730,9 +769,14 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
> /* Libfuse/user space has not implemented atomic create, therefore
> * fall back to normal create.
> */
> - if (err == -ENOSYS)
> - err = fuse_create_open(dir, entry, file, flags, mode,
> - FUSE_CREATE);
> + /* Atomic create+open operation
> + * If the filesystem doesn't support this, then fall back to separate
> + * 'mknod' + 'open' requests.
> + */
> + if (err == -ENOSYS) {
> + err = fuse_atomic_do_common(dir, entry, NULL, file, flags,
> + mode, FUSE_CREATE);
> + }
> if (err == -ENOSYS) {
> fc->no_create = 1;
> goto mknod;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index d577a591ab16..24793b82303f 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -669,6 +669,9 @@ struct fuse_conn {
> /** Is open/release not implemented by fs? */
> unsigned no_open:1;
>
> + /** Is atomic open implemented by fs ? */
> + unsigned do_atomic_open : 1;
> +
> /** Is atomic create not implemented by fs? */
> unsigned no_atomic_create:1;
>
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index ee846ce371d8..5f667de69115 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1190,6 +1190,8 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> fc->setxattr_ext = 1;
> if (flags & FUSE_SECURITY_CTX)
> fc->init_security = 1;
> + if (flags & FUSE_DO_ATOMIC_OPEN)
> + fc->do_atomic_open = 1;

Hey you are negotiating FUSE_DO_ATOMIC_OPEN flag. If that's the case why
do you have to deal with -ENOSYS in fuse_do_atomic_open(). You should
be able to just check.

if (!create && fc->do_atomic_open) {
fuse_do_atomic_open().
}

I have yet to check what happens if file does not exist.

> } else {
> ra_pages = fc->max_read / PAGE_SIZE;
> fc->no_lock = 1;
> @@ -1235,7 +1237,7 @@ void fuse_send_init(struct fuse_mount *fm)
> FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
> FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
> FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_EXT | FUSE_INIT_EXT |
> - FUSE_SECURITY_CTX;
> + FUSE_SECURITY_CTX | FUSE_DO_ATOMIC_OPEN;
> #ifdef CONFIG_FUSE_DAX
> if (fm->fc->dax)
> flags |= FUSE_MAP_ALIGNMENT;
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index e4b56004b148..ab91e391241a 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -391,6 +391,7 @@ struct fuse_file_lock {
> /* bits 32..63 get shifted down 32 bits into the flags2 field */
> #define FUSE_SECURITY_CTX (1ULL << 32)
> #define FUSE_HAS_INODE_DAX (1ULL << 33)
> +#define FUSE_DO_ATOMIC_OPEN (1ULL << 34)
>
> /**
> * CUSE INIT request/reply flags
> @@ -540,6 +541,7 @@ enum fuse_opcode {
> FUSE_REMOVEMAPPING = 49,
> FUSE_SYNCFS = 50,
> FUSE_ATOMIC_CREATE = 51,
> + FUSE_ATOMIC_OPEN = 52,
>
> /* CUSE specific operations */
> CUSE_INIT = 4096,
> --
> 2.17.1
>


2022-05-09 05:09:51

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] FUSE: Implement atomic lookup + create

On Wed, May 04, 2022 at 09:56:49AM +0530, Dharmendra Hans wrote:
> On Wed, May 4, 2022 at 1:24 AM Vivek Goyal <[email protected]> wrote:
> >
> > On Mon, May 02, 2022 at 03:55:19PM +0530, Dharmendra Singh wrote:
> > > From: Dharmendra Singh <[email protected]>
> > >
> > > When we go for creating a file (O_CREAT), we trigger
> > > a lookup to FUSE USER SPACE. It is very much likely
> > > that file does not exist yet as O_CREAT is passed to
> > > open(). This lookup can be avoided and can be performed
> > > as part of create call into libfuse.
> > >
> > > This lookup + create in single call to libfuse and finally
> > > to USER SPACE has been named as atomic create. It is expected
> > > that USER SPACE create the file, open it and fills in the
> > > attributes which are then used to make inode stand/revalidate
> > > in the kernel cache. Also if file was newly created(does not
> > > exist yet by this time) in USER SPACE then it should be indicated
> > > in `struct fuse_file_info` by setting a bit which is again used by
> > > libfuse to send some flags back to fuse kernel to indicate that
> > > that file was newly created. These flags are used by kernel to
> > > indicate changes in parent directory.
> >
> > Reading the existing code a little bit more and trying to understand
> > existing semantics. And that will help me unerstand what new is being
> > done.
> >
> > So current fuse_atomic_open() does following.
> >
> > A. Looks up dentry (if d_in_lookup() is set).
> > B. If dentry is positive or O_CREAT is not set, return.
> > C. If server supports atomic create + open, use that to create file and
> > open it as well.
> > D. If server does not support atomic create + open, just create file
> > using "mknod" and return. VFS will take care of opening the file.
> >
> > Now with this patch, new flow is.
> >
> > A. Look up dentry if d_in_lookup() is set as well as either file is not
> > being created or fc->no_atomic_create is set. This basiclally means
> > skip lookup if atomic_create is supported and file is being created.
> >
> > B. Remains same. if dentry is positive or O_CREATE is not set, return.
> >
> > C. If server supports new atomic_create(), use that.
> >
> > D. If not, if server supports atomic create + open, use that
> >
> > E. If not, fall back to mknod and do not open file.
> >
> > So to me this new functionality is basically atomic "lookup + create +
> > open"?
> >
> > Or may be not. I see we check "fc->no_create" and fallback to mknod.
> >
> > if (fc->no_create)
> > goto mknod;
> >
> > So fc->no_create is representing both old atomic "create + open" as well
> > as new "lookup + create + open" ?
> >
> > It might be obvious to you, but it is not to me. So will be great if
> > you shed some light on this.
> >
>
> I think you got it right now. New atomic create does what you
> mentioned as new flow. It does lookup + create + open in single call
> (being called as atomic create) to USER SPACE.mknod is a special case

Ok, naming is little confusing. I think we will have to put it in
commit message and where you define FUSE_ATOMIC_CREATE that what's
the difference between FUSE_CREATE and FUSE_ATOMIC_CREATE. This is
ATOMIC w.r.t what?

May be atomic here means that "lookup + create + open" is a single operation.
But then even FUSE_CREATE is atomic because "creat + open" is a single
operation.

In fact FUSE_CREATE does lookup anyway and returns all the information
in fuse_entry_out.

IIUC, only difference between FUSE_CREATE and FUSE_ATOMIC_CREATE is that
later also carries information in reply whether file was actually created
or not (FOPEN_FILE_CREATED). This will be set if file did not exist
already and it was created indeed. Is that right?

I see FOPEN_FILE_CREATED is being used to avoid calling
fuse_dir_changed(). That sounds like a separate optimization and probably
should be in a separate patch.

IOW, I think this patch should be broken in to multiple pieces. First
piece seems to be avoiding lookup() and given the way it is implemented,
looks like we can avoid lookup() even by using existing FUSE_CREATE
command. We don't necessarily need FUSE_ATOMIC_CREATE. Is that right?

And once that is done, a separate patch should probably should explain
the problem and say fuse_dir_changed() call can be avoided if we knew
if file was actually created or it was already existing there. And that's
when one need to introduce a new command. Given this is just an extension
of existing FUSE_CREATE command and returns additiona info about
FOPEN_FILE_CREATED, we probably should simply call it FUSE_CREATE_EXT
and explain how this operation is different from FUSE_CREATE.

Thanks
Vivek

> where the file system does not have a create call implemented. I think
> its legacy probably goes back to Linux 2.4 if I am not wrong. We did
> not make any changes into that.

>
> Second patch avoids lookup for open calls. 3rd patch avoids lookup in
> de_revalidate() but for non-dir. And only in case when default
> permissions are not enabled.
>


2022-05-09 05:50:26

by Bernd Schubert

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] FUSE: Implement atomic lookup + create



On 5/6/22 16:12, Vivek Goyal wrote:

[...]

> On Fri, May 06, 2022 at 11:04:05AM +0530, Dharmendra Hans wrote:

>
> Ok, looks like your fuse file server is talking to a another file
> server on network and that's why you are mentioning two network trips.
>
> Let us differentiate between two things first.
>
> A. FUSE protocol semantics
> B. Implementation of FUSE protocl by libfuse.
>
> I think I am stressing on A and you are stressing on B. I just want
> to see what's the difference between FUSE_CREATE and FUSE_ATOMIC_CREATE
> from fuse protocol point of view. Again look at from kernel's point of
> view and don't worry about libfuse is going to implement it.
> Implementations can vary.

Agreed, I don't think we need to bring in network for the kernel to
libfuse API.

>
> From kernel's perspective FUSE_CREATE is supposed to create + open a
> file. It is possible file already exists. Look at include/fuse_lowlevel.h
> description for create().
>
> /**
> * Create and open a file
> *
> * If the file does not exist, first create it with the specified
> * mode, and then open it.
> */
>
> I notice that fuse is offering a high level API as well as low level
> API. I primarily know about low level API. To me these are just two
> different implementation but things don't change how kernel sends
> fuse messages and what it expects from server in return.
>
> Now with FUSE_ATOMIC_CREATE, from kernel's perspective, only difference
> is that in reply message file server will also indicate if file was
> actually created or not. Is that right?
>
> And I am focussing on this FUSE API apsect. I am least concerned at
> this point of time who libfuse decides to actually implement FUSE_CREATE
> or FUSE_ATOMIC_CREATE etc. You might make a single call in libfuse
> server (instead of two) and that's performance optimization in libfuse.
> Kernel does not care how many calls did you make in file server to
> implement FUSE_CREATE or FUSE_ATOMIC_CREATE. All it cares is that
> create and open the file.
>
> So while you might do things in more atomic manner in file server and
> cut down on network traffic, kernel fuse API does not care. All it cares
> about is create + open a file.
>
> Anyway, from kernel's perspective, I think you should be able to
> just use FUSE_CREATE and still be do "lookup + create + open".
> FUSE_ATOMIC_CREATE is just allows one additional optimization so
> that you know whether to invalidate parent dir's attrs or not.
>
> In fact kernel is not putting any atomicity requirements as well on
> file server. And that's why I think this new command should probably
> be called FUSE_CREATE_EXT because it just sends back additional
> info.
>
> All the atomicity stuff you have been describing is that you are
> trying to do some optimizations in libfuse implementation to implement
> FUSE_ATOMIC_CREATE so that you send less number of commands over
> network. That's a good idea but fuse kernel API does not require you
> do these atomically, AFAICS.
>
> Given I know little bit of fuse low level API, If I were to implement
> this in virtiofs/passthrough_ll.c, I probably will do following.
>
> A. Check if caller provided O_EXCL flag.
> B. openat(O_CREAT | O_EXCL)
> C. If success, we created the file. Set file_created = 1.
>
> D. If error and error != -EEXIST, send error back to client.
> E. If error and error == -EEXIST, if caller did provide O_EXCL flag,
> return error.
> F. openat() returned -EEXIST and caller did not provide O_EXCL flag,
> that means file already exists. Set file_created = 0.
> G. Do lookup() etc to create internal lo_inode and stat() of file.
> H. Send response back to client using fuse_reply_create().
>
> This is one sample implementation for fuse lowlevel API. There could
> be other ways to implement. But all that is libfuse + filesystem
> specific and kernel does not care how many operations you use to
> complete and what's the atomicity etc. Of course less number of
> operations you do better it is.
>
> Anyway, I think I have said enough on this topic. IMHO, FUSE_CREATE
> descritpion (fuse_lowlevel.h) already mentions that "If the file does not
> exist, first create it with the specified mode and then open it". That
> means intent of protocol is that file could already be there as well.
> So I think we probably should implement this optimization (in kernel)
> using FUSE_CREATE command and then add FUSE_CREATE_EXT to add optimization
> about knowing whether file was actually created or not.
>
> W.r.t libfuse optimizations, I am not sure why can't you do optimizations
> with FUSE_CREATE and why do you need FUSE_CREATE_EXT necessarily. If
> are you worried that some existing filesystems will break, I think
> you can create an internal helper say fuse_create_atomic() and then
> use that if filesystem offers it. IOW, libfuse will have two
> ways to implement FUSE_CREATE. And if filesystem offers a new way which
> cuts down on network traffic, libfuse uses more efficient method. We
> should not have to change kernel FUSE API just because libfuse can
> do create + open operation more efficiently.

Ah right, I like this. As I had written before, the first patch version
was using FUSE_CREATE and I was worried to break something. Yes, it
should be possible split into lookup+create on the libfuse side. That
being said, libfuse will need to know which version it is - there might
be an old kernel sending the non-optimized version - libfuse should not
do another lookup then. Now there is 'fi.flags = arg->flags', but these
are already taken by open/fcntl flags - I would not feel comfortable to
overload these. At best, struct fuse_create_in currently had a padding
field, we could convert these to something like 'ext_fuse_open_flags'
and then use it for fuse internal things. Difficulty here is that I
don't know if all kernel implementations zero the struct (BSD, MacOS),
so I guess we would need to negotiate at startup/init time and would
need another main feature flag? And with that I'm not be sure anymore if
the result would be actually more simple than what we have right now for
the first patch.


Thanks,
Bernd


2022-05-09 05:59:32

by Bernd Schubert

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create



On 5/5/22 14:54, Vivek Goyal wrote:
> On Thu, May 05, 2022 at 11:42:51AM +0530, Dharmendra Hans wrote:
>> Here are the numbers I took last time. These were taken on tmpfs to
>> actually see the effect of reduced calls. On local file systems it
>> might not be that much visible. But we have observed that on systems
>> where we have thousands of clients hammering the metadata servers, it
>> helps a lot (We did not take numbers yet as we are required to change
>> a lot of our client code but would be doing it later on).
>>
>> Note that for a change in performance number due to the new version of
>> these patches, we have just refactored the code and functionality has
>> remained the same since then.
>>
>> here is the link to the performance numbers
>> https://lore.kernel.org/linux-fsdevel/[email protected]/
>
> There is a lot going in that table. Trying to understand it.
>
> - Why care about No-Flush. I mean that's independent of these changes,
> right? I am assuming this means that upon file close do not send
> a flush to fuse server. Not sure how bringing No-Flush into the
> mix is helpful here.


It is a basically removing another call from kernel to user space. The
calls there are, the lower is the resulting percentage for atomic-open.


>
> - What is "Patched Libfuse"? I am assuming that these are changes
> needed in libfuse to support atomic create + atomic open. Similarly
> assuming "Patched FuseK" means patched kernel with your changes.

Yes, I did that to ensure there is no regression with the patches, when
the other side is not patched.

>
> If this is correct, I would probably only be interested in
> looking at "Patched Libfuse + Patched FuseK" numbers to figure out
> what's the effect of your changes w.r.t vanilla kernel + libfuse.
> Am I understanding it right?

Yes.

>
> - I am wondering why do we measure "Sequential" and "Random" patterns.
> These optimizations are primarily for file creation + file opening
> and I/O pattern should not matter.

bonnie++ does this automatically and it just convenient to take the
bonnie++ csv value and to paste them into a table.

In our HPC world mdtest is more common, but it has MPI as requirement -
make it harder to run. Reproducing the values with bonnie++ should be
rather easy for you.

Only issue with bonnie++ is that bonnie++ by default does not run
multi-threaded and the old 3rd party perl scripts I had to let it run
with multiple processes and to sum up the values don't work anymore with
recent perl versions. I need to find some time to fix that.


>
> - Also wondering why performance of Read/s improves. Assuming once
> file has been opened, I think your optimizations get out of the
> way (no create, no open) and we are just going through data path of
> reading file data and no lookups happening. If that's the case, why
> do Read/s numbers show an improvement.

That is now bonnie++ works. It creates the files, closes them (which
causes the flush) and then re-opens for stat and read - atomic open
comes into the picture here. Also read() is totally skipped when the
files are empty - which is why one should use something like 1B files.

If you have another metadata benchmark - please let us know.

>
> - Why do we measure "Patched Libfuse". It shows performance regression
> of 4-5% in table 0B, Sequential workoad. That sounds bad. So without
> any optimization kicking in, it has a performance cost.

Yes, I'm not sure yet. There is not so much code that has changed on the
libfuse side.
However the table needs to be redone with fixed libfuse - limiting the
number of threads caused a permanent libfuse thread creation and destruction

https://github.com/libfuse/libfuse/pull/652

The numbers in table are also with paththrough_ll, which has its own
issue due to linear inode search. paththrough_hp uses a C++ map and
avoids that. I noticed too late when I started to investigate why there
are regressions....

Also the table made me to investigate/profile all the fuse operations,
which resulted in my waitq question. Please see that thread for more
details
https://lore.kernel.org/lkml/[email protected]/T/

Regarding atomic-open/create with avoiding lookup/revalidate, our
primary goal is to reduce network calls. A file system that handles it
locally only reduces the number of fuse kernel/user space crossing. A
network file system that fully supports it needs to do the atomic open
(or in old terms lookup-intent-open) on the server side of the network
and needs to transfer attributes together with the open result.

Lustre does this, although I cannot easily point you to the right code.
It all started almost two decades ago:
https://groups.google.com/g/lucky.linux.fsdevel/c/iYNFIIrkJ1s


BeeGFS does this as well
https://git.beegfs.io/pub/v7/-/blob/master/client_module/source/filesystem/FhgfsOpsInode.c
See for examaple FhgfsOps_atomicOpen() and FhgfsOps_createIntent()

(FhGFS is the old name when I was still involved in the project.)

From my head I'm not sure if NFS does it over the wire, maybe v4.


Thanks,
Bernd







2022-05-09 06:50:44

by Jean-Pierre André

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] FUSE: Implement atomic lookup + create

Bernd Schubert wrote on 5/6/22 8:45 PM:
>
>
> On 5/6/22 19:07, Vivek Goyal wrote:
>> On Fri, May 06, 2022 at 06:41:17PM +0200, Bernd Schubert wrote:
>>>
>>>
>>> On 5/6/22 16:12, Vivek Goyal wrote:
>>>
>>> [...]
>>>
>>>> On Fri, May 06, 2022 at 11:04:05AM +0530, Dharmendra Hans wrote:
>>>
>>>>
>>>> Ok, looks like your fuse file server is talking to a another file
>>>> server on network and that's why you are mentioning two network trips.
>>>>
>>>> Let us differentiate between two things first.
>>>>
>>>> A. FUSE protocol semantics
>>>> B. Implementation of FUSE protocl by libfuse.
>>>>
>>>> I think I am stressing on A and you are stressing on B. I just want
>>>> to see what's the difference between FUSE_CREATE and FUSE_ATOMIC_CREATE
>>>> from fuse protocol point of view. Again look at from kernel's point of
>>>> view and don't worry about libfuse is going to implement it.
>>>> Implementations can vary.
>>>
>>> Agreed, I don't think we need to bring in network for the kernel to
>>> libfuse
>>> API.
>>>
>>>>
>>>>   From kernel's perspective FUSE_CREATE is supposed to create + open a
>>>> file. It is possible file already exists. Look at
>>>> include/fuse_lowlevel.h
>>>> description for create().
>>>>
>>>>           /**
>>>>            * Create and open a file
>>>>            *
>>>>            * If the file does not exist, first create it with the
>>>> specified
>>>>            * mode, and then open it.
>>>>            */
>>>>
>>>> I notice that fuse is offering a high level API as well as low level
>>>> API. I primarily know about low level API. To me these are just two
>>>> different implementation but things don't change how kernel sends
>>>> fuse messages and what it expects from server in return.
>>>>
>>>> Now with FUSE_ATOMIC_CREATE, from kernel's perspective, only difference
>>>> is that in reply message file server will also indicate if file was
>>>> actually created or not. Is that right?
>>>>
>>>> And I am focussing on this FUSE API apsect. I am least concerned at
>>>> this point of time who libfuse decides to actually implement
>>>> FUSE_CREATE
>>>> or FUSE_ATOMIC_CREATE etc. You might make a single call in libfuse
>>>> server (instead of two) and that's performance optimization in libfuse.
>>>> Kernel does not care how many calls did you make in file server to
>>>> implement FUSE_CREATE or FUSE_ATOMIC_CREATE. All it cares is that
>>>> create and open the file.
>>>>
>>>> So while you might do things in more atomic manner in file server and
>>>> cut down on network traffic, kernel fuse API does not care. All it
>>>> cares
>>>> about is create + open a file.
>>>>
>>>> Anyway, from kernel's perspective, I think you should be able to
>>>> just use FUSE_CREATE and still be do "lookup + create + open".
>>>> FUSE_ATOMIC_CREATE is just allows one additional optimization so
>>>> that you know whether to invalidate parent dir's attrs or not.
>>>>
>>>> In fact kernel is not putting any atomicity requirements as well on
>>>> file server. And that's why I think this new command should probably
>>>> be called FUSE_CREATE_EXT because it just sends back additional
>>>> info.
>>>>
>>>> All the atomicity stuff you have been describing is that you are
>>>> trying to do some optimizations in libfuse implementation to implement
>>>> FUSE_ATOMIC_CREATE so that you send less number of commands over
>>>> network. That's a good idea but fuse kernel API does not require you
>>>> do these atomically, AFAICS.
>>>>
>>>> Given I know little bit of fuse low level API, If I were to implement
>>>> this in virtiofs/passthrough_ll.c, I probably will do following.
>>>>
>>>> A. Check if caller provided O_EXCL flag.
>>>> B. openat(O_CREAT | O_EXCL)
>>>> C. If success, we created the file. Set file_created = 1.
>>>>
>>>> D. If error and error != -EEXIST, send error back to client.
>>>> E. If error and error == -EEXIST, if caller did provide O_EXCL flag,
>>>>      return error.
>>>> F. openat() returned -EEXIST and caller did not provide O_EXCL flag,
>>>>      that means file already exists.  Set file_created = 0.
>>>> G. Do lookup() etc to create internal lo_inode and stat() of file.
>>>> H. Send response back to client using fuse_reply_create().
>>>> This is one sample implementation for fuse lowlevel API. There could
>>>> be other ways to implement. But all that is libfuse + filesystem
>>>> specific and kernel does not care how many operations you use to
>>>> complete and what's the atomicity etc. Of course less number of
>>>> operations you do better it is.
>>>>
>>>> Anyway, I think I have said enough on this topic. IMHO, FUSE_CREATE
>>>> descritpion (fuse_lowlevel.h) already mentions that "If the file
>>>> does not
>>>> exist, first create it with the specified mode and then open it". That
>>>> means intent of protocol is that file could already be there as well.
>>>> So I think we probably should implement this optimization (in kernel)
>>>> using FUSE_CREATE command and then add FUSE_CREATE_EXT to add
>>>> optimization
>>>> about knowing whether file was actually created or not.
>>>>
>>>> W.r.t libfuse optimizations, I am not sure why can't you do
>>>> optimizations
>>>> with FUSE_CREATE and why do you need FUSE_CREATE_EXT necessarily. If
>>>> are you worried that some existing filesystems will break, I think
>>>> you can create an internal helper say fuse_create_atomic() and then
>>>> use that if filesystem offers it. IOW, libfuse will have two
>>>> ways to implement FUSE_CREATE. And if filesystem offers a new way which
>>>> cuts down on network traffic, libfuse uses more efficient method. We
>>>> should not have to change kernel FUSE API just because libfuse can
>>>> do create + open operation more efficiently.
>>>
>>> Ah right, I like this. As I had written before, the first patch
>>> version was
>>> using FUSE_CREATE and I was worried to break something. Yes, it
>>> should be
>>> possible split into lookup+create on the libfuse side. That being said,
>>> libfuse will need to know which version it is - there might be an old
>>> kernel
>>> sending the non-optimized version - libfuse should not do another lookup
>>> then.
>>
>> I am confused about one thing. For FUSE_CREATE command, how does it
>> matter whether kernel has done lookup() before sending FUSE_CREATE. All
>> FUSE_CREATE seems to say that crate a file (if it does not exist already)
>> and then open it and return file handle as well as inode attributes. It
>> does not say anything about whether a LOOKUP has already been done
>> by kernel or not.
>>
>> It looks like you are assuming that if FUSE_CREATE is coming, that
>> means client has already done FUSE_LOOKUP. So there is something we
>> are not on same page about.
>>
>> I looked at fuse_lowlevel API and passthrough_ll.c and there is no
>> assumption whether FUSE_LOOKUP has already been called by client
>> before calling FUSE_CREATE. Similarly, I looked at virtiofs code
>> and I can't see any such assumption there as well.
>
> The current linux kernel code does this right now, skipping the lookup
> just changes behavior.  Personally I would see it as bug if the
> userspace implementation does not handle EEXIST for FUSE_CREATE.
> Implementation developer and especially users might see it differently
> if a kernel update breaks/changes things out of the sudden.
> passthrough_ll.c is not the issue here, it handles it correctly, but
> what about the XYZ other file systems out there - do you want to check
> them one by one, including closed source ones? And wouldn't even a
> single broken application count as regression?
>
>>
>> https://github.com/qemu/qemu/blob/master/tools/virtiofsd/passthrough_ll.c
>>
>> So I am sort of lost. May be you can help me understsand this.
>
> I guess it would be more interesting to look at different file systems
> that are not overlay based. Like ntfs-3g - I have not looked at the code
> yet, but especially disk based file system didn't have a reason so far
> to handle EEXIST. And

AFAIK ntfs-3g proper does not keep a context across calls and does
not know what LOOKUP was preparing a CREATE. However this may have
consequences in the high level of libfuse for managing the file tree.

The kernel apparently issues a LOOKUP to decide whether issuing a
CREATE (create+open) or an OPEN. If it sent blindly a CREATE,
ntfs-3g would return EEXIST if the name was already present in
the directory.

For a test, can you suggest a way to force ignore of such lookup
within libfuse, without applying your kernel patches ? Is there a
way to detect the purpose of a lookup ? (A possible way is to
hardcode a directory inode within which the lookups return ENOENT).

>
>>
>>> Now there is 'fi.flags = arg->flags', but these are already taken by
>>> open/fcntl flags - I would not feel comfortable to overload these. At
>>> best,
>>> struct fuse_create_in currently had a padding field, we could convert
>>> these
>>> to something like 'ext_fuse_open_flags' and then use it for fuse
>>> internal
>>> things. Difficulty here is that I don't know if all kernel
>>> implementations
>>> zero the struct (BSD, MacOS), so I guess we would need to negotiate at
>>> startup/init time and would need another main feature flag? And with
>>> that
>>> I'm not be sure anymore if the result would be actually more simple than
>>> what we have right now for the first patch.
>>
>> If FUSE_CREATE indeed has a dependency on FUSE_LOOKUP have been called
>> before that, then I agree that we can't implement new semantics with
>> FUSE_CREATE and we will have to introduce a new op say
>> FUSE_ATOMIC_CREATE/FUSE_LOOKUP_CREATE/FUSE_CREATE_EXT.
>>
>> But looking at fuse API, I don't see FUSE_CREATE ever guaranteeing that
>> a FUSE_LOOKUP has been done before this.
>
> It is not in written document, but in the existing (linux) kernel behavior.
>
>
> You can look up "regressions due to 64-bit ext4 directory cookies" -
> patches from me had been once already the reason for accidental breakage
> and back that time I did not even have the slightest chance to predict
> it, as glusterfs was relying on max 32bit telldir results and using the
> other 32bit for its own purposes. Kernel behavior changed and
> application broke, even though the application was relying on non-posix
> behavior. In case of fuse there is no document like posix, but just API
> headers and current behavior...
>
> Thanks,
> Bernd
>
>
>
>
>



2022-05-09 07:01:19

by Dharmendra Singh

[permalink] [raw]
Subject: Re: [PATCH v4 2/3] FUSE: Implement atomic lookup + open

On Wed, May 4, 2022 at 11:50 PM Vivek Goyal <[email protected]> wrote:
>
> On Mon, May 02, 2022 at 03:55:20PM +0530, Dharmendra Singh wrote:
> > From: Dharmendra Singh <[email protected]>
> >
> > We can optimize aggressive lookups which are triggered when
> > there is normal open for file/dir (dentry is new/negative).
> >
> > Here in this case since we are anyway going to open the file/dir
> > with USER SPACE, avoid this separate lookup call into libfuse
> > and combine it with open call into libfuse.
> >
> > This lookup + open in single call to libfuse has been named
> > as atomic open. It is expected that USER SPACE opens the file
> > and fills in the attributes, which are then used to make inode
> > stand/revalidate in the kernel cache.
> >
> > Signed-off-by: Dharmendra Singh <[email protected]>
> > ---
> > fs/fuse/dir.c | 78 ++++++++++++++++++++++++++++++---------
> > fs/fuse/fuse_i.h | 3 ++
> > fs/fuse/inode.c | 4 +-
> > include/uapi/linux/fuse.h | 2 +
> > 4 files changed, 69 insertions(+), 18 deletions(-)
> >
> > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> > index cad3322a007f..6879d3a86796 100644
> > --- a/fs/fuse/dir.c
> > +++ b/fs/fuse/dir.c
> > @@ -516,18 +516,18 @@ static int get_security_context(struct dentry *entry, umode_t mode,
> > }
> >
> > /*
> > - * Atomic create+open operation
> > - *
> > - * If the filesystem doesn't support this, then fall back to separate
> > - * 'mknod' + 'open' requests.
> > + * This is common function for initiating atomic operations into libfuse.
> > + * Currently being used by Atomic create(atomic lookup + create) and
> > + * Atomic open(atomic lookup + open).
> > */
> > -static int fuse_create_open(struct inode *dir, struct dentry *entry,
> > - struct file *file, unsigned int flags,
> > - umode_t mode, uint32_t opcode)
> > +static int fuse_atomic_do_common(struct inode *dir, struct dentry *entry,
> > + struct dentry **alias, struct file *file,
> > + unsigned int flags, umode_t mode, uint32_t opcode)
>
> If this common function is really useful for atomic create + atomic open,
> I will first add a patch which adds a common function and makes FUSE_CREATE_EXT use that function. And in later patch add functionality to do atomic open.
> Just makes code more readable.

This func is commonly used by atomic create and atomic open, both.
atomic open has fuse_do_atomic_open wrapper over it. I just renamed
this func except doing it in a separate patch (just to avoid another
patch though I understand that it might be a little hard to read the
code).


>
> > {
> > int err;
> > struct inode *inode;
> > struct fuse_mount *fm = get_fuse_mount(dir);
> > + struct fuse_conn *fc = get_fuse_conn(dir);
> > FUSE_ARGS(args);
> > struct fuse_forget_link *forget;
> > struct fuse_create_in inarg;
> > @@ -539,9 +539,13 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
> > void *security_ctx = NULL;
> > u32 security_ctxlen;
> > bool atomic_create = (opcode == FUSE_ATOMIC_CREATE ? true : false);
> > + bool create_op = (opcode == FUSE_CREATE ||
> > + opcode == FUSE_ATOMIC_CREATE) ? true : false;
> > + if (alias)
> > + *alias = NULL;
> >
> > /* Userspace expects S_IFREG in create mode */
> > - BUG_ON((mode & S_IFMT) != S_IFREG);
> > + BUG_ON(create_op && (mode & S_IFMT) != S_IFREG);
> >
> > forget = fuse_alloc_forget();
> > err = -ENOMEM;
> > @@ -553,7 +557,7 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
> > if (!ff)
> > goto out_put_forget_req;
> >
> > - if (!fm->fc->dont_mask)
> > + if (!fc->dont_mask)
> > mode &= ~current_umask();
> >
> > flags &= ~O_NOCTTY;
> > @@ -642,8 +646,9 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
> > err = PTR_ERR(res);
> > goto out_err;
> > }
> > - /* res is expected to be NULL since its REG file */
> > - WARN_ON(res);
> > + entry = res;
> > + if (alias)
> > + *alias = res;
> > }
> > }
> > fuse_change_entry_timeout(entry, &outentry);
> > @@ -681,8 +686,8 @@ static int fuse_atomic_create(struct inode *dir, struct dentry *entry,
> > if (fc->no_atomic_create)
> > return -ENOSYS;
> >
> > - err = fuse_create_open(dir, entry, file, flags, mode,
> > - FUSE_ATOMIC_CREATE);
> > + err = fuse_atomic_do_common(dir, entry, NULL, file, flags, mode,
> > + FUSE_ATOMIC_CREATE);
> > /* If atomic create is not implemented then indicate in fc so that next
> > * request falls back to normal create instead of going into libufse and
> > * returning with -ENOSYS.
> > @@ -694,6 +699,29 @@ static int fuse_atomic_create(struct inode *dir, struct dentry *entry,
> > return err;
> > }
> >
> > +static int fuse_do_atomic_open(struct inode *dir, struct dentry *entry,
> > + struct dentry **alias, struct file *file,
> > + unsigned int flags, umode_t mode)
> > +{
> > + int err;
> > + struct fuse_conn *fc = get_fuse_conn(dir);
> > +
> > + if (!fc->do_atomic_open)
> > + return -ENOSYS;
> > +
> > + err = fuse_atomic_do_common(dir, entry, alias, file, flags, mode,
> > + FUSE_ATOMIC_OPEN);
> > + /* Atomic open imply atomic trunc as well i.e truncate should be performed
> > + * as part of atomic open call itself.
> > + */
> > + if (!fc->atomic_o_trunc) {
> > + if (fc->do_atomic_open)
> > + fc->atomic_o_trunc = 1;
> > + }
> > +
> > + return err;
> > +}
> > +
> > static int fuse_mknod(struct user_namespace *, struct inode *, struct dentry *,
> > umode_t, dev_t);
> > static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
> > @@ -702,12 +730,23 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
> > {
> > int err;
> > struct fuse_conn *fc = get_fuse_conn(dir);
> > - struct dentry *res = NULL;
> > + struct dentry *res = NULL, *alias = NULL;
> > bool create = flags & O_CREAT ? true : false;
> >
> > if (fuse_is_bad(dir))
> > return -EIO;
> >
> > + if (!create) {
> > + err = fuse_do_atomic_open(dir, entry, &alias,
> > + file, flags, mode);
>
> So this is the core change of behavior. As of now, if we are not doing
> create operation, we just lookup and return. But now you are doing
> open + lookup in a single operation. Interesting. I am not sure if
> it breaks anything in VFS.

Yes, here we are handling optimizing lookups for open calls. If atomic
open is implemented underlying we use it otherwise falls back to
normal open.

>
> > + res = alias;
> > + if (!err || err != -ENOSYS)
> > + goto out_dput;
> > + }
> > + /*
> > + * ENOSYS fall back for open- user space does not have full
> > + * atomic open.
> > + */
>
> This ENOSYS stuff all the place is making these patches look very unclean
> to me. A part of me says that should we negotiate these as feature bits.
> Having said that feature bits are precious and should not be used for
> trivial purposes. Hmm..., may be we can make handle ENOSYS little better
> in general.

Actually atomic open is based upon init bits considering the 3rd patch
which optimizes lookups in d_revalidate(). I would see if I can clean
ENOSYS a bit.
>
> > if ((!create || fc->no_atomic_create) && d_in_lookup(entry)) {
> > res = fuse_lookup(dir, entry, 0);
> > if (IS_ERR(res))
> > @@ -730,9 +769,14 @@ static int fuse_atomic_open(struct inode *dir, struct dentry *entry,
> > /* Libfuse/user space has not implemented atomic create, therefore
> > * fall back to normal create.
> > */
> > - if (err == -ENOSYS)
> > - err = fuse_create_open(dir, entry, file, flags, mode,
> > - FUSE_CREATE);
> > + /* Atomic create+open operation
> > + * If the filesystem doesn't support this, then fall back to separate
> > + * 'mknod' + 'open' requests.
> > + */
> > + if (err == -ENOSYS) {
> > + err = fuse_atomic_do_common(dir, entry, NULL, file, flags,
> > + mode, FUSE_CREATE);
> > + }
> > if (err == -ENOSYS) {
> > fc->no_create = 1;
> > goto mknod;
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index d577a591ab16..24793b82303f 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -669,6 +669,9 @@ struct fuse_conn {
> > /** Is open/release not implemented by fs? */
> > unsigned no_open:1;
> >
> > + /** Is atomic open implemented by fs ? */
> > + unsigned do_atomic_open : 1;
> > +
> > /** Is atomic create not implemented by fs? */
> > unsigned no_atomic_create:1;
> >
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index ee846ce371d8..5f667de69115 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -1190,6 +1190,8 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
> > fc->setxattr_ext = 1;
> > if (flags & FUSE_SECURITY_CTX)
> > fc->init_security = 1;
> > + if (flags & FUSE_DO_ATOMIC_OPEN)
> > + fc->do_atomic_open = 1;
>
> Hey you are negotiating FUSE_DO_ATOMIC_OPEN flag. If that's the case why
> do you have to deal with -ENOSYS in fuse_do_atomic_open(). You should
> be able to just check.
>
> if (!create && fc->do_atomic_open) {
> fuse_do_atomic_open().
> }

You are right. This happens due to the reason that we have auto atomic
create check in atomic create but yes, we do not need ENOSYS here. I
will clean it.


> I have yet to check what happens if file does not exist.
>
> > } else {
> > ra_pages = fc->max_read / PAGE_SIZE;
> > fc->no_lock = 1;
> > @@ -1235,7 +1237,7 @@ void fuse_send_init(struct fuse_mount *fm)
> > FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
> > FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
> > FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_EXT | FUSE_INIT_EXT |
> > - FUSE_SECURITY_CTX;
> > + FUSE_SECURITY_CTX | FUSE_DO_ATOMIC_OPEN;
> > #ifdef CONFIG_FUSE_DAX
> > if (fm->fc->dax)
> > flags |= FUSE_MAP_ALIGNMENT;
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index e4b56004b148..ab91e391241a 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -391,6 +391,7 @@ struct fuse_file_lock {
> > /* bits 32..63 get shifted down 32 bits into the flags2 field */
> > #define FUSE_SECURITY_CTX (1ULL << 32)
> > #define FUSE_HAS_INODE_DAX (1ULL << 33)
> > +#define FUSE_DO_ATOMIC_OPEN (1ULL << 34)
> >
> > /**
> > * CUSE INIT request/reply flags
> > @@ -540,6 +541,7 @@ enum fuse_opcode {
> > FUSE_REMOVEMAPPING = 49,
> > FUSE_SYNCFS = 50,
> > FUSE_ATOMIC_CREATE = 51,
> > + FUSE_ATOMIC_OPEN = 52,
> >
> > /* CUSE specific operations */
> > CUSE_INIT = 4096,
> > --
> > 2.17.1
> >
>

2022-05-09 09:16:53

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] FUSE: Implement atomic lookup + create

On Fri, May 06, 2022 at 06:41:17PM +0200, Bernd Schubert wrote:
>
>
> On 5/6/22 16:12, Vivek Goyal wrote:
>
> [...]
>
> > On Fri, May 06, 2022 at 11:04:05AM +0530, Dharmendra Hans wrote:
>
> >
> > Ok, looks like your fuse file server is talking to a another file
> > server on network and that's why you are mentioning two network trips.
> >
> > Let us differentiate between two things first.
> >
> > A. FUSE protocol semantics
> > B. Implementation of FUSE protocl by libfuse.
> >
> > I think I am stressing on A and you are stressing on B. I just want
> > to see what's the difference between FUSE_CREATE and FUSE_ATOMIC_CREATE
> > from fuse protocol point of view. Again look at from kernel's point of
> > view and don't worry about libfuse is going to implement it.
> > Implementations can vary.
>
> Agreed, I don't think we need to bring in network for the kernel to libfuse
> API.
>
> >
> > From kernel's perspective FUSE_CREATE is supposed to create + open a
> > file. It is possible file already exists. Look at include/fuse_lowlevel.h
> > description for create().
> >
> > /**
> > * Create and open a file
> > *
> > * If the file does not exist, first create it with the specified
> > * mode, and then open it.
> > */
> >
> > I notice that fuse is offering a high level API as well as low level
> > API. I primarily know about low level API. To me these are just two
> > different implementation but things don't change how kernel sends
> > fuse messages and what it expects from server in return.
> >
> > Now with FUSE_ATOMIC_CREATE, from kernel's perspective, only difference
> > is that in reply message file server will also indicate if file was
> > actually created or not. Is that right?
> >
> > And I am focussing on this FUSE API apsect. I am least concerned at
> > this point of time who libfuse decides to actually implement FUSE_CREATE
> > or FUSE_ATOMIC_CREATE etc. You might make a single call in libfuse
> > server (instead of two) and that's performance optimization in libfuse.
> > Kernel does not care how many calls did you make in file server to
> > implement FUSE_CREATE or FUSE_ATOMIC_CREATE. All it cares is that
> > create and open the file.
> >
> > So while you might do things in more atomic manner in file server and
> > cut down on network traffic, kernel fuse API does not care. All it cares
> > about is create + open a file.
> >
> > Anyway, from kernel's perspective, I think you should be able to
> > just use FUSE_CREATE and still be do "lookup + create + open".
> > FUSE_ATOMIC_CREATE is just allows one additional optimization so
> > that you know whether to invalidate parent dir's attrs or not.
> >
> > In fact kernel is not putting any atomicity requirements as well on
> > file server. And that's why I think this new command should probably
> > be called FUSE_CREATE_EXT because it just sends back additional
> > info.
> >
> > All the atomicity stuff you have been describing is that you are
> > trying to do some optimizations in libfuse implementation to implement
> > FUSE_ATOMIC_CREATE so that you send less number of commands over
> > network. That's a good idea but fuse kernel API does not require you
> > do these atomically, AFAICS.
> >
> > Given I know little bit of fuse low level API, If I were to implement
> > this in virtiofs/passthrough_ll.c, I probably will do following.
> >
> > A. Check if caller provided O_EXCL flag.
> > B. openat(O_CREAT | O_EXCL)
> > C. If success, we created the file. Set file_created = 1.
> >
> > D. If error and error != -EEXIST, send error back to client.
> > E. If error and error == -EEXIST, if caller did provide O_EXCL flag,
> > return error.
> > F. openat() returned -EEXIST and caller did not provide O_EXCL flag,
> > that means file already exists. Set file_created = 0.
> > G. Do lookup() etc to create internal lo_inode and stat() of file.
> > H. Send response back to client using fuse_reply_create().
> > This is one sample implementation for fuse lowlevel API. There could
> > be other ways to implement. But all that is libfuse + filesystem
> > specific and kernel does not care how many operations you use to
> > complete and what's the atomicity etc. Of course less number of
> > operations you do better it is.
> >
> > Anyway, I think I have said enough on this topic. IMHO, FUSE_CREATE
> > descritpion (fuse_lowlevel.h) already mentions that "If the file does not
> > exist, first create it with the specified mode and then open it". That
> > means intent of protocol is that file could already be there as well.
> > So I think we probably should implement this optimization (in kernel)
> > using FUSE_CREATE command and then add FUSE_CREATE_EXT to add optimization
> > about knowing whether file was actually created or not.
> >
> > W.r.t libfuse optimizations, I am not sure why can't you do optimizations
> > with FUSE_CREATE and why do you need FUSE_CREATE_EXT necessarily. If
> > are you worried that some existing filesystems will break, I think
> > you can create an internal helper say fuse_create_atomic() and then
> > use that if filesystem offers it. IOW, libfuse will have two
> > ways to implement FUSE_CREATE. And if filesystem offers a new way which
> > cuts down on network traffic, libfuse uses more efficient method. We
> > should not have to change kernel FUSE API just because libfuse can
> > do create + open operation more efficiently.
>
> Ah right, I like this. As I had written before, the first patch version was
> using FUSE_CREATE and I was worried to break something. Yes, it should be
> possible split into lookup+create on the libfuse side. That being said,
> libfuse will need to know which version it is - there might be an old kernel
> sending the non-optimized version - libfuse should not do another lookup
> then.

I am confused about one thing. For FUSE_CREATE command, how does it
matter whether kernel has done lookup() before sending FUSE_CREATE. All
FUSE_CREATE seems to say that crate a file (if it does not exist already)
and then open it and return file handle as well as inode attributes. It
does not say anything about whether a LOOKUP has already been done
by kernel or not.

It looks like you are assuming that if FUSE_CREATE is coming, that
means client has already done FUSE_LOOKUP. So there is something we
are not on same page about.

I looked at fuse_lowlevel API and passthrough_ll.c and there is no
assumption whether FUSE_LOOKUP has already been called by client
before calling FUSE_CREATE. Similarly, I looked at virtiofs code
and I can't see any such assumption there as well.

https://github.com/qemu/qemu/blob/master/tools/virtiofsd/passthrough_ll.c

So I am sort of lost. May be you can help me understsand this.

> Now there is 'fi.flags = arg->flags', but these are already taken by
> open/fcntl flags - I would not feel comfortable to overload these. At best,
> struct fuse_create_in currently had a padding field, we could convert these
> to something like 'ext_fuse_open_flags' and then use it for fuse internal
> things. Difficulty here is that I don't know if all kernel implementations
> zero the struct (BSD, MacOS), so I guess we would need to negotiate at
> startup/init time and would need another main feature flag? And with that
> I'm not be sure anymore if the result would be actually more simple than
> what we have right now for the first patch.

If FUSE_CREATE indeed has a dependency on FUSE_LOOKUP have been called
before that, then I agree that we can't implement new semantics with
FUSE_CREATE and we will have to introduce a new op say
FUSE_ATOMIC_CREATE/FUSE_LOOKUP_CREATE/FUSE_CREATE_EXT.

But looking at fuse API, I don't see FUSE_CREATE ever guaranteeing that
a FUSE_LOOKUP has been done before this.

Thanks
Vivek


2022-05-09 09:47:50

by Dharmendra Singh

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create

On Thu, May 5, 2022 at 12:48 AM Vivek Goyal <[email protected]> wrote:
>
> On Mon, May 02, 2022 at 03:55:18PM +0530, Dharmendra Singh wrote:
> > In FUSE, as of now, uncached lookups are expensive over the wire.
> > E.g additional latencies and stressing (meta data) servers from
> > thousands of clients. These lookup calls possibly can be avoided
> > in some cases. Incoming three patches address this issue.
>
> BTW, these patches are designed to improve performance by cutting down
> on number of fuse commands sent. Are there any performance numbers
> which demonstrate what kind of improvement you are seeing.
>
> Say, If I do kernel build, is the performance improvement observable?

Here are the numbers I took last time. These were taken on tmpfs to
actually see the effect of reduced calls. On local file systems it
might not be that much visible. But we have observed that on systems
where we have thousands of clients hammering the metadata servers, it
helps a lot (We did not take numbers yet as we are required to change
a lot of our client code but would be doing it later on).

Note that for a change in performance number due to the new version of
these patches, we have just refactored the code and functionality has
remained the same since then.

here is the link to the performance numbers
https://lore.kernel.org/linux-fsdevel/[email protected]/

2022-05-11 15:39:03

by Bernd Schubert

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create



On 5/11/22 11:40, Miklos Szeredi wrote:
> On Thu, 5 May 2022 at 21:59, Vivek Goyal <[email protected]> wrote:
>
>> Oh, I have no issues with the intent. I will like to see cut in network
>> traffic too (if we can do this without introducing problems). My primary
>> interest is that this kind of change should benefit virtiofs as well.
>
> One issue with that appears to be checking permissions. AFAIU this
> patchset only enables the optimization if default_permissions is
> turned off (i.e. all permission checking is done by the server). But
> virtiofs uses the default_permissions model.
>
> I'm not quite sure about this limitation, guessing that it's related
> to the fact that the permissions may be stale at the time of checking?

Yes exactly, I had actually pointed this out in one of the mails.

<quote myself from 2022-04-05 >
Adding in a vfs ->open_revalidate might have the advantage that we could
also support 'default_permissions' - ->open_revalidate needs to
additionally check the retrieved file permissions and and needs to call
into generic_permissions for that. Right now that is not easily
feasible, without adding some code dup to convert flags in MAY_* flags -
a vfs change would be needed here to pass the right flags.
</quote>


Avoiding lookup for create should work without default_permissions, though.


With the current patches this comment should be added in
fuse_dentry_revalidate() to clarify things (somehow it got lost in the
(internal) review process).

/* Default permissions actually also would not need revalidate, if
atomic_open() could verify attributes after opening the file. But the
MAY_ flags are missing and the vfs build_open_flags() to recreate these
flags not exported. Thus, default_permissions() cannot be called from
here - vfs updates would be required */


Thanks,
Bernd

2022-05-11 19:09:05

by Bernd Schubert

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] FUSE: Implement atomic lookup + create



On 5/7/22 12:42, Jean-Pierre André wrote:
> Bernd Schubert wrote on 5/6/22 8:45 PM:
>>
>>
>> On 5/6/22 19:07, Vivek Goyal wrote:
>>> I looked at fuse_lowlevel API and passthrough_ll.c and there is no
>>> assumption whether FUSE_LOOKUP has already been called by client
>>> before calling FUSE_CREATE. Similarly, I looked at virtiofs code
>>> and I can't see any such assumption there as well.
>>
>> The current linux kernel code does this right now, skipping the lookup
>> just changes behavior.  Personally I would see it as bug if the
>> userspace implementation does not handle EEXIST for FUSE_CREATE.
>> Implementation developer and especially users might see it differently
>> if a kernel update breaks/changes things out of the sudden.
>> passthrough_ll.c is not the issue here, it handles it correctly, but
>> what about the XYZ other file systems out there - do you want to check
>> them one by one, including closed source ones? And wouldn't even a
>> single broken application count as regression?
>>
>>>
>>> https://github.com/qemu/qemu/blob/master/tools/virtiofsd/passthrough_ll.c
>>>
>>>
>>> So I am sort of lost. May be you can help me understsand this.
>>
>> I guess it would be more interesting to look at different file systems
>> that are not overlay based. Like ntfs-3g - I have not looked at the
>> code yet, but especially disk based file system didn't have a reason
>> so far to handle EEXIST. And
>
> AFAIK ntfs-3g proper does not keep a context across calls and does
> not know what LOOKUP was preparing a CREATE. However this may have
> consequences in the high level of libfuse for managing the file tree.

I don't think high level is effected. I'm really just scared to break

>
> The kernel apparently issues a LOOKUP to decide whether issuing a
> CREATE (create+open) or an OPEN. If it sent blindly a CREATE,
> ntfs-3g would return EEXIST if the name was already present in
> the directory.

Ok, so ntfs-3g is ok - which leaves N-1 file systems to check...

>
> For a test, can you suggest a way to force ignore of such lookup
> within libfuse, without applying your kernel patches ? Is there a
> way to detect the purpose of a lookup ? (A possible way is to
> hardcode a directory inode within which the lookups return ENOENT).


What I mean is that we need to check the code or test N file systems -
if ntfs-3g handles it, N-1 are left...

I we all agree on that there is no issue in skipping the lookup - fine
with me - it slightly simplifies the patches.


Thanks,
Bernd

2022-05-11 22:53:15

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create

On Thu, 5 May 2022 at 21:59, Vivek Goyal <[email protected]> wrote:

> Oh, I have no issues with the intent. I will like to see cut in network
> traffic too (if we can do this without introducing problems). My primary
> interest is that this kind of change should benefit virtiofs as well.

One issue with that appears to be checking permissions. AFAIU this
patchset only enables the optimization if default_permissions is
turned off (i.e. all permission checking is done by the server). But
virtiofs uses the default_permissions model.

I'm not quite sure about this limitation, guessing that it's related
to the fact that the permissions may be stale at the time of checking?

Thanks,
Miklos

2022-05-11 23:39:30

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create

On Wed, May 11, 2022 at 11:40:59AM +0200, Miklos Szeredi wrote:
> On Thu, 5 May 2022 at 21:59, Vivek Goyal <[email protected]> wrote:
>
> > Oh, I have no issues with the intent. I will like to see cut in network
> > traffic too (if we can do this without introducing problems). My primary
> > interest is that this kind of change should benefit virtiofs as well.
>
> One issue with that appears to be checking permissions. AFAIU this
> patchset only enables the optimization if default_permissions is
> turned off (i.e. all permission checking is done by the server). But
> virtiofs uses the default_permissions model.

IIUC, only 3rd patch mentions that default_permission should be turned
off. IOW, first patch where lookup + create + open is a single operation
and second patch which does "lookup + open" in a single operation does
not seem to require that default_permissions are not in effect.

So if first two patches work fine, I think virtiofs should benefit too.
(IMHO, 3rd patch is too hacky anyway)

W.r.t permission checks, looks like may_open() will finally be called
after ->atomic_open(). So even if we open the file, we should still be
able to check whether we have permissions to open the file or not
after the fact.

fs/namei.c

path_openat()
{
open_last_lookups() <--- This calls ->atomic_open()
do_open() <--- This calls may_open()
}

Thanks
Vivek

>
> I'm not quite sure about this limitation, guessing that it's related
> to the fact that the permissions may be stale at the time of checking?
>
> Thanks,
> Miklos
>


2022-05-12 10:19:48

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create

On Wed, May 11, 2022 at 01:21:13PM -0400, Vivek Goyal wrote:
> On Wed, May 11, 2022 at 11:40:59AM +0200, Miklos Szeredi wrote:
> > On Thu, 5 May 2022 at 21:59, Vivek Goyal <[email protected]> wrote:
> >
> > > Oh, I have no issues with the intent. I will like to see cut in network
> > > traffic too (if we can do this without introducing problems). My primary
> > > interest is that this kind of change should benefit virtiofs as well.
> >
> > One issue with that appears to be checking permissions. AFAIU this
> > patchset only enables the optimization if default_permissions is
> > turned off (i.e. all permission checking is done by the server). But
> > virtiofs uses the default_permissions model.
>
> IIUC, only 3rd patch mentions that default_permission should be turned
> off. IOW, first patch where lookup + create + open is a single operation
> and second patch which does "lookup + open" in a single operation does
> not seem to require that default_permissions are not in effect.
>
> So if first two patches work fine, I think virtiofs should benefit too.
> (IMHO, 3rd patch is too hacky anyway)
>
> W.r.t permission checks, looks like may_open() will finally be called
> after ->atomic_open(). So even if we open the file, we should still be
> able to check whether we have permissions to open the file or not
> after the fact.
>
> fs/namei.c
>
> path_openat()
> {
> open_last_lookups() <--- This calls ->atomic_open()
> do_open() <--- This calls may_open()
> }

Actually I am not sure about it. I was playing with

open(foo.txt, O_CREAT | O_RDWR, O_IRUSR)

This succeeds if file is newly created but if file already existed, this
fails with -EACCESS

So man 2 open says following. Thanks to Andy Price for pointing me to it.

Note that mode applies only to future accesses of the newly cre‐
ated file; the open() call that creates a read-only file may
well return a read/write file descriptor.


Now I am wondering how will it look like with first patch. Assume file
already exists on the server (But there is no negative dentry present)
and I do following. And assume file only has read permission for user
and I am trying to open it read-write.

open(foo.txt, O_CREAT | O_RDWR, O_IRUSR)

In normal circumstances, user will expect -EACCESS as file is read-only
and user is trying to open it read-write.

I am wondering how will it look like with this first patch.

Current fuse ->atomic_open() looks up the dentry and does not open
the file if dentry is positive.

New implementation will skip lookup and open the file anyway and
set file->f_mode |= FMODE_CREATED; (First patch in series)

So first of all this seems wrong. I thought FMODE_CREATED should be
set only if file was newly created. Is that a correct understanding.

And I am looking at do_open() code. It does bunch of things based
on FMODE_CREATED flag. One of the things it does is reset acc_mode =0

if (file->f_mode & FMODE_CREATED) {
/* Don't check for write permission, don't truncate */
open_flag &= ~O_TRUNC;
acc_mode = 0;
}
error = may_open(mnt_userns, &nd->path, acc_mode, open_flag);

I suspect this is the code which allows opening a newly created read-only
file as O_RDWR. (Though I am not 100% sure).

I suspect with first patch this will be broken. We will set FMODE_CREATED
even if file already existed and VFS will assume a new file has been
created and do bunch of things which is wrong.

So looks like fuse ->atomic_open() should set FMODE_CREATED only if
it really created the file.

Thanks
Vivek


2022-05-13 20:10:21

by Dharmendra Singh

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create

On Thu, May 12, 2022 at 1:00 AM Vivek Goyal <[email protected]> wrote:
>
> On Wed, May 11, 2022 at 01:21:13PM -0400, Vivek Goyal wrote:
> > On Wed, May 11, 2022 at 11:40:59AM +0200, Miklos Szeredi wrote:
> > > On Thu, 5 May 2022 at 21:59, Vivek Goyal <[email protected]> wrote:
> > >
> > > > Oh, I have no issues with the intent. I will like to see cut in network
> > > > traffic too (if we can do this without introducing problems). My primary
> > > > interest is that this kind of change should benefit virtiofs as well.
> > >
> > > One issue with that appears to be checking permissions. AFAIU this
> > > patchset only enables the optimization if default_permissions is
> > > turned off (i.e. all permission checking is done by the server). But
> > > virtiofs uses the default_permissions model.
> >
> > IIUC, only 3rd patch mentions that default_permission should be turned
> > off. IOW, first patch where lookup + create + open is a single operation
> > and second patch which does "lookup + open" in a single operation does
> > not seem to require that default_permissions are not in effect.
> >
> > So if first two patches work fine, I think virtiofs should benefit too.
> > (IMHO, 3rd patch is too hacky anyway)
> >
> > W.r.t permission checks, looks like may_open() will finally be called
> > after ->atomic_open(). So even if we open the file, we should still be
> > able to check whether we have permissions to open the file or not
> > after the fact.
> >
> > fs/namei.c
> >
> > path_openat()
> > {
> > open_last_lookups() <--- This calls ->atomic_open()
> > do_open() <--- This calls may_open()
> > }
>
> Actually I am not sure about it. I was playing with
>
> open(foo.txt, O_CREAT | O_RDWR, O_IRUSR)
>
> This succeeds if file is newly created but if file already existed, this
> fails with -EACCESS
>
> So man 2 open says following. Thanks to Andy Price for pointing me to it.
>
> Note that mode applies only to future accesses of the newly cre‐
> ated file; the open() call that creates a read-only file may
> well return a read/write file descriptor.
>
>
> Now I am wondering how will it look like with first patch. Assume file
> already exists on the server (But there is no negative dentry present)
> and I do following. And assume file only has read permission for user
> and I am trying to open it read-write.
>
> open(foo.txt, O_CREAT | O_RDWR, O_IRUSR)
>
> In normal circumstances, user will expect -EACCESS as file is read-only
> and user is trying to open it read-write.
>
> I am wondering how will it look like with this first patch.
>
> Current fuse ->atomic_open() looks up the dentry and does not open
> the file if dentry is positive.
>
> New implementation will skip lookup and open the file anyway and
> set file->f_mode |= FMODE_CREATED; (First patch in series)
>
> So first of all this seems wrong. I thought FMODE_CREATED should be
> set only if file was newly created. Is that a correct understanding.

You are right. we should mark in f_mode only if the file was actually created.
Thanks for catching this. Appreciate your efforts:)

>
> And I am looking at do_open() code. It does bunch of things based
> on FMODE_CREATED flag. One of the things it does is reset acc_mode =0
>
> if (file->f_mode & FMODE_CREATED) {
> /* Don't check for write permission, don't truncate */
> open_flag &= ~O_TRUNC;
> acc_mode = 0;
> }
> error = may_open(mnt_userns, &nd->path, acc_mode, open_flag);
>
> I suspect this is the code which allows opening a newly created read-only
> file as O_RDWR. (Though I am not 100% sure).

Correct. I could see it.

>
> I suspect with first patch this will be broken. We will set FMODE_CREATED
> even if file already existed and VFS will assume a new file has been
> created and do bunch of things which is wrong.
>
> So looks like fuse ->atomic_open() should set FMODE_CREATED only if
> it really created the file.

This looks like an obvious bug with new patches. But If I do not miss
anything then its a bug on distributed file systems with current code
as well.
It could happen this way(Taking your example which is perfect to trace
this on distributed systems):
Lets start with File is non-existent yet on the file system. There
comes the first client which does a lookup on the file, it does not
find the file. Meanwhile another client created the file on the File
system. Now when this first client goes to create the file, before
going down it sets FMODE_CREATED on the file handle and then goes down
the lower layers. It comes back with inode but f->mode as
FMODE_CREATED which is incorrect. This mode then results in acc_mode
being set to zero which then allows access to the file as O_RDWR.

I think Miklos proposed to return the flag from user space if the file
was actually created, this would solve two problems 1) this access
problem and code execution going the wrong path 2) correct update if
parent dir changed or not.

2022-05-14 03:05:09

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v4 0/3] FUSE: Implement atomic lookup + open/create

On Thu, May 12, 2022 at 01:46:12PM +0530, Dharmendra Hans wrote:
> On Thu, May 12, 2022 at 1:00 AM Vivek Goyal <[email protected]> wrote:
> >
> > On Wed, May 11, 2022 at 01:21:13PM -0400, Vivek Goyal wrote:
> > > On Wed, May 11, 2022 at 11:40:59AM +0200, Miklos Szeredi wrote:
> > > > On Thu, 5 May 2022 at 21:59, Vivek Goyal <[email protected]> wrote:
> > > >
> > > > > Oh, I have no issues with the intent. I will like to see cut in network
> > > > > traffic too (if we can do this without introducing problems). My primary
> > > > > interest is that this kind of change should benefit virtiofs as well.
> > > >
> > > > One issue with that appears to be checking permissions. AFAIU this
> > > > patchset only enables the optimization if default_permissions is
> > > > turned off (i.e. all permission checking is done by the server). But
> > > > virtiofs uses the default_permissions model.
> > >
> > > IIUC, only 3rd patch mentions that default_permission should be turned
> > > off. IOW, first patch where lookup + create + open is a single operation
> > > and second patch which does "lookup + open" in a single operation does
> > > not seem to require that default_permissions are not in effect.
> > >
> > > So if first two patches work fine, I think virtiofs should benefit too.
> > > (IMHO, 3rd patch is too hacky anyway)
> > >
> > > W.r.t permission checks, looks like may_open() will finally be called
> > > after ->atomic_open(). So even if we open the file, we should still be
> > > able to check whether we have permissions to open the file or not
> > > after the fact.
> > >
> > > fs/namei.c
> > >
> > > path_openat()
> > > {
> > > open_last_lookups() <--- This calls ->atomic_open()
> > > do_open() <--- This calls may_open()
> > > }
> >
> > Actually I am not sure about it. I was playing with
> >
> > open(foo.txt, O_CREAT | O_RDWR, O_IRUSR)
> >
> > This succeeds if file is newly created but if file already existed, this
> > fails with -EACCESS
> >
> > So man 2 open says following. Thanks to Andy Price for pointing me to it.
> >
> > Note that mode applies only to future accesses of the newly cre‐
> > ated file; the open() call that creates a read-only file may
> > well return a read/write file descriptor.
> >
> >
> > Now I am wondering how will it look like with first patch. Assume file
> > already exists on the server (But there is no negative dentry present)
> > and I do following. And assume file only has read permission for user
> > and I am trying to open it read-write.
> >
> > open(foo.txt, O_CREAT | O_RDWR, O_IRUSR)
> >
> > In normal circumstances, user will expect -EACCESS as file is read-only
> > and user is trying to open it read-write.
> >
> > I am wondering how will it look like with this first patch.
> >
> > Current fuse ->atomic_open() looks up the dentry and does not open
> > the file if dentry is positive.
> >
> > New implementation will skip lookup and open the file anyway and
> > set file->f_mode |= FMODE_CREATED; (First patch in series)
> >
> > So first of all this seems wrong. I thought FMODE_CREATED should be
> > set only if file was newly created. Is that a correct understanding.
>
> You are right. we should mark in f_mode only if the file was actually created.
> Thanks for catching this. Appreciate your efforts:)
>
> >
> > And I am looking at do_open() code. It does bunch of things based
> > on FMODE_CREATED flag. One of the things it does is reset acc_mode =0
> >
> > if (file->f_mode & FMODE_CREATED) {
> > /* Don't check for write permission, don't truncate */
> > open_flag &= ~O_TRUNC;
> > acc_mode = 0;
> > }
> > error = may_open(mnt_userns, &nd->path, acc_mode, open_flag);
> >
> > I suspect this is the code which allows opening a newly created read-only
> > file as O_RDWR. (Though I am not 100% sure).
>
> Correct. I could see it.
>
> >
> > I suspect with first patch this will be broken. We will set FMODE_CREATED
> > even if file already existed and VFS will assume a new file has been
> > created and do bunch of things which is wrong.
> >
> > So looks like fuse ->atomic_open() should set FMODE_CREATED only if
> > it really created the file.
>
> This looks like an obvious bug with new patches. But If I do not miss
> anything then its a bug on distributed file systems with current code
> as well.
> It could happen this way(Taking your example which is perfect to trace
> this on distributed systems):
> Lets start with File is non-existent yet on the file system. There
> comes the first client which does a lookup on the file, it does not
> find the file. Meanwhile another client created the file on the File
> system. Now when this first client goes to create the file, before
> going down it sets FMODE_CREATED on the file handle and then goes down
> the lower layers. It comes back with inode but f->mode as
> FMODE_CREATED which is incorrect. This mode then results in acc_mode
> being set to zero which then allows access to the file as O_RDWR.

I think with current code (FUSE_CREATE), it is a race with shared
filesystems where multiple clients might be sharing this directory.

We are looking up dentry first and issue FUSE_CREATE only if dentry
is negative. Now it is possible that between lookup() + FUSE_CREATE,
another client dropped in same file in the directory. But one could
argue, that's something not detectable by this client. So from user's
perspective, this client created the file.

If we have a negative dentry, then we will not do the lookup and
assumption is that we created file (even if another client created
it between previous lookup and this creation).

So agreed, this is a race but might not be very harmful one because
looks like we are providing weaker coherency with FUSE as opposed
to local filesystems. In fact this case fuse server running on a
directory which is shared by multiple clients bothers me a lot. There
seem to be many corner cases where it is not clear what will happen
if another client is doing operation.

>
> I think Miklos proposed to return the flag from user space if the file
> was actually created, this would solve two problems 1) this access
> problem and code execution going the wrong path 2) correct update if
> parent dir changed or not.

Agreed that we could use new command FUSE_ATOMIC_CREATE/FUSE_CREATE_EXT
to figure out if file was newly created or not and then set FMODE_CREATED
accordingly.

W.r.t dir change, I am assuming we are invalidating dir attributes just
because if we created file, it could change dir's mtime/ctime. So yes,
FUSE_ATOMIC_CREATE/FUSE_CREATE_EXT could return this info whether file
was actually crated or not and invalidate dir's attribute accordingly.

Thanks
Vivek