2019-08-21 18:01:12

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH v3 00/13] virtio-fs: shared file system for virtual machines

Hi,

Here are the V3 patches for virtio-fs filesystem. This time I have
broken the patch series in two parts. This is first part which does
not contain DAX support. Second patch series will contain the patches
for DAX support.

I have also dropped RFC tag from first patch series as we believe its
in good enough shape that it should get a consideration for inclusion
upstream.

These patches apply on top of 5.3-rc5 kernel and are also available
here.

https://github.com/rhvgoyal/linux/commits/vivek-5.3-aug-21-2019

Patches for V1 and V2 were posted here.

https://lwn.net/ml/linux-fsdevel/[email protected]/
http://lkml.iu.edu/hypermail/linux/kernel/1905.1/07232.html

More information about the project can be found here.

https://virtio-fs.gitlab.io

Changes from V2
===============
- Various bug fixes and performance improvements.

HOWTO
======
We have put instructions on how to use it here.

https://virtio-fs.gitlab.io/

Some Performance Numbers
========================
I have basically run bunch of fio jobs to get a sense of speed of
various operations. I wrote a simple wrapper script to run fio jobs
3 times and take their average and report it. These scripts are available
here.

https://github.com/rhvgoyal/virtiofs-tests

I set up a directory on ramfs on host and exported that directory inside
guest using virtio-9p and virtio-fs and ran tests inside guests. Ran
tests with cache=none both for virtio-9p and virtio-fs so that no caching
happens in guest. For virtio-fs, I ran an additional set of tests with
dax enabled. Dax is not part of first patch series but I included
results here because dax seems to get the maximum performance advantage
and its shows the real potential of virtio-fs.

Test Setup
-----------
- A fedora 28 host with 32G RAM, 2 sockets (6 cores per socket, 2
threads per core)

- Using ramfs on host as backing store. 4 fio files of 2G each.

- Created a VM with 16 VCPUS and 8GB memory. An 8GB cache window (for dax
mmap).

Test Results
------------
- Results in three configurations have been reported. 9p (cache=none),
virtio-fs (cache=none) and virtio-fs (cache=none + dax).

There are other caching modes as well but to me cache=none seemed most
interesting for now because it does not cache anything in guest
and provides strong coherence. Other modes which provide less strong
coherence and hence are faster are yet to be benchmarked.

- Three fio ioengines psync, libaio and mmap have been used.

- I/O Workload of randread, radwrite, seqread and seqwrite have been run.

- Each file size is 2G. Block size 4K. iodepth=16

- "multi" means same operation was done with 4 jobs and each job is
operating on a file of size 2G.

- Some results are "0 (KiB/s)". That means that particular operation is
not supported in that configuration.

NAME I/O Operation BW(Read/Write)

9p-cache-none seqread-psync 27(MiB/s)
virtiofs-cache-none seqread-psync 35(MiB/s)
virtiofs-dax-cache-none seqread-psync 245(MiB/s)

9p-cache-none seqread-psync-multi 117(MiB/s)
virtiofs-cache-none seqread-psync-multi 162(MiB/s)
virtiofs-dax-cache-none seqread-psync-multi 894(MiB/s)

9p-cache-none seqread-mmap 24(MiB/s)
virtiofs-cache-none seqread-mmap 0(KiB/s)
virtiofs-dax-cache-none seqread-mmap 168(MiB/s)

9p-cache-none seqread-mmap-multi 115(MiB/s)
virtiofs-cache-none seqread-mmap-multi 0(KiB/s)
virtiofs-dax-cache-none seqread-mmap-multi 614(MiB/s)

9p-cache-none seqread-libaio 26(MiB/s)
virtiofs-cache-none seqread-libaio 139(MiB/s)
virtiofs-dax-cache-none seqread-libaio 160(MiB/s)

9p-cache-none seqread-libaio-multi 129(MiB/s)
virtiofs-cache-none seqread-libaio-multi 142(MiB/s)
virtiofs-dax-cache-none seqread-libaio-multi 577(MiB/s)

9p-cache-none randread-psync 29(MiB/s)
virtiofs-cache-none randread-psync 34(MiB/s)
virtiofs-dax-cache-none randread-psync 256(MiB/s)

9p-cache-none randread-psync-multi 139(MiB/s)
virtiofs-cache-none randread-psync-multi 153(MiB/s)
virtiofs-dax-cache-none randread-psync-multi 245(MiB/s)

9p-cache-none randread-mmap 22(MiB/s)
virtiofs-cache-none randread-mmap 0(KiB/s)
virtiofs-dax-cache-none randread-mmap 162(MiB/s)

9p-cache-none randread-mmap-multi 111(MiB/s)
virtiofs-cache-none randread-mmap-multi 0(KiB/s)
virtiofs-dax-cache-none randread-mmap-multi 215(MiB/s)

9p-cache-none randread-libaio 26(MiB/s)
virtiofs-cache-none randread-libaio 135(MiB/s)
virtiofs-dax-cache-none randread-libaio 157(MiB/s)

9p-cache-none randread-libaio-multi 133(MiB/s)
virtiofs-cache-none randread-libaio-multi 245(MiB/s)
virtiofs-dax-cache-none randread-libaio-multi 163(MiB/s)

9p-cache-none seqwrite-psync 28(MiB/s)
virtiofs-cache-none seqwrite-psync 34(MiB/s)
virtiofs-dax-cache-none seqwrite-psync 203(MiB/s)

9p-cache-none seqwrite-psync-multi 128(MiB/s)
virtiofs-cache-none seqwrite-psync-multi 155(MiB/s)
virtiofs-dax-cache-none seqwrite-psync-multi 717(MiB/s)

9p-cache-none seqwrite-mmap 0(KiB/s)
virtiofs-cache-none seqwrite-mmap 0(KiB/s)
virtiofs-dax-cache-none seqwrite-mmap 165(MiB/s)

9p-cache-none seqwrite-mmap-multi 0(KiB/s)
virtiofs-cache-none seqwrite-mmap-multi 0(KiB/s)
virtiofs-dax-cache-none seqwrite-mmap-multi 511(MiB/s)

9p-cache-none seqwrite-libaio 27(MiB/s)
virtiofs-cache-none seqwrite-libaio 128(MiB/s)
virtiofs-dax-cache-none seqwrite-libaio 141(MiB/s)

9p-cache-none seqwrite-libaio-multi 119(MiB/s)
virtiofs-cache-none seqwrite-libaio-multi 242(MiB/s)
virtiofs-dax-cache-none seqwrite-libaio-multi 505(MiB/s)

9p-cache-none randwrite-psync 27(MiB/s)
virtiofs-cache-none randwrite-psync 34(MiB/s)
virtiofs-dax-cache-none randwrite-psync 189(MiB/s)

9p-cache-none randwrite-psync-multi 137(MiB/s)
virtiofs-cache-none randwrite-psync-multi 150(MiB/s)
virtiofs-dax-cache-none randwrite-psync-multi 233(MiB/s)

9p-cache-none randwrite-mmap 0(KiB/s)
virtiofs-cache-none randwrite-mmap 0(KiB/s)
virtiofs-dax-cache-none randwrite-mmap 120(MiB/s)

9p-cache-none randwrite-mmap-multi 0(KiB/s)
virtiofs-cache-none randwrite-mmap-multi 0(KiB/s)
virtiofs-dax-cache-none randwrite-mmap-multi 200(MiB/s)

9p-cache-none randwrite-libaio 25(MiB/s)
virtiofs-cache-none randwrite-libaio 124(MiB/s)
virtiofs-dax-cache-none randwrite-libaio 131(MiB/s)

9p-cache-none randwrite-libaio-multi 125(MiB/s)
virtiofs-cache-none randwrite-libaio-multi 241(MiB/s)
virtiofs-dax-cache-none randwrite-libaio-multi 163(MiB/s)

Conclusions
===========
- In general virtio-fs seems faster than virtio-9p. Using dax makes it
really interesting.

Note:
Right now dax window is 8G and max fio file size is 8G as well (4
files of 2G each). That means everything fits into dax window and no
reclaim is needed. Dax window reclaim logic is slower and if file
size is bigger than dax window size, performance slows down.

Description from previous postings
==================================

Design Overview
===============
With the goal of designing something with better performance and local file
system semantics, a bunch of ideas were proposed.

- Use fuse protocol (instead of 9p) for communication between guest
and host. Guest kernel will be fuse client and a fuse server will
run on host to serve the requests.

- For data access inside guest, mmap portion of file in QEMU address
space and guest accesses this memory using dax. That way guest page
cache is bypassed and there is only one copy of data (on host). This
will also enable mmap(MAP_SHARED) between guests.

- For metadata coherency, there is a shared memory region which contains
version number associated with metadata and any guest changing metadata
updates version number and other guests refresh metadata on next
access. This is yet to be implemented.

How virtio-fs differs from existing approaches
==============================================
The unique idea behind virtio-fs is to take advantage of the co-location
of the virtual machine and hypervisor to avoid communication (vmexits).

DAX allows file contents to be accessed without communication with the
hypervisor. The shared memory region for metadata avoids communication in
the common case where metadata is unchanged.

By replacing expensive communication with cheaper shared memory accesses,
we expect to achieve better performance than approaches based on network
file system protocols. In addition, this also makes it easier to achieve
local file system semantics (coherency).

These techniques are not applicable to network file system protocols since
the communications channel is bypassed by taking advantage of shared memory
on a local machine. This is why we decided to build virtio-fs rather than
focus on 9P or NFS.

Caching Modes
=============
Like virtio-9p, different caching modes are supported which determine the
coherency level as well. The “cache=FOO” and “writeback” options control the
level of coherence between the guest and host filesystems.

- cache=none
metadata, data and pathname lookup are not cached in guest. They are always
fetched from host and any changes are immediately pushed to host.

- cache=always
metadata, data and pathname lookup are cached in guest and never expire.

- cache=auto
metadata and pathname lookup cache expires after a configured amount of time
(default is 1 second). Data is cached while the file is open (close to open
consistency).

- writeback/no_writeback
These options control the writeback strategy. If writeback is disabled,
then normal writes will immediately be synchronized with the host fs. If
writeback is enabled, then writes may be cached in the guest until the file
is closed or an fsync(2) performed. This option has no effect on mmap-ed
writes or writes going through the DAX mechanism.

Thanks
Vivek

Miklos Szeredi (2):
fuse: delete dentry if timeout is zero
fuse: Use default_file_splice_read for direct IO

Stefan Hajnoczi (6):
fuse: export fuse_end_request()
fuse: export fuse_len_args()
fuse: export fuse_get_unique()
fuse: extract fuse_fill_super_common()
fuse: add fuse_iqueue_ops callbacks
virtio_fs: add skeleton virtio_fs.ko module

Vivek Goyal (5):
fuse: Export fuse_send_init_request()
Export fuse_dequeue_forget() function
fuse: Separate fuse device allocation and installation in fuse_conn
virtio-fs: Do not provide abort interface in fusectl
init/do_mounts.c: add virtio_fs root fs support

fs/fuse/Kconfig | 11 +
fs/fuse/Makefile | 1 +
fs/fuse/control.c | 4 +-
fs/fuse/cuse.c | 4 +-
fs/fuse/dev.c | 89 ++-
fs/fuse/dir.c | 26 +-
fs/fuse/file.c | 15 +-
fs/fuse/fuse_i.h | 120 +++-
fs/fuse/inode.c | 203 +++---
fs/fuse/virtio_fs.c | 1061 +++++++++++++++++++++++++++++++
fs/splice.c | 3 +-
include/linux/fs.h | 2 +
include/uapi/linux/virtio_fs.h | 41 ++
include/uapi/linux/virtio_ids.h | 1 +
init/do_mounts.c | 10 +
15 files changed, 1462 insertions(+), 129 deletions(-)
create mode 100644 fs/fuse/virtio_fs.c
create mode 100644 include/uapi/linux/virtio_fs.h

--
2.20.1


2019-08-21 18:34:07

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 09/13] fuse: add fuse_iqueue_ops callbacks

From: Stefan Hajnoczi <[email protected]>

The /dev/fuse device uses fiq->waitq and fasync to signal that requests
are available. These mechanisms do not apply to virtio-fs. This patch
introduces callbacks so alternative behavior can be used.

Note that queue_interrupt() changes along these lines:

spin_lock(&fiq->waitq.lock);
wake_up_locked(&fiq->waitq);
+ kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
spin_unlock(&fiq->waitq.lock);
- kill_fasync(&fiq->fasync, SIGIO, POLL_IN);

Since queue_request() and queue_forget() also call kill_fasync() inside
the spinlock this should be safe.

Signed-off-by: Stefan Hajnoczi <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/fuse/cuse.c | 2 +-
fs/fuse/dev.c | 50 ++++++++++++++++++++++++++++++++----------------
fs/fuse/fuse_i.h | 48 +++++++++++++++++++++++++++++++++++++++++++++-
fs/fuse/inode.c | 16 ++++++++++++----
4 files changed, 94 insertions(+), 22 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index bab7a0db81dd..7bdab6521469 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -504,7 +504,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
* Limit the cuse channel to requests that can
* be represented in file->f_cred->user_ns.
*/
- fuse_conn_init(&cc->fc, file->f_cred->user_ns);
+ fuse_conn_init(&cc->fc, file->f_cred->user_ns, &fuse_dev_fiq_ops, NULL);

fud = fuse_dev_alloc(&cc->fc);
if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 2bbfc1f77875..fa6794e96811 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -375,13 +375,33 @@ static unsigned int fuse_req_hash(u64 unique)
return hash_long(unique & ~FUSE_INT_REQ_BIT, FUSE_PQ_HASH_BITS);
}

-static void queue_request(struct fuse_iqueue *fiq, struct fuse_req *req)
+/**
+ * A new request is available, wake fiq->waitq
+ */
+static void fuse_dev_wake_and_unlock(struct fuse_iqueue *fiq)
+__releases(fiq->waitq.lock)
{
- req->in.h.len = sizeof(struct fuse_in_header) +
- fuse_len_args(req->in.numargs, (struct fuse_arg *) req->in.args);
- list_add_tail(&req->list, &fiq->pending);
wake_up_locked(&fiq->waitq);
kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
+ spin_unlock(&fiq->waitq.lock);
+}
+
+const struct fuse_iqueue_ops fuse_dev_fiq_ops = {
+ .wake_forget_and_unlock = fuse_dev_wake_and_unlock,
+ .wake_interrupt_and_unlock = fuse_dev_wake_and_unlock,
+ .wake_pending_and_unlock = fuse_dev_wake_and_unlock,
+};
+EXPORT_SYMBOL_GPL(fuse_dev_fiq_ops);
+
+static void queue_request_and_unlock(struct fuse_iqueue *fiq,
+ struct fuse_req *req)
+__releases(fiq->waitq.lock)
+{
+ req->in.h.len = sizeof(struct fuse_in_header) +
+ fuse_len_args(req->in.numargs,
+ (struct fuse_arg *) req->in.args);
+ list_add_tail(&req->list, &fiq->pending);
+ fiq->ops->wake_pending_and_unlock(fiq);
}

void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget,
@@ -396,12 +416,11 @@ void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget,
if (fiq->connected) {
fiq->forget_list_tail->next = forget;
fiq->forget_list_tail = forget;
- wake_up_locked(&fiq->waitq);
- kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
+ fiq->ops->wake_forget_and_unlock(fiq);
} else {
kfree(forget);
+ spin_unlock(&fiq->waitq.lock);
}
- spin_unlock(&fiq->waitq.lock);
}

static void flush_bg_queue(struct fuse_conn *fc)
@@ -417,8 +436,7 @@ static void flush_bg_queue(struct fuse_conn *fc)
fc->active_background++;
spin_lock(&fiq->waitq.lock);
req->in.h.unique = fuse_get_unique(fiq);
- queue_request(fiq, req);
- spin_unlock(&fiq->waitq.lock);
+ queue_request_and_unlock(fiq, req);
}
}

@@ -506,10 +524,10 @@ static int queue_interrupt(struct fuse_iqueue *fiq, struct fuse_req *req)
spin_unlock(&fiq->waitq.lock);
return 0;
}
- wake_up_locked(&fiq->waitq);
- kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
+ fiq->ops->wake_interrupt_and_unlock(fiq);
+ } else {
+ spin_unlock(&fiq->waitq.lock);
}
- spin_unlock(&fiq->waitq.lock);
return 0;
}

@@ -569,11 +587,10 @@ static void __fuse_request_send(struct fuse_conn *fc, struct fuse_req *req)
req->out.h.error = -ENOTCONN;
} else {
req->in.h.unique = fuse_get_unique(fiq);
- queue_request(fiq, req);
/* acquire extra reference, since request is still needed
after fuse_request_end() */
__fuse_get_request(req);
- spin_unlock(&fiq->waitq.lock);
+ queue_request_and_unlock(fiq, req);

request_wait_answer(fc, req);
/* Pairs with smp_wmb() in fuse_request_end() */
@@ -706,10 +723,11 @@ static int fuse_request_send_notify_reply(struct fuse_conn *fc,
req->in.h.unique = unique;
spin_lock(&fiq->waitq.lock);
if (fiq->connected) {
- queue_request(fiq, req);
+ queue_request_and_unlock(fiq, req);
err = 0;
+ } else {
+ spin_unlock(&fiq->waitq.lock);
}
- spin_unlock(&fiq->waitq.lock);

return err;
}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 062f929348cc..70c1bde06293 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -75,6 +75,12 @@ struct fuse_mount_data {
unsigned max_read;
unsigned blksize;

+ /* fuse input queue operations */
+ const struct fuse_iqueue_ops *fiq_ops;
+
+ /* device-specific state for fuse_iqueue */
+ void *fiq_priv;
+
/* fuse_dev pointer to fill in, should contain NULL on entry */
void **fudptr;
};
@@ -465,6 +471,39 @@ struct fuse_req {
struct file *stolen_file;
};

+struct fuse_iqueue;
+
+/**
+ * Input queue callbacks
+ *
+ * Input queue signalling is device-specific. For example, the /dev/fuse file
+ * uses fiq->waitq and fasync to wake processes that are waiting on queue
+ * readiness. These callbacks allow other device types to respond to input
+ * queue activity.
+ */
+struct fuse_iqueue_ops {
+ /**
+ * Signal that a forget has been queued
+ */
+ void (*wake_forget_and_unlock)(struct fuse_iqueue *fiq)
+ __releases(fiq->waitq.lock);
+
+ /**
+ * Signal that an INTERRUPT request has been queued
+ */
+ void (*wake_interrupt_and_unlock)(struct fuse_iqueue *fiq)
+ __releases(fiq->waitq.lock);
+
+ /**
+ * Signal that a request has been queued
+ */
+ void (*wake_pending_and_unlock)(struct fuse_iqueue *fiq)
+ __releases(fiq->waitq.lock);
+};
+
+/** /dev/fuse input queue operations */
+extern const struct fuse_iqueue_ops fuse_dev_fiq_ops;
+
struct fuse_iqueue {
/** Connection established */
unsigned connected;
@@ -490,6 +529,12 @@ struct fuse_iqueue {

/** O_ASYNC requests */
struct fasync_struct *fasync;
+
+ /** Device-specific callbacks */
+ const struct fuse_iqueue_ops *ops;
+
+ /** Device-specific state */
+ void *priv;
};

#define FUSE_PQ_HASH_BITS 8
@@ -1007,7 +1052,8 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
/**
* Initialize fuse_conn
*/
-void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
+ const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv);

/**
* Release reference to fuse_conn
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 9f21a0cb58b9..226af184a402 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -565,7 +565,9 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
return 0;
}

-static void fuse_iqueue_init(struct fuse_iqueue *fiq)
+static void fuse_iqueue_init(struct fuse_iqueue *fiq,
+ const struct fuse_iqueue_ops *ops,
+ void *priv)
{
memset(fiq, 0, sizeof(struct fuse_iqueue));
init_waitqueue_head(&fiq->waitq);
@@ -573,6 +575,8 @@ static void fuse_iqueue_init(struct fuse_iqueue *fiq)
INIT_LIST_HEAD(&fiq->interrupts);
fiq->forget_list_tail = &fiq->forget_list_head;
fiq->connected = 1;
+ fiq->ops = ops;
+ fiq->priv = priv;
}

static void fuse_pqueue_init(struct fuse_pqueue *fpq)
@@ -586,7 +590,8 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
fpq->connected = 1;
}

-void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
+ const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv)
{
memset(fc, 0, sizeof(*fc));
spin_lock_init(&fc->lock);
@@ -596,7 +601,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
atomic_set(&fc->dev_count, 1);
init_waitqueue_head(&fc->blocked_waitq);
init_waitqueue_head(&fc->reserved_req_waitq);
- fuse_iqueue_init(&fc->iq);
+ fuse_iqueue_init(&fc->iq, fiq_ops, fiq_priv);
INIT_LIST_HEAD(&fc->bg_queue);
INIT_LIST_HEAD(&fc->entry);
INIT_LIST_HEAD(&fc->devices);
@@ -1110,7 +1115,8 @@ int fuse_fill_super_common(struct super_block *sb,
if (!fc)
goto err;

- fuse_conn_init(fc, sb->s_user_ns);
+ fuse_conn_init(fc, sb->s_user_ns, mount_data->fiq_ops,
+ mount_data->fiq_priv);
fc->release = fuse_free_conn;

fud = fuse_dev_alloc(fc);
@@ -1212,6 +1218,8 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
goto err_fput;
__set_bit(FR_BACKGROUND, &init_req->flags);

+ d.fiq_ops = &fuse_dev_fiq_ops;
+ d.fiq_priv = NULL;
d.fudptr = &file->private_data;
err = fuse_fill_super_common(sb, &d);
if (err < 0)
--
2.20.1

2019-08-21 21:00:47

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 01/13] fuse: delete dentry if timeout is zero

From: Miklos Szeredi <[email protected]>

Don't hold onto dentry in lru list if need to re-lookup it anyway at next
access.

More advanced version of this patch would periodically flush out dentries
from the lru which have gone stale.

Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/fuse/dir.c | 26 +++++++++++++++++++++++---
1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index dd0f64f7bc06..fd8636e67ae9 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -29,12 +29,26 @@ union fuse_dentry {
struct rcu_head rcu;
};

-static inline void fuse_dentry_settime(struct dentry *entry, u64 time)
+static void fuse_dentry_settime(struct dentry *dentry, u64 time)
{
- ((union fuse_dentry *) entry->d_fsdata)->time = time;
+ /*
+ * Mess with DCACHE_OP_DELETE because dput() will be faster without it.
+ * Don't care about races, either way it's just an optimization
+ */
+ if ((time && (dentry->d_flags & DCACHE_OP_DELETE)) ||
+ (!time && !(dentry->d_flags & DCACHE_OP_DELETE))) {
+ spin_lock(&dentry->d_lock);
+ if (time)
+ dentry->d_flags &= ~DCACHE_OP_DELETE;
+ else
+ dentry->d_flags |= DCACHE_OP_DELETE;
+ spin_unlock(&dentry->d_lock);
+ }
+
+ ((union fuse_dentry *) dentry->d_fsdata)->time = time;
}

-static inline u64 fuse_dentry_time(struct dentry *entry)
+static inline u64 fuse_dentry_time(const struct dentry *entry)
{
return ((union fuse_dentry *) entry->d_fsdata)->time;
}
@@ -255,8 +269,14 @@ static void fuse_dentry_release(struct dentry *dentry)
kfree_rcu(fd, rcu);
}

+static int fuse_dentry_delete(const struct dentry *dentry)
+{
+ return time_before64(fuse_dentry_time(dentry), get_jiffies_64());
+}
+
const struct dentry_operations fuse_dentry_operations = {
.d_revalidate = fuse_dentry_revalidate,
+ .d_delete = fuse_dentry_delete,
.d_init = fuse_dentry_init,
.d_release = fuse_dentry_release,
};
--
2.20.1

2019-08-29 09:29:46

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] virtio-fs: shared file system for virtual machines

On Wed, Aug 21, 2019 at 7:38 PM Vivek Goyal <[email protected]> wrote:
>
> Hi,
>
> Here are the V3 patches for virtio-fs filesystem. This time I have
> broken the patch series in two parts. This is first part which does
> not contain DAX support. Second patch series will contain the patches
> for DAX support.
>
> I have also dropped RFC tag from first patch series as we believe its
> in good enough shape that it should get a consideration for inclusion
> upstream.

Pushed out to

git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git#for-next

Major changes compared to patchset:

- renamed to "virtiofs". Filesystem names don't usually have
underscore before "fs" postfix.

- removed option parsing completely. Virtiofs config is fixed to "-o
rootmode=040000,user_id=0,group_id=0,allow_other,default_permissions".
Does this sound reasonable?

There are miscellaneous changes, so needs to be thoroughly tested.

I think we also need something in
"Documentation/filesystems/virtiofs.rst" which describes the design
(how request gets to userspace and back) and how to set up the
server, etc... Stefan, Vivek can you do something like that?

Thanks,
Miklos

2019-08-29 11:59:34

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] virtio-fs: shared file system for virtual machines

On Thu, Aug 29, 2019 at 11:28:27AM +0200, Miklos Szeredi wrote:
> On Wed, Aug 21, 2019 at 7:38 PM Vivek Goyal <[email protected]> wrote:
> >
> > Hi,
> >
> > Here are the V3 patches for virtio-fs filesystem. This time I have
> > broken the patch series in two parts. This is first part which does
> > not contain DAX support. Second patch series will contain the patches
> > for DAX support.
> >
> > I have also dropped RFC tag from first patch series as we believe its
> > in good enough shape that it should get a consideration for inclusion
> > upstream.
>
> Pushed out to
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git#for-next
>

Awesome.

> Major changes compared to patchset:
>
> - renamed to "virtiofs". Filesystem names don't usually have
> underscore before "fs" postfix.
>

Sound good to me.

> - removed option parsing completely. Virtiofs config is fixed to "-o
> rootmode=040000,user_id=0,group_id=0,allow_other,default_permissions".
> Does this sound reasonable?

These are the options we are using now and looks like they make lot of
sense for virtiofs. I guess if somebody needs a different configuration
later we can introduce option parsing and override these defaults.

>
> There are miscellaneous changes, so needs to be thoroughly tested.

I will test these.

>
> I think we also need something in
> "Documentation/filesystems/virtiofs.rst" which describes the design
> (how request gets to userspace and back) and how to set up the
> server, etc... Stefan, Vivek can you do something like that?

Sure, I will write up something and take Stefan's input as well.

Thanks
Vivek

2019-08-29 12:36:44

by Stefan Hajnoczi

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] virtio-fs: shared file system for virtual machines

On Thu, Aug 29, 2019 at 11:28:27AM +0200, Miklos Szeredi wrote:
> On Wed, Aug 21, 2019 at 7:38 PM Vivek Goyal <[email protected]> wrote:

Thanks!

> - removed option parsing completely. Virtiofs config is fixed to "-o
> rootmode=040000,user_id=0,group_id=0,allow_other,default_permissions".
> Does this sound reasonable?

That simplifies things for users and I've never seen a need to change
these options in virtio-fs. If we need the control for some reason in
the future we can add options back in again.

> I think we also need something in
> "Documentation/filesystems/virtiofs.rst" which describes the design
> (how request gets to userspace and back) and how to set up the
> server, etc... Stefan, Vivek can you do something like that?

Sure. I'll send a patch.

Stefan


Attachments:
(No filename) (803.00 B)
signature.asc (499.00 B)
Download all attachments

2019-08-29 13:32:34

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] virtio-fs: shared file system for virtual machines

On Thu, Aug 29, 2019 at 11:28:27AM +0200, Miklos Szeredi wrote:
> On Wed, Aug 21, 2019 at 7:38 PM Vivek Goyal <[email protected]> wrote:
> >
> > Hi,
> >
> > Here are the V3 patches for virtio-fs filesystem. This time I have
> > broken the patch series in two parts. This is first part which does
> > not contain DAX support. Second patch series will contain the patches
> > for DAX support.
> >
> > I have also dropped RFC tag from first patch series as we believe its
> > in good enough shape that it should get a consideration for inclusion
> > upstream.
>
> Pushed out to
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git#for-next
>

Hi Miklos,

Compilation of virtio-fs as module fails. While it works if compiled as
non-module.

fs/fuse/virtio_fs.c: In function ‘copy_args_to_argbuf’:
fs/fuse/virtio_fs.c:255:5: error: ‘struct fuse_req’ has no member named ‘argbuf’
req->argbuf = kmalloc(len, GFP_ATOMIC);

It can't find req->argbuf.

I noticed that you have put ifdef around argbuf.

#ifdef CONFIG_VIRTIO_FS
/** virtio-fs's physically contiguous buffer for in and out args */
void *argbuf;
#endif

It should have worked. Not sure why it is not working.

Vivek

2019-08-29 13:44:50

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] virtio-fs: shared file system for virtual machines

On Thu, Aug 29, 2019 at 3:29 PM Vivek Goyal <[email protected]> wrote:

> #ifdef CONFIG_VIRTIO_FS
> /** virtio-fs's physically contiguous buffer for in and out args */
> void *argbuf;
> #endif
>
> It should have worked. Not sure why it is not working.

Needs to be changed to

#if IS_ENABLED(CONFIG_VIRTIO_FS)

Pushed out fixed version.

Thanks,
Miklos

2019-08-29 14:33:19

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] virtio-fs: shared file system for virtual machines

On Thu, Aug 29, 2019 at 03:41:26PM +0200, Miklos Szeredi wrote:
> On Thu, Aug 29, 2019 at 3:29 PM Vivek Goyal <[email protected]> wrote:
>
> > #ifdef CONFIG_VIRTIO_FS
> > /** virtio-fs's physically contiguous buffer for in and out args */
> > void *argbuf;
> > #endif
> >
> > It should have worked. Not sure why it is not working.
>
> Needs to be changed to
>
> #if IS_ENABLED(CONFIG_VIRTIO_FS)
>
> Pushed out fixed version.

Cool. That works. Faced another compilation error with "make allmodconfig"
config file.

HDRINST usr/include/linux/virtio_fs.h
error: include/uapi/linux/virtio_fs.h: missing "WITH Linux-syscall-note" for SPDX-License-Identifier
make[1]: *** [scripts/Makefile.headersinst:66: usr/include/linux/virtio_fs.h] Error 1

Looks like include/uapi/linux/virtio_fs.h needs following.

-/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */

Vivek

2019-08-29 14:48:49

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] virtio-fs: shared file system for virtual machines

On Thu, Aug 29, 2019 at 4:31 PM Vivek Goyal <[email protected]> wrote:

> Cool. That works. Faced another compilation error with "make allmodconfig"
> config file.
>
> HDRINST usr/include/linux/virtio_fs.h
> error: include/uapi/linux/virtio_fs.h: missing "WITH Linux-syscall-note" for SPDX-License-Identifier
> make[1]: *** [scripts/Makefile.headersinst:66: usr/include/linux/virtio_fs.h] Error 1
>
> Looks like include/uapi/linux/virtio_fs.h needs following.
>
> -/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */

Fixed and pushed.

Thanks,
Miklos

2019-08-29 16:02:53

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] virtio-fs: shared file system for virtual machines

On Thu, Aug 29, 2019 at 11:28:27AM +0200, Miklos Szeredi wrote:

[..]
> There are miscellaneous changes, so needs to be thoroughly tested.

Hi Miklos,

First round of tests passed. Ran pjdfstests, blogbench and bunch of fio
jobs and everyting looks good.

Thanks
Vivek

2019-08-31 05:50:09

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] virtio-fs: shared file system for virtual machines

On Thu, Aug 29, 2019 at 6:01 PM Vivek Goyal <[email protected]> wrote:
>
> On Thu, Aug 29, 2019 at 11:28:27AM +0200, Miklos Szeredi wrote:
>
> [..]
> > There are miscellaneous changes, so needs to be thoroughly tested.
>
> Hi Miklos,
>
> First round of tests passed. Ran pjdfstests, blogbench and bunch of fio
> jobs and everyting looks good.

fsx-linux with cache=always revealed a couple of bugs in the argpages
case. Pushed fixes for those too.

Thanks,
Miklos

2019-09-03 08:07:45

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] virtio-fs: shared file system for virtual machines

[Cc: [email protected], "Michael S. Tsirkin"
<[email protected]>, Jason Wang <[email protected]>]

It'd be nice to have an ACK for this from the virtio maintainers.

Thanks,
Miklos

On Wed, Aug 21, 2019 at 7:38 PM Vivek Goyal <[email protected]> wrote:
>
> Hi,
>
> Here are the V3 patches for virtio-fs filesystem. This time I have
> broken the patch series in two parts. This is first part which does
> not contain DAX support. Second patch series will contain the patches
> for DAX support.
>
> I have also dropped RFC tag from first patch series as we believe its
> in good enough shape that it should get a consideration for inclusion
> upstream.
>
> These patches apply on top of 5.3-rc5 kernel and are also available
> here.
>
> https://github.com/rhvgoyal/linux/commits/vivek-5.3-aug-21-2019
>
> Patches for V1 and V2 were posted here.
>
> https://lwn.net/ml/linux-fsdevel/[email protected]/
> http://lkml.iu.edu/hypermail/linux/kernel/1905.1/07232.html
>
> More information about the project can be found here.
>
> https://virtio-fs.gitlab.io
>
> Changes from V2
> ===============
> - Various bug fixes and performance improvements.
>
> HOWTO
> ======
> We have put instructions on how to use it here.
>
> https://virtio-fs.gitlab.io/
>
> Some Performance Numbers
> ========================
> I have basically run bunch of fio jobs to get a sense of speed of
> various operations. I wrote a simple wrapper script to run fio jobs
> 3 times and take their average and report it. These scripts are available
> here.
>
> https://github.com/rhvgoyal/virtiofs-tests
>
> I set up a directory on ramfs on host and exported that directory inside
> guest using virtio-9p and virtio-fs and ran tests inside guests. Ran
> tests with cache=none both for virtio-9p and virtio-fs so that no caching
> happens in guest. For virtio-fs, I ran an additional set of tests with
> dax enabled. Dax is not part of first patch series but I included
> results here because dax seems to get the maximum performance advantage
> and its shows the real potential of virtio-fs.
>
> Test Setup
> -----------
> - A fedora 28 host with 32G RAM, 2 sockets (6 cores per socket, 2
> threads per core)
>
> - Using ramfs on host as backing store. 4 fio files of 2G each.
>
> - Created a VM with 16 VCPUS and 8GB memory. An 8GB cache window (for dax
> mmap).
>
> Test Results
> ------------
> - Results in three configurations have been reported. 9p (cache=none),
> virtio-fs (cache=none) and virtio-fs (cache=none + dax).
>
> There are other caching modes as well but to me cache=none seemed most
> interesting for now because it does not cache anything in guest
> and provides strong coherence. Other modes which provide less strong
> coherence and hence are faster are yet to be benchmarked.
>
> - Three fio ioengines psync, libaio and mmap have been used.
>
> - I/O Workload of randread, radwrite, seqread and seqwrite have been run.
>
> - Each file size is 2G. Block size 4K. iodepth=16
>
> - "multi" means same operation was done with 4 jobs and each job is
> operating on a file of size 2G.
>
> - Some results are "0 (KiB/s)". That means that particular operation is
> not supported in that configuration.
>
> NAME I/O Operation BW(Read/Write)
>
> 9p-cache-none seqread-psync 27(MiB/s)
> virtiofs-cache-none seqread-psync 35(MiB/s)
> virtiofs-dax-cache-none seqread-psync 245(MiB/s)
>
> 9p-cache-none seqread-psync-multi 117(MiB/s)
> virtiofs-cache-none seqread-psync-multi 162(MiB/s)
> virtiofs-dax-cache-none seqread-psync-multi 894(MiB/s)
>
> 9p-cache-none seqread-mmap 24(MiB/s)
> virtiofs-cache-none seqread-mmap 0(KiB/s)
> virtiofs-dax-cache-none seqread-mmap 168(MiB/s)
>
> 9p-cache-none seqread-mmap-multi 115(MiB/s)
> virtiofs-cache-none seqread-mmap-multi 0(KiB/s)
> virtiofs-dax-cache-none seqread-mmap-multi 614(MiB/s)
>
> 9p-cache-none seqread-libaio 26(MiB/s)
> virtiofs-cache-none seqread-libaio 139(MiB/s)
> virtiofs-dax-cache-none seqread-libaio 160(MiB/s)
>
> 9p-cache-none seqread-libaio-multi 129(MiB/s)
> virtiofs-cache-none seqread-libaio-multi 142(MiB/s)
> virtiofs-dax-cache-none seqread-libaio-multi 577(MiB/s)
>
> 9p-cache-none randread-psync 29(MiB/s)
> virtiofs-cache-none randread-psync 34(MiB/s)
> virtiofs-dax-cache-none randread-psync 256(MiB/s)
>
> 9p-cache-none randread-psync-multi 139(MiB/s)
> virtiofs-cache-none randread-psync-multi 153(MiB/s)
> virtiofs-dax-cache-none randread-psync-multi 245(MiB/s)
>
> 9p-cache-none randread-mmap 22(MiB/s)
> virtiofs-cache-none randread-mmap 0(KiB/s)
> virtiofs-dax-cache-none randread-mmap 162(MiB/s)
>
> 9p-cache-none randread-mmap-multi 111(MiB/s)
> virtiofs-cache-none randread-mmap-multi 0(KiB/s)
> virtiofs-dax-cache-none randread-mmap-multi 215(MiB/s)
>
> 9p-cache-none randread-libaio 26(MiB/s)
> virtiofs-cache-none randread-libaio 135(MiB/s)
> virtiofs-dax-cache-none randread-libaio 157(MiB/s)
>
> 9p-cache-none randread-libaio-multi 133(MiB/s)
> virtiofs-cache-none randread-libaio-multi 245(MiB/s)
> virtiofs-dax-cache-none randread-libaio-multi 163(MiB/s)
>
> 9p-cache-none seqwrite-psync 28(MiB/s)
> virtiofs-cache-none seqwrite-psync 34(MiB/s)
> virtiofs-dax-cache-none seqwrite-psync 203(MiB/s)
>
> 9p-cache-none seqwrite-psync-multi 128(MiB/s)
> virtiofs-cache-none seqwrite-psync-multi 155(MiB/s)
> virtiofs-dax-cache-none seqwrite-psync-multi 717(MiB/s)
>
> 9p-cache-none seqwrite-mmap 0(KiB/s)
> virtiofs-cache-none seqwrite-mmap 0(KiB/s)
> virtiofs-dax-cache-none seqwrite-mmap 165(MiB/s)
>
> 9p-cache-none seqwrite-mmap-multi 0(KiB/s)
> virtiofs-cache-none seqwrite-mmap-multi 0(KiB/s)
> virtiofs-dax-cache-none seqwrite-mmap-multi 511(MiB/s)
>
> 9p-cache-none seqwrite-libaio 27(MiB/s)
> virtiofs-cache-none seqwrite-libaio 128(MiB/s)
> virtiofs-dax-cache-none seqwrite-libaio 141(MiB/s)
>
> 9p-cache-none seqwrite-libaio-multi 119(MiB/s)
> virtiofs-cache-none seqwrite-libaio-multi 242(MiB/s)
> virtiofs-dax-cache-none seqwrite-libaio-multi 505(MiB/s)
>
> 9p-cache-none randwrite-psync 27(MiB/s)
> virtiofs-cache-none randwrite-psync 34(MiB/s)
> virtiofs-dax-cache-none randwrite-psync 189(MiB/s)
>
> 9p-cache-none randwrite-psync-multi 137(MiB/s)
> virtiofs-cache-none randwrite-psync-multi 150(MiB/s)
> virtiofs-dax-cache-none randwrite-psync-multi 233(MiB/s)
>
> 9p-cache-none randwrite-mmap 0(KiB/s)
> virtiofs-cache-none randwrite-mmap 0(KiB/s)
> virtiofs-dax-cache-none randwrite-mmap 120(MiB/s)
>
> 9p-cache-none randwrite-mmap-multi 0(KiB/s)
> virtiofs-cache-none randwrite-mmap-multi 0(KiB/s)
> virtiofs-dax-cache-none randwrite-mmap-multi 200(MiB/s)
>
> 9p-cache-none randwrite-libaio 25(MiB/s)
> virtiofs-cache-none randwrite-libaio 124(MiB/s)
> virtiofs-dax-cache-none randwrite-libaio 131(MiB/s)
>
> 9p-cache-none randwrite-libaio-multi 125(MiB/s)
> virtiofs-cache-none randwrite-libaio-multi 241(MiB/s)
> virtiofs-dax-cache-none randwrite-libaio-multi 163(MiB/s)
>
> Conclusions
> ===========
> - In general virtio-fs seems faster than virtio-9p. Using dax makes it
> really interesting.
>
> Note:
> Right now dax window is 8G and max fio file size is 8G as well (4
> files of 2G each). That means everything fits into dax window and no
> reclaim is needed. Dax window reclaim logic is slower and if file
> size is bigger than dax window size, performance slows down.
>
> Description from previous postings
> ==================================
>
> Design Overview
> ===============
> With the goal of designing something with better performance and local file
> system semantics, a bunch of ideas were proposed.
>
> - Use fuse protocol (instead of 9p) for communication between guest
> and host. Guest kernel will be fuse client and a fuse server will
> run on host to serve the requests.
>
> - For data access inside guest, mmap portion of file in QEMU address
> space and guest accesses this memory using dax. That way guest page
> cache is bypassed and there is only one copy of data (on host). This
> will also enable mmap(MAP_SHARED) between guests.
>
> - For metadata coherency, there is a shared memory region which contains
> version number associated with metadata and any guest changing metadata
> updates version number and other guests refresh metadata on next
> access. This is yet to be implemented.
>
> How virtio-fs differs from existing approaches
> ==============================================
> The unique idea behind virtio-fs is to take advantage of the co-location
> of the virtual machine and hypervisor to avoid communication (vmexits).
>
> DAX allows file contents to be accessed without communication with the
> hypervisor. The shared memory region for metadata avoids communication in
> the common case where metadata is unchanged.
>
> By replacing expensive communication with cheaper shared memory accesses,
> we expect to achieve better performance than approaches based on network
> file system protocols. In addition, this also makes it easier to achieve
> local file system semantics (coherency).
>
> These techniques are not applicable to network file system protocols since
> the communications channel is bypassed by taking advantage of shared memory
> on a local machine. This is why we decided to build virtio-fs rather than
> focus on 9P or NFS.
>
> Caching Modes
> =============
> Like virtio-9p, different caching modes are supported which determine the
> coherency level as well. The “cache=FOO” and “writeback” options control the
> level of coherence between the guest and host filesystems.
>
> - cache=none
> metadata, data and pathname lookup are not cached in guest. They are always
> fetched from host and any changes are immediately pushed to host.
>
> - cache=always
> metadata, data and pathname lookup are cached in guest and never expire.
>
> - cache=auto
> metadata and pathname lookup cache expires after a configured amount of time
> (default is 1 second). Data is cached while the file is open (close to open
> consistency).
>
> - writeback/no_writeback
> These options control the writeback strategy. If writeback is disabled,
> then normal writes will immediately be synchronized with the host fs. If
> writeback is enabled, then writes may be cached in the guest until the file
> is closed or an fsync(2) performed. This option has no effect on mmap-ed
> writes or writes going through the DAX mechanism.
>
> Thanks
> Vivek
>
> Miklos Szeredi (2):
> fuse: delete dentry if timeout is zero
> fuse: Use default_file_splice_read for direct IO
>
> Stefan Hajnoczi (6):
> fuse: export fuse_end_request()
> fuse: export fuse_len_args()
> fuse: export fuse_get_unique()
> fuse: extract fuse_fill_super_common()
> fuse: add fuse_iqueue_ops callbacks
> virtio_fs: add skeleton virtio_fs.ko module
>
> Vivek Goyal (5):
> fuse: Export fuse_send_init_request()
> Export fuse_dequeue_forget() function
> fuse: Separate fuse device allocation and installation in fuse_conn
> virtio-fs: Do not provide abort interface in fusectl
> init/do_mounts.c: add virtio_fs root fs support
>
> fs/fuse/Kconfig | 11 +
> fs/fuse/Makefile | 1 +
> fs/fuse/control.c | 4 +-
> fs/fuse/cuse.c | 4 +-
> fs/fuse/dev.c | 89 ++-
> fs/fuse/dir.c | 26 +-
> fs/fuse/file.c | 15 +-
> fs/fuse/fuse_i.h | 120 +++-
> fs/fuse/inode.c | 203 +++---
> fs/fuse/virtio_fs.c | 1061 +++++++++++++++++++++++++++++++
> fs/splice.c | 3 +-
> include/linux/fs.h | 2 +
> include/uapi/linux/virtio_fs.h | 41 ++
> include/uapi/linux/virtio_ids.h | 1 +
> init/do_mounts.c | 10 +
> 15 files changed, 1462 insertions(+), 129 deletions(-)
> create mode 100644 fs/fuse/virtio_fs.c
> create mode 100644 include/uapi/linux/virtio_fs.h
>
> --
> 2.20.1
>

2019-09-03 08:35:16

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] virtio-fs: shared file system for virtual machines

On Tue, Sep 03, 2019 at 10:05:02AM +0200, Miklos Szeredi wrote:
> [Cc: [email protected], "Michael S. Tsirkin"
> <[email protected]>, Jason Wang <[email protected]>]
>
> It'd be nice to have an ACK for this from the virtio maintainers.
>
> Thanks,
> Miklos

Can the patches themselves be posted to the relevant list(s) please?
If possible, please also include "v3" in all patches so they are
easier to find.

I poked at
https://lwn.net/ml/linux-kernel/[email protected]/
specifically
https://lwn.net/ml/linux-kernel/[email protected]/
and things like:
+ /* TODO lock */
give me pause.

Cleanup generally seems broken to me - what pauses the FS

What about the rest of TODOs in that file?

use of usleep is hacky - can't we do better e.g. with a
completion?

Some typos - e.g. "reuests".


>
> On Wed, Aug 21, 2019 at 7:38 PM Vivek Goyal <[email protected]> wrote:
> >
> > Hi,
> >
> > Here are the V3 patches for virtio-fs filesystem. This time I have
> > broken the patch series in two parts. This is first part which does
> > not contain DAX support. Second patch series will contain the patches
> > for DAX support.
> >
> > I have also dropped RFC tag from first patch series as we believe its
> > in good enough shape that it should get a consideration for inclusion
> > upstream.
> >
> > These patches apply on top of 5.3-rc5 kernel and are also available
> > here.
> >
> > https://github.com/rhvgoyal/linux/commits/vivek-5.3-aug-21-2019
> >
> > Patches for V1 and V2 were posted here.
> >
> > https://lwn.net/ml/linux-fsdevel/[email protected]/
> > http://lkml.iu.edu/hypermail/linux/kernel/1905.1/07232.html
> >
> > More information about the project can be found here.
> >
> > https://virtio-fs.gitlab.io
> >
> > Changes from V2
> > ===============
> > - Various bug fixes and performance improvements.
> >
> > HOWTO
> > ======
> > We have put instructions on how to use it here.
> >
> > https://virtio-fs.gitlab.io/
> >
> > Some Performance Numbers
> > ========================
> > I have basically run bunch of fio jobs to get a sense of speed of
> > various operations. I wrote a simple wrapper script to run fio jobs
> > 3 times and take their average and report it. These scripts are available
> > here.
> >
> > https://github.com/rhvgoyal/virtiofs-tests
> >
> > I set up a directory on ramfs on host and exported that directory inside
> > guest using virtio-9p and virtio-fs and ran tests inside guests. Ran
> > tests with cache=none both for virtio-9p and virtio-fs so that no caching
> > happens in guest. For virtio-fs, I ran an additional set of tests with
> > dax enabled. Dax is not part of first patch series but I included
> > results here because dax seems to get the maximum performance advantage
> > and its shows the real potential of virtio-fs.
> >
> > Test Setup
> > -----------
> > - A fedora 28 host with 32G RAM, 2 sockets (6 cores per socket, 2
> > threads per core)
> >
> > - Using ramfs on host as backing store. 4 fio files of 2G each.
> >
> > - Created a VM with 16 VCPUS and 8GB memory. An 8GB cache window (for dax
> > mmap).
> >
> > Test Results
> > ------------
> > - Results in three configurations have been reported. 9p (cache=none),
> > virtio-fs (cache=none) and virtio-fs (cache=none + dax).
> >
> > There are other caching modes as well but to me cache=none seemed most
> > interesting for now because it does not cache anything in guest
> > and provides strong coherence. Other modes which provide less strong
> > coherence and hence are faster are yet to be benchmarked.
> >
> > - Three fio ioengines psync, libaio and mmap have been used.
> >
> > - I/O Workload of randread, radwrite, seqread and seqwrite have been run.
> >
> > - Each file size is 2G. Block size 4K. iodepth=16
> >
> > - "multi" means same operation was done with 4 jobs and each job is
> > operating on a file of size 2G.
> >
> > - Some results are "0 (KiB/s)". That means that particular operation is
> > not supported in that configuration.
> >
> > NAME I/O Operation BW(Read/Write)
> >
> > 9p-cache-none seqread-psync 27(MiB/s)
> > virtiofs-cache-none seqread-psync 35(MiB/s)
> > virtiofs-dax-cache-none seqread-psync 245(MiB/s)
> >
> > 9p-cache-none seqread-psync-multi 117(MiB/s)
> > virtiofs-cache-none seqread-psync-multi 162(MiB/s)
> > virtiofs-dax-cache-none seqread-psync-multi 894(MiB/s)
> >
> > 9p-cache-none seqread-mmap 24(MiB/s)
> > virtiofs-cache-none seqread-mmap 0(KiB/s)
> > virtiofs-dax-cache-none seqread-mmap 168(MiB/s)
> >
> > 9p-cache-none seqread-mmap-multi 115(MiB/s)
> > virtiofs-cache-none seqread-mmap-multi 0(KiB/s)
> > virtiofs-dax-cache-none seqread-mmap-multi 614(MiB/s)
> >
> > 9p-cache-none seqread-libaio 26(MiB/s)
> > virtiofs-cache-none seqread-libaio 139(MiB/s)
> > virtiofs-dax-cache-none seqread-libaio 160(MiB/s)
> >
> > 9p-cache-none seqread-libaio-multi 129(MiB/s)
> > virtiofs-cache-none seqread-libaio-multi 142(MiB/s)
> > virtiofs-dax-cache-none seqread-libaio-multi 577(MiB/s)
> >
> > 9p-cache-none randread-psync 29(MiB/s)
> > virtiofs-cache-none randread-psync 34(MiB/s)
> > virtiofs-dax-cache-none randread-psync 256(MiB/s)
> >
> > 9p-cache-none randread-psync-multi 139(MiB/s)
> > virtiofs-cache-none randread-psync-multi 153(MiB/s)
> > virtiofs-dax-cache-none randread-psync-multi 245(MiB/s)
> >
> > 9p-cache-none randread-mmap 22(MiB/s)
> > virtiofs-cache-none randread-mmap 0(KiB/s)
> > virtiofs-dax-cache-none randread-mmap 162(MiB/s)
> >
> > 9p-cache-none randread-mmap-multi 111(MiB/s)
> > virtiofs-cache-none randread-mmap-multi 0(KiB/s)
> > virtiofs-dax-cache-none randread-mmap-multi 215(MiB/s)
> >
> > 9p-cache-none randread-libaio 26(MiB/s)
> > virtiofs-cache-none randread-libaio 135(MiB/s)
> > virtiofs-dax-cache-none randread-libaio 157(MiB/s)
> >
> > 9p-cache-none randread-libaio-multi 133(MiB/s)
> > virtiofs-cache-none randread-libaio-multi 245(MiB/s)
> > virtiofs-dax-cache-none randread-libaio-multi 163(MiB/s)
> >
> > 9p-cache-none seqwrite-psync 28(MiB/s)
> > virtiofs-cache-none seqwrite-psync 34(MiB/s)
> > virtiofs-dax-cache-none seqwrite-psync 203(MiB/s)
> >
> > 9p-cache-none seqwrite-psync-multi 128(MiB/s)
> > virtiofs-cache-none seqwrite-psync-multi 155(MiB/s)
> > virtiofs-dax-cache-none seqwrite-psync-multi 717(MiB/s)
> >
> > 9p-cache-none seqwrite-mmap 0(KiB/s)
> > virtiofs-cache-none seqwrite-mmap 0(KiB/s)
> > virtiofs-dax-cache-none seqwrite-mmap 165(MiB/s)
> >
> > 9p-cache-none seqwrite-mmap-multi 0(KiB/s)
> > virtiofs-cache-none seqwrite-mmap-multi 0(KiB/s)
> > virtiofs-dax-cache-none seqwrite-mmap-multi 511(MiB/s)
> >
> > 9p-cache-none seqwrite-libaio 27(MiB/s)
> > virtiofs-cache-none seqwrite-libaio 128(MiB/s)
> > virtiofs-dax-cache-none seqwrite-libaio 141(MiB/s)
> >
> > 9p-cache-none seqwrite-libaio-multi 119(MiB/s)
> > virtiofs-cache-none seqwrite-libaio-multi 242(MiB/s)
> > virtiofs-dax-cache-none seqwrite-libaio-multi 505(MiB/s)
> >
> > 9p-cache-none randwrite-psync 27(MiB/s)
> > virtiofs-cache-none randwrite-psync 34(MiB/s)
> > virtiofs-dax-cache-none randwrite-psync 189(MiB/s)
> >
> > 9p-cache-none randwrite-psync-multi 137(MiB/s)
> > virtiofs-cache-none randwrite-psync-multi 150(MiB/s)
> > virtiofs-dax-cache-none randwrite-psync-multi 233(MiB/s)
> >
> > 9p-cache-none randwrite-mmap 0(KiB/s)
> > virtiofs-cache-none randwrite-mmap 0(KiB/s)
> > virtiofs-dax-cache-none randwrite-mmap 120(MiB/s)
> >
> > 9p-cache-none randwrite-mmap-multi 0(KiB/s)
> > virtiofs-cache-none randwrite-mmap-multi 0(KiB/s)
> > virtiofs-dax-cache-none randwrite-mmap-multi 200(MiB/s)
> >
> > 9p-cache-none randwrite-libaio 25(MiB/s)
> > virtiofs-cache-none randwrite-libaio 124(MiB/s)
> > virtiofs-dax-cache-none randwrite-libaio 131(MiB/s)
> >
> > 9p-cache-none randwrite-libaio-multi 125(MiB/s)
> > virtiofs-cache-none randwrite-libaio-multi 241(MiB/s)
> > virtiofs-dax-cache-none randwrite-libaio-multi 163(MiB/s)
> >
> > Conclusions
> > ===========
> > - In general virtio-fs seems faster than virtio-9p. Using dax makes it
> > really interesting.
> >
> > Note:
> > Right now dax window is 8G and max fio file size is 8G as well (4
> > files of 2G each). That means everything fits into dax window and no
> > reclaim is needed. Dax window reclaim logic is slower and if file
> > size is bigger than dax window size, performance slows down.
> >
> > Description from previous postings
> > ==================================
> >
> > Design Overview
> > ===============
> > With the goal of designing something with better performance and local file
> > system semantics, a bunch of ideas were proposed.
> >
> > - Use fuse protocol (instead of 9p) for communication between guest
> > and host. Guest kernel will be fuse client and a fuse server will
> > run on host to serve the requests.
> >
> > - For data access inside guest, mmap portion of file in QEMU address
> > space and guest accesses this memory using dax. That way guest page
> > cache is bypassed and there is only one copy of data (on host). This
> > will also enable mmap(MAP_SHARED) between guests.
> >
> > - For metadata coherency, there is a shared memory region which contains
> > version number associated with metadata and any guest changing metadata
> > updates version number and other guests refresh metadata on next
> > access. This is yet to be implemented.
> >
> > How virtio-fs differs from existing approaches
> > ==============================================
> > The unique idea behind virtio-fs is to take advantage of the co-location
> > of the virtual machine and hypervisor to avoid communication (vmexits).
> >
> > DAX allows file contents to be accessed without communication with the
> > hypervisor. The shared memory region for metadata avoids communication in
> > the common case where metadata is unchanged.
> >
> > By replacing expensive communication with cheaper shared memory accesses,
> > we expect to achieve better performance than approaches based on network
> > file system protocols. In addition, this also makes it easier to achieve
> > local file system semantics (coherency).
> >
> > These techniques are not applicable to network file system protocols since
> > the communications channel is bypassed by taking advantage of shared memory
> > on a local machine. This is why we decided to build virtio-fs rather than
> > focus on 9P or NFS.
> >
> > Caching Modes
> > =============
> > Like virtio-9p, different caching modes are supported which determine the
> > coherency level as well. The “cache=FOO” and “writeback” options control the
> > level of coherence between the guest and host filesystems.
> >
> > - cache=none
> > metadata, data and pathname lookup are not cached in guest. They are always
> > fetched from host and any changes are immediately pushed to host.
> >
> > - cache=always
> > metadata, data and pathname lookup are cached in guest and never expire.
> >
> > - cache=auto
> > metadata and pathname lookup cache expires after a configured amount of time
> > (default is 1 second). Data is cached while the file is open (close to open
> > consistency).
> >
> > - writeback/no_writeback
> > These options control the writeback strategy. If writeback is disabled,
> > then normal writes will immediately be synchronized with the host fs. If
> > writeback is enabled, then writes may be cached in the guest until the file
> > is closed or an fsync(2) performed. This option has no effect on mmap-ed
> > writes or writes going through the DAX mechanism.
> >
> > Thanks
> > Vivek
> >
> > Miklos Szeredi (2):
> > fuse: delete dentry if timeout is zero
> > fuse: Use default_file_splice_read for direct IO
> >
> > Stefan Hajnoczi (6):
> > fuse: export fuse_end_request()
> > fuse: export fuse_len_args()
> > fuse: export fuse_get_unique()
> > fuse: extract fuse_fill_super_common()
> > fuse: add fuse_iqueue_ops callbacks
> > virtio_fs: add skeleton virtio_fs.ko module
> >
> > Vivek Goyal (5):
> > fuse: Export fuse_send_init_request()
> > Export fuse_dequeue_forget() function
> > fuse: Separate fuse device allocation and installation in fuse_conn
> > virtio-fs: Do not provide abort interface in fusectl
> > init/do_mounts.c: add virtio_fs root fs support
> >
> > fs/fuse/Kconfig | 11 +
> > fs/fuse/Makefile | 1 +
> > fs/fuse/control.c | 4 +-
> > fs/fuse/cuse.c | 4 +-
> > fs/fuse/dev.c | 89 ++-
> > fs/fuse/dir.c | 26 +-
> > fs/fuse/file.c | 15 +-
> > fs/fuse/fuse_i.h | 120 +++-
> > fs/fuse/inode.c | 203 +++---
> > fs/fuse/virtio_fs.c | 1061 +++++++++++++++++++++++++++++++
> > fs/splice.c | 3 +-
> > include/linux/fs.h | 2 +
> > include/uapi/linux/virtio_fs.h | 41 ++
> > include/uapi/linux/virtio_ids.h | 1 +
> > init/do_mounts.c | 10 +
> > 15 files changed, 1462 insertions(+), 129 deletions(-)
> > create mode 100644 fs/fuse/virtio_fs.c
> > create mode 100644 include/uapi/linux/virtio_fs.h

Don't the new files need a MAINTAINERS entry?
I think we want [email protected] to be
copied.


> >
> > --
> > 2.20.1
> >

2019-09-03 09:19:06

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] virtio-fs: shared file system for virtual machines

On Tue, Sep 3, 2019 at 10:31 AM Michael S. Tsirkin <[email protected]> wrote:

> Can the patches themselves be posted to the relevant list(s) please?
> If possible, please also include "v3" in all patches so they are
> easier to find.

I'll post a v4.

> I poked at
> https://lwn.net/ml/linux-kernel/[email protected]/
> specifically
> https://lwn.net/ml/linux-kernel/[email protected]/
> and things like:
> + /* TODO lock */
> give me pause.
>
> Cleanup generally seems broken to me - what pauses the FS
>
> What about the rest of TODOs in that file?

AFAICS, some of those are QOI issues, and some are long term
performance items. I see no blockers or things that would pose a
security concern.

That said, it would be nice to get rid of them at some point.

> use of usleep is hacky - can't we do better e.g. with a
> completion?

Yes. I have a different implementation, but was planning to leave
this one in until the dust settles.

> Some typos - e.g. "reuests".

Fixed.

> > > fs/fuse/Kconfig | 11 +
> > > fs/fuse/Makefile | 1 +
> > > fs/fuse/control.c | 4 +-
> > > fs/fuse/cuse.c | 4 +-
> > > fs/fuse/dev.c | 89 ++-
> > > fs/fuse/dir.c | 26 +-
> > > fs/fuse/file.c | 15 +-
> > > fs/fuse/fuse_i.h | 120 +++-
> > > fs/fuse/inode.c | 203 +++---
> > > fs/fuse/virtio_fs.c | 1061 +++++++++++++++++++++++++++++++
> > > fs/splice.c | 3 +-
> > > include/linux/fs.h | 2 +
> > > include/uapi/linux/virtio_fs.h | 41 ++
> > > include/uapi/linux/virtio_ids.h | 1 +
> > > init/do_mounts.c | 10 +
> > > 15 files changed, 1462 insertions(+), 129 deletions(-)
> > > create mode 100644 fs/fuse/virtio_fs.c
> > > create mode 100644 include/uapi/linux/virtio_fs.h
>
> Don't the new files need a MAINTAINERS entry?
> I think we want [email protected] to be
> copied.

Yep.

Stefan, do you want to formally maintain this file?

Thanks,
Miklos

2019-09-03 14:09:40

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] virtio-fs: shared file system for virtual machines

On Tue, Sep 03, 2019 at 04:31:38AM -0400, Michael S. Tsirkin wrote:

[..]
> + /* TODO lock */
> give me pause.
>
> Cleanup generally seems broken to me - what pauses the FS

I am looking into device removal aspect of it now. Thinking of adding
a reference count to virtiofs device and possibly also a bit flag to
indicate if device is still alive. That way, we should be able to cleanup
device more gracefully.

>
> What about the rest of TODOs in that file?

I will also take a closer look at TODOs now. Better device cleanup path
might get rid of some of them. Some of them might not be valid anymore.

>
> use of usleep is hacky - can't we do better e.g. with a
> completion?

Agreed.

Thanks
Vivek

2019-09-03 14:13:45

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] virtio-fs: shared file system for virtual machines

On Tue, Sep 03, 2019 at 10:07:52AM -0400, Vivek Goyal wrote:
> On Tue, Sep 03, 2019 at 04:31:38AM -0400, Michael S. Tsirkin wrote:
>
> [..]
> > + /* TODO lock */
> > give me pause.
> >
> > Cleanup generally seems broken to me - what pauses the FS
>
> I am looking into device removal aspect of it now. Thinking of adding
> a reference count to virtiofs device and possibly also a bit flag to
> indicate if device is still alive. That way, we should be able to cleanup
> device more gracefully.

Generally, the way to cleanup things is to first disconnect device from
linux so linux won't send new requests, wait for old ones to finish.




> >
> > What about the rest of TODOs in that file?
>
> I will also take a closer look at TODOs now. Better device cleanup path
> might get rid of some of them. Some of them might not be valid anymore.
>
> >
> > use of usleep is hacky - can't we do better e.g. with a
> > completion?
>
> Agreed.
>
> Thanks
> Vivek

2019-09-03 14:20:19

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] virtio-fs: shared file system for virtual machines

On Tue, Sep 03, 2019 at 10:12:16AM -0400, Michael S. Tsirkin wrote:
> On Tue, Sep 03, 2019 at 10:07:52AM -0400, Vivek Goyal wrote:
> > On Tue, Sep 03, 2019 at 04:31:38AM -0400, Michael S. Tsirkin wrote:
> >
> > [..]
> > > + /* TODO lock */
> > > give me pause.
> > >
> > > Cleanup generally seems broken to me - what pauses the FS
> >
> > I am looking into device removal aspect of it now. Thinking of adding
> > a reference count to virtiofs device and possibly also a bit flag to
> > indicate if device is still alive. That way, we should be able to cleanup
> > device more gracefully.
>
> Generally, the way to cleanup things is to first disconnect device from
> linux so linux won't send new requests, wait for old ones to finish.

I was thinking of following.

- Set a flag on device to indicate device is dead and not queue new
requests. Device removal call can set this flag.

- Return errors when fs code tries to queue new request.

- Drop device creation reference in device removal path. If device is
mounted at the time of removal, that reference will still be active
and device state will not be cleaned up in kernel yet.

- User unmounts the fs, and that will drop last reference to device and
will lead to cleanup of in kernel state of the device.

Does that sound reasonable.

Vivek

2019-09-03 17:17:21

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] virtio-fs: shared file system for virtual machines

On Tue, Sep 03, 2019 at 10:18:51AM -0400, Vivek Goyal wrote:
> On Tue, Sep 03, 2019 at 10:12:16AM -0400, Michael S. Tsirkin wrote:
> > On Tue, Sep 03, 2019 at 10:07:52AM -0400, Vivek Goyal wrote:
> > > On Tue, Sep 03, 2019 at 04:31:38AM -0400, Michael S. Tsirkin wrote:
> > >
> > > [..]
> > > > + /* TODO lock */
> > > > give me pause.
> > > >
> > > > Cleanup generally seems broken to me - what pauses the FS
> > >
> > > I am looking into device removal aspect of it now. Thinking of adding
> > > a reference count to virtiofs device and possibly also a bit flag to
> > > indicate if device is still alive. That way, we should be able to cleanup
> > > device more gracefully.
> >
> > Generally, the way to cleanup things is to first disconnect device from
> > linux so linux won't send new requests, wait for old ones to finish.
>
> I was thinking of following.
>
> - Set a flag on device to indicate device is dead and not queue new
> requests. Device removal call can set this flag.
>
> - Return errors when fs code tries to queue new request.
>
> - Drop device creation reference in device removal path. If device is
> mounted at the time of removal, that reference will still be active
> and device state will not be cleaned up in kernel yet.
>
> - User unmounts the fs, and that will drop last reference to device and
> will lead to cleanup of in kernel state of the device.
>
> Does that sound reasonable.
>
> Vivek

Just we aware of the fact that virtio device, all vqs etc
will be gone by the time remove returns.


--
MST

2019-09-04 15:57:31

by Stefan Hajnoczi

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] virtio-fs: shared file system for virtual machines

On Tue, Sep 03, 2019 at 11:17:35AM +0200, Miklos Szeredi wrote:
> On Tue, Sep 3, 2019 at 10:31 AM Michael S. Tsirkin <[email protected]> wrote:
> > > > fs/fuse/Kconfig | 11 +
> > > > fs/fuse/Makefile | 1 +
> > > > fs/fuse/control.c | 4 +-
> > > > fs/fuse/cuse.c | 4 +-
> > > > fs/fuse/dev.c | 89 ++-
> > > > fs/fuse/dir.c | 26 +-
> > > > fs/fuse/file.c | 15 +-
> > > > fs/fuse/fuse_i.h | 120 +++-
> > > > fs/fuse/inode.c | 203 +++---
> > > > fs/fuse/virtio_fs.c | 1061 +++++++++++++++++++++++++++++++
> > > > fs/splice.c | 3 +-
> > > > include/linux/fs.h | 2 +
> > > > include/uapi/linux/virtio_fs.h | 41 ++
> > > > include/uapi/linux/virtio_ids.h | 1 +
> > > > init/do_mounts.c | 10 +
> > > > 15 files changed, 1462 insertions(+), 129 deletions(-)
> > > > create mode 100644 fs/fuse/virtio_fs.c
> > > > create mode 100644 include/uapi/linux/virtio_fs.h
> >
> > Don't the new files need a MAINTAINERS entry?
> > I think we want [email protected] to be
> > copied.
>
> Yep.
>
> Stefan, do you want to formally maintain this file?

Vivek has been doing most of the kernel work lately and I would suggest
that he acts as maintainer.

But I'm happy to be added if you want two people or if Vivek is
unwilling.

Stefan


Attachments:
(No filename) (1.49 kB)
signature.asc (499.00 B)
Download all attachments