Hi,
Here are RFC patches for virtio-fs. Looking for feedback on this approach.
These patches should apply on top of 4.20-rc5. We have also put code for
various components here.
https://gitlab.com/virtio-fs
Problem Description
===================
We want to be able to take a directory tree on the host and share it with
guest[s]. Our goal is to be able to do it in a fast, consistent and secure
manner. Our primary use case is kata containers, but it should be usable in
other scenarios as well.
Containers may rely on local file system semantics for shared volumes,
read-write mounts that multiple containers access simultaneously. File
system changes must be visible to other containers with the same consistency
expected of a local file system, including mmap MAP_SHARED.
Existing Solutions
==================
We looked at existing solutions and virtio-9p already provides basic shared
file system functionality although does not offer local file system semantics,
causing some workloads and test suites to fail. In addition, virtio-9p
performance has been an issue for Kata Containers and we believe this cannot
be alleviated without major changes that do not fit into the 9P protocol.
Design Overview
===============
With the goal of designing something with better performance and local file
system semantics, a bunch of ideas were proposed.
- Use fuse protocol (instead of 9p) for communication between guest
and host. Guest kernel will be fuse client and a fuse server will
run on host to serve the requests. Benchmark results (see below) are
encouraging and show this approach performs well (2x to 8x improvement
depending on test being run).
- For data access inside guest, mmap portion of file in QEMU address
space and guest accesses this memory using dax. That way guest page
cache is bypassed and there is only one copy of data (on host). This
will also enable mmap(MAP_SHARED) between guests.
- For metadata coherency, there is a shared memory region which contains
version number associated with metadata and any guest changing metadata
updates version number and other guests refresh metadata on next
access. This is still experimental and implementation is not complete.
How virtio-fs differs from existing approaches
==============================================
The unique idea behind virtio-fs is to take advantage of the co-location
of the virtual machine and hypervisor to avoid communication (vmexits).
DAX allows file contents to be accessed without communication with the
hypervisor. The shared memory region for metadata avoids communication in
the common case where metadata is unchanged.
By replacing expensive communication with cheaper shared memory accesses,
we expect to achieve better performance than approaches based on network
file system protocols. In addition, this also makes it easier to achieve
local file system semantics (coherency).
These techniques are not applicable to network file system protocols since
the communications channel is bypassed by taking advantage of shared memory
on a local machine. This is why we decided to build virtio-fs rather than
focus on 9P or NFS.
HOWTO
======
We have put instructions on how to use it here.
https://virtio-fs.gitlab.io/
Caching Modes
=============
Like virtio-9p, different caching modes are supported which determine the
coherency level as well. The “cache=FOO” and “writeback” options control the
level of coherence between the guest and host filesystems. The “shared” option
only has an effect on coherence between virtio-fs filesystem instances
running inside different guests.
- cache=none
metadata, data and pathname lookup are not cached in guest. They are always
fetched from host and any changes are immediately pushed to host.
- cache=always
metadata, data and pathname lookup are cached in guest and never expire.
- cache=auto
metadata and pathname lookup cache expires after a configured amount of time
(default is 1 second). Data is cached while the file is open (close to open
consistency).
- writeback/no_writeback
These options control the writeback strategy. If writeback is disabled,
then normal writes will immediately be synchronized with the host fs. If
writeback is enabled, then writes may be cached in the guest until the file
is closed or an fsync(2) performed. This option has no effect on mmap-ed
writes or writes going through the DAX mechanism.
- shared/no_shared
These options control the use of the shared version table. If shared mode
is enabled then metadata and pathname lookup is cached in guest, but is
refreshed due to changes in another virtio-fs instance.
DAX
===
- dax can be turned on/off when mounting virtio-fs inside guest.
WHAT WORKS
==========
- As of now primarily cache options none, auto and always are working.
shared option is still being worked on.
- Dax on/off seems to work. It does not seem to be as fast as we were
expecting it to be. Still need to look into optimization opportunities.
TODO
====
- Complete "cache=shared" implementation.
- Look into improving performance for dax. It seems slow.
- Lot of bug fixing, cleanup and performance improvement.
RESULTS
=======
- pjdfstests are passing. Have tried cache=none/auto/always and dax on/off).
https://github.com/pjd/pjdfstest
(one symlink test fails and that seems to be due xfs on host. Yet to
look into it).
- We have run some basic tests and compared with virtio-9p and it seems
to be faster. I ran "smallfile" utility and a simple fio job to test
mmap performance.
Test Setup
-----------
- A fedora 28 host with 32G RAM, 2 sockets (6 cores per socket, 2
threads per core)
- Using a PCIE SSD at host as backing store.
- Created a VM with 16 VCPUS and 6GB memory. A 2GB cache window (for dax
mmap).
fio mmap
--------
Wrote simple fio job to run mmap and READ. Ran test on 1 file and 4
files and different caching modes. File size is 4G. Dropped cache in
guest before each run. Cache on host was untouched. So data on host must
have been cached. These results are average of 3 runs.
cache mode 1-file(one thread) 4-files(4 threads)
virtio-9p mmap 28 MB/s 140 MB/s
virtio-fs none + dax 126 MB/s 501 MB/s
virtio-9p loose 31 MB/s 135 MB/s
virtio-fs always 235 MB/s 858 MB/s
virtio-fs always + dax 121 MB/s 487 MB/s
smallfile
---------
https://github.com/distributed-system-analysis/smallfile
I basically ran bunch of operations like create, ls-l, read, append,
rename and delete-renamed and measured performance over 3 runs and
took average. Dropped cache after before each operation started
running. Used effectively following command for each operation.
# python smallfile_cli.py --operation create --threads 8 --file-size 1024 --files 2048 --top <test-dir>
cache mode operation (files/sec)
virtio-9p none create 194
virtio-fs none create 714
virtio-9p mmap create 201
virtio-fs none + dax create 759
virtio-9p loose create 16
virtio-fs always create 685
virtio-fs always + dax create 735
virtio-9p none ls-l 2038
virtio-fs none ls-l 4615
virtio-9p mmap ls-l 2087
virtio-fs none + dax ls-l 4616
virtio-9p loose ls-l 1619
virtio-fs always ls-l 13571
virtio-fs always + dax ls-l 12626
virtio-9p none read 199
virtio-fs none read 1405
virtio-9p mmap read 203
virtio-fs none + dax read 1345
virtio-9p loose read 207
virtio-fs always read 1436
virtio-fs always + dax read 1368
virtio-9p none append 197
virtio-fs none append 717
virtio-9p mmap append 200
virtio-fs none + dax append 645
virtio-9p loose append 16
virtio-fs always append 651
virtio-fs always + dax append 704
virtio-9p none rename 2442
virtio-fs none rename 5797
virtio-9p mmap rename 2518
virtio-fs none + dax rename 6386
virtio-9p loose rename 4178
virtio-fs always rename 15834
virtio-fs always + dax rename 15529
Thanks
Vivek
Dr. David Alan Gilbert (5):
virtio-fs: Add VIRTIO_PCI_CAP_SHARED_MEMORY_CFG and utility to find
them
virito-fs: Make dax optional
virtio: Free fuse devices on umount
virtio-fs: Retrieve shm capabilities for version table
virtio-fs: Map using the values from the capabilities
Miklos Szeredi (8):
fuse: simplify fuse_fill_super_common() calling
fuse: delete dentry if timeout is zero
fuse: multiplex cached/direct_io/dax file operations
virtio-fs: pass version table pointer to fuse
fuse: don't crash if version table is NULL
fuse: add shared version support (virtio-fs only)
fuse: shared version cleanups
fuse: fix fuse_permission() for the default_permissions case
Stefan Hajnoczi (17):
fuse: add skeleton virtio_fs.ko module
fuse: add probe/remove virtio driver
fuse: rely on mutex_unlock() barrier instead of fput()
fuse: extract fuse_fill_super_common()
virtio_fs: get mount working
fuse: export fuse_end_request()
fuse: export fuse_len_args()
fuse: add fuse_iqueue_ops callbacks
fuse: process requests queues
fuse: export fuse_get_unique()
fuse: implement FUSE_FORGET for virtio-fs
virtio_fs: Set up dax_device
dax: remove block device dependencies
fuse: add fuse_conn->dax_dev field
fuse: map virtio_fs DAX window BAR
fuse: Implement basic DAX read/write support commands
fuse: add DAX mmap support
Vivek Goyal (22):
virtio-fs: Retrieve shm capabilities for cache
virtio-fs: Map cache using the values from the capabilities
Limit number of pages returned by direct_access()
fuse: Introduce fuse_dax_mapping
Create a list of free memory ranges
fuse: Introduce setupmapping/removemapping commands
Introduce interval tree basic data structures
fuse: Maintain a list of busy elements
Do fallocate() to grow file before mapping for file growing writes
dax: Pass dax_dev to dax_writeback_mapping_range()
fuse: Define dax address space operations
fuse, dax: Take ->i_mmap_sem lock during dax page fault
fuse: Add logic to free up a memory range
fuse: Add logic to do direct reclaim of memory
fuse: Kick worker when free memory drops below 20% of total ranges
Dispatch FORGET requests later instead of dropping them
Release file in process context
fuse: Do not block on inode lock while freeing memory range
fuse: Reschedule dax free work if too many EAGAIN attempts
fuse: Wait for memory ranges to become free
fuse: Take inode lock for dax inode truncation
fuse: Clear setuid bit even in direct I/O path
drivers/dax/super.c | 3 +-
fs/dax.c | 23 +-
fs/ext4/inode.c | 2 +-
fs/fuse/Kconfig | 11 +
fs/fuse/Makefile | 1 +
fs/fuse/cuse.c | 3 +-
fs/fuse/dev.c | 80 ++-
fs/fuse/dir.c | 282 +++++++--
fs/fuse/file.c | 1012 +++++++++++++++++++++++++++--
fs/fuse/fuse_i.h | 234 ++++++-
fs/fuse/inode.c | 278 ++++++--
fs/fuse/readdir.c | 12 +-
fs/fuse/virtio_fs.c | 1336 +++++++++++++++++++++++++++++++++++++++
fs/splice.c | 3 +-
fs/xfs/xfs_aops.c | 2 +-
include/linux/dax.h | 6 +-
include/linux/fs.h | 2 +
include/uapi/linux/fuse.h | 39 ++
include/uapi/linux/virtio_fs.h | 46 ++
include/uapi/linux/virtio_ids.h | 1 +
include/uapi/linux/virtio_pci.h | 10 +
21 files changed, 3151 insertions(+), 235 deletions(-)
create mode 100644 fs/fuse/virtio_fs.c
create mode 100644 include/uapi/linux/virtio_fs.h
--
2.13.6
This is done along the lines of ext4 and xfs. I primarily wanted ->writepages
hook at this time so that I could call into dax_writeback_mapping_range().
This in turn will decide which pfns need to be written back and call
dax_flush() on those.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/file.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 5230f2d84a14..eb12776f5ff6 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2306,6 +2306,17 @@ static int fuse_writepages_fill(struct page *page,
return err;
}
+static int fuse_dax_writepages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+
+ struct inode *inode = mapping->host;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+
+ return dax_writeback_mapping_range(mapping,
+ NULL, fc->dax_dev, wbc);
+}
+
static int fuse_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
@@ -3646,6 +3657,13 @@ static const struct address_space_operations fuse_file_aops = {
.write_end = fuse_write_end,
};
+static const struct address_space_operations fuse_dax_file_aops = {
+ .writepages = fuse_dax_writepages,
+ .direct_IO = noop_direct_IO,
+ .set_page_dirty = noop_set_page_dirty,
+ .invalidatepage = noop_invalidatepage,
+};
+
void fuse_init_file_inode(struct inode *inode)
{
struct fuse_inode *fi = get_fuse_inode(inode);
@@ -3664,5 +3682,6 @@ void fuse_init_file_inode(struct inode *inode)
if (fc->dax_dev) {
inode->i_flags |= S_DAX;
inode->i_fop = &fuse_dax_file_operations;
+ inode->i_data.a_ops = &fuse_dax_file_aops;
}
}
--
2.13.6
From: Miklos Szeredi <[email protected]>
Don't hold onto dentry in lru list if need to re-lookup it anyway at next
access.
More advanced version of this patch would periodically flush out dentries
from the lru which have gone stale.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/fuse/dir.c | 26 +++++++++++++++++++++++---
1 file changed, 23 insertions(+), 3 deletions(-)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 47395b0c3b35..b7e6e421f6bb 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -29,12 +29,26 @@ union fuse_dentry {
struct rcu_head rcu;
};
-static inline void fuse_dentry_settime(struct dentry *entry, u64 time)
+static void fuse_dentry_settime(struct dentry *dentry, u64 time)
{
- ((union fuse_dentry *) entry->d_fsdata)->time = time;
+ /*
+ * Mess with DCACHE_OP_DELETE because dput() will be faster without it.
+ * Don't care about races, either way it's just an optimization
+ */
+ if ((time && (dentry->d_flags & DCACHE_OP_DELETE)) ||
+ (!time && !(dentry->d_flags & DCACHE_OP_DELETE))) {
+ spin_lock(&dentry->d_lock);
+ if (time)
+ dentry->d_flags &= ~DCACHE_OP_DELETE;
+ else
+ dentry->d_flags |= DCACHE_OP_DELETE;
+ spin_unlock(&dentry->d_lock);
+ }
+
+ ((union fuse_dentry *) dentry->d_fsdata)->time = time;
}
-static inline u64 fuse_dentry_time(struct dentry *entry)
+static inline u64 fuse_dentry_time(const struct dentry *entry)
{
return ((union fuse_dentry *) entry->d_fsdata)->time;
}
@@ -270,8 +284,14 @@ static void fuse_dentry_release(struct dentry *dentry)
kfree_rcu(fd, rcu);
}
+static int fuse_dentry_delete(const struct dentry *dentry)
+{
+ return time_before64(fuse_dentry_time(dentry), get_jiffies_64());
+}
+
const struct dentry_operations fuse_dentry_operations = {
.d_revalidate = fuse_dentry_revalidate,
+ .d_delete = fuse_dentry_delete,
.d_init = fuse_dentry_init,
.d_release = fuse_dentry_release,
};
--
2.13.6
Retrieve the capabilities needed to find the cache.
Signed-off-by: Dr. David Alan Gilbert <[email protected]>
---
fs/fuse/virtio_fs.c | 15 +++++++++++++++
include/uapi/linux/virtio_fs.h | 3 +++
2 files changed, 18 insertions(+)
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index cd916943205e..60d496c16841 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -520,6 +520,8 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
phys_addr_t phys_addr;
size_t len;
int ret;
+ u8 have_cache, cache_bar;
+ u64 cache_offset, cache_len;
if (!IS_ENABLED(CONFIG_DAX_DRIVER))
return 0;
@@ -535,6 +537,19 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
if (ret < 0)
return ret;
+ have_cache = virtio_pci_find_shm_cap(pci_dev,
+ VIRTIO_FS_PCI_SHMCAP_ID_CACHE, &cache_bar,
+ &cache_offset, &cache_len);
+
+ if (!have_cache) {
+ dev_err(&vdev->dev, "%s: No cache capability\n",
+ __func__);
+ return -ENXIO;
+ } else {
+ dev_notice(&vdev->dev, "Cache bar: %d len: 0x%llx @ 0x%llx\n",
+ cache_bar, cache_len, cache_offset);
+ }
+
/* TODO handle case where device doesn't expose BAR? */
ret = pci_request_region(pci_dev, VIRTIO_FS_WINDOW_BAR,
"virtio-fs-window");
diff --git a/include/uapi/linux/virtio_fs.h b/include/uapi/linux/virtio_fs.h
index 48f3590dcfbe..65a9d4a0dac0 100644
--- a/include/uapi/linux/virtio_fs.h
+++ b/include/uapi/linux/virtio_fs.h
@@ -38,4 +38,7 @@ struct virtio_fs_config {
__u32 num_queues;
} __attribute__((packed));
+/* For the id field in virtio_pci_shm_cap */
+#define VIRTIO_FS_PCI_SHMCAP_ID_CACHE 0
+
#endif /* _UAPI_LINUX_VIRTIO_FS_H */
--
2.13.6
From: Miklos Szeredi <[email protected]>
Add more fields to "struct fuse_mount_data" so that less parameters
have to be passed to function fuse_fill_super_common().
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/fuse/fuse_i.h | 22 +++++++++++++---------
fs/fuse/inode.c | 27 ++++++++++++++-------------
fs/fuse/virtio_fs.c | 10 +++++++---
3 files changed, 34 insertions(+), 25 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f0775d76e31f..fb49ca9d05ac 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -77,6 +77,18 @@ struct fuse_mount_data {
unsigned dax:1;
unsigned max_read;
unsigned blksize;
+
+ /* DAX device, may be NULL */
+ struct dax_device *dax_dev;
+
+ /* fuse input queue operations */
+ const struct fuse_iqueue_ops *fiq_ops;
+
+ /* device-specific state for fuse_iqueue */
+ void *fiq_priv;
+
+ /* fuse_dev pointer to fill in, should contain NULL on entry */
+ void **fudptr;
};
/* One forget request */
@@ -1073,17 +1085,9 @@ int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
* Fill in superblock and initialize fuse connection
* @sb: partially-initialized superblock to fill in
* @mount_data: mount parameters
- * @dax_dev: DAX device, may be NULL
- * @fiq_ops: fuse input queue operations
- * @fiq_priv: device-specific state for fuse_iqueue
- * @fudptr: fuse_dev pointer to fill in, should contain NULL on entry
*/
int fuse_fill_super_common(struct super_block *sb,
- struct fuse_mount_data *mount_data,
- struct dax_device *dax_dev,
- const struct fuse_iqueue_ops *fiq_ops,
- void *fiq_priv,
- void **fudptr);
+ struct fuse_mount_data *mount_data);
/**
* Disassociate fuse connection from superblock and kill the superblock
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 403360e352d8..075997977cfd 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1149,11 +1149,7 @@ void fuse_dev_free(struct fuse_dev *fud)
EXPORT_SYMBOL_GPL(fuse_dev_free);
int fuse_fill_super_common(struct super_block *sb,
- struct fuse_mount_data *mount_data,
- struct dax_device *dax_dev,
- const struct fuse_iqueue_ops *fiq_ops,
- void *fiq_priv,
- void **fudptr)
+ struct fuse_mount_data *mount_data)
{
struct fuse_dev *fud;
struct fuse_conn *fc;
@@ -1201,11 +1197,12 @@ int fuse_fill_super_common(struct super_block *sb,
if (!fc)
goto err;
- fuse_conn_init(fc, sb->s_user_ns, dax_dev, fiq_ops, fiq_priv);
+ fuse_conn_init(fc, sb->s_user_ns, mount_data->dax_dev,
+ mount_data->fiq_ops, mount_data->fiq_priv);
fc->release = fuse_free_conn;
- if (dax_dev) {
- err = fuse_dax_mem_range_init(fc, dax_dev);
+ if (mount_data->dax_dev) {
+ err = fuse_dax_mem_range_init(fc, mount_data->dax_dev);
if (err) {
pr_debug("fuse_dax_mem_range_init() returned %d\n", err);
goto err_put_conn;
@@ -1259,7 +1256,7 @@ int fuse_fill_super_common(struct super_block *sb,
mutex_lock(&fuse_mutex);
err = -EINVAL;
- if (*fudptr)
+ if (*mount_data->fudptr)
goto err_unlock;
err = fuse_ctl_add_conn(fc);
@@ -1268,7 +1265,7 @@ int fuse_fill_super_common(struct super_block *sb,
list_add_tail(&fc->entry, &fuse_conn_list);
sb->s_root = root_dentry;
- *fudptr = fud;
+ *mount_data->fudptr = fud;
/*
* mutex_unlock() provides the necessary memory barrier for
* *fudptr to be visible on all CPUs after this
@@ -1288,7 +1285,7 @@ int fuse_fill_super_common(struct super_block *sb,
err_dev_free:
fuse_dev_free(fud);
err_free_ranges:
- if (dax_dev)
+ if (mount_data->dax_dev)
fuse_free_dax_mem_ranges(&fc->free_ranges);
err_put_conn:
fuse_conn_put(fc);
@@ -1323,8 +1320,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
(file->f_cred->user_ns != sb->s_user_ns))
goto err_fput;
- err = fuse_fill_super_common(sb, &d, NULL, &fuse_dev_fiq_ops, NULL,
- &file->private_data);
+ d.dax_dev = NULL;
+ d.fiq_ops = &fuse_dev_fiq_ops;
+ d.fiq_priv = NULL;
+ d.fudptr = &file->private_data;
+ err = fuse_fill_super_common(sb, &d);
+
err_fput:
fput(file);
err:
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index c79c9a885253..98dba3cf9d40 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1053,9 +1053,13 @@ static int virtio_fs_fill_super(struct super_block *sb, void *data,
/* TODO this sends FUSE_INIT and could cause hiprio or notifications
* virtqueue races since they haven't been set up yet!
*/
- err = fuse_fill_super_common(sb, &d, d.dax ? fs->dax_dev : NULL,
- &virtio_fs_fiq_ops, fs,
- (void **)&fs->vqs[2].fud);
+
+ d.dax_dev = d.dax ? fs->dax_dev : NULL;
+ d.fiq_ops = &virtio_fs_fiq_ops;
+ d.fiq_priv = fs;
+ d.fudptr = (void **)&fs->vqs[2].fud;
+ err = fuse_fill_super_common(sb, &d);
+
if (err < 0)
goto err_fud;
--
2.13.6
From: Stefan Hajnoczi <[email protected]>
The /dev/fuse device uses fiq->waitq and fasync to signal that requests
are available. These mechanisms do not apply to virtio-fs. This patch
introduces callbacks so alternative behavior can be used.
Note that queue_interrupt() changes along these lines:
spin_lock(&fiq->waitq.lock);
wake_up_locked(&fiq->waitq);
+ kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
spin_unlock(&fiq->waitq.lock);
- kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
Since queue_request() and queue_forget() also call kill_fasync() inside
the spinlock this should be safe.
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
fs/fuse/cuse.c | 2 +-
fs/fuse/dev.c | 50 ++++++++++++++++++++++++++++++++++----------------
fs/fuse/fuse_i.h | 46 +++++++++++++++++++++++++++++++++++++++++++++-
fs/fuse/inode.c | 18 +++++++++++++-----
4 files changed, 93 insertions(+), 23 deletions(-)
diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index 8f68181256c0..98dc780cbafa 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -503,7 +503,7 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
* Limit the cuse channel to requests that can
* be represented in file->f_cred->user_ns.
*/
- fuse_conn_init(&cc->fc, file->f_cred->user_ns);
+ fuse_conn_init(&cc->fc, file->f_cred->user_ns, &fuse_dev_fiq_ops, NULL);
fud = fuse_dev_alloc(&cc->fc);
if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 7fd627d5cf58..b26ee5ed8974 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -371,13 +371,33 @@ static unsigned int fuse_req_hash(u64 unique)
return hash_long(unique & ~FUSE_INT_REQ_BIT, FUSE_PQ_HASH_BITS);
}
-static void queue_request(struct fuse_iqueue *fiq, struct fuse_req *req)
+/**
+ * A new request is available, wake fiq->waitq
+ */
+static void fuse_dev_wake_and_unlock(struct fuse_iqueue *fiq)
+__releases(fiq->waitq.lock)
{
- req->in.h.len = sizeof(struct fuse_in_header) +
- fuse_len_args(req->in.numargs, (struct fuse_arg *) req->in.args);
- list_add_tail(&req->list, &fiq->pending);
wake_up_locked(&fiq->waitq);
kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
+ spin_unlock(&fiq->waitq.lock);
+}
+
+const struct fuse_iqueue_ops fuse_dev_fiq_ops = {
+ .wake_forget_and_unlock = fuse_dev_wake_and_unlock,
+ .wake_interrupt_and_unlock = fuse_dev_wake_and_unlock,
+ .wake_pending_and_unlock = fuse_dev_wake_and_unlock,
+};
+EXPORT_SYMBOL_GPL(fuse_dev_fiq_ops);
+
+static void queue_request_and_unlock(struct fuse_iqueue *fiq,
+ struct fuse_req *req)
+__releases(fiq->waitq.lock)
+{
+ req->in.h.len = sizeof(struct fuse_in_header) +
+ fuse_len_args(req->in.numargs,
+ (struct fuse_arg *) req->in.args);
+ list_add_tail(&req->list, &fiq->pending);
+ fiq->ops->wake_pending_and_unlock(fiq);
}
void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget,
@@ -392,12 +412,11 @@ void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget,
if (fiq->connected) {
fiq->forget_list_tail->next = forget;
fiq->forget_list_tail = forget;
- wake_up_locked(&fiq->waitq);
- kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
+ fiq->ops->wake_forget_and_unlock(fiq);
} else {
kfree(forget);
+ spin_unlock(&fiq->waitq.lock);
}
- spin_unlock(&fiq->waitq.lock);
}
static void flush_bg_queue(struct fuse_conn *fc)
@@ -413,8 +432,7 @@ static void flush_bg_queue(struct fuse_conn *fc)
fc->active_background++;
spin_lock(&fiq->waitq.lock);
req->in.h.unique = fuse_get_unique(fiq);
- queue_request(fiq, req);
- spin_unlock(&fiq->waitq.lock);
+ queue_request_and_unlock(fiq, req);
}
}
@@ -481,10 +499,10 @@ static void queue_interrupt(struct fuse_iqueue *fiq, struct fuse_req *req)
}
if (list_empty(&req->intr_entry)) {
list_add_tail(&req->intr_entry, &fiq->interrupts);
- wake_up_locked(&fiq->waitq);
+ fiq->ops->wake_interrupt_and_unlock(fiq);
+ } else {
+ spin_unlock(&fiq->waitq.lock);
}
- spin_unlock(&fiq->waitq.lock);
- kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
}
static void request_wait_answer(struct fuse_conn *fc, struct fuse_req *req)
@@ -543,11 +561,10 @@ static void __fuse_request_send(struct fuse_conn *fc, struct fuse_req *req)
req->out.h.error = -ENOTCONN;
} else {
req->in.h.unique = fuse_get_unique(fiq);
- queue_request(fiq, req);
/* acquire extra reference, since request is still needed
after fuse_request_end() */
__fuse_get_request(req);
- spin_unlock(&fiq->waitq.lock);
+ queue_request_and_unlock(fiq, req);
request_wait_answer(fc, req);
/* Pairs with smp_wmb() in fuse_request_end() */
@@ -680,10 +697,11 @@ static int fuse_request_send_notify_reply(struct fuse_conn *fc,
req->in.h.unique = unique;
spin_lock(&fiq->waitq.lock);
if (fiq->connected) {
- queue_request(fiq, req);
+ queue_request_and_unlock(fiq, req);
err = 0;
+ } else {
+ spin_unlock(&fiq->waitq.lock);
}
- spin_unlock(&fiq->waitq.lock);
return err;
}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f41ebc723e01..60ebe3c2e2c3 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -454,6 +454,39 @@ struct fuse_req {
struct file *stolen_file;
};
+struct fuse_iqueue;
+
+/**
+ * Input queue callbacks
+ *
+ * Input queue signalling is device-specific. For example, the /dev/fuse file
+ * uses fiq->waitq and fasync to wake processes that are waiting on queue
+ * readiness. These callbacks allow other device types to respond to input
+ * queue activity.
+ */
+struct fuse_iqueue_ops {
+ /**
+ * Signal that a forget has been queued
+ */
+ void (*wake_forget_and_unlock)(struct fuse_iqueue *fiq)
+ __releases(fiq->waitq.lock);
+
+ /**
+ * Signal that an INTERRUPT request has been queued
+ */
+ void (*wake_interrupt_and_unlock)(struct fuse_iqueue *fiq)
+ __releases(fiq->waitq.lock);
+
+ /**
+ * Signal that a request has been queued
+ */
+ void (*wake_pending_and_unlock)(struct fuse_iqueue *fiq)
+ __releases(fiq->waitq.lock);
+};
+
+/** /dev/fuse input queue operations */
+extern const struct fuse_iqueue_ops fuse_dev_fiq_ops;
+
struct fuse_iqueue {
/** Connection established */
unsigned connected;
@@ -479,6 +512,12 @@ struct fuse_iqueue {
/** O_ASYNC requests */
struct fasync_struct *fasync;
+
+ /** Device-specific callbacks */
+ const struct fuse_iqueue_ops *ops;
+
+ /** Device-specific state */
+ void *priv;
};
#define FUSE_PQ_HASH_BITS 8
@@ -982,7 +1021,8 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
/**
* Initialize fuse_conn
*/
-void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns);
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
+ const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv);
/**
* Release reference to fuse_conn
@@ -1002,10 +1042,14 @@ int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
* Fill in superblock and initialize fuse connection
* @sb: partially-initialized superblock to fill in
* @mount_data: mount parameters
+ * @fiq_ops: fuse input queue operations
+ * @fiq_priv: device-specific state for fuse_iqueue
* @fudptr: fuse_dev pointer to fill in, should contain NULL on entry
*/
int fuse_fill_super_common(struct super_block *sb,
struct fuse_mount_data *mount_data,
+ const struct fuse_iqueue_ops *fiq_ops,
+ void *fiq_priv,
void **fudptr);
/**
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 65fd59fc1e81..31bb817575c4 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -574,7 +574,9 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
return 0;
}
-static void fuse_iqueue_init(struct fuse_iqueue *fiq)
+static void fuse_iqueue_init(struct fuse_iqueue *fiq,
+ const struct fuse_iqueue_ops *ops,
+ void *priv)
{
memset(fiq, 0, sizeof(struct fuse_iqueue));
init_waitqueue_head(&fiq->waitq);
@@ -582,6 +584,8 @@ static void fuse_iqueue_init(struct fuse_iqueue *fiq)
INIT_LIST_HEAD(&fiq->interrupts);
fiq->forget_list_tail = &fiq->forget_list_head;
fiq->connected = 1;
+ fiq->ops = ops;
+ fiq->priv = priv;
}
static void fuse_pqueue_init(struct fuse_pqueue *fpq)
@@ -595,7 +599,8 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
fpq->connected = 1;
}
-void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
+void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
+ const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv)
{
memset(fc, 0, sizeof(*fc));
spin_lock_init(&fc->lock);
@@ -605,7 +610,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns)
atomic_set(&fc->dev_count, 1);
init_waitqueue_head(&fc->blocked_waitq);
init_waitqueue_head(&fc->reserved_req_waitq);
- fuse_iqueue_init(&fc->iq);
+ fuse_iqueue_init(&fc->iq, fiq_ops, fiq_priv);
INIT_LIST_HEAD(&fc->bg_queue);
INIT_LIST_HEAD(&fc->entry);
INIT_LIST_HEAD(&fc->devices);
@@ -1067,6 +1072,8 @@ EXPORT_SYMBOL_GPL(fuse_dev_free);
int fuse_fill_super_common(struct super_block *sb,
struct fuse_mount_data *mount_data,
+ const struct fuse_iqueue_ops *fiq_ops,
+ void *fiq_priv,
void **fudptr)
{
struct fuse_dev *fud;
@@ -1115,7 +1122,7 @@ int fuse_fill_super_common(struct super_block *sb,
if (!fc)
goto err;
- fuse_conn_init(fc, sb->s_user_ns);
+ fuse_conn_init(fc, sb->s_user_ns, fiq_ops, fiq_priv);
fc->release = fuse_free_conn;
fud = fuse_dev_alloc(fc);
@@ -1226,7 +1233,8 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
(file->f_cred->user_ns != sb->s_user_ns))
goto err_fput;
- err = fuse_fill_super_common(sb, &d, &file->private_data);
+ err = fuse_fill_super_common(sb, &d, &fuse_dev_fiq_ops, NULL,
+ &file->private_data);
err_fput:
fput(file);
err:
--
2.13.6
From: Stefan Hajnoczi <[email protected]>
Add basic probe/remove functionality for the new virtio-fs device.
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
fs/fuse/Kconfig | 1 +
fs/fuse/virtio_fs.c | 160 ++++++++++++++++++++++++++++++++++++++--
include/uapi/linux/virtio_fs.h | 41 ++++++++++
include/uapi/linux/virtio_ids.h | 1 +
4 files changed, 195 insertions(+), 8 deletions(-)
create mode 100644 include/uapi/linux/virtio_fs.h
diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index 0b1375126420..46e9a8ff9f7a 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -30,6 +30,7 @@ config CUSE
config VIRTIO_FS
tristate "Virtio Filesystem"
depends on FUSE_FS
+ select VIRTIO
help
The Virtio Filesystem allows guests to mount file systems from the
host.
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 6b7d3973bd85..aac9c3c42827 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -4,13 +4,139 @@
* Copyright (C) 2018 Red Hat, Inc.
*/
-#include <linux/module.h>
#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/virtio.h>
+#include <linux/virtio_fs.h>
-MODULE_AUTHOR("Stefan Hajnoczi <[email protected]>");
-MODULE_DESCRIPTION("Virtio Filesystem");
-MODULE_LICENSE("GPL");
-MODULE_ALIAS_FS(KBUILD_MODNAME);
+/* List of virtio-fs device instances and a lock for the list */
+static DEFINE_MUTEX(virtio_fs_mutex);
+static LIST_HEAD(virtio_fs_instances);
+
+/* A virtio-fs device instance */
+struct virtio_fs {
+ struct list_head list; /* on virtio_fs_instances */
+ char *tag;
+};
+
+/* Add a new instance to the list or return -EEXIST if tag name exists*/
+static int virtio_fs_add_instance(struct virtio_fs *fs)
+{
+ struct virtio_fs *fs2;
+ bool duplicate = false;
+
+ mutex_lock(&virtio_fs_mutex);
+
+ list_for_each_entry(fs2, &virtio_fs_instances, list) {
+ if (strcmp(fs->tag, fs2->tag) == 0)
+ duplicate = true;
+ }
+
+ if (!duplicate)
+ list_add_tail(&fs->list, &virtio_fs_instances);
+
+ mutex_unlock(&virtio_fs_mutex);
+
+ if (duplicate)
+ return -EEXIST;
+ return 0;
+}
+
+/* Read filesystem name from virtio config into fs->tag (must kfree()). */
+static int virtio_fs_read_tag(struct virtio_device *vdev, struct virtio_fs *fs)
+{
+ char tag_buf[sizeof_field(struct virtio_fs_config, tag)];
+ char *end;
+ size_t len;
+
+ virtio_cread_bytes(vdev, offsetof(struct virtio_fs_config, tag),
+ &tag_buf, sizeof(tag_buf));
+ end = memchr(tag_buf, '\0', sizeof(tag_buf));
+ if (end == tag_buf)
+ return -EINVAL; /* empty tag */
+ if (!end)
+ end = &tag_buf[sizeof(tag_buf)];
+
+ len = end - tag_buf;
+ fs->tag = devm_kmalloc(&vdev->dev, len + 1, GFP_KERNEL);
+ if (!fs->tag)
+ return -ENOMEM;
+ memcpy(fs->tag, tag_buf, len);
+ fs->tag[len] = '\0';
+ return 0;
+}
+
+static int virtio_fs_probe(struct virtio_device *vdev)
+{
+ struct virtio_fs *fs;
+ int ret;
+
+ fs = devm_kzalloc(&vdev->dev, sizeof(*fs), GFP_KERNEL);
+ if (!fs)
+ return -ENOMEM;
+ vdev->priv = fs;
+
+ ret = virtio_fs_read_tag(vdev, fs);
+ if (ret < 0)
+ goto out;
+
+ ret = virtio_fs_add_instance(fs);
+ if (ret < 0)
+ goto out;
+
+ return 0;
+
+out:
+ vdev->priv = NULL;
+ return ret;
+}
+
+static void virtio_fs_remove(struct virtio_device *vdev)
+{
+ struct virtio_fs *fs = vdev->priv;
+
+ vdev->config->reset(vdev);
+
+ mutex_lock(&virtio_fs_mutex);
+ list_del(&fs->list);
+ mutex_unlock(&virtio_fs_mutex);
+
+ vdev->priv = NULL;
+}
+
+#ifdef CONFIG_PM
+static int virtio_fs_freeze(struct virtio_device *vdev)
+{
+ return 0; /* TODO */
+}
+
+static int virtio_fs_restore(struct virtio_device *vdev)
+{
+ return 0; /* TODO */
+}
+#endif /* CONFIG_PM */
+
+const static struct virtio_device_id id_table[] = {
+ { VIRTIO_ID_FS, VIRTIO_DEV_ANY_ID },
+ {},
+};
+
+const static unsigned int feature_table[] = {};
+
+static struct virtio_driver virtio_fs_driver = {
+ .driver.name = KBUILD_MODNAME,
+ .driver.owner = THIS_MODULE,
+ .id_table = id_table,
+ .feature_table = feature_table,
+ .feature_table_size = ARRAY_SIZE(feature_table),
+ /* TODO validate config_get != NULL */
+ .probe = virtio_fs_probe,
+ .remove = virtio_fs_remove,
+#ifdef CONFIG_PM_SLEEP
+ .freeze = virtio_fs_freeze,
+ .restore = virtio_fs_restore,
+#endif
+};
static struct file_system_type virtio_fs_type = {
.owner = THIS_MODULE,
@@ -21,13 +147,31 @@ static struct file_system_type virtio_fs_type = {
static int __init virtio_fs_init(void)
{
- return register_filesystem(&virtio_fs_type);
+ int ret;
+
+ ret = register_virtio_driver(&virtio_fs_driver);
+ if (ret < 0)
+ return ret;
+
+ ret = register_filesystem(&virtio_fs_type);
+ if (ret < 0) {
+ unregister_virtio_driver(&virtio_fs_driver);
+ return ret;
+ }
+
+ return 0;
}
+module_init(virtio_fs_init);
static void __exit virtio_fs_exit(void)
{
unregister_filesystem(&virtio_fs_type);
+ unregister_virtio_driver(&virtio_fs_driver);
}
-
-module_init(virtio_fs_init);
module_exit(virtio_fs_exit);
+
+MODULE_AUTHOR("Stefan Hajnoczi <[email protected]>");
+MODULE_DESCRIPTION("Virtio Filesystem");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_FS(KBUILD_MODNAME);
+MODULE_DEVICE_TABLE(virtio, id_table);
diff --git a/include/uapi/linux/virtio_fs.h b/include/uapi/linux/virtio_fs.h
new file mode 100644
index 000000000000..48f3590dcfbe
--- /dev/null
+++ b/include/uapi/linux/virtio_fs.h
@@ -0,0 +1,41 @@
+#ifndef _UAPI_LINUX_VIRTIO_FS_H
+#define _UAPI_LINUX_VIRTIO_FS_H
+/* This header is BSD licensed so anyone can use the definitions to implement
+ * compatible drivers/servers.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of IBM nor the names of its contributors
+ * may be used to endorse or promote products derived from this software
+ * without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE. */
+#include <linux/types.h>
+#include <linux/virtio_ids.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_types.h>
+
+struct virtio_fs_config {
+ /* Filesystem name (UTF-8, not NUL-terminated, padded with NULs) */
+ __u8 tag[36];
+
+ /* Number of request queues */
+ __u32 num_queues;
+} __attribute__((packed));
+
+#endif /* _UAPI_LINUX_VIRTIO_FS_H */
diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
index 6d5c3b2d4f4d..884b0e2734bb 100644
--- a/include/uapi/linux/virtio_ids.h
+++ b/include/uapi/linux/virtio_ids.h
@@ -43,5 +43,6 @@
#define VIRTIO_ID_INPUT 18 /* virtio input */
#define VIRTIO_ID_VSOCK 19 /* virtio vsock transport */
#define VIRTIO_ID_CRYPTO 20 /* virtio crypto */
+#define VIRTIO_ID_FS 26 /* virtio filesystem */
#endif /* _LINUX_VIRTIO_IDS_H */
--
2.13.6
Sometimes we run out of memory ranges. So in that case, wait for memory
ranges to become free, instead of returning -EBUSY.
dax fault path is holding fuse_inode->i_mmap_sem and once that is being
held, memory reclaim can't be done. Its not safe to wait while holding
fuse_inode->i_mmap_sem for two reasons.
- Worker thread to free memory might block on fuse_inode->i_mmap_sem as well.
- This inode is holding all the memory and more memory can't be freed.
In both the cases, deadlock will ensue. So return -ENOSPC from iomap_begin()
in fault path if memory can't be allocated. Drop fuse_inode->i_mmap_sem,
and wait for a free range to become available and retry.
read/write path is a different story. We hold inode lock and lock ordering
allows to grab fuse_inode->immap_sem, if needed. That means we can do direct
reclaim in that path. But if there is no memory allocated to this inode,
then direct reclaim will not work and we need to wait for a memory range
to become free. So try following order.
A. Try to get a free range.
B. If not, try direct reclaim.
C. If not, wait for a memory range to become free
Here sleeping with locks held should be fine because in step B, we made
sure this inode is not holding any ranges. That means other inodes are
holding ranges and somebody should be able to free memory. Also, worker
thread does a trylock() on inode lock. That means worker tread will not
wait on this inode and move onto next memory range. Hence above sequence
should be deadlock free.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/file.c | 60 +++++++++++++++++++++++++++++++++++++++++++-------------
fs/fuse/fuse_i.h | 3 +++
fs/fuse/inode.c | 1 +
3 files changed, 50 insertions(+), 14 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 709747458335..d0942ce0a6c3 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -220,6 +220,8 @@ static void __free_dax_mapping(struct fuse_conn *fc,
{
list_add_tail(&dmap->list, &fc->free_ranges);
fc->nr_free_ranges++;
+ /* TODO: Wake up only when needed */
+ wake_up(&fc->dax_range_waitq);
}
static void free_dax_mapping(struct fuse_conn *fc,
@@ -1770,12 +1772,18 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
goto iomap_hole;
/* Can't do reclaim in fault path yet due to lock ordering */
- if (flags & IOMAP_FAULT)
+ if (flags & IOMAP_FAULT) {
alloc_dmap = alloc_dax_mapping(fc);
- else
+ if (!alloc_dmap)
+ return -ENOSPC;
+ } else {
alloc_dmap = alloc_dax_mapping_reclaim(fc, inode);
+ if (IS_ERR(alloc_dmap))
+ return PTR_ERR(alloc_dmap);
+ }
- if (!alloc_dmap)
+ /* If we are here, we should have memory allocated */
+ if (WARN_ON(!alloc_dmap))
return -EBUSY;
/*
@@ -2596,14 +2604,24 @@ static ssize_t fuse_file_splice_read(struct file *in, loff_t *ppos,
static int __fuse_dax_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
bool write)
{
- int ret;
+ int ret, error = 0;
struct inode *inode = file_inode(vmf->vma->vm_file);
struct super_block *sb = inode->i_sb;
pfn_t pfn;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ bool retry = false;
if (write)
sb_start_pagefault(sb);
+retry:
+ if (retry && !(fc->nr_free_ranges > 0)) {
+ ret = -EINTR;
+ if (wait_event_killable_exclusive(fc->dax_range_waitq,
+ (fc->nr_free_ranges > 0)))
+ goto out;
+ }
+
/*
* We need to serialize against not only truncate but also against
* fuse dax memory range reclaim. While a range is being reclaimed,
@@ -2611,13 +2629,20 @@ static int __fuse_dax_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
* to populate page cache or access memory we are trying to free.
*/
down_read(&get_fuse_inode(inode)->i_mmap_sem);
- ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops);
+ ret = dax_iomap_fault(vmf, pe_size, &pfn, &error, &fuse_iomap_ops);
+ if ((ret & VM_FAULT_ERROR) && error == -ENOSPC) {
+ error = 0;
+ retry = true;
+ up_read(&get_fuse_inode(inode)->i_mmap_sem);
+ goto retry;
+ }
if (ret & VM_FAULT_NEEDDSYNC)
ret = dax_finish_sync_fault(vmf, pe_size, pfn);
up_read(&get_fuse_inode(inode)->i_mmap_sem);
+out:
if (write)
sb_end_pagefault(sb);
@@ -3828,16 +3853,23 @@ static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
struct fuse_dax_mapping *dmap;
struct fuse_inode *fi = get_fuse_inode(inode);
- dmap = alloc_dax_mapping(fc);
- if (dmap)
- return dmap;
-
- /* There are no mappings which can be reclaimed */
- if (!fi->nr_dmaps)
- return NULL;
+ while(1) {
+ dmap = alloc_dax_mapping(fc);
+ if (dmap)
+ return dmap;
- /* Try reclaim a fuse dax memory range */
- return fuse_dax_reclaim_first_mapping(fc, inode);
+ if (fi->nr_dmaps)
+ return fuse_dax_reclaim_first_mapping(fc, inode);
+ /*
+ * There are no mappings which can be reclaimed.
+ * Wait for one.
+ */
+ if (!(fc->nr_free_ranges > 0)) {
+ if (wait_event_killable_exclusive(fc->dax_range_waitq,
+ (fc->nr_free_ranges > 0)))
+ return ERR_PTR(-EINTR);
+ }
+ }
}
int fuse_dax_free_one_mapping_locked(struct fuse_conn *fc, struct inode *inode,
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index bbefa7c11078..7b2db87c6ead 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -886,6 +886,9 @@ struct fuse_conn {
/* Worker to free up memory ranges */
struct delayed_work dax_free_work;
+ /* Wait queue for a dax range to become free */
+ wait_queue_head_t dax_range_waitq;
+
/*
* DAX Window Free Ranges. TODO: This might not be best place to store
* this free list
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index d31acb97eede..178ac3171564 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -695,6 +695,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
atomic_set(&fc->dev_count, 1);
init_waitqueue_head(&fc->blocked_waitq);
init_waitqueue_head(&fc->reserved_req_waitq);
+ init_waitqueue_head(&fc->dax_range_waitq);
fuse_iqueue_init(&fc->iq, fiq_ops, fiq_priv);
INIT_LIST_HEAD(&fc->bg_queue);
INIT_LIST_HEAD(&fc->entry);
--
2.13.6
From: Stefan Hajnoczi <[email protected]>
virtio-fs will need to query the length of fuse_arg lists. Make the
symbol visible.
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
fs/fuse/dev.c | 7 ++++---
fs/fuse/fuse_i.h | 5 +++++
2 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 5b90c839a7c3..7fd627d5cf58 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -348,7 +348,7 @@ void fuse_put_request(struct fuse_conn *fc, struct fuse_req *req)
}
EXPORT_SYMBOL_GPL(fuse_put_request);
-static unsigned len_args(unsigned numargs, struct fuse_arg *args)
+unsigned fuse_len_args(unsigned numargs, struct fuse_arg *args)
{
unsigned nbytes = 0;
unsigned i;
@@ -358,6 +358,7 @@ static unsigned len_args(unsigned numargs, struct fuse_arg *args)
return nbytes;
}
+EXPORT_SYMBOL_GPL(fuse_len_args);
static u64 fuse_get_unique(struct fuse_iqueue *fiq)
{
@@ -373,7 +374,7 @@ static unsigned int fuse_req_hash(u64 unique)
static void queue_request(struct fuse_iqueue *fiq, struct fuse_req *req)
{
req->in.h.len = sizeof(struct fuse_in_header) +
- len_args(req->in.numargs, (struct fuse_arg *) req->in.args);
+ fuse_len_args(req->in.numargs, (struct fuse_arg *) req->in.args);
list_add_tail(&req->list, &fiq->pending);
wake_up_locked(&fiq->waitq);
kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
@@ -1870,7 +1871,7 @@ static int copy_out_args(struct fuse_copy_state *cs, struct fuse_out *out,
if (out->h.error)
return nbytes != reqsize ? -EINVAL : 0;
- reqsize += len_args(out->numargs, out->args);
+ reqsize += fuse_len_args(out->numargs, out->args);
if (reqsize < nbytes || (reqsize > nbytes && !out->argvar))
return -EINVAL;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 32c4466a8f89..f41ebc723e01 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1120,4 +1120,9 @@ int fuse_set_acl(struct inode *inode, struct posix_acl *acl, int type);
/* readdir.c */
int fuse_readdir(struct file *file, struct dir_context *ctx);
+/**
+ * Return the number of bytes in an arguments list
+ */
+unsigned fuse_len_args(unsigned numargs, struct fuse_arg *args);
+
#endif /* _FS_FUSE_I_H */
--
2.13.6
From: Stefan Hajnoczi <[email protected]>
A struct dax_device instance is a prerequisite for the DAX filesystem
APIs. Let virtio_fs associate a dax_device with a fuse_conn. Classic
FUSE and CUSE set the pointer to NULL, disabling DAX.
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
fs/fuse/cuse.c | 3 ++-
fs/fuse/fuse_i.h | 8 +++++++-
fs/fuse/inode.c | 9 ++++++---
fs/fuse/virtio_fs.c | 3 ++-
4 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index 98dc780cbafa..bf8c1c470e8c 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -503,7 +503,8 @@ static int cuse_channel_open(struct inode *inode, struct file *file)
* Limit the cuse channel to requests that can
* be represented in file->f_cred->user_ns.
*/
- fuse_conn_init(&cc->fc, file->f_cred->user_ns, &fuse_dev_fiq_ops, NULL);
+ fuse_conn_init(&cc->fc, file->f_cred->user_ns, NULL, &fuse_dev_fiq_ops,
+ NULL);
fud = fuse_dev_alloc(&cc->fc);
if (!fud) {
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f463586f2c9e..b5a6a12e67d6 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -803,6 +803,9 @@ struct fuse_conn {
/** List of device instances belonging to this connection */
struct list_head devices;
+
+ /** DAX device, non-NULL if DAX is supported */
+ struct dax_device *dax_dev;
};
static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
@@ -1025,7 +1028,8 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
* Initialize fuse_conn
*/
void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
- const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv);
+ struct dax_device *dax_dev,
+ const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv);
/**
* Release reference to fuse_conn
@@ -1045,12 +1049,14 @@ int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
* Fill in superblock and initialize fuse connection
* @sb: partially-initialized superblock to fill in
* @mount_data: mount parameters
+ * @dax_dev: DAX device, may be NULL
* @fiq_ops: fuse input queue operations
* @fiq_priv: device-specific state for fuse_iqueue
* @fudptr: fuse_dev pointer to fill in, should contain NULL on entry
*/
int fuse_fill_super_common(struct super_block *sb,
struct fuse_mount_data *mount_data,
+ struct dax_device *dax_dev,
const struct fuse_iqueue_ops *fiq_ops,
void *fiq_priv,
void **fudptr);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 31bb817575c4..10e4a39318c4 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -600,7 +600,8 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
}
void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
- const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv)
+ struct dax_device *dax_dev,
+ const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv)
{
memset(fc, 0, sizeof(*fc));
spin_lock_init(&fc->lock);
@@ -625,6 +626,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
fc->attr_version = 1;
get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
+ fc->dax_dev = dax_dev;
fc->user_ns = get_user_ns(user_ns);
}
EXPORT_SYMBOL_GPL(fuse_conn_init);
@@ -1072,6 +1074,7 @@ EXPORT_SYMBOL_GPL(fuse_dev_free);
int fuse_fill_super_common(struct super_block *sb,
struct fuse_mount_data *mount_data,
+ struct dax_device *dax_dev,
const struct fuse_iqueue_ops *fiq_ops,
void *fiq_priv,
void **fudptr)
@@ -1122,7 +1125,7 @@ int fuse_fill_super_common(struct super_block *sb,
if (!fc)
goto err;
- fuse_conn_init(fc, sb->s_user_ns, fiq_ops, fiq_priv);
+ fuse_conn_init(fc, sb->s_user_ns, dax_dev, fiq_ops, fiq_priv);
fc->release = fuse_free_conn;
fud = fuse_dev_alloc(fc);
@@ -1233,7 +1236,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
(file->f_cred->user_ns != sb->s_user_ns))
goto err_fput;
- err = fuse_fill_super_common(sb, &d, &fuse_dev_fiq_ops, NULL,
+ err = fuse_fill_super_common(sb, &d, NULL, &fuse_dev_fiq_ops, NULL,
&file->private_data);
err_fput:
fput(file);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index fd914f2c6209..ba615ec2603e 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -844,7 +844,8 @@ static int virtio_fs_fill_super(struct super_block *sb, void *data,
/* TODO this sends FUSE_INIT and could cause hiprio or notifications
* virtqueue races since they haven't been set up yet!
*/
- err = fuse_fill_super_common(sb, &d, &virtio_fs_fiq_ops, fs,
+ err = fuse_fill_super_common(sb, &d, fs->dax_dev,
+ &virtio_fs_fiq_ops, fs,
(void **)&fs->vqs[2].fud);
if (err < 0)
goto err_fud;
--
2.13.6
From: "Dr. David Alan Gilbert" <[email protected]>
When unmounting the fs close all the fuse devices.
This includes making sure the daemon gets a FUSE_DESTROY to
tell it.
Signed-off-by: Dr. David Alan Gilbert <[email protected]>
---
fs/fuse/fuse_i.h | 1 +
fs/fuse/inode.c | 3 ++-
fs/fuse/virtio_fs.c | 13 ++++++++++++-
3 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 7b2db87c6ead..30c7b4b56200 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -85,6 +85,7 @@ struct fuse_mount_data {
unsigned default_permissions:1;
unsigned allow_other:1;
unsigned dax:1;
+ unsigned destroy:1;
unsigned max_read;
unsigned blksize;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 178ac3171564..4d2d623e607f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1263,7 +1263,7 @@ int fuse_fill_super_common(struct super_block *sb,
goto err_put_root;
__set_bit(FR_BACKGROUND, &init_req->flags);
- if (is_bdev) {
+ if (mount_data->destroy) {
fc->destroy_req = fuse_request_alloc(0);
if (!fc->destroy_req)
goto err_free_init_req;
@@ -1339,6 +1339,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
d.fiq_ops = &fuse_dev_fiq_ops;
d.fiq_priv = NULL;
d.fudptr = &file->private_data;
+ d.destroy = is_bdev;
err = fuse_fill_super_common(sb, &d);
err_fput:
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index f436f5b3f85c..c71bc47395b4 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1128,6 +1128,7 @@ static int virtio_fs_fill_super(struct super_block *sb, void *data,
d.fiq_ops = &virtio_fs_fiq_ops;
d.fiq_priv = fs;
d.fudptr = (void **)&fs->vqs[2].fud;
+ d.destroy = true; /* Send destroy request on unmount */
err = fuse_fill_super_common(sb, &d);
if (err < 0)
@@ -1160,6 +1161,16 @@ static int virtio_fs_fill_super(struct super_block *sb, void *data,
return err;
}
+static void virtio_kill_sb(struct super_block *sb)
+{
+ struct fuse_conn *fc = get_fuse_conn_super(sb);
+ fuse_kill_sb_anon(sb);
+ if (fc) {
+ struct virtio_fs *vfs = fc->iq.priv;
+ virtio_fs_free_devs(vfs);
+ }
+}
+
static struct dentry *virtio_fs_mount(struct file_system_type *fs_type,
int flags, const char *dev_name,
void *raw_data)
@@ -1171,7 +1182,7 @@ static struct file_system_type virtio_fs_type = {
.owner = THIS_MODULE,
.name = KBUILD_MODNAME,
.mount = virtio_fs_mount,
- .kill_sb = fuse_kill_sb_anon,
+ .kill_sb = virtio_kill_sb,
};
static int __init virtio_fs_init(void)
--
2.13.6
fuse_file_put(sync) can be called with sync=true/false. If sync=true,
it waits for release request response and then calls iput() in the
caller's context. If sync=false, it does not wait for release request
response, frees the fuse_file struct immediately and req->end function
does the iput().
iput() can be a problem with DAX if called in req->end context. If this
is last reference to inode (VFS has let go its reference already), then
iput() will clean DAX mappings as well and send REMOVEMAPPING requests
and wait for completion. (All the the worker thread context which is
processing fuse replies from daemon on the host).
That means it blocks worker thread and it stops processing further
replies and system deadlocks.
So for now, force sync release of file in case of DAX inodes.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/file.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6421c94cef46..d86f6e5c4daf 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -451,6 +451,7 @@ void fuse_release_common(struct file *file, int opcode)
{
struct fuse_file *ff = file->private_data;
struct fuse_req *req = ff->reserved_req;
+ bool sync = false;
fuse_prepare_release(ff, file->f_flags, opcode);
@@ -471,8 +472,20 @@ void fuse_release_common(struct file *file, int opcode)
* Make the release synchronous if this is a fuseblk mount,
* synchronous RELEASE is allowed (and desirable) in this case
* because the server can be trusted not to screw up.
+ *
+ * For DAX, fuse server is trusted. So it should be fine to
+ * do a sync file put. Doing async file put is creating
+ * problems right now because when request finish, iput()
+ * can lead to freeing of inode. That means it tears down
+ * mappings backing DAX memory and sends REMOVEMAPPING message
+ * to server and blocks for completion. Currently, waiting
+ * in req->end context deadlocks the system as same worker thread
+ * can't process REMOVEMAPPING reply it is waiting for.
*/
- fuse_file_put(ff, ff->fc->destroy_req != NULL);
+ if (IS_DAX(req->misc.release.inode) || ff->fc->destroy_req != NULL)
+ sync = true;
+
+ fuse_file_put(ff, sync);
}
static int fuse_open(struct inode *inode, struct file *file)
--
2.13.6
We need some kind of locking mechanism here. Normal file systems like
ext4 and xfs seems to take their own semaphore to protect agains
truncate while fault is going on.
We have additional requirement to protect against fuse dax memory range
reclaim. When a range has been selected for reclaim, we need to make sure
no other read/write/fault can try to access that memory range while
reclaim is in progress. Once reclaim is complete, lock will be released
and read/write/fault will trigger allocation of fresh dax range.
Taking inode_lock() is not an option in fault path as lockdep complains
about circular dependencies. So define a new fuse_inode->i_mmap_sem.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/dir.c | 2 ++
fs/fuse/file.c | 17 +++++++++++++----
fs/fuse/fuse_i.h | 7 +++++++
fs/fuse/inode.c | 1 +
4 files changed, 23 insertions(+), 4 deletions(-)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index b7e6e421f6bb..8aa4ff82ea7a 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1553,8 +1553,10 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
*/
if ((is_truncate || !is_wb) &&
S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
+ down_write(&fi->i_mmap_sem);
truncate_pagecache(inode, outarg.attr.size);
invalidate_inode_pages2(inode->i_mapping);
+ up_write(&fi->i_mmap_sem);
}
clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index eb12776f5ff6..73068289f62e 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2523,13 +2523,20 @@ static int __fuse_dax_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
if (write)
sb_start_pagefault(sb);
- /* TODO inode semaphore to protect faults vs truncate */
-
+ /*
+ * We need to serialize against not only truncate but also against
+ * fuse dax memory range reclaim. While a range is being reclaimed,
+ * we do not want any read/write/mmap to make progress and try
+ * to populate page cache or access memory we are trying to free.
+ */
+ down_read(&get_fuse_inode(inode)->i_mmap_sem);
ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops);
if (ret & VM_FAULT_NEEDDSYNC)
ret = dax_finish_sync_fault(vmf, pe_size, pfn);
+ up_read(&get_fuse_inode(inode)->i_mmap_sem);
+
if (write)
sb_end_pagefault(sb);
@@ -3476,9 +3483,11 @@ static long __fuse_file_fallocate(struct file *file, int mode,
file_update_time(file);
}
- if (mode & FALLOC_FL_PUNCH_HOLE)
+ if (mode & FALLOC_FL_PUNCH_HOLE) {
+ down_write(&fi->i_mmap_sem);
truncate_pagecache_range(inode, offset, offset + length - 1);
-
+ down_write(&fi->i_mmap_sem);
+ }
fuse_invalidate_attr(inode);
out:
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e32b0059493b..280f717deb57 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -211,6 +211,13 @@ struct fuse_inode {
*/
struct rw_semaphore i_dmap_sem;
+ /**
+ * Can't take inode lock in fault path (leads to circular dependency).
+ * So take this in fuse dax fault path to make sure truncate and
+ * punch hole etc. can't make progress in parallel.
+ */
+ struct rw_semaphore i_mmap_sem;
+
/** Sorted rb tree of struct fuse_dax_mapping elements */
struct rb_root_cached dmap_tree;
unsigned long nr_dmaps;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 234b9c0c80ab..59fc5a7a18fc 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -85,6 +85,7 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
fi->state = 0;
fi->nr_dmaps = 0;
mutex_init(&fi->mutex);
+ init_rwsem(&fi->i_mmap_sem);
init_rwsem(&fi->i_dmap_sem);
fi->forget = fuse_alloc_forget();
if (!fi->forget) {
--
2.13.6
If virtio queue is full, then don't drop FORGET requests. Instead, wait
a bit and try to dispatch these little later using a worker thread.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/virtio_fs.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 78 insertions(+), 8 deletions(-)
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 98dba3cf9d40..f436f5b3f85c 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -22,6 +22,8 @@ static LIST_HEAD(virtio_fs_instances);
struct virtio_fs_vq {
struct virtqueue *vq; /* protected by fpq->lock */
struct work_struct done_work;
+ struct list_head queued_reqs;
+ struct delayed_work dispatch_work;
struct fuse_dev *fud;
char name[24];
} ____cacheline_aligned_in_smp;
@@ -53,6 +55,13 @@ struct virtio_fs {
size_t window_len;
};
+struct virtio_fs_forget {
+ struct fuse_in_header ih;
+ struct fuse_forget_in arg;
+ /* This request can be temporarily queued on virt queue */
+ struct list_head list;
+};
+
/* TODO: This should be in a PCI file somewhere */
static int virtio_pci_find_shm_cap(struct pci_dev *dev,
u8 required_id,
@@ -189,6 +198,7 @@ static void virtio_fs_free_devs(struct virtio_fs *fs)
continue;
flush_work(&fsvq->done_work);
+ flush_delayed_work(&fsvq->dispatch_work);
fuse_dev_free(fsvq->fud); /* TODO need to quiesce/end_requests/decrement dev_count */
fsvq->fud = NULL;
@@ -252,6 +262,58 @@ static void virtio_fs_hiprio_done_work(struct work_struct *work)
spin_unlock(&fpq->lock);
}
+static void virtio_fs_dummy_dispatch_work(struct work_struct *work)
+{
+ return;
+}
+
+static void virtio_fs_hiprio_dispatch_work(struct work_struct *work)
+{
+ struct virtio_fs_forget *forget;
+ struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
+ dispatch_work.work);
+ struct fuse_pqueue *fpq = &fsvq->fud->pq;
+ struct virtqueue *vq = fsvq->vq;
+ struct scatterlist sg;
+ struct scatterlist *sgs[] = {&sg};
+ bool notify;
+ int ret;
+
+ pr_debug("worker virtio_fs_hiprio_dispatch_work() called.\n");
+ while(1) {
+ spin_lock(&fpq->lock);
+ forget = list_first_entry_or_null(&fsvq->queued_reqs,
+ struct virtio_fs_forget, list);
+ if (!forget) {
+ spin_unlock(&fpq->lock);
+ return;
+ }
+
+ list_del(&forget->list);
+ sg_init_one(&sg, forget, sizeof(*forget));
+
+ /* Enqueue the request */
+ dev_dbg(&vq->vdev->dev, "%s\n", __func__);
+ ret = virtqueue_add_sgs(vq, sgs, 1, 0, forget, GFP_ATOMIC);
+ if (ret < 0) {
+ pr_debug("virtio-fs: Could not queue FORGET: queue full. Will try later\n");
+ list_add_tail(&forget->list, &fsvq->queued_reqs);
+ schedule_delayed_work(&fsvq->dispatch_work,
+ msecs_to_jiffies(1));
+ /* TODO handle full virtqueue */
+ spin_unlock(&fpq->lock);
+ return;
+ }
+
+ notify = virtqueue_kick_prepare(vq);
+ spin_unlock(&fpq->lock);
+
+ if (notify)
+ virtqueue_notify(vq);
+ pr_debug("worker virtio_fs_hiprio_dispatch_work() dispatched one forget request.\n");
+ }
+}
+
/* Allocate and copy args into req->argbuf */
static int copy_args_to_argbuf(struct fuse_req *req)
{
@@ -404,15 +466,24 @@ static int virtio_fs_setup_vqs(struct virtio_device *vdev,
snprintf(fs->vqs[0].name, sizeof(fs->vqs[0].name), "notifications");
INIT_WORK(&fs->vqs[0].done_work, virtio_fs_notifications_done_work);
names[0] = fs->vqs[0].name;
+ INIT_LIST_HEAD(&fs->vqs[0].queued_reqs);
+ INIT_DELAYED_WORK(&fs->vqs[0].dispatch_work,
+ virtio_fs_dummy_dispatch_work);
callbacks[1] = virtio_fs_vq_done;
snprintf(fs->vqs[1].name, sizeof(fs->vqs[1].name), "hiprio");
names[1] = fs->vqs[1].name;
INIT_WORK(&fs->vqs[1].done_work, virtio_fs_hiprio_done_work);
+ INIT_LIST_HEAD(&fs->vqs[1].queued_reqs);
+ INIT_DELAYED_WORK(&fs->vqs[1].dispatch_work,
+ virtio_fs_hiprio_dispatch_work);
/* Initialize the requests virtqueues */
for (i = 2; i < fs->nvqs; i++) {
INIT_WORK(&fs->vqs[i].done_work, virtio_fs_requests_done_work);
+ INIT_DELAYED_WORK(&fs->vqs[i].dispatch_work,
+ virtio_fs_dummy_dispatch_work);
+ INIT_LIST_HEAD(&fs->vqs[i].queued_reqs);
snprintf(fs->vqs[i].name, sizeof(fs->vqs[i].name),
"requests.%u", i - 2);
callbacks[i] = virtio_fs_vq_done;
@@ -718,11 +789,6 @@ static struct virtio_driver virtio_fs_driver = {
#endif
};
-struct virtio_fs_forget {
- struct fuse_in_header ih;
- struct fuse_forget_in arg;
-};
-
static void virtio_fs_wake_forget_and_unlock(struct fuse_iqueue *fiq)
__releases(fiq->waitq.lock)
{
@@ -733,6 +799,7 @@ __releases(fiq->waitq.lock)
struct scatterlist *sgs[] = {&sg};
struct virtio_fs *fs;
struct virtqueue *vq;
+ struct virtio_fs_vq *fsvq;
bool notify;
u64 unique;
int ret;
@@ -746,7 +813,7 @@ __releases(fiq->waitq.lock)
unique = fuse_get_unique(fiq);
fs = fiq->priv;
-
+ fsvq = &fs->vqs[1];
spin_unlock(&fiq->waitq.lock);
/* Allocate a buffer for the request */
@@ -769,14 +836,17 @@ __releases(fiq->waitq.lock)
sg_init_one(&sg, forget, sizeof(*forget));
/* Enqueue the request */
- vq = fs->vqs[1].vq;
+ vq = fsvq->vq;
dev_dbg(&vq->vdev->dev, "%s\n", __func__);
fpq = vq_to_fpq(vq);
spin_lock(&fpq->lock);
ret = virtqueue_add_sgs(vq, sgs, 1, 0, forget, GFP_ATOMIC);
if (ret < 0) {
- pr_err("virtio-fs: dropped FORGET: queue full\n");
+ pr_debug("virtio-fs: Could not queue FORGET: queue full. Will try later\n");
+ list_add_tail(&forget->list, &fsvq->queued_reqs);
+ schedule_delayed_work(&fsvq->dispatch_work,
+ msecs_to_jiffies(1));
/* TODO handle full virtqueue */
spin_unlock(&fpq->lock);
goto out;
--
2.13.6
From: Miklos Szeredi <[email protected]>
Metadata and dcache versioning support.
READDIRPLUS doesn't supply version information yet, so don't use.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/fuse/dev.c | 3 +-
fs/fuse/dir.c | 244 +++++++++++++++++++++++++++++++++++++++-------
fs/fuse/file.c | 53 ++++++----
fs/fuse/fuse_i.h | 25 +++--
fs/fuse/inode.c | 23 +++--
fs/fuse/readdir.c | 12 ++-
include/uapi/linux/fuse.h | 5 +
7 files changed, 284 insertions(+), 81 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index f35c4ab2dcbb..9ed326d716ee 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -640,8 +640,7 @@ ssize_t fuse_simple_request(struct fuse_conn *fc, struct fuse_args *args)
args->out.numargs * sizeof(struct fuse_arg));
fuse_request_send(fc, req);
ret = req->out.h.error;
- if (!ret && args->out.argvar) {
- BUG_ON(args->out.numargs != 1);
+ if (!ret && args->out.argvar && args->out.numargs == 1) {
ret = req->out.args[0].size;
}
fuse_put_request(fc, req);
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 8aa4ff82ea7a..3aa214f9a28e 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -25,7 +25,11 @@ static void fuse_advise_use_readdirplus(struct inode *dir)
}
union fuse_dentry {
- u64 time;
+ struct {
+ u64 time;
+ s64 version;
+ s64 parent_version;
+ };
struct rcu_head rcu;
};
@@ -48,6 +52,18 @@ static void fuse_dentry_settime(struct dentry *dentry, u64 time)
((union fuse_dentry *) dentry->d_fsdata)->time = time;
}
+static inline void fuse_dentry_setver(struct dentry *entry,
+ struct fuse_entryver_out *outver,
+ s64 pver)
+{
+ union fuse_dentry *fude = entry->d_fsdata;
+
+ smp_wmb();
+ /* FIXME: verify versions aren't going backwards */
+ WRITE_ONCE(fude->version, outver->initial_version);
+ WRITE_ONCE(fude->parent_version, pver);
+}
+
static inline u64 fuse_dentry_time(const struct dentry *entry)
{
return ((union fuse_dentry *) entry->d_fsdata)->time;
@@ -150,34 +166,118 @@ static void fuse_invalidate_entry(struct dentry *entry)
static void fuse_lookup_init(struct fuse_conn *fc, struct fuse_args *args,
u64 nodeid, const struct qstr *name,
- struct fuse_entry_out *outarg)
+ struct fuse_entry_out *outarg,
+ struct fuse_entryver_out *outver)
{
memset(outarg, 0, sizeof(struct fuse_entry_out));
+ memset(outver, 0, sizeof(struct fuse_entryver_out));
args->in.h.opcode = FUSE_LOOKUP;
args->in.h.nodeid = nodeid;
args->in.numargs = 1;
args->in.args[0].size = name->len + 1;
args->in.args[0].value = name->name;
- args->out.numargs = 1;
+ args->out.argvar = 1;
+ args->out.numargs = 2;
args->out.args[0].size = sizeof(struct fuse_entry_out);
args->out.args[0].value = outarg;
+ args->out.args[1].size = sizeof(struct fuse_entryver_out);
+ args->out.args[1].value = outver;
}
-u64 fuse_get_attr_version(struct fuse_conn *fc)
+s64 fuse_get_attr_version(struct inode *inode)
{
- u64 curr_version;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ s64 curr_version;
- /*
- * The spin lock isn't actually needed on 64bit archs, but we
- * don't yet care too much about such optimizations.
- */
- spin_lock(&fc->lock);
- curr_version = fc->attr_version;
- spin_unlock(&fc->lock);
+ if (fi->version_ptr) {
+ curr_version = READ_ONCE(*fi->version_ptr);
+ } else {
+ struct fuse_conn *fc = get_fuse_conn(inode);
+
+ /*
+ * The spin lock isn't actually needed on 64bit archs, but we
+ * don't yet care too much about such optimizations.
+ */
+ spin_lock(&fc->lock);
+ curr_version = fc->attr_ctr;
+ spin_unlock(&fc->lock);
+ }
+
+ return curr_version;
+}
+
+static s64 fuse_get_attr_version_shared(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ s64 curr_version = 0;
+
+ if (fi->version_ptr)
+ curr_version = READ_ONCE(*fi->version_ptr);
return curr_version;
}
+static bool fuse_version_mismatch(struct inode *inode, s64 version)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ bool mismatch = false;
+
+ if (fi->version_ptr) {
+ s64 curr_version = READ_ONCE(*fi->version_ptr);
+
+ mismatch = curr_version != version;
+ smp_rmb();
+
+ if (mismatch) {
+ pr_info("mismatch: nodeid=%llu curr=%lli cache=%lli\n",
+ get_node_id(inode), curr_version, version);
+ }
+ }
+
+ return mismatch;
+}
+
+static bool fuse_dentry_version_mismatch(struct dentry *dentry)
+{
+ union fuse_dentry *fude = dentry->d_fsdata;
+ struct inode *dir = d_inode_rcu(dentry->d_parent);
+ struct inode *inode = d_inode_rcu(dentry);
+
+ if (!fuse_version_mismatch(dir, READ_ONCE(fude->parent_version)))
+ return false;
+
+ /* Can only validate negatives based on parent version */
+ if (!inode)
+ return true;
+
+ return fuse_version_mismatch(inode, READ_ONCE(fude->version));
+}
+
+static void fuse_set_version_ptr(struct inode *inode,
+ struct fuse_entryver_out *outver)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ if (!fc->version_table || !outver->version_index) {
+ fi->version_ptr = NULL;
+ return;
+ }
+ if (outver->version_index >= fc->version_table_size) {
+ pr_warn_ratelimited("version index too large (%llu >= %llu)\n",
+ outver->version_index,
+ fc->version_table_size);
+ fi->version_ptr = NULL;
+ return;
+ }
+
+ fi->version_ptr = fc->version_table + outver->version_index;
+
+ pr_info("fuse: version_ptr = %p\n", fi->version_ptr);
+ pr_info("fuse: version = %lli\n", fi->attr_version);
+ pr_info("fuse: current_version: %lli\n", *fi->version_ptr);
+}
+
/*
* Check whether the dentry is still valid
*
@@ -198,12 +298,15 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
inode = d_inode_rcu(entry);
if (inode && is_bad_inode(inode))
goto invalid;
- else if (time_before64(fuse_dentry_time(entry), get_jiffies_64()) ||
+ else if (fuse_dentry_version_mismatch(entry) ||
+ time_before64(fuse_dentry_time(entry), get_jiffies_64()) ||
(flags & LOOKUP_REVAL)) {
struct fuse_entry_out outarg;
+ struct fuse_entryver_out outver;
FUSE_ARGS(args);
struct fuse_forget_link *forget;
- u64 attr_version;
+ s64 attr_version;
+ s64 parent_version;
/* For negative dentries, always do a fresh lookup */
if (!inode)
@@ -220,11 +323,12 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
if (!forget)
goto out;
- attr_version = fuse_get_attr_version(fc);
+ attr_version = fuse_get_attr_version(inode);
parent = dget_parent(entry);
+ parent_version = fuse_get_attr_version_shared(d_inode(parent));
fuse_lookup_init(fc, &args, get_node_id(d_inode(parent)),
- &entry->d_name, &outarg);
+ &entry->d_name, &outarg, &outver);
ret = fuse_simple_request(fc, &args);
dput(parent);
/* Zero nodeid is same as -ENOENT */
@@ -236,6 +340,9 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
fuse_queue_forget(fc, forget, outarg.nodeid, 1);
goto invalid;
}
+ if (fi->version_ptr != fc->version_table + outver.version_index)
+ pr_warn("fuse_dentry_revalidate: version_ptr changed (%p -> %p)\n", fi->version_ptr, fc->version_table + outver.version_index);
+
spin_lock(&fc->lock);
fi->nlookup++;
spin_unlock(&fc->lock);
@@ -246,14 +353,26 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
if (ret || (outarg.attr.mode ^ inode->i_mode) & S_IFMT)
goto invalid;
+ if (fi->version_ptr) {
+ if (outver.initial_version > attr_version)
+ attr_version = outver.initial_version;
+ else if (outver.initial_version < attr_version)
+ pr_warn("fuse_dentry_revalidate: backward going version (%lli -> %lli)\n", attr_version, outver.initial_version);
+ }
+
forget_all_cached_acls(inode);
fuse_change_attributes(inode, &outarg.attr,
entry_attr_timeout(&outarg),
attr_version);
fuse_change_entry_timeout(entry, &outarg);
+ fuse_dentry_setver(entry, &outver, parent_version);
} else if (inode) {
fi = get_fuse_inode(inode);
if (flags & LOOKUP_RCU) {
+ /*
+ * FIXME: Don't leave rcu if FUSE_I_ADVISE_RDPLUS is
+ * already set?
+ */
if (test_bit(FUSE_I_INIT_RDPLUS, &fi->state))
return -ECHILD;
} else if (test_and_clear_bit(FUSE_I_INIT_RDPLUS, &fi->state)) {
@@ -307,13 +426,16 @@ int fuse_valid_type(int m)
S_ISBLK(m) || S_ISFIFO(m) || S_ISSOCK(m);
}
-int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name,
- struct fuse_entry_out *outarg, struct inode **inode)
+static int fuse_lookup_name_with_ver(struct super_block *sb, u64 nodeid,
+ const struct qstr *name,
+ struct fuse_entry_out *outarg,
+ struct fuse_entryver_out *outver,
+ struct inode **inode)
{
struct fuse_conn *fc = get_fuse_conn_super(sb);
FUSE_ARGS(args);
struct fuse_forget_link *forget;
- u64 attr_version;
+ s64 attr_version;
int err;
*inode = NULL;
@@ -327,9 +449,11 @@ int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name
if (!forget)
goto out;
- attr_version = fuse_get_attr_version(fc);
+ spin_lock(&fc->lock);
+ attr_version = fc->attr_ctr;
+ spin_unlock(&fc->lock);
- fuse_lookup_init(fc, &args, nodeid, name, outarg);
+ fuse_lookup_init(fc, &args, nodeid, name, outarg, outver);
err = fuse_simple_request(fc, &args);
/* Zero nodeid is same as -ENOENT, but with valid timeout */
if (err || !outarg->nodeid)
@@ -357,19 +481,32 @@ int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name
return err;
}
+int fuse_lookup_name(struct super_block *sb, u64 nodeid,
+ const struct qstr *name,
+ struct fuse_entry_out *outarg, struct inode **inode)
+{
+ struct fuse_entryver_out outver;
+
+ return fuse_lookup_name_with_ver(sb, nodeid, name, outarg, &outver,
+ inode);
+}
+
static struct dentry *fuse_lookup(struct inode *dir, struct dentry *entry,
unsigned int flags)
{
int err;
struct fuse_entry_out outarg;
+ struct fuse_entryver_out outver;
struct inode *inode;
struct dentry *newent;
bool outarg_valid = true;
+ s64 parent_version = fuse_get_attr_version_shared(dir);
bool locked;
locked = fuse_lock_inode(dir);
- err = fuse_lookup_name(dir->i_sb, get_node_id(dir), &entry->d_name,
- &outarg, &inode);
+ err = fuse_lookup_name_with_ver(dir->i_sb, get_node_id(dir),
+ &entry->d_name, &outarg, &outver,
+ &inode);
fuse_unlock_inode(dir, locked);
if (err == -ENOENT) {
outarg_valid = false;
@@ -382,16 +519,21 @@ static struct dentry *fuse_lookup(struct inode *dir, struct dentry *entry,
if (inode && get_node_id(inode) == FUSE_ROOT_ID)
goto out_iput;
+ if (inode)
+ fuse_set_version_ptr(inode, &outver);
+
newent = d_splice_alias(inode, entry);
err = PTR_ERR(newent);
if (IS_ERR(newent))
goto out_err;
entry = newent ? newent : entry;
- if (outarg_valid)
+ if (outarg_valid) {
fuse_change_entry_timeout(entry, &outarg);
- else
+ fuse_dentry_setver(entry, &outver, parent_version);
+ } else {
fuse_invalidate_entry_cache(entry);
+ }
fuse_advise_use_readdirplus(dir);
return newent;
@@ -420,7 +562,9 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
struct fuse_create_in inarg;
struct fuse_open_out outopen;
struct fuse_entry_out outentry;
+ struct fuse_entryver_out outver;
struct fuse_file *ff;
+ s64 parent_version = fuse_get_attr_version_shared(dir);
/* Userspace expects S_IFREG in create mode */
BUG_ON((mode & S_IFMT) != S_IFREG);
@@ -451,11 +595,14 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
args.in.args[0].value = &inarg;
args.in.args[1].size = entry->d_name.len + 1;
args.in.args[1].value = entry->d_name.name;
- args.out.numargs = 2;
+ args.out.argvar = 1;
+ args.out.numargs = 3;
args.out.args[0].size = sizeof(outentry);
args.out.args[0].value = &outentry;
args.out.args[1].size = sizeof(outopen);
args.out.args[1].value = &outopen;
+ args.out.args[2].size = sizeof(outver);
+ args.out.args[2].value = &outver;
err = fuse_simple_request(fc, &args);
if (err)
goto out_free_ff;
@@ -478,7 +625,9 @@ static int fuse_create_open(struct inode *dir, struct dentry *entry,
}
kfree(forget);
d_instantiate(entry, inode);
+ fuse_set_version_ptr(inode, &outver);
fuse_change_entry_timeout(entry, &outentry);
+ fuse_dentry_setver(entry, &outver, parent_version);
fuse_dir_changed(dir);
err = finish_open(file, entry, generic_file_open);
if (err) {
@@ -549,10 +698,12 @@ static int create_new_entry(struct fuse_conn *fc, struct fuse_args *args,
umode_t mode)
{
struct fuse_entry_out outarg;
+ struct fuse_entryver_out outver;
struct inode *inode;
struct dentry *d;
int err;
struct fuse_forget_link *forget;
+ s64 parent_version = fuse_get_attr_version_shared(dir);
forget = fuse_alloc_forget();
if (!forget)
@@ -560,9 +711,12 @@ static int create_new_entry(struct fuse_conn *fc, struct fuse_args *args,
memset(&outarg, 0, sizeof(outarg));
args->in.h.nodeid = get_node_id(dir);
- args->out.numargs = 1;
+ args->out.argvar = 1;
+ args->out.numargs = 2;
args->out.args[0].size = sizeof(outarg);
args->out.args[0].value = &outarg;
+ args->out.args[1].size = sizeof(outver);
+ args->out.args[1].value = &outver;
err = fuse_simple_request(fc, args);
if (err)
goto out_put_forget_req;
@@ -582,6 +736,8 @@ static int create_new_entry(struct fuse_conn *fc, struct fuse_args *args,
}
kfree(forget);
+ fuse_set_version_ptr(inode, &outver);
+
d_drop(entry);
d = d_splice_alias(inode, entry);
if (IS_ERR(d))
@@ -589,9 +745,11 @@ static int create_new_entry(struct fuse_conn *fc, struct fuse_args *args,
if (d) {
fuse_change_entry_timeout(d, &outarg);
+ fuse_dentry_setver(d, &outver, parent_version);
dput(d);
} else {
fuse_change_entry_timeout(entry, &outarg);
+ fuse_dentry_setver(entry, &outver, parent_version);
}
fuse_dir_changed(dir);
return 0;
@@ -689,10 +847,9 @@ static int fuse_unlink(struct inode *dir, struct dentry *entry)
err = fuse_simple_request(fc, &args);
if (!err) {
struct inode *inode = d_inode(entry);
- struct fuse_inode *fi = get_fuse_inode(inode);
spin_lock(&fc->lock);
- fi->attr_version = ++fc->attr_version;
+ fuse_update_attr_version_locked(inode);
/*
* If i_nlink == 0 then unlink doesn't make sense, yet this can
* happen if userspace filesystem is careless. It would be
@@ -843,10 +1000,8 @@ static int fuse_link(struct dentry *entry, struct inode *newdir,
etc.)
*/
if (!err) {
- struct fuse_inode *fi = get_fuse_inode(inode);
-
spin_lock(&fc->lock);
- fi->attr_version = ++fc->attr_version;
+ fuse_update_attr_version_locked(inode);
inc_nlink(inode);
spin_unlock(&fc->lock);
fuse_invalidate_attr(inode);
@@ -904,9 +1059,9 @@ static int fuse_do_getattr(struct inode *inode, struct kstat *stat,
struct fuse_attr_out outarg;
struct fuse_conn *fc = get_fuse_conn(inode);
FUSE_ARGS(args);
- u64 attr_version;
+ s64 attr_version;
- attr_version = fuse_get_attr_version(fc);
+ attr_version = fuse_get_attr_version(inode);
memset(&inarg, 0, sizeof(inarg));
memset(&outarg, 0, sizeof(outarg));
@@ -941,6 +1096,13 @@ static int fuse_do_getattr(struct inode *inode, struct kstat *stat,
return err;
}
+static bool fuse_shared_version_mismatch(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ return fuse_version_mismatch(inode, READ_ONCE(fi->attr_version));
+}
+
static int fuse_update_get_attr(struct inode *inode, struct file *file,
struct kstat *stat, u32 request_mask,
unsigned int flags)
@@ -956,7 +1118,8 @@ static int fuse_update_get_attr(struct inode *inode, struct file *file,
else if (request_mask & READ_ONCE(fi->inval_mask))
sync = true;
else
- sync = time_before64(fi->i_time, get_jiffies_64());
+ sync = (fuse_shared_version_mismatch(inode) ||
+ time_before64(fi->i_time, get_jiffies_64()));
if (sync) {
forget_all_cached_acls(inode);
@@ -1150,7 +1313,9 @@ static int fuse_permission(struct inode *inode, int mask)
}
if (fc->default_permissions) {
- err = generic_permission(inode, mask);
+ err = -EACCES;
+ if (!refreshed && !fuse_shared_version_mismatch(inode))
+ err = generic_permission(inode, mask);
/* If permission is denied, try to refresh file
attributes. This is also needed, because the root
@@ -1459,6 +1624,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
loff_t oldsize;
int err;
bool trust_local_cmtime = is_wb && S_ISREG(inode->i_mode);
+ s64 attr_version = fuse_get_attr_version(inode);
if (!fc->default_permissions)
attr->ia_valid |= ATTR_FORCE;
@@ -1534,8 +1700,12 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
/* FIXME: clear I_DIRTY_SYNC? */
}
+ if (fi->version_ptr)
+ attr_version++;
+ else
+ attr_version = fuse_update_attr_version_locked(inode);
fuse_change_attributes_common(inode, &outarg.attr,
- attr_timeout(&outarg));
+ attr_timeout(&outarg), attr_version);
oldsize = inode->i_size;
/* see the comment in fuse_change_attributes() */
if (!is_wb || is_truncate || !S_ISREG(inode->i_mode))
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 0be5a7380b3c..4cb8c8a8011c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -376,6 +376,28 @@ void fuse_removemapping(struct inode *inode)
pr_debug("%s request succeeded\n", __func__);
}
+s64 fuse_update_attr_version_locked(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ s64 curr_version = 0;
+
+ if (!fi->version_ptr) {
+ struct fuse_conn *fc = get_fuse_conn(inode);
+
+ curr_version = fi->attr_version = fc->attr_ctr++;
+ }
+ return curr_version;
+}
+
+static void fuse_update_attr_version(struct inode *inode)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+
+ spin_lock(&fc->lock);
+ fuse_update_attr_version_locked(inode);
+ spin_unlock(&fc->lock);
+}
+
void fuse_finish_open(struct inode *inode, struct file *file)
{
struct fuse_file *ff = file->private_data;
@@ -386,12 +408,11 @@ void fuse_finish_open(struct inode *inode, struct file *file)
if (ff->open_flags & FOPEN_NONSEEKABLE)
nonseekable_open(inode, file);
if (fc->atomic_o_trunc && (file->f_flags & O_TRUNC)) {
- struct fuse_inode *fi = get_fuse_inode(inode);
-
spin_lock(&fc->lock);
- fi->attr_version = ++fc->attr_version;
+ fuse_update_attr_version_locked(inode);
i_size_write(inode, 0);
spin_unlock(&fc->lock);
+
fuse_invalidate_attr(inode);
if (fc->writeback_cache)
file_update_time(file);
@@ -806,15 +827,8 @@ static void fuse_aio_complete(struct fuse_io_priv *io, int err, ssize_t pos)
if (!left && !io->blocking) {
ssize_t res = fuse_get_res_by_io(io);
- if (res >= 0) {
- struct inode *inode = file_inode(io->iocb->ki_filp);
- struct fuse_conn *fc = get_fuse_conn(inode);
- struct fuse_inode *fi = get_fuse_inode(inode);
-
- spin_lock(&fc->lock);
- fi->attr_version = ++fc->attr_version;
- spin_unlock(&fc->lock);
- }
+ if (res >= 0)
+ fuse_update_attr_version(file_inode(io->iocb->ki_filp));
io->iocb->ki_complete(io->iocb, res, 0);
}
@@ -883,7 +897,7 @@ static size_t fuse_send_read(struct fuse_req *req, struct fuse_io_priv *io,
}
static void fuse_read_update_size(struct inode *inode, loff_t size,
- u64 attr_ver)
+ s64 attr_ver)
{
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
@@ -891,14 +905,14 @@ static void fuse_read_update_size(struct inode *inode, loff_t size,
spin_lock(&fc->lock);
if (attr_ver == fi->attr_version && size < inode->i_size &&
!test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) {
- fi->attr_version = ++fc->attr_version;
+ fuse_update_attr_version_locked(inode);
i_size_write(inode, size);
}
spin_unlock(&fc->lock);
}
static void fuse_short_read(struct fuse_req *req, struct inode *inode,
- u64 attr_ver)
+ s64 attr_ver)
{
size_t num_read = req->out.args[0].size;
struct fuse_conn *fc = get_fuse_conn(inode);
@@ -933,7 +947,7 @@ static int fuse_do_readpage(struct file *file, struct page *page)
size_t num_read;
loff_t pos = page_offset(page);
size_t count = PAGE_SIZE;
- u64 attr_ver;
+ s64 attr_ver;
int err;
/*
@@ -947,7 +961,7 @@ static int fuse_do_readpage(struct file *file, struct page *page)
if (IS_ERR(req))
return PTR_ERR(req);
- attr_ver = fuse_get_attr_version(fc);
+ attr_ver = fuse_get_attr_version(inode);
req->out.page_zeroing = 1;
req->out.argpages = 1;
@@ -1036,7 +1050,7 @@ static void fuse_send_readpages(struct fuse_req *req, struct file *file)
req->out.page_zeroing = 1;
req->out.page_replace = 1;
fuse_read_fill(req, file, pos, count, FUSE_READ);
- req->misc.read.attr_ver = fuse_get_attr_version(fc);
+ req->misc.read.attr_ver = fuse_get_attr_version(file_inode(file));
if (fc->async_read) {
req->ff = fuse_file_get(ff);
req->end = fuse_readpages_end;
@@ -1218,11 +1232,10 @@ static size_t fuse_send_write(struct fuse_req *req, struct fuse_io_priv *io,
bool fuse_write_update_size(struct inode *inode, loff_t pos)
{
struct fuse_conn *fc = get_fuse_conn(inode);
- struct fuse_inode *fi = get_fuse_inode(inode);
bool ret = false;
spin_lock(&fc->lock);
- fi->attr_version = ++fc->attr_version;
+ fuse_update_attr_version_locked(inode);
if (pos > inode->i_size) {
i_size_write(inode, pos);
ret = true;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 8a2604606d51..9ea5d0f760f4 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -172,7 +172,7 @@ struct fuse_inode {
u64 orig_ino;
/** Version of last attribute change */
- u64 attr_version;
+ s64 attr_version;
union {
/* Write related fields (regular file only) */
@@ -223,7 +223,7 @@ struct fuse_inode {
/** Miscellaneous bits describing inode state */
unsigned long state;
- /** Lock for serializing lookup and readdir for back compatibility*/
+ /** Lock for serializing lookup and readdir for back compatibility */
struct mutex mutex;
/*
@@ -241,6 +241,9 @@ struct fuse_inode {
/** Sorted rb tree of struct fuse_dax_mapping elements */
struct rb_root_cached dmap_tree;
unsigned long nr_dmaps;
+
+ /** Pointer to shared version */
+ s64 *version_ptr;
};
/** FUSE inode state bits */
@@ -364,7 +367,7 @@ struct fuse_out {
unsigned numargs;
/** Array of arguments */
- struct fuse_arg args[2];
+ struct fuse_arg args[3];
};
/** FUSE page descriptor */
@@ -386,7 +389,7 @@ struct fuse_args {
struct {
unsigned argvar:1;
unsigned numargs;
- struct fuse_arg args[2];
+ struct fuse_arg args[3];
} out;
};
@@ -486,7 +489,7 @@ struct fuse_req {
struct cuse_init_in cuse_init_in;
struct {
struct fuse_read_in in;
- u64 attr_ver;
+ s64 attr_ver;
} read;
struct {
struct fuse_write_in in;
@@ -869,7 +872,7 @@ struct fuse_conn {
struct fuse_req *destroy_req;
/** Version counter for attribute changes */
- u64 attr_version;
+ s64 attr_ctr;
/** Called on final put */
void (*release)(struct fuse_conn *);
@@ -953,7 +956,7 @@ int fuse_inode_eq(struct inode *inode, void *_nodeidp);
*/
struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
int generation, struct fuse_attr *attr,
- u64 attr_valid, u64 attr_version);
+ u64 attr_valid, s64 attr_version);
int fuse_lookup_name(struct super_block *sb, u64 nodeid, const struct qstr *name,
struct fuse_entry_out *outarg, struct inode **inode);
@@ -1027,10 +1030,10 @@ void fuse_init_symlink(struct inode *inode);
* Change attributes of an inode
*/
void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
- u64 attr_valid, u64 attr_version);
+ u64 attr_valid, s64 attr_version);
void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
- u64 attr_valid);
+ u64 attr_valid, s64 attr_version);
/**
* Initialize the client device
@@ -1195,7 +1198,7 @@ void fuse_flush_writepages(struct inode *inode);
void fuse_set_nowrite(struct inode *inode);
void fuse_release_nowrite(struct inode *inode);
-u64 fuse_get_attr_version(struct fuse_conn *fc);
+s64 fuse_get_attr_version(struct inode *inode);
/**
* File-system tells the kernel to invalidate cache for the given node id.
@@ -1281,4 +1284,6 @@ u64 fuse_get_unique(struct fuse_iqueue *fiq);
void fuse_dax_free_mem_worker(struct work_struct *work);
void fuse_removemapping(struct inode *inode);
+s64 fuse_update_attr_version_locked(struct inode *inode);
+
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index d44827bbfa3d..ea2be153a322 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -82,6 +82,8 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
fi->nodeid = 0;
fi->nlookup = 0;
fi->attr_version = 0;
+ fi->state = 0;
+ fi->version_ptr = NULL;
fi->orig_ino = 0;
fi->state = 0;
fi->nr_dmaps = 0;
@@ -153,12 +155,11 @@ static ino_t fuse_squash_ino(u64 ino64)
}
void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
- u64 attr_valid)
+ u64 attr_valid, s64 attr_version)
{
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
- fi->attr_version = ++fc->attr_version;
fi->i_time = attr_valid;
WRITE_ONCE(fi->inval_mask, 0);
@@ -193,10 +194,13 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
inode->i_mode &= ~S_ISVTX;
fi->orig_ino = attr->ino;
+ smp_wmb();
+ WRITE_ONCE(fi->attr_version, attr_version);
+
}
void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
- u64 attr_valid, u64 attr_version)
+ u64 attr_valid, s64 attr_version)
{
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
@@ -205,14 +209,17 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
struct timespec64 old_mtime;
spin_lock(&fc->lock);
- if ((attr_version != 0 && fi->attr_version > attr_version) ||
- test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) {
+ if (test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) {
+ spin_unlock(&fc->lock);
+ return;
+ }
+ if (attr_version != 0 && fi->attr_version > attr_version) {
spin_unlock(&fc->lock);
return;
}
old_mtime = inode->i_mtime;
- fuse_change_attributes_common(inode, attr, attr_valid);
+ fuse_change_attributes_common(inode, attr, attr_valid, attr_version);
oldsize = inode->i_size;
/*
@@ -291,7 +298,7 @@ static int fuse_inode_set(struct inode *inode, void *_nodeidp)
struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
int generation, struct fuse_attr *attr,
- u64 attr_valid, u64 attr_version)
+ u64 attr_valid, s64 attr_version)
{
struct inode *inode;
struct fuse_inode *fi;
@@ -709,7 +716,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
fc->blocked = 0;
fc->initialized = 0;
fc->connected = 1;
- fc->attr_version = 1;
+ fc->attr_ctr = 1;
get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
fc->dax_dev = dax_dev;
diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
index ab18b78f4755..e3ecc56013b8 100644
--- a/fs/fuse/readdir.c
+++ b/fs/fuse/readdir.c
@@ -147,7 +147,7 @@ static int parse_dirfile(char *buf, size_t nbytes, struct file *file,
static int fuse_direntplus_link(struct file *file,
struct fuse_direntplus *direntplus,
- u64 attr_version)
+ s64 attr_version)
{
struct fuse_entry_out *o = &direntplus->entry_out;
struct fuse_dirent *dirent = &direntplus->dirent;
@@ -212,6 +212,9 @@ static int fuse_direntplus_link(struct file *file,
return -EIO;
}
+ /* FIXME: translate version_ptr on reading from device... */
+ /* fuse_set_version_ptr(inode, o); */
+
fi = get_fuse_inode(inode);
spin_lock(&fc->lock);
fi->nlookup++;
@@ -231,6 +234,7 @@ static int fuse_direntplus_link(struct file *file,
attr_version);
if (!inode)
inode = ERR_PTR(-ENOMEM);
+ /* else fuse_set_version_ptr(inode, o); */
alias = d_splice_alias(inode, dentry);
d_lookup_done(dentry);
@@ -250,7 +254,7 @@ static int fuse_direntplus_link(struct file *file,
}
static int parse_dirplusfile(char *buf, size_t nbytes, struct file *file,
- struct dir_context *ctx, u64 attr_version)
+ struct dir_context *ctx, s64 attr_version)
{
struct fuse_direntplus *direntplus;
struct fuse_dirent *dirent;
@@ -301,7 +305,7 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
struct inode *inode = file_inode(file);
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_req *req;
- u64 attr_version = 0;
+ s64 attr_version = 0;
bool locked;
req = fuse_get_req(fc, 1);
@@ -320,7 +324,7 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
req->pages[0] = page;
req->page_descs[0].length = PAGE_SIZE;
if (plus) {
- attr_version = fuse_get_attr_version(fc);
+ attr_version = fuse_get_attr_version(inode);
fuse_read_fill(req, file, ctx->pos, PAGE_SIZE,
FUSE_READDIRPLUS);
} else {
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 1657253cb7d6..301c3c23228f 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -427,6 +427,11 @@ struct fuse_entry_out {
struct fuse_attr attr;
};
+struct fuse_entryver_out {
+ uint64_t version_index;
+ int64_t initial_version;
+};
+
struct fuse_forget_in {
uint64_t nlookup;
};
--
2.13.6
From: Miklos Szeredi <[email protected]>
---
fs/fuse/file.c | 91 ++++++++++++++++++++++++++++--------------------------
fs/splice.c | 3 +-
include/linux/fs.h | 2 ++
3 files changed, 52 insertions(+), 44 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1f172d372eeb..6421c94cef46 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -22,8 +22,6 @@
#include <linux/iomap.h>
#include <linux/interval_tree_generic.h>
-static const struct file_operations fuse_direct_io_file_operations;
-
INTERVAL_TREE_DEFINE(struct fuse_dax_mapping,
rb, __u64, __subtree_last,
START, LAST, static inline, fuse_dax_interval_tree);
@@ -381,8 +379,6 @@ void fuse_finish_open(struct inode *inode, struct file *file)
struct fuse_file *ff = file->private_data;
struct fuse_conn *fc = get_fuse_conn(inode);
- if (ff->open_flags & FOPEN_DIRECT_IO)
- file->f_op = &fuse_direct_io_file_operations;
if (!(ff->open_flags & FOPEN_KEEP_CACHE))
invalidate_inode_pages2(inode->i_mapping);
if (ff->open_flags & FOPEN_NONSEEKABLE)
@@ -1121,11 +1117,23 @@ static int fuse_readpages(struct file *file, struct address_space *mapping,
return err;
}
+
+static ssize_t fuse_direct_read_iter(struct kiocb *iocb, struct iov_iter *to);
+static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
+
static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
- struct inode *inode = iocb->ki_filp->f_mapping->host;
+ struct file *file = iocb->ki_filp;
+ struct fuse_file *ff = file->private_data;
+ struct inode *inode = file->f_mapping->host;
struct fuse_conn *fc = get_fuse_conn(inode);
+ if (ff->open_flags & FOPEN_DIRECT_IO)
+ return fuse_direct_read_iter(iocb, to);
+
+ if (IS_DAX(inode))
+ return fuse_dax_read_iter(iocb, to);
+
/*
* In auto invalidate mode, always update attributes on read.
* Otherwise, only update if we attempt to read past EOF (to ensure
@@ -1375,9 +1383,14 @@ static ssize_t fuse_perform_write(struct kiocb *iocb,
return res > 0 ? res : err;
}
+static ssize_t fuse_direct_write_iter(struct kiocb *iocb,
+ struct iov_iter *from);
+static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
+
static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
struct file *file = iocb->ki_filp;
+ struct fuse_file *ff = file->private_data;
struct address_space *mapping = file->f_mapping;
ssize_t written = 0;
ssize_t written_buffered = 0;
@@ -1385,6 +1398,11 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
ssize_t err;
loff_t endbyte = 0;
+ if (ff->open_flags & FOPEN_DIRECT_IO)
+ return fuse_direct_write_iter(iocb, from);
+ if (IS_DAX(inode))
+ return fuse_dax_write_iter(iocb, from);
+
if (get_fuse_conn(inode)->writeback_cache) {
/* Update size (EOF optimization) and mode (SUID clearing) */
err = fuse_update_attributes(mapping->host, file);
@@ -2517,8 +2535,20 @@ static const struct vm_operations_struct fuse_file_vm_ops = {
.page_mkwrite = fuse_page_mkwrite,
};
+static int fuse_direct_mmap(struct file *file, struct vm_area_struct *vma);
+static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma);
+
static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
{
+ struct fuse_file *ff = file->private_data;
+
+ /* DAX mmap is superior to direct_io mmap */
+ if (IS_DAX(file_inode(file)))
+ return fuse_dax_mmap(file, vma);
+
+ if (ff->open_flags & FOPEN_DIRECT_IO)
+ return fuse_direct_mmap(file, vma);
+
if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
fuse_link_write_file(file);
@@ -2538,6 +2568,18 @@ static int fuse_direct_mmap(struct file *file, struct vm_area_struct *vma)
return generic_file_mmap(file, vma);
}
+static ssize_t fuse_file_splice_read(struct file *in, loff_t *ppos,
+ struct pipe_inode_info *pipe, size_t len,
+ unsigned int flags)
+{
+ struct fuse_file *ff = in->private_data;
+
+ if (ff->open_flags & FOPEN_DIRECT_IO)
+ return default_file_splice_read(in, ppos, pipe, len, flags);
+ else
+ return generic_file_splice_read(in, ppos, pipe, len, flags);
+
+}
static int __fuse_dax_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
bool write)
{
@@ -3629,13 +3671,13 @@ static const struct file_operations fuse_file_operations = {
.read_iter = fuse_file_read_iter,
.write_iter = fuse_file_write_iter,
.mmap = fuse_file_mmap,
+ .splice_read = fuse_file_splice_read,
.open = fuse_open,
.flush = fuse_flush,
.release = fuse_release,
.fsync = fuse_fsync,
.lock = fuse_file_lock,
.flock = fuse_file_flock,
- .splice_read = generic_file_splice_read,
.unlocked_ioctl = fuse_file_ioctl,
.compat_ioctl = fuse_file_compat_ioctl,
.poll = fuse_file_poll,
@@ -3643,42 +3685,6 @@ static const struct file_operations fuse_file_operations = {
.copy_file_range = fuse_copy_file_range,
};
-static const struct file_operations fuse_direct_io_file_operations = {
- .llseek = fuse_file_llseek,
- .read_iter = fuse_direct_read_iter,
- .write_iter = fuse_direct_write_iter,
- .mmap = fuse_direct_mmap,
- .open = fuse_open,
- .flush = fuse_flush,
- .release = fuse_release,
- .fsync = fuse_fsync,
- .lock = fuse_file_lock,
- .flock = fuse_file_flock,
- .unlocked_ioctl = fuse_file_ioctl,
- .compat_ioctl = fuse_file_compat_ioctl,
- .poll = fuse_file_poll,
- .fallocate = fuse_file_fallocate,
- /* no splice_read */
-};
-
-static const struct file_operations fuse_dax_file_operations = {
- .llseek = fuse_file_llseek,
- .read_iter = fuse_dax_read_iter,
- .write_iter = fuse_dax_write_iter,
- .mmap = fuse_dax_mmap,
- .open = fuse_open,
- .flush = fuse_flush,
- .release = fuse_release,
- .fsync = fuse_fsync,
- .lock = fuse_file_lock,
- .flock = fuse_file_flock,
- .unlocked_ioctl = fuse_file_ioctl,
- .compat_ioctl = fuse_file_compat_ioctl,
- .poll = fuse_file_poll,
- .fallocate = fuse_file_fallocate,
- /* no splice_read */
-};
-
static const struct address_space_operations fuse_file_aops = {
.readpage = fuse_readpage,
.writepage = fuse_writepage,
@@ -3716,7 +3722,6 @@ void fuse_init_file_inode(struct inode *inode)
if (fc->dax_dev) {
inode->i_flags |= S_DAX;
- inode->i_fop = &fuse_dax_file_operations;
inode->i_data.a_ops = &fuse_dax_file_aops;
}
}
diff --git a/fs/splice.c b/fs/splice.c
index 3553f1956508..93cbb03a70b1 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -365,7 +365,7 @@ static ssize_t kernel_readv(struct file *file, const struct kvec *vec,
return res;
}
-static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
+ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags)
{
@@ -429,6 +429,7 @@ static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
iov_iter_advance(&to, copied); /* truncates and discards */
return res;
}
+EXPORT_SYMBOL(default_file_splice_read);
/*
* Send 'sd->len' bytes to socket from 'sd->file' at position 'sd->pos'
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c95c0807471f..574e63b58a6f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3040,6 +3040,8 @@ extern void block_sync_page(struct page *page);
/* fs/splice.c */
extern ssize_t generic_file_splice_read(struct file *, loff_t *,
struct pipe_inode_info *, size_t, unsigned int);
+extern ssize_t default_file_splice_read(struct file *, loff_t *,
+ struct pipe_inode_info *, size_t, unsigned int);
extern ssize_t iter_file_splice_write(struct pipe_inode_info *,
struct file *, loff_t *, size_t, unsigned int);
extern ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe,
--
2.13.6
From: Stefan Hajnoczi <[email protected]>
Send normal requests to the device and handle completions.
This is enough to get mount and basic I/O working. The hiprio and
notifications queues still need to be implemented for full FUSE
functionality.
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
fs/fuse/fuse_i.h | 3 +
fs/fuse/virtio_fs.c | 529 +++++++++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 501 insertions(+), 31 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 60ebe3c2e2c3..3a91aa970566 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -452,6 +452,9 @@ struct fuse_req {
/** Request is stolen from fuse_file->reserved_req */
struct file *stolen_file;
+
+ /** virtio-fs's physically contiguous buffer for in and out args */
+ void *argbuf;
};
struct fuse_iqueue;
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 8cdeb02f3778..fa99a31ee930 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -14,14 +14,35 @@
static DEFINE_MUTEX(virtio_fs_mutex);
static LIST_HEAD(virtio_fs_instances);
+/* Per-virtqueue state */
+struct virtio_fs_vq {
+ struct virtqueue *vq; /* protected by fpq->lock */
+ struct work_struct done_work;
+ struct fuse_dev *fud;
+ char name[24];
+} ____cacheline_aligned_in_smp;
+
/* A virtio-fs device instance */
struct virtio_fs {
- struct list_head list; /* on virtio_fs_instances */
+ struct list_head list; /* on virtio_fs_instances */
char *tag;
- struct fuse_dev **fud; /* 1:1 mapping with request queues */
- unsigned int num_queues;
+ struct virtio_fs_vq *vqs;
+ unsigned nvqs; /* number of virtqueues */
+ unsigned num_queues; /* number of request queues */
};
+static inline struct virtio_fs_vq *vq_to_fsvq(struct virtqueue *vq)
+{
+ struct virtio_fs *fs = vq->vdev->priv;
+
+ return &fs->vqs[vq->index];
+}
+
+static inline struct fuse_pqueue *vq_to_fpq(struct virtqueue *vq)
+{
+ return &vq_to_fsvq(vq)->fud->pq;
+}
+
/* Add a new instance to the list or return -EEXIST if tag name exists*/
static int virtio_fs_add_instance(struct virtio_fs *fs)
{
@@ -71,18 +92,17 @@ static void virtio_fs_free_devs(struct virtio_fs *fs)
/* TODO lock */
- if (!fs->fud)
- return;
+ for (i = 0; i < fs->nvqs; i++) {
+ struct virtio_fs_vq *fsvq = &fs->vqs[i];
- for (i = 0; i < fs->num_queues; i++) {
- struct fuse_dev *fud = fs->fud[i];
+ if (!fsvq->fud)
+ continue;
- if (fud)
- fuse_dev_free(fud); /* TODO need to quiesce/end_requests/decrement dev_count */
- }
+ flush_work(&fsvq->done_work);
- kfree(fs->fud);
- fs->fud = NULL;
+ fuse_dev_free(fsvq->fud); /* TODO need to quiesce/end_requests/decrement dev_count */
+ fsvq->fud = NULL;
+ }
}
/* Read filesystem name from virtio config into fs->tag (must kfree()). */
@@ -109,6 +129,210 @@ static int virtio_fs_read_tag(struct virtio_device *vdev, struct virtio_fs *fs)
return 0;
}
+static void virtio_fs_notifications_done(struct virtqueue *vq)
+{
+ /* TODO */
+ dev_dbg(&vq->vdev->dev, "%s\n", __func__);
+}
+
+static void virtio_fs_notifications_done_work(struct work_struct *work)
+{
+ return;
+}
+
+static void virtio_fs_hiprio_done(struct virtqueue *vq)
+{
+ /* TODO */
+ dev_dbg(&vq->vdev->dev, "%s\n", __func__);
+}
+
+/* Allocate and copy args into req->argbuf */
+static int copy_args_to_argbuf(struct fuse_req *req)
+{
+ unsigned offset = 0;
+ unsigned num_in;
+ unsigned num_out;
+ unsigned len;
+ unsigned i;
+
+ num_in = req->in.numargs - req->in.argpages;
+ num_out = req->out.numargs - req->out.argpages;
+ len = fuse_len_args(num_in, (struct fuse_arg *)req->in.args) +
+ fuse_len_args(num_out, req->out.args);
+
+ req->argbuf = kmalloc(len, GFP_ATOMIC);
+ if (!req->argbuf)
+ return -ENOMEM;
+
+ for (i = 0; i < num_in; i++) {
+ memcpy(req->argbuf + offset,
+ req->in.args[i].value,
+ req->in.args[i].size);
+ offset += req->in.args[i].size;
+ }
+
+ return 0;
+}
+
+/* Copy args out of and free req->argbuf */
+static void copy_args_from_argbuf(struct fuse_req *req)
+{
+ unsigned remaining;
+ unsigned offset;
+ unsigned num_in;
+ unsigned num_out;
+ unsigned i;
+
+ remaining = req->out.h.len - sizeof(req->out.h);
+ num_in = req->in.numargs - req->in.argpages;
+ num_out = req->out.numargs - req->out.argpages;
+ offset = fuse_len_args(num_in, (struct fuse_arg *)req->in.args);
+
+ for (i = 0; i < num_out; i++) {
+ unsigned argsize = req->out.args[i].size;
+
+ if (req->out.argvar &&
+ i == req->out.numargs - 1 &&
+ argsize > remaining) {
+ argsize = remaining;
+ }
+
+ memcpy(req->out.args[i].value, req->argbuf + offset, argsize);
+ offset += argsize;
+
+ if (i != req->out.numargs - 1)
+ remaining -= argsize;
+ }
+
+ /* Store the actual size of the variable-length arg */
+ if (req->out.argvar)
+ req->out.args[req->out.numargs - 1].size = remaining;
+
+ kfree(req->argbuf);
+ req->argbuf = NULL;
+}
+
+/* Work function for request completion */
+static void virtio_fs_requests_done_work(struct work_struct *work)
+{
+ struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
+ done_work);
+ struct fuse_pqueue *fpq = &fsvq->fud->pq;
+ struct fuse_conn *fc = fsvq->fud->fc;
+ struct virtqueue *vq = fsvq->vq;
+ struct fuse_req *req;
+ struct fuse_req *next;
+ LIST_HEAD(reqs);
+
+ /* Collect completed requests off the virtqueue */
+ spin_lock(&fpq->lock);
+ do {
+ unsigned len;
+
+ virtqueue_disable_cb(vq);
+
+ while ((req = virtqueue_get_buf(vq, &len)) != NULL)
+ list_move_tail(&req->list, &reqs);
+ } while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
+ spin_unlock(&fpq->lock);
+
+ /* End requests */
+ list_for_each_entry_safe(req, next, &reqs, list) {
+ /* TODO check unique */
+ /* TODO fuse_len_args(out) against oh.len */
+
+ copy_args_from_argbuf(req);
+
+ /* TODO zeroing? */
+
+ spin_lock(&fpq->lock);
+ clear_bit(FR_SENT, &req->flags);
+ list_del_init(&req->list);
+ spin_unlock(&fpq->lock);
+
+ fuse_request_end(fc, req);
+ }
+}
+
+/* Virtqueue interrupt handler */
+static void virtio_fs_vq_done(struct virtqueue *vq)
+{
+ struct virtio_fs_vq *fsvq = vq_to_fsvq(vq);
+
+ dev_dbg(&vq->vdev->dev, "%s %s\n", __func__, fsvq->name);
+
+ schedule_work(&fsvq->done_work);
+}
+
+/* Initialize virtqueues */
+static int virtio_fs_setup_vqs(struct virtio_device *vdev,
+ struct virtio_fs *fs)
+{
+ struct virtqueue **vqs;
+ vq_callback_t **callbacks;
+ const char **names;
+ unsigned i;
+ int ret;
+
+ virtio_cread(vdev, struct virtio_fs_config, num_queues,
+ &fs->num_queues);
+ if (fs->num_queues == 0)
+ return -EINVAL;
+
+ fs->nvqs = 2 + fs->num_queues;
+
+ fs->vqs = devm_kcalloc(&vdev->dev, fs->nvqs, sizeof(fs->vqs[0]),
+ GFP_KERNEL);
+ if (!fs->vqs)
+ return -ENOMEM;
+
+ vqs = kmalloc_array(fs->nvqs, sizeof(vqs[0]), GFP_KERNEL);
+ callbacks = kmalloc_array(fs->nvqs, sizeof(callbacks[0]), GFP_KERNEL);
+ names = kmalloc_array(fs->nvqs, sizeof(names[0]), GFP_KERNEL);
+ if (!vqs || !callbacks || !names) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ callbacks[0] = virtio_fs_notifications_done;
+ snprintf(fs->vqs[0].name, sizeof(fs->vqs[0].name), "notifications");
+ INIT_WORK(&fs->vqs[0].done_work, virtio_fs_notifications_done_work);
+ names[0] = fs->vqs[0].name;
+
+ callbacks[1] = virtio_fs_vq_done;
+ snprintf(fs->vqs[1].name, sizeof(fs->vqs[1].name), "hiprio");
+ names[1] = fs->vqs[1].name;
+
+ /* Initialize the requests virtqueues */
+ for (i = 2; i < fs->nvqs; i++) {
+ INIT_WORK(&fs->vqs[i].done_work, virtio_fs_requests_done_work);
+ snprintf(fs->vqs[i].name, sizeof(fs->vqs[i].name),
+ "requests.%u", i - 2);
+ callbacks[i] = virtio_fs_vq_done;
+ names[i] = fs->vqs[i].name;
+ }
+
+ ret = virtio_find_vqs(vdev, fs->nvqs, vqs, callbacks, names, NULL);
+ if (ret < 0)
+ goto out;
+
+ for (i = 0; i < fs->nvqs; i++)
+ fs->vqs[i].vq = vqs[i];
+
+out:
+ kfree(names);
+ kfree(callbacks);
+ kfree(vqs);
+ return ret;
+}
+
+/* Free virtqueues (device must already be reset) */
+static void virtio_fs_cleanup_vqs(struct virtio_device *vdev,
+ struct virtio_fs *fs)
+{
+ vdev->config->del_vqs(vdev);
+}
+
static int virtio_fs_probe(struct virtio_device *vdev)
{
struct virtio_fs *fs;
@@ -119,23 +343,32 @@ static int virtio_fs_probe(struct virtio_device *vdev)
return -ENOMEM;
vdev->priv = fs;
- virtio_cread(vdev, struct virtio_fs_config, num_queues,
- &fs->num_queues);
- if (fs->num_queues == 0) {
- ret = -EINVAL;
+ ret = virtio_fs_read_tag(vdev, fs);
+ if (ret < 0)
goto out;
- }
- ret = virtio_fs_read_tag(vdev, fs);
+ ret = virtio_fs_setup_vqs(vdev, fs);
if (ret < 0)
goto out;
+ /* TODO vq affinity */
+ /* TODO populate notifications vq */
+
+ /* Bring the device online in case the filesystem is mounted and
+ * requests need to be sent before we return.
+ */
+ virtio_device_ready(vdev);
+
ret = virtio_fs_add_instance(fs);
if (ret < 0)
- goto out;
+ goto out_vqs;
return 0;
+out_vqs:
+ vdev->config->reset(vdev);
+ virtio_fs_cleanup_vqs(vdev, fs);
+
out:
vdev->priv = NULL;
return ret;
@@ -148,6 +381,7 @@ static void virtio_fs_remove(struct virtio_device *vdev)
virtio_fs_free_devs(fs);
vdev->config->reset(vdev);
+ virtio_fs_cleanup_vqs(vdev, fs);
mutex_lock(&virtio_fs_mutex);
list_del(&fs->list);
@@ -190,6 +424,234 @@ static struct virtio_driver virtio_fs_driver = {
#endif
};
+static void virtio_fs_wake_forget_and_unlock(struct fuse_iqueue *fiq)
+__releases(fiq->waitq.lock)
+{
+ /* TODO */
+ spin_unlock(&fiq->waitq.lock);
+}
+
+static void virtio_fs_wake_interrupt_and_unlock(struct fuse_iqueue *fiq)
+__releases(fiq->waitq.lock)
+{
+ /* TODO */
+ spin_unlock(&fiq->waitq.lock);
+}
+
+/* Return the number of scatter-gather list elements required */
+static unsigned sg_count_fuse_req(struct fuse_req *req)
+{
+ unsigned total_sgs = 1 /* fuse_in_header */;
+
+ if (req->in.numargs - req->in.argpages)
+ total_sgs += 1;
+
+ if (req->in.argpages)
+ total_sgs += req->num_pages;
+
+ if (!test_bit(FR_ISREPLY, &req->flags))
+ return total_sgs;
+
+ total_sgs += 1 /* fuse_out_header */;
+
+ if (req->out.numargs - req->out.argpages)
+ total_sgs += 1;
+
+ if (req->out.argpages)
+ total_sgs += req->num_pages;
+
+ return total_sgs;
+}
+
+/* Add pages to scatter-gather list and return number of elements used */
+static unsigned sg_init_fuse_pages(struct scatterlist *sg,
+ struct page **pages,
+ struct fuse_page_desc *page_descs,
+ unsigned num_pages)
+{
+ unsigned i;
+
+ for (i = 0; i < num_pages; i++) {
+ sg_init_table(&sg[i], 1);
+ sg_set_page(&sg[i], pages[i],
+ page_descs[i].length,
+ page_descs[i].offset);
+ }
+
+ return i;
+}
+
+/* Add args to scatter-gather list and return number of elements used */
+static unsigned sg_init_fuse_args(struct scatterlist *sg,
+ struct fuse_req *req,
+ struct fuse_arg *args,
+ unsigned numargs,
+ bool argpages,
+ void *argbuf,
+ unsigned *len_used)
+{
+ unsigned total_sgs = 0;
+ unsigned len;
+
+ len = fuse_len_args(numargs - argpages, args);
+ if (len)
+ sg_init_one(&sg[total_sgs++], argbuf, len);
+
+ if (argpages)
+ total_sgs += sg_init_fuse_pages(&sg[total_sgs],
+ req->pages,
+ req->page_descs,
+ req->num_pages);
+
+ if (len_used)
+ *len_used = len;
+
+ return total_sgs;
+}
+
+/* Add a request to a virtqueue and kick the device */
+static int virtio_fs_enqueue_req(struct virtqueue *vq, struct fuse_req *req)
+{
+ struct scatterlist *stack_sgs[6 /* requests need at least 4 elements */];
+ struct scatterlist stack_sg[ARRAY_SIZE(stack_sgs)];
+ struct scatterlist **sgs = stack_sgs;
+ struct scatterlist *sg = stack_sg;
+ struct fuse_pqueue *fpq;
+ unsigned argbuf_used = 0;
+ unsigned out_sgs = 0;
+ unsigned in_sgs = 0;
+ unsigned total_sgs;
+ unsigned i;
+ int ret;
+ bool notify;
+
+ /* Does the sglist fit on the stack? */
+ total_sgs = sg_count_fuse_req(req);
+ if (total_sgs > ARRAY_SIZE(stack_sgs)) {
+ sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), GFP_ATOMIC);
+ sg = kmalloc_array(total_sgs, sizeof(sg[0]), GFP_ATOMIC);
+ if (!sgs || !sg) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ }
+
+ /* Use a bounce buffer since stack args cannot be mapped */
+ ret = copy_args_to_argbuf(req);
+ if (ret < 0)
+ goto out;
+
+ /* Request elements */
+ sg_init_one(&sg[out_sgs++], &req->in.h, sizeof(req->in.h));
+ out_sgs += sg_init_fuse_args(&sg[out_sgs], req,
+ (struct fuse_arg *)req->in.args,
+ req->in.numargs, req->in.argpages,
+ req->argbuf, &argbuf_used);
+
+ /* Reply elements */
+ if (test_bit(FR_ISREPLY, &req->flags)) {
+ sg_init_one(&sg[out_sgs + in_sgs++],
+ &req->out.h, sizeof(req->out.h));
+ in_sgs += sg_init_fuse_args(&sg[out_sgs + in_sgs], req,
+ req->out.args, req->out.numargs,
+ req->out.argpages,
+ req->argbuf + argbuf_used, NULL);
+ }
+
+ BUG_ON(out_sgs + in_sgs != total_sgs);
+
+ for (i = 0; i < total_sgs; i++)
+ sgs[i] = &sg[i];
+
+ fpq = vq_to_fpq(vq);
+ spin_lock(&fpq->lock);
+
+ ret = virtqueue_add_sgs(vq, sgs, out_sgs, in_sgs, req, GFP_ATOMIC);
+ if (ret < 0) {
+ /* TODO handle full virtqueue */
+ spin_unlock(&fpq->lock);
+ goto out;
+ }
+
+ notify = virtqueue_kick_prepare(vq);
+
+ spin_unlock(&fpq->lock);
+
+ if (notify)
+ virtqueue_notify(vq);
+
+out:
+ if (ret < 0 && req->argbuf) {
+ kfree(req->argbuf);
+ req->argbuf = NULL;
+ }
+ if (sgs != stack_sgs) {
+ kfree(sgs);
+ kfree(sg);
+ }
+
+ return ret;
+}
+
+static void virtio_fs_wake_pending_and_unlock(struct fuse_iqueue *fiq)
+__releases(fiq->waitq.lock)
+{
+ unsigned queue_id = 2; /* TODO multiqueue */
+ struct virtio_fs *fs;
+ struct fuse_conn *fc;
+ struct fuse_req *req;
+ struct fuse_pqueue *fpq;
+ int ret;
+
+ BUG_ON(list_empty(&fiq->pending));
+ req = list_last_entry(&fiq->pending, struct fuse_req, list);
+ clear_bit(FR_PENDING, &req->flags);
+ list_del_init(&req->list);
+ BUG_ON(!list_empty(&fiq->pending));
+ spin_unlock(&fiq->waitq.lock);
+
+ fs = fiq->priv;
+ fc = fs->vqs[queue_id].fud->fc;
+
+ dev_dbg(&fs->vqs[queue_id].vq->vdev->dev,
+ "%s: opcode %u unique %#llx nodeid %#llx in.len %u out.len %u\n",
+ __func__, req->in.h.opcode, req->in.h.unique, req->in.h.nodeid,
+ req->in.h.len, fuse_len_args(req->out.numargs, req->out.args));
+
+ /* TODO put request onto fpq->io list? */
+
+ fpq = &fs->vqs[queue_id].fud->pq;
+ spin_lock(&fpq->lock);
+ if (!fpq->connected) {
+ spin_unlock(&fpq->lock);
+ req->out.h.error = -ENODEV;
+ printk(KERN_ERR "%s: disconnected\n", __func__);
+/* fuse_request_end(fc, req); unsafe due to fc->lock */
+ return;
+ }
+ list_add_tail(&req->list, fpq->processing);
+ spin_unlock(&fpq->lock);
+ set_bit(FR_SENT, &req->flags);
+ /* matches barrier in request_wait_answer() */
+ smp_mb__after_atomic();
+ /* TODO check for FR_INTERRUPTED? */
+
+ ret = virtio_fs_enqueue_req(fs->vqs[queue_id].vq, req);
+ if (ret < 0) {
+ req->out.h.error = ret;
+ printk(KERN_ERR "%s: virtio_fs_enqueue_req failed %d\n",
+ __func__, ret);
+/* fuse_request_end(fc, req); unsafe due to fc->lock */
+ return;
+ }
+}
+
+const static struct fuse_iqueue_ops virtio_fs_fiq_ops = {
+ .wake_forget_and_unlock = virtio_fs_wake_forget_and_unlock,
+ .wake_interrupt_and_unlock = virtio_fs_wake_interrupt_and_unlock,
+ .wake_pending_and_unlock = virtio_fs_wake_pending_and_unlock,
+};
+
static int virtio_fs_fill_super(struct super_block *sb, void *data,
int silent)
{
@@ -220,30 +682,35 @@ static int virtio_fs_fill_super(struct super_block *sb, void *data,
}
/* TODO lock */
- if (fs->fud) {
+ if (fs->vqs[2].fud) {
printk(KERN_ERR "virtio-fs: device already in use\n");
err = -EBUSY;
goto err;
}
- fs->fud = kcalloc(fs->num_queues, sizeof(fs->fud[0]), GFP_KERNEL);
- if (!fs->fud) {
- err = -ENOMEM;
- goto err_fud;
- }
- err = fuse_fill_super_common(sb, &d, (void **)&fs->fud[0]);
+ /* TODO this sends FUSE_INIT and could cause hiprio or notifications
+ * virtqueue races since they haven't been set up yet!
+ */
+ err = fuse_fill_super_common(sb, &d, &virtio_fs_fiq_ops, fs,
+ (void **)&fs->vqs[2].fud);
if (err < 0)
goto err_fud;
- fc = fs->fud[0]->fc;
+ fc = fs->vqs[2].fud->fc;
- /* Allocate remaining fuse_devs */
err = -ENOMEM;
/* TODO take fuse_mutex around this loop? */
- for (i = 1; i < fs->num_queues; i++) {
- fs->fud[i] = fuse_dev_alloc(fc);
- if (!fs->fud[i]) {
+ for (i = 0; i < fs->nvqs; i++) {
+ struct virtio_fs_vq *fsvq = &fs->vqs[i];
+
+ if (i == 2)
+ continue; /* already initialized */
+
+ fsvq->fud = fuse_dev_alloc(fc);
+ if (!fsvq->fud) {
/* TODO */
+ printk(KERN_ERR "%s: fuse_dev_alloc failed\n",
+ __func__);
}
atomic_inc(&fc->dev_count);
}
--
2.13.6
Divide the dax memory range into fixed size ranges (2MB for now) and put
them in a list. This will track free ranges. Once an inode requires a
free range, we will take one from here and put it in interval-tree
of ranges assigned to inode.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/fuse_i.h | 14 +++++++++
fs/fuse/inode.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
fs/fuse/virtio_fs.c | 2 ++
3 files changed, 96 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index b9880be690bd..f0775d76e31f 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -46,6 +46,10 @@
/** Number of page pointers embedded in fuse_req */
#define FUSE_REQ_INLINE_PAGES 1
+/* Default memory range size, 2MB */
+#define FUSE_DAX_MEM_RANGE_SZ (2*1024*1024)
+#define FUSE_DAX_MEM_RANGE_PAGES (FUSE_DAX_MEM_RANGE_SZ/PAGE_SIZE)
+
/** List of active connections */
extern struct list_head fuse_conn_list;
@@ -83,6 +87,9 @@ struct fuse_forget_link {
/** Translation information for file offsets to DAX window offsets */
struct fuse_dax_mapping {
+ /* Will connect in fc->free_ranges to keep track of free memory */
+ struct list_head list;
+
/** Position in DAX window */
u64 window_offset;
@@ -816,6 +823,13 @@ struct fuse_conn {
/** DAX device, non-NULL if DAX is supported */
struct dax_device *dax_dev;
+
+ /*
+ * DAX Window Free Ranges. TODO: This might not be best place to store
+ * this free list
+ */
+ unsigned long nr_free_ranges;
+ struct list_head free_ranges;
};
static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index d2afce377fd4..403360e352d8 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -22,6 +22,8 @@
#include <linux/exportfs.h>
#include <linux/posix_acl.h>
#include <linux/pid_namespace.h>
+#include <linux/dax.h>
+#include <linux/pfn_t.h>
MODULE_AUTHOR("Miklos Szeredi <[email protected]>");
MODULE_DESCRIPTION("Filesystem in Userspace");
@@ -607,6 +609,69 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
fpq->connected = 1;
}
+static void fuse_free_dax_mem_ranges(struct list_head *mem_list)
+{
+ struct fuse_dax_mapping *range, *temp;
+
+ /* Free All allocated elements */
+ list_for_each_entry_safe(range, temp, mem_list, list) {
+ list_del(&range->list);
+ kfree(range);
+ }
+}
+
+static int fuse_dax_mem_range_init(struct fuse_conn *fc,
+ struct dax_device *dax_dev)
+{
+ long nr_pages, nr_ranges;
+ void *kaddr;
+ pfn_t pfn;
+ struct fuse_dax_mapping *range;
+ LIST_HEAD(mem_ranges);
+ phys_addr_t phys_addr;
+ int ret = 0, id;
+ size_t dax_size = -1;
+ unsigned long allocated_ranges = 0, i;
+
+ id = dax_read_lock();
+ nr_pages = dax_direct_access(dax_dev, 0, PHYS_PFN(dax_size), &kaddr,
+ &pfn);
+ dax_read_unlock(id);
+ if (nr_pages < 0) {
+ pr_debug("dax_direct_access() returned %ld\n", nr_pages);
+ return nr_pages;
+ }
+
+ phys_addr = pfn_t_to_phys(pfn);
+ nr_ranges = nr_pages/FUSE_DAX_MEM_RANGE_PAGES;
+ printk("fuse_dax_mem_range_init(): dax mapped %ld pages. nr_ranges=%ld\n", nr_pages, nr_ranges);
+
+ for (i = 0; i < nr_ranges; i++) {
+ range = kzalloc(sizeof(struct fuse_dax_mapping), GFP_KERNEL);
+ if (!range) {
+ pr_debug("memory allocation for mem_range failed.\n");
+ ret = -ENOMEM;
+ goto out_err;
+ }
+ /* TODO: This offset only works if virtio-fs driver is not
+ * having some memory hidden at the beginning. This needs
+ * better handling
+ */
+ range->window_offset = i * FUSE_DAX_MEM_RANGE_SZ;
+ range->length = FUSE_DAX_MEM_RANGE_SZ;
+ list_add_tail(&range->list, &mem_ranges);
+ allocated_ranges++;
+ }
+
+ list_replace_init(&mem_ranges, &fc->free_ranges);
+ fc->nr_free_ranges = allocated_ranges;
+ return 0;
+out_err:
+ /* Free All allocated elements */
+ fuse_free_dax_mem_ranges(&mem_ranges);
+ return ret;
+}
+
void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
struct dax_device *dax_dev,
const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv)
@@ -636,6 +701,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
fc->dax_dev = dax_dev;
fc->user_ns = get_user_ns(user_ns);
+ INIT_LIST_HEAD(&fc->free_ranges);
}
EXPORT_SYMBOL_GPL(fuse_conn_init);
@@ -644,6 +710,8 @@ void fuse_conn_put(struct fuse_conn *fc)
if (refcount_dec_and_test(&fc->count)) {
if (fc->destroy_req)
fuse_request_free(fc->destroy_req);
+ if (fc->dax_dev)
+ fuse_free_dax_mem_ranges(&fc->free_ranges);
put_pid_ns(fc->pid_ns);
put_user_ns(fc->user_ns);
fc->release(fc);
@@ -1136,9 +1204,17 @@ int fuse_fill_super_common(struct super_block *sb,
fuse_conn_init(fc, sb->s_user_ns, dax_dev, fiq_ops, fiq_priv);
fc->release = fuse_free_conn;
+ if (dax_dev) {
+ err = fuse_dax_mem_range_init(fc, dax_dev);
+ if (err) {
+ pr_debug("fuse_dax_mem_range_init() returned %d\n", err);
+ goto err_put_conn;
+ }
+ }
+
fud = fuse_dev_alloc(fc);
if (!fud)
- goto err_put_conn;
+ goto err_free_ranges;
fc->dev = sb->s_dev;
fc->sb = sb;
@@ -1211,6 +1287,9 @@ int fuse_fill_super_common(struct super_block *sb,
dput(root_dentry);
err_dev_free:
fuse_dev_free(fud);
+ err_free_ranges:
+ if (dax_dev)
+ fuse_free_dax_mem_ranges(&fc->free_ranges);
err_put_conn:
fuse_conn_put(fc);
sb->s_fs_info = NULL;
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index ef1469b38a6d..c79c9a885253 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -451,6 +451,8 @@ static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
phys_addr_t offset = PFN_PHYS(pgoff);
size_t max_nr_pages = fs->window_len/PAGE_SIZE - pgoff;
+ pr_debug("virtio_fs_direct_access(): called. nr_pages=%ld max_nr_pages=%ld\n", nr_pages, max_nr_pages);
+
if (kaddr)
*kaddr = fs->window_kaddr + offset;
if (pfn)
--
2.13.6
Right now dax_writeback_mapping_range() is passed a bdev and dax_dev
is searched from that bdev name.
virtio-fs does not have a bdev. So pass in dax_dev also to
dax_writeback_mapping_range(). If dax_dev is passed in, bdev is not
used otherwise dax_dev is searched using bdev.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/dax.c | 16 ++++++++++------
fs/ext4/inode.c | 2 +-
fs/xfs/xfs_aops.c | 2 +-
include/linux/dax.h | 6 ++++--
4 files changed, 16 insertions(+), 10 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 6431c3aba182..1ae3a60c17d4 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -911,12 +911,12 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
* on persistent storage prior to completion of the operation.
*/
int dax_writeback_mapping_range(struct address_space *mapping,
- struct block_device *bdev, struct writeback_control *wbc)
+ struct block_device *bdev, struct dax_device *dax_dev,
+ struct writeback_control *wbc)
{
XA_STATE(xas, &mapping->i_pages, wbc->range_start >> PAGE_SHIFT);
struct inode *inode = mapping->host;
pgoff_t end_index = wbc->range_end >> PAGE_SHIFT;
- struct dax_device *dax_dev;
void *entry;
int ret = 0;
unsigned int scanned = 0;
@@ -927,9 +927,12 @@ int dax_writeback_mapping_range(struct address_space *mapping,
if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL)
return 0;
- dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
- if (!dax_dev)
- return -EIO;
+ if (bdev) {
+ WARN_ON(dax_dev);
+ dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
+ if (!dax_dev)
+ return -EIO;
+ }
trace_dax_writeback_range(inode, xas.xa_index, end_index);
@@ -951,7 +954,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
xas_lock_irq(&xas);
}
xas_unlock_irq(&xas);
- put_dax(dax_dev);
+ if (bdev)
+ put_dax(dax_dev);
trace_dax_writeback_range_done(inode, xas.xa_index, end_index);
return ret;
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 22a9d8159720..3569c260a3bd 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2978,7 +2978,7 @@ static int ext4_dax_writepages(struct address_space *mapping,
percpu_down_read(&sbi->s_journal_flag_rwsem);
trace_ext4_writepages(inode, wbc);
- ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, wbc);
+ ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, NULL, wbc);
trace_ext4_writepages_result(inode, wbc, ret,
nr_to_write - wbc->nr_to_write);
percpu_up_read(&sbi->s_journal_flag_rwsem);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 338b9d9984e0..b1947beec50a 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -951,7 +951,7 @@ xfs_dax_writepages(
{
xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
return dax_writeback_mapping_range(mapping,
- xfs_find_bdev_for_inode(mapping->host), wbc);
+ xfs_find_bdev_for_inode(mapping->host), NULL, wbc);
}
STATIC int
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 450b28db9533..a8461841f148 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -85,7 +85,8 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
int dax_writeback_mapping_range(struct address_space *mapping,
- struct block_device *bdev, struct writeback_control *wbc);
+ struct block_device *bdev, struct dax_device *dax_dev,
+ struct writeback_control *wbc);
struct page *dax_layout_busy_page(struct address_space *mapping);
bool dax_lock_mapping_entry(struct page *page);
@@ -117,7 +118,8 @@ static inline struct page *dax_layout_busy_page(struct address_space *mapping)
}
static inline int dax_writeback_mapping_range(struct address_space *mapping,
- struct block_device *bdev, struct writeback_control *wbc)
+ struct block_device *bdev, struct dax_device *dax_dev,
+ struct writeback_control *wbc)
{
return -EOPNOTSUPP;
}
--
2.13.6
How to handle file growing writes. For now, this patch does fallocate() to
grow file and then map it using dax. We need to figure out what's the best
way to handle it.
This patch does fallocate() and setup mapping operations in
fuse_dax_write_iter(), instead of iomap_begin(). I don't have access to file
pointer needed to send a message to fuse daemon in iomap_begin().
Dave Chinner has expressed concers with this approach as this is not
atomic. If guest crashes after falloc() but before data was written,
user will think that filesystem lost its data. So this is still an
outstanding issue.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/file.c | 71 +++++++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 55 insertions(+), 16 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 94ad76382a6f..41d773ba2c72 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -28,6 +28,9 @@ INTERVAL_TREE_DEFINE(struct fuse_dax_mapping,
rb, __u64, __subtree_last,
START, LAST, static inline, fuse_dax_interval_tree);
+static long __fuse_file_fallocate(struct file *file, int mode,
+ loff_t offset, loff_t length);
+
static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
int opcode, struct fuse_open_out *outargp)
{
@@ -1819,6 +1822,22 @@ static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
/* TODO file_update_time() but we don't want metadata I/O */
/* TODO handle growing the file */
+ /* Grow file here if need be. iomap_begin() does not have access
+ * to file pointer
+ */
+ if (iov_iter_rw(from) == WRITE &&
+ ((iocb->ki_pos + iov_iter_count(from)) > i_size_read(inode))) {
+ ret = __fuse_file_fallocate(iocb->ki_filp, 0, iocb->ki_pos,
+ iov_iter_count(from));
+ if (ret < 0) {
+ printk("fallocate(offset=0x%llx length=0x%lx)"
+ " failed. err=%ld\n", iocb->ki_pos,
+ iov_iter_count(from), ret);
+ goto out;
+ }
+ pr_debug("fallocate(offset=0x%llx length=0x%lx)"
+ " succeed. ret=%ld\n", iocb->ki_pos, iov_iter_count(from), ret);
+ }
ret = dax_iomap_rw(iocb, from, &fuse_iomap_ops);
@@ -3331,8 +3350,12 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
return ret;
}
-static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
- loff_t length)
+/*
+ * This variant does not take any inode lock and if locking is required,
+ * caller is supposed to hold lock
+ */
+static long __fuse_file_fallocate(struct file *file, int mode,
+ loff_t offset, loff_t length)
{
struct fuse_file *ff = file->private_data;
struct inode *inode = file_inode(file);
@@ -3346,8 +3369,6 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
.mode = mode
};
int err;
- bool lock_inode = !(mode & FALLOC_FL_KEEP_SIZE) ||
- (mode & FALLOC_FL_PUNCH_HOLE);
if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
return -EOPNOTSUPP;
@@ -3355,17 +3376,13 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
if (fc->no_fallocate)
return -EOPNOTSUPP;
- if (lock_inode) {
- inode_lock(inode);
- if (mode & FALLOC_FL_PUNCH_HOLE) {
- loff_t endbyte = offset + length - 1;
- err = filemap_write_and_wait_range(inode->i_mapping,
- offset, endbyte);
- if (err)
- goto out;
-
- fuse_sync_writes(inode);
- }
+ if (mode & FALLOC_FL_PUNCH_HOLE) {
+ loff_t endbyte = offset + length - 1;
+ err = filemap_write_and_wait_range(inode->i_mapping, offset,
+ endbyte);
+ if (err)
+ goto out;
+ fuse_sync_writes(inode);
}
if (!(mode & FALLOC_FL_KEEP_SIZE))
@@ -3401,9 +3418,31 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
if (!(mode & FALLOC_FL_KEEP_SIZE))
clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
+ return err;
+}
+
+static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
+ loff_t length)
+{
+ struct fuse_file *ff = file->private_data;
+ struct inode *inode = file_inode(file);
+ struct fuse_conn *fc = ff->fc;
+ int err;
+ bool lock_inode = !(mode & FALLOC_FL_KEEP_SIZE) ||
+ (mode & FALLOC_FL_PUNCH_HOLE);
+
+ if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
+ return -EOPNOTSUPP;
+
+ if (fc->no_fallocate)
+ return -EOPNOTSUPP;
+
if (lock_inode)
- inode_unlock(inode);
+ inode_lock(inode);
+ err = __fuse_file_fallocate(file, mode, offset, length);
+ if (lock_inode)
+ inode_unlock(inode);
return err;
}
--
2.13.6
From: Stefan Hajnoczi <[email protected]>
This patch implements basic DAX support. mmap() is not implemented
yet and will come in later patches. This patch looks into implemeting
read/write.
Signed-off-by: Stefan Hajnoczi <[email protected]>
Signed-off-by: Dr. David Alan Gilbert <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/file.c | 400 ++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/fuse_i.h | 6 +
fs/fuse/inode.c | 6 +
include/uapi/linux/fuse.h | 1 +
4 files changed, 413 insertions(+)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b52f9baaa3e7..449a6b315327 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -18,9 +18,16 @@
#include <linux/swap.h>
#include <linux/falloc.h>
#include <linux/uio.h>
+#include <linux/dax.h>
+#include <linux/iomap.h>
+#include <linux/interval_tree_generic.h>
static const struct file_operations fuse_direct_io_file_operations;
+INTERVAL_TREE_DEFINE(struct fuse_dax_mapping,
+ rb, __u64, __subtree_last,
+ START, LAST, static inline, fuse_dax_interval_tree);
+
static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
int opcode, struct fuse_open_out *outargp)
{
@@ -172,6 +179,171 @@ static void fuse_link_write_file(struct file *file)
spin_unlock(&fc->lock);
}
+static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
+{
+ struct fuse_dax_mapping *dmap = NULL;
+
+ spin_lock(&fc->lock);
+
+ /* TODO: Add logic to try to free up memory if wait is allowed */
+ if (fc->nr_free_ranges <= 0) {
+ spin_unlock(&fc->lock);
+ return NULL;
+ }
+
+ WARN_ON(list_empty(&fc->free_ranges));
+
+ /* Take a free range */
+ dmap = list_first_entry(&fc->free_ranges, struct fuse_dax_mapping,
+ list);
+ list_del_init(&dmap->list);
+ fc->nr_free_ranges--;
+ spin_unlock(&fc->lock);
+ return dmap;
+}
+
+/* This assumes fc->lock is held */
+static void __free_dax_mapping(struct fuse_conn *fc,
+ struct fuse_dax_mapping *dmap)
+{
+ list_add_tail(&dmap->list, &fc->free_ranges);
+ fc->nr_free_ranges++;
+}
+
+static void free_dax_mapping(struct fuse_conn *fc,
+ struct fuse_dax_mapping *dmap)
+{
+ /* Return fuse_dax_mapping to free list */
+ spin_lock(&fc->lock);
+ __free_dax_mapping(fc, dmap);
+ spin_unlock(&fc->lock);
+}
+
+/* offset passed in should be aligned to FUSE_DAX_MEM_RANGE_SZ */
+static int fuse_setup_one_mapping(struct inode *inode,
+ struct file *file, loff_t offset,
+ struct fuse_dax_mapping *dmap)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_file *ff = NULL;
+ struct fuse_setupmapping_in inarg;
+ FUSE_ARGS(args);
+ ssize_t err;
+
+ if (file)
+ ff = file->private_data;
+
+ WARN_ON(offset % FUSE_DAX_MEM_RANGE_SZ);
+ WARN_ON(fc->nr_free_ranges < 0);
+
+ /* Ask fuse daemon to setup mapping */
+ memset(&inarg, 0, sizeof(inarg));
+ inarg.foffset = offset;
+ if (ff)
+ inarg.fh = ff->fh;
+ else
+ inarg.fh = -1;
+ inarg.moffset = dmap->window_offset;
+ inarg.len = FUSE_DAX_MEM_RANGE_SZ;
+ if (file) {
+ inarg.flags |= (file->f_mode & FMODE_WRITE) ?
+ FUSE_SETUPMAPPING_FLAG_WRITE : 0;
+ inarg.flags |= (file->f_mode & FMODE_READ) ?
+ FUSE_SETUPMAPPING_FLAG_READ : 0;
+ } else {
+ inarg.flags |= FUSE_SETUPMAPPING_FLAG_READ;
+ inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
+ }
+ args.in.h.opcode = FUSE_SETUPMAPPING;
+ args.in.h.nodeid = fi->nodeid;
+ args.in.numargs = 1;
+ args.in.args[0].size = sizeof(inarg);
+ args.in.args[0].value = &inarg;
+ err = fuse_simple_request(fc, &args);
+ if (err < 0) {
+ printk(KERN_ERR "%s request failed at mem_offset=0x%llx %zd\n",
+ __func__, dmap->window_offset, err);
+ return err;
+ }
+
+ pr_debug("fuse_setup_one_mapping() succeeded. offset=0x%llx err=%zd\n", offset, err);
+
+ /* TODO: What locking is required here. For now, using fc->lock */
+ dmap->start = offset;
+ dmap->end = offset + FUSE_DAX_MEM_RANGE_SZ - 1;
+ /* Protected by fi->i_dmap_sem */
+ fuse_dax_interval_tree_insert(dmap, &fi->dmap_tree);
+ fi->nr_dmaps++;
+ return 0;
+}
+
+static int fuse_removemapping_one(struct inode *inode,
+ struct fuse_dax_mapping *dmap)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_removemapping_in inarg;
+ FUSE_ARGS(args);
+ ssize_t err = 0;
+
+ memset(&inarg, 0, sizeof(inarg));
+ inarg.moffset = dmap->window_offset;
+ inarg.len = dmap->length;
+ args.in.h.opcode = FUSE_REMOVEMAPPING;
+ args.in.h.nodeid = fi->nodeid;
+ args.in.numargs = 1;
+ args.in.args[0].size = sizeof(inarg);
+ args.in.args[0].value = &inarg;
+ err = fuse_simple_request(fc, &args);
+ if (err < 0) {
+ printk(KERN_ERR "%s request failed %zd\n", __func__, err);
+ return err;
+ }
+ pr_debug("%s request succeeded\n", __func__);
+ return 0;
+}
+
+void fuse_removemapping(struct inode *inode)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ ssize_t err;
+ struct fuse_dax_mapping *dmap;
+
+ down_write(&fi->i_dmap_sem);
+
+ /* Clear the mappings list */
+ while (true) {
+ WARN_ON(fi->nr_dmaps < 0);
+
+ dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, 0,
+ -1);
+ if (dmap) {
+ fuse_dax_interval_tree_remove(dmap, &fi->dmap_tree);
+ fi->nr_dmaps--;
+ }
+
+ if (!dmap)
+ break;
+
+ err = fuse_removemapping_one(inode, dmap);
+ if (err) {
+ /* TODO: Add it back to tree. */
+ printk("Failed to removemapping. offset=0x%llx"
+ " len=0x%llx\n", dmap->window_offset,
+ dmap->length);
+ continue;
+ }
+
+ /* Add it back to free ranges list */
+ free_dax_mapping(fc, dmap);
+ }
+
+ up_write(&fi->i_dmap_sem);
+ pr_debug("%s request succeeded\n", __func__);
+}
+
void fuse_finish_open(struct inode *inode, struct file *file)
{
struct fuse_file *ff = file->private_data;
@@ -1452,6 +1624,204 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
return res;
}
+static void fuse_fill_iomap_hole(struct iomap *iomap, loff_t length)
+{
+ iomap->addr = IOMAP_NULL_ADDR;
+ iomap->length = length;
+ iomap->type = IOMAP_HOLE;
+}
+
+static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
+ struct iomap *iomap, struct fuse_dax_mapping *dmap,
+ unsigned flags)
+{
+ loff_t offset, len;
+ loff_t i_size = i_size_read(inode);
+
+ offset = pos - dmap->start;
+ len = min(length, dmap->length - offset);
+
+ /* If length is beyond end of file, truncate further */
+ if (pos + len > i_size)
+ len = i_size - pos;
+
+ if (len > 0) {
+ iomap->addr = dmap->window_offset + offset;
+ iomap->length = len;
+ if (flags & IOMAP_FAULT)
+ iomap->length = ALIGN(len, PAGE_SIZE);
+ iomap->type = IOMAP_MAPPED;
+ pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
+ " length 0x%llx\n", __func__, iomap->addr,
+ iomap->offset, iomap->length);
+ } else {
+ /* Mapping beyond end of file is hole */
+ fuse_fill_iomap_hole(iomap, length);
+ pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
+ "length 0x%llx\n", __func__, iomap->addr,
+ iomap->offset, iomap->length);
+ }
+}
+
+/* This is just for DAX and the mapping is ephemeral, do not use it for other
+ * purposes since there is no block device with a permanent mapping.
+ */
+static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
+ unsigned flags, struct iomap *iomap)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_dax_mapping *dmap, *alloc_dmap = NULL;
+ int ret;
+
+ /* We don't support FIEMAP */
+ BUG_ON(flags & IOMAP_REPORT);
+
+ pr_debug("fuse_iomap_begin() called. pos=0x%llx length=0x%llx\n",
+ pos, length);
+
+ iomap->offset = pos;
+ iomap->flags = 0;
+ iomap->bdev = NULL;
+ iomap->dax_dev = fc->dax_dev;
+
+ /*
+ * Both read/write and mmap path can race here. So we need something
+ * to make sure if we are setting up mapping, then other path waits
+ *
+ * For now, use a semaphore for this. It probably needs to be
+ * optimized later.
+ */
+ down_read(&fi->i_dmap_sem);
+ dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos, pos);
+
+ if (dmap) {
+ fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
+ up_read(&fi->i_dmap_sem);
+ return 0;
+ } else {
+ up_read(&fi->i_dmap_sem);
+ pr_debug("%s: no mapping at offset 0x%llx length 0x%llx\n",
+ __func__, pos, length);
+ if (pos >= i_size_read(inode))
+ goto iomap_hole;
+
+ alloc_dmap = alloc_dax_mapping(fc);
+ if (!alloc_dmap)
+ return -EBUSY;
+
+ /*
+ * Drop read lock and take write lock so that only one
+ * caller can try to setup mapping and other waits
+ */
+ down_write(&fi->i_dmap_sem);
+ /*
+ * We dropped lock. Check again if somebody else setup
+ * mapping already.
+ */
+ dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos,
+ pos);
+ if (dmap) {
+ fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
+ free_dax_mapping(fc, alloc_dmap);
+ up_write(&fi->i_dmap_sem);
+ return 0;
+ }
+
+ /* Setup one mapping */
+ ret = fuse_setup_one_mapping(inode, NULL,
+ ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
+ alloc_dmap);
+ if (ret < 0) {
+ printk("fuse_setup_one_mapping() failed. err=%d"
+ " pos=0x%llx\n", ret, pos);
+ free_dax_mapping(fc, alloc_dmap);
+ up_write(&fi->i_dmap_sem);
+ return ret;
+ }
+ fuse_fill_iomap(inode, pos, length, iomap, alloc_dmap, flags);
+ up_write(&fi->i_dmap_sem);
+ return 0;
+ }
+
+ /*
+ * If read beyond end of file happnes, fs code seems to return
+ * it as hole
+ */
+iomap_hole:
+ fuse_fill_iomap_hole(iomap, length);
+ pr_debug("fuse_iomap_begin() returning hole mapping. pos=0x%llx length_asked=0x%llx length_returned=0x%llx\n", pos, length, iomap->length);
+ return 0;
+}
+
+static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
+ ssize_t written, unsigned flags,
+ struct iomap *iomap)
+{
+ /* DAX writes beyond end-of-file aren't handled using iomap, so the
+ * file size is unchanged and there is nothing to do here.
+ */
+ return 0;
+}
+
+static const struct iomap_ops fuse_iomap_ops = {
+ .iomap_begin = fuse_iomap_begin,
+ .iomap_end = fuse_iomap_end,
+};
+
+static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ ssize_t ret;
+
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (!inode_trylock_shared(inode))
+ return -EAGAIN;
+ } else {
+ inode_lock_shared(inode);
+ }
+
+ ret = dax_iomap_rw(iocb, to, &fuse_iomap_ops);
+ inode_unlock_shared(inode);
+
+ /* TODO file_accessed(iocb->f_filp) */
+
+ return ret;
+}
+
+static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ ssize_t ret;
+
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (!inode_trylock(inode))
+ return -EAGAIN;
+ } else {
+ inode_lock(inode);
+ }
+
+ ret = generic_write_checks(iocb, from);
+ if (ret <= 0)
+ goto out;
+
+ ret = file_remove_privs(iocb->ki_filp);
+ if (ret)
+ goto out;
+ /* TODO file_update_time() but we don't want metadata I/O */
+
+ /* TODO handle growing the file */
+
+ ret = dax_iomap_rw(iocb, from, &fuse_iomap_ops);
+
+out:
+ inode_unlock(inode);
+
+ if (ret > 0)
+ ret = generic_write_sync(iocb, ret);
+ return ret;
+}
+
static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req)
{
int i;
@@ -2104,6 +2474,11 @@ static int fuse_direct_mmap(struct file *file, struct vm_area_struct *vma)
return generic_file_mmap(file, vma);
}
+static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return -EINVAL; /* TODO */
+}
+
static int convert_fuse_file_lock(struct fuse_conn *fc,
const struct fuse_file_lock *ffl,
struct file_lock *fl)
@@ -3137,6 +3512,24 @@ static const struct file_operations fuse_direct_io_file_operations = {
/* no splice_read */
};
+static const struct file_operations fuse_dax_file_operations = {
+ .llseek = fuse_file_llseek,
+ .read_iter = fuse_dax_read_iter,
+ .write_iter = fuse_dax_write_iter,
+ .mmap = fuse_dax_mmap,
+ .open = fuse_open,
+ .flush = fuse_flush,
+ .release = fuse_release,
+ .fsync = fuse_fsync,
+ .lock = fuse_file_lock,
+ .flock = fuse_file_flock,
+ .unlocked_ioctl = fuse_file_ioctl,
+ .compat_ioctl = fuse_file_compat_ioctl,
+ .poll = fuse_file_poll,
+ .fallocate = fuse_file_fallocate,
+ /* no splice_read */
+};
+
static const struct address_space_operations fuse_file_aops = {
.readpage = fuse_readpage,
.writepage = fuse_writepage,
@@ -3153,6 +3546,7 @@ static const struct address_space_operations fuse_file_aops = {
void fuse_init_file_inode(struct inode *inode)
{
struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_conn *fc = get_fuse_conn(inode);
inode->i_fop = &fuse_file_operations;
inode->i_data.a_ops = &fuse_file_aops;
@@ -3162,4 +3556,10 @@ void fuse_init_file_inode(struct inode *inode)
fi->writectr = 0;
init_waitqueue_head(&fi->page_waitq);
INIT_LIST_HEAD(&fi->writepages);
+ fi->dmap_tree = RB_ROOT_CACHED;
+
+ if (fc->dax_dev) {
+ inode->i_flags |= S_DAX;
+ inode->i_fop = &fuse_dax_file_operations;
+ }
}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index a24f31156b47..3b17fb336256 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -203,6 +203,11 @@ struct fuse_inode {
/** Lock for serializing lookup and readdir for back compatibility*/
struct mutex mutex;
+ /*
+ * Semaphore to protect modifications to dmap_tree
+ */
+ struct rw_semaphore i_dmap_sem;
+
/** Sorted rb tree of struct fuse_dax_mapping elements */
struct rb_root_cached dmap_tree;
unsigned long nr_dmaps;
@@ -1225,5 +1230,6 @@ unsigned fuse_len_args(unsigned numargs, struct fuse_arg *args);
* Get the next unique ID for a request
*/
u64 fuse_get_unique(struct fuse_iqueue *fiq);
+void fuse_removemapping(struct inode *inode);
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 075997977cfd..56310d10cd4c 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -83,7 +83,9 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
fi->attr_version = 0;
fi->orig_ino = 0;
fi->state = 0;
+ fi->nr_dmaps = 0;
mutex_init(&fi->mutex);
+ init_rwsem(&fi->i_dmap_sem);
fi->forget = fuse_alloc_forget();
if (!fi->forget) {
kmem_cache_free(fuse_inode_cachep, inode);
@@ -118,6 +120,10 @@ static void fuse_evict_inode(struct inode *inode)
if (inode->i_sb->s_flags & SB_ACTIVE) {
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
+ if (IS_DAX(inode)) {
+ fuse_removemapping(inode);
+ WARN_ON(fi->nr_dmaps);
+ }
fuse_queue_forget(fc, fi->forget, fi->nodeid, fi->nlookup);
fi->forget = NULL;
}
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 867fdafc4a5e..1657253cb7d6 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -821,6 +821,7 @@ struct fuse_copy_file_range_in {
#define FUSE_SETUPMAPPING_ENTRIES 8
#define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
+#define FUSE_SETUPMAPPING_FLAG_READ (1ull << 1)
struct fuse_setupmapping_in {
/* An already open handle */
uint64_t fh;
--
2.13.6
Introduce fuse_dax_mapping. This type will be used to keep track of
per inode dax mappings.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/fuse_i.h | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 345abe9b022f..b9880be690bd 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -81,6 +81,15 @@ struct fuse_forget_link {
struct fuse_forget_link *next;
};
+/** Translation information for file offsets to DAX window offsets */
+struct fuse_dax_mapping {
+ /** Position in DAX window */
+ u64 window_offset;
+
+ /** Length of mapping, in bytes */
+ loff_t length;
+};
+
/** FUSE inode */
struct fuse_inode {
/** Inode data */
--
2.13.6
Instead of assuming we had the fixed bar for the cache, use the
value from the capabilities.
Signed-off-by: Dr. David Alan Gilbert <[email protected]>
---
fs/fuse/virtio_fs.c | 32 +++++++++++++++++---------------
1 file changed, 17 insertions(+), 15 deletions(-)
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 60d496c16841..55bac1465536 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -14,11 +14,6 @@
#include <uapi/linux/virtio_pci.h>
#include "fuse_i.h"
-enum {
- /* PCI BAR number of the virtio-fs DAX window */
- VIRTIO_FS_WINDOW_BAR = 2,
-};
-
/* List of virtio-fs device instances and a lock for the list */
static DEFINE_MUTEX(virtio_fs_mutex);
static LIST_HEAD(virtio_fs_instances);
@@ -518,7 +513,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
struct dev_pagemap *pgmap;
struct pci_dev *pci_dev;
phys_addr_t phys_addr;
- size_t len;
+ size_t bar_len;
int ret;
u8 have_cache, cache_bar;
u64 cache_offset, cache_len;
@@ -551,17 +546,13 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
}
/* TODO handle case where device doesn't expose BAR? */
- ret = pci_request_region(pci_dev, VIRTIO_FS_WINDOW_BAR,
- "virtio-fs-window");
+ ret = pci_request_region(pci_dev, cache_bar, "virtio-fs-window");
if (ret < 0) {
dev_err(&vdev->dev, "%s: failed to request window BAR\n",
__func__);
return ret;
}
- phys_addr = pci_resource_start(pci_dev, VIRTIO_FS_WINDOW_BAR);
- len = pci_resource_len(pci_dev, VIRTIO_FS_WINDOW_BAR);
-
mi = devm_kzalloc(&pci_dev->dev, sizeof(*mi), GFP_KERNEL);
if (!mi)
return -ENOMEM;
@@ -586,6 +577,17 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
pgmap->ref = &mi->ref;
pgmap->type = MEMORY_DEVICE_FS_DAX;
+ phys_addr = pci_resource_start(pci_dev, cache_bar);
+ bar_len = pci_resource_len(pci_dev, cache_bar);
+
+ if (cache_offset + cache_len > bar_len) {
+ dev_err(&vdev->dev,
+ "%s: cache bar shorter than cap offset+len\n",
+ __func__);
+ return -EINVAL;
+ }
+ phys_addr += cache_offset;
+
/* Ideally we would directly use the PCI BAR resource but
* devm_memremap_pages() wants its own copy in pgmap. So
* initialize a struct resource from scratch (only the start
@@ -594,7 +596,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
pgmap->res = (struct resource){
.name = "virtio-fs dax window",
.start = phys_addr,
- .end = phys_addr + len,
+ .end = phys_addr + cache_len,
};
fs->window_kaddr = devm_memremap_pages(&pci_dev->dev, pgmap);
@@ -607,10 +609,10 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
return ret;
fs->window_phys_addr = phys_addr;
- fs->window_len = len;
+ fs->window_len = cache_len;
- dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr 0x%llx len %zu\n",
- __func__, fs->window_kaddr, phys_addr, len);
+ dev_dbg(&vdev->dev, "%s: cache kaddr 0x%px phys_addr 0x%llx len %llx\n",
+ __func__, fs->window_kaddr, phys_addr, cache_len);
fs->dax_dev = alloc_dax(fs, NULL, &virtio_fs_dax_ops);
if (!fs->dax_dev)
--
2.13.6
We want to use interval tree to keep track of per inode dax mappings.
Introduce basic data structures.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/fuse_i.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index fb49ca9d05ac..a24f31156b47 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -97,11 +97,22 @@ struct fuse_forget_link {
struct fuse_forget_link *next;
};
+#define START(node) ((node)->start)
+#define LAST(node) ((node)->end)
+
/** Translation information for file offsets to DAX window offsets */
struct fuse_dax_mapping {
/* Will connect in fc->free_ranges to keep track of free memory */
struct list_head list;
+ /* For interval tree in file/inode */
+ struct rb_node rb;
+ /** Start Position in file */
+ __u64 start;
+ /** End Position in file */
+ __u64 end;
+ __u64 __subtree_last;
+
/** Position in DAX window */
u64 window_offset;
@@ -191,6 +202,10 @@ struct fuse_inode {
/** Lock for serializing lookup and readdir for back compatibility*/
struct mutex mutex;
+
+ /** Sorted rb tree of struct fuse_dax_mapping elements */
+ struct rb_root_cached dmap_tree;
+ unsigned long nr_dmaps;
};
/** FUSE inode state bits */
--
2.13.6
When a file is opened with O_TRUNC, we need to make sure that any other
DAX operation is not in progress. DAX expects i_size to be stable.
In fuse_iomap_begin() we check for i_size at multiple places and we expect
i_size to not change.
Another problem is, if we setup a mapping in fuse_iomap_begin(), and
file gets truncated and dax read/write happens, KVM currently hangs.
It tries to fault in a page which does not exist on host (file got
truncated). It probably requries fixing in KVM.
So for now, take inode lock. Once KVM is fixed, we might have to
have a look at it again.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/file.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index d0942ce0a6c3..cb28cf26a6e7 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -406,7 +406,7 @@ int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
int err;
bool lock_inode = (file->f_flags & O_TRUNC) &&
fc->atomic_o_trunc &&
- fc->writeback_cache;
+ (fc->writeback_cache || IS_DAX(inode));
err = generic_file_open(inode, file);
if (err)
--
2.13.6
From: Stefan Hajnoczi <[email protected]>
Sent single FUSE_FORGET requests on the hiprio queue. In the future it
may be possible to do FUSE_BATCH_FORGET but that is tricky since
virtio-fs gets called synchronously when forgets are queued.
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
fs/fuse/virtio_fs.c | 93 ++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 89 insertions(+), 4 deletions(-)
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index fa99a31ee930..225eb729656f 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -140,10 +140,26 @@ static void virtio_fs_notifications_done_work(struct work_struct *work)
return;
}
-static void virtio_fs_hiprio_done(struct virtqueue *vq)
+/* Work function for hiprio completion */
+static void virtio_fs_hiprio_done_work(struct work_struct *work)
{
- /* TODO */
- dev_dbg(&vq->vdev->dev, "%s\n", __func__);
+ struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
+ done_work);
+ struct fuse_pqueue *fpq = &fsvq->fud->pq;
+ struct virtqueue *vq = fsvq->vq;
+
+ /* Free completed FUSE_FORGET requests */
+ spin_lock(&fpq->lock);
+ do {
+ unsigned len;
+ void *req;
+
+ virtqueue_disable_cb(vq);
+
+ while ((req = virtqueue_get_buf(vq, &len)) != NULL)
+ kfree(req);
+ } while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
+ spin_unlock(&fpq->lock);
}
/* Allocate and copy args into req->argbuf */
@@ -302,6 +318,7 @@ static int virtio_fs_setup_vqs(struct virtio_device *vdev,
callbacks[1] = virtio_fs_vq_done;
snprintf(fs->vqs[1].name, sizeof(fs->vqs[1].name), "hiprio");
names[1] = fs->vqs[1].name;
+ INIT_WORK(&fs->vqs[1].done_work, virtio_fs_hiprio_done_work);
/* Initialize the requests virtqueues */
for (i = 2; i < fs->nvqs; i++) {
@@ -424,11 +441,79 @@ static struct virtio_driver virtio_fs_driver = {
#endif
};
+struct virtio_fs_forget {
+ struct fuse_in_header ih;
+ struct fuse_forget_in arg;
+};
+
static void virtio_fs_wake_forget_and_unlock(struct fuse_iqueue *fiq)
__releases(fiq->waitq.lock)
{
- /* TODO */
+ struct fuse_forget_link *link;
+ struct virtio_fs_forget *forget;
+ struct fuse_pqueue *fpq;
+ struct scatterlist sg;
+ struct scatterlist *sgs[] = {&sg};
+ struct virtio_fs *fs;
+ struct virtqueue *vq;
+ bool notify;
+ u64 unique;
+ int ret;
+
+ BUG_ON(!fiq->forget_list_head.next);
+ link = fiq->forget_list_head.next;
+ BUG_ON(link->next);
+ fiq->forget_list_head.next = NULL;
+ fiq->forget_list_tail = &fiq->forget_list_head;
+
+ unique = fuse_get_unique(fiq);
+
+ fs = fiq->priv;
+
spin_unlock(&fiq->waitq.lock);
+
+ /* Allocate a buffer for the request */
+ forget = kmalloc(sizeof(*forget), GFP_ATOMIC);
+ if (!forget) {
+ pr_err("virtio-fs: dropped FORGET: kmalloc failed\n");
+ goto out; /* TODO avoid dropping it? */
+ }
+
+ forget->ih = (struct fuse_in_header){
+ .opcode = FUSE_FORGET,
+ .nodeid = link->forget_one.nodeid,
+ .unique = unique,
+ .len = sizeof(*forget),
+ };
+ forget->arg = (struct fuse_forget_in){
+ .nlookup = link->forget_one.nlookup,
+ };
+
+ sg_init_one(&sg, forget, sizeof(*forget));
+
+ /* Enqueue the request */
+ vq = fs->vqs[1].vq;
+ dev_dbg(&vq->vdev->dev, "%s\n", __func__);
+ fpq = vq_to_fpq(vq);
+ spin_lock(&fpq->lock);
+
+ ret = virtqueue_add_sgs(vq, sgs, 1, 0, forget, GFP_ATOMIC);
+ if (ret < 0) {
+ pr_err("virtio-fs: dropped FORGET: queue full\n");
+ /* TODO handle full virtqueue */
+ spin_unlock(&fpq->lock);
+ goto out;
+ }
+
+ notify = virtqueue_kick_prepare(vq);
+
+ spin_unlock(&fpq->lock);
+
+ if (notify)
+ virtqueue_notify(vq);
+
+out:
+ kfree(link);
}
static void virtio_fs_wake_interrupt_and_unlock(struct fuse_iqueue *fiq)
--
2.13.6
From: Stefan Hajnoczi <[email protected]>
Add DAX mmap() support.
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
fs/fuse/file.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 57 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 41d773ba2c72..5230f2d84a14 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2501,9 +2501,65 @@ static int fuse_direct_mmap(struct file *file, struct vm_area_struct *vma)
return generic_file_mmap(file, vma);
}
+static int __fuse_dax_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
+ bool write)
+{
+ int ret;
+ struct inode *inode = file_inode(vmf->vma->vm_file);
+ struct super_block *sb = inode->i_sb;
+ pfn_t pfn;
+
+ if (write)
+ sb_start_pagefault(sb);
+
+ /* TODO inode semaphore to protect faults vs truncate */
+
+ ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops);
+
+ if (ret & VM_FAULT_NEEDDSYNC)
+ ret = dax_finish_sync_fault(vmf, pe_size, pfn);
+
+ if (write)
+ sb_end_pagefault(sb);
+
+ return ret;
+}
+
+static int fuse_dax_fault(struct vm_fault *vmf)
+{
+ return __fuse_dax_fault(vmf, PE_SIZE_PTE,
+ vmf->flags & FAULT_FLAG_WRITE);
+}
+
+static int fuse_dax_huge_fault(struct vm_fault *vmf,
+ enum page_entry_size pe_size)
+{
+ return __fuse_dax_fault(vmf, pe_size, vmf->flags & FAULT_FLAG_WRITE);
+}
+
+static int fuse_dax_page_mkwrite(struct vm_fault *vmf)
+{
+ return __fuse_dax_fault(vmf, PE_SIZE_PTE, true);
+}
+
+static int fuse_dax_pfn_mkwrite(struct vm_fault *vmf)
+{
+ return __fuse_dax_fault(vmf, PE_SIZE_PTE, true);
+}
+
+static const struct vm_operations_struct fuse_dax_vm_ops = {
+ .fault = fuse_dax_fault,
+ .huge_fault = fuse_dax_huge_fault,
+ .page_mkwrite = fuse_dax_page_mkwrite,
+ .pfn_mkwrite = fuse_dax_pfn_mkwrite,
+};
+
static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma)
{
- return -EINVAL; /* TODO */
+ file_accessed(file);
+ vma->vm_ops = &fuse_dax_vm_ops;
+ vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+ return 0;
}
static int convert_fuse_file_lock(struct fuse_conn *fc,
--
2.13.6
From: Stefan Hajnoczi <[email protected]>
Add a basic file system module for virtio-fs.
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
fs/fuse/Kconfig | 10 ++++++++++
fs/fuse/Makefile | 1 +
fs/fuse/virtio_fs.c | 33 +++++++++++++++++++++++++++++++++
3 files changed, 44 insertions(+)
create mode 100644 fs/fuse/virtio_fs.c
diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index 76f09ce7e5b2..0b1375126420 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -26,3 +26,13 @@ config CUSE
If you want to develop or use a userspace character device
based on CUSE, answer Y or M.
+
+config VIRTIO_FS
+ tristate "Virtio Filesystem"
+ depends on FUSE_FS
+ help
+ The Virtio Filesystem allows guests to mount file systems from the
+ host.
+
+ If you want to share files between guests or with the host, answer Y
+ or M.
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index f7b807bc1027..47b78fac5809 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -4,5 +4,6 @@
obj-$(CONFIG_FUSE_FS) += fuse.o
obj-$(CONFIG_CUSE) += cuse.o
+obj-$(CONFIG_VIRTIO_FS) += virtio_fs.o
fuse-objs := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
new file mode 100644
index 000000000000..6b7d3973bd85
--- /dev/null
+++ b/fs/fuse/virtio_fs.c
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * virtio-fs: Virtio Filesystem
+ * Copyright (C) 2018 Red Hat, Inc.
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+
+MODULE_AUTHOR("Stefan Hajnoczi <[email protected]>");
+MODULE_DESCRIPTION("Virtio Filesystem");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_FS(KBUILD_MODNAME);
+
+static struct file_system_type virtio_fs_type = {
+ .owner = THIS_MODULE,
+ .name = KBUILD_MODNAME,
+ .mount = NULL,
+ .kill_sb = NULL,
+};
+
+static int __init virtio_fs_init(void)
+{
+ return register_filesystem(&virtio_fs_type);
+}
+
+static void __exit virtio_fs_exit(void)
+{
+ unregister_filesystem(&virtio_fs_type);
+}
+
+module_init(virtio_fs_init);
+module_exit(virtio_fs_exit);
--
2.13.6
From: Stefan Hajnoczi <[email protected]>
fput() will be moved out of this function in a later patch, so we cannot
rely on it as the memory barrier for ensuring file->private_data = fud
is visible.
Luckily there is a mutex_unlock() right before fput() which provides the
same effect.
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
fs/fuse/inode.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 0b94b23b02d4..d08cd8bf7705 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1198,12 +1198,11 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
list_add_tail(&fc->entry, &fuse_conn_list);
sb->s_root = root_dentry;
file->private_data = fud;
- mutex_unlock(&fuse_mutex);
/*
- * atomic_dec_and_test() in fput() provides the necessary
- * memory barrier for file->private_data to be visible on all
- * CPUs after this
+ * mutex_unlock() provides the necessary memory barrier for
+ * file->private_data to be visible on all CPUs after this
*/
+ mutex_unlock(&fuse_mutex);
fput(file);
fuse_send_init(fc, init_req);
--
2.13.6
From: Stefan Hajnoczi <[email protected]>
Setup a dax device.
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
fs/fuse/virtio_fs.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 68 insertions(+)
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 225eb729656f..fd914f2c6209 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -5,6 +5,8 @@
*/
#include <linux/fs.h>
+#include <linux/dax.h>
+#include <linux/pfn_t.h>
#include <linux/module.h>
#include <linux/virtio.h>
#include <linux/virtio_fs.h>
@@ -29,6 +31,11 @@ struct virtio_fs {
struct virtio_fs_vq *vqs;
unsigned nvqs; /* number of virtqueues */
unsigned num_queues; /* number of request queues */
+ struct dax_device *dax_dev;
+
+ /* DAX memory window where file contents are mapped */
+ void *window_kaddr;
+ phys_addr_t window_phys_addr;
};
static inline struct virtio_fs_vq *vq_to_fsvq(struct virtqueue *vq)
@@ -350,6 +357,44 @@ static void virtio_fs_cleanup_vqs(struct virtio_device *vdev,
vdev->config->del_vqs(vdev);
}
+/* Map a window offset to a page frame number. The window offset will have
+ * been produced by .iomap_begin(), which maps a file offset to a window
+ * offset.
+ */
+static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+ long nr_pages, void **kaddr, pfn_t *pfn)
+{
+ struct virtio_fs *fs = dax_get_private(dax_dev);
+ phys_addr_t offset = PFN_PHYS(pgoff);
+
+ if (kaddr)
+ *kaddr = fs->window_kaddr + offset;
+ if (pfn)
+ *pfn = phys_to_pfn_t(fs->window_phys_addr + offset,
+ PFN_DEV | PFN_MAP);
+ return nr_pages;
+}
+
+static size_t virtio_fs_copy_from_iter(struct dax_device *dax_dev,
+ pgoff_t pgoff, void *addr,
+ size_t bytes, struct iov_iter *i)
+{
+ return copy_from_iter(addr, bytes, i);
+}
+
+static size_t virtio_fs_copy_to_iter(struct dax_device *dax_dev,
+ pgoff_t pgoff, void *addr,
+ size_t bytes, struct iov_iter *i)
+{
+ return copy_to_iter(addr, bytes, i);
+}
+
+static const struct dax_operations virtio_fs_dax_ops = {
+ .direct_access = virtio_fs_direct_access,
+ .copy_from_iter = virtio_fs_copy_from_iter,
+ .copy_to_iter = virtio_fs_copy_to_iter,
+};
+
static int virtio_fs_probe(struct virtio_device *vdev)
{
struct virtio_fs *fs;
@@ -371,6 +416,17 @@ static int virtio_fs_probe(struct virtio_device *vdev)
/* TODO vq affinity */
/* TODO populate notifications vq */
+ if (IS_ENABLED(CONFIG_DAX_DRIVER)) {
+ /* TODO map window */
+ fs->window_kaddr = NULL;
+ fs->window_phys_addr = 0;
+
+ fs->dax_dev = alloc_dax(fs, NULL, &virtio_fs_dax_ops);
+ if (!fs->dax_dev)
+ goto out_vqs; /* TODO handle case where device doesn't expose
+ BAR */
+ }
+
/* Bring the device online in case the filesystem is mounted and
* requests need to be sent before we return.
*/
@@ -386,6 +442,12 @@ static int virtio_fs_probe(struct virtio_device *vdev)
vdev->config->reset(vdev);
virtio_fs_cleanup_vqs(vdev, fs);
+ if (fs->dax_dev) {
+ kill_dax(fs->dax_dev);
+ put_dax(fs->dax_dev);
+ fs->dax_dev = NULL;
+ }
+
out:
vdev->priv = NULL;
return ret;
@@ -404,6 +466,12 @@ static void virtio_fs_remove(struct virtio_device *vdev)
list_del(&fs->list);
mutex_unlock(&virtio_fs_mutex);
+ if (fs->dax_dev) {
+ kill_dax(fs->dax_dev);
+ put_dax(fs->dax_dev);
+ fs->dax_dev = NULL;
+ }
+
vdev->priv = NULL;
}
--
2.13.6
From: Stefan Hajnoczi <[email protected]>
virtio-fs will need to complete requests from outside fs/fuse/dev.c.
Make the symbol visible.
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
fs/fuse/dev.c | 19 ++++++++++---------
fs/fuse/fuse_i.h | 5 +++++
2 files changed, 15 insertions(+), 9 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index a5e516a40e7a..5b90c839a7c3 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -425,7 +425,7 @@ static void flush_bg_queue(struct fuse_conn *fc)
* the 'end' callback is called if given, else the reference to the
* request is released
*/
-static void request_end(struct fuse_conn *fc, struct fuse_req *req)
+void fuse_request_end(struct fuse_conn *fc, struct fuse_req *req)
{
struct fuse_iqueue *fiq = &fc->iq;
@@ -469,6 +469,7 @@ static void request_end(struct fuse_conn *fc, struct fuse_req *req)
put_request:
fuse_put_request(fc, req);
}
+EXPORT_SYMBOL_GPL(fuse_request_end);
static void queue_interrupt(struct fuse_iqueue *fiq, struct fuse_req *req)
{
@@ -543,12 +544,12 @@ static void __fuse_request_send(struct fuse_conn *fc, struct fuse_req *req)
req->in.h.unique = fuse_get_unique(fiq);
queue_request(fiq, req);
/* acquire extra reference, since request is still needed
- after request_end() */
+ after fuse_request_end() */
__fuse_get_request(req);
spin_unlock(&fiq->waitq.lock);
request_wait_answer(fc, req);
- /* Pairs with smp_wmb() in request_end() */
+ /* Pairs with smp_wmb() in fuse_request_end() */
smp_rmb();
}
}
@@ -1278,7 +1279,7 @@ __releases(fiq->waitq.lock)
* the pending list and copies request data to userspace buffer. If
* no reply is needed (FORGET) or request has been aborted or there
* was an error during the copying then it's finished by calling
- * request_end(). Otherwise add it to the processing list, and set
+ * fuse_request_end(). Otherwise add it to the processing list, and set
* the 'sent' flag.
*/
static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
@@ -1338,7 +1339,7 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
/* SETXATTR is special, since it may contain too large data */
if (in->h.opcode == FUSE_SETXATTR)
req->out.h.error = -E2BIG;
- request_end(fc, req);
+ fuse_request_end(fc, req);
goto restart;
}
spin_lock(&fpq->lock);
@@ -1381,7 +1382,7 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
if (!test_bit(FR_PRIVATE, &req->flags))
list_del_init(&req->list);
spin_unlock(&fpq->lock);
- request_end(fc, req);
+ fuse_request_end(fc, req);
return err;
err_unlock:
@@ -1889,7 +1890,7 @@ static int copy_out_args(struct fuse_copy_state *cs, struct fuse_out *out,
* the write buffer. The request is then searched on the processing
* list by the unique ID found in the header. If found, then remove
* it from the list and copy the rest of the buffer to the request.
- * The request is finished by calling request_end()
+ * The request is finished by calling fuse_request_end().
*/
static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
struct fuse_copy_state *cs, size_t nbytes)
@@ -1976,7 +1977,7 @@ static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
list_del_init(&req->list);
spin_unlock(&fpq->lock);
- request_end(fc, req);
+ fuse_request_end(fc, req);
return err ? err : nbytes;
@@ -2120,7 +2121,7 @@ static void end_requests(struct fuse_conn *fc, struct list_head *head)
req->out.h.error = -ECONNABORTED;
clear_bit(FR_SENT, &req->flags);
list_del_init(&req->list);
- request_end(fc, req);
+ fuse_request_end(fc, req);
}
}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 4fea75c92a7c..32c4466a8f89 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -953,6 +953,11 @@ ssize_t fuse_simple_request(struct fuse_conn *fc, struct fuse_args *args);
void fuse_request_send_background(struct fuse_conn *fc, struct fuse_req *req);
bool fuse_request_queue_background(struct fuse_conn *fc, struct fuse_req *req);
+/**
+ * End a finished request
+ */
+void fuse_request_end(struct fuse_conn *fc, struct fuse_req *req);
+
/* Abort all requests */
void fuse_abort_conn(struct fuse_conn *fc, bool is_abort);
void fuse_wait_aborted(struct fuse_conn *fc);
--
2.13.6
From: Miklos Szeredi <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/fuse/fuse_i.h | 12 ++++++++++++
fs/fuse/inode.c | 10 ++++++++++
fs/fuse/virtio_fs.c | 2 ++
3 files changed, 24 insertions(+)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 30c7b4b56200..8a2604606d51 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -100,6 +100,12 @@ struct fuse_mount_data {
/* fuse_dev pointer to fill in, should contain NULL on entry */
void **fudptr;
+
+ /* version table length in bytes */
+ size_t vertab_len;
+
+ /* version table kernel address */
+ void *vertab_kaddr;
};
/* One forget request */
@@ -898,6 +904,12 @@ struct fuse_conn {
struct list_head free_ranges;
unsigned long nr_ranges;
+
+ /** Size of version table */
+ uint64_t version_table_size;
+
+ /** Shared version entry for each active inode */
+ s64 *version_table;
};
static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 4d2d623e607f..1ab4df442390 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -52,6 +52,7 @@ MODULE_PARM_DESC(max_user_congthresh,
"unprivileged user can set");
#define FUSE_SUPER_MAGIC 0x65735546
+#define VERSION_TABLE_MAGIC 0x7265566465726853
#define FUSE_DEFAULT_BLKSIZE 512
@@ -1215,6 +1216,15 @@ int fuse_fill_super_common(struct super_block *sb,
fuse_conn_init(fc, sb->s_user_ns, mount_data->dax_dev,
mount_data->fiq_ops, mount_data->fiq_priv);
fc->release = fuse_free_conn;
+ fc->version_table_size = mount_data->vertab_len / sizeof(s64);
+ fc->version_table = mount_data->vertab_kaddr;
+
+ if (fc->version_table[0] != VERSION_TABLE_MAGIC) {
+ pr_warn("bad version table magic: 0x%16llx\n",
+ fc->version_table[0]);
+ fc->version_table_size = 0;
+ fc->version_table = NULL;
+ }
if (mount_data->dax_dev) {
err = fuse_dax_mem_range_init(fc, mount_data->dax_dev);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 7d5b23455639..88b00055589b 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1246,6 +1246,8 @@ static int virtio_fs_fill_super(struct super_block *sb, void *data,
d.fiq_priv = fs;
d.fudptr = (void **)&fs->vqs[2].fud;
d.destroy = true; /* Send destroy request on unmount */
+ d.vertab_len = fs->vertab_len;
+ d.vertab_kaddr = fs->vertab_kaddr;
err = fuse_fill_super_common(sb, &d);
if (err < 0)
--
2.13.6
From: "Dr. David Alan Gilbert" <[email protected]>
Retrieve the capabilities needed to find the journal and version table.
Signed-off-by: Dr. David Alan Gilbert <[email protected]>
---
fs/fuse/virtio_fs.c | 26 ++++++++++++++++++++++++--
include/uapi/linux/virtio_fs.h | 2 ++
2 files changed, 26 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index c71bc47395b4..c18f406b61cd 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -589,8 +589,11 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
phys_addr_t phys_addr;
size_t bar_len;
int ret;
- u8 have_cache, cache_bar;
- u64 cache_offset, cache_len;
+ u8 have_cache, have_journal, have_vertab;
+ u8 cache_bar, journal_bar, vertab_bar;
+ u64 cache_offset, cache_len;
+ u64 journal_offset, journal_len;
+ u64 vertab_offset, vertab_len;
if (!IS_ENABLED(CONFIG_DAX_DRIVER))
return 0;
@@ -619,6 +622,25 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
cache_bar, cache_len, cache_offset);
}
+ have_journal = virtio_pci_find_shm_cap(pci_dev,
+ VIRTIO_FS_PCI_SHMCAP_ID_JOURNAL,
+ &journal_bar, &journal_offset,
+ &journal_len);
+ if (have_journal) {
+ dev_notice(&vdev->dev, "Journal bar: %d len: 0x%llx @ 0x%llx\n",
+ journal_bar, journal_len, journal_offset);
+ }
+
+ have_vertab = virtio_pci_find_shm_cap(pci_dev,
+ VIRTIO_FS_PCI_SHMCAP_ID_VERTAB,
+ &vertab_bar, &vertab_offset,
+ &vertab_len);
+ if (have_vertab) {
+ dev_notice(&vdev->dev, "Version table bar: %d len: 0x%llx @ 0x%llx\n",
+ vertab_bar, vertab_len, vertab_offset);
+ }
+
+
/* TODO handle case where device doesn't expose BAR? */
ret = pci_request_region(pci_dev, cache_bar, "virtio-fs-window");
if (ret < 0) {
diff --git a/include/uapi/linux/virtio_fs.h b/include/uapi/linux/virtio_fs.h
index 65a9d4a0dac0..e70741ab14a8 100644
--- a/include/uapi/linux/virtio_fs.h
+++ b/include/uapi/linux/virtio_fs.h
@@ -40,5 +40,7 @@ struct virtio_fs_config {
/* For the id field in virtio_pci_shm_cap */
#define VIRTIO_FS_PCI_SHMCAP_ID_CACHE 0
+#define VIRTIO_FS_PCI_SHMCAP_ID_VERTAB 1
+#define VIRTIO_FS_PCI_SHMCAP_ID_JOURNAL 2
#endif /* _UAPI_LINUX_VIRTIO_FS_H */
--
2.13.6
From: Miklos Szeredi <[email protected]>
Version table can be NULL. Do not crash.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/fuse/inode.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 1ab4df442390..d44827bbfa3d 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1219,7 +1219,8 @@ int fuse_fill_super_common(struct super_block *sb,
fc->version_table_size = mount_data->vertab_len / sizeof(s64);
fc->version_table = mount_data->vertab_kaddr;
- if (fc->version_table[0] != VERSION_TABLE_MAGIC) {
+ if (fc->version_table && fc->version_table_size > 0 &&
+ fc->version_table[0] != VERSION_TABLE_MAGIC) {
pr_warn("bad version table magic: 0x%16llx\n",
fc->version_table[0]);
fc->version_table_size = 0;
--
2.13.6
From: "Dr. David Alan Gilbert" <[email protected]>
The shm cap defines a capability to allow 64bit size chunks at 64 bit
size offsets into a bar. There can be multiple such chunks on any one
device.
Signed-off-by: Dr. David Alan Gilbert <[email protected]>
---
fs/fuse/virtio_fs.c | 69 +++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/virtio_pci.h | 10 ++++++
2 files changed, 79 insertions(+)
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 87b7e42a6763..cd916943205e 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -11,6 +11,7 @@
#include <linux/module.h>
#include <linux/virtio.h>
#include <linux/virtio_fs.h>
+#include <uapi/linux/virtio_pci.h>
#include "fuse_i.h"
enum {
@@ -57,6 +58,74 @@ struct virtio_fs {
size_t window_len;
};
+/* TODO: This should be in a PCI file somewhere */
+static int virtio_pci_find_shm_cap(struct pci_dev *dev,
+ u8 required_id,
+ u8 *bar, u64 *offset, u64 *len)
+{
+ int pos;
+
+ for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
+ pos > 0;
+ pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
+ u8 type, cap_len, id;
+ u32 tmp32;
+ u64 res_offset, res_length;
+
+ pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+ cfg_type),
+ &type);
+ if (type != VIRTIO_PCI_CAP_SHARED_MEMORY_CFG)
+ continue;
+
+ pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+ cap_len),
+ &cap_len);
+ if (cap_len != sizeof(struct virtio_pci_shm_cap)) {
+ printk(KERN_ERR "%s: shm cap with bad size offset: %d size: %d\n",
+ __func__, pos, cap_len);
+ continue;
+ };
+
+ pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_shm_cap,
+ id),
+ &id);
+ if (id != required_id)
+ continue;
+
+ /* Type, and ID match, looks good */
+ pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+ bar),
+ bar);
+
+ /* Read the lower 32bit of length and offset */
+ pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap, offset),
+ &tmp32);
+ res_offset = tmp32;
+ pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap, length),
+ &tmp32);
+ res_length = tmp32;
+
+ /* and now the top half */
+ pci_read_config_dword(dev,
+ pos + offsetof(struct virtio_pci_shm_cap,
+ offset_hi),
+ &tmp32);
+ res_offset |= ((u64)tmp32) << 32;
+ pci_read_config_dword(dev,
+ pos + offsetof(struct virtio_pci_shm_cap,
+ length_hi),
+ &tmp32);
+ res_length |= ((u64)tmp32) << 32;
+
+ *offset = res_offset;
+ *len = res_length;
+
+ return pos;
+ }
+ return 0;
+}
+
static inline struct virtio_fs_vq *vq_to_fsvq(struct virtqueue *vq)
{
struct virtio_fs *fs = vq->vdev->priv;
diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 90007a1abcab..2e6072b5a7c9 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -113,6 +113,8 @@
#define VIRTIO_PCI_CAP_DEVICE_CFG 4
/* PCI configuration access */
#define VIRTIO_PCI_CAP_PCI_CFG 5
+/* Additional shared memory capability */
+#define VIRTIO_PCI_CAP_SHARED_MEMORY_CFG 8
/* This is the PCI capability header: */
struct virtio_pci_cap {
@@ -163,6 +165,14 @@ struct virtio_pci_cfg_cap {
__u8 pci_cfg_data[4]; /* Data for BAR access. */
};
+/* Fields in VIRTIO_PCI_CAP_SHARED_MEMORY_CFG */
+struct virtio_pci_shm_cap {
+ struct virtio_pci_cap cap;
+ __le32 offset_hi; /* Most sig 32 bits of offset */
+ __le32 length_hi; /* Most sig 32 bits of length */
+ __u8 id; /* To distinguish shm chunks */
+};
+
/* Macro versions of offsets for the Old Timers! */
#define VIRTIO_PCI_CAP_VNDR 0
#define VIRTIO_PCI_CAP_NEXT 1
--
2.13.6
From: Stefan Hajnoczi <[email protected]>
Provide definitions of ->mount and ->kill_sb. This is still WIP.
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
fs/fuse/fuse_i.h | 9 ++++
fs/fuse/inode.c | 12 ++++-
fs/fuse/virtio_fs.c | 129 +++++++++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 146 insertions(+), 4 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 9b5b8b194f77..4fea75c92a7c 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -59,10 +59,12 @@ extern unsigned max_user_congthresh;
/** Mount options */
struct fuse_mount_data {
int fd;
+ const char *tag; /* lifetime: .fill_super() data argument */
unsigned rootmode;
kuid_t user_id;
kgid_t group_id;
unsigned fd_present:1;
+ unsigned tag_present:1;
unsigned rootmode_present:1;
unsigned user_id_present:1;
unsigned group_id_present:1;
@@ -1002,6 +1004,13 @@ int fuse_fill_super_common(struct super_block *sb,
void **fudptr);
/**
+ * Disassociate fuse connection from superblock and kill the superblock
+ *
+ * Calls kill_anon_super(), use with do not use with bdev mounts.
+ */
+void fuse_kill_sb_anon(struct super_block *sb);
+
+/**
* Add connection to control filesystem
*/
int fuse_ctl_add_conn(struct fuse_conn *fc);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index f13133f0ebd1..65fd59fc1e81 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -431,6 +431,7 @@ static int fuse_statfs(struct dentry *dentry, struct kstatfs *buf)
enum {
OPT_FD,
+ OPT_TAG,
OPT_ROOTMODE,
OPT_USER_ID,
OPT_GROUP_ID,
@@ -443,6 +444,7 @@ enum {
static const match_table_t tokens = {
{OPT_FD, "fd=%u"},
+ {OPT_TAG, "tag=%s"},
{OPT_ROOTMODE, "rootmode=%o"},
{OPT_USER_ID, "user_id=%u"},
{OPT_GROUP_ID, "group_id=%u"},
@@ -489,6 +491,11 @@ int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
d->fd_present = 1;
break;
+ case OPT_TAG:
+ d->tag = args[0].from;
+ d->tag_present = 1;
+ break;
+
case OPT_ROOTMODE:
if (match_octal(&args[0], &value))
return 0;
@@ -1204,7 +1211,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
err = -EINVAL;
if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
goto err;
- if (!d.fd_present)
+ if (!d.fd_present || d.tag_present)
goto err;
file = fget(d.fd);
@@ -1249,11 +1256,12 @@ static void fuse_sb_destroy(struct super_block *sb)
}
}
-static void fuse_kill_sb_anon(struct super_block *sb)
+void fuse_kill_sb_anon(struct super_block *sb)
{
fuse_sb_destroy(sb);
kill_anon_super(sb);
}
+EXPORT_SYMBOL_GPL(fuse_kill_sb_anon);
static struct file_system_type fuse_fs_type = {
.owner = THIS_MODULE,
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index aac9c3c42827..8cdeb02f3778 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -8,6 +8,7 @@
#include <linux/module.h>
#include <linux/virtio.h>
#include <linux/virtio_fs.h>
+#include "fuse_i.h"
/* List of virtio-fs device instances and a lock for the list */
static DEFINE_MUTEX(virtio_fs_mutex);
@@ -17,6 +18,8 @@ static LIST_HEAD(virtio_fs_instances);
struct virtio_fs {
struct list_head list; /* on virtio_fs_instances */
char *tag;
+ struct fuse_dev **fud; /* 1:1 mapping with request queues */
+ unsigned int num_queues;
};
/* Add a new instance to the list or return -EEXIST if tag name exists*/
@@ -42,6 +45,46 @@ static int virtio_fs_add_instance(struct virtio_fs *fs)
return 0;
}
+/* Return the virtio_fs with a given tag, or NULL */
+static struct virtio_fs *virtio_fs_find_instance(const char *tag)
+{
+ struct virtio_fs *fs;
+
+ mutex_lock(&virtio_fs_mutex);
+
+ list_for_each_entry(fs, &virtio_fs_instances, list) {
+ if (strcmp(fs->tag, tag) == 0)
+ goto found;
+ }
+
+ fs = NULL; /* not found */
+
+found:
+ mutex_unlock(&virtio_fs_mutex);
+
+ return fs;
+}
+
+static void virtio_fs_free_devs(struct virtio_fs *fs)
+{
+ unsigned int i;
+
+ /* TODO lock */
+
+ if (!fs->fud)
+ return;
+
+ for (i = 0; i < fs->num_queues; i++) {
+ struct fuse_dev *fud = fs->fud[i];
+
+ if (fud)
+ fuse_dev_free(fud); /* TODO need to quiesce/end_requests/decrement dev_count */
+ }
+
+ kfree(fs->fud);
+ fs->fud = NULL;
+}
+
/* Read filesystem name from virtio config into fs->tag (must kfree()). */
static int virtio_fs_read_tag(struct virtio_device *vdev, struct virtio_fs *fs)
{
@@ -76,6 +119,13 @@ static int virtio_fs_probe(struct virtio_device *vdev)
return -ENOMEM;
vdev->priv = fs;
+ virtio_cread(vdev, struct virtio_fs_config, num_queues,
+ &fs->num_queues);
+ if (fs->num_queues == 0) {
+ ret = -EINVAL;
+ goto out;
+ }
+
ret = virtio_fs_read_tag(vdev, fs);
if (ret < 0)
goto out;
@@ -95,6 +145,8 @@ static void virtio_fs_remove(struct virtio_device *vdev)
{
struct virtio_fs *fs = vdev->priv;
+ virtio_fs_free_devs(fs);
+
vdev->config->reset(vdev);
mutex_lock(&virtio_fs_mutex);
@@ -138,11 +190,84 @@ static struct virtio_driver virtio_fs_driver = {
#endif
};
+static int virtio_fs_fill_super(struct super_block *sb, void *data,
+ int silent)
+{
+ struct fuse_mount_data d;
+ struct fuse_conn *fc;
+ struct virtio_fs *fs;
+ int is_bdev = sb->s_bdev != NULL;
+ unsigned int i;
+ int err;
+
+ err = -EINVAL;
+ if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
+ goto err;
+ if (d.fd_present) {
+ printk(KERN_ERR "virtio-fs: fd option cannot be used\n");
+ goto err;
+ }
+ if (!d.tag_present) {
+ printk(KERN_ERR "virtio-fs: missing tag option\n");
+ goto err;
+ }
+
+ fs = virtio_fs_find_instance(d.tag);
+ if (!fs) {
+ printk(KERN_ERR "virtio-fs: tag not found\n");
+ err = -ENOENT;
+ goto err;
+ }
+
+ /* TODO lock */
+ if (fs->fud) {
+ printk(KERN_ERR "virtio-fs: device already in use\n");
+ err = -EBUSY;
+ goto err;
+ }
+ fs->fud = kcalloc(fs->num_queues, sizeof(fs->fud[0]), GFP_KERNEL);
+ if (!fs->fud) {
+ err = -ENOMEM;
+ goto err_fud;
+ }
+
+ err = fuse_fill_super_common(sb, &d, (void **)&fs->fud[0]);
+ if (err < 0)
+ goto err_fud;
+
+ fc = fs->fud[0]->fc;
+
+ /* Allocate remaining fuse_devs */
+ err = -ENOMEM;
+ /* TODO take fuse_mutex around this loop? */
+ for (i = 1; i < fs->num_queues; i++) {
+ fs->fud[i] = fuse_dev_alloc(fc);
+ if (!fs->fud[i]) {
+ /* TODO */
+ }
+ atomic_inc(&fc->dev_count);
+ }
+
+ return 0;
+
+err_fud:
+ virtio_fs_free_devs(fs);
+err:
+ return err;
+}
+
+static struct dentry *virtio_fs_mount(struct file_system_type *fs_type,
+ int flags, const char *dev_name,
+ void *raw_data)
+{
+ return mount_nodev(fs_type, flags, raw_data, virtio_fs_fill_super);
+}
+
static struct file_system_type virtio_fs_type = {
.owner = THIS_MODULE,
.name = KBUILD_MODNAME,
- .mount = NULL,
- .kill_sb = NULL,
+ .mount = virtio_fs_mount,
+ .kill_sb = fuse_kill_sb_anon,
};
static int __init virtio_fs_init(void)
--
2.13.6
From: Miklos Szeredi <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/fuse/dir.c | 40 +++++++++++++++++++++++++---------------
1 file changed, 25 insertions(+), 15 deletions(-)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 3aa214f9a28e..f9a91e782cf0 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -253,29 +253,36 @@ static bool fuse_dentry_version_mismatch(struct dentry *dentry)
return fuse_version_mismatch(inode, READ_ONCE(fude->version));
}
-static void fuse_set_version_ptr(struct inode *inode,
- struct fuse_entryver_out *outver)
+static s64 *fuse_version_ptr(struct inode *inode,
+ struct fuse_entryver_out *outver)
{
struct fuse_conn *fc = get_fuse_conn(inode);
- struct fuse_inode *fi = get_fuse_inode(inode);
- if (!fc->version_table || !outver->version_index) {
- fi->version_ptr = NULL;
- return;
- }
+ if (!fc->version_table || !outver->version_index)
+ return NULL;
+
if (outver->version_index >= fc->version_table_size) {
pr_warn_ratelimited("version index too large (%llu >= %llu)\n",
outver->version_index,
fc->version_table_size);
- fi->version_ptr = NULL;
- return;
+ return NULL;
}
- fi->version_ptr = fc->version_table + outver->version_index;
+ return fc->version_table + outver->version_index;
+}
- pr_info("fuse: version_ptr = %p\n", fi->version_ptr);
- pr_info("fuse: version = %lli\n", fi->attr_version);
- pr_info("fuse: current_version: %lli\n", *fi->version_ptr);
+static void fuse_set_version_ptr(struct inode *inode,
+ struct fuse_entryver_out *outver)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ fi->version_ptr = fuse_version_ptr(inode, outver);
+
+ if (fi->version_ptr) {
+ pr_info("fuse: version_ptr = %p\n", fi->version_ptr);
+ pr_info("fuse: version = %lli\n", fi->attr_version);
+ pr_info("fuse: current_version: %lli\n", *fi->version_ptr);
+ }
}
/*
@@ -335,13 +342,16 @@ static int fuse_dentry_revalidate(struct dentry *entry, unsigned int flags)
if (!ret && !outarg.nodeid)
ret = -ENOENT;
if (!ret) {
+ s64 *new_version_ptr = fuse_version_ptr(inode, &outver);
+
fi = get_fuse_inode(inode);
if (outarg.nodeid != get_node_id(inode)) {
fuse_queue_forget(fc, forget, outarg.nodeid, 1);
goto invalid;
}
- if (fi->version_ptr != fc->version_table + outver.version_index)
- pr_warn("fuse_dentry_revalidate: version_ptr changed (%p -> %p)\n", fi->version_ptr, fc->version_table + outver.version_index);
+ if (fi->version_ptr != new_version_ptr) {
+ pr_warn("fuse_dentry_revalidate: version_ptr changed (%p -> %p)\n", fi->version_ptr, new_version_ptr);
+ }
spin_lock(&fc->lock);
fi->nlookup++;
--
2.13.6
fuse_dax_free_memory() can be very cpu intensive in corner cases. For example,
if one inode has consumed all the memory and a setupmapping request is
pending, that means inode lock is held by request and worker thread will
not get lock for a while. And given there is only one inode consuming all
the dax ranges, all the attempts to acquire lock will fail.
So if there are too many inode lock failures (-EAGAIN), reschedule the
worker with a 10ms delay.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/file.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index dbe3410a94d7..709747458335 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3909,7 +3909,7 @@ int fuse_dax_free_one_mapping(struct fuse_conn *fc, struct inode *inode,
int fuse_dax_free_memory(struct fuse_conn *fc, unsigned long nr_to_free)
{
struct fuse_dax_mapping *dmap, *pos, *temp;
- int ret, nr_freed = 0;
+ int ret, nr_freed = 0, nr_eagain = 0;
u64 dmap_start = 0, window_offset = 0;
struct inode *inode = NULL;
@@ -3918,6 +3918,12 @@ int fuse_dax_free_memory(struct fuse_conn *fc, unsigned long nr_to_free)
if (nr_freed >= nr_to_free)
break;
+ if (nr_eagain > 20) {
+ queue_delayed_work(system_long_wq, &fc->dax_free_work,
+ msecs_to_jiffies(10));
+ return 0;
+ }
+
dmap = NULL;
spin_lock(&fc->lock);
@@ -3955,8 +3961,10 @@ int fuse_dax_free_memory(struct fuse_conn *fc, unsigned long nr_to_free)
}
/* Could not get inode lock. Try next element */
- if (ret == -EAGAIN)
+ if (ret == -EAGAIN) {
+ nr_eagain++;
continue;
+ }
nr_freed++;
}
return 0;
--
2.13.6
Once we select a memory range to free, we currently block on inode
lock. Do not block and use trylock instead. And move on to next memory
range if trylock fails.
Reason being that in next few patches I want to enabling waiting for
memmory ranges to become free in fuse_iomap_begin(). So insted of
returning -EBUSY, a process will wait for a memory range to become
free.
We don't want to end up in a situation where process is sleeping in
iomap_begin() with inode lock held and worker is trying to free
memory from same inode, resulting in deadlock.
To avoid deadlock, use trylock instead.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/file.c | 36 ++++++++++++++++++++++++++++--------
1 file changed, 28 insertions(+), 8 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index d86f6e5c4daf..dbe3410a94d7 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3891,7 +3891,12 @@ int fuse_dax_free_one_mapping(struct fuse_conn *fc, struct inode *inode,
int ret;
struct fuse_inode *fi = get_fuse_inode(inode);
- inode_lock(inode);
+ /*
+ * If process is blocked waiting for memory while holding inode
+ * lock, we will deadlock. So continue to free next range.
+ */
+ if (!inode_trylock(inode))
+ return -EAGAIN;
down_write(&fi->i_mmap_sem);
down_write(&fi->i_dmap_sem);
ret = fuse_dax_free_one_mapping_locked(fc, inode, dmap_start);
@@ -3903,19 +3908,22 @@ int fuse_dax_free_one_mapping(struct fuse_conn *fc, struct inode *inode,
int fuse_dax_free_memory(struct fuse_conn *fc, unsigned long nr_to_free)
{
- struct fuse_dax_mapping *dmap, *pos;
- int ret, i;
+ struct fuse_dax_mapping *dmap, *pos, *temp;
+ int ret, nr_freed = 0;
u64 dmap_start = 0, window_offset = 0;
struct inode *inode = NULL;
/* Pick first busy range and free it for now*/
- for (i = 0; i < nr_to_free; i++) {
+ while(1) {
+ if (nr_freed >= nr_to_free)
+ break;
+
dmap = NULL;
spin_lock(&fc->lock);
- list_for_each_entry(pos, &fc->busy_ranges, busy_list) {
- dmap = pos;
- inode = igrab(dmap->inode);
+ list_for_each_entry_safe(pos, temp, &fc->busy_ranges,
+ busy_list) {
+ inode = igrab(pos->inode);
/*
* This inode is going away. That will free
* up all the ranges anyway, continue to
@@ -3923,6 +3931,13 @@ int fuse_dax_free_memory(struct fuse_conn *fc, unsigned long nr_to_free)
*/
if (!inode)
continue;
+ /*
+ * Take this element off list and add it tail. If
+ * inode lock can't be obtained, this will help with
+ * selecting new element
+ */
+ dmap = pos;
+ list_move_tail(&dmap->busy_list, &fc->busy_ranges);
dmap_start = dmap->start;
window_offset = dmap->window_offset;
break;
@@ -3933,11 +3948,16 @@ int fuse_dax_free_memory(struct fuse_conn *fc, unsigned long nr_to_free)
ret = fuse_dax_free_one_mapping(fc, inode, dmap_start);
iput(inode);
- if (ret) {
+ if (ret && ret != -EAGAIN) {
printk("%s(window_offset=0x%llx) failed. err=%d\n",
__func__, window_offset, ret);
return ret;
}
+
+ /* Could not get inode lock. Try next element */
+ if (ret == -EAGAIN)
+ continue;
+ nr_freed++;
}
return 0;
}
--
2.13.6
This can be done only from same inode. Also it can be done only for
read/write case and not for fault case. Reason, as of now reclaim requires
holding inode_lock, fuse_inode->i_mmap_sem and fuse_inode->dmap_tree
locks in that order and only read/write path will allow that (and not
fault path).
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/file.c | 121 +++++++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 105 insertions(+), 16 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 17becdff3014..13db83d105ff 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -30,6 +30,8 @@ INTERVAL_TREE_DEFINE(struct fuse_dax_mapping,
static long __fuse_file_fallocate(struct file *file, int mode,
loff_t offset, loff_t length);
+static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
+ struct inode *inode);
static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
int opcode, struct fuse_open_out *outargp)
@@ -1727,7 +1729,12 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
if (pos >= i_size_read(inode))
goto iomap_hole;
- alloc_dmap = alloc_dax_mapping(fc);
+ /* Can't do reclaim in fault path yet due to lock ordering */
+ if (flags & IOMAP_FAULT)
+ alloc_dmap = alloc_dax_mapping(fc);
+ else
+ alloc_dmap = alloc_dax_mapping_reclaim(fc, inode);
+
if (!alloc_dmap)
return -EBUSY;
@@ -3705,24 +3712,14 @@ void fuse_init_file_inode(struct inode *inode)
}
}
-int fuse_dax_free_one_mapping_locked(struct fuse_conn *fc, struct inode *inode,
- u64 dmap_start)
+int fuse_dax_reclaim_dmap_locked(struct fuse_conn *fc, struct inode *inode,
+ struct fuse_dax_mapping *dmap)
{
int ret;
struct fuse_inode *fi = get_fuse_inode(inode);
- struct fuse_dax_mapping *dmap;
-
- WARN_ON(!inode_is_locked(inode));
-
- /* Find fuse dax mapping at file offset inode. */
- dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, dmap_start,
- dmap_start);
-
- /* Range already got cleaned up by somebody else */
- if (!dmap)
- return 0;
- ret = filemap_fdatawrite_range(inode->i_mapping, dmap->start, dmap->end);
+ ret = filemap_fdatawrite_range(inode->i_mapping, dmap->start,
+ dmap->end);
if (ret) {
printk("filemap_fdatawrite_range() failed. err=%d start=0x%llx,"
" end=0x%llx\n", ret, dmap->start, dmap->end);
@@ -3743,6 +3740,99 @@ int fuse_dax_free_one_mapping_locked(struct fuse_conn *fc, struct inode *inode,
/* Remove dax mapping from inode interval tree now */
fuse_dax_interval_tree_remove(dmap, &fi->dmap_tree);
fi->nr_dmaps--;
+ return 0;
+}
+
+/* First first mapping in the tree and free it. */
+struct fuse_dax_mapping *fuse_dax_reclaim_first_mapping_locked(
+ struct fuse_conn *fc, struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_dax_mapping *dmap;
+ int ret;
+
+ /* Find fuse dax mapping at file offset inode. */
+ dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, 0, -1);
+ if (!dmap)
+ return NULL;
+
+ ret = fuse_dax_reclaim_dmap_locked(fc, inode, dmap);
+ if (ret < 0)
+ return ERR_PTR(ret);
+
+ /* Clean up dmap. Do not add back to free list */
+ spin_lock(&fc->lock);
+ list_del_init(&dmap->busy_list);
+ WARN_ON(fc->nr_busy_ranges == 0);
+ fc->nr_busy_ranges--;
+ dmap->inode = NULL;
+ dmap->start = dmap->end = 0;
+ spin_unlock(&fc->lock);
+
+ pr_debug("fuse: reclaimed memory range window_offset=0x%llx,"
+ " length=0x%llx\n", dmap->window_offset,
+ dmap->length);
+ return dmap;
+}
+
+/*
+ * First first mapping in the tree and free it and return it. Do not add
+ * it back to free pool.
+ *
+ * This is called with inode lock held.
+ */
+struct fuse_dax_mapping *fuse_dax_reclaim_first_mapping(struct fuse_conn *fc,
+ struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_dax_mapping *dmap;
+
+ down_write(&fi->i_mmap_sem);
+ down_write(&fi->i_dmap_sem);
+ dmap = fuse_dax_reclaim_first_mapping_locked(fc, inode);
+ up_write(&fi->i_dmap_sem);
+ up_write(&fi->i_mmap_sem);
+ return dmap;
+}
+
+static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
+ struct inode *inode)
+{
+ struct fuse_dax_mapping *dmap;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ dmap = alloc_dax_mapping(fc);
+ if (dmap)
+ return dmap;
+
+ /* There are no mappings which can be reclaimed */
+ if (!fi->nr_dmaps)
+ return NULL;
+
+ /* Try reclaim a fuse dax memory range */
+ return fuse_dax_reclaim_first_mapping(fc, inode);
+}
+
+int fuse_dax_free_one_mapping_locked(struct fuse_conn *fc, struct inode *inode,
+ u64 dmap_start)
+{
+ int ret;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_dax_mapping *dmap;
+
+ WARN_ON(!inode_is_locked(inode));
+
+ /* Find fuse dax mapping at file offset inode. */
+ dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, dmap_start,
+ dmap_start);
+
+ /* Range already got cleaned up by somebody else */
+ if (!dmap)
+ return 0;
+
+ ret = fuse_dax_reclaim_dmap_locked(fc, inode, dmap);
+ if (ret < 0)
+ return ret;
/* Cleanup dmap entry and add back to free list */
spin_lock(&fc->lock);
@@ -3757,7 +3847,6 @@ int fuse_dax_free_one_mapping_locked(struct fuse_conn *fc, struct inode *inode,
pr_debug("fuse: freed memory range window_offset=0x%llx,"
" length=0x%llx\n", dmap->window_offset,
dmap->length);
-
return ret;
}
--
2.13.6
Kick worker to free up some memory when number of free ranges drops below
20% of total free ranges at the time of initialization.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/file.c | 11 ++++++++++-
fs/fuse/fuse_i.h | 9 +++++++++
fs/fuse/inode.c | 1 +
3 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 13db83d105ff..1f172d372eeb 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -186,6 +186,7 @@ static void fuse_link_write_file(struct file *file)
static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
{
+ unsigned long free_threshold;
struct fuse_dax_mapping *dmap = NULL;
spin_lock(&fc->lock);
@@ -193,7 +194,7 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
/* TODO: Add logic to try to free up memory if wait is allowed */
if (fc->nr_free_ranges <= 0) {
spin_unlock(&fc->lock);
- return NULL;
+ goto out_kick;
}
WARN_ON(list_empty(&fc->free_ranges));
@@ -204,6 +205,14 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
list_del_init(&dmap->list);
fc->nr_free_ranges--;
spin_unlock(&fc->lock);
+
+out_kick:
+ /* If number of free ranges are below threshold, start reclaim */
+ free_threshold = (fc->nr_ranges * FUSE_DAX_RECLAIM_THRESHOLD)/100;
+ if (free_threshold > 0 && fc->nr_free_ranges < free_threshold) {
+ pr_debug("fuse: Kicking dax memory reclaim worker. nr_free_ranges=0x%ld nr_total_ranges=%ld\n", fc->nr_free_ranges, fc->nr_ranges);
+ queue_delayed_work(system_long_wq, &fc->dax_free_work, 0);
+ }
return dmap;
}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 383deaf0ecf1..bbefa7c11078 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -53,6 +53,13 @@
/* Number of ranges reclaimer will try to free in one invocation */
#define FUSE_DAX_RECLAIM_CHUNK (10)
+/*
+ * Dax memory reclaim threshold in percetage of total ranges. When free
+ * number of free ranges drops below this threshold, reclaim can trigger
+ * Default is 20%
+ * */
+#define FUSE_DAX_RECLAIM_THRESHOLD (20)
+
/** List of active connections */
extern struct list_head fuse_conn_list;
@@ -885,6 +892,8 @@ struct fuse_conn {
*/
unsigned long nr_free_ranges;
struct list_head free_ranges;
+
+ unsigned long nr_ranges;
};
static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 44f7bc44e319..d31acb97eede 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -675,6 +675,7 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
list_replace_init(&mem_ranges, &fc->free_ranges);
fc->nr_free_ranges = allocated_ranges;
+ fc->nr_ranges = allocated_ranges;
return 0;
out_err:
/* Free All allocated elements */
--
2.13.6
With cache=never, we fall back to direct IO. pjdfstest chmod test 12.t was
failing because if a file has setuid bit, it should be cleared if an
unpriviledged user opens it for write and writes to it.
Call fuse_remove_privs() even for direct I/O path.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/file.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index cb28cf26a6e7..0be5a7380b3c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1679,13 +1679,25 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
/* Don't allow parallel writes to the same file */
inode_lock(inode);
res = generic_write_checks(iocb, from);
- if (res > 0)
- res = fuse_direct_io(&io, from, &iocb->ki_pos, FUSE_DIO_WRITE);
+ if (res < 0)
+ goto out_invalidate;
+
+ res = file_remove_privs(iocb->ki_filp);
+ if (res)
+ goto out_invalidate;
+
+ res = fuse_direct_io(&io, from, &iocb->ki_pos, FUSE_DIO_WRITE);
+ if (res < 0)
+ goto out_invalidate;
+
fuse_invalidate_attr(inode);
- if (res > 0)
- fuse_write_update_size(inode, iocb->ki_pos);
+ fuse_write_update_size(inode, iocb->ki_pos);
inode_unlock(inode);
+ return res;
+out_invalidate:
+ fuse_invalidate_attr(inode);
+ inode_unlock(inode);
return res;
}
--
2.13.6
From: Stefan Hajnoczi <[email protected]>
Experimental QEMU code introduces an MMIO BAR for mapping portions of
files in the virtio-fs device. Map this BAR so that FUSE DAX can access
file contents from the host page cache.
The DAX window is accessed by the fs/dax.c infrastructure and must have
struct pages (at least on x86). Use devm_memremap_pages() to map the
DAX window PCI BAR and allocate struct page.
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
fs/fuse/virtio_fs.c | 166 ++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 143 insertions(+), 23 deletions(-)
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index ba615ec2603e..87b7e42a6763 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -6,12 +6,18 @@
#include <linux/fs.h>
#include <linux/dax.h>
+#include <linux/pci.h>
#include <linux/pfn_t.h>
#include <linux/module.h>
#include <linux/virtio.h>
#include <linux/virtio_fs.h>
#include "fuse_i.h"
+enum {
+ /* PCI BAR number of the virtio-fs DAX window */
+ VIRTIO_FS_WINDOW_BAR = 2,
+};
+
/* List of virtio-fs device instances and a lock for the list */
static DEFINE_MUTEX(virtio_fs_mutex);
static LIST_HEAD(virtio_fs_instances);
@@ -24,6 +30,18 @@ struct virtio_fs_vq {
char name[24];
} ____cacheline_aligned_in_smp;
+/* State needed for devm_memremap_pages(). This API is called on the
+ * underlying pci_dev instead of struct virtio_fs (layering violation). Since
+ * the memremap release function only gets called when the pci_dev is released,
+ * keep the associated state separate from struct virtio_fs (it has a different
+ * lifecycle from pci_dev).
+ */
+struct virtio_fs_memremap_info {
+ struct dev_pagemap pgmap;
+ struct percpu_ref ref;
+ struct completion completion;
+};
+
/* A virtio-fs device instance */
struct virtio_fs {
struct list_head list; /* on virtio_fs_instances */
@@ -36,6 +54,7 @@ struct virtio_fs {
/* DAX memory window where file contents are mapped */
void *window_kaddr;
phys_addr_t window_phys_addr;
+ size_t window_len;
};
static inline struct virtio_fs_vq *vq_to_fsvq(struct virtqueue *vq)
@@ -395,6 +414,127 @@ static const struct dax_operations virtio_fs_dax_ops = {
.copy_to_iter = virtio_fs_copy_to_iter,
};
+static void virtio_fs_percpu_release(struct percpu_ref *ref)
+{
+ struct virtio_fs_memremap_info *mi =
+ container_of(ref, struct virtio_fs_memremap_info, ref);
+
+ complete(&mi->completion);
+}
+
+static void virtio_fs_percpu_exit(void *data)
+{
+ struct virtio_fs_memremap_info *mi = data;
+
+ wait_for_completion(&mi->completion);
+ percpu_ref_exit(&mi->ref);
+}
+
+static void virtio_fs_percpu_kill(void *data)
+{
+ percpu_ref_kill(data);
+}
+
+static void virtio_fs_cleanup_dax(void *data)
+{
+ struct virtio_fs *fs = data;
+
+ kill_dax(fs->dax_dev);
+ put_dax(fs->dax_dev);
+}
+
+static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
+{
+ struct virtio_fs_memremap_info *mi;
+ struct dev_pagemap *pgmap;
+ struct pci_dev *pci_dev;
+ phys_addr_t phys_addr;
+ size_t len;
+ int ret;
+
+ if (!IS_ENABLED(CONFIG_DAX_DRIVER))
+ return 0;
+
+ /* HACK implement VIRTIO shared memory regions instead of
+ * directly accessing the PCI BAR from a virtio device driver.
+ */
+ pci_dev = container_of(vdev->dev.parent, struct pci_dev, dev);
+
+ /* TODO Is this safe - the virtio_pci_* driver doesn't use managed
+ * device APIs? */
+ ret = pcim_enable_device(pci_dev);
+ if (ret < 0)
+ return ret;
+
+ /* TODO handle case where device doesn't expose BAR? */
+ ret = pci_request_region(pci_dev, VIRTIO_FS_WINDOW_BAR,
+ "virtio-fs-window");
+ if (ret < 0) {
+ dev_err(&vdev->dev, "%s: failed to request window BAR\n",
+ __func__);
+ return ret;
+ }
+
+ phys_addr = pci_resource_start(pci_dev, VIRTIO_FS_WINDOW_BAR);
+ len = pci_resource_len(pci_dev, VIRTIO_FS_WINDOW_BAR);
+
+ mi = devm_kzalloc(&pci_dev->dev, sizeof(*mi), GFP_KERNEL);
+ if (!mi)
+ return -ENOMEM;
+
+ init_completion(&mi->completion);
+ ret = percpu_ref_init(&mi->ref, virtio_fs_percpu_release, 0,
+ GFP_KERNEL);
+ if (ret < 0) {
+ dev_err(&vdev->dev, "%s: percpu_ref_init failed (%d)\n",
+ __func__, ret);
+ return ret;
+ }
+
+ ret = devm_add_action(&pci_dev->dev, virtio_fs_percpu_exit, mi);
+ if (ret < 0) {
+ percpu_ref_exit(&mi->ref);
+ return ret;
+ }
+
+ pgmap = &mi->pgmap;
+ pgmap->altmap_valid = false;
+ pgmap->ref = &mi->ref;
+ pgmap->type = MEMORY_DEVICE_FS_DAX;
+
+ /* Ideally we would directly use the PCI BAR resource but
+ * devm_memremap_pages() wants its own copy in pgmap. So
+ * initialize a struct resource from scratch (only the start
+ * and end fields will be used).
+ */
+ pgmap->res = (struct resource){
+ .name = "virtio-fs dax window",
+ .start = phys_addr,
+ .end = phys_addr + len,
+ };
+
+ fs->window_kaddr = devm_memremap_pages(&pci_dev->dev, pgmap);
+ if (IS_ERR(fs->window_kaddr))
+ return PTR_ERR(fs->window_kaddr);
+
+ ret = devm_add_action_or_reset(&pci_dev->dev, virtio_fs_percpu_kill,
+ &mi->ref);
+ if (ret < 0)
+ return ret;
+
+ fs->window_phys_addr = phys_addr;
+ fs->window_len = len;
+
+ dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr 0x%llx len %zu\n",
+ __func__, fs->window_kaddr, phys_addr, len);
+
+ fs->dax_dev = alloc_dax(fs, NULL, &virtio_fs_dax_ops);
+ if (!fs->dax_dev)
+ return -ENOMEM;
+
+ return devm_add_action_or_reset(&vdev->dev, virtio_fs_cleanup_dax, fs);
+}
+
static int virtio_fs_probe(struct virtio_device *vdev)
{
struct virtio_fs *fs;
@@ -416,16 +556,9 @@ static int virtio_fs_probe(struct virtio_device *vdev)
/* TODO vq affinity */
/* TODO populate notifications vq */
- if (IS_ENABLED(CONFIG_DAX_DRIVER)) {
- /* TODO map window */
- fs->window_kaddr = NULL;
- fs->window_phys_addr = 0;
-
- fs->dax_dev = alloc_dax(fs, NULL, &virtio_fs_dax_ops);
- if (!fs->dax_dev)
- goto out_vqs; /* TODO handle case where device doesn't expose
- BAR */
- }
+ ret = virtio_fs_setup_dax(vdev, fs);
+ if (ret < 0)
+ goto out_vqs;
/* Bring the device online in case the filesystem is mounted and
* requests need to be sent before we return.
@@ -441,13 +574,6 @@ static int virtio_fs_probe(struct virtio_device *vdev)
out_vqs:
vdev->config->reset(vdev);
virtio_fs_cleanup_vqs(vdev, fs);
-
- if (fs->dax_dev) {
- kill_dax(fs->dax_dev);
- put_dax(fs->dax_dev);
- fs->dax_dev = NULL;
- }
-
out:
vdev->priv = NULL;
return ret;
@@ -466,12 +592,6 @@ static void virtio_fs_remove(struct virtio_device *vdev)
list_del(&fs->list);
mutex_unlock(&virtio_fs_mutex);
- if (fs->dax_dev) {
- kill_dax(fs->dax_dev);
- put_dax(fs->dax_dev);
- fs->dax_dev = NULL;
- }
-
vdev->priv = NULL;
}
--
2.13.6
From: Stefan Hajnoczi <[email protected]>
Although struct dax_device itself is not tied to a block device, some
DAX code assumes there is a block device. Make block devices optional
by allowing bdev to be NULL in commonly used DAX APIs.
When there is no block device:
* Skip the partition offset calculation in bdev_dax_pgoff()
* Skip the blkdev_issue_zeroout() optimization
Note that more block device assumptions remain but I haven't reach those
code paths yet.
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
drivers/dax/super.c | 3 ++-
fs/dax.c | 7 ++++++-
2 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 6e928f37d084..74f3bf7ae822 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -52,7 +52,8 @@ EXPORT_SYMBOL_GPL(dax_read_unlock);
int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
pgoff_t *pgoff)
{
- phys_addr_t phys_off = (get_start_sect(bdev) + sector) * 512;
+ sector_t start_sect = bdev ? get_start_sect(bdev) : 0;
+ phys_addr_t phys_off = (start_sect + sector) * 512;
if (pgoff)
*pgoff = PHYS_PFN(phys_off);
diff --git a/fs/dax.c b/fs/dax.c
index 9bcce89ea18e..6431c3aba182 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1021,7 +1021,12 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
static bool dax_range_is_aligned(struct block_device *bdev,
unsigned int offset, unsigned int length)
{
- unsigned short sector_size = bdev_logical_block_size(bdev);
+ unsigned short sector_size;
+
+ if (!bdev)
+ return false;
+
+ sector_size = bdev_logical_block_size(bdev);
if (!IS_ALIGNED(offset, sector_size))
return false;
--
2.13.6
From: Stefan Hajnoczi <[email protected]>
virtio-fs will need unique IDs for FORGET requests from outside
fs/fuse/dev.c. Make the symbol visible.
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
fs/fuse/dev.c | 3 ++-
fs/fuse/fuse_i.h | 5 +++++
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index b26ee5ed8974..f35c4ab2dcbb 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -360,11 +360,12 @@ unsigned fuse_len_args(unsigned numargs, struct fuse_arg *args)
}
EXPORT_SYMBOL_GPL(fuse_len_args);
-static u64 fuse_get_unique(struct fuse_iqueue *fiq)
+u64 fuse_get_unique(struct fuse_iqueue *fiq)
{
fiq->reqctr += FUSE_REQ_ID_STEP;
return fiq->reqctr;
}
+EXPORT_SYMBOL_GPL(fuse_get_unique);
static unsigned int fuse_req_hash(u64 unique)
{
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 3a91aa970566..f463586f2c9e 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1172,4 +1172,9 @@ int fuse_readdir(struct file *file, struct dir_context *ctx);
*/
unsigned fuse_len_args(unsigned numargs, struct fuse_arg *args);
+/**
+ * Get the next unique ID for a request
+ */
+u64 fuse_get_unique(struct fuse_iqueue *fiq);
+
#endif /* _FS_FUSE_I_H */
--
2.13.6
Introduce two new fuse commands to setup/remove memory mappings.
Signed-off-by: Vivek Goyal <[email protected]>
---
include/uapi/linux/fuse.h | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index b4967d48bfda..867fdafc4a5e 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -394,6 +394,8 @@ enum fuse_opcode {
FUSE_RENAME2 = 45,
FUSE_LSEEK = 46,
FUSE_COPY_FILE_RANGE = 47,
+ FUSE_SETUPMAPPING = 48,
+ FUSE_REMOVEMAPPING = 49,
/* CUSE specific operations */
CUSE_INIT = 4096,
@@ -817,4 +819,35 @@ struct fuse_copy_file_range_in {
uint64_t flags;
};
+#define FUSE_SETUPMAPPING_ENTRIES 8
+#define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
+struct fuse_setupmapping_in {
+ /* An already open handle */
+ uint64_t fh;
+ /* Offset into the file to start the mapping */
+ uint64_t foffset;
+ /* Length of mapping required */
+ uint64_t len;
+ /* Flags, FUSE_SETUPMAPPING_FLAG_* */
+ uint64_t flags;
+ /* Offset in Memory Window */
+ uint64_t moffset;
+};
+
+struct fuse_setupmapping_out {
+ /* Offsets into the cache of mappings */
+ uint64_t coffset[FUSE_SETUPMAPPING_ENTRIES];
+ /* Lengths of each mapping */
+ uint64_t len[FUSE_SETUPMAPPING_ENTRIES];
+};
+
+struct fuse_removemapping_in {
+ /* An already open handle */
+ uint64_t fh;
+ /* Offset into the dax window start the unmapping */
+ uint64_t moffset;
+ /* Length of mapping required */
+ uint64_t len;
+};
+
#endif /* _LINUX_FUSE_H */
--
2.13.6
Add logic to free up a busy memory range. Freed memory range will be
returned to free pool. Add a worker which can be started to select
and free some busy memory ranges.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/file.c | 148 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
fs/fuse/fuse_i.h | 10 ++++
fs/fuse/inode.c | 2 +
3 files changed, 159 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 73068289f62e..17becdff3014 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -272,7 +272,15 @@ static int fuse_setup_one_mapping(struct inode *inode,
pr_debug("fuse_setup_one_mapping() succeeded. offset=0x%llx err=%zd\n", offset, err);
- /* TODO: What locking is required here. For now, using fc->lock */
+ /*
+ * We don't take a refernce on inode. inode is valid right now and
+ * when inode is going away, cleanup logic should first cleanup
+ * dmap entries.
+ *
+ * TODO: Do we need to ensure that we are holding inode lock
+ * as well.
+ */
+ dmap->inode = inode;
dmap->start = offset;
dmap->end = offset + FUSE_DAX_MEM_RANGE_SZ - 1;
/* Protected by fi->i_dmap_sem */
@@ -347,6 +355,8 @@ void fuse_removemapping(struct inode *inode)
continue;
}
+ dmap->inode = NULL;
+
/* Add it back to free ranges list */
free_dax_mapping(fc, dmap);
}
@@ -3694,3 +3704,139 @@ void fuse_init_file_inode(struct inode *inode)
inode->i_data.a_ops = &fuse_dax_file_aops;
}
}
+
+int fuse_dax_free_one_mapping_locked(struct fuse_conn *fc, struct inode *inode,
+ u64 dmap_start)
+{
+ int ret;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_dax_mapping *dmap;
+
+ WARN_ON(!inode_is_locked(inode));
+
+ /* Find fuse dax mapping at file offset inode. */
+ dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, dmap_start,
+ dmap_start);
+
+ /* Range already got cleaned up by somebody else */
+ if (!dmap)
+ return 0;
+
+ ret = filemap_fdatawrite_range(inode->i_mapping, dmap->start, dmap->end);
+ if (ret) {
+ printk("filemap_fdatawrite_range() failed. err=%d start=0x%llx,"
+ " end=0x%llx\n", ret, dmap->start, dmap->end);
+ return ret;
+ }
+
+ ret = invalidate_inode_pages2_range(inode->i_mapping,
+ dmap->start >> PAGE_SHIFT,
+ dmap->end >> PAGE_SHIFT);
+ /* TODO: What to do if above fails? For now,
+ * leave the range in place.
+ */
+ if (ret) {
+ printk("invalidate_inode_pages2_range() failed err=%d\n", ret);
+ return ret;
+ }
+
+ /* Remove dax mapping from inode interval tree now */
+ fuse_dax_interval_tree_remove(dmap, &fi->dmap_tree);
+ fi->nr_dmaps--;
+
+ /* Cleanup dmap entry and add back to free list */
+ spin_lock(&fc->lock);
+ list_del_init(&dmap->busy_list);
+ WARN_ON(fc->nr_busy_ranges == 0);
+ fc->nr_busy_ranges--;
+ dmap->inode = NULL;
+ dmap->start = dmap->end = 0;
+ __free_dax_mapping(fc, dmap);
+ spin_unlock(&fc->lock);
+
+ pr_debug("fuse: freed memory range window_offset=0x%llx,"
+ " length=0x%llx\n", dmap->window_offset,
+ dmap->length);
+
+ return ret;
+}
+
+/*
+ * Free a range of memory.
+ * Locking.
+ * 1. Take inode->i_rwsem to prever further read/write.
+ * 2. Take fuse_inode->i_mmap_sem to block dax faults.
+ * 3. Take fuse_inode->i_dmap_sem to protect interval tree. It might not
+ * be strictly necessary as lock 1 and 2 seem sufficient.
+ */
+int fuse_dax_free_one_mapping(struct fuse_conn *fc, struct inode *inode,
+ u64 dmap_start)
+{
+ int ret;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ inode_lock(inode);
+ down_write(&fi->i_mmap_sem);
+ down_write(&fi->i_dmap_sem);
+ ret = fuse_dax_free_one_mapping_locked(fc, inode, dmap_start);
+ up_write(&fi->i_dmap_sem);
+ up_write(&fi->i_mmap_sem);
+ inode_unlock(inode);
+ return ret;
+}
+
+int fuse_dax_free_memory(struct fuse_conn *fc, unsigned long nr_to_free)
+{
+ struct fuse_dax_mapping *dmap, *pos;
+ int ret, i;
+ u64 dmap_start = 0, window_offset = 0;
+ struct inode *inode = NULL;
+
+ /* Pick first busy range and free it for now*/
+ for (i = 0; i < nr_to_free; i++) {
+ dmap = NULL;
+ spin_lock(&fc->lock);
+
+ list_for_each_entry(pos, &fc->busy_ranges, busy_list) {
+ dmap = pos;
+ inode = igrab(dmap->inode);
+ /*
+ * This inode is going away. That will free
+ * up all the ranges anyway, continue to
+ * next range.
+ */
+ if (!inode)
+ continue;
+ dmap_start = dmap->start;
+ window_offset = dmap->window_offset;
+ break;
+ }
+ spin_unlock(&fc->lock);
+ if (!dmap)
+ return 0;
+
+ ret = fuse_dax_free_one_mapping(fc, inode, dmap_start);
+ iput(inode);
+ if (ret) {
+ printk("%s(window_offset=0x%llx) failed. err=%d\n",
+ __func__, window_offset, ret);
+ return ret;
+ }
+ }
+ return 0;
+}
+
+/* TODO: This probably should go in inode.c */
+void fuse_dax_free_mem_worker(struct work_struct *work)
+{
+ int ret;
+ struct fuse_conn *fc = container_of(work, struct fuse_conn,
+ dax_free_work.work);
+ pr_debug("fuse: Worker to free memory called.\n");
+ pr_debug("fuse: Worker to free memory called. nr_free_ranges=%lu"
+ " nr_busy_ranges=%lu\n", fc->nr_free_ranges,
+ fc->nr_busy_ranges);
+ ret = fuse_dax_free_memory(fc, FUSE_DAX_RECLAIM_CHUNK);
+ if (ret)
+ pr_debug("fuse: fuse_dax_free_memory() failed with err=%d\n", ret);
+}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 280f717deb57..383deaf0ecf1 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -50,6 +50,9 @@
#define FUSE_DAX_MEM_RANGE_SZ (2*1024*1024)
#define FUSE_DAX_MEM_RANGE_PAGES (FUSE_DAX_MEM_RANGE_SZ/PAGE_SIZE)
+/* Number of ranges reclaimer will try to free in one invocation */
+#define FUSE_DAX_RECLAIM_CHUNK (10)
+
/** List of active connections */
extern struct list_head fuse_conn_list;
@@ -102,6 +105,9 @@ struct fuse_forget_link {
/** Translation information for file offsets to DAX window offsets */
struct fuse_dax_mapping {
+ /* Pointer to inode where this memory range is mapped */
+ struct inode *inode;
+
/* Will connect in fc->free_ranges to keep track of free memory */
struct list_head list;
@@ -870,6 +876,9 @@ struct fuse_conn {
unsigned long nr_busy_ranges;
struct list_head busy_ranges;
+ /* Worker to free up memory ranges */
+ struct delayed_work dax_free_work;
+
/*
* DAX Window Free Ranges. TODO: This might not be best place to store
* this free list
@@ -1244,6 +1253,7 @@ unsigned fuse_len_args(unsigned numargs, struct fuse_arg *args);
* Get the next unique ID for a request
*/
u64 fuse_get_unique(struct fuse_iqueue *fiq);
+void fuse_dax_free_mem_worker(struct work_struct *work);
void fuse_removemapping(struct inode *inode);
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 59fc5a7a18fc..44f7bc44e319 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -713,6 +713,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
fc->user_ns = get_user_ns(user_ns);
INIT_LIST_HEAD(&fc->free_ranges);
INIT_LIST_HEAD(&fc->busy_ranges);
+ INIT_DELAYED_WORK(&fc->dax_free_work, fuse_dax_free_mem_worker);
}
EXPORT_SYMBOL_GPL(fuse_conn_init);
@@ -721,6 +722,7 @@ void fuse_conn_put(struct fuse_conn *fc)
if (refcount_dec_and_test(&fc->count)) {
if (fc->destroy_req)
fuse_request_free(fc->destroy_req);
+ flush_delayed_work(&fc->dax_free_work);
if (fc->dax_dev)
fuse_free_dax_mem_ranges(&fc->free_ranges);
put_pid_ns(fc->pid_ns);
--
2.13.6
From: "Dr. David Alan Gilbert" <[email protected]>
Instead of assuming we had the fixed bar for the cache, use the
value from the capabilities.
Use the other capabilities to map their memory.
Signed-off-by: Dr. David Alan Gilbert <[email protected]>
---
fs/fuse/virtio_fs.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 95 insertions(+)
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index c18f406b61cd..7d5b23455639 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -53,6 +53,16 @@ struct virtio_fs {
void *window_kaddr;
phys_addr_t window_phys_addr;
size_t window_len;
+
+ /* Version table where version numbers can be read */
+ void *vertab_kaddr;
+ phys_addr_t vertab_phys_addr;
+ size_t vertab_len;
+
+ /* Journal */
+ void *journal_kaddr;
+ phys_addr_t journal_phys_addr;
+ size_t journal_len;
};
struct virtio_fs_forget {
@@ -684,6 +694,17 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
}
phys_addr += cache_offset;
+ phys_addr = pci_resource_start(pci_dev, cache_bar);
+ bar_len = pci_resource_len(pci_dev, cache_bar);
+
+ if (cache_offset + cache_len > bar_len) {
+ dev_err(&vdev->dev,
+ "%s: cache bar shorter than cap offset+len\n",
+ __func__);
+ return -EINVAL;
+ }
+ phys_addr += cache_offset;
+
/* Ideally we would directly use the PCI BAR resource but
* devm_memremap_pages() wants its own copy in pgmap. So
* initialize a struct resource from scratch (only the start
@@ -710,6 +731,80 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
dev_dbg(&vdev->dev, "%s: cache kaddr 0x%px phys_addr 0x%llx len %llx\n",
__func__, fs->window_kaddr, phys_addr, cache_len);
+ /*
+ * The journal and version table should be easier since DAX doesn't
+ * need them
+ */
+ if (have_journal) {
+ if (journal_bar != cache_bar) {
+ ret = pci_request_region(pci_dev, journal_bar,
+ "virtio-fs-journal");
+ if (ret < 0) {
+ dev_err(&vdev->dev,
+ "%s: failed to request journal BAR\n",
+ __func__);
+ return ret;
+ }
+ }
+
+ phys_addr = pci_resource_start(pci_dev, journal_bar);
+ bar_len = pci_resource_len(pci_dev, journal_bar);
+
+ if (journal_offset + journal_len > bar_len) {
+ dev_err(&vdev->dev,
+ "%s: journal bar shorter than cap offset+len\n",
+ __func__);
+ return -EINVAL;
+ }
+ fs->journal_phys_addr = phys_addr + journal_offset;
+ fs->journal_len = journal_len;
+
+ fs->journal_kaddr = devm_memremap(&pci_dev->dev,
+ fs->journal_phys_addr,
+ journal_len, MEMREMAP_WB);
+ if (!fs->journal_kaddr) {
+ dev_err(&vdev->dev, "%s: failed to remap journal\n",
+ __func__);
+ return -ENOMEM;
+ }
+ dev_notice(&vdev->dev, "%s: journal at %px\n", __func__,
+ fs->journal_kaddr);
+ }
+
+ if (have_vertab) {
+ if (vertab_bar != cache_bar &&
+ vertab_bar != journal_bar) {
+ ret = pci_request_region(pci_dev, vertab_bar,
+ "virtio-fs-vertab");
+ if (ret < 0) {
+ dev_err(&vdev->dev, "%s: failed to request"
+ " vertab BAR\n", __func__);
+ return ret;
+ }
+ }
+
+ phys_addr = pci_resource_start(pci_dev, vertab_bar);
+ bar_len = pci_resource_len(pci_dev, vertab_bar);
+
+ if (vertab_offset + vertab_len > bar_len) {
+ dev_err(&vdev->dev, "%s: version tab bar shorter than"
+ " cap offset+len\n", __func__);
+ return -EINVAL;
+ }
+ fs->vertab_phys_addr = phys_addr + vertab_offset;
+ fs->vertab_len = vertab_len;
+ fs->vertab_kaddr = devm_memremap(&pci_dev->dev,
+ fs->vertab_phys_addr,
+ vertab_len, MEMREMAP_WB);
+ if (!fs->vertab_kaddr) {
+ dev_err(&vdev->dev, "%s: failed to remap version"
+ " table\n", __func__);
+ return -ENOMEM;
+ }
+ dev_notice(&vdev->dev, "%s: version table at %px\n",
+ __func__, fs->vertab_kaddr);
+ }
+
fs->dax_dev = alloc_dax(fs, NULL, &virtio_fs_dax_ops);
if (!fs->dax_dev)
return -ENOMEM;
--
2.13.6
From: Miklos Szeredi <[email protected]>
Fixes: f064cab7f6ee ("fuse: add shared version support (virtio-fs only)")
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/fuse/dir.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index f9a91e782cf0..f1da787796e8 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1324,7 +1324,7 @@ static int fuse_permission(struct inode *inode, int mask)
if (fc->default_permissions) {
err = -EACCES;
- if (!refreshed && !fuse_shared_version_mismatch(inode))
+ if (refreshed || !fuse_shared_version_mismatch(inode))
err = generic_permission(inode, mask);
/* If permission is denied, try to refresh file
--
2.13.6
Truncate number of pages mapped by direct_access() to remain with-in window
size. User might request mapping pages beyond window size.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/virtio_fs.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index e4d5e0cd41ba..ef1469b38a6d 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -449,13 +449,14 @@ static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
{
struct virtio_fs *fs = dax_get_private(dax_dev);
phys_addr_t offset = PFN_PHYS(pgoff);
+ size_t max_nr_pages = fs->window_len/PAGE_SIZE - pgoff;
if (kaddr)
*kaddr = fs->window_kaddr + offset;
if (pfn)
*pfn = phys_to_pfn_t(fs->window_phys_addr + offset,
PFN_DEV | PFN_MAP);
- return nr_pages;
+ return nr_pages > max_nr_pages ? max_nr_pages : nr_pages;
}
static size_t virtio_fs_copy_from_iter(struct dax_device *dax_dev,
--
2.13.6
From: "Dr. David Alan Gilbert" <[email protected]>
Add a 'dax' option and only enable dax when it's on.
Also show "dax" in mount options if filesystem was mounted with dax
enabled.
Signed-off-by: Dr. David Alan Gilbert <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/fuse_i.h | 1 +
fs/fuse/inode.c | 8 ++++++++
fs/fuse/virtio_fs.c | 2 +-
3 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index b5a6a12e67d6..345abe9b022f 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -70,6 +70,7 @@ struct fuse_mount_data {
unsigned group_id_present:1;
unsigned default_permissions:1;
unsigned allow_other:1;
+ unsigned dax:1;
unsigned max_read;
unsigned blksize;
};
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 10e4a39318c4..d2afce377fd4 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -439,6 +439,7 @@ enum {
OPT_ALLOW_OTHER,
OPT_MAX_READ,
OPT_BLKSIZE,
+ OPT_DAX,
OPT_ERR
};
@@ -452,6 +453,7 @@ static const match_table_t tokens = {
{OPT_ALLOW_OTHER, "allow_other"},
{OPT_MAX_READ, "max_read=%u"},
{OPT_BLKSIZE, "blksize=%u"},
+ {OPT_DAX, "dax"},
{OPT_ERR, NULL}
};
@@ -543,6 +545,10 @@ int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
d->blksize = value;
break;
+ case OPT_DAX:
+ d->dax = 1;
+ break;
+
default:
return 0;
}
@@ -571,6 +577,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
seq_printf(m, ",max_read=%u", fc->max_read);
if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
seq_printf(m, ",blksize=%lu", sb->s_blocksize);
+ if (fc->dax_dev)
+ seq_printf(m, ",dax");
return 0;
}
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 55bac1465536..e4d5e0cd41ba 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1050,7 +1050,7 @@ static int virtio_fs_fill_super(struct super_block *sb, void *data,
/* TODO this sends FUSE_INIT and could cause hiprio or notifications
* virtqueue races since they haven't been set up yet!
*/
- err = fuse_fill_super_common(sb, &d, fs->dax_dev,
+ err = fuse_fill_super_common(sb, &d, d.dax ? fs->dax_dev : NULL,
&virtio_fs_fiq_ops, fs,
(void **)&fs->vqs[2].fud);
if (err < 0)
--
2.13.6
This list will be used selecting fuse_dax_mapping to free when number of
free mappings drops below a threshold.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/file.c | 8 ++++++++
fs/fuse/fuse_i.h | 7 +++++++
fs/fuse/inode.c | 4 ++++
3 files changed, 19 insertions(+)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 449a6b315327..94ad76382a6f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -275,6 +275,10 @@ static int fuse_setup_one_mapping(struct inode *inode,
/* Protected by fi->i_dmap_sem */
fuse_dax_interval_tree_insert(dmap, &fi->dmap_tree);
fi->nr_dmaps++;
+ spin_lock(&fc->lock);
+ list_add_tail(&dmap->busy_list, &fc->busy_ranges);
+ fc->nr_busy_ranges++;
+ spin_unlock(&fc->lock);
return 0;
}
@@ -322,6 +326,10 @@ void fuse_removemapping(struct inode *inode)
if (dmap) {
fuse_dax_interval_tree_remove(dmap, &fi->dmap_tree);
fi->nr_dmaps--;
+ spin_lock(&fc->lock);
+ list_del_init(&dmap->busy_list);
+ fc->nr_busy_ranges--;
+ spin_unlock(&fc->lock);
}
if (!dmap)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 3b17fb336256..e32b0059493b 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -113,6 +113,9 @@ struct fuse_dax_mapping {
__u64 end;
__u64 __subtree_last;
+ /* Will connect in fc->busy_ranges to keep track busy memory */
+ struct list_head busy_list;
+
/** Position in DAX window */
u64 window_offset;
@@ -856,6 +859,10 @@ struct fuse_conn {
/** DAX device, non-NULL if DAX is supported */
struct dax_device *dax_dev;
+ /* List of memory ranges which are busy */
+ unsigned long nr_busy_ranges;
+ struct list_head busy_ranges;
+
/*
* DAX Window Free Ranges. TODO: This might not be best place to store
* this free list
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 56310d10cd4c..234b9c0c80ab 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -622,6 +622,8 @@ static void fuse_free_dax_mem_ranges(struct list_head *mem_list)
/* Free All allocated elements */
list_for_each_entry_safe(range, temp, mem_list, list) {
list_del(&range->list);
+ if (!list_empty(&range->busy_list))
+ list_del(&range->busy_list);
kfree(range);
}
}
@@ -666,6 +668,7 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
range->window_offset = i * FUSE_DAX_MEM_RANGE_SZ;
range->length = FUSE_DAX_MEM_RANGE_SZ;
list_add_tail(&range->list, &mem_ranges);
+ INIT_LIST_HEAD(&range->busy_list);
allocated_ranges++;
}
@@ -708,6 +711,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
fc->dax_dev = dax_dev;
fc->user_ns = get_user_ns(user_ns);
INIT_LIST_HEAD(&fc->free_ranges);
+ INIT_LIST_HEAD(&fc->busy_ranges);
}
EXPORT_SYMBOL_GPL(fuse_conn_init);
--
2.13.6
From: Stefan Hajnoczi <[email protected]>
fuse_fill_super() includes code to process the fd= option and link the
struct fuse_dev to the fd's struct file. In virtio-fs there is no file
descriptor because /dev/fuse is not used.
This patch extracts fuse_fill_super_common() so that both classic fuse
and virtio-fs can share the code to initialize a mount.
parse_fuse_opt() is also extracted so that the fuse_fill_super_common()
caller has access to the mount options. This allows classic fuse to
handle the fd= option outside fuse_fill_super_common().
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
fs/fuse/fuse_i.h | 32 +++++++++++++++++
fs/fuse/inode.c | 102 +++++++++++++++++++++++++++----------------------------
2 files changed, 83 insertions(+), 51 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e9f712e81c7d..9b5b8b194f77 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -56,6 +56,22 @@ extern struct mutex fuse_mutex;
extern unsigned max_user_bgreq;
extern unsigned max_user_congthresh;
+/** Mount options */
+struct fuse_mount_data {
+ int fd;
+ unsigned rootmode;
+ kuid_t user_id;
+ kgid_t group_id;
+ unsigned fd_present:1;
+ unsigned rootmode_present:1;
+ unsigned user_id_present:1;
+ unsigned group_id_present:1;
+ unsigned default_permissions:1;
+ unsigned allow_other:1;
+ unsigned max_read;
+ unsigned blksize;
+};
+
/* One forget request */
struct fuse_forget_link {
struct fuse_forget_one forget_one;
@@ -970,6 +986,22 @@ struct fuse_dev *fuse_dev_alloc(struct fuse_conn *fc);
void fuse_dev_free(struct fuse_dev *fud);
/**
+ * Parse a mount options string
+ */
+int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
+ struct user_namespace *user_ns);
+
+/**
+ * Fill in superblock and initialize fuse connection
+ * @sb: partially-initialized superblock to fill in
+ * @mount_data: mount parameters
+ * @fudptr: fuse_dev pointer to fill in, should contain NULL on entry
+ */
+int fuse_fill_super_common(struct super_block *sb,
+ struct fuse_mount_data *mount_data,
+ void **fudptr);
+
+/**
* Add connection to control filesystem
*/
int fuse_ctl_add_conn(struct fuse_conn *fc);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index d08cd8bf7705..f13133f0ebd1 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -59,21 +59,6 @@ MODULE_PARM_DESC(max_user_congthresh,
/** Congestion starts at 75% of maximum */
#define FUSE_DEFAULT_CONGESTION_THRESHOLD (FUSE_DEFAULT_MAX_BACKGROUND * 3 / 4)
-struct fuse_mount_data {
- int fd;
- unsigned rootmode;
- kuid_t user_id;
- kgid_t group_id;
- unsigned fd_present:1;
- unsigned rootmode_present:1;
- unsigned user_id_present:1;
- unsigned group_id_present:1;
- unsigned default_permissions:1;
- unsigned allow_other:1;
- unsigned max_read;
- unsigned blksize;
-};
-
struct fuse_forget_link *fuse_alloc_forget(void)
{
return kzalloc(sizeof(struct fuse_forget_link), GFP_KERNEL);
@@ -479,7 +464,7 @@ static int fuse_match_uint(substring_t *s, unsigned int *res)
return err;
}
-static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
+int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
struct user_namespace *user_ns)
{
char *p;
@@ -556,12 +541,13 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
}
}
- if (!d->fd_present || !d->rootmode_present ||
- !d->user_id_present || !d->group_id_present)
+ if (!d->rootmode_present || !d->user_id_present ||
+ !d->group_id_present)
return 0;
return 1;
}
+EXPORT_SYMBOL_GPL(parse_fuse_opt);
static int fuse_show_options(struct seq_file *m, struct dentry *root)
{
@@ -1072,13 +1058,13 @@ void fuse_dev_free(struct fuse_dev *fud)
}
EXPORT_SYMBOL_GPL(fuse_dev_free);
-static int fuse_fill_super(struct super_block *sb, void *data, int silent)
+int fuse_fill_super_common(struct super_block *sb,
+ struct fuse_mount_data *mount_data,
+ void **fudptr)
{
struct fuse_dev *fud;
struct fuse_conn *fc;
struct inode *root;
- struct fuse_mount_data d;
- struct file *file;
struct dentry *root_dentry;
struct fuse_req *init_req;
int err;
@@ -1090,13 +1076,10 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
sb->s_flags &= ~(SB_NOSEC | SB_I_VERSION);
- if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
- goto err;
-
if (is_bdev) {
#ifdef CONFIG_BLOCK
err = -EINVAL;
- if (!sb_set_blocksize(sb, d.blksize))
+ if (!sb_set_blocksize(sb, mount_data->blksize))
goto err;
#endif
} else {
@@ -1113,19 +1096,6 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
if (sb->s_user_ns != &init_user_ns)
sb->s_iflags |= SB_I_UNTRUSTED_MOUNTER;
- file = fget(d.fd);
- err = -EINVAL;
- if (!file)
- goto err;
-
- /*
- * Require mount to happen from the same user namespace which
- * opened /dev/fuse to prevent potential attacks.
- */
- if (file->f_op != &fuse_dev_operations ||
- file->f_cred->user_ns != sb->s_user_ns)
- goto err_fput;
-
/*
* If we are not in the initial user namespace posix
* acls must be translated.
@@ -1136,7 +1106,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
fc = kmalloc(sizeof(*fc), GFP_KERNEL);
err = -ENOMEM;
if (!fc)
- goto err_fput;
+ goto err;
fuse_conn_init(fc, sb->s_user_ns);
fc->release = fuse_free_conn;
@@ -1156,18 +1126,18 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
fc->dont_mask = 1;
sb->s_flags |= SB_POSIXACL;
- fc->default_permissions = d.default_permissions;
- fc->allow_other = d.allow_other;
- fc->user_id = d.user_id;
- fc->group_id = d.group_id;
- fc->max_read = max_t(unsigned, 4096, d.max_read);
+ fc->default_permissions = mount_data->default_permissions;
+ fc->allow_other = mount_data->allow_other;
+ fc->user_id = mount_data->user_id;
+ fc->group_id = mount_data->group_id;
+ fc->max_read = max_t(unsigned, 4096, mount_data->max_read);
fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
/* Used by get_root_inode() */
sb->s_fs_info = fc;
err = -ENOMEM;
- root = fuse_get_root_inode(sb, d.rootmode);
+ root = fuse_get_root_inode(sb, mount_data->rootmode);
sb->s_d_op = &fuse_root_dentry_operations;
root_dentry = d_make_root(root);
if (!root_dentry)
@@ -1188,7 +1158,7 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
mutex_lock(&fuse_mutex);
err = -EINVAL;
- if (file->private_data)
+ if (*fudptr)
goto err_unlock;
err = fuse_ctl_add_conn(fc);
@@ -1197,13 +1167,12 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
list_add_tail(&fc->entry, &fuse_conn_list);
sb->s_root = root_dentry;
- file->private_data = fud;
+ *fudptr = fud;
/*
* mutex_unlock() provides the necessary memory barrier for
- * file->private_data to be visible on all CPUs after this
+ * *fudptr to be visible on all CPUs after this
*/
mutex_unlock(&fuse_mutex);
- fput(file);
fuse_send_init(fc, init_req);
@@ -1220,11 +1189,42 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
err_put_conn:
fuse_conn_put(fc);
sb->s_fs_info = NULL;
- err_fput:
- fput(file);
err:
return err;
}
+EXPORT_SYMBOL_GPL(fuse_fill_super_common);
+
+static int fuse_fill_super(struct super_block *sb, void *data, int silent)
+{
+ struct fuse_mount_data d;
+ struct file *file;
+ int is_bdev = sb->s_bdev != NULL;
+ int err;
+
+ err = -EINVAL;
+ if (!parse_fuse_opt(data, &d, is_bdev, sb->s_user_ns))
+ goto err;
+ if (!d.fd_present)
+ goto err;
+
+ file = fget(d.fd);
+ if (!file)
+ goto err;
+
+ /*
+ * Require mount to happen from the same user namespace which
+ * opened /dev/fuse to prevent potential attacks.
+ */
+ if ((file->f_op != &fuse_dev_operations) ||
+ (file->f_cred->user_ns != sb->s_user_ns))
+ goto err_fput;
+
+ err = fuse_fill_super_common(sb, &d, &file->private_data);
+err_fput:
+ fput(file);
+err:
+ return err;
+}
static struct dentry *fuse_mount(struct file_system_type *fs_type,
int flags, const char *dev_name,
--
2.13.6
Hi Vivek,
I love your patch! Yet something to improve:
[auto build test ERROR on fuse/for-next]
[also build test ERROR on v4.20-rc6]
[cannot apply to next-20181210]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Vivek-Goyal/virtio-fs-shared-file-system-for-virtual-machines/20181211-103034
base: https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next
config: i386-randconfig-x006-201849 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All errors (new ones prefixed by >>):
fs//ext2/inode.c: In function 'ext2_dax_writepages':
>> fs//ext2/inode.c:959:33: error: passing argument 3 of 'dax_writeback_mapping_range' from incompatible pointer type [-Werror=incompatible-pointer-types]
mapping->host->i_sb->s_bdev, wbc);
^~~
In file included from fs//ext2/inode.c:29:0:
include/linux/dax.h:120:19: note: expected 'struct dax_device *' but argument is of type 'struct writeback_control *'
static inline int dax_writeback_mapping_range(struct address_space *mapping,
^~~~~~~~~~~~~~~~~~~~~~~~~~~
>> fs//ext2/inode.c:958:9: error: too few arguments to function 'dax_writeback_mapping_range'
return dax_writeback_mapping_range(mapping,
^~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from fs//ext2/inode.c:29:0:
include/linux/dax.h:120:19: note: declared here
static inline int dax_writeback_mapping_range(struct address_space *mapping,
^~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
vim +/dax_writeback_mapping_range +959 fs//ext2/inode.c
7f6d5b52 Ross Zwisler 2016-02-26 954
fb094c90 Dan Williams 2017-12-21 955 static int
fb094c90 Dan Williams 2017-12-21 956 ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc)
fb094c90 Dan Williams 2017-12-21 957 {
fb094c90 Dan Williams 2017-12-21 @958 return dax_writeback_mapping_range(mapping,
fb094c90 Dan Williams 2017-12-21 @959 mapping->host->i_sb->s_bdev, wbc);
^1da177e Linus Torvalds 2005-04-16 960 }
^1da177e Linus Torvalds 2005-04-16 961
:::::: The code at line 959 was first introduced by commit
:::::: fb094c90748fbeba1063927eeb751add147b35b9 ext2, dax: introduce ext2_dax_aops
:::::: TO: Dan Williams <[email protected]>
:::::: CC: Dan Williams <[email protected]>
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Vivek,
I love your patch! Perhaps something to improve:
[auto build test WARNING on fuse/for-next]
[also build test WARNING on v4.20-rc6]
[cannot apply to next-20181210]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Vivek-Goyal/virtio-fs-shared-file-system-for-virtual-machines/20181211-103034
base: https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next
config: i386-randconfig-x005-201849 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All warnings (new ones prefixed by >>):
fs/fuse/file.c: In function 'fuse_dax_write_iter':
>> fs/fuse/file.c:1834:47: warning: format '%lx' expects argument of type 'long unsigned int', but argument 3 has type 'size_t {aka unsigned int}' [-Wformat=]
printk("fallocate(offset=0x%llx length=0x%lx)"
~~^
%x
fs/fuse/file.c:1836:4:
iov_iter_count(from), ret);
~~~~~~~~~~~~~~~~~~~~
>> fs/fuse/file.c:1834:11: warning: format '%ld' expects argument of type 'long int', but argument 4 has type 'ssize_t {aka int}' [-Wformat=]
printk("fallocate(offset=0x%llx length=0x%lx)"
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/fuse/file.c:1835:20: note: format string is defined here
" failed. err=%ld\n", iocb->ki_pos,
~~^
%d
In file included from include/linux/kernel.h:14:0,
from include/linux/list.h:9,
from include/linux/wait.h:7,
from include/linux/wait_bit.h:8,
from include/linux/fs.h:6,
from fs/fuse/fuse_i.h:13,
from fs/fuse/file.c:9:
fs/fuse/file.c:1839:12: warning: format '%lx' expects argument of type 'long unsigned int', but argument 4 has type 'size_t {aka unsigned int}' [-Wformat=]
pr_debug("fallocate(offset=0x%llx length=0x%lx)"
^
include/linux/printk.h:292:21: note: in definition of macro 'pr_fmt'
#define pr_fmt(fmt) fmt
^~~
include/linux/printk.h:340:2: note: in expansion of macro 'dynamic_pr_debug'
dynamic_pr_debug(fmt, ##__VA_ARGS__)
^~~~~~~~~~~~~~~~
>> fs/fuse/file.c:1839:3: note: in expansion of macro 'pr_debug'
pr_debug("fallocate(offset=0x%llx length=0x%lx)"
^~~~~~~~
fs/fuse/file.c:1839:48: note: format string is defined here
pr_debug("fallocate(offset=0x%llx length=0x%lx)"
~~^
%x
In file included from include/linux/kernel.h:14:0,
from include/linux/list.h:9,
from include/linux/wait.h:7,
from include/linux/wait_bit.h:8,
from include/linux/fs.h:6,
from fs/fuse/fuse_i.h:13,
from fs/fuse/file.c:9:
fs/fuse/file.c:1839:12: warning: format '%ld' expects argument of type 'long int', but argument 5 has type 'ssize_t {aka int}' [-Wformat=]
pr_debug("fallocate(offset=0x%llx length=0x%lx)"
^
include/linux/printk.h:292:21: note: in definition of macro 'pr_fmt'
#define pr_fmt(fmt) fmt
^~~
include/linux/printk.h:340:2: note: in expansion of macro 'dynamic_pr_debug'
dynamic_pr_debug(fmt, ##__VA_ARGS__)
^~~~~~~~~~~~~~~~
>> fs/fuse/file.c:1839:3: note: in expansion of macro 'pr_debug'
pr_debug("fallocate(offset=0x%llx length=0x%lx)"
^~~~~~~~
fs/fuse/file.c:1840:20: note: format string is defined here
" succeed. ret=%ld\n", iocb->ki_pos, iov_iter_count(from), ret);
~~^
%d
Cyclomatic Complexity 5 include/linux/compiler.h:__read_once_size
Cyclomatic Complexity 5 include/linux/compiler.h:__write_once_size
Cyclomatic Complexity 1 include/linux/kasan-checks.h:kasan_check_read
Cyclomatic Complexity 1 include/linux/kasan-checks.h:kasan_check_write
Cyclomatic Complexity 2 arch/x86/include/asm/bitops.h:set_bit
Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:__set_bit
Cyclomatic Complexity 2 arch/x86/include/asm/bitops.h:clear_bit
Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:__clear_bit
Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:test_and_set_bit
Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:test_and_set_bit_lock
Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:constant_test_bit
Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:fls
Cyclomatic Complexity 1 include/linux/log2.h:__ilog2_u32
Cyclomatic Complexity 2 arch/x86/include/asm/jump_label.h:arch_static_branch
Cyclomatic Complexity 1 include/linux/list.h:INIT_LIST_HEAD
Cyclomatic Complexity 1 include/linux/list.h:__list_add_valid
Cyclomatic Complexity 1 include/linux/list.h:__list_del_entry_valid
Cyclomatic Complexity 2 include/linux/list.h:__list_add
Cyclomatic Complexity 1 include/linux/list.h:list_add
Cyclomatic Complexity 1 include/linux/list.h:list_add_tail
Cyclomatic Complexity 1 include/linux/list.h:__list_del
Cyclomatic Complexity 2 include/linux/list.h:__list_del_entry
Cyclomatic Complexity 1 include/linux/list.h:list_del
Cyclomatic Complexity 1 include/linux/list.h:list_del_init
Cyclomatic Complexity 1 include/linux/list.h:list_empty
Cyclomatic Complexity 1 arch/x86/include/asm/current.h:get_current
Cyclomatic Complexity 1 arch/x86/include/asm/page_32.h:copy_page
Cyclomatic Complexity 1 include/asm-generic/getorder.h:__get_order
Cyclomatic Complexity 1 include/linux/err.h:PTR_ERR
Cyclomatic Complexity 1 include/linux/err.h:IS_ERR
Cyclomatic Complexity 1 arch/x86/include/asm/irqflags.h:native_save_fl
Cyclomatic Complexity 1 arch/x86/include/asm/irqflags.h:native_restore_fl
Cyclomatic Complexity 1 arch/x86/include/asm/irqflags.h:native_irq_disable
Cyclomatic Complexity 1 arch/x86/include/asm/irqflags.h:arch_local_save_flags
Cyclomatic Complexity 1 arch/x86/include/asm/irqflags.h:arch_local_irq_restore
Cyclomatic Complexity 1 arch/x86/include/asm/irqflags.h:arch_local_irq_disable
Cyclomatic Complexity 1 arch/x86/include/asm/irqflags.h:arch_local_irq_save
Cyclomatic Complexity 1 arch/x86/include/asm/processor.h:rep_nop
Cyclomatic Complexity 1 arch/x86/include/asm/processor.h:cpu_relax
Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:arch_atomic_read
Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:arch_atomic_set
Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:arch_atomic_inc
Cyclomatic Complexity 1 arch/x86/include/asm/atomic.h:arch_atomic_dec_and_test
Cyclomatic Complexity 1 include/asm-generic/atomic-instrumented.h:atomic_read
Cyclomatic Complexity 1 include/asm-generic/atomic-instrumented.h:atomic_set
Cyclomatic Complexity 1 include/asm-generic/atomic-instrumented.h:atomic_inc
Cyclomatic Complexity 1 include/asm-generic/atomic-instrumented.h:atomic_dec_and_test
Cyclomatic Complexity 5 arch/x86/include/asm/preempt.h:__preempt_count_add
Cyclomatic Complexity 5 arch/x86/include/asm/preempt.h:__preempt_count_sub
Cyclomatic Complexity 1 include/linux/spinlock.h:spinlock_check
Cyclomatic Complexity 1 include/linux/spinlock.h:spin_lock
Cyclomatic Complexity 1 include/linux/spinlock.h:spin_unlock
Cyclomatic Complexity 1 include/linux/wait.h:waitqueue_active
Cyclomatic Complexity 1 include/linux/rcupdate.h:__rcu_read_lock
Cyclomatic Complexity 1 include/linux/rcupdate.h:__rcu_read_unlock
Cyclomatic Complexity 1 include/linux/rcupdate.h:rcu_lock_acquire
Cyclomatic Complexity 1 include/linux/rcupdate.h:rcu_lock_release
Cyclomatic Complexity 1 include/linux/rcupdate.h:rcu_read_lock
Cyclomatic Complexity 1 include/linux/rcupdate.h:rcu_read_unlock
Cyclomatic Complexity 1 include/linux/seqlock.h:seqcount_lockdep_reader_access
Cyclomatic Complexity 2 include/linux/seqlock.h:__read_seqcount_begin
Cyclomatic Complexity 1 include/linux/seqlock.h:raw_read_seqcount_begin
Cyclomatic Complexity 1 include/linux/seqlock.h:read_seqcount_begin
Cyclomatic Complexity 1 include/linux/seqlock.h:__read_seqcount_retry
Cyclomatic Complexity 1 include/linux/seqlock.h:read_seqcount_retry
Cyclomatic Complexity 1 include/linux/seqlock.h:raw_write_seqcount_begin
Cyclomatic Complexity 1 include/linux/seqlock.h:raw_write_seqcount_end
Cyclomatic Complexity 1 include/linux/seqlock.h:write_seqcount_begin_nested
Cyclomatic Complexity 1 include/linux/seqlock.h:write_seqcount_begin
Cyclomatic Complexity 1 include/linux/seqlock.h:write_seqcount_end
Cyclomatic Complexity 2 include/linux/dcache.h:d_real
Cyclomatic Complexity 1 include/linux/completion.h:__init_completion
Cyclomatic Complexity 1 arch/x86/include/asm/topology.h:numa_node_id
Cyclomatic Complexity 1 include/linux/rbtree.h:rb_link_node
Cyclomatic Complexity 1 include/linux/topology.h:numa_mem_id
Cyclomatic Complexity 1 include/linux/gfp.h:__alloc_pages
Cyclomatic Complexity 1 include/linux/gfp.h:__alloc_pages_node
Cyclomatic Complexity 2 include/linux/gfp.h:alloc_pages_node
Cyclomatic Complexity 1 include/linux/refcount.h:refcount_set
Cyclomatic Complexity 1 include/linux/refcount.h:refcount_read
Cyclomatic Complexity 1 arch/x86/include/asm/refcount.h:refcount_inc
Cyclomatic Complexity 1 arch/x86/include/asm/refcount.h:refcount_dec_and_test
Cyclomatic Complexity 1 include/linux/fs.h:is_sync_kiocb
Cyclomatic Complexity 1 include/linux/fs.h:mapping_writably_mapped
Cyclomatic Complexity 1 include/linux/fs.h:inode_lock
Cyclomatic Complexity 1 include/linux/fs.h:inode_unlock
Cyclomatic Complexity 1 include/linux/fs.h:inode_lock_shared
Cyclomatic Complexity 1 include/linux/fs.h:inode_unlock_shared
Cyclomatic Complexity 1 include/linux/fs.h:inode_trylock
Cyclomatic Complexity 1 include/linux/fs.h:inode_trylock_shared
Cyclomatic Complexity 2 include/linux/fs.h:i_size_read
Cyclomatic Complexity 1 include/linux/fs.h:i_size_write
Cyclomatic Complexity 1 include/linux/fs.h:file_inode
Cyclomatic Complexity 1 include/linux/fs.h:file_dentry
vim +1834 fs/fuse/file.c
1803
1804 static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
1805 {
1806 struct inode *inode = file_inode(iocb->ki_filp);
1807 ssize_t ret;
1808
1809 if (iocb->ki_flags & IOCB_NOWAIT) {
1810 if (!inode_trylock(inode))
1811 return -EAGAIN;
1812 } else {
1813 inode_lock(inode);
1814 }
1815
1816 ret = generic_write_checks(iocb, from);
1817 if (ret <= 0)
1818 goto out;
1819
1820 ret = file_remove_privs(iocb->ki_filp);
1821 if (ret)
1822 goto out;
1823 /* TODO file_update_time() but we don't want metadata I/O */
1824
1825 /* TODO handle growing the file */
1826 /* Grow file here if need be. iomap_begin() does not have access
1827 * to file pointer
1828 */
1829 if (iov_iter_rw(from) == WRITE &&
1830 ((iocb->ki_pos + iov_iter_count(from)) > i_size_read(inode))) {
1831 ret = __fuse_file_fallocate(iocb->ki_filp, 0, iocb->ki_pos,
1832 iov_iter_count(from));
1833 if (ret < 0) {
> 1834 printk("fallocate(offset=0x%llx length=0x%lx)"
1835 " failed. err=%ld\n", iocb->ki_pos,
1836 iov_iter_count(from), ret);
1837 goto out;
1838 }
> 1839 pr_debug("fallocate(offset=0x%llx length=0x%lx)"
1840 " succeed. ret=%ld\n", iocb->ki_pos, iov_iter_count(from), ret);
1841 }
1842
1843 ret = dax_iomap_rw(iocb, from, &fuse_iomap_ops);
1844
1845 out:
1846 inode_unlock(inode);
1847
1848 if (ret > 0)
1849 ret = generic_write_sync(iocb, ret);
1850 return ret;
1851 }
1852
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Vivek,
I love your patch! Perhaps something to improve:
[auto build test WARNING on fuse/for-next]
[also build test WARNING on v4.20-rc6]
[cannot apply to next-20181210]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Vivek-Goyal/virtio-fs-shared-file-system-for-virtual-machines/20181211-103034
base: https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next
config: i386-randconfig-x000-201849 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All warnings (new ones prefixed by >>):
fs/fuse/file.c: In function 'fuse_dax_write_iter':
fs/fuse/file.c:1834:47: warning: format '%lx' expects argument of type 'long unsigned int', but argument 3 has type 'size_t {aka unsigned int}' [-Wformat=]
printk("fallocate(offset=0x%llx length=0x%lx)"
~~^
%x
fs/fuse/file.c:1836:4:
iov_iter_count(from), ret);
~~~~~~~~~~~~~~~~~~~~
fs/fuse/file.c:1834:11: warning: format '%ld' expects argument of type 'long int', but argument 4 has type 'ssize_t {aka int}' [-Wformat=]
printk("fallocate(offset=0x%llx length=0x%lx)"
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/fuse/file.c:1835:20: note: format string is defined here
" failed. err=%ld\n", iocb->ki_pos,
~~^
%d
In file included from include/linux/kernel.h:14:0,
from include/linux/list.h:9,
from include/linux/wait.h:7,
from include/linux/wait_bit.h:8,
from include/linux/fs.h:6,
from fs/fuse/fuse_i.h:13,
from fs/fuse/file.c:9:
include/linux/kern_levels.h:5:18: warning: format '%lx' expects argument of type 'long unsigned int', but argument 3 has type 'size_t {aka unsigned int}' [-Wformat=]
#define KERN_SOH "\001" /* ASCII Start Of Header */
^
include/linux/printk.h:136:10: note: in definition of macro 'no_printk'
printk(fmt, ##__VA_ARGS__); \
^~~
include/linux/kern_levels.h:15:20: note: in expansion of macro 'KERN_SOH'
#define KERN_DEBUG KERN_SOH "7" /* debug-level messages */
^~~~~~~~
include/linux/printk.h:346:12: note: in expansion of macro 'KERN_DEBUG'
no_printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
^~~~~~~~~~
fs/fuse/file.c:1839:3: note: in expansion of macro 'pr_debug'
pr_debug("fallocate(offset=0x%llx length=0x%lx)"
^~~~~~~~
fs/fuse/file.c:1839:48: note: format string is defined here
pr_debug("fallocate(offset=0x%llx length=0x%lx)"
~~^
%x
In file included from include/linux/kernel.h:14:0,
from include/linux/list.h:9,
from include/linux/wait.h:7,
from include/linux/wait_bit.h:8,
from include/linux/fs.h:6,
from fs/fuse/fuse_i.h:13,
from fs/fuse/file.c:9:
>> include/linux/kern_levels.h:5:18: warning: format '%ld' expects argument of type 'long int', but argument 4 has type 'ssize_t {aka int}' [-Wformat=]
#define KERN_SOH "\001" /* ASCII Start Of Header */
^
include/linux/printk.h:136:10: note: in definition of macro 'no_printk'
printk(fmt, ##__VA_ARGS__); \
^~~
include/linux/kern_levels.h:15:20: note: in expansion of macro 'KERN_SOH'
#define KERN_DEBUG KERN_SOH "7" /* debug-level messages */
^~~~~~~~
include/linux/printk.h:346:12: note: in expansion of macro 'KERN_DEBUG'
no_printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
^~~~~~~~~~
fs/fuse/file.c:1839:3: note: in expansion of macro 'pr_debug'
pr_debug("fallocate(offset=0x%llx length=0x%lx)"
^~~~~~~~
fs/fuse/file.c:1840:20: note: format string is defined here
" succeed. ret=%ld\n", iocb->ki_pos, iov_iter_count(from), ret);
~~^
%d
--
fs//fuse/file.c: In function 'fuse_dax_write_iter':
fs//fuse/file.c:1834:47: warning: format '%lx' expects argument of type 'long unsigned int', but argument 3 has type 'size_t {aka unsigned int}' [-Wformat=]
printk("fallocate(offset=0x%llx length=0x%lx)"
~~^
%x
fs//fuse/file.c:1836:4:
iov_iter_count(from), ret);
~~~~~~~~~~~~~~~~~~~~
fs//fuse/file.c:1834:11: warning: format '%ld' expects argument of type 'long int', but argument 4 has type 'ssize_t {aka int}' [-Wformat=]
printk("fallocate(offset=0x%llx length=0x%lx)"
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs//fuse/file.c:1835:20: note: format string is defined here
" failed. err=%ld\n", iocb->ki_pos,
~~^
%d
In file included from include/linux/kernel.h:14:0,
from include/linux/list.h:9,
from include/linux/wait.h:7,
from include/linux/wait_bit.h:8,
from include/linux/fs.h:6,
from fs//fuse/fuse_i.h:13,
from fs//fuse/file.c:9:
include/linux/kern_levels.h:5:18: warning: format '%lx' expects argument of type 'long unsigned int', but argument 3 has type 'size_t {aka unsigned int}' [-Wformat=]
#define KERN_SOH "\001" /* ASCII Start Of Header */
^
include/linux/printk.h:136:10: note: in definition of macro 'no_printk'
printk(fmt, ##__VA_ARGS__); \
^~~
include/linux/kern_levels.h:15:20: note: in expansion of macro 'KERN_SOH'
#define KERN_DEBUG KERN_SOH "7" /* debug-level messages */
^~~~~~~~
include/linux/printk.h:346:12: note: in expansion of macro 'KERN_DEBUG'
no_printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
^~~~~~~~~~
fs//fuse/file.c:1839:3: note: in expansion of macro 'pr_debug'
pr_debug("fallocate(offset=0x%llx length=0x%lx)"
^~~~~~~~
fs//fuse/file.c:1839:48: note: format string is defined here
pr_debug("fallocate(offset=0x%llx length=0x%lx)"
~~^
%x
In file included from include/linux/kernel.h:14:0,
from include/linux/list.h:9,
from include/linux/wait.h:7,
from include/linux/wait_bit.h:8,
from include/linux/fs.h:6,
from fs//fuse/fuse_i.h:13,
from fs//fuse/file.c:9:
>> include/linux/kern_levels.h:5:18: warning: format '%ld' expects argument of type 'long int', but argument 4 has type 'ssize_t {aka int}' [-Wformat=]
#define KERN_SOH "\001" /* ASCII Start Of Header */
^
include/linux/printk.h:136:10: note: in definition of macro 'no_printk'
printk(fmt, ##__VA_ARGS__); \
^~~
include/linux/kern_levels.h:15:20: note: in expansion of macro 'KERN_SOH'
#define KERN_DEBUG KERN_SOH "7" /* debug-level messages */
^~~~~~~~
include/linux/printk.h:346:12: note: in expansion of macro 'KERN_DEBUG'
no_printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__)
^~~~~~~~~~
fs//fuse/file.c:1839:3: note: in expansion of macro 'pr_debug'
pr_debug("fallocate(offset=0x%llx length=0x%lx)"
^~~~~~~~
fs//fuse/file.c:1840:20: note: format string is defined here
" succeed. ret=%ld\n", iocb->ki_pos, iov_iter_count(from), ret);
~~^
%d
vim +5 include/linux/kern_levels.h
314ba352 Joe Perches 2012-07-30 4
04d2c8c8 Joe Perches 2012-07-30 @5 #define KERN_SOH "\001" /* ASCII Start Of Header */
04d2c8c8 Joe Perches 2012-07-30 6 #define KERN_SOH_ASCII '\001'
04d2c8c8 Joe Perches 2012-07-30 7
:::::: The code at line 5 was first introduced by commit
:::::: 04d2c8c83d0e3ac5f78aeede51babb3236200112 printk: convert the format for KERN_<LEVEL> to a 2 byte pattern
:::::: TO: Joe Perches <[email protected]>
:::::: CC: Linus Torvalds <[email protected]>
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On Mon, Dec 10, 2018 at 12:12:26PM -0500, Vivek Goyal wrote:
> Hi,
>
> Here are RFC patches for virtio-fs. Looking for feedback on this approach.
>
> These patches should apply on top of 4.20-rc5. We have also put code for
> various components here.
>
> https://gitlab.com/virtio-fs
A draft specification for the virtio-fs device is available here:
https://stefanha.github.io/virtio/virtio-fs.html#x1-38800010 (HTML)
https://github.com/stefanha/virtio/commit/e1cac3777ef03bc9c5c8ee91bcc6ba478272e6b6
Stefan
Hi Vivek,
I love your patch! Perhaps something to improve:
[auto build test WARNING on fuse/for-next]
[also build test WARNING on v4.20-rc6]
[cannot apply to next-20181211]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Vivek-Goyal/virtio-fs-shared-file-system-for-virtual-machines/20181211-103034
base: https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next
config: arm64-allmodconfig (attached as .config)
compiler: aarch64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=arm64
All warnings (new ones prefixed by >>):
fs/ext2/inode.c: In function 'ext2_dax_writepages':
fs/ext2/inode.c:959:33: error: passing argument 3 of 'dax_writeback_mapping_range' from incompatible pointer type [-Werror=incompatible-pointer-types]
mapping->host->i_sb->s_bdev, wbc);
^~~
In file included from fs/ext2/inode.c:29:0:
include/linux/dax.h:87:5: note: expected 'struct dax_device *' but argument is of type 'struct writeback_control *'
int dax_writeback_mapping_range(struct address_space *mapping,
^~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/ext2/inode.c:958:9: error: too few arguments to function 'dax_writeback_mapping_range'
return dax_writeback_mapping_range(mapping,
^~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from fs/ext2/inode.c:29:0:
include/linux/dax.h:87:5: note: declared here
int dax_writeback_mapping_range(struct address_space *mapping,
^~~~~~~~~~~~~~~~~~~~~~~~~~~
>> fs/ext2/inode.c:960:1: warning: control reaches end of non-void function [-Wreturn-type]
}
^
cc1: some warnings being treated as errors
vim +960 fs/ext2/inode.c
7f6d5b52 Ross Zwisler 2016-02-26 954
fb094c90 Dan Williams 2017-12-21 955 static int
fb094c90 Dan Williams 2017-12-21 956 ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc)
fb094c90 Dan Williams 2017-12-21 957 {
fb094c90 Dan Williams 2017-12-21 @958 return dax_writeback_mapping_range(mapping,
fb094c90 Dan Williams 2017-12-21 959 mapping->host->i_sb->s_bdev, wbc);
^1da177e Linus Torvalds 2005-04-16 @960 }
^1da177e Linus Torvalds 2005-04-16 961
:::::: The code at line 960 was first introduced by commit
:::::: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 Linux-2.6.12-rc2
:::::: TO: Linus Torvalds <[email protected]>
:::::: CC: Linus Torvalds <[email protected]>
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Vivek,
I love your patch! Perhaps something to improve:
[auto build test WARNING on fuse/for-next]
[also build test WARNING on v4.20-rc6]
[cannot apply to next-20181210]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Vivek-Goyal/virtio-fs-shared-file-system-for-virtual-machines/20181211-103034
base: https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next
config: arm-allmodconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=arm
All warnings (new ones prefixed by >>):
In file included from include/linux/kernel.h:14:0,
from include/linux/list.h:9,
from include/linux/wait.h:7,
from include/linux/wait_bit.h:8,
from include/linux/fs.h:6,
from fs/fuse/virtio_fs.c:7:
fs/fuse/virtio_fs.c: In function 'virtio_fs_direct_access':
>> fs/fuse/virtio_fs.c:454:11: warning: format '%ld' expects argument of type 'long int', but argument 4 has type 'size_t {aka unsigned int}' [-Wformat=]
pr_debug("virtio_fs_direct_access(): called. nr_pages=%ld max_nr_pages=%ld\n", nr_pages, max_nr_pages);
^
include/linux/printk.h:292:21: note: in definition of macro 'pr_fmt'
#define pr_fmt(fmt) fmt
^~~
include/linux/printk.h:340:2: note: in expansion of macro 'dynamic_pr_debug'
dynamic_pr_debug(fmt, ##__VA_ARGS__)
^~~~~~~~~~~~~~~~
>> fs/fuse/virtio_fs.c:454:2: note: in expansion of macro 'pr_debug'
pr_debug("virtio_fs_direct_access(): called. nr_pages=%ld max_nr_pages=%ld\n", nr_pages, max_nr_pages);
^~~~~~~~
In file included from include/linux/printk.h:336:0,
from include/linux/kernel.h:14,
from include/linux/list.h:9,
from include/linux/wait.h:7,
from include/linux/wait_bit.h:8,
from include/linux/fs.h:6,
from fs/fuse/virtio_fs.c:7:
fs/fuse/virtio_fs.c: In function 'virtio_fs_setup_dax':
fs/fuse/virtio_fs.c:617:22: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 6 has type 'phys_addr_t {aka unsigned int}' [-Wformat=]
dev_dbg(&vdev->dev, "%s: cache kaddr 0x%px phys_addr 0x%llx len %llx\n",
^
include/linux/dynamic_debug.h:135:39: note: in definition of macro 'dynamic_dev_dbg'
__dynamic_dev_dbg(&descriptor, dev, fmt, \
^~~
include/linux/device.h:1463:23: note: in expansion of macro 'dev_fmt'
dynamic_dev_dbg(dev, dev_fmt(fmt), ##__VA_ARGS__)
^~~~~~~
fs/fuse/virtio_fs.c:617:2: note: in expansion of macro 'dev_dbg'
dev_dbg(&vdev->dev, "%s: cache kaddr 0x%px phys_addr 0x%llx len %llx\n",
^~~~~~~
vim +454 fs/fuse/virtio_fs.c
442
443 /* Map a window offset to a page frame number. The window offset will have
444 * been produced by .iomap_begin(), which maps a file offset to a window
445 * offset.
446 */
447 static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
448 long nr_pages, void **kaddr, pfn_t *pfn)
449 {
450 struct virtio_fs *fs = dax_get_private(dax_dev);
451 phys_addr_t offset = PFN_PHYS(pgoff);
452 size_t max_nr_pages = fs->window_len/PAGE_SIZE - pgoff;
453
> 454 pr_debug("virtio_fs_direct_access(): called. nr_pages=%ld max_nr_pages=%ld\n", nr_pages, max_nr_pages);
455
456 if (kaddr)
457 *kaddr = fs->window_kaddr + offset;
458 if (pfn)
459 *pfn = phys_to_pfn_t(fs->window_phys_addr + offset,
460 PFN_DEV | PFN_MAP);
461 return nr_pages > max_nr_pages ? max_nr_pages : nr_pages;
462 }
463
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
From: kbuild test robot <[email protected]>
fs/fuse/virtio_fs.c:88:17-18: Unneeded semicolon
Remove unneeded semicolon.
Generated by: scripts/coccinelle/misc/semicolon.cocci
Fixes: 065b4fe69a2b ("virtio-fs: Add VIRTIO_PCI_CAP_SHARED_MEMORY_CFG and utility to find them")
CC: Dr. David Alan Gilbert <[email protected]>
Signed-off-by: kbuild test robot <[email protected]>
---
url: https://github.com/0day-ci/linux/commits/Vivek-Goyal/virtio-fs-shared-file-system-for-virtual-machines/20181211-103034
base: https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next
virtio_fs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -85,7 +85,7 @@ static int virtio_pci_find_shm_cap(struc
printk(KERN_ERR "%s: shm cap with bad size offset: %d size: %d\n",
__func__, pos, cap_len);
continue;
- };
+ }
pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_shm_cap,
id),
Hi David,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on fuse/for-next]
[also build test WARNING on v4.20-rc6]
[cannot apply to next-20181212]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Vivek-Goyal/virtio-fs-shared-file-system-for-virtual-machines/20181211-103034
base: https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next
coccinelle warnings: (new ones prefixed by >>)
>> fs/fuse/virtio_fs.c:88:17-18: Unneeded semicolon
Please review and possibly fold the followup patch.
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On 10.12.2018 18:12, Vivek Goyal wrote:
> From: Stefan Hajnoczi <[email protected]>
> +static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> +{
> + struct virtio_fs_memremap_info *mi;
> + struct dev_pagemap *pgmap;
> + struct pci_dev *pci_dev;
> + phys_addr_t phys_addr;
> + size_t len;
> + int ret;
> +
> + if (!IS_ENABLED(CONFIG_DAX_DRIVER))
> + return 0;
> +
> + /* HACK implement VIRTIO shared memory regions instead of
> + * directly accessing the PCI BAR from a virtio device driver.
> + */
> + pci_dev = container_of(vdev->dev.parent, struct pci_dev, dev);
> +
> + /* TODO Is this safe - the virtio_pci_* driver doesn't use managed
> + * device APIs? */
> + ret = pcim_enable_device(pci_dev);
> + if (ret < 0)
> + return ret;
> +
> + /* TODO handle case where device doesn't expose BAR? */
> + ret = pci_request_region(pci_dev, VIRTIO_FS_WINDOW_BAR,
> + "virtio-fs-window");
> + if (ret < 0) {
> + dev_err(&vdev->dev, "%s: failed to request window BAR\n",
> + __func__);
> + return ret;
> + }
Can we please have a generic virtio interface to map the address (the default can then
fall back to PCI) instead of mapping a PCI bar? This would make it easier to implement
virtio-ccw or virtio-mmio.
Currently, virtio-9p cannot be used with overlayfs in order to obtain
Docker-like experience (but with separate kernel) because of file
attributes problems. I wrote an email about that to qemu-devel almost
year ago, but it received no attention (I attach its contents below.).
Will virtio-fs avoid these problems? I assume it will be transparent
from the point of view of file attributes, and not enforce any kind of
security filtering?
Piotr Jurkiewicz
----
1. Upper filesystem must support the creation of trusted.* extended
attributes.
9pfs has support for getting/setting xattrs, but calls operating on
attributes other than user.* and system.posix_acl_* are dropped.
2. Upper filesystem must provide valid d_type in readdir responses.
This works, but only in case of 'passtrough' and 'none' security models.
In the case of 'mapped-xattr' and 'mapped-file' models, d_type is being
zeroed to DT_UNKNOWN during readdir() call.
All these limitations can be resolved pretty easily, but requires some
design decisions. I can prepare appropriate patches.
Ad. 1.
Why are operations on attributes other than than user.* and
system.posix_acl_* forbidden? Is this due to security reasons?
If so, can we map all of them to user.virtfs namespace, similarly as
system.posix_acl_* are being mapped to user.virtfs.system.posix_acl_* in
'mapping' mode already? This way any trusted/security/system attributes
will be effective only when mounted via virtfs inside VM.
Ad. 2.
local_readdir() can fill entry->d_type with the right DT_* value by
obtaining file type from mapping and translating it with IFTODT() macro.
This would, however, require reading 'user.virtfs.mode' for each
direntry during readdir() call, what can affect performance. If so, this
behavior would probably need to be controlled with some runtime option.
'mapped-xattr' and 'mapped-file' models are essential for running qemu
with overlayfs as non-root, because overlayfs creates device nodes, what
is possible for unprivileged user only with these models.
On Mon, Dec 10, 2018 at 12:12:26PM -0500, Vivek Goyal wrote:
> Hi,
>
> Here are RFC patches for virtio-fs. Looking for feedback on this approach.
>
> These patches should apply on top of 4.20-rc5. We have also put code for
> various components here.
>
> https://gitlab.com/virtio-fs
>
> Problem Description
> ===================
> We want to be able to take a directory tree on the host and share it with
> guest[s]. Our goal is to be able to do it in a fast, consistent and secure
> manner. Our primary use case is kata containers, but it should be usable in
> other scenarios as well.
>
> Containers may rely on local file system semantics for shared volumes,
> read-write mounts that multiple containers access simultaneously. File
> system changes must be visible to other containers with the same consistency
> expected of a local file system, including mmap MAP_SHARED.
>
> Existing Solutions
> ==================
> We looked at existing solutions and virtio-9p already provides basic shared
> file system functionality although does not offer local file system semantics,
> causing some workloads and test suites to fail. In addition, virtio-9p
> performance has been an issue for Kata Containers and we believe this cannot
> be alleviated without major changes that do not fit into the 9P protocol.
>
> Design Overview
> ===============
> With the goal of designing something with better performance and local file
> system semantics, a bunch of ideas were proposed.
>
> - Use fuse protocol (instead of 9p) for communication between guest
> and host. Guest kernel will be fuse client and a fuse server will
> run on host to serve the requests. Benchmark results (see below) are
> encouraging and show this approach performs well (2x to 8x improvement
> depending on test being run).
>
> - For data access inside guest, mmap portion of file in QEMU address
> space and guest accesses this memory using dax. That way guest page
> cache is bypassed and there is only one copy of data (on host). This
> will also enable mmap(MAP_SHARED) between guests.
>
> - For metadata coherency, there is a shared memory region which contains
> version number associated with metadata and any guest changing metadata
> updates version number and other guests refresh metadata on next
> access. This is still experimental and implementation is not complete.
What about Windows guests or BSD ones? Is there a plan to make that work with them as well?
What about the Virtio spec? Plans to make changes there as well?
On Wed, Dec 12, 2018 at 03:30:49PM -0500, Konrad Rzeszutek Wilk wrote:
> On Mon, Dec 10, 2018 at 12:12:26PM -0500, Vivek Goyal wrote:
> > Hi,
> >
> > Here are RFC patches for virtio-fs. Looking for feedback on this approach.
> >
> > These patches should apply on top of 4.20-rc5. We have also put code for
> > various components here.
> >
> > https://gitlab.com/virtio-fs
> >
> > Problem Description
> > ===================
> > We want to be able to take a directory tree on the host and share it with
> > guest[s]. Our goal is to be able to do it in a fast, consistent and secure
> > manner. Our primary use case is kata containers, but it should be usable in
> > other scenarios as well.
> >
> > Containers may rely on local file system semantics for shared volumes,
> > read-write mounts that multiple containers access simultaneously. File
> > system changes must be visible to other containers with the same consistency
> > expected of a local file system, including mmap MAP_SHARED.
> >
> > Existing Solutions
> > ==================
> > We looked at existing solutions and virtio-9p already provides basic shared
> > file system functionality although does not offer local file system semantics,
> > causing some workloads and test suites to fail. In addition, virtio-9p
> > performance has been an issue for Kata Containers and we believe this cannot
> > be alleviated without major changes that do not fit into the 9P protocol.
> >
> > Design Overview
> > ===============
> > With the goal of designing something with better performance and local file
> > system semantics, a bunch of ideas were proposed.
> >
> > - Use fuse protocol (instead of 9p) for communication between guest
> > and host. Guest kernel will be fuse client and a fuse server will
> > run on host to serve the requests. Benchmark results (see below) are
> > encouraging and show this approach performs well (2x to 8x improvement
> > depending on test being run).
> >
> > - For data access inside guest, mmap portion of file in QEMU address
> > space and guest accesses this memory using dax. That way guest page
> > cache is bypassed and there is only one copy of data (on host). This
> > will also enable mmap(MAP_SHARED) between guests.
> >
> > - For metadata coherency, there is a shared memory region which contains
> > version number associated with metadata and any guest changing metadata
> > updates version number and other guests refresh metadata on next
> > access. This is still experimental and implementation is not complete.
>
> What about Windows guests or BSD ones? Is there a plan to make that work with them as well?
Hi Konrad,
I have not thought much about making it work on Windows or BSD yet.
Does Fuse work with windows. I am assuming it does with BSD. As long as FUSE
works, I am assuming that atleast basic mode can be made to work.
>
> What about the Virtio spec? Plans to make changes there as well?
There are plans to change that. Stefan posted a proposal here.
https://lists.oasis-open.org/archives/virtio-dev/201812/msg00073.html
Thanks
Vivek
On Wed, Dec 12, 2018 at 06:07:40PM +0100, Piotr Jurkiewicz wrote:
> Currently, virtio-9p cannot be used with overlayfs in order to obtain
> Docker-like experience (but with separate kernel) because of file attributes
> problems. I wrote an email about that to qemu-devel almost year ago, but it
> received no attention (I attach its contents below.).
>
> Will virtio-fs avoid these problems? I assume it will be transparent from
> the point of view of file attributes, and not enforce any kind of security
> filtering?
Hi Piotr,
So you want to use virtio-fs as upper/ layer of a overlay filesystem
inside guest? Interesting. I have not tried that.
As of now I think we are not doing any filtering of file attributes
and it might just work. Give it a try. Having said that, I suspect
that security model of virtio-fs most likely will evolve.
Thanks
Vivek
>
> Piotr Jurkiewicz
>
> ----
>
> 1. Upper filesystem must support the creation of trusted.* extended
> attributes.
>
> 9pfs has support for getting/setting xattrs, but calls operating on
> attributes other than user.* and system.posix_acl_* are dropped.
>
> 2. Upper filesystem must provide valid d_type in readdir responses.
>
> This works, but only in case of 'passtrough' and 'none' security models. In
> the case of 'mapped-xattr' and 'mapped-file' models, d_type is being zeroed
> to DT_UNKNOWN during readdir() call.
>
> All these limitations can be resolved pretty easily, but requires some
> design decisions. I can prepare appropriate patches.
>
> Ad. 1.
>
> Why are operations on attributes other than than user.* and
> system.posix_acl_* forbidden? Is this due to security reasons?
>
> If so, can we map all of them to user.virtfs namespace, similarly as
> system.posix_acl_* are being mapped to user.virtfs.system.posix_acl_* in
> 'mapping' mode already? This way any trusted/security/system attributes will
> be effective only when mounted via virtfs inside VM.
>
> Ad. 2.
>
> local_readdir() can fill entry->d_type with the right DT_* value by
> obtaining file type from mapping and translating it with IFTODT() macro.
> This would, however, require reading 'user.virtfs.mode' for each direntry
> during readdir() call, what can affect performance. If so, this behavior
> would probably need to be controlled with some runtime option.
>
> 'mapped-xattr' and 'mapped-file' models are essential for running qemu with
> overlayfs as non-root, because overlayfs creates device nodes, what is
> possible for unprivileged user only with these models.
On 10.12.18 18:12, Vivek Goyal wrote:
> Instead of assuming we had the fixed bar for the cache, use the
> value from the capabilities.
>
> Signed-off-by: Dr. David Alan Gilbert <[email protected]>
> ---
> fs/fuse/virtio_fs.c | 32 +++++++++++++++++---------------
> 1 file changed, 17 insertions(+), 15 deletions(-)
>
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 60d496c16841..55bac1465536 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -14,11 +14,6 @@
> #include <uapi/linux/virtio_pci.h>
> #include "fuse_i.h"
>
> -enum {
> - /* PCI BAR number of the virtio-fs DAX window */
> - VIRTIO_FS_WINDOW_BAR = 2,
> -};
> -
> /* List of virtio-fs device instances and a lock for the list */
> static DEFINE_MUTEX(virtio_fs_mutex);
> static LIST_HEAD(virtio_fs_instances);
> @@ -518,7 +513,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> struct dev_pagemap *pgmap;
> struct pci_dev *pci_dev;
> phys_addr_t phys_addr;
> - size_t len;
> + size_t bar_len;
> int ret;
> u8 have_cache, cache_bar;
> u64 cache_offset, cache_len;
> @@ -551,17 +546,13 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> }
>
> /* TODO handle case where device doesn't expose BAR? */
For virtio-pmem we decided to not go via BARs as this would effectively
make it only usable for virtio-pci implementers. Instead, we are going
to export the applicable physical device region directly (e.g.
phys_start, phys_size in virtio config), so it is decoupled from PCI
details. Doing the same for virtio-fs would allow e.g. also virtio-ccw
to make eventually use of this.
> - ret = pci_request_region(pci_dev, VIRTIO_FS_WINDOW_BAR,
> - "virtio-fs-window");
> + ret = pci_request_region(pci_dev, cache_bar, "virtio-fs-window");
> if (ret < 0) {
> dev_err(&vdev->dev, "%s: failed to request window BAR\n",
> __func__);
> return ret;
> }
>
> - phys_addr = pci_resource_start(pci_dev, VIRTIO_FS_WINDOW_BAR);
> - len = pci_resource_len(pci_dev, VIRTIO_FS_WINDOW_BAR);
> -
> mi = devm_kzalloc(&pci_dev->dev, sizeof(*mi), GFP_KERNEL);
> if (!mi)
> return -ENOMEM;
> @@ -586,6 +577,17 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> pgmap->ref = &mi->ref;
> pgmap->type = MEMORY_DEVICE_FS_DAX;
>
> + phys_addr = pci_resource_start(pci_dev, cache_bar);
> + bar_len = pci_resource_len(pci_dev, cache_bar);
> +
> + if (cache_offset + cache_len > bar_len) {
> + dev_err(&vdev->dev,
> + "%s: cache bar shorter than cap offset+len\n",
> + __func__);
> + return -EINVAL;
> + }
> + phys_addr += cache_offset;
> +
> /* Ideally we would directly use the PCI BAR resource but
> * devm_memremap_pages() wants its own copy in pgmap. So
> * initialize a struct resource from scratch (only the start
> @@ -594,7 +596,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> pgmap->res = (struct resource){
> .name = "virtio-fs dax window",
> .start = phys_addr,
> - .end = phys_addr + len,
> + .end = phys_addr + cache_len,
> };
>
> fs->window_kaddr = devm_memremap_pages(&pci_dev->dev, pgmap);
> @@ -607,10 +609,10 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> return ret;
>
> fs->window_phys_addr = phys_addr;
> - fs->window_len = len;
> + fs->window_len = cache_len;
>
> - dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr 0x%llx len %zu\n",
> - __func__, fs->window_kaddr, phys_addr, len);
> + dev_dbg(&vdev->dev, "%s: cache kaddr 0x%px phys_addr 0x%llx len %llx\n",
> + __func__, fs->window_kaddr, phys_addr, cache_len);
>
> fs->dax_dev = alloc_dax(fs, NULL, &virtio_fs_dax_ops);
> if (!fs->dax_dev)
>
--
Thanks,
David / dhildenb
* David Hildenbrand ([email protected]) wrote:
> On 10.12.18 18:12, Vivek Goyal wrote:
> > Instead of assuming we had the fixed bar for the cache, use the
> > value from the capabilities.
> >
> > Signed-off-by: Dr. David Alan Gilbert <[email protected]>
> > ---
> > fs/fuse/virtio_fs.c | 32 +++++++++++++++++---------------
> > 1 file changed, 17 insertions(+), 15 deletions(-)
> >
> > diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> > index 60d496c16841..55bac1465536 100644
> > --- a/fs/fuse/virtio_fs.c
> > +++ b/fs/fuse/virtio_fs.c
> > @@ -14,11 +14,6 @@
> > #include <uapi/linux/virtio_pci.h>
> > #include "fuse_i.h"
> >
> > -enum {
> > - /* PCI BAR number of the virtio-fs DAX window */
> > - VIRTIO_FS_WINDOW_BAR = 2,
> > -};
> > -
> > /* List of virtio-fs device instances and a lock for the list */
> > static DEFINE_MUTEX(virtio_fs_mutex);
> > static LIST_HEAD(virtio_fs_instances);
> > @@ -518,7 +513,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > struct dev_pagemap *pgmap;
> > struct pci_dev *pci_dev;
> > phys_addr_t phys_addr;
> > - size_t len;
> > + size_t bar_len;
> > int ret;
> > u8 have_cache, cache_bar;
> > u64 cache_offset, cache_len;
> > @@ -551,17 +546,13 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > }
> >
> > /* TODO handle case where device doesn't expose BAR? */
>
> For virtio-pmem we decided to not go via BARs as this would effectively
> make it only usable for virtio-pci implementers. Instead, we are going
> to export the applicable physical device region directly (e.g.
> phys_start, phys_size in virtio config), so it is decoupled from PCI
> details. Doing the same for virtio-fs would allow e.g. also virtio-ccw
> to make eventually use of this.
That makes it a very odd looking PCI device; I can see that with
virtio-pmem it makes some sense, given that it's job is to expose
arbitrary chunks of memory.
Dave
> > - ret = pci_request_region(pci_dev, VIRTIO_FS_WINDOW_BAR,
> > - "virtio-fs-window");
> > + ret = pci_request_region(pci_dev, cache_bar, "virtio-fs-window");
> > if (ret < 0) {
> > dev_err(&vdev->dev, "%s: failed to request window BAR\n",
> > __func__);
> > return ret;
> > }
> >
> > - phys_addr = pci_resource_start(pci_dev, VIRTIO_FS_WINDOW_BAR);
> > - len = pci_resource_len(pci_dev, VIRTIO_FS_WINDOW_BAR);
> > -
> > mi = devm_kzalloc(&pci_dev->dev, sizeof(*mi), GFP_KERNEL);
> > if (!mi)
> > return -ENOMEM;
> > @@ -586,6 +577,17 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > pgmap->ref = &mi->ref;
> > pgmap->type = MEMORY_DEVICE_FS_DAX;
> >
> > + phys_addr = pci_resource_start(pci_dev, cache_bar);
> > + bar_len = pci_resource_len(pci_dev, cache_bar);
> > +
> > + if (cache_offset + cache_len > bar_len) {
> > + dev_err(&vdev->dev,
> > + "%s: cache bar shorter than cap offset+len\n",
> > + __func__);
> > + return -EINVAL;
> > + }
> > + phys_addr += cache_offset;
> > +
> > /* Ideally we would directly use the PCI BAR resource but
> > * devm_memremap_pages() wants its own copy in pgmap. So
> > * initialize a struct resource from scratch (only the start
> > @@ -594,7 +596,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > pgmap->res = (struct resource){
> > .name = "virtio-fs dax window",
> > .start = phys_addr,
> > - .end = phys_addr + len,
> > + .end = phys_addr + cache_len,
> > };
> >
> > fs->window_kaddr = devm_memremap_pages(&pci_dev->dev, pgmap);
> > @@ -607,10 +609,10 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > return ret;
> >
> > fs->window_phys_addr = phys_addr;
> > - fs->window_len = len;
> > + fs->window_len = cache_len;
> >
> > - dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr 0x%llx len %zu\n",
> > - __func__, fs->window_kaddr, phys_addr, len);
> > + dev_dbg(&vdev->dev, "%s: cache kaddr 0x%px phys_addr 0x%llx len %llx\n",
> > + __func__, fs->window_kaddr, phys_addr, cache_len);
> >
> > fs->dax_dev = alloc_dax(fs, NULL, &virtio_fs_dax_ops);
> > if (!fs->dax_dev)
> >
>
>
> --
>
> Thanks,
>
> David / dhildenb
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK
On 13.12.18 10:13, Dr. David Alan Gilbert wrote:
> * David Hildenbrand ([email protected]) wrote:
>> On 10.12.18 18:12, Vivek Goyal wrote:
>>> Instead of assuming we had the fixed bar for the cache, use the
>>> value from the capabilities.
>>>
>>> Signed-off-by: Dr. David Alan Gilbert <[email protected]>
>>> ---
>>> fs/fuse/virtio_fs.c | 32 +++++++++++++++++---------------
>>> 1 file changed, 17 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
>>> index 60d496c16841..55bac1465536 100644
>>> --- a/fs/fuse/virtio_fs.c
>>> +++ b/fs/fuse/virtio_fs.c
>>> @@ -14,11 +14,6 @@
>>> #include <uapi/linux/virtio_pci.h>
>>> #include "fuse_i.h"
>>>
>>> -enum {
>>> - /* PCI BAR number of the virtio-fs DAX window */
>>> - VIRTIO_FS_WINDOW_BAR = 2,
>>> -};
>>> -
>>> /* List of virtio-fs device instances and a lock for the list */
>>> static DEFINE_MUTEX(virtio_fs_mutex);
>>> static LIST_HEAD(virtio_fs_instances);
>>> @@ -518,7 +513,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
>>> struct dev_pagemap *pgmap;
>>> struct pci_dev *pci_dev;
>>> phys_addr_t phys_addr;
>>> - size_t len;
>>> + size_t bar_len;
>>> int ret;
>>> u8 have_cache, cache_bar;
>>> u64 cache_offset, cache_len;
>>> @@ -551,17 +546,13 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
>>> }
>>>
>>> /* TODO handle case where device doesn't expose BAR? */
>>
>> For virtio-pmem we decided to not go via BARs as this would effectively
>> make it only usable for virtio-pci implementers. Instead, we are going
>> to export the applicable physical device region directly (e.g.
>> phys_start, phys_size in virtio config), so it is decoupled from PCI
>> details. Doing the same for virtio-fs would allow e.g. also virtio-ccw
>> to make eventually use of this.
>
> That makes it a very odd looking PCI device; I can see that with
> virtio-pmem it makes some sense, given that it's job is to expose
> arbitrary chunks of memory.
>
> Dave
Well, the fact that your are
- including <uapi/linux/virtio_pci.h>
- adding pci related code
in/to fs/fuse/virtio_fs.c
tells me that these properties might be better communicated on the
virtio layer, not on the PCI layer.
Or do you really want to glue virtio-fs to virtio-pci for all eternity?
--
Thanks,
David / dhildenb
* David Hildenbrand ([email protected]) wrote:
> On 13.12.18 10:13, Dr. David Alan Gilbert wrote:
> > * David Hildenbrand ([email protected]) wrote:
> >> On 10.12.18 18:12, Vivek Goyal wrote:
> >>> Instead of assuming we had the fixed bar for the cache, use the
> >>> value from the capabilities.
> >>>
> >>> Signed-off-by: Dr. David Alan Gilbert <[email protected]>
> >>> ---
> >>> fs/fuse/virtio_fs.c | 32 +++++++++++++++++---------------
> >>> 1 file changed, 17 insertions(+), 15 deletions(-)
> >>>
> >>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> >>> index 60d496c16841..55bac1465536 100644
> >>> --- a/fs/fuse/virtio_fs.c
> >>> +++ b/fs/fuse/virtio_fs.c
> >>> @@ -14,11 +14,6 @@
> >>> #include <uapi/linux/virtio_pci.h>
> >>> #include "fuse_i.h"
> >>>
> >>> -enum {
> >>> - /* PCI BAR number of the virtio-fs DAX window */
> >>> - VIRTIO_FS_WINDOW_BAR = 2,
> >>> -};
> >>> -
> >>> /* List of virtio-fs device instances and a lock for the list */
> >>> static DEFINE_MUTEX(virtio_fs_mutex);
> >>> static LIST_HEAD(virtio_fs_instances);
> >>> @@ -518,7 +513,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> >>> struct dev_pagemap *pgmap;
> >>> struct pci_dev *pci_dev;
> >>> phys_addr_t phys_addr;
> >>> - size_t len;
> >>> + size_t bar_len;
> >>> int ret;
> >>> u8 have_cache, cache_bar;
> >>> u64 cache_offset, cache_len;
> >>> @@ -551,17 +546,13 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> >>> }
> >>>
> >>> /* TODO handle case where device doesn't expose BAR? */
> >>
> >> For virtio-pmem we decided to not go via BARs as this would effectively
> >> make it only usable for virtio-pci implementers. Instead, we are going
> >> to export the applicable physical device region directly (e.g.
> >> phys_start, phys_size in virtio config), so it is decoupled from PCI
> >> details. Doing the same for virtio-fs would allow e.g. also virtio-ccw
> >> to make eventually use of this.
> >
> > That makes it a very odd looking PCI device; I can see that with
> > virtio-pmem it makes some sense, given that it's job is to expose
> > arbitrary chunks of memory.
> >
> > Dave
>
> Well, the fact that your are
>
> - including <uapi/linux/virtio_pci.h>
> - adding pci related code
>
> in/to fs/fuse/virtio_fs.c
>
> tells me that these properties might be better communicated on the
> virtio layer, not on the PCI layer.
>
> Or do you really want to glue virtio-fs to virtio-pci for all eternity?
No, these need cleaning up; and the split within the bar
is probably going to change to be communicated via virtio layer
rather than pci capabilities. However, I don't want to make our PCI
device look odd, just to make portability to non-PCI devices - so it's
right to make the split appropriately, but still to use PCI bars
for what they were designed for.
Dave
>
> --
>
> Thanks,
>
> David / dhildenb
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK
On 13.12.18 11:00, Dr. David Alan Gilbert wrote:
> * David Hildenbrand ([email protected]) wrote:
>> On 13.12.18 10:13, Dr. David Alan Gilbert wrote:
>>> * David Hildenbrand ([email protected]) wrote:
>>>> On 10.12.18 18:12, Vivek Goyal wrote:
>>>>> Instead of assuming we had the fixed bar for the cache, use the
>>>>> value from the capabilities.
>>>>>
>>>>> Signed-off-by: Dr. David Alan Gilbert <[email protected]>
>>>>> ---
>>>>> fs/fuse/virtio_fs.c | 32 +++++++++++++++++---------------
>>>>> 1 file changed, 17 insertions(+), 15 deletions(-)
>>>>>
>>>>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
>>>>> index 60d496c16841..55bac1465536 100644
>>>>> --- a/fs/fuse/virtio_fs.c
>>>>> +++ b/fs/fuse/virtio_fs.c
>>>>> @@ -14,11 +14,6 @@
>>>>> #include <uapi/linux/virtio_pci.h>
>>>>> #include "fuse_i.h"
>>>>>
>>>>> -enum {
>>>>> - /* PCI BAR number of the virtio-fs DAX window */
>>>>> - VIRTIO_FS_WINDOW_BAR = 2,
>>>>> -};
>>>>> -
>>>>> /* List of virtio-fs device instances and a lock for the list */
>>>>> static DEFINE_MUTEX(virtio_fs_mutex);
>>>>> static LIST_HEAD(virtio_fs_instances);
>>>>> @@ -518,7 +513,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
>>>>> struct dev_pagemap *pgmap;
>>>>> struct pci_dev *pci_dev;
>>>>> phys_addr_t phys_addr;
>>>>> - size_t len;
>>>>> + size_t bar_len;
>>>>> int ret;
>>>>> u8 have_cache, cache_bar;
>>>>> u64 cache_offset, cache_len;
>>>>> @@ -551,17 +546,13 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
>>>>> }
>>>>>
>>>>> /* TODO handle case where device doesn't expose BAR? */
>>>>
>>>> For virtio-pmem we decided to not go via BARs as this would effectively
>>>> make it only usable for virtio-pci implementers. Instead, we are going
>>>> to export the applicable physical device region directly (e.g.
>>>> phys_start, phys_size in virtio config), so it is decoupled from PCI
>>>> details. Doing the same for virtio-fs would allow e.g. also virtio-ccw
>>>> to make eventually use of this.
>>>
>>> That makes it a very odd looking PCI device; I can see that with
>>> virtio-pmem it makes some sense, given that it's job is to expose
>>> arbitrary chunks of memory.
>>>
>>> Dave
>>
>> Well, the fact that your are
>>
>> - including <uapi/linux/virtio_pci.h>
>> - adding pci related code
>>
>> in/to fs/fuse/virtio_fs.c
>>
>> tells me that these properties might be better communicated on the
>> virtio layer, not on the PCI layer.
>>
>> Or do you really want to glue virtio-fs to virtio-pci for all eternity?
>
> No, these need cleaning up; and the split within the bar
> is probably going to change to be communicated via virtio layer
> rather than pci capabilities. However, I don't want to make our PCI
> device look odd, just to make portability to non-PCI devices - so it's
> right to make the split appropriately, but still to use PCI bars
> for what they were designed for.
>
> Dave
Let's discuss after the cleanup. In general I am not convinced this is
the right thing to do. Using virtio-pci for anything else than pure
transport smells like bad design to me (well, I am no virtio expert
after all ;) ). No matter what PCI bars were designed for. If we can't
get the same running with e.g. virtio-ccw or virtio-whatever, it is
broken by design (or an addon that is tightly glued to virtio-pci, if
that is the general idea).
--
Thanks,
David / dhildenb
On Wed, Dec 12, 2018 at 05:37:35PM +0100, Christian Borntraeger wrote:
>
>
> On 10.12.2018 18:12, Vivek Goyal wrote:
> > From: Stefan Hajnoczi <[email protected]>
>
> > +static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > +{
> > + struct virtio_fs_memremap_info *mi;
> > + struct dev_pagemap *pgmap;
> > + struct pci_dev *pci_dev;
> > + phys_addr_t phys_addr;
> > + size_t len;
> > + int ret;
> > +
> > + if (!IS_ENABLED(CONFIG_DAX_DRIVER))
> > + return 0;
> > +
> > + /* HACK implement VIRTIO shared memory regions instead of
> > + * directly accessing the PCI BAR from a virtio device driver.
> > + */
> > + pci_dev = container_of(vdev->dev.parent, struct pci_dev, dev);
> > +
> > + /* TODO Is this safe - the virtio_pci_* driver doesn't use managed
> > + * device APIs? */
> > + ret = pcim_enable_device(pci_dev);
> > + if (ret < 0)
> > + return ret;
> > +
> > + /* TODO handle case where device doesn't expose BAR? */
> > + ret = pci_request_region(pci_dev, VIRTIO_FS_WINDOW_BAR,
> > + "virtio-fs-window");
> > + if (ret < 0) {
> > + dev_err(&vdev->dev, "%s: failed to request window BAR\n",
> > + __func__);
> > + return ret;
> > + }
>
> Can we please have a generic virtio interface to map the address (the default can then
> fall back to PCI) instead of mapping a PCI bar? This would make it easier to implement
> virtio-ccw or virtio-mmio.
Yes, we'll define shared memory as a device resource in the VIRTIO
specification. It will become part of the device model, alongside
virtqueues, configuration space, etc. That means devices can use shared
memory without it being tied to PCI BARs explicitly.
But only the PCI transport will have a realization of shared memory
resources in the beginning. We need to work together to add this
feature to the ccw and mmio transports.
Stefan
* David Hildenbrand ([email protected]) wrote:
> On 13.12.18 11:00, Dr. David Alan Gilbert wrote:
> > * David Hildenbrand ([email protected]) wrote:
> >> On 13.12.18 10:13, Dr. David Alan Gilbert wrote:
> >>> * David Hildenbrand ([email protected]) wrote:
> >>>> On 10.12.18 18:12, Vivek Goyal wrote:
> >>>>> Instead of assuming we had the fixed bar for the cache, use the
> >>>>> value from the capabilities.
> >>>>>
> >>>>> Signed-off-by: Dr. David Alan Gilbert <[email protected]>
> >>>>> ---
> >>>>> fs/fuse/virtio_fs.c | 32 +++++++++++++++++---------------
> >>>>> 1 file changed, 17 insertions(+), 15 deletions(-)
> >>>>>
> >>>>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> >>>>> index 60d496c16841..55bac1465536 100644
> >>>>> --- a/fs/fuse/virtio_fs.c
> >>>>> +++ b/fs/fuse/virtio_fs.c
> >>>>> @@ -14,11 +14,6 @@
> >>>>> #include <uapi/linux/virtio_pci.h>
> >>>>> #include "fuse_i.h"
> >>>>>
> >>>>> -enum {
> >>>>> - /* PCI BAR number of the virtio-fs DAX window */
> >>>>> - VIRTIO_FS_WINDOW_BAR = 2,
> >>>>> -};
> >>>>> -
> >>>>> /* List of virtio-fs device instances and a lock for the list */
> >>>>> static DEFINE_MUTEX(virtio_fs_mutex);
> >>>>> static LIST_HEAD(virtio_fs_instances);
> >>>>> @@ -518,7 +513,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> >>>>> struct dev_pagemap *pgmap;
> >>>>> struct pci_dev *pci_dev;
> >>>>> phys_addr_t phys_addr;
> >>>>> - size_t len;
> >>>>> + size_t bar_len;
> >>>>> int ret;
> >>>>> u8 have_cache, cache_bar;
> >>>>> u64 cache_offset, cache_len;
> >>>>> @@ -551,17 +546,13 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> >>>>> }
> >>>>>
> >>>>> /* TODO handle case where device doesn't expose BAR? */
> >>>>
> >>>> For virtio-pmem we decided to not go via BARs as this would effectively
> >>>> make it only usable for virtio-pci implementers. Instead, we are going
> >>>> to export the applicable physical device region directly (e.g.
> >>>> phys_start, phys_size in virtio config), so it is decoupled from PCI
> >>>> details. Doing the same for virtio-fs would allow e.g. also virtio-ccw
> >>>> to make eventually use of this.
> >>>
> >>> That makes it a very odd looking PCI device; I can see that with
> >>> virtio-pmem it makes some sense, given that it's job is to expose
> >>> arbitrary chunks of memory.
> >>>
> >>> Dave
> >>
> >> Well, the fact that your are
> >>
> >> - including <uapi/linux/virtio_pci.h>
> >> - adding pci related code
> >>
> >> in/to fs/fuse/virtio_fs.c
> >>
> >> tells me that these properties might be better communicated on the
> >> virtio layer, not on the PCI layer.
> >>
> >> Or do you really want to glue virtio-fs to virtio-pci for all eternity?
> >
> > No, these need cleaning up; and the split within the bar
> > is probably going to change to be communicated via virtio layer
> > rather than pci capabilities. However, I don't want to make our PCI
> > device look odd, just to make portability to non-PCI devices - so it's
> > right to make the split appropriately, but still to use PCI bars
> > for what they were designed for.
> >
> > Dave
>
> Let's discuss after the cleanup. In general I am not convinced this is
> the right thing to do. Using virtio-pci for anything else than pure
> transport smells like bad design to me (well, I am no virtio expert
> after all ;) ). No matter what PCI bars were designed for. If we can't
> get the same running with e.g. virtio-ccw or virtio-whatever, it is
> broken by design (or an addon that is tightly glued to virtio-pci, if
> that is the general idea).
I'm sure we can find alternatives for virtio-*, so I wouldn't expect
it to be glued to virtio-pci.
Dave
> --
>
> Thanks,
>
> David / dhildenb
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK
On 13.12.18 13:15, Dr. David Alan Gilbert wrote:
> * David Hildenbrand ([email protected]) wrote:
>> On 13.12.18 11:00, Dr. David Alan Gilbert wrote:
>>> * David Hildenbrand ([email protected]) wrote:
>>>> On 13.12.18 10:13, Dr. David Alan Gilbert wrote:
>>>>> * David Hildenbrand ([email protected]) wrote:
>>>>>> On 10.12.18 18:12, Vivek Goyal wrote:
>>>>>>> Instead of assuming we had the fixed bar for the cache, use the
>>>>>>> value from the capabilities.
>>>>>>>
>>>>>>> Signed-off-by: Dr. David Alan Gilbert <[email protected]>
>>>>>>> ---
>>>>>>> fs/fuse/virtio_fs.c | 32 +++++++++++++++++---------------
>>>>>>> 1 file changed, 17 insertions(+), 15 deletions(-)
>>>>>>>
>>>>>>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
>>>>>>> index 60d496c16841..55bac1465536 100644
>>>>>>> --- a/fs/fuse/virtio_fs.c
>>>>>>> +++ b/fs/fuse/virtio_fs.c
>>>>>>> @@ -14,11 +14,6 @@
>>>>>>> #include <uapi/linux/virtio_pci.h>
>>>>>>> #include "fuse_i.h"
>>>>>>>
>>>>>>> -enum {
>>>>>>> - /* PCI BAR number of the virtio-fs DAX window */
>>>>>>> - VIRTIO_FS_WINDOW_BAR = 2,
>>>>>>> -};
>>>>>>> -
>>>>>>> /* List of virtio-fs device instances and a lock for the list */
>>>>>>> static DEFINE_MUTEX(virtio_fs_mutex);
>>>>>>> static LIST_HEAD(virtio_fs_instances);
>>>>>>> @@ -518,7 +513,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
>>>>>>> struct dev_pagemap *pgmap;
>>>>>>> struct pci_dev *pci_dev;
>>>>>>> phys_addr_t phys_addr;
>>>>>>> - size_t len;
>>>>>>> + size_t bar_len;
>>>>>>> int ret;
>>>>>>> u8 have_cache, cache_bar;
>>>>>>> u64 cache_offset, cache_len;
>>>>>>> @@ -551,17 +546,13 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
>>>>>>> }
>>>>>>>
>>>>>>> /* TODO handle case where device doesn't expose BAR? */
>>>>>>
>>>>>> For virtio-pmem we decided to not go via BARs as this would effectively
>>>>>> make it only usable for virtio-pci implementers. Instead, we are going
>>>>>> to export the applicable physical device region directly (e.g.
>>>>>> phys_start, phys_size in virtio config), so it is decoupled from PCI
>>>>>> details. Doing the same for virtio-fs would allow e.g. also virtio-ccw
>>>>>> to make eventually use of this.
>>>>>
>>>>> That makes it a very odd looking PCI device; I can see that with
>>>>> virtio-pmem it makes some sense, given that it's job is to expose
>>>>> arbitrary chunks of memory.
>>>>>
>>>>> Dave
>>>>
>>>> Well, the fact that your are
>>>>
>>>> - including <uapi/linux/virtio_pci.h>
>>>> - adding pci related code
>>>>
>>>> in/to fs/fuse/virtio_fs.c
>>>>
>>>> tells me that these properties might be better communicated on the
>>>> virtio layer, not on the PCI layer.
>>>>
>>>> Or do you really want to glue virtio-fs to virtio-pci for all eternity?
>>>
>>> No, these need cleaning up; and the split within the bar
>>> is probably going to change to be communicated via virtio layer
>>> rather than pci capabilities. However, I don't want to make our PCI
>>> device look odd, just to make portability to non-PCI devices - so it's
>>> right to make the split appropriately, but still to use PCI bars
>>> for what they were designed for.
>>>
>>> Dave
>>
>> Let's discuss after the cleanup. In general I am not convinced this is
>> the right thing to do. Using virtio-pci for anything else than pure
>> transport smells like bad design to me (well, I am no virtio expert
>> after all ;) ). No matter what PCI bars were designed for. If we can't
>> get the same running with e.g. virtio-ccw or virtio-whatever, it is
>> broken by design (or an addon that is tightly glued to virtio-pci, if
>> that is the general idea).
>
> I'm sure we can find alternatives for virtio-*, so I wouldn't expect
> it to be glued to virtio-pci.
>
> Dave
As s390x does not have the concept of memory mapped io (RAM is RAM,
nothing else), this is not architectured. vitio-ccw can therefore not
define anything similar like that. However, in virtual environments we
can do whatever we want on top of the pure transport (e.g. on the virtio
layer).
Conny can correct me if I am wrong.
--
Thanks,
David / dhildenb
On Thu, 13 Dec 2018 13:24:31 +0100
David Hildenbrand <[email protected]> wrote:
> On 13.12.18 13:15, Dr. David Alan Gilbert wrote:
> > * David Hildenbrand ([email protected]) wrote:
> >> On 13.12.18 11:00, Dr. David Alan Gilbert wrote:
> >>> * David Hildenbrand ([email protected]) wrote:
> >>>> On 13.12.18 10:13, Dr. David Alan Gilbert wrote:
> >>>>> * David Hildenbrand ([email protected]) wrote:
> >>>>>> On 10.12.18 18:12, Vivek Goyal wrote:
> >>>>>>> Instead of assuming we had the fixed bar for the cache, use the
> >>>>>>> value from the capabilities.
> >>>>>>>
> >>>>>>> Signed-off-by: Dr. David Alan Gilbert <[email protected]>
> >>>>>>> ---
> >>>>>>> fs/fuse/virtio_fs.c | 32 +++++++++++++++++---------------
> >>>>>>> 1 file changed, 17 insertions(+), 15 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> >>>>>>> index 60d496c16841..55bac1465536 100644
> >>>>>>> --- a/fs/fuse/virtio_fs.c
> >>>>>>> +++ b/fs/fuse/virtio_fs.c
> >>>>>>> @@ -14,11 +14,6 @@
> >>>>>>> #include <uapi/linux/virtio_pci.h>
> >>>>>>> #include "fuse_i.h"
> >>>>>>>
> >>>>>>> -enum {
> >>>>>>> - /* PCI BAR number of the virtio-fs DAX window */
> >>>>>>> - VIRTIO_FS_WINDOW_BAR = 2,
> >>>>>>> -};
> >>>>>>> -
> >>>>>>> /* List of virtio-fs device instances and a lock for the list */
> >>>>>>> static DEFINE_MUTEX(virtio_fs_mutex);
> >>>>>>> static LIST_HEAD(virtio_fs_instances);
> >>>>>>> @@ -518,7 +513,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> >>>>>>> struct dev_pagemap *pgmap;
> >>>>>>> struct pci_dev *pci_dev;
> >>>>>>> phys_addr_t phys_addr;
> >>>>>>> - size_t len;
> >>>>>>> + size_t bar_len;
> >>>>>>> int ret;
> >>>>>>> u8 have_cache, cache_bar;
> >>>>>>> u64 cache_offset, cache_len;
> >>>>>>> @@ -551,17 +546,13 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> >>>>>>> }
> >>>>>>>
> >>>>>>> /* TODO handle case where device doesn't expose BAR? */
> >>>>>>
> >>>>>> For virtio-pmem we decided to not go via BARs as this would effectively
> >>>>>> make it only usable for virtio-pci implementers. Instead, we are going
> >>>>>> to export the applicable physical device region directly (e.g.
> >>>>>> phys_start, phys_size in virtio config), so it is decoupled from PCI
> >>>>>> details. Doing the same for virtio-fs would allow e.g. also virtio-ccw
> >>>>>> to make eventually use of this.
> >>>>>
> >>>>> That makes it a very odd looking PCI device; I can see that with
> >>>>> virtio-pmem it makes some sense, given that it's job is to expose
> >>>>> arbitrary chunks of memory.
> >>>>>
> >>>>> Dave
> >>>>
> >>>> Well, the fact that your are
> >>>>
> >>>> - including <uapi/linux/virtio_pci.h>
> >>>> - adding pci related code
> >>>>
> >>>> in/to fs/fuse/virtio_fs.c
> >>>>
> >>>> tells me that these properties might be better communicated on the
> >>>> virtio layer, not on the PCI layer.
> >>>>
> >>>> Or do you really want to glue virtio-fs to virtio-pci for all eternity?
> >>>
> >>> No, these need cleaning up; and the split within the bar
> >>> is probably going to change to be communicated via virtio layer
> >>> rather than pci capabilities. However, I don't want to make our PCI
> >>> device look odd, just to make portability to non-PCI devices - so it's
> >>> right to make the split appropriately, but still to use PCI bars
> >>> for what they were designed for.
> >>>
> >>> Dave
> >>
> >> Let's discuss after the cleanup. In general I am not convinced this is
> >> the right thing to do. Using virtio-pci for anything else than pure
> >> transport smells like bad design to me (well, I am no virtio expert
> >> after all ;) ). No matter what PCI bars were designed for. If we can't
> >> get the same running with e.g. virtio-ccw or virtio-whatever, it is
> >> broken by design (or an addon that is tightly glued to virtio-pci, if
> >> that is the general idea).
> >
> > I'm sure we can find alternatives for virtio-*, so I wouldn't expect
> > it to be glued to virtio-pci.
> >
> > Dave
>
> As s390x does not have the concept of memory mapped io (RAM is RAM,
> nothing else), this is not architectured. vitio-ccw can therefore not
> define anything similar like that. However, in virtual environments we
> can do whatever we want on top of the pure transport (e.g. on the virtio
> layer).
>
> Conny can correct me if I am wrong.
I don't think you're wrong, but I haven't read the code yet and I'm
therefore not aware of the purpose of this BAR.
Generally, if there is a memory location shared between host and guest,
we need a way to communicate its location, which will likely differ
between transports. For ccw, I could imagine a new channel command
dedicated to exchanging configuration information (similar to what
exists today to communicate the locations of virtqueues), but I'd
rather not go down this path.
Without reading the code/design further, can we use one of the
following instead of a BAR:
- a virtqueue;
- something in config space?
That would be implementable by any virtio transport.
Hi Stefan,
I love your patch! Yet something to improve:
[auto build test ERROR on fuse/for-next]
[also build test ERROR on v4.20-rc6]
[cannot apply to next-20181213]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Vivek-Goyal/virtio-fs-shared-file-system-for-virtual-machines/20181211-103034
base: https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next
config: sh-allmodconfig (attached as .config)
compiler: sh4-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=sh
All errors (new ones prefixed by >>):
fs/fuse/virtio_fs.c: In function 'virtio_fs_setup_dax':
>> fs/fuse/virtio_fs.c:465:8: error: implicit declaration of function 'pcim_enable_device'; did you mean 'pci_enable_device'? [-Werror=implicit-function-declaration]
ret = pcim_enable_device(pci_dev);
^~~~~~~~~~~~~~~~~~
pci_enable_device
>> fs/fuse/virtio_fs.c:470:8: error: implicit declaration of function 'pci_request_region'; did you mean 'pci_request_regions'? [-Werror=implicit-function-declaration]
ret = pci_request_region(pci_dev, VIRTIO_FS_WINDOW_BAR,
^~~~~~~~~~~~~~~~~~
pci_request_regions
In file included from include/linux/printk.h:336:0,
from include/linux/kernel.h:14,
from include/linux/list.h:9,
from include/linux/wait.h:7,
from include/linux/wait_bit.h:8,
from include/linux/fs.h:6,
from fs/fuse/virtio_fs.c:7:
fs/fuse/virtio_fs.c:528:22: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 6 has type 'phys_addr_t {aka unsigned int}' [-Wformat=]
dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr 0x%llx len %zu\n",
^
include/linux/dynamic_debug.h:135:39: note: in definition of macro 'dynamic_dev_dbg'
__dynamic_dev_dbg(&descriptor, dev, fmt, \
^~~
include/linux/device.h:1463:23: note: in expansion of macro 'dev_fmt'
dynamic_dev_dbg(dev, dev_fmt(fmt), ##__VA_ARGS__)
^~~~~~~
fs/fuse/virtio_fs.c:528:2: note: in expansion of macro 'dev_dbg'
dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr 0x%llx len %zu\n",
^~~~~~~
At top level:
fs/fuse/virtio_fs.c:604:12: warning: 'virtio_fs_restore' defined but not used [-Wunused-function]
static int virtio_fs_restore(struct virtio_device *vdev)
^~~~~~~~~~~~~~~~~
fs/fuse/virtio_fs.c:599:12: warning: 'virtio_fs_freeze' defined but not used [-Wunused-function]
static int virtio_fs_freeze(struct virtio_device *vdev)
^~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
vim +465 fs/fuse/virtio_fs.c
445
446 static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
447 {
448 struct virtio_fs_memremap_info *mi;
449 struct dev_pagemap *pgmap;
450 struct pci_dev *pci_dev;
451 phys_addr_t phys_addr;
452 size_t len;
453 int ret;
454
455 if (!IS_ENABLED(CONFIG_DAX_DRIVER))
456 return 0;
457
458 /* HACK implement VIRTIO shared memory regions instead of
459 * directly accessing the PCI BAR from a virtio device driver.
460 */
461 pci_dev = container_of(vdev->dev.parent, struct pci_dev, dev);
462
463 /* TODO Is this safe - the virtio_pci_* driver doesn't use managed
464 * device APIs? */
> 465 ret = pcim_enable_device(pci_dev);
466 if (ret < 0)
467 return ret;
468
469 /* TODO handle case where device doesn't expose BAR? */
> 470 ret = pci_request_region(pci_dev, VIRTIO_FS_WINDOW_BAR,
471 "virtio-fs-window");
472 if (ret < 0) {
473 dev_err(&vdev->dev, "%s: failed to request window BAR\n",
474 __func__);
475 return ret;
476 }
477
478 phys_addr = pci_resource_start(pci_dev, VIRTIO_FS_WINDOW_BAR);
479 len = pci_resource_len(pci_dev, VIRTIO_FS_WINDOW_BAR);
480
481 mi = devm_kzalloc(&pci_dev->dev, sizeof(*mi), GFP_KERNEL);
482 if (!mi)
483 return -ENOMEM;
484
485 init_completion(&mi->completion);
486 ret = percpu_ref_init(&mi->ref, virtio_fs_percpu_release, 0,
487 GFP_KERNEL);
488 if (ret < 0) {
489 dev_err(&vdev->dev, "%s: percpu_ref_init failed (%d)\n",
490 __func__, ret);
491 return ret;
492 }
493
494 ret = devm_add_action(&pci_dev->dev, virtio_fs_percpu_exit, mi);
495 if (ret < 0) {
496 percpu_ref_exit(&mi->ref);
497 return ret;
498 }
499
500 pgmap = &mi->pgmap;
501 pgmap->altmap_valid = false;
502 pgmap->ref = &mi->ref;
503 pgmap->type = MEMORY_DEVICE_FS_DAX;
504
505 /* Ideally we would directly use the PCI BAR resource but
506 * devm_memremap_pages() wants its own copy in pgmap. So
507 * initialize a struct resource from scratch (only the start
508 * and end fields will be used).
509 */
510 pgmap->res = (struct resource){
511 .name = "virtio-fs dax window",
512 .start = phys_addr,
513 .end = phys_addr + len,
514 };
515
516 fs->window_kaddr = devm_memremap_pages(&pci_dev->dev, pgmap);
517 if (IS_ERR(fs->window_kaddr))
518 return PTR_ERR(fs->window_kaddr);
519
520 ret = devm_add_action_or_reset(&pci_dev->dev, virtio_fs_percpu_kill,
521 &mi->ref);
522 if (ret < 0)
523 return ret;
524
525 fs->window_phys_addr = phys_addr;
526 fs->window_len = len;
527
528 dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr 0x%llx len %zu\n",
529 __func__, fs->window_kaddr, phys_addr, len);
530
531 fs->dax_dev = alloc_dax(fs, NULL, &virtio_fs_dax_ops);
532 if (!fs->dax_dev)
533 return -ENOMEM;
534
535 return devm_add_action_or_reset(&vdev->dev, virtio_fs_cleanup_dax, fs);
536 }
537
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On Mon, Dec 10, 2018 at 9:22 AM Vivek Goyal <[email protected]> wrote:
>
> From: Stefan Hajnoczi <[email protected]>
>
> Experimental QEMU code introduces an MMIO BAR for mapping portions of
> files in the virtio-fs device. Map this BAR so that FUSE DAX can access
> file contents from the host page cache.
FUSE DAX sounds terrifying, can you explain a bit more about what this is?
> The DAX window is accessed by the fs/dax.c infrastructure and must have
> struct pages (at least on x86). Use devm_memremap_pages() to map the
> DAX window PCI BAR and allocate struct page.
PCI BAR space is not cache coherent, what prevents these pages from
being used in paths that would do:
object = page_address(pfn_to_page(virtio_fs_pfn));
...?
* Dan Williams ([email protected]) wrote:
> On Mon, Dec 10, 2018 at 9:22 AM Vivek Goyal <[email protected]> wrote:
> >
> > From: Stefan Hajnoczi <[email protected]>
> >
> > Experimental QEMU code introduces an MMIO BAR for mapping portions of
> > files in the virtio-fs device. Map this BAR so that FUSE DAX can access
> > file contents from the host page cache.
>
> FUSE DAX sounds terrifying, can you explain a bit more about what this is?
We've got a guest running in QEMU, it sees an emulated PCI device;
that runs a FUSE protocol over virtio on that PCI device, but also has
a trick where via commands sent over the virtio queue associated with that device,
(fragments of) host files get mmap'd into the qemu virtual memory that corresponds
to the kvm slot exposed to the guest for that bar.
The guest sees those chunks in that BAR, and thus you can read/write
to the host file by directly writing into that BAR.
> > The DAX window is accessed by the fs/dax.c infrastructure and must have
> > struct pages (at least on x86). Use devm_memremap_pages() to map the
> > DAX window PCI BAR and allocate struct page.
>
> PCI BAR space is not cache coherent,
Note that no real PCI infrastructure is involved - this is all emulated
devices, backed by mmap'd files on the host qemu process.
Dave
> what prevents these pages from
> being used in paths that would do:
>
> object = page_address(pfn_to_page(virtio_fs_pfn));
>
> ...?
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK
On Thu, Dec 13, 2018 at 12:09 PM Dr. David Alan Gilbert
<[email protected]> wrote:
>
> * Dan Williams ([email protected]) wrote:
> > On Mon, Dec 10, 2018 at 9:22 AM Vivek Goyal <[email protected]> wrote:
> > >
> > > From: Stefan Hajnoczi <[email protected]>
> > >
> > > Experimental QEMU code introduces an MMIO BAR for mapping portions of
> > > files in the virtio-fs device. Map this BAR so that FUSE DAX can access
> > > file contents from the host page cache.
> >
> > FUSE DAX sounds terrifying, can you explain a bit more about what this is?
>
> We've got a guest running in QEMU, it sees an emulated PCI device;
> that runs a FUSE protocol over virtio on that PCI device, but also has
> a trick where via commands sent over the virtio queue associated with that device,
> (fragments of) host files get mmap'd into the qemu virtual memory that corresponds
> to the kvm slot exposed to the guest for that bar.
>
> The guest sees those chunks in that BAR, and thus you can read/write
> to the host file by directly writing into that BAR.
Ok so it's all software emulated and there won't be hardware DMA
initiated by the guest to that address? I.e. if the host file gets
truncated / hole-punched the guest would just cause a refault and the
filesystem could fill in the block, or the guest is expected to die if
the fault to the truncated file range results in SIGBUS.
> > > The DAX window is accessed by the fs/dax.c infrastructure and must have
> > > struct pages (at least on x86). Use devm_memremap_pages() to map the
> > > DAX window PCI BAR and allocate struct page.
> >
> > PCI BAR space is not cache coherent,
>
> Note that no real PCI infrastructure is involved - this is all emulated
> devices, backed by mmap'd files on the host qemu process.
Ok, terror level decreased.
On Thu, Dec 13, 2018 at 12:15:51PM -0800, Dan Williams wrote:
> On Thu, Dec 13, 2018 at 12:09 PM Dr. David Alan Gilbert
> <[email protected]> wrote:
> >
> > * Dan Williams ([email protected]) wrote:
> > > On Mon, Dec 10, 2018 at 9:22 AM Vivek Goyal <[email protected]> wrote:
> > > >
> > > > From: Stefan Hajnoczi <[email protected]>
> > > >
> > > > Experimental QEMU code introduces an MMIO BAR for mapping portions of
> > > > files in the virtio-fs device. Map this BAR so that FUSE DAX can access
> > > > file contents from the host page cache.
> > >
> > > FUSE DAX sounds terrifying, can you explain a bit more about what this is?
> >
> > We've got a guest running in QEMU, it sees an emulated PCI device;
> > that runs a FUSE protocol over virtio on that PCI device, but also has
> > a trick where via commands sent over the virtio queue associated with that device,
> > (fragments of) host files get mmap'd into the qemu virtual memory that corresponds
> > to the kvm slot exposed to the guest for that bar.
> >
> > The guest sees those chunks in that BAR, and thus you can read/write
> > to the host file by directly writing into that BAR.
>
> Ok so it's all software emulated and there won't be hardware DMA
> initiated by the guest to that address?
That's my understanding.
> I.e. if the host file gets
> truncated / hole-punched the guest would just cause a refault and the
> filesystem could fill in the block,
Right
> or the guest is expected to die if
> the fault to the truncated file range results in SIGBUS.
Are you referring to the case where a file page is mapped in qemu and
another guest/process trucates that page and when qemu tries to access it it
will get SIGBUS. Have not tried it, will give it a try. Not sure what
happens when QEMU receives SIGBUS.
Having said that, this is not different from the case of one process
mapping a file and another process truncating the file and first process
getting SIGBUS, right?
Thanks
Vivek
>
> > > > The DAX window is accessed by the fs/dax.c infrastructure and must have
> > > > struct pages (at least on x86). Use devm_memremap_pages() to map the
> > > > DAX window PCI BAR and allocate struct page.
> > >
> > > PCI BAR space is not cache coherent,
> >
> > Note that no real PCI infrastructure is involved - this is all emulated
> > devices, backed by mmap'd files on the host qemu process.
>
> Ok, terror level decreased.
On Thu, Dec 13, 2018 at 03:40:52PM -0500, Vivek Goyal wrote:
> On Thu, Dec 13, 2018 at 12:15:51PM -0800, Dan Williams wrote:
> > On Thu, Dec 13, 2018 at 12:09 PM Dr. David Alan Gilbert
> > <[email protected]> wrote:
> > >
> > > * Dan Williams ([email protected]) wrote:
> > > > On Mon, Dec 10, 2018 at 9:22 AM Vivek Goyal <[email protected]> wrote:
> > > > >
> > > > > From: Stefan Hajnoczi <[email protected]>
> > > > >
> > > > > Experimental QEMU code introduces an MMIO BAR for mapping portions of
> > > > > files in the virtio-fs device. Map this BAR so that FUSE DAX can access
> > > > > file contents from the host page cache.
> > > >
> > > > FUSE DAX sounds terrifying, can you explain a bit more about what this is?
> > >
> > > We've got a guest running in QEMU, it sees an emulated PCI device;
> > > that runs a FUSE protocol over virtio on that PCI device, but also has
> > > a trick where via commands sent over the virtio queue associated with that device,
> > > (fragments of) host files get mmap'd into the qemu virtual memory that corresponds
> > > to the kvm slot exposed to the guest for that bar.
> > >
> > > The guest sees those chunks in that BAR, and thus you can read/write
> > > to the host file by directly writing into that BAR.
> >
> > Ok so it's all software emulated and there won't be hardware DMA
> > initiated by the guest to that address?
>
> That's my understanding.
>
> > I.e. if the host file gets
> > truncated / hole-punched the guest would just cause a refault and the
> > filesystem could fill in the block,
>
> Right
>
> > or the guest is expected to die if
> > the fault to the truncated file range results in SIGBUS.
>
> Are you referring to the case where a file page is mapped in qemu and
> another guest/process trucates that page and when qemu tries to access it it
> will get SIGBUS. Have not tried it, will give it a try. Not sure what
> happens when QEMU receives SIGBUS.
>
> Having said that, this is not different from the case of one process
> mapping a file and another process truncating the file and first process
> getting SIGBUS, right?
Ok, tried this and guest process hangs.
Stefan, dgilbert, this reminds me that we have faced this issue during
our testing and we decided that this will need some fixing in KVM. I
even put this in as part of changelog of patch with subject "fuse: Take
inode lock for dax inode truncation"
"Another problem is, if we setup a mapping in fuse_iomap_begin(), and
file gets truncated and dax read/write happens, KVM currently hangs.
It tries to fault in a page which does not exist on host (file got
truncated). It probably requries fixing in KVM."
Not sure what should happen though when qemu receives SIGBUS in this
case.
Thanks
Vivek
* Vivek Goyal ([email protected]) wrote:
> On Thu, Dec 13, 2018 at 03:40:52PM -0500, Vivek Goyal wrote:
> > On Thu, Dec 13, 2018 at 12:15:51PM -0800, Dan Williams wrote:
> > > On Thu, Dec 13, 2018 at 12:09 PM Dr. David Alan Gilbert
> > > <[email protected]> wrote:
> > > >
> > > > * Dan Williams ([email protected]) wrote:
> > > > > On Mon, Dec 10, 2018 at 9:22 AM Vivek Goyal <[email protected]> wrote:
> > > > > >
> > > > > > From: Stefan Hajnoczi <[email protected]>
> > > > > >
> > > > > > Experimental QEMU code introduces an MMIO BAR for mapping portions of
> > > > > > files in the virtio-fs device. Map this BAR so that FUSE DAX can access
> > > > > > file contents from the host page cache.
> > > > >
> > > > > FUSE DAX sounds terrifying, can you explain a bit more about what this is?
> > > >
> > > > We've got a guest running in QEMU, it sees an emulated PCI device;
> > > > that runs a FUSE protocol over virtio on that PCI device, but also has
> > > > a trick where via commands sent over the virtio queue associated with that device,
> > > > (fragments of) host files get mmap'd into the qemu virtual memory that corresponds
> > > > to the kvm slot exposed to the guest for that bar.
> > > >
> > > > The guest sees those chunks in that BAR, and thus you can read/write
> > > > to the host file by directly writing into that BAR.
> > >
> > > Ok so it's all software emulated and there won't be hardware DMA
> > > initiated by the guest to that address?
> >
> > That's my understanding.
> >
> > > I.e. if the host file gets
> > > truncated / hole-punched the guest would just cause a refault and the
> > > filesystem could fill in the block,
> >
> > Right
> >
> > > or the guest is expected to die if
> > > the fault to the truncated file range results in SIGBUS.
> >
> > Are you referring to the case where a file page is mapped in qemu and
> > another guest/process trucates that page and when qemu tries to access it it
> > will get SIGBUS. Have not tried it, will give it a try. Not sure what
> > happens when QEMU receives SIGBUS.
> >
> > Having said that, this is not different from the case of one process
> > mapping a file and another process truncating the file and first process
> > getting SIGBUS, right?
>
> Ok, tried this and guest process hangs.
>
> Stefan, dgilbert, this reminds me that we have faced this issue during
> our testing and we decided that this will need some fixing in KVM. I
> even put this in as part of changelog of patch with subject "fuse: Take
> inode lock for dax inode truncation"
>
> "Another problem is, if we setup a mapping in fuse_iomap_begin(), and
> file gets truncated and dax read/write happens, KVM currently hangs.
> It tries to fault in a page which does not exist on host (file got
> truncated). It probably requries fixing in KVM."
>
> Not sure what should happen though when qemu receives SIGBUS in this
> case.
Yes, and I noted it in the TODO in my qemu patch posting.
We need to figure out what we want the guest to see in this case and
figure out how to make QEMU/kvm fix it up so that the guest doesn't
see anything odd.
Dave
> Thanks
> Vivek
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK
On Thu, Dec 13, 2018 at 01:38:23PM +0100, Cornelia Huck wrote:
> On Thu, 13 Dec 2018 13:24:31 +0100
> David Hildenbrand <[email protected]> wrote:
>
> > On 13.12.18 13:15, Dr. David Alan Gilbert wrote:
> > > * David Hildenbrand ([email protected]) wrote:
> > >> On 13.12.18 11:00, Dr. David Alan Gilbert wrote:
> > >>> * David Hildenbrand ([email protected]) wrote:
> > >>>> On 13.12.18 10:13, Dr. David Alan Gilbert wrote:
> > >>>>> * David Hildenbrand ([email protected]) wrote:
> > >>>>>> On 10.12.18 18:12, Vivek Goyal wrote:
> > >>>>>>> Instead of assuming we had the fixed bar for the cache, use the
> > >>>>>>> value from the capabilities.
> > >>>>>>>
> > >>>>>>> Signed-off-by: Dr. David Alan Gilbert <[email protected]>
> > >>>>>>> ---
> > >>>>>>> fs/fuse/virtio_fs.c | 32 +++++++++++++++++---------------
> > >>>>>>> 1 file changed, 17 insertions(+), 15 deletions(-)
> > >>>>>>>
> > >>>>>>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> > >>>>>>> index 60d496c16841..55bac1465536 100644
> > >>>>>>> --- a/fs/fuse/virtio_fs.c
> > >>>>>>> +++ b/fs/fuse/virtio_fs.c
> > >>>>>>> @@ -14,11 +14,6 @@
> > >>>>>>> #include <uapi/linux/virtio_pci.h>
> > >>>>>>> #include "fuse_i.h"
> > >>>>>>>
> > >>>>>>> -enum {
> > >>>>>>> - /* PCI BAR number of the virtio-fs DAX window */
> > >>>>>>> - VIRTIO_FS_WINDOW_BAR = 2,
> > >>>>>>> -};
> > >>>>>>> -
> > >>>>>>> /* List of virtio-fs device instances and a lock for the list */
> > >>>>>>> static DEFINE_MUTEX(virtio_fs_mutex);
> > >>>>>>> static LIST_HEAD(virtio_fs_instances);
> > >>>>>>> @@ -518,7 +513,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > >>>>>>> struct dev_pagemap *pgmap;
> > >>>>>>> struct pci_dev *pci_dev;
> > >>>>>>> phys_addr_t phys_addr;
> > >>>>>>> - size_t len;
> > >>>>>>> + size_t bar_len;
> > >>>>>>> int ret;
> > >>>>>>> u8 have_cache, cache_bar;
> > >>>>>>> u64 cache_offset, cache_len;
> > >>>>>>> @@ -551,17 +546,13 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > >>>>>>> }
> > >>>>>>>
> > >>>>>>> /* TODO handle case where device doesn't expose BAR? */
> > >>>>>>
> > >>>>>> For virtio-pmem we decided to not go via BARs as this would effectively
> > >>>>>> make it only usable for virtio-pci implementers. Instead, we are going
> > >>>>>> to export the applicable physical device region directly (e.g.
> > >>>>>> phys_start, phys_size in virtio config), so it is decoupled from PCI
> > >>>>>> details. Doing the same for virtio-fs would allow e.g. also virtio-ccw
> > >>>>>> to make eventually use of this.
> > >>>>>
> > >>>>> That makes it a very odd looking PCI device; I can see that with
> > >>>>> virtio-pmem it makes some sense, given that it's job is to expose
> > >>>>> arbitrary chunks of memory.
> > >>>>>
> > >>>>> Dave
> > >>>>
> > >>>> Well, the fact that your are
> > >>>>
> > >>>> - including <uapi/linux/virtio_pci.h>
> > >>>> - adding pci related code
> > >>>>
> > >>>> in/to fs/fuse/virtio_fs.c
> > >>>>
> > >>>> tells me that these properties might be better communicated on the
> > >>>> virtio layer, not on the PCI layer.
> > >>>>
> > >>>> Or do you really want to glue virtio-fs to virtio-pci for all eternity?
> > >>>
> > >>> No, these need cleaning up; and the split within the bar
> > >>> is probably going to change to be communicated via virtio layer
> > >>> rather than pci capabilities. However, I don't want to make our PCI
> > >>> device look odd, just to make portability to non-PCI devices - so it's
> > >>> right to make the split appropriately, but still to use PCI bars
> > >>> for what they were designed for.
> > >>>
> > >>> Dave
> > >>
> > >> Let's discuss after the cleanup. In general I am not convinced this is
> > >> the right thing to do. Using virtio-pci for anything else than pure
> > >> transport smells like bad design to me (well, I am no virtio expert
> > >> after all ;) ). No matter what PCI bars were designed for. If we can't
> > >> get the same running with e.g. virtio-ccw or virtio-whatever, it is
> > >> broken by design (or an addon that is tightly glued to virtio-pci, if
> > >> that is the general idea).
> > >
> > > I'm sure we can find alternatives for virtio-*, so I wouldn't expect
> > > it to be glued to virtio-pci.
> > >
> > > Dave
> >
> > As s390x does not have the concept of memory mapped io (RAM is RAM,
> > nothing else), this is not architectured. vitio-ccw can therefore not
> > define anything similar like that. However, in virtual environments we
> > can do whatever we want on top of the pure transport (e.g. on the virtio
> > layer).
> >
> > Conny can correct me if I am wrong.
>
> I don't think you're wrong, but I haven't read the code yet and I'm
> therefore not aware of the purpose of this BAR.
>
> Generally, if there is a memory location shared between host and guest,
> we need a way to communicate its location, which will likely differ
> between transports. For ccw, I could imagine a new channel command
> dedicated to exchanging configuration information (similar to what
> exists today to communicate the locations of virtqueues), but I'd
> rather not go down this path.
>
> Without reading the code/design further, can we use one of the
> following instead of a BAR:
> - a virtqueue;
> - something in config space?
> That would be implementable by any virtio transport.
The way I think about this is that we wish to extend the VIRTIO device
model with the concept of shared memory. virtio-fs, virtio-gpu, and
virtio-vhost-user all have requirements for shared memory.
This seems like a transport-level issue to me. PCI supports
memory-mapped I/O and that's the right place to do it. If you try to
put it into config space or the virtqueue, you'll end up with something
that cannot be realized as a PCI device because it bypasses PCI bus
address translation.
If CCW needs a side-channel, that's fine. But that side-channel is a
CCW-specific mechanism and probably doesn't apply to all other
transports.
Stefan
On Fri, 14 Dec 2018 13:44:34 +0000
Stefan Hajnoczi <[email protected]> wrote:
> On Thu, Dec 13, 2018 at 01:38:23PM +0100, Cornelia Huck wrote:
> > On Thu, 13 Dec 2018 13:24:31 +0100
> > David Hildenbrand <[email protected]> wrote:
> >
> > > On 13.12.18 13:15, Dr. David Alan Gilbert wrote:
> > > > * David Hildenbrand ([email protected]) wrote:
> > > >> On 13.12.18 11:00, Dr. David Alan Gilbert wrote:
> > > >>> * David Hildenbrand ([email protected]) wrote:
> > > >>>> On 13.12.18 10:13, Dr. David Alan Gilbert wrote:
> > > >>>>> * David Hildenbrand ([email protected]) wrote:
> > > >>>>>> On 10.12.18 18:12, Vivek Goyal wrote:
> > > >>>>>>> Instead of assuming we had the fixed bar for the cache, use the
> > > >>>>>>> value from the capabilities.
> > > >>>>>>>
> > > >>>>>>> Signed-off-by: Dr. David Alan Gilbert <[email protected]>
> > > >>>>>>> ---
> > > >>>>>>> fs/fuse/virtio_fs.c | 32 +++++++++++++++++---------------
> > > >>>>>>> 1 file changed, 17 insertions(+), 15 deletions(-)
> > > >>>>>>>
> > > >>>>>>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> > > >>>>>>> index 60d496c16841..55bac1465536 100644
> > > >>>>>>> --- a/fs/fuse/virtio_fs.c
> > > >>>>>>> +++ b/fs/fuse/virtio_fs.c
> > > >>>>>>> @@ -14,11 +14,6 @@
> > > >>>>>>> #include <uapi/linux/virtio_pci.h>
> > > >>>>>>> #include "fuse_i.h"
> > > >>>>>>>
> > > >>>>>>> -enum {
> > > >>>>>>> - /* PCI BAR number of the virtio-fs DAX window */
> > > >>>>>>> - VIRTIO_FS_WINDOW_BAR = 2,
> > > >>>>>>> -};
> > > >>>>>>> -
> > > >>>>>>> /* List of virtio-fs device instances and a lock for the list */
> > > >>>>>>> static DEFINE_MUTEX(virtio_fs_mutex);
> > > >>>>>>> static LIST_HEAD(virtio_fs_instances);
> > > >>>>>>> @@ -518,7 +513,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > > >>>>>>> struct dev_pagemap *pgmap;
> > > >>>>>>> struct pci_dev *pci_dev;
> > > >>>>>>> phys_addr_t phys_addr;
> > > >>>>>>> - size_t len;
> > > >>>>>>> + size_t bar_len;
> > > >>>>>>> int ret;
> > > >>>>>>> u8 have_cache, cache_bar;
> > > >>>>>>> u64 cache_offset, cache_len;
> > > >>>>>>> @@ -551,17 +546,13 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > > >>>>>>> }
> > > >>>>>>>
> > > >>>>>>> /* TODO handle case where device doesn't expose BAR? */
> > > >>>>>>
> > > >>>>>> For virtio-pmem we decided to not go via BARs as this would effectively
> > > >>>>>> make it only usable for virtio-pci implementers. Instead, we are going
> > > >>>>>> to export the applicable physical device region directly (e.g.
> > > >>>>>> phys_start, phys_size in virtio config), so it is decoupled from PCI
> > > >>>>>> details. Doing the same for virtio-fs would allow e.g. also virtio-ccw
> > > >>>>>> to make eventually use of this.
> > > >>>>>
> > > >>>>> That makes it a very odd looking PCI device; I can see that with
> > > >>>>> virtio-pmem it makes some sense, given that it's job is to expose
> > > >>>>> arbitrary chunks of memory.
> > > >>>>>
> > > >>>>> Dave
> > > >>>>
> > > >>>> Well, the fact that your are
> > > >>>>
> > > >>>> - including <uapi/linux/virtio_pci.h>
> > > >>>> - adding pci related code
> > > >>>>
> > > >>>> in/to fs/fuse/virtio_fs.c
> > > >>>>
> > > >>>> tells me that these properties might be better communicated on the
> > > >>>> virtio layer, not on the PCI layer.
> > > >>>>
> > > >>>> Or do you really want to glue virtio-fs to virtio-pci for all eternity?
> > > >>>
> > > >>> No, these need cleaning up; and the split within the bar
> > > >>> is probably going to change to be communicated via virtio layer
> > > >>> rather than pci capabilities. However, I don't want to make our PCI
> > > >>> device look odd, just to make portability to non-PCI devices - so it's
> > > >>> right to make the split appropriately, but still to use PCI bars
> > > >>> for what they were designed for.
> > > >>>
> > > >>> Dave
> > > >>
> > > >> Let's discuss after the cleanup. In general I am not convinced this is
> > > >> the right thing to do. Using virtio-pci for anything else than pure
> > > >> transport smells like bad design to me (well, I am no virtio expert
> > > >> after all ;) ). No matter what PCI bars were designed for. If we can't
> > > >> get the same running with e.g. virtio-ccw or virtio-whatever, it is
> > > >> broken by design (or an addon that is tightly glued to virtio-pci, if
> > > >> that is the general idea).
> > > >
> > > > I'm sure we can find alternatives for virtio-*, so I wouldn't expect
> > > > it to be glued to virtio-pci.
> > > >
> > > > Dave
> > >
> > > As s390x does not have the concept of memory mapped io (RAM is RAM,
> > > nothing else), this is not architectured. vitio-ccw can therefore not
> > > define anything similar like that. However, in virtual environments we
> > > can do whatever we want on top of the pure transport (e.g. on the virtio
> > > layer).
> > >
> > > Conny can correct me if I am wrong.
> >
> > I don't think you're wrong, but I haven't read the code yet and I'm
> > therefore not aware of the purpose of this BAR.
> >
> > Generally, if there is a memory location shared between host and guest,
> > we need a way to communicate its location, which will likely differ
> > between transports. For ccw, I could imagine a new channel command
> > dedicated to exchanging configuration information (similar to what
> > exists today to communicate the locations of virtqueues), but I'd
> > rather not go down this path.
> >
> > Without reading the code/design further, can we use one of the
> > following instead of a BAR:
> > - a virtqueue;
> > - something in config space?
> > That would be implementable by any virtio transport.
>
> The way I think about this is that we wish to extend the VIRTIO device
> model with the concept of shared memory. virtio-fs, virtio-gpu, and
> virtio-vhost-user all have requirements for shared memory.
>
> This seems like a transport-level issue to me. PCI supports
> memory-mapped I/O and that's the right place to do it. If you try to
> put it into config space or the virtqueue, you'll end up with something
> that cannot be realized as a PCI device because it bypasses PCI bus
> address translation.
>
> If CCW needs a side-channel, that's fine. But that side-channel is a
> CCW-specific mechanism and probably doesn't apply to all other
> transports.
But virtio-gpu works with ccw right now (I haven't checked what it
uses); can virtio-fs use an equivalent method?
If there's a more generic case to be made for extending virtio devices
with a way to handle shared memory, a ccw for that would be fine. I
just want to avoid adding new ccws for everything as the namespace is
not infinite.
* Cornelia Huck ([email protected]) wrote:
> On Fri, 14 Dec 2018 13:44:34 +0000
> Stefan Hajnoczi <[email protected]> wrote:
>
> > On Thu, Dec 13, 2018 at 01:38:23PM +0100, Cornelia Huck wrote:
> > > On Thu, 13 Dec 2018 13:24:31 +0100
> > > David Hildenbrand <[email protected]> wrote:
> > >
> > > > On 13.12.18 13:15, Dr. David Alan Gilbert wrote:
> > > > > * David Hildenbrand ([email protected]) wrote:
> > > > >> On 13.12.18 11:00, Dr. David Alan Gilbert wrote:
> > > > >>> * David Hildenbrand ([email protected]) wrote:
> > > > >>>> On 13.12.18 10:13, Dr. David Alan Gilbert wrote:
> > > > >>>>> * David Hildenbrand ([email protected]) wrote:
> > > > >>>>>> On 10.12.18 18:12, Vivek Goyal wrote:
> > > > >>>>>>> Instead of assuming we had the fixed bar for the cache, use the
> > > > >>>>>>> value from the capabilities.
> > > > >>>>>>>
> > > > >>>>>>> Signed-off-by: Dr. David Alan Gilbert <[email protected]>
> > > > >>>>>>> ---
> > > > >>>>>>> fs/fuse/virtio_fs.c | 32 +++++++++++++++++---------------
> > > > >>>>>>> 1 file changed, 17 insertions(+), 15 deletions(-)
> > > > >>>>>>>
> > > > >>>>>>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> > > > >>>>>>> index 60d496c16841..55bac1465536 100644
> > > > >>>>>>> --- a/fs/fuse/virtio_fs.c
> > > > >>>>>>> +++ b/fs/fuse/virtio_fs.c
> > > > >>>>>>> @@ -14,11 +14,6 @@
> > > > >>>>>>> #include <uapi/linux/virtio_pci.h>
> > > > >>>>>>> #include "fuse_i.h"
> > > > >>>>>>>
> > > > >>>>>>> -enum {
> > > > >>>>>>> - /* PCI BAR number of the virtio-fs DAX window */
> > > > >>>>>>> - VIRTIO_FS_WINDOW_BAR = 2,
> > > > >>>>>>> -};
> > > > >>>>>>> -
> > > > >>>>>>> /* List of virtio-fs device instances and a lock for the list */
> > > > >>>>>>> static DEFINE_MUTEX(virtio_fs_mutex);
> > > > >>>>>>> static LIST_HEAD(virtio_fs_instances);
> > > > >>>>>>> @@ -518,7 +513,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > > > >>>>>>> struct dev_pagemap *pgmap;
> > > > >>>>>>> struct pci_dev *pci_dev;
> > > > >>>>>>> phys_addr_t phys_addr;
> > > > >>>>>>> - size_t len;
> > > > >>>>>>> + size_t bar_len;
> > > > >>>>>>> int ret;
> > > > >>>>>>> u8 have_cache, cache_bar;
> > > > >>>>>>> u64 cache_offset, cache_len;
> > > > >>>>>>> @@ -551,17 +546,13 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > > > >>>>>>> }
> > > > >>>>>>>
> > > > >>>>>>> /* TODO handle case where device doesn't expose BAR? */
> > > > >>>>>>
> > > > >>>>>> For virtio-pmem we decided to not go via BARs as this would effectively
> > > > >>>>>> make it only usable for virtio-pci implementers. Instead, we are going
> > > > >>>>>> to export the applicable physical device region directly (e.g.
> > > > >>>>>> phys_start, phys_size in virtio config), so it is decoupled from PCI
> > > > >>>>>> details. Doing the same for virtio-fs would allow e.g. also virtio-ccw
> > > > >>>>>> to make eventually use of this.
> > > > >>>>>
> > > > >>>>> That makes it a very odd looking PCI device; I can see that with
> > > > >>>>> virtio-pmem it makes some sense, given that it's job is to expose
> > > > >>>>> arbitrary chunks of memory.
> > > > >>>>>
> > > > >>>>> Dave
> > > > >>>>
> > > > >>>> Well, the fact that your are
> > > > >>>>
> > > > >>>> - including <uapi/linux/virtio_pci.h>
> > > > >>>> - adding pci related code
> > > > >>>>
> > > > >>>> in/to fs/fuse/virtio_fs.c
> > > > >>>>
> > > > >>>> tells me that these properties might be better communicated on the
> > > > >>>> virtio layer, not on the PCI layer.
> > > > >>>>
> > > > >>>> Or do you really want to glue virtio-fs to virtio-pci for all eternity?
> > > > >>>
> > > > >>> No, these need cleaning up; and the split within the bar
> > > > >>> is probably going to change to be communicated via virtio layer
> > > > >>> rather than pci capabilities. However, I don't want to make our PCI
> > > > >>> device look odd, just to make portability to non-PCI devices - so it's
> > > > >>> right to make the split appropriately, but still to use PCI bars
> > > > >>> for what they were designed for.
> > > > >>>
> > > > >>> Dave
> > > > >>
> > > > >> Let's discuss after the cleanup. In general I am not convinced this is
> > > > >> the right thing to do. Using virtio-pci for anything else than pure
> > > > >> transport smells like bad design to me (well, I am no virtio expert
> > > > >> after all ;) ). No matter what PCI bars were designed for. If we can't
> > > > >> get the same running with e.g. virtio-ccw or virtio-whatever, it is
> > > > >> broken by design (or an addon that is tightly glued to virtio-pci, if
> > > > >> that is the general idea).
> > > > >
> > > > > I'm sure we can find alternatives for virtio-*, so I wouldn't expect
> > > > > it to be glued to virtio-pci.
> > > > >
> > > > > Dave
> > > >
> > > > As s390x does not have the concept of memory mapped io (RAM is RAM,
> > > > nothing else), this is not architectured. vitio-ccw can therefore not
> > > > define anything similar like that. However, in virtual environments we
> > > > can do whatever we want on top of the pure transport (e.g. on the virtio
> > > > layer).
> > > >
> > > > Conny can correct me if I am wrong.
> > >
> > > I don't think you're wrong, but I haven't read the code yet and I'm
> > > therefore not aware of the purpose of this BAR.
> > >
> > > Generally, if there is a memory location shared between host and guest,
> > > we need a way to communicate its location, which will likely differ
> > > between transports. For ccw, I could imagine a new channel command
> > > dedicated to exchanging configuration information (similar to what
> > > exists today to communicate the locations of virtqueues), but I'd
> > > rather not go down this path.
> > >
> > > Without reading the code/design further, can we use one of the
> > > following instead of a BAR:
> > > - a virtqueue;
> > > - something in config space?
> > > That would be implementable by any virtio transport.
> >
> > The way I think about this is that we wish to extend the VIRTIO device
> > model with the concept of shared memory. virtio-fs, virtio-gpu, and
> > virtio-vhost-user all have requirements for shared memory.
> >
> > This seems like a transport-level issue to me. PCI supports
> > memory-mapped I/O and that's the right place to do it. If you try to
> > put it into config space or the virtqueue, you'll end up with something
> > that cannot be realized as a PCI device because it bypasses PCI bus
> > address translation.
> >
> > If CCW needs a side-channel, that's fine. But that side-channel is a
> > CCW-specific mechanism and probably doesn't apply to all other
> > transports.
>
> But virtio-gpu works with ccw right now (I haven't checked what it
> uses); can virtio-fs use an equivalent method?
>
> If there's a more generic case to be made for extending virtio devices
> with a way to handle shared memory, a ccw for that would be fine. I
> just want to avoid adding new ccws for everything as the namespace is
> not infinite.
In our case we've got somewhere between 0..3 ranges of memory, and I was
specifying them as PCI capabilities; however Gerd's suggestion was that
it would be better to just use 1 bar and then have something as part of
virtio or the like to split them up.
If we do that, then we could have something of the form
(index, base, length)
for each of the regions, where in the PCI case 'index' means BAR and
in CCW it means something else. (For mmio it's probably irrelevant and
the base is probably a physical address).
Dave
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK
Hi Vivek,
I love your patch! Yet something to improve:
[auto build test ERROR on fuse/for-next]
[also build test ERROR on v4.20-rc6]
[cannot apply to next-20181214]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Vivek-Goyal/virtio-fs-shared-file-system-for-virtual-machines/20181211-103034
base: https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next
config: nds32-defconfig (attached as .config)
compiler: nds32le-linux-gcc (GCC) 6.4.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=6.4.0 make.cross ARCH=nds32
All errors (new ones prefixed by >>):
fs/fuse/inode.o: In function `fuse_fill_super_common':
>> inode.c:(.text+0x1c00): undefined reference to `dax_read_lock'
inode.c:(.text+0x1c04): undefined reference to `dax_read_lock'
>> inode.c:(.text+0x1c22): undefined reference to `dax_direct_access'
inode.c:(.text+0x1c26): undefined reference to `dax_direct_access'
>> inode.c:(.text+0x1c32): undefined reference to `dax_read_unlock'
inode.c:(.text+0x1c36): undefined reference to `dax_read_unlock'
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On 14.12.18 14:44, Stefan Hajnoczi wrote:
> On Thu, Dec 13, 2018 at 01:38:23PM +0100, Cornelia Huck wrote:
>> On Thu, 13 Dec 2018 13:24:31 +0100
>> David Hildenbrand <[email protected]> wrote:
>>
>>> On 13.12.18 13:15, Dr. David Alan Gilbert wrote:
>>>> * David Hildenbrand ([email protected]) wrote:
>>>>> On 13.12.18 11:00, Dr. David Alan Gilbert wrote:
>>>>>> * David Hildenbrand ([email protected]) wrote:
>>>>>>> On 13.12.18 10:13, Dr. David Alan Gilbert wrote:
>>>>>>>> * David Hildenbrand ([email protected]) wrote:
>>>>>>>>> On 10.12.18 18:12, Vivek Goyal wrote:
>>>>>>>>>> Instead of assuming we had the fixed bar for the cache, use the
>>>>>>>>>> value from the capabilities.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Dr. David Alan Gilbert <[email protected]>
>>>>>>>>>> ---
>>>>>>>>>> fs/fuse/virtio_fs.c | 32 +++++++++++++++++---------------
>>>>>>>>>> 1 file changed, 17 insertions(+), 15 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
>>>>>>>>>> index 60d496c16841..55bac1465536 100644
>>>>>>>>>> --- a/fs/fuse/virtio_fs.c
>>>>>>>>>> +++ b/fs/fuse/virtio_fs.c
>>>>>>>>>> @@ -14,11 +14,6 @@
>>>>>>>>>> #include <uapi/linux/virtio_pci.h>
>>>>>>>>>> #include "fuse_i.h"
>>>>>>>>>>
>>>>>>>>>> -enum {
>>>>>>>>>> - /* PCI BAR number of the virtio-fs DAX window */
>>>>>>>>>> - VIRTIO_FS_WINDOW_BAR = 2,
>>>>>>>>>> -};
>>>>>>>>>> -
>>>>>>>>>> /* List of virtio-fs device instances and a lock for the list */
>>>>>>>>>> static DEFINE_MUTEX(virtio_fs_mutex);
>>>>>>>>>> static LIST_HEAD(virtio_fs_instances);
>>>>>>>>>> @@ -518,7 +513,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
>>>>>>>>>> struct dev_pagemap *pgmap;
>>>>>>>>>> struct pci_dev *pci_dev;
>>>>>>>>>> phys_addr_t phys_addr;
>>>>>>>>>> - size_t len;
>>>>>>>>>> + size_t bar_len;
>>>>>>>>>> int ret;
>>>>>>>>>> u8 have_cache, cache_bar;
>>>>>>>>>> u64 cache_offset, cache_len;
>>>>>>>>>> @@ -551,17 +546,13 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> /* TODO handle case where device doesn't expose BAR? */
>>>>>>>>>
>>>>>>>>> For virtio-pmem we decided to not go via BARs as this would effectively
>>>>>>>>> make it only usable for virtio-pci implementers. Instead, we are going
>>>>>>>>> to export the applicable physical device region directly (e.g.
>>>>>>>>> phys_start, phys_size in virtio config), so it is decoupled from PCI
>>>>>>>>> details. Doing the same for virtio-fs would allow e.g. also virtio-ccw
>>>>>>>>> to make eventually use of this.
>>>>>>>>
>>>>>>>> That makes it a very odd looking PCI device; I can see that with
>>>>>>>> virtio-pmem it makes some sense, given that it's job is to expose
>>>>>>>> arbitrary chunks of memory.
>>>>>>>>
>>>>>>>> Dave
>>>>>>>
>>>>>>> Well, the fact that your are
>>>>>>>
>>>>>>> - including <uapi/linux/virtio_pci.h>
>>>>>>> - adding pci related code
>>>>>>>
>>>>>>> in/to fs/fuse/virtio_fs.c
>>>>>>>
>>>>>>> tells me that these properties might be better communicated on the
>>>>>>> virtio layer, not on the PCI layer.
>>>>>>>
>>>>>>> Or do you really want to glue virtio-fs to virtio-pci for all eternity?
>>>>>>
>>>>>> No, these need cleaning up; and the split within the bar
>>>>>> is probably going to change to be communicated via virtio layer
>>>>>> rather than pci capabilities. However, I don't want to make our PCI
>>>>>> device look odd, just to make portability to non-PCI devices - so it's
>>>>>> right to make the split appropriately, but still to use PCI bars
>>>>>> for what they were designed for.
>>>>>>
>>>>>> Dave
>>>>>
>>>>> Let's discuss after the cleanup. In general I am not convinced this is
>>>>> the right thing to do. Using virtio-pci for anything else than pure
>>>>> transport smells like bad design to me (well, I am no virtio expert
>>>>> after all ;) ). No matter what PCI bars were designed for. If we can't
>>>>> get the same running with e.g. virtio-ccw or virtio-whatever, it is
>>>>> broken by design (or an addon that is tightly glued to virtio-pci, if
>>>>> that is the general idea).
>>>>
>>>> I'm sure we can find alternatives for virtio-*, so I wouldn't expect
>>>> it to be glued to virtio-pci.
>>>>
>>>> Dave
>>>
>>> As s390x does not have the concept of memory mapped io (RAM is RAM,
>>> nothing else), this is not architectured. vitio-ccw can therefore not
>>> define anything similar like that. However, in virtual environments we
>>> can do whatever we want on top of the pure transport (e.g. on the virtio
>>> layer).
>>>
>>> Conny can correct me if I am wrong.
>>
>> I don't think you're wrong, but I haven't read the code yet and I'm
>> therefore not aware of the purpose of this BAR.
>>
>> Generally, if there is a memory location shared between host and guest,
>> we need a way to communicate its location, which will likely differ
>> between transports. For ccw, I could imagine a new channel command
>> dedicated to exchanging configuration information (similar to what
>> exists today to communicate the locations of virtqueues), but I'd
>> rather not go down this path.
>>
>> Without reading the code/design further, can we use one of the
>> following instead of a BAR:
>> - a virtqueue;
>> - something in config space?
>> That would be implementable by any virtio transport.
>
> The way I think about this is that we wish to extend the VIRTIO device
> model with the concept of shared memory. virtio-fs, virtio-gpu, and
> virtio-vhost-user all have requirements for shared memory.
>
> This seems like a transport-level issue to me. PCI supports
> memory-mapped I/O and that's the right place to do it. If you try to
> put it into config space or the virtqueue, you'll end up with something
> that cannot be realized as a PCI device because it bypasses PCI bus
> address translation.
>
> If CCW needs a side-channel, that's fine. But that side-channel is a
> CCW-specific mechanism and probably doesn't apply to all other
> transports.
>
> Stefan
>
I think the problem is more fundamental. There is no iommu. Whatever
shared region you want to indicate, you want it to be assigned a memory
region in guest physical memory. Like a DIMM/NVDIMM. And this should be
different to the concept of a BAR. Or am I missing something?
I am ok with using whatever other channel to transport such information.
But I believe this is different to a typical BAR. (I wish I knew more
about PCI internals ;) ).
I would also like to know how shared memory works as of now for e.g.
virtio-gpu.
--
Thanks,
David / dhildenb
On Fri, Dec 14, 2018 at 02:50:58PM +0100, Cornelia Huck wrote:
> On Fri, 14 Dec 2018 13:44:34 +0000
> Stefan Hajnoczi <[email protected]> wrote:
>
> > On Thu, Dec 13, 2018 at 01:38:23PM +0100, Cornelia Huck wrote:
> > > On Thu, 13 Dec 2018 13:24:31 +0100
> > > David Hildenbrand <[email protected]> wrote:
> > >
> > > > On 13.12.18 13:15, Dr. David Alan Gilbert wrote:
> > > > > * David Hildenbrand ([email protected]) wrote:
> > > > >> On 13.12.18 11:00, Dr. David Alan Gilbert wrote:
> > > > >>> * David Hildenbrand ([email protected]) wrote:
> > > > >>>> On 13.12.18 10:13, Dr. David Alan Gilbert wrote:
> > > > >>>>> * David Hildenbrand ([email protected]) wrote:
> > > > >>>>>> On 10.12.18 18:12, Vivek Goyal wrote:
> > > > >>>>>>> Instead of assuming we had the fixed bar for the cache, use the
> > > > >>>>>>> value from the capabilities.
> > > > >>>>>>>
> > > > >>>>>>> Signed-off-by: Dr. David Alan Gilbert <[email protected]>
> > > > >>>>>>> ---
> > > > >>>>>>> fs/fuse/virtio_fs.c | 32 +++++++++++++++++---------------
> > > > >>>>>>> 1 file changed, 17 insertions(+), 15 deletions(-)
> > > > >>>>>>>
> > > > >>>>>>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> > > > >>>>>>> index 60d496c16841..55bac1465536 100644
> > > > >>>>>>> --- a/fs/fuse/virtio_fs.c
> > > > >>>>>>> +++ b/fs/fuse/virtio_fs.c
> > > > >>>>>>> @@ -14,11 +14,6 @@
> > > > >>>>>>> #include <uapi/linux/virtio_pci.h>
> > > > >>>>>>> #include "fuse_i.h"
> > > > >>>>>>>
> > > > >>>>>>> -enum {
> > > > >>>>>>> - /* PCI BAR number of the virtio-fs DAX window */
> > > > >>>>>>> - VIRTIO_FS_WINDOW_BAR = 2,
> > > > >>>>>>> -};
> > > > >>>>>>> -
> > > > >>>>>>> /* List of virtio-fs device instances and a lock for the list */
> > > > >>>>>>> static DEFINE_MUTEX(virtio_fs_mutex);
> > > > >>>>>>> static LIST_HEAD(virtio_fs_instances);
> > > > >>>>>>> @@ -518,7 +513,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > > > >>>>>>> struct dev_pagemap *pgmap;
> > > > >>>>>>> struct pci_dev *pci_dev;
> > > > >>>>>>> phys_addr_t phys_addr;
> > > > >>>>>>> - size_t len;
> > > > >>>>>>> + size_t bar_len;
> > > > >>>>>>> int ret;
> > > > >>>>>>> u8 have_cache, cache_bar;
> > > > >>>>>>> u64 cache_offset, cache_len;
> > > > >>>>>>> @@ -551,17 +546,13 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > > > >>>>>>> }
> > > > >>>>>>>
> > > > >>>>>>> /* TODO handle case where device doesn't expose BAR? */
> > > > >>>>>>
> > > > >>>>>> For virtio-pmem we decided to not go via BARs as this would effectively
> > > > >>>>>> make it only usable for virtio-pci implementers. Instead, we are going
> > > > >>>>>> to export the applicable physical device region directly (e.g.
> > > > >>>>>> phys_start, phys_size in virtio config), so it is decoupled from PCI
> > > > >>>>>> details. Doing the same for virtio-fs would allow e.g. also virtio-ccw
> > > > >>>>>> to make eventually use of this.
> > > > >>>>>
> > > > >>>>> That makes it a very odd looking PCI device; I can see that with
> > > > >>>>> virtio-pmem it makes some sense, given that it's job is to expose
> > > > >>>>> arbitrary chunks of memory.
> > > > >>>>>
> > > > >>>>> Dave
> > > > >>>>
> > > > >>>> Well, the fact that your are
> > > > >>>>
> > > > >>>> - including <uapi/linux/virtio_pci.h>
> > > > >>>> - adding pci related code
> > > > >>>>
> > > > >>>> in/to fs/fuse/virtio_fs.c
> > > > >>>>
> > > > >>>> tells me that these properties might be better communicated on the
> > > > >>>> virtio layer, not on the PCI layer.
> > > > >>>>
> > > > >>>> Or do you really want to glue virtio-fs to virtio-pci for all eternity?
> > > > >>>
> > > > >>> No, these need cleaning up; and the split within the bar
> > > > >>> is probably going to change to be communicated via virtio layer
> > > > >>> rather than pci capabilities. However, I don't want to make our PCI
> > > > >>> device look odd, just to make portability to non-PCI devices - so it's
> > > > >>> right to make the split appropriately, but still to use PCI bars
> > > > >>> for what they were designed for.
> > > > >>>
> > > > >>> Dave
> > > > >>
> > > > >> Let's discuss after the cleanup. In general I am not convinced this is
> > > > >> the right thing to do. Using virtio-pci for anything else than pure
> > > > >> transport smells like bad design to me (well, I am no virtio expert
> > > > >> after all ;) ). No matter what PCI bars were designed for. If we can't
> > > > >> get the same running with e.g. virtio-ccw or virtio-whatever, it is
> > > > >> broken by design (or an addon that is tightly glued to virtio-pci, if
> > > > >> that is the general idea).
> > > > >
> > > > > I'm sure we can find alternatives for virtio-*, so I wouldn't expect
> > > > > it to be glued to virtio-pci.
> > > > >
> > > > > Dave
> > > >
> > > > As s390x does not have the concept of memory mapped io (RAM is RAM,
> > > > nothing else), this is not architectured. vitio-ccw can therefore not
> > > > define anything similar like that. However, in virtual environments we
> > > > can do whatever we want on top of the pure transport (e.g. on the virtio
> > > > layer).
> > > >
> > > > Conny can correct me if I am wrong.
> > >
> > > I don't think you're wrong, but I haven't read the code yet and I'm
> > > therefore not aware of the purpose of this BAR.
> > >
> > > Generally, if there is a memory location shared between host and guest,
> > > we need a way to communicate its location, which will likely differ
> > > between transports. For ccw, I could imagine a new channel command
> > > dedicated to exchanging configuration information (similar to what
> > > exists today to communicate the locations of virtqueues), but I'd
> > > rather not go down this path.
> > >
> > > Without reading the code/design further, can we use one of the
> > > following instead of a BAR:
> > > - a virtqueue;
> > > - something in config space?
> > > That would be implementable by any virtio transport.
> >
> > The way I think about this is that we wish to extend the VIRTIO device
> > model with the concept of shared memory. virtio-fs, virtio-gpu, and
> > virtio-vhost-user all have requirements for shared memory.
> >
> > This seems like a transport-level issue to me. PCI supports
> > memory-mapped I/O and that's the right place to do it. If you try to
> > put it into config space or the virtqueue, you'll end up with something
> > that cannot be realized as a PCI device because it bypasses PCI bus
> > address translation.
> >
> > If CCW needs a side-channel, that's fine. But that side-channel is a
> > CCW-specific mechanism and probably doesn't apply to all other
> > transports.
>
> But virtio-gpu works with ccw right now (I haven't checked what it
> uses); can virtio-fs use an equivalent method?
virtio-gpu does not use shared memory yet but it needs to in the future.
> If there's a more generic case to be made for extending virtio devices
> with a way to handle shared memory, a ccw for that would be fine. I
> just want to avoid adding new ccws for everything as the namespace is
> not infinite.
Yes, virtio-vhost-user needs it too. I think it makes sense for shared
memory resources to be part of the VIRTIO device model.
Stefan
On Mon, Dec 17, 2018 at 11:53:46AM +0100, David Hildenbrand wrote:
> On 14.12.18 14:44, Stefan Hajnoczi wrote:
> > On Thu, Dec 13, 2018 at 01:38:23PM +0100, Cornelia Huck wrote:
> >> On Thu, 13 Dec 2018 13:24:31 +0100
> >> David Hildenbrand <[email protected]> wrote:
> >>
> >>> On 13.12.18 13:15, Dr. David Alan Gilbert wrote:
> >>>> * David Hildenbrand ([email protected]) wrote:
> >>>>> On 13.12.18 11:00, Dr. David Alan Gilbert wrote:
> >>>>>> * David Hildenbrand ([email protected]) wrote:
> >>>>>>> On 13.12.18 10:13, Dr. David Alan Gilbert wrote:
> >>>>>>>> * David Hildenbrand ([email protected]) wrote:
> >>>>>>>>> On 10.12.18 18:12, Vivek Goyal wrote:
> >>>>>>>>>> Instead of assuming we had the fixed bar for the cache, use the
> >>>>>>>>>> value from the capabilities.
> >>>>>>>>>>
> >>>>>>>>>> Signed-off-by: Dr. David Alan Gilbert <[email protected]>
> >>>>>>>>>> ---
> >>>>>>>>>> fs/fuse/virtio_fs.c | 32 +++++++++++++++++---------------
> >>>>>>>>>> 1 file changed, 17 insertions(+), 15 deletions(-)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> >>>>>>>>>> index 60d496c16841..55bac1465536 100644
> >>>>>>>>>> --- a/fs/fuse/virtio_fs.c
> >>>>>>>>>> +++ b/fs/fuse/virtio_fs.c
> >>>>>>>>>> @@ -14,11 +14,6 @@
> >>>>>>>>>> #include <uapi/linux/virtio_pci.h>
> >>>>>>>>>> #include "fuse_i.h"
> >>>>>>>>>>
> >>>>>>>>>> -enum {
> >>>>>>>>>> - /* PCI BAR number of the virtio-fs DAX window */
> >>>>>>>>>> - VIRTIO_FS_WINDOW_BAR = 2,
> >>>>>>>>>> -};
> >>>>>>>>>> -
> >>>>>>>>>> /* List of virtio-fs device instances and a lock for the list */
> >>>>>>>>>> static DEFINE_MUTEX(virtio_fs_mutex);
> >>>>>>>>>> static LIST_HEAD(virtio_fs_instances);
> >>>>>>>>>> @@ -518,7 +513,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> >>>>>>>>>> struct dev_pagemap *pgmap;
> >>>>>>>>>> struct pci_dev *pci_dev;
> >>>>>>>>>> phys_addr_t phys_addr;
> >>>>>>>>>> - size_t len;
> >>>>>>>>>> + size_t bar_len;
> >>>>>>>>>> int ret;
> >>>>>>>>>> u8 have_cache, cache_bar;
> >>>>>>>>>> u64 cache_offset, cache_len;
> >>>>>>>>>> @@ -551,17 +546,13 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> /* TODO handle case where device doesn't expose BAR? */
> >>>>>>>>>
> >>>>>>>>> For virtio-pmem we decided to not go via BARs as this would effectively
> >>>>>>>>> make it only usable for virtio-pci implementers. Instead, we are going
> >>>>>>>>> to export the applicable physical device region directly (e.g.
> >>>>>>>>> phys_start, phys_size in virtio config), so it is decoupled from PCI
> >>>>>>>>> details. Doing the same for virtio-fs would allow e.g. also virtio-ccw
> >>>>>>>>> to make eventually use of this.
> >>>>>>>>
> >>>>>>>> That makes it a very odd looking PCI device; I can see that with
> >>>>>>>> virtio-pmem it makes some sense, given that it's job is to expose
> >>>>>>>> arbitrary chunks of memory.
> >>>>>>>>
> >>>>>>>> Dave
> >>>>>>>
> >>>>>>> Well, the fact that your are
> >>>>>>>
> >>>>>>> - including <uapi/linux/virtio_pci.h>
> >>>>>>> - adding pci related code
> >>>>>>>
> >>>>>>> in/to fs/fuse/virtio_fs.c
> >>>>>>>
> >>>>>>> tells me that these properties might be better communicated on the
> >>>>>>> virtio layer, not on the PCI layer.
> >>>>>>>
> >>>>>>> Or do you really want to glue virtio-fs to virtio-pci for all eternity?
> >>>>>>
> >>>>>> No, these need cleaning up; and the split within the bar
> >>>>>> is probably going to change to be communicated via virtio layer
> >>>>>> rather than pci capabilities. However, I don't want to make our PCI
> >>>>>> device look odd, just to make portability to non-PCI devices - so it's
> >>>>>> right to make the split appropriately, but still to use PCI bars
> >>>>>> for what they were designed for.
> >>>>>>
> >>>>>> Dave
> >>>>>
> >>>>> Let's discuss after the cleanup. In general I am not convinced this is
> >>>>> the right thing to do. Using virtio-pci for anything else than pure
> >>>>> transport smells like bad design to me (well, I am no virtio expert
> >>>>> after all ;) ). No matter what PCI bars were designed for. If we can't
> >>>>> get the same running with e.g. virtio-ccw or virtio-whatever, it is
> >>>>> broken by design (or an addon that is tightly glued to virtio-pci, if
> >>>>> that is the general idea).
> >>>>
> >>>> I'm sure we can find alternatives for virtio-*, so I wouldn't expect
> >>>> it to be glued to virtio-pci.
> >>>>
> >>>> Dave
> >>>
> >>> As s390x does not have the concept of memory mapped io (RAM is RAM,
> >>> nothing else), this is not architectured. vitio-ccw can therefore not
> >>> define anything similar like that. However, in virtual environments we
> >>> can do whatever we want on top of the pure transport (e.g. on the virtio
> >>> layer).
> >>>
> >>> Conny can correct me if I am wrong.
> >>
> >> I don't think you're wrong, but I haven't read the code yet and I'm
> >> therefore not aware of the purpose of this BAR.
> >>
> >> Generally, if there is a memory location shared between host and guest,
> >> we need a way to communicate its location, which will likely differ
> >> between transports. For ccw, I could imagine a new channel command
> >> dedicated to exchanging configuration information (similar to what
> >> exists today to communicate the locations of virtqueues), but I'd
> >> rather not go down this path.
> >>
> >> Without reading the code/design further, can we use one of the
> >> following instead of a BAR:
> >> - a virtqueue;
> >> - something in config space?
> >> That would be implementable by any virtio transport.
> >
> > The way I think about this is that we wish to extend the VIRTIO device
> > model with the concept of shared memory. virtio-fs, virtio-gpu, and
> > virtio-vhost-user all have requirements for shared memory.
> >
> > This seems like a transport-level issue to me. PCI supports
> > memory-mapped I/O and that's the right place to do it. If you try to
> > put it into config space or the virtqueue, you'll end up with something
> > that cannot be realized as a PCI device because it bypasses PCI bus
> > address translation.
> >
> > If CCW needs a side-channel, that's fine. But that side-channel is a
> > CCW-specific mechanism and probably doesn't apply to all other
> > transports.
> >
> > Stefan
> >
>
> I think the problem is more fundamental. There is no iommu. Whatever
> shared region you want to indicate, you want it to be assigned a memory
> region in guest physical memory. Like a DIMM/NVDIMM. And this should be
> different to the concept of a BAR. Or am I missing something?
If you implement a physical virtio PCI adapter then there is bus
addressing and an IOMMU and VIRTIO has support for that. I'm not sure I
understand what you mean by "there is no iommu"?
> I am ok with using whatever other channel to transport such information.
> But I believe this is different to a typical BAR. (I wish I knew more
> about PCI internals ;) ).
>
> I would also like to know how shared memory works as of now for e.g.
> virtio-gpu.
virtio-gpu currently does not use shared memory, it needs it for future
features.
Stefan
On Mon, 17 Dec 2018 14:56:38 +0000
Stefan Hajnoczi <[email protected]> wrote:
> On Mon, Dec 17, 2018 at 11:53:46AM +0100, David Hildenbrand wrote:
> > On 14.12.18 14:44, Stefan Hajnoczi wrote:
> > > On Thu, Dec 13, 2018 at 01:38:23PM +0100, Cornelia Huck wrote:
> > >> On Thu, 13 Dec 2018 13:24:31 +0100
> > >> David Hildenbrand <[email protected]> wrote:
> > >>> As s390x does not have the concept of memory mapped io (RAM is RAM,
> > >>> nothing else), this is not architectured. vitio-ccw can therefore not
> > >>> define anything similar like that. However, in virtual environments we
> > >>> can do whatever we want on top of the pure transport (e.g. on the virtio
> > >>> layer).
> > >>>
> > >>> Conny can correct me if I am wrong.
> > >>
> > >> I don't think you're wrong, but I haven't read the code yet and I'm
> > >> therefore not aware of the purpose of this BAR.
> > >>
> > >> Generally, if there is a memory location shared between host and guest,
> > >> we need a way to communicate its location, which will likely differ
> > >> between transports. For ccw, I could imagine a new channel command
> > >> dedicated to exchanging configuration information (similar to what
> > >> exists today to communicate the locations of virtqueues), but I'd
> > >> rather not go down this path.
> > >>
> > >> Without reading the code/design further, can we use one of the
> > >> following instead of a BAR:
> > >> - a virtqueue;
> > >> - something in config space?
> > >> That would be implementable by any virtio transport.
> > >
> > > The way I think about this is that we wish to extend the VIRTIO device
> > > model with the concept of shared memory. virtio-fs, virtio-gpu, and
> > > virtio-vhost-user all have requirements for shared memory.
> > >
> > > This seems like a transport-level issue to me. PCI supports
> > > memory-mapped I/O and that's the right place to do it. If you try to
> > > put it into config space or the virtqueue, you'll end up with something
> > > that cannot be realized as a PCI device because it bypasses PCI bus
> > > address translation.
> > >
> > > If CCW needs a side-channel, that's fine. But that side-channel is a
> > > CCW-specific mechanism and probably doesn't apply to all other
> > > transports.
> > >
> > > Stefan
> > >
> >
> > I think the problem is more fundamental. There is no iommu. Whatever
> > shared region you want to indicate, you want it to be assigned a memory
> > region in guest physical memory. Like a DIMM/NVDIMM. And this should be
> > different to the concept of a BAR. Or am I missing something?
>
> If you implement a physical virtio PCI adapter then there is bus
> addressing and an IOMMU and VIRTIO has support for that. I'm not sure I
> understand what you mean by "there is no iommu"?
For ccw, there is no iommu; channel-program translation is doing
similar things. (I hope that is what David meant :)
>
> > I am ok with using whatever other channel to transport such information.
> > But I believe this is different to a typical BAR. (I wish I knew more
> > about PCI internals ;) ).
> >
> > I would also like to know how shared memory works as of now for e.g.
> > virtio-gpu.
>
> virtio-gpu currently does not use shared memory, it needs it for future
> features.
OK, that all sounds like we need to define a generic, per transport,
device agnostic way to specify shared memory.
Where is that memory situated? Is it something in guest memory (like
virtqueues)? If it is something provided by the device, things will get
tricky for ccw (remember that there's no mmio on s390; pci on s390 uses
special instructions for that.)
On 18.12.18 18:13, Cornelia Huck wrote:
> On Mon, 17 Dec 2018 14:56:38 +0000
> Stefan Hajnoczi <[email protected]> wrote:
>
>> On Mon, Dec 17, 2018 at 11:53:46AM +0100, David Hildenbrand wrote:
>>> On 14.12.18 14:44, Stefan Hajnoczi wrote:
>>>> On Thu, Dec 13, 2018 at 01:38:23PM +0100, Cornelia Huck wrote:
>>>>> On Thu, 13 Dec 2018 13:24:31 +0100
>>>>> David Hildenbrand <[email protected]> wrote:
>
>>>>>> As s390x does not have the concept of memory mapped io (RAM is RAM,
>>>>>> nothing else), this is not architectured. vitio-ccw can therefore not
>>>>>> define anything similar like that. However, in virtual environments we
>>>>>> can do whatever we want on top of the pure transport (e.g. on the virtio
>>>>>> layer).
>>>>>>
>>>>>> Conny can correct me if I am wrong.
>>>>>
>>>>> I don't think you're wrong, but I haven't read the code yet and I'm
>>>>> therefore not aware of the purpose of this BAR.
>>>>>
>>>>> Generally, if there is a memory location shared between host and guest,
>>>>> we need a way to communicate its location, which will likely differ
>>>>> between transports. For ccw, I could imagine a new channel command
>>>>> dedicated to exchanging configuration information (similar to what
>>>>> exists today to communicate the locations of virtqueues), but I'd
>>>>> rather not go down this path.
>>>>>
>>>>> Without reading the code/design further, can we use one of the
>>>>> following instead of a BAR:
>>>>> - a virtqueue;
>>>>> - something in config space?
>>>>> That would be implementable by any virtio transport.
>>>>
>>>> The way I think about this is that we wish to extend the VIRTIO device
>>>> model with the concept of shared memory. virtio-fs, virtio-gpu, and
>>>> virtio-vhost-user all have requirements for shared memory.
>>>>
>>>> This seems like a transport-level issue to me. PCI supports
>>>> memory-mapped I/O and that's the right place to do it. If you try to
>>>> put it into config space or the virtqueue, you'll end up with something
>>>> that cannot be realized as a PCI device because it bypasses PCI bus
>>>> address translation.
>>>>
>>>> If CCW needs a side-channel, that's fine. But that side-channel is a
>>>> CCW-specific mechanism and probably doesn't apply to all other
>>>> transports.
>>>>
>>>> Stefan
>>>>
>>>
>>> I think the problem is more fundamental. There is no iommu. Whatever
>>> shared region you want to indicate, you want it to be assigned a memory
>>> region in guest physical memory. Like a DIMM/NVDIMM. And this should be
>>> different to the concept of a BAR. Or am I missing something?
>>
>> If you implement a physical virtio PCI adapter then there is bus
>> addressing and an IOMMU and VIRTIO has support for that. I'm not sure I
>> understand what you mean by "there is no iommu"?
>
> For ccw, there is no iommu; channel-program translation is doing
> similar things. (I hope that is what David meant :)
>
>>
>>> I am ok with using whatever other channel to transport such information.
>>> But I believe this is different to a typical BAR. (I wish I knew more
>>> about PCI internals ;) ).
>>>
>>> I would also like to know how shared memory works as of now for e.g.
>>> virtio-gpu.
>>
>> virtio-gpu currently does not use shared memory, it needs it for future
>> features.
>
> OK, that all sounds like we need to define a generic, per transport,
> device agnostic way to specify shared memory.
>
> Where is that memory situated? Is it something in guest memory (like
> virtqueues)? If it is something provided by the device, things will get
> tricky for ccw (remember that there's no mmio on s390; pci on s390 uses
> special instructions for that.)
>
I am just very very confused right now. What I am struggling with right
now (Stefan, hope you can clarify it for me):
We need some place where this shared memory is located in the guest
physical memory. On x86 - if I am not wrong - this BAR is placed into
the reserved memory area between 3 and 4 GB. There is no such thing on
s390x. Because we don't have IO via memory (yet). All we have is one or
two KVM memory slots filled with all memory.
So what we will need on s390x is on the QEMU side such a reserved memory
region where devices like virtio-fs can reserve a region for shared memory.
So it is something like a dimm/nvdimm except that it is smaller and not
visible to the user directly (via memory backends).
--
Thanks,
David / dhildenb
Hi Miklos,
I love your patch! Yet something to improve:
[auto build test ERROR on fuse/for-next]
[cannot apply to v4.20-rc7 next-20181219]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Vivek-Goyal/virtio-fs-shared-file-system-for-virtual-machines/20181211-103034
base: https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next
config: x86_64-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64
All error/warnings (new ones prefixed by >>):
>> fs/ext2/inode.c:959:54: warning: incorrect type in argument 3 (different base types)
fs/ext2/inode.c:959:54: expected struct dax_device *dax_dev
fs/ext2/inode.c:959:54: got struct writeback_control *wbc
>> fs/ext2/inode.c:958:43: error: not enough arguments for function dax_writeback_mapping_range
fs/ext2/inode.c: In function 'ext2_dax_writepages':
fs/ext2/inode.c:959:33: error: passing argument 3 of 'dax_writeback_mapping_range' from incompatible pointer type [-Werror=incompatible-pointer-types]
mapping->host->i_sb->s_bdev, wbc);
^~~
In file included from fs/ext2/inode.c:29:0:
include/linux/dax.h:87:5: note: expected 'struct dax_device *' but argument is of type 'struct writeback_control *'
int dax_writeback_mapping_range(struct address_space *mapping,
^~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/ext2/inode.c:958:9: error: too few arguments to function 'dax_writeback_mapping_range'
return dax_writeback_mapping_range(mapping,
^~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from fs/ext2/inode.c:29:0:
include/linux/dax.h:87:5: note: declared here
int dax_writeback_mapping_range(struct address_space *mapping,
^~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/ext2/inode.c:960:1: warning: control reaches end of non-void function [-Wreturn-type]
}
^
cc1: some warnings being treated as errors
--
include/linux/slab.h:332:43: warning: dubious: x & !y
include/linux/slab.h:332:43: warning: dubious: x & !y
include/linux/slab.h:332:43: warning: dubious: x & !y
include/linux/slab.h:332:43: warning: dubious: x & !y
>> fs/fuse/file.c:3793:5: warning: symbol 'fuse_dax_reclaim_dmap_locked' was not declared. Should it be static?
>> fs/fuse/file.c:3825:25: warning: symbol 'fuse_dax_reclaim_first_mapping_locked' was not declared. Should it be static?
>> fs/fuse/file.c:3862:25: warning: symbol 'fuse_dax_reclaim_first_mapping' was not declared. Should it be static?
>> fs/fuse/file.c:3901:5: warning: symbol 'fuse_dax_free_one_mapping_locked' was not declared. Should it be static?
>> fs/fuse/file.c:3946:5: warning: symbol 'fuse_dax_free_one_mapping' was not declared. Should it be static?
>> fs/fuse/file.c:3967:5: warning: symbol 'fuse_dax_free_memory' was not declared. Should it be static?
vim +958 fs/ext2/inode.c
7f6d5b52 Ross Zwisler 2016-02-26 954
fb094c90 Dan Williams 2017-12-21 955 static int
fb094c90 Dan Williams 2017-12-21 956 ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc)
fb094c90 Dan Williams 2017-12-21 957 {
fb094c90 Dan Williams 2017-12-21 @958 return dax_writeback_mapping_range(mapping,
fb094c90 Dan Williams 2017-12-21 @959 mapping->host->i_sb->s_bdev, wbc);
^1da177e Linus Torvalds 2005-04-16 960 }
^1da177e Linus Torvalds 2005-04-16 961
:::::: The code at line 958 was first introduced by commit
:::::: fb094c90748fbeba1063927eeb751add147b35b9 ext2, dax: introduce ext2_dax_aops
:::::: TO: Dan Williams <[email protected]>
:::::: CC: Dan Williams <[email protected]>
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On Tue, Dec 18, 2018 at 06:25:27PM +0100, David Hildenbrand wrote:
> On 18.12.18 18:13, Cornelia Huck wrote:
> > On Mon, 17 Dec 2018 14:56:38 +0000
> > Stefan Hajnoczi <[email protected]> wrote:
> >
> >> On Mon, Dec 17, 2018 at 11:53:46AM +0100, David Hildenbrand wrote:
> >>> On 14.12.18 14:44, Stefan Hajnoczi wrote:
> >>>> On Thu, Dec 13, 2018 at 01:38:23PM +0100, Cornelia Huck wrote:
> >>>>> On Thu, 13 Dec 2018 13:24:31 +0100
> >>>>> David Hildenbrand <[email protected]> wrote:
> >
> >>>>>> As s390x does not have the concept of memory mapped io (RAM is RAM,
> >>>>>> nothing else), this is not architectured. vitio-ccw can therefore not
> >>>>>> define anything similar like that. However, in virtual environments we
> >>>>>> can do whatever we want on top of the pure transport (e.g. on the virtio
> >>>>>> layer).
> >>>>>>
> >>>>>> Conny can correct me if I am wrong.
> >>>>>
> >>>>> I don't think you're wrong, but I haven't read the code yet and I'm
> >>>>> therefore not aware of the purpose of this BAR.
> >>>>>
> >>>>> Generally, if there is a memory location shared between host and guest,
> >>>>> we need a way to communicate its location, which will likely differ
> >>>>> between transports. For ccw, I could imagine a new channel command
> >>>>> dedicated to exchanging configuration information (similar to what
> >>>>> exists today to communicate the locations of virtqueues), but I'd
> >>>>> rather not go down this path.
> >>>>>
> >>>>> Without reading the code/design further, can we use one of the
> >>>>> following instead of a BAR:
> >>>>> - a virtqueue;
> >>>>> - something in config space?
> >>>>> That would be implementable by any virtio transport.
> >>>>
> >>>> The way I think about this is that we wish to extend the VIRTIO device
> >>>> model with the concept of shared memory. virtio-fs, virtio-gpu, and
> >>>> virtio-vhost-user all have requirements for shared memory.
> >>>>
> >>>> This seems like a transport-level issue to me. PCI supports
> >>>> memory-mapped I/O and that's the right place to do it. If you try to
> >>>> put it into config space or the virtqueue, you'll end up with something
> >>>> that cannot be realized as a PCI device because it bypasses PCI bus
> >>>> address translation.
> >>>>
> >>>> If CCW needs a side-channel, that's fine. But that side-channel is a
> >>>> CCW-specific mechanism and probably doesn't apply to all other
> >>>> transports.
> >>>>
> >>>> Stefan
> >>>>
> >>>
> >>> I think the problem is more fundamental. There is no iommu. Whatever
> >>> shared region you want to indicate, you want it to be assigned a memory
> >>> region in guest physical memory. Like a DIMM/NVDIMM. And this should be
> >>> different to the concept of a BAR. Or am I missing something?
> >>
> >> If you implement a physical virtio PCI adapter then there is bus
> >> addressing and an IOMMU and VIRTIO has support for that. I'm not sure I
> >> understand what you mean by "there is no iommu"?
> >
> > For ccw, there is no iommu; channel-program translation is doing
> > similar things. (I hope that is what David meant :)
> >
> >>
> >>> I am ok with using whatever other channel to transport such information.
> >>> But I believe this is different to a typical BAR. (I wish I knew more
> >>> about PCI internals ;) ).
> >>>
> >>> I would also like to know how shared memory works as of now for e.g.
> >>> virtio-gpu.
> >>
> >> virtio-gpu currently does not use shared memory, it needs it for future
> >> features.
> >
> > OK, that all sounds like we need to define a generic, per transport,
> > device agnostic way to specify shared memory.
> >
> > Where is that memory situated? Is it something in guest memory (like
> > virtqueues)? If it is something provided by the device, things will get
> > tricky for ccw (remember that there's no mmio on s390; pci on s390 uses
> > special instructions for that.)
> >
>
> I am just very very confused right now. What I am struggling with right
> now (Stefan, hope you can clarify it for me):
>
> We need some place where this shared memory is located in the guest
> physical memory. On x86 - if I am not wrong - this BAR is placed into
> the reserved memory area between 3 and 4 GB.
Right, the shared memory is provided by the device and does not live in
guest RAM.
> There is no such thing on
> s390x. Because we don't have IO via memory (yet). All we have is one or
> two KVM memory slots filled with all memory.
>
> So what we will need on s390x is on the QEMU side such a reserved memory
> region where devices like virtio-fs can reserve a region for shared memory.
>
> So it is something like a dimm/nvdimm except that it is smaller and not
> visible to the user directly (via memory backends).
I see. That makes sense.
Stefan
Vivek Goyal <[email protected]> writes:
> Hi,
>
> Here are RFC patches for virtio-fs. Looking for feedback on this approach.
>
> These patches should apply on top of 4.20-rc5. We have also put code for
> various components here.
>
> https://gitlab.com/virtio-fs
>
> Problem Description
> ===================
> We want to be able to take a directory tree on the host and share it with
> guest[s]. Our goal is to be able to do it in a fast, consistent and secure
> manner. Our primary use case is kata containers, but it should be usable in
> other scenarios as well.
>
> Containers may rely on local file system semantics for shared volumes,
> read-write mounts that multiple containers access simultaneously. File
> system changes must be visible to other containers with the same consistency
> expected of a local file system, including mmap MAP_SHARED.
>
> Existing Solutions
> ==================
> We looked at existing solutions and virtio-9p already provides basic shared
> file system functionality although does not offer local file system semantics,
> causing some workloads and test suites to fail.
Can you elaborate on this? Is this with 9p2000.L ? We did quiet a lot of
work to make sure posix test suite pass on 9p file system. Also
was the mount option with cache=loose?
-aneesh
On Tue, Feb 12, 2019 at 09:26:48PM +0530, Aneesh Kumar K.V wrote:
> Vivek Goyal <[email protected]> writes:
>
> > Hi,
> >
> > Here are RFC patches for virtio-fs. Looking for feedback on this approach.
> >
> > These patches should apply on top of 4.20-rc5. We have also put code for
> > various components here.
> >
> > https://gitlab.com/virtio-fs
> >
> > Problem Description
> > ===================
> > We want to be able to take a directory tree on the host and share it with
> > guest[s]. Our goal is to be able to do it in a fast, consistent and secure
> > manner. Our primary use case is kata containers, but it should be usable in
> > other scenarios as well.
> >
> > Containers may rely on local file system semantics for shared volumes,
> > read-write mounts that multiple containers access simultaneously. File
> > system changes must be visible to other containers with the same consistency
> > expected of a local file system, including mmap MAP_SHARED.
> >
> > Existing Solutions
> > ==================
> > We looked at existing solutions and virtio-9p already provides basic shared
> > file system functionality although does not offer local file system semantics,
> > causing some workloads and test suites to fail.
>
> Can you elaborate on this? Is this with 9p2000.L ? We did quiet a lot of
> work to make sure posix test suite pass on 9p file system. Also
> was the mount option with cache=loose?
Hi Aneesh,
Yes this is with 9p2000.L and cache=loose. I used following mount option.
mount -t 9p -o trans=virtio hostShared /mnt/virtio-9p/ -oversion=9p2000.L,posixacl,cache=loose
We noticed primarily two issues.
- Ran pjdfstests and a lot of them are failing. I think even kata
container folks also experienced pjdfstests failures. I have never
looked into details of why it is failing.
- We thought mmap(MAP_SHARED) will not work with virtio-9p when two
clients are running in two different VMs and mapped same file with
MAP_SHARED.
Having said that, biggest concern with virtio-9p seems to be performance.
We are looking for ways to improve performance with virtio-fs. Hoping
DAX can provide faster data access and fuse protocol itself seems to
be faster (in primilinary testing results).
Thanks
Vivek
On Mon, Dec 10, 2018 at 9:57 AM Vivek Goyal <[email protected]> wrote:
>
> Instead of assuming we had the fixed bar for the cache, use the
> value from the capabilities.
>
> Signed-off-by: Dr. David Alan Gilbert <[email protected]>
> ---
> fs/fuse/virtio_fs.c | 32 +++++++++++++++++---------------
> 1 file changed, 17 insertions(+), 15 deletions(-)
>
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 60d496c16841..55bac1465536 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -14,11 +14,6 @@
> #include <uapi/linux/virtio_pci.h>
> #include "fuse_i.h"
>
> -enum {
> - /* PCI BAR number of the virtio-fs DAX window */
> - VIRTIO_FS_WINDOW_BAR = 2,
> -};
> -
> /* List of virtio-fs device instances and a lock for the list */
> static DEFINE_MUTEX(virtio_fs_mutex);
> static LIST_HEAD(virtio_fs_instances);
> @@ -518,7 +513,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> struct dev_pagemap *pgmap;
> struct pci_dev *pci_dev;
> phys_addr_t phys_addr;
> - size_t len;
> + size_t bar_len;
> int ret;
> u8 have_cache, cache_bar;
> u64 cache_offset, cache_len;
> @@ -551,17 +546,13 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> }
>
> /* TODO handle case where device doesn't expose BAR? */
> - ret = pci_request_region(pci_dev, VIRTIO_FS_WINDOW_BAR,
> - "virtio-fs-window");
> + ret = pci_request_region(pci_dev, cache_bar, "virtio-fs-window");
> if (ret < 0) {
> dev_err(&vdev->dev, "%s: failed to request window BAR\n",
> __func__);
> return ret;
> }
>
> - phys_addr = pci_resource_start(pci_dev, VIRTIO_FS_WINDOW_BAR);
> - len = pci_resource_len(pci_dev, VIRTIO_FS_WINDOW_BAR);
> -
> mi = devm_kzalloc(&pci_dev->dev, sizeof(*mi), GFP_KERNEL);
> if (!mi)
> return -ENOMEM;
> @@ -586,6 +577,17 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> pgmap->ref = &mi->ref;
> pgmap->type = MEMORY_DEVICE_FS_DAX;
>
> + phys_addr = pci_resource_start(pci_dev, cache_bar);
> + bar_len = pci_resource_len(pci_dev, cache_bar);
> +
> + if (cache_offset + cache_len > bar_len) {
> + dev_err(&vdev->dev,
> + "%s: cache bar shorter than cap offset+len\n",
> + __func__);
> + return -EINVAL;
> + }
> + phys_addr += cache_offset;
> +
> /* Ideally we would directly use the PCI BAR resource but
> * devm_memremap_pages() wants its own copy in pgmap. So
> * initialize a struct resource from scratch (only the start
> @@ -594,7 +596,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> pgmap->res = (struct resource){
> .name = "virtio-fs dax window",
> .start = phys_addr,
> - .end = phys_addr + len,
> + .end = phys_addr + cache_len,
Just in case you haven't noticed/fixed this problem, it should be
+ .end = phys_addr + cache_len - 1,
because resource_size() counts %size as "end - start + 1".
The end result of the above is a "conflicting page map" warning when
specifying a second virtio-fs pci device.
I'll send a patch for this, and feel free to take it along with the
patchset if needed.
thanks,
liubo
> };
>
> fs->window_kaddr = devm_memremap_pages(&pci_dev->dev, pgmap);
> @@ -607,10 +609,10 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> return ret;
>
> fs->window_phys_addr = phys_addr;
> - fs->window_len = len;
> + fs->window_len = cache_len;
>
> - dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr 0x%llx len %zu\n",
> - __func__, fs->window_kaddr, phys_addr, len);
> + dev_dbg(&vdev->dev, "%s: cache kaddr 0x%px phys_addr 0x%llx len %llx\n",
> + __func__, fs->window_kaddr, phys_addr, cache_len);
>
> fs->dax_dev = alloc_dax(fs, NULL, &virtio_fs_dax_ops);
> if (!fs->dax_dev)
> --
> 2.13.6
>
While doing memremap from pci_dev's system bus address to kernel virtual
address, we assign a wrong value to the %end of pgmap.res, which ends up
with a wrong resource size in the process of memremap, and that further
prevent the second virtiofs pci device from being probed successfully.
Signed-off-by: Liu Bo <[email protected]>
---
fs/fuse/virtio_fs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 88b00055589b..7abf2187d85f 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -713,7 +713,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
pgmap->res = (struct resource){
.name = "virtio-fs dax window",
.start = phys_addr,
- .end = phys_addr + cache_len,
+ .end = phys_addr + cache_len - 1,
};
fs->window_kaddr = devm_memremap_pages(&pci_dev->dev, pgmap);
--
2.20.1.2.gb21ebb6
On Sun, Mar 17, 2019 at 08:35:21AM +0800, Liu Bo wrote:
> While doing memremap from pci_dev's system bus address to kernel virtual
> address, we assign a wrong value to the %end of pgmap.res, which ends up
> with a wrong resource size in the process of memremap, and that further
> prevent the second virtiofs pci device from being probed successfully.
>
> Signed-off-by: Liu Bo <[email protected]>
Hi Liu Bo,
Thanks for the fix. This seems right. I will fix it in my internal
branch. These patches are not upstream yet. Will fold this into
existing patch for the next posting.
Thanks
Vivek
> ---
> fs/fuse/virtio_fs.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 88b00055589b..7abf2187d85f 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -713,7 +713,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> pgmap->res = (struct resource){
> .name = "virtio-fs dax window",
> .start = phys_addr,
> - .end = phys_addr + cache_len,
> + .end = phys_addr + cache_len - 1,
> };
>
> fs->window_kaddr = devm_memremap_pages(&pci_dev->dev, pgmap);
> --
> 2.20.1.2.gb21ebb6
>
On Tue, Mar 19, 2019 at 04:26:54PM -0400, Vivek Goyal wrote:
> On Sun, Mar 17, 2019 at 08:35:21AM +0800, Liu Bo wrote:
> > While doing memremap from pci_dev's system bus address to kernel virtual
> > address, we assign a wrong value to the %end of pgmap.res, which ends up
> > with a wrong resource size in the process of memremap, and that further
> > prevent the second virtiofs pci device from being probed successfully.
> >
> > Signed-off-by: Liu Bo <[email protected]>
>
> Hi Liu Bo,
>
> Thanks for the fix. This seems right. I will fix it in my internal
> branch. These patches are not upstream yet. Will fold this into
> existing patch for the next posting.
>
Feel free to fold it, and I'm looking forward to the 1.0 release.
thanks,
-liubo
> Thanks
> Vivek
>
> > ---
> > fs/fuse/virtio_fs.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> > index 88b00055589b..7abf2187d85f 100644
> > --- a/fs/fuse/virtio_fs.c
> > +++ b/fs/fuse/virtio_fs.c
> > @@ -713,7 +713,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > pgmap->res = (struct resource){
> > .name = "virtio-fs dax window",
> > .start = phys_addr,
> > - .end = phys_addr + cache_len,
> > + .end = phys_addr + cache_len - 1,
> > };
> >
> > fs->window_kaddr = devm_memremap_pages(&pci_dev->dev, pgmap);
> > --
> > 2.20.1.2.gb21ebb6
> >
* Liu Bo ([email protected]) wrote:
> On Mon, Dec 10, 2018 at 9:57 AM Vivek Goyal <[email protected]> wrote:
> >
> > Instead of assuming we had the fixed bar for the cache, use the
> > value from the capabilities.
> >
> > Signed-off-by: Dr. David Alan Gilbert <[email protected]>
> > ---
> > fs/fuse/virtio_fs.c | 32 +++++++++++++++++---------------
> > 1 file changed, 17 insertions(+), 15 deletions(-)
> >
> > diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> > index 60d496c16841..55bac1465536 100644
> > --- a/fs/fuse/virtio_fs.c
> > +++ b/fs/fuse/virtio_fs.c
> > @@ -14,11 +14,6 @@
> > #include <uapi/linux/virtio_pci.h>
> > #include "fuse_i.h"
> >
> > -enum {
> > - /* PCI BAR number of the virtio-fs DAX window */
> > - VIRTIO_FS_WINDOW_BAR = 2,
> > -};
> > -
> > /* List of virtio-fs device instances and a lock for the list */
> > static DEFINE_MUTEX(virtio_fs_mutex);
> > static LIST_HEAD(virtio_fs_instances);
> > @@ -518,7 +513,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > struct dev_pagemap *pgmap;
> > struct pci_dev *pci_dev;
> > phys_addr_t phys_addr;
> > - size_t len;
> > + size_t bar_len;
> > int ret;
> > u8 have_cache, cache_bar;
> > u64 cache_offset, cache_len;
> > @@ -551,17 +546,13 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > }
> >
> > /* TODO handle case where device doesn't expose BAR? */
> > - ret = pci_request_region(pci_dev, VIRTIO_FS_WINDOW_BAR,
> > - "virtio-fs-window");
> > + ret = pci_request_region(pci_dev, cache_bar, "virtio-fs-window");
> > if (ret < 0) {
> > dev_err(&vdev->dev, "%s: failed to request window BAR\n",
> > __func__);
> > return ret;
> > }
> >
> > - phys_addr = pci_resource_start(pci_dev, VIRTIO_FS_WINDOW_BAR);
> > - len = pci_resource_len(pci_dev, VIRTIO_FS_WINDOW_BAR);
> > -
> > mi = devm_kzalloc(&pci_dev->dev, sizeof(*mi), GFP_KERNEL);
> > if (!mi)
> > return -ENOMEM;
> > @@ -586,6 +577,17 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > pgmap->ref = &mi->ref;
> > pgmap->type = MEMORY_DEVICE_FS_DAX;
> >
> > + phys_addr = pci_resource_start(pci_dev, cache_bar);
> > + bar_len = pci_resource_len(pci_dev, cache_bar);
> > +
> > + if (cache_offset + cache_len > bar_len) {
> > + dev_err(&vdev->dev,
> > + "%s: cache bar shorter than cap offset+len\n",
> > + __func__);
> > + return -EINVAL;
> > + }
> > + phys_addr += cache_offset;
> > +
> > /* Ideally we would directly use the PCI BAR resource but
> > * devm_memremap_pages() wants its own copy in pgmap. So
> > * initialize a struct resource from scratch (only the start
> > @@ -594,7 +596,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > pgmap->res = (struct resource){
> > .name = "virtio-fs dax window",
> > .start = phys_addr,
> > - .end = phys_addr + len,
> > + .end = phys_addr + cache_len,
>
> Just in case you haven't noticed/fixed this problem, it should be
>
> + .end = phys_addr + cache_len - 1,
>
> because resource_size() counts %size as "end - start + 1".
> The end result of the above is a "conflicting page map" warning when
> specifying a second virtio-fs pci device.
Thanks for spotting this! I think we'd seen that message once but not
noticed where from.
> I'll send a patch for this, and feel free to take it along with the
> patchset if needed.
>
Dave
> thanks,
> liubo
>
> > };
> >
> > fs->window_kaddr = devm_memremap_pages(&pci_dev->dev, pgmap);
> > @@ -607,10 +609,10 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
> > return ret;
> >
> > fs->window_phys_addr = phys_addr;
> > - fs->window_len = len;
> > + fs->window_len = cache_len;
> >
> > - dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr 0x%llx len %zu\n",
> > - __func__, fs->window_kaddr, phys_addr, len);
> > + dev_dbg(&vdev->dev, "%s: cache kaddr 0x%px phys_addr 0x%llx len %llx\n",
> > + __func__, fs->window_kaddr, phys_addr, cache_len);
> >
> > fs->dax_dev = alloc_dax(fs, NULL, &virtio_fs_dax_ops);
> > if (!fs->dax_dev)
> > --
> > 2.13.6
> >
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK