Hi,
This patch series adds DAX support to virtiofs filesystem. This allows
bypassing guest page cache and allows mapping host page cache directly
in guest address space.
When a page of file is needed, guest sends a request to map that page
(in host page cache) in qemu address space. Inside guest this is
a physical memory range controlled by virtiofs device. And guest
directly maps this physical address range using DAX and hence gets
access to file data on host.
This can speed up things considerably in many situations. Also this
can result in substantial memory savings as file data does not have
to be copied in guest and it is directly accessed from host page
cache.
Most of the changes are limited to fuse/virtiofs. There are couple
of changes needed in generic dax infrastructure and couple of changes
in virtio to be able to access shared memory region.
These patches apply on top of 5.6-rc4 and are also available here.
https://github.com/rhvgoyal/linux/commits/vivek-04-march-2020
Any review or feedback is welcome.
Performance
===========
I have basically run bunch of fio jobs to get a sense of speed of
various operations. I wrote a simple wrapper script to run fio jobs
3 times and take their average and report it. These scripts and fio
jobs are available here.
https://github.com/rhvgoyal/virtiofs-tests
I set up a directory on ramfs on host and exported that directory inside
guest using virtio-fs and ran tests inside guests. Ran tests with
cache=none both with dax enabled and disabled. cache=none option
enforces no caching happens in guest both for data and metadata.
Test Setup
-----------
- A fedora 29 host with 376Gi RAM, 2 sockets (20 cores per socket, 2
threads per core)
- Using ramfs on host as backing store. 4 fio files of 8G each.
- Created a VM with 64 VCPUS and 64GB memory. An 64GB cache window (for dax
mmap).
Test Results
------------
- Results in two configurations have been reported.
virtio-fs (cache=none) and virtio-fs (cache=none + dax).
There are other caching modes as well but to me cache=none seemed most
interesting for now because it does not cache anything in guest
and provides strong coherence. Other modes which provide less strong
coherence and hence are faster are yet to be benchmarked.
- Three fio ioengines psync, libaio and mmap have been used.
- I/O Workload of randread, radwrite, seqread and seqwrite have been run.
- Each file size is 8G. Block size 4K. iodepth=16
- "multi" means same operation was done with 4 jobs and each job is
operating on a file of size 8G.
- Some results are "0 (KiB/s)". That means that particular operation is
not supported in that configuration.
NAME I/O Operation BW(Read/Write)
virtiofs-cache-none seqread-psync 35(MiB/s)
virtiofs-cache-none-dax seqread-psync 643(MiB/s)
virtiofs-cache-none seqread-psync-multi 219(MiB/s)
virtiofs-cache-none-dax seqread-psync-multi 2132(MiB/s)
virtiofs-cache-none seqread-mmap 0(KiB/s)
virtiofs-cache-none-dax seqread-mmap 741(MiB/s)
virtiofs-cache-none seqread-mmap-multi 0(KiB/s)
virtiofs-cache-none-dax seqread-mmap-multi 2530(MiB/s)
virtiofs-cache-none seqread-libaio 293(MiB/s)
virtiofs-cache-none-dax seqread-libaio 425(MiB/s)
virtiofs-cache-none seqread-libaio-multi 207(MiB/s)
virtiofs-cache-none-dax seqread-libaio-multi 1543(MiB/s)
virtiofs-cache-none randread-psync 36(MiB/s)
virtiofs-cache-none-dax randread-psync 572(MiB/s)
virtiofs-cache-none randread-psync-multi 211(MiB/s)
virtiofs-cache-none-dax randread-psync-multi 1764(MiB/s)
virtiofs-cache-none randread-mmap 0(KiB/s)
virtiofs-cache-none-dax randread-mmap 719(MiB/s)
virtiofs-cache-none randread-mmap-multi 0(KiB/s)
virtiofs-cache-none-dax randread-mmap-multi 2005(MiB/s)
virtiofs-cache-none randread-libaio 300(MiB/s)
virtiofs-cache-none-dax randread-libaio 413(MiB/s)
virtiofs-cache-none randread-libaio-multi 327(MiB/s)
virtiofs-cache-none-dax randread-libaio-multi 1326(MiB/s)
virtiofs-cache-none seqwrite-psync 34(MiB/s)
virtiofs-cache-none-dax seqwrite-psync 494(MiB/s)
virtiofs-cache-none seqwrite-psync-multi 223(MiB/s)
virtiofs-cache-none-dax seqwrite-psync-multi 1680(MiB/s)
virtiofs-cache-none seqwrite-mmap 0(KiB/s)
virtiofs-cache-none-dax seqwrite-mmap 1217(MiB/s)
virtiofs-cache-none seqwrite-mmap-multi 0(KiB/s)
virtiofs-cache-none-dax seqwrite-mmap-multi 2359(MiB/s)
virtiofs-cache-none seqwrite-libaio 282(MiB/s)
virtiofs-cache-none-dax seqwrite-libaio 348(MiB/s)
virtiofs-cache-none seqwrite-libaio-multi 320(MiB/s)
virtiofs-cache-none-dax seqwrite-libaio-multi 1255(MiB/s)
virtiofs-cache-none randwrite-psync 32(MiB/s)
virtiofs-cache-none-dax randwrite-psync 458(MiB/s)
virtiofs-cache-none randwrite-psync-multi 213(MiB/s)
virtiofs-cache-none-dax randwrite-psync-multi 1343(MiB/s)
virtiofs-cache-none randwrite-mmap 0(KiB/s)
virtiofs-cache-none-dax randwrite-mmap 663(MiB/s)
virtiofs-cache-none randwrite-mmap-multi 0(KiB/s)
virtiofs-cache-none-dax randwrite-mmap-multi 1820(MiB/s)
virtiofs-cache-none randwrite-libaio 292(MiB/s)
virtiofs-cache-none-dax randwrite-libaio 341(MiB/s)
virtiofs-cache-none randwrite-libaio-multi 322(MiB/s)
virtiofs-cache-none-dax randwrite-libaio-multi 1094(MiB/s)
Conclusion
===========
- virtio-fs with dax enabled is significantly faster and memory
effiecient as comapred to non-dax operation.
Note:
Right now dax window is 64G and max fio file size is 32G as well (4
files of 8G each). That means everything fits into dax window and no
reclaim is needed. Dax window reclaim logic is slower and if file
size is bigger than dax window size, performance slows down.
Thanks
Vivek
Sebastien Boeuf (3):
virtio: Add get_shm_region method
virtio: Implement get_shm_region for PCI transport
virtio: Implement get_shm_region for MMIO transport
Stefan Hajnoczi (2):
virtio_fs, dax: Set up virtio_fs dax_device
fuse,dax: add DAX mmap support
Vivek Goyal (15):
dax: Modify bdev_dax_pgoff() to handle NULL bdev
dax: Create a range version of dax_layout_busy_page()
virtiofs: Provide a helper function for virtqueue initialization
fuse: Get rid of no_mount_options
fuse,virtiofs: Add a mount option to enable dax
fuse,virtiofs: Keep a list of free dax memory ranges
fuse: implement FUSE_INIT map_alignment field
fuse: Introduce setupmapping/removemapping commands
fuse, dax: Implement dax read/write operations
fuse, dax: Take ->i_mmap_sem lock during dax page fault
fuse,virtiofs: Define dax address space operations
fuse,virtiofs: Maintain a list of busy elements
fuse: Release file in process context
fuse: Take inode lock for dax inode truncation
fuse,virtiofs: Add logic to free up a memory range
drivers/dax/super.c | 3 +-
drivers/virtio/virtio_mmio.c | 32 +
drivers/virtio/virtio_pci_modern.c | 107 +++
fs/dax.c | 66 +-
fs/fuse/dir.c | 2 +
fs/fuse/file.c | 1162 +++++++++++++++++++++++++++-
fs/fuse/fuse_i.h | 109 ++-
fs/fuse/inode.c | 148 +++-
fs/fuse/virtio_fs.c | 250 +++++-
include/linux/dax.h | 6 +
include/linux/virtio_config.h | 17 +
include/uapi/linux/fuse.h | 42 +-
include/uapi/linux/virtio_fs.h | 3 +
include/uapi/linux/virtio_mmio.h | 11 +
include/uapi/linux/virtio_pci.h | 11 +-
15 files changed, 1888 insertions(+), 81 deletions(-)
--
2.20.1
fuse_file_put(sync) can be called with sync=true/false. If sync=true,
it waits for release request response and then calls iput() in the
caller's context. If sync=false, it does not wait for release request
response, frees the fuse_file struct immediately and req->end function
does the iput().
iput() can be a problem with DAX if called in req->end context. If this
is last reference to inode (VFS has let go its reference already), then
iput() will clean DAX mappings as well and send REMOVEMAPPING requests
and wait for completion. (All the the worker thread context which is
processing fuse replies from daemon on the host).
That means it blocks worker thread and it stops processing further
replies and system deadlocks.
So for now, force sync release of file in case of DAX inodes.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/file.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index afabeb1acd50..561428b66101 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -543,6 +543,7 @@ void fuse_release_common(struct file *file, bool isdir)
struct fuse_file *ff = file->private_data;
struct fuse_release_args *ra = ff->release_args;
int opcode = isdir ? FUSE_RELEASEDIR : FUSE_RELEASE;
+ bool sync = false;
fuse_prepare_release(fi, ff, file->f_flags, opcode);
@@ -562,8 +563,19 @@ void fuse_release_common(struct file *file, bool isdir)
* Make the release synchronous if this is a fuseblk mount,
* synchronous RELEASE is allowed (and desirable) in this case
* because the server can be trusted not to screw up.
+ *
+ * For DAX, fuse server is trusted. So it should be fine to
+ * do a sync file put. Doing async file put is creating
+ * problems right now because when request finish, iput()
+ * can lead to freeing of inode. That means it tears down
+ * mappings backing DAX memory and sends REMOVEMAPPING message
+ * to server and blocks for completion. Currently, waiting
+ * in req->end context deadlocks the system as same worker thread
+ * can't process REMOVEMAPPING reply it is waiting for.
*/
- fuse_file_put(ff, ff->fc->destroy, isdir);
+ if (IS_DAX(file_inode(file)) || ff->fc->destroy)
+ sync = true;
+ fuse_file_put(ff, sync, isdir);
}
static int fuse_open(struct inode *inode, struct file *file)
--
2.20.1
Add a mount option to allow using dax with virtio_fs.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/fuse_i.h | 7 ++++
fs/fuse/inode.c | 3 ++
fs/fuse/virtio_fs.c | 82 +++++++++++++++++++++++++++++++++++++--------
3 files changed, 78 insertions(+), 14 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 2cebdf6dcfd8..1fe5065a2902 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -483,10 +483,14 @@ struct fuse_fs_context {
bool destroy:1;
bool no_control:1;
bool no_force_umount:1;
+ bool dax:1;
unsigned int max_read;
unsigned int blksize;
const char *subtype;
+ /* DAX device, may be NULL */
+ struct dax_device *dax_dev;
+
/* fuse_dev pointer to fill in, should contain NULL on entry */
void **fudptr;
};
@@ -758,6 +762,9 @@ struct fuse_conn {
/** List of device instances belonging to this connection */
struct list_head devices;
+
+ /** DAX device, non-NULL if DAX is supported */
+ struct dax_device *dax_dev;
};
static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index f160a3d47b63..84295fac4ff3 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -569,6 +569,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
seq_printf(m, ",max_read=%u", fc->max_read);
if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
seq_printf(m, ",blksize=%lu", sb->s_blocksize);
+ if (fc->dax_dev)
+ seq_printf(m, ",dax");
return 0;
}
@@ -1185,6 +1187,7 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
fc->destroy = ctx->destroy;
fc->no_control = ctx->no_control;
fc->no_force_umount = ctx->no_force_umount;
+ fc->dax_dev = ctx->dax_dev;
err = -ENOMEM;
root = fuse_get_root_inode(sb, ctx->rootmode);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 3f786a15b0d9..62cdd6817b5b 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -10,6 +10,7 @@
#include <linux/virtio_fs.h>
#include <linux/delay.h>
#include <linux/fs_context.h>
+#include <linux/fs_parser.h>
#include <linux/highmem.h>
#include "fuse_i.h"
@@ -65,6 +66,45 @@ struct virtio_fs_forget {
static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
struct fuse_req *req, bool in_flight);
+enum {
+ OPT_DAX,
+};
+
+static const struct fs_parameter_spec virtio_fs_parameters[] = {
+ fsparam_flag ("dax", OPT_DAX),
+ {}
+};
+
+static int virtio_fs_parse_param(struct fs_context *fc,
+ struct fs_parameter *param)
+{
+ struct fs_parse_result result;
+ struct fuse_fs_context *ctx = fc->fs_private;
+ int opt;
+
+ opt = fs_parse(fc, virtio_fs_parameters, param, &result);
+ if (opt < 0)
+ return opt;
+
+ switch(opt) {
+ case OPT_DAX:
+ ctx->dax = 1;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static void virtio_fs_free_fc(struct fs_context *fc)
+{
+ struct fuse_fs_context *ctx = fc->fs_private;
+
+ if (ctx)
+ kfree(ctx);
+}
+
static inline struct virtio_fs_vq *vq_to_fsvq(struct virtqueue *vq)
{
struct virtio_fs *fs = vq->vdev->priv;
@@ -1045,23 +1085,27 @@ static const struct fuse_iqueue_ops virtio_fs_fiq_ops = {
.release = virtio_fs_fiq_release,
};
-static int virtio_fs_fill_super(struct super_block *sb)
+static inline void virtio_fs_ctx_set_defaults(struct fuse_fs_context *ctx)
+{
+ ctx->rootmode = S_IFDIR;
+ ctx->default_permissions = 1;
+ ctx->allow_other = 1;
+ ctx->max_read = UINT_MAX;
+ ctx->blksize = 512;
+ ctx->destroy = true;
+ ctx->no_control = true;
+ ctx->no_force_umount = true;
+}
+
+static int virtio_fs_fill_super(struct super_block *sb, struct fs_context *fsc)
{
struct fuse_conn *fc = get_fuse_conn_super(sb);
struct virtio_fs *fs = fc->iq.priv;
+ struct fuse_fs_context *ctx = fsc->fs_private;
unsigned int i;
int err;
- struct fuse_fs_context ctx = {
- .rootmode = S_IFDIR,
- .default_permissions = 1,
- .allow_other = 1,
- .max_read = UINT_MAX,
- .blksize = 512,
- .destroy = true,
- .no_control = true,
- .no_force_umount = true,
- };
+ virtio_fs_ctx_set_defaults(ctx);
mutex_lock(&virtio_fs_mutex);
/* After holding mutex, make sure virtiofs device is still there.
@@ -1084,8 +1128,10 @@ static int virtio_fs_fill_super(struct super_block *sb)
goto err_free_fuse_devs;
}
- ctx.fudptr = (void **)&fs->vqs[VQ_REQUEST].fud;
- err = fuse_fill_super_common(sb, &ctx);
+ ctx->fudptr = (void **)&fs->vqs[VQ_REQUEST].fud;
+ if (ctx->dax)
+ ctx->dax_dev = fs->dax_dev;
+ err = fuse_fill_super_common(sb, ctx);
if (err < 0)
goto err_free_fuse_devs;
@@ -1200,7 +1246,7 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
return PTR_ERR(sb);
if (!sb->s_root) {
- err = virtio_fs_fill_super(sb);
+ err = virtio_fs_fill_super(sb, fsc);
if (err) {
deactivate_locked_super(sb);
return err;
@@ -1215,11 +1261,19 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
}
static const struct fs_context_operations virtio_fs_context_ops = {
+ .free = virtio_fs_free_fc,
+ .parse_param = virtio_fs_parse_param,
.get_tree = virtio_fs_get_tree,
};
static int virtio_fs_init_fs_context(struct fs_context *fsc)
{
+ struct fuse_fs_context *ctx;
+
+ ctx = kzalloc(sizeof(struct fuse_fs_context), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+ fsc->fs_private = ctx;
fsc->ops = &virtio_fs_context_ops;
return 0;
}
--
2.20.1
This reduces code duplication and make it little easier to read code.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/virtio_fs.c | 50 +++++++++++++++++++++++++++------------------
1 file changed, 30 insertions(+), 20 deletions(-)
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index bade74768903..a16cc9195087 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -24,6 +24,8 @@ enum {
VQ_REQUEST
};
+#define VQ_NAME_LEN 24
+
/* Per-virtqueue state */
struct virtio_fs_vq {
spinlock_t lock;
@@ -36,7 +38,7 @@ struct virtio_fs_vq {
bool connected;
long in_flight;
struct completion in_flight_zero; /* No inflight requests */
- char name[24];
+ char name[VQ_NAME_LEN];
} ____cacheline_aligned_in_smp;
/* A virtio-fs device instance */
@@ -560,6 +562,26 @@ static void virtio_fs_vq_done(struct virtqueue *vq)
schedule_work(&fsvq->done_work);
}
+static void virtio_fs_init_vq(struct virtio_fs_vq *fsvq, char *name,
+ int vq_type)
+{
+ strncpy(fsvq->name, name, VQ_NAME_LEN);
+ spin_lock_init(&fsvq->lock);
+ INIT_LIST_HEAD(&fsvq->queued_reqs);
+ INIT_LIST_HEAD(&fsvq->end_reqs);
+ init_completion(&fsvq->in_flight_zero);
+
+ if (vq_type == VQ_REQUEST) {
+ INIT_WORK(&fsvq->done_work, virtio_fs_requests_done_work);
+ INIT_DELAYED_WORK(&fsvq->dispatch_work,
+ virtio_fs_request_dispatch_work);
+ } else {
+ INIT_WORK(&fsvq->done_work, virtio_fs_hiprio_done_work);
+ INIT_DELAYED_WORK(&fsvq->dispatch_work,
+ virtio_fs_hiprio_dispatch_work);
+ }
+}
+
/* Initialize virtqueues */
static int virtio_fs_setup_vqs(struct virtio_device *vdev,
struct virtio_fs *fs)
@@ -575,7 +597,7 @@ static int virtio_fs_setup_vqs(struct virtio_device *vdev,
if (fs->num_request_queues == 0)
return -EINVAL;
- fs->nvqs = 1 + fs->num_request_queues;
+ fs->nvqs = VQ_REQUEST + fs->num_request_queues;
fs->vqs = kcalloc(fs->nvqs, sizeof(fs->vqs[VQ_HIPRIO]), GFP_KERNEL);
if (!fs->vqs)
return -ENOMEM;
@@ -589,29 +611,17 @@ static int virtio_fs_setup_vqs(struct virtio_device *vdev,
goto out;
}
+ /* Initialize the hiprio/forget request virtqueue */
callbacks[VQ_HIPRIO] = virtio_fs_vq_done;
- snprintf(fs->vqs[VQ_HIPRIO].name, sizeof(fs->vqs[VQ_HIPRIO].name),
- "hiprio");
+ virtio_fs_init_vq(&fs->vqs[VQ_HIPRIO], "hiprio", VQ_HIPRIO);
names[VQ_HIPRIO] = fs->vqs[VQ_HIPRIO].name;
- INIT_WORK(&fs->vqs[VQ_HIPRIO].done_work, virtio_fs_hiprio_done_work);
- INIT_LIST_HEAD(&fs->vqs[VQ_HIPRIO].queued_reqs);
- INIT_LIST_HEAD(&fs->vqs[VQ_HIPRIO].end_reqs);
- INIT_DELAYED_WORK(&fs->vqs[VQ_HIPRIO].dispatch_work,
- virtio_fs_hiprio_dispatch_work);
- init_completion(&fs->vqs[VQ_HIPRIO].in_flight_zero);
- spin_lock_init(&fs->vqs[VQ_HIPRIO].lock);
/* Initialize the requests virtqueues */
for (i = VQ_REQUEST; i < fs->nvqs; i++) {
- spin_lock_init(&fs->vqs[i].lock);
- INIT_WORK(&fs->vqs[i].done_work, virtio_fs_requests_done_work);
- INIT_DELAYED_WORK(&fs->vqs[i].dispatch_work,
- virtio_fs_request_dispatch_work);
- INIT_LIST_HEAD(&fs->vqs[i].queued_reqs);
- INIT_LIST_HEAD(&fs->vqs[i].end_reqs);
- init_completion(&fs->vqs[i].in_flight_zero);
- snprintf(fs->vqs[i].name, sizeof(fs->vqs[i].name),
- "requests.%u", i - VQ_REQUEST);
+ char vq_name[VQ_NAME_LEN];
+
+ snprintf(vq_name, VQ_NAME_LEN, "requests.%u", i - VQ_REQUEST);
+ virtio_fs_init_vq(&fs->vqs[i], vq_name, VQ_REQUEST);
callbacks[i] = virtio_fs_vq_done;
names[i] = fs->vqs[i].name;
}
--
2.20.1
From: Stefan Hajnoczi <[email protected]>
Setup a dax device.
Use the shm capability to find the cache entry and map it.
The DAX window is accessed by the fs/dax.c infrastructure and must have
struct pages (at least on x86). Use devm_memremap_pages() to map the
DAX window PCI BAR and allocate struct page.
Signed-off-by: Stefan Hajnoczi <[email protected]>
Signed-off-by: Dr. David Alan Gilbert <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Signed-off-by: Sebastien Boeuf <[email protected]>
Signed-off-by: Liu Bo <[email protected]>
---
fs/fuse/virtio_fs.c | 115 +++++++++++++++++++++++++++++++++
include/uapi/linux/virtio_fs.h | 3 +
2 files changed, 118 insertions(+)
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 62cdd6817b5b..b0574b208cd5 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -5,6 +5,9 @@
*/
#include <linux/fs.h>
+#include <linux/dax.h>
+#include <linux/pci.h>
+#include <linux/pfn_t.h>
#include <linux/module.h>
#include <linux/virtio.h>
#include <linux/virtio_fs.h>
@@ -50,6 +53,12 @@ struct virtio_fs {
struct virtio_fs_vq *vqs;
unsigned int nvqs; /* number of virtqueues */
unsigned int num_request_queues; /* number of request queues */
+ struct dax_device *dax_dev;
+
+ /* DAX memory window where file contents are mapped */
+ void *window_kaddr;
+ phys_addr_t window_phys_addr;
+ size_t window_len;
};
struct virtio_fs_forget_req {
@@ -690,6 +699,108 @@ static void virtio_fs_cleanup_vqs(struct virtio_device *vdev,
vdev->config->del_vqs(vdev);
}
+/* Map a window offset to a page frame number. The window offset will have
+ * been produced by .iomap_begin(), which maps a file offset to a window
+ * offset.
+ */
+static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+ long nr_pages, void **kaddr, pfn_t *pfn)
+{
+ struct virtio_fs *fs = dax_get_private(dax_dev);
+ phys_addr_t offset = PFN_PHYS(pgoff);
+ size_t max_nr_pages = fs->window_len/PAGE_SIZE - pgoff;
+
+ if (kaddr)
+ *kaddr = fs->window_kaddr + offset;
+ if (pfn)
+ *pfn = phys_to_pfn_t(fs->window_phys_addr + offset,
+ PFN_DEV | PFN_MAP);
+ return nr_pages > max_nr_pages ? max_nr_pages : nr_pages;
+}
+
+static size_t virtio_fs_copy_from_iter(struct dax_device *dax_dev,
+ pgoff_t pgoff, void *addr,
+ size_t bytes, struct iov_iter *i)
+{
+ return copy_from_iter(addr, bytes, i);
+}
+
+static size_t virtio_fs_copy_to_iter(struct dax_device *dax_dev,
+ pgoff_t pgoff, void *addr,
+ size_t bytes, struct iov_iter *i)
+{
+ return copy_to_iter(addr, bytes, i);
+}
+
+static const struct dax_operations virtio_fs_dax_ops = {
+ .direct_access = virtio_fs_direct_access,
+ .copy_from_iter = virtio_fs_copy_from_iter,
+ .copy_to_iter = virtio_fs_copy_to_iter,
+};
+
+static void virtio_fs_cleanup_dax(void *data)
+{
+ struct virtio_fs *fs = data;
+
+ kill_dax(fs->dax_dev);
+ put_dax(fs->dax_dev);
+}
+
+static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
+{
+ struct virtio_shm_region cache_reg;
+ struct dev_pagemap *pgmap;
+ bool have_cache;
+
+ if (!IS_ENABLED(CONFIG_DAX_DRIVER))
+ return 0;
+
+ /* Get cache region */
+ have_cache = virtio_get_shm_region(vdev, &cache_reg,
+ (u8)VIRTIO_FS_SHMCAP_ID_CACHE);
+ if (!have_cache) {
+ dev_notice(&vdev->dev, "%s: No cache capability\n", __func__);
+ return 0;
+ } else {
+ dev_notice(&vdev->dev, "Cache len: 0x%llx @ 0x%llx\n",
+ cache_reg.len, cache_reg.addr);
+ }
+
+ pgmap = devm_kzalloc(&vdev->dev, sizeof(*pgmap), GFP_KERNEL);
+ if (!pgmap)
+ return -ENOMEM;
+
+ pgmap->type = MEMORY_DEVICE_FS_DAX;
+
+ /* Ideally we would directly use the PCI BAR resource but
+ * devm_memremap_pages() wants its own copy in pgmap. So
+ * initialize a struct resource from scratch (only the start
+ * and end fields will be used).
+ */
+ pgmap->res = (struct resource){
+ .name = "virtio-fs dax window",
+ .start = (phys_addr_t) cache_reg.addr,
+ .end = (phys_addr_t) cache_reg.addr + cache_reg.len - 1,
+ };
+
+ fs->window_kaddr = devm_memremap_pages(&vdev->dev, pgmap);
+ if (IS_ERR(fs->window_kaddr))
+ return PTR_ERR(fs->window_kaddr);
+
+ fs->window_phys_addr = (phys_addr_t) cache_reg.addr;
+ fs->window_len = (phys_addr_t) cache_reg.len;
+
+ dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr 0x%llx"
+ " len 0x%llx\n", __func__, fs->window_kaddr, cache_reg.addr,
+ cache_reg.len);
+
+ fs->dax_dev = alloc_dax(fs, NULL, &virtio_fs_dax_ops, 0);
+ if (!fs->dax_dev)
+ return -ENOMEM;
+
+ return devm_add_action_or_reset(&vdev->dev, virtio_fs_cleanup_dax, fs);
+}
+
static int virtio_fs_probe(struct virtio_device *vdev)
{
struct virtio_fs *fs;
@@ -711,6 +822,10 @@ static int virtio_fs_probe(struct virtio_device *vdev)
/* TODO vq affinity */
+ ret = virtio_fs_setup_dax(vdev, fs);
+ if (ret < 0)
+ goto out_vqs;
+
/* Bring the device online in case the filesystem is mounted and
* requests need to be sent before we return.
*/
diff --git a/include/uapi/linux/virtio_fs.h b/include/uapi/linux/virtio_fs.h
index b02eb2ac3d99..2f64abce781f 100644
--- a/include/uapi/linux/virtio_fs.h
+++ b/include/uapi/linux/virtio_fs.h
@@ -16,4 +16,7 @@ struct virtio_fs_config {
__u32 num_request_queues;
} __attribute__((packed));
+/* For the id field in virtio_pci_shm_cap */
+#define VIRTIO_FS_SHMCAP_ID_CACHE 0
+
#endif /* _UAPI_LINUX_VIRTIO_FS_H */
--
2.20.1
virtiofs does not have a block device. Modify bdev_dax_pgoff() to be
able to handle that.
If there is no bdev, that means dax offset is 0. (It can't be a partition
block device starting at an offset in dax device).
Signed-off-by: Stefan Hajnoczi <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---
drivers/dax/super.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0aa4b6bc5101..c34f21f2f199 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -46,7 +46,8 @@ EXPORT_SYMBOL_GPL(dax_read_unlock);
int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
pgoff_t *pgoff)
{
- phys_addr_t phys_off = (get_start_sect(bdev) + sector) * 512;
+ sector_t start_sect = bdev ? get_start_sect(bdev) : 0;
+ phys_addr_t phys_off = (start_sect + sector) * 512;
if (pgoff)
*pgoff = PHYS_PFN(phys_off);
--
2.20.1
virtiofs device has a range of memory which is mapped into file inodes
using dax. This memory is mapped in qemu on host and maps different
sections of real file on host. Size of this memory is limited
(determined by administrator) and depending on filesystem size, we will
soon reach a situation where all the memory is in use and we need to
reclaim some.
As part of reclaim process, we will need to make sure that there are
no active references to pages (taken by get_user_pages()) on the memory
range we are trying to reclaim. I am planning to use
dax_layout_busy_page() for this. But in current form this is per inode
and scans through all the pages of the inode.
We want to reclaim only a portion of memory (say 2MB page). So we want
to make sure that only that 2MB range of pages do not have any
references (and don't want to unmap all the pages of inode).
Hence, create a range version of this function named
dax_layout_busy_page_range() which can be used to pass a range which
needs to be unmapped.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/dax.c | 66 ++++++++++++++++++++++++++++++++-------------
include/linux/dax.h | 6 +++++
2 files changed, 54 insertions(+), 18 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 35da144375a0..fde92bb5da69 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -558,27 +558,20 @@ static void *grab_mapping_entry(struct xa_state *xas,
return xa_mk_internal(VM_FAULT_FALLBACK);
}
-/**
- * dax_layout_busy_page - find first pinned page in @mapping
- * @mapping: address space to scan for a page with ref count > 1
- *
- * DAX requires ZONE_DEVICE mapped pages. These pages are never
- * 'onlined' to the page allocator so they are considered idle when
- * page->count == 1. A filesystem uses this interface to determine if
- * any page in the mapping is busy, i.e. for DMA, or other
- * get_user_pages() usages.
- *
- * It is expected that the filesystem is holding locks to block the
- * establishment of new mappings in this address_space. I.e. it expects
- * to be able to run unmap_mapping_range() and subsequently not race
- * mapping_mapped() becoming true.
+/*
+ * Partial pages are included. If end is 0, pages in the range from start
+ * to end of the file are inluded.
*/
-struct page *dax_layout_busy_page(struct address_space *mapping)
+struct page *dax_layout_busy_page_range(struct address_space *mapping,
+ loff_t start, loff_t end)
{
- XA_STATE(xas, &mapping->i_pages, 0);
void *entry;
unsigned int scanned = 0;
struct page *page = NULL;
+ pgoff_t start_idx = start >> PAGE_SHIFT;
+ pgoff_t end_idx = end >> PAGE_SHIFT;
+ XA_STATE(xas, &mapping->i_pages, start_idx);
+ loff_t len, lstart = round_down(start, PAGE_SIZE);
/*
* In the 'limited' case get_user_pages() for dax is disabled.
@@ -589,6 +582,22 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
if (!dax_mapping(mapping) || !mapping_mapped(mapping))
return NULL;
+ /* If end == 0, all pages from start to till end of file */
+ if (!end) {
+ end_idx = ULONG_MAX;
+ len = 0;
+ } else {
+ /* length is being calculated from lstart and not start.
+ * This is due to behavior of unmap_mapping_range(). If
+ * start is say 4094 and end is on 4096 then we want to
+ * unamp two pages, idx 0 and 1. But unmap_mapping_range()
+ * will unmap only page at idx 0. If we calculate len
+ * from the rounded down start, this problem should not
+ * happen.
+ */
+ len = end - lstart + 1;
+ }
+
/*
* If we race get_user_pages_fast() here either we'll see the
* elevated page count in the iteration and wait, or
@@ -601,10 +610,10 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
* guaranteed to either see new references or prevent new
* references from being established.
*/
- unmap_mapping_range(mapping, 0, 0, 0);
+ unmap_mapping_range(mapping, start, len, 0);
xas_lock_irq(&xas);
- xas_for_each(&xas, entry, ULONG_MAX) {
+ xas_for_each(&xas, entry, end_idx) {
if (WARN_ON_ONCE(!xa_is_value(entry)))
continue;
if (unlikely(dax_is_locked(entry)))
@@ -625,6 +634,27 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
xas_unlock_irq(&xas);
return page;
}
+EXPORT_SYMBOL_GPL(dax_layout_busy_page_range);
+
+/**
+ * dax_layout_busy_page - find first pinned page in @mapping
+ * @mapping: address space to scan for a page with ref count > 1
+ *
+ * DAX requires ZONE_DEVICE mapped pages. These pages are never
+ * 'onlined' to the page allocator so they are considered idle when
+ * page->count == 1. A filesystem uses this interface to determine if
+ * any page in the mapping is busy, i.e. for DMA, or other
+ * get_user_pages() usages.
+ *
+ * It is expected that the filesystem is holding locks to block the
+ * establishment of new mappings in this address_space. I.e. it expects
+ * to be able to run unmap_mapping_range() and subsequently not race
+ * mapping_mapped() becoming true.
+ */
+struct page *dax_layout_busy_page(struct address_space *mapping)
+{
+ return dax_layout_busy_page_range(mapping, 0, 0);
+}
EXPORT_SYMBOL_GPL(dax_layout_busy_page);
static int __dax_invalidate_entry(struct address_space *mapping,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 328c2dbb4409..4fd4f866a4d3 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -139,6 +139,7 @@ int dax_writeback_mapping_range(struct address_space *mapping,
struct dax_device *dax_dev, struct writeback_control *wbc);
struct page *dax_layout_busy_page(struct address_space *mapping);
+struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end);
dax_entry_t dax_lock_page(struct page *page);
void dax_unlock_page(struct page *page, dax_entry_t cookie);
#else
@@ -169,6 +170,11 @@ static inline struct page *dax_layout_busy_page(struct address_space *mapping)
return NULL;
}
+static inline struct page *dax_layout_busy_page_range(struct address_space *mapping, pgoff_t start, pgoff_t nr_pages)
+{
+ return NULL;
+}
+
static inline int dax_writeback_mapping_range(struct address_space *mapping,
struct dax_device *dax_dev, struct writeback_control *wbc)
{
--
2.20.1
This patch implements basic DAX support. mmap() is not implemented
yet and will come in later patches. This patch looks into implemeting
read/write.
We make use of interval tree to keep track of per inode dax mappings.
Do not use dax for file extending writes, instead just send WRITE message
to daemon (like we do for direct I/O path). This will keep write and
i_size change atomic w.r.t crash.
Signed-off-by: Stefan Hajnoczi <[email protected]>
Signed-off-by: Dr. David Alan Gilbert <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
Signed-off-by: Liu Bo <[email protected]>
Signed-off-by: Peng Tao <[email protected]>
---
fs/fuse/file.c | 597 +++++++++++++++++++++++++++++++++++++-
fs/fuse/fuse_i.h | 23 ++
fs/fuse/inode.c | 6 +
include/uapi/linux/fuse.h | 1 +
4 files changed, 621 insertions(+), 6 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9d67b830fb7a..9effdd3dc6d6 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -18,6 +18,12 @@
#include <linux/swap.h>
#include <linux/falloc.h>
#include <linux/uio.h>
+#include <linux/dax.h>
+#include <linux/iomap.h>
+#include <linux/interval_tree_generic.h>
+
+INTERVAL_TREE_DEFINE(struct fuse_dax_mapping, rb, __u64, __subtree_last,
+ START, LAST, static inline, fuse_dax_interval_tree);
static struct page **fuse_pages_alloc(unsigned int npages, gfp_t flags,
struct fuse_page_desc **desc)
@@ -187,6 +193,242 @@ static void fuse_link_write_file(struct file *file)
spin_unlock(&fi->lock);
}
+static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
+{
+ struct fuse_dax_mapping *dmap = NULL;
+
+ spin_lock(&fc->lock);
+
+ if (fc->nr_free_ranges <= 0) {
+ spin_unlock(&fc->lock);
+ return NULL;
+ }
+
+ WARN_ON(list_empty(&fc->free_ranges));
+
+ /* Take a free range */
+ dmap = list_first_entry(&fc->free_ranges, struct fuse_dax_mapping,
+ list);
+ list_del_init(&dmap->list);
+ fc->nr_free_ranges--;
+ spin_unlock(&fc->lock);
+ return dmap;
+}
+
+/* This assumes fc->lock is held */
+static void __dmap_add_to_free_pool(struct fuse_conn *fc,
+ struct fuse_dax_mapping *dmap)
+{
+ list_add_tail(&dmap->list, &fc->free_ranges);
+ fc->nr_free_ranges++;
+}
+
+static void dmap_add_to_free_pool(struct fuse_conn *fc,
+ struct fuse_dax_mapping *dmap)
+{
+ /* Return fuse_dax_mapping to free list */
+ spin_lock(&fc->lock);
+ __dmap_add_to_free_pool(fc, dmap);
+ spin_unlock(&fc->lock);
+}
+
+/* offset passed in should be aligned to FUSE_DAX_MEM_RANGE_SZ */
+static int fuse_setup_one_mapping(struct inode *inode, loff_t offset,
+ struct fuse_dax_mapping *dmap, bool writable,
+ bool upgrade)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_setupmapping_in inarg;
+ FUSE_ARGS(args);
+ ssize_t err;
+
+ WARN_ON(offset % FUSE_DAX_MEM_RANGE_SZ);
+ WARN_ON(fc->nr_free_ranges < 0);
+
+ /* Ask fuse daemon to setup mapping */
+ memset(&inarg, 0, sizeof(inarg));
+ inarg.foffset = offset;
+ inarg.fh = -1;
+ inarg.moffset = dmap->window_offset;
+ inarg.len = FUSE_DAX_MEM_RANGE_SZ;
+ inarg.flags |= FUSE_SETUPMAPPING_FLAG_READ;
+ if (writable)
+ inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
+ args.opcode = FUSE_SETUPMAPPING;
+ args.nodeid = fi->nodeid;
+ args.in_numargs = 1;
+ args.in_args[0].size = sizeof(inarg);
+ args.in_args[0].value = &inarg;
+ err = fuse_simple_request(fc, &args);
+ if (err < 0) {
+ printk(KERN_ERR "%s request failed at mem_offset=0x%llx %zd\n",
+ __func__, dmap->window_offset, err);
+ return err;
+ }
+
+ pr_debug("fuse_setup_one_mapping() succeeded. offset=0x%llx writable=%d"
+ " err=%zd\n", offset, writable, err);
+
+ dmap->writable = writable;
+ if (!upgrade) {
+ dmap->start = offset;
+ dmap->end = offset + FUSE_DAX_MEM_RANGE_SZ - 1;
+ /* Protected by fi->i_dmap_sem */
+ fuse_dax_interval_tree_insert(dmap, &fi->dmap_tree);
+ fi->nr_dmaps++;
+ }
+ return 0;
+}
+
+static int
+fuse_send_removemapping(struct inode *inode,
+ struct fuse_removemapping_in *inargp,
+ struct fuse_removemapping_one *remove_one)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ FUSE_ARGS(args);
+
+ args.opcode = FUSE_REMOVEMAPPING;
+ args.nodeid = fi->nodeid;
+ args.in_numargs = 2;
+ args.in_args[0].size = sizeof(*inargp);
+ args.in_args[0].value = inargp;
+ args.in_args[1].size = inargp->count * sizeof(*remove_one);
+ args.in_args[1].value = remove_one;
+ return fuse_simple_request(fc, &args);
+}
+
+static int dmap_removemapping_list(struct inode *inode, unsigned num,
+ struct list_head *to_remove)
+{
+ struct fuse_removemapping_one *remove_one, *ptr;
+ struct fuse_removemapping_in inarg;
+ struct fuse_dax_mapping *dmap;
+ int ret, i = 0, nr_alloc;
+
+ nr_alloc = min_t(unsigned int, num, FUSE_REMOVEMAPPING_MAX_ENTRY);
+ remove_one = kmalloc_array(nr_alloc, sizeof(*remove_one), GFP_NOFS);
+ if (!remove_one)
+ return -ENOMEM;
+
+ ptr = remove_one;
+ list_for_each_entry(dmap, to_remove, list) {
+ ptr->moffset = dmap->window_offset;
+ ptr->len = dmap->length;
+ ptr++;
+ i++;
+ num--;
+ if (i >= nr_alloc || num == 0) {
+ memset(&inarg, 0, sizeof(inarg));
+ inarg.count = i;
+ ret = fuse_send_removemapping(inode, &inarg,
+ remove_one);
+ if (ret)
+ goto out;
+ ptr = remove_one;
+ i = 0;
+ }
+ }
+out:
+ kfree(remove_one);
+ return ret;
+}
+
+/*
+ * Cleanup dmap entry and add back to free list. This should be called with
+ * fc->lock held.
+ */
+static void dmap_reinit_add_to_free_pool(struct fuse_conn *fc,
+ struct fuse_dax_mapping *dmap)
+{
+ pr_debug("fuse: freeing memory range start=0x%llx end=0x%llx "
+ "window_offset=0x%llx length=0x%llx\n", dmap->start,
+ dmap->end, dmap->window_offset, dmap->length);
+ dmap->start = dmap->end = 0;
+ __dmap_add_to_free_pool(fc, dmap);
+}
+
+/*
+ * Free inode dmap entries whose range falls entirely inside [start, end].
+ * Does not take any locks. At this point of time it should only be
+ * called from evict_inode() path where we know all dmap entries can be
+ * reclaimed.
+ */
+static void inode_reclaim_dmap_range(struct fuse_conn *fc, struct inode *inode,
+ loff_t start, loff_t end)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_dax_mapping *dmap, *n;
+ int err, num = 0;
+ LIST_HEAD(to_remove);
+
+ pr_debug("fuse: %s: start=0x%llx, end=0x%llx\n", __func__, start, end);
+
+ /*
+ * Interval tree search matches intersecting entries. Adjust the range
+ * to avoid dropping partial valid entries.
+ */
+ start = ALIGN(start, FUSE_DAX_MEM_RANGE_SZ);
+ end = ALIGN_DOWN(end, FUSE_DAX_MEM_RANGE_SZ);
+
+ while (1) {
+ dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, start,
+ end);
+ if (!dmap)
+ break;
+ fuse_dax_interval_tree_remove(dmap, &fi->dmap_tree);
+ num++;
+ list_add(&dmap->list, &to_remove);
+ }
+
+ /* Nothing to remove */
+ if (list_empty(&to_remove))
+ return;
+
+ WARN_ON(fi->nr_dmaps < num);
+ fi->nr_dmaps -= num;
+ /*
+ * During umount/shutdown, fuse connection is dropped first
+ * and evict_inode() is called later. That means any
+ * removemapping messages are going to fail. Send messages
+ * only if connection is up. Otherwise fuse daemon is
+ * responsible for cleaning up any leftover references and
+ * mappings.
+ */
+ if (fc->connected) {
+ err = dmap_removemapping_list(inode, num, &to_remove);
+ if (err) {
+ pr_warn("Failed to removemappings. start=0x%llx"
+ " end=0x%llx\n", start, end);
+ }
+ }
+ spin_lock(&fc->lock);
+ list_for_each_entry_safe(dmap, n, &to_remove, list) {
+ list_del_init(&dmap->list);
+ dmap_reinit_add_to_free_pool(fc, dmap);
+ }
+ spin_unlock(&fc->lock);
+}
+
+/*
+ * It is called from evict_inode() and by that time inode is going away. So
+ * this function does not take any locks like fi->i_dmap_sem for traversing
+ * that fuse inode interval tree. If that lock is taken then lock validator
+ * complains of deadlock situation w.r.t fs_reclaim lock.
+ */
+void fuse_cleanup_inode_mappings(struct inode *inode)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ /*
+ * fuse_evict_inode() has alredy called truncate_inode_pages_final()
+ * before we arrive here. So we should not have to worry about
+ * any pages/exception entries still associated with inode.
+ */
+ inode_reclaim_dmap_range(fc, inode, 0, -1);
+}
+
void fuse_finish_open(struct inode *inode, struct file *file)
{
struct fuse_file *ff = file->private_data;
@@ -1562,32 +1804,364 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
return res;
}
+static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
struct file *file = iocb->ki_filp;
struct fuse_file *ff = file->private_data;
+ struct inode *inode = file->f_mapping->host;
if (is_bad_inode(file_inode(file)))
return -EIO;
- if (!(ff->open_flags & FOPEN_DIRECT_IO))
- return fuse_cache_read_iter(iocb, to);
- else
+ if (IS_DAX(inode))
+ return fuse_dax_read_iter(iocb, to);
+
+ if (ff->open_flags & FOPEN_DIRECT_IO)
return fuse_direct_read_iter(iocb, to);
+
+ return fuse_cache_read_iter(iocb, to);
}
+static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
struct file *file = iocb->ki_filp;
struct fuse_file *ff = file->private_data;
+ struct inode *inode = file->f_mapping->host;
if (is_bad_inode(file_inode(file)))
return -EIO;
- if (!(ff->open_flags & FOPEN_DIRECT_IO))
- return fuse_cache_write_iter(iocb, from);
- else
+ if (IS_DAX(inode))
+ return fuse_dax_write_iter(iocb, from);
+
+ if (ff->open_flags & FOPEN_DIRECT_IO)
return fuse_direct_write_iter(iocb, from);
+
+ return fuse_cache_write_iter(iocb, from);
+}
+
+static void fuse_fill_iomap_hole(struct iomap *iomap, loff_t length)
+{
+ iomap->addr = IOMAP_NULL_ADDR;
+ iomap->length = length;
+ iomap->type = IOMAP_HOLE;
+}
+
+static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
+ struct iomap *iomap, struct fuse_dax_mapping *dmap,
+ unsigned flags)
+{
+ loff_t offset, len;
+ loff_t i_size = i_size_read(inode);
+
+ offset = pos - dmap->start;
+ len = min(length, dmap->length - offset);
+
+ /* If length is beyond end of file, truncate further */
+ if (pos + len > i_size)
+ len = i_size - pos;
+
+ if (len > 0) {
+ iomap->addr = dmap->window_offset + offset;
+ iomap->length = len;
+ if (flags & IOMAP_FAULT)
+ iomap->length = ALIGN(len, PAGE_SIZE);
+ iomap->type = IOMAP_MAPPED;
+ pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
+ " length 0x%llx\n", __func__, iomap->addr,
+ iomap->offset, iomap->length);
+ } else {
+ /* Mapping beyond end of file is hole */
+ fuse_fill_iomap_hole(iomap, length);
+ pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
+ "length 0x%llx\n", __func__, iomap->addr,
+ iomap->offset, iomap->length);
+ }
+}
+
+static int iomap_begin_setup_new_mapping(struct inode *inode, loff_t pos,
+ loff_t length, unsigned flags,
+ struct iomap *iomap)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_dax_mapping *dmap, *alloc_dmap = NULL;
+ int ret;
+ bool writable = flags & IOMAP_WRITE;
+
+ alloc_dmap = alloc_dax_mapping(fc);
+ if (!alloc_dmap)
+ return -EBUSY;
+
+ /*
+ * Take write lock so that only one caller can try to setup mapping
+ * and other waits.
+ */
+ down_write(&fi->i_dmap_sem);
+ /*
+ * We dropped lock. Check again if somebody else setup
+ * mapping already.
+ */
+ dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos,
+ pos);
+ if (dmap) {
+ fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
+ dmap_add_to_free_pool(fc, alloc_dmap);
+ up_write(&fi->i_dmap_sem);
+ return 0;
+ }
+
+ /* Setup one mapping */
+ ret = fuse_setup_one_mapping(inode,
+ ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
+ alloc_dmap, writable, false);
+ if (ret < 0) {
+ printk("fuse_setup_one_mapping() failed. err=%d"
+ " pos=0x%llx, writable=%d\n", ret, pos, writable);
+ dmap_add_to_free_pool(fc, alloc_dmap);
+ up_write(&fi->i_dmap_sem);
+ return ret;
+ }
+ fuse_fill_iomap(inode, pos, length, iomap, alloc_dmap, flags);
+ up_write(&fi->i_dmap_sem);
+ return 0;
+}
+
+static int iomap_begin_upgrade_mapping(struct inode *inode, loff_t pos,
+ loff_t length, unsigned flags,
+ struct iomap *iomap)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_dax_mapping *dmap;
+ int ret;
+
+ /*
+ * Take exclusive lock so that only one caller can try to setup
+ * mapping and others wait.
+ */
+ down_write(&fi->i_dmap_sem);
+ dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos, pos);
+
+ /* We are holding either inode lock or i_mmap_sem, and that should
+ * ensure that dmap can't reclaimed or truncated and it should still
+ * be there in tree despite the fact we dropped and re-acquired the
+ * lock.
+ */
+ ret = -EIO;
+ if (WARN_ON(!dmap))
+ goto out_err;
+
+ /* Maybe another thread already upgraded mapping while we were not
+ * holding lock.
+ */
+ if (dmap->writable)
+ goto out_fill_iomap;
+
+ ret = fuse_setup_one_mapping(inode,
+ ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
+ dmap, true, true);
+ if (ret < 0) {
+ printk("fuse_setup_one_mapping() failed. err=%d pos=0x%llx\n",
+ ret, pos);
+ goto out_err;
+ }
+
+out_fill_iomap:
+ fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
+out_err:
+ up_write(&fi->i_dmap_sem);
+ return ret;
+}
+
+/* This is just for DAX and the mapping is ephemeral, do not use it for other
+ * purposes since there is no block device with a permanent mapping.
+ */
+static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
+ unsigned flags, struct iomap *iomap,
+ struct iomap *srcmap)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_dax_mapping *dmap;
+ bool writable = flags & IOMAP_WRITE;
+
+ /* We don't support FIEMAP */
+ BUG_ON(flags & IOMAP_REPORT);
+
+ pr_debug("fuse_iomap_begin() called. pos=0x%llx length=0x%llx\n",
+ pos, length);
+
+ /*
+ * Writes beyond end of file are not handled using dax path. Instead
+ * a fuse write message is sent to daemon
+ */
+ if (flags & IOMAP_WRITE && pos >= i_size_read(inode))
+ return -EIO;
+
+ iomap->offset = pos;
+ iomap->flags = 0;
+ iomap->bdev = NULL;
+ iomap->dax_dev = fc->dax_dev;
+
+ /*
+ * Both read/write and mmap path can race here. So we need something
+ * to make sure if we are setting up mapping, then other path waits
+ *
+ * For now, use a semaphore for this. It probably needs to be
+ * optimized later.
+ */
+ down_read(&fi->i_dmap_sem);
+ dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos, pos);
+
+ if (dmap) {
+ if (writable && !dmap->writable) {
+ /* Upgrade read-only mapping to read-write. This will
+ * require exclusive i_dmap_sem lock as we don't want
+ * two threads to be trying to this simultaneously
+ * for same dmap. So drop shared lock and acquire
+ * exclusive lock.
+ */
+ up_read(&fi->i_dmap_sem);
+ pr_debug("%s: Upgrading mapping at offset 0x%llx"
+ " length 0x%llx\n", __func__, pos, length);
+ return iomap_begin_upgrade_mapping(inode, pos, length,
+ flags, iomap);
+ } else {
+ fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
+ up_read(&fi->i_dmap_sem);
+ return 0;
+ }
+ } else {
+ up_read(&fi->i_dmap_sem);
+ pr_debug("%s: no mapping at offset 0x%llx length 0x%llx\n",
+ __func__, pos, length);
+ if (pos >= i_size_read(inode))
+ goto iomap_hole;
+
+ return iomap_begin_setup_new_mapping(inode, pos, length, flags,
+ iomap);
+ }
+
+ /*
+ * If read beyond end of file happnes, fs code seems to return
+ * it as hole
+ */
+iomap_hole:
+ fuse_fill_iomap_hole(iomap, length);
+ pr_debug("fuse_iomap_begin() returning hole mapping. pos=0x%llx length_asked=0x%llx length_returned=0x%llx\n", pos, length, iomap->length);
+ return 0;
+}
+
+static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
+ ssize_t written, unsigned flags,
+ struct iomap *iomap)
+{
+ /* DAX writes beyond end-of-file aren't handled using iomap, so the
+ * file size is unchanged and there is nothing to do here.
+ */
+ return 0;
+}
+
+static const struct iomap_ops fuse_iomap_ops = {
+ .iomap_begin = fuse_iomap_begin,
+ .iomap_end = fuse_iomap_end,
+};
+
+static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ ssize_t ret;
+
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (!inode_trylock_shared(inode))
+ return -EAGAIN;
+ } else {
+ inode_lock_shared(inode);
+ }
+
+ ret = dax_iomap_rw(iocb, to, &fuse_iomap_ops);
+ inode_unlock_shared(inode);
+
+ /* TODO file_accessed(iocb->f_filp) */
+ return ret;
+}
+
+static bool file_extending_write(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ return (iov_iter_rw(from) == WRITE &&
+ ((iocb->ki_pos) >= i_size_read(inode)));
+}
+
+static ssize_t fuse_dax_direct_write(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(iocb);
+ ssize_t ret;
+
+ ret = fuse_direct_io(&io, from, &iocb->ki_pos, FUSE_DIO_WRITE);
+ if (ret < 0)
+ return ret;
+
+ fuse_invalidate_attr(inode);
+ fuse_write_update_size(inode, iocb->ki_pos);
+ return ret;
+}
+
+static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ ssize_t ret, count;
+
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (!inode_trylock(inode))
+ return -EAGAIN;
+ } else {
+ inode_lock(inode);
+ }
+
+ ret = generic_write_checks(iocb, from);
+ if (ret <= 0)
+ goto out;
+
+ ret = file_remove_privs(iocb->ki_filp);
+ if (ret)
+ goto out;
+ /* TODO file_update_time() but we don't want metadata I/O */
+
+ /* Do not use dax for file extending writes as its an mmap and
+ * trying to write beyong end of existing page will generate
+ * SIGBUS.
+ */
+ if (file_extending_write(iocb, from)) {
+ ret = fuse_dax_direct_write(iocb, from);
+ goto out;
+ }
+
+ ret = dax_iomap_rw(iocb, from, &fuse_iomap_ops);
+ if (ret < 0)
+ goto out;
+
+ /*
+ * If part of the write was file extending, fuse dax path will not
+ * take care of that. Do direct write instead.
+ */
+ if (iov_iter_count(from) && file_extending_write(iocb, from)) {
+ count = fuse_dax_direct_write(iocb, from);
+ if (count < 0)
+ goto out;
+ ret += count;
+ }
+
+out:
+ inode_unlock(inode);
+
+ if (ret > 0)
+ ret = generic_write_sync(iocb, ret);
+ return ret;
}
static void fuse_writepage_free(struct fuse_writepage_args *wpa)
@@ -2318,6 +2892,11 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
return 0;
}
+static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return -EINVAL; /* TODO */
+}
+
static int convert_fuse_file_lock(struct fuse_conn *fc,
const struct fuse_file_lock *ffl,
struct file_lock *fl)
@@ -3387,6 +3966,7 @@ static const struct address_space_operations fuse_file_aops = {
void fuse_init_file_inode(struct inode *inode)
{
struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_conn *fc = get_fuse_conn(inode);
inode->i_fop = &fuse_file_operations;
inode->i_data.a_ops = &fuse_file_aops;
@@ -3396,4 +3976,9 @@ void fuse_init_file_inode(struct inode *inode)
fi->writectr = 0;
init_waitqueue_head(&fi->page_waitq);
INIT_LIST_HEAD(&fi->writepages);
+ fi->dmap_tree = RB_ROOT_CACHED;
+
+ if (fc->dax_dev) {
+ inode->i_flags |= S_DAX;
+ }
}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index b41275f73e4c..490549862bda 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -70,16 +70,29 @@ struct fuse_forget_link {
struct fuse_forget_link *next;
};
+#define START(node) ((node)->start)
+#define LAST(node) ((node)->end)
+
/** Translation information for file offsets to DAX window offsets */
struct fuse_dax_mapping {
/* Will connect in fc->free_ranges to keep track of free memory */
struct list_head list;
+ /* For interval tree in file/inode */
+ struct rb_node rb;
+ /** Start Position in file */
+ __u64 start;
+ /** End Position in file */
+ __u64 end;
+ __u64 __subtree_last;
/** Position in DAX window */
u64 window_offset;
/** Length of mapping, in bytes */
loff_t length;
+
+ /* Is this mapping read-only or read-write */
+ bool writable;
};
/** FUSE inode */
@@ -167,6 +180,15 @@ struct fuse_inode {
/** Lock to protect write related fields */
spinlock_t lock;
+
+ /*
+ * Semaphore to protect modifications to dmap_tree
+ */
+ struct rw_semaphore i_dmap_sem;
+
+ /** Sorted rb tree of struct fuse_dax_mapping elements */
+ struct rb_root_cached dmap_tree;
+ unsigned long nr_dmaps;
};
/** FUSE inode state bits */
@@ -1127,5 +1149,6 @@ unsigned int fuse_len_args(unsigned int numargs, struct fuse_arg *args);
*/
u64 fuse_get_unique(struct fuse_iqueue *fiq);
void fuse_free_conn(struct fuse_conn *fc);
+void fuse_cleanup_inode_mappings(struct inode *inode);
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 36cb9c00bbe5..93bc65607a15 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -86,7 +86,9 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
fi->attr_version = 0;
fi->orig_ino = 0;
fi->state = 0;
+ fi->nr_dmaps = 0;
mutex_init(&fi->mutex);
+ init_rwsem(&fi->i_dmap_sem);
spin_lock_init(&fi->lock);
fi->forget = fuse_alloc_forget();
if (!fi->forget) {
@@ -114,6 +116,10 @@ static void fuse_evict_inode(struct inode *inode)
clear_inode(inode);
if (inode->i_sb->s_flags & SB_ACTIVE) {
struct fuse_conn *fc = get_fuse_conn(inode);
+ if (IS_DAX(inode)) {
+ fuse_cleanup_inode_mappings(inode);
+ WARN_ON(fi->nr_dmaps);
+ }
fuse_queue_forget(fc, fi->forget, fi->nodeid, fi->nlookup);
fi->forget = NULL;
}
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 62633555d547..36d824b82ebc 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -896,6 +896,7 @@ struct fuse_copy_file_range_in {
#define FUSE_SETUPMAPPING_ENTRIES 8
#define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
+#define FUSE_SETUPMAPPING_FLAG_READ (1ull << 1)
struct fuse_setupmapping_in {
/* An already open handle */
uint64_t fh;
--
2.20.1
This is done along the lines of ext4 and xfs. I primarily wanted ->writepages
hook at this time so that I could call into dax_writeback_mapping_range().
This in turn will decide which pfns need to be written back.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/file.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ab56396cf661..619aff6b5f44 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2696,6 +2696,16 @@ static int fuse_writepages_fill(struct page *page,
return err;
}
+static int fuse_dax_writepages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+
+ struct inode *inode = mapping->host;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+
+ return dax_writeback_mapping_range(mapping, fc->dax_dev, wbc);
+}
+
static int fuse_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
@@ -4032,6 +4042,13 @@ static const struct address_space_operations fuse_file_aops = {
.write_end = fuse_write_end,
};
+static const struct address_space_operations fuse_dax_file_aops = {
+ .writepages = fuse_dax_writepages,
+ .direct_IO = noop_direct_IO,
+ .set_page_dirty = noop_set_page_dirty,
+ .invalidatepage = noop_invalidatepage,
+};
+
void fuse_init_file_inode(struct inode *inode)
{
struct fuse_inode *fi = get_fuse_inode(inode);
@@ -4049,5 +4066,6 @@ void fuse_init_file_inode(struct inode *inode)
if (fc->dax_dev) {
inode->i_flags |= S_DAX;
+ inode->i_data.a_ops = &fuse_dax_file_aops;
}
}
--
2.20.1
From: Sebastien Boeuf <[email protected]>
On PCI the shm regions are found using capability entries;
find a region by searching for the capability.
Signed-off-by: Sebastien Boeuf <[email protected]>
Signed-off-by: Dr. David Alan Gilbert <[email protected]>
Signed-off-by: kbuild test robot <[email protected]>
---
drivers/virtio/virtio_pci_modern.c | 107 +++++++++++++++++++++++++++++
include/uapi/linux/virtio_pci.h | 11 ++-
2 files changed, 117 insertions(+), 1 deletion(-)
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index 7abcc50838b8..52f179411015 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -443,6 +443,111 @@ static void del_vq(struct virtio_pci_vq_info *info)
vring_del_virtqueue(vq);
}
+static int virtio_pci_find_shm_cap(struct pci_dev *dev,
+ u8 required_id,
+ u8 *bar, u64 *offset, u64 *len)
+{
+ int pos;
+
+ for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
+ pos > 0;
+ pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
+ u8 type, cap_len, id;
+ u32 tmp32;
+ u64 res_offset, res_length;
+
+ pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+ cfg_type),
+ &type);
+ if (type != VIRTIO_PCI_CAP_SHARED_MEMORY_CFG)
+ continue;
+
+ pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+ cap_len),
+ &cap_len);
+ if (cap_len != sizeof(struct virtio_pci_cap64)) {
+ printk(KERN_ERR "%s: shm cap with bad size offset: %d size: %d\n",
+ __func__, pos, cap_len);
+ continue;
+ }
+
+ pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+ id),
+ &id);
+ if (id != required_id)
+ continue;
+
+ /* Type, and ID match, looks good */
+ pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+ bar),
+ bar);
+
+ /* Read the lower 32bit of length and offset */
+ pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap, offset),
+ &tmp32);
+ res_offset = tmp32;
+ pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap, length),
+ &tmp32);
+ res_length = tmp32;
+
+ /* and now the top half */
+ pci_read_config_dword(dev,
+ pos + offsetof(struct virtio_pci_cap64,
+ offset_hi),
+ &tmp32);
+ res_offset |= ((u64)tmp32) << 32;
+ pci_read_config_dword(dev,
+ pos + offsetof(struct virtio_pci_cap64,
+ length_hi),
+ &tmp32);
+ res_length |= ((u64)tmp32) << 32;
+
+ *offset = res_offset;
+ *len = res_length;
+
+ return pos;
+ }
+ return 0;
+}
+
+static bool vp_get_shm_region(struct virtio_device *vdev,
+ struct virtio_shm_region *region, u8 id)
+{
+ struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+ struct pci_dev *pci_dev = vp_dev->pci_dev;
+ u8 bar;
+ u64 offset, len;
+ phys_addr_t phys_addr;
+ size_t bar_len;
+ int ret;
+
+ if (!virtio_pci_find_shm_cap(pci_dev, id, &bar, &offset, &len)) {
+ return false;
+ }
+
+ ret = pci_request_region(pci_dev, bar, "virtio-pci-shm");
+ if (ret < 0) {
+ dev_err(&pci_dev->dev, "%s: failed to request BAR\n",
+ __func__);
+ return false;
+ }
+
+ phys_addr = pci_resource_start(pci_dev, bar);
+ bar_len = pci_resource_len(pci_dev, bar);
+
+ if (offset + len > bar_len) {
+ dev_err(&pci_dev->dev,
+ "%s: bar shorter than cap offset+len\n",
+ __func__);
+ return false;
+ }
+
+ region->len = len;
+ region->addr = (u64) phys_addr + offset;
+
+ return true;
+}
+
static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
.get = NULL,
.set = NULL,
@@ -457,6 +562,7 @@ static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
.bus_name = vp_bus_name,
.set_vq_affinity = vp_set_vq_affinity,
.get_vq_affinity = vp_get_vq_affinity,
+ .get_shm_region = vp_get_shm_region,
};
static const struct virtio_config_ops virtio_pci_config_ops = {
@@ -473,6 +579,7 @@ static const struct virtio_config_ops virtio_pci_config_ops = {
.bus_name = vp_bus_name,
.set_vq_affinity = vp_set_vq_affinity,
.get_vq_affinity = vp_get_vq_affinity,
+ .get_shm_region = vp_get_shm_region,
};
/**
diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 90007a1abcab..fe9f43680a1d 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -113,6 +113,8 @@
#define VIRTIO_PCI_CAP_DEVICE_CFG 4
/* PCI configuration access */
#define VIRTIO_PCI_CAP_PCI_CFG 5
+/* Additional shared memory capability */
+#define VIRTIO_PCI_CAP_SHARED_MEMORY_CFG 8
/* This is the PCI capability header: */
struct virtio_pci_cap {
@@ -121,11 +123,18 @@ struct virtio_pci_cap {
__u8 cap_len; /* Generic PCI field: capability length */
__u8 cfg_type; /* Identifies the structure. */
__u8 bar; /* Where to find it. */
- __u8 padding[3]; /* Pad to full dword. */
+ __u8 id; /* Multiple capabilities of the same type */
+ __u8 padding[2]; /* Pad to full dword. */
__le32 offset; /* Offset within bar. */
__le32 length; /* Length of the structure, in bytes. */
};
+struct virtio_pci_cap64 {
+ struct virtio_pci_cap cap;
+ __le32 offset_hi; /* Most sig 32 bits of offset */
+ __le32 length_hi; /* Most sig 32 bits of length */
+};
+
struct virtio_pci_notify_cap {
struct virtio_pci_cap cap;
__le32 notify_off_multiplier; /* Multiplier for queue_notify_off. */
--
2.20.1
From: Sebastien Boeuf <[email protected]>
On MMIO a new set of registers is defined for finding SHM
regions. Add their definitions and use them to find the region.
Signed-off-by: Sebastien Boeuf <[email protected]>
---
drivers/virtio/virtio_mmio.c | 32 ++++++++++++++++++++++++++++++++
include/uapi/linux/virtio_mmio.h | 11 +++++++++++
2 files changed, 43 insertions(+)
diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index 97d5725fd9a2..4922a1a9e3a7 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -500,6 +500,37 @@ static const char *vm_bus_name(struct virtio_device *vdev)
return vm_dev->pdev->name;
}
+static bool vm_get_shm_region(struct virtio_device *vdev,
+ struct virtio_shm_region *region, u8 id)
+{
+ struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
+ u64 len, addr;
+
+ /* Select the region we're interested in */
+ writel(id, vm_dev->base + VIRTIO_MMIO_SHM_SEL);
+
+ /* Read the region size */
+ len = (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_LEN_LOW);
+ len |= (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_LEN_HIGH) << 32;
+
+ region->len = len;
+
+ /* Check if region length is -1. If that's the case, the shared memory
+ * region does not exist and there is no need to proceed further.
+ */
+ if (len == ~(u64)0) {
+ return false;
+ }
+
+ /* Read the region base address */
+ addr = (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_BASE_LOW);
+ addr |= (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_BASE_HIGH) << 32;
+
+ region->addr = addr;
+
+ return true;
+}
+
static const struct virtio_config_ops virtio_mmio_config_ops = {
.get = vm_get,
.set = vm_set,
@@ -512,6 +543,7 @@ static const struct virtio_config_ops virtio_mmio_config_ops = {
.get_features = vm_get_features,
.finalize_features = vm_finalize_features,
.bus_name = vm_bus_name,
+ .get_shm_region = vm_get_shm_region,
};
diff --git a/include/uapi/linux/virtio_mmio.h b/include/uapi/linux/virtio_mmio.h
index c4b09689ab64..0650f91bea6c 100644
--- a/include/uapi/linux/virtio_mmio.h
+++ b/include/uapi/linux/virtio_mmio.h
@@ -122,6 +122,17 @@
#define VIRTIO_MMIO_QUEUE_USED_LOW 0x0a0
#define VIRTIO_MMIO_QUEUE_USED_HIGH 0x0a4
+/* Shared memory region id */
+#define VIRTIO_MMIO_SHM_SEL 0x0ac
+
+/* Shared memory region length, 64 bits in two halves */
+#define VIRTIO_MMIO_SHM_LEN_LOW 0x0b0
+#define VIRTIO_MMIO_SHM_LEN_HIGH 0x0b4
+
+/* Shared memory region base address, 64 bits in two halves */
+#define VIRTIO_MMIO_SHM_BASE_LOW 0x0b8
+#define VIRTIO_MMIO_SHM_BASE_HIGH 0x0bc
+
/* Configuration atomicity value */
#define VIRTIO_MMIO_CONFIG_GENERATION 0x0fc
--
2.20.1
Introduce two new fuse commands to setup/remove memory mappings. This
will be used to setup/tear down file mapping in dax window.
Signed-off-by: Vivek Goyal <[email protected]>
Signed-off-by: Peng Tao <[email protected]>
---
include/uapi/linux/fuse.h | 37 +++++++++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 5b85819e045f..62633555d547 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -894,4 +894,41 @@ struct fuse_copy_file_range_in {
uint64_t flags;
};
+#define FUSE_SETUPMAPPING_ENTRIES 8
+#define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
+struct fuse_setupmapping_in {
+ /* An already open handle */
+ uint64_t fh;
+ /* Offset into the file to start the mapping */
+ uint64_t foffset;
+ /* Length of mapping required */
+ uint64_t len;
+ /* Flags, FUSE_SETUPMAPPING_FLAG_* */
+ uint64_t flags;
+ /* Offset in Memory Window */
+ uint64_t moffset;
+};
+
+struct fuse_setupmapping_out {
+ /* Offsets into the cache of mappings */
+ uint64_t coffset[FUSE_SETUPMAPPING_ENTRIES];
+ /* Lengths of each mapping */
+ uint64_t len[FUSE_SETUPMAPPING_ENTRIES];
+};
+
+struct fuse_removemapping_in {
+ /* number of fuse_removemapping_one follows */
+ uint32_t count;
+};
+
+struct fuse_removemapping_one {
+ /* Offset into the dax window start the unmapping */
+ uint64_t moffset;
+ /* Length of mapping required */
+ uint64_t len;
+};
+
+#define FUSE_REMOVEMAPPING_MAX_ENTRY \
+ (PAGE_SIZE / sizeof(struct fuse_removemapping_one))
+
#endif /* _LINUX_FUSE_H */
--
2.20.1
From: Sebastien Boeuf <[email protected]>
Virtio defines 'shared memory regions' that provide a continuously
shared region between the host and guest.
Provide a method to find a particular region on a device.
Signed-off-by: Sebastien Boeuf <[email protected]>
Signed-off-by: Dr. David Alan Gilbert <[email protected]>
---
include/linux/virtio_config.h | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index bb4cc4910750..c859f000a751 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -10,6 +10,11 @@
struct irq_affinity;
+struct virtio_shm_region {
+ u64 addr;
+ u64 len;
+};
+
/**
* virtio_config_ops - operations for configuring a virtio device
* Note: Do not assume that a transport implements all of the operations
@@ -65,6 +70,7 @@ struct irq_affinity;
* the caller can then copy.
* @set_vq_affinity: set the affinity for a virtqueue (optional).
* @get_vq_affinity: get the affinity for a virtqueue (optional).
+ * @get_shm_region: get a shared memory region based on the index.
*/
typedef void vq_callback_t(struct virtqueue *);
struct virtio_config_ops {
@@ -88,6 +94,8 @@ struct virtio_config_ops {
const struct cpumask *cpu_mask);
const struct cpumask *(*get_vq_affinity)(struct virtio_device *vdev,
int index);
+ bool (*get_shm_region)(struct virtio_device *vdev,
+ struct virtio_shm_region *region, u8 id);
};
/* If driver didn't advertise the feature, it will never appear. */
@@ -250,6 +258,15 @@ int virtqueue_set_affinity(struct virtqueue *vq, const struct cpumask *cpu_mask)
return 0;
}
+static inline
+bool virtio_get_shm_region(struct virtio_device *vdev,
+ struct virtio_shm_region *region, u8 id)
+{
+ if (!vdev->config->get_shm_region)
+ return false;
+ return vdev->config->get_shm_region(vdev, region, id);
+}
+
static inline bool virtio_is_little_endian(struct virtio_device *vdev)
{
return virtio_has_feature(vdev, VIRTIO_F_VERSION_1) ||
--
2.20.1
The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment
constraints via the FUST_INIT map_alignment field. Parse this field and
ensure our DAX mappings meet the alignment constraints.
We don't actually align anything differently since our mappings are
already 2MB aligned. Just check the value when the connection is
established. If it becomes necessary to honor arbitrary alignments in
the future we'll have to adjust how mappings are sized.
The upshot of this commit is that we can be confident that mappings will
work even when emulating x86 on Power and similar combinations where the
host page sizes are different.
Signed-off-by: Stefan Hajnoczi <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/fuse_i.h | 5 ++++-
fs/fuse/inode.c | 19 +++++++++++++++++--
include/uapi/linux/fuse.h | 4 +++-
3 files changed, 24 insertions(+), 4 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index edd3136c11f7..b41275f73e4c 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -47,7 +47,10 @@
/** Number of dentries for each connection in the control filesystem */
#define FUSE_CTL_NUM_DENTRIES 5
-/* Default memory range size, 2MB */
+/*
+ * Default memory range size. A power of 2 so it agrees with common FUSE_INIT
+ * map_alignment values 4KB and 64KB.
+ */
#define FUSE_DAX_MEM_RANGE_SZ (2*1024*1024)
#define FUSE_DAX_MEM_RANGE_PAGES (FUSE_DAX_MEM_RANGE_SZ/PAGE_SIZE)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 0ba092bf0b6d..36cb9c00bbe5 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -961,9 +961,10 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
{
struct fuse_init_args *ia = container_of(args, typeof(*ia), args);
struct fuse_init_out *arg = &ia->out;
+ bool ok = true;
if (error || arg->major != FUSE_KERNEL_VERSION)
- fc->conn_error = 1;
+ ok = false;
else {
unsigned long ra_pages;
@@ -1026,6 +1027,14 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
min_t(unsigned int, FUSE_MAX_MAX_PAGES,
max_t(unsigned int, arg->max_pages, 1));
}
+ if ((arg->flags & FUSE_MAP_ALIGNMENT) &&
+ (FUSE_DAX_MEM_RANGE_SZ % (1ul << arg->map_alignment))) {
+ printk(KERN_ERR "FUSE: map_alignment %u"
+ " incompatible with dax mem range size"
+ " %u\n", arg->map_alignment,
+ FUSE_DAX_MEM_RANGE_SZ);
+ ok = false;
+ }
} else {
ra_pages = fc->max_read / PAGE_SIZE;
fc->no_lock = 1;
@@ -1041,6 +1050,11 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
}
kfree(ia);
+ if (!ok) {
+ fc->conn_init = 0;
+ fc->conn_error = 1;
+ }
+
fuse_set_initialized(fc);
wake_up_all(&fc->blocked_waitq);
}
@@ -1063,7 +1077,8 @@ void fuse_send_init(struct fuse_conn *fc)
FUSE_WRITEBACK_CACHE | FUSE_NO_OPEN_SUPPORT |
FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
- FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA;
+ FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
+ FUSE_MAP_ALIGNMENT;
ia->args.opcode = FUSE_INIT;
ia->args.in_numargs = 1;
ia->args.in_args[0].size = sizeof(ia->in);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 373cada89815..5b85819e045f 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -313,7 +313,9 @@ struct fuse_file_lock {
* FUSE_CACHE_SYMLINKS: cache READLINK responses
* FUSE_NO_OPENDIR_SUPPORT: kernel supports zero-message opendir
* FUSE_EXPLICIT_INVAL_DATA: only invalidate cached pages on explicit request
- * FUSE_MAP_ALIGNMENT: map_alignment field is valid
+ * FUSE_MAP_ALIGNMENT: init_out.map_alignment contains log2(byte alignment) for
+ * foffset and moffset fields in struct
+ * fuse_setupmapping_out and fuse_removemapping_one.
*/
#define FUSE_ASYNC_READ (1 << 0)
#define FUSE_POSIX_LOCKS (1 << 1)
--
2.20.1
This option was introduced so that for virtio_fs we don't show any mounts
options fuse_show_options(). Because we don't offer any of these options
to be controlled by mounter.
Very soon we are planning to introduce option "dax" which mounter should
be able to specify. And no_mount_options does not work anymore. What
we need is a per mount option specific flag so that fileystem can
specify which options to show.
Add few such flags to control the behavior in more fine grained manner
and get rid of no_mount_options.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/fuse_i.h | 14 ++++++++++----
fs/fuse/inode.c | 22 ++++++++++++++--------
fs/fuse/virtio_fs.c | 1 -
3 files changed, 24 insertions(+), 13 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index aa75e2305b75..2cebdf6dcfd8 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -468,18 +468,21 @@ struct fuse_fs_context {
int fd;
unsigned int rootmode;
kuid_t user_id;
+ bool user_id_show;
kgid_t group_id;
+ bool group_id_show;
bool is_bdev:1;
bool fd_present:1;
bool rootmode_present:1;
bool user_id_present:1;
bool group_id_present:1;
bool default_permissions:1;
+ bool default_permissions_show:1;
bool allow_other:1;
+ bool allow_other_show:1;
bool destroy:1;
bool no_control:1;
bool no_force_umount:1;
- bool no_mount_options:1;
unsigned int max_read;
unsigned int blksize;
const char *subtype;
@@ -509,9 +512,11 @@ struct fuse_conn {
/** The user id for this mount */
kuid_t user_id;
+ bool user_id_show:1;
/** The group id for this mount */
kgid_t group_id;
+ bool group_id_show:1;
/** The pid namespace for this mount */
struct pid_namespace *pid_ns;
@@ -695,10 +700,14 @@ struct fuse_conn {
/** Check permissions based on the file mode or not? */
unsigned default_permissions:1;
+ bool default_permissions_show:1;
/** Allow other than the mounter user to access the filesystem ? */
unsigned allow_other:1;
+ /** Show allow_other in mount options */
+ bool allow_other_show:1;
+
/** Does the filesystem support copy_file_range? */
unsigned no_copy_file_range:1;
@@ -714,9 +723,6 @@ struct fuse_conn {
/** Do not allow MNT_FORCE umount */
unsigned int no_force_umount:1;
- /* Do not show mount options */
- unsigned int no_mount_options:1;
-
/** The number of requests waiting for completion */
atomic_t num_waiting;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 95d712d44ca1..f160a3d47b63 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -515,10 +515,12 @@ static int fuse_parse_param(struct fs_context *fc, struct fs_parameter *param)
case OPT_DEFAULT_PERMISSIONS:
ctx->default_permissions = true;
+ ctx->default_permissions_show = true;
break;
case OPT_ALLOW_OTHER:
ctx->allow_other = true;
+ ctx->allow_other_show = true;
break;
case OPT_MAX_READ:
@@ -553,14 +555,15 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
struct super_block *sb = root->d_sb;
struct fuse_conn *fc = get_fuse_conn_super(sb);
- if (fc->no_mount_options)
- return 0;
-
- seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
- seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
- if (fc->default_permissions)
+ if (fc->user_id_show)
+ seq_printf(m, ",user_id=%u",
+ from_kuid_munged(fc->user_ns, fc->user_id));
+ if (fc->group_id_show)
+ seq_printf(m, ",group_id=%u",
+ from_kgid_munged(fc->user_ns, fc->group_id));
+ if (fc->default_permissions && fc->default_permissions_show)
seq_puts(m, ",default_permissions");
- if (fc->allow_other)
+ if (fc->allow_other && fc->allow_other_show)
seq_puts(m, ",allow_other");
if (fc->max_read != ~0)
seq_printf(m, ",max_read=%u", fc->max_read);
@@ -1171,14 +1174,17 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
sb->s_flags |= SB_POSIXACL;
fc->default_permissions = ctx->default_permissions;
+ fc->default_permissions_show = ctx->default_permissions_show;
fc->allow_other = ctx->allow_other;
+ fc->allow_other_show = ctx->allow_other_show;
fc->user_id = ctx->user_id;
+ fc->user_id_show = ctx->user_id_show;
fc->group_id = ctx->group_id;
+ fc->group_id_show = ctx->group_id_show;
fc->max_read = max_t(unsigned, 4096, ctx->max_read);
fc->destroy = ctx->destroy;
fc->no_control = ctx->no_control;
fc->no_force_umount = ctx->no_force_umount;
- fc->no_mount_options = ctx->no_mount_options;
err = -ENOMEM;
root = fuse_get_root_inode(sb, ctx->rootmode);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index a16cc9195087..3f786a15b0d9 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1060,7 +1060,6 @@ static int virtio_fs_fill_super(struct super_block *sb)
.destroy = true,
.no_control = true,
.no_force_umount = true,
- .no_mount_options = true,
};
mutex_lock(&virtio_fs_mutex);
--
2.20.1
When a file is opened with O_TRUNC, we need to make sure that any other
DAX operation is not in progress. DAX expects i_size to be stable.
In fuse_iomap_begin() we check for i_size at multiple places and we expect
i_size to not change.
Another problem is, if we setup a mapping in fuse_iomap_begin(), and
file gets truncated and dax read/write happens, KVM currently hangs.
It tries to fault in a page which does not exist on host (file got
truncated). It probably requries fixing in KVM.
So for now, take inode lock. Once KVM is fixed, we might have to
have a look at it again.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/file.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 561428b66101..8b264fcb9b3c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -483,7 +483,7 @@ int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
int err;
bool is_wb_truncate = (file->f_flags & O_TRUNC) &&
fc->atomic_o_trunc &&
- fc->writeback_cache;
+ (fc->writeback_cache || IS_DAX(inode));
err = generic_file_open(inode, file);
if (err)
--
2.20.1
From: Stefan Hajnoczi <[email protected]>
Add DAX mmap() support.
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
fs/fuse/file.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 61 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9effdd3dc6d6..303496e6617f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2870,10 +2870,15 @@ static const struct vm_operations_struct fuse_file_vm_ops = {
.page_mkwrite = fuse_page_mkwrite,
};
+static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma);
static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
{
struct fuse_file *ff = file->private_data;
+ /* DAX mmap is superior to direct_io mmap */
+ if (IS_DAX(file_inode(file)))
+ return fuse_dax_mmap(file, vma);
+
if (ff->open_flags & FOPEN_DIRECT_IO) {
/* Can't provide the coherency needed for MAP_SHARED */
if (vma->vm_flags & VM_MAYSHARE)
@@ -2892,9 +2897,63 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
return 0;
}
+static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf,
+ enum page_entry_size pe_size, bool write)
+{
+ vm_fault_t ret;
+ struct inode *inode = file_inode(vmf->vma->vm_file);
+ struct super_block *sb = inode->i_sb;
+ pfn_t pfn;
+
+ if (write)
+ sb_start_pagefault(sb);
+
+ ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops);
+
+ if (ret & VM_FAULT_NEEDDSYNC)
+ ret = dax_finish_sync_fault(vmf, pe_size, pfn);
+
+ if (write)
+ sb_end_pagefault(sb);
+
+ return ret;
+}
+
+static vm_fault_t fuse_dax_fault(struct vm_fault *vmf)
+{
+ return __fuse_dax_fault(vmf, PE_SIZE_PTE,
+ vmf->flags & FAULT_FLAG_WRITE);
+}
+
+static vm_fault_t fuse_dax_huge_fault(struct vm_fault *vmf,
+ enum page_entry_size pe_size)
+{
+ return __fuse_dax_fault(vmf, pe_size, vmf->flags & FAULT_FLAG_WRITE);
+}
+
+static vm_fault_t fuse_dax_page_mkwrite(struct vm_fault *vmf)
+{
+ return __fuse_dax_fault(vmf, PE_SIZE_PTE, true);
+}
+
+static vm_fault_t fuse_dax_pfn_mkwrite(struct vm_fault *vmf)
+{
+ return __fuse_dax_fault(vmf, PE_SIZE_PTE, true);
+}
+
+static const struct vm_operations_struct fuse_dax_vm_ops = {
+ .fault = fuse_dax_fault,
+ .huge_fault = fuse_dax_huge_fault,
+ .page_mkwrite = fuse_dax_page_mkwrite,
+ .pfn_mkwrite = fuse_dax_pfn_mkwrite,
+};
+
static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma)
{
- return -EINVAL; /* TODO */
+ file_accessed(file);
+ vma->vm_ops = &fuse_dax_vm_ops;
+ vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+ return 0;
}
static int convert_fuse_file_lock(struct fuse_conn *fc,
@@ -3940,6 +3999,7 @@ static const struct file_operations fuse_file_operations = {
.release = fuse_release,
.fsync = fuse_fsync,
.lock = fuse_file_lock,
+ .get_unmapped_area = thp_get_unmapped_area,
.flock = fuse_file_flock,
.splice_read = generic_file_splice_read,
.splice_write = iter_file_splice_write,
--
2.20.1
Add logic to free up a busy memory range. Freed memory range will be
returned to free pool. Add a worker which can be started to select
and free some busy memory ranges.
Process can also steal one of its busy dax ranges if free range is not
available. I will refer it to as direct reclaim.
If free range is not available and nothing can't be stolen from same
inode, caller waits on a waitq for free range to become available.
For reclaiming a range, as of now we need to hold following locks in
specified order.
down_write(&fi->i_mmap_sem);
down_write(&fi->i_dmap_sem);
We look for a free range in following order.
A. Try to get a free range.
B. If not, try direct reclaim.
C. If not, wait for a memory range to become free
Signed-off-by: Vivek Goyal <[email protected]>
Signed-off-by: Liu Bo <[email protected]>
---
fs/fuse/file.c | 450 ++++++++++++++++++++++++++++++++++++++++++++++-
fs/fuse/fuse_i.h | 25 +++
fs/fuse/inode.c | 5 +
3 files changed, 473 insertions(+), 7 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 8b264fcb9b3c..61ae2ddeef55 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -8,6 +8,7 @@
#include "fuse_i.h"
+#include <linux/delay.h>
#include <linux/pagemap.h>
#include <linux/slab.h>
#include <linux/kernel.h>
@@ -37,6 +38,8 @@ static struct page **fuse_pages_alloc(unsigned int npages, gfp_t flags,
return pages;
}
+static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
+ struct inode *inode, bool fault);
static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
int opcode, struct fuse_open_out *outargp)
{
@@ -193,6 +196,28 @@ static void fuse_link_write_file(struct file *file)
spin_unlock(&fi->lock);
}
+static void
+__kick_dmap_free_worker(struct fuse_conn *fc, unsigned long delay_ms)
+{
+ unsigned long free_threshold;
+
+ /* If number of free ranges are below threshold, start reclaim */
+ free_threshold = max((fc->nr_ranges * FUSE_DAX_RECLAIM_THRESHOLD)/100,
+ (unsigned long)1);
+ if (fc->nr_free_ranges < free_threshold) {
+ pr_debug("fuse: Kicking dax memory reclaim worker. nr_free_ranges=0x%ld nr_total_ranges=%ld\n", fc->nr_free_ranges, fc->nr_ranges);
+ queue_delayed_work(system_long_wq, &fc->dax_free_work,
+ msecs_to_jiffies(delay_ms));
+ }
+}
+
+static void kick_dmap_free_worker(struct fuse_conn *fc, unsigned long delay_ms)
+{
+ spin_lock(&fc->lock);
+ __kick_dmap_free_worker(fc, delay_ms);
+ spin_unlock(&fc->lock);
+}
+
static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
{
struct fuse_dax_mapping *dmap = NULL;
@@ -201,7 +226,7 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
if (fc->nr_free_ranges <= 0) {
spin_unlock(&fc->lock);
- return NULL;
+ goto out_kick;
}
WARN_ON(list_empty(&fc->free_ranges));
@@ -212,6 +237,9 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
list_del_init(&dmap->list);
fc->nr_free_ranges--;
spin_unlock(&fc->lock);
+
+out_kick:
+ kick_dmap_free_worker(fc, 0);
return dmap;
}
@@ -238,6 +266,7 @@ static void __dmap_add_to_free_pool(struct fuse_conn *fc,
{
list_add_tail(&dmap->list, &fc->free_ranges);
fc->nr_free_ranges++;
+ wake_up(&fc->dax_range_waitq);
}
static void dmap_add_to_free_pool(struct fuse_conn *fc,
@@ -289,6 +318,12 @@ static int fuse_setup_one_mapping(struct inode *inode, loff_t offset,
dmap->writable = writable;
if (!upgrade) {
+ /*
+ * We don't take a refernce on inode. inode is valid right now
+ * and when inode is going away, cleanup logic should first
+ * cleanup dmap entries.
+ */
+ dmap->inode = inode;
dmap->start = offset;
dmap->end = offset + FUSE_DAX_MEM_RANGE_SZ - 1;
/* Protected by fi->i_dmap_sem */
@@ -368,6 +403,7 @@ static void dmap_reinit_add_to_free_pool(struct fuse_conn *fc,
"window_offset=0x%llx length=0x%llx\n", dmap->start,
dmap->end, dmap->window_offset, dmap->length);
__dmap_remove_busy_list(fc, dmap);
+ dmap->inode = NULL;
dmap->start = dmap->end = 0;
__dmap_add_to_free_pool(fc, dmap);
}
@@ -386,7 +422,8 @@ static void inode_reclaim_dmap_range(struct fuse_conn *fc, struct inode *inode,
int err, num = 0;
LIST_HEAD(to_remove);
- pr_debug("fuse: %s: start=0x%llx, end=0x%llx\n", __func__, start, end);
+ pr_debug("fuse: %s: inode=0x%px start=0x%llx, end=0x%llx\n", __func__,
+ inode, start, end);
/*
* Interval tree search matches intersecting entries. Adjust the range
@@ -400,6 +437,8 @@ static void inode_reclaim_dmap_range(struct fuse_conn *fc, struct inode *inode,
end);
if (!dmap)
break;
+ /* inode is going away. There should not be any users of dmap */
+ WARN_ON(refcount_read(&dmap->refcnt) > 1);
fuse_dax_interval_tree_remove(dmap, &fi->dmap_tree);
num++;
list_add(&dmap->list, &to_remove);
@@ -434,6 +473,21 @@ static void inode_reclaim_dmap_range(struct fuse_conn *fc, struct inode *inode,
spin_unlock(&fc->lock);
}
+static int dmap_removemapping_one(struct inode *inode,
+ struct fuse_dax_mapping *dmap)
+{
+ struct fuse_removemapping_one forget_one;
+ struct fuse_removemapping_in inarg;
+
+ memset(&inarg, 0, sizeof(inarg));
+ inarg.count = 1;
+ memset(&forget_one, 0, sizeof(forget_one));
+ forget_one.moffset = dmap->window_offset;
+ forget_one.len = dmap->length;
+
+ return fuse_send_removemapping(inode, &inarg, &forget_one);
+}
+
/*
* It is called from evict_inode() and by that time inode is going away. So
* this function does not take any locks like fi->i_dmap_sem for traversing
@@ -1903,6 +1957,17 @@ static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
if (flags & IOMAP_FAULT)
iomap->length = ALIGN(len, PAGE_SIZE);
iomap->type = IOMAP_MAPPED;
+ /*
+ * increace refcnt so that reclaim code knows this dmap is in
+ * use. This assumes i_dmap_sem mutex is held either
+ * shared/exclusive.
+ */
+ refcount_inc(&dmap->refcnt);
+
+ /* iomap->private should be NULL */
+ WARN_ON_ONCE(iomap->private);
+ iomap->private = dmap;
+
pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
" length 0x%llx\n", __func__, iomap->addr,
iomap->offset, iomap->length);
@@ -1925,8 +1990,12 @@ static int iomap_begin_setup_new_mapping(struct inode *inode, loff_t pos,
int ret;
bool writable = flags & IOMAP_WRITE;
- alloc_dmap = alloc_dax_mapping(fc);
- if (!alloc_dmap)
+ alloc_dmap = alloc_dax_mapping_reclaim(fc, inode, flags & IOMAP_FAULT);
+ if (IS_ERR(alloc_dmap))
+ return PTR_ERR(alloc_dmap);
+
+ /* If we are here, we should have memory allocated */
+ if (WARN_ON(!alloc_dmap))
return -EBUSY;
/*
@@ -1979,14 +2048,25 @@ static int iomap_begin_upgrade_mapping(struct inode *inode, loff_t pos,
dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos, pos);
/* We are holding either inode lock or i_mmap_sem, and that should
- * ensure that dmap can't reclaimed or truncated and it should still
- * be there in tree despite the fact we dropped and re-acquired the
- * lock.
+ * ensure that dmap can't be truncated. We are holding a reference
+ * on dmap and that should make sure it can't be reclaimed. So dmap
+ * should still be there in tree despite the fact we dropped and
+ * re-acquired the i_dmap_sem lock.
*/
ret = -EIO;
if (WARN_ON(!dmap))
goto out_err;
+ /* We took an extra reference on dmap to make sure its not reclaimd.
+ * Now we hold i_dmap_sem lock and that reference is not needed
+ * anymore. Drop it.
+ */
+ if (refcount_dec_and_test(&dmap->refcnt)) {
+ /* refcount should not hit 0. This object only goes
+ * away when fuse connection goes away */
+ WARN_ON_ONCE(1);
+ }
+
/* Maybe another thread already upgraded mapping while we were not
* holding lock.
*/
@@ -2056,7 +2136,11 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
* two threads to be trying to this simultaneously
* for same dmap. So drop shared lock and acquire
* exclusive lock.
+ *
+ * Before dropping i_dmap_sem lock, take reference
+ * on dmap so that its not freed by range reclaim.
*/
+ refcount_inc(&dmap->refcnt);
up_read(&fi->i_dmap_sem);
pr_debug("%s: Upgrading mapping at offset 0x%llx"
" length 0x%llx\n", __func__, pos, length);
@@ -2092,6 +2176,16 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
ssize_t written, unsigned flags,
struct iomap *iomap)
{
+ struct fuse_dax_mapping *dmap = iomap->private;
+
+ if (dmap) {
+ if (refcount_dec_and_test(&dmap->refcnt)) {
+ /* refcount should not hit 0. This object only goes
+ * away when fuse connection goes away */
+ WARN_ON_ONCE(1);
+ }
+ }
+
/* DAX writes beyond end-of-file aren't handled using iomap, so the
* file size is unchanged and there is nothing to do here.
*/
@@ -4103,3 +4197,345 @@ void fuse_init_file_inode(struct inode *inode)
inode->i_data.a_ops = &fuse_dax_file_aops;
}
}
+
+static int dmap_writeback_invalidate(struct inode *inode,
+ struct fuse_dax_mapping *dmap)
+{
+ int ret;
+
+ ret = filemap_fdatawrite_range(inode->i_mapping, dmap->start,
+ dmap->end);
+ if (ret) {
+ printk("filemap_fdatawrite_range() failed. err=%d start=0x%llx,"
+ " end=0x%llx\n", ret, dmap->start, dmap->end);
+ return ret;
+ }
+
+ ret = invalidate_inode_pages2_range(inode->i_mapping,
+ dmap->start >> PAGE_SHIFT,
+ dmap->end >> PAGE_SHIFT);
+ if (ret)
+ printk("invalidate_inode_pages2_range() failed err=%d\n", ret);
+
+ return ret;
+}
+
+static int reclaim_one_dmap_locked(struct fuse_conn *fc, struct inode *inode,
+ struct fuse_dax_mapping *dmap)
+{
+ int ret;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ /*
+ * igrab() was done to make sure inode won't go under us, and this
+ * further avoids the race with evict().
+ */
+ ret = dmap_writeback_invalidate(inode, dmap);
+ if (ret)
+ return ret;
+
+ /* Remove dax mapping from inode interval tree now */
+ fuse_dax_interval_tree_remove(dmap, &fi->dmap_tree);
+ fi->nr_dmaps--;
+
+ /* It is possible that umount/shutodwn has killed the fuse connection
+ * and worker thread is trying to reclaim memory in parallel. So check
+ * if connection is still up or not otherwise don't send removemapping
+ * message.
+ */
+ if (fc->connected) {
+ ret = dmap_removemapping_one(inode, dmap);
+ if (ret) {
+ pr_warn("Failed to remove mapping. offset=0x%llx"
+ " len=0x%llx ret=%d\n", dmap->window_offset,
+ dmap->length, ret);
+ }
+ }
+ return 0;
+}
+
+static void fuse_wait_dax_page(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ up_write(&fi->i_mmap_sem);
+ schedule();
+ down_write(&fi->i_mmap_sem);
+}
+
+/* Should be called with fi->i_mmap_sem lock held exclusively */
+static int __fuse_break_dax_layouts(struct inode *inode, bool *retry,
+ loff_t start, loff_t end)
+{
+ struct page *page;
+
+ page = dax_layout_busy_page_range(inode->i_mapping, start, end);
+ if (!page)
+ return 0;
+
+ *retry = true;
+ return ___wait_var_event(&page->_refcount,
+ atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
+ 0, 0, fuse_wait_dax_page(inode));
+}
+
+/* dmap_end == 0 leads to unmapping of whole file */
+static int fuse_break_dax_layouts(struct inode *inode, u64 dmap_start,
+ u64 dmap_end)
+{
+ bool retry;
+ int ret;
+
+ do {
+ retry = false;
+ ret = __fuse_break_dax_layouts(inode, &retry, dmap_start,
+ dmap_end);
+ } while (ret == 0 && retry);
+
+ return ret;
+}
+
+/* Find first mapping in the tree and free it. */
+static struct fuse_dax_mapping *
+inode_reclaim_one_dmap_locked(struct fuse_conn *fc, struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_dax_mapping *dmap;
+ int ret;
+
+ for (dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, 0, -1);
+ dmap;
+ dmap = fuse_dax_interval_tree_iter_next(dmap, 0, -1)) {
+ /* still in use. */
+ if (refcount_read(&dmap->refcnt) > 1)
+ continue;
+
+ ret = reclaim_one_dmap_locked(fc, inode, dmap);
+ if (ret < 0)
+ return ERR_PTR(ret);
+
+ /* Clean up dmap. Do not add back to free list */
+ dmap_remove_busy_list(fc, dmap);
+ dmap->inode = NULL;
+ dmap->start = dmap->end = 0;
+
+ pr_debug("fuse: %s: reclaimed memory range. inode=%px,"
+ " window_offset=0x%llx, length=0x%llx\n", __func__,
+ inode, dmap->window_offset, dmap->length);
+ return dmap;
+ }
+
+ return NULL;
+}
+
+/*
+ * Find first mapping in the tree and free it and return it. Do not add
+ * it back to free pool. If fault == true, this function should be called
+ * with fi->i_mmap_sem held.
+ */
+static struct fuse_dax_mapping *inode_reclaim_one_dmap(struct fuse_conn *fc,
+ struct inode *inode,
+ bool fault)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_dax_mapping *dmap;
+ int ret;
+
+ if (!fault)
+ down_write(&fi->i_mmap_sem);
+
+ /*
+ * Make sure there are no references to inode pages using
+ * get_user_pages()
+ */
+ ret = fuse_break_dax_layouts(inode, 0, 0);
+ if (ret) {
+ printk("virtio_fs: fuse_break_dax_layouts() failed. err=%d\n",
+ ret);
+ dmap = ERR_PTR(ret);
+ goto out_mmap_sem;
+ }
+ down_write(&fi->i_dmap_sem);
+ dmap = inode_reclaim_one_dmap_locked(fc, inode);
+ up_write(&fi->i_dmap_sem);
+out_mmap_sem:
+ if (!fault)
+ up_write(&fi->i_mmap_sem);
+ return dmap;
+}
+
+/* If fault == true, it should be called with fi->i_mmap_sem locked */
+static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
+ struct inode *inode, bool fault)
+{
+ struct fuse_dax_mapping *dmap;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ while(1) {
+ dmap = alloc_dax_mapping(fc);
+ if (dmap)
+ return dmap;
+
+ if (fi->nr_dmaps) {
+ dmap = inode_reclaim_one_dmap(fc, inode, fault);
+ if (dmap)
+ return dmap;
+ /* If we could not reclaim a mapping because it
+ * had a reference, that should be a temporary
+ * situation. Try again.
+ */
+ msleep(1);
+ continue;
+ }
+ /*
+ * There are no mappings which can be reclaimed.
+ * Wait for one.
+ */
+ if (!(fc->nr_free_ranges > 0)) {
+ if (wait_event_killable_exclusive(fc->dax_range_waitq,
+ (fc->nr_free_ranges > 0)))
+ return ERR_PTR(-EINTR);
+ }
+ }
+}
+
+static int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc,
+ struct inode *inode, u64 dmap_start)
+{
+ int ret;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_dax_mapping *dmap;
+
+ /* Find fuse dax mapping at file offset inode. */
+ dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, dmap_start,
+ dmap_start);
+
+ /* Range already got cleaned up by somebody else */
+ if (!dmap)
+ return 0;
+
+ /* still in use. */
+ if (refcount_read(&dmap->refcnt) > 1)
+ return 0;
+
+ ret = reclaim_one_dmap_locked(fc, inode, dmap);
+ if (ret < 0)
+ return ret;
+
+ /* Cleanup dmap entry and add back to free list */
+ spin_lock(&fc->lock);
+ dmap_reinit_add_to_free_pool(fc, dmap);
+ spin_unlock(&fc->lock);
+ return ret;
+}
+
+/*
+ * Free a range of memory.
+ * Locking.
+ * 1. Take fuse_inode->i_mmap_sem to block dax faults.
+ * 2. Take fuse_inode->i_dmap_sem to protect interval tree and also to make
+ * sure read/write can not reuse a dmap which we might be freeing.
+ */
+static int lookup_and_reclaim_dmap(struct fuse_conn *fc, struct inode *inode,
+ u64 dmap_start, u64 dmap_end)
+{
+ int ret;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ down_write(&fi->i_mmap_sem);
+ ret = fuse_break_dax_layouts(inode, dmap_start, dmap_end);
+ if (ret) {
+ printk("virtio_fs: fuse_break_dax_layouts() failed. err=%d\n",
+ ret);
+ goto out_mmap_sem;
+ }
+
+ down_write(&fi->i_dmap_sem);
+ ret = lookup_and_reclaim_dmap_locked(fc, inode, dmap_start);
+ up_write(&fi->i_dmap_sem);
+out_mmap_sem:
+ up_write(&fi->i_mmap_sem);
+ return ret;
+}
+
+static int try_to_free_dmap_chunks(struct fuse_conn *fc,
+ unsigned long nr_to_free)
+{
+ struct fuse_dax_mapping *dmap, *pos, *temp;
+ int ret, nr_freed = 0;
+ u64 dmap_start = 0, window_offset = 0, dmap_end = 0;
+ struct inode *inode = NULL;
+
+ /* Pick first busy range and free it for now*/
+ while(1) {
+ if (nr_freed >= nr_to_free)
+ break;
+
+ dmap = NULL;
+ spin_lock(&fc->lock);
+
+ if (!fc->nr_busy_ranges) {
+ spin_unlock(&fc->lock);
+ return 0;
+ }
+
+ list_for_each_entry_safe(pos, temp, &fc->busy_ranges,
+ busy_list) {
+ /* skip this range if it's in use. */
+ if (refcount_read(&pos->refcnt) > 1)
+ continue;
+
+ inode = igrab(pos->inode);
+ /*
+ * This inode is going away. That will free
+ * up all the ranges anyway, continue to
+ * next range.
+ */
+ if (!inode)
+ continue;
+ /*
+ * Take this element off list and add it tail. If
+ * this element can't be freed, it will help with
+ * selecting new element in next iteration of loop.
+ */
+ dmap = pos;
+ list_move_tail(&dmap->busy_list, &fc->busy_ranges);
+ dmap_start = dmap->start;
+ dmap_end = dmap->end;
+ window_offset = dmap->window_offset;
+ break;
+ }
+ spin_unlock(&fc->lock);
+ if (!dmap)
+ return 0;
+
+ ret = lookup_and_reclaim_dmap(fc, inode, dmap_start, dmap_end);
+ iput(inode);
+ if (ret) {
+ printk("%s(window_offset=0x%llx) failed. err=%d\n",
+ __func__, window_offset, ret);
+ return ret;
+ }
+ nr_freed++;
+ }
+ return 0;
+}
+
+void fuse_dax_free_mem_worker(struct work_struct *work)
+{
+ int ret;
+ struct fuse_conn *fc = container_of(work, struct fuse_conn,
+ dax_free_work.work);
+ pr_debug("fuse: Worker to free memory called. nr_free_ranges=%lu"
+ " nr_busy_ranges=%lu\n", fc->nr_free_ranges,
+ fc->nr_busy_ranges);
+
+ ret = try_to_free_dmap_chunks(fc, FUSE_DAX_RECLAIM_CHUNK);
+ if (ret) {
+ pr_debug("fuse: try_to_free_dmap_chunks() failed with err=%d\n",
+ ret);
+ }
+
+ /* If number of free ranges are still below threhold, requeue */
+ kick_dmap_free_worker(fc, 1);
+}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index de213a7e1b0e..41c2fbff0d37 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -54,6 +54,16 @@
#define FUSE_DAX_MEM_RANGE_SZ (2*1024*1024)
#define FUSE_DAX_MEM_RANGE_PAGES (FUSE_DAX_MEM_RANGE_SZ/PAGE_SIZE)
+/* Number of ranges reclaimer will try to free in one invocation */
+#define FUSE_DAX_RECLAIM_CHUNK (10)
+
+/*
+ * Dax memory reclaim threshold in percetage of total ranges. When free
+ * number of free ranges drops below this threshold, reclaim can trigger
+ * Default is 20%
+ * */
+#define FUSE_DAX_RECLAIM_THRESHOLD (20)
+
/** List of active connections */
extern struct list_head fuse_conn_list;
@@ -75,6 +85,9 @@ struct fuse_forget_link {
/** Translation information for file offsets to DAX window offsets */
struct fuse_dax_mapping {
+ /* Pointer to inode where this memory range is mapped */
+ struct inode *inode;
+
/* Will connect in fc->free_ranges to keep track of free memory */
struct list_head list;
@@ -97,6 +110,9 @@ struct fuse_dax_mapping {
/* Is this mapping read-only or read-write */
bool writable;
+
+ /* reference count when the mapping is used by dax iomap. */
+ refcount_t refcnt;
};
/** FUSE inode */
@@ -822,11 +838,19 @@ struct fuse_conn {
unsigned long nr_busy_ranges;
struct list_head busy_ranges;
+ /* Worker to free up memory ranges */
+ struct delayed_work dax_free_work;
+
+ /* Wait queue for a dax range to become free */
+ wait_queue_head_t dax_range_waitq;
+
/*
* DAX Window Free Ranges
*/
long nr_free_ranges;
struct list_head free_ranges;
+
+ unsigned long nr_ranges;
};
static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
@@ -1164,6 +1188,7 @@ unsigned int fuse_len_args(unsigned int numargs, struct fuse_arg *args);
*/
u64 fuse_get_unique(struct fuse_iqueue *fiq);
void fuse_free_conn(struct fuse_conn *fc);
+void fuse_dax_free_mem_worker(struct work_struct *work);
void fuse_cleanup_inode_mappings(struct inode *inode);
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index d4770e7fb7eb..3560b62077a7 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -663,11 +663,13 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
range->window_offset = i * FUSE_DAX_MEM_RANGE_SZ;
range->length = FUSE_DAX_MEM_RANGE_SZ;
INIT_LIST_HEAD(&range->busy_list);
+ refcount_set(&range->refcnt, 1);
list_add_tail(&range->list, &mem_ranges);
}
list_replace_init(&mem_ranges, &fc->free_ranges);
fc->nr_free_ranges = nr_ranges;
+ fc->nr_ranges = nr_ranges;
return 0;
out_err:
/* Free All allocated elements */
@@ -692,6 +694,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
refcount_set(&fc->count, 1);
atomic_set(&fc->dev_count, 1);
init_waitqueue_head(&fc->blocked_waitq);
+ init_waitqueue_head(&fc->dax_range_waitq);
fuse_iqueue_init(&fc->iq, fiq_ops, fiq_priv);
INIT_LIST_HEAD(&fc->bg_queue);
INIT_LIST_HEAD(&fc->entry);
@@ -711,6 +714,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
INIT_LIST_HEAD(&fc->free_ranges);
INIT_LIST_HEAD(&fc->busy_ranges);
+ INIT_DELAYED_WORK(&fc->dax_free_work, fuse_dax_free_mem_worker);
}
EXPORT_SYMBOL_GPL(fuse_conn_init);
@@ -719,6 +723,7 @@ void fuse_conn_put(struct fuse_conn *fc)
if (refcount_dec_and_test(&fc->count)) {
struct fuse_iqueue *fiq = &fc->iq;
+ flush_delayed_work(&fc->dax_free_work);
if (fc->dax_dev)
fuse_free_dax_mem_ranges(&fc->free_ranges);
if (fiq->ops->release)
--
2.20.1
We need some kind of locking mechanism here. Normal file systems like
ext4 and xfs seems to take their own semaphore to protect agains
truncate while fault is going on.
We have additional requirement to protect against fuse dax memory range
reclaim. When a range has been selected for reclaim, we need to make sure
no other read/write/fault can try to access that memory range while
reclaim is in progress. Once reclaim is complete, lock will be released
and read/write/fault will trigger allocation of fresh dax range.
Taking inode_lock() is not an option in fault path as lockdep complains
about circular dependencies. So define a new fuse_inode->i_mmap_sem.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/dir.c | 2 ++
fs/fuse/file.c | 15 ++++++++++++---
fs/fuse/fuse_i.h | 7 +++++++
fs/fuse/inode.c | 1 +
4 files changed, 22 insertions(+), 3 deletions(-)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index de1e2fde60bd..ad699a60ec03 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1609,8 +1609,10 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
*/
if ((is_truncate || !is_wb) &&
S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
+ down_write(&fi->i_mmap_sem);
truncate_pagecache(inode, outarg.attr.size);
invalidate_inode_pages2(inode->i_mapping);
+ up_write(&fi->i_mmap_sem);
}
clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 303496e6617f..ab56396cf661 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2907,11 +2907,18 @@ static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf,
if (write)
sb_start_pagefault(sb);
-
+ /*
+ * We need to serialize against not only truncate but also against
+ * fuse dax memory range reclaim. While a range is being reclaimed,
+ * we do not want any read/write/mmap to make progress and try
+ * to populate page cache or access memory we are trying to free.
+ */
+ down_read(&get_fuse_inode(inode)->i_mmap_sem);
ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops);
if (ret & VM_FAULT_NEEDDSYNC)
ret = dax_finish_sync_fault(vmf, pe_size, pfn);
+ up_read(&get_fuse_inode(inode)->i_mmap_sem);
if (write)
sb_end_pagefault(sb);
@@ -3869,9 +3876,11 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
file_update_time(file);
}
- if (mode & FALLOC_FL_PUNCH_HOLE)
+ if (mode & FALLOC_FL_PUNCH_HOLE) {
+ down_write(&fi->i_mmap_sem);
truncate_pagecache_range(inode, offset, offset + length - 1);
-
+ up_write(&fi->i_mmap_sem);
+ }
fuse_invalidate_attr(inode);
out:
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 490549862bda..3fea84411401 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -186,6 +186,13 @@ struct fuse_inode {
*/
struct rw_semaphore i_dmap_sem;
+ /**
+ * Can't take inode lock in fault path (leads to circular dependency).
+ * So take this in fuse dax fault path to make sure truncate and
+ * punch hole etc. can't make progress in parallel.
+ */
+ struct rw_semaphore i_mmap_sem;
+
/** Sorted rb tree of struct fuse_dax_mapping elements */
struct rb_root_cached dmap_tree;
unsigned long nr_dmaps;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 93bc65607a15..abc881e6acb0 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -88,6 +88,7 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
fi->state = 0;
fi->nr_dmaps = 0;
mutex_init(&fi->mutex);
+ init_rwsem(&fi->i_mmap_sem);
init_rwsem(&fi->i_dmap_sem);
spin_lock_init(&fi->lock);
fi->forget = fuse_alloc_forget();
--
2.20.1
On Wed, Mar 04, 2020 at 11:58:28AM -0500, Vivek Goyal wrote:
> From: Sebastien Boeuf <[email protected]>
>
> Virtio defines 'shared memory regions' that provide a continuously
> shared region between the host and guest.
>
> Provide a method to find a particular region on a device.
>
> Signed-off-by: Sebastien Boeuf <[email protected]>
> Signed-off-by: Dr. David Alan Gilbert <[email protected]>
> ---
> include/linux/virtio_config.h | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
Reviewed-by: Stefan Hajnoczi <[email protected]>
On Wed, Mar 04, 2020 at 11:58:29AM -0500, Vivek Goyal wrote:
> diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
> index 7abcc50838b8..52f179411015 100644
> --- a/drivers/virtio/virtio_pci_modern.c
> +++ b/drivers/virtio/virtio_pci_modern.c
> @@ -443,6 +443,111 @@ static void del_vq(struct virtio_pci_vq_info *info)
> vring_del_virtqueue(vq);
> }
>
> +static int virtio_pci_find_shm_cap(struct pci_dev *dev,
> + u8 required_id,
> + u8 *bar, u64 *offset, u64 *len)
> +{
> + int pos;
> +
> + for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
Please fix the mixed tabs vs space indentation in this patch.
> +static bool vp_get_shm_region(struct virtio_device *vdev,
> + struct virtio_shm_region *region, u8 id)
> +{
> + struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> + struct pci_dev *pci_dev = vp_dev->pci_dev;
> + u8 bar;
> + u64 offset, len;
> + phys_addr_t phys_addr;
> + size_t bar_len;
> + int ret;
> +
> + if (!virtio_pci_find_shm_cap(pci_dev, id, &bar, &offset, &len)) {
> + return false;
> + }
> +
> + ret = pci_request_region(pci_dev, bar, "virtio-pci-shm");
> + if (ret < 0) {
> + dev_err(&pci_dev->dev, "%s: failed to request BAR\n",
> + __func__);
> + return false;
> + }
> +
> + phys_addr = pci_resource_start(pci_dev, bar);
> + bar_len = pci_resource_len(pci_dev, bar);
> +
> + if (offset + len > bar_len) {
> + dev_err(&pci_dev->dev,
> + "%s: bar shorter than cap offset+len\n",
> + __func__);
> + return false;
> + }
> +
> + region->len = len;
> + region->addr = (u64) phys_addr + offset;
> +
> + return true;
> +}
Missing pci_release_region()?
On Wed, Mar 04, 2020 at 11:58:30AM -0500, Vivek Goyal wrote:
> From: Sebastien Boeuf <[email protected]>
>
> On MMIO a new set of registers is defined for finding SHM
> regions. Add their definitions and use them to find the region.
>
> Signed-off-by: Sebastien Boeuf <[email protected]>
> ---
> drivers/virtio/virtio_mmio.c | 32 ++++++++++++++++++++++++++++++++
> include/uapi/linux/virtio_mmio.h | 11 +++++++++++
> 2 files changed, 43 insertions(+)
Reviewed-by: Stefan Hajnoczi <[email protected]>
On Wed, Mar 04, 2020 at 11:58:29AM -0500, Vivek Goyal wrote:
> From: Sebastien Boeuf <[email protected]>
>
> On PCI the shm regions are found using capability entries;
> find a region by searching for the capability.
>
> Signed-off-by: Sebastien Boeuf <[email protected]>
> Signed-off-by: Dr. David Alan Gilbert <[email protected]>
> Signed-off-by: kbuild test robot <[email protected]>
> ---
> drivers/virtio/virtio_pci_modern.c | 107 +++++++++++++++++++++++++++++
> include/uapi/linux/virtio_pci.h | 11 ++-
> 2 files changed, 117 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
> index 7abcc50838b8..52f179411015 100644
> --- a/drivers/virtio/virtio_pci_modern.c
> +++ b/drivers/virtio/virtio_pci_modern.c
> @@ -443,6 +443,111 @@ static void del_vq(struct virtio_pci_vq_info *info)
> vring_del_virtqueue(vq);
> }
>
> +static int virtio_pci_find_shm_cap(struct pci_dev *dev,
> + u8 required_id,
> + u8 *bar, u64 *offset, u64 *len)
> +{
> + int pos;
> +
> + for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
> + pos > 0;
> + pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
> + u8 type, cap_len, id;
> + u32 tmp32;
> + u64 res_offset, res_length;
> +
> + pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
> + cfg_type),
> + &type);
> + if (type != VIRTIO_PCI_CAP_SHARED_MEMORY_CFG)
> + continue;
> +
> + pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
> + cap_len),
> + &cap_len);
> + if (cap_len != sizeof(struct virtio_pci_cap64)) {
> + printk(KERN_ERR "%s: shm cap with bad size offset: %d size: %d\n",
> + __func__, pos, cap_len);
> + continue;
> + }
> +
> + pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
> + id),
> + &id);
> + if (id != required_id)
> + continue;
> +
> + /* Type, and ID match, looks good */
> + pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
> + bar),
> + bar);
> +
> + /* Read the lower 32bit of length and offset */
> + pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap, offset),
> + &tmp32);
> + res_offset = tmp32;
> + pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap, length),
> + &tmp32);
> + res_length = tmp32;
> +
> + /* and now the top half */
> + pci_read_config_dword(dev,
> + pos + offsetof(struct virtio_pci_cap64,
> + offset_hi),
> + &tmp32);
> + res_offset |= ((u64)tmp32) << 32;
> + pci_read_config_dword(dev,
> + pos + offsetof(struct virtio_pci_cap64,
> + length_hi),
> + &tmp32);
> + res_length |= ((u64)tmp32) << 32;
> +
> + *offset = res_offset;
> + *len = res_length;
> +
> + return pos;
> + }
> + return 0;
> +}
> +
> +static bool vp_get_shm_region(struct virtio_device *vdev,
> + struct virtio_shm_region *region, u8 id)
> +{
> + struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> + struct pci_dev *pci_dev = vp_dev->pci_dev;
> + u8 bar;
> + u64 offset, len;
> + phys_addr_t phys_addr;
> + size_t bar_len;
> + int ret;
> +
> + if (!virtio_pci_find_shm_cap(pci_dev, id, &bar, &offset, &len)) {
> + return false;
> + }
> +
> + ret = pci_request_region(pci_dev, bar, "virtio-pci-shm");
> + if (ret < 0) {
> + dev_err(&pci_dev->dev, "%s: failed to request BAR\n",
> + __func__);
> + return false;
> + }
> +
> + phys_addr = pci_resource_start(pci_dev, bar);
> + bar_len = pci_resource_len(pci_dev, bar);
> +
> + if (offset + len > bar_len) {
> + dev_err(&pci_dev->dev,
> + "%s: bar shorter than cap offset+len\n",
> + __func__);
> + return false;
> + }
> +
Something wrong with indentation here.
Also as long as you are validating things, it's worth checking
offset + len does not overflow.
> + region->len = len;
> + region->addr = (u64) phys_addr + offset;
> +
> + return true;
> +}
> +
> static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
> .get = NULL,
> .set = NULL,
> @@ -457,6 +562,7 @@ static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
> .bus_name = vp_bus_name,
> .set_vq_affinity = vp_set_vq_affinity,
> .get_vq_affinity = vp_get_vq_affinity,
> + .get_shm_region = vp_get_shm_region,
> };
>
> static const struct virtio_config_ops virtio_pci_config_ops = {
> @@ -473,6 +579,7 @@ static const struct virtio_config_ops virtio_pci_config_ops = {
> .bus_name = vp_bus_name,
> .set_vq_affinity = vp_set_vq_affinity,
> .get_vq_affinity = vp_get_vq_affinity,
> + .get_shm_region = vp_get_shm_region,
> };
>
> /**
> diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
> index 90007a1abcab..fe9f43680a1d 100644
> --- a/include/uapi/linux/virtio_pci.h
> +++ b/include/uapi/linux/virtio_pci.h
> @@ -113,6 +113,8 @@
> #define VIRTIO_PCI_CAP_DEVICE_CFG 4
> /* PCI configuration access */
> #define VIRTIO_PCI_CAP_PCI_CFG 5
> +/* Additional shared memory capability */
> +#define VIRTIO_PCI_CAP_SHARED_MEMORY_CFG 8
>
> /* This is the PCI capability header: */
> struct virtio_pci_cap {
> @@ -121,11 +123,18 @@ struct virtio_pci_cap {
> __u8 cap_len; /* Generic PCI field: capability length */
> __u8 cfg_type; /* Identifies the structure. */
> __u8 bar; /* Where to find it. */
> - __u8 padding[3]; /* Pad to full dword. */
> + __u8 id; /* Multiple capabilities of the same type */
> + __u8 padding[2]; /* Pad to full dword. */
> __le32 offset; /* Offset within bar. */
> __le32 length; /* Length of the structure, in bytes. */
> };
>
> +struct virtio_pci_cap64 {
> + struct virtio_pci_cap cap;
> + __le32 offset_hi; /* Most sig 32 bits of offset */
> + __le32 length_hi; /* Most sig 32 bits of length */
> +};
> +
> struct virtio_pci_notify_cap {
> struct virtio_pci_cap cap;
> __le32 notify_off_multiplier; /* Multiplier for queue_notify_off. */
> --
> 2.20.1
On Wed, Mar 4, 2020 at 5:59 PM Vivek Goyal <[email protected]> wrote:
>
> This reduces code duplication and make it little easier to read code.
>
> Signed-off-by: Vivek Goyal <[email protected]>
Reviewed-by: Miklos Szeredi <[email protected]>
On Wed, Mar 4, 2020 at 5:59 PM Vivek Goyal <[email protected]> wrote:
>
> This option was introduced so that for virtio_fs we don't show any mounts
> options fuse_show_options(). Because we don't offer any of these options
> to be controlled by mounter.
>
> Very soon we are planning to introduce option "dax" which mounter should
> be able to specify. And no_mount_options does not work anymore. What
> we need is a per mount option specific flag so that fileystem can
> specify which options to show.
>
> Add few such flags to control the behavior in more fine grained manner
> and get rid of no_mount_options.
>
> Signed-off-by: Vivek Goyal <[email protected]>
Reviewed-by: Miklos Szeredi <[email protected]>
On Wed, Mar 4, 2020 at 5:59 PM Vivek Goyal <[email protected]> wrote:
>
> Add a mount option to allow using dax with virtio_fs.
>
> Signed-off-by: Vivek Goyal <[email protected]>
Reviewed-by: Miklos Szeredi <[email protected]>
On Wed, Mar 04, 2020 at 11:58:27AM -0500, Vivek Goyal wrote:
>
> + /* If end == 0, all pages from start to till end of file */
> + if (!end) {
> + end_idx = ULONG_MAX;
> + len = 0;
I find this a bit odd to specify end == 0 for ULONG_MAX...
> }
> +EXPORT_SYMBOL_GPL(dax_layout_busy_page_range);
> +
> +/**
> + * dax_layout_busy_page - find first pinned page in @mapping
> + * @mapping: address space to scan for a page with ref count > 1
> + *
> + * DAX requires ZONE_DEVICE mapped pages. These pages are never
> + * 'onlined' to the page allocator so they are considered idle when
> + * page->count == 1. A filesystem uses this interface to determine if
> + * any page in the mapping is busy, i.e. for DMA, or other
> + * get_user_pages() usages.
> + *
> + * It is expected that the filesystem is holding locks to block the
> + * establishment of new mappings in this address_space. I.e. it expects
> + * to be able to run unmap_mapping_range() and subsequently not race
> + * mapping_mapped() becoming true.
> + */
> +struct page *dax_layout_busy_page(struct address_space *mapping)
> +{
> + return dax_layout_busy_page_range(mapping, 0, 0);
... other functions I have seen specify ULONG_MAX here. Which IMO makes this
call site more clear.
Ira
On Tue, Mar 10, 2020 at 11:04:37AM +0000, Stefan Hajnoczi wrote:
> On Wed, Mar 04, 2020 at 11:58:29AM -0500, Vivek Goyal wrote:
> > diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
> > index 7abcc50838b8..52f179411015 100644
> > --- a/drivers/virtio/virtio_pci_modern.c
> > +++ b/drivers/virtio/virtio_pci_modern.c
> > @@ -443,6 +443,111 @@ static void del_vq(struct virtio_pci_vq_info *info)
> > vring_del_virtqueue(vq);
> > }
> >
> > +static int virtio_pci_find_shm_cap(struct pci_dev *dev,
> > + u8 required_id,
> > + u8 *bar, u64 *offset, u64 *len)
> > +{
> > + int pos;
> > +
> > + for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
>
> Please fix the mixed tabs vs space indentation in this patch.
Will do. There are plenty of these in this patch.
>
> > +static bool vp_get_shm_region(struct virtio_device *vdev,
> > + struct virtio_shm_region *region, u8 id)
> > +{
> > + struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> > + struct pci_dev *pci_dev = vp_dev->pci_dev;
> > + u8 bar;
> > + u64 offset, len;
> > + phys_addr_t phys_addr;
> > + size_t bar_len;
> > + int ret;
> > +
> > + if (!virtio_pci_find_shm_cap(pci_dev, id, &bar, &offset, &len)) {
> > + return false;
> > + }
> > +
> > + ret = pci_request_region(pci_dev, bar, "virtio-pci-shm");
> > + if (ret < 0) {
> > + dev_err(&pci_dev->dev, "%s: failed to request BAR\n",
> > + __func__);
> > + return false;
> > + }
> > +
> > + phys_addr = pci_resource_start(pci_dev, bar);
> > + bar_len = pci_resource_len(pci_dev, bar);
> > +
> > + if (offset + len > bar_len) {
> > + dev_err(&pci_dev->dev,
> > + "%s: bar shorter than cap offset+len\n",
> > + __func__);
> > + return false;
> > + }
> > +
> > + region->len = len;
> > + region->addr = (u64) phys_addr + offset;
> > +
> > + return true;
> > +}
>
> Missing pci_release_region()?
Good catch. We don't have a mechanism to call pci_relese_region() and
virtio-mmio device's ->get_shm_region() implementation does not even
seem to reserve the resources.
So how about we leave this resource reservation to the caller.
->get_shm_region() just returns the addr/len pair of requested resource.
Something like this patch.
---
drivers/virtio/virtio_pci_modern.c | 8 --------
fs/fuse/virtio_fs.c | 13 ++++++++++---
2 files changed, 10 insertions(+), 11 deletions(-)
Index: redhat-linux/fs/fuse/virtio_fs.c
===================================================================
--- redhat-linux.orig/fs/fuse/virtio_fs.c 2020-03-10 09:13:34.624565666 -0400
+++ redhat-linux/fs/fuse/virtio_fs.c 2020-03-10 14:11:10.970284651 -0400
@@ -763,11 +763,18 @@ static int virtio_fs_setup_dax(struct vi
if (!have_cache) {
dev_notice(&vdev->dev, "%s: No cache capability\n", __func__);
return 0;
- } else {
- dev_notice(&vdev->dev, "Cache len: 0x%llx @ 0x%llx\n",
- cache_reg.len, cache_reg.addr);
}
+ if (!devm_request_mem_region(&vdev->dev, cache_reg.addr, cache_reg.len,
+ dev_name(&vdev->dev))) {
+ dev_warn(&vdev->dev, "could not reserve region addr=0x%llx"
+ " len=0x%llx\n", cache_reg.addr, cache_reg.len);
+ return -EBUSY;
+ }
+
+ dev_notice(&vdev->dev, "Cache len: 0x%llx @ 0x%llx\n", cache_reg.len,
+ cache_reg.addr);
+
pgmap = devm_kzalloc(&vdev->dev, sizeof(*pgmap), GFP_KERNEL);
if (!pgmap)
return -ENOMEM;
Index: redhat-linux/drivers/virtio/virtio_pci_modern.c
===================================================================
--- redhat-linux.orig/drivers/virtio/virtio_pci_modern.c 2020-03-10 08:51:36.886565666 -0400
+++ redhat-linux/drivers/virtio/virtio_pci_modern.c 2020-03-10 13:43:15.168753543 -0400
@@ -511,19 +511,11 @@ static bool vp_get_shm_region(struct vir
u64 offset, len;
phys_addr_t phys_addr;
size_t bar_len;
- int ret;
if (!virtio_pci_find_shm_cap(pci_dev, id, &bar, &offset, &len)) {
return false;
}
- ret = pci_request_region(pci_dev, bar, "virtio-pci-shm");
- if (ret < 0) {
- dev_err(&pci_dev->dev, "%s: failed to request BAR\n",
- __func__);
- return false;
- }
-
phys_addr = pci_resource_start(pci_dev, bar);
bar_len = pci_resource_len(pci_dev, bar);
On Tue, Mar 10, 2020 at 07:12:25AM -0400, Michael S. Tsirkin wrote:
[..]
> > +static bool vp_get_shm_region(struct virtio_device *vdev,
> > + struct virtio_shm_region *region, u8 id)
> > +{
> > + struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> > + struct pci_dev *pci_dev = vp_dev->pci_dev;
> > + u8 bar;
> > + u64 offset, len;
> > + phys_addr_t phys_addr;
> > + size_t bar_len;
> > + int ret;
> > +
> > + if (!virtio_pci_find_shm_cap(pci_dev, id, &bar, &offset, &len)) {
> > + return false;
> > + }
> > +
> > + ret = pci_request_region(pci_dev, bar, "virtio-pci-shm");
> > + if (ret < 0) {
> > + dev_err(&pci_dev->dev, "%s: failed to request BAR\n",
> > + __func__);
> > + return false;
> > + }
> > +
> > + phys_addr = pci_resource_start(pci_dev, bar);
> > + bar_len = pci_resource_len(pci_dev, bar);
> > +
> > + if (offset + len > bar_len) {
> > + dev_err(&pci_dev->dev,
> > + "%s: bar shorter than cap offset+len\n",
> > + __func__);
> > + return false;
> > + }
> > +
>
> Something wrong with indentation here.
Will fix all indentation related issues in this patch.
> Also as long as you are validating things, it's worth checking
> offset + len does not overflow.
Something like addition of following lines?
+ if ((offset + len) < offset) {
+ dev_err(&pci_dev->dev, "%s: cap offset+len overflow detected\n",
+ __func__);
+ return false;
+ }
Vivek
On Wed, Mar 4, 2020 at 5:59 PM Vivek Goyal <[email protected]> wrote:
>
> The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment
> constraints via the FUST_INIT map_alignment field. Parse this field and
> ensure our DAX mappings meet the alignment constraints.
>
> We don't actually align anything differently since our mappings are
> already 2MB aligned. Just check the value when the connection is
> established. If it becomes necessary to honor arbitrary alignments in
> the future we'll have to adjust how mappings are sized.
>
> The upshot of this commit is that we can be confident that mappings will
> work even when emulating x86 on Power and similar combinations where the
> host page sizes are different.
>
> Signed-off-by: Stefan Hajnoczi <[email protected]>
> Signed-off-by: Vivek Goyal <[email protected]>
Reviewed-by: Miklos Szeredi <[email protected]>
On Wed, Mar 4, 2020 at 5:59 PM Vivek Goyal <[email protected]> wrote:
>
> Introduce two new fuse commands to setup/remove memory mappings. This
> will be used to setup/tear down file mapping in dax window.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> Signed-off-by: Peng Tao <[email protected]>
> ---
> include/uapi/linux/fuse.h | 37 +++++++++++++++++++++++++++++++++++++
> 1 file changed, 37 insertions(+)
>
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 5b85819e045f..62633555d547 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -894,4 +894,41 @@ struct fuse_copy_file_range_in {
> uint64_t flags;
> };
>
> +#define FUSE_SETUPMAPPING_ENTRIES 8
> +#define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
> +struct fuse_setupmapping_in {
> + /* An already open handle */
> + uint64_t fh;
> + /* Offset into the file to start the mapping */
> + uint64_t foffset;
> + /* Length of mapping required */
> + uint64_t len;
> + /* Flags, FUSE_SETUPMAPPING_FLAG_* */
> + uint64_t flags;
> + /* Offset in Memory Window */
> + uint64_t moffset;
> +};
> +
> +struct fuse_setupmapping_out {
> + /* Offsets into the cache of mappings */
> + uint64_t coffset[FUSE_SETUPMAPPING_ENTRIES];
> + /* Lengths of each mapping */
> + uint64_t len[FUSE_SETUPMAPPING_ENTRIES];
> +};
fuse_setupmapping_out together with FUSE_SETUPMAPPING_ENTRIES seem to be unused.
Thanks,
Miklos
On Tue, Mar 10, 2020 at 08:19:07AM -0700, Ira Weiny wrote:
> On Wed, Mar 04, 2020 at 11:58:27AM -0500, Vivek Goyal wrote:
> >
> > + /* If end == 0, all pages from start to till end of file */
> > + if (!end) {
> > + end_idx = ULONG_MAX;
> > + len = 0;
>
> I find this a bit odd to specify end == 0 for ULONG_MAX...
>
> > }
> > +EXPORT_SYMBOL_GPL(dax_layout_busy_page_range);
> > +
> > +/**
> > + * dax_layout_busy_page - find first pinned page in @mapping
> > + * @mapping: address space to scan for a page with ref count > 1
> > + *
> > + * DAX requires ZONE_DEVICE mapped pages. These pages are never
> > + * 'onlined' to the page allocator so they are considered idle when
> > + * page->count == 1. A filesystem uses this interface to determine if
> > + * any page in the mapping is busy, i.e. for DMA, or other
> > + * get_user_pages() usages.
> > + *
> > + * It is expected that the filesystem is holding locks to block the
> > + * establishment of new mappings in this address_space. I.e. it expects
> > + * to be able to run unmap_mapping_range() and subsequently not race
> > + * mapping_mapped() becoming true.
> > + */
> > +struct page *dax_layout_busy_page(struct address_space *mapping)
> > +{
> > + return dax_layout_busy_page_range(mapping, 0, 0);
>
> ... other functions I have seen specify ULONG_MAX here. Which IMO makes this
> call site more clear.
I think I looked at unmap_mapping_range() where holelen=0 implies till the
end of file and followed same pattern.
But I agree that LLONG_MAX (end is of type loff_t) is probably more
intuitive. I will change it.
Vivek
On Tue, Mar 10, 2020 at 08:49:49PM +0100, Miklos Szeredi wrote:
> On Wed, Mar 4, 2020 at 5:59 PM Vivek Goyal <[email protected]> wrote:
> >
> > Introduce two new fuse commands to setup/remove memory mappings. This
> > will be used to setup/tear down file mapping in dax window.
> >
> > Signed-off-by: Vivek Goyal <[email protected]>
> > Signed-off-by: Peng Tao <[email protected]>
> > ---
> > include/uapi/linux/fuse.h | 37 +++++++++++++++++++++++++++++++++++++
> > 1 file changed, 37 insertions(+)
> >
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index 5b85819e045f..62633555d547 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -894,4 +894,41 @@ struct fuse_copy_file_range_in {
> > uint64_t flags;
> > };
> >
> > +#define FUSE_SETUPMAPPING_ENTRIES 8
> > +#define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
> > +struct fuse_setupmapping_in {
> > + /* An already open handle */
> > + uint64_t fh;
> > + /* Offset into the file to start the mapping */
> > + uint64_t foffset;
> > + /* Length of mapping required */
> > + uint64_t len;
> > + /* Flags, FUSE_SETUPMAPPING_FLAG_* */
> > + uint64_t flags;
> > + /* Offset in Memory Window */
> > + uint64_t moffset;
> > +};
> > +
> > +struct fuse_setupmapping_out {
> > + /* Offsets into the cache of mappings */
> > + uint64_t coffset[FUSE_SETUPMAPPING_ENTRIES];
> > + /* Lengths of each mapping */
> > + uint64_t len[FUSE_SETUPMAPPING_ENTRIES];
> > +};
>
> fuse_setupmapping_out together with FUSE_SETUPMAPPING_ENTRIES seem to be unused.
This looks like leftover from the old code. I will get rid of it. Thanks.
Vivek
On Tue, Mar 10, 2020 at 02:47:20PM -0400, Vivek Goyal wrote:
> On Tue, Mar 10, 2020 at 07:12:25AM -0400, Michael S. Tsirkin wrote:
> [..]
> > > +static bool vp_get_shm_region(struct virtio_device *vdev,
> > > + struct virtio_shm_region *region, u8 id)
> > > +{
> > > + struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> > > + struct pci_dev *pci_dev = vp_dev->pci_dev;
> > > + u8 bar;
> > > + u64 offset, len;
> > > + phys_addr_t phys_addr;
> > > + size_t bar_len;
> > > + int ret;
> > > +
> > > + if (!virtio_pci_find_shm_cap(pci_dev, id, &bar, &offset, &len)) {
> > > + return false;
> > > + }
> > > +
> > > + ret = pci_request_region(pci_dev, bar, "virtio-pci-shm");
> > > + if (ret < 0) {
> > > + dev_err(&pci_dev->dev, "%s: failed to request BAR\n",
> > > + __func__);
> > > + return false;
> > > + }
> > > +
> > > + phys_addr = pci_resource_start(pci_dev, bar);
> > > + bar_len = pci_resource_len(pci_dev, bar);
> > > +
> > > + if (offset + len > bar_len) {
> > > + dev_err(&pci_dev->dev,
> > > + "%s: bar shorter than cap offset+len\n",
> > > + __func__);
> > > + return false;
> > > + }
> > > +
> >
> > Something wrong with indentation here.
>
> Will fix all indentation related issues in this patch.
>
> > Also as long as you are validating things, it's worth checking
> > offset + len does not overflow.
>
> Something like addition of following lines?
>
> + if ((offset + len) < offset) {
> + dev_err(&pci_dev->dev, "%s: cap offset+len overflow detected\n",
> + __func__);
> + return false;
> + }
>
> Vivek
That should do it.
On Wed, Mar 04, 2020 at 11:58:45AM -0500, Vivek Goyal wrote:
> Add logic to free up a busy memory range. Freed memory range will be
> returned to free pool. Add a worker which can be started to select
> and free some busy memory ranges.
>
> Process can also steal one of its busy dax ranges if free range is not
> available. I will refer it to as direct reclaim.
>
> If free range is not available and nothing can't be stolen from same
> inode, caller waits on a waitq for free range to become available.
>
> For reclaiming a range, as of now we need to hold following locks in
> specified order.
>
> down_write(&fi->i_mmap_sem);
> down_write(&fi->i_dmap_sem);
>
> We look for a free range in following order.
>
> A. Try to get a free range.
> B. If not, try direct reclaim.
> C. If not, wait for a memory range to become free
>
> Signed-off-by: Vivek Goyal <[email protected]>
> Signed-off-by: Liu Bo <[email protected]>
> ---
> fs/fuse/file.c | 450 ++++++++++++++++++++++++++++++++++++++++++++++-
> fs/fuse/fuse_i.h | 25 +++
> fs/fuse/inode.c | 5 +
> 3 files changed, 473 insertions(+), 7 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 8b264fcb9b3c..61ae2ddeef55 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -8,6 +8,7 @@
>
> #include "fuse_i.h"
>
> +#include <linux/delay.h>
> #include <linux/pagemap.h>
> #include <linux/slab.h>
> #include <linux/kernel.h>
> @@ -37,6 +38,8 @@ static struct page **fuse_pages_alloc(unsigned int npages, gfp_t flags,
> return pages;
> }
>
> +static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
> + struct inode *inode, bool fault);
> static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
> int opcode, struct fuse_open_out *outargp)
> {
> @@ -193,6 +196,28 @@ static void fuse_link_write_file(struct file *file)
> spin_unlock(&fi->lock);
> }
>
> +static void
> +__kick_dmap_free_worker(struct fuse_conn *fc, unsigned long delay_ms)
> +{
> + unsigned long free_threshold;
> +
> + /* If number of free ranges are below threshold, start reclaim */
> + free_threshold = max((fc->nr_ranges * FUSE_DAX_RECLAIM_THRESHOLD)/100,
> + (unsigned long)1);
> + if (fc->nr_free_ranges < free_threshold) {
> + pr_debug("fuse: Kicking dax memory reclaim worker. nr_free_ranges=0x%ld nr_total_ranges=%ld\n", fc->nr_free_ranges, fc->nr_ranges);
> + queue_delayed_work(system_long_wq, &fc->dax_free_work,
> + msecs_to_jiffies(delay_ms));
> + }
> +}
> +
> +static void kick_dmap_free_worker(struct fuse_conn *fc, unsigned long delay_ms)
> +{
> + spin_lock(&fc->lock);
> + __kick_dmap_free_worker(fc, delay_ms);
> + spin_unlock(&fc->lock);
> +}
> +
> static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
> {
> struct fuse_dax_mapping *dmap = NULL;
> @@ -201,7 +226,7 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
>
> if (fc->nr_free_ranges <= 0) {
> spin_unlock(&fc->lock);
> - return NULL;
> + goto out_kick;
> }
>
> WARN_ON(list_empty(&fc->free_ranges));
> @@ -212,6 +237,9 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
> list_del_init(&dmap->list);
> fc->nr_free_ranges--;
> spin_unlock(&fc->lock);
> +
> +out_kick:
> + kick_dmap_free_worker(fc, 0);
> return dmap;
> }
>
> @@ -238,6 +266,7 @@ static void __dmap_add_to_free_pool(struct fuse_conn *fc,
> {
> list_add_tail(&dmap->list, &fc->free_ranges);
> fc->nr_free_ranges++;
> + wake_up(&fc->dax_range_waitq);
> }
>
> static void dmap_add_to_free_pool(struct fuse_conn *fc,
> @@ -289,6 +318,12 @@ static int fuse_setup_one_mapping(struct inode *inode, loff_t offset,
>
> dmap->writable = writable;
> if (!upgrade) {
> + /*
> + * We don't take a refernce on inode. inode is valid right now
> + * and when inode is going away, cleanup logic should first
> + * cleanup dmap entries.
> + */
> + dmap->inode = inode;
> dmap->start = offset;
> dmap->end = offset + FUSE_DAX_MEM_RANGE_SZ - 1;
> /* Protected by fi->i_dmap_sem */
> @@ -368,6 +403,7 @@ static void dmap_reinit_add_to_free_pool(struct fuse_conn *fc,
> "window_offset=0x%llx length=0x%llx\n", dmap->start,
> dmap->end, dmap->window_offset, dmap->length);
> __dmap_remove_busy_list(fc, dmap);
> + dmap->inode = NULL;
> dmap->start = dmap->end = 0;
> __dmap_add_to_free_pool(fc, dmap);
> }
> @@ -386,7 +422,8 @@ static void inode_reclaim_dmap_range(struct fuse_conn *fc, struct inode *inode,
> int err, num = 0;
> LIST_HEAD(to_remove);
>
> - pr_debug("fuse: %s: start=0x%llx, end=0x%llx\n", __func__, start, end);
> + pr_debug("fuse: %s: inode=0x%px start=0x%llx, end=0x%llx\n", __func__,
> + inode, start, end);
>
> /*
> * Interval tree search matches intersecting entries. Adjust the range
> @@ -400,6 +437,8 @@ static void inode_reclaim_dmap_range(struct fuse_conn *fc, struct inode *inode,
> end);
> if (!dmap)
> break;
> + /* inode is going away. There should not be any users of dmap */
> + WARN_ON(refcount_read(&dmap->refcnt) > 1);
> fuse_dax_interval_tree_remove(dmap, &fi->dmap_tree);
> num++;
> list_add(&dmap->list, &to_remove);
> @@ -434,6 +473,21 @@ static void inode_reclaim_dmap_range(struct fuse_conn *fc, struct inode *inode,
> spin_unlock(&fc->lock);
> }
>
> +static int dmap_removemapping_one(struct inode *inode,
> + struct fuse_dax_mapping *dmap)
> +{
> + struct fuse_removemapping_one forget_one;
> + struct fuse_removemapping_in inarg;
> +
> + memset(&inarg, 0, sizeof(inarg));
> + inarg.count = 1;
> + memset(&forget_one, 0, sizeof(forget_one));
> + forget_one.moffset = dmap->window_offset;
> + forget_one.len = dmap->length;
> +
> + return fuse_send_removemapping(inode, &inarg, &forget_one);
> +}
> +
> /*
> * It is called from evict_inode() and by that time inode is going away. So
> * this function does not take any locks like fi->i_dmap_sem for traversing
> @@ -1903,6 +1957,17 @@ static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
> if (flags & IOMAP_FAULT)
> iomap->length = ALIGN(len, PAGE_SIZE);
> iomap->type = IOMAP_MAPPED;
> + /*
> + * increace refcnt so that reclaim code knows this dmap is in
> + * use. This assumes i_dmap_sem mutex is held either
> + * shared/exclusive.
> + */
> + refcount_inc(&dmap->refcnt);
> +
> + /* iomap->private should be NULL */
> + WARN_ON_ONCE(iomap->private);
> + iomap->private = dmap;
> +
> pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
> " length 0x%llx\n", __func__, iomap->addr,
> iomap->offset, iomap->length);
> @@ -1925,8 +1990,12 @@ static int iomap_begin_setup_new_mapping(struct inode *inode, loff_t pos,
> int ret;
> bool writable = flags & IOMAP_WRITE;
>
> - alloc_dmap = alloc_dax_mapping(fc);
> - if (!alloc_dmap)
> + alloc_dmap = alloc_dax_mapping_reclaim(fc, inode, flags & IOMAP_FAULT);
> + if (IS_ERR(alloc_dmap))
> + return PTR_ERR(alloc_dmap);
> +
> + /* If we are here, we should have memory allocated */
> + if (WARN_ON(!alloc_dmap))
> return -EBUSY;
>
> /*
> @@ -1979,14 +2048,25 @@ static int iomap_begin_upgrade_mapping(struct inode *inode, loff_t pos,
> dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos, pos);
>
> /* We are holding either inode lock or i_mmap_sem, and that should
> - * ensure that dmap can't reclaimed or truncated and it should still
> - * be there in tree despite the fact we dropped and re-acquired the
> - * lock.
> + * ensure that dmap can't be truncated. We are holding a reference
> + * on dmap and that should make sure it can't be reclaimed. So dmap
> + * should still be there in tree despite the fact we dropped and
> + * re-acquired the i_dmap_sem lock.
> */
> ret = -EIO;
> if (WARN_ON(!dmap))
> goto out_err;
>
> + /* We took an extra reference on dmap to make sure its not reclaimd.
> + * Now we hold i_dmap_sem lock and that reference is not needed
> + * anymore. Drop it.
> + */
> + if (refcount_dec_and_test(&dmap->refcnt)) {
> + /* refcount should not hit 0. This object only goes
> + * away when fuse connection goes away */
> + WARN_ON_ONCE(1);
> + }
> +
> /* Maybe another thread already upgraded mapping while we were not
> * holding lock.
> */
> @@ -2056,7 +2136,11 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
> * two threads to be trying to this simultaneously
> * for same dmap. So drop shared lock and acquire
> * exclusive lock.
> + *
> + * Before dropping i_dmap_sem lock, take reference
> + * on dmap so that its not freed by range reclaim.
> */
> + refcount_inc(&dmap->refcnt);
> up_read(&fi->i_dmap_sem);
> pr_debug("%s: Upgrading mapping at offset 0x%llx"
> " length 0x%llx\n", __func__, pos, length);
> @@ -2092,6 +2176,16 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
> ssize_t written, unsigned flags,
> struct iomap *iomap)
> {
> + struct fuse_dax_mapping *dmap = iomap->private;
> +
> + if (dmap) {
> + if (refcount_dec_and_test(&dmap->refcnt)) {
> + /* refcount should not hit 0. This object only goes
> + * away when fuse connection goes away */
> + WARN_ON_ONCE(1);
> + }
> + }
> +
> /* DAX writes beyond end-of-file aren't handled using iomap, so the
> * file size is unchanged and there is nothing to do here.
> */
> @@ -4103,3 +4197,345 @@ void fuse_init_file_inode(struct inode *inode)
> inode->i_data.a_ops = &fuse_dax_file_aops;
> }
> }
> +
> +static int dmap_writeback_invalidate(struct inode *inode,
> + struct fuse_dax_mapping *dmap)
> +{
> + int ret;
> +
> + ret = filemap_fdatawrite_range(inode->i_mapping, dmap->start,
> + dmap->end);
> + if (ret) {
> + printk("filemap_fdatawrite_range() failed. err=%d start=0x%llx,"
> + " end=0x%llx\n", ret, dmap->start, dmap->end);
> + return ret;
> + }
> +
> + ret = invalidate_inode_pages2_range(inode->i_mapping,
> + dmap->start >> PAGE_SHIFT,
> + dmap->end >> PAGE_SHIFT);
> + if (ret)
> + printk("invalidate_inode_pages2_range() failed err=%d\n", ret);
> +
> + return ret;
> +}
> +
> +static int reclaim_one_dmap_locked(struct fuse_conn *fc, struct inode *inode,
> + struct fuse_dax_mapping *dmap)
> +{
> + int ret;
> + struct fuse_inode *fi = get_fuse_inode(inode);
> +
> + /*
> + * igrab() was done to make sure inode won't go under us, and this
> + * further avoids the race with evict().
> + */
> + ret = dmap_writeback_invalidate(inode, dmap);
> + if (ret)
> + return ret;
> +
> + /* Remove dax mapping from inode interval tree now */
> + fuse_dax_interval_tree_remove(dmap, &fi->dmap_tree);
> + fi->nr_dmaps--;
> +
> + /* It is possible that umount/shutodwn has killed the fuse connection
> + * and worker thread is trying to reclaim memory in parallel. So check
> + * if connection is still up or not otherwise don't send removemapping
> + * message.
> + */
> + if (fc->connected) {
> + ret = dmap_removemapping_one(inode, dmap);
> + if (ret) {
> + pr_warn("Failed to remove mapping. offset=0x%llx"
> + " len=0x%llx ret=%d\n", dmap->window_offset,
> + dmap->length, ret);
> + }
> + }
> + return 0;
> +}
> +
> +static void fuse_wait_dax_page(struct inode *inode)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> +
> + up_write(&fi->i_mmap_sem);
> + schedule();
> + down_write(&fi->i_mmap_sem);
> +}
> +
> +/* Should be called with fi->i_mmap_sem lock held exclusively */
> +static int __fuse_break_dax_layouts(struct inode *inode, bool *retry,
> + loff_t start, loff_t end)
> +{
> + struct page *page;
> +
> + page = dax_layout_busy_page_range(inode->i_mapping, start, end);
> + if (!page)
> + return 0;
> +
> + *retry = true;
> + return ___wait_var_event(&page->_refcount,
> + atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
> + 0, 0, fuse_wait_dax_page(inode));
> +}
> +
> +/* dmap_end == 0 leads to unmapping of whole file */
> +static int fuse_break_dax_layouts(struct inode *inode, u64 dmap_start,
> + u64 dmap_end)
> +{
> + bool retry;
> + int ret;
> +
> + do {
> + retry = false;
> + ret = __fuse_break_dax_layouts(inode, &retry, dmap_start,
> + dmap_end);
> + } while (ret == 0 && retry);
> +
> + return ret;
> +}
> +
> +/* Find first mapping in the tree and free it. */
> +static struct fuse_dax_mapping *
> +inode_reclaim_one_dmap_locked(struct fuse_conn *fc, struct inode *inode)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct fuse_dax_mapping *dmap;
> + int ret;
> +
> + for (dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, 0, -1);
> + dmap;
> + dmap = fuse_dax_interval_tree_iter_next(dmap, 0, -1)) {
> + /* still in use. */
> + if (refcount_read(&dmap->refcnt) > 1)
> + continue;
> +
> + ret = reclaim_one_dmap_locked(fc, inode, dmap);
> + if (ret < 0)
> + return ERR_PTR(ret);
> +
> + /* Clean up dmap. Do not add back to free list */
> + dmap_remove_busy_list(fc, dmap);
> + dmap->inode = NULL;
> + dmap->start = dmap->end = 0;
> +
> + pr_debug("fuse: %s: reclaimed memory range. inode=%px,"
> + " window_offset=0x%llx, length=0x%llx\n", __func__,
> + inode, dmap->window_offset, dmap->length);
> + return dmap;
> + }
> +
> + return NULL;
> +}
> +
> +/*
> + * Find first mapping in the tree and free it and return it. Do not add
> + * it back to free pool. If fault == true, this function should be called
> + * with fi->i_mmap_sem held.
> + */
> +static struct fuse_dax_mapping *inode_reclaim_one_dmap(struct fuse_conn *fc,
> + struct inode *inode,
> + bool fault)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct fuse_dax_mapping *dmap;
> + int ret;
> +
> + if (!fault)
> + down_write(&fi->i_mmap_sem);
> +
> + /*
> + * Make sure there are no references to inode pages using
> + * get_user_pages()
> + */
> + ret = fuse_break_dax_layouts(inode, 0, 0);
> + if (ret) {
> + printk("virtio_fs: fuse_break_dax_layouts() failed. err=%d\n",
> + ret);
> + dmap = ERR_PTR(ret);
> + goto out_mmap_sem;
> + }
> + down_write(&fi->i_dmap_sem);
> + dmap = inode_reclaim_one_dmap_locked(fc, inode);
> + up_write(&fi->i_dmap_sem);
> +out_mmap_sem:
> + if (!fault)
> + up_write(&fi->i_mmap_sem);
> + return dmap;
> +}
> +
> +/* If fault == true, it should be called with fi->i_mmap_sem locked */
> +static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
> + struct inode *inode, bool fault)
> +{
> + struct fuse_dax_mapping *dmap;
> + struct fuse_inode *fi = get_fuse_inode(inode);
> +
> + while(1) {
> + dmap = alloc_dax_mapping(fc);
> + if (dmap)
> + return dmap;
> +
> + if (fi->nr_dmaps) {
> + dmap = inode_reclaim_one_dmap(fc, inode, fault);
> + if (dmap)
> + return dmap;
> + /* If we could not reclaim a mapping because it
> + * had a reference, that should be a temporary
> + * situation. Try again.
> + */
> + msleep(1);
> + continue;
> + }
> + /*
> + * There are no mappings which can be reclaimed.
> + * Wait for one.
> + */
> + if (!(fc->nr_free_ranges > 0)) {
> + if (wait_event_killable_exclusive(fc->dax_range_waitq,
> + (fc->nr_free_ranges > 0)))
> + return ERR_PTR(-EINTR);
> + }
> + }
> +}
> +
> +static int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc,
> + struct inode *inode, u64 dmap_start)
> +{
> + int ret;
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct fuse_dax_mapping *dmap;
> +
> + /* Find fuse dax mapping at file offset inode. */
> + dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, dmap_start,
> + dmap_start);
> +
> + /* Range already got cleaned up by somebody else */
> + if (!dmap)
> + return 0;
> +
> + /* still in use. */
> + if (refcount_read(&dmap->refcnt) > 1)
> + return 0;
> +
> + ret = reclaim_one_dmap_locked(fc, inode, dmap);
> + if (ret < 0)
> + return ret;
> +
> + /* Cleanup dmap entry and add back to free list */
> + spin_lock(&fc->lock);
> + dmap_reinit_add_to_free_pool(fc, dmap);
> + spin_unlock(&fc->lock);
> + return ret;
> +}
> +
> +/*
> + * Free a range of memory.
> + * Locking.
> + * 1. Take fuse_inode->i_mmap_sem to block dax faults.
> + * 2. Take fuse_inode->i_dmap_sem to protect interval tree and also to make
> + * sure read/write can not reuse a dmap which we might be freeing.
> + */
> +static int lookup_and_reclaim_dmap(struct fuse_conn *fc, struct inode *inode,
> + u64 dmap_start, u64 dmap_end)
> +{
> + int ret;
> + struct fuse_inode *fi = get_fuse_inode(inode);
> +
> + down_write(&fi->i_mmap_sem);
> + ret = fuse_break_dax_layouts(inode, dmap_start, dmap_end);
> + if (ret) {
> + printk("virtio_fs: fuse_break_dax_layouts() failed. err=%d\n",
> + ret);
> + goto out_mmap_sem;
> + }
> +
> + down_write(&fi->i_dmap_sem);
> + ret = lookup_and_reclaim_dmap_locked(fc, inode, dmap_start);
> + up_write(&fi->i_dmap_sem);
> +out_mmap_sem:
> + up_write(&fi->i_mmap_sem);
> + return ret;
> +}
> +
> +static int try_to_free_dmap_chunks(struct fuse_conn *fc,
> + unsigned long nr_to_free)
> +{
> + struct fuse_dax_mapping *dmap, *pos, *temp;
> + int ret, nr_freed = 0;
> + u64 dmap_start = 0, window_offset = 0, dmap_end = 0;
> + struct inode *inode = NULL;
> +
> + /* Pick first busy range and free it for now*/
> + while(1) {
> + if (nr_freed >= nr_to_free)
> + break;
> +
> + dmap = NULL;
> + spin_lock(&fc->lock);
> +
> + if (!fc->nr_busy_ranges) {
> + spin_unlock(&fc->lock);
> + return 0;
> + }
> +
> + list_for_each_entry_safe(pos, temp, &fc->busy_ranges,
> + busy_list) {
> + /* skip this range if it's in use. */
> + if (refcount_read(&pos->refcnt) > 1)
> + continue;
> +
> + inode = igrab(pos->inode);
> + /*
> + * This inode is going away. That will free
> + * up all the ranges anyway, continue to
> + * next range.
> + */
> + if (!inode)
> + continue;
> + /*
> + * Take this element off list and add it tail. If
> + * this element can't be freed, it will help with
> + * selecting new element in next iteration of loop.
> + */
> + dmap = pos;
> + list_move_tail(&dmap->busy_list, &fc->busy_ranges);
> + dmap_start = dmap->start;
> + dmap_end = dmap->end;
> + window_offset = dmap->window_offset;
> + break;
> + }
> + spin_unlock(&fc->lock);
> + if (!dmap)
> + return 0;
> +
> + ret = lookup_and_reclaim_dmap(fc, inode, dmap_start, dmap_end);
> + iput(inode);
> + if (ret) {
> + printk("%s(window_offset=0x%llx) failed. err=%d\n",
> + __func__, window_offset, ret);
> + return ret;
> + }
> + nr_freed++;
> + }
> + return 0;
> +}
> +
> +void fuse_dax_free_mem_worker(struct work_struct *work)
> +{
> + int ret;
> + struct fuse_conn *fc = container_of(work, struct fuse_conn,
> + dax_free_work.work);
> + pr_debug("fuse: Worker to free memory called. nr_free_ranges=%lu"
> + " nr_busy_ranges=%lu\n", fc->nr_free_ranges,
> + fc->nr_busy_ranges);
> +
> + ret = try_to_free_dmap_chunks(fc, FUSE_DAX_RECLAIM_CHUNK);
> + if (ret) {
> + pr_debug("fuse: try_to_free_dmap_chunks() failed with err=%d\n",
> + ret);
> + }
> +
> + /* If number of free ranges are still below threhold, requeue */
> + kick_dmap_free_worker(fc, 1);
> +}
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index de213a7e1b0e..41c2fbff0d37 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -54,6 +54,16 @@
> #define FUSE_DAX_MEM_RANGE_SZ (2*1024*1024)
> #define FUSE_DAX_MEM_RANGE_PAGES (FUSE_DAX_MEM_RANGE_SZ/PAGE_SIZE)
>
> +/* Number of ranges reclaimer will try to free in one invocation */
> +#define FUSE_DAX_RECLAIM_CHUNK (10)
> +
> +/*
> + * Dax memory reclaim threshold in percetage of total ranges. When free
> + * number of free ranges drops below this threshold, reclaim can trigger
> + * Default is 20%
> + * */
> +#define FUSE_DAX_RECLAIM_THRESHOLD (20)
> +
> /** List of active connections */
> extern struct list_head fuse_conn_list;
>
> @@ -75,6 +85,9 @@ struct fuse_forget_link {
>
> /** Translation information for file offsets to DAX window offsets */
> struct fuse_dax_mapping {
> + /* Pointer to inode where this memory range is mapped */
> + struct inode *inode;
> +
> /* Will connect in fc->free_ranges to keep track of free memory */
> struct list_head list;
>
> @@ -97,6 +110,9 @@ struct fuse_dax_mapping {
>
> /* Is this mapping read-only or read-write */
> bool writable;
> +
> + /* reference count when the mapping is used by dax iomap. */
> + refcount_t refcnt;
> };
>
> /** FUSE inode */
> @@ -822,11 +838,19 @@ struct fuse_conn {
> unsigned long nr_busy_ranges;
> struct list_head busy_ranges;
>
> + /* Worker to free up memory ranges */
> + struct delayed_work dax_free_work;
> +
> + /* Wait queue for a dax range to become free */
> + wait_queue_head_t dax_range_waitq;
> +
> /*
> * DAX Window Free Ranges
> */
> long nr_free_ranges;
> struct list_head free_ranges;
> +
> + unsigned long nr_ranges;
> };
>
> static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
> @@ -1164,6 +1188,7 @@ unsigned int fuse_len_args(unsigned int numargs, struct fuse_arg *args);
> */
> u64 fuse_get_unique(struct fuse_iqueue *fiq);
> void fuse_free_conn(struct fuse_conn *fc);
> +void fuse_dax_free_mem_worker(struct work_struct *work);
> void fuse_cleanup_inode_mappings(struct inode *inode);
>
> #endif /* _FS_FUSE_I_H */
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index d4770e7fb7eb..3560b62077a7 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -663,11 +663,13 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
> range->window_offset = i * FUSE_DAX_MEM_RANGE_SZ;
> range->length = FUSE_DAX_MEM_RANGE_SZ;
> INIT_LIST_HEAD(&range->busy_list);
> + refcount_set(&range->refcnt, 1);
> list_add_tail(&range->list, &mem_ranges);
> }
>
> list_replace_init(&mem_ranges, &fc->free_ranges);
> fc->nr_free_ranges = nr_ranges;
> + fc->nr_ranges = nr_ranges;
> return 0;
> out_err:
> /* Free All allocated elements */
> @@ -692,6 +694,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
> refcount_set(&fc->count, 1);
> atomic_set(&fc->dev_count, 1);
> init_waitqueue_head(&fc->blocked_waitq);
> + init_waitqueue_head(&fc->dax_range_waitq);
> fuse_iqueue_init(&fc->iq, fiq_ops, fiq_priv);
> INIT_LIST_HEAD(&fc->bg_queue);
> INIT_LIST_HEAD(&fc->entry);
> @@ -711,6 +714,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
> fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
> INIT_LIST_HEAD(&fc->free_ranges);
> INIT_LIST_HEAD(&fc->busy_ranges);
> + INIT_DELAYED_WORK(&fc->dax_free_work, fuse_dax_free_mem_worker);
> }
> EXPORT_SYMBOL_GPL(fuse_conn_init);
>
> @@ -719,6 +723,7 @@ void fuse_conn_put(struct fuse_conn *fc)
> if (refcount_dec_and_test(&fc->count)) {
> struct fuse_iqueue *fiq = &fc->iq;
>
> + flush_delayed_work(&fc->dax_free_work);
Today while debugging another case, I realized that flushing work here
at the very last fuse_conn_put() is a bit too late, here's my analysis,
umount kthread
deactivate_locked_super
->virtio_kill_sb try_to_free_dmap_chunks
->generic_shutdown_super ->igrab()
...
->evict_inodes() -> check all inodes' count
->fuse_conn_put ->iput
->virtio_fs_free_devs
->fuse_dev_free
->fuse_conn_put // vq1
->fuse_dev_free
->fuse_conn_put // vq2
->flush_delayed_work
The above can end up with a warning message reported by evict_inodes()
about stable inodes. So I think it's necessary to put either
cancel_delayed_work_sync() or flush_delayed_work() before going to
generic_shutdown_super().
thanks,
-liubo
> if (fc->dax_dev)
> fuse_free_dax_mem_ranges(&fc->free_ranges);
> if (fiq->ops->release)
> --
> 2.20.1
On Wed, Mar 4, 2020 at 7:01 PM Vivek Goyal <[email protected]> wrote:
>
> Hi,
>
> This patch series adds DAX support to virtiofs filesystem. This allows
> bypassing guest page cache and allows mapping host page cache directly
> in guest address space.
>
> When a page of file is needed, guest sends a request to map that page
> (in host page cache) in qemu address space. Inside guest this is
> a physical memory range controlled by virtiofs device. And guest
> directly maps this physical address range using DAX and hence gets
> access to file data on host.
>
> This can speed up things considerably in many situations. Also this
> can result in substantial memory savings as file data does not have
> to be copied in guest and it is directly accessed from host page
> cache.
>
> Most of the changes are limited to fuse/virtiofs. There are couple
> of changes needed in generic dax infrastructure and couple of changes
> in virtio to be able to access shared memory region.
>
> These patches apply on top of 5.6-rc4 and are also available here.
>
> https://github.com/rhvgoyal/linux/commits/vivek-04-march-2020
>
> Any review or feedback is welcome.
>
[...]
> drivers/dax/super.c | 3 +-
> drivers/virtio/virtio_mmio.c | 32 +
> drivers/virtio/virtio_pci_modern.c | 107 +++
> fs/dax.c | 66 +-
> fs/fuse/dir.c | 2 +
> fs/fuse/file.c | 1162 +++++++++++++++++++++++++++-
That's a big addition to already big file.c.
Maybe split dax specific code to dax.c?
Can be a post series cleanup too.
Thanks,
Amir.
On Tue, Mar 10, 2020 at 10:34 PM Vivek Goyal <[email protected]> wrote:
>
> On Tue, Mar 10, 2020 at 08:49:49PM +0100, Miklos Szeredi wrote:
> > On Wed, Mar 4, 2020 at 5:59 PM Vivek Goyal <[email protected]> wrote:
> > >
> > > Introduce two new fuse commands to setup/remove memory mappings. This
> > > will be used to setup/tear down file mapping in dax window.
> > >
> > > Signed-off-by: Vivek Goyal <[email protected]>
> > > Signed-off-by: Peng Tao <[email protected]>
> > > ---
> > > include/uapi/linux/fuse.h | 37 +++++++++++++++++++++++++++++++++++++
> > > 1 file changed, 37 insertions(+)
> > >
> > > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > > index 5b85819e045f..62633555d547 100644
> > > --- a/include/uapi/linux/fuse.h
> > > +++ b/include/uapi/linux/fuse.h
> > > @@ -894,4 +894,41 @@ struct fuse_copy_file_range_in {
> > > uint64_t flags;
> > > };
> > >
> > > +#define FUSE_SETUPMAPPING_ENTRIES 8
> > > +#define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
> > > +struct fuse_setupmapping_in {
> > > + /* An already open handle */
> > > + uint64_t fh;
> > > + /* Offset into the file to start the mapping */
> > > + uint64_t foffset;
> > > + /* Length of mapping required */
> > > + uint64_t len;
> > > + /* Flags, FUSE_SETUPMAPPING_FLAG_* */
> > > + uint64_t flags;
> > > + /* Offset in Memory Window */
> > > + uint64_t moffset;
> > > +};
> > > +
> > > +struct fuse_setupmapping_out {
> > > + /* Offsets into the cache of mappings */
> > > + uint64_t coffset[FUSE_SETUPMAPPING_ENTRIES];
> > > + /* Lengths of each mapping */
> > > + uint64_t len[FUSE_SETUPMAPPING_ENTRIES];
> > > +};
> >
> > fuse_setupmapping_out together with FUSE_SETUPMAPPING_ENTRIES seem to be unused.
>
> This looks like leftover from the old code. I will get rid of it. Thanks.
>
Hmm. I wonder if we should keep some out args for future extensions.
Maybe return the mapped size even though it is all or nothing at this
point?
I have interest in a similar FUSE mapping functionality that was prototyped
by Miklos and published here:
https://lore.kernel.org/linux-fsdevel/CAJfpegtjEoE7H8tayLaQHG9fRSBiVuaspnmPr2oQiOZXVB1+7g@mail.gmail.com/
In this prototype, a FUSE_MAP command is used by the server to map a
range of file to the kernel for io. The command in args are quite similar to
those in fuse_setupmapping_in, but since the server is on the same host,
the mapping response is {mapfd, offset, size}.
I wonder, if we decide to go forward with this prototype (and I may well decide
to drive this forward), should the new command be overloading
FUSE_SETUPMAPPING, by using new flags or should it be a new
command? In either case, I think it would be best to try and make a decision
now in order to avoid ambiguity with protocol command/flag names later on.
If we decide that those are completely different beasts and it is agreed that
the future command will be named, for example, FUSE_SETUPIOMAP
with different arguments and that this naming will not create confusion and
ambiguity with FUSE_SETUPMAPPING, then there is no actionable item at
this time.
But it is possible that there is something to gain from using the same
command(?) and same book keeping mechanism for both types of
mappings. Even server on same host could decide that it wants
to map some file regions via mmap and some via iomap.
In that case, perhaps we should make the FUSE_SETUPMAPPING
response args expressive enough to be able to express an iomap
mapping in the future and perhaps dax code should explicitly request
for FUSE_SETUPMAPPING_FLAG_DAX mapping type?
Thoughts?
Amir.
On Wed, Mar 11, 2020 at 01:16:42PM +0800, Liu Bo wrote:
[..]
> > @@ -719,6 +723,7 @@ void fuse_conn_put(struct fuse_conn *fc)
> > if (refcount_dec_and_test(&fc->count)) {
> > struct fuse_iqueue *fiq = &fc->iq;
> >
> > + flush_delayed_work(&fc->dax_free_work);
>
> Today while debugging another case, I realized that flushing work here
> at the very last fuse_conn_put() is a bit too late, here's my analysis,
>
> umount kthread
>
> deactivate_locked_super
> ->virtio_kill_sb try_to_free_dmap_chunks
> ->generic_shutdown_super ->igrab()
> ...
> ->evict_inodes() -> check all inodes' count
> ->fuse_conn_put ->iput
> ->virtio_fs_free_devs
> ->fuse_dev_free
> ->fuse_conn_put // vq1
> ->fuse_dev_free
> ->fuse_conn_put // vq2
> ->flush_delayed_work
>
> The above can end up with a warning message reported by evict_inodes()
> about stable inodes.
Hi Liu Bo,
Which warning is that? Can you point me to it in code.
> So I think it's necessary to put either
> cancel_delayed_work_sync() or flush_delayed_work() before going to
> generic_shutdown_super().
In general I agree that shutting down memory range freeing worker
earling in unmount/shutdown sequence makes sense. It does not seem
to help to let it run while filesystem is going away. How about following
patch.
---
fs/fuse/inode.c | 1 -
fs/fuse/virtio_fs.c | 5 +++++
2 files changed, 5 insertions(+), 1 deletion(-)
Index: redhat-linux/fs/fuse/virtio_fs.c
===================================================================
--- redhat-linux.orig/fs/fuse/virtio_fs.c 2020-03-10 14:11:10.970284651 -0400
+++ redhat-linux/fs/fuse/virtio_fs.c 2020-03-11 08:27:08.103330039 -0400
@@ -1295,6 +1295,11 @@ static void virtio_kill_sb(struct super_
vfs = fc->iq.priv;
fsvq = &vfs->vqs[VQ_HIPRIO];
+ /* Stop dax worker. Soon evict_inodes() will be called which will
+ * free all memory ranges belonging to all inodes.
+ */
+ flush_delayed_work(&fc->dax_free_work);
+
/* Stop forget queue. Soon destroy will be sent */
spin_lock(&fsvq->lock);
fsvq->connected = false;
Index: redhat-linux/fs/fuse/inode.c
===================================================================
--- redhat-linux.orig/fs/fuse/inode.c 2020-03-10 09:13:35.132565666 -0400
+++ redhat-linux/fs/fuse/inode.c 2020-03-11 08:22:02.685330039 -0400
@@ -723,7 +723,6 @@ void fuse_conn_put(struct fuse_conn *fc)
if (refcount_dec_and_test(&fc->count)) {
struct fuse_iqueue *fiq = &fc->iq;
- flush_delayed_work(&fc->dax_free_work);
if (fc->dax_dev)
fuse_free_dax_mem_ranges(&fc->free_ranges);
if (fiq->ops->release)
On Wed, Mar 11, 2020 at 07:22:51AM +0200, Amir Goldstein wrote:
> On Wed, Mar 4, 2020 at 7:01 PM Vivek Goyal <[email protected]> wrote:
> >
> > Hi,
> >
> > This patch series adds DAX support to virtiofs filesystem. This allows
> > bypassing guest page cache and allows mapping host page cache directly
> > in guest address space.
> >
> > When a page of file is needed, guest sends a request to map that page
> > (in host page cache) in qemu address space. Inside guest this is
> > a physical memory range controlled by virtiofs device. And guest
> > directly maps this physical address range using DAX and hence gets
> > access to file data on host.
> >
> > This can speed up things considerably in many situations. Also this
> > can result in substantial memory savings as file data does not have
> > to be copied in guest and it is directly accessed from host page
> > cache.
> >
> > Most of the changes are limited to fuse/virtiofs. There are couple
> > of changes needed in generic dax infrastructure and couple of changes
> > in virtio to be able to access shared memory region.
> >
> > These patches apply on top of 5.6-rc4 and are also available here.
> >
> > https://github.com/rhvgoyal/linux/commits/vivek-04-march-2020
> >
> > Any review or feedback is welcome.
> >
> [...]
> > drivers/dax/super.c | 3 +-
> > drivers/virtio/virtio_mmio.c | 32 +
> > drivers/virtio/virtio_pci_modern.c | 107 +++
> > fs/dax.c | 66 +-
> > fs/fuse/dir.c | 2 +
> > fs/fuse/file.c | 1162 +++++++++++++++++++++++++++-
>
> That's a big addition to already big file.c.
> Maybe split dax specific code to dax.c?
> Can be a post series cleanup too.
Lot of this code is coming from logic to reclaim dax memory range
assigned to inode. I will look into moving some of it to a separate
file.
Vivek
On Wed, Mar 11, 2020 at 8:03 AM Amir Goldstein <[email protected]> wrote:
>
> On Tue, Mar 10, 2020 at 10:34 PM Vivek Goyal <[email protected]> wrote:
> >
> > On Tue, Mar 10, 2020 at 08:49:49PM +0100, Miklos Szeredi wrote:
> > > On Wed, Mar 4, 2020 at 5:59 PM Vivek Goyal <[email protected]> wrote:
> > > >
> > > > Introduce two new fuse commands to setup/remove memory mappings. This
> > > > will be used to setup/tear down file mapping in dax window.
> > > >
> > > > Signed-off-by: Vivek Goyal <[email protected]>
> > > > Signed-off-by: Peng Tao <[email protected]>
> > > > ---
> > > > include/uapi/linux/fuse.h | 37 +++++++++++++++++++++++++++++++++++++
> > > > 1 file changed, 37 insertions(+)
> > > >
> > > > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > > > index 5b85819e045f..62633555d547 100644
> > > > --- a/include/uapi/linux/fuse.h
> > > > +++ b/include/uapi/linux/fuse.h
> > > > @@ -894,4 +894,41 @@ struct fuse_copy_file_range_in {
> > > > uint64_t flags;
> > > > };
> > > >
> > > > +#define FUSE_SETUPMAPPING_ENTRIES 8
> > > > +#define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
> > > > +struct fuse_setupmapping_in {
> > > > + /* An already open handle */
> > > > + uint64_t fh;
> > > > + /* Offset into the file to start the mapping */
> > > > + uint64_t foffset;
> > > > + /* Length of mapping required */
> > > > + uint64_t len;
> > > > + /* Flags, FUSE_SETUPMAPPING_FLAG_* */
> > > > + uint64_t flags;
> > > > + /* Offset in Memory Window */
> > > > + uint64_t moffset;
> > > > +};
> > > > +
> > > > +struct fuse_setupmapping_out {
> > > > + /* Offsets into the cache of mappings */
> > > > + uint64_t coffset[FUSE_SETUPMAPPING_ENTRIES];
> > > > + /* Lengths of each mapping */
> > > > + uint64_t len[FUSE_SETUPMAPPING_ENTRIES];
> > > > +};
> > >
> > > fuse_setupmapping_out together with FUSE_SETUPMAPPING_ENTRIES seem to be unused.
> >
> > This looks like leftover from the old code. I will get rid of it. Thanks.
> >
>
> Hmm. I wonder if we should keep some out args for future extensions.
> Maybe return the mapped size even though it is all or nothing at this
> point?
>
> I have interest in a similar FUSE mapping functionality that was prototyped
> by Miklos and published here:
> https://lore.kernel.org/linux-fsdevel/CAJfpegtjEoE7H8tayLaQHG9fRSBiVuaspnmPr2oQiOZXVB1+7g@mail.gmail.com/
>
> In this prototype, a FUSE_MAP command is used by the server to map a
> range of file to the kernel for io. The command in args are quite similar to
> those in fuse_setupmapping_in, but since the server is on the same host,
> the mapping response is {mapfd, offset, size}.
Right. So the difference is in which entity allocates the mapping.
IOW whether the {fd, offset, size} is input or output in the protocol.
I don't remember the reasons for going with the mapping being
allocated by the client, not the other way round. Vivek?
If the allocation were to be by the server, we could share the request
type and possibly some code between the two, although the I/O
mechanism would still be different.
Thanks,
Miklos
On Wed, Mar 11, 2020 at 03:19:18PM +0100, Miklos Szeredi wrote:
> On Wed, Mar 11, 2020 at 8:03 AM Amir Goldstein <[email protected]> wrote:
> >
> > On Tue, Mar 10, 2020 at 10:34 PM Vivek Goyal <[email protected]> wrote:
> > >
> > > On Tue, Mar 10, 2020 at 08:49:49PM +0100, Miklos Szeredi wrote:
> > > > On Wed, Mar 4, 2020 at 5:59 PM Vivek Goyal <[email protected]> wrote:
> > > > >
> > > > > Introduce two new fuse commands to setup/remove memory mappings. This
> > > > > will be used to setup/tear down file mapping in dax window.
> > > > >
> > > > > Signed-off-by: Vivek Goyal <[email protected]>
> > > > > Signed-off-by: Peng Tao <[email protected]>
> > > > > ---
> > > > > include/uapi/linux/fuse.h | 37 +++++++++++++++++++++++++++++++++++++
> > > > > 1 file changed, 37 insertions(+)
> > > > >
> > > > > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > > > > index 5b85819e045f..62633555d547 100644
> > > > > --- a/include/uapi/linux/fuse.h
> > > > > +++ b/include/uapi/linux/fuse.h
> > > > > @@ -894,4 +894,41 @@ struct fuse_copy_file_range_in {
> > > > > uint64_t flags;
> > > > > };
> > > > >
> > > > > +#define FUSE_SETUPMAPPING_ENTRIES 8
> > > > > +#define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
> > > > > +struct fuse_setupmapping_in {
> > > > > + /* An already open handle */
> > > > > + uint64_t fh;
> > > > > + /* Offset into the file to start the mapping */
> > > > > + uint64_t foffset;
> > > > > + /* Length of mapping required */
> > > > > + uint64_t len;
> > > > > + /* Flags, FUSE_SETUPMAPPING_FLAG_* */
> > > > > + uint64_t flags;
> > > > > + /* Offset in Memory Window */
> > > > > + uint64_t moffset;
> > > > > +};
> > > > > +
> > > > > +struct fuse_setupmapping_out {
> > > > > + /* Offsets into the cache of mappings */
> > > > > + uint64_t coffset[FUSE_SETUPMAPPING_ENTRIES];
> > > > > + /* Lengths of each mapping */
> > > > > + uint64_t len[FUSE_SETUPMAPPING_ENTRIES];
> > > > > +};
> > > >
> > > > fuse_setupmapping_out together with FUSE_SETUPMAPPING_ENTRIES seem to be unused.
> > >
> > > This looks like leftover from the old code. I will get rid of it. Thanks.
> > >
> >
> > Hmm. I wonder if we should keep some out args for future extensions.
> > Maybe return the mapped size even though it is all or nothing at this
> > point?
> >
> > I have interest in a similar FUSE mapping functionality that was prototyped
> > by Miklos and published here:
> > https://lore.kernel.org/linux-fsdevel/CAJfpegtjEoE7H8tayLaQHG9fRSBiVuaspnmPr2oQiOZXVB1+7g@mail.gmail.com/
> >
> > In this prototype, a FUSE_MAP command is used by the server to map a
> > range of file to the kernel for io. The command in args are quite similar to
> > those in fuse_setupmapping_in, but since the server is on the same host,
> > the mapping response is {mapfd, offset, size}.
>
> Right. So the difference is in which entity allocates the mapping.
> IOW whether the {fd, offset, size} is input or output in the protocol.
>
> I don't remember the reasons for going with the mapping being
> allocated by the client, not the other way round. Vivek?
I think one of the main reasons is memory reclaim. Once all ranges in
a cache range are allocated, we need to free a memory range which can be
reused. And client has all the logic to free up that range so that it can
be remapped and reused for a different file/offset. Server will not know
any of this. So I will think that for virtiofs, server might not be
able to decide where to map a section of file and it has to be told
explicitly by the client.
>
> If the allocation were to be by the server, we could share the request
> type and possibly some code between the two, although the I/O
> mechanism would still be different.
>
So input parameters of both FUSE_SETUPMAPPING and FUSE_MAP seem
similar (except the moffset field). Given output of FUSE_MAP reqeust
is very different, I would think it will be easier to have it as a
separate command.
Or can it be some sort of optional output args which can differentiate
between two types of requests.
/me personally finds it simpler to have separate command instead of
overloading FUSE_SETUPMAPPING. But its your call. :-)
Vivek
On Wed, Mar 11, 2020 at 3:41 PM Vivek Goyal <[email protected]> wrote:
>
> On Wed, Mar 11, 2020 at 03:19:18PM +0100, Miklos Szeredi wrote:
> > On Wed, Mar 11, 2020 at 8:03 AM Amir Goldstein <[email protected]> wrote:
> > >
> > > On Tue, Mar 10, 2020 at 10:34 PM Vivek Goyal <[email protected]> wrote:
> > > >
> > > > On Tue, Mar 10, 2020 at 08:49:49PM +0100, Miklos Szeredi wrote:
> > > > > On Wed, Mar 4, 2020 at 5:59 PM Vivek Goyal <[email protected]> wrote:
> > > > > >
> > > > > > Introduce two new fuse commands to setup/remove memory mappings. This
> > > > > > will be used to setup/tear down file mapping in dax window.
> > > > > >
> > > > > > Signed-off-by: Vivek Goyal <[email protected]>
> > > > > > Signed-off-by: Peng Tao <[email protected]>
> > > > > > ---
> > > > > > include/uapi/linux/fuse.h | 37 +++++++++++++++++++++++++++++++++++++
> > > > > > 1 file changed, 37 insertions(+)
> > > > > >
> > > > > > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > > > > > index 5b85819e045f..62633555d547 100644
> > > > > > --- a/include/uapi/linux/fuse.h
> > > > > > +++ b/include/uapi/linux/fuse.h
> > > > > > @@ -894,4 +894,41 @@ struct fuse_copy_file_range_in {
> > > > > > uint64_t flags;
> > > > > > };
> > > > > >
> > > > > > +#define FUSE_SETUPMAPPING_ENTRIES 8
> > > > > > +#define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
> > > > > > +struct fuse_setupmapping_in {
> > > > > > + /* An already open handle */
> > > > > > + uint64_t fh;
> > > > > > + /* Offset into the file to start the mapping */
> > > > > > + uint64_t foffset;
> > > > > > + /* Length of mapping required */
> > > > > > + uint64_t len;
> > > > > > + /* Flags, FUSE_SETUPMAPPING_FLAG_* */
> > > > > > + uint64_t flags;
> > > > > > + /* Offset in Memory Window */
> > > > > > + uint64_t moffset;
> > > > > > +};
> > > > > > +
> > > > > > +struct fuse_setupmapping_out {
> > > > > > + /* Offsets into the cache of mappings */
> > > > > > + uint64_t coffset[FUSE_SETUPMAPPING_ENTRIES];
> > > > > > + /* Lengths of each mapping */
> > > > > > + uint64_t len[FUSE_SETUPMAPPING_ENTRIES];
> > > > > > +};
> > > > >
> > > > > fuse_setupmapping_out together with FUSE_SETUPMAPPING_ENTRIES seem to be unused.
> > > >
> > > > This looks like leftover from the old code. I will get rid of it. Thanks.
> > > >
> > >
> > > Hmm. I wonder if we should keep some out args for future extensions.
> > > Maybe return the mapped size even though it is all or nothing at this
> > > point?
> > >
> > > I have interest in a similar FUSE mapping functionality that was prototyped
> > > by Miklos and published here:
> > > https://lore.kernel.org/linux-fsdevel/CAJfpegtjEoE7H8tayLaQHG9fRSBiVuaspnmPr2oQiOZXVB1+7g@mail.gmail.com/
> > >
> > > In this prototype, a FUSE_MAP command is used by the server to map a
> > > range of file to the kernel for io. The command in args are quite similar to
> > > those in fuse_setupmapping_in, but since the server is on the same host,
> > > the mapping response is {mapfd, offset, size}.
> >
> > Right. So the difference is in which entity allocates the mapping.
> > IOW whether the {fd, offset, size} is input or output in the protocol.
> >
> > I don't remember the reasons for going with the mapping being
> > allocated by the client, not the other way round. Vivek?
>
> I think one of the main reasons is memory reclaim. Once all ranges in
> a cache range are allocated, we need to free a memory range which can be
> reused. And client has all the logic to free up that range so that it can
> be remapped and reused for a different file/offset. Server will not know
> any of this. So I will think that for virtiofs, server might not be
> able to decide where to map a section of file and it has to be told
> explicitly by the client.
Okay.
> >
> > If the allocation were to be by the server, we could share the request
> > type and possibly some code between the two, although the I/O
> > mechanism would still be different.
> >
>
> So input parameters of both FUSE_SETUPMAPPING and FUSE_MAP seem
> similar (except the moffset field). Given output of FUSE_MAP reqeust
> is very different, I would think it will be easier to have it as a
> separate command.
>
> Or can it be some sort of optional output args which can differentiate
> between two types of requests.
>
> /me personally finds it simpler to have separate command instead of
> overloading FUSE_SETUPMAPPING. But its your call. :-)
I too prefer a separate request type.
Thanks,
Miklos
On Wed, Mar 11, 2020 at 08:59:23AM -0400, Vivek Goyal wrote:
> On Wed, Mar 11, 2020 at 01:16:42PM +0800, Liu Bo wrote:
>
> [..]
> > > @@ -719,6 +723,7 @@ void fuse_conn_put(struct fuse_conn *fc)
> > > if (refcount_dec_and_test(&fc->count)) {
> > > struct fuse_iqueue *fiq = &fc->iq;
> > >
> > > + flush_delayed_work(&fc->dax_free_work);
> >
> > Today while debugging another case, I realized that flushing work here
> > at the very last fuse_conn_put() is a bit too late, here's my analysis,
> >
> > umount kthread
> >
> > deactivate_locked_super
> > ->virtio_kill_sb try_to_free_dmap_chunks
> > ->generic_shutdown_super ->igrab()
> > ...
> > ->evict_inodes() -> check all inodes' count
> > ->fuse_conn_put ->iput
> > ->virtio_fs_free_devs
> > ->fuse_dev_free
> > ->fuse_conn_put // vq1
> > ->fuse_dev_free
> > ->fuse_conn_put // vq2
> > ->flush_delayed_work
> >
> > The above can end up with a warning message reported by evict_inodes()
> > about stable inodes.
>
> Hi Liu Bo,
>
> Which warning is that? Can you point me to it in code.
>
Hmm, it was actually in generic_shutdow_super,
---
printk("VFS: Busy inodes after unmount of %s. "
"Self-destruct in 5 seconds. Have a nice day...\n",
---
> > So I think it's necessary to put either
> > cancel_delayed_work_sync() or flush_delayed_work() before going to
> > generic_shutdown_super().
>
> In general I agree that shutting down memory range freeing worker
> earling in unmount/shutdown sequence makes sense. It does not seem
> to help to let it run while filesystem is going away. How about following
> patch.
>
> ---
> fs/fuse/inode.c | 1 -
> fs/fuse/virtio_fs.c | 5 +++++
> 2 files changed, 5 insertions(+), 1 deletion(-)
>
> Index: redhat-linux/fs/fuse/virtio_fs.c
> ===================================================================
> --- redhat-linux.orig/fs/fuse/virtio_fs.c 2020-03-10 14:11:10.970284651 -0400
> +++ redhat-linux/fs/fuse/virtio_fs.c 2020-03-11 08:27:08.103330039 -0400
> @@ -1295,6 +1295,11 @@ static void virtio_kill_sb(struct super_
> vfs = fc->iq.priv;
> fsvq = &vfs->vqs[VQ_HIPRIO];
>
> + /* Stop dax worker. Soon evict_inodes() will be called which will
> + * free all memory ranges belonging to all inodes.
> + */
> + flush_delayed_work(&fc->dax_free_work);
> +
> /* Stop forget queue. Soon destroy will be sent */
> spin_lock(&fsvq->lock);
> fsvq->connected = false;
> Index: redhat-linux/fs/fuse/inode.c
> ===================================================================
> --- redhat-linux.orig/fs/fuse/inode.c 2020-03-10 09:13:35.132565666 -0400
> +++ redhat-linux/fs/fuse/inode.c 2020-03-11 08:22:02.685330039 -0400
> @@ -723,7 +723,6 @@ void fuse_conn_put(struct fuse_conn *fc)
> if (refcount_dec_and_test(&fc->count)) {
> struct fuse_iqueue *fiq = &fc->iq;
>
> - flush_delayed_work(&fc->dax_free_work);
> if (fc->dax_dev)
> fuse_free_dax_mem_ranges(&fc->free_ranges);
> if (fiq->ops->release)
Looks good, it should be safe now, but I feel like
cancel_delayed_work_sync() would be a good alternative for "stop dax
worker".
Reviewed-by: Liu Bo <[email protected]>
Fine with either folding directly or a new patch, thanks for fixing it.
thanks,
-liubo
On Tue, Mar 10, 2020 at 02:19:36PM -0400, Vivek Goyal wrote:
> On Tue, Mar 10, 2020 at 11:04:37AM +0000, Stefan Hajnoczi wrote:
> > On Wed, Mar 04, 2020 at 11:58:29AM -0500, Vivek Goyal wrote:
> > > diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
> > > index 7abcc50838b8..52f179411015 100644
> > > --- a/drivers/virtio/virtio_pci_modern.c
> > > +++ b/drivers/virtio/virtio_pci_modern.c
> > > @@ -443,6 +443,111 @@ static void del_vq(struct virtio_pci_vq_info *info)
> > > vring_del_virtqueue(vq);
> > > }
> > >
> > > +static int virtio_pci_find_shm_cap(struct pci_dev *dev,
> > > + u8 required_id,
> > > + u8 *bar, u64 *offset, u64 *len)
> > > +{
> > > + int pos;
> > > +
> > > + for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
> >
> > Please fix the mixed tabs vs space indentation in this patch.
>
> Will do. There are plenty of these in this patch.
>
> >
> > > +static bool vp_get_shm_region(struct virtio_device *vdev,
> > > + struct virtio_shm_region *region, u8 id)
> > > +{
> > > + struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> > > + struct pci_dev *pci_dev = vp_dev->pci_dev;
> > > + u8 bar;
> > > + u64 offset, len;
> > > + phys_addr_t phys_addr;
> > > + size_t bar_len;
> > > + int ret;
> > > +
> > > + if (!virtio_pci_find_shm_cap(pci_dev, id, &bar, &offset, &len)) {
> > > + return false;
> > > + }
> > > +
> > > + ret = pci_request_region(pci_dev, bar, "virtio-pci-shm");
> > > + if (ret < 0) {
> > > + dev_err(&pci_dev->dev, "%s: failed to request BAR\n",
> > > + __func__);
> > > + return false;
> > > + }
> > > +
> > > + phys_addr = pci_resource_start(pci_dev, bar);
> > > + bar_len = pci_resource_len(pci_dev, bar);
> > > +
> > > + if (offset + len > bar_len) {
> > > + dev_err(&pci_dev->dev,
> > > + "%s: bar shorter than cap offset+len\n",
> > > + __func__);
> > > + return false;
> > > + }
> > > +
> > > + region->len = len;
> > > + region->addr = (u64) phys_addr + offset;
> > > +
> > > + return true;
> > > +}
> >
> > Missing pci_release_region()?
>
> Good catch. We don't have a mechanism to call pci_relese_region() and
> virtio-mmio device's ->get_shm_region() implementation does not even
> seem to reserve the resources.
>
> So how about we leave this resource reservation to the caller.
> ->get_shm_region() just returns the addr/len pair of requested resource.
>
> Something like this patch.
>
> ---
> drivers/virtio/virtio_pci_modern.c | 8 --------
> fs/fuse/virtio_fs.c | 13 ++++++++++---
> 2 files changed, 10 insertions(+), 11 deletions(-)
>
> Index: redhat-linux/fs/fuse/virtio_fs.c
> ===================================================================
> --- redhat-linux.orig/fs/fuse/virtio_fs.c 2020-03-10 09:13:34.624565666 -0400
> +++ redhat-linux/fs/fuse/virtio_fs.c 2020-03-10 14:11:10.970284651 -0400
> @@ -763,11 +763,18 @@ static int virtio_fs_setup_dax(struct vi
> if (!have_cache) {
> dev_notice(&vdev->dev, "%s: No cache capability\n", __func__);
> return 0;
> - } else {
> - dev_notice(&vdev->dev, "Cache len: 0x%llx @ 0x%llx\n",
> - cache_reg.len, cache_reg.addr);
> }
>
> + if (!devm_request_mem_region(&vdev->dev, cache_reg.addr, cache_reg.len,
> + dev_name(&vdev->dev))) {
> + dev_warn(&vdev->dev, "could not reserve region addr=0x%llx"
> + " len=0x%llx\n", cache_reg.addr, cache_reg.len);
> + return -EBUSY;
> + }
> +
> + dev_notice(&vdev->dev, "Cache len: 0x%llx @ 0x%llx\n", cache_reg.len,
> + cache_reg.addr);
> +
> pgmap = devm_kzalloc(&vdev->dev, sizeof(*pgmap), GFP_KERNEL);
> if (!pgmap)
> return -ENOMEM;
> Index: redhat-linux/drivers/virtio/virtio_pci_modern.c
> ===================================================================
> --- redhat-linux.orig/drivers/virtio/virtio_pci_modern.c 2020-03-10 08:51:36.886565666 -0400
> +++ redhat-linux/drivers/virtio/virtio_pci_modern.c 2020-03-10 13:43:15.168753543 -0400
> @@ -511,19 +511,11 @@ static bool vp_get_shm_region(struct vir
> u64 offset, len;
> phys_addr_t phys_addr;
> size_t bar_len;
> - int ret;
>
> if (!virtio_pci_find_shm_cap(pci_dev, id, &bar, &offset, &len)) {
> return false;
> }
>
> - ret = pci_request_region(pci_dev, bar, "virtio-pci-shm");
> - if (ret < 0) {
> - dev_err(&pci_dev->dev, "%s: failed to request BAR\n",
> - __func__);
> - return false;
> - }
> -
> phys_addr = pci_resource_start(pci_dev, bar);
> bar_len = pci_resource_len(pci_dev, bar);
Do pci_resource_start()/pci_resource_len() work on a BAR where
pci_request_region() hasn't been called yet? (I haven't checked the
code, sorry...)
Assuming yes, then my next question is whether devm_request_mem_region()
works in both the VIRTIO PCI and MMIO cases?
If yes, then this looks like a solution, though the need for
devm_request_mem_region() should be explained in the vp_get_shm_region()
doc comments so that callers remember to make that call. Or maybe it
can be included in vp_get_shm_region().
Stefan
On Wed, Mar 11, 2020 at 07:22:51AM +0200, Amir Goldstein wrote:
> On Wed, Mar 4, 2020 at 7:01 PM Vivek Goyal <[email protected]> wrote:
> >
> > Hi,
> >
> > This patch series adds DAX support to virtiofs filesystem. This allows
> > bypassing guest page cache and allows mapping host page cache directly
> > in guest address space.
> >
> > When a page of file is needed, guest sends a request to map that page
> > (in host page cache) in qemu address space. Inside guest this is
> > a physical memory range controlled by virtiofs device. And guest
> > directly maps this physical address range using DAX and hence gets
> > access to file data on host.
> >
> > This can speed up things considerably in many situations. Also this
> > can result in substantial memory savings as file data does not have
> > to be copied in guest and it is directly accessed from host page
> > cache.
> >
> > Most of the changes are limited to fuse/virtiofs. There are couple
> > of changes needed in generic dax infrastructure and couple of changes
> > in virtio to be able to access shared memory region.
> >
> > These patches apply on top of 5.6-rc4 and are also available here.
> >
> > https://github.com/rhvgoyal/linux/commits/vivek-04-march-2020
> >
> > Any review or feedback is welcome.
> >
> [...]
> > drivers/dax/super.c | 3 +-
> > drivers/virtio/virtio_mmio.c | 32 +
> > drivers/virtio/virtio_pci_modern.c | 107 +++
> > fs/dax.c | 66 +-
> > fs/fuse/dir.c | 2 +
> > fs/fuse/file.c | 1162 +++++++++++++++++++++++++++-
>
> That's a big addition to already big file.c.
> Maybe split dax specific code to dax.c?
> Can be a post series cleanup too.
How about fs/fuse/iomap.c instead. This will have all the iomap related logic
as well as all the dax range allocation/free logic which is required
by iomap logic. That moves about 900 lines of code from file.c to iomap.c
Vivek
On Wed, Mar 11, 2020 at 05:34:05PM +0000, Stefan Hajnoczi wrote:
> On Tue, Mar 10, 2020 at 02:19:36PM -0400, Vivek Goyal wrote:
> > On Tue, Mar 10, 2020 at 11:04:37AM +0000, Stefan Hajnoczi wrote:
> > > On Wed, Mar 04, 2020 at 11:58:29AM -0500, Vivek Goyal wrote:
> > > > diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
> > > > index 7abcc50838b8..52f179411015 100644
> > > > --- a/drivers/virtio/virtio_pci_modern.c
> > > > +++ b/drivers/virtio/virtio_pci_modern.c
> > > > @@ -443,6 +443,111 @@ static void del_vq(struct virtio_pci_vq_info *info)
> > > > vring_del_virtqueue(vq);
> > > > }
> > > >
> > > > +static int virtio_pci_find_shm_cap(struct pci_dev *dev,
> > > > + u8 required_id,
> > > > + u8 *bar, u64 *offset, u64 *len)
> > > > +{
> > > > + int pos;
> > > > +
> > > > + for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
> > >
> > > Please fix the mixed tabs vs space indentation in this patch.
> >
> > Will do. There are plenty of these in this patch.
> >
> > >
> > > > +static bool vp_get_shm_region(struct virtio_device *vdev,
> > > > + struct virtio_shm_region *region, u8 id)
> > > > +{
> > > > + struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> > > > + struct pci_dev *pci_dev = vp_dev->pci_dev;
> > > > + u8 bar;
> > > > + u64 offset, len;
> > > > + phys_addr_t phys_addr;
> > > > + size_t bar_len;
> > > > + int ret;
> > > > +
> > > > + if (!virtio_pci_find_shm_cap(pci_dev, id, &bar, &offset, &len)) {
> > > > + return false;
> > > > + }
> > > > +
> > > > + ret = pci_request_region(pci_dev, bar, "virtio-pci-shm");
> > > > + if (ret < 0) {
> > > > + dev_err(&pci_dev->dev, "%s: failed to request BAR\n",
> > > > + __func__);
> > > > + return false;
> > > > + }
> > > > +
> > > > + phys_addr = pci_resource_start(pci_dev, bar);
> > > > + bar_len = pci_resource_len(pci_dev, bar);
> > > > +
> > > > + if (offset + len > bar_len) {
> > > > + dev_err(&pci_dev->dev,
> > > > + "%s: bar shorter than cap offset+len\n",
> > > > + __func__);
> > > > + return false;
> > > > + }
> > > > +
> > > > + region->len = len;
> > > > + region->addr = (u64) phys_addr + offset;
> > > > +
> > > > + return true;
> > > > +}
> > >
> > > Missing pci_release_region()?
> >
> > Good catch. We don't have a mechanism to call pci_relese_region() and
> > virtio-mmio device's ->get_shm_region() implementation does not even
> > seem to reserve the resources.
> >
> > So how about we leave this resource reservation to the caller.
> > ->get_shm_region() just returns the addr/len pair of requested resource.
> >
> > Something like this patch.
> >
> > ---
> > drivers/virtio/virtio_pci_modern.c | 8 --------
> > fs/fuse/virtio_fs.c | 13 ++++++++++---
> > 2 files changed, 10 insertions(+), 11 deletions(-)
> >
> > Index: redhat-linux/fs/fuse/virtio_fs.c
> > ===================================================================
> > --- redhat-linux.orig/fs/fuse/virtio_fs.c 2020-03-10 09:13:34.624565666 -0400
> > +++ redhat-linux/fs/fuse/virtio_fs.c 2020-03-10 14:11:10.970284651 -0400
> > @@ -763,11 +763,18 @@ static int virtio_fs_setup_dax(struct vi
> > if (!have_cache) {
> > dev_notice(&vdev->dev, "%s: No cache capability\n", __func__);
> > return 0;
> > - } else {
> > - dev_notice(&vdev->dev, "Cache len: 0x%llx @ 0x%llx\n",
> > - cache_reg.len, cache_reg.addr);
> > }
> >
> > + if (!devm_request_mem_region(&vdev->dev, cache_reg.addr, cache_reg.len,
> > + dev_name(&vdev->dev))) {
> > + dev_warn(&vdev->dev, "could not reserve region addr=0x%llx"
> > + " len=0x%llx\n", cache_reg.addr, cache_reg.len);
> > + return -EBUSY;
> > + }
> > +
> > + dev_notice(&vdev->dev, "Cache len: 0x%llx @ 0x%llx\n", cache_reg.len,
> > + cache_reg.addr);
> > +
> > pgmap = devm_kzalloc(&vdev->dev, sizeof(*pgmap), GFP_KERNEL);
> > if (!pgmap)
> > return -ENOMEM;
> > Index: redhat-linux/drivers/virtio/virtio_pci_modern.c
> > ===================================================================
> > --- redhat-linux.orig/drivers/virtio/virtio_pci_modern.c 2020-03-10 08:51:36.886565666 -0400
> > +++ redhat-linux/drivers/virtio/virtio_pci_modern.c 2020-03-10 13:43:15.168753543 -0400
> > @@ -511,19 +511,11 @@ static bool vp_get_shm_region(struct vir
> > u64 offset, len;
> > phys_addr_t phys_addr;
> > size_t bar_len;
> > - int ret;
> >
> > if (!virtio_pci_find_shm_cap(pci_dev, id, &bar, &offset, &len)) {
> > return false;
> > }
> >
> > - ret = pci_request_region(pci_dev, bar, "virtio-pci-shm");
> > - if (ret < 0) {
> > - dev_err(&pci_dev->dev, "%s: failed to request BAR\n",
> > - __func__);
> > - return false;
> > - }
> > -
> > phys_addr = pci_resource_start(pci_dev, bar);
> > bar_len = pci_resource_len(pci_dev, bar);
>
> Do pci_resource_start()/pci_resource_len() work on a BAR where
> pci_request_region() hasn't been called yet? (I haven't checked the
> code, sorry...)
It should. Infact, pci_request_region() itself is calling
pci_resource_start() and pci_resource_len().
>
> Assuming yes, then my next question is whether devm_request_mem_region()
> works in both the VIRTIO PCI and MMIO cases?
It should work on MMIO case as well. This basically works on /proc/iomem
resource tree to reserve resources. So as long as MMIO memory range
has been registered by driver in /proc/iomem, it will work.
>
> If yes, then this looks like a solution, though the need for
> devm_request_mem_region() should be explained in the vp_get_shm_region()
> doc comments so that callers remember to make that call. Or maybe it
> can be included in vp_get_shm_region().
How about adding a line in include/linux/virtio_config.h right below the
@get_shm_region descrition which says.
"This does not reserve the resources and caller is expected to call
devm_request_mem_region() or similar to reserve resources."
Vivek
On Wed, Mar 11, 2020 at 8:48 PM Vivek Goyal <[email protected]> wrote:
>
> On Wed, Mar 11, 2020 at 07:22:51AM +0200, Amir Goldstein wrote:
> > On Wed, Mar 4, 2020 at 7:01 PM Vivek Goyal <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > This patch series adds DAX support to virtiofs filesystem. This allows
> > > bypassing guest page cache and allows mapping host page cache directly
> > > in guest address space.
> > >
> > > When a page of file is needed, guest sends a request to map that page
> > > (in host page cache) in qemu address space. Inside guest this is
> > > a physical memory range controlled by virtiofs device. And guest
> > > directly maps this physical address range using DAX and hence gets
> > > access to file data on host.
> > >
> > > This can speed up things considerably in many situations. Also this
> > > can result in substantial memory savings as file data does not have
> > > to be copied in guest and it is directly accessed from host page
> > > cache.
> > >
> > > Most of the changes are limited to fuse/virtiofs. There are couple
> > > of changes needed in generic dax infrastructure and couple of changes
> > > in virtio to be able to access shared memory region.
> > >
> > > These patches apply on top of 5.6-rc4 and are also available here.
> > >
> > > https://github.com/rhvgoyal/linux/commits/vivek-04-march-2020
> > >
> > > Any review or feedback is welcome.
> > >
> > [...]
> > > drivers/dax/super.c | 3 +-
> > > drivers/virtio/virtio_mmio.c | 32 +
> > > drivers/virtio/virtio_pci_modern.c | 107 +++
> > > fs/dax.c | 66 +-
> > > fs/fuse/dir.c | 2 +
> > > fs/fuse/file.c | 1162 +++++++++++++++++++++++++++-
> >
> > That's a big addition to already big file.c.
> > Maybe split dax specific code to dax.c?
> > Can be a post series cleanup too.
>
> How about fs/fuse/iomap.c instead. This will have all the iomap related logic
> as well as all the dax range allocation/free logic which is required
> by iomap logic. That moves about 900 lines of code from file.c to iomap.c
>
Fine by me. I didn't take time to study the code in file.c
I just noticed is has grown a lot bigger and wasn't sure that
it made sense. Up to you. Only if you think the result would be nicer
to maintain.
Thanks,
Amir.
On Wed, Mar 11, 2020 at 09:32:17PM +0200, Amir Goldstein wrote:
> On Wed, Mar 11, 2020 at 8:48 PM Vivek Goyal <[email protected]> wrote:
> >
> > On Wed, Mar 11, 2020 at 07:22:51AM +0200, Amir Goldstein wrote:
> > > On Wed, Mar 4, 2020 at 7:01 PM Vivek Goyal <[email protected]> wrote:
> > > >
> > > > Hi,
> > > >
> > > > This patch series adds DAX support to virtiofs filesystem. This allows
> > > > bypassing guest page cache and allows mapping host page cache directly
> > > > in guest address space.
> > > >
> > > > When a page of file is needed, guest sends a request to map that page
> > > > (in host page cache) in qemu address space. Inside guest this is
> > > > a physical memory range controlled by virtiofs device. And guest
> > > > directly maps this physical address range using DAX and hence gets
> > > > access to file data on host.
> > > >
> > > > This can speed up things considerably in many situations. Also this
> > > > can result in substantial memory savings as file data does not have
> > > > to be copied in guest and it is directly accessed from host page
> > > > cache.
> > > >
> > > > Most of the changes are limited to fuse/virtiofs. There are couple
> > > > of changes needed in generic dax infrastructure and couple of changes
> > > > in virtio to be able to access shared memory region.
> > > >
> > > > These patches apply on top of 5.6-rc4 and are also available here.
> > > >
> > > > https://github.com/rhvgoyal/linux/commits/vivek-04-march-2020
> > > >
> > > > Any review or feedback is welcome.
> > > >
> > > [...]
> > > > drivers/dax/super.c | 3 +-
> > > > drivers/virtio/virtio_mmio.c | 32 +
> > > > drivers/virtio/virtio_pci_modern.c | 107 +++
> > > > fs/dax.c | 66 +-
> > > > fs/fuse/dir.c | 2 +
> > > > fs/fuse/file.c | 1162 +++++++++++++++++++++++++++-
> > >
> > > That's a big addition to already big file.c.
> > > Maybe split dax specific code to dax.c?
> > > Can be a post series cleanup too.
> >
> > How about fs/fuse/iomap.c instead. This will have all the iomap related logic
> > as well as all the dax range allocation/free logic which is required
> > by iomap logic. That moves about 900 lines of code from file.c to iomap.c
> >
>
> Fine by me. I didn't take time to study the code in file.c
> I just noticed is has grown a lot bigger and wasn't sure that
> it made sense. Up to you. Only if you think the result would be nicer
> to maintain.
I am happy to move this code to a separate file. In fact I think we could
probably break it further into another file say dax-mapping.c or something
like that where all the memory range allocation/reclaim logic goes and
iomap logic remains in iomap.c.
But that's probably a future cleanup if code in this file continues to grow.
Vivek
Vivek Goyal <[email protected]> writes:
> This patch series adds DAX support to virtiofs filesystem. This allows
> bypassing guest page cache and allows mapping host page cache directly
> in guest address space.
>
> When a page of file is needed, guest sends a request to map that page
> (in host page cache) in qemu address space. Inside guest this is
> a physical memory range controlled by virtiofs device. And guest
> directly maps this physical address range using DAX and hence gets
> access to file data on host.
>
> This can speed up things considerably in many situations. Also this
> can result in substantial memory savings as file data does not have
> to be copied in guest and it is directly accessed from host page
> cache.
As a potential user of this, let me make sure I understand the expected
outcome: is the goal to let virtiofs use DAX (for increased performance,
etc.) or also let applications that use virtiofs use DAX?
You are mentioning using the host's page cache, so it's probably the
former and MAP_SYNC on virtiofs will continue to be rejected, right?
--
Best Regards
Patrick Ohly
On Wed, Mar 4, 2020 at 5:59 PM Vivek Goyal <[email protected]> wrote:
>
> This patch implements basic DAX support. mmap() is not implemented
> yet and will come in later patches. This patch looks into implemeting
> read/write.
>
> We make use of interval tree to keep track of per inode dax mappings.
>
> Do not use dax for file extending writes, instead just send WRITE message
> to daemon (like we do for direct I/O path). This will keep write and
> i_size change atomic w.r.t crash.
>
> Signed-off-by: Stefan Hajnoczi <[email protected]>
> Signed-off-by: Dr. David Alan Gilbert <[email protected]>
> Signed-off-by: Vivek Goyal <[email protected]>
> Signed-off-by: Miklos Szeredi <[email protected]>
> Signed-off-by: Liu Bo <[email protected]>
> Signed-off-by: Peng Tao <[email protected]>
> ---
> fs/fuse/file.c | 597 +++++++++++++++++++++++++++++++++++++-
> fs/fuse/fuse_i.h | 23 ++
> fs/fuse/inode.c | 6 +
> include/uapi/linux/fuse.h | 1 +
> 4 files changed, 621 insertions(+), 6 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 9d67b830fb7a..9effdd3dc6d6 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -18,6 +18,12 @@
> #include <linux/swap.h>
> #include <linux/falloc.h>
> #include <linux/uio.h>
> +#include <linux/dax.h>
> +#include <linux/iomap.h>
> +#include <linux/interval_tree_generic.h>
> +
> +INTERVAL_TREE_DEFINE(struct fuse_dax_mapping, rb, __u64, __subtree_last,
> + START, LAST, static inline, fuse_dax_interval_tree);
Are you using this because of byte ranges (u64)? Does it not make
more sense to use page offsets, which are unsigned long and so fit
nicely into the generic interval tree?
>
> static struct page **fuse_pages_alloc(unsigned int npages, gfp_t flags,
> struct fuse_page_desc **desc)
> @@ -187,6 +193,242 @@ static void fuse_link_write_file(struct file *file)
> spin_unlock(&fi->lock);
> }
>
> +static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
> +{
> + struct fuse_dax_mapping *dmap = NULL;
> +
> + spin_lock(&fc->lock);
> +
> + if (fc->nr_free_ranges <= 0) {
> + spin_unlock(&fc->lock);
> + return NULL;
> + }
> +
> + WARN_ON(list_empty(&fc->free_ranges));
> +
> + /* Take a free range */
> + dmap = list_first_entry(&fc->free_ranges, struct fuse_dax_mapping,
> + list);
> + list_del_init(&dmap->list);
> + fc->nr_free_ranges--;
> + spin_unlock(&fc->lock);
> + return dmap;
> +}
> +
> +/* This assumes fc->lock is held */
> +static void __dmap_add_to_free_pool(struct fuse_conn *fc,
> + struct fuse_dax_mapping *dmap)
> +{
> + list_add_tail(&dmap->list, &fc->free_ranges);
> + fc->nr_free_ranges++;
> +}
> +
> +static void dmap_add_to_free_pool(struct fuse_conn *fc,
> + struct fuse_dax_mapping *dmap)
> +{
> + /* Return fuse_dax_mapping to free list */
> + spin_lock(&fc->lock);
> + __dmap_add_to_free_pool(fc, dmap);
> + spin_unlock(&fc->lock);
> +}
> +
> +/* offset passed in should be aligned to FUSE_DAX_MEM_RANGE_SZ */
> +static int fuse_setup_one_mapping(struct inode *inode, loff_t offset,
> + struct fuse_dax_mapping *dmap, bool writable,
> + bool upgrade)
> +{
> + struct fuse_conn *fc = get_fuse_conn(inode);
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct fuse_setupmapping_in inarg;
> + FUSE_ARGS(args);
> + ssize_t err;
> +
> + WARN_ON(offset % FUSE_DAX_MEM_RANGE_SZ);
> + WARN_ON(fc->nr_free_ranges < 0);
> +
> + /* Ask fuse daemon to setup mapping */
> + memset(&inarg, 0, sizeof(inarg));
> + inarg.foffset = offset;
> + inarg.fh = -1;
> + inarg.moffset = dmap->window_offset;
> + inarg.len = FUSE_DAX_MEM_RANGE_SZ;
> + inarg.flags |= FUSE_SETUPMAPPING_FLAG_READ;
> + if (writable)
> + inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
> + args.opcode = FUSE_SETUPMAPPING;
> + args.nodeid = fi->nodeid;
> + args.in_numargs = 1;
> + args.in_args[0].size = sizeof(inarg);
> + args.in_args[0].value = &inarg;
args.force = true?
> + err = fuse_simple_request(fc, &args);
> + if (err < 0) {
> + printk(KERN_ERR "%s request failed at mem_offset=0x%llx %zd\n",
> + __func__, dmap->window_offset, err);
Is this level of noisiness really needed? AFAICS, the error will
reach the caller, in which case we don't usually need to print a
kernel error.
> + return err;
> + }
> +
> + pr_debug("fuse_setup_one_mapping() succeeded. offset=0x%llx writable=%d"
> + " err=%zd\n", offset, writable, err);
> +
> + dmap->writable = writable;
> + if (!upgrade) {
> + dmap->start = offset;
> + dmap->end = offset + FUSE_DAX_MEM_RANGE_SZ - 1;
> + /* Protected by fi->i_dmap_sem */
> + fuse_dax_interval_tree_insert(dmap, &fi->dmap_tree);
> + fi->nr_dmaps++;
> + }
> + return 0;
> +}
> +
> +static int
> +fuse_send_removemapping(struct inode *inode,
> + struct fuse_removemapping_in *inargp,
> + struct fuse_removemapping_one *remove_one)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct fuse_conn *fc = get_fuse_conn(inode);
> + FUSE_ARGS(args);
> +
> + args.opcode = FUSE_REMOVEMAPPING;
> + args.nodeid = fi->nodeid;
> + args.in_numargs = 2;
> + args.in_args[0].size = sizeof(*inargp);
> + args.in_args[0].value = inargp;
> + args.in_args[1].size = inargp->count * sizeof(*remove_one);
> + args.in_args[1].value = remove_one;
args.force = true?
> + return fuse_simple_request(fc, &args);
> +}
> +
> +static int dmap_removemapping_list(struct inode *inode, unsigned num,
> + struct list_head *to_remove)
> +{
> + struct fuse_removemapping_one *remove_one, *ptr;
> + struct fuse_removemapping_in inarg;
> + struct fuse_dax_mapping *dmap;
> + int ret, i = 0, nr_alloc;
> +
> + nr_alloc = min_t(unsigned int, num, FUSE_REMOVEMAPPING_MAX_ENTRY);
> + remove_one = kmalloc_array(nr_alloc, sizeof(*remove_one), GFP_NOFS);
> + if (!remove_one)
> + return -ENOMEM;
> +
> + ptr = remove_one;
> + list_for_each_entry(dmap, to_remove, list) {
> + ptr->moffset = dmap->window_offset;
> + ptr->len = dmap->length;
> + ptr++;
Minor nit: ptr = &remove_one[i] at the start of the section would be
cleaner IMO.
> + i++;
> + num--;
> + if (i >= nr_alloc || num == 0) {
> + memset(&inarg, 0, sizeof(inarg));
> + inarg.count = i;
> + ret = fuse_send_removemapping(inode, &inarg,
> + remove_one);
> + if (ret)
> + goto out;
> + ptr = remove_one;
> + i = 0;
> + }
> + }
> +out:
> + kfree(remove_one);
> + return ret;
> +}
> +
> +/*
> + * Cleanup dmap entry and add back to free list. This should be called with
> + * fc->lock held.
> + */
> +static void dmap_reinit_add_to_free_pool(struct fuse_conn *fc,
> + struct fuse_dax_mapping *dmap)
> +{
> + pr_debug("fuse: freeing memory range start=0x%llx end=0x%llx "
> + "window_offset=0x%llx length=0x%llx\n", dmap->start,
> + dmap->end, dmap->window_offset, dmap->length);
> + dmap->start = dmap->end = 0;
> + __dmap_add_to_free_pool(fc, dmap);
> +}
> +
> +/*
> + * Free inode dmap entries whose range falls entirely inside [start, end].
> + * Does not take any locks. At this point of time it should only be
> + * called from evict_inode() path where we know all dmap entries can be
> + * reclaimed.
> + */
> +static void inode_reclaim_dmap_range(struct fuse_conn *fc, struct inode *inode,
> + loff_t start, loff_t end)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct fuse_dax_mapping *dmap, *n;
> + int err, num = 0;
> + LIST_HEAD(to_remove);
> +
> + pr_debug("fuse: %s: start=0x%llx, end=0x%llx\n", __func__, start, end);
> +
> + /*
> + * Interval tree search matches intersecting entries. Adjust the range
> + * to avoid dropping partial valid entries.
> + */
> + start = ALIGN(start, FUSE_DAX_MEM_RANGE_SZ);
> + end = ALIGN_DOWN(end, FUSE_DAX_MEM_RANGE_SZ);
> +
> + while (1) {
> + dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, start,
> + end);
> + if (!dmap)
> + break;
> + fuse_dax_interval_tree_remove(dmap, &fi->dmap_tree);
> + num++;
> + list_add(&dmap->list, &to_remove);
> + }
> +
> + /* Nothing to remove */
> + if (list_empty(&to_remove))
> + return;
> +
> + WARN_ON(fi->nr_dmaps < num);
> + fi->nr_dmaps -= num;
> + /*
> + * During umount/shutdown, fuse connection is dropped first
> + * and evict_inode() is called later. That means any
> + * removemapping messages are going to fail. Send messages
> + * only if connection is up. Otherwise fuse daemon is
> + * responsible for cleaning up any leftover references and
> + * mappings.
> + */
> + if (fc->connected) {
> + err = dmap_removemapping_list(inode, num, &to_remove);
> + if (err) {
> + pr_warn("Failed to removemappings. start=0x%llx"
> + " end=0x%llx\n", start, end);
> + }
> + }
> + spin_lock(&fc->lock);
> + list_for_each_entry_safe(dmap, n, &to_remove, list) {
> + list_del_init(&dmap->list);
> + dmap_reinit_add_to_free_pool(fc, dmap);
> + }
> + spin_unlock(&fc->lock);
> +}
> +
> +/*
> + * It is called from evict_inode() and by that time inode is going away. So
> + * this function does not take any locks like fi->i_dmap_sem for traversing
> + * that fuse inode interval tree. If that lock is taken then lock validator
> + * complains of deadlock situation w.r.t fs_reclaim lock.
> + */
> +void fuse_cleanup_inode_mappings(struct inode *inode)
> +{
> + struct fuse_conn *fc = get_fuse_conn(inode);
> + /*
> + * fuse_evict_inode() has alredy called truncate_inode_pages_final()
> + * before we arrive here. So we should not have to worry about
> + * any pages/exception entries still associated with inode.
> + */
> + inode_reclaim_dmap_range(fc, inode, 0, -1);
> +}
> +
> void fuse_finish_open(struct inode *inode, struct file *file)
> {
> struct fuse_file *ff = file->private_data;
> @@ -1562,32 +1804,364 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
> return res;
> }
>
> +static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
> static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> {
> struct file *file = iocb->ki_filp;
> struct fuse_file *ff = file->private_data;
> + struct inode *inode = file->f_mapping->host;
>
> if (is_bad_inode(file_inode(file)))
> return -EIO;
>
> - if (!(ff->open_flags & FOPEN_DIRECT_IO))
> - return fuse_cache_read_iter(iocb, to);
> - else
> + if (IS_DAX(inode))
> + return fuse_dax_read_iter(iocb, to);
> +
> + if (ff->open_flags & FOPEN_DIRECT_IO)
> return fuse_direct_read_iter(iocb, to);
> +
> + return fuse_cache_read_iter(iocb, to);
> }
>
> +static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
> static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> {
> struct file *file = iocb->ki_filp;
> struct fuse_file *ff = file->private_data;
> + struct inode *inode = file->f_mapping->host;
>
> if (is_bad_inode(file_inode(file)))
> return -EIO;
>
> - if (!(ff->open_flags & FOPEN_DIRECT_IO))
> - return fuse_cache_write_iter(iocb, from);
> - else
> + if (IS_DAX(inode))
> + return fuse_dax_write_iter(iocb, from);
> +
> + if (ff->open_flags & FOPEN_DIRECT_IO)
> return fuse_direct_write_iter(iocb, from);
> +
> + return fuse_cache_write_iter(iocb, from);
> +}
> +
> +static void fuse_fill_iomap_hole(struct iomap *iomap, loff_t length)
> +{
> + iomap->addr = IOMAP_NULL_ADDR;
> + iomap->length = length;
> + iomap->type = IOMAP_HOLE;
> +}
> +
> +static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
> + struct iomap *iomap, struct fuse_dax_mapping *dmap,
> + unsigned flags)
> +{
> + loff_t offset, len;
> + loff_t i_size = i_size_read(inode);
> +
> + offset = pos - dmap->start;
> + len = min(length, dmap->length - offset);
> +
> + /* If length is beyond end of file, truncate further */
> + if (pos + len > i_size)
> + len = i_size - pos;
> +
> + if (len > 0) {
> + iomap->addr = dmap->window_offset + offset;
> + iomap->length = len;
> + if (flags & IOMAP_FAULT)
> + iomap->length = ALIGN(len, PAGE_SIZE);
> + iomap->type = IOMAP_MAPPED;
> + pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
> + " length 0x%llx\n", __func__, iomap->addr,
> + iomap->offset, iomap->length);
> + } else {
> + /* Mapping beyond end of file is hole */
> + fuse_fill_iomap_hole(iomap, length);
> + pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
> + "length 0x%llx\n", __func__, iomap->addr,
> + iomap->offset, iomap->length);
> + }
> +}
> +
> +static int iomap_begin_setup_new_mapping(struct inode *inode, loff_t pos,
> + loff_t length, unsigned flags,
> + struct iomap *iomap)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct fuse_conn *fc = get_fuse_conn(inode);
> + struct fuse_dax_mapping *dmap, *alloc_dmap = NULL;
> + int ret;
> + bool writable = flags & IOMAP_WRITE;
> +
> + alloc_dmap = alloc_dax_mapping(fc);
> + if (!alloc_dmap)
> + return -EBUSY;
> +
> + /*
> + * Take write lock so that only one caller can try to setup mapping
> + * and other waits.
> + */
> + down_write(&fi->i_dmap_sem);
> + /*
> + * We dropped lock. Check again if somebody else setup
> + * mapping already.
> + */
> + dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos,
> + pos);
> + if (dmap) {
> + fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
> + dmap_add_to_free_pool(fc, alloc_dmap);
> + up_write(&fi->i_dmap_sem);
> + return 0;
> + }
> +
> + /* Setup one mapping */
> + ret = fuse_setup_one_mapping(inode,
> + ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
> + alloc_dmap, writable, false);
> + if (ret < 0) {
> + printk("fuse_setup_one_mapping() failed. err=%d"
> + " pos=0x%llx, writable=%d\n", ret, pos, writable);
More unnecessary noise?
> + dmap_add_to_free_pool(fc, alloc_dmap);
> + up_write(&fi->i_dmap_sem);
> + return ret;
> + }
> + fuse_fill_iomap(inode, pos, length, iomap, alloc_dmap, flags);
> + up_write(&fi->i_dmap_sem);
> + return 0;
> +}
> +
> +static int iomap_begin_upgrade_mapping(struct inode *inode, loff_t pos,
> + loff_t length, unsigned flags,
> + struct iomap *iomap)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct fuse_dax_mapping *dmap;
> + int ret;
> +
> + /*
> + * Take exclusive lock so that only one caller can try to setup
> + * mapping and others wait.
> + */
> + down_write(&fi->i_dmap_sem);
> + dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos, pos);
> +
> + /* We are holding either inode lock or i_mmap_sem, and that should
> + * ensure that dmap can't reclaimed or truncated and it should still
> + * be there in tree despite the fact we dropped and re-acquired the
> + * lock.
> + */
> + ret = -EIO;
> + if (WARN_ON(!dmap))
> + goto out_err;
> +
> + /* Maybe another thread already upgraded mapping while we were not
> + * holding lock.
> + */
> + if (dmap->writable)
> + goto out_fill_iomap;
> +
> + ret = fuse_setup_one_mapping(inode,
> + ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
> + dmap, true, true);
> + if (ret < 0) {
> + printk("fuse_setup_one_mapping() failed. err=%d pos=0x%llx\n",
> + ret, pos);
Again.
> + goto out_err;
> + }
> +
> +out_fill_iomap:
> + fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
> +out_err:
> + up_write(&fi->i_dmap_sem);
> + return ret;
> +}
> +
> +/* This is just for DAX and the mapping is ephemeral, do not use it for other
> + * purposes since there is no block device with a permanent mapping.
> + */
> +static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
> + unsigned flags, struct iomap *iomap,
> + struct iomap *srcmap)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct fuse_conn *fc = get_fuse_conn(inode);
> + struct fuse_dax_mapping *dmap;
> + bool writable = flags & IOMAP_WRITE;
> +
> + /* We don't support FIEMAP */
> + BUG_ON(flags & IOMAP_REPORT);
> +
> + pr_debug("fuse_iomap_begin() called. pos=0x%llx length=0x%llx\n",
> + pos, length);
> +
> + /*
> + * Writes beyond end of file are not handled using dax path. Instead
> + * a fuse write message is sent to daemon
> + */
> + if (flags & IOMAP_WRITE && pos >= i_size_read(inode))
> + return -EIO;
Okay, this will work fine if the host filesystem is not modified by
other entities.
What happens if there's a concurrent truncate going on on the host
with this write? If the two are not in any way synchronized than
either the two following behavior is allowed:
1) Whole or partial data in write is truncated. (If there are
complete pages from the write being truncated, then the writing
process will receive SIGBUS. Does KVM hande that? I remember that
being discussed, but don't remember the conclusion).
2) Write re-extends file size.
However EIO is not a good result, so we need to do something with it.
> +
> + iomap->offset = pos;
> + iomap->flags = 0;
> + iomap->bdev = NULL;
> + iomap->dax_dev = fc->dax_dev;
> +
> + /*
> + * Both read/write and mmap path can race here. So we need something
> + * to make sure if we are setting up mapping, then other path waits
> + *
> + * For now, use a semaphore for this. It probably needs to be
> + * optimized later.
> + */
> + down_read(&fi->i_dmap_sem);
> + dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos, pos);
> +
> + if (dmap) {
> + if (writable && !dmap->writable) {
> + /* Upgrade read-only mapping to read-write. This will
> + * require exclusive i_dmap_sem lock as we don't want
> + * two threads to be trying to this simultaneously
> + * for same dmap. So drop shared lock and acquire
> + * exclusive lock.
> + */
> + up_read(&fi->i_dmap_sem);
> + pr_debug("%s: Upgrading mapping at offset 0x%llx"
> + " length 0x%llx\n", __func__, pos, length);
> + return iomap_begin_upgrade_mapping(inode, pos, length,
> + flags, iomap);
> + } else {
> + fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
> + up_read(&fi->i_dmap_sem);
> + return 0;
> + }
> + } else {
> + up_read(&fi->i_dmap_sem);
> + pr_debug("%s: no mapping at offset 0x%llx length 0x%llx\n",
> + __func__, pos, length);
> + if (pos >= i_size_read(inode))
> + goto iomap_hole;
> +
> + return iomap_begin_setup_new_mapping(inode, pos, length, flags,
> + iomap);
> + }
> +
> + /*
> + * If read beyond end of file happnes, fs code seems to return
> + * it as hole
> + */
> +iomap_hole:
> + fuse_fill_iomap_hole(iomap, length);
> + pr_debug("fuse_iomap_begin() returning hole mapping. pos=0x%llx length_asked=0x%llx length_returned=0x%llx\n", pos, length, iomap->length);
> + return 0;
> +}
> +
> +static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
> + ssize_t written, unsigned flags,
> + struct iomap *iomap)
> +{
> + /* DAX writes beyond end-of-file aren't handled using iomap, so the
> + * file size is unchanged and there is nothing to do here.
> + */
> + return 0;
> +}
> +
> +static const struct iomap_ops fuse_iomap_ops = {
> + .iomap_begin = fuse_iomap_begin,
> + .iomap_end = fuse_iomap_end,
> +};
> +
> +static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> + struct inode *inode = file_inode(iocb->ki_filp);
> + ssize_t ret;
> +
> + if (iocb->ki_flags & IOCB_NOWAIT) {
> + if (!inode_trylock_shared(inode))
> + return -EAGAIN;
> + } else {
> + inode_lock_shared(inode);
> + }
> +
> + ret = dax_iomap_rw(iocb, to, &fuse_iomap_ops);
> + inode_unlock_shared(inode);
> +
> + /* TODO file_accessed(iocb->f_filp) */
> + return ret;
> +}
> +
> +static bool file_extending_write(struct kiocb *iocb, struct iov_iter *from)
> +{
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + return (iov_iter_rw(from) == WRITE &&
> + ((iocb->ki_pos) >= i_size_read(inode)));
> +}
> +
> +static ssize_t fuse_dax_direct_write(struct kiocb *iocb, struct iov_iter *from)
> +{
> + struct inode *inode = file_inode(iocb->ki_filp);
> + struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(iocb);
> + ssize_t ret;
> +
> + ret = fuse_direct_io(&io, from, &iocb->ki_pos, FUSE_DIO_WRITE);
> + if (ret < 0)
> + return ret;
> +
> + fuse_invalidate_attr(inode);
> + fuse_write_update_size(inode, iocb->ki_pos);
> + return ret;
> +}
> +
> +static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> + struct inode *inode = file_inode(iocb->ki_filp);
> + ssize_t ret, count;
> +
> + if (iocb->ki_flags & IOCB_NOWAIT) {
> + if (!inode_trylock(inode))
> + return -EAGAIN;
> + } else {
> + inode_lock(inode);
> + }
> +
> + ret = generic_write_checks(iocb, from);
> + if (ret <= 0)
> + goto out;
> +
> + ret = file_remove_privs(iocb->ki_filp);
> + if (ret)
> + goto out;
> + /* TODO file_update_time() but we don't want metadata I/O */
> +
> + /* Do not use dax for file extending writes as its an mmap and
> + * trying to write beyong end of existing page will generate
> + * SIGBUS.
Ah, here it is. So what happens in case of a race? Does that
currently crash KVM?
> + */
> + if (file_extending_write(iocb, from)) {
> + ret = fuse_dax_direct_write(iocb, from);
> + goto out;
> + }
> +
> + ret = dax_iomap_rw(iocb, from, &fuse_iomap_ops);
> + if (ret < 0)
> + goto out;
> +
> + /*
> + * If part of the write was file extending, fuse dax path will not
> + * take care of that. Do direct write instead.
> + */
> + if (iov_iter_count(from) && file_extending_write(iocb, from)) {
> + count = fuse_dax_direct_write(iocb, from);
> + if (count < 0)
> + goto out;
> + ret += count;
> + }
> +
> +out:
> + inode_unlock(inode);
> +
> + if (ret > 0)
> + ret = generic_write_sync(iocb, ret);
> + return ret;
> }
>
> static void fuse_writepage_free(struct fuse_writepage_args *wpa)
> @@ -2318,6 +2892,11 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
> return 0;
> }
>
> +static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + return -EINVAL; /* TODO */
> +}
> +
> static int convert_fuse_file_lock(struct fuse_conn *fc,
> const struct fuse_file_lock *ffl,
> struct file_lock *fl)
> @@ -3387,6 +3966,7 @@ static const struct address_space_operations fuse_file_aops = {
> void fuse_init_file_inode(struct inode *inode)
> {
> struct fuse_inode *fi = get_fuse_inode(inode);
> + struct fuse_conn *fc = get_fuse_conn(inode);
>
> inode->i_fop = &fuse_file_operations;
> inode->i_data.a_ops = &fuse_file_aops;
> @@ -3396,4 +3976,9 @@ void fuse_init_file_inode(struct inode *inode)
> fi->writectr = 0;
> init_waitqueue_head(&fi->page_waitq);
> INIT_LIST_HEAD(&fi->writepages);
> + fi->dmap_tree = RB_ROOT_CACHED;
> +
> + if (fc->dax_dev) {
> + inode->i_flags |= S_DAX;
> + }
> }
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index b41275f73e4c..490549862bda 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -70,16 +70,29 @@ struct fuse_forget_link {
> struct fuse_forget_link *next;
> };
>
> +#define START(node) ((node)->start)
> +#define LAST(node) ((node)->end)
> +
> /** Translation information for file offsets to DAX window offsets */
> struct fuse_dax_mapping {
> /* Will connect in fc->free_ranges to keep track of free memory */
> struct list_head list;
>
> + /* For interval tree in file/inode */
> + struct rb_node rb;
> + /** Start Position in file */
> + __u64 start;
> + /** End Position in file */
> + __u64 end;
> + __u64 __subtree_last;
> /** Position in DAX window */
> u64 window_offset;
>
> /** Length of mapping, in bytes */
> loff_t length;
> +
> + /* Is this mapping read-only or read-write */
> + bool writable;
> };
>
> /** FUSE inode */
> @@ -167,6 +180,15 @@ struct fuse_inode {
>
> /** Lock to protect write related fields */
> spinlock_t lock;
> +
> + /*
> + * Semaphore to protect modifications to dmap_tree
> + */
> + struct rw_semaphore i_dmap_sem;
> +
> + /** Sorted rb tree of struct fuse_dax_mapping elements */
> + struct rb_root_cached dmap_tree;
> + unsigned long nr_dmaps;
> };
>
> /** FUSE inode state bits */
> @@ -1127,5 +1149,6 @@ unsigned int fuse_len_args(unsigned int numargs, struct fuse_arg *args);
> */
> u64 fuse_get_unique(struct fuse_iqueue *fiq);
> void fuse_free_conn(struct fuse_conn *fc);
> +void fuse_cleanup_inode_mappings(struct inode *inode);
>
> #endif /* _FS_FUSE_I_H */
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 36cb9c00bbe5..93bc65607a15 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -86,7 +86,9 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
> fi->attr_version = 0;
> fi->orig_ino = 0;
> fi->state = 0;
> + fi->nr_dmaps = 0;
> mutex_init(&fi->mutex);
> + init_rwsem(&fi->i_dmap_sem);
> spin_lock_init(&fi->lock);
> fi->forget = fuse_alloc_forget();
> if (!fi->forget) {
> @@ -114,6 +116,10 @@ static void fuse_evict_inode(struct inode *inode)
> clear_inode(inode);
> if (inode->i_sb->s_flags & SB_ACTIVE) {
> struct fuse_conn *fc = get_fuse_conn(inode);
> + if (IS_DAX(inode)) {
> + fuse_cleanup_inode_mappings(inode);
> + WARN_ON(fi->nr_dmaps);
> + }
> fuse_queue_forget(fc, fi->forget, fi->nodeid, fi->nlookup);
> fi->forget = NULL;
> }
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 62633555d547..36d824b82ebc 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -896,6 +896,7 @@ struct fuse_copy_file_range_in {
>
> #define FUSE_SETUPMAPPING_ENTRIES 8
> #define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
> +#define FUSE_SETUPMAPPING_FLAG_READ (1ull << 1)
> struct fuse_setupmapping_in {
> /* An already open handle */
> uint64_t fh;
> --
> 2.20.1
>
On Thu, Mar 12, 2020 at 10:43:10AM +0100, Miklos Szeredi wrote:
> On Wed, Mar 4, 2020 at 5:59 PM Vivek Goyal <[email protected]> wrote:
> >
> > This patch implements basic DAX support. mmap() is not implemented
> > yet and will come in later patches. This patch looks into implemeting
> > read/write.
> >
> > We make use of interval tree to keep track of per inode dax mappings.
> >
> > Do not use dax for file extending writes, instead just send WRITE message
> > to daemon (like we do for direct I/O path). This will keep write and
> > i_size change atomic w.r.t crash.
> >
> > Signed-off-by: Stefan Hajnoczi <[email protected]>
> > Signed-off-by: Dr. David Alan Gilbert <[email protected]>
> > Signed-off-by: Vivek Goyal <[email protected]>
> > Signed-off-by: Miklos Szeredi <[email protected]>
> > Signed-off-by: Liu Bo <[email protected]>
> > Signed-off-by: Peng Tao <[email protected]>
> > ---
> > fs/fuse/file.c | 597 +++++++++++++++++++++++++++++++++++++-
> > fs/fuse/fuse_i.h | 23 ++
> > fs/fuse/inode.c | 6 +
> > include/uapi/linux/fuse.h | 1 +
> > 4 files changed, 621 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 9d67b830fb7a..9effdd3dc6d6 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -18,6 +18,12 @@
> > #include <linux/swap.h>
> > #include <linux/falloc.h>
> > #include <linux/uio.h>
> > +#include <linux/dax.h>
> > +#include <linux/iomap.h>
> > +#include <linux/interval_tree_generic.h>
> > +
> > +INTERVAL_TREE_DEFINE(struct fuse_dax_mapping, rb, __u64, __subtree_last,
> > + START, LAST, static inline, fuse_dax_interval_tree);
>
> Are you using this because of byte ranges (u64)? Does it not make
> more sense to use page offsets, which are unsigned long and so fit
> nicely into the generic interval tree?
I think I should be able to use generic interval tree. I will switch
to that.
[..]
> > +/* offset passed in should be aligned to FUSE_DAX_MEM_RANGE_SZ */
> > +static int fuse_setup_one_mapping(struct inode *inode, loff_t offset,
> > + struct fuse_dax_mapping *dmap, bool writable,
> > + bool upgrade)
> > +{
> > + struct fuse_conn *fc = get_fuse_conn(inode);
> > + struct fuse_inode *fi = get_fuse_inode(inode);
> > + struct fuse_setupmapping_in inarg;
> > + FUSE_ARGS(args);
> > + ssize_t err;
> > +
> > + WARN_ON(offset % FUSE_DAX_MEM_RANGE_SZ);
> > + WARN_ON(fc->nr_free_ranges < 0);
> > +
> > + /* Ask fuse daemon to setup mapping */
> > + memset(&inarg, 0, sizeof(inarg));
> > + inarg.foffset = offset;
> > + inarg.fh = -1;
> > + inarg.moffset = dmap->window_offset;
> > + inarg.len = FUSE_DAX_MEM_RANGE_SZ;
> > + inarg.flags |= FUSE_SETUPMAPPING_FLAG_READ;
> > + if (writable)
> > + inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
> > + args.opcode = FUSE_SETUPMAPPING;
> > + args.nodeid = fi->nodeid;
> > + args.in_numargs = 1;
> > + args.in_args[0].size = sizeof(inarg);
> > + args.in_args[0].value = &inarg;
>
> args.force = true?
I can do that but I am not sure what exactly does args.force do and
why do we need it in this case.
First thing it does is that request is allocated with flag __GFP_NOFAIL.
Second thing it does is that caller is forced to wait for request
completion and its not an interruptible sleep.
I am wondering what makes FUSE_SETUPMAPING/FUSE_REMOVEMAPPING requests
special that we need to set force flag.
>
> > + err = fuse_simple_request(fc, &args);
> > + if (err < 0) {
> > + printk(KERN_ERR "%s request failed at mem_offset=0x%llx %zd\n",
> > + __func__, dmap->window_offset, err);
>
> Is this level of noisiness really needed? AFAICS, the error will
> reach the caller, in which case we don't usually need to print a
> kernel error.
I will remove it. I think code in general has quite a few printk() and
pr_debug() we can get rid of. Some of them were helpful for debugging
problems while code was being developed. But now that code is working,
we should be able to drop some of them.
[..]
> > +static int
> > +fuse_send_removemapping(struct inode *inode,
> > + struct fuse_removemapping_in *inargp,
> > + struct fuse_removemapping_one *remove_one)
> > +{
> > + struct fuse_inode *fi = get_fuse_inode(inode);
> > + struct fuse_conn *fc = get_fuse_conn(inode);
> > + FUSE_ARGS(args);
> > +
> > + args.opcode = FUSE_REMOVEMAPPING;
> > + args.nodeid = fi->nodeid;
> > + args.in_numargs = 2;
> > + args.in_args[0].size = sizeof(*inargp);
> > + args.in_args[0].value = inargp;
> > + args.in_args[1].size = inargp->count * sizeof(*remove_one);
> > + args.in_args[1].value = remove_one;
>
> args.force = true?
FUSE_REMOVEMAPPING is an optional nice to have request. Will it make
help to set force.
>
> > + return fuse_simple_request(fc, &args);
> > +}
> > +
> > +static int dmap_removemapping_list(struct inode *inode, unsigned num,
> > + struct list_head *to_remove)
> > +{
> > + struct fuse_removemapping_one *remove_one, *ptr;
> > + struct fuse_removemapping_in inarg;
> > + struct fuse_dax_mapping *dmap;
> > + int ret, i = 0, nr_alloc;
> > +
> > + nr_alloc = min_t(unsigned int, num, FUSE_REMOVEMAPPING_MAX_ENTRY);
> > + remove_one = kmalloc_array(nr_alloc, sizeof(*remove_one), GFP_NOFS);
> > + if (!remove_one)
> > + return -ENOMEM;
> > +
> > + ptr = remove_one;
> > + list_for_each_entry(dmap, to_remove, list) {
> > + ptr->moffset = dmap->window_offset;
> > + ptr->len = dmap->length;
> > + ptr++;
>
> Minor nit: ptr = &remove_one[i] at the start of the section would be
> cleaner IMO.
Will do.
[..]
> > +static int iomap_begin_setup_new_mapping(struct inode *inode, loff_t pos,
> > + loff_t length, unsigned flags,
> > + struct iomap *iomap)
> > +{
> > + struct fuse_inode *fi = get_fuse_inode(inode);
> > + struct fuse_conn *fc = get_fuse_conn(inode);
> > + struct fuse_dax_mapping *dmap, *alloc_dmap = NULL;
> > + int ret;
> > + bool writable = flags & IOMAP_WRITE;
> > +
> > + alloc_dmap = alloc_dax_mapping(fc);
> > + if (!alloc_dmap)
> > + return -EBUSY;
> > +
> > + /*
> > + * Take write lock so that only one caller can try to setup mapping
> > + * and other waits.
> > + */
> > + down_write(&fi->i_dmap_sem);
> > + /*
> > + * We dropped lock. Check again if somebody else setup
> > + * mapping already.
> > + */
> > + dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos,
> > + pos);
> > + if (dmap) {
> > + fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
> > + dmap_add_to_free_pool(fc, alloc_dmap);
> > + up_write(&fi->i_dmap_sem);
> > + return 0;
> > + }
> > +
> > + /* Setup one mapping */
> > + ret = fuse_setup_one_mapping(inode,
> > + ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
> > + alloc_dmap, writable, false);
> > + if (ret < 0) {
> > + printk("fuse_setup_one_mapping() failed. err=%d"
> > + " pos=0x%llx, writable=%d\n", ret, pos, writable);
>
> More unnecessary noise?
Will remove.
>
> > + dmap_add_to_free_pool(fc, alloc_dmap);
> > + up_write(&fi->i_dmap_sem);
> > + return ret;
> > + }
> > + fuse_fill_iomap(inode, pos, length, iomap, alloc_dmap, flags);
> > + up_write(&fi->i_dmap_sem);
> > + return 0;
> > +}
> > +
> > +static int iomap_begin_upgrade_mapping(struct inode *inode, loff_t pos,
> > + loff_t length, unsigned flags,
> > + struct iomap *iomap)
> > +{
> > + struct fuse_inode *fi = get_fuse_inode(inode);
> > + struct fuse_dax_mapping *dmap;
> > + int ret;
> > +
> > + /*
> > + * Take exclusive lock so that only one caller can try to setup
> > + * mapping and others wait.
> > + */
> > + down_write(&fi->i_dmap_sem);
> > + dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos, pos);
> > +
> > + /* We are holding either inode lock or i_mmap_sem, and that should
> > + * ensure that dmap can't reclaimed or truncated and it should still
> > + * be there in tree despite the fact we dropped and re-acquired the
> > + * lock.
> > + */
> > + ret = -EIO;
> > + if (WARN_ON(!dmap))
> > + goto out_err;
> > +
> > + /* Maybe another thread already upgraded mapping while we were not
> > + * holding lock.
> > + */
> > + if (dmap->writable)
> > + goto out_fill_iomap;
> > +
> > + ret = fuse_setup_one_mapping(inode,
> > + ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
> > + dmap, true, true);
> > + if (ret < 0) {
> > + printk("fuse_setup_one_mapping() failed. err=%d pos=0x%llx\n",
> > + ret, pos);
>
> Again.
Will remove. How about converting some of them to pr_debug() instead? It
can help with debugging if something is not working.
>
> > + goto out_err;
> > + }
> > +
> > +out_fill_iomap:
> > + fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
> > +out_err:
> > + up_write(&fi->i_dmap_sem);
> > + return ret;
> > +}
> > +
> > +/* This is just for DAX and the mapping is ephemeral, do not use it for other
> > + * purposes since there is no block device with a permanent mapping.
> > + */
> > +static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
> > + unsigned flags, struct iomap *iomap,
> > + struct iomap *srcmap)
> > +{
> > + struct fuse_inode *fi = get_fuse_inode(inode);
> > + struct fuse_conn *fc = get_fuse_conn(inode);
> > + struct fuse_dax_mapping *dmap;
> > + bool writable = flags & IOMAP_WRITE;
> > +
> > + /* We don't support FIEMAP */
> > + BUG_ON(flags & IOMAP_REPORT);
> > +
> > + pr_debug("fuse_iomap_begin() called. pos=0x%llx length=0x%llx\n",
> > + pos, length);
> > +
> > + /*
> > + * Writes beyond end of file are not handled using dax path. Instead
> > + * a fuse write message is sent to daemon
> > + */
> > + if (flags & IOMAP_WRITE && pos >= i_size_read(inode))
> > + return -EIO;
>
> Okay, this will work fine if the host filesystem is not modified by
> other entities.
This requires little longer explanation. It took me a while to remember
what I did.
For file extending writes, we do not want to go through dax path because
we want written data and file size to be atomic operation w.r.t guest
crash. So in fuse_dax_write_iter() I detect that this is file extending
write and call fuse_dax_direct_write() instead to fall back to regular
fuse message for write and bypass dax.
But if write is partially overwriting and rest is file extending, current
logic tries to use dax for the portion of page which is being overwritten
and fall back to fuse write message for the remaining file extending
write. And that's why after the call to dax_iomap_rw() I check one more
time if there are some bytes not written and use fuse write to extend
file.
/*
* If part of the write was file extending, fuse dax path will not
* take care of that. Do direct write instead.
*/
if (iov_iter_count(from) && file_extending_write(iocb, from)) {
count = fuse_dax_direct_write(iocb, from);
if (count < 0)
goto out;
ret += count;
}
dax_iomap_rw() will do dax operation for the bytes which are with-in
i_size. Then it will call iomap_apply() again with the portion of
file doing file extending write and this time iomap_begin() will return
-EIO. And dax_iomap_rw() will return number of bytes written (and not
-EIO) to caller.
while (iov_iter_count(iter)) {
ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
iter, dax_iomap_actor);
if (ret <= 0)
break;
pos += ret;
done += ret;
}
I am beginning to think that this is way more complicated then it needs
to be. Probably I should detect that if any part of the file is file
extending, just fall back to using fuse write path.
> What happens if there's a concurrent truncate going on on the host
> with this write?
For regular fuse write, concurrent truncate is not a problem. But for
dax read/write/mmap, concurrent truncate is a problem. If another guest
truncates the file (after this guest has mapped this page), then
any attempt to access this page hangs that process. KVM is trying to
fault in a page on host which does not exist anymore. Currently kvm
does not seem to have the logic to be able to deal with errors in
async page fault path. And we will have to modify all that so that
we can somehow propagate errors (SIGBUS) to guest and deliver it
to process.
So if process did mmap() and tried to access truncated portion of
file, then it should get SIGBUS. If we are doing read/write then
we should have the logic to deal with this error (exception table
magic) and deliver -EIO to user space.
None of that is handled right now and is a future TODO item. So
for now, this will work well only with single guest and we will
run into issues if we are sharing directories with another
guest.
> If the two are not in any way synchronized than
> either the two following behavior is allowed:
>
> 1) Whole or partial data in write is truncated. (If there are
> complete pages from the write being truncated, then the writing
> process will receive SIGBUS. Does KVM hande that? I remember that
> being discussed, but don't remember the conclusion).
>
> 2) Write re-extends file size.
Currently, for file extending writes, if other guest truncates file first
then fuse write will extend file again. If fuse write finished first,
then other guest will truncate file and reduce size.
I think we will have problem when only part of the write is extending
file. In that case part of the file which is being overwritten, we are
doing dax. And if other guest truncates file first, then kvm will hang.
But that's a problem we have with not just file extending write, but
any read/write/mmap w.r.t truncate by another guest. We will have to
fix that before we support virtiofs+dax for shared directory.
>
> However EIO is not a good result, so we need to do something with it.
This -EIO is not seen by user. But dax_iomap_rw() does not return it
instead returns number of bytes which have been written.
[..]
> > +static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > +{
> > + struct inode *inode = file_inode(iocb->ki_filp);
> > + ssize_t ret, count;
> > +
> > + if (iocb->ki_flags & IOCB_NOWAIT) {
> > + if (!inode_trylock(inode))
> > + return -EAGAIN;
> > + } else {
> > + inode_lock(inode);
> > + }
> > +
> > + ret = generic_write_checks(iocb, from);
> > + if (ret <= 0)
> > + goto out;
> > +
> > + ret = file_remove_privs(iocb->ki_filp);
> > + if (ret)
> > + goto out;
> > + /* TODO file_update_time() but we don't want metadata I/O */
> > +
> > + /* Do not use dax for file extending writes as its an mmap and
> > + * trying to write beyong end of existing page will generate
> > + * SIGBUS.
>
> Ah, here it is. So what happens in case of a race? Does that
> currently crash KVM?
In case of race, yes, KVM hangs. So no shared directory operation yet
till we have designed proper error handling in kvm path.
Thanks
Vivek
On Thu, Mar 12, 2020 at 5:02 PM Vivek Goyal <[email protected]> wrote:
>
> On Thu, Mar 12, 2020 at 10:43:10AM +0100, Miklos Szeredi wrote:
> > On Wed, Mar 4, 2020 at 5:59 PM Vivek Goyal <[email protected]> wrote:
> > >
> > > This patch implements basic DAX support. mmap() is not implemented
> > > yet and will come in later patches. This patch looks into implemeting
> > > read/write.
> > >
> > > We make use of interval tree to keep track of per inode dax mappings.
> > >
> > > Do not use dax for file extending writes, instead just send WRITE message
> > > to daemon (like we do for direct I/O path). This will keep write and
> > > i_size change atomic w.r.t crash.
> > >
> > > Signed-off-by: Stefan Hajnoczi <[email protected]>
> > > Signed-off-by: Dr. David Alan Gilbert <[email protected]>
> > > Signed-off-by: Vivek Goyal <[email protected]>
> > > Signed-off-by: Miklos Szeredi <[email protected]>
> > > Signed-off-by: Liu Bo <[email protected]>
> > > Signed-off-by: Peng Tao <[email protected]>
> > > ---
> > > fs/fuse/file.c | 597 +++++++++++++++++++++++++++++++++++++-
> > > fs/fuse/fuse_i.h | 23 ++
> > > fs/fuse/inode.c | 6 +
> > > include/uapi/linux/fuse.h | 1 +
> > > 4 files changed, 621 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > > index 9d67b830fb7a..9effdd3dc6d6 100644
> > > --- a/fs/fuse/file.c
> > > +++ b/fs/fuse/file.c
> > > @@ -18,6 +18,12 @@
> > > #include <linux/swap.h>
> > > #include <linux/falloc.h>
> > > #include <linux/uio.h>
> > > +#include <linux/dax.h>
> > > +#include <linux/iomap.h>
> > > +#include <linux/interval_tree_generic.h>
> > > +
> > > +INTERVAL_TREE_DEFINE(struct fuse_dax_mapping, rb, __u64, __subtree_last,
> > > + START, LAST, static inline, fuse_dax_interval_tree);
> >
> > Are you using this because of byte ranges (u64)? Does it not make
> > more sense to use page offsets, which are unsigned long and so fit
> > nicely into the generic interval tree?
>
> I think I should be able to use generic interval tree. I will switch
> to that.
>
> [..]
> > > +/* offset passed in should be aligned to FUSE_DAX_MEM_RANGE_SZ */
> > > +static int fuse_setup_one_mapping(struct inode *inode, loff_t offset,
> > > + struct fuse_dax_mapping *dmap, bool writable,
> > > + bool upgrade)
> > > +{
> > > + struct fuse_conn *fc = get_fuse_conn(inode);
> > > + struct fuse_inode *fi = get_fuse_inode(inode);
> > > + struct fuse_setupmapping_in inarg;
> > > + FUSE_ARGS(args);
> > > + ssize_t err;
> > > +
> > > + WARN_ON(offset % FUSE_DAX_MEM_RANGE_SZ);
> > > + WARN_ON(fc->nr_free_ranges < 0);
> > > +
> > > + /* Ask fuse daemon to setup mapping */
> > > + memset(&inarg, 0, sizeof(inarg));
> > > + inarg.foffset = offset;
> > > + inarg.fh = -1;
> > > + inarg.moffset = dmap->window_offset;
> > > + inarg.len = FUSE_DAX_MEM_RANGE_SZ;
> > > + inarg.flags |= FUSE_SETUPMAPPING_FLAG_READ;
> > > + if (writable)
> > > + inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
> > > + args.opcode = FUSE_SETUPMAPPING;
> > > + args.nodeid = fi->nodeid;
> > > + args.in_numargs = 1;
> > > + args.in_args[0].size = sizeof(inarg);
> > > + args.in_args[0].value = &inarg;
> >
> > args.force = true?
>
> I can do that but I am not sure what exactly does args.force do and
> why do we need it in this case.
Hm, it prevents interrupts. Looking closely, however it will only
prevent SIGKILL from immediately interrupting the request, otherwise
it will send an INTERRUPT request and the filesystem can ignore that.
Might make sense to have a args.nonint flag to prevent the sending of
INTERRUPT...
> First thing it does is that request is allocated with flag __GFP_NOFAIL.
> Second thing it does is that caller is forced to wait for request
> completion and its not an interruptible sleep.
>
> I am wondering what makes FUSE_SETUPMAPING/FUSE_REMOVEMAPPING requests
> special that we need to set force flag.
Maybe not for SETUPMAPPING (I was confused by the error log).
However if REMOVEMAPPING fails for some reason, than that dax mapping
will be leaked for the lifetime of the filesystem. Or am I
misunderstanding it?
> > > + ret = fuse_setup_one_mapping(inode,
> > > + ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
> > > + dmap, true, true);
> > > + if (ret < 0) {
> > > + printk("fuse_setup_one_mapping() failed. err=%d pos=0x%llx\n",
> > > + ret, pos);
> >
> > Again.
>
> Will remove. How about converting some of them to pr_debug() instead? It
> can help with debugging if something is not working.
Okay, and please move it to fuse_setup_one_mapping() where there's
already a pr_debug() for the success case.
> > +
> > > + /* Do not use dax for file extending writes as its an mmap and
> > > + * trying to write beyong end of existing page will generate
> > > + * SIGBUS.
> >
> > Ah, here it is. So what happens in case of a race? Does that
> > currently crash KVM?
>
> In case of race, yes, KVM hangs. So no shared directory operation yet
> till we have designed proper error handling in kvm path.
I think before this is merged we have to fix the KVM crash; that's not
acceptable even if we explicitly say that shared directory is not
supported for the time being.
Thanks,
Miklos
On Fri, Mar 13, 2020 at 11:18:15AM +0100, Miklos Szeredi wrote:
[..]
> > > > +/* offset passed in should be aligned to FUSE_DAX_MEM_RANGE_SZ */
> > > > +static int fuse_setup_one_mapping(struct inode *inode, loff_t offset,
> > > > + struct fuse_dax_mapping *dmap, bool writable,
> > > > + bool upgrade)
> > > > +{
> > > > + struct fuse_conn *fc = get_fuse_conn(inode);
> > > > + struct fuse_inode *fi = get_fuse_inode(inode);
> > > > + struct fuse_setupmapping_in inarg;
> > > > + FUSE_ARGS(args);
> > > > + ssize_t err;
> > > > +
> > > > + WARN_ON(offset % FUSE_DAX_MEM_RANGE_SZ);
> > > > + WARN_ON(fc->nr_free_ranges < 0);
> > > > +
> > > > + /* Ask fuse daemon to setup mapping */
> > > > + memset(&inarg, 0, sizeof(inarg));
> > > > + inarg.foffset = offset;
> > > > + inarg.fh = -1;
> > > > + inarg.moffset = dmap->window_offset;
> > > > + inarg.len = FUSE_DAX_MEM_RANGE_SZ;
> > > > + inarg.flags |= FUSE_SETUPMAPPING_FLAG_READ;
> > > > + if (writable)
> > > > + inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
> > > > + args.opcode = FUSE_SETUPMAPPING;
> > > > + args.nodeid = fi->nodeid;
> > > > + args.in_numargs = 1;
> > > > + args.in_args[0].size = sizeof(inarg);
> > > > + args.in_args[0].value = &inarg;
> > >
> > > args.force = true?
> >
> > I can do that but I am not sure what exactly does args.force do and
> > why do we need it in this case.
>
> Hm, it prevents interrupts. Looking closely, however it will only
> prevent SIGKILL from immediately interrupting the request, otherwise
> it will send an INTERRUPT request and the filesystem can ignore that.
> Might make sense to have a args.nonint flag to prevent the sending of
> INTERRUPT...
Hi Miklos,
virtiofs does not support interrupt requests yet. Its fiq interrupt
handler just does not do anything.
static void virtio_fs_wake_interrupt_and_unlock(struct fuse_iqueue *fiq)
__releases(fiq->lock)
{
/*
* TODO interrupts.
*
* Normal fs operations on a local filesystems aren't interruptible.
* Exceptions are blocking lock operations; for example fcntl(F_SETLKW)
* with shared lock between host and guest.
*/
spin_unlock(&fiq->lock);
}
So as of now setting force or not will not make any difference. We will
still end up waiting for request to finish.
Infact, I think there is no mechanism to set fc->no_interrupt in
virtio_fs. If I am reading request_wait_answer(), correctly, it will
see fc->no_interrupt is not set. That means filesystem supports
interrupt requests and it will do wait_event_interruptible() and
not even check for FR_FORCE bit.
Right now fc->no_interrupt is set in response to INTERRUPT request
reply. Will it make sense to also be able to set it as part of
connection negotation protocol and filesystem can tell in the
beginning itself that it does not support interrupt and virtiofs
can make use of that.
So force flag is only useful if filesystem does not support interrupt
and in that case we do wait_event_killable() and upon receiving
SIGKILL, cancel request if it is still in pending queue. For virtiofs,
we take request out of fiq->pending queue in submission path itself
and if it can't be dispatched it waits on virtiofs speicfic queue
with FR_PENDING cleared. That means, setting FR_FORCE for virtiofs
does not mean anything as caller will end up waiting for
request to finish anyway.
IOW, setting FR_FORCE will make sense when we have mechanism to
detect that request is still queued in virtiofs queues and have
mechanism to cancel it. We don't have it. In fact, given we are
a push model, we dispatch request immediately to filesystem,
until and unless virtqueue is full. So probability of a request
still in virtiofs queue is low.
So may be we can start setting force at some point of time later
when we have mechanism to cancel detect and cancel pending requests
in virtiofs.
>
> > First thing it does is that request is allocated with flag __GFP_NOFAIL.
> > Second thing it does is that caller is forced to wait for request
> > completion and its not an interruptible sleep.
> >
> > I am wondering what makes FUSE_SETUPMAPING/FUSE_REMOVEMAPPING requests
> > special that we need to set force flag.
>
> Maybe not for SETUPMAPPING (I was confused by the error log).
>
> However if REMOVEMAPPING fails for some reason, than that dax mapping
> will be leaked for the lifetime of the filesystem. Or am I
> misunderstanding it?
FUSE_REMVOEMAPPING is not must. If we send another FUSE_SETUPMAPPING, then
it will create the new mapping and free up resources associated with
the previous mapping, IIUC.
So at one point of time we were thinking that what's the point of
sending FUSE_REMOVEMAPPING. It helps a bit with freeing up filesystem
resources earlier. So if cache size is big, then there will not be
much reclaim activity going and if we don't send FUSE_REMOVEMAPPING,
all these filesystem resources will remain busy on host for a long
time.
>
> > > > + ret = fuse_setup_one_mapping(inode,
> > > > + ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
> > > > + dmap, true, true);
> > > > + if (ret < 0) {
> > > > + printk("fuse_setup_one_mapping() failed. err=%d pos=0x%llx\n",
> > > > + ret, pos);
> > >
> > > Again.
> >
> > Will remove. How about converting some of them to pr_debug() instead? It
> > can help with debugging if something is not working.
>
> Okay, and please move it to fuse_setup_one_mapping() where there's
> already a pr_debug() for the success case.
Will do.
>
> > > +
> > > > + /* Do not use dax for file extending writes as its an mmap and
> > > > + * trying to write beyong end of existing page will generate
> > > > + * SIGBUS.
> > >
> > > Ah, here it is. So what happens in case of a race? Does that
> > > currently crash KVM?
> >
> > In case of race, yes, KVM hangs. So no shared directory operation yet
> > till we have designed proper error handling in kvm path.
>
> I think before this is merged we have to fix the KVM crash; that's not
> acceptable even if we explicitly say that shared directory is not
> supported for the time being.
Ok, I will look into it. I had done some work in the past and realized
its not trivial to fix kvm error paths. There are no users and propagating
signals back into qemu instances and finding the right process is going to be
tricky.
Given the complexity of that work, I thought that for now we say that
shared directory is not supported and once basic dax patches get merged,
focus on kvm work.
Thanks
Vivek
On Wed, Mar 11, 2020 at 02:38:03PM +0100, Patrick Ohly wrote:
> Vivek Goyal <[email protected]> writes:
> > This patch series adds DAX support to virtiofs filesystem. This allows
> > bypassing guest page cache and allows mapping host page cache directly
> > in guest address space.
> >
> > When a page of file is needed, guest sends a request to map that page
> > (in host page cache) in qemu address space. Inside guest this is
> > a physical memory range controlled by virtiofs device. And guest
> > directly maps this physical address range using DAX and hence gets
> > access to file data on host.
> >
> > This can speed up things considerably in many situations. Also this
> > can result in substantial memory savings as file data does not have
> > to be copied in guest and it is directly accessed from host page
> > cache.
>
> As a potential user of this, let me make sure I understand the expected
> outcome: is the goal to let virtiofs use DAX (for increased performance,
> etc.) or also let applications that use virtiofs use DAX?
>
> You are mentioning using the host's page cache, so it's probably the
> former and MAP_SYNC on virtiofs will continue to be rejected, right?
Hi Patrick,
You are right. Its the former. That is we want virtiofs to be able to
make use of DAX to bypass guest page cache. But there is no persistent
memory so no persistent memory programming semantics available to user
space. For that I guess we have virtio-pmem.
We expect users will issue fsync/msync like a regular filesystem to
make changes persistent. So in that aspect, rejecting MAP_SYNC
makes sense. I will test and see if current code is rejecting MAP_SYNC
or not.
Thanks
Vivek
Vivek Goyal <[email protected]> writes:
> We expect users will issue fsync/msync like a regular filesystem to
> make changes persistent. So in that aspect, rejecting MAP_SYNC
> makes sense. I will test and see if current code is rejecting MAP_SYNC
> or not.
Last time I checked, it did. Here's the test program that I wrote for
that:
https://github.com/intel/pmem-csi/blob/ee3200794a1ade49a02df6f359a134115b409e90/test/cmd/pmem-dax-check/main.go
--
Best Regards
Patrick Ohly
On Wed, Mar 04, 2020 at 11:58:45AM -0500, Vivek Goyal wrote:
> Add logic to free up a busy memory range. Freed memory range will be
> returned to free pool. Add a worker which can be started to select
> and free some busy memory ranges.
>
> Process can also steal one of its busy dax ranges if free range is not
> available. I will refer it to as direct reclaim.
>
> If free range is not available and nothing can't be stolen from same
> inode, caller waits on a waitq for free range to become available.
>
> For reclaiming a range, as of now we need to hold following locks in
> specified order.
>
> down_write(&fi->i_mmap_sem);
> down_write(&fi->i_dmap_sem);
>
> We look for a free range in following order.
>
> A. Try to get a free range.
> B. If not, try direct reclaim.
> C. If not, wait for a memory range to become free
>
> Signed-off-by: Vivek Goyal <[email protected]>
> Signed-off-by: Liu Bo <[email protected]>
> ---
> fs/fuse/file.c | 450 ++++++++++++++++++++++++++++++++++++++++++++++-
> fs/fuse/fuse_i.h | 25 +++
> fs/fuse/inode.c | 5 +
> 3 files changed, 473 insertions(+), 7 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 8b264fcb9b3c..61ae2ddeef55 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -8,6 +8,7 @@
>
> #include "fuse_i.h"
>
> +#include <linux/delay.h>
> #include <linux/pagemap.h>
> #include <linux/slab.h>
> #include <linux/kernel.h>
> @@ -37,6 +38,8 @@ static struct page **fuse_pages_alloc(unsigned int npages, gfp_t flags,
> return pages;
> }
>
> +static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
> + struct inode *inode, bool fault);
> static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
> int opcode, struct fuse_open_out *outargp)
> {
> @@ -193,6 +196,28 @@ static void fuse_link_write_file(struct file *file)
> spin_unlock(&fi->lock);
> }
>
> +static void
> +__kick_dmap_free_worker(struct fuse_conn *fc, unsigned long delay_ms)
> +{
> + unsigned long free_threshold;
> +
> + /* If number of free ranges are below threshold, start reclaim */
> + free_threshold = max((fc->nr_ranges * FUSE_DAX_RECLAIM_THRESHOLD)/100,
> + (unsigned long)1);
> + if (fc->nr_free_ranges < free_threshold) {
> + pr_debug("fuse: Kicking dax memory reclaim worker. nr_free_ranges=0x%ld nr_total_ranges=%ld\n", fc->nr_free_ranges, fc->nr_ranges);
> + queue_delayed_work(system_long_wq, &fc->dax_free_work,
> + msecs_to_jiffies(delay_ms));
> + }
> +}
> +
> +static void kick_dmap_free_worker(struct fuse_conn *fc, unsigned long delay_ms)
> +{
> + spin_lock(&fc->lock);
> + __kick_dmap_free_worker(fc, delay_ms);
> + spin_unlock(&fc->lock);
> +}
> +
> static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
> {
> struct fuse_dax_mapping *dmap = NULL;
> @@ -201,7 +226,7 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
>
> if (fc->nr_free_ranges <= 0) {
> spin_unlock(&fc->lock);
> - return NULL;
> + goto out_kick;
> }
>
> WARN_ON(list_empty(&fc->free_ranges));
> @@ -212,6 +237,9 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
> list_del_init(&dmap->list);
> fc->nr_free_ranges--;
> spin_unlock(&fc->lock);
> +
> +out_kick:
> + kick_dmap_free_worker(fc, 0);
> return dmap;
> }
>
> @@ -238,6 +266,7 @@ static void __dmap_add_to_free_pool(struct fuse_conn *fc,
> {
> list_add_tail(&dmap->list, &fc->free_ranges);
> fc->nr_free_ranges++;
> + wake_up(&fc->dax_range_waitq);
> }
>
> static void dmap_add_to_free_pool(struct fuse_conn *fc,
> @@ -289,6 +318,12 @@ static int fuse_setup_one_mapping(struct inode *inode, loff_t offset,
>
> dmap->writable = writable;
> if (!upgrade) {
> + /*
> + * We don't take a refernce on inode. inode is valid right now
> + * and when inode is going away, cleanup logic should first
> + * cleanup dmap entries.
> + */
> + dmap->inode = inode;
> dmap->start = offset;
> dmap->end = offset + FUSE_DAX_MEM_RANGE_SZ - 1;
> /* Protected by fi->i_dmap_sem */
> @@ -368,6 +403,7 @@ static void dmap_reinit_add_to_free_pool(struct fuse_conn *fc,
> "window_offset=0x%llx length=0x%llx\n", dmap->start,
> dmap->end, dmap->window_offset, dmap->length);
> __dmap_remove_busy_list(fc, dmap);
> + dmap->inode = NULL;
> dmap->start = dmap->end = 0;
> __dmap_add_to_free_pool(fc, dmap);
> }
> @@ -386,7 +422,8 @@ static void inode_reclaim_dmap_range(struct fuse_conn *fc, struct inode *inode,
> int err, num = 0;
> LIST_HEAD(to_remove);
>
> - pr_debug("fuse: %s: start=0x%llx, end=0x%llx\n", __func__, start, end);
> + pr_debug("fuse: %s: inode=0x%px start=0x%llx, end=0x%llx\n", __func__,
> + inode, start, end);
>
> /*
> * Interval tree search matches intersecting entries. Adjust the range
> @@ -400,6 +437,8 @@ static void inode_reclaim_dmap_range(struct fuse_conn *fc, struct inode *inode,
> end);
> if (!dmap)
> break;
> + /* inode is going away. There should not be any users of dmap */
> + WARN_ON(refcount_read(&dmap->refcnt) > 1);
> fuse_dax_interval_tree_remove(dmap, &fi->dmap_tree);
> num++;
> list_add(&dmap->list, &to_remove);
> @@ -434,6 +473,21 @@ static void inode_reclaim_dmap_range(struct fuse_conn *fc, struct inode *inode,
> spin_unlock(&fc->lock);
> }
>
> +static int dmap_removemapping_one(struct inode *inode,
> + struct fuse_dax_mapping *dmap)
> +{
> + struct fuse_removemapping_one forget_one;
> + struct fuse_removemapping_in inarg;
> +
> + memset(&inarg, 0, sizeof(inarg));
> + inarg.count = 1;
> + memset(&forget_one, 0, sizeof(forget_one));
> + forget_one.moffset = dmap->window_offset;
> + forget_one.len = dmap->length;
> +
> + return fuse_send_removemapping(inode, &inarg, &forget_one);
> +}
> +
> /*
> * It is called from evict_inode() and by that time inode is going away. So
> * this function does not take any locks like fi->i_dmap_sem for traversing
> @@ -1903,6 +1957,17 @@ static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
> if (flags & IOMAP_FAULT)
> iomap->length = ALIGN(len, PAGE_SIZE);
> iomap->type = IOMAP_MAPPED;
> + /*
> + * increace refcnt so that reclaim code knows this dmap is in
> + * use. This assumes i_dmap_sem mutex is held either
> + * shared/exclusive.
> + */
> + refcount_inc(&dmap->refcnt);
> +
> + /* iomap->private should be NULL */
> + WARN_ON_ONCE(iomap->private);
> + iomap->private = dmap;
> +
> pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
> " length 0x%llx\n", __func__, iomap->addr,
> iomap->offset, iomap->length);
> @@ -1925,8 +1990,12 @@ static int iomap_begin_setup_new_mapping(struct inode *inode, loff_t pos,
> int ret;
> bool writable = flags & IOMAP_WRITE;
>
> - alloc_dmap = alloc_dax_mapping(fc);
> - if (!alloc_dmap)
> + alloc_dmap = alloc_dax_mapping_reclaim(fc, inode, flags & IOMAP_FAULT);
> + if (IS_ERR(alloc_dmap))
> + return PTR_ERR(alloc_dmap);
> +
> + /* If we are here, we should have memory allocated */
> + if (WARN_ON(!alloc_dmap))
> return -EBUSY;
>
> /*
> @@ -1979,14 +2048,25 @@ static int iomap_begin_upgrade_mapping(struct inode *inode, loff_t pos,
> dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos, pos);
>
> /* We are holding either inode lock or i_mmap_sem, and that should
> - * ensure that dmap can't reclaimed or truncated and it should still
> - * be there in tree despite the fact we dropped and re-acquired the
> - * lock.
> + * ensure that dmap can't be truncated. We are holding a reference
> + * on dmap and that should make sure it can't be reclaimed. So dmap
> + * should still be there in tree despite the fact we dropped and
> + * re-acquired the i_dmap_sem lock.
> */
> ret = -EIO;
> if (WARN_ON(!dmap))
> goto out_err;
>
> + /* We took an extra reference on dmap to make sure its not reclaimd.
> + * Now we hold i_dmap_sem lock and that reference is not needed
> + * anymore. Drop it.
> + */
> + if (refcount_dec_and_test(&dmap->refcnt)) {
> + /* refcount should not hit 0. This object only goes
> + * away when fuse connection goes away */
> + WARN_ON_ONCE(1);
> + }
> +
> /* Maybe another thread already upgraded mapping while we were not
> * holding lock.
> */
> @@ -2056,7 +2136,11 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
> * two threads to be trying to this simultaneously
> * for same dmap. So drop shared lock and acquire
> * exclusive lock.
> + *
> + * Before dropping i_dmap_sem lock, take reference
> + * on dmap so that its not freed by range reclaim.
> */
> + refcount_inc(&dmap->refcnt);
> up_read(&fi->i_dmap_sem);
> pr_debug("%s: Upgrading mapping at offset 0x%llx"
> " length 0x%llx\n", __func__, pos, length);
> @@ -2092,6 +2176,16 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
> ssize_t written, unsigned flags,
> struct iomap *iomap)
> {
> + struct fuse_dax_mapping *dmap = iomap->private;
> +
> + if (dmap) {
> + if (refcount_dec_and_test(&dmap->refcnt)) {
> + /* refcount should not hit 0. This object only goes
> + * away when fuse connection goes away */
> + WARN_ON_ONCE(1);
> + }
> + }
> +
> /* DAX writes beyond end-of-file aren't handled using iomap, so the
> * file size is unchanged and there is nothing to do here.
> */
> @@ -4103,3 +4197,345 @@ void fuse_init_file_inode(struct inode *inode)
> inode->i_data.a_ops = &fuse_dax_file_aops;
> }
> }
> +
> +static int dmap_writeback_invalidate(struct inode *inode,
> + struct fuse_dax_mapping *dmap)
> +{
> + int ret;
> +
> + ret = filemap_fdatawrite_range(inode->i_mapping, dmap->start,
> + dmap->end);
> + if (ret) {
> + printk("filemap_fdatawrite_range() failed. err=%d start=0x%llx,"
> + " end=0x%llx\n", ret, dmap->start, dmap->end);
> + return ret;
> + }
> +
> + ret = invalidate_inode_pages2_range(inode->i_mapping,
> + dmap->start >> PAGE_SHIFT,
> + dmap->end >> PAGE_SHIFT);
> + if (ret)
> + printk("invalidate_inode_pages2_range() failed err=%d\n", ret);
> +
> + return ret;
> +}
> +
> +static int reclaim_one_dmap_locked(struct fuse_conn *fc, struct inode *inode,
> + struct fuse_dax_mapping *dmap)
> +{
> + int ret;
> + struct fuse_inode *fi = get_fuse_inode(inode);
> +
> + /*
> + * igrab() was done to make sure inode won't go under us, and this
> + * further avoids the race with evict().
> + */
> + ret = dmap_writeback_invalidate(inode, dmap);
> + if (ret)
> + return ret;
> +
> + /* Remove dax mapping from inode interval tree now */
> + fuse_dax_interval_tree_remove(dmap, &fi->dmap_tree);
> + fi->nr_dmaps--;
> +
> + /* It is possible that umount/shutodwn has killed the fuse connection
> + * and worker thread is trying to reclaim memory in parallel. So check
> + * if connection is still up or not otherwise don't send removemapping
> + * message.
> + */
> + if (fc->connected) {
> + ret = dmap_removemapping_one(inode, dmap);
> + if (ret) {
> + pr_warn("Failed to remove mapping. offset=0x%llx"
> + " len=0x%llx ret=%d\n", dmap->window_offset,
> + dmap->length, ret);
> + }
> + }
> + return 0;
> +}
> +
> +static void fuse_wait_dax_page(struct inode *inode)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> +
> + up_write(&fi->i_mmap_sem);
> + schedule();
> + down_write(&fi->i_mmap_sem);
> +}
> +
> +/* Should be called with fi->i_mmap_sem lock held exclusively */
> +static int __fuse_break_dax_layouts(struct inode *inode, bool *retry,
> + loff_t start, loff_t end)
> +{
> + struct page *page;
> +
> + page = dax_layout_busy_page_range(inode->i_mapping, start, end);
> + if (!page)
> + return 0;
> +
> + *retry = true;
> + return ___wait_var_event(&page->_refcount,
> + atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
> + 0, 0, fuse_wait_dax_page(inode));
> +}
> +
> +/* dmap_end == 0 leads to unmapping of whole file */
> +static int fuse_break_dax_layouts(struct inode *inode, u64 dmap_start,
> + u64 dmap_end)
> +{
> + bool retry;
> + int ret;
> +
> + do {
> + retry = false;
> + ret = __fuse_break_dax_layouts(inode, &retry, dmap_start,
> + dmap_end);
> + } while (ret == 0 && retry);
> +
> + return ret;
> +}
> +
> +/* Find first mapping in the tree and free it. */
> +static struct fuse_dax_mapping *
> +inode_reclaim_one_dmap_locked(struct fuse_conn *fc, struct inode *inode)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct fuse_dax_mapping *dmap;
> + int ret;
> +
> + for (dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, 0, -1);
> + dmap;
> + dmap = fuse_dax_interval_tree_iter_next(dmap, 0, -1)) {
> + /* still in use. */
> + if (refcount_read(&dmap->refcnt) > 1)
> + continue;
> +
> + ret = reclaim_one_dmap_locked(fc, inode, dmap);
> + if (ret < 0)
> + return ERR_PTR(ret);
> +
> + /* Clean up dmap. Do not add back to free list */
> + dmap_remove_busy_list(fc, dmap);
> + dmap->inode = NULL;
> + dmap->start = dmap->end = 0;
> +
> + pr_debug("fuse: %s: reclaimed memory range. inode=%px,"
> + " window_offset=0x%llx, length=0x%llx\n", __func__,
> + inode, dmap->window_offset, dmap->length);
> + return dmap;
> + }
> +
> + return NULL;
> +}
> +
> +/*
> + * Find first mapping in the tree and free it and return it. Do not add
> + * it back to free pool. If fault == true, this function should be called
> + * with fi->i_mmap_sem held.
> + */
> +static struct fuse_dax_mapping *inode_reclaim_one_dmap(struct fuse_conn *fc,
> + struct inode *inode,
> + bool fault)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct fuse_dax_mapping *dmap;
> + int ret;
> +
> + if (!fault)
> + down_write(&fi->i_mmap_sem);
> +
> + /*
> + * Make sure there are no references to inode pages using
> + * get_user_pages()
> + */
> + ret = fuse_break_dax_layouts(inode, 0, 0);
Hi Vivek,
This patch is enabling inline reclaim for fault path, but fault path
has already holds a locked exceptional entry which I believe the above
fuse_break_dax_layouts() needs to wait for, can you please elaborate
on how this can be avoided?
thanks,
liubo
> + if (ret) {
> + printk("virtio_fs: fuse_break_dax_layouts() failed. err=%d\n",
> + ret);
> + dmap = ERR_PTR(ret);
> + goto out_mmap_sem;
> + }
> + down_write(&fi->i_dmap_sem);
> + dmap = inode_reclaim_one_dmap_locked(fc, inode);
> + up_write(&fi->i_dmap_sem);
> +out_mmap_sem:
> + if (!fault)
> + up_write(&fi->i_mmap_sem);
> + return dmap;
> +}
> +
> +/* If fault == true, it should be called with fi->i_mmap_sem locked */
> +static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
> + struct inode *inode, bool fault)
> +{
> + struct fuse_dax_mapping *dmap;
> + struct fuse_inode *fi = get_fuse_inode(inode);
> +
> + while(1) {
> + dmap = alloc_dax_mapping(fc);
> + if (dmap)
> + return dmap;
> +
> + if (fi->nr_dmaps) {
> + dmap = inode_reclaim_one_dmap(fc, inode, fault);
> + if (dmap)
> + return dmap;
> + /* If we could not reclaim a mapping because it
> + * had a reference, that should be a temporary
> + * situation. Try again.
> + */
> + msleep(1);
> + continue;
> + }
> + /*
> + * There are no mappings which can be reclaimed.
> + * Wait for one.
> + */
> + if (!(fc->nr_free_ranges > 0)) {
> + if (wait_event_killable_exclusive(fc->dax_range_waitq,
> + (fc->nr_free_ranges > 0)))
> + return ERR_PTR(-EINTR);
> + }
> + }
> +}
> +
> +static int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc,
> + struct inode *inode, u64 dmap_start)
> +{
> + int ret;
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct fuse_dax_mapping *dmap;
> +
> + /* Find fuse dax mapping at file offset inode. */
> + dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, dmap_start,
> + dmap_start);
> +
> + /* Range already got cleaned up by somebody else */
> + if (!dmap)
> + return 0;
> +
> + /* still in use. */
> + if (refcount_read(&dmap->refcnt) > 1)
> + return 0;
> +
> + ret = reclaim_one_dmap_locked(fc, inode, dmap);
> + if (ret < 0)
> + return ret;
> +
> + /* Cleanup dmap entry and add back to free list */
> + spin_lock(&fc->lock);
> + dmap_reinit_add_to_free_pool(fc, dmap);
> + spin_unlock(&fc->lock);
> + return ret;
> +}
> +
> +/*
> + * Free a range of memory.
> + * Locking.
> + * 1. Take fuse_inode->i_mmap_sem to block dax faults.
> + * 2. Take fuse_inode->i_dmap_sem to protect interval tree and also to make
> + * sure read/write can not reuse a dmap which we might be freeing.
> + */
> +static int lookup_and_reclaim_dmap(struct fuse_conn *fc, struct inode *inode,
> + u64 dmap_start, u64 dmap_end)
> +{
> + int ret;
> + struct fuse_inode *fi = get_fuse_inode(inode);
> +
> + down_write(&fi->i_mmap_sem);
> + ret = fuse_break_dax_layouts(inode, dmap_start, dmap_end);
> + if (ret) {
> + printk("virtio_fs: fuse_break_dax_layouts() failed. err=%d\n",
> + ret);
> + goto out_mmap_sem;
> + }
> +
> + down_write(&fi->i_dmap_sem);
> + ret = lookup_and_reclaim_dmap_locked(fc, inode, dmap_start);
> + up_write(&fi->i_dmap_sem);
> +out_mmap_sem:
> + up_write(&fi->i_mmap_sem);
> + return ret;
> +}
> +
> +static int try_to_free_dmap_chunks(struct fuse_conn *fc,
> + unsigned long nr_to_free)
> +{
> + struct fuse_dax_mapping *dmap, *pos, *temp;
> + int ret, nr_freed = 0;
> + u64 dmap_start = 0, window_offset = 0, dmap_end = 0;
> + struct inode *inode = NULL;
> +
> + /* Pick first busy range and free it for now*/
> + while(1) {
> + if (nr_freed >= nr_to_free)
> + break;
> +
> + dmap = NULL;
> + spin_lock(&fc->lock);
> +
> + if (!fc->nr_busy_ranges) {
> + spin_unlock(&fc->lock);
> + return 0;
> + }
> +
> + list_for_each_entry_safe(pos, temp, &fc->busy_ranges,
> + busy_list) {
> + /* skip this range if it's in use. */
> + if (refcount_read(&pos->refcnt) > 1)
> + continue;
> +
> + inode = igrab(pos->inode);
> + /*
> + * This inode is going away. That will free
> + * up all the ranges anyway, continue to
> + * next range.
> + */
> + if (!inode)
> + continue;
> + /*
> + * Take this element off list and add it tail. If
> + * this element can't be freed, it will help with
> + * selecting new element in next iteration of loop.
> + */
> + dmap = pos;
> + list_move_tail(&dmap->busy_list, &fc->busy_ranges);
> + dmap_start = dmap->start;
> + dmap_end = dmap->end;
> + window_offset = dmap->window_offset;
> + break;
> + }
> + spin_unlock(&fc->lock);
> + if (!dmap)
> + return 0;
> +
> + ret = lookup_and_reclaim_dmap(fc, inode, dmap_start, dmap_end);
> + iput(inode);
> + if (ret) {
> + printk("%s(window_offset=0x%llx) failed. err=%d\n",
> + __func__, window_offset, ret);
> + return ret;
> + }
> + nr_freed++;
> + }
> + return 0;
> +}
> +
> +void fuse_dax_free_mem_worker(struct work_struct *work)
> +{
> + int ret;
> + struct fuse_conn *fc = container_of(work, struct fuse_conn,
> + dax_free_work.work);
> + pr_debug("fuse: Worker to free memory called. nr_free_ranges=%lu"
> + " nr_busy_ranges=%lu\n", fc->nr_free_ranges,
> + fc->nr_busy_ranges);
> +
> + ret = try_to_free_dmap_chunks(fc, FUSE_DAX_RECLAIM_CHUNK);
> + if (ret) {
> + pr_debug("fuse: try_to_free_dmap_chunks() failed with err=%d\n",
> + ret);
> + }
> +
> + /* If number of free ranges are still below threhold, requeue */
> + kick_dmap_free_worker(fc, 1);
> +}
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index de213a7e1b0e..41c2fbff0d37 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -54,6 +54,16 @@
> #define FUSE_DAX_MEM_RANGE_SZ (2*1024*1024)
> #define FUSE_DAX_MEM_RANGE_PAGES (FUSE_DAX_MEM_RANGE_SZ/PAGE_SIZE)
>
> +/* Number of ranges reclaimer will try to free in one invocation */
> +#define FUSE_DAX_RECLAIM_CHUNK (10)
> +
> +/*
> + * Dax memory reclaim threshold in percetage of total ranges. When free
> + * number of free ranges drops below this threshold, reclaim can trigger
> + * Default is 20%
> + * */
> +#define FUSE_DAX_RECLAIM_THRESHOLD (20)
> +
> /** List of active connections */
> extern struct list_head fuse_conn_list;
>
> @@ -75,6 +85,9 @@ struct fuse_forget_link {
>
> /** Translation information for file offsets to DAX window offsets */
> struct fuse_dax_mapping {
> + /* Pointer to inode where this memory range is mapped */
> + struct inode *inode;
> +
> /* Will connect in fc->free_ranges to keep track of free memory */
> struct list_head list;
>
> @@ -97,6 +110,9 @@ struct fuse_dax_mapping {
>
> /* Is this mapping read-only or read-write */
> bool writable;
> +
> + /* reference count when the mapping is used by dax iomap. */
> + refcount_t refcnt;
> };
>
> /** FUSE inode */
> @@ -822,11 +838,19 @@ struct fuse_conn {
> unsigned long nr_busy_ranges;
> struct list_head busy_ranges;
>
> + /* Worker to free up memory ranges */
> + struct delayed_work dax_free_work;
> +
> + /* Wait queue for a dax range to become free */
> + wait_queue_head_t dax_range_waitq;
> +
> /*
> * DAX Window Free Ranges
> */
> long nr_free_ranges;
> struct list_head free_ranges;
> +
> + unsigned long nr_ranges;
> };
>
> static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
> @@ -1164,6 +1188,7 @@ unsigned int fuse_len_args(unsigned int numargs, struct fuse_arg *args);
> */
> u64 fuse_get_unique(struct fuse_iqueue *fiq);
> void fuse_free_conn(struct fuse_conn *fc);
> +void fuse_dax_free_mem_worker(struct work_struct *work);
> void fuse_cleanup_inode_mappings(struct inode *inode);
>
> #endif /* _FS_FUSE_I_H */
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index d4770e7fb7eb..3560b62077a7 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -663,11 +663,13 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
> range->window_offset = i * FUSE_DAX_MEM_RANGE_SZ;
> range->length = FUSE_DAX_MEM_RANGE_SZ;
> INIT_LIST_HEAD(&range->busy_list);
> + refcount_set(&range->refcnt, 1);
> list_add_tail(&range->list, &mem_ranges);
> }
>
> list_replace_init(&mem_ranges, &fc->free_ranges);
> fc->nr_free_ranges = nr_ranges;
> + fc->nr_ranges = nr_ranges;
> return 0;
> out_err:
> /* Free All allocated elements */
> @@ -692,6 +694,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
> refcount_set(&fc->count, 1);
> atomic_set(&fc->dev_count, 1);
> init_waitqueue_head(&fc->blocked_waitq);
> + init_waitqueue_head(&fc->dax_range_waitq);
> fuse_iqueue_init(&fc->iq, fiq_ops, fiq_priv);
> INIT_LIST_HEAD(&fc->bg_queue);
> INIT_LIST_HEAD(&fc->entry);
> @@ -711,6 +714,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
> fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
> INIT_LIST_HEAD(&fc->free_ranges);
> INIT_LIST_HEAD(&fc->busy_ranges);
> + INIT_DELAYED_WORK(&fc->dax_free_work, fuse_dax_free_mem_worker);
> }
> EXPORT_SYMBOL_GPL(fuse_conn_init);
>
> @@ -719,6 +723,7 @@ void fuse_conn_put(struct fuse_conn *fc)
> if (refcount_dec_and_test(&fc->count)) {
> struct fuse_iqueue *fiq = &fc->iq;
>
> + flush_delayed_work(&fc->dax_free_work);
> if (fc->dax_dev)
> fuse_free_dax_mem_ranges(&fc->free_ranges);
> if (fiq->ops->release)
> --
> 2.20.1
On Thu, Mar 26, 2020 at 08:09:05AM +0800, Liu Bo wrote:
[..]
> > +/*
> > + * Find first mapping in the tree and free it and return it. Do not add
> > + * it back to free pool. If fault == true, this function should be called
> > + * with fi->i_mmap_sem held.
> > + */
> > +static struct fuse_dax_mapping *inode_reclaim_one_dmap(struct fuse_conn *fc,
> > + struct inode *inode,
> > + bool fault)
> > +{
> > + struct fuse_inode *fi = get_fuse_inode(inode);
> > + struct fuse_dax_mapping *dmap;
> > + int ret;
> > +
> > + if (!fault)
> > + down_write(&fi->i_mmap_sem);
> > +
> > + /*
> > + * Make sure there are no references to inode pages using
> > + * get_user_pages()
> > + */
> > + ret = fuse_break_dax_layouts(inode, 0, 0);
>
> Hi Vivek,
>
> This patch is enabling inline reclaim for fault path, but fault path
> has already holds a locked exceptional entry which I believe the above
> fuse_break_dax_layouts() needs to wait for, can you please elaborate
> on how this can be avoided?
>
Hi Liubo,
Can you please point to the exact lock you are referring to. I will
check it out. Once we got rid of needing to take inode lock in
reclaim path, that opended the door to do inline reclaim in fault
path as well. But I was not aware of this exceptional entry lock.
Vivek
On Fri, Mar 27, 2020 at 10:01:14AM -0400, Vivek Goyal wrote:
> On Thu, Mar 26, 2020 at 08:09:05AM +0800, Liu Bo wrote:
>
> [..]
> > > +/*
> > > + * Find first mapping in the tree and free it and return it. Do not add
> > > + * it back to free pool. If fault == true, this function should be called
> > > + * with fi->i_mmap_sem held.
> > > + */
> > > +static struct fuse_dax_mapping *inode_reclaim_one_dmap(struct fuse_conn *fc,
> > > + struct inode *inode,
> > > + bool fault)
> > > +{
> > > + struct fuse_inode *fi = get_fuse_inode(inode);
> > > + struct fuse_dax_mapping *dmap;
> > > + int ret;
> > > +
> > > + if (!fault)
> > > + down_write(&fi->i_mmap_sem);
> > > +
> > > + /*
> > > + * Make sure there are no references to inode pages using
> > > + * get_user_pages()
> > > + */
> > > + ret = fuse_break_dax_layouts(inode, 0, 0);
> >
> > Hi Vivek,
> >
> > This patch is enabling inline reclaim for fault path, but fault path
> > has already holds a locked exceptional entry which I believe the above
> > fuse_break_dax_layouts() needs to wait for, can you please elaborate
> > on how this can be avoided?
> >
>
> Hi Liubo,
>
> Can you please point to the exact lock you are referring to. I will
> check it out. Once we got rid of needing to take inode lock in
> reclaim path, that opended the door to do inline reclaim in fault
> path as well. But I was not aware of this exceptional entry lock.
Hi Vivek,
dax_iomap_{pte,pmd}_fault has called grab_mapping_entry to get a
locked entry, when this fault gets into inline reclaim, would
fuse_break_dax_layouts wait for the locked exceptional entry which is
locked in dax_iomap_{pte,pmd}_fault?
thanks,
liubo
On Wed, Mar 04, 2020 at 11:58:38AM -0500, Vivek Goyal wrote:
> This patch implements basic DAX support. mmap() is not implemented
> yet and will come in later patches. This patch looks into implemeting
> read/write.
>
> We make use of interval tree to keep track of per inode dax mappings.
>
> Do not use dax for file extending writes, instead just send WRITE message
> to daemon (like we do for direct I/O path). This will keep write and
> i_size change atomic w.r.t crash.
>
> Signed-off-by: Stefan Hajnoczi <[email protected]>
> Signed-off-by: Dr. David Alan Gilbert <[email protected]>
> Signed-off-by: Vivek Goyal <[email protected]>
> Signed-off-by: Miklos Szeredi <[email protected]>
> Signed-off-by: Liu Bo <[email protected]>
> Signed-off-by: Peng Tao <[email protected]>
> ---
> fs/fuse/file.c | 597 +++++++++++++++++++++++++++++++++++++-
> fs/fuse/fuse_i.h | 23 ++
> fs/fuse/inode.c | 6 +
> include/uapi/linux/fuse.h | 1 +
> 4 files changed, 621 insertions(+), 6 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 9d67b830fb7a..9effdd3dc6d6 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -18,6 +18,12 @@
> #include <linux/swap.h>
> #include <linux/falloc.h>
> #include <linux/uio.h>
> +#include <linux/dax.h>
> +#include <linux/iomap.h>
> +#include <linux/interval_tree_generic.h>
> +
> +INTERVAL_TREE_DEFINE(struct fuse_dax_mapping, rb, __u64, __subtree_last,
> + START, LAST, static inline, fuse_dax_interval_tree);
>
> static struct page **fuse_pages_alloc(unsigned int npages, gfp_t flags,
> struct fuse_page_desc **desc)
> @@ -187,6 +193,242 @@ static void fuse_link_write_file(struct file *file)
> spin_unlock(&fi->lock);
> }
>
> +static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
> +{
> + struct fuse_dax_mapping *dmap = NULL;
> +
> + spin_lock(&fc->lock);
> +
> + if (fc->nr_free_ranges <= 0) {
> + spin_unlock(&fc->lock);
> + return NULL;
> + }
> +
> + WARN_ON(list_empty(&fc->free_ranges));
> +
> + /* Take a free range */
> + dmap = list_first_entry(&fc->free_ranges, struct fuse_dax_mapping,
> + list);
> + list_del_init(&dmap->list);
> + fc->nr_free_ranges--;
> + spin_unlock(&fc->lock);
> + return dmap;
> +}
> +
> +/* This assumes fc->lock is held */
> +static void __dmap_add_to_free_pool(struct fuse_conn *fc,
> + struct fuse_dax_mapping *dmap)
> +{
> + list_add_tail(&dmap->list, &fc->free_ranges);
> + fc->nr_free_ranges++;
> +}
> +
> +static void dmap_add_to_free_pool(struct fuse_conn *fc,
> + struct fuse_dax_mapping *dmap)
> +{
> + /* Return fuse_dax_mapping to free list */
> + spin_lock(&fc->lock);
> + __dmap_add_to_free_pool(fc, dmap);
> + spin_unlock(&fc->lock);
> +}
> +
> +/* offset passed in should be aligned to FUSE_DAX_MEM_RANGE_SZ */
> +static int fuse_setup_one_mapping(struct inode *inode, loff_t offset,
> + struct fuse_dax_mapping *dmap, bool writable,
> + bool upgrade)
> +{
> + struct fuse_conn *fc = get_fuse_conn(inode);
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct fuse_setupmapping_in inarg;
> + FUSE_ARGS(args);
> + ssize_t err;
> +
> + WARN_ON(offset % FUSE_DAX_MEM_RANGE_SZ);
> + WARN_ON(fc->nr_free_ranges < 0);
> +
> + /* Ask fuse daemon to setup mapping */
> + memset(&inarg, 0, sizeof(inarg));
> + inarg.foffset = offset;
> + inarg.fh = -1;
> + inarg.moffset = dmap->window_offset;
> + inarg.len = FUSE_DAX_MEM_RANGE_SZ;
> + inarg.flags |= FUSE_SETUPMAPPING_FLAG_READ;
> + if (writable)
> + inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
> + args.opcode = FUSE_SETUPMAPPING;
> + args.nodeid = fi->nodeid;
> + args.in_numargs = 1;
> + args.in_args[0].size = sizeof(inarg);
> + args.in_args[0].value = &inarg;
> + err = fuse_simple_request(fc, &args);
> + if (err < 0) {
> + printk(KERN_ERR "%s request failed at mem_offset=0x%llx %zd\n",
> + __func__, dmap->window_offset, err);
> + return err;
> + }
> +
> + pr_debug("fuse_setup_one_mapping() succeeded. offset=0x%llx writable=%d"
> + " err=%zd\n", offset, writable, err);
> +
> + dmap->writable = writable;
> + if (!upgrade) {
> + dmap->start = offset;
> + dmap->end = offset + FUSE_DAX_MEM_RANGE_SZ - 1;
> + /* Protected by fi->i_dmap_sem */
> + fuse_dax_interval_tree_insert(dmap, &fi->dmap_tree);
> + fi->nr_dmaps++;
> + }
> + return 0;
> +}
> +
> +static int
> +fuse_send_removemapping(struct inode *inode,
> + struct fuse_removemapping_in *inargp,
> + struct fuse_removemapping_one *remove_one)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct fuse_conn *fc = get_fuse_conn(inode);
> + FUSE_ARGS(args);
> +
> + args.opcode = FUSE_REMOVEMAPPING;
> + args.nodeid = fi->nodeid;
> + args.in_numargs = 2;
> + args.in_args[0].size = sizeof(*inargp);
> + args.in_args[0].value = inargp;
> + args.in_args[1].size = inargp->count * sizeof(*remove_one);
> + args.in_args[1].value = remove_one;
> + return fuse_simple_request(fc, &args);
> +}
> +
> +static int dmap_removemapping_list(struct inode *inode, unsigned num,
> + struct list_head *to_remove)
> +{
> + struct fuse_removemapping_one *remove_one, *ptr;
> + struct fuse_removemapping_in inarg;
> + struct fuse_dax_mapping *dmap;
> + int ret, i = 0, nr_alloc;
> +
> + nr_alloc = min_t(unsigned int, num, FUSE_REMOVEMAPPING_MAX_ENTRY);
> + remove_one = kmalloc_array(nr_alloc, sizeof(*remove_one), GFP_NOFS);
> + if (!remove_one)
> + return -ENOMEM;
> +
> + ptr = remove_one;
> + list_for_each_entry(dmap, to_remove, list) {
> + ptr->moffset = dmap->window_offset;
> + ptr->len = dmap->length;
> + ptr++;
> + i++;
> + num--;
> + if (i >= nr_alloc || num == 0) {
> + memset(&inarg, 0, sizeof(inarg));
> + inarg.count = i;
> + ret = fuse_send_removemapping(inode, &inarg,
> + remove_one);
> + if (ret)
> + goto out;
> + ptr = remove_one;
> + i = 0;
> + }
> + }
> +out:
> + kfree(remove_one);
> + return ret;
> +}
> +
> +/*
> + * Cleanup dmap entry and add back to free list. This should be called with
> + * fc->lock held.
> + */
> +static void dmap_reinit_add_to_free_pool(struct fuse_conn *fc,
> + struct fuse_dax_mapping *dmap)
> +{
> + pr_debug("fuse: freeing memory range start=0x%llx end=0x%llx "
> + "window_offset=0x%llx length=0x%llx\n", dmap->start,
> + dmap->end, dmap->window_offset, dmap->length);
> + dmap->start = dmap->end = 0;
> + __dmap_add_to_free_pool(fc, dmap);
> +}
> +
> +/*
> + * Free inode dmap entries whose range falls entirely inside [start, end].
> + * Does not take any locks. At this point of time it should only be
> + * called from evict_inode() path where we know all dmap entries can be
> + * reclaimed.
> + */
> +static void inode_reclaim_dmap_range(struct fuse_conn *fc, struct inode *inode,
> + loff_t start, loff_t end)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct fuse_dax_mapping *dmap, *n;
> + int err, num = 0;
> + LIST_HEAD(to_remove);
> +
> + pr_debug("fuse: %s: start=0x%llx, end=0x%llx\n", __func__, start, end);
> +
> + /*
> + * Interval tree search matches intersecting entries. Adjust the range
> + * to avoid dropping partial valid entries.
> + */
> + start = ALIGN(start, FUSE_DAX_MEM_RANGE_SZ);
> + end = ALIGN_DOWN(end, FUSE_DAX_MEM_RANGE_SZ);
> +
> + while (1) {
> + dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, start,
> + end);
> + if (!dmap)
> + break;
> + fuse_dax_interval_tree_remove(dmap, &fi->dmap_tree);
> + num++;
> + list_add(&dmap->list, &to_remove);
> + }
> +
> + /* Nothing to remove */
> + if (list_empty(&to_remove))
> + return;
> +
> + WARN_ON(fi->nr_dmaps < num);
> + fi->nr_dmaps -= num;
> + /*
> + * During umount/shutdown, fuse connection is dropped first
> + * and evict_inode() is called later. That means any
> + * removemapping messages are going to fail. Send messages
> + * only if connection is up. Otherwise fuse daemon is
> + * responsible for cleaning up any leftover references and
> + * mappings.
> + */
> + if (fc->connected) {
> + err = dmap_removemapping_list(inode, num, &to_remove);
> + if (err) {
> + pr_warn("Failed to removemappings. start=0x%llx"
> + " end=0x%llx\n", start, end);
> + }
> + }
> + spin_lock(&fc->lock);
> + list_for_each_entry_safe(dmap, n, &to_remove, list) {
> + list_del_init(&dmap->list);
> + dmap_reinit_add_to_free_pool(fc, dmap);
> + }
> + spin_unlock(&fc->lock);
> +}
> +
> +/*
> + * It is called from evict_inode() and by that time inode is going away. So
> + * this function does not take any locks like fi->i_dmap_sem for traversing
> + * that fuse inode interval tree. If that lock is taken then lock validator
> + * complains of deadlock situation w.r.t fs_reclaim lock.
> + */
> +void fuse_cleanup_inode_mappings(struct inode *inode)
> +{
> + struct fuse_conn *fc = get_fuse_conn(inode);
> + /*
> + * fuse_evict_inode() has alredy called truncate_inode_pages_final()
> + * before we arrive here. So we should not have to worry about
> + * any pages/exception entries still associated with inode.
> + */
> + inode_reclaim_dmap_range(fc, inode, 0, -1);
> +}
> +
> void fuse_finish_open(struct inode *inode, struct file *file)
> {
> struct fuse_file *ff = file->private_data;
> @@ -1562,32 +1804,364 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
> return res;
> }
>
> +static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
> static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> {
> struct file *file = iocb->ki_filp;
> struct fuse_file *ff = file->private_data;
> + struct inode *inode = file->f_mapping->host;
>
> if (is_bad_inode(file_inode(file)))
> return -EIO;
>
> - if (!(ff->open_flags & FOPEN_DIRECT_IO))
> - return fuse_cache_read_iter(iocb, to);
> - else
> + if (IS_DAX(inode))
> + return fuse_dax_read_iter(iocb, to);
> +
> + if (ff->open_flags & FOPEN_DIRECT_IO)
> return fuse_direct_read_iter(iocb, to);
> +
> + return fuse_cache_read_iter(iocb, to);
> }
>
> +static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
> static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> {
> struct file *file = iocb->ki_filp;
> struct fuse_file *ff = file->private_data;
> + struct inode *inode = file->f_mapping->host;
>
> if (is_bad_inode(file_inode(file)))
> return -EIO;
>
> - if (!(ff->open_flags & FOPEN_DIRECT_IO))
> - return fuse_cache_write_iter(iocb, from);
> - else
> + if (IS_DAX(inode))
> + return fuse_dax_write_iter(iocb, from);
> +
> + if (ff->open_flags & FOPEN_DIRECT_IO)
> return fuse_direct_write_iter(iocb, from);
> +
> + return fuse_cache_write_iter(iocb, from);
> +}
> +
> +static void fuse_fill_iomap_hole(struct iomap *iomap, loff_t length)
> +{
> + iomap->addr = IOMAP_NULL_ADDR;
> + iomap->length = length;
> + iomap->type = IOMAP_HOLE;
> +}
> +
> +static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
> + struct iomap *iomap, struct fuse_dax_mapping *dmap,
> + unsigned flags)
> +{
> + loff_t offset, len;
> + loff_t i_size = i_size_read(inode);
> +
> + offset = pos - dmap->start;
> + len = min(length, dmap->length - offset);
> +
> + /* If length is beyond end of file, truncate further */
> + if (pos + len > i_size)
> + len = i_size - pos;
> +
> + if (len > 0) {
> + iomap->addr = dmap->window_offset + offset;
> + iomap->length = len;
> + if (flags & IOMAP_FAULT)
> + iomap->length = ALIGN(len, PAGE_SIZE);
> + iomap->type = IOMAP_MAPPED;
> + pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
> + " length 0x%llx\n", __func__, iomap->addr,
> + iomap->offset, iomap->length);
> + } else {
> + /* Mapping beyond end of file is hole */
> + fuse_fill_iomap_hole(iomap, length);
> + pr_debug("%s: returns iomap: addr 0x%llx offset 0x%llx"
> + "length 0x%llx\n", __func__, iomap->addr,
> + iomap->offset, iomap->length);
> + }
> +}
> +
> +static int iomap_begin_setup_new_mapping(struct inode *inode, loff_t pos,
> + loff_t length, unsigned flags,
> + struct iomap *iomap)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct fuse_conn *fc = get_fuse_conn(inode);
> + struct fuse_dax_mapping *dmap, *alloc_dmap = NULL;
> + int ret;
> + bool writable = flags & IOMAP_WRITE;
> +
> + alloc_dmap = alloc_dax_mapping(fc);
> + if (!alloc_dmap)
> + return -EBUSY;
> +
> + /*
> + * Take write lock so that only one caller can try to setup mapping
> + * and other waits.
> + */
> + down_write(&fi->i_dmap_sem);
> + /*
> + * We dropped lock. Check again if somebody else setup
> + * mapping already.
> + */
> + dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos,
> + pos);
> + if (dmap) {
> + fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
> + dmap_add_to_free_pool(fc, alloc_dmap);
> + up_write(&fi->i_dmap_sem);
> + return 0;
> + }
> +
> + /* Setup one mapping */
> + ret = fuse_setup_one_mapping(inode,
> + ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
> + alloc_dmap, writable, false);
> + if (ret < 0) {
> + printk("fuse_setup_one_mapping() failed. err=%d"
> + " pos=0x%llx, writable=%d\n", ret, pos, writable);
> + dmap_add_to_free_pool(fc, alloc_dmap);
> + up_write(&fi->i_dmap_sem);
> + return ret;
> + }
> + fuse_fill_iomap(inode, pos, length, iomap, alloc_dmap, flags);
> + up_write(&fi->i_dmap_sem);
> + return 0;
> +}
> +
> +static int iomap_begin_upgrade_mapping(struct inode *inode, loff_t pos,
> + loff_t length, unsigned flags,
> + struct iomap *iomap)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct fuse_dax_mapping *dmap;
> + int ret;
> +
> + /*
> + * Take exclusive lock so that only one caller can try to setup
> + * mapping and others wait.
> + */
> + down_write(&fi->i_dmap_sem);
> + dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos, pos);
> +
> + /* We are holding either inode lock or i_mmap_sem, and that should
> + * ensure that dmap can't reclaimed or truncated and it should still
> + * be there in tree despite the fact we dropped and re-acquired the
> + * lock.
> + */
> + ret = -EIO;
> + if (WARN_ON(!dmap))
> + goto out_err;
> +
> + /* Maybe another thread already upgraded mapping while we were not
> + * holding lock.
> + */
> + if (dmap->writable)
oops, looks like it's still returning -EIO here, %ret should be zero.
thanks,
liubo
> + goto out_fill_iomap;
> +
> + ret = fuse_setup_one_mapping(inode,
> + ALIGN_DOWN(pos, FUSE_DAX_MEM_RANGE_SZ),
> + dmap, true, true);
> + if (ret < 0) {
> + printk("fuse_setup_one_mapping() failed. err=%d pos=0x%llx\n",
> + ret, pos);
> + goto out_err;
> + }
> +
> +out_fill_iomap:
> + fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
> +out_err:
> + up_write(&fi->i_dmap_sem);
> + return ret;
> +}
> +
> +/* This is just for DAX and the mapping is ephemeral, do not use it for other
> + * purposes since there is no block device with a permanent mapping.
> + */
> +static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
> + unsigned flags, struct iomap *iomap,
> + struct iomap *srcmap)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct fuse_conn *fc = get_fuse_conn(inode);
> + struct fuse_dax_mapping *dmap;
> + bool writable = flags & IOMAP_WRITE;
> +
> + /* We don't support FIEMAP */
> + BUG_ON(flags & IOMAP_REPORT);
> +
> + pr_debug("fuse_iomap_begin() called. pos=0x%llx length=0x%llx\n",
> + pos, length);
> +
> + /*
> + * Writes beyond end of file are not handled using dax path. Instead
> + * a fuse write message is sent to daemon
> + */
> + if (flags & IOMAP_WRITE && pos >= i_size_read(inode))
> + return -EIO;
> +
> + iomap->offset = pos;
> + iomap->flags = 0;
> + iomap->bdev = NULL;
> + iomap->dax_dev = fc->dax_dev;
> +
> + /*
> + * Both read/write and mmap path can race here. So we need something
> + * to make sure if we are setting up mapping, then other path waits
> + *
> + * For now, use a semaphore for this. It probably needs to be
> + * optimized later.
> + */
> + down_read(&fi->i_dmap_sem);
> + dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos, pos);
> +
> + if (dmap) {
> + if (writable && !dmap->writable) {
> + /* Upgrade read-only mapping to read-write. This will
> + * require exclusive i_dmap_sem lock as we don't want
> + * two threads to be trying to this simultaneously
> + * for same dmap. So drop shared lock and acquire
> + * exclusive lock.
> + */
> + up_read(&fi->i_dmap_sem);
> + pr_debug("%s: Upgrading mapping at offset 0x%llx"
> + " length 0x%llx\n", __func__, pos, length);
> + return iomap_begin_upgrade_mapping(inode, pos, length,
> + flags, iomap);
> + } else {
> + fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
> + up_read(&fi->i_dmap_sem);
> + return 0;
> + }
> + } else {
> + up_read(&fi->i_dmap_sem);
> + pr_debug("%s: no mapping at offset 0x%llx length 0x%llx\n",
> + __func__, pos, length);
> + if (pos >= i_size_read(inode))
> + goto iomap_hole;
> +
> + return iomap_begin_setup_new_mapping(inode, pos, length, flags,
> + iomap);
> + }
> +
> + /*
> + * If read beyond end of file happnes, fs code seems to return
> + * it as hole
> + */
> +iomap_hole:
> + fuse_fill_iomap_hole(iomap, length);
> + pr_debug("fuse_iomap_begin() returning hole mapping. pos=0x%llx length_asked=0x%llx length_returned=0x%llx\n", pos, length, iomap->length);
> + return 0;
> +}
> +
> +static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
> + ssize_t written, unsigned flags,
> + struct iomap *iomap)
> +{
> + /* DAX writes beyond end-of-file aren't handled using iomap, so the
> + * file size is unchanged and there is nothing to do here.
> + */
> + return 0;
> +}
> +
> +static const struct iomap_ops fuse_iomap_ops = {
> + .iomap_begin = fuse_iomap_begin,
> + .iomap_end = fuse_iomap_end,
> +};
> +
> +static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> + struct inode *inode = file_inode(iocb->ki_filp);
> + ssize_t ret;
> +
> + if (iocb->ki_flags & IOCB_NOWAIT) {
> + if (!inode_trylock_shared(inode))
> + return -EAGAIN;
> + } else {
> + inode_lock_shared(inode);
> + }
> +
> + ret = dax_iomap_rw(iocb, to, &fuse_iomap_ops);
> + inode_unlock_shared(inode);
> +
> + /* TODO file_accessed(iocb->f_filp) */
> + return ret;
> +}
> +
> +static bool file_extending_write(struct kiocb *iocb, struct iov_iter *from)
> +{
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + return (iov_iter_rw(from) == WRITE &&
> + ((iocb->ki_pos) >= i_size_read(inode)));
> +}
> +
> +static ssize_t fuse_dax_direct_write(struct kiocb *iocb, struct iov_iter *from)
> +{
> + struct inode *inode = file_inode(iocb->ki_filp);
> + struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(iocb);
> + ssize_t ret;
> +
> + ret = fuse_direct_io(&io, from, &iocb->ki_pos, FUSE_DIO_WRITE);
> + if (ret < 0)
> + return ret;
> +
> + fuse_invalidate_attr(inode);
> + fuse_write_update_size(inode, iocb->ki_pos);
> + return ret;
> +}
> +
> +static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> + struct inode *inode = file_inode(iocb->ki_filp);
> + ssize_t ret, count;
> +
> + if (iocb->ki_flags & IOCB_NOWAIT) {
> + if (!inode_trylock(inode))
> + return -EAGAIN;
> + } else {
> + inode_lock(inode);
> + }
> +
> + ret = generic_write_checks(iocb, from);
> + if (ret <= 0)
> + goto out;
> +
> + ret = file_remove_privs(iocb->ki_filp);
> + if (ret)
> + goto out;
> + /* TODO file_update_time() but we don't want metadata I/O */
> +
> + /* Do not use dax for file extending writes as its an mmap and
> + * trying to write beyong end of existing page will generate
> + * SIGBUS.
> + */
> + if (file_extending_write(iocb, from)) {
> + ret = fuse_dax_direct_write(iocb, from);
> + goto out;
> + }
> +
> + ret = dax_iomap_rw(iocb, from, &fuse_iomap_ops);
> + if (ret < 0)
> + goto out;
> +
> + /*
> + * If part of the write was file extending, fuse dax path will not
> + * take care of that. Do direct write instead.
> + */
> + if (iov_iter_count(from) && file_extending_write(iocb, from)) {
> + count = fuse_dax_direct_write(iocb, from);
> + if (count < 0)
> + goto out;
> + ret += count;
> + }
> +
> +out:
> + inode_unlock(inode);
> +
> + if (ret > 0)
> + ret = generic_write_sync(iocb, ret);
> + return ret;
> }
>
> static void fuse_writepage_free(struct fuse_writepage_args *wpa)
> @@ -2318,6 +2892,11 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
> return 0;
> }
>
> +static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + return -EINVAL; /* TODO */
> +}
> +
> static int convert_fuse_file_lock(struct fuse_conn *fc,
> const struct fuse_file_lock *ffl,
> struct file_lock *fl)
> @@ -3387,6 +3966,7 @@ static const struct address_space_operations fuse_file_aops = {
> void fuse_init_file_inode(struct inode *inode)
> {
> struct fuse_inode *fi = get_fuse_inode(inode);
> + struct fuse_conn *fc = get_fuse_conn(inode);
>
> inode->i_fop = &fuse_file_operations;
> inode->i_data.a_ops = &fuse_file_aops;
> @@ -3396,4 +3976,9 @@ void fuse_init_file_inode(struct inode *inode)
> fi->writectr = 0;
> init_waitqueue_head(&fi->page_waitq);
> INIT_LIST_HEAD(&fi->writepages);
> + fi->dmap_tree = RB_ROOT_CACHED;
> +
> + if (fc->dax_dev) {
> + inode->i_flags |= S_DAX;
> + }
> }
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index b41275f73e4c..490549862bda 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -70,16 +70,29 @@ struct fuse_forget_link {
> struct fuse_forget_link *next;
> };
>
> +#define START(node) ((node)->start)
> +#define LAST(node) ((node)->end)
> +
> /** Translation information for file offsets to DAX window offsets */
> struct fuse_dax_mapping {
> /* Will connect in fc->free_ranges to keep track of free memory */
> struct list_head list;
>
> + /* For interval tree in file/inode */
> + struct rb_node rb;
> + /** Start Position in file */
> + __u64 start;
> + /** End Position in file */
> + __u64 end;
> + __u64 __subtree_last;
> /** Position in DAX window */
> u64 window_offset;
>
> /** Length of mapping, in bytes */
> loff_t length;
> +
> + /* Is this mapping read-only or read-write */
> + bool writable;
> };
>
> /** FUSE inode */
> @@ -167,6 +180,15 @@ struct fuse_inode {
>
> /** Lock to protect write related fields */
> spinlock_t lock;
> +
> + /*
> + * Semaphore to protect modifications to dmap_tree
> + */
> + struct rw_semaphore i_dmap_sem;
> +
> + /** Sorted rb tree of struct fuse_dax_mapping elements */
> + struct rb_root_cached dmap_tree;
> + unsigned long nr_dmaps;
> };
>
> /** FUSE inode state bits */
> @@ -1127,5 +1149,6 @@ unsigned int fuse_len_args(unsigned int numargs, struct fuse_arg *args);
> */
> u64 fuse_get_unique(struct fuse_iqueue *fiq);
> void fuse_free_conn(struct fuse_conn *fc);
> +void fuse_cleanup_inode_mappings(struct inode *inode);
>
> #endif /* _FS_FUSE_I_H */
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 36cb9c00bbe5..93bc65607a15 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -86,7 +86,9 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
> fi->attr_version = 0;
> fi->orig_ino = 0;
> fi->state = 0;
> + fi->nr_dmaps = 0;
> mutex_init(&fi->mutex);
> + init_rwsem(&fi->i_dmap_sem);
> spin_lock_init(&fi->lock);
> fi->forget = fuse_alloc_forget();
> if (!fi->forget) {
> @@ -114,6 +116,10 @@ static void fuse_evict_inode(struct inode *inode)
> clear_inode(inode);
> if (inode->i_sb->s_flags & SB_ACTIVE) {
> struct fuse_conn *fc = get_fuse_conn(inode);
> + if (IS_DAX(inode)) {
> + fuse_cleanup_inode_mappings(inode);
> + WARN_ON(fi->nr_dmaps);
> + }
> fuse_queue_forget(fc, fi->forget, fi->nodeid, fi->nlookup);
> fi->forget = NULL;
> }
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 62633555d547..36d824b82ebc 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -896,6 +896,7 @@ struct fuse_copy_file_range_in {
>
> #define FUSE_SETUPMAPPING_ENTRIES 8
> #define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
> +#define FUSE_SETUPMAPPING_FLAG_READ (1ull << 1)
> struct fuse_setupmapping_in {
> /* An already open handle */
> uint64_t fh;
> --
> 2.20.1
On Sat, Apr 04, 2020 at 08:25:21AM +0800, Liu Bo wrote:
[..]
> > +static int iomap_begin_upgrade_mapping(struct inode *inode, loff_t pos,
> > + loff_t length, unsigned flags,
> > + struct iomap *iomap)
> > +{
> > + struct fuse_inode *fi = get_fuse_inode(inode);
> > + struct fuse_dax_mapping *dmap;
> > + int ret;
> > +
> > + /*
> > + * Take exclusive lock so that only one caller can try to setup
> > + * mapping and others wait.
> > + */
> > + down_write(&fi->i_dmap_sem);
> > + dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, pos, pos);
> > +
> > + /* We are holding either inode lock or i_mmap_sem, and that should
> > + * ensure that dmap can't reclaimed or truncated and it should still
> > + * be there in tree despite the fact we dropped and re-acquired the
> > + * lock.
> > + */
> > + ret = -EIO;
> > + if (WARN_ON(!dmap))
> > + goto out_err;
> > +
> > + /* Maybe another thread already upgraded mapping while we were not
> > + * holding lock.
> > + */
> > + if (dmap->writable)
>
> oops, looks like it's still returning -EIO here, %ret should be zero.
>
Good catch. Will fix it.
Vivek
On Sat, Mar 28, 2020 at 06:06:06AM +0800, Liu Bo wrote:
> On Fri, Mar 27, 2020 at 10:01:14AM -0400, Vivek Goyal wrote:
> > On Thu, Mar 26, 2020 at 08:09:05AM +0800, Liu Bo wrote:
> >
> > [..]
> > > > +/*
> > > > + * Find first mapping in the tree and free it and return it. Do not add
> > > > + * it back to free pool. If fault == true, this function should be called
> > > > + * with fi->i_mmap_sem held.
> > > > + */
> > > > +static struct fuse_dax_mapping *inode_reclaim_one_dmap(struct fuse_conn *fc,
> > > > + struct inode *inode,
> > > > + bool fault)
> > > > +{
> > > > + struct fuse_inode *fi = get_fuse_inode(inode);
> > > > + struct fuse_dax_mapping *dmap;
> > > > + int ret;
> > > > +
> > > > + if (!fault)
> > > > + down_write(&fi->i_mmap_sem);
> > > > +
> > > > + /*
> > > > + * Make sure there are no references to inode pages using
> > > > + * get_user_pages()
> > > > + */
> > > > + ret = fuse_break_dax_layouts(inode, 0, 0);
> > >
> > > Hi Vivek,
> > >
> > > This patch is enabling inline reclaim for fault path, but fault path
> > > has already holds a locked exceptional entry which I believe the above
> > > fuse_break_dax_layouts() needs to wait for, can you please elaborate
> > > on how this can be avoided?
> > >
> >
> > Hi Liubo,
> >
> > Can you please point to the exact lock you are referring to. I will
> > check it out. Once we got rid of needing to take inode lock in
> > reclaim path, that opended the door to do inline reclaim in fault
> > path as well. But I was not aware of this exceptional entry lock.
>
> Hi Vivek,
>
> dax_iomap_{pte,pmd}_fault has called grab_mapping_entry to get a
> locked entry, when this fault gets into inline reclaim, would
> fuse_break_dax_layouts wait for the locked exceptional entry which is
> locked in dax_iomap_{pte,pmd}_fault?
Hi Liu Bo,
This is a good point. Indeed it can deadlock the way code is written
currently.
Currently we are calling fuse_break_dax_layouts() on the whole file
in memory inline reclaim path. I am thinking of changing that. Instead,
find a mapped memory range and file offset and call
fuse_break_dax_layouts() only on that range (2MB). This should ensure
that we don't try to break dax layout in the range where we are holding
exceptional entry lock and avoid deadlock possibility.
This also has added benefit that we don't have to unmap the whole
file in an attempt to reclaim one memory range. We will unmap only
a portion of file and that should be good from performance point of
view.
Here is proof of concept patch which applies on top of my internal
tree.
---
fs/fuse/file.c | 72 +++++++++++++++++++++++++++++++++++++++------------------
1 file changed, 50 insertions(+), 22 deletions(-)
Index: redhat-linux/fs/fuse/file.c
===================================================================
--- redhat-linux.orig/fs/fuse/file.c 2020-04-14 13:47:19.493780528 -0400
+++ redhat-linux/fs/fuse/file.c 2020-04-14 14:58:26.814079643 -0400
@@ -4297,13 +4297,13 @@ static int fuse_break_dax_layouts(struct
return ret;
}
-/* Find first mapping in the tree and free it. */
-static struct fuse_dax_mapping *
-inode_reclaim_one_dmap_locked(struct fuse_conn *fc, struct inode *inode)
+/* Find first mapped dmap for an inode and return file offset. Caller needs
+ * to hold inode->i_dmap_sem lock either shared or exclusive. */
+static struct fuse_dax_mapping *inode_lookup_first_dmap(struct fuse_conn *fc,
+ struct inode *inode)
{
struct fuse_inode *fi = get_fuse_inode(inode);
struct fuse_dax_mapping *dmap;
- int ret;
for (dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, 0, -1);
dmap;
@@ -4312,18 +4312,6 @@ inode_reclaim_one_dmap_locked(struct fus
if (refcount_read(&dmap->refcnt) > 1)
continue;
- ret = reclaim_one_dmap_locked(fc, inode, dmap);
- if (ret < 0)
- return ERR_PTR(ret);
-
- /* Clean up dmap. Do not add back to free list */
- dmap_remove_busy_list(fc, dmap);
- dmap->inode = NULL;
- dmap->start = dmap->end = 0;
-
- pr_debug("fuse: %s: reclaimed memory range. inode=%px,"
- " window_offset=0x%llx, length=0x%llx\n", __func__,
- inode, dmap->window_offset, dmap->length);
return dmap;
}
@@ -4335,30 +4323,70 @@ inode_reclaim_one_dmap_locked(struct fus
* it back to free pool. If fault == true, this function should be called
* with fi->i_mmap_sem held.
*/
-static struct fuse_dax_mapping *inode_reclaim_one_dmap(struct fuse_conn *fc,
- struct inode *inode,
- bool fault)
+static struct fuse_dax_mapping *
+inode_inline_reclaim_one_dmap(struct fuse_conn *fc, struct inode *inode,
+ bool fault)
{
struct fuse_inode *fi = get_fuse_inode(inode);
struct fuse_dax_mapping *dmap;
+ u64 dmap_start, dmap_end;
int ret;
if (!fault)
down_write(&fi->i_mmap_sem);
+ /* Lookup a dmap and corresponding file offset to reclaim. */
+ down_read(&fi->i_dmap_sem);
+ dmap = inode_lookup_first_dmap(fc, inode);
+ if (dmap) {
+ dmap_start = dmap->start;
+ dmap_end = dmap->end;
+ }
+ up_read(&fi->i_dmap_sem);
+
+ if (!dmap)
+ goto out_mmap_sem;
/*
* Make sure there are no references to inode pages using
* get_user_pages()
*/
- ret = fuse_break_dax_layouts(inode, 0, 0);
+ ret = fuse_break_dax_layouts(inode, dmap_start, dmap_end);
if (ret) {
printk("virtio_fs: fuse_break_dax_layouts() failed. err=%d\n",
ret);
dmap = ERR_PTR(ret);
goto out_mmap_sem;
}
+
down_write(&fi->i_dmap_sem);
- dmap = inode_reclaim_one_dmap_locked(fc, inode);
+ dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, dmap_start,
+ dmap_start);
+ /* Range already got reclaimed by somebody else */
+ if (!dmap)
+ goto out_write_dmap_sem;
+
+ /* still in use. */
+ if (refcount_read(&dmap->refcnt) > 1) {
+ dmap = NULL;
+ goto out_write_dmap_sem;
+ }
+
+ ret = reclaim_one_dmap_locked(fc, inode, dmap);
+ if (ret < 0) {
+ dmap = NULL;
+ goto out_write_dmap_sem;
+ }
+
+ /* Clean up dmap. Do not add back to free list */
+ dmap_remove_busy_list(fc, dmap);
+ dmap->inode = NULL;
+ dmap->start = dmap->end = 0;
+
+ pr_debug("fuse: %s: inline reclaimed memory range. inode=%px,"
+ " window_offset=0x%llx, length=0x%llx\n", __func__,
+ inode, dmap->window_offset, dmap->length);
+
+out_write_dmap_sem:
up_write(&fi->i_dmap_sem);
out_mmap_sem:
if (!fault)
@@ -4379,7 +4407,7 @@ static struct fuse_dax_mapping *alloc_da
return dmap;
if (fi->nr_dmaps) {
- dmap = inode_reclaim_one_dmap(fc, inode, fault);
+ dmap = inode_inline_reclaim_one_dmap(fc, inode, fault);
if (dmap)
return dmap;
/* If we could not reclaim a mapping because it
On Tue, Apr 14, 2020 at 03:30:45PM -0400, Vivek Goyal wrote:
> On Sat, Mar 28, 2020 at 06:06:06AM +0800, Liu Bo wrote:
> > On Fri, Mar 27, 2020 at 10:01:14AM -0400, Vivek Goyal wrote:
> > > On Thu, Mar 26, 2020 at 08:09:05AM +0800, Liu Bo wrote:
> > >
> > > [..]
> > > > > +/*
> > > > > + * Find first mapping in the tree and free it and return it. Do not add
> > > > > + * it back to free pool. If fault == true, this function should be called
> > > > > + * with fi->i_mmap_sem held.
> > > > > + */
> > > > > +static struct fuse_dax_mapping *inode_reclaim_one_dmap(struct fuse_conn *fc,
> > > > > + struct inode *inode,
> > > > > + bool fault)
> > > > > +{
> > > > > + struct fuse_inode *fi = get_fuse_inode(inode);
> > > > > + struct fuse_dax_mapping *dmap;
> > > > > + int ret;
> > > > > +
> > > > > + if (!fault)
> > > > > + down_write(&fi->i_mmap_sem);
> > > > > +
> > > > > + /*
> > > > > + * Make sure there are no references to inode pages using
> > > > > + * get_user_pages()
> > > > > + */
> > > > > + ret = fuse_break_dax_layouts(inode, 0, 0);
> > > >
> > > > Hi Vivek,
> > > >
> > > > This patch is enabling inline reclaim for fault path, but fault path
> > > > has already holds a locked exceptional entry which I believe the above
> > > > fuse_break_dax_layouts() needs to wait for, can you please elaborate
> > > > on how this can be avoided?
> > > >
> > >
> > > Hi Liubo,
> > >
> > > Can you please point to the exact lock you are referring to. I will
> > > check it out. Once we got rid of needing to take inode lock in
> > > reclaim path, that opended the door to do inline reclaim in fault
> > > path as well. But I was not aware of this exceptional entry lock.
> >
> > Hi Vivek,
> >
> > dax_iomap_{pte,pmd}_fault has called grab_mapping_entry to get a
> > locked entry, when this fault gets into inline reclaim, would
> > fuse_break_dax_layouts wait for the locked exceptional entry which is
> > locked in dax_iomap_{pte,pmd}_fault?
>
> Hi Liu Bo,
>
> This is a good point. Indeed it can deadlock the way code is written
> currently.
>
It's 100% reproducible on 4.19, but not on 5.x which has xarray for
dax_layout_busy_page.
It was weird that on 5.x kernel the deadlock is gone, it turned out
that xarray search in dax_layout_busy_page simply skips the empty
locked exceptional entry, I didn't get deeper to find out whether it's
reasonable, but with that 5.x doesn't run to deadlock.
thanks,
liubo
> Currently we are calling fuse_break_dax_layouts() on the whole file
> in memory inline reclaim path. I am thinking of changing that. Instead,
> find a mapped memory range and file offset and call
> fuse_break_dax_layouts() only on that range (2MB). This should ensure
> that we don't try to break dax layout in the range where we are holding
> exceptional entry lock and avoid deadlock possibility.
>
> This also has added benefit that we don't have to unmap the whole
> file in an attempt to reclaim one memory range. We will unmap only
> a portion of file and that should be good from performance point of
> view.
>
> Here is proof of concept patch which applies on top of my internal
> tree.
>
> ---
> fs/fuse/file.c | 72 +++++++++++++++++++++++++++++++++++++++------------------
> 1 file changed, 50 insertions(+), 22 deletions(-)
>
> Index: redhat-linux/fs/fuse/file.c
> ===================================================================
> --- redhat-linux.orig/fs/fuse/file.c 2020-04-14 13:47:19.493780528 -0400
> +++ redhat-linux/fs/fuse/file.c 2020-04-14 14:58:26.814079643 -0400
> @@ -4297,13 +4297,13 @@ static int fuse_break_dax_layouts(struct
> return ret;
> }
>
> -/* Find first mapping in the tree and free it. */
> -static struct fuse_dax_mapping *
> -inode_reclaim_one_dmap_locked(struct fuse_conn *fc, struct inode *inode)
> +/* Find first mapped dmap for an inode and return file offset. Caller needs
> + * to hold inode->i_dmap_sem lock either shared or exclusive. */
> +static struct fuse_dax_mapping *inode_lookup_first_dmap(struct fuse_conn *fc,
> + struct inode *inode)
> {
> struct fuse_inode *fi = get_fuse_inode(inode);
> struct fuse_dax_mapping *dmap;
> - int ret;
>
> for (dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, 0, -1);
> dmap;
> @@ -4312,18 +4312,6 @@ inode_reclaim_one_dmap_locked(struct fus
> if (refcount_read(&dmap->refcnt) > 1)
> continue;
>
> - ret = reclaim_one_dmap_locked(fc, inode, dmap);
> - if (ret < 0)
> - return ERR_PTR(ret);
> -
> - /* Clean up dmap. Do not add back to free list */
> - dmap_remove_busy_list(fc, dmap);
> - dmap->inode = NULL;
> - dmap->start = dmap->end = 0;
> -
> - pr_debug("fuse: %s: reclaimed memory range. inode=%px,"
> - " window_offset=0x%llx, length=0x%llx\n", __func__,
> - inode, dmap->window_offset, dmap->length);
> return dmap;
> }
>
> @@ -4335,30 +4323,70 @@ inode_reclaim_one_dmap_locked(struct fus
> * it back to free pool. If fault == true, this function should be called
> * with fi->i_mmap_sem held.
> */
> -static struct fuse_dax_mapping *inode_reclaim_one_dmap(struct fuse_conn *fc,
> - struct inode *inode,
> - bool fault)
> +static struct fuse_dax_mapping *
> +inode_inline_reclaim_one_dmap(struct fuse_conn *fc, struct inode *inode,
> + bool fault)
> {
> struct fuse_inode *fi = get_fuse_inode(inode);
> struct fuse_dax_mapping *dmap;
> + u64 dmap_start, dmap_end;
> int ret;
>
> if (!fault)
> down_write(&fi->i_mmap_sem);
>
> + /* Lookup a dmap and corresponding file offset to reclaim. */
> + down_read(&fi->i_dmap_sem);
> + dmap = inode_lookup_first_dmap(fc, inode);
> + if (dmap) {
> + dmap_start = dmap->start;
> + dmap_end = dmap->end;
> + }
> + up_read(&fi->i_dmap_sem);
> +
> + if (!dmap)
> + goto out_mmap_sem;
> /*
> * Make sure there are no references to inode pages using
> * get_user_pages()
> */
> - ret = fuse_break_dax_layouts(inode, 0, 0);
> + ret = fuse_break_dax_layouts(inode, dmap_start, dmap_end);
> if (ret) {
> printk("virtio_fs: fuse_break_dax_layouts() failed. err=%d\n",
> ret);
> dmap = ERR_PTR(ret);
> goto out_mmap_sem;
> }
> +
> down_write(&fi->i_dmap_sem);
> - dmap = inode_reclaim_one_dmap_locked(fc, inode);
> + dmap = fuse_dax_interval_tree_iter_first(&fi->dmap_tree, dmap_start,
> + dmap_start);
> + /* Range already got reclaimed by somebody else */
> + if (!dmap)
> + goto out_write_dmap_sem;
> +
> + /* still in use. */
> + if (refcount_read(&dmap->refcnt) > 1) {
> + dmap = NULL;
> + goto out_write_dmap_sem;
> + }
> +
> + ret = reclaim_one_dmap_locked(fc, inode, dmap);
> + if (ret < 0) {
> + dmap = NULL;
> + goto out_write_dmap_sem;
> + }
> +
> + /* Clean up dmap. Do not add back to free list */
> + dmap_remove_busy_list(fc, dmap);
> + dmap->inode = NULL;
> + dmap->start = dmap->end = 0;
> +
> + pr_debug("fuse: %s: inline reclaimed memory range. inode=%px,"
> + " window_offset=0x%llx, length=0x%llx\n", __func__,
> + inode, dmap->window_offset, dmap->length);
> +
> +out_write_dmap_sem:
> up_write(&fi->i_dmap_sem);
> out_mmap_sem:
> if (!fault)
> @@ -4379,7 +4407,7 @@ static struct fuse_dax_mapping *alloc_da
> return dmap;
>
> if (fi->nr_dmaps) {
> - dmap = inode_reclaim_one_dmap(fc, inode, fault);
> + dmap = inode_inline_reclaim_one_dmap(fc, inode, fault);
> if (dmap)
> return dmap;
> /* If we could not reclaim a mapping because it
>
On Thu, Apr 16, 2020 at 01:22:29AM +0800, Liu Bo wrote:
> On Tue, Apr 14, 2020 at 03:30:45PM -0400, Vivek Goyal wrote:
> > On Sat, Mar 28, 2020 at 06:06:06AM +0800, Liu Bo wrote:
> > > On Fri, Mar 27, 2020 at 10:01:14AM -0400, Vivek Goyal wrote:
> > > > On Thu, Mar 26, 2020 at 08:09:05AM +0800, Liu Bo wrote:
> > > >
> > > > [..]
> > > > > > +/*
> > > > > > + * Find first mapping in the tree and free it and return it. Do not add
> > > > > > + * it back to free pool. If fault == true, this function should be called
> > > > > > + * with fi->i_mmap_sem held.
> > > > > > + */
> > > > > > +static struct fuse_dax_mapping *inode_reclaim_one_dmap(struct fuse_conn *fc,
> > > > > > + struct inode *inode,
> > > > > > + bool fault)
> > > > > > +{
> > > > > > + struct fuse_inode *fi = get_fuse_inode(inode);
> > > > > > + struct fuse_dax_mapping *dmap;
> > > > > > + int ret;
> > > > > > +
> > > > > > + if (!fault)
> > > > > > + down_write(&fi->i_mmap_sem);
> > > > > > +
> > > > > > + /*
> > > > > > + * Make sure there are no references to inode pages using
> > > > > > + * get_user_pages()
> > > > > > + */
> > > > > > + ret = fuse_break_dax_layouts(inode, 0, 0);
> > > > >
> > > > > Hi Vivek,
> > > > >
> > > > > This patch is enabling inline reclaim for fault path, but fault path
> > > > > has already holds a locked exceptional entry which I believe the above
> > > > > fuse_break_dax_layouts() needs to wait for, can you please elaborate
> > > > > on how this can be avoided?
> > > > >
> > > >
> > > > Hi Liubo,
> > > >
> > > > Can you please point to the exact lock you are referring to. I will
> > > > check it out. Once we got rid of needing to take inode lock in
> > > > reclaim path, that opended the door to do inline reclaim in fault
> > > > path as well. But I was not aware of this exceptional entry lock.
> > >
> > > Hi Vivek,
> > >
> > > dax_iomap_{pte,pmd}_fault has called grab_mapping_entry to get a
> > > locked entry, when this fault gets into inline reclaim, would
> > > fuse_break_dax_layouts wait for the locked exceptional entry which is
> > > locked in dax_iomap_{pte,pmd}_fault?
> >
> > Hi Liu Bo,
> >
> > This is a good point. Indeed it can deadlock the way code is written
> > currently.
> >
>
> It's 100% reproducible on 4.19, but not on 5.x which has xarray for
> dax_layout_busy_page.
>
> It was weird that on 5.x kernel the deadlock is gone, it turned out
> that xarray search in dax_layout_busy_page simply skips the empty
> locked exceptional entry, I didn't get deeper to find out whether it's
> reasonable, but with that 5.x doesn't run to deadlock.
I found more problems with enabling inline reclaim in fault path. I
am holding fi->i_mmap_sem, shared and fuse_break_dax_layouts() can
drop fi->i_mmap_sem if page is busy. I don't think we can drop and
reacquire fi->i_mmap_sem while in fault path.
Also fuse_break_dax_layouts() does not know if we are holding it
shared or exclusive.
So I will probably have to go back to disable inline reclaim in
fault path. If memory range is not available go back up in
fuse_dax_fault(), drop fi->i_mmap_sem lock and wait on wait queue for
a range to become free and retry.
I can retain the changes I did to break layout for a 2MB range only
and not the whole file. I think that's a good optimization to retain
anyway.
Vivek
On Thu, Apr 16, 2020 at 03:05:07PM -0400, Vivek Goyal wrote:
> On Thu, Apr 16, 2020 at 01:22:29AM +0800, Liu Bo wrote:
> > On Tue, Apr 14, 2020 at 03:30:45PM -0400, Vivek Goyal wrote:
> > > On Sat, Mar 28, 2020 at 06:06:06AM +0800, Liu Bo wrote:
> > > > On Fri, Mar 27, 2020 at 10:01:14AM -0400, Vivek Goyal wrote:
> > > > > On Thu, Mar 26, 2020 at 08:09:05AM +0800, Liu Bo wrote:
> > > > >
> > > > > [..]
> > > > > > > +/*
> > > > > > > + * Find first mapping in the tree and free it and return it. Do not add
> > > > > > > + * it back to free pool. If fault == true, this function should be called
> > > > > > > + * with fi->i_mmap_sem held.
> > > > > > > + */
> > > > > > > +static struct fuse_dax_mapping *inode_reclaim_one_dmap(struct fuse_conn *fc,
> > > > > > > + struct inode *inode,
> > > > > > > + bool fault)
> > > > > > > +{
> > > > > > > + struct fuse_inode *fi = get_fuse_inode(inode);
> > > > > > > + struct fuse_dax_mapping *dmap;
> > > > > > > + int ret;
> > > > > > > +
> > > > > > > + if (!fault)
> > > > > > > + down_write(&fi->i_mmap_sem);
> > > > > > > +
> > > > > > > + /*
> > > > > > > + * Make sure there are no references to inode pages using
> > > > > > > + * get_user_pages()
> > > > > > > + */
> > > > > > > + ret = fuse_break_dax_layouts(inode, 0, 0);
> > > > > >
> > > > > > Hi Vivek,
> > > > > >
> > > > > > This patch is enabling inline reclaim for fault path, but fault path
> > > > > > has already holds a locked exceptional entry which I believe the above
> > > > > > fuse_break_dax_layouts() needs to wait for, can you please elaborate
> > > > > > on how this can be avoided?
> > > > > >
> > > > >
> > > > > Hi Liubo,
> > > > >
> > > > > Can you please point to the exact lock you are referring to. I will
> > > > > check it out. Once we got rid of needing to take inode lock in
> > > > > reclaim path, that opended the door to do inline reclaim in fault
> > > > > path as well. But I was not aware of this exceptional entry lock.
> > > >
> > > > Hi Vivek,
> > > >
> > > > dax_iomap_{pte,pmd}_fault has called grab_mapping_entry to get a
> > > > locked entry, when this fault gets into inline reclaim, would
> > > > fuse_break_dax_layouts wait for the locked exceptional entry which is
> > > > locked in dax_iomap_{pte,pmd}_fault?
> > >
> > > Hi Liu Bo,
> > >
> > > This is a good point. Indeed it can deadlock the way code is written
> > > currently.
> > >
> >
> > It's 100% reproducible on 4.19, but not on 5.x which has xarray for
> > dax_layout_busy_page.
> >
> > It was weird that on 5.x kernel the deadlock is gone, it turned out
> > that xarray search in dax_layout_busy_page simply skips the empty
> > locked exceptional entry, I didn't get deeper to find out whether it's
> > reasonable, but with that 5.x doesn't run to deadlock.
>
> I found more problems with enabling inline reclaim in fault path. I
> am holding fi->i_mmap_sem, shared and fuse_break_dax_layouts() can
> drop fi->i_mmap_sem if page is busy. I don't think we can drop and
> reacquire fi->i_mmap_sem while in fault path.
>
Good point, yes, dropping & reacquiring lock might bring more trouble
w.r.t race on the i_mmap_sem.
> Also fuse_break_dax_layouts() does not know if we are holding it
> shared or exclusive.
>
> So I will probably have to go back to disable inline reclaim in
> fault path. If memory range is not available go back up in
> fuse_dax_fault(), drop fi->i_mmap_sem lock and wait on wait queue for
> a range to become free and retry.
>
> I can retain the changes I did to break layout for a 2MB range only
> and not the whole file. I think that's a good optimization to retain
> anyway.
>
That part does look reasonable to me.
thanks,
liubo