Hi All,
This is V3 of patches. I had posted version v2 version here.
https://lore.kernel.org/linux-fsdevel/[email protected]/
I have taken care of comments from V2. Changes from V2 are.
- Rebased patches on top of 5.9-rc1
- Renamed couple of functions to get rid of iomap prefix. (Dave Chinner)
- Modified truncate/punch_hole paths to serialize with dax fault
path. For now did this only for dax paths. May be non-dax path
can benefit from this too. But that is an option for a different
day. (Dave Chinner).
- Took care of comments by Jan Kara in dax_layout_busy_page_range()
implementation patch.
- Dropped one of the patches which forced sync release in
fuse_file_put() path for DAX files. It was redundant now as virtiofs
already sets fs_context->destroy which forces sync release. (Miklos)
- Took care of some of the errors flagged by checkpatch.pl.
Description from previous post
------------------------------
This patch series adds DAX support to virtiofs filesystem. This allows
bypassing guest page cache and allows mapping host page cache directly
in guest address space.
When a page of file is needed, guest sends a request to map that page
(in host page cache) in qemu address space. Inside guest this is
a physical memory range controlled by virtiofs device. And guest
directly maps this physical address range using DAX and hence gets
access to file data on host.
This can speed up things considerably in many situations. Also this
can result in substantial memory savings as file data does not have
to be copied in guest and it is directly accessed from host page
cache.
Most of the changes are limited to fuse/virtiofs. There are couple
of changes needed in generic dax infrastructure and couple of changes
in virtio to be able to access shared memory region.
Thanks
Vivek
Sebastien Boeuf (3):
virtio: Add get_shm_region method
virtio: Implement get_shm_region for PCI transport
virtio: Implement get_shm_region for MMIO transport
Stefan Hajnoczi (2):
virtio_fs, dax: Set up virtio_fs dax_device
fuse,dax: add DAX mmap support
Vivek Goyal (13):
dax: Modify bdev_dax_pgoff() to handle NULL bdev
dax: Create a range version of dax_layout_busy_page()
virtiofs: Provide a helper function for virtqueue initialization
fuse: Get rid of no_mount_options
fuse,virtiofs: Add a mount option to enable dax
fuse,virtiofs: Keep a list of free dax memory ranges
fuse: implement FUSE_INIT map_alignment field
fuse: Introduce setupmapping/removemapping commands
fuse, dax: Implement dax read/write operations
fuse,virtiofs: Define dax address space operations
fuse, dax: Serialize truncate/punch_hole and dax fault path
fuse,virtiofs: Maintain a list of busy elements
fuse,virtiofs: Add logic to free up a memory range
drivers/dax/super.c | 3 +-
drivers/virtio/virtio_mmio.c | 31 +
drivers/virtio/virtio_pci_modern.c | 95 +++
fs/dax.c | 29 +-
fs/fuse/dir.c | 32 +-
fs/fuse/file.c | 1198 +++++++++++++++++++++++++++-
fs/fuse/fuse_i.h | 114 ++-
fs/fuse/inode.c | 146 +++-
fs/fuse/virtio_fs.c | 279 ++++++-
include/linux/dax.h | 6 +
include/linux/virtio_config.h | 17 +
include/uapi/linux/fuse.h | 34 +-
include/uapi/linux/virtio_fs.h | 3 +
include/uapi/linux/virtio_mmio.h | 11 +
include/uapi/linux/virtio_pci.h | 11 +-
15 files changed, 1933 insertions(+), 76 deletions(-)
Cc: Jan Kara <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Vishal L Verma <[email protected]>
--
2.25.4
Introduce two new fuse commands to setup/remove memory mappings. This
will be used to setup/tear down file mapping in dax window.
Signed-off-by: Vivek Goyal <[email protected]>
Signed-off-by: Peng Tao <[email protected]>
---
include/uapi/linux/fuse.h | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 5b85819e045f..60a7bfc787ce 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -894,4 +894,33 @@ struct fuse_copy_file_range_in {
uint64_t flags;
};
+#define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
+struct fuse_setupmapping_in {
+ /* An already open handle */
+ uint64_t fh;
+ /* Offset into the file to start the mapping */
+ uint64_t foffset;
+ /* Length of mapping required */
+ uint64_t len;
+ /* Flags, FUSE_SETUPMAPPING_FLAG_* */
+ uint64_t flags;
+ /* Offset in Memory Window */
+ uint64_t moffset;
+};
+
+struct fuse_removemapping_in {
+ /* number of fuse_removemapping_one follows */
+ uint32_t count;
+};
+
+struct fuse_removemapping_one {
+ /* Offset into the dax window start the unmapping */
+ uint64_t moffset;
+ /* Length of mapping required */
+ uint64_t len;
+};
+
+#define FUSE_REMOVEMAPPING_MAX_ENTRY \
+ (PAGE_SIZE / sizeof(struct fuse_removemapping_one))
+
#endif /* _LINUX_FUSE_H */
--
2.25.4
This reduces code duplication and make it little easier to read code.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/virtio_fs.c | 50 +++++++++++++++++++++++++++------------------
1 file changed, 30 insertions(+), 20 deletions(-)
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 104f35de5270..ed8da4825b70 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -24,6 +24,8 @@ enum {
VQ_REQUEST
};
+#define VQ_NAME_LEN 24
+
/* Per-virtqueue state */
struct virtio_fs_vq {
spinlock_t lock;
@@ -36,7 +38,7 @@ struct virtio_fs_vq {
bool connected;
long in_flight;
struct completion in_flight_zero; /* No inflight requests */
- char name[24];
+ char name[VQ_NAME_LEN];
} ____cacheline_aligned_in_smp;
/* A virtio-fs device instance */
@@ -596,6 +598,26 @@ static void virtio_fs_vq_done(struct virtqueue *vq)
schedule_work(&fsvq->done_work);
}
+static void virtio_fs_init_vq(struct virtio_fs_vq *fsvq, char *name,
+ int vq_type)
+{
+ strncpy(fsvq->name, name, VQ_NAME_LEN);
+ spin_lock_init(&fsvq->lock);
+ INIT_LIST_HEAD(&fsvq->queued_reqs);
+ INIT_LIST_HEAD(&fsvq->end_reqs);
+ init_completion(&fsvq->in_flight_zero);
+
+ if (vq_type == VQ_REQUEST) {
+ INIT_WORK(&fsvq->done_work, virtio_fs_requests_done_work);
+ INIT_DELAYED_WORK(&fsvq->dispatch_work,
+ virtio_fs_request_dispatch_work);
+ } else {
+ INIT_WORK(&fsvq->done_work, virtio_fs_hiprio_done_work);
+ INIT_DELAYED_WORK(&fsvq->dispatch_work,
+ virtio_fs_hiprio_dispatch_work);
+ }
+}
+
/* Initialize virtqueues */
static int virtio_fs_setup_vqs(struct virtio_device *vdev,
struct virtio_fs *fs)
@@ -611,7 +633,7 @@ static int virtio_fs_setup_vqs(struct virtio_device *vdev,
if (fs->num_request_queues == 0)
return -EINVAL;
- fs->nvqs = 1 + fs->num_request_queues;
+ fs->nvqs = VQ_REQUEST + fs->num_request_queues;
fs->vqs = kcalloc(fs->nvqs, sizeof(fs->vqs[VQ_HIPRIO]), GFP_KERNEL);
if (!fs->vqs)
return -ENOMEM;
@@ -625,29 +647,17 @@ static int virtio_fs_setup_vqs(struct virtio_device *vdev,
goto out;
}
+ /* Initialize the hiprio/forget request virtqueue */
callbacks[VQ_HIPRIO] = virtio_fs_vq_done;
- snprintf(fs->vqs[VQ_HIPRIO].name, sizeof(fs->vqs[VQ_HIPRIO].name),
- "hiprio");
+ virtio_fs_init_vq(&fs->vqs[VQ_HIPRIO], "hiprio", VQ_HIPRIO);
names[VQ_HIPRIO] = fs->vqs[VQ_HIPRIO].name;
- INIT_WORK(&fs->vqs[VQ_HIPRIO].done_work, virtio_fs_hiprio_done_work);
- INIT_LIST_HEAD(&fs->vqs[VQ_HIPRIO].queued_reqs);
- INIT_LIST_HEAD(&fs->vqs[VQ_HIPRIO].end_reqs);
- INIT_DELAYED_WORK(&fs->vqs[VQ_HIPRIO].dispatch_work,
- virtio_fs_hiprio_dispatch_work);
- init_completion(&fs->vqs[VQ_HIPRIO].in_flight_zero);
- spin_lock_init(&fs->vqs[VQ_HIPRIO].lock);
/* Initialize the requests virtqueues */
for (i = VQ_REQUEST; i < fs->nvqs; i++) {
- spin_lock_init(&fs->vqs[i].lock);
- INIT_WORK(&fs->vqs[i].done_work, virtio_fs_requests_done_work);
- INIT_DELAYED_WORK(&fs->vqs[i].dispatch_work,
- virtio_fs_request_dispatch_work);
- INIT_LIST_HEAD(&fs->vqs[i].queued_reqs);
- INIT_LIST_HEAD(&fs->vqs[i].end_reqs);
- init_completion(&fs->vqs[i].in_flight_zero);
- snprintf(fs->vqs[i].name, sizeof(fs->vqs[i].name),
- "requests.%u", i - VQ_REQUEST);
+ char vq_name[VQ_NAME_LEN];
+
+ snprintf(vq_name, VQ_NAME_LEN, "requests.%u", i - VQ_REQUEST);
+ virtio_fs_init_vq(&fs->vqs[i], vq_name, VQ_REQUEST);
callbacks[i] = virtio_fs_vq_done;
names[i] = fs->vqs[i].name;
}
--
2.25.4
From: Sebastien Boeuf <[email protected]>
On PCI the shm regions are found using capability entries;
find a region by searching for the capability.
Signed-off-by: Sebastien Boeuf <[email protected]>
Signed-off-by: Dr. David Alan Gilbert <[email protected]>
Signed-off-by: kbuild test robot <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: "Michael S. Tsirkin" <[email protected]>
---
drivers/virtio/virtio_pci_modern.c | 95 ++++++++++++++++++++++++++++++
include/uapi/linux/virtio_pci.h | 11 +++-
2 files changed, 105 insertions(+), 1 deletion(-)
diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
index 3e14e700b231..3d6ae5a5e252 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -444,6 +444,99 @@ static void del_vq(struct virtio_pci_vq_info *info)
vring_del_virtqueue(vq);
}
+static int virtio_pci_find_shm_cap(struct pci_dev *dev, u8 required_id,
+ u8 *bar, u64 *offset, u64 *len)
+{
+ int pos;
+
+ for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR); pos > 0;
+ pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
+ u8 type, cap_len, id;
+ u32 tmp32;
+ u64 res_offset, res_length;
+
+ pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+ cfg_type), &type);
+ if (type != VIRTIO_PCI_CAP_SHARED_MEMORY_CFG)
+ continue;
+
+ pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+ cap_len), &cap_len);
+ if (cap_len != sizeof(struct virtio_pci_cap64)) {
+ dev_err(&dev->dev, "%s: shm cap with bad size offset:"
+ " %d size: %d\n", __func__, pos, cap_len);
+ continue;
+ }
+
+ pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+ id), &id);
+ if (id != required_id)
+ continue;
+
+ /* Type, and ID match, looks good */
+ pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+ bar), bar);
+
+ /* Read the lower 32bit of length and offset */
+ pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap,
+ offset), &tmp32);
+ res_offset = tmp32;
+ pci_read_config_dword(dev, pos + offsetof(struct virtio_pci_cap,
+ length), &tmp32);
+ res_length = tmp32;
+
+ /* and now the top half */
+ pci_read_config_dword(dev,
+ pos + offsetof(struct virtio_pci_cap64,
+ offset_hi), &tmp32);
+ res_offset |= ((u64)tmp32) << 32;
+ pci_read_config_dword(dev,
+ pos + offsetof(struct virtio_pci_cap64,
+ length_hi), &tmp32);
+ res_length |= ((u64)tmp32) << 32;
+
+ *offset = res_offset;
+ *len = res_length;
+
+ return pos;
+ }
+ return 0;
+}
+
+static bool vp_get_shm_region(struct virtio_device *vdev,
+ struct virtio_shm_region *region, u8 id)
+{
+ struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+ struct pci_dev *pci_dev = vp_dev->pci_dev;
+ u8 bar;
+ u64 offset, len;
+ phys_addr_t phys_addr;
+ size_t bar_len;
+
+ if (!virtio_pci_find_shm_cap(pci_dev, id, &bar, &offset, &len))
+ return false;
+
+ phys_addr = pci_resource_start(pci_dev, bar);
+ bar_len = pci_resource_len(pci_dev, bar);
+
+ if ((offset + len) < offset) {
+ dev_err(&pci_dev->dev, "%s: cap offset+len overflow detected\n",
+ __func__);
+ return false;
+ }
+
+ if (offset + len > bar_len) {
+ dev_err(&pci_dev->dev, "%s: bar shorter than cap offset+len\n",
+ __func__);
+ return false;
+ }
+
+ region->len = len;
+ region->addr = (u64) phys_addr + offset;
+
+ return true;
+}
+
static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
.get = NULL,
.set = NULL,
@@ -458,6 +551,7 @@ static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
.bus_name = vp_bus_name,
.set_vq_affinity = vp_set_vq_affinity,
.get_vq_affinity = vp_get_vq_affinity,
+ .get_shm_region = vp_get_shm_region,
};
static const struct virtio_config_ops virtio_pci_config_ops = {
@@ -474,6 +568,7 @@ static const struct virtio_config_ops virtio_pci_config_ops = {
.bus_name = vp_bus_name,
.set_vq_affinity = vp_set_vq_affinity,
.get_vq_affinity = vp_get_vq_affinity,
+ .get_shm_region = vp_get_shm_region,
};
/**
diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 90007a1abcab..3a86f36d7e3d 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -113,6 +113,8 @@
#define VIRTIO_PCI_CAP_DEVICE_CFG 4
/* PCI configuration access */
#define VIRTIO_PCI_CAP_PCI_CFG 5
+/* Additional shared memory capability */
+#define VIRTIO_PCI_CAP_SHARED_MEMORY_CFG 8
/* This is the PCI capability header: */
struct virtio_pci_cap {
@@ -121,11 +123,18 @@ struct virtio_pci_cap {
__u8 cap_len; /* Generic PCI field: capability length */
__u8 cfg_type; /* Identifies the structure. */
__u8 bar; /* Where to find it. */
- __u8 padding[3]; /* Pad to full dword. */
+ __u8 id; /* Multiple capabilities of the same type */
+ __u8 padding[2]; /* Pad to full dword. */
__le32 offset; /* Offset within bar. */
__le32 length; /* Length of the structure, in bytes. */
};
+struct virtio_pci_cap64 {
+ struct virtio_pci_cap cap;
+ __le32 offset_hi; /* Most sig 32 bits of offset */
+ __le32 length_hi; /* Most sig 32 bits of length */
+};
+
struct virtio_pci_notify_cap {
struct virtio_pci_cap cap;
__le32 notify_off_multiplier; /* Multiplier for queue_notify_off. */
--
2.25.4
Currently in fuse we don't seem have any lock which can serialize fault
path with truncate/punch_hole path. With dax support I need one for
following reasons.
1. Dax requirement
DAX fault code relies on inode size being stable for the duration of
fault and want to serialize with truncate/punch_hole and they explicitly
mention it.
static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
const struct iomap_ops *ops)
/*
* Check whether offset isn't beyond end of file now. Caller is
* supposed to hold locks serializing us with truncate / punch hole so
* this is a reliable test.
*/
max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
2. Make sure there are no users of pages being truncated/punch_hole
get_user_pages() might take references to page and then do some DMA
to said pages. Filesystem might truncate those pages without knowing
that a DMA is in progress or some I/O is in progress. So use
dax_layout_busy_page() to make sure there are no such references
and I/O is not in progress on said pages before moving ahead with
truncation.
3. Limitation of kvm page fault error reporting
If we are truncating file on host first and then removing mappings in
guest lateter (truncate page cache etc), then this could lead to a
problem with KVM. Say a mapping is in place in guest and truncation
happens on host. Now if guest accesses that mapping, then host will
take a fault and kvm will either exit to qemu or spin infinitely.
IOW, before we do truncation on host, we need to make sure that guest
inode does not have any mapping in that region or whole file.
4. virtiofs memory range reclaim
Soon I will introduce the notion of being able to reclaim dax memory
ranges from a fuse dax inode. There also I need to make sure that
no I/O or fault is going on in the reclaimed range and nobody is using
it so that range can be reclaimed without issues.
Currently if we take inode lock, that serializes read/write. But it does
not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem
for this purpose. It can be used to serialize with faults.
As of now, I am adding taking this semaphore only in dax fault path and
not regular fault path because existing code does not have one. May
be existing code can benefit from it as well to take care of some
races, but that we can fix later if need be. For now, I am just focussing
only on DAX path which is new path.
Also added logic to take fuse_inode->i_mmap_sem in
truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and
fuse dax fault are mutually exlusive and avoid all the above problems.
Signed-off-by: Vivek Goyal <[email protected]>
Cc: Dave Chinner <[email protected]>
---
fs/fuse/dir.c | 32 ++++++++++++++-----
fs/fuse/file.c | 81 +++++++++++++++++++++++++++++++++++++++++++++---
fs/fuse/fuse_i.h | 9 ++++++
fs/fuse/inode.c | 1 +
4 files changed, 112 insertions(+), 11 deletions(-)
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 26f028bc760b..4c7e29ba7c4c 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1501,6 +1501,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
loff_t oldsize;
int err;
bool trust_local_cmtime = is_wb && S_ISREG(inode->i_mode);
+ bool fault_blocked = false;
if (!fc->default_permissions)
attr->ia_valid |= ATTR_FORCE;
@@ -1509,6 +1510,22 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
if (err)
return err;
+ if (attr->ia_valid & ATTR_SIZE) {
+ if (WARN_ON(!S_ISREG(inode->i_mode)))
+ return -EIO;
+ is_truncate = true;
+ }
+
+ if (IS_DAX(inode) && is_truncate) {
+ down_write(&fi->i_mmap_sem);
+ fault_blocked = true;
+ err = fuse_break_dax_layouts(inode, 0, 0);
+ if (err) {
+ up_write(&fi->i_mmap_sem);
+ return err;
+ }
+ }
+
if (attr->ia_valid & ATTR_OPEN) {
/* This is coming from open(..., ... | O_TRUNC); */
WARN_ON(!(attr->ia_valid & ATTR_SIZE));
@@ -1521,17 +1538,11 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
*/
i_size_write(inode, 0);
truncate_pagecache(inode, 0);
- return 0;
+ goto out;
}
file = NULL;
}
- if (attr->ia_valid & ATTR_SIZE) {
- if (WARN_ON(!S_ISREG(inode->i_mode)))
- return -EIO;
- is_truncate = true;
- }
-
/* Flush dirty data/metadata before non-truncate SETATTR */
if (is_wb && S_ISREG(inode->i_mode) &&
attr->ia_valid &
@@ -1614,6 +1625,10 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
}
clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
+out:
+ if (fault_blocked)
+ up_write(&fi->i_mmap_sem);
+
return 0;
error:
@@ -1621,6 +1636,9 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr,
fuse_release_nowrite(inode);
clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
+
+ if (fault_blocked)
+ up_write(&fi->i_mmap_sem);
return err;
}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 0eecb4097c14..aaa57c625af7 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -445,23 +445,35 @@ int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
int err;
bool is_wb_truncate = (file->f_flags & O_TRUNC) &&
fc->atomic_o_trunc &&
- fc->writeback_cache;
+ (fc->writeback_cache);
+ bool dax_truncate = (file->f_flags & O_TRUNC) &&
+ fc->atomic_o_trunc && IS_DAX(inode);
err = generic_file_open(inode, file);
if (err)
return err;
- if (is_wb_truncate) {
+ if (is_wb_truncate || dax_truncate) {
inode_lock(inode);
fuse_set_nowrite(inode);
}
- err = fuse_do_open(fc, get_node_id(inode), file, isdir);
+ if (dax_truncate) {
+ down_write(&get_fuse_inode(inode)->i_mmap_sem);
+ err = fuse_break_dax_layouts(inode, 0, 0);
+ if (err)
+ goto out;
+ }
+ err = fuse_do_open(fc, get_node_id(inode), file, isdir);
if (!err)
fuse_finish_open(inode, file);
- if (is_wb_truncate) {
+out:
+ if (dax_truncate)
+ up_write(&get_fuse_inode(inode)->i_mmap_sem);
+
+ if (is_wb_truncate | dax_truncate) {
fuse_release_nowrite(inode);
inode_unlock(inode);
}
@@ -2011,6 +2023,47 @@ static const struct iomap_ops fuse_iomap_ops = {
.iomap_end = fuse_iomap_end,
};
+static void fuse_wait_dax_page(struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ up_write(&fi->i_mmap_sem);
+ schedule();
+ down_write(&fi->i_mmap_sem);
+}
+
+/* Should be called with fi->i_mmap_sem lock held exclusively */
+static int __fuse_break_dax_layouts(struct inode *inode, bool *retry,
+ loff_t start, loff_t end)
+{
+ struct page *page;
+
+ page = dax_layout_busy_page_range(inode->i_mapping, start, end);
+ if (!page)
+ return 0;
+
+ *retry = true;
+ return ___wait_var_event(&page->_refcount,
+ atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
+ 0, 0, fuse_wait_dax_page(inode));
+}
+
+/* dmap_end == 0 leads to unmapping of whole file */
+int fuse_break_dax_layouts(struct inode *inode, u64 dmap_start,
+ u64 dmap_end)
+{
+ bool retry;
+ int ret;
+
+ do {
+ retry = false;
+ ret = __fuse_break_dax_layouts(inode, &retry, dmap_start,
+ dmap_end);
+ } while (ret == 0 && retry);
+
+ return ret;
+}
+
static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
struct inode *inode = file_inode(iocb->ki_filp);
@@ -2889,10 +2942,18 @@ static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf,
if (write)
sb_start_pagefault(sb);
+ /*
+ * We need to serialize against not only truncate but also against
+ * fuse dax memory range reclaim. While a range is being reclaimed,
+ * we do not want any read/write/mmap to make progress and try
+ * to populate page cache or access memory we are trying to free.
+ */
+ down_read(&get_fuse_inode(inode)->i_mmap_sem);
ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops);
if (ret & VM_FAULT_NEEDDSYNC)
ret = dax_finish_sync_fault(vmf, pe_size, pfn);
+ up_read(&get_fuse_inode(inode)->i_mmap_sem);
if (write)
sb_end_pagefault(sb);
@@ -3811,6 +3872,8 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
bool lock_inode = !(mode & FALLOC_FL_KEEP_SIZE) ||
(mode & FALLOC_FL_PUNCH_HOLE);
+ bool block_faults = IS_DAX(inode) && lock_inode;
+
if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
return -EOPNOTSUPP;
@@ -3819,6 +3882,13 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
if (lock_inode) {
inode_lock(inode);
+ if (block_faults) {
+ down_write(&fi->i_mmap_sem);
+ err = fuse_break_dax_layouts(inode, 0, 0);
+ if (err)
+ goto out;
+ }
+
if (mode & FALLOC_FL_PUNCH_HOLE) {
loff_t endbyte = offset + length - 1;
@@ -3868,6 +3938,9 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
if (!(mode & FALLOC_FL_KEEP_SIZE))
clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
+ if (block_faults)
+ up_write(&fi->i_mmap_sem);
+
if (lock_inode)
inode_unlock(inode);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 22fb01ba55fb..e555c9a33359 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -176,6 +176,13 @@ struct fuse_inode {
/** Lock to protect write related fields */
spinlock_t lock;
+ /**
+ * Can't take inode lock in fault path (leads to circular dependency).
+ * Introduce another semaphore which can be taken in fault path and
+ * then other filesystem paths can take this to block faults.
+ */
+ struct rw_semaphore i_mmap_sem;
+
/*
* Semaphore to protect modifications to dmap_tree
*/
@@ -1158,4 +1165,6 @@ node_to_dmap(struct interval_tree_node *node)
return container_of(node, struct fuse_dax_mapping, itn);
}
+int fuse_break_dax_layouts(struct inode *inode, u64 dmap_start, u64 dmap_end);
+
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 8ea2d3ebcb36..3735bc5fdfa2 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -88,6 +88,7 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
fi->state = 0;
fi->nr_dmaps = 0;
mutex_init(&fi->mutex);
+ init_rwsem(&fi->i_mmap_sem);
init_rwsem(&fi->i_dmap_sem);
spin_lock_init(&fi->lock);
fi->forget = fuse_alloc_forget();
--
2.25.4
From: Stefan Hajnoczi <[email protected]>
Add DAX mmap() support.
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
fs/fuse/file.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 61 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 99457d0b14b9..f1ad8b95b546 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2841,10 +2841,15 @@ static const struct vm_operations_struct fuse_file_vm_ops = {
.page_mkwrite = fuse_page_mkwrite,
};
+static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma);
static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
{
struct fuse_file *ff = file->private_data;
+ /* DAX mmap is superior to direct_io mmap */
+ if (IS_DAX(file_inode(file)))
+ return fuse_dax_mmap(file, vma);
+
if (ff->open_flags & FOPEN_DIRECT_IO) {
/* Can't provide the coherency needed for MAP_SHARED */
if (vma->vm_flags & VM_MAYSHARE)
@@ -2863,9 +2868,63 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
return 0;
}
+static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf,
+ enum page_entry_size pe_size, bool write)
+{
+ vm_fault_t ret;
+ struct inode *inode = file_inode(vmf->vma->vm_file);
+ struct super_block *sb = inode->i_sb;
+ pfn_t pfn;
+
+ if (write)
+ sb_start_pagefault(sb);
+
+ ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops);
+
+ if (ret & VM_FAULT_NEEDDSYNC)
+ ret = dax_finish_sync_fault(vmf, pe_size, pfn);
+
+ if (write)
+ sb_end_pagefault(sb);
+
+ return ret;
+}
+
+static vm_fault_t fuse_dax_fault(struct vm_fault *vmf)
+{
+ return __fuse_dax_fault(vmf, PE_SIZE_PTE,
+ vmf->flags & FAULT_FLAG_WRITE);
+}
+
+static vm_fault_t fuse_dax_huge_fault(struct vm_fault *vmf,
+ enum page_entry_size pe_size)
+{
+ return __fuse_dax_fault(vmf, pe_size, vmf->flags & FAULT_FLAG_WRITE);
+}
+
+static vm_fault_t fuse_dax_page_mkwrite(struct vm_fault *vmf)
+{
+ return __fuse_dax_fault(vmf, PE_SIZE_PTE, true);
+}
+
+static vm_fault_t fuse_dax_pfn_mkwrite(struct vm_fault *vmf)
+{
+ return __fuse_dax_fault(vmf, PE_SIZE_PTE, true);
+}
+
+static const struct vm_operations_struct fuse_dax_vm_ops = {
+ .fault = fuse_dax_fault,
+ .huge_fault = fuse_dax_huge_fault,
+ .page_mkwrite = fuse_dax_page_mkwrite,
+ .pfn_mkwrite = fuse_dax_pfn_mkwrite,
+};
+
static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma)
{
- return -EINVAL; /* TODO */
+ file_accessed(file);
+ vma->vm_ops = &fuse_dax_vm_ops;
+ vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+ return 0;
}
static int convert_fuse_file_lock(struct fuse_conn *fc,
@@ -3938,6 +3997,7 @@ static const struct file_operations fuse_file_operations = {
.release = fuse_release,
.fsync = fuse_fsync,
.lock = fuse_file_lock,
+ .get_unmapped_area = thp_get_unmapped_area,
.flock = fuse_file_flock,
.splice_read = generic_file_splice_read,
.splice_write = iter_file_splice_write,
--
2.25.4
Add logic to free up a busy memory range. Freed memory range will be
returned to free pool. Add a worker which can be started to select
and free some busy memory ranges.
Process can also steal one of its busy dax ranges if free range is not
available. I will refer it to as direct reclaim.
If free range is not available and nothing can't be stolen from same
inode, caller waits on a waitq for free range to become available.
For reclaiming a range, as of now we need to hold following locks in
specified order.
down_write(&fi->i_mmap_sem);
down_write(&fi->i_dmap_sem);
We look for a free range in following order.
A. Try to get a free range.
B. If not, try direct reclaim.
C. If not, wait for a memory range to become free
Signed-off-by: Vivek Goyal <[email protected]>
Signed-off-by: Liu Bo <[email protected]>
---
fs/fuse/file.c | 482 +++++++++++++++++++++++++++++++++++++++++++-
fs/fuse/fuse_i.h | 25 +++
fs/fuse/inode.c | 4 +
fs/fuse/virtio_fs.c | 5 +
4 files changed, 508 insertions(+), 8 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 723602813ad6..12c4716fc1e5 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -8,6 +8,7 @@
#include "fuse_i.h"
+#include <linux/delay.h>
#include <linux/pagemap.h>
#include <linux/slab.h>
#include <linux/kernel.h>
@@ -35,6 +36,8 @@ static struct page **fuse_pages_alloc(unsigned int npages, gfp_t flags,
return pages;
}
+static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
+ struct inode *inode);
static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
int opcode, struct fuse_open_out *outargp)
{
@@ -191,6 +194,26 @@ static void fuse_link_write_file(struct file *file)
spin_unlock(&fi->lock);
}
+static void
+__kick_dmap_free_worker(struct fuse_conn *fc, unsigned long delay_ms)
+{
+ unsigned long free_threshold;
+
+ /* If number of free ranges are below threshold, start reclaim */
+ free_threshold = max((fc->nr_ranges * FUSE_DAX_RECLAIM_THRESHOLD)/100,
+ (unsigned long)1);
+ if (fc->nr_free_ranges < free_threshold)
+ queue_delayed_work(system_long_wq, &fc->dax_free_work,
+ msecs_to_jiffies(delay_ms));
+}
+
+static void kick_dmap_free_worker(struct fuse_conn *fc, unsigned long delay_ms)
+{
+ spin_lock(&fc->lock);
+ __kick_dmap_free_worker(fc, delay_ms);
+ spin_unlock(&fc->lock);
+}
+
static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
{
struct fuse_dax_mapping *dmap = NULL;
@@ -199,7 +222,7 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
if (fc->nr_free_ranges <= 0) {
spin_unlock(&fc->lock);
- return NULL;
+ goto out_kick;
}
WARN_ON(list_empty(&fc->free_ranges));
@@ -210,6 +233,9 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
list_del_init(&dmap->list);
fc->nr_free_ranges--;
spin_unlock(&fc->lock);
+
+out_kick:
+ kick_dmap_free_worker(fc, 0);
return dmap;
}
@@ -236,6 +262,7 @@ static void __dmap_add_to_free_pool(struct fuse_conn *fc,
{
list_add_tail(&dmap->list, &fc->free_ranges);
fc->nr_free_ranges++;
+ wake_up(&fc->dax_range_waitq);
}
static void dmap_add_to_free_pool(struct fuse_conn *fc,
@@ -279,6 +306,12 @@ static int fuse_setup_one_mapping(struct inode *inode, unsigned long start_idx,
return err;
dmap->writable = writable;
if (!upgrade) {
+ /*
+ * We don't take a refernce on inode. inode is valid right now
+ * and when inode is going away, cleanup logic should first
+ * cleanup dmap entries.
+ */
+ dmap->inode = inode;
dmap->itn.start = dmap->itn.last = start_idx;
/* Protected by fi->i_dmap_sem */
interval_tree_insert(&dmap->itn, &fi->dmap_tree);
@@ -357,6 +390,7 @@ static void dmap_reinit_add_to_free_pool(struct fuse_conn *fc,
"window_offset=0x%llx length=0x%llx\n", dmap->itn.start,
dmap->itn.last, dmap->window_offset, dmap->length);
__dmap_remove_busy_list(fc, dmap);
+ dmap->inode = NULL;
dmap->itn.start = dmap->itn.last = 0;
__dmap_add_to_free_pool(fc, dmap);
}
@@ -384,6 +418,8 @@ static void inode_reclaim_dmap_range(struct fuse_conn *fc, struct inode *inode,
if (!node)
break;
dmap = node_to_dmap(node);
+ /* inode is going away. There should not be any users of dmap */
+ WARN_ON(refcount_read(&dmap->refcnt) > 1);
interval_tree_remove(&dmap->itn, &fi->dmap_tree);
num++;
list_add(&dmap->list, &to_remove);
@@ -418,6 +454,21 @@ static void inode_reclaim_dmap_range(struct fuse_conn *fc, struct inode *inode,
spin_unlock(&fc->lock);
}
+static int dmap_removemapping_one(struct inode *inode,
+ struct fuse_dax_mapping *dmap)
+{
+ struct fuse_removemapping_one forget_one;
+ struct fuse_removemapping_in inarg;
+
+ memset(&inarg, 0, sizeof(inarg));
+ inarg.count = 1;
+ memset(&forget_one, 0, sizeof(forget_one));
+ forget_one.moffset = dmap->window_offset;
+ forget_one.len = dmap->length;
+
+ return fuse_send_removemapping(inode, &inarg, &forget_one);
+}
+
/*
* It is called from evict_inode() and by that time inode is going away. So
* this function does not take any locks like fi->i_dmap_sem for traversing
@@ -1859,6 +1910,16 @@ static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
if (flags & IOMAP_FAULT)
iomap->length = ALIGN(len, PAGE_SIZE);
iomap->type = IOMAP_MAPPED;
+ /*
+ * increace refcnt so that reclaim code knows this dmap is in
+ * use. This assumes i_dmap_sem mutex is held either
+ * shared/exclusive.
+ */
+ refcount_inc(&dmap->refcnt);
+
+ /* iomap->private should be NULL */
+ WARN_ON_ONCE(iomap->private);
+ iomap->private = dmap;
} else {
/* Mapping beyond end of file is hole */
fuse_fill_iomap_hole(iomap, length);
@@ -1877,8 +1938,28 @@ static int fuse_setup_new_dax_mapping(struct inode *inode, loff_t pos,
unsigned long start_idx = pos >> FUSE_DAX_SHIFT;
struct interval_tree_node *node;
- alloc_dmap = alloc_dax_mapping(fc);
- if (!alloc_dmap)
+ /*
+ * Can't do inline reclaim in fault path. We call
+ * dax_layout_busy_page() before we free a range. And
+ * fuse_wait_dax_page() drops fi->i_mmap_sem lock and requires it.
+ * In fault path we enter with fi->i_mmap_sem held and can't drop
+ * it. Also in fault path we hold fi->i_mmap_sem shared and not
+ * exclusive, so that creates further issues with fuse_wait_dax_page().
+ * Hence return -EAGAIN and fuse_dax_fault() will wait for a memory
+ * range to become free and retry.
+ */
+ if (flags & IOMAP_FAULT) {
+ alloc_dmap = alloc_dax_mapping(fc);
+ if (!alloc_dmap)
+ return -EAGAIN;
+ } else {
+ alloc_dmap = alloc_dax_mapping_reclaim(fc, inode);
+ if (IS_ERR(alloc_dmap))
+ return PTR_ERR(alloc_dmap);
+ }
+
+ /* If we are here, we should have memory allocated */
+ if (WARN_ON(!alloc_dmap))
return -EBUSY;
/*
@@ -1930,16 +2011,26 @@ static int fuse_upgrade_dax_mapping(struct inode *inode, loff_t pos,
node = interval_tree_iter_first(&fi->dmap_tree, idx, idx);
/* We are holding either inode lock or i_mmap_sem, and that should
- * ensure that dmap can't reclaimed or truncated and it should still
- * be there in tree despite the fact we dropped and re-acquired the
- * lock.
+ * ensure that dmap can't be truncated. We are holding a reference
+ * on dmap and that should make sure it can't be reclaimed. So dmap
+ * should still be there in tree despite the fact we dropped and
+ * re-acquired the i_dmap_sem lock.
*/
ret = -EIO;
if (WARN_ON(!node))
goto out_err;
-
dmap = node_to_dmap(node);
+ /* We took an extra reference on dmap to make sure its not reclaimd.
+ * Now we hold i_dmap_sem lock and that reference is not needed
+ * anymore. Drop it.
+ */
+ if (refcount_dec_and_test(&dmap->refcnt)) {
+ /* refcount should not hit 0. This object only goes
+ * away when fuse connection goes away */
+ WARN_ON_ONCE(1);
+ }
+
/* Maybe another thread already upgraded mapping while we were not
* holding lock.
*/
@@ -1998,7 +2089,11 @@ static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
* two threads to be trying to this simultaneously
* for same dmap. So drop shared lock and acquire
* exclusive lock.
+ *
+ * Before dropping i_dmap_sem lock, take reference
+ * on dmap so that its not freed by range reclaim.
*/
+ refcount_inc(&dmap->refcnt);
up_read(&fi->i_dmap_sem);
pr_debug("%s: Upgrading mapping at offset 0x%llx"
" length 0x%llx\n", __func__, pos, length);
@@ -2034,6 +2129,16 @@ static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
ssize_t written, unsigned flags,
struct iomap *iomap)
{
+ struct fuse_dax_mapping *dmap = iomap->private;
+
+ if (dmap) {
+ if (refcount_dec_and_test(&dmap->refcnt)) {
+ /* refcount should not hit 0. This object only goes
+ * away when fuse connection goes away */
+ WARN_ON_ONCE(1);
+ }
+ }
+
/* DAX writes beyond end-of-file aren't handled using iomap, so the
* file size is unchanged and there is nothing to do here.
*/
@@ -2960,9 +3065,15 @@ static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf,
struct inode *inode = file_inode(vmf->vma->vm_file);
struct super_block *sb = inode->i_sb;
pfn_t pfn;
+ int error = 0;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ bool retry = false;
if (write)
sb_start_pagefault(sb);
+retry:
+ if (retry && !(fc->nr_free_ranges > 0))
+ wait_event(fc->dax_range_waitq, (fc->nr_free_ranges > 0));
/*
* We need to serialize against not only truncate but also against
@@ -2971,7 +3082,13 @@ static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf,
* to populate page cache or access memory we are trying to free.
*/
down_read(&get_fuse_inode(inode)->i_mmap_sem);
- ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &fuse_iomap_ops);
+ ret = dax_iomap_fault(vmf, pe_size, &pfn, &error, &fuse_iomap_ops);
+ if ((ret & VM_FAULT_ERROR) && error == -EAGAIN) {
+ error = 0;
+ retry = true;
+ up_read(&get_fuse_inode(inode)->i_mmap_sem);
+ goto retry;
+ }
if (ret & VM_FAULT_NEEDDSYNC)
ret = dax_finish_sync_fault(vmf, pe_size, pfn);
@@ -4153,3 +4270,352 @@ void fuse_init_file_inode(struct inode *inode)
inode->i_data.a_ops = &fuse_dax_file_aops;
}
}
+
+static int dmap_writeback_invalidate(struct inode *inode,
+ struct fuse_dax_mapping *dmap)
+{
+ int ret;
+ loff_t start_pos = dmap->itn.start << FUSE_DAX_SHIFT;
+ loff_t end_pos = (start_pos + FUSE_DAX_SZ - 1);
+
+ ret = filemap_fdatawrite_range(inode->i_mapping, start_pos, end_pos);
+ if (ret) {
+ pr_debug("fuse: filemap_fdatawrite_range() failed. err=%d"
+ " start_pos=0x%llx, end_pos=0x%llx\n", ret, start_pos,
+ end_pos);
+ return ret;
+ }
+
+ ret = invalidate_inode_pages2_range(inode->i_mapping,
+ start_pos >> PAGE_SHIFT,
+ end_pos >> PAGE_SHIFT);
+ if (ret)
+ pr_debug("fuse: invalidate_inode_pages2_range() failed err=%d\n"
+ , ret);
+
+ return ret;
+}
+
+static int reclaim_one_dmap_locked(struct fuse_conn *fc, struct inode *inode,
+ struct fuse_dax_mapping *dmap)
+{
+ int ret;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ /*
+ * igrab() was done to make sure inode won't go under us, and this
+ * further avoids the race with evict().
+ */
+ ret = dmap_writeback_invalidate(inode, dmap);
+ if (ret)
+ return ret;
+
+ /* Remove dax mapping from inode interval tree now */
+ interval_tree_remove(&dmap->itn, &fi->dmap_tree);
+ fi->nr_dmaps--;
+
+ /* It is possible that umount/shutodwn has killed the fuse connection
+ * and worker thread is trying to reclaim memory in parallel. So check
+ * if connection is still up or not otherwise don't send removemapping
+ * message.
+ */
+ if (fc->connected) {
+ ret = dmap_removemapping_one(inode, dmap);
+ if (ret) {
+ pr_warn("Failed to remove mapping. offset=0x%llx"
+ " len=0x%llx ret=%d\n", dmap->window_offset,
+ dmap->length, ret);
+ }
+ }
+ return 0;
+}
+
+/* Find first mapped dmap for an inode and return file offset. Caller needs
+ * to hold inode->i_dmap_sem lock either shared or exclusive. */
+static struct fuse_dax_mapping *inode_lookup_first_dmap(struct fuse_conn *fc,
+ struct inode *inode)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_dax_mapping *dmap;
+ struct interval_tree_node *node;
+
+ for (node = interval_tree_iter_first(&fi->dmap_tree, 0, -1); node;
+ node = interval_tree_iter_next(node, 0, -1)) {
+ dmap = node_to_dmap(node);
+ /* still in use. */
+ if (refcount_read(&dmap->refcnt) > 1)
+ continue;
+
+ return dmap;
+ }
+
+ return NULL;
+}
+
+/*
+ * Find first mapping in the tree and free it and return it. Do not add
+ * it back to free pool.
+ */
+static struct fuse_dax_mapping *
+inode_inline_reclaim_one_dmap(struct fuse_conn *fc, struct inode *inode,
+ bool *retry)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_dax_mapping *dmap;
+ u64 dmap_start, dmap_end;
+ unsigned long start_idx;
+ int ret;
+ struct interval_tree_node *node;
+
+ down_write(&fi->i_mmap_sem);
+
+ /* Lookup a dmap and corresponding file offset to reclaim. */
+ down_read(&fi->i_dmap_sem);
+ dmap = inode_lookup_first_dmap(fc, inode);
+ if (dmap) {
+ start_idx = dmap->itn.start;
+ dmap_start = start_idx << FUSE_DAX_SHIFT;
+ dmap_end = dmap_start + FUSE_DAX_SZ - 1;
+ }
+ up_read(&fi->i_dmap_sem);
+
+ if (!dmap)
+ goto out_mmap_sem;
+ /*
+ * Make sure there are no references to inode pages using
+ * get_user_pages()
+ */
+ ret = fuse_break_dax_layouts(inode, dmap_start, dmap_end);
+ if (ret) {
+ pr_debug("fuse: fuse_break_dax_layouts() failed. err=%d\n",
+ ret);
+ dmap = ERR_PTR(ret);
+ goto out_mmap_sem;
+ }
+
+ down_write(&fi->i_dmap_sem);
+ node = interval_tree_iter_first(&fi->dmap_tree, start_idx, start_idx);
+ /* Range already got reclaimed by somebody else */
+ if (!node) {
+ if (retry)
+ *retry = true;
+ goto out_write_dmap_sem;
+ }
+
+ dmap = node_to_dmap(node);
+ /* still in use. */
+ if (refcount_read(&dmap->refcnt) > 1) {
+ dmap = NULL;
+ if (retry)
+ *retry = true;
+ goto out_write_dmap_sem;
+ }
+
+ ret = reclaim_one_dmap_locked(fc, inode, dmap);
+ if (ret < 0) {
+ dmap = ERR_PTR(ret);
+ goto out_write_dmap_sem;
+ }
+
+ /* Clean up dmap. Do not add back to free list */
+ dmap_remove_busy_list(fc, dmap);
+ dmap->inode = NULL;
+ dmap->itn.start = dmap->itn.last = 0;
+
+ pr_debug("fuse: %s: inline reclaimed memory range. inode=%px,"
+ " window_offset=0x%llx, length=0x%llx\n", __func__,
+ inode, dmap->window_offset, dmap->length);
+
+out_write_dmap_sem:
+ up_write(&fi->i_dmap_sem);
+out_mmap_sem:
+ up_write(&fi->i_mmap_sem);
+ return dmap;
+}
+
+static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
+ struct inode *inode)
+{
+ struct fuse_dax_mapping *dmap;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ while (1) {
+ bool retry = false;
+
+ dmap = alloc_dax_mapping(fc);
+ if (dmap)
+ return dmap;
+
+ dmap = inode_inline_reclaim_one_dmap(fc, inode, &retry);
+ /*
+ * Either we got a mapping or it is an error, return in both
+ * the cases.
+ */
+ if (dmap)
+ return dmap;
+
+ /* If we could not reclaim a mapping because it
+ * had a reference or some other temporary failure,
+ * Try again. We want to give up inline reclaim only
+ * if there is no range assigned to this node. Otherwise
+ * if a deadlock is possible if we sleep with fi->i_mmap_sem
+ * held and worker to free memory can't make progress due
+ * to unavailability of fi->i_mmap_sem lock. So sleep
+ * only if fi->nr_dmaps=0
+ */
+ if (retry)
+ continue;
+ /*
+ * There are no mappings which can be reclaimed. Wait for one.
+ * We are not holding fi->i_dmap_sem. So it is possible
+ * that range gets added now. But as we are not holding
+ * fi->i_mmap_sem, worker should still be able to free up
+ * a range and wake us up.
+ */
+ if (!fi->nr_dmaps && !(fc->nr_free_ranges > 0)) {
+ if (wait_event_killable_exclusive(fc->dax_range_waitq,
+ (fc->nr_free_ranges > 0))) {
+ return ERR_PTR(-EINTR);
+ }
+ }
+ }
+}
+
+static int lookup_and_reclaim_dmap_locked(struct fuse_conn *fc,
+ struct inode *inode,
+ unsigned long start_idx)
+{
+ int ret;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_dax_mapping *dmap;
+ struct interval_tree_node *node;
+
+ /* Find fuse dax mapping at file offset inode. */
+ node = interval_tree_iter_first(&fi->dmap_tree, start_idx, start_idx);
+
+ /* Range already got cleaned up by somebody else */
+ if (!node)
+ return 0;
+ dmap = node_to_dmap(node);
+
+ /* still in use. */
+ if (refcount_read(&dmap->refcnt) > 1)
+ return 0;
+
+ ret = reclaim_one_dmap_locked(fc, inode, dmap);
+ if (ret < 0)
+ return ret;
+
+ /* Cleanup dmap entry and add back to free list */
+ spin_lock(&fc->lock);
+ dmap_reinit_add_to_free_pool(fc, dmap);
+ spin_unlock(&fc->lock);
+ return ret;
+}
+
+/*
+ * Free a range of memory.
+ * Locking.
+ * 1. Take fuse_inode->i_mmap_sem to block dax faults.
+ * 2. Take fuse_inode->i_dmap_sem to protect interval tree and also to make
+ * sure read/write can not reuse a dmap which we might be freeing.
+ */
+static int lookup_and_reclaim_dmap(struct fuse_conn *fc, struct inode *inode,
+ unsigned long start_idx,
+ unsigned long end_idx)
+{
+ int ret;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ loff_t dmap_start = start_idx << FUSE_DAX_SHIFT;
+ loff_t dmap_end = (dmap_start + FUSE_DAX_SZ) - 1;
+
+ down_write(&fi->i_mmap_sem);
+ ret = fuse_break_dax_layouts(inode, dmap_start, dmap_end);
+ if (ret) {
+ pr_debug("virtio_fs: fuse_break_dax_layouts() failed. err=%d\n",
+ ret);
+ goto out_mmap_sem;
+ }
+
+ down_write(&fi->i_dmap_sem);
+ ret = lookup_and_reclaim_dmap_locked(fc, inode, start_idx);
+ up_write(&fi->i_dmap_sem);
+out_mmap_sem:
+ up_write(&fi->i_mmap_sem);
+ return ret;
+}
+
+static int try_to_free_dmap_chunks(struct fuse_conn *fc,
+ unsigned long nr_to_free)
+{
+ struct fuse_dax_mapping *dmap, *pos, *temp;
+ int ret, nr_freed = 0;
+ unsigned long start_idx = 0, end_idx = 0;
+ u64 window_offset = 0;
+ struct inode *inode = NULL;
+
+ /* Pick first busy range and free it for now*/
+ while (1) {
+ if (nr_freed >= nr_to_free)
+ break;
+
+ dmap = NULL;
+ spin_lock(&fc->lock);
+
+ if (!fc->nr_busy_ranges) {
+ spin_unlock(&fc->lock);
+ return 0;
+ }
+
+ list_for_each_entry_safe(pos, temp, &fc->busy_ranges,
+ busy_list) {
+ /* skip this range if it's in use. */
+ if (refcount_read(&pos->refcnt) > 1)
+ continue;
+
+ inode = igrab(pos->inode);
+ /*
+ * This inode is going away. That will free
+ * up all the ranges anyway, continue to
+ * next range.
+ */
+ if (!inode)
+ continue;
+ /*
+ * Take this element off list and add it tail. If
+ * this element can't be freed, it will help with
+ * selecting new element in next iteration of loop.
+ */
+ dmap = pos;
+ list_move_tail(&dmap->busy_list, &fc->busy_ranges);
+ start_idx = end_idx = dmap->itn.start;
+ window_offset = dmap->window_offset;
+ break;
+ }
+ spin_unlock(&fc->lock);
+ if (!dmap)
+ return 0;
+
+ ret = lookup_and_reclaim_dmap(fc, inode, start_idx, end_idx);
+ iput(inode);
+ if (ret)
+ return ret;
+ nr_freed++;
+ }
+ return 0;
+}
+
+void fuse_dax_free_mem_worker(struct work_struct *work)
+{
+ int ret;
+ struct fuse_conn *fc = container_of(work, struct fuse_conn,
+ dax_free_work.work);
+ ret = try_to_free_dmap_chunks(fc, FUSE_DAX_RECLAIM_CHUNK);
+ if (ret) {
+ pr_debug("fuse: try_to_free_dmap_chunks() failed with err=%d\n",
+ ret);
+ }
+
+ /* If number of free ranges are still below threhold, requeue */
+ kick_dmap_free_worker(fc, 1);
+}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 400a19a464ca..79e8297aacc9 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -56,6 +56,16 @@
#define FUSE_DAX_SHIFT (21)
#define FUSE_DAX_PAGES (FUSE_DAX_SZ/PAGE_SIZE)
+/* Number of ranges reclaimer will try to free in one invocation */
+#define FUSE_DAX_RECLAIM_CHUNK (10)
+
+/*
+ * Dax memory reclaim threshold in percetage of total ranges. When free
+ * number of free ranges drops below this threshold, reclaim can trigger
+ * Default is 20%
+ * */
+#define FUSE_DAX_RECLAIM_THRESHOLD (20)
+
/** List of active connections */
extern struct list_head fuse_conn_list;
@@ -74,6 +84,9 @@ struct fuse_forget_link {
/** Translation information for file offsets to DAX window offsets */
struct fuse_dax_mapping {
+ /* Pointer to inode where this memory range is mapped */
+ struct inode *inode;
+
/* Will connect in fc->free_ranges to keep track of free memory */
struct list_head list;
@@ -91,6 +104,9 @@ struct fuse_dax_mapping {
/* Is this mapping read-only or read-write */
bool writable;
+
+ /* reference count when the mapping is used by dax iomap. */
+ refcount_t refcnt;
};
/** FUSE inode */
@@ -819,11 +835,19 @@ struct fuse_conn {
unsigned long nr_busy_ranges;
struct list_head busy_ranges;
+ /* Worker to free up memory ranges */
+ struct delayed_work dax_free_work;
+
+ /* Wait queue for a dax range to become free */
+ wait_queue_head_t dax_range_waitq;
+
/*
* DAX Window Free Ranges
*/
long nr_free_ranges;
struct list_head free_ranges;
+
+ unsigned long nr_ranges;
};
static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
@@ -1161,6 +1185,7 @@ unsigned int fuse_len_args(unsigned int numargs, struct fuse_arg *args);
*/
u64 fuse_get_unique(struct fuse_iqueue *fiq);
void fuse_free_conn(struct fuse_conn *fc);
+void fuse_dax_free_mem_worker(struct work_struct *work);
void fuse_cleanup_inode_mappings(struct inode *inode);
static inline struct fuse_dax_mapping *
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 671e84e3dd99..d2e09fd0a3e6 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -683,11 +683,13 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
range->window_offset = i * FUSE_DAX_SZ;
range->length = FUSE_DAX_SZ;
INIT_LIST_HEAD(&range->busy_list);
+ refcount_set(&range->refcnt, 1);
list_add_tail(&range->list, &mem_ranges);
}
list_replace_init(&mem_ranges, &fc->free_ranges);
fc->nr_free_ranges = nr_ranges;
+ fc->nr_ranges = nr_ranges;
return 0;
out_err:
/* Free All allocated elements */
@@ -712,6 +714,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
refcount_set(&fc->count, 1);
atomic_set(&fc->dev_count, 1);
init_waitqueue_head(&fc->blocked_waitq);
+ init_waitqueue_head(&fc->dax_range_waitq);
fuse_iqueue_init(&fc->iq, fiq_ops, fiq_priv);
INIT_LIST_HEAD(&fc->bg_queue);
INIT_LIST_HEAD(&fc->entry);
@@ -731,6 +734,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
INIT_LIST_HEAD(&fc->free_ranges);
INIT_LIST_HEAD(&fc->busy_ranges);
+ INIT_DELAYED_WORK(&fc->dax_free_work, fuse_dax_free_mem_worker);
}
EXPORT_SYMBOL_GPL(fuse_conn_init);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index fb31c9dbc0b8..07ad697b403c 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1345,6 +1345,11 @@ static void virtio_kill_sb(struct super_block *sb)
vfs = fc->iq.priv;
fsvq = &vfs->vqs[VQ_HIPRIO];
+ /* Stop dax worker. Soon evict_inodes() will be called which will
+ * free all memory ranges belonging to all inodes.
+ */
+ cancel_delayed_work_sync(&fc->dax_free_work);
+
/* Stop forget queue. Soon destroy will be sent */
spin_lock(&fsvq->lock);
fsvq->connected = false;
--
2.25.4
Divide the dax memory range into fixed size ranges (2MB for now) and put
them in a list. This will track free ranges. Once an inode requires a
free range, we will take one from here and put it in interval-tree
of ranges assigned to inode.
Signed-off-by: Vivek Goyal <[email protected]>
Signed-off-by: Peng Tao <[email protected]>
---
fs/fuse/fuse_i.h | 23 ++++++++++++
fs/fuse/inode.c | 88 ++++++++++++++++++++++++++++++++++++++++++++-
fs/fuse/virtio_fs.c | 2 ++
3 files changed, 112 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 04fdd7c41bd1..478c940b05b4 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -47,6 +47,11 @@
/** Number of dentries for each connection in the control filesystem */
#define FUSE_CTL_NUM_DENTRIES 5
+/* Default memory range size, 2MB */
+#define FUSE_DAX_SZ (2*1024*1024)
+#define FUSE_DAX_SHIFT (21)
+#define FUSE_DAX_PAGES (FUSE_DAX_SZ/PAGE_SIZE)
+
/** List of active connections */
extern struct list_head fuse_conn_list;
@@ -63,6 +68,18 @@ struct fuse_forget_link {
struct fuse_forget_link *next;
};
+/** Translation information for file offsets to DAX window offsets */
+struct fuse_dax_mapping {
+ /* Will connect in fc->free_ranges to keep track of free memory */
+ struct list_head list;
+
+ /** Position in DAX window */
+ u64 window_offset;
+
+ /** Length of mapping, in bytes */
+ loff_t length;
+};
+
/** FUSE inode */
struct fuse_inode {
/** Inode data */
@@ -768,6 +785,12 @@ struct fuse_conn {
/** DAX device, non-NULL if DAX is supported */
struct dax_device *dax_dev;
+
+ /*
+ * DAX Window Free Ranges
+ */
+ long nr_free_ranges;
+ struct list_head free_ranges;
};
static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index beac337ccc10..b82eb61d63cc 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -23,6 +23,8 @@
#include <linux/exportfs.h>
#include <linux/posix_acl.h>
#include <linux/pid_namespace.h>
+#include <linux/dax.h>
+#include <linux/pfn_t.h>
MODULE_AUTHOR("Miklos Szeredi <[email protected]>");
MODULE_DESCRIPTION("Filesystem in Userspace");
@@ -620,6 +622,76 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
fpq->connected = 1;
}
+static void fuse_free_dax_mem_ranges(struct list_head *mem_list)
+{
+ struct fuse_dax_mapping *range, *temp;
+
+ /* Free All allocated elements */
+ list_for_each_entry_safe(range, temp, mem_list, list) {
+ list_del(&range->list);
+ kfree(range);
+ }
+}
+
+#ifdef CONFIG_FS_DAX
+static int fuse_dax_mem_range_init(struct fuse_conn *fc,
+ struct dax_device *dax_dev)
+{
+ long nr_pages, nr_ranges;
+ void *kaddr;
+ pfn_t pfn;
+ struct fuse_dax_mapping *range;
+ LIST_HEAD(mem_ranges);
+ phys_addr_t phys_addr;
+ int ret = 0, id;
+ size_t dax_size = -1;
+ unsigned long i;
+
+ id = dax_read_lock();
+ nr_pages = dax_direct_access(dax_dev, 0, PHYS_PFN(dax_size), &kaddr,
+ &pfn);
+ dax_read_unlock(id);
+ if (nr_pages < 0) {
+ pr_debug("dax_direct_access() returned %ld\n", nr_pages);
+ return nr_pages;
+ }
+
+ phys_addr = pfn_t_to_phys(pfn);
+ nr_ranges = nr_pages/FUSE_DAX_PAGES;
+ printk("fuse_dax_mem_range_init(): dax mapped %ld pages. nr_ranges=%ld\n", nr_pages, nr_ranges);
+
+ for (i = 0; i < nr_ranges; i++) {
+ range = kzalloc(sizeof(struct fuse_dax_mapping), GFP_KERNEL);
+ if (!range) {
+ pr_debug("memory allocation for mem_range failed.\n");
+ ret = -ENOMEM;
+ goto out_err;
+ }
+ /* TODO: This offset only works if virtio-fs driver is not
+ * having some memory hidden at the beginning. This needs
+ * better handling
+ */
+ range->window_offset = i * FUSE_DAX_SZ;
+ range->length = FUSE_DAX_SZ;
+ list_add_tail(&range->list, &mem_ranges);
+ }
+
+ list_replace_init(&mem_ranges, &fc->free_ranges);
+ fc->nr_free_ranges = nr_ranges;
+ return 0;
+out_err:
+ /* Free All allocated elements */
+ fuse_free_dax_mem_ranges(&mem_ranges);
+ return ret;
+}
+#else /* !CONFIG_FS_DAX */
+static inline int fuse_dax_mem_range_init(struct fuse_conn *fc,
+ struct dax_device *dax_dev)
+{
+ return 0;
+}
+#endif /* CONFIG_FS_DAX */
+
void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv)
{
@@ -647,6 +719,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
fc->user_ns = get_user_ns(user_ns);
fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
+ INIT_LIST_HEAD(&fc->free_ranges);
}
EXPORT_SYMBOL_GPL(fuse_conn_init);
@@ -655,6 +728,8 @@ void fuse_conn_put(struct fuse_conn *fc)
if (refcount_dec_and_test(&fc->count)) {
struct fuse_iqueue *fiq = &fc->iq;
+ if (fc->dax_dev)
+ fuse_free_dax_mem_ranges(&fc->free_ranges);
if (fiq->ops->release)
fiq->ops->release(fiq);
put_pid_ns(fc->pid_ns);
@@ -1179,11 +1254,19 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
if (sb->s_user_ns != &init_user_ns)
sb->s_xattr = fuse_no_acl_xattr_handlers;
+ if (ctx->dax_dev) {
+ err = fuse_dax_mem_range_init(fc, ctx->dax_dev);
+ if (err) {
+ pr_debug("fuse_dax_mem_range_init() returned %d\n", err);
+ goto err_free_ranges;
+ }
+ }
+
if (ctx->fudptr) {
err = -ENOMEM;
fud = fuse_dev_alloc_install(fc);
if (!fud)
- goto err;
+ goto err_free_ranges;
}
fc->dev = sb->s_dev;
@@ -1242,6 +1325,9 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
err_dev_free:
if (fud)
fuse_dev_free(fud);
+ err_free_ranges:
+ if (ctx->dax_dev)
+ fuse_free_dax_mem_ranges(&fc->free_ranges);
err:
return err;
}
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 741cad4abad8..fb31c9dbc0b8 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -747,6 +747,8 @@ static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
phys_addr_t offset = PFN_PHYS(pgoff);
size_t max_nr_pages = fs->window_len/PAGE_SIZE - pgoff;
+ pr_debug("virtio_fs_direct_access(): called. nr_pages=%ld max_nr_pages=%zu\n", nr_pages, max_nr_pages);
+
if (kaddr)
*kaddr = fs->window_kaddr + offset;
if (pfn)
--
2.25.4
From: Sebastien Boeuf <[email protected]>
On MMIO a new set of registers is defined for finding SHM
regions. Add their definitions and use them to find the region.
Signed-off-by: Sebastien Boeuf <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: "Michael S. Tsirkin" <[email protected]>
---
drivers/virtio/virtio_mmio.c | 31 +++++++++++++++++++++++++++++++
include/uapi/linux/virtio_mmio.h | 11 +++++++++++
2 files changed, 42 insertions(+)
diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index 627ac0487494..238383ff1064 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -498,6 +498,36 @@ static const char *vm_bus_name(struct virtio_device *vdev)
return vm_dev->pdev->name;
}
+static bool vm_get_shm_region(struct virtio_device *vdev,
+ struct virtio_shm_region *region, u8 id)
+{
+ struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
+ u64 len, addr;
+
+ /* Select the region we're interested in */
+ writel(id, vm_dev->base + VIRTIO_MMIO_SHM_SEL);
+
+ /* Read the region size */
+ len = (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_LEN_LOW);
+ len |= (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_LEN_HIGH) << 32;
+
+ region->len = len;
+
+ /* Check if region length is -1. If that's the case, the shared memory
+ * region does not exist and there is no need to proceed further.
+ */
+ if (len == ~(u64)0)
+ return false;
+
+ /* Read the region base address */
+ addr = (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_BASE_LOW);
+ addr |= (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_BASE_HIGH) << 32;
+
+ region->addr = addr;
+
+ return true;
+}
+
static const struct virtio_config_ops virtio_mmio_config_ops = {
.get = vm_get,
.set = vm_set,
@@ -510,6 +540,7 @@ static const struct virtio_config_ops virtio_mmio_config_ops = {
.get_features = vm_get_features,
.finalize_features = vm_finalize_features,
.bus_name = vm_bus_name,
+ .get_shm_region = vm_get_shm_region,
};
diff --git a/include/uapi/linux/virtio_mmio.h b/include/uapi/linux/virtio_mmio.h
index c4b09689ab64..0650f91bea6c 100644
--- a/include/uapi/linux/virtio_mmio.h
+++ b/include/uapi/linux/virtio_mmio.h
@@ -122,6 +122,17 @@
#define VIRTIO_MMIO_QUEUE_USED_LOW 0x0a0
#define VIRTIO_MMIO_QUEUE_USED_HIGH 0x0a4
+/* Shared memory region id */
+#define VIRTIO_MMIO_SHM_SEL 0x0ac
+
+/* Shared memory region length, 64 bits in two halves */
+#define VIRTIO_MMIO_SHM_LEN_LOW 0x0b0
+#define VIRTIO_MMIO_SHM_LEN_HIGH 0x0b4
+
+/* Shared memory region base address, 64 bits in two halves */
+#define VIRTIO_MMIO_SHM_BASE_LOW 0x0b8
+#define VIRTIO_MMIO_SHM_BASE_HIGH 0x0bc
+
/* Configuration atomicity value */
#define VIRTIO_MMIO_CONFIG_GENERATION 0x0fc
--
2.25.4
This patch implements basic DAX support. mmap() is not implemented
yet and will come in later patches. This patch looks into implemeting
read/write.
We make use of interval tree to keep track of per inode dax mappings.
Do not use dax for file extending writes, instead just send WRITE message
to daemon (like we do for direct I/O path). This will keep write and
i_size change atomic w.r.t crash.
Signed-off-by: Stefan Hajnoczi <[email protected]>
Signed-off-by: Dr. David Alan Gilbert <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
Signed-off-by: Liu Bo <[email protected]>
Signed-off-by: Peng Tao <[email protected]>
Cc: Dave Chinner <[email protected]>
---
fs/fuse/file.c | 550 +++++++++++++++++++++++++++++++++++++-
fs/fuse/fuse_i.h | 26 ++
fs/fuse/inode.c | 6 +
include/uapi/linux/fuse.h | 1 +
4 files changed, 577 insertions(+), 6 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6611ef3269a8..99457d0b14b9 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -19,6 +19,9 @@
#include <linux/falloc.h>
#include <linux/uio.h>
#include <linux/fs.h>
+#include <linux/dax.h>
+#include <linux/iomap.h>
+#include <linux/interval_tree_generic.h>
static struct page **fuse_pages_alloc(unsigned int npages, gfp_t flags,
struct fuse_page_desc **desc)
@@ -188,6 +191,228 @@ static void fuse_link_write_file(struct file *file)
spin_unlock(&fi->lock);
}
+static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
+{
+ struct fuse_dax_mapping *dmap = NULL;
+
+ spin_lock(&fc->lock);
+
+ if (fc->nr_free_ranges <= 0) {
+ spin_unlock(&fc->lock);
+ return NULL;
+ }
+
+ WARN_ON(list_empty(&fc->free_ranges));
+
+ /* Take a free range */
+ dmap = list_first_entry(&fc->free_ranges, struct fuse_dax_mapping,
+ list);
+ list_del_init(&dmap->list);
+ fc->nr_free_ranges--;
+ spin_unlock(&fc->lock);
+ return dmap;
+}
+
+/* This assumes fc->lock is held */
+static void __dmap_add_to_free_pool(struct fuse_conn *fc,
+ struct fuse_dax_mapping *dmap)
+{
+ list_add_tail(&dmap->list, &fc->free_ranges);
+ fc->nr_free_ranges++;
+}
+
+static void dmap_add_to_free_pool(struct fuse_conn *fc,
+ struct fuse_dax_mapping *dmap)
+{
+ /* Return fuse_dax_mapping to free list */
+ spin_lock(&fc->lock);
+ __dmap_add_to_free_pool(fc, dmap);
+ spin_unlock(&fc->lock);
+}
+
+static int fuse_setup_one_mapping(struct inode *inode, unsigned long start_idx,
+ struct fuse_dax_mapping *dmap, bool writable,
+ bool upgrade)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_setupmapping_in inarg;
+ loff_t offset = start_idx << FUSE_DAX_SHIFT;
+ FUSE_ARGS(args);
+ ssize_t err;
+
+ WARN_ON(fc->nr_free_ranges < 0);
+
+ /* Ask fuse daemon to setup mapping */
+ memset(&inarg, 0, sizeof(inarg));
+ inarg.foffset = offset;
+ inarg.fh = -1;
+ inarg.moffset = dmap->window_offset;
+ inarg.len = FUSE_DAX_SZ;
+ inarg.flags |= FUSE_SETUPMAPPING_FLAG_READ;
+ if (writable)
+ inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
+ args.opcode = FUSE_SETUPMAPPING;
+ args.nodeid = fi->nodeid;
+ args.in_numargs = 1;
+ args.in_args[0].size = sizeof(inarg);
+ args.in_args[0].value = &inarg;
+ err = fuse_simple_request(fc, &args);
+ if (err < 0)
+ return err;
+ dmap->writable = writable;
+ if (!upgrade) {
+ dmap->itn.start = dmap->itn.last = start_idx;
+ /* Protected by fi->i_dmap_sem */
+ interval_tree_insert(&dmap->itn, &fi->dmap_tree);
+ fi->nr_dmaps++;
+ }
+ return 0;
+}
+
+static int
+fuse_send_removemapping(struct inode *inode,
+ struct fuse_removemapping_in *inargp,
+ struct fuse_removemapping_one *remove_one)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ FUSE_ARGS(args);
+
+ args.opcode = FUSE_REMOVEMAPPING;
+ args.nodeid = fi->nodeid;
+ args.in_numargs = 2;
+ args.in_args[0].size = sizeof(*inargp);
+ args.in_args[0].value = inargp;
+ args.in_args[1].size = inargp->count * sizeof(*remove_one);
+ args.in_args[1].value = remove_one;
+ return fuse_simple_request(fc, &args);
+}
+
+static int dmap_removemapping_list(struct inode *inode, unsigned num,
+ struct list_head *to_remove)
+{
+ struct fuse_removemapping_one *remove_one, *ptr;
+ struct fuse_removemapping_in inarg;
+ struct fuse_dax_mapping *dmap;
+ int ret, i = 0, nr_alloc;
+
+ nr_alloc = min_t(unsigned int, num, FUSE_REMOVEMAPPING_MAX_ENTRY);
+ remove_one = kmalloc_array(nr_alloc, sizeof(*remove_one), GFP_NOFS);
+ if (!remove_one)
+ return -ENOMEM;
+
+ ptr = remove_one;
+ list_for_each_entry(dmap, to_remove, list) {
+ ptr->moffset = dmap->window_offset;
+ ptr->len = dmap->length;
+ ptr++;
+ i++;
+ num--;
+ if (i >= nr_alloc || num == 0) {
+ memset(&inarg, 0, sizeof(inarg));
+ inarg.count = i;
+ ret = fuse_send_removemapping(inode, &inarg,
+ remove_one);
+ if (ret)
+ goto out;
+ ptr = remove_one;
+ i = 0;
+ }
+ }
+out:
+ kfree(remove_one);
+ return ret;
+}
+
+/*
+ * Cleanup dmap entry and add back to free list. This should be called with
+ * fc->lock held.
+ */
+static void dmap_reinit_add_to_free_pool(struct fuse_conn *fc,
+ struct fuse_dax_mapping *dmap)
+{
+ pr_debug("fuse: freeing memory range start_idx=0x%lx end_idx=0x%lx "
+ "window_offset=0x%llx length=0x%llx\n", dmap->itn.start,
+ dmap->itn.last, dmap->window_offset, dmap->length);
+ dmap->itn.start = dmap->itn.last = 0;
+ __dmap_add_to_free_pool(fc, dmap);
+}
+
+/*
+ * Free inode dmap entries whose range falls inside [start, end].
+ * Does not take any locks. At this point of time it should only be
+ * called from evict_inode() path where we know all dmap entries can be
+ * reclaimed.
+ */
+static void inode_reclaim_dmap_range(struct fuse_conn *fc, struct inode *inode,
+ loff_t start, loff_t end)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_dax_mapping *dmap, *n;
+ int err, num = 0;
+ LIST_HEAD(to_remove);
+ unsigned long start_idx = start >> FUSE_DAX_SHIFT;
+ unsigned long end_idx = end >> FUSE_DAX_SHIFT;
+ struct interval_tree_node *node;
+
+ while (1) {
+ node = interval_tree_iter_first(&fi->dmap_tree, start_idx,
+ end_idx);
+ if (!node)
+ break;
+ dmap = node_to_dmap(node);
+ interval_tree_remove(&dmap->itn, &fi->dmap_tree);
+ num++;
+ list_add(&dmap->list, &to_remove);
+ }
+
+ /* Nothing to remove */
+ if (list_empty(&to_remove))
+ return;
+
+ WARN_ON(fi->nr_dmaps < num);
+ fi->nr_dmaps -= num;
+ /*
+ * During umount/shutdown, fuse connection is dropped first
+ * and evict_inode() is called later. That means any
+ * removemapping messages are going to fail. Send messages
+ * only if connection is up. Otherwise fuse daemon is
+ * responsible for cleaning up any leftover references and
+ * mappings.
+ */
+ if (fc->connected) {
+ err = dmap_removemapping_list(inode, num, &to_remove);
+ if (err) {
+ pr_warn("Failed to removemappings. start=0x%llx"
+ " end=0x%llx\n", start, end);
+ }
+ }
+ spin_lock(&fc->lock);
+ list_for_each_entry_safe(dmap, n, &to_remove, list) {
+ list_del_init(&dmap->list);
+ dmap_reinit_add_to_free_pool(fc, dmap);
+ }
+ spin_unlock(&fc->lock);
+}
+
+/*
+ * It is called from evict_inode() and by that time inode is going away. So
+ * this function does not take any locks like fi->i_dmap_sem for traversing
+ * that fuse inode interval tree. If that lock is taken then lock validator
+ * complains of deadlock situation w.r.t fs_reclaim lock.
+ */
+void fuse_cleanup_inode_mappings(struct inode *inode)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ /*
+ * fuse_evict_inode() has alredy called truncate_inode_pages_final()
+ * before we arrive here. So we should not have to worry about
+ * any pages/exception entries still associated with inode.
+ */
+ inode_reclaim_dmap_range(fc, inode, 0, -1);
+}
+
void fuse_finish_open(struct inode *inode, struct file *file)
{
struct fuse_file *ff = file->private_data;
@@ -1535,32 +1760,335 @@ static ssize_t fuse_direct_write_iter(struct kiocb *iocb, struct iov_iter *from)
return res;
}
+static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
struct file *file = iocb->ki_filp;
struct fuse_file *ff = file->private_data;
+ struct inode *inode = file->f_mapping->host;
if (is_bad_inode(file_inode(file)))
return -EIO;
- if (!(ff->open_flags & FOPEN_DIRECT_IO))
- return fuse_cache_read_iter(iocb, to);
- else
+ if (IS_DAX(inode))
+ return fuse_dax_read_iter(iocb, to);
+
+ if (ff->open_flags & FOPEN_DIRECT_IO)
return fuse_direct_read_iter(iocb, to);
+
+ return fuse_cache_read_iter(iocb, to);
}
+static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
struct file *file = iocb->ki_filp;
struct fuse_file *ff = file->private_data;
+ struct inode *inode = file->f_mapping->host;
if (is_bad_inode(file_inode(file)))
return -EIO;
- if (!(ff->open_flags & FOPEN_DIRECT_IO))
- return fuse_cache_write_iter(iocb, from);
- else
+ if (IS_DAX(inode))
+ return fuse_dax_write_iter(iocb, from);
+
+ if (ff->open_flags & FOPEN_DIRECT_IO)
return fuse_direct_write_iter(iocb, from);
+
+ return fuse_cache_write_iter(iocb, from);
+}
+
+static void fuse_fill_iomap_hole(struct iomap *iomap, loff_t length)
+{
+ iomap->addr = IOMAP_NULL_ADDR;
+ iomap->length = length;
+ iomap->type = IOMAP_HOLE;
+}
+
+static void fuse_fill_iomap(struct inode *inode, loff_t pos, loff_t length,
+ struct iomap *iomap, struct fuse_dax_mapping *dmap,
+ unsigned flags)
+{
+ loff_t offset, len;
+ loff_t i_size = i_size_read(inode);
+
+ offset = pos - (dmap->itn.start << FUSE_DAX_SHIFT);
+ len = min(length, dmap->length - offset);
+
+ /* If length is beyond end of file, truncate further */
+ if (pos + len > i_size)
+ len = i_size - pos;
+
+ if (len > 0) {
+ iomap->addr = dmap->window_offset + offset;
+ iomap->length = len;
+ if (flags & IOMAP_FAULT)
+ iomap->length = ALIGN(len, PAGE_SIZE);
+ iomap->type = IOMAP_MAPPED;
+ } else {
+ /* Mapping beyond end of file is hole */
+ fuse_fill_iomap_hole(iomap, length);
+ }
+}
+
+static int fuse_setup_new_dax_mapping(struct inode *inode, loff_t pos,
+ loff_t length, unsigned flags,
+ struct iomap *iomap)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_dax_mapping *dmap, *alloc_dmap = NULL;
+ int ret;
+ bool writable = flags & IOMAP_WRITE;
+ unsigned long start_idx = pos >> FUSE_DAX_SHIFT;
+ struct interval_tree_node *node;
+
+ alloc_dmap = alloc_dax_mapping(fc);
+ if (!alloc_dmap)
+ return -EBUSY;
+
+ /*
+ * Take write lock so that only one caller can try to setup mapping
+ * and other waits.
+ */
+ down_write(&fi->i_dmap_sem);
+ /*
+ * We dropped lock. Check again if somebody else setup
+ * mapping already.
+ */
+ node = interval_tree_iter_first(&fi->dmap_tree, start_idx, start_idx);
+ if (node) {
+ dmap = node_to_dmap(node);
+ fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
+ dmap_add_to_free_pool(fc, alloc_dmap);
+ up_write(&fi->i_dmap_sem);
+ return 0;
+ }
+
+ /* Setup one mapping */
+ ret = fuse_setup_one_mapping(inode, pos >> FUSE_DAX_SHIFT, alloc_dmap,
+ writable, false);
+ if (ret < 0) {
+ dmap_add_to_free_pool(fc, alloc_dmap);
+ up_write(&fi->i_dmap_sem);
+ return ret;
+ }
+ fuse_fill_iomap(inode, pos, length, iomap, alloc_dmap, flags);
+ up_write(&fi->i_dmap_sem);
+ return 0;
+}
+
+static int fuse_upgrade_dax_mapping(struct inode *inode, loff_t pos,
+ loff_t length, unsigned flags,
+ struct iomap *iomap)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_dax_mapping *dmap;
+ int ret;
+ unsigned long idx = pos >> FUSE_DAX_SHIFT;
+ struct interval_tree_node *node;
+
+ /*
+ * Take exclusive lock so that only one caller can try to setup
+ * mapping and others wait.
+ */
+ down_write(&fi->i_dmap_sem);
+ node = interval_tree_iter_first(&fi->dmap_tree, idx, idx);
+
+ /* We are holding either inode lock or i_mmap_sem, and that should
+ * ensure that dmap can't reclaimed or truncated and it should still
+ * be there in tree despite the fact we dropped and re-acquired the
+ * lock.
+ */
+ ret = -EIO;
+ if (WARN_ON(!node))
+ goto out_err;
+
+ dmap = node_to_dmap(node);
+
+ /* Maybe another thread already upgraded mapping while we were not
+ * holding lock.
+ */
+ if (dmap->writable) {
+ ret = 0;
+ goto out_fill_iomap;
+ }
+
+ ret = fuse_setup_one_mapping(inode, pos >> FUSE_DAX_SHIFT, dmap, true,
+ true);
+ if (ret < 0)
+ goto out_err;
+out_fill_iomap:
+ fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
+out_err:
+ up_write(&fi->i_dmap_sem);
+ return ret;
+}
+
+/* This is just for DAX and the mapping is ephemeral, do not use it for other
+ * purposes since there is no block device with a permanent mapping.
+ */
+static int fuse_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
+ unsigned flags, struct iomap *iomap,
+ struct iomap *srcmap)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_dax_mapping *dmap;
+ bool writable = flags & IOMAP_WRITE;
+ unsigned long start_idx = pos >> FUSE_DAX_SHIFT;
+ struct interval_tree_node *node;
+
+ /* We don't support FIEMAP */
+ BUG_ON(flags & IOMAP_REPORT);
+
+ iomap->offset = pos;
+ iomap->flags = 0;
+ iomap->bdev = NULL;
+ iomap->dax_dev = fc->dax_dev;
+
+ /*
+ * Both read/write and mmap path can race here. So we need something
+ * to make sure if we are setting up mapping, then other path waits
+ *
+ * For now, use a semaphore for this. It probably needs to be
+ * optimized later.
+ */
+ down_read(&fi->i_dmap_sem);
+ node = interval_tree_iter_first(&fi->dmap_tree, start_idx, start_idx);
+ if (node) {
+ dmap = node_to_dmap(node);
+ if (writable && !dmap->writable) {
+ /* Upgrade read-only mapping to read-write. This will
+ * require exclusive i_dmap_sem lock as we don't want
+ * two threads to be trying to this simultaneously
+ * for same dmap. So drop shared lock and acquire
+ * exclusive lock.
+ */
+ up_read(&fi->i_dmap_sem);
+ pr_debug("%s: Upgrading mapping at offset 0x%llx"
+ " length 0x%llx\n", __func__, pos, length);
+ return fuse_upgrade_dax_mapping(inode, pos, length,
+ flags, iomap);
+ } else {
+ fuse_fill_iomap(inode, pos, length, iomap, dmap, flags);
+ up_read(&fi->i_dmap_sem);
+ return 0;
+ }
+ } else {
+ up_read(&fi->i_dmap_sem);
+ pr_debug("%s: no mapping at offset 0x%llx length 0x%llx\n",
+ __func__, pos, length);
+ if (pos >= i_size_read(inode))
+ goto iomap_hole;
+
+ return fuse_setup_new_dax_mapping(inode, pos, length, flags,
+ iomap);
+ }
+
+ /*
+ * If read beyond end of file happnes, fs code seems to return
+ * it as hole
+ */
+iomap_hole:
+ fuse_fill_iomap_hole(iomap, length);
+ pr_debug("fuse_iomap_begin() returning hole mapping. pos=0x%llx length_asked=0x%llx length_returned=0x%llx\n", pos, length, iomap->length);
+ return 0;
+}
+
+static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t length,
+ ssize_t written, unsigned flags,
+ struct iomap *iomap)
+{
+ /* DAX writes beyond end-of-file aren't handled using iomap, so the
+ * file size is unchanged and there is nothing to do here.
+ */
+ return 0;
+}
+
+static const struct iomap_ops fuse_iomap_ops = {
+ .iomap_begin = fuse_iomap_begin,
+ .iomap_end = fuse_iomap_end,
+};
+
+static ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ ssize_t ret;
+
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (!inode_trylock_shared(inode))
+ return -EAGAIN;
+ } else {
+ inode_lock_shared(inode);
+ }
+
+ ret = dax_iomap_rw(iocb, to, &fuse_iomap_ops);
+ inode_unlock_shared(inode);
+
+ /* TODO file_accessed(iocb->f_filp) */
+ return ret;
+}
+
+static bool file_extending_write(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ return (iov_iter_rw(from) == WRITE &&
+ ((iocb->ki_pos) >= i_size_read(inode) ||
+ (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode))));
+}
+
+static ssize_t fuse_dax_direct_write(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ struct fuse_io_priv io = FUSE_IO_PRIV_SYNC(iocb);
+ ssize_t ret;
+
+ ret = fuse_direct_io(&io, from, &iocb->ki_pos, FUSE_DIO_WRITE);
+ if (ret < 0)
+ return ret;
+
+ fuse_invalidate_attr(inode);
+ fuse_write_update_size(inode, iocb->ki_pos);
+ return ret;
+}
+
+static ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ ssize_t ret;
+
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (!inode_trylock(inode))
+ return -EAGAIN;
+ } else {
+ inode_lock(inode);
+ }
+
+ ret = generic_write_checks(iocb, from);
+ if (ret <= 0)
+ goto out;
+
+ ret = file_remove_privs(iocb->ki_filp);
+ if (ret)
+ goto out;
+ /* TODO file_update_time() but we don't want metadata I/O */
+
+ /* Do not use dax for file extending writes as write and on
+ * on disk i_size increase are not atomic otherwise.
+ */
+ if (file_extending_write(iocb, from))
+ ret = fuse_dax_direct_write(iocb, from);
+ else
+ ret = dax_iomap_rw(iocb, from, &fuse_iomap_ops);
+
+out:
+ inode_unlock(inode);
+
+ if (ret > 0)
+ ret = generic_write_sync(iocb, ret);
+ return ret;
}
static void fuse_writepage_free(struct fuse_writepage_args *wpa)
@@ -2335,6 +2863,11 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
return 0;
}
+static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return -EINVAL; /* TODO */
+}
+
static int convert_fuse_file_lock(struct fuse_conn *fc,
const struct fuse_file_lock *ffl,
struct file_lock *fl)
@@ -3431,6 +3964,7 @@ static const struct address_space_operations fuse_file_aops = {
void fuse_init_file_inode(struct inode *inode)
{
struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_conn *fc = get_fuse_conn(inode);
inode->i_fop = &fuse_file_operations;
inode->i_data.a_ops = &fuse_file_aops;
@@ -3440,4 +3974,8 @@ void fuse_init_file_inode(struct inode *inode)
fi->writectr = 0;
init_waitqueue_head(&fi->page_waitq);
fi->writepages = RB_ROOT;
+ fi->dmap_tree = RB_ROOT_CACHED;
+
+ if (fc->dax_dev)
+ inode->i_flags |= S_DAX;
}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 4a46e35222c7..22fb01ba55fb 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -31,6 +31,7 @@
#include <linux/pid_namespace.h>
#include <linux/refcount.h>
#include <linux/user_namespace.h>
+#include <linux/interval_tree.h>
/** Default max number of pages that can be used in a single read request */
#define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32
@@ -76,11 +77,17 @@ struct fuse_dax_mapping {
/* Will connect in fc->free_ranges to keep track of free memory */
struct list_head list;
+ /* For interval tree in file/inode */
+ struct interval_tree_node itn;
+
/** Position in DAX window */
u64 window_offset;
/** Length of mapping, in bytes */
loff_t length;
+
+ /* Is this mapping read-only or read-write */
+ bool writable;
};
/** FUSE inode */
@@ -168,6 +175,15 @@ struct fuse_inode {
/** Lock to protect write related fields */
spinlock_t lock;
+
+ /*
+ * Semaphore to protect modifications to dmap_tree
+ */
+ struct rw_semaphore i_dmap_sem;
+
+ /** Sorted rb tree of struct fuse_dax_mapping elements */
+ struct rb_root_cached dmap_tree;
+ unsigned long nr_dmaps;
};
/** FUSE inode state bits */
@@ -1131,5 +1147,15 @@ unsigned int fuse_len_args(unsigned int numargs, struct fuse_arg *args);
*/
u64 fuse_get_unique(struct fuse_iqueue *fiq);
void fuse_free_conn(struct fuse_conn *fc);
+void fuse_cleanup_inode_mappings(struct inode *inode);
+
+static inline struct fuse_dax_mapping *
+node_to_dmap(struct interval_tree_node *node)
+{
+ if (!node)
+ return NULL;
+
+ return container_of(node, struct fuse_dax_mapping, itn);
+}
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 947abdd776ca..8ea2d3ebcb36 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -86,7 +86,9 @@ static struct inode *fuse_alloc_inode(struct super_block *sb)
fi->attr_version = 0;
fi->orig_ino = 0;
fi->state = 0;
+ fi->nr_dmaps = 0;
mutex_init(&fi->mutex);
+ init_rwsem(&fi->i_dmap_sem);
spin_lock_init(&fi->lock);
fi->forget = fuse_alloc_forget();
if (!fi->forget) {
@@ -114,6 +116,10 @@ static void fuse_evict_inode(struct inode *inode)
clear_inode(inode);
if (inode->i_sb->s_flags & SB_ACTIVE) {
struct fuse_conn *fc = get_fuse_conn(inode);
+ if (IS_DAX(inode)) {
+ fuse_cleanup_inode_mappings(inode);
+ WARN_ON(fi->nr_dmaps);
+ }
fuse_queue_forget(fc, fi->forget, fi->nodeid, fi->nlookup);
fi->forget = NULL;
}
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 60a7bfc787ce..8899e4862309 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -895,6 +895,7 @@ struct fuse_copy_file_range_in {
};
#define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
+#define FUSE_SETUPMAPPING_FLAG_READ (1ull << 1)
struct fuse_setupmapping_in {
/* An already open handle */
uint64_t fh;
--
2.25.4
From: Stefan Hajnoczi <[email protected]>
Setup a dax device.
Use the shm capability to find the cache entry and map it.
The DAX window is accessed by the fs/dax.c infrastructure and must have
struct pages (at least on x86). Use devm_memremap_pages() to map the
DAX window PCI BAR and allocate struct page.
Signed-off-by: Stefan Hajnoczi <[email protected]>
Signed-off-by: Dr. David Alan Gilbert <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Signed-off-by: Sebastien Boeuf <[email protected]>
Signed-off-by: Liu Bo <[email protected]>
---
fs/fuse/virtio_fs.c | 139 +++++++++++++++++++++++++++++++++
include/uapi/linux/virtio_fs.h | 3 +
2 files changed, 142 insertions(+)
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 47ecdc15f25d..0fd3b5cecc5f 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -5,12 +5,16 @@
*/
#include <linux/fs.h>
+#include <linux/dax.h>
+#include <linux/pci.h>
+#include <linux/pfn_t.h>
#include <linux/module.h>
#include <linux/virtio.h>
#include <linux/virtio_fs.h>
#include <linux/delay.h>
#include <linux/fs_context.h>
#include <linux/highmem.h>
+#include <linux/uio.h>
#include "fuse_i.h"
/* List of virtio-fs device instances and a lock for the list. Also provides
@@ -49,6 +53,12 @@ struct virtio_fs {
struct virtio_fs_vq *vqs;
unsigned int nvqs; /* number of virtqueues */
unsigned int num_request_queues; /* number of request queues */
+ struct dax_device *dax_dev;
+
+ /* DAX memory window where file contents are mapped */
+ void *window_kaddr;
+ phys_addr_t window_phys_addr;
+ size_t window_len;
};
struct virtio_fs_forget_req {
@@ -686,6 +696,131 @@ static void virtio_fs_cleanup_vqs(struct virtio_device *vdev,
vdev->config->del_vqs(vdev);
}
+/* Map a window offset to a page frame number. The window offset will have
+ * been produced by .iomap_begin(), which maps a file offset to a window
+ * offset.
+ */
+static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+ long nr_pages, void **kaddr, pfn_t *pfn)
+{
+ struct virtio_fs *fs = dax_get_private(dax_dev);
+ phys_addr_t offset = PFN_PHYS(pgoff);
+ size_t max_nr_pages = fs->window_len/PAGE_SIZE - pgoff;
+
+ if (kaddr)
+ *kaddr = fs->window_kaddr + offset;
+ if (pfn)
+ *pfn = phys_to_pfn_t(fs->window_phys_addr + offset,
+ PFN_DEV | PFN_MAP);
+ return nr_pages > max_nr_pages ? max_nr_pages : nr_pages;
+}
+
+static size_t virtio_fs_copy_from_iter(struct dax_device *dax_dev,
+ pgoff_t pgoff, void *addr,
+ size_t bytes, struct iov_iter *i)
+{
+ return copy_from_iter(addr, bytes, i);
+}
+
+static size_t virtio_fs_copy_to_iter(struct dax_device *dax_dev,
+ pgoff_t pgoff, void *addr,
+ size_t bytes, struct iov_iter *i)
+{
+ return copy_to_iter(addr, bytes, i);
+}
+
+static int virtio_fs_zero_page_range(struct dax_device *dax_dev,
+ pgoff_t pgoff, size_t nr_pages)
+{
+ long rc;
+ void *kaddr;
+
+ rc = dax_direct_access(dax_dev, pgoff, nr_pages, &kaddr, NULL);
+ if (rc < 0)
+ return rc;
+ memset(kaddr, 0, nr_pages << PAGE_SHIFT);
+ dax_flush(dax_dev, kaddr, nr_pages << PAGE_SHIFT);
+ return 0;
+}
+
+static const struct dax_operations virtio_fs_dax_ops = {
+ .direct_access = virtio_fs_direct_access,
+ .copy_from_iter = virtio_fs_copy_from_iter,
+ .copy_to_iter = virtio_fs_copy_to_iter,
+ .zero_page_range = virtio_fs_zero_page_range,
+};
+
+static void virtio_fs_cleanup_dax(void *data)
+{
+ struct dax_device *dax_dev = data;
+
+ kill_dax(dax_dev);
+ put_dax(dax_dev);
+}
+
+static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
+{
+ struct virtio_shm_region cache_reg;
+ struct dev_pagemap *pgmap;
+ bool have_cache;
+
+ if (!IS_ENABLED(CONFIG_DAX_DRIVER))
+ return 0;
+
+ /* Get cache region */
+ have_cache = virtio_get_shm_region(vdev, &cache_reg,
+ (u8)VIRTIO_FS_SHMCAP_ID_CACHE);
+ if (!have_cache) {
+ dev_notice(&vdev->dev, "%s: No cache capability\n", __func__);
+ return 0;
+ }
+
+ if (!devm_request_mem_region(&vdev->dev, cache_reg.addr, cache_reg.len,
+ dev_name(&vdev->dev))) {
+ dev_warn(&vdev->dev, "could not reserve region addr=0x%llx"
+ " len=0x%llx\n", cache_reg.addr, cache_reg.len);
+ return -EBUSY;
+ }
+
+ dev_notice(&vdev->dev, "Cache len: 0x%llx @ 0x%llx\n", cache_reg.len,
+ cache_reg.addr);
+
+ pgmap = devm_kzalloc(&vdev->dev, sizeof(*pgmap), GFP_KERNEL);
+ if (!pgmap)
+ return -ENOMEM;
+
+ pgmap->type = MEMORY_DEVICE_FS_DAX;
+
+ /* Ideally we would directly use the PCI BAR resource but
+ * devm_memremap_pages() wants its own copy in pgmap. So
+ * initialize a struct resource from scratch (only the start
+ * and end fields will be used).
+ */
+ pgmap->res = (struct resource){
+ .name = "virtio-fs dax window",
+ .start = (phys_addr_t) cache_reg.addr,
+ .end = (phys_addr_t) cache_reg.addr + cache_reg.len - 1,
+ };
+
+ fs->window_kaddr = devm_memremap_pages(&vdev->dev, pgmap);
+ if (IS_ERR(fs->window_kaddr))
+ return PTR_ERR(fs->window_kaddr);
+
+ fs->window_phys_addr = (phys_addr_t) cache_reg.addr;
+ fs->window_len = (phys_addr_t) cache_reg.len;
+
+ dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr 0x%llx"
+ " len 0x%llx\n", __func__, fs->window_kaddr, cache_reg.addr,
+ cache_reg.len);
+
+ fs->dax_dev = alloc_dax(fs, NULL, &virtio_fs_dax_ops, 0);
+ if (IS_ERR(fs->dax_dev))
+ return PTR_ERR(fs->dax_dev);
+
+ return devm_add_action_or_reset(&vdev->dev, virtio_fs_cleanup_dax,
+ fs->dax_dev);
+}
+
static int virtio_fs_probe(struct virtio_device *vdev)
{
struct virtio_fs *fs;
@@ -707,6 +842,10 @@ static int virtio_fs_probe(struct virtio_device *vdev)
/* TODO vq affinity */
+ ret = virtio_fs_setup_dax(vdev, fs);
+ if (ret < 0)
+ goto out_vqs;
+
/* Bring the device online in case the filesystem is mounted and
* requests need to be sent before we return.
*/
diff --git a/include/uapi/linux/virtio_fs.h b/include/uapi/linux/virtio_fs.h
index 3056b6e9f8ce..bea38291421b 100644
--- a/include/uapi/linux/virtio_fs.h
+++ b/include/uapi/linux/virtio_fs.h
@@ -16,4 +16,7 @@ struct virtio_fs_config {
__le32 num_request_queues;
} __attribute__((packed));
+/* For the id field in virtio_pci_shm_cap */
+#define VIRTIO_FS_SHMCAP_ID_CACHE 0
+
#endif /* _UAPI_LINUX_VIRTIO_FS_H */
--
2.25.4
This list will be used selecting fuse_dax_mapping to free when number of
free mappings drops below a threshold.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/file.c | 22 ++++++++++++++++++++++
fs/fuse/fuse_i.h | 7 +++++++
fs/fuse/inode.c | 4 ++++
3 files changed, 33 insertions(+)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index aaa57c625af7..723602813ad6 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -213,6 +213,23 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
return dmap;
}
+/* This assumes fc->lock is held */
+static void __dmap_remove_busy_list(struct fuse_conn *fc,
+ struct fuse_dax_mapping *dmap)
+{
+ list_del_init(&dmap->busy_list);
+ WARN_ON(fc->nr_busy_ranges == 0);
+ fc->nr_busy_ranges--;
+}
+
+static void dmap_remove_busy_list(struct fuse_conn *fc,
+ struct fuse_dax_mapping *dmap)
+{
+ spin_lock(&fc->lock);
+ __dmap_remove_busy_list(fc, dmap);
+ spin_unlock(&fc->lock);
+}
+
/* This assumes fc->lock is held */
static void __dmap_add_to_free_pool(struct fuse_conn *fc,
struct fuse_dax_mapping *dmap)
@@ -266,6 +283,10 @@ static int fuse_setup_one_mapping(struct inode *inode, unsigned long start_idx,
/* Protected by fi->i_dmap_sem */
interval_tree_insert(&dmap->itn, &fi->dmap_tree);
fi->nr_dmaps++;
+ spin_lock(&fc->lock);
+ list_add_tail(&dmap->busy_list, &fc->busy_ranges);
+ fc->nr_busy_ranges++;
+ spin_unlock(&fc->lock);
}
return 0;
}
@@ -335,6 +356,7 @@ static void dmap_reinit_add_to_free_pool(struct fuse_conn *fc,
pr_debug("fuse: freeing memory range start_idx=0x%lx end_idx=0x%lx "
"window_offset=0x%llx length=0x%llx\n", dmap->itn.start,
dmap->itn.last, dmap->window_offset, dmap->length);
+ __dmap_remove_busy_list(fc, dmap);
dmap->itn.start = dmap->itn.last = 0;
__dmap_add_to_free_pool(fc, dmap);
}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e555c9a33359..400a19a464ca 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -80,6 +80,9 @@ struct fuse_dax_mapping {
/* For interval tree in file/inode */
struct interval_tree_node itn;
+ /* Will connect in fc->busy_ranges to keep track busy memory */
+ struct list_head busy_list;
+
/** Position in DAX window */
u64 window_offset;
@@ -812,6 +815,10 @@ struct fuse_conn {
/** DAX device, non-NULL if DAX is supported */
struct dax_device *dax_dev;
+ /* List of memory ranges which are busy */
+ unsigned long nr_busy_ranges;
+ struct list_head busy_ranges;
+
/*
* DAX Window Free Ranges
*/
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 3735bc5fdfa2..671e84e3dd99 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -636,6 +636,8 @@ static void fuse_free_dax_mem_ranges(struct list_head *mem_list)
/* Free All allocated elements */
list_for_each_entry_safe(range, temp, mem_list, list) {
list_del(&range->list);
+ if (!list_empty(&range->busy_list))
+ list_del(&range->busy_list);
kfree(range);
}
}
@@ -680,6 +682,7 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
*/
range->window_offset = i * FUSE_DAX_SZ;
range->length = FUSE_DAX_SZ;
+ INIT_LIST_HEAD(&range->busy_list);
list_add_tail(&range->list, &mem_ranges);
}
@@ -727,6 +730,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
fc->user_ns = get_user_ns(user_ns);
fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
INIT_LIST_HEAD(&fc->free_ranges);
+ INIT_LIST_HEAD(&fc->busy_ranges);
}
EXPORT_SYMBOL_GPL(fuse_conn_init);
--
2.25.4
The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment
constraints via the FUST_INIT map_alignment field. Parse this field and
ensure our DAX mappings meet the alignment constraints.
We don't actually align anything differently since our mappings are
already 2MB aligned. Just check the value when the connection is
established. If it becomes necessary to honor arbitrary alignments in
the future we'll have to adjust how mappings are sized.
The upshot of this commit is that we can be confident that mappings will
work even when emulating x86 on Power and similar combinations where the
host page sizes are different.
Signed-off-by: Stefan Hajnoczi <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/fuse_i.h | 5 ++++-
fs/fuse/inode.c | 18 ++++++++++++++++--
include/uapi/linux/fuse.h | 4 +++-
3 files changed, 23 insertions(+), 4 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 478c940b05b4..4a46e35222c7 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -47,7 +47,10 @@
/** Number of dentries for each connection in the control filesystem */
#define FUSE_CTL_NUM_DENTRIES 5
-/* Default memory range size, 2MB */
+/*
+ * Default memory range size. A power of 2 so it agrees with common FUSE_INIT
+ * map_alignment values 4KB and 64KB.
+ */
#define FUSE_DAX_SZ (2*1024*1024)
#define FUSE_DAX_SHIFT (21)
#define FUSE_DAX_PAGES (FUSE_DAX_SZ/PAGE_SIZE)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index b82eb61d63cc..947abdd776ca 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -980,9 +980,10 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
{
struct fuse_init_args *ia = container_of(args, typeof(*ia), args);
struct fuse_init_out *arg = &ia->out;
+ bool ok = true;
if (error || arg->major != FUSE_KERNEL_VERSION)
- fc->conn_error = 1;
+ ok = false;
else {
unsigned long ra_pages;
@@ -1045,6 +1046,13 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
min_t(unsigned int, FUSE_MAX_MAX_PAGES,
max_t(unsigned int, arg->max_pages, 1));
}
+ if ((arg->flags & FUSE_MAP_ALIGNMENT) &&
+ (FUSE_DAX_SZ % (1ul << arg->map_alignment))) {
+ pr_err("FUSE: map_alignment %u incompatible"
+ " with dax mem range size %u\n",
+ arg->map_alignment, FUSE_DAX_SZ);
+ ok = false;
+ }
} else {
ra_pages = fc->max_read / PAGE_SIZE;
fc->no_lock = 1;
@@ -1060,6 +1068,11 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
}
kfree(ia);
+ if (!ok) {
+ fc->conn_init = 0;
+ fc->conn_error = 1;
+ }
+
fuse_set_initialized(fc);
wake_up_all(&fc->blocked_waitq);
}
@@ -1082,7 +1095,8 @@ void fuse_send_init(struct fuse_conn *fc)
FUSE_WRITEBACK_CACHE | FUSE_NO_OPEN_SUPPORT |
FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
- FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA;
+ FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
+ FUSE_MAP_ALIGNMENT;
ia->args.opcode = FUSE_INIT;
ia->args.in_numargs = 1;
ia->args.in_args[0].size = sizeof(ia->in);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 373cada89815..5b85819e045f 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -313,7 +313,9 @@ struct fuse_file_lock {
* FUSE_CACHE_SYMLINKS: cache READLINK responses
* FUSE_NO_OPENDIR_SUPPORT: kernel supports zero-message opendir
* FUSE_EXPLICIT_INVAL_DATA: only invalidate cached pages on explicit request
- * FUSE_MAP_ALIGNMENT: map_alignment field is valid
+ * FUSE_MAP_ALIGNMENT: init_out.map_alignment contains log2(byte alignment) for
+ * foffset and moffset fields in struct
+ * fuse_setupmapping_out and fuse_removemapping_one.
*/
#define FUSE_ASYNC_READ (1 << 0)
#define FUSE_POSIX_LOCKS (1 << 1)
--
2.25.4
This option was introduced so that for virtio_fs we don't show any mounts
options fuse_show_options(). Because we don't offer any of these options
to be controlled by mounter.
Very soon we are planning to introduce option "dax" which mounter should
be able to specify. And no_mount_options does not work anymore. What
we need is a per mount option specific flag so that filesystem can
specify which options to show.
Add few such flags to control the behavior in more fine grained manner
and get rid of no_mount_options.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/fuse_i.h | 14 ++++++++++----
fs/fuse/inode.c | 22 ++++++++++++++--------
fs/fuse/virtio_fs.c | 1 -
3 files changed, 24 insertions(+), 13 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 740a8a7d7ae6..cf5e675100ec 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -471,18 +471,21 @@ struct fuse_fs_context {
int fd;
unsigned int rootmode;
kuid_t user_id;
+ bool user_id_show;
kgid_t group_id;
+ bool group_id_show;
bool is_bdev:1;
bool fd_present:1;
bool rootmode_present:1;
bool user_id_present:1;
bool group_id_present:1;
bool default_permissions:1;
+ bool default_permissions_show:1;
bool allow_other:1;
+ bool allow_other_show:1;
bool destroy:1;
bool no_control:1;
bool no_force_umount:1;
- bool no_mount_options:1;
unsigned int max_read;
unsigned int blksize;
const char *subtype;
@@ -512,9 +515,11 @@ struct fuse_conn {
/** The user id for this mount */
kuid_t user_id;
+ bool user_id_show:1;
/** The group id for this mount */
kgid_t group_id;
+ bool group_id_show:1;
/** The pid namespace for this mount */
struct pid_namespace *pid_ns;
@@ -698,10 +703,14 @@ struct fuse_conn {
/** Check permissions based on the file mode or not? */
unsigned default_permissions:1;
+ bool default_permissions_show:1;
/** Allow other than the mounter user to access the filesystem ? */
unsigned allow_other:1;
+ /** Show allow_other in mount options */
+ bool allow_other_show:1;
+
/** Does the filesystem support copy_file_range? */
unsigned no_copy_file_range:1;
@@ -717,9 +726,6 @@ struct fuse_conn {
/** Do not allow MNT_FORCE umount */
unsigned int no_force_umount:1;
- /* Do not show mount options */
- unsigned int no_mount_options:1;
-
/** The number of requests waiting for completion */
atomic_t num_waiting;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index bba747520e9b..2ac5713c4c32 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -535,10 +535,12 @@ static int fuse_parse_param(struct fs_context *fc, struct fs_parameter *param)
case OPT_DEFAULT_PERMISSIONS:
ctx->default_permissions = true;
+ ctx->default_permissions_show = true;
break;
case OPT_ALLOW_OTHER:
ctx->allow_other = true;
+ ctx->allow_other_show = true;
break;
case OPT_MAX_READ:
@@ -573,14 +575,15 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
struct super_block *sb = root->d_sb;
struct fuse_conn *fc = get_fuse_conn_super(sb);
- if (fc->no_mount_options)
- return 0;
-
- seq_printf(m, ",user_id=%u", from_kuid_munged(fc->user_ns, fc->user_id));
- seq_printf(m, ",group_id=%u", from_kgid_munged(fc->user_ns, fc->group_id));
- if (fc->default_permissions)
+ if (fc->user_id_show)
+ seq_printf(m, ",user_id=%u",
+ from_kuid_munged(fc->user_ns, fc->user_id));
+ if (fc->group_id_show)
+ seq_printf(m, ",group_id=%u",
+ from_kgid_munged(fc->user_ns, fc->group_id));
+ if (fc->default_permissions && fc->default_permissions_show)
seq_puts(m, ",default_permissions");
- if (fc->allow_other)
+ if (fc->allow_other && fc->allow_other_show)
seq_puts(m, ",allow_other");
if (fc->max_read != ~0)
seq_printf(m, ",max_read=%u", fc->max_read);
@@ -1193,14 +1196,17 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
sb->s_flags |= SB_POSIXACL;
fc->default_permissions = ctx->default_permissions;
+ fc->default_permissions_show = ctx->default_permissions_show;
fc->allow_other = ctx->allow_other;
+ fc->allow_other_show = ctx->allow_other_show;
fc->user_id = ctx->user_id;
+ fc->user_id_show = ctx->user_id_show;
fc->group_id = ctx->group_id;
+ fc->group_id_show = ctx->group_id_show;
fc->max_read = max_t(unsigned, 4096, ctx->max_read);
fc->destroy = ctx->destroy;
fc->no_control = ctx->no_control;
fc->no_force_umount = ctx->no_force_umount;
- fc->no_mount_options = ctx->no_mount_options;
err = -ENOMEM;
root = fuse_get_root_inode(sb, ctx->rootmode);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index ed8da4825b70..47ecdc15f25d 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1096,7 +1096,6 @@ static int virtio_fs_fill_super(struct super_block *sb)
.destroy = true,
.no_control = true,
.no_force_umount = true,
- .no_mount_options = true,
};
mutex_lock(&virtio_fs_mutex);
--
2.25.4
Add a mount option to allow using dax with virtio_fs.
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/fuse/fuse_i.h | 7 ++++
fs/fuse/inode.c | 3 ++
fs/fuse/virtio_fs.c | 82 +++++++++++++++++++++++++++++++++++++--------
3 files changed, 78 insertions(+), 14 deletions(-)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index cf5e675100ec..04fdd7c41bd1 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -486,10 +486,14 @@ struct fuse_fs_context {
bool destroy:1;
bool no_control:1;
bool no_force_umount:1;
+ bool dax:1;
unsigned int max_read;
unsigned int blksize;
const char *subtype;
+ /* DAX device, may be NULL */
+ struct dax_device *dax_dev;
+
/* fuse_dev pointer to fill in, should contain NULL on entry */
void **fudptr;
};
@@ -761,6 +765,9 @@ struct fuse_conn {
/** List of device instances belonging to this connection */
struct list_head devices;
+
+ /** DAX device, non-NULL if DAX is supported */
+ struct dax_device *dax_dev;
};
static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 2ac5713c4c32..beac337ccc10 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -589,6 +589,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
seq_printf(m, ",max_read=%u", fc->max_read);
if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
seq_printf(m, ",blksize=%lu", sb->s_blocksize);
+ if (fc->dax_dev)
+ seq_printf(m, ",dax");
return 0;
}
@@ -1207,6 +1209,7 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
fc->destroy = ctx->destroy;
fc->no_control = ctx->no_control;
fc->no_force_umount = ctx->no_force_umount;
+ fc->dax_dev = ctx->dax_dev;
err = -ENOMEM;
root = fuse_get_root_inode(sb, ctx->rootmode);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 0fd3b5cecc5f..741cad4abad8 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -13,6 +13,7 @@
#include <linux/virtio_fs.h>
#include <linux/delay.h>
#include <linux/fs_context.h>
+#include <linux/fs_parser.h>
#include <linux/highmem.h>
#include <linux/uio.h>
#include "fuse_i.h"
@@ -81,6 +82,45 @@ struct virtio_fs_req_work {
static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
struct fuse_req *req, bool in_flight);
+enum {
+ OPT_DAX,
+};
+
+static const struct fs_parameter_spec virtio_fs_parameters[] = {
+ fsparam_flag ("dax", OPT_DAX),
+ {}
+};
+
+static int virtio_fs_parse_param(struct fs_context *fc,
+ struct fs_parameter *param)
+{
+ struct fs_parse_result result;
+ struct fuse_fs_context *ctx = fc->fs_private;
+ int opt;
+
+ opt = fs_parse(fc, virtio_fs_parameters, param, &result);
+ if (opt < 0)
+ return opt;
+
+ switch (opt) {
+ case OPT_DAX:
+ ctx->dax = 1;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static void virtio_fs_free_fc(struct fs_context *fc)
+{
+ struct fuse_fs_context *ctx = fc->fs_private;
+
+ if (ctx)
+ kfree(ctx);
+}
+
static inline struct virtio_fs_vq *vq_to_fsvq(struct virtqueue *vq)
{
struct virtio_fs *fs = vq->vdev->priv;
@@ -1220,23 +1260,27 @@ static const struct fuse_iqueue_ops virtio_fs_fiq_ops = {
.release = virtio_fs_fiq_release,
};
-static int virtio_fs_fill_super(struct super_block *sb)
+static inline void virtio_fs_ctx_set_defaults(struct fuse_fs_context *ctx)
+{
+ ctx->rootmode = S_IFDIR;
+ ctx->default_permissions = 1;
+ ctx->allow_other = 1;
+ ctx->max_read = UINT_MAX;
+ ctx->blksize = 512;
+ ctx->destroy = true;
+ ctx->no_control = true;
+ ctx->no_force_umount = true;
+}
+
+static int virtio_fs_fill_super(struct super_block *sb, struct fs_context *fsc)
{
struct fuse_conn *fc = get_fuse_conn_super(sb);
struct virtio_fs *fs = fc->iq.priv;
+ struct fuse_fs_context *ctx = fsc->fs_private;
unsigned int i;
int err;
- struct fuse_fs_context ctx = {
- .rootmode = S_IFDIR,
- .default_permissions = 1,
- .allow_other = 1,
- .max_read = UINT_MAX,
- .blksize = 512,
- .destroy = true,
- .no_control = true,
- .no_force_umount = true,
- };
+ virtio_fs_ctx_set_defaults(ctx);
mutex_lock(&virtio_fs_mutex);
/* After holding mutex, make sure virtiofs device is still there.
@@ -1260,8 +1304,10 @@ static int virtio_fs_fill_super(struct super_block *sb)
}
/* virtiofs allocates and installs its own fuse devices */
- ctx.fudptr = NULL;
- err = fuse_fill_super_common(sb, &ctx);
+ ctx->fudptr = NULL;
+ if (ctx->dax)
+ ctx->dax_dev = fs->dax_dev;
+ err = fuse_fill_super_common(sb, ctx);
if (err < 0)
goto err_free_fuse_devs;
@@ -1372,7 +1418,7 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
return PTR_ERR(sb);
if (!sb->s_root) {
- err = virtio_fs_fill_super(sb);
+ err = virtio_fs_fill_super(sb, fsc);
if (err) {
deactivate_locked_super(sb);
return err;
@@ -1387,11 +1433,19 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
}
static const struct fs_context_operations virtio_fs_context_ops = {
+ .free = virtio_fs_free_fc,
+ .parse_param = virtio_fs_parse_param,
.get_tree = virtio_fs_get_tree,
};
static int virtio_fs_init_fs_context(struct fs_context *fsc)
{
+ struct fuse_fs_context *ctx;
+
+ ctx = kzalloc(sizeof(struct fuse_fs_context), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+ fsc->fs_private = ctx;
fsc->ops = &virtio_fs_context_ops;
return 0;
}
--
2.25.4
virtiofs device has a range of memory which is mapped into file inodes
using dax. This memory is mapped in qemu on host and maps different
sections of real file on host. Size of this memory is limited
(determined by administrator) and depending on filesystem size, we will
soon reach a situation where all the memory is in use and we need to
reclaim some.
As part of reclaim process, we will need to make sure that there are
no active references to pages (taken by get_user_pages()) on the memory
range we are trying to reclaim. I am planning to use
dax_layout_busy_page() for this. But in current form this is per inode
and scans through all the pages of the inode.
We want to reclaim only a portion of memory (say 2MB page). So we want
to make sure that only that 2MB range of pages do not have any
references (and don't want to unmap all the pages of inode).
Hence, create a range version of this function named
dax_layout_busy_page_range() which can be used to pass a range which
needs to be unmapped.
Cc: Dan Williams <[email protected]>
Cc: [email protected]
Cc: Jan Kara <[email protected]>
Cc: Vishal L Verma <[email protected]>
Cc: "Weiny, Ira" <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
---
fs/dax.c | 29 +++++++++++++++++++++++------
include/linux/dax.h | 6 ++++++
2 files changed, 29 insertions(+), 6 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 95341af1a966..ddd705251d9f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -559,7 +559,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
}
/**
- * dax_layout_busy_page - find first pinned page in @mapping
+ * dax_layout_busy_page_range - find first pinned page in @mapping
* @mapping: address space to scan for a page with ref count > 1
*
* DAX requires ZONE_DEVICE mapped pages. These pages are never
@@ -572,13 +572,19 @@ static void *grab_mapping_entry(struct xa_state *xas,
* establishment of new mappings in this address_space. I.e. it expects
* to be able to run unmap_mapping_range() and subsequently not race
* mapping_mapped() becoming true.
+ *
+ * Partial pages are included. If 'end' is LLONG_MAX, pages in the range
+ * from 'start' to end of the file are inluded.
*/
-struct page *dax_layout_busy_page(struct address_space *mapping)
+struct page *dax_layout_busy_page_range(struct address_space *mapping,
+ loff_t start, loff_t end)
{
- XA_STATE(xas, &mapping->i_pages, 0);
void *entry;
unsigned int scanned = 0;
struct page *page = NULL;
+ pgoff_t start_idx = start >> PAGE_SHIFT;
+ pgoff_t end_idx;
+ XA_STATE(xas, &mapping->i_pages, start_idx);
/*
* In the 'limited' case get_user_pages() for dax is disabled.
@@ -589,6 +595,11 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
if (!dax_mapping(mapping) || !mapping_mapped(mapping))
return NULL;
+ /* If end == LLONG_MAX, all pages from start to till end of file */
+ if (end == LLONG_MAX)
+ end_idx = ULONG_MAX;
+ else
+ end_idx = end >> PAGE_SHIFT;
/*
* If we race get_user_pages_fast() here either we'll see the
* elevated page count in the iteration and wait, or
@@ -596,15 +607,15 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
* against is no longer mapped in the page tables and bail to the
* get_user_pages() slow path. The slow path is protected by
* pte_lock() and pmd_lock(). New references are not taken without
- * holding those locks, and unmap_mapping_range() will not zero the
+ * holding those locks, and unmap_mapping_pages() will not zero the
* pte or pmd without holding the respective lock, so we are
* guaranteed to either see new references or prevent new
* references from being established.
*/
- unmap_mapping_range(mapping, 0, 0, 0);
+ unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1, 0);
xas_lock_irq(&xas);
- xas_for_each(&xas, entry, ULONG_MAX) {
+ xas_for_each(&xas, entry, end_idx) {
if (WARN_ON_ONCE(!xa_is_value(entry)))
continue;
if (unlikely(dax_is_locked(entry)))
@@ -625,6 +636,12 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
xas_unlock_irq(&xas);
return page;
}
+EXPORT_SYMBOL_GPL(dax_layout_busy_page_range);
+
+struct page *dax_layout_busy_page(struct address_space *mapping)
+{
+ return dax_layout_busy_page_range(mapping, 0, LLONG_MAX);
+}
EXPORT_SYMBOL_GPL(dax_layout_busy_page);
static int __dax_invalidate_entry(struct address_space *mapping,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 6904d4e0b2e0..9016929db4c6 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -141,6 +141,7 @@ int dax_writeback_mapping_range(struct address_space *mapping,
struct dax_device *dax_dev, struct writeback_control *wbc);
struct page *dax_layout_busy_page(struct address_space *mapping);
+struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end);
dax_entry_t dax_lock_page(struct page *page);
void dax_unlock_page(struct page *page, dax_entry_t cookie);
#else
@@ -171,6 +172,11 @@ static inline struct page *dax_layout_busy_page(struct address_space *mapping)
return NULL;
}
+static inline struct page *dax_layout_busy_page_range(struct address_space *mapping, pgoff_t start, pgoff_t nr_pages)
+{
+ return NULL;
+}
+
static inline int dax_writeback_mapping_range(struct address_space *mapping,
struct dax_device *dax_dev, struct writeback_control *wbc)
{
--
2.25.4
On Wed 19-08-20 18:19:40, Vivek Goyal wrote:
> virtiofs device has a range of memory which is mapped into file inodes
> using dax. This memory is mapped in qemu on host and maps different
> sections of real file on host. Size of this memory is limited
> (determined by administrator) and depending on filesystem size, we will
> soon reach a situation where all the memory is in use and we need to
> reclaim some.
>
> As part of reclaim process, we will need to make sure that there are
> no active references to pages (taken by get_user_pages()) on the memory
> range we are trying to reclaim. I am planning to use
> dax_layout_busy_page() for this. But in current form this is per inode
> and scans through all the pages of the inode.
>
> We want to reclaim only a portion of memory (say 2MB page). So we want
> to make sure that only that 2MB range of pages do not have any
> references (and don't want to unmap all the pages of inode).
>
> Hence, create a range version of this function named
> dax_layout_busy_page_range() which can be used to pass a range which
> needs to be unmapped.
>
> Cc: Dan Williams <[email protected]>
> Cc: [email protected]
> Cc: Jan Kara <[email protected]>
> Cc: Vishal L Verma <[email protected]>
> Cc: "Weiny, Ira" <[email protected]>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> fs/dax.c | 29 +++++++++++++++++++++++------
> include/linux/dax.h | 6 ++++++
> 2 files changed, 29 insertions(+), 6 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 95341af1a966..ddd705251d9f 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -559,7 +559,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
> }
>
> /**
> - * dax_layout_busy_page - find first pinned page in @mapping
> + * dax_layout_busy_page_range - find first pinned page in @mapping
> * @mapping: address space to scan for a page with ref count > 1
Please document additional function arguments in the kernel-doc comment.
Otherwise the patch looks good so feel free to add:
Reviewed-by: Jan Kara <[email protected]>
after fixing this nit.
Honza
> *
> * DAX requires ZONE_DEVICE mapped pages. These pages are never
> @@ -572,13 +572,19 @@ static void *grab_mapping_entry(struct xa_state *xas,
> * establishment of new mappings in this address_space. I.e. it expects
> * to be able to run unmap_mapping_range() and subsequently not race
> * mapping_mapped() becoming true.
> + *
> + * Partial pages are included. If 'end' is LLONG_MAX, pages in the range
> + * from 'start' to end of the file are inluded.
> */
> -struct page *dax_layout_busy_page(struct address_space *mapping)
> +struct page *dax_layout_busy_page_range(struct address_space *mapping,
> + loff_t start, loff_t end)
> {
> - XA_STATE(xas, &mapping->i_pages, 0);
> void *entry;
> unsigned int scanned = 0;
> struct page *page = NULL;
> + pgoff_t start_idx = start >> PAGE_SHIFT;
> + pgoff_t end_idx;
> + XA_STATE(xas, &mapping->i_pages, start_idx);
>
> /*
> * In the 'limited' case get_user_pages() for dax is disabled.
> @@ -589,6 +595,11 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
> if (!dax_mapping(mapping) || !mapping_mapped(mapping))
> return NULL;
>
> + /* If end == LLONG_MAX, all pages from start to till end of file */
> + if (end == LLONG_MAX)
> + end_idx = ULONG_MAX;
> + else
> + end_idx = end >> PAGE_SHIFT;
> /*
> * If we race get_user_pages_fast() here either we'll see the
> * elevated page count in the iteration and wait, or
> @@ -596,15 +607,15 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
> * against is no longer mapped in the page tables and bail to the
> * get_user_pages() slow path. The slow path is protected by
> * pte_lock() and pmd_lock(). New references are not taken without
> - * holding those locks, and unmap_mapping_range() will not zero the
> + * holding those locks, and unmap_mapping_pages() will not zero the
> * pte or pmd without holding the respective lock, so we are
> * guaranteed to either see new references or prevent new
> * references from being established.
> */
> - unmap_mapping_range(mapping, 0, 0, 0);
> + unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1, 0);
>
> xas_lock_irq(&xas);
> - xas_for_each(&xas, entry, ULONG_MAX) {
> + xas_for_each(&xas, entry, end_idx) {
> if (WARN_ON_ONCE(!xa_is_value(entry)))
> continue;
> if (unlikely(dax_is_locked(entry)))
> @@ -625,6 +636,12 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
> xas_unlock_irq(&xas);
> return page;
> }
> +EXPORT_SYMBOL_GPL(dax_layout_busy_page_range);
> +
> +struct page *dax_layout_busy_page(struct address_space *mapping)
> +{
> + return dax_layout_busy_page_range(mapping, 0, LLONG_MAX);
> +}
> EXPORT_SYMBOL_GPL(dax_layout_busy_page);
>
> static int __dax_invalidate_entry(struct address_space *mapping,
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 6904d4e0b2e0..9016929db4c6 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -141,6 +141,7 @@ int dax_writeback_mapping_range(struct address_space *mapping,
> struct dax_device *dax_dev, struct writeback_control *wbc);
>
> struct page *dax_layout_busy_page(struct address_space *mapping);
> +struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end);
> dax_entry_t dax_lock_page(struct page *page);
> void dax_unlock_page(struct page *page, dax_entry_t cookie);
> #else
> @@ -171,6 +172,11 @@ static inline struct page *dax_layout_busy_page(struct address_space *mapping)
> return NULL;
> }
>
> +static inline struct page *dax_layout_busy_page_range(struct address_space *mapping, pgoff_t start, pgoff_t nr_pages)
> +{
> + return NULL;
> +}
> +
> static inline int dax_writeback_mapping_range(struct address_space *mapping,
> struct dax_device *dax_dev, struct writeback_control *wbc)
> {
> --
> 2.25.4
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu, Aug 20, 2020 at 02:58:55PM +0200, Jan Kara wrote:
[..]
> > /**
> > - * dax_layout_busy_page - find first pinned page in @mapping
> > + * dax_layout_busy_page_range - find first pinned page in @mapping
> > * @mapping: address space to scan for a page with ref count > 1
>
> Please document additional function arguments in the kernel-doc comment.
>
> Otherwise the patch looks good so feel free to add:
>
> Reviewed-by: Jan Kara <[email protected]>
>
> after fixing this nit.
>
Hi Jan
Thanks for the review. Here is the updated patch. I also captured your
Reviewed-by.
From 3f81f769be9419ffc5a788833339ed439dbcd48e Mon Sep 17 00:00:00 2001
From: Vivek Goyal <[email protected]>
Date: Tue, 3 Mar 2020 14:58:21 -0500
Subject: [PATCH 02/20] dax: Create a range version of dax_layout_busy_page()
virtiofs device has a range of memory which is mapped into file inodes
using dax. This memory is mapped in qemu on host and maps different
sections of real file on host. Size of this memory is limited
(determined by administrator) and depending on filesystem size, we will
soon reach a situation where all the memory is in use and we need to
reclaim some.
As part of reclaim process, we will need to make sure that there are
no active references to pages (taken by get_user_pages()) on the memory
range we are trying to reclaim. I am planning to use
dax_layout_busy_page() for this. But in current form this is per inode
and scans through all the pages of the inode.
We want to reclaim only a portion of memory (say 2MB page). So we want
to make sure that only that 2MB range of pages do not have any
references (and don't want to unmap all the pages of inode).
Hence, create a range version of this function named
dax_layout_busy_page_range() which can be used to pass a range which
needs to be unmapped.
Cc: Dan Williams <[email protected]>
Cc: [email protected]
Cc: Jan Kara <[email protected]>
Cc: Vishal L Verma <[email protected]>
Cc: "Weiny, Ira" <[email protected]>
Signed-off-by: Vivek Goyal <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/dax.c | 29 +++++++++++++++++++++++------
include/linux/dax.h | 6 ++++++
2 files changed, 29 insertions(+), 6 deletions(-)
Index: redhat-linux/fs/dax.c
===================================================================
--- redhat-linux.orig/fs/dax.c 2020-08-20 14:04:41.995676669 +0000
+++ redhat-linux/fs/dax.c 2020-08-20 14:15:20.072676669 +0000
@@ -559,8 +559,11 @@ fallback:
}
/**
- * dax_layout_busy_page - find first pinned page in @mapping
+ * dax_layout_busy_page_range - find first pinned page in @mapping
* @mapping: address space to scan for a page with ref count > 1
+ * @start: Starting offset. Page containing 'start' is included.
+ * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX,
+ * pages from 'start' till the end of file are included.
*
* DAX requires ZONE_DEVICE mapped pages. These pages are never
* 'onlined' to the page allocator so they are considered idle when
@@ -573,12 +576,15 @@ fallback:
* to be able to run unmap_mapping_range() and subsequently not race
* mapping_mapped() becoming true.
*/
-struct page *dax_layout_busy_page(struct address_space *mapping)
+struct page *dax_layout_busy_page_range(struct address_space *mapping,
+ loff_t start, loff_t end)
{
- XA_STATE(xas, &mapping->i_pages, 0);
void *entry;
unsigned int scanned = 0;
struct page *page = NULL;
+ pgoff_t start_idx = start >> PAGE_SHIFT;
+ pgoff_t end_idx;
+ XA_STATE(xas, &mapping->i_pages, start_idx);
/*
* In the 'limited' case get_user_pages() for dax is disabled.
@@ -589,6 +595,11 @@ struct page *dax_layout_busy_page(struct
if (!dax_mapping(mapping) || !mapping_mapped(mapping))
return NULL;
+ /* If end == LLONG_MAX, all pages from start to till end of file */
+ if (end == LLONG_MAX)
+ end_idx = ULONG_MAX;
+ else
+ end_idx = end >> PAGE_SHIFT;
/*
* If we race get_user_pages_fast() here either we'll see the
* elevated page count in the iteration and wait, or
@@ -596,15 +607,15 @@ struct page *dax_layout_busy_page(struct
* against is no longer mapped in the page tables and bail to the
* get_user_pages() slow path. The slow path is protected by
* pte_lock() and pmd_lock(). New references are not taken without
- * holding those locks, and unmap_mapping_range() will not zero the
+ * holding those locks, and unmap_mapping_pages() will not zero the
* pte or pmd without holding the respective lock, so we are
* guaranteed to either see new references or prevent new
* references from being established.
*/
- unmap_mapping_range(mapping, 0, 0, 0);
+ unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1, 0);
xas_lock_irq(&xas);
- xas_for_each(&xas, entry, ULONG_MAX) {
+ xas_for_each(&xas, entry, end_idx) {
if (WARN_ON_ONCE(!xa_is_value(entry)))
continue;
if (unlikely(dax_is_locked(entry)))
@@ -625,6 +636,12 @@ struct page *dax_layout_busy_page(struct
xas_unlock_irq(&xas);
return page;
}
+EXPORT_SYMBOL_GPL(dax_layout_busy_page_range);
+
+struct page *dax_layout_busy_page(struct address_space *mapping)
+{
+ return dax_layout_busy_page_range(mapping, 0, LLONG_MAX);
+}
EXPORT_SYMBOL_GPL(dax_layout_busy_page);
static int __dax_invalidate_entry(struct address_space *mapping,
Index: redhat-linux/include/linux/dax.h
===================================================================
--- redhat-linux.orig/include/linux/dax.h 2020-08-20 14:04:42.014676669 +0000
+++ redhat-linux/include/linux/dax.h 2020-08-20 14:04:56.867676669 +0000
@@ -141,6 +141,7 @@ int dax_writeback_mapping_range(struct a
struct dax_device *dax_dev, struct writeback_control *wbc);
struct page *dax_layout_busy_page(struct address_space *mapping);
+struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end);
dax_entry_t dax_lock_page(struct page *page);
void dax_unlock_page(struct page *page, dax_entry_t cookie);
#else
@@ -170,6 +171,11 @@ static inline struct page *dax_layout_bu
{
return NULL;
}
+
+static inline struct page *dax_layout_busy_page_range(struct address_space *mapping, pgoff_t start, pgoff_t nr_pages)
+{
+ return NULL;
+}
static inline int dax_writeback_mapping_range(struct address_space *mapping,
struct dax_device *dax_dev, struct writeback_control *wbc)
On Thu, Aug 20, 2020 at 12:21 AM Vivek Goyal <[email protected]> wrote:
>
> The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment
> constraints via the FUST_INIT map_alignment field. Parse this field and
> ensure our DAX mappings meet the alignment constraints.
>
> We don't actually align anything differently since our mappings are
> already 2MB aligned. Just check the value when the connection is
> established. If it becomes necessary to honor arbitrary alignments in
> the future we'll have to adjust how mappings are sized.
>
> The upshot of this commit is that we can be confident that mappings will
> work even when emulating x86 on Power and similar combinations where the
> host page sizes are different.
>
> Signed-off-by: Stefan Hajnoczi <[email protected]>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> fs/fuse/fuse_i.h | 5 ++++-
> fs/fuse/inode.c | 18 ++++++++++++++++--
> include/uapi/linux/fuse.h | 4 +++-
> 3 files changed, 23 insertions(+), 4 deletions(-)
>
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 478c940b05b4..4a46e35222c7 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -47,7 +47,10 @@
> /** Number of dentries for each connection in the control filesystem */
> #define FUSE_CTL_NUM_DENTRIES 5
>
> -/* Default memory range size, 2MB */
> +/*
> + * Default memory range size. A power of 2 so it agrees with common FUSE_INIT
> + * map_alignment values 4KB and 64KB.
> + */
> #define FUSE_DAX_SZ (2*1024*1024)
> #define FUSE_DAX_SHIFT (21)
> #define FUSE_DAX_PAGES (FUSE_DAX_SZ/PAGE_SIZE)
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index b82eb61d63cc..947abdd776ca 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -980,9 +980,10 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
> {
> struct fuse_init_args *ia = container_of(args, typeof(*ia), args);
> struct fuse_init_out *arg = &ia->out;
> + bool ok = true;
>
> if (error || arg->major != FUSE_KERNEL_VERSION)
> - fc->conn_error = 1;
> + ok = false;
> else {
> unsigned long ra_pages;
>
> @@ -1045,6 +1046,13 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
> min_t(unsigned int, FUSE_MAX_MAX_PAGES,
> max_t(unsigned int, arg->max_pages, 1));
> }
> + if ((arg->flags & FUSE_MAP_ALIGNMENT) &&
> + (FUSE_DAX_SZ % (1ul << arg->map_alignment))) {
This just obfuscates "arg->map_alignment != FUSE_DAX_SHIFT".
So the intention was that userspace can ask the kernel for a
particular alignment, right?
In that case kernel can definitely succeed if the requested alignment
is smaller than the kernel provided one, no? It would also make
sense to make this a two way negotiation. I.e. send the largest
alignment (FUSE_DAX_SHIFT in this implementation) that the kernel can
provide in fuse_init_in. In that case the only error would be if
userspace ignored the given constraints.
Am I getting not getting something?
> + pr_err("FUSE: map_alignment %u incompatible"
> + " with dax mem range size %u\n",
> + arg->map_alignment, FUSE_DAX_SZ);
> + ok = false;
> + }
> } else {
> ra_pages = fc->max_read / PAGE_SIZE;
> fc->no_lock = 1;
> @@ -1060,6 +1068,11 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
> }
> kfree(ia);
>
> + if (!ok) {
> + fc->conn_init = 0;
> + fc->conn_error = 1;
> + }
> +
> fuse_set_initialized(fc);
> wake_up_all(&fc->blocked_waitq);
> }
> @@ -1082,7 +1095,8 @@ void fuse_send_init(struct fuse_conn *fc)
> FUSE_WRITEBACK_CACHE | FUSE_NO_OPEN_SUPPORT |
> FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
> FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
> - FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA;
> + FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
> + FUSE_MAP_ALIGNMENT;
> ia->args.opcode = FUSE_INIT;
> ia->args.in_numargs = 1;
> ia->args.in_args[0].size = sizeof(ia->in);
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 373cada89815..5b85819e045f 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -313,7 +313,9 @@ struct fuse_file_lock {
> * FUSE_CACHE_SYMLINKS: cache READLINK responses
> * FUSE_NO_OPENDIR_SUPPORT: kernel supports zero-message opendir
> * FUSE_EXPLICIT_INVAL_DATA: only invalidate cached pages on explicit request
> - * FUSE_MAP_ALIGNMENT: map_alignment field is valid
> + * FUSE_MAP_ALIGNMENT: init_out.map_alignment contains log2(byte alignment) for
> + * foffset and moffset fields in struct
> + * fuse_setupmapping_out and fuse_removemapping_one.
fuse_setupmapping_in
Thanks,
Miklos
On Wed, Aug 26, 2020 at 04:06:35PM +0200, Miklos Szeredi wrote:
> On Thu, Aug 20, 2020 at 12:21 AM Vivek Goyal <[email protected]> wrote:
> >
> > The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment
> > constraints via the FUST_INIT map_alignment field. Parse this field and
> > ensure our DAX mappings meet the alignment constraints.
> >
> > We don't actually align anything differently since our mappings are
> > already 2MB aligned. Just check the value when the connection is
> > established. If it becomes necessary to honor arbitrary alignments in
> > the future we'll have to adjust how mappings are sized.
> >
> > The upshot of this commit is that we can be confident that mappings will
> > work even when emulating x86 on Power and similar combinations where the
> > host page sizes are different.
> >
> > Signed-off-by: Stefan Hajnoczi <[email protected]>
> > Signed-off-by: Vivek Goyal <[email protected]>
> > ---
> > fs/fuse/fuse_i.h | 5 ++++-
> > fs/fuse/inode.c | 18 ++++++++++++++++--
> > include/uapi/linux/fuse.h | 4 +++-
> > 3 files changed, 23 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 478c940b05b4..4a46e35222c7 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -47,7 +47,10 @@
> > /** Number of dentries for each connection in the control filesystem */
> > #define FUSE_CTL_NUM_DENTRIES 5
> >
> > -/* Default memory range size, 2MB */
> > +/*
> > + * Default memory range size. A power of 2 so it agrees with common FUSE_INIT
> > + * map_alignment values 4KB and 64KB.
> > + */
> > #define FUSE_DAX_SZ (2*1024*1024)
> > #define FUSE_DAX_SHIFT (21)
> > #define FUSE_DAX_PAGES (FUSE_DAX_SZ/PAGE_SIZE)
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index b82eb61d63cc..947abdd776ca 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -980,9 +980,10 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
> > {
> > struct fuse_init_args *ia = container_of(args, typeof(*ia), args);
> > struct fuse_init_out *arg = &ia->out;
> > + bool ok = true;
> >
> > if (error || arg->major != FUSE_KERNEL_VERSION)
> > - fc->conn_error = 1;
> > + ok = false;
> > else {
> > unsigned long ra_pages;
> >
> > @@ -1045,6 +1046,13 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
> > min_t(unsigned int, FUSE_MAX_MAX_PAGES,
> > max_t(unsigned int, arg->max_pages, 1));
> > }
> > + if ((arg->flags & FUSE_MAP_ALIGNMENT) &&
> > + (FUSE_DAX_SZ % (1ul << arg->map_alignment))) {
>
> This just obfuscates "arg->map_alignment != FUSE_DAX_SHIFT".
>
> So the intention was that userspace can ask the kernel for a
> particular alignment, right?
My understanding is that device will specify alignment for
the foffset/moffset fields in fuse_setupmapping_in/fuse_removemapping_one.
And DAX mapping can be any size meeting that alignment contraint.
>
> In that case kernel can definitely succeed if the requested alignment
> is smaller than the kernel provided one, no?
Yes. So if map_alignemnt is 64K and DAX mapping size is 2MB, that's just
fine because it meets 4K alignment contraint. Just that we can't use
4K size DAX mapping in that case.
> It would also make
> sense to make this a two way negotiation. I.e. send the largest
> alignment (FUSE_DAX_SHIFT in this implementation) that the kernel can
> provide in fuse_init_in. In that case the only error would be if
> userspace ignored the given constraints.
We could make it two way negotiation if it helps. So if we support
multiple mapping sizes in future, say 4K, 64K, 2MB, 1GB. So idea is
to send alignment of largest mapping size to device/user_space (1GB)
in this case? And that will allow device to choose an alignment
which best fits its needs?
But problem here is that sending (log2(1GB)) does not mean we support
all the alignments in that range. For example, if device selects say
256MB as minimum alignment, kernel might not support it.
So there seem to be two ways to handle this.
A.Let device be conservative and always specify the minimum aligment
it can work with and let guest kernel automatically choose a mapping
size which meets that min_alignment contraint.
B.Send all the mapping sizes supported by kernel to device and then
device chooses an alignment as it sees fit. We could probably send
a 64bit field and set a bit for every size we support as dax mapping.
If we were to go down this path, I think in that case client should
respond back with exact mapping size we should use (and not with
minimum alignment).
I thought intent behind this patch was to implement A.
Stefan/David, this patch came from you folks. What do you think?
>
> Am I getting not getting something?
>
> > + pr_err("FUSE: map_alignment %u incompatible"
> > + " with dax mem range size %u\n",
> > + arg->map_alignment, FUSE_DAX_SZ);
> > + ok = false;
> > + }
> > } else {
> > ra_pages = fc->max_read / PAGE_SIZE;
> > fc->no_lock = 1;
> > @@ -1060,6 +1068,11 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
> > }
> > kfree(ia);
> >
> > + if (!ok) {
> > + fc->conn_init = 0;
> > + fc->conn_error = 1;
> > + }
> > +
> > fuse_set_initialized(fc);
> > wake_up_all(&fc->blocked_waitq);
> > }
> > @@ -1082,7 +1095,8 @@ void fuse_send_init(struct fuse_conn *fc)
> > FUSE_WRITEBACK_CACHE | FUSE_NO_OPEN_SUPPORT |
> > FUSE_PARALLEL_DIROPS | FUSE_HANDLE_KILLPRIV | FUSE_POSIX_ACL |
> > FUSE_ABORT_ERROR | FUSE_MAX_PAGES | FUSE_CACHE_SYMLINKS |
> > - FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA;
> > + FUSE_NO_OPENDIR_SUPPORT | FUSE_EXPLICIT_INVAL_DATA |
> > + FUSE_MAP_ALIGNMENT;
> > ia->args.opcode = FUSE_INIT;
> > ia->args.in_numargs = 1;
> > ia->args.in_args[0].size = sizeof(ia->in);
> > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > index 373cada89815..5b85819e045f 100644
> > --- a/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -313,7 +313,9 @@ struct fuse_file_lock {
> > * FUSE_CACHE_SYMLINKS: cache READLINK responses
> > * FUSE_NO_OPENDIR_SUPPORT: kernel supports zero-message opendir
> > * FUSE_EXPLICIT_INVAL_DATA: only invalidate cached pages on explicit request
> > - * FUSE_MAP_ALIGNMENT: map_alignment field is valid
> > + * FUSE_MAP_ALIGNMENT: init_out.map_alignment contains log2(byte alignment) for
> > + * foffset and moffset fields in struct
> > + * fuse_setupmapping_out and fuse_removemapping_one.
>
> fuse_setupmapping_in
Will fix it.
Vivek
On Wed, Aug 26, 2020 at 11:51:42AM -0400, Vivek Goyal wrote:
> On Wed, Aug 26, 2020 at 04:06:35PM +0200, Miklos Szeredi wrote:
> > On Thu, Aug 20, 2020 at 12:21 AM Vivek Goyal <[email protected]> wrote:
> > >
> > > The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment
> > > constraints via the FUST_INIT map_alignment field. Parse this field and
> > > ensure our DAX mappings meet the alignment constraints.
> > >
> > > We don't actually align anything differently since our mappings are
> > > already 2MB aligned. Just check the value when the connection is
> > > established. If it becomes necessary to honor arbitrary alignments in
> > > the future we'll have to adjust how mappings are sized.
> > >
> > > The upshot of this commit is that we can be confident that mappings will
> > > work even when emulating x86 on Power and similar combinations where the
> > > host page sizes are different.
> > >
> > > Signed-off-by: Stefan Hajnoczi <[email protected]>
> > > Signed-off-by: Vivek Goyal <[email protected]>
> > > ---
> > > fs/fuse/fuse_i.h | 5 ++++-
> > > fs/fuse/inode.c | 18 ++++++++++++++++--
> > > include/uapi/linux/fuse.h | 4 +++-
> > > 3 files changed, 23 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > index 478c940b05b4..4a46e35222c7 100644
> > > --- a/fs/fuse/fuse_i.h
> > > +++ b/fs/fuse/fuse_i.h
> > > @@ -47,7 +47,10 @@
> > > /** Number of dentries for each connection in the control filesystem */
> > > #define FUSE_CTL_NUM_DENTRIES 5
> > >
> > > -/* Default memory range size, 2MB */
> > > +/*
> > > + * Default memory range size. A power of 2 so it agrees with common FUSE_INIT
> > > + * map_alignment values 4KB and 64KB.
> > > + */
> > > #define FUSE_DAX_SZ (2*1024*1024)
> > > #define FUSE_DAX_SHIFT (21)
> > > #define FUSE_DAX_PAGES (FUSE_DAX_SZ/PAGE_SIZE)
> > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > index b82eb61d63cc..947abdd776ca 100644
> > > --- a/fs/fuse/inode.c
> > > +++ b/fs/fuse/inode.c
> > > @@ -980,9 +980,10 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
> > > {
> > > struct fuse_init_args *ia = container_of(args, typeof(*ia), args);
> > > struct fuse_init_out *arg = &ia->out;
> > > + bool ok = true;
> > >
> > > if (error || arg->major != FUSE_KERNEL_VERSION)
> > > - fc->conn_error = 1;
> > > + ok = false;
> > > else {
> > > unsigned long ra_pages;
> > >
> > > @@ -1045,6 +1046,13 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
> > > min_t(unsigned int, FUSE_MAX_MAX_PAGES,
> > > max_t(unsigned int, arg->max_pages, 1));
> > > }
> > > + if ((arg->flags & FUSE_MAP_ALIGNMENT) &&
> > > + (FUSE_DAX_SZ % (1ul << arg->map_alignment))) {
> >
> > This just obfuscates "arg->map_alignment != FUSE_DAX_SHIFT".
> >
> > So the intention was that userspace can ask the kernel for a
> > particular alignment, right?
>
> My understanding is that device will specify alignment for
> the foffset/moffset fields in fuse_setupmapping_in/fuse_removemapping_one.
> And DAX mapping can be any size meeting that alignment contraint.
>
> >
> > In that case kernel can definitely succeed if the requested alignment
> > is smaller than the kernel provided one, no?
>
> Yes. So if map_alignemnt is 64K and DAX mapping size is 2MB, that's just
> fine because it meets 4K alignment contraint. Just that we can't use
> 4K size DAX mapping in that case.
>
> > It would also make
> > sense to make this a two way negotiation. I.e. send the largest
> > alignment (FUSE_DAX_SHIFT in this implementation) that the kernel can
> > provide in fuse_init_in. In that case the only error would be if
> > userspace ignored the given constraints.
>
> We could make it two way negotiation if it helps. So if we support
> multiple mapping sizes in future, say 4K, 64K, 2MB, 1GB. So idea is
> to send alignment of largest mapping size to device/user_space (1GB)
> in this case? And that will allow device to choose an alignment
> which best fits its needs?
>
> But problem here is that sending (log2(1GB)) does not mean we support
> all the alignments in that range. For example, if device selects say
> 256MB as minimum alignment, kernel might not support it.
>
> So there seem to be two ways to handle this.
>
> A.Let device be conservative and always specify the minimum aligment
> it can work with and let guest kernel automatically choose a mapping
> size which meets that min_alignment contraint.
>
> B.Send all the mapping sizes supported by kernel to device and then
> device chooses an alignment as it sees fit. We could probably send
> a 64bit field and set a bit for every size we support as dax mapping.
> If we were to go down this path, I think in that case client should
> respond back with exact mapping size we should use (and not with
> minimum alignment).
>
> I thought intent behind this patch was to implement A.
>
> Stefan/David, this patch came from you folks. What do you think?
Yes, I agree with Vivek.
The FUSE server is telling the client the minimum alignment for
foffset/moffset. The client can map any size it likes as long as
foffset/moffset meet the alignment constraint. I can't think of a reason
to do two-way negotiation.
Stefan
* Stefan Hajnoczi ([email protected]) wrote:
> On Wed, Aug 26, 2020 at 11:51:42AM -0400, Vivek Goyal wrote:
> > On Wed, Aug 26, 2020 at 04:06:35PM +0200, Miklos Szeredi wrote:
> > > On Thu, Aug 20, 2020 at 12:21 AM Vivek Goyal <[email protected]> wrote:
> > > >
> > > > The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment
> > > > constraints via the FUST_INIT map_alignment field. Parse this field and
> > > > ensure our DAX mappings meet the alignment constraints.
> > > >
> > > > We don't actually align anything differently since our mappings are
> > > > already 2MB aligned. Just check the value when the connection is
> > > > established. If it becomes necessary to honor arbitrary alignments in
> > > > the future we'll have to adjust how mappings are sized.
> > > >
> > > > The upshot of this commit is that we can be confident that mappings will
> > > > work even when emulating x86 on Power and similar combinations where the
> > > > host page sizes are different.
> > > >
> > > > Signed-off-by: Stefan Hajnoczi <[email protected]>
> > > > Signed-off-by: Vivek Goyal <[email protected]>
> > > > ---
> > > > fs/fuse/fuse_i.h | 5 ++++-
> > > > fs/fuse/inode.c | 18 ++++++++++++++++--
> > > > include/uapi/linux/fuse.h | 4 +++-
> > > > 3 files changed, 23 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > > index 478c940b05b4..4a46e35222c7 100644
> > > > --- a/fs/fuse/fuse_i.h
> > > > +++ b/fs/fuse/fuse_i.h
> > > > @@ -47,7 +47,10 @@
> > > > /** Number of dentries for each connection in the control filesystem */
> > > > #define FUSE_CTL_NUM_DENTRIES 5
> > > >
> > > > -/* Default memory range size, 2MB */
> > > > +/*
> > > > + * Default memory range size. A power of 2 so it agrees with common FUSE_INIT
> > > > + * map_alignment values 4KB and 64KB.
> > > > + */
> > > > #define FUSE_DAX_SZ (2*1024*1024)
> > > > #define FUSE_DAX_SHIFT (21)
> > > > #define FUSE_DAX_PAGES (FUSE_DAX_SZ/PAGE_SIZE)
> > > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > > index b82eb61d63cc..947abdd776ca 100644
> > > > --- a/fs/fuse/inode.c
> > > > +++ b/fs/fuse/inode.c
> > > > @@ -980,9 +980,10 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
> > > > {
> > > > struct fuse_init_args *ia = container_of(args, typeof(*ia), args);
> > > > struct fuse_init_out *arg = &ia->out;
> > > > + bool ok = true;
> > > >
> > > > if (error || arg->major != FUSE_KERNEL_VERSION)
> > > > - fc->conn_error = 1;
> > > > + ok = false;
> > > > else {
> > > > unsigned long ra_pages;
> > > >
> > > > @@ -1045,6 +1046,13 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_args *args,
> > > > min_t(unsigned int, FUSE_MAX_MAX_PAGES,
> > > > max_t(unsigned int, arg->max_pages, 1));
> > > > }
> > > > + if ((arg->flags & FUSE_MAP_ALIGNMENT) &&
> > > > + (FUSE_DAX_SZ % (1ul << arg->map_alignment))) {
> > >
> > > This just obfuscates "arg->map_alignment != FUSE_DAX_SHIFT".
> > >
> > > So the intention was that userspace can ask the kernel for a
> > > particular alignment, right?
> >
> > My understanding is that device will specify alignment for
> > the foffset/moffset fields in fuse_setupmapping_in/fuse_removemapping_one.
> > And DAX mapping can be any size meeting that alignment contraint.
> >
> > >
> > > In that case kernel can definitely succeed if the requested alignment
> > > is smaller than the kernel provided one, no?
> >
> > Yes. So if map_alignemnt is 64K and DAX mapping size is 2MB, that's just
> > fine because it meets 4K alignment contraint. Just that we can't use
> > 4K size DAX mapping in that case.
> >
> > > It would also make
> > > sense to make this a two way negotiation. I.e. send the largest
> > > alignment (FUSE_DAX_SHIFT in this implementation) that the kernel can
> > > provide in fuse_init_in. In that case the only error would be if
> > > userspace ignored the given constraints.
> >
> > We could make it two way negotiation if it helps. So if we support
> > multiple mapping sizes in future, say 4K, 64K, 2MB, 1GB. So idea is
> > to send alignment of largest mapping size to device/user_space (1GB)
> > in this case? And that will allow device to choose an alignment
> > which best fits its needs?
> >
> > But problem here is that sending (log2(1GB)) does not mean we support
> > all the alignments in that range. For example, if device selects say
> > 256MB as minimum alignment, kernel might not support it.
> >
> > So there seem to be two ways to handle this.
> >
> > A.Let device be conservative and always specify the minimum aligment
> > it can work with and let guest kernel automatically choose a mapping
> > size which meets that min_alignment contraint.
> >
> > B.Send all the mapping sizes supported by kernel to device and then
> > device chooses an alignment as it sees fit. We could probably send
> > a 64bit field and set a bit for every size we support as dax mapping.
> > If we were to go down this path, I think in that case client should
> > respond back with exact mapping size we should use (and not with
> > minimum alignment).
> >
> > I thought intent behind this patch was to implement A.
> >
> > Stefan/David, this patch came from you folks. What do you think?
>
> Yes, I agree with Vivek.
>
> The FUSE server is telling the client the minimum alignment for
> foffset/moffset. The client can map any size it likes as long as
> foffset/moffset meet the alignment constraint. I can't think of a reason
> to do two-way negotiation.
Agreed, because there's not much that the server can do about it if the
client would like a smaller granularity - the servers granularity might
be dictated by it's mmap/pagesize/filesystem. If the client wants a
larger granularity that's it's choice when it sends the setupmapping
calls.
Dave
> Stefan
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK
On Wed, Aug 26, 2020 at 9:17 PM Dr. David Alan Gilbert
<[email protected]> wrote:
> Agreed, because there's not much that the server can do about it if the
> client would like a smaller granularity - the servers granularity might
> be dictated by it's mmap/pagesize/filesystem. If the client wants a
> larger granularity that's it's choice when it sends the setupmapping
> calls.
What bothers me is that the server now comes with the built in 2MiB
granularity (obviously much larger than actually needed).
What if at some point we'd want to reduce that somewhat in the client?
Yeah, we can't. Maybe this is not a kernel problem after all, the
proper thing would be to fix the server to actually send something
meaningful.
Thanks,
Miklos
On Wed, Aug 26, 2020 at 09:26:29PM +0200, Miklos Szeredi wrote:
> On Wed, Aug 26, 2020 at 9:17 PM Dr. David Alan Gilbert
> <[email protected]> wrote:
>
> > Agreed, because there's not much that the server can do about it if the
> > client would like a smaller granularity - the servers granularity might
> > be dictated by it's mmap/pagesize/filesystem. If the client wants a
> > larger granularity that's it's choice when it sends the setupmapping
> > calls.
>
> What bothers me is that the server now comes with the built in 2MiB
> granularity (obviously much larger than actually needed).
>
> What if at some point we'd want to reduce that somewhat in the client?
> Yeah, we can't. Maybe this is not a kernel problem after all, the
> proper thing would be to fix the server to actually send something
> meaningful.
Hi Miklos,
Current implementation of virtiofsd reports this map alignment as
PAGE_SIZE.
/* This constraint comes from mmap(2) and munmap(2) */
outarg.map_alignment = ffsl(sysconf(_SC_PAGE_SIZE)) - 1;
Which should be 4K on x86.
And that means if client wants it can drop to dax mapping size as
small as 4K and still meeting alignment constratints. Just that by
default we have chosen 2MB as of now fearing there might be too
many small mmap() calls on host and we will hit various limits.
Thanks
Vivek
On Thu, Aug 20, 2020 at 12:21 AM Vivek Goyal <[email protected]> wrote:
>
> Hi All,
>
> This is V3 of patches. I had posted version v2 version here.
Pushed to:
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git#dax
Fixed a couple of minor issues, and added two patches:
1. move dax specific code from fuse core to a separate source file
2. move dax specific data, as well as allowing dax to be configured out
I think it would be cleaner to fold these back into the original
series, but for now I'm just asking for comments and testing.
Thanks,
Miklos
On Fri, Aug 28, 2020 at 04:26:55PM +0200, Miklos Szeredi wrote:
> On Thu, Aug 20, 2020 at 12:21 AM Vivek Goyal <[email protected]> wrote:
> >
> > Hi All,
> >
> > This is V3 of patches. I had posted version v2 version here.
>
> Pushed to:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git#dax
>
> Fixed a couple of minor issues, and added two patches:
>
> 1. move dax specific code from fuse core to a separate source file
>
> 2. move dax specific data, as well as allowing dax to be configured out
>
> I think it would be cleaner to fold these back into the original
> series, but for now I'm just asking for comments and testing.
Thanks Miklos. I will have a look and test.
Vivek