2010-02-09 06:03:06

by Tejun Heo

[permalink] [raw]
Subject: [PATCH] FUSE/CUSE: implement direct mmap support

Implement FUSE direct mmap support. The server can redirect client
mmap requests to any SHMLBA aligned offset in the custom address space
attached to the fuse channel. The address space is managed by the
server using mmap/munmap(2). The SHMLBA alignment requirement is
necessary to avoid cache aliasing issues on archs with virtually
indexed caches as FUSE direct mmaps are basically shared memory
between clients and the server.

The direct mmap address space is backed by pinned kernel pages which
are allocated on the first fault either from a client or the server.
If used carelessly, this can easily waste and drain memory.
Currently, a server must have CAP_SYS_ADMIN to manage dmmap regions by
mmapping and munmapping the channel fd.

Signed-off-by: Tejun Heo <[email protected]>
---
Here's the long overdue FUSE direct mmap support. I tried several
things but this simplistic custom address space implementation turns
out to be the cleanest. It also shouldn't be too difficult to extend
it with userland ->fault() handler if such thing ever becomes
necessary. I'll post the library part soon.

Thanks.

fs/fuse/cuse.c | 1 +
fs/fuse/dev.c | 294 ++++++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/file.c | 209 +++++++++++++++++++++++++++++++++--
fs/fuse/fuse_i.h | 19 ++++
fs/fuse/inode.c | 1 +
include/linux/fuse.h | 32 ++++++
6 files changed, 545 insertions(+), 11 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index de792dc..e5cefa6 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -181,6 +181,7 @@ static const struct file_operations cuse_frontend_fops = {
.unlocked_ioctl = cuse_file_ioctl,
.compat_ioctl = cuse_file_compat_ioctl,
.poll = fuse_file_poll,
+ .mmap = fuse_do_dmmap,
};


diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 51d9e33..8dbd39b 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1220,6 +1220,299 @@ static int fuse_dev_fasync(int fd, struct file *file, int on)
return fasync_helper(fd, file, on, &fc->fasync);
}

+/*
+ * Direct mmap implementation.
+ *
+ * A simple custom address space is available for each fuse_conn at
+ * fc->dmmap_regions, which is managed by the server. The server can
+ * access it by mmap system calls on the fuse channel fd. mmap()
+ * creates a region, the final munmap() of the region destroys the
+ * region and overlapping regions aren't allowed.
+ *
+ * When a client invokes mmap(), the server can specify the offset the
+ * mmap request should be mapped to in the above adress space. Please
+ * note that the offset adjusted by the server and the originally
+ * requested offset must share the same alignment with respect to
+ * SHMLBA to avoid cache aliasing issues on architectures with
+ * virtually indexed caches. IOW, the following should hold.
+ *
+ * REQUESTED_OFFSET & (SHMLBA - 1) == ADJUSTED_OFFSET & (SHMLBA - 1)
+ *
+ * As long as the above holds, the server is free to specify any
+ * offset allowing it to assign and share any region among arbitrary
+ * set of clients.
+ *
+ * The direct mmap address space is backed by pinned kernel pages
+ * which are allocated on the first fault either from a client or the
+ * server. If used carelessly, this can easily waste and drain
+ * memory. Currently, a server must have CAP_SYS_ADMIN to manage
+ * dmmap regions by mmapping and munmapping the channel fd.
+ *
+ * fuse_dmmap_region describes single region in the dmmap address
+ * space.
+ */
+struct fuse_dmmap_region {
+ struct rb_node node;
+ atomic_t count; /* reference count */
+ pgoff_t pgoff; /* pgoff into dmmap AS */
+ pgoff_t nr_pages; /* number of pages */
+ struct page *pages[0]; /* pointer to allocated pages */
+};
+
+/**
+ * fuse_find_dmmap_node - find dmmap region for given page offset
+ * @fc: fuse connection to search dmmap_node from
+ * @pgoff: page offset
+ * @parentp: optional out parameter for parent
+ *
+ * Look for dmmap region which contains @pgoff in it. @parentp, if
+ * not NULL, will be filled with pointer to the immediate parent which
+ * may be NULL or the immediate previous or next node.
+ *
+ * CONTEXT:
+ * spin_lock(fc->lock)
+ *
+ * RETURNS:
+ * Always returns pointer to the rbtree slot. If matching node is
+ * found, the returned slot contains non-NULL pointer. The returned
+ * slot can be used to node insertion if it contains NULL.
+ */
+static struct rb_node **fuse_find_dmmap_node(struct fuse_conn *fc,
+ unsigned long pgoff,
+ struct rb_node **parentp)
+{
+ struct rb_node **link = &fc->dmmap_regions.rb_node;
+ struct rb_node *last = NULL;
+
+ while (*link) {
+ struct fuse_dmmap_region *region;
+
+ last = *link;
+ region = rb_entry(last, struct fuse_dmmap_region, node);
+
+ if (pgoff < region->pgoff)
+ link = &last->rb_left;
+ else if (pgoff >= region->pgoff + region->nr_pages)
+ link = &last->rb_right;
+ else
+ return link;
+ }
+
+ if (parentp)
+ *parentp = last;
+ return link;
+}
+
+static void fuse_dmmap_put_region(struct fuse_conn *fc,
+ struct fuse_dmmap_region *region)
+{
+ pgoff_t i;
+
+ if (!atomic_dec_and_test(&region->count))
+ return;
+
+ /* refcnt hit zero, unlink and free */
+ spin_lock(&fc->lock);
+ rb_erase(&region->node, &fc->dmmap_regions);
+ spin_unlock(&fc->lock);
+
+ for (i = 0; i < region->nr_pages; i++)
+ if (region->pages[i])
+ put_page(region->pages[i]);
+ kfree(region);
+}
+
+/**
+ * fuse_dmmap_find_get_page - find or create page for given page offset
+ * @fc: fuse connection of interest
+ * @pgoff: page offset
+ * @pagep: out parameter for the found
+ *
+ * Find or create page at @pgoff in @fc, increment reference and store
+ * it to *@pagep.
+ *
+ * CONTEXT:
+ * May sleep.
+ *
+ * RETURNS:
+ * 0 on success. VM_FAULT_SIGBUS if matching dmmap region doesn't
+ * exist. VM_FAULT_OOM if allocation fails.
+ */
+int fuse_dmmap_find_get_page(struct fuse_conn *fc, pgoff_t pgoff,
+ struct page **pagep)
+{
+ struct rb_node **link;
+ struct fuse_dmmap_region *region = NULL;
+ struct page *new_page = NULL;
+ pgoff_t idx;
+
+ /* find the region and see if the page is already there */
+ spin_lock(&fc->lock);
+
+ link = fuse_find_dmmap_node(fc, pgoff, NULL);
+ if (unlikely(!*link)) {
+ spin_unlock(&fc->lock);
+ return VM_FAULT_SIGBUS;
+ }
+
+ region = rb_entry(*link, struct fuse_dmmap_region, node);
+ idx = pgoff - region->pgoff;
+ *pagep = region->pages[idx];
+ if (*pagep)
+ get_page(*pagep);
+ else
+ atomic_inc(&region->count);
+
+ spin_unlock(&fc->lock);
+
+ /*
+ * At this point, we're holding a reference to either the
+ * *pagep or region. If *pagep, we're done.
+ */
+ if (*pagep)
+ return 0;
+
+ /* need to allocate and install a new page */
+ new_page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
+ if (!new_page) {
+ fuse_dmmap_put_region(fc, region);
+ return VM_FAULT_OOM;
+ }
+
+ /* try to install, check whether someone else already did it */
+ spin_lock(&fc->lock);
+
+ *pagep = region->pages[idx];
+ if (!*pagep) {
+ *pagep = region->pages[idx] = new_page;
+ new_page = NULL;
+ }
+ get_page(*pagep);
+
+ spin_unlock(&fc->lock);
+
+ fuse_dmmap_put_region(fc, region);
+ if (new_page)
+ put_page(new_page);
+
+ return 0;
+}
+
+static void fuse_dev_vm_open(struct vm_area_struct *vma)
+{
+ struct fuse_dmmap_region *region = vma->vm_private_data;
+
+ atomic_inc(&region->count);
+}
+
+static void fuse_dev_vm_close(struct vm_area_struct *vma)
+{
+ struct fuse_dmmap_region *region = vma->vm_private_data;
+ struct fuse_conn *fc = fuse_get_conn(vma->vm_file);
+
+ BUG_ON(!fc);
+ fuse_dmmap_put_region(fc, region);
+}
+
+static int fuse_dev_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct fuse_conn *fc = fuse_get_conn(vma->vm_file);
+
+ if (!fc)
+ return VM_FAULT_SIGBUS;
+
+ return fuse_dmmap_find_get_page(fc, vmf->pgoff, &vmf->page);
+}
+
+static const struct vm_operations_struct fuse_dev_vm_ops = {
+ .open = fuse_dev_vm_open,
+ .close = fuse_dev_vm_close,
+ .fault = fuse_dev_vm_fault,
+};
+
+static int fuse_dev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct fuse_conn *fc = fuse_get_conn(file);
+ struct fuse_dmmap_region *region = NULL, *overlap = NULL;
+ struct rb_node **link, *parent_link;
+ pgoff_t nr_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ int ret;
+
+ /* server side dmmap will consume pinned pages, allow only root */
+ ret = -EPERM;
+ if (!fc || !capable(CAP_SYS_ADMIN))
+ goto out;
+
+ /* alloc and init region */
+ ret = -ENOMEM;
+ region = kzalloc(sizeof(*region) + sizeof(region->pages[0]) * nr_pages,
+ GFP_KERNEL);
+ if (!region)
+ goto out;
+
+ atomic_set(&region->count, 1);
+ region->pgoff = vma->vm_pgoff;
+ region->nr_pages = nr_pages;
+
+ /* check for overlapping regions and insert it */
+ spin_lock(&fc->lock);
+
+ link = fuse_find_dmmap_node(fc, region->pgoff, &parent_link);
+
+ if (*link) {
+ /*
+ * This covers coinciding starting pgoff and the prev
+ * region going over this one.
+ */
+ overlap = rb_entry(*link, struct fuse_dmmap_region, node);
+ } else if (parent_link) {
+ struct fuse_dmmap_region *parent, *next = NULL;
+
+ /* determine next */
+ parent = rb_entry(parent_link, struct fuse_dmmap_region, node);
+ if (parent->pgoff < region->pgoff) {
+ struct rb_node *next_link = rb_next(parent_link);
+ if (next_link)
+ next = rb_entry(next_link,
+ struct fuse_dmmap_region, node);
+ } else
+ next = parent;
+
+ /* and see whether the new region goes over the next one */
+ if (next && region->pgoff + nr_pages > next->pgoff)
+ overlap = next;
+ }
+
+ if (overlap) {
+ printk("FUSE: server %u tries to dmmap %lu-%lu which overlaps "
+ "with %lu-%lu\n",
+ task_pid_nr(current),
+ (unsigned long)region->pgoff,
+ (unsigned long)(region->pgoff + nr_pages - 1),
+ (unsigned long)overlap->pgoff,
+ (unsigned long)(overlap->pgoff + overlap->nr_pages - 1));
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+
+ /* yay, nothing overlaps, link region and initialize vma */
+ rb_link_node(&region->node, parent_link, link);
+ rb_insert_color(&region->node, &fc->dmmap_regions);
+
+ vma->vm_flags |= VM_DONTEXPAND;
+ vma->vm_ops = &fuse_dev_vm_ops;
+ vma->vm_private_data = region;
+
+ region = NULL;
+ ret = 0;
+
+out_unlock:
+ spin_unlock(&fc->lock);
+out:
+ kfree(region);
+ return ret;
+}
+
const struct file_operations fuse_dev_operations = {
.owner = THIS_MODULE,
.llseek = no_llseek,
@@ -1230,6 +1523,7 @@ const struct file_operations fuse_dev_operations = {
.poll = fuse_dev_poll,
.release = fuse_dev_release,
.fasync = fuse_dev_fasync,
+ .mmap = fuse_dev_mmap,
};
EXPORT_SYMBOL_GPL(fuse_dev_operations);

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a9f5e13..8c5988c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -13,6 +13,8 @@
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/module.h>
+#include <linux/mman.h>
+#include <asm/shmparam.h>

static const struct file_operations fuse_direct_io_file_operations;

@@ -1344,17 +1346,6 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
return 0;
}

-static int fuse_direct_mmap(struct file *file, struct vm_area_struct *vma)
-{
- /* Can't provide the coherency needed for MAP_SHARED */
- if (vma->vm_flags & VM_MAYSHARE)
- return -ENODEV;
-
- invalidate_inode_pages2(file->f_mapping);
-
- return generic_file_mmap(file, vma);
-}
-
static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
struct file_lock *fl)
{
@@ -1974,6 +1965,202 @@ int fuse_notify_poll_wakeup(struct fuse_conn *fc,
return 0;
}

+/*
+ * For details on dmmap implementation, please read "Direct mmap
+ * implementation" in dev.c.
+ *
+ * fuse_dmmap_vm represents the result of a single mmap() call, which
+ * can be shared by multiple client vmas created by forking.
+ * fuse_dmmap_vm maintains the original length so that it can be used
+ * to notify the server of the final put of this area. This is
+ * necessary because vmas can be shrinked using mremap().
+ */
+struct fuse_dmmap_vm {
+ atomic_t count; /* reference count */
+ u64 mh; /* unique dmmap id */
+ size_t len;
+};
+
+static void fuse_dmmap_vm_open(struct vm_area_struct *vma)
+{
+ struct fuse_dmmap_vm *fdvm = vma->vm_private_data;
+
+ /* vma copied */
+ atomic_inc(&fdvm->count);
+}
+
+static void fuse_dmmap_vm_close(struct vm_area_struct *vma)
+{
+ struct fuse_dmmap_vm *fdvm = vma->vm_private_data;
+ struct fuse_file *ff = vma->vm_file->private_data;
+ struct fuse_conn *fc = ff->fc;
+ struct fuse_req *req;
+ struct fuse_munmap_in *inarg;
+
+ if (!atomic_dec_and_test(&fdvm->count))
+ return;
+ /*
+ * Notify server that the mmap region has been unmapped.
+ * Failing this might lead to resource leak in server, don't
+ * fail.
+ */
+ req = fuse_get_req_nofail(fc, vma->vm_file);
+ inarg = &req->misc.munmap_in;
+
+ inarg->fh = ff->fh;
+ inarg->mh = fdvm->mh;
+ inarg->len = fdvm->len;
+ inarg->offset = vma->vm_pgoff << PAGE_SHIFT;
+
+ req->in.h.opcode = FUSE_MUNMAP;
+ req->in.h.nodeid = ff->nodeid;
+ req->in.numargs = 1;
+ req->in.args[0].size = sizeof(*inarg);
+ req->in.args[0].value = inarg;
+
+ fuse_request_send_noreply(fc, req);
+ kfree(fdvm);
+}
+
+static int fuse_dmmap_vm_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct fuse_file *ff = vma->vm_file->private_data;
+ struct fuse_conn *fc = ff->fc;
+
+ return fuse_dmmap_find_get_page(fc, vmf->pgoff, &vmf->page);
+}
+
+static const struct vm_operations_struct fuse_dmmap_vm_ops = {
+ .open = fuse_dmmap_vm_open,
+ .close = fuse_dmmap_vm_close,
+ .fault = fuse_dmmap_vm_fault,
+};
+
+int fuse_do_dmmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct fuse_file *ff = file->private_data;
+ struct fuse_conn *fc = ff->fc;
+ struct fuse_dmmap_vm *fdvm = NULL;
+ struct fuse_req *req = NULL;
+ struct fuse_mmap_in inarg;
+ struct fuse_mmap_out outarg;
+ int err;
+
+ if (fc->no_dmmap)
+ return -ENOSYS;
+
+ /* allocate and initialize fdvm */
+ err = -ENOMEM;
+ fdvm = kzalloc(sizeof(*fdvm), GFP_KERNEL);
+ if (!fdvm)
+ goto fail;
+
+ atomic_set(&fdvm->count, 1);
+ fdvm->len = vma->vm_end - vma->vm_start;
+
+ spin_lock(&fc->lock);
+ fdvm->mh = ++fc->dmmapctr;
+ spin_unlock(&fc->lock);
+
+ req = fuse_get_req(fc);
+ if (IS_ERR(req)) {
+ err = PTR_ERR(req);
+ goto fail;
+ }
+
+ /* ask server whether this mmap is okay and what the offset should be */
+ memset(&inarg, 0, sizeof(inarg));
+ inarg.fh = ff->fh;
+ inarg.mh = fdvm->mh;
+ inarg.addr = vma->vm_start;
+ inarg.len = fdvm->len;
+ inarg.prot = ((vma->vm_flags & VM_READ) ? PROT_READ : 0) |
+ ((vma->vm_flags & VM_WRITE) ? PROT_WRITE : 0) |
+ ((vma->vm_flags & VM_EXEC) ? PROT_EXEC : 0);
+ inarg.flags = ((vma->vm_flags & VM_GROWSDOWN) ? MAP_GROWSDOWN : 0) |
+ ((vma->vm_flags & VM_DENYWRITE) ? MAP_DENYWRITE : 0) |
+ ((vma->vm_flags & VM_EXECUTABLE) ? MAP_EXECUTABLE : 0) |
+ ((vma->vm_flags & VM_LOCKED) ? MAP_LOCKED : 0);
+ inarg.offset = (loff_t)vma->vm_pgoff << PAGE_SHIFT;
+
+ req->in.h.opcode = FUSE_MMAP;
+ req->in.h.nodeid = ff->nodeid;
+ req->in.numargs = 1;
+ req->in.args[0].size = sizeof(inarg);
+ req->in.args[0].value = &inarg;
+ req->out.numargs = 1;
+ req->out.args[0].size = sizeof(outarg);
+ req->out.args[0].value = &outarg;
+
+ fuse_request_send(fc, req);
+ err = req->out.h.error;
+ if (err) {
+ if (err == -ENOSYS)
+ fc->no_dmmap = 1;
+ goto fail;
+ }
+
+ /*
+ * Make sure the returned offset has the same SHMLBA alignment
+ * as the requested one; otherwise, virtual cache aliasing may
+ * happen. This checks also makes sure that the offset is
+ * page aligned.
+ */
+ if ((outarg.offset & (SHMLBA - 1)) != (inarg.offset & (SHMLBA - 1))) {
+ if (!fc->dmmap_misalign_warned) {
+ pr_err("FUSE: mmap offset 0x%lx returned by server "
+ "is misaligned, failing mmap\n",
+ (unsigned long)outarg.offset);
+ fc->dmmap_misalign_warned = 1;
+ }
+ err = -EINVAL;
+ goto fail;
+ }
+
+ /* initialize @vma accordingly */
+ vma->vm_pgoff = outarg.offset >> PAGE_SHIFT;
+ vma->vm_ops = &fuse_dmmap_vm_ops;
+ vma->vm_private_data = fdvm;
+
+ vma->vm_flags |= VM_DONTEXPAND; /* disallow expansion for now */
+ if (outarg.flags & FUSE_MMAP_DONT_COPY)
+ vma->vm_flags |= VM_DONTCOPY;
+
+ fuse_put_request(fc, req);
+ return 0;
+
+fail:
+ kfree(fdvm);
+ if (req)
+ fuse_put_request(fc, req);
+ return err;
+}
+EXPORT_SYMBOL_GPL(fuse_do_dmmap);
+
+static int fuse_direct_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct inode *inode = file->f_dentry->d_inode;
+ int ret;
+
+ if (is_bad_inode(inode))
+ return -EIO;
+
+ ret = fuse_do_dmmap(file, vma);
+ if (ret != -ENOSYS)
+ return ret;
+
+ /*
+ * Fall back to generic mmap.
+ * Generic mmap can't provide the coherency needed for MAP_SHARED.
+ */
+ if (vma->vm_flags & VM_MAYSHARE)
+ return -ENODEV;
+
+ invalidate_inode_pages2(file->f_mapping);
+
+ return generic_file_mmap(file, vma);
+}
+
static const struct file_operations fuse_file_operations = {
.llseek = fuse_file_llseek,
.read = do_sync_read,
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 01cc462..32f01dd 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -21,6 +21,7 @@
#include <linux/rwsem.h>
#include <linux/rbtree.h>
#include <linux/poll.h>
+#include <linux/radix-tree.h>

/** Max number of pages that can be used in a single read request */
#define FUSE_MAX_PAGES_PER_REQ 32
@@ -270,6 +271,7 @@ struct fuse_req {
struct fuse_write_out out;
} write;
struct fuse_lk_in lk_in;
+ struct fuse_munmap_in munmap_in;
} misc;

/** page vector */
@@ -447,6 +449,12 @@ struct fuse_conn {
/** Is poll not implemented by fs? */
unsigned no_poll:1;

+ /** Is direct mmap not implemente by fs? */
+ unsigned no_dmmap:1;
+
+ /** Already warned unaligned direct mmap */
+ unsigned dmmap_misalign_warned:1;
+
/** Do multi-page cached writes */
unsigned big_writes:1;

@@ -494,6 +502,12 @@ struct fuse_conn {

/** Read/write semaphore to hold when accessing sb. */
struct rw_semaphore killsb;
+
+ /** dmmap unique id */
+ u64 dmmapctr;
+
+ /** Direct mmap regions */
+ struct rb_root dmmap_regions;
};

static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
@@ -746,4 +760,9 @@ long fuse_do_ioctl(struct file *file, unsigned int cmd, unsigned long arg,
unsigned fuse_file_poll(struct file *file, poll_table *wait);
int fuse_dev_release(struct inode *inode, struct file *file);

+int fuse_dmmap_find_get_page(struct fuse_conn *fc, pgoff_t pgoff,
+ struct page **pagep);
+
+int fuse_do_dmmap(struct file *file, struct vm_area_struct *vma);
+
#endif /* _FS_FUSE_I_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 1a822ce..8557b59 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -541,6 +541,7 @@ void fuse_conn_init(struct fuse_conn *fc)
fc->blocked = 1;
fc->attr_version = 1;
get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
+ fc->dmmap_regions = RB_ROOT;
}
EXPORT_SYMBOL_GPL(fuse_conn_init);

diff --git a/include/linux/fuse.h b/include/linux/fuse.h
index 3e2925a..44bf759 100644
--- a/include/linux/fuse.h
+++ b/include/linux/fuse.h
@@ -209,6 +209,13 @@ struct fuse_file_lock {
*/
#define FUSE_POLL_SCHEDULE_NOTIFY (1 << 0)

+/**
+ * Mmap flags
+ *
+ * FUSE_MMAP_DONT_COPY: don't copy the region on fork
+ */
+#define FUSE_MMAP_DONT_COPY (1 << 0)
+
enum fuse_opcode {
FUSE_LOOKUP = 1,
FUSE_FORGET = 2, /* no reply */
@@ -248,6 +255,8 @@ enum fuse_opcode {
FUSE_DESTROY = 38,
FUSE_IOCTL = 39,
FUSE_POLL = 40,
+ FUSE_MMAP = 41,
+ FUSE_MUNMAP = 42,

/* CUSE specific operations */
CUSE_INIT = 4096,
@@ -523,6 +532,29 @@ struct fuse_notify_poll_wakeup_out {
__u64 kh;
};

+struct fuse_mmap_in {
+ __u64 fh;
+ __u64 mh;
+ __u64 addr;
+ __u64 len;
+ __u32 prot;
+ __u32 flags;
+ __u64 offset;
+};
+
+struct fuse_mmap_out {
+ __u64 offset;
+ __u32 flags;
+ __u32 padding;
+};
+
+struct fuse_munmap_in {
+ __u64 fh;
+ __u64 mh;
+ __u64 len;
+ __u64 offset;
+};
+
struct fuse_in_header {
__u32 len;
__u32 opcode;
--
1.6.4.2


2010-02-09 14:59:59

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

On Tue, 09 Feb 2010, Tejun Heo wrote:
> Implement FUSE direct mmap support. The server can redirect client
> mmap requests to any SHMLBA aligned offset in the custom address space
> attached to the fuse channel. The address space is managed by the
> server using mmap/munmap(2). The SHMLBA alignment requirement is
> necessary to avoid cache aliasing issues on archs with virtually
> indexed caches as FUSE direct mmaps are basically shared memory
> between clients and the server.
>
> The direct mmap address space is backed by pinned kernel pages which
> are allocated on the first fault either from a client or the server.
> If used carelessly, this can easily waste and drain memory.
> Currently, a server must have CAP_SYS_ADMIN to manage dmmap regions by
> mmapping and munmapping the channel fd.

Okay, I'm a bit confused about these offsets.

Client asks to map a file at an offset. Server receives offset, may
change it (but only by multiple of SHMLBA) then returns it to the
kernel. The returned offset globally identifies not only the mapped
region but the page within the region. Sounds neat.

But then fuse_do_mmap() goes and changes vma->vm_pgoff, which will
show up in /proc/PID/maps for example, which is really not nice.

Can't this page ID rather be put in vma->vm_private_data?

Also can we take this page ID abstraction a step further, and say that
the ID has nothing to do with the original offset, the only
requirement is that it'd globally identify all direct mapped pages.

And the coherency requirements would be satisfied by the
fuse_dev_mmap() code. Haven't looked into what that would take, but
it sounds doable.

That would take the load off userspace having to search for a suitable
offset using some magic architecture dependent constant.

Otherwise I like the interface, it can be extended with fault and
reclaim requests and server side requests to load and save the map
contents as well.

Thanks,
Miklos

2010-02-10 08:24:17

by Goswin von Brederlow

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

Tejun Heo <[email protected]> writes:

> Implement FUSE direct mmap support. The server can redirect client
> mmap requests to any SHMLBA aligned offset in the custom address space
> attached to the fuse channel. The address space is managed by the
> server using mmap/munmap(2). The SHMLBA alignment requirement is
> necessary to avoid cache aliasing issues on archs with virtually
> indexed caches as FUSE direct mmaps are basically shared memory
> between clients and the server.
>
> The direct mmap address space is backed by pinned kernel pages which
> are allocated on the first fault either from a client or the server.
> If used carelessly, this can easily waste and drain memory.
> Currently, a server must have CAP_SYS_ADMIN to manage dmmap regions by
> mmapping and munmapping the channel fd.

Does that mean that for example in unionfs-fuse when a user wants to
mmap a file I can just mmap the actual underlying file from the real
filesystem and any read/write access would then shortcut fuse and go
directly to the real file?

MfG
Goswin

2010-02-10 08:41:06

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

On Wed, 10 Feb 2010, Goswin von Brederlow wrote:
> Tejun Heo <[email protected]> writes:
>
> > Implement FUSE direct mmap support. The server can redirect client
> > mmap requests to any SHMLBA aligned offset in the custom address space
> > attached to the fuse channel. The address space is managed by the
> > server using mmap/munmap(2). The SHMLBA alignment requirement is
> > necessary to avoid cache aliasing issues on archs with virtually
> > indexed caches as FUSE direct mmaps are basically shared memory
> > between clients and the server.
> >
> > The direct mmap address space is backed by pinned kernel pages which
> > are allocated on the first fault either from a client or the server.
> > If used carelessly, this can easily waste and drain memory.
> > Currently, a server must have CAP_SYS_ADMIN to manage dmmap regions by
> > mmapping and munmapping the channel fd.
>
> Does that mean that for example in unionfs-fuse when a user wants to
> mmap a file I can just mmap the actual underlying file from the real
> filesystem and any read/write access would then shortcut fuse and go
> directly to the real file?

No. It is the /dev/fuse file descriptor that is being mapped to give
access to the "direct mmaped" memory.

Thanks,
Miklos

2010-02-10 09:40:42

by Paul Schutte

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

Hi.

I am just sharing my thoughts on the matter.

Maybe one can implement a separate "fuse sync" ioctl which can then be
called by a user space program say "fusersync" to sync all the fuse
filesystems you are allowed to.

I implemented sync to my filesystem by means of an extended attribute.
I now do "setfattr -nsync /mountpoint" to get the syncing done.

This has the drawback that you can not easily put it in a script because
you need to know the mount point. I know one can write smart scripts to
figure out the mountpoint, but it would be nice if you can just say
"fusersync" and all the fuse filesystems that you are allowed to will
just sync.

One can then maybe rename the system wide sync to say "sync.system" and
put a script in it's place which calls "fusersync;sync.system"

Hopefully this idea might be useful.

Regards
Paul

Goswin von Brederlow wrote:
> Tejun Heo <[email protected]> writes:
>
>
>> Implement FUSE direct mmap support. The server can redirect client
>> mmap requests to any SHMLBA aligned offset in the custom address space
>> attached to the fuse channel. The address space is managed by the
>> server using mmap/munmap(2). The SHMLBA alignment requirement is
>> necessary to avoid cache aliasing issues on archs with virtually
>> indexed caches as FUSE direct mmaps are basically shared memory
>> between clients and the server.
>>
>> The direct mmap address space is backed by pinned kernel pages which
>> are allocated on the first fault either from a client or the server.
>> If used carelessly, this can easily waste and drain memory.
>> Currently, a server must have CAP_SYS_ADMIN to manage dmmap regions by
>> mmapping and munmapping the channel fd.
>>
>
> Does that mean that for example in unionfs-fuse when a user wants to
> mmap a file I can just mmap the actual underlying file from the real
> filesystem and any read/write access would then shortcut fuse and go
> directly to the real file?
>
> MfG
> Goswin
>
> ------------------------------------------------------------------------------
> SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
> Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
> http://p.sf.net/sfu/solaris-dev2dev
> _______________________________________________
> fuse-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/fuse-devel
>

2010-02-10 10:02:38

by Paul Schutte

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

Sorry wrong thread !

Paul Schutte wrote:
> Hi.
>
> I am just sharing my thoughts on the matter.
>
> Maybe one can implement a separate "fuse sync" ioctl which can then be
> called by a user space program say "fusersync" to sync all the fuse
> filesystems you are allowed to.
>
> I implemented sync to my filesystem by means of an extended attribute.
> I now do "setfattr -nsync /mountpoint" to get the syncing done.
>
> This has the drawback that you can not easily put it in a script because
> you need to know the mount point. I know one can write smart scripts to
> figure out the mountpoint, but it would be nice if you can just say
> "fusersync" and all the fuse filesystems that you are allowed to will
> just sync.
>
> One can then maybe rename the system wide sync to say "sync.system" and
> put a script in it's place which calls "fusersync;sync.system"
>
> Hopefully this idea might be useful.
>
> Regards
> Paul
>
> Goswin von Brederlow wrote:
>
>> Tejun Heo <[email protected]> writes:
>>
>>
>>
>>> Implement FUSE direct mmap support. The server can redirect client
>>> mmap requests to any SHMLBA aligned offset in the custom address space
>>> attached to the fuse channel. The address space is managed by the
>>> server using mmap/munmap(2). The SHMLBA alignment requirement is
>>> necessary to avoid cache aliasing issues on archs with virtually
>>> indexed caches as FUSE direct mmaps are basically shared memory
>>> between clients and the server.
>>>
>>> The direct mmap address space is backed by pinned kernel pages which
>>> are allocated on the first fault either from a client or the server.
>>> If used carelessly, this can easily waste and drain memory.
>>> Currently, a server must have CAP_SYS_ADMIN to manage dmmap regions by
>>> mmapping and munmapping the channel fd.
>>>
>>>
>> Does that mean that for example in unionfs-fuse when a user wants to
>> mmap a file I can just mmap the actual underlying file from the real
>> filesystem and any read/write access would then shortcut fuse and go
>> directly to the real file?
>>
>> MfG
>> Goswin
>>
>> ------------------------------------------------------------------------------
>> SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
>> Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
>> http://p.sf.net/sfu/solaris-dev2dev
>> _______________________________________________
>> fuse-devel mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/fuse-devel
>>
>>
>
> ------------------------------------------------------------------------------
> SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
> Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
> http://p.sf.net/sfu/solaris-dev2dev
> _______________________________________________
> fuse-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/fuse-devel
>

2010-02-10 11:15:57

by Tejun Heo

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

Hello, Miklos.

On 02/09/2010 11:59 PM, Miklos Szeredi wrote:
> On Tue, 09 Feb 2010, Tejun Heo wrote:
> Okay, I'm a bit confused about these offsets.
>
> Client asks to map a file at an offset. Server receives offset, may
> change it (but only by multiple of SHMLBA) then returns it to the
> kernel. The returned offset globally identifies not only the mapped
> region but the page within the region. Sounds neat.
>
> But then fuse_do_mmap() goes and changes vma->vm_pgoff, which will
> show up in /proc/PID/maps for example, which is really not nice.
>
> Can't this page ID rather be put in vma->vm_private_data?

Yeah, sure, it can be put anywhere. I just never thought about it
being visible under /proc. Will move it inside fuse_dmmap_vm.

> Also can we take this page ID abstraction a step further, and say that
> the ID has nothing to do with the original offset, the only
> requirement is that it'd globally identify all direct mapped pages.
>
> And the coherency requirements would be satisfied by the
> fuse_dev_mmap() code. Haven't looked into what that would take, but
> it sounds doable.

The coherency requirement is not really between the address and offset
but virtual addresses different maps which may end up sharing the same
physical page. On an architecture with virtually indexed but
physically tagged caches, virtual addresses which map to the same page
must end up hitting the same cache group; otherwise, the processor
wouldn't be able to determine whether the two addresses point to the
same page (as only the tags inside the same cache group are matched
against the physical address).

This is achieved by enforcing address to offset alignment. IOW, all
maps are forced to be SHMLBA aligned to offset so that all maps are
SHMLBA aligned to each other. And without knowing which maps are
gonna end up sharing which pages beforehand, enforcing SHMLBA
alignment against something is pretty much the only way to achieve
this and that's the reason why something as low level as processor
cache orgnization is exported to userland as SHMLBA for shared memory
in the first place.

FUSE dmmap has exactly the same problem as it basically is a shared
memory implementation with slightly different interface. Unless we
can know which maps are gonna be shared beforehand, and we can't know
this either for between the client and server or between clients, the
only way to make those shared mappings fall into the same cache slot
is aligning them to address space offsets.

So, I don't think it's feasible to do the address matching from inside
the kernel without a lot of convolution. The only way to do it would
be adjusting the mapping addresses but this has at least two problems
- 1. the VM layer isn't made that way and virtual address is
determined by vm generic and arch code before ->mmap() is invoked and
can't be changed after that. 2. the SHMLBA alignment is good in the
sense that it gives the userland something to rely on when determining
mapping address for address hint or fixed mapping. I don't think it
would be wise to break such assumptions which hold for *ALL* existing
memory mappings.

Thanks.

--
tejun

2010-02-10 11:29:43

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

On Wed, 10 Feb 2010, Tejun Heo wrote:
> > Also can we take this page ID abstraction a step further, and say that
> > the ID has nothing to do with the original offset, the only
> > requirement is that it'd globally identify all direct mapped pages.
> >
> > And the coherency requirements would be satisfied by the
> > fuse_dev_mmap() code. Haven't looked into what that would take, but
> > it sounds doable.
>
> The coherency requirement is not really between the address and offset
> but virtual addresses different maps which may end up sharing the same
> physical page. On an architecture with virtually indexed but
> physically tagged caches, virtual addresses which map to the same page
> must end up hitting the same cache group; otherwise, the processor
> wouldn't be able to determine whether the two addresses point to the
> same page (as only the tags inside the same cache group are matched
> against the physical address).
>
> This is achieved by enforcing address to offset alignment. IOW, all
> maps are forced to be SHMLBA aligned to offset so that all maps are
> SHMLBA aligned to each other.

Okay, lets be a little clearer. There are client side maps and server
side maps. Client side maps are naturally aligned (same offset ->
same page).

So that leaves server side maps needing to be aligned to client side
maps. Since we use the offset into the mmap as the ID, we might as
well just cheat and calculate a maching offset for the server side map
and use that. I'm not worried about changing vma->vm_pgoff there, if
that's the only way to get proper alignment, since those are not
"proper" mmaps anyway.

> FUSE dmmap has exactly the same problem as it basically is a shared
> memory implementation with slightly different interface. Unless we
> can know which maps are gonna be shared beforehand, and we can't know
> this either for between the client and server or between clients, the
> only way to make those shared mappings fall into the same cache slot
> is aligning them to address space offsets.

Currently you are aligning client/client maps by changing the
requested offset. That's like saying: you wanted the file mapped from
offset X, but we mapped it from offset Y because that's better
aligned. It doesn't make much sense, if the filesystem is doing
something nasty like that, then to hell with cache alignment.

Thanks,
Miklos

2010-02-10 11:30:53

by Tejun Heo

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

On 02/10/2010 08:22 PM, Tejun Heo wrote:
> So, I don't think it's feasible to do the address matching from inside
> the kernel without a lot of convolution.

To clarify a bit. The alignment is still a must. What's
theoretically feasible with convolution is hiding the alignment from
the userland server by adjusting virtual addresses of maps, but this
will visibly break alignment as seen from the clients (e.g. client may
not be able to unmap part of existing mmap and mmap SHMLBA aligned
disjoint part there) even if we ignore the fact that implementation
would be so invasive to the vm layer that it has no possibility of
getting accepted.

SHMLBA is something that a user of shm should know anyway. I don't
think it's too much to ask for from FUSE servers which would implement
direct mmap and it's easy to detect and enforce too. The proposed
libfuse patch abort()s if the condition is not met.

Thanks.

--
tejun

2010-02-10 11:50:23

by Tejun Heo

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

Hello,

On 02/10/2010 08:29 PM, Miklos Szeredi wrote:
>> This is achieved by enforcing address to offset alignment. IOW, all
>> maps are forced to be SHMLBA aligned to offset so that all maps are
>> SHMLBA aligned to each other.
>
> Okay, lets be a little clearer. There are client side maps and server
> side maps. Client side maps are naturally aligned (same offset ->
> same page).

Same offset -> same page doesn't hold. Clients mappings sharing the
same offset can end up with different physical pages. ie. Process A
opening /dev/dsp and mmapping at 0 and process B doing the same thing
clearly need to be served with different pages and that's one of the
reasons why the server is given the ability to adjust the offset.

> So that leaves server side maps needing to be aligned to client side
> maps. Since we use the offset into the mmap as the ID, we might as
> well just cheat and calculate a maching offset for the server side map
> and use that. I'm not worried about changing vma->vm_pgoff there, if
> that's the only way to get proper alignment, since those are not
> "proper" mmaps anyway.
...
> Currently you are aligning client/client maps by changing the
> requested offset. That's like saying: you wanted the file mapped from
> offset X, but we mapped it from offset Y because that's better
> aligned. It doesn't make much sense, if the filesystem is doing
> something nasty like that, then to hell with cache alignment.

I think we're misunderstanding each other pretty good here. I really
can't understand the above paragraph. The offset adjustment is to
allow the server to choose which clients share which mappings. It's
not to align anything. The alignment is done by the generic VM layer
against the offset. The SHMLBA requirement for the FUSE server is
there so that the FUSE server doesn't break the alignment while
changing the offset.

Can you please elaborate how you think the thing can work without
referencing the proposed implementation? Let's find out where the
misundertanding is.

Thanks.

--
tejun

2010-02-10 12:15:39

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

On Wed, 10 Feb 2010, Tejun Heo wrote:
> > Okay, lets be a little clearer. There are client side maps and server
> > side maps. Client side maps are naturally aligned (same offset ->
> > same page).
>
> Same offset -> same page doesn't hold.

Right, I really meant same page -> same offset. If the same offset is
mapped to multiple pages: no problem. If the same page is mapped to
multiple offsets, then obviously it's not going to work properly.

> Can you please elaborate how you think the thing can work without
> referencing the proposed implementation? Let's find out where the
> misundertanding is.

Thinking about it I'm not really sure...

Maybe the problem is that the propsed solution allows too much
freedom. Normally there's a 1:1 relationship between pages and
offsets. But we want to break that for CUSE, because two different
mappings of a char dev might point to completely different pages,
right?

When does that happen? Can it happen that two mappings of the same
file descriptor will have different backing pages?

Thanks,
Miklos

2010-02-10 12:28:57

by Tejun Heo

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

Hello, Miklos.

On 02/10/2010 09:15 PM, Miklos Szeredi wrote:
>> Same offset -> same page doesn't hold.
>
> Right, I really meant same page -> same offset. If the same offset is
> mapped to multiple pages: no problem. If the same page is mapped to
> multiple offsets, then obviously it's not going to work properly.

Yeap.

>> Can you please elaborate how you think the thing can work without
>> referencing the proposed implementation? Let's find out where the
>> misundertanding is.
>
> Thinking about it I'm not really sure...
>
> Maybe the problem is that the propsed solution allows too much
> freedom. Normally there's a 1:1 relationship between pages and
> offsets. But we want to break that for CUSE, because two different
> mappings of a char dev might point to completely different pages,
> right?

Yeap. It basically behaves like each mmap() instance is a shm
instance and the offset into dmmap_regions is the shmkey.

> When does that happen? Can it happen that two mappings of the same
> file descriptor will have different backing pages?

Yeah, sure. FUSE server is free to give them separate regions. The
only restriction is the SHMLBA alignment which is pretty easy to
adhere to.

Thanks.

--
tejun

2010-02-10 15:02:26

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

On Wed, 10 Feb 2010, Tejun Heo wrote:
> > When does that happen? Can it happen that two mappings of the same
> > file descriptor will have different backing pages?
>
> Yeah, sure. FUSE server is free to give them separate regions. The
> only restriction is the SHMLBA alignment which is pretty easy to
> adhere to.

What I really meant was: does it ever happen that mapping a *single
open file* of a chardev twice will result in two different maps?
Or will the maps be only different if it was a different open of the
device?

If there's a pattern there, we might make the sharing/non-sharing
automated, and greatly simplify the interface.

Thanks,
Miklos

2010-02-10 23:41:55

by Tejun Heo

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

Hello,

On 02/11/2010 12:02 AM, Miklos Szeredi wrote:
>> Yeah, sure. FUSE server is free to give them separate regions. The
>> only restriction is the SHMLBA alignment which is pretty easy to
>> adhere to.
>
> What I really meant was: does it ever happen that mapping a *single
> open file* of a chardev twice will result in two different maps?
> Or will the maps be only different if it was a different open of the
> device?

I don't know if there is one but ->mmap() implementation is definitely
allowed to do that.

> If there's a pattern there, we might make the sharing/non-sharing
> automated, and greatly simplify the interface.

Isn't the interface pretty simple as it is? If we want to make it
easier for API users by imposing limits, I think the correct layer to
do that would be at the library level.

Thanks.

--
tejun

2010-02-11 09:31:57

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

On Thu, 11 Feb 2010, Tejun Heo wrote:
> > If there's a pattern there, we might make the sharing/non-sharing
> > automated, and greatly simplify the interface.
>
> Isn't the interface pretty simple as it is?

On one hand it's simple, on the other it has pretty weird limitations,
considering that server side mmap shouldn't be mandatory. But of
course if server side mmap isn't used, then the SHMLBA limitation is
not necessary anymore so the implementation could choose an arbitrary
"offset".

My biggest gripe with the kernel API is that we shouldn't be calling
that thing in fuse_mmap_out an offset at all, because it's not and
it's confusing (like making you set vma->vm_pgoff to that value, which
is bogus). Adding a separate "page_id" or whatever would make me
happy.

And if the server wants to mmap /dev/fuse then it can do that and send
the result in "dev_offset", to make it clear that it's a different
offset from the one the client used on the mmap. And it can even use
that value as page_id, if it wants to or it can use a different
page_id.

Does that sound reasonable?

> If we want to make it
> easier for API users by imposing limits, I think the correct layer to
> do that would be at the library level.

Hmm, might do that. High level library users definitely shouldn't be
having to mess around with SHMLBA.

Thanks,
Miklos

2010-02-11 09:51:54

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

On Thu, 11 Feb 2010, Miklos Szeredi wrote:
> On one hand it's simple, on the other it has pretty weird limitations,
> considering that server side mmap shouldn't be mandatory. But of
> course if server side mmap isn't used, then the SHMLBA limitation is
> not necessary anymore so the implementation could choose an arbitrary
> "offset".
>
> My biggest gripe with the kernel API is that we shouldn't be calling
> that thing in fuse_mmap_out an offset at all, because it's not and
> it's confusing (like making you set vma->vm_pgoff to that value, which
> is bogus). Adding a separate "page_id" or whatever would make me
> happy.

And even a page_id is superfluous, since all we want is differentiate
between possibly separate maps for the same file and don't care about
individual pages.

And even differentiating between maps on the same file is only ever
possible for char devs, normal files will only ever have a single map.

So instead of "page_id" it could just be a "map_id", which will be
zero for normal maps, and whatever for /dev/fuse maps.

> And if the server wants to mmap /dev/fuse then it can do that and send
> the result in "dev_offset", to make it clear that it's a different
> offset from the one the client used on the mmap. And it can even use
> that value as page_id, if it wants to or it can use a different
> page_id.

And there's a weakness in the current server side mmap interface, for
example:

1) user mmaps region 3-5 (in page index)
server maps this region at 3-5

2) user mmaps region 1-7
server can't tell kernel that it wants to reuse region 3-5 but
wants to create two other regions

What happens in that case?

On the other hand if the map_id is separate from the dev_offset, then
the server can map the second region as one map and the kernel can
connect it up with the right pages based on the map_id returned in the
MMAP reply.

Thanks,
Miklos

2010-02-11 11:41:56

by Tejun Heo

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

Hello,

On 02/11/2010 06:31 PM, Miklos Szeredi wrote:
> On Thu, 11 Feb 2010, Tejun Heo wrote:
>>> If there's a pattern there, we might make the sharing/non-sharing
>>> automated, and greatly simplify the interface.
>>
>> Isn't the interface pretty simple as it is?
>
> On one hand it's simple, on the other it has pretty weird
> limitations, considering that server side mmap shouldn't be
> mandatory.

It's not like it's something completely foreign. It's a limitation
which can also be found in shared memory and the server side mmap
doesn't really have much to do with it. It's also necessary to avoid
aliasing issues among clients.

> But of course if server side mmap isn't used, then the SHMLBA
> limitation is not necessary anymore so the implementation could
> choose an arbitrary "offset".

It of course is necessary. How else can aliasing among clients be
avoided? The alignment is not only for the server. If you're gonna
share memories and want to adjust the mapping in any way, you need to
follow the SHMLBA alignment.

> My biggest gripe with the kernel API is that we shouldn't be calling
> that thing in fuse_mmap_out an offset at all, because it's not and
> it's confusing (like making you set vma->vm_pgoff to that value, which
> is bogus). Adding a separate "page_id" or whatever would make me
> happy.

It _is_ an offset into the dmmap address space. mmap() requests to
map certain length at certain offset of an address space. The FUSE
server can redirect the offset to make it share, not share or combine
or whatever. It's very simple conceptually.

> And if the server wants to mmap /dev/fuse then it can do that and send
> the result in "dev_offset", to make it clear that it's a different
> offset from the one the client used on the mmap. And it can even use
> that value as page_id, if it wants to or it can use a different
> page_id.
>
> Does that sound reasonable?

I can't really follow what you're suggesting. Either you're
misunderstanding why SHMLBA is necessary or I'm being plain dumb. Can
you please describe what you have on mind without referring to the
current implementation? How would it work?

>> If we want to make it easier for API users by imposing limits, I
>> think the correct layer to do that would be at the library level.
>
> Hmm, might do that. High level library users definitely shouldn't be
> having to mess around with SHMLBA.

The main thing would be to let the library manage mmap regions and ask
the server only two things - region size and share ID both of which
are optional. If both are not set (ie. -1 and -1), the mmap request
won't be shareable and sized as requested by the client. If the
server returns a positive ID, all mmap requests with the same ID will
share the region.

Also, I said it before but SHMLBA is something which is already known
to the userland for shared memories. I really don't think this is too
much to ask for if the server is playing with direct mmap.

Thanks.

--
tejun

2010-02-11 11:47:54

by Tejun Heo

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

Hello,

On 02/11/2010 06:51 PM, Miklos Szeredi wrote:
> And even a page_id is superfluous, since all we want is differentiate
> between possibly separate maps for the same file and don't care about
> individual pages.
>
> And even differentiating between maps on the same file is only ever
> possible for char devs, normal files will only ever have a single map.
>
> So instead of "page_id" it could just be a "map_id", which will be
> zero for normal maps, and whatever for /dev/fuse maps.

Well, not exactly. There are device mmap() implementations which
simply ignore @offset because offsetting doesn't make any sense at
all. ossp mmap is actually done that way. I really don't think it
would be wise to impose such restriction at the kernel API level. For
highlevel interface, dealing with separate regions could be fine tho.

>> And if the server wants to mmap /dev/fuse then it can do that and send
>> the result in "dev_offset", to make it clear that it's a different
>> offset from the one the client used on the mmap. And it can even use
>> that value as page_id, if it wants to or it can use a different
>> page_id.
>
> And there's a weakness in the current server side mmap interface, for
> example:
>
> 1) user mmaps region 3-5 (in page index)
> server maps this region at 3-5
>
> 2) user mmaps region 1-7
> server can't tell kernel that it wants to reuse region 3-5 but
> wants to create two other regions
>
> What happens in that case?

If the server wants the two regions to be separate, it can map it to
say 5-11 and returnt he offset of 5. If it wants them to be shared,
it will have to mmap 1-2 and 6-7 and return offset of 1. But, the
server should really know better than growing the area this way. If
this type of expansion ever becomes problem, we can implement
expendable mmap on the server side (ie. don't require VM_DONT_EXPAND).

> On the other hand if the map_id is separate from the dev_offset, then
> the server can map the second region as one map and the kernel can
> connect it up with the right pages based on the map_id returned in the
> MMAP reply.

How is this not possible in the current implementation?

Thanks.

--
tejun

2010-02-11 12:25:21

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

On Thu, 11 Feb 2010, Tejun Heo wrote:
> > And there's a weakness in the current server side mmap interface, for
> > example:
> >
> > 1) user mmaps region 3-5 (in page index)
> > server maps this region at 3-5
> >
> > 2) user mmaps region 1-7
> > server can't tell kernel that it wants to reuse region 3-5 but
> > wants to create two other regions
> >
> > What happens in that case?
>
> If the server wants the two regions to be separate, it can map it to
> say 5-11 and returnt he offset of 5. If it wants them to be shared,
> it will have to mmap 1-2 and 6-7 and return offset of 1.

What if region 6-7 is already occupied (e.g. because a separate region
was put there)?

Thanks,
Miklos

2010-02-11 12:34:48

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

On Thu, 11 Feb 2010, Tejun Heo wrote:
> It's not like it's something completely foreign. It's a limitation
> which can also be found in shared memory and the server side mmap
> doesn't really have much to do with it. It's also necessary to avoid
> aliasing issues among clients.

What aliasing issues among clients? That's a job for the arch
dependent part of mmap. And as you said, there's not a lot the driver
can do (except play with ->vm_pgoff) to influence the address
selection.

And playing with ->vm_pgoff is *not* a valid thing to do for the
client, at least not on "normal" mmaps.

Thanks,
Miklos

2010-02-11 12:43:22

by Tejun Heo

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

Hello, Miklos.

On 02/11/2010 09:25 PM, Miklos Szeredi wrote:
>> If the server wants the two regions to be separate, it can map it to
>> say 5-11 and returnt he offset of 5. If it wants them to be shared,
>> it will have to mmap 1-2 and 6-7 and return offset of 1.
>
> What if region 6-7 is already occupied (e.g. because a separate region
> was put there)?

Allocating and managing the address space ranges are the server's
responsibility. If it expects the region to grow, it shouldn't
colocate those regions. The kernel is just giving the server an
address space to manage and letting it redirect mmaps to arbitrary
(sans the SHMLBA alignment restriction) part of it. The rest is upto
the server.

Thanks.

--
tejun

2010-02-11 12:46:51

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

On Thu, 11 Feb 2010, Tejun Heo wrote:
> Hello, Miklos.
>
> On 02/11/2010 09:25 PM, Miklos Szeredi wrote:
> >> If the server wants the two regions to be separate, it can map it to
> >> say 5-11 and returnt he offset of 5. If it wants them to be shared,
> >> it will have to mmap 1-2 and 6-7 and return offset of 1.
> >
> > What if region 6-7 is already occupied (e.g. because a separate region
> > was put there)?
>
> Allocating and managing the address space ranges are the server's
> responsibility. If it expects the region to grow, it shouldn't
> colocate those regions. The kernel is just giving the server an
> address space to manage and letting it redirect mmaps to arbitrary
> (sans the SHMLBA alignment restriction) part of it. The rest is upto
> the server.

The problem with that is you simply can't determine in advance where
the region will grow. Okay, you can leave space according to the size
of the file, but the size of the file can grow too.

*This* is the complexity that I want to get rid of.

Thanks,
Miklos

2010-02-11 12:50:28

by Tejun Heo

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

Hello,

On 02/11/2010 09:34 PM, Miklos Szeredi wrote:
> On Thu, 11 Feb 2010, Tejun Heo wrote:
>> It's not like it's something completely foreign. It's a limitation
>> which can also be found in shared memory and the server side mmap
>> doesn't really have much to do with it. It's also necessary to avoid
>> aliasing issues among clients.
>
> What aliasing issues among clients? That's a job for the arch
> dependent part of mmap. And as you said, there's not a lot the driver
> can do (except play with ->vm_pgoff) to influence the address
> selection.
>
> And playing with ->vm_pgoff is *not* a valid thing to do for the
> client, at least not on "normal" mmaps.

That's the requirement coming from allowing the server to determine
how these mmaps are served, not from the fact that the management is
done via server side mmaps or the server maps those regions into its
process address space. No matter how you do it, if you want to mix
and match client mmap requests, the SHMLBA alignment will be visible.
I suppose you're talking about not allowing offsets to be adjusted at
all but I don't think that's a restriction we would want to have at
the kernel API level because offset might encode different things for
device mmaps.

Thanks.

--
tejun

2010-02-11 12:59:48

by Tejun Heo

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

Hello,

On 02/11/2010 09:46 PM, Miklos Szeredi wrote:
> The problem with that is you simply can't determine in advance where
> the region will grow. Okay, you can leave space according to the size
> of the file, but the size of the file can grow too.
>
> *This* is the complexity that I want to get rid of.

Alright, then let's talk about that, not SHMLBA which doesn't really
have much to do with this. So, you're basically saying that you want
multiple address spaces. In that case, the only logical abstraction
for that are files. ie. We need to be passing file descriptors to the
server and asking the server which descriptor it would want to use,
which was the previous implementation, which you objected mentioning
that this type of direct mmap would probably be useful only for device
mmap implementations.

The address space provided for dmmap is huge. It's any range which
pgoff_t can address. I really can't imagine managing allocation from
that space causing problems. If you want to use it for multiple huge
mappings, we can't use the simple implementation w/ pinned pages
anyway. We'll have to borrow anonymous file implementation or
implement private swap backing.

Thanks.

--
tejun

2010-02-11 13:01:32

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

On Thu, 11 Feb 2010, Tejun Heo wrote:
> > And playing with ->vm_pgoff is *not* a valid thing to do for the
> > client, at least not on "normal" mmaps.
>
> That's the requirement coming from allowing the server to determine
> how these mmaps are served, not from the fact that the management is
> done via server side mmaps or the server maps those regions into its
> process address space. No matter how you do it, if you want to mix
> and match client mmap requests, the SHMLBA alignment will be visible.

Can you give an example?

> I suppose you're talking about not allowing offsets to be adjusted at
> all but I don't think that's a restriction we would want to have at
> the kernel API level because offset might encode different things for
> device mmaps.

Possibly. But there's some confusion about offsets again. This
offset is not the same as the one currently returned in fuse_mmap_out.

And neither is the SHMLBA alignment requirement so clear:

You say, that ossp ignores the client mmap offset. So basically it
resets the offset to zero, whatever it was, no? That sounds fine, but
then you are adjusting the offset by something not necessarily a
multiple of SHMLBA.

So there are different offsets:

a) vma->vm_pgoff (which may mean anything, but usually means b)
b) the offset at which the pages of the mapping are located
c) the offset at which the server side mmap is located

Thanks,
Miklos

2010-02-11 13:08:25

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

On Thu, 11 Feb 2010, Tejun Heo wrote:
> Hello,
>
> On 02/11/2010 09:46 PM, Miklos Szeredi wrote:
> > The problem with that is you simply can't determine in advance where
> > the region will grow. Okay, you can leave space according to the size
> > of the file, but the size of the file can grow too.
> >
> > *This* is the complexity that I want to get rid of.
>
> Alright, then let's talk about that, not SHMLBA which doesn't really
> have much to do with this. So, you're basically saying that you want
> multiple address spaces. In that case, the only logical abstraction
> for that are files. ie. We need to be passing file descriptors to the
> server and asking the server which descriptor it would want to use,
> which was the previous implementation, which you objected mentioning
> that this type of direct mmap would probably be useful only for device
> mmap implementations.

No, I don't want to pass file descriptors (early fuse did use that for
readdir and it was a mistake).

Global address space for server side maps are OK. What I mind is that
in the *absence* of server side maps the filesystem still has to deal
with the global address space (with all its quirks). This is totally
unnecessary.

Thanks,
Miklos

2010-02-11 13:23:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

Hello,

On 02/11/2010 10:01 PM, Miklos Szeredi wrote:
>> That's the requirement coming from allowing the server to determine
>> how these mmaps are served, not from the fact that the management is
>> done via server side mmaps or the server maps those regions into its
>> process address space. No matter how you do it, if you want to mix
>> and match client mmap requests, the SHMLBA alignment will be visible.
>
> Can you give an example?

Hmmm... I don't have any specific example. osspd is the only real
thing which uses it and it serves each mmap request with a separate
region. Oh... the fmmap example program serves the same region to
mmap requests coming from the same UID, so it does both sharing and
putting maps apart.

>> I suppose you're talking about not allowing offsets to be adjusted at
>> all but I don't think that's a restriction we would want to have at
>> the kernel API level because offset might encode different things for
>> device mmaps.
>
> Possibly. But there's some confusion about offsets again. This
> offset is not the same as the one currently returned in fuse_mmap_out.

This offset is the offset client requests.

> And neither is the SHMLBA alignment requirement so clear:
>
> You say, that ossp ignores the client mmap offset. So basically it
> resets the offset to zero, whatever it was, no? That sounds fine, but
> then you are adjusting the offset by something not necessarily a
> multiple of SHMLBA.

Yeap and that would be a bug. It will probably have to do % SHMLBA on
the offset. I don't think all OSS drivers would be caring about these
stuff anyway. Most of them work on only x86 where SHMLBA == PAGE_SIZE.

> So there are different offsets:
>
> a) vma->vm_pgoff (which may mean anything, but usually means b)

Yeap, vma->vm_pgoff can be any value and doesn't really matter. The
only visible difference would be the /proc listing, right? Setting
this to the requested offset is trivial.

> b) the offset at which the pages of the mapping are located
> c) the offset at which the server side mmap is located

There are three offsets.

a) the offset a client requested

b) the offset into dmmap AS, a client mmap region is mapped to. This
could be different from a) by multiple of SHMLBA / PAGE_SIZE.

c) the offset into dmmap AS, a server mmap region is mapped to, where
collection of these mmaps define the dmmap AS.

The offsets used in b) and c) are the same offsets.

Thanks.

--
tejun

2010-02-11 13:33:34

by Tejun Heo

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

Hello,

On 02/11/2010 10:08 PM, Miklos Szeredi wrote:
> No, I don't want to pass file descriptors (early fuse did use that for
> readdir and it was a mistake).

Yeah, I agree that passing fds is pretty painful to implement and look
at.

> Global address space for server side maps are OK.

Do you mean having single dmmap AS is okay?

> What I mind is that in the *absence* of server side maps the
> filesystem still has to deal with the global address space (with all
> its quirks). This is totally unnecessary.

And then the above seems contradictory with the first sentence. Sorry
to be dense but I really don't follow what you're suggesting. If
you're okay with single dmmap AS, how can you not be okay with the
quirks of dealing with global AS? The only exit from that problem is
having multiple ASes.

Thanks.

--
tejun

2010-02-11 13:40:37

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

On Thu, 11 Feb 2010, Tejun Heo wrote:
> > So there are different offsets:
> >
> > a) vma->vm_pgoff (which may mean anything, but usually means b)
>
> Yeap, vma->vm_pgoff can be any value and doesn't really matter. The
> only visible difference would be the /proc listing, right? Setting
> this to the requested offset is trivial.

You mean leaving it at the requested offset? Yes, that's the most
trivial thing to do. Very few drivers change vm_pgoff:

git grep "vm_pgoff *=[^=]"

> > b) the offset at which the pages of the mapping are located
> > c) the offset at which the server side mmap is located
>
> There are three offsets.
>
> a) the offset a client requested
>
> b) the offset into dmmap AS, a client mmap region is mapped to. This
> could be different from a) by multiple of SHMLBA / PAGE_SIZE.

No, it could be different from a) by an arbitrary value.

>
> c) the offset into dmmap AS, a server mmap region is mapped to, where
> collection of these mmaps define the dmmap AS.
>
> The offsets used in b) and c) are the same offsets.

Why are they the same?

Thanks,
Miklos

2010-02-11 13:51:20

by Tejun Heo

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

Hello,

On 02/11/2010 10:40 PM, Miklos Szeredi wrote:
> On Thu, 11 Feb 2010, Tejun Heo wrote:
>>> So there are different offsets:
>>>
>>> a) vma->vm_pgoff (which may mean anything, but usually means b)
>>
>> Yeap, vma->vm_pgoff can be any value and doesn't really matter. The
>> only visible difference would be the /proc listing, right? Setting
>> this to the requested offset is trivial.
>
> You mean leaving it at the requested offset? Yes, that's the most
> trivial thing to do. Very few drivers change vm_pgoff:
>
> git grep "vm_pgoff *=[^=]"

Yeap, sure. I just didn't think it was visible outside.

>>> b) the offset at which the pages of the mapping are located
>>> c) the offset at which the server side mmap is located
>>
>> There are three offsets.
>>
>> a) the offset a client requested
>>
>> b) the offset into dmmap AS, a client mmap region is mapped to. This
>> could be different from a) by multiple of SHMLBA / PAGE_SIZE.
>
> No, it could be different from a) by an arbitrary value.

Then, sharing those pages would cause aliasing issues.

>> c) the offset into dmmap AS, a server mmap region is mapped to, where
>> collection of these mmaps define the dmmap AS.
>>
>> The offsets used in b) and c) are the same offsets.
>
> Why are they the same?

I meant they point into the same space. If they're the same value,
they point to the same page.

Thanks.

--
tejun

2010-02-11 14:40:11

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

On Thu, 11 Feb 2010, Tejun Heo wrote:
> >> There are three offsets.
> >>
> >> a) the offset a client requested
> >>
> >> b) the offset into dmmap AS, a client mmap region is mapped to. This
> >> could be different from a) by multiple of SHMLBA / PAGE_SIZE.
> >
> > No, it could be different from a) by an arbitrary value.
>
> Then, sharing those pages would cause aliasing issues.

You said a few mails up:

"There are device mmap() implementations which simply ignore @offset
because offsetting doesn't make any sense at all"

Which means a) doesn't necessarily matter, so it's not something that
determines aliasing issues.

> >> c) the offset into dmmap AS, a server mmap region is mapped to, where
> >> collection of these mmaps define the dmmap AS.
> >>
> >> The offsets used in b) and c) are the same offsets.
> >
> > Why are they the same?
>
> I meant they point into the same space. If they're the same value,
> they point to the same page.

I'm beginning to undestand what you mean by "dmmap AS".

The thing is, I'm still not sure if or how this kind of mmap makes
sense outside of the CUSE context. Which makes designing the API
difficult.

So, for now maybe it's best to go with your implementation, fix issues
with the offsets and make it CUSE only for the moment.

The alternative is for me to start implementing a coherent distributed
filesystem, so I can see what the actual requirements for a direct
mmap would be. That would be fun, but it would

a) delay direct mmap for CUSE by an unknown amount of time
b) delay everything else that I have in the pipeline ;)

Thanks,
Miklos

2010-02-12 00:06:22

by Tejun Heo

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

Hello,

On 02/11/2010 11:40 PM, Miklos Szeredi wrote:
>> Then, sharing those pages would cause aliasing issues.
>
> You said a few mails up:
>
> "There are device mmap() implementations which simply ignore @offset
> because offsetting doesn't make any sense at all"
>
> Which means a) doesn't necessarily matter, so it's not something that
> determines aliasing issues.

Mmap regions of devices aren't always used as shared memories. IO
regions often don't have cache backing at all. Also, certain arch is
often assumed.

And yes it is something which determines aliasing issues. There is no
way around it.

>>>> The offsets used in b) and c) are the same offsets.
>>>
>>> Why are they the same?
>>
>> I meant they point into the same space. If they're the same value,
>> they point to the same page.
>
> I'm beginning to undestand what you mean by "dmmap AS".

The thing described by rbtree of struct dmmap_regions.

> The thing is, I'm still not sure if or how this kind of mmap makes
> sense outside of the CUSE context. Which makes designing the API
> difficult.

For this to be useful for normal FS, it has to be backed by multiple
swap backed files and I can almost guarantee you would need to be
passing fds around.

> So, for now maybe it's best to go with your implementation, fix issues
> with the offsets and make it CUSE only for the moment.

The offset problem can't be fixed. If you allow offsets to be
adjusted in any way, SHMLBA requirement is gonna be there and for CUSE
I think having the ability to adjust offset would be useful even if
multiple files are used. It can of course be hidden behind a
highlevel library API tho.

> The alternative is for me to start implementing a coherent distributed
> filesystem, so I can see what the actual requirements for a direct
> mmap would be. That would be fun, but it would
>
> a) delay direct mmap for CUSE by an unknown amount of time
> b) delay everything else that I have in the pipeline ;)

Well, for CUSE, it has been delayed for a long time already so I don't
think there would be much harm in waiting a bit more. Any estimates
on how long it would take?

Thanks.

--
tejun

2010-02-12 00:23:15

by Tejun Heo

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

Hello,

On 02/12/2010 09:07 AM, Tejun Heo wrote:
>> The thing is, I'm still not sure if or how this kind of mmap makes
>> sense outside of the CUSE context. Which makes designing the API
>> difficult.
>
> For this to be useful for normal FS, it has to be backed by multiple
> swap backed files and I can almost guarantee you would need to be
> passing fds around.

BTW, if you're looking this way, one way to do it would be to ask the
server to return a fd to use for the mmap backing and offset
adjustment (still needs to be SHMLBA aligned) rather than creating
anonymous file from the kernel. In this approach, the problem would
be wrapping the vma ops so that FUSE can determine when the mmap
segment is released. I did it in pretty ugly way in the previous mmap
implementation by simply overwriting the ops as anonymous files didn't
have their own ->open or ->close but to do this for any file you'll
need a way to nest or inherit vma ops in generic way. The problem is
that the assumptions regarding vma are pretty deeply buried all over
the VM layer. FWIW, I couldn't think of a way to do that in generic
manner.

Thanks.

--
tejun

2010-02-12 09:55:25

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

On Fri, 12 Feb 2010, Tejun Heo wrote:
> The offset problem can't be fixed. If you allow offsets to be
> adjusted in any way, SHMLBA requirement is gonna be there and for CUSE
> I think having the ability to adjust offset would be useful even if
> multiple files are used. It can of course be hidden behind a
> highlevel library API tho.

Yes, offset issues can be fixed:

1) don't change vma->vm_pgoff

2) in fuse_mmap_out, call it dev_offset, dev_mmap_offset or whatever

3) on the low level API don't make it an offset pointer that is
adjusted. It's not an offset to be either left alone or changed
(that would be the case if we wanted to allow adjustment to
vma->vm_pgoff itself). It's about calclulating a completely new
offset for the server side mmap.

> > The alternative is for me to start implementing a coherent distributed
> > filesystem, so I can see what the actual requirements for a direct
> > mmap would be. That would be fun, but it would
> >
> > a) delay direct mmap for CUSE by an unknown amount of time
> > b) delay everything else that I have in the pipeline ;)
>
> Well, for CUSE, it has been delayed for a long time already so I don't
> think there would be much harm in waiting a bit more. Any estimates
> on how long it would take?

Well if there were no higher priority things then I think I could do
that in a month.

I don't think I want a swap backed solution. There is already a
backing for these maps and that is the userspace filesystem. In fact
most of what is required is already there in the form of the page
cache. What I think would be interesting to be able to load/save
contents of page cache from the server side, and not necessarily using
server side mmaps (server side mmap is also a possibility, but not an
easy one if it has to cooperate with the page cache).

Basically all we need to ensure, is that the mmap API doesn't conflict
with the above usage. The problem is that the details for the above
usage will only come out with a real implementation.

Thanks,
Miklos

2010-02-12 13:27:53

by Tejun Heo

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

Hello,

On 02/12/2010 06:55 PM, Miklos Szeredi wrote:
> On Fri, 12 Feb 2010, Tejun Heo wrote:
>> The offset problem can't be fixed. If you allow offsets to be
>> adjusted in any way, SHMLBA requirement is gonna be there and for CUSE
>> I think having the ability to adjust offset would be useful even if
>> multiple files are used. It can of course be hidden behind a
>> highlevel library API tho.
>
> Yes, offset issues can be fixed:
>
> 1) don't change vma->vm_pgoff

vma->vm_pgoff is peripheral unless you mean not adjusting offset at
all.

> 2) in fuse_mmap_out, call it dev_offset, dev_mmap_offset or whatever

Okay.

> 3) on the low level API don't make it an offset pointer that is
> adjusted. It's not an offset to be either left alone or changed
> (that would be the case if we wanted to allow adjustment to
> vma->vm_pgoff itself). It's about calclulating a completely new
> offset for the server side mmap.

And now I'm completely lost. So, we'll assign new offset (no matter
how it is called) but it doesn't have to be aligned? It seems like
we've been having this disconnection from the beginning. Can you
please describe how this can avoid aliasing issues between clients
sharing the same page? So, in 2), whatever it is called, the server
specifies a value, how is that value used?

>> Well, for CUSE, it has been delayed for a long time already so I don't
>> think there would be much harm in waiting a bit more. Any estimates
>> on how long it would take?
>
> Well if there were no higher priority things then I think I could do
> that in a month.

Alright, we'll be missing the upcoming merge window anyway, so no
biggie.

> I don't think I want a swap backed solution. There is already a
> backing for these maps and that is the userspace filesystem.

Yeap, sure. This was what I was talking about in the other reply.
Named files would probably be much easier to work with for the server
implementations.

> In fact most of what is required is already there in the form of the
> page cache. What I think would be interesting to be able to
> load/save contents of page cache from the server side, and not
> necessarily using server side mmaps (server side mmap is also a
> possibility, but not an easy one if it has to cooperate with the
> page cache).

Device mmap use cases might not work very well if the server can't
mmap directly.

> Basically all we need to ensure, is that the mmap API doesn't conflict
> with the above usage. The problem is that the details for the above
> usage will only come out with a real implementation.

Alright, I'll ping you in a while.

Thanks.

--
tejun

2010-02-12 13:53:47

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

On Fri, 12 Feb 2010, Tejun Heo wrote:
> > 3) on the low level API don't make it an offset pointer that is
> > adjusted. It's not an offset to be either left alone or changed
> > (that would be the case if we wanted to allow adjustment to
> > vma->vm_pgoff itself). It's about calclulating a completely new
> > offset for the server side mmap.
>
> And now I'm completely lost. So, we'll assign new offset (no matter
> how it is called) but it doesn't have to be aligned? It seems like
> we've been having this disconnection from the beginning. Can you
> please describe how this can avoid aliasing issues between clients
> sharing the same page? So, in 2), whatever it is called, the server
> specifies a value, how is that value used?

That dev_offset value is used as an offset into the server side mmap
address space. And yes, vma->vm_pgoff and dev_offset should be SHMLBA
multiples apart.

But don't call that _adjustment_. That's totally confusing, these are
*two* *different* *offsets*. There's an alignment requirement but
that's all. If they are the same that is pure coincidence.

And dev_offset (which points into the dmmap address space) is only
required if the filesystem/CUSE driver needs server side mmap.

> > In fact most of what is required is already there in the form of the
> > page cache. What I think would be interesting to be able to
> > load/save contents of page cache from the server side, and not
> > necessarily using server side mmaps (server side mmap is also a
> > possibility, but not an easy one if it has to cooperate with the
> > page cache).
>
> Device mmap use cases might not work very well if the server can't
> mmap directly.

I understand that, and that's where the interesting part comes in:
make the mmap API in a way that it works with and without server side
mmap.

Thanks,
Miklos

2010-02-12 17:49:23

by Tejun Heo

[permalink] [raw]
Subject: Re: [fuse-devel] [PATCH] FUSE/CUSE: implement direct mmap support

Hello,

On 02/12/2010 10:53 PM, Miklos Szeredi wrote:
> That dev_offset value is used as an offset into the server side mmap
> address space. And yes, vma->vm_pgoff and dev_offset should be SHMLBA
> multiples apart.

Alright.

> But don't call that _adjustment_. That's totally confusing, these are
> *two* *different* *offsets*.

At this point, we're completely in the realm of the babel tower.
Unless we define all the terms explicitly, I don't think using or not
using a single word would make much difference. Working with FUSE
gets a bit confusing regarding which term means what but with
multiple process address spaces and the dmmap address space and all
the offsets, it's getting pretty ridiculous.

> There's an alignment requirement but that's all. If they are the
> same that is pure coincidence.
>
> And dev_offset (which points into the dmmap address space) is only
> required if the filesystem/CUSE driver needs server side mmap.

Hmmm... w/o the dev_offset how would the server designate which area
to use for the client?

Thanks.

--
tejun