2009-06-18 09:26:33

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET] FUSE/CUSE: implement direct mmap, take#3

Hello,

This is the third take of fuse-implement-direct-mmap patchset. This
patchset implements direct mmap support for FUSE (and CUSE). Each
direct mmap area is backed by anonymous mapping (shmem_file) and the
FUSE server can decide how they are shared.

mmap request is handled in two steps. MMAP first queries the server
whether it wants to share the mapping with an existing one or create a
new one, and if so, with which flags. MMAP_COMMIT notifies the server
the result of mmap and if successful the fd the server can use to
access the mmap region.

Changes from the last take[L] are

* mmap-don-t-assume-f_op-mmap-doesn-t-change-vma.patch was
unnecessary. Dropped and mmap implementation updated accordingly.

* Updated to the current master which now contains CUSE support.

This patchset contains the following patches.

0001-fdtable-export-alloc_fd.patch
0002-FUSE-make-request_wait_answer-wait-for-end-co.patch
0003-FUSE-implement-fuse_req-prep.patch
0004-FUSE-implement-direct-mmap.patch

0001 update exports fdtable for following FUSE changes. 0002-0003
update fuse_req->end() handling and add ->prep(). 0004 implements
direct mmap.

This patchset is on top of

linus#master(1d89b30cc9be41af87881682ec82e2c107849dbe)

and is available in the following git tree.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git fuse-mmap

diffstat follows.

fs/file.c | 1
fs/fuse/cuse.c | 1
fs/fuse/dev.c | 70 ++++++--
fs/fuse/file.c | 432 +++++++++++++++++++++++++++++++++++++++++++++++++--
fs/fuse/fuse_i.h | 19 ++
include/linux/fuse.h | 47 +++++
6 files changed, 538 insertions(+), 32 deletions(-)

Thanks.

--
tejun

[L] http://thread.gmane.org/gmane.linux.kernel/821839


2009-06-18 09:26:20

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 3/4] FUSE: implement fuse_req->prep()

Implement ->prep() which is the opposite equivalent of ->end(). It's
called right before the request is passed to userland server in the
kernel context of the server. ->prep() can fail the request without
disrupting the whole channel.

This will be used by direct mmap implementation.

Signed-off-by: Tejun Heo <[email protected]>
---
fs/fuse/dev.c | 29 ++++++++++++++++++++++++++---
fs/fuse/fuse_i.h | 6 ++++++
2 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 0745728..1c56c6e 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -758,6 +758,7 @@ static ssize_t fuse_dev_read(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t pos)
{
int err;
+ bool restart;
struct fuse_req *req;
struct fuse_in *in;
struct fuse_copy_state cs;
@@ -804,12 +805,32 @@ static ssize_t fuse_dev_read(struct kiocb *iocb, const struct iovec *iov,
goto restart;
}
spin_unlock(&fc->lock);
+
+ restart = false;
fuse_copy_init(&cs, fc, 1, req, iov, nr_segs);
+
+ /*
+ * Execute prep if available. Failure from prep doesn't
+ * indicate faulty channel. On failure, fail the current
+ * request and proceed to the next one.
+ */
+ if (req->prep) {
+ err = req->prep(fc, req);
+ if (err) {
+ restart = true;
+ goto finish;
+ }
+ }
+
err = fuse_copy_one(&cs, &in->h, sizeof(in->h));
- if (!err)
- err = fuse_copy_args(&cs, in->numargs, in->argpages,
- (struct fuse_arg *) in->args, 0);
+ if (err)
+ goto finish;
+
+ err = fuse_copy_args(&cs, in->numargs, in->argpages,
+ (struct fuse_arg *) in->args, 0);
+ finish:
fuse_copy_finish(&cs);
+
spin_lock(&fc->lock);
req->locked = 0;
if (req->aborted) {
@@ -819,6 +840,8 @@ static ssize_t fuse_dev_read(struct kiocb *iocb, const struct iovec *iov,
if (err) {
req->out.h.error = -EIO;
request_end(fc, req);
+ if (restart)
+ goto restart;
return err;
}
if (!req->isreply)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 7b930b6..aa112e2 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -292,6 +292,12 @@ struct fuse_req {
/** Link on fi->writepages */
struct list_head writepages_entry;

+ /** Request preparation callback. Called from the kernel
+ context of the FUSE server before passing the request to
+ the FUSE server. Non-zero return from this function will
+ fail the request. */
+ int (*prep)(struct fuse_conn *, struct fuse_req *);
+
/** Request completion callback. This function is called from
the kernel context of the FUSE server if the request isn't
being aborted. If the request is being aborted, it's
--
1.6.0.2

2009-06-18 09:26:46

by Tejun Heo

[permalink] [raw]
Subject: [PATCH libfuse] implement direct mmap support

This patch implements direct mmap. It allows FUSE server to honor
each mmap request with anonymous mapping. FUSE server can make
multiple mmap requests share a single anonymous mapping or separate
mappings as it sees fit.

mmap request is handled in two steps. MMAP first queries the server
whether it wants to share the mapping with an existing one or create a
new one, and if so, with which flags. MMAP_COMMIT notifies the server
the result of mmap and if successful the fd the server can use to
access the mmap region.

Signed-off-by: Tejun Heo <[email protected]>
---
Here's updated direct mmap implementation in libfuse. Thanks.

example/Makefile.am | 6
example/fmmap.c | 370 ++++++++++++++++++++++++++++++++++++++++++++++++
example/fmmapclient.c | 112 ++++++++++++++
include/cuse_lowlevel.h | 8 +
include/fuse.h | 70 +++++++++
include/fuse_common.h | 9 +
include/fuse_kernel.h | 47 ++++++
include/fuse_lowlevel.h | 76 +++++++++
lib/cuse_lowlevel.c | 30 +++
lib/fuse.c | 129 ++++++++++++++++
lib/fuse_lowlevel.c | 65 ++++++++
lib/fuse_versionscript | 4
12 files changed, 924 insertions(+), 2 deletions(-)

Index: fuse/include/fuse.h
===================================================================
--- fuse.orig/include/fuse.h
+++ fuse/include/fuse.h
@@ -493,6 +493,67 @@ struct fuse_operations {
*/
int (*poll) (const char *, struct fuse_file_info *,
struct fuse_pollhandle *ph, unsigned *reventsp);
+
+ /**
+ * Direct mmap
+ *
+ * Direct mmap is available for direct_io files. FUSE direct
+ * mmap is always backed by anonymous mapping which can be
+ * shared between different maps.
+ *
+ * mmap is done in two steps. mmap() is the first step and
+ * queries the server whether an existing mmap should be
+ * reused or a new one should be created and if a new area is
+ * to be created with which flags.
+ *
+ * The server can set *@fdp to an existing fd to share the map
+ * and set flags on *@flagsp for new map. Note that setting
+ * both *@fdp and *@flagsp makes *@flagsp setting ignored.
+ * The fd set in *@fdp must be from previous FUSE mmaps.
+ *
+ * Introduced in version 2.9
+ */
+ int (*mmap) (const char *, void *addr, size_t len, int prot, int flags,
+ off_t offset, struct fuse_file_info *fi,
+ int *fdp, unsigned *flagsp);
+
+ /**
+ * Direct mmap commit
+ *
+ * This method is called to commit a direct mmap. This method
+ * is guaranteed to be called after successful return from
+ * mmap(). If @fd is negative, it contains -errno and
+ * indicates internal error occurred and the mmap request is
+ * being failed. Otherwise, @fd is connected to the anonymous
+ * mapping which will be used for the mmap. The server is
+ * free to resize or fill the area as it likes. If @fd has
+ * been created for this mmap, it will be visible to the
+ * client only after mmap_commit() completes successfully.
+ *
+ * On mmap_commit() failure, the kernel will automatically
+ * close @fd if it has been created for this mmap.
+ *
+ * The server can associate MMAP_COMMIT with the respective
+ * MMAP using @mmap_unique which contains request unique ID of
+ * the MMAP request.
+ */
+ int (*mmap_commit) (const char *, void *addr, size_t len,
+ int prot, int flags, int fd, off_t offset,
+ struct fuse_file_info *fi, uint64_t mmap_unique);
+
+ /**
+ * Direct munmap
+ *
+ * This method is called when a direct mmap area is being
+ * unmapped.
+ *
+ * The server can associate MUNMAP with the respective MMAP or
+ * MMAP_COMMIT using @mmap_unique which contains request
+ * unique ID of the MMAP request.
+ */
+ void (*munmap) (const char *, void *addr, size_t len,
+ struct fuse_file_info *fi, uint64_t mmap_unique,
+ int fd);
};

/** Extra context that may be needed by some filesystems
@@ -730,6 +791,15 @@ int fuse_fs_ioctl(struct fuse_fs *fs, co
int fuse_fs_poll(struct fuse_fs *fs, const char *path,
struct fuse_file_info *fi, struct fuse_pollhandle *ph,
unsigned *reventsp);
+int fuse_fs_mmap(struct fuse_fs *fs, const char *path, void *addr, size_t len,
+ int prot, int flags, off_t offset, struct fuse_file_info *fi,
+ int *fdp, unsigned *flagsp);
+int fuse_fs_mmap_commit(struct fuse_fs *fs, const char *path, void *addr,
+ size_t len, int prot, int flags, int fd, off_t offset,
+ struct fuse_file_info *fi, uint64_t mmap_unique);
+void fuse_fs_munmap(struct fuse_fs *fs, const char *path,
+ void *addr, size_t len, struct fuse_file_info *fi,
+ uint64_t mmap_unique, int fd);
void fuse_fs_init(struct fuse_fs *fs, struct fuse_conn_info *conn);
void fuse_fs_destroy(struct fuse_fs *fs);

Index: fuse/include/fuse_kernel.h
===================================================================
--- fuse.orig/include/fuse_kernel.h
+++ fuse/include/fuse_kernel.h
@@ -207,6 +207,15 @@ struct fuse_file_lock {
*/
#define FUSE_POLL_SCHEDULE_NOTIFY (1 << 0)

+/**
+ * Mmap flags
+ *
+ * FUSE_MMAP_DONT_COPY: don't copy the region on fork
+ * FUSE_MMAP_DONT_EXPAND: can't be expanded with mremap()
+ */
+#define FUSE_MMAP_DONT_COPY (1 << 0)
+#define FUSE_MMAP_DONT_EXPAND (1 << 1)
+
enum fuse_opcode {
FUSE_LOOKUP = 1,
FUSE_FORGET = 2, /* no reply */
@@ -246,6 +255,9 @@ enum fuse_opcode {
FUSE_DESTROY = 38,
FUSE_IOCTL = 39,
FUSE_POLL = 40,
+ FUSE_MMAP = 41,
+ FUSE_MMAP_COMMIT = 42,
+ FUSE_MUNMAP = 43,

/* CUSE specific operations */
CUSE_INIT = 4096,
@@ -507,6 +519,41 @@ struct fuse_notify_poll_wakeup_out {
__u64 kh;
};

+struct fuse_mmap_in {
+ __u64 fh;
+ __u64 addr;
+ __u64 len;
+ __s32 prot;
+ __s32 flags;
+ __u64 offset;
+};
+
+struct fuse_mmap_out {
+ __s32 fd;
+ __u32 flags;
+};
+
+struct fuse_mmap_commit_in {
+ __u64 fh;
+ __u64 mmap_unique;
+ __u64 addr;
+ __u64 len;
+ __s32 prot;
+ __s32 flags;
+ __s32 fd;
+ __u32 padding;
+ __u64 offset;
+};
+
+struct fuse_munmap_in {
+ __u64 fh;
+ __u64 mmap_unique;
+ __u64 addr;
+ __u64 len;
+ __s32 fd;
+ __u32 padding;
+};
+
struct fuse_in_header {
__u32 len;
__u32 opcode;
Index: fuse/include/fuse_lowlevel.h
===================================================================
--- fuse.orig/include/fuse_lowlevel.h
+++ fuse/include/fuse_lowlevel.h
@@ -869,6 +869,72 @@ struct fuse_lowlevel_ops {
*/
void (*poll) (fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi,
struct fuse_pollhandle *ph);
+
+ /**
+ * Mmap
+ *
+ * Introduced in version 2.8
+ *
+ * Valid replies:
+ * fuse_reply_mmap
+ * fuse_reply_err
+ *
+ * @param req request handle
+ * @param ino the inode number
+ * @param addr address to be mapped
+ * @param len length of the mapping
+ * @param prot PROT_*
+ * @param flags MAP_* flags
+ * @param offset mmap offset
+ * @param fi file information
+ */
+ void (*mmap) (fuse_req_t req, fuse_ino_t ino, void *addr, size_t len,
+ int prot, int flags, off_t offset,
+ struct fuse_file_info *fi);
+
+ /**
+ * Mmap commit
+ *
+ * Introduced in version 2.8
+ *
+ * Valid replies:
+ * fuse_reply_err
+ *
+ * @param req request handle
+ * @param ino the inode number
+ * @param addr address to be mapped
+ * @param len length of the mapping
+ * @param prot PROT_*
+ * @param flags MAP_* flags
+ * @param fd file backing the anonymous shared mapping for this mmap
+ * @param offset mmap offset
+ * @param fi file information
+ * @param mmap_unique req->unique of the corresponding MMAP request
+ */
+ void (*mmap_commit) (fuse_req_t req, fuse_ino_t ino,
+ void *addr, size_t len, int prot, int flags,
+ int fd, off_t offset, struct fuse_file_info *fi,
+ uint64_t mmap_unique);
+
+ /**
+ * Munmap
+ *
+ * Introduced in version 2.8
+ *
+ * Valid replies:
+ * fuse_reply_none
+ *
+ * @param req request handle
+ * @param ino the inode number
+ * @param addr address to be mapped
+ * @param len length of the mapping
+ * @param fi file information
+ * @param mmap_unique req->unique of the corresponding MMAP request
+ * @param fd file backing the anonymous shared mapping for this mmap
+ */
+ void (*munmap) (fuse_req_t req, fuse_ino_t ino,
+ void *addr, size_t len, struct fuse_file_info *fi,
+ uint64_t mmap_unique, int fd);
};

/**
@@ -1138,6 +1204,16 @@ int fuse_reply_ioctl_iov(fuse_req_t req,
*/
int fuse_reply_poll(fuse_req_t req, unsigned revents);

+/**
+ * Reply with mmap information
+ *
+ * @param req request handle
+ * @param fd file for the mapping. -1 creates new one; otherwise,
+ * should point to fd created by mmap
+ * @param flags FUSE_MMAP_* flags
+ */
+int fuse_reply_mmap(fuse_req_t req, int fd, unsigned flags);
+
/* ----------------------------------------------------------- *
* Notification *
* ----------------------------------------------------------- */
Index: fuse/lib/fuse_lowlevel.c
===================================================================
--- fuse.orig/lib/fuse_lowlevel.c
+++ fuse/lib/fuse_lowlevel.c
@@ -509,6 +509,17 @@ int fuse_reply_poll(fuse_req_t req, unsi
return send_reply_ok(req, &arg, sizeof(arg));
}

+int fuse_reply_mmap(fuse_req_t req, int fd, unsigned flags)
+{
+ struct fuse_mmap_out arg;
+
+ memset(&arg, 0, sizeof(arg));
+ arg.fd = fd;
+ arg.flags = flags;
+
+ return send_reply_ok(req, &arg, sizeof(arg));
+}
+
static void do_lookup(fuse_req_t req, fuse_ino_t nodeid, const void *inarg)
{
char *name = (char *) inarg;
@@ -1125,6 +1136,57 @@ static void do_poll(fuse_req_t req, fuse
}
}

+static void do_mmap(fuse_req_t req, fuse_ino_t nodeid, const void *inarg)
+{
+ struct fuse_mmap_in *arg = (struct fuse_mmap_in *) inarg;
+ struct fuse_file_info fi;
+
+ memset(&fi, 0, sizeof(fi));
+ fi.fh = arg->fh;
+ fi.fh_old = fi.fh;
+
+ if (req->f->op.mmap)
+ req->f->op.mmap(req, nodeid, (void *)(unsigned long)arg->addr,
+ arg->len, arg->prot, arg->flags, arg->offset,
+ &fi);
+ else
+ fuse_reply_err(req, ENOSYS);
+}
+
+static void do_mmap_commit(fuse_req_t req, fuse_ino_t nodeid, const void *inarg)
+{
+ struct fuse_mmap_commit_in *arg = (struct fuse_mmap_commit_in *) inarg;
+ struct fuse_file_info fi;
+
+ memset(&fi, 0, sizeof(fi));
+ fi.fh = arg->fh;
+ fi.fh_old = fi.fh;
+
+ if (req->f->op.mmap_commit)
+ req->f->op.mmap_commit(req, nodeid,
+ (void *)(unsigned long)arg->addr,
+ arg->len, arg->prot, arg->flags, arg->fd,
+ arg->offset, &fi, arg->mmap_unique);
+ else
+ fuse_reply_err(req, ENOSYS);
+}
+
+static void do_munmap(fuse_req_t req, fuse_ino_t nodeid, const void *inarg)
+{
+ struct fuse_munmap_in *arg = (struct fuse_munmap_in *) inarg;
+ struct fuse_file_info fi;
+
+ memset(&fi, 0, sizeof(fi));
+ fi.fh = arg->fh;
+ fi.fh_old = fi.fh;
+
+ if (req->f->op.munmap)
+ req->f->op.munmap(req, nodeid, (void *)(unsigned long)arg->addr,
+ arg->len, &fi, arg->mmap_unique, arg->fd);
+ else
+ fuse_reply_none(req);
+}
+
static void do_init(fuse_req_t req, fuse_ino_t nodeid, const void *inarg)
{
struct fuse_init_in *arg = (struct fuse_init_in *) inarg;
@@ -1341,7 +1403,10 @@ static struct {
[FUSE_BMAP] = { do_bmap, "BMAP" },
[FUSE_IOCTL] = { do_ioctl, "IOCTL" },
[FUSE_POLL] = { do_poll, "POLL" },
+ [FUSE_MMAP] = { do_mmap, "MMAP" },
+ [FUSE_MMAP_COMMIT] = { do_mmap_commit, "MMAP_COMMIT" },
[FUSE_DESTROY] = { do_destroy, "DESTROY" },
+ [FUSE_MUNMAP] = { do_munmap, "MUNMAP" },
[CUSE_INIT] = { do_cuse_init, "CUSE_INIT" },
};

Index: fuse/lib/fuse_versionscript
===================================================================
--- fuse.orig/lib/fuse_versionscript
+++ fuse/lib/fuse_versionscript
@@ -169,6 +169,10 @@ FUSE_2.8 {
fuse_reply_ioctl;
fuse_reply_ioctl_iov;
fuse_reply_ioctl_retry;
+ fuse_fs_mmap;
+ fuse_fs_mmap_commit;
+ fuse_fs_munmap;
+ fuse_reply_mmap;
fuse_reply_poll;

local:
Index: fuse/example/Makefile.am
===================================================================
--- fuse.orig/example/Makefile.am
+++ fuse/example/Makefile.am
@@ -3,7 +3,7 @@
AM_CPPFLAGS = -I$(top_srcdir)/include -D_FILE_OFFSET_BITS=64 -D_REENTRANT
noinst_HEADERS = fioc.h
noinst_PROGRAMS = fusexmp fusexmp_fh null hello hello_ll fioc fioclient \
- fsel fselclient cusexmp
+ fsel fselclient fmmap fmmapclient cusexmp

LDADD = ../lib/libfuse.la @libfuse_libs@
fusexmp_fh_LDADD = ../lib/libfuse.la ../lib/libulockmgr.la @libfuse_libs@
@@ -14,4 +14,6 @@ fioclient_LDADD =
fselclient_CPPFLAGS =
fselclient_LDFLAGS =
fselclient_LDADD =
-
+fmmapclient_CPPFLAGS =
+fmmapclient_LDFLAGS =
+fmmapclient_LDADD =
Index: fuse/example/fmmap.c
===================================================================
--- /dev/null
+++ fuse/example/fmmap.c
@@ -0,0 +1,370 @@
+/*
+ FUSE fmmap: FUSE mmap example
+ Copyright (C) 2008-2009 SUSE Linux Products GmbH
+ Copyright (C) 2008-2009 Tejun Heo <[email protected]>
+
+ This program can be distributed under the terms of the GNU GPL.
+ See the file COPYING.
+
+ gcc -Wall `pkg-config fuse --cflags --libs` fmmap.c -o fmmap
+*/
+
+#define FUSE_USE_VERSION 29
+#define _GNU_SOURCE
+
+#include <errno.h>
+#include <fuse.h>
+#include <pthread.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <time.h>
+#include <unistd.h>
+
+#define FMMAP_NAME "fmmap"
+
+struct fmmap_data {
+ struct fmmap_data *next;
+ uid_t uid;
+ int fd;
+ void *buf;
+ size_t size;
+ size_t cur;
+};
+
+static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
+static struct fmmap_data *fmmap_data_list;
+static pthread_t fmmap_worker_thread;
+
+enum {
+ FMMAP_NONE,
+ FMMAP_ROOT,
+ FMMAP_FILE,
+};
+
+static struct fmmap_data *fmmap_get_data(uid_t uid)
+{
+ struct fmmap_data **pos, *fmdata;
+
+ pthread_mutex_lock(&mutex);
+
+ for (pos = &fmmap_data_list; *pos; pos = &((*pos)->next))
+ if ((*pos)->uid == uid)
+ break;
+ fmdata = *pos;
+
+ if (!fmdata) {
+ fmdata = calloc(1, sizeof(*fmdata));
+ if (fmdata) {
+ fmdata->fd = -1;
+ fmdata->uid = uid;
+ *pos = fmdata;
+ }
+ }
+
+ pthread_mutex_unlock(&mutex);
+
+ return fmdata;
+}
+
+static int fmmap_mmap_buf(struct fmmap_data *fmdata, int fd, size_t size)
+{
+ void *buf;
+ int rc = 0;
+
+ pthread_mutex_lock(&mutex);
+
+ if (fmdata->fd >= 0 && fmdata->fd != fd) {
+ rc = -EBUSY;
+ goto out_unlock;
+ }
+ fmdata->fd = fd;
+
+ if (size <= fmdata->size)
+ goto out_unlock;
+
+ if (ftruncate(fd, size)) {
+ rc = -errno;
+ goto out_unlock;
+ }
+
+ if (!fmdata->buf)
+ buf = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED,
+ fmdata->fd, 0);
+ else
+ buf = mremap(fmdata->buf, fmdata->size, size, MREMAP_MAYMOVE);
+
+ if (buf == MAP_FAILED) {
+ rc = -errno;
+ goto out_unlock;
+ }
+
+ fmdata->buf = buf;
+ fmdata->size = size;
+ out_unlock:
+ pthread_mutex_unlock(&mutex);
+ return rc;
+}
+
+static int fmmap_file_type(const char *path)
+{
+ if (strcmp(path, "/") == 0)
+ return FMMAP_ROOT;
+ if (strcmp(path, "/" FMMAP_NAME) == 0)
+ return FMMAP_FILE;
+ return FMMAP_NONE;
+}
+
+static int fmmap_readdir(const char *path, void *buf, fuse_fill_dir_t filler,
+ off_t offset, struct fuse_file_info *fi)
+{
+ (void) fi;
+ (void) offset;
+
+ if (fmmap_file_type(path) != FMMAP_ROOT)
+ return -ENOENT;
+
+ filler(buf, ".", NULL, 0);
+ filler(buf, "..", NULL, 0);
+ filler(buf, FMMAP_NAME, NULL, 0);
+
+ return 0;
+}
+
+static int fmmap_getattr(const char *path, struct stat *stbuf)
+{
+ stbuf->st_uid = getuid();
+ stbuf->st_gid = getgid();
+ stbuf->st_atime = stbuf->st_mtime = time(NULL);
+
+ switch (fmmap_file_type(path)) {
+ case FMMAP_ROOT:
+ stbuf->st_mode = S_IFDIR | 0777;
+ stbuf->st_nlink = 2;
+ break;
+ case FMMAP_FILE:
+ stbuf->st_mode = S_IFREG | 0666;
+ stbuf->st_nlink = 1;
+ break;
+ case FMMAP_NONE:
+ return -ENOENT;
+ }
+
+ return 0;
+}
+
+static int fmmap_open(const char *path, struct fuse_file_info *fi)
+{
+ struct fuse_context *cxt = fuse_get_context();
+
+ switch (fmmap_file_type(path)) {
+ case FMMAP_ROOT:
+ return 0;
+ case FMMAP_FILE:
+ fi->fh = cxt->uid;
+ fi->direct_io = 1;
+ return 0;
+ default:
+ return -ENOENT;
+ }
+}
+
+static int fmmap_read(const char *path, char *buf, size_t size,
+ off_t offset, struct fuse_file_info *fi)
+{
+ struct fmmap_data *fmdata;
+ size_t bytes = 0;
+
+ (void) path;
+ (void) fi;
+
+ /* all data should be fetched in one slurp */
+ if (offset)
+ return 0;
+
+#define append(fmt, args...) \
+ ({ \
+ size_t cur, t; \
+ cur = t = snprintf(buf + bytes, size - bytes, \
+ fmt , ##args); \
+ if (cur >= size - bytes) \
+ cur = size - bytes; \
+ bytes += cur; \
+ t; \
+ })
+
+ pthread_mutex_lock(&mutex);
+
+ for (fmdata = fmmap_data_list; fmdata; fmdata = fmdata->next) {
+ int col = 80, step, i;
+ size_t off = 0;
+
+ col -= append("%6d: %6zu ", fmdata->uid, fmdata->size);
+ step = fmdata->size / col;
+
+ for (i = 0; i < col; i++) {
+ int nr_zeros = 0, nr_ones = 0, has_cursor = 0;
+ size_t end;
+
+ if (i < col - 1)
+ end = off + step;
+ else
+ end = fmdata->size;
+
+ for (; off < end; off++) {
+ char *p = fmdata->buf + off;
+
+ if (off == fmdata->cur)
+ has_cursor = 1;
+ if (*p)
+ nr_ones++;
+ else
+ nr_zeros++;
+ }
+
+ if (has_cursor)
+ append("^");
+ else if (nr_ones && nr_zeros)
+ append("?");
+ else if (nr_ones)
+ append("1");
+ else
+ append("0");
+ }
+ append("\n");
+ }
+
+ pthread_mutex_unlock(&mutex);
+ return bytes;
+}
+
+static int fmmap_mmap(const char *path, void *addr, size_t len, int prot,
+ int flags, off_t offset, struct fuse_file_info *fi,
+ int *fdp, unsigned *flagsp)
+{
+ uid_t uid = fi->fh;
+ struct fmmap_data *fmdata;
+
+ (void) path;
+ (void) addr;
+ (void) len;
+ (void) prot;
+ (void) flags;
+ (void) offset;
+
+ printf("MMAP: addr=%p len=%zu prot=0x%x flags=0x%x offset=%lld\n",
+ addr, len, prot, flags, (long long)offset);
+
+ fmdata = fmmap_get_data(uid);
+ if (!fmdata)
+ return -ENOMEM;
+
+ *fdp = fmdata->fd;
+ *flagsp |= FUSE_MMAP_DONT_COPY;
+
+ return 0;
+}
+
+static int fmmap_mmap_commit(const char *path, void *addr, size_t len,
+ int prot, int flags, int fd, off_t offset,
+ struct fuse_file_info *fi, uint64_t mmap_unique)
+{
+ uid_t uid = fi->fh;
+ struct fmmap_data *fmdata;
+ int rc;
+
+ (void) path;
+
+ printf("MMAP_COMMIT: addr=%p len=%zu prot=0x%x flags=0x%x\n"
+ " fd=%d offset=%lld mmap_unique=%llu\n",
+ addr, len, prot, flags, fd, (long long)offset,
+ (unsigned long long)mmap_unique);
+
+ if (fd < 0)
+ return 0;
+
+ fmdata = fmmap_get_data(uid);
+ if (!fmdata)
+ return -ENOMEM;
+
+ if (offset + len > SIZE_MAX)
+ return -EOVERFLOW;
+
+ rc = fmmap_mmap_buf(fmdata, fd, offset + len);
+ if (rc)
+ return rc;
+
+ return 0;
+}
+
+static void fmmap_munmap(const char *path, void *addr, size_t len,
+ struct fuse_file_info *fi, uint64_t mmap_unique,
+ int fd)
+{
+ (void) path;
+ (void) addr;
+ (void) len;
+ (void) fi;
+ (void) mmap_unique;
+ (void) fd;
+
+ printf("MUNMAP: addr=%p len=%zu mmap_unique=%llu fd=%d\n",
+ addr, len, (unsigned long long)mmap_unique, fd);
+}
+
+static struct fuse_operations fmmap_oper = {
+ .getattr = fmmap_getattr,
+ .readdir = fmmap_readdir,
+ .open = fmmap_open,
+ .read = fmmap_read,
+ .mmap = fmmap_mmap,
+ .mmap_commit = fmmap_mmap_commit,
+ .munmap = fmmap_munmap,
+};
+
+static void *fmmap_worker(void *arg)
+{
+ const struct timespec delay = { 0, 1000000 }; /* 1ms */
+ struct fmmap_data *fmdata;
+
+ (void) arg;
+
+ repeat:
+ pthread_mutex_lock(&mutex);
+
+ for (fmdata = fmmap_data_list; fmdata; fmdata = fmdata->next) {
+ char *p = NULL;
+ int i;
+
+ for (i = 0; i < fmdata->size; i++) {
+ p = fmdata->buf + (fmdata->cur + i) % fmdata->size;
+ if (!*p)
+ break;
+ }
+
+ if (p && !*p) {
+ *p = 1;
+ fmdata->cur = (fmdata->cur + i + 1) % fmdata->size;
+ }
+ }
+
+ pthread_mutex_unlock(&mutex);
+
+ nanosleep(&delay, NULL);
+
+ goto repeat;
+}
+
+int main(int argc, char *argv[])
+{
+ errno = pthread_create(&fmmap_worker_thread, NULL, &fmmap_worker, NULL);
+ if (errno) {
+ perror("pthread_create");
+ return 1;
+ }
+
+ return fuse_main(argc, argv, &fmmap_oper, NULL);
+}
Index: fuse/include/fuse_common.h
===================================================================
--- fuse.orig/include/fuse_common.h
+++ fuse/include/fuse_common.h
@@ -109,6 +109,15 @@ struct fuse_file_info {
#define FUSE_IOCTL_MAX_IOV 256

/**
+ * Mmap flags
+ *
+ * FUSE_MMAP_DONT_COPY: don't copy the region on fork
+ * FUSE_MMAP_DONT_EXPAND: can't be expanded with mremap()
+ */
+#define FUSE_MMAP_DONT_COPY (1 << 0)
+#define FUSE_MMAP_DONT_EXPAND (1 << 1)
+
+/**
* Connection information, passed to the ->init() method
*
* Some of the elements are read-write, these can be changed to
Index: fuse/example/fmmapclient.c
===================================================================
--- /dev/null
+++ fuse/example/fmmapclient.c
@@ -0,0 +1,112 @@
+/*
+ FUSE fmmapclient: FUSE mmap example client
+ Copyright (C) 2008-2009 SUSE Linux Products GmbH
+ Copyright (C) 2008-2009 Tejun Heo <[email protected]>
+
+ This program can be distributed under the terms of the GNU GPL.
+ See the file COPYING.
+
+ gcc -Wall fmmapclient.c -o fmmapclient
+*/
+
+#include <fcntl.h>
+#include <limits.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <time.h>
+#include <unistd.h>
+
+#define COL 78
+
+const char *usage =
+"Usage: fmmapclient FMMAP_FILE OFF SIZE\n"
+"\n"
+" OFF : offset in pages\n"
+" SIZE : size in pages\n"
+"\n";
+
+int main(int argc, char **argv)
+{
+ const struct timespec delay = { 0, 1000000 }; /* 1ms */
+ unsigned long param[2] = { 0, 0 };
+ char status[COL] = "";
+ size_t pos = 0;
+ int nr_ones = 0, nr_zeros = 0;
+ size_t page_size;
+ off_t offset;
+ size_t size;
+ char *endp, *map;
+ int i, fd;
+
+ page_size = sysconf(_SC_PAGESIZE);
+
+ if (argc < 4)
+ goto usage;
+
+ for (i = 2; i < 4; i++) {
+ param[i - 2] = strtoul(argv[i], &endp, 0);
+ if (endp == argv[i] || *endp != '\0')
+ goto usage;
+ }
+
+ /* well, let's not care about overflow for now */
+ offset = param[0] * page_size;
+ size = param[1] * page_size;
+
+ if (!size)
+ goto usage;
+
+ fd = open(argv[1], O_RDWR);
+ if (fd < 0) {
+ perror("open");
+ return 1;
+ }
+
+ map = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, offset);
+ if (map == MAP_FAILED) {
+ perror("mmap");
+ return 1;
+ }
+
+ while (1) {
+ int col = COL, step = size / col;
+ char *p = map + pos;
+ int col_pre, col_new;
+
+ col_pre = pos / step;
+ pos = (pos + 1) % size;
+ col_new = pos / step;
+
+ if (*p) {
+ status[col_pre] = '#';
+ nr_ones++;
+ *p = 0;
+ } else {
+ status[col_pre] = '^';
+ nr_zeros++;
+ }
+
+ if (col_pre != col_new) {
+ if (nr_zeros && nr_ones)
+ status[col_pre] = '?';
+ else if (nr_ones)
+ status[col_pre] = '1';
+ else
+ status[col_pre] = '0';
+ nr_zeros = 0;
+ nr_ones = 0;
+ }
+ printf("\r%s", status);
+ fflush(stdout);
+
+ nanosleep(&delay, NULL);
+ }
+ return 0;
+
+ usage:
+ fprintf(stderr, usage);
+ return 1;
+}
Index: fuse/include/cuse_lowlevel.h
===================================================================
--- fuse.orig/include/cuse_lowlevel.h
+++ fuse/include/cuse_lowlevel.h
@@ -63,6 +63,14 @@ struct cuse_lowlevel_ops {
const void *in_buf, size_t in_bufsz, size_t out_bufsz);
void (*poll) (fuse_req_t req, struct fuse_file_info *fi,
struct fuse_pollhandle *ph);
+ void (*mmap) (fuse_req_t req, void *addr, size_t len, int prot,
+ int flags, off_t offset, struct fuse_file_info *fi);
+ void (*mmap_commit) (fuse_req_t req, void *addr, size_t len, int prot,
+ int flags, int fd, off_t offset,
+ struct fuse_file_info *fi, uint64_t mmap_unique);
+ void (*munmap) (fuse_req_t req, void *addr, size_t len,
+ struct fuse_file_info *fi, uint64_t mmap_unique,
+ int fd);
};

struct fuse_session *cuse_lowlevel_new(struct fuse_args *args,
Index: fuse/lib/fuse.c
===================================================================
--- fuse.orig/lib/fuse.c
+++ fuse/lib/fuse.c
@@ -1690,6 +1690,67 @@ int fuse_fs_poll(struct fuse_fs *fs, con
return -ENOSYS;
}

+int fuse_fs_mmap(struct fuse_fs *fs, const char *path, void *addr, size_t len,
+ int prot, int flags, off_t offset, struct fuse_file_info *fi,
+ int *fdp, unsigned *flagsp)
+{
+ fuse_get_context()->private_data = fs->user_data;
+ if (fs->op.mmap) {
+ int res;
+
+ if (fs->debug)
+ fprintf(stderr, "mmap[%llu] addr: %p len: %zu "
+ "prot: 0x%x flags: 0x%x offset: %llu\n",
+ (unsigned long long) fi->fh, addr, len,
+ prot, flags, (unsigned long long) offset);
+
+ res = fs->op.mmap(path, addr, len, prot, flags, offset, fi,
+ fdp, flagsp);
+
+ if (fs->debug && !res)
+ fprintf(stderr, " mmap[%llu] fd: %d flags: 0x%x\n",
+ (unsigned long long) fi->fh, *fdp, *flagsp);
+
+ return res;
+ } else
+ return -ENOSYS;
+}
+
+int fuse_fs_mmap_commit(struct fuse_fs *fs, const char *path, void *addr,
+ size_t len, int prot, int flags, int fd, off_t offset,
+ struct fuse_file_info *fi, uint64_t mmap_unique)
+{
+ fuse_get_context()->private_data = fs->user_data;
+ if (fs->op.mmap_commit) {
+ if (fs->debug)
+ fprintf(stderr, "mmap_commit[%llu] addr: %p len: %zu "
+ "prot: 0x%x flags: 0x%x fd: %d offset: %llu "
+ "mmap_unique: 0x%llx\n",
+ (unsigned long long) fi->fh, addr, len,
+ prot, flags, fd, (unsigned long long) offset,
+ (unsigned long long) mmap_unique);
+
+ return fs->op.mmap_commit(path, addr, len, prot, flags, fd,
+ offset, fi, mmap_unique);
+ } else
+ return -ENOSYS;
+}
+
+void fuse_fs_munmap(struct fuse_fs *fs, const char *path, void *addr, size_t len,
+ struct fuse_file_info *fi, uint64_t mmap_unique, int fd)
+{
+ fuse_get_context()->private_data = fs->user_data;
+ if (fs->op.munmap) {
+ if (fs->debug)
+ fprintf(stderr, "munmap[%llu] addr: %p len: %zu "
+ "mmap_unique: 0x%llx fd: %d\n",
+ (unsigned long long) fi->fh, addr, len,
+ (unsigned long long) mmap_unique, fd);
+
+ fs->op.munmap(path, addr, len, fi, mmap_unique, fd);
+ }
+}
+
static int is_open(struct fuse *f, fuse_ino_t dir, const char *name)
{
struct node *node;
@@ -3274,6 +3335,71 @@ static void fuse_lib_poll(fuse_req_t req
reply_err(req, ret);
}

+static void fuse_lib_mmap(fuse_req_t req, fuse_ino_t ino,
+ void *addr, size_t len, int prot, int flags,
+ off_t offset, struct fuse_file_info *fi)
+{
+ struct fuse *f = req_fuse_prepare(req);
+ struct fuse_intr_data d;
+ char *path;
+ int ret;
+ int fd = -1;
+ unsigned fmmap_flags = 0;
+
+ ret = get_path(f, ino, &path);
+ if (!ret) {
+ fuse_prepare_interrupt(f, req, &d);
+ ret = fuse_fs_mmap(f->fs, path, addr, len, prot, flags,
+ offset, fi, &fd, &fmmap_flags);
+ fuse_finish_interrupt(f, req, &d);
+ free_path(f, ino, path);
+ }
+ if (!ret)
+ fuse_reply_mmap(req, fd, fmmap_flags);
+ else
+ reply_err(req, ret);
+}
+
+static void fuse_lib_mmap_commit(fuse_req_t req, fuse_ino_t ino, void *addr,
+ size_t len, int prot, int flags, int fd,
+ off_t offset, struct fuse_file_info *fi,
+ uint64_t mmap_unique)
+{
+ struct fuse *f = req_fuse_prepare(req);
+ struct fuse_intr_data d;
+ char *path;
+ int ret;
+
+ ret = get_path(f, ino, &path);
+ if (!ret) {
+ fuse_prepare_interrupt(f, req, &d);
+ ret = fuse_fs_mmap_commit(f->fs, path, addr, len, prot, flags,
+ fd, offset, fi, mmap_unique);
+ fuse_finish_interrupt(f, req, &d);
+ free_path(f, ino, path);
+ }
+ reply_err(req, ret);
+}
+
+static void fuse_lib_munmap(fuse_req_t req, fuse_ino_t ino, void *addr,
+ size_t len, struct fuse_file_info *fi,
+ uint64_t mmap_unique, int fd)
+{
+ struct fuse *f = req_fuse_prepare(req);
+ struct fuse_intr_data d;
+ char *path;
+ int ret;
+
+ ret = get_path(f, ino, &path);
+ if (!ret) {
+ fuse_prepare_interrupt(f, req, &d);
+ fuse_fs_munmap(f->fs, path, addr, len, fi, mmap_unique, fd);
+ fuse_finish_interrupt(f, req, &d);
+ free_path(f, ino, path);
+ }
+ fuse_reply_none(req);
+}
+
static struct fuse_lowlevel_ops fuse_path_ops = {
.init = fuse_lib_init,
.destroy = fuse_lib_destroy,
@@ -3311,6 +3437,9 @@ static struct fuse_lowlevel_ops fuse_pat
.bmap = fuse_lib_bmap,
.ioctl = fuse_lib_ioctl,
.poll = fuse_lib_poll,
+ .mmap = fuse_lib_mmap,
+ .mmap_commit = fuse_lib_mmap_commit,
+ .munmap = fuse_lib_munmap,
};

int fuse_notify_poll(struct fuse_pollhandle *ph)
Index: fuse/lib/cuse_lowlevel.c
===================================================================
--- fuse.orig/lib/cuse_lowlevel.c
+++ fuse/lib/cuse_lowlevel.c
@@ -93,6 +93,33 @@ static void cuse_fll_poll(fuse_req_t req
req_clop(req)->poll(req, fi, ph);
}

+static void cuse_fll_mmap(fuse_req_t req, fuse_ino_t ino, void *addr,
+ size_t len, int prot, int flags, off_t offset,
+ struct fuse_file_info *fi)
+{
+ (void)ino;
+ req_clop(req)->mmap(req, addr, len, prot, flags, offset, fi);
+}
+
+static void cuse_fll_mmap_commit(fuse_req_t req, fuse_ino_t ino,
+ void *addr, size_t len, int prot, int flags,
+ int fd, off_t offset,
+ struct fuse_file_info *fi,
+ uint64_t mmap_unique)
+{
+ (void)ino;
+ req_clop(req)->mmap_commit(req, addr, len, prot, flags, fd, offset, fi,
+ mmap_unique);
+}
+
+static void cuse_fll_munmap(fuse_req_t req, fuse_ino_t ino,
+ void *addr, size_t len, struct fuse_file_info *fi,
+ uint64_t mmap_unique, int fd)
+{
+ (void)ino;
+ req_clop(req)->munmap(req, addr, len, fi, mmap_unique, fd);
+}
+
static size_t cuse_pack_info(int argc, const char **argv, char *buf)
{
size_t size = 0;
@@ -169,6 +196,9 @@ struct fuse_session *cuse_lowlevel_new(s
lop.fsync = clop->fsync ? cuse_fll_fsync : NULL;
lop.ioctl = clop->ioctl ? cuse_fll_ioctl : NULL;
lop.poll = clop->poll ? cuse_fll_poll : NULL;
+ lop.mmap = clop->mmap ? cuse_fll_mmap : NULL;
+ lop.mmap_commit = clop->mmap_commit ? cuse_fll_mmap_commit : NULL;
+ lop.munmap = clop->munmap ? cuse_fll_munmap : NULL;

se = fuse_lowlevel_new_common(args, &lop, sizeof(lop), userdata);
if (!se) {

2009-06-18 09:26:58

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 1/4] fdtable: export alloc_fd()

Export alloc_fd(). Will be used by FUSE.

Signed-off-by: Tejun Heo <[email protected]>
---
fs/file.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index f313314..806b3ad 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -487,6 +487,7 @@ out:
spin_unlock(&files->file_lock);
return error;
}
+EXPORT_SYMBOL_GPL(alloc_fd);

int get_unused_fd(void)
{
--
1.6.0.2

2009-06-18 09:27:20

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 2/4] FUSE: make request_wait_answer() wait for ->end() completion

Previously, a request was marked FINISHED before ->end() is executed
and thus request_wait_answer() can return before it's done. This
patch makes request_wait_answer() wait for ->end() to finish before
returning.

Note that no current ->end() user waits for request completion, so
this change doesn't cause any behavior difference.

While at it, beef up the comment above ->end() hook and clarify when
and where it's called.

Signed-off-by: Tejun Heo <[email protected]>
---
fs/fuse/dev.c | 41 +++++++++++++++++++++++++----------------
fs/fuse/fuse_i.h | 5 ++++-
2 files changed, 29 insertions(+), 17 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 8fed2ed..0745728 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -278,7 +278,6 @@ __releases(&fc->lock)
req->end = NULL;
list_del(&req->list);
list_del(&req->intr_entry);
- req->state = FUSE_REQ_FINISHED;
if (req->background) {
if (fc->num_background == FUSE_MAX_BACKGROUND) {
fc->blocked = 0;
@@ -293,10 +292,21 @@ __releases(&fc->lock)
fc->active_background--;
flush_bg_queue(fc);
}
+
spin_unlock(&fc->lock);
- wake_up(&req->waitq);
- if (end)
+
+ if (end) {
end(fc, req);
+ smp_wmb();
+ }
+
+ /*
+ * We own this request and wake_up() has enough memory
+ * barrier, no need to grab spin lock to set state.
+ */
+ req->state = FUSE_REQ_FINISHED;
+
+ wake_up(&req->waitq);
fuse_put_request(fc, req);
}

@@ -372,17 +382,16 @@ __acquires(&fc->lock)
return;

aborted:
- BUG_ON(req->state != FUSE_REQ_FINISHED);
- if (req->locked) {
- /* This is uninterruptible sleep, because data is
- being copied to/from the buffers of req. During
- locked state, there mustn't be any filesystem
- operation (e.g. page fault), since that could lead
- to deadlock */
- spin_unlock(&fc->lock);
- wait_event(req->waitq, !req->locked);
- spin_lock(&fc->lock);
- }
+ spin_unlock(&fc->lock);
+ wait_event(req->waitq, req->state == FUSE_REQ_FINISHED);
+ /*
+ * This is uninterruptible sleep, because data is being copied
+ * to/from the buffers of req. During locked state, there
+ * mustn't be any filesystem operation (e.g. page fault),
+ * since that could lead to deadlock
+ */
+ wait_event(req->waitq, !req->locked);
+ spin_lock(&fc->lock);
}

void fuse_request_send(struct fuse_conn *fc, struct fuse_req *req)
@@ -1062,9 +1071,7 @@ __acquires(&fc->lock)

req->aborted = 1;
req->out.h.error = -ECONNABORTED;
- req->state = FUSE_REQ_FINISHED;
list_del_init(&req->list);
- wake_up(&req->waitq);
if (end) {
req->end = NULL;
__fuse_get_request(req);
@@ -1074,6 +1081,8 @@ __acquires(&fc->lock)
fuse_put_request(fc, req);
spin_lock(&fc->lock);
}
+ req->state = FUSE_REQ_FINISHED;
+ wake_up(&req->waitq);
}
}

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index aaf2f9f..7b930b6 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -292,7 +292,10 @@ struct fuse_req {
/** Link on fi->writepages */
struct list_head writepages_entry;

- /** Request completion callback */
+ /** Request completion callback. This function is called from
+ the kernel context of the FUSE server if the request isn't
+ being aborted. If the request is being aborted, it's
+ called from the kernel context of the aborting process. */
void (*end)(struct fuse_conn *, struct fuse_req *);

/** Request is stolen from fuse_file->reserved_req */
--
1.6.0.2

2009-06-18 09:27:35

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 4/4] FUSE: implement direct mmap

This patch implements direct mmap. It allows FUSE server to honor
each mmap request with anonymous mapping. FUSE server can make
multiple mmap requests share a single anonymous mapping or separate
mappings as it sees fit.

mmap request is handled in two steps. MMAP first queries the server
whether it wants to share the mapping with an existing one or create a
new one, and if so, with which flags. MMAP_COMMIT notifies the server
the result of mmap and if successful the fd the server can use to
access the mmap region.

Internally, shmem_file is used to back the mmap areas and vma->vm_file
is overridden from the FUSE file to the shmem_file.

For details, please read the comment on top of
fuse_file_direct_mmap().

Signed-off-by: Tejun Heo <[email protected]>
---
fs/fuse/cuse.c | 1 +
fs/fuse/file.c | 432 ++++++++++++++++++++++++++++++++++++++++++++++++--
fs/fuse/fuse_i.h | 8 +
include/linux/fuse.h | 47 ++++++
4 files changed, 476 insertions(+), 12 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index de792dc..0ec447c 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -181,6 +181,7 @@ static const struct file_operations cuse_frontend_fops = {
.unlocked_ioctl = cuse_file_ioctl,
.compat_ioctl = cuse_file_compat_ioctl,
.poll = fuse_file_poll,
+ .mmap = fuse_file_direct_mmap,
};


diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index fce6ce6..fab3c2c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -13,6 +13,9 @@
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/module.h>
+#include <linux/file.h>
+#include <linux/syscalls.h>
+#include <linux/mman.h>

static const struct file_operations fuse_direct_io_file_operations;

@@ -1340,17 +1343,6 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
return 0;
}

-static int fuse_direct_mmap(struct file *file, struct vm_area_struct *vma)
-{
- /* Can't provide the coherency needed for MAP_SHARED */
- if (vma->vm_flags & VM_MAYSHARE)
- return -ENODEV;
-
- invalidate_inode_pages2(file->f_mapping);
-
- return generic_file_mmap(file, vma);
-}
-
static int convert_fuse_file_lock(const struct fuse_file_lock *ffl,
struct file_lock *fl)
{
@@ -1970,6 +1962,422 @@ int fuse_notify_poll_wakeup(struct fuse_conn *fc,
return 0;
}

+struct fuse_mmap {
+ struct fuse_conn *fc; /* associated fuse_conn */
+ struct file *file; /* associated file */
+ struct kref kref; /* reference count */
+ u64 mmap_unique; /* mmap req which created this */
+ int mmap_fd; /* server side fd for shmem file */
+ struct file *mmap_file; /* shmem file backing this mmap */
+ unsigned long start;
+ unsigned long len;
+
+ /* our copy of vm_ops w/ open and close overridden */
+ struct vm_operations_struct vm_ops;
+};
+
+/*
+ * Create fuse_mmap structure which represents a single mmapped
+ * region. If @mfile is specified the created fuse_mmap would be
+ * associated with it; otherwise, a new shmem_file is created.
+ */
+static struct fuse_mmap *create_fuse_mmap(struct fuse_conn *fc,
+ struct file *file, struct file *mfile,
+ u64 mmap_unique, int mmap_fd,
+ struct vm_area_struct *vma)
+{
+ char dname[] = "dev/fuse";
+ loff_t off = (loff_t)vma->vm_pgoff << PAGE_SHIFT;
+ size_t len = vma->vm_end - vma->vm_start;
+ struct fuse_mmap *fmmap;
+ int err;
+
+ err = -ENOMEM;
+ fmmap = kzalloc(sizeof(*fmmap), GFP_KERNEL);
+ if (!fmmap)
+ goto fail;
+ kref_init(&fmmap->kref);
+
+ if (mfile) {
+ /*
+ * dentry name with a slash in it can't be created
+ * from userland, so testing dname ensures that the fd
+ * is the one we've created. Note that @mfile is
+ * already grabbed by fuse_mmap_end().
+ */
+ err = -EINVAL;
+ if (strcmp(mfile->f_dentry->d_name.name, dname))
+ goto fail;
+ } else {
+ /*
+ * Create a new shmem_file. As fuse direct mmaps can
+ * be shared, offset can't be zapped to zero. Use off
+ * + len as the default size. Server has a chance to
+ * adjust this and other stuff while processing the
+ * COMMIT request before the client sees this mmap
+ * area.
+ */
+ mfile = shmem_file_setup(dname, off + len, vma->vm_flags);
+ if (IS_ERR(mfile)) {
+ err = PTR_ERR(mfile);
+ goto fail;
+ }
+ }
+ fmmap->mmap_file = mfile;
+
+ fmmap->fc = fuse_conn_get(fc);
+ get_file(file);
+ fmmap->file = file;
+ fmmap->mmap_unique = mmap_unique;
+ fmmap->mmap_fd = mmap_fd;
+ fmmap->start = vma->vm_start;
+ fmmap->len = len;
+
+ return fmmap;
+
+ fail:
+ kfree(fmmap);
+ return ERR_PTR(err);
+}
+
+static void destroy_fuse_mmap(struct fuse_mmap *fmmap)
+{
+ /* mmap_file reference is managed by VM */
+ fuse_conn_put(fmmap->fc);
+ fput(fmmap->file);
+ kfree(fmmap);
+}
+
+static void fuse_vm_release(struct kref *kref)
+{
+ struct fuse_mmap *fmmap = container_of(kref, struct fuse_mmap, kref);
+ struct fuse_conn *fc = fmmap->fc;
+ struct fuse_file *ff = fmmap->file->private_data;
+ struct fuse_req *req;
+ struct fuse_munmap_in *inarg;
+
+ /* failing this might lead to resource leak in server, don't fail */
+ req = fuse_get_req_nofail(fc, fmmap->file);
+ inarg = &req->misc.munmap.in;
+
+ inarg->fh = ff->fh;
+ inarg->mmap_unique = fmmap->mmap_unique;
+ inarg->fd = fmmap->mmap_fd;
+ inarg->addr = fmmap->start;
+ inarg->len = fmmap->len;
+
+ req->in.h.opcode = FUSE_MUNMAP;
+ req->in.h.nodeid = get_node_id(fmmap->file->f_dentry->d_inode);
+ req->in.numargs = 1;
+ req->in.args[0].size = sizeof(*inarg);
+ req->in.args[0].value = inarg;
+
+ fuse_request_send_noreply(fc, req);
+
+ destroy_fuse_mmap(fmmap);
+}
+
+static void fuse_vm_open(struct vm_area_struct *vma)
+{
+ struct fuse_mmap *fmmap = vma->vm_private_data;
+
+ kref_get(&fmmap->kref);
+}
+
+static void fuse_vm_close(struct vm_area_struct *vma)
+{
+ struct fuse_mmap *fmmap = vma->vm_private_data;
+
+ kref_put(&fmmap->kref, fuse_vm_release);
+}
+
+static void fuse_mmap_end(struct fuse_conn *fc, struct fuse_req *req)
+{
+ struct fuse_mmap_out *mmap_out = req->out.args[0].value;
+ int fd = mmap_out->fd;
+ struct file *file;
+
+ /*
+ * If aborted, we're in a different context and the server is
+ * gonna die soon anyway. Don't bother.
+ */
+ if (unlikely(req->aborted))
+ return;
+
+ if (!req->out.h.error && fd >= 0) {
+ /*
+ * fget() failure should be handled differently as the
+ * userland is expecting MMAP_COMMIT. Set ERR_PTR
+ * value in misc.mmap.file instead of setting
+ * out.h.error.
+ */
+ file = fget(fd);
+ if (!file)
+ file = ERR_PTR(-EBADF);
+ req->misc.mmap.file = file;
+ }
+}
+
+static int fuse_mmap_commit_prep(struct fuse_conn *fc, struct fuse_req *req)
+{
+ struct fuse_mmap_commit_in *commit_in = (void *)req->in.args[0].value;
+ struct file *mfile = req->misc.mmap.file;
+ int fd;
+
+ if (!mfile)
+ return 0;
+
+ /* new mmap.file has been created, assign a fd to it */
+ fd = commit_in->fd = get_unused_fd_flags(O_CLOEXEC);
+ if (fd < 0)
+ return 0;
+
+ get_file(mfile);
+ fd_install(fd, mfile);
+ return 0;
+}
+
+static void fuse_mmap_commit_end(struct fuse_conn *fc, struct fuse_req *req)
+{
+ struct fuse_mmap_commit_in *commit_in = (void *)req->in.args[0].value;
+
+ /*
+ * If aborted, we're in a different context and the server is
+ * gonna die soon anyway. Don't bother.
+ */
+ if (unlikely(req->aborted))
+ return;
+
+ /*
+ * If a new fd was assigned to mmap.file but the request
+ * failed, close the fd.
+ */
+ if (req->misc.mmap.file && commit_in->fd >= 0 && req->out.h.error)
+ sys_close(commit_in->fd);
+}
+
+/*
+ * Direct mmap is implemented using two requests - FUSE_MMAP and
+ * FUSE_MMAP_COMMIT. This is to allow the userland server to choose
+ * whether to share an existing mmap or create a new one.
+ *
+ * Each separate mmap area is backed by a shmem_file (an anonymous
+ * mapping). If the server specifies fd to an existing shmem_file
+ * created by previous FUSE_MMAP_COMMIT, the shmem_file for that
+ * mapping is reused. If not, a new shmem_file is created and a new
+ * fd is opened and notified to the server via FUSE_MMAP_COMMIT.
+ *
+ * Because the server might allocate resources on FUSE_MMAP, FUSE
+ * guarantees that FUSE_MMAP_COMMIT will be sent whether the mmap
+ * attempt succeeds or not. On failure, commit_in.fd will contain
+ * negative error code; otherwise, it will contain the fd for the
+ * shmem_file. The server is then free to truncate the fd to desired
+ * size and fill in the content. The client will only see the area
+ * only after COMMIT is successfully replied. If the server fails the
+ * COMMIT request and new fd has been allocated for it, the fd will be
+ * automatically closed by the kernel.
+ *
+ * FUSE guarantees that MUNMAP request will be sent when the area gets
+ * unmapped.
+ *
+ * The server can associate the three related requests - MMAP,
+ * MMAP_COMMIT and MUNMAP using ->unique of the MMAP request. The
+ * latter two requests carry ->mmap_unique field which contains
+ * ->unique of the MMAP request.
+ */
+int fuse_file_direct_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct fuse_file *ff = file->private_data;
+ struct fuse_conn *fc = ff->fc;
+ struct vm_operations_struct *orig_vm_ops = vma->vm_ops;
+ struct file *orig_vm_file = vma->vm_file;
+ unsigned long orig_vm_flags = vma->vm_flags;
+ struct fuse_mmap *fmmap = NULL;
+ struct file *mfile = NULL;
+ struct fuse_req *req;
+ struct fuse_mmap_in mmap_in;
+ struct fuse_mmap_out mmap_out;
+ struct fuse_mmap_commit_in commit_in;
+ u64 mmap_unique;
+ int err;
+
+ /*
+ * First, execute FUSE_MMAP which will query the server
+ * whether this mmap request is valid and which fd it wants to
+ * use to mmap this request.
+ */
+ req = fuse_get_req(fc);
+ if (IS_ERR(req)) {
+ err = PTR_ERR(req);
+ goto err;
+ }
+
+ memset(&mmap_in, 0, sizeof(mmap_in));
+ mmap_in.fh = ff->fh;
+ mmap_in.addr = vma->vm_start;
+ mmap_in.len = vma->vm_end - vma->vm_start;
+ mmap_in.prot = ((vma->vm_flags & VM_READ) ? PROT_READ : 0) |
+ ((vma->vm_flags & VM_WRITE) ? PROT_WRITE : 0) |
+ ((vma->vm_flags & VM_EXEC) ? PROT_EXEC : 0);
+ mmap_in.flags = ((vma->vm_flags & VM_GROWSDOWN) ? MAP_GROWSDOWN : 0) |
+ ((vma->vm_flags & VM_DENYWRITE) ? MAP_DENYWRITE : 0) |
+ ((vma->vm_flags & VM_EXECUTABLE) ? MAP_EXECUTABLE : 0) |
+ ((vma->vm_flags & VM_LOCKED) ? MAP_LOCKED : 0);
+ mmap_in.offset = (loff_t)vma->vm_pgoff << PAGE_SHIFT;
+
+ req->in.h.opcode = FUSE_MMAP;
+ req->in.h.nodeid = ff->nodeid;
+ req->in.numargs = 1;
+ req->in.args[0].size = sizeof(mmap_in);
+ req->in.args[0].value = &mmap_in;
+ req->out.numargs = 1;
+ req->out.args[0].size = sizeof(mmap_out);
+ req->out.args[0].value = &mmap_out;
+
+ req->end = fuse_mmap_end;
+
+ fuse_request_send(fc, req);
+
+ /* mmap.file is set if server requested to reuse existing mapping */
+ mfile = req->misc.mmap.file;
+ mmap_unique = req->in.h.unique;
+ err = req->out.h.error;
+
+ fuse_put_request(fc, req);
+
+ /* ERR_PTR value in mfile means fget failure, send failure COMMIT */
+ if (IS_ERR(mfile)) {
+ err = PTR_ERR(mfile);
+ goto commit;
+ }
+ /* userland indicated failure, we can just fail */
+ if (err)
+ goto err;
+
+ /*
+ * Second, create mmap as the server requested.
+ */
+ fmmap = create_fuse_mmap(fc, file, mfile, mmap_unique, mmap_out.fd,
+ vma);
+ if (IS_ERR(fmmap)) {
+ err = PTR_ERR(fmmap);
+ goto commit;
+ }
+
+ /* fmmap points to shm_file to mmap, give it to vma */
+ mfile = fmmap->mmap_file;
+ vma->vm_file = mfile;
+
+ /* add flags server requested and mmap the shm_file */
+ if (mmap_out.flags & FUSE_MMAP_DONT_COPY)
+ vma->vm_flags |= VM_DONTCOPY;
+ if (mmap_out.flags & FUSE_MMAP_DONT_EXPAND)
+ vma->vm_flags |= VM_DONTEXPAND;
+
+ err = mfile->f_op->mmap(mfile, vma);
+ if (err)
+ goto commit;
+
+ /*
+ * Override vm_ops->open and ->close. This is a bit hacky but
+ * vma's can't easily be nested and FUSE needs to notify the
+ * server when to release resources for mmaps. Both shmem and
+ * tiny_shmem implementations are okay with this trick but if
+ * there's a cleaner way to do this, please update it.
+ */
+ err = -EINVAL;
+ if (vma->vm_ops->open || vma->vm_ops->close || vma->vm_private_data) {
+ printk(KERN_ERR "FUSE: can't do direct mmap. shmem mmap has "
+ "open, close or vm_private_data\n");
+ goto commit;
+ }
+
+ fmmap->vm_ops = *vma->vm_ops;
+ vma->vm_ops = &fmmap->vm_ops;
+ vma->vm_ops->open = fuse_vm_open;
+ vma->vm_ops->close = fuse_vm_close;
+ vma->vm_private_data = fmmap;
+ err = 0;
+
+ commit:
+ /*
+ * Third, either mmap succeeded or failed after MMAP request
+ * succeeded. Notify userland what happened.
+ */
+
+ /* missing commit can cause resource leak on server side, don't fail */
+ req = fuse_get_req_nofail(fc, file);
+
+ memset(&commit_in, 0, sizeof(commit_in));
+ commit_in.fh = ff->fh;
+ commit_in.mmap_unique = mmap_unique;
+ commit_in.addr = mmap_in.addr;
+ commit_in.len = mmap_in.len;
+ commit_in.prot = mmap_in.prot;
+ commit_in.flags = mmap_in.flags;
+ commit_in.offset = mmap_in.offset;
+
+ if (!err) {
+ commit_in.fd = fmmap->mmap_fd;
+ /*
+ * If fmmap->mmap_fd < 0, new fd needs to be created
+ * when the server reads MMAP_COMMIT. Pass the file
+ * pointer. A fd will be assigned to it by the
+ * fuse_mmap_commit_prep callback.
+ */
+ if (fmmap->mmap_fd < 0)
+ req->misc.mmap.file = mfile;
+ } else
+ commit_in.fd = err;
+
+ req->in.h.opcode = FUSE_MMAP_COMMIT;
+ req->in.h.nodeid = ff->nodeid;
+ req->in.numargs = 1;
+ req->in.args[0].size = sizeof(commit_in);
+ req->in.args[0].value = &commit_in;
+
+ req->prep = fuse_mmap_commit_prep;
+ req->end = fuse_mmap_commit_end;
+
+ fuse_request_send(fc, req);
+ if (!err) /* notified failure to userland */
+ err = req->out.h.error;
+ if (!err && commit_in.fd < 0) /* failed to allocate fd */
+ err = commit_in.fd;
+ fuse_put_request(fc, req);
+
+ if (!err) {
+ fput(orig_vm_file);
+ fmmap->mmap_fd = commit_in.fd;
+ return 0;
+ }
+
+ /* fall through */
+ err:
+ if (fmmap && !IS_ERR(fmmap))
+ destroy_fuse_mmap(fmmap);
+ if (mfile && !IS_ERR(mfile))
+ fput(mfile);
+
+ /* restore original vm_ops, file and flags */
+ vma->vm_ops = orig_vm_ops;
+ vma->vm_file = orig_vm_file;
+ vma->vm_flags = orig_vm_flags;
+
+ if (err == -ENOSYS) {
+ /* Can't provide the coherency needed for MAP_SHARED */
+ if (vma->vm_flags & VM_MAYSHARE)
+ return -ENODEV;
+
+ invalidate_inode_pages2(file->f_mapping);
+
+ return generic_file_mmap(file, vma);
+ }
+
+ return err;
+}
+EXPORT_SYMBOL_GPL(fuse_file_direct_mmap);
+
static const struct file_operations fuse_file_operations = {
.llseek = fuse_file_llseek,
.read = do_sync_read,
@@ -1993,7 +2401,7 @@ static const struct file_operations fuse_direct_io_file_operations = {
.llseek = fuse_file_llseek,
.read = fuse_direct_read,
.write = fuse_direct_write,
- .mmap = fuse_direct_mmap,
+ .mmap = fuse_file_direct_mmap,
.open = fuse_open,
.flush = fuse_flush,
.release = fuse_release,
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index aa112e2..0fdaee7 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -272,6 +272,13 @@ struct fuse_req {
struct fuse_write_out out;
} write;
struct fuse_lk_in lk_in;
+ struct {
+ /** to move filp for mmap between client and server */
+ struct file *file;
+ } mmap;
+ struct {
+ struct fuse_munmap_in in;
+ } munmap;
} misc;

/** page vector */
@@ -724,6 +731,7 @@ ssize_t fuse_direct_io(struct file *file, const char __user *buf,
long fuse_do_ioctl(struct file *file, unsigned int cmd, unsigned long arg,
unsigned int flags);
unsigned fuse_file_poll(struct file *file, poll_table *wait);
+int fuse_file_direct_mmap(struct file *file, struct vm_area_struct *vma);
int fuse_dev_release(struct inode *inode, struct file *file);

#endif /* _FS_FUSE_I_H */
diff --git a/include/linux/fuse.h b/include/linux/fuse.h
index d41ed59..e000c33 100644
--- a/include/linux/fuse.h
+++ b/include/linux/fuse.h
@@ -178,6 +178,15 @@ struct fuse_file_lock {
*/
#define FUSE_POLL_SCHEDULE_NOTIFY (1 << 0)

+/**
+ * Mmap flags
+ *
+ * FUSE_MMAP_DONT_COPY: don't copy the region on fork
+ * FUSE_MMAP_DONT_EXPAND: can't be expanded with mremap()
+ */
+#define FUSE_MMAP_DONT_COPY (1 << 0)
+#define FUSE_MMAP_DONT_EXPAND (1 << 1)
+
enum fuse_opcode {
FUSE_LOOKUP = 1,
FUSE_FORGET = 2, /* no reply */
@@ -217,6 +226,9 @@ enum fuse_opcode {
FUSE_DESTROY = 38,
FUSE_IOCTL = 39,
FUSE_POLL = 40,
+ FUSE_MMAP = 41,
+ FUSE_MMAP_COMMIT = 42,
+ FUSE_MUNMAP = 43,

/* CUSE specific operations */
CUSE_INIT = 4096,
@@ -478,6 +490,41 @@ struct fuse_notify_poll_wakeup_out {
__u64 kh;
};

+struct fuse_mmap_in {
+ __u64 fh;
+ __u64 addr;
+ __u64 len;
+ __s32 prot;
+ __s32 flags;
+ __u64 offset;
+};
+
+struct fuse_mmap_out {
+ __s32 fd;
+ __u32 flags;
+};
+
+struct fuse_mmap_commit_in {
+ __u64 fh;
+ __u64 mmap_unique;
+ __u64 addr;
+ __u64 len;
+ __s32 prot;
+ __s32 flags;
+ __s32 fd;
+ __u32 padding;
+ __u64 offset;
+};
+
+struct fuse_munmap_in {
+ __u64 fh;
+ __u64 mmap_unique;
+ __u64 addr;
+ __u64 len;
+ __s32 fd;
+ __u32 padding;
+};
+
struct fuse_in_header {
__u32 len;
__u32 opcode;
--
1.6.0.2

2009-06-23 23:58:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 3/4] FUSE: implement fuse_req->prep()

On Thu, 18 Jun 2009 18:24:32 +0900
Tejun Heo <[email protected]> wrote:

> Implement ->prep() which is the opposite equivalent of ->end().

The opposite of "end" is "begin". I'd expect to see a sequence like

-> prep
-> begin
-> end

so the naming choice here is unexpected.

> It's
> called right before the request is passed to userland server in the
> kernel context of the server. ->prep() can fail the request without
> disrupting the whole channel.
>
> This will be used by direct mmap implementation.

> ...
>
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -292,6 +292,12 @@ struct fuse_req {
> /** Link on fi->writepages */
> struct list_head writepages_entry;
>
> + /** Request preparation callback. Called from the kernel
> + context of the FUSE server before passing the request to
> + the FUSE server. Non-zero return from this function will
> + fail the request. */
> + int (*prep)(struct fuse_conn *, struct fuse_req *);
> +
> /** Request completion callback. This function is called from
> the kernel context of the FUSE server if the request isn't
> being aborted. If the request is being aborted, it's

Why the strange comment layout? Does kerneldoc actually recognise and
appropriately process this text? if not, please do

/*
* Request preparation callback. Called from the kernel
* context of the FUSE server before passing the request to
* the FUSE server. Non-zero return from this function will
* fail the request.
*/

If that looks odd then, well, that wasn't your fault ;)

2009-06-24 00:04:35

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/4] FUSE: make request_wait_answer() wait for ->end() completion

(restoring cc list)

Andrew Morton wrote:
> On Thu, 18 Jun 2009 18:24:31 +0900
> Tejun Heo <[email protected]> wrote:
>
>> - /** Request completion callback */
>> + /** Request completion callback. This function is called from
>> + the kernel context of the FUSE server if the request isn't
>> + being aborted. If the request is being aborted, it's
>> + called from the kernel context of the aborting process. */
>
> The comment pretends to be kerneldoc but isn't.
>
> Like this:
>
> /*
> * Request completion callback. This function is called from
> * the kernel context of the FUSE server if the request isn't
> * being aborted. If the request is being aborted, it's
> * called from the kernel context of the aborting process.
> */
>
> please.

All the sorrounding comments were formatted in false-kerneldoc format,
so it will look a bit weird that way. Hmm... I'll add a patch to
change the surrounding comments too.

Thanks.

--
tejun

2009-06-24 00:09:41

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 3/4] FUSE: implement fuse_req->prep()

Hello,

Andrew Morton wrote:
> On Thu, 18 Jun 2009 18:24:32 +0900
> Tejun Heo <[email protected]> wrote:
>
>> Implement ->prep() which is the opposite equivalent of ->end().
>
> The opposite of "end" is "begin". I'd expect to see a sequence like
>
> -> prep
> -> begin
> -> end
>
> so the naming choice here is unexpected.

Yes, indeed. I'll look for a better name.

>> It's
>> called right before the request is passed to userland server in the
>> kernel context of the server. ->prep() can fail the request without
>> disrupting the whole channel.
>>
>> This will be used by direct mmap implementation.
>
>> ...
>>
>> --- a/fs/fuse/fuse_i.h
>> +++ b/fs/fuse/fuse_i.h
>> @@ -292,6 +292,12 @@ struct fuse_req {
>> /** Link on fi->writepages */
>> struct list_head writepages_entry;
>>
>> + /** Request preparation callback. Called from the kernel
>> + context of the FUSE server before passing the request to
>> + the FUSE server. Non-zero return from this function will
>> + fail the request. */
>> + int (*prep)(struct fuse_conn *, struct fuse_req *);
>> +
>> /** Request completion callback. This function is called from
>> the kernel context of the FUSE server if the request isn't
>> being aborted. If the request is being aborted, it's
>
> Why the strange comment layout? Does kerneldoc actually recognise and
> appropriately process this text? if not, please do
>
> /*
> * Request preparation callback. Called from the kernel
> * context of the FUSE server before passing the request to
> * the FUSE server. Non-zero return from this function will
> * fail the request.
> */
>
> If that looks odd then, well, that wasn't your fault ;)

I usually try to conform to the surrounding style so that new stuff at
least is consistent with the rest (and maintainers tend to feel
happier that way) but yeah this can cause confusion to kerneldoc
handling. I'll add a patch to change the style.

Thanks.

--
tejun

2009-06-24 00:16:33

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 4/4] FUSE: implement direct mmap

On Thu, 18 Jun 2009 18:24:33 +0900
Tejun Heo <[email protected]> wrote:

> This patch implements direct mmap. It allows FUSE server to honor
> each mmap request with anonymous mapping. FUSE server can make
> multiple mmap requests share a single anonymous mapping or separate
> mappings as it sees fit.
>
> mmap request is handled in two steps. MMAP first queries the server
> whether it wants to share the mapping with an existing one or create a
> new one, and if so, with which flags. MMAP_COMMIT notifies the server
> the result of mmap and if successful the fd the server can use to
> access the mmap region.
>
> Internally, shmem_file is used to back the mmap areas and vma->vm_file
> is overridden from the FUSE file to the shmem_file.
>
> For details, please read the comment on top of
> fuse_file_direct_mmap().
>
>
> ...
>
> +static int fuse_mmap_commit_prep(struct fuse_conn *fc, struct fuse_req *req)
> +{
> + struct fuse_mmap_commit_in *commit_in = (void *)req->in.args[0].value;
> + struct file *mfile = req->misc.mmap.file;
> + int fd;
> +
> + if (!mfile)
> + return 0;
> +
> + /* new mmap.file has been created, assign a fd to it */
> + fd = commit_in->fd = get_unused_fd_flags(O_CLOEXEC);
> + if (fd < 0)
> + return 0;

Shouldn't this be `return fd;'?

> + get_file(mfile);
> + fd_install(fd, mfile);
> + return 0;
> +}
> +
>
> ...
>
> +int fuse_file_direct_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + struct fuse_file *ff = file->private_data;
> + struct fuse_conn *fc = ff->fc;
> + struct vm_operations_struct *orig_vm_ops = vma->vm_ops;
> + struct file *orig_vm_file = vma->vm_file;
> + unsigned long orig_vm_flags = vma->vm_flags;
> + struct fuse_mmap *fmmap = NULL;
> + struct file *mfile = NULL;
> + struct fuse_req *req;
> + struct fuse_mmap_in mmap_in;
> + struct fuse_mmap_out mmap_out;
> + struct fuse_mmap_commit_in commit_in;
> + u64 mmap_unique;
> + int err;
> +
> + /*
> + * First, execute FUSE_MMAP which will query the server
> + * whether this mmap request is valid and which fd it wants to
> + * use to mmap this request.
> + */
> + req = fuse_get_req(fc);
> + if (IS_ERR(req)) {
> + err = PTR_ERR(req);
> + goto err;
> + }
> +
> + memset(&mmap_in, 0, sizeof(mmap_in));
> + mmap_in.fh = ff->fh;
> + mmap_in.addr = vma->vm_start;
> + mmap_in.len = vma->vm_end - vma->vm_start;
> + mmap_in.prot = ((vma->vm_flags & VM_READ) ? PROT_READ : 0) |
> + ((vma->vm_flags & VM_WRITE) ? PROT_WRITE : 0) |
> + ((vma->vm_flags & VM_EXEC) ? PROT_EXEC : 0);
> + mmap_in.flags = ((vma->vm_flags & VM_GROWSDOWN) ? MAP_GROWSDOWN : 0) |
> + ((vma->vm_flags & VM_DENYWRITE) ? MAP_DENYWRITE : 0) |
> + ((vma->vm_flags & VM_EXECUTABLE) ? MAP_EXECUTABLE : 0) |
> + ((vma->vm_flags & VM_LOCKED) ? MAP_LOCKED : 0);
> + mmap_in.offset = (loff_t)vma->vm_pgoff << PAGE_SHIFT;
> +
> + req->in.h.opcode = FUSE_MMAP;
> + req->in.h.nodeid = ff->nodeid;
> + req->in.numargs = 1;
> + req->in.args[0].size = sizeof(mmap_in);
> + req->in.args[0].value = &mmap_in;
> + req->out.numargs = 1;
> + req->out.args[0].size = sizeof(mmap_out);
> + req->out.args[0].value = &mmap_out;
> +
> + req->end = fuse_mmap_end;
> +
> + fuse_request_send(fc, req);
> +
> + /* mmap.file is set if server requested to reuse existing mapping */
> + mfile = req->misc.mmap.file;
> + mmap_unique = req->in.h.unique;
> + err = req->out.h.error;
> +
> + fuse_put_request(fc, req);
> +
> + /* ERR_PTR value in mfile means fget failure, send failure COMMIT */
> + if (IS_ERR(mfile)) {
> + err = PTR_ERR(mfile);
> + goto commit;
> + }
> + /* userland indicated failure, we can just fail */
> + if (err)
> + goto err;
> +
> + /*
> + * Second, create mmap as the server requested.
> + */
> + fmmap = create_fuse_mmap(fc, file, mfile, mmap_unique, mmap_out.fd,
> + vma);
> + if (IS_ERR(fmmap)) {
> + err = PTR_ERR(fmmap);
> + goto commit;
> + }
> +
> + /* fmmap points to shm_file to mmap, give it to vma */
> + mfile = fmmap->mmap_file;
> + vma->vm_file = mfile;
> +
> + /* add flags server requested and mmap the shm_file */
> + if (mmap_out.flags & FUSE_MMAP_DONT_COPY)
> + vma->vm_flags |= VM_DONTCOPY;
> + if (mmap_out.flags & FUSE_MMAP_DONT_EXPAND)
> + vma->vm_flags |= VM_DONTEXPAND;
> +
> + err = mfile->f_op->mmap(mfile, vma);
> + if (err)
> + goto commit;
> +
> + /*
> + * Override vm_ops->open and ->close. This is a bit hacky but
> + * vma's can't easily be nested and FUSE needs to notify the
> + * server when to release resources for mmaps. Both shmem and
> + * tiny_shmem implementations are okay with this trick but if
> + * there's a cleaner way to do this, please update it.
> + */
> + err = -EINVAL;
> + if (vma->vm_ops->open || vma->vm_ops->close || vma->vm_private_data) {
> + printk(KERN_ERR "FUSE: can't do direct mmap. shmem mmap has "
> + "open, close or vm_private_data\n");
> + goto commit;
> + }
> +
> + fmmap->vm_ops = *vma->vm_ops;
> + vma->vm_ops = &fmmap->vm_ops;
> + vma->vm_ops->open = fuse_vm_open;
> + vma->vm_ops->close = fuse_vm_close;
> + vma->vm_private_data = fmmap;
> + err = 0;

Yes, the vm_operations save/overwrite/restore is the stinky part here.

I'm not completely clear what's going on here, and why, and what the
overall implications and risks are. Some discussion in the changelog
would help. With all that information we then would be better situated
to discuss and perhaps solve <whatever problem this solves> by more
pleasing means.

I mean, if the need for this hack indicates that there is some
shortcoming in core VM then perhaps we can improve core VM. Dunno.

> + commit:
> + /*
> + * Third, either mmap succeeded or failed after MMAP request
> + * succeeded. Notify userland what happened.
> + */
> +
> + /* missing commit can cause resource leak on server side, don't fail */
> + req = fuse_get_req_nofail(fc, file);

That's an optimistically-named function.

<looks>

hm, it seems to duplicate mempool.c.

> + memset(&commit_in, 0, sizeof(commit_in));
> + commit_in.fh = ff->fh;
> + commit_in.mmap_unique = mmap_unique;
> + commit_in.addr = mmap_in.addr;
> + commit_in.len = mmap_in.len;
> + commit_in.prot = mmap_in.prot;
> + commit_in.flags = mmap_in.flags;
> + commit_in.offset = mmap_in.offset;
> +
> + if (!err) {
> + commit_in.fd = fmmap->mmap_fd;
> + /*
> + * If fmmap->mmap_fd < 0, new fd needs to be created
> + * when the server reads MMAP_COMMIT. Pass the file
> + * pointer. A fd will be assigned to it by the
> + * fuse_mmap_commit_prep callback.
> + */
> + if (fmmap->mmap_fd < 0)
> + req->misc.mmap.file = mfile;
> + } else
> + commit_in.fd = err;
> +
> + req->in.h.opcode = FUSE_MMAP_COMMIT;
> + req->in.h.nodeid = ff->nodeid;
> + req->in.numargs = 1;
> + req->in.args[0].size = sizeof(commit_in);
> + req->in.args[0].value = &commit_in;
> +
> + req->prep = fuse_mmap_commit_prep;
> + req->end = fuse_mmap_commit_end;
> +
> + fuse_request_send(fc, req);
> + if (!err) /* notified failure to userland */
> + err = req->out.h.error;
> + if (!err && commit_in.fd < 0) /* failed to allocate fd */
> + err = commit_in.fd;
> + fuse_put_request(fc, req);
> +
> + if (!err) {
> + fput(orig_vm_file);
> + fmmap->mmap_fd = commit_in.fd;
> + return 0;
> + }
> +
> + /* fall through */
> + err:
> + if (fmmap && !IS_ERR(fmmap))
> + destroy_fuse_mmap(fmmap);
> + if (mfile && !IS_ERR(mfile))
> + fput(mfile);
> +
> + /* restore original vm_ops, file and flags */
> + vma->vm_ops = orig_vm_ops;
> + vma->vm_file = orig_vm_file;
> + vma->vm_flags = orig_vm_flags;
> +
> + if (err == -ENOSYS) {
> + /* Can't provide the coherency needed for MAP_SHARED */
> + if (vma->vm_flags & VM_MAYSHARE)
> + return -ENODEV;
> +
> + invalidate_inode_pages2(file->f_mapping);
> +
> + return generic_file_mmap(file, vma);

Now what's happening on this path?

We seem to have decided to fallback to a standard mmap of some form?
Why? If the user's request failed, shouldn't we just fail the syscall?
Hasn't the kernel just lied to userspace about what happened?

Also, this code looks like it's duplicating the behaviour of core MM?
If so then it needs to be kept in sync as core MM changes - that's a
maintenance problem. Can it be improved by changing core MM?

> + }
> +
> + return err;
> +}
> +EXPORT_SYMBOL_GPL(fuse_file_direct_mmap);
>
> ...
>
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -272,6 +272,13 @@ struct fuse_req {
> struct fuse_write_out out;
> } write;
> struct fuse_lk_in lk_in;
> + struct {
> + /** to move filp for mmap between client and server */

Comment pretends to be kerneldoc?

> + struct file *file;
> + } mmap;
> + struct {
> + struct fuse_munmap_in in;
> + } munmap;
> } misc;
>
>
> ...
>
> --- a/include/linux/fuse.h
> +++ b/include/linux/fuse.h
> @@ -178,6 +178,15 @@ struct fuse_file_lock {
> */
> #define FUSE_POLL_SCHEDULE_NOTIFY (1 << 0)
>
> +/**

kerneldoc? Does this work?

> + * Mmap flags
> + *
> + * FUSE_MMAP_DONT_COPY: don't copy the region on fork
> + * FUSE_MMAP_DONT_EXPAND: can't be expanded with mremap()
> + */
> +#define FUSE_MMAP_DONT_COPY (1 << 0)
> +#define FUSE_MMAP_DONT_EXPAND (1 << 1)
> +
>
> ...
>

2009-06-24 04:39:31

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 1/4] fdtable: export alloc_fd()

On Thu, Jun 18, 2009 at 06:24:30PM +0900, Tejun Heo wrote:
> Export alloc_fd(). Will be used by FUSE.

Where and how?

2009-06-24 10:02:58

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 2/4] FUSE: make request_wait_answer() wait for ->end() completion

On Thu, 18 Jun 2009, Tejun Heo wrote:
> Previously, a request was marked FINISHED before ->end() is executed
> and thus request_wait_answer() can return before it's done. This
> patch makes request_wait_answer() wait for ->end() to finish before
> returning.

Why is this change needed?

>
> Note that no current ->end() user waits for request completion, so
> this change doesn't cause any behavior difference.
>
> While at it, beef up the comment above ->end() hook and clarify when
> and where it's called.

OK.

[snip]

> @@ -293,10 +292,21 @@ __releases(&fc->lock)
> fc->active_background--;
> flush_bg_queue(fc);
> }
> +
> spin_unlock(&fc->lock);
> - wake_up(&req->waitq);
> - if (end)
> +
> + if (end) {
> end(fc, req);
> + smp_wmb();

Why is this barrier needed?

Thanks,
Miklos

2009-06-24 10:33:21

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 4/4] FUSE: implement direct mmap

On Thu, 18 Jun 2009, Tejun Heo wrote:
> This patch implements direct mmap. It allows FUSE server to honor
> each mmap request with anonymous mapping. FUSE server can make
> multiple mmap requests share a single anonymous mapping or separate
> mappings as it sees fit.

One thing that worries me is that this isn't generic enough. It may
be good for OSS emulation, but if someone would want to use it for
normal file based mmap, I don't think it would be usable: there's no
synchronization between normal I/O paths.

Now that's not a big problem as long as there's a possility to extend
the current interface in that direction. E.g. add the possibility for
userspace to handle faults and unmap pages from the address space on
demand.

The other problem is that sharing address spaces between filesystems
in this way is inherently hackish. CODA manages to get away with it,
but CODA doesn't want to reimplement mmap/munmap in exotic ways.

Is there s a possibility to do this without needing to share the
pages? E.g. implement explicit calls (notifications) on the fuse
interface to read and write the mapped pages. That would get rid of
all the ugly problems caused by having to pass and translate file
descriptors on the fuse interface.

>
> mmap request is handled in two steps. MMAP first queries the server
> whether it wants to share the mapping with an existing one or create a
> new one, and if so, with which flags. MMAP_COMMIT notifies the server
> the result of mmap and if successful the fd the server can use to
> access the mmap region.

And you might have noticed I'm not a big fan of these three way handshake
messages ;)

Thanks,
Miklos

2009-06-29 16:21:05

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/4] FUSE: make request_wait_answer() wait for ->end() completion

Hello, Miklos.

Miklos Szeredi wrote:
> On Thu, 18 Jun 2009, Tejun Heo wrote:
>> Previously, a request was marked FINISHED before ->end() is executed
>> and thus request_wait_answer() can return before it's done. This
>> patch makes request_wait_answer() wait for ->end() to finish before
>> returning.
>
> Why is this change needed?

Direct mmap implementation manages fds in the server process using the
end() callback. After each command which needs to manipulate server
fds in some way, the end callback is responsible for doing it and
communicating what happened to the client side which is executing
fuse_file_direct_mmap(), so it needs to wait for end to complete
before proceeding.

>> spin_unlock(&fc->lock);
>> - wake_up(&req->waitq);
>> - if (end)
>> +
>> + if (end) {
>> end(fc, req);
>> + smp_wmb();
>
> Why is this barrier needed?

To separate the changes end() made from setting FUSE_REQ_FINISHED.
The corresponding wait is in wait_answer_interruptible() and the
corresponding barrier is rmb() implied by spin_lock(&fc->lock) in that
function. ie. the barrier makes sure that when
wait_event_interruptible() exits the waiting state because state
equals FUSE_REQ_FINISHED, the caller is guaranteed to see the changes
made by end().

Thanks.

--
tejun

2009-06-29 16:41:47

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 4/4] FUSE: implement direct mmap

Hello, Miklos.

Miklos Szeredi wrote:
> On Thu, 18 Jun 2009, Tejun Heo wrote:
>> This patch implements direct mmap. It allows FUSE server to honor
>> each mmap request with anonymous mapping. FUSE server can make
>> multiple mmap requests share a single anonymous mapping or separate
>> mappings as it sees fit.
>
> One thing that worries me is that this isn't generic enough. It may
> be good for OSS emulation, but if someone would want to use it for
> normal file based mmap, I don't think it would be usable: there's no
> synchronization between normal I/O paths.

I would love if it can be made more elegant and generic. Although I
wrote it, I have to agree the whole thing doesn't look too pretty.

> Now that's not a big problem as long as there's a possility to extend
> the current interface in that direction. E.g. add the possibility for
> userspace to handle faults and unmap pages from the address space on
> demand.
>
> The other problem is that sharing address spaces between filesystems
> in this way is inherently hackish. CODA manages to get away with it,
> but CODA doesn't want to reimplement mmap/munmap in exotic ways.
>
> Is there s a possibility to do this without needing to share the
> pages? E.g. implement explicit calls (notifications) on the fuse
> interface to read and write the mapped pages. That would get rid of
> all the ugly problems caused by having to pass and translate file
> descriptors on the fuse interface.

Hmmm... yeah, it would be great if the fd ugliness can be killed.

One problem to consider is that mmaps of the same file doesn't always
share the mapping. For example, on a normal filesystem, shared mmaps
of a file will always share the same pages; however, mmaps of a device
node might or might not share the mapping. For example, mmaps could
represent separate data streams or a shared buffer space which the
different mmaps might partially use for data passing between
themselves without notifying the server. So, the server should be
able to determine whether a mmap is gonna be shared with another mmap
or not and that's where the passing fd back and forth came in.

Simlar thing can be done by letting the kernel part of fuse keep track
of mapped pages using server provided page IDs and letting the server
determine sharing or creation of pages and communicate it to the
kernel part. It would require more code but the interface should be
less convoluted and more flexible. Is this what you have in mind?

>> mmap request is handled in two steps. MMAP first queries the server
>> whether it wants to share the mapping with an existing one or create a
>> new one, and if so, with which flags. MMAP_COMMIT notifies the server
>> the result of mmap and if successful the fd the server can use to
>> access the mmap region.
>
> And you might have noticed I'm not a big fan of these three way handshake
> messages ;)

Believe it or not, I'm not particularly big fan of it either. I just
couldn't think of anything better at the time. :-)

Thanks.

--
tejun

2009-06-29 16:59:25

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 4/4] FUSE: implement direct mmap

Hello, Andrew.

Andrew Morton wrote:
>> +static int fuse_mmap_commit_prep(struct fuse_conn *fc, struct fuse_req *req)
>> +{
>> + struct fuse_mmap_commit_in *commit_in = (void *)req->in.args[0].value;
>> + struct file *mfile = req->misc.mmap.file;
>> + int fd;
>> +
>> + if (!mfile)
>> + return 0;
>> +
>> + /* new mmap.file has been created, assign a fd to it */
>> + fd = commit_in->fd = get_unused_fd_flags(O_CLOEXEC);
>> + if (fd < 0)
>> + return 0;
>
> Shouldn't this be `return fd;'?

Non-zero return from prep aborts the command immediately and starts
processing the next command. However, when the fd assigning fails,
the failure should be communicated to the userland server so that it
can release any resource it might be holding for the mmap. So, the
failure state is recorded in commit_in->fd which and the prep callback
returns 0 so that the failure status can be transmitted to the
userland server.

>> + fmmap->vm_ops = *vma->vm_ops;
>> + vma->vm_ops = &fmmap->vm_ops;
>> + vma->vm_ops->open = fuse_vm_open;
>> + vma->vm_ops->close = fuse_vm_close;
>> + vma->vm_private_data = fmmap;
>> + err = 0;
>
> Yes, the vm_operations save/overwrite/restore is the stinky part here.
>
> I'm not completely clear what's going on here, and why, and what the
> overall implications and risks are. Some discussion in the changelog
> would help. With all that information we then would be better situated
> to discuss and perhaps solve <whatever problem this solves> by more
> pleasing means.
>
> I mean, if the need for this hack indicates that there is some
> shortcoming in core VM then perhaps we can improve core VM. Dunno.

Sorry about lack of explanation, in a nutshell, fuse direct mmap
implementation wants to use shmem but wants to know when each mmap is
opened and closed so that it can determine when the mmap is finally
released and tell the userland to release related resources. I think
the following two approaches would be more regular for situations like
this.

1. Export the necessary parts of shmem operations and build our fuse's
own vm_ops using necessary operations.

2. If vma can be nested, create wrapper vma around shmem vma and catch
open/close calls from there.

#1 might be the better solution here but with two separate shmem
implementations it was a bit awkward. #2 currently can't be done, so
I used bastardized form of #1. So, yeap, it's ugly.

Anyways, it looks like the whole fd fiddling and shmem vma trickery
might all go away, so please don't invest too much time into it.

>> + /* missing commit can cause resource leak on server side, don't fail */
>> + req = fuse_get_req_nofail(fc, file);
>
> That's an optimistically-named function.
>
> <looks>
>
> hm, it seems to duplicate mempool.c.

It's much more narrowly defined but yeah in a way.

>> + if (err == -ENOSYS) {
>> + /* Can't provide the coherency needed for MAP_SHARED */
>> + if (vma->vm_flags & VM_MAYSHARE)
>> + return -ENODEV;
>> +
>> + invalidate_inode_pages2(file->f_mapping);
>> +
>> + return generic_file_mmap(file, vma);
>
> Now what's happening on this path?
>
> We seem to have decided to fallback to a standard mmap of some form?
> Why? If the user's request failed, shouldn't we just fail the syscall?
> Hasn't the kernel just lied to userspace about what happened?
>
> Also, this code looks like it's duplicating the behaviour of core MM?
> If so then it needs to be kept in sync as core MM changes - that's a
> maintenance problem. Can it be improved by changing core MM?

That's from the original mmap implementation before the direct mmap is
added. It's just code movement. No functional change by this patch
on this path.

>> --- a/fs/fuse/fuse_i.h
>> +++ b/fs/fuse/fuse_i.h
>> @@ -272,6 +272,13 @@ struct fuse_req {
>> struct fuse_write_out out;
>> } write;
>> struct fuse_lk_in lk_in;
>> + struct {
>> + /** to move filp for mmap between client and server */
>
> Comment pretends to be kerneldoc?

Will fix.

>>
>> +/**
>
> kerneldoc? Does this work?

I have no idea. Will update.

>> + * Mmap flags
>> + *
>> + * FUSE_MMAP_DONT_COPY: don't copy the region on fork
>> + * FUSE_MMAP_DONT_EXPAND: can't be expanded with mremap()
>> + */
>> +#define FUSE_MMAP_DONT_COPY (1 << 0)
>> +#define FUSE_MMAP_DONT_EXPAND (1 << 1)
>> +

Thanks.

--
tejun

2009-06-30 03:07:28

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/4] fdtable: export alloc_fd()

Hello, Al.

Al Viro wrote:
> On Thu, Jun 18, 2009 at 06:24:30PM +0900, Tejun Heo wrote:
>> Export alloc_fd(). Will be used by FUSE.
>
> Where and how?

This currently is used via get_unused_fd_flags() in
fuse_mmap_commit_prep(). A shmem_file is created in the requesting
client's context (for accounting and easier flow of control) and then
passed to the server's context where an fd is allocated and assigned
to the file so that the fd can be passed to the userland server.

The shmem_file based implementation serves certain device mmap file
emulation well but as Miklos pointed out it may not be sufficiently
flexible for other purposes and passing file back and forth and
wrapping shmem_file is quite ugly, so I think there's pretty good
chance it will get reimplemented in some different way which likely
won't need alloc_fd() exported. If there's any vfs related changes,
I'll cc you.

Thanks.

--
tejun

2009-07-02 13:52:46

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 4/4] FUSE: implement direct mmap

On Tue, 30 Jun 2009, Tejun Heo wrote:
> One problem to consider is that mmaps of the same file doesn't always
> share the mapping. For example, on a normal filesystem, shared mmaps
> of a file will always share the same pages; however, mmaps of a device
> node might or might not share the mapping. For example, mmaps could
> represent separate data streams or a shared buffer space which the
> different mmaps might partially use for data passing between
> themselves without notifying the server. So, the server should be
> able to determine whether a mmap is gonna be shared with another mmap
> or not and that's where the passing fd back and forth came in.
>
> Simlar thing can be done by letting the kernel part of fuse keep track
> of mapped pages using server provided page IDs and letting the server
> determine sharing or creation of pages and communicate it to the
> kernel part. It would require more code but the interface should be
> less convoluted and more flexible. Is this what you have in mind?

Something like that. In fact we could possibly allow even sharing the
pages with userspace, similarly to how drivers do it with the
hardware. Afaics sound drivers now map the dma memory with
remap_pfn_range(). Similary we could allocate a chunk of
non-swapabble kernel memory on request from the userspace server and
map its pages using this trick to both the server's and the client's
address space.

This is still sort of OSSP specific, I don't see clearly how it could
be made more generic.

> >> mmap request is handled in two steps. MMAP first queries the server
> >> whether it wants to share the mapping with an existing one or create a
> >> new one, and if so, with which flags. MMAP_COMMIT notifies the server
> >> the result of mmap and if successful the fd the server can use to
> >> access the mmap region.
> >
> > And you might have noticed I'm not a big fan of these three way handshake
> > messages ;)
>
> Believe it or not, I'm not particularly big fan of it either. I just
> couldn't think of anything better at the time. :-)

A better scheme would be to assume that MMAP is successful, and if
there's an error, send a MUNMAP request. That way the normal case is
simpler and there's no need for an extra message type.

Thanks,
Miklos

2009-07-04 11:13:54

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 4/4] FUSE: implement direct mmap

Hello, Miklos.

Miklos Szeredi wrote:
>> Simlar thing can be done by letting the kernel part of fuse keep track
>> of mapped pages using server provided page IDs and letting the server
>> determine sharing or creation of pages and communicate it to the
>> kernel part. It would require more code but the interface should be
>> less convoluted and more flexible. Is this what you have in mind?
>
> Something like that. In fact we could possibly allow even sharing the
> pages with userspace, similarly to how drivers do it with the
> hardware.

This was possible with the original implementation. Hmmm... we'll
need some sort of filehandle to represent the address space one way or
the other.

> Afaics sound drivers now map the dma memory with remap_pfn_range().
> Similary we could allocate a chunk of non-swapabble kernel memory on
> request from the userspace server and map its pages using this trick
> to both the server's and the client's address space.
>
> This is still sort of OSSP specific, I don't see clearly how it could
> be made more generic.

Using non-swappable memory would be fine for most device emulations
but mapping large amount of pages would be problematic. Hmmm... this
is difficult. It's a compromise among flexibility, scalability and
code complexity.

>>>> mmap request is handled in two steps. MMAP first queries the server
>>>> whether it wants to share the mapping with an existing one or create a
>>>> new one, and if so, with which flags. MMAP_COMMIT notifies the server
>>>> the result of mmap and if successful the fd the server can use to
>>>> access the mmap region.
>>> And you might have noticed I'm not a big fan of these three way handshake
>>> messages ;)
>> Believe it or not, I'm not particularly big fan of it either. I just
>> couldn't think of anything better at the time. :-)
>
> A better scheme would be to assume that MMAP is successful, and if
> there's an error, send a MUNMAP request. That way the normal case is
> simpler and there's no need for an extra message type.

The reason why the second stage was necessary was because the fd needs
to be communicated back to the server. The kernel doesn't know
whether the server wants to share an existing fd or create a new one
so the MMAP queries that and MMAP_COMMIT communicates newly created fd
if necessary. What can be done is to always create a new fd and let
the server override it if it wants the map to be shared and then the
kernel can kill the unused fd.

Thanks.

--
tejun

2009-07-06 11:41:23

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 4/4] FUSE: implement direct mmap

On Sat, 04 Jul 2009, Tejun Heo wrote:
> > Afaics sound drivers now map the dma memory with remap_pfn_range().
> > Similary we could allocate a chunk of non-swapabble kernel memory on
> > request from the userspace server and map its pages using this trick
> > to both the server's and the client's address space.
> >
> > This is still sort of OSSP specific, I don't see clearly how it could
> > be made more generic.
>
> Using non-swappable memory would be fine for most device emulations
> but mapping large amount of pages would be problematic. Hmmm... this
> is difficult. It's a compromise among flexibility, scalability and
> code complexity.

What's the difficulty?

Allocating pages, giving them an ID and mapping them into various page
tables seems simple in contrast to trying to make a tmpfs file be a
fuse file at the same time, which the VM is really not prepared for.

Thanks,
Miklos

2009-07-08 23:16:44

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 4/4] FUSE: implement direct mmap

Hello, Miklos.

Miklos Szeredi wrote:
> On Sat, 04 Jul 2009, Tejun Heo wrote:
>>> Afaics sound drivers now map the dma memory with remap_pfn_range().
>>> Similary we could allocate a chunk of non-swapabble kernel memory on
>>> request from the userspace server and map its pages using this trick
>>> to both the server's and the client's address space.
>>>
>>> This is still sort of OSSP specific, I don't see clearly how it could
>>> be made more generic.
>> Using non-swappable memory would be fine for most device emulations
>> but mapping large amount of pages would be problematic. Hmmm... this
>> is difficult. It's a compromise among flexibility, scalability and
>> code complexity.
>
> What's the difficulty?

The desire to avoid pinning all the mapped pages. :-)

> Allocating pages, giving them an ID and mapping them into various page
> tables seems simple in contrast to trying to make a tmpfs file be a
> fuse file at the same time, which the VM is really not prepared for.

If pinning all the pages are okay, the above should work fine.

Thanks.

--
tejun

2009-07-09 10:35:40

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 4/4] FUSE: implement direct mmap

On Thu, 09 Jul 2009, Tejun Heo wrote:
> >> Using non-swappable memory would be fine for most device emulations
> >> but mapping large amount of pages would be problematic. Hmmm... this
> >> is difficult. It's a compromise among flexibility, scalability and
> >> code complexity.
> >
> > What's the difficulty?
>
> The desire to avoid pinning all the mapped pages. :-)
>
> > Allocating pages, giving them an ID and mapping them into various page
> > tables seems simple in contrast to trying to make a tmpfs file be a
> > fuse file at the same time, which the VM is really not prepared for.
>
> If pinning all the pages are okay, the above should work fine.

If the page allocation is privileged then I think it's okay.

I have only vague ideas how this "direct mmap" could be used for other
things than sound drivers. For example there's a significant push
towards making network filesystems coherent. This is currently not
well supported by fuse and adding coherency guarantees to mmap is
especially tricky. But I'm not sure it really needs a special mmap
interface, that's just one of the options.

Thoughts?

Thanks,
Miklos